Search
Close this search box.
Cover Slide for Job Classification Using Machine Learning

Job Classification Using Machine Learning

I am grateful to have had the opportunity to intern at Anant this summer. My project entailed classifying jobs using machine learning.

Goal

The goal of my project was to classify job listings into job types from websites such as LinkedIn and Indeed. This makes it easier for job seekers to find jobs that they are well suited for. In particular, my project will help classify the job listings that are posted on the Jobs section of Cassandra.Link. All of the job listings are sorted into the following seven categories:

  • Interface Technical Architect
  • Software Technical Architect
  • System Technical Architect
  • Database Technical Architect
  • Business Analyst
  • Project Manager
  • Miscellaneous

Training Data

A single data point has 15 different features, as represented in the following picture. Some of the features are unique to each data point, such as “slug”, “id”, and “epoch”. Other features provide a better description of the job, such as what company the job is at and what position is open. The “label” feature indicates what category the job listing falls into. Overall, the entire data file contained 275 such data points.

Image of the data points used in training the job classification machine learning algorithm.

Initial Brainstorming

There are two ways that I thought about approaching this problem. My first idea was to use Python to implement natural language processing techniques such as Word2Vec. I could then process the data and train a machine learning classifier (e.g. Support Vector Machine/SVM). My second idea was to use Mathematica/Wolfram Language, a computational language that has a lot of neat features for data processing and analytics.

Approach

I ultimately decided to go with the Wolfram Language since it seemed well suited for this task.

To solve this problem, I first split my data into 80/20 train/test splits. Then, I converted each data point into an Association, similar to a dictionary, and made a list of Associations. This gives a structure similar to what is in the “data” section above. I once again converted the data, but this time into Dataset, which organizes the data into a table-like format.

For example, if my original data was organized in the format {<|type -> apple, color -> red, taste -> good|>, <|type -> orange, color -> orange, taste -> good|>,  <|type -> strawberry, color -> red, taste -> excellent|>}, the corresponding Dataset would be:

I then proceeded to train a classifier on this Dataset and specified the datapoint’s “label”. I completed this process twice. The first time, I included all of the given features in the data. The second time, I removed some features that I thought were “irrelevant”. For example, some of these features included “id” and “date”, as these were often unique to each data point, and did not seem likely to impact the classification task. 

I used the Classify function in the Wolfram Language for the machine learning task. The Classify function takes in a set of input data, analyzes it, and returns the best classification method for the training data. For the first method, which kept all of the features, the two classifiers that worked best were Decision Trees and Nearest Neighbors. The second method almost consistently used Gradient Boosted Trees.

Analysis

Job Classification Using Machine Learning Results Key
Results Tables for Job Classification Using Machine Learning

An accuracy baseline is when the classifier always predicts the most common class. The accuracy baseline for both methods was (73 ± 6)%. 

We got a wide range of accuracies depending on how the data was split into training and testing sets. Oftentimes, the second method slightly outperformed the first. The first method had accuracies as low as 68% or as high as 80%. The second method was a little more consistent, with the accuracy hovering between 75% and 80%. However, we cannot make any clear conclusions about which method performed better, as the size of our training and testing sets were fairly small. They were 220 and 55 data points, respectively. The given data also did not have an even distribution of job types, so sometimes a job type appeared for the first time in the test set. For this reason, even if a classifier performed very well on one set of data, if we then proceeded to shuffle the data and form a completely different 80/20 train/test split, it is very likely that the accuracies will be different. Given more data, we could get a better classifier.

Next Steps

My solution did fairly well, but there are a couple of things I would like to try next. 

  • I would like to gather more data to train an improved classifier. 
  • It would also be interesting to do some analysis on the data to figure out which features are most important.

Code

All of the code for this project can be found in the following GitHub repository: https://github.com/akarpurapu4/Anant-JobClassification

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!