In trying to understand the distribution of skills on a particular job portal, I've observed that the set of possible skills on the portal follows a Zipf's law-like distribution, that's cool!

### Background

I find myself still looking at job descriptions from Jobsbank, and this time I'm looking at the skills associated with each job description. I've noticed that each job description comes with a list of listed skills as well as a confidence score (of how relevant this skill is to this job). It would be interesting to see the distribution of skills across all job postings, specific industry sectors, and specific job titles.

Intuitively, one would expect there to be significant variance in the signal-to-noise ratio of these listed skills. For instance, almost everyone is expected to know how to use and navigate a computer nowadays, so knowing 'Microsoft Office' is sort of taken for granted, but is nonetheless listed as a skill. It would be interesting to see if that is indeed reflected in the distribution of skills on the market, and if so, what can be done to improve the signal coming through in this mess.

### Counts and Scores

The skills for each job posting is generated by JobKred. I am not aware of the means through which these skills are generated, but I would assume that the job description is fed into an engine which then spits out a certain number of skills with associated confidence scores.

In looking at these skills, I collected two scores. First, there is the raw count - the number of times the skill appeared on a job posting. Second, there is the confidence score - the sum of all the individual confidence scores of all the job postings where the skill appears.

The total number of possible skills was identified by manually inspecting the API endpoints of the site. This came out to be a rather manageable number of about ten thousand, which happened to be nicely arranged in linearly increasing IDs.

As for the actual job posting data, we looked at a snapshot of all available job postings on a particular day. This came out to be approximately 25 thousand job postings, which was extracted using Scrapy and other tools.

The raw skills data and the following charts can be found here.

### Counts

First I looked at the discrete counts of the skills. The following chart shows the true and predicted counts of the top 250 most frequently occurring skills. The predicted counts were calculated by taking the reciprocal of the rank multiplied by the highest true count value. I decided to cut off the graph at 250 ranks to better accentuate the shape of the curve at smaller rank values, which is really what we're after anyway.

The true count sort of appears to follow a Zipf's law distribution. This is my way of saying look the shape looks kind of like that, which is cool and that's all I really wanted to know, and being not-a-mathematical statistician I am unable to say more than that.

Here are the top 10 skills as ordered by count:

- Management Accounting
- Leadership Development
- Microsoft Office SharePoint Server
- Customer Service Management
- Marketing Analytics
- Strategic Policy Development
- Sales & Marketing
- Strategy Alignment
- Business Development Consultancy
- Microsoft Exchange

### Scores

Next, I looked at the confidence scores of the skills. I did not look too much at the range of the confidence scores, but they appear to be distributed between 0 and 20, with extreme values for low/high confidence options.

I would expect that the distribution for scores to be closer to that predicted by the Zipf's law distribution. Intuitively, there are only so many relevant skills one can put for a particular job description, which is unlikely to number more than a two digit number. However, most of the job descriptions in this data set contains about 20 listed skills, so there will be a large number of skills listed that have low relevance values, but made it into the top 20 for a job description by virtue of the quota. Hence, the relative weights as indicated by the confidence scores should compensate for this effect.

This distribution appears to be much closer to the predicted distribution than that observed in the previous chart.

Here are the top 10 skills as ordered by the sum of the confidence scores:

- Management Accounting
- Microsoft Office SharePoint Server
- Customer Service Management
- Sales & Marketing
- Leadership Development
- Engineering Design
- Marketing Analytics
- Java Application Development
- Microsoft Exchange
- Research & Development

There is significant overlap between this list and the previous list which was sorted by count. I'm of the opinion that many of these skills listed are far too generic to conclude anything significant about the underlying job.

### Possible Future Work

That's all I've done so far. There are a number of interesting stretch goals that would be quite relevant. Here's some of them that you or me may want to look at in future:

- A pseudo 'tf-idf' scoring mechanism to filter out skills that are too generic to be of real help in understanding a job
- Analyze change in skill demand over time. The proportion scores are already calculated in the excel spreadsheet, we just have to do that over multiple time points
- Remove really irrelevant skills like 'Maintaining a Positive Attitude'. Also, does 'Mahout' refer to an elephant rider or the Apache framework? I'm not particularly convinced that the former hires through an online job portal.
- Remove overlapping skills like 'Engineering' and 'Mechanical Engineering'. Again, not convinced that 'Engineering' as a skill tells me any more than having the word 'engineer' in the job title.