Classifying jobs is probably a labor economist's dream. Many a great public servant has toiled hours to produce standard occupational classification codes, alas if only they were actually used!

(This post does not contribute any new knowledge to the world)

As a somewhat related topic to my current work, I'm looking for methods of automating the classification of jobs into some standardized occupational codes. Specifically, given a random instance of a typical job found on an online job portal (a job title, a description, some other auxiliary information), can we construct an automated tool to classify the job into one of N categories?

There is currently no shortage of job posting information online. Datasets do exist on Kaggle, and more exist on the actual portals. There is however, a dearth of labeled data. The only barely-consoling consolation prize perhaps, are that job portals do have their own attempted labelings. Job portals usually have about 15-20 large buckets, with terms like 'IT', 'Engineering', or 'Healthcare'. This sort of conflation between job function and industry works as a rough first pass filter for improving search experience, but not much more than that.

We could of course pick one of these job portals with a categorization that catches our eye and heart, and just go with that, but no labor economist would take you seriously. I'm not a labor economist nor am I looking for one, but I do find the standard occupational code classification system (and its counterpart the standard industry classification system) a more rigorous and fine-grained classification system. Furthermore, developing such a system against any particular country's standard occupational code classification system would be transferable, since it would be far easier to map standard occupational codes to each other than to reconstruct a data set. There is a granularity trade-off to be made here of course, particularly since there is no labeled data available, but I suppose the labor statistics people have already addressed that before they came up with the latest classification system. I just can't believe it's 2019 and no one has attempted to automate this.

I did find one thing, well sort of. I did manage to chance upon a Chinese paper that did exactly what I was looking for. This paper describes creating a data set of 100k online job postings, and subsequently sorting it into one of 465 categories according to the Chinese version of the SOC (it has an amazing name called the Grand Classification of Occupations, which makes me feel we should start adding adjectives to all our things to make them sound more amazing). They use the job description information, which is a huge step up from most of the datasets on Kaggle. My only gripe with it: I can't find any more information about this work on both Google and Baidu, much less the data set, which is unfortunate. But the findings of the paper do point in a positive direction - that it is possible to do this classification automatically with a high degree of accuracy if we had the data and labelled it.

If anyone knows anything about this topic, let me know!