Glassdoor Analytics

How people want to be perceived is often vastly different to how they are perceived. The same is true of companies. The first step to aligning the two is to understand where the differences lie.

This post goes from why Glassdoor is an important data source and how it fits into the broader picture, to detailing how you can write a web scraper that will enable you to perform analysis of Glassdoor reviews in bulk.

This was originally written as a bit of a coding challenge while learning how to web scrape - if you decide to try it yourself for any website I would encourage you to check their T&Cs.

Why would you care about Glassdoor?

No company wants to view themselves as mediocre, and no company will sell working for them as a mediocre job. But online reviews on platforms like Glassdoor take control of the company brand away from their slick marketing and recruitment teams, and in doing so become more trusted by potential employees.

If you run a company that talks a big game about making a difference and work-life balance, but when people look up what it's like to work for you they see nothing but negativity, you're probably going to have trouble hiring.

Equally you could score well on all these areas, but if your competitors for talent are scoring higher then you're going to lose candidates to them - or end up paying more.

The implications for retaining talent are similar - but while there are all sorts of issues with Glassdoor reviews as a source of information that make engagement surveys more reliable for the internal perspective, Glassdoor is the source for understanding the external perspective.

Look at the war for talent between banks and tech. Banks are increasingly hiring top tier tech talent . At the start of the year Goldman announced its "headcount increased 9% during 2018, reflecting an increase in technology professionals and investments in new business initiatives" - notably Marcus, its new online customer offering. Lloyd Blankfein, the ex-CEO, repeatedly referred to it as a technnology company. Yet it, along with all the other banks, still falls behind its tech competitors for talent in online perception.

The need to understand this external perspective is where the need to pull data from Glassdoor comes in. A snapshot of Glassdoor ratings across five key areas for top banking firms and top tech firms makes the difference abundantly clear.

This would take a while to collect manually through Glassdoor - fortunately there's an API that allows code to speak directly to their servers. A quick script written in R pulls this, wrangles the data and visualises it. But if we want more detail, we'll have to dig into the individual reviews - which aren't available through the API.

How does what the reviews say fit into the bigger picture?

Much like Maslow's famous hierarchy of needs, companies have offerings for employees at different levels - often referred to as the Employee Value Proposition, or "EVP".

At the foundation are the basic contractual elements (compensation and benefits). Then moving up we move to how the employee experiences the organisation - can they learn and develop, do they have opportunities. Finally at the peak is the sense of purpose - the link to the vision or mission of the organisation.

If you can offer a better purpose, like charities, you have less need to compete on the other elements like pay. Equally if you pay the same as a competitor, but they've nailed their career development, people will want to work for them.

Crafting your EVP is a whole project in itself. Here we're focused on measuring how its perceived.

How do you automate the process?

The traditional approach to this would be to get some poor soul to click through hundreds of pages of reviews and pick out themes manually. But it's an interesting coding challenge to try and automate the process.

The output is word clouds of pros and cons, along with all the reviews in a table for easy review. For Goldman this visualisation of the latest 1000 reviews (dating to mid 2018 and scraped from 100 pages of reviews) makes it immediately clear that people like the smart people they work with - but don't like the work-life balance.

Compare it to Facebook and you can see how much of a role relatively inexpensive perks like free food can play in shaping employee experiences and thus external perceptions of a company.

How does the code do it?

The first, and hardest part, is the web scraping itself. More detail on exact code can be found in the second section of this, but I'll give a summary here.

The only inputs are the company Glassdoor reviews page, the number of pages to scrape (ten reviews per page) and the cut off date for reviews.

The scraper is written in R, and uses the rvest package to (ha)rvest the html code of the first page of reviews. It then looks for certain html tags that correspond to different parts of the page - specifically to the date fields and the review text fields. It extracts the text, and turns it into a table. It does this for all pages, looping over them using the fact the page number is in the url. The output is a table with columns for date, pros and cons - and one row per review.

The actual process of text analytics is relatively simple. The script starts by cutting off all reviews before a specified date to keep it relevant. It strips out punctuation and converts to lower case.

Of course if you just did a word cloud of words the biggest word would probably be "the" - so the script takes out "stop words" using a dictionary of common words. Some reviews are in different languages - so it plugs into the google translate API to automatically translate them. And sometimes its combinations of words that are important, such as "work life balance", so the script looks for two and three word phrases as well.

Doing this manually would be...a lot of work. And if you did it for another company (say, for comparison), you'd have to do it all again. Whereas this code can run in 3 minutes.

What more can be done?

Of course word clouds are hardly a scientific approach - despite being engaging they're hard to compare and quantify - the ratings from the API shown at the start are much clearer.

A fuller approach could group words into themes and look at prevalence of these themes in each review. By adjusting the date ranges you could look at change in theme prevalence over time - for example following a project to improve a company's Employee Value Proposition. Specific reviews could be pulled out to give context to the appearance of specific words.

Much like with the initial graph, benchmarks for competitors could be pulled in. The data could be supplemented with internal employee engagement survey data to compare and contrast. If you had it, you could segment by location and role.

All of these things would give a fuller picture of what is going on. The elegance of a simple wordcloud however, is that it gives an immediate impression to the viewer. An immediate impression much like the one a potential employee would form by looking through themselves.