Business Intelligence Platform: Tutorial Using MongoDB Aggregation Pipeline

Wassup everyone. I’m starting a collaboration with TopTal to bring more content to this blog. I will publish some articles from TopTal developers as guest posts. I hope you all enjoy in-depth and high-quality articles that were written by top-tier web developers.

The first article is written by Avinash Kaza. He is a senior developer with experience designing & developing data visualizations that answer difficult questions by using the latest tools available.

All that said, please take over Avinash:

Using data to answer interesting questions is what researchers are busy doing in today’s data driven world. Given huge volumes of data, the challenge of processing and analyzing it is a big one; particularly for statisticians or data analysts who do not have the time to invest in learning business intelligence platforms or technologies provided by Hadoop eco-system, Spark, or NoSQL databases that would help them to analyze terabytes of data in minutes.

The norm today is for researchers or statisticians to build their models on subsets of data in analytics packages like R, MATLAB, or Octave, and then give the formulas and data processing steps to IT teams who then build production analytics solutions.

One problem with this approach is that if the researcher realizes something new after running his model on all of the data in production, the process has to be repeated all over again.

What if the researcher could work with a MongoDB developer and run his analysis on all of the production data and use it as his exploratory dataset, without having to learn any new technology or complex programming languages, or even SQL?

If we use MongoDB’s Aggregation Pipeline and MEAN effectively we can achieve this in a reasonably short time. Through this article and the code that is available here in this GitHub repository, we would like to show how easy it is to achieve this.

Most of the Business Intelligence tools that are on the market are providing ways for researchers to import datasets from NoSQL and other Big Data technologies into the tool, then the transformations and analysis are done inside the tool. But in this business intelligence tutorial we are using the power of MongoDB Aggregation Pipeline without pulling the data out of MongoDB, and the researcher is using a simple interface to do all kinds of transformations on a production big data system.

MongoDB Aggregation Pipeline for Business Intelligence

Simply put, MongoDB’s aggregation pipeline is a framework to perform a series of data transformations on a dataset. The first stage takes the entire collection of documents as input, and from then on each subsequent stage takes the previous transformation’s result set as input and produces some transformed output.

There are 10 types of transformations that can be used in an aggregation pipeline:

  • $geoNear: outputs documents in order of nearest to farthest from a specified point
  • $match: filters input record set by any given expressions
  • $project: creates a resultset with a subset of input fields or computed fields
  • $redact: restricts the contents of the documents based on information from the document
  • $unwind: takes an array field with n elements from a document and returns n documents with each element added to each document as a field replacing that array
  • $group: groups by one or more columns and perform aggregations on other columns
  • $limit: picks first n documents from input sets (useful for percentile calculations, etc.)
  • $skip: ignores first n documents from input set
  • $sort: sorts all input documents as per the object given
  • $out: takes all the documents returned from previous stage and writes them to a collection

Except for the first and last in the list above, there are no rules about the order in which these transformations may be applied. $out should be used only once, and at the end, if we want to write the result of the aggregation pipeline to a new or existing collection. $geoNear can be used only as the first stage of a pipeline.

In order to make things easier to understand, let us walk through two datasets and two questions relevant to these datasets.

Difference in Salaries by Designation

In order to explain the power of MongoDB’s aggregation pipeline, we have downloaded a dataset which has salary information of university instructional staff for the entire US. This data is available at We have data from 7598 institutions with the following fields:

With this data we want to find out (on average) what the difference is between salaries of associate professors and professors by state. Then, an associate professor can realize in which state he is valued closer to a professor in terms of salary.

To answer this question, a researcher first needs to weed out bad data from the collection, because there are a few rows/documents in our dataset where the average salary is a null or empty string. To accomplish this cleaning of the dataset we will add the following stage:

This will filter out all the entities which have string values in those two fields. In MongoDB, each type is represented with a unique number – for strings, the type number is 2.

This dataset is a good example because in real world data analytics, engineers often have to deal with data cleanups as well.

Now that we have some stable data, we can continue to the next stage where we will average the salaries by state:

We just need to run a projection of the above result set and get the difference in state average salaries, as shown below in Stage 3 of our pipeline:

This should give us the state level average salary difference between professors and associate professors from a dataset of 7519 educational institutions all over US. To make it even more convenient to interpret this information, let us do a simple sort so we know which state has the least difference by adding a $sort stage:

From this dataset, it is apparent that Idaho, Kansas, and West Virginia are three states where the difference in salaries of associate professors and professors is the least compared to all the other states.

The full aggregation pipeline generated for this is shown below:

The resulting dataset that shows up looks like this. Researchers can also export these results to CSV in order to report on it using visualization packages like Tableau, or through simple Microsoft Excel charts.

Average Pay by Employment Type

Another example that we will explore in this article involves a dataset obtained from Given the payroll information of all state and local government organizations in the United States of America, we would like to figure out the average pay of full-time and part-time “Financial Administration” employees in each state.

The dataset has been imported, resulting in 1975 documents where each document follows this schema:

The answer to this question may help a Financial Administration employee to choose the best state to move to. With our MongoDB aggregator pipeline based tool, this can be done quite easily:

In the first stage, filter on GovernmentFunction column to discard all non-”Financial Administration” entities:

In the next stage of the tutorial, we will group the entities by state and calculate the average full time and part time salaries in each state:

Finally, we will sort the results from higher paying states to lower paying states:

This should allow the tool to generate the following aggregation pipeline:

Running the aggregation pipeline should produce some results like this:


In this MongoDB aggregation pipeline tutorial we have demonstrated a cost effective way to give researchers a tool where they can use production data as exploratory datasets and run different sets of transformations to analyze and construct models from.

We were able to develop and deploy this application end-to-end in just 3 days. This applicaton was developed by a team of 4 experienced engineers (2 in the US and 2 in India) and a designer and freelance UX expert helping us with some thoughts on interface design. At some point in the future, I will take the time to explain how this level of collaboration works to build awesome products in unbelievably short time.

We hope you take advantage of MongoDB’s Aggregation Pipeline, and put power in the hands of your researchers who can change the world with their clever analysis and insights.

This article was originally published in: