POC – Proof of Concept or Piece of Cake?
“Hey, let’s make a POC for a BI system that will accessible the data of the organisation” so I’ve been told. Great, but what would the POC will be about? “Let’s start by producing a report with statistical data of our funnel, showing visitors, pageviews, purchases and such, considering uniqueness & non-uniqueness numbers per visitor, oh, and let’s make it for a various of data ranges”. Alright, challenge accepted.
Some high level background: We have an operational system where our site is being managed (content & business logic) and we have a tracking system that holds information about the actual visiting activity. Keeping in mind that for the future we may integrate with other source of data about our traffic such as Google Analytics or Google AdWords. Now all is left is to integrate information from all of those and produce some data. One other note: It was obvious to me that assuming the POC will succeed demand for information will start to flow and there will not be time to rebuild it “correctly”, and not a “fast & dirty”. Refactoring is acceptable, but once initial reports are generated we should not rebuild it all from scratch, so the ETLs should be designed to last.
1st run – Quick Reports
Tackling the issue requires, of course, to build several ETL processes that will process the data from all relevant sources and resolve all information. But the question is what are the final outputs of those ETLs?
Keeping in mind at this point of time we are using simple SQL DBs such as MySQL or PostgreSQL, let’s ignore the basic option to store all data as raw facts and to aggregate results on runtime, as it will take too long or will fail. That been said, the old fashion style for preparing the data for the reports led to two options: The first was to aggregate the data on a daily level, where final aggregations would be done online during a report generation (e.g aggregate several days according to user selection). As most of the data is already aggregated, doing the final aggregation upon request is expected to be quick enough. The second was to fully aggregate all required data per report and store it in the datawarehouse.
But both these approaches are very problematic, one will lead us to wrong results and the other is not robust and will not answer all requirements. Let’s review these issues:
- The required reports should include, to begin with, unique (distinct) counting of values. Aggregating on a daily level and then aggregating multiple days to generate a report (option 1) will lead to wrong results, as it will count multiple time a value that should be counted only once. For example, say that some user visits our site every day during April, thus should be counted only once, will be counted 30 times as each day will include it as a visitor. So daily aggregations are off the table.
- Fully pre-aggregating data within the ETL and storing the results in the datawarehouse would solve the above mentioned uniqueness issue, however it will pose new problems – For which data ranges should the reports be generated? Weekly? Monthly? Month-to-date? What about custom date range? and such.
Moreover, both approaches fail to support dynamic drill-downs. Say that a data analyst would like to see breakdown of visitors to geographic location (country, region, etc.). Or a breakdown by hour of the day. In the second approach of pre-aggregating we can support it, but it means to upgrade & maintain the ETL logic with each minor requirement – not robust at all!
As for the POC, and as we wanted to show some data to the organization, we had to progress by choosing the second option: We have pre-aggregated some agreed upon reports, considering all unique numbers correctly. We have made reports for various date ranges (daily, weekly, monthly, month-to-date, etc.) with a few level of drill downs. It led us to about 18 possible outputs. As expected, shortly after delivering initial reports our satisfied users wanted more, and asked for more drill-down levels. That was the time to find the correct way to tackle the issue for the long run.
Building Analytical DataWarehouse
The issues we were facing are common issues in the world as many organizations have similar requirements. The solution is to have an analytical DB of some kind as the DataWarehouse. Such will allow to hold the raw data (no aggregations at all) while running ad hoc queries that respond fast with accurate results. Common solutions include Columnar DB, Installing strong hardware & software within the organization, In memory DB, Massive Parallel Processing (MPP) DB, etc. We sought for a solution that will be scalable, easy to maintain (i.e. not to buy lots of hardware and deploy services that need to be maintained and increased as we go), supports large storage and can run queries fast.
We found two options that seemed suitable for us:
- Amazon Redshift – A columnar DB that uses the MPP technology. It is a cloud service that requires no onsite installation in the organization.
- Google BigQuery – A RESTful web service that enables massive analysis on large sets of data. It is also a cloud service as Amazon Redshift. BigQuery is not a traditional DB, as it is based on Dremel engine, however from a user point of view it acts as a DB.
Analyzing the differences between the two shows very little differences, both as for capabilities, cost and access (both have an SQL/SQL-like interaction). We have chosen Amazon Redshift for several reasons, where the most important one was better support for “join” (joining data from several tables), which BigQuery has a major weakness at.
Amazon Redshift have gave us the advantage of being both a Columnar DB and an MPP service. As such, we changed the ETLs to save raw facts to the DataWarehouse, where each fact may have lots of attributes about it – starting from obvious timestamp of a pageview event, the user, a visit ID. etc. to attributes like the device and agent that user has been using at that visit or even position of the clicked component within the screen. Adding more attributes is easy as adding a new column to the table is easy & quick. To generate a report all is required is an SQL query that defines the required data range (so we can support any required & custom date ranges) and parameters to group-by, thus supporting any drill down levels that is required – if a BI user would like to breakdown the visitors per country or city, no problem, simply group by another attribute, and the query will return accurate results as required. Fast.
We started with a simple straight-forward approach for generating a few POC reports in the organization, leading us to pre-aggregate required results on a MySQL server. That answered the immediate requirements, but was not a robust solution. Creating real BI in a modern organization that needs to handle big data with flexible requirements requires to work with an analytical DB & tools. We have chose to work with Amazon Redshift as it answers our needs for fast response on flexible queries and is a low cost and easy to implement solution for an organization.
And if it was not clear – we are very pleased with the results it produces us, and we are able to give ability of data analysis that some of our users have not thought to ask for a few months ago.
A nice article compares Amazon Redshift & Google BigQuery: http://courses.cs.washington.edu/courses/cse544/13sp/final-projects/p18-lijl.pdf