Data Tracking and Collection

Motivation

To understand what is going on with a product/organization/strategy we need to place probes in specific points of its structure to be able to measure them in detail. Often there are several data points generating data, that need to be reconciled, cleaned up, and aggregated into a consumable format.

Therefore, each organization will have the need to:

Once the data from several sources is in a consumable format, analysts can then start querying the data and revealing insights of the reality that better inform the organization future strategies/actions.

Data repositories

What is common in all organizations by default

Data Analysis repositories

Then for the purposes of doing analysis this production systems data is then collected, cleaned and aggregated into a separate data repository optimized for analysis work, that can generate reports automatically, navigate data more easily, i.e simplified and aggregated data that is more easily consumable.

Dedicated Analytics Telemetry & Collection

Increasingly more common is to have also a dedicated telemetry and collection system that are placed explicitly in the “critical” points we need and that captures the exact data we define. This allows for more specific data measuring, and makes available data that does not exist from default systems (or is too cumbersome to use).

It requires somewhat more effort from the organization: add the telemetry to product (into the critical points), have extra mechanisms and support to handle telemetry and data collection systems.

Dedicated vs default, what to choose ?

What is required depends on the data needs. Default sources, are typically always there, and make sense when are in easy reach and consumable format, but when data is needed that does not exist in default systems, then the dedicated way is required.

UI Click-stream data is “the” typical dataset that requires dedicated telemetry system.

Data structures

The industry standard for data representation is a table, in .csv or very often a relational database table. Recent formats also include key->value paradigm (JSON, Hadoop). For consumable data, the relational table format seems to still persist as the most successful one (hint: Hive on top of Hadoop).

Realtime vs non-Realtime

Looking into collected data often branches into 2 different needs:

Telemetry principles

Some care should go into adding Telemetry.

Data collection types and caveats

Logs

Caveats

Tools

URL get/post (includes Javascript telemetry)

Here i include Javascript and other telemetries mechanisms that work by doing a post/get request to a url with data as parameters. A very common one is javascript in web analytics.

Caveats

Tools

Big Data

Google, for their search engine, as a need to be able to handle massive data, created a tool, based on distributed file system that run parallel MapReduce jobs. Hadoop is the outside-of-google same approach of that technology.

More recently Google created a 2nd tool called dremmel, that is faster and better for data analysis EDA. Supports SQL natively. It also exists as an online service from Google, called the BigQuery.

Reference

Data Tools Ecosystem: http://insightdataengineering.com/blog/new-ecosystem/


comments powered by Disqus