Big Data–Intro

MapReduce: A distributed data-processing system. “Mapper” nodes preprocess data (filter etc.), then “Reducer” nodes aggregate the data into the result.
– Java-code jobs running in Batch (non online) approach. So slower responsiveness.
GFS – Google File System: A distributed file-storage system: HDFS.

Hadoop: An (Yahoo) open source implementation of Google’s MapReduce and GFS (Google File System).
Hadoop can work with flat files, and some DB formats.
No native high level query language to use MapReduce.
Streaming: A conversion technique from no-java languages (e.g. SQL) to map-reduce jobs.
Hive: A separate (now Apatche) layer converting HiveQL (SQL-like) code to MapReduce jobs. Used by many BI products requiring tabular data. Uses specific HDFS table file format, fully schema-bound.
Hadoop can use HBase No-SQL DB.

Massive Parallel Processing (MPP): Distributed Processing Architecture. Very similar concept to MapReduce.

Column Store: Most MPPs use it. Column store means data is stored in column value sets rather than rows value sets. This allow high compression rate, thus data can fit into memory since it gets smaller with compression, which makes queries even faster. This works best with BI, since it helps aggregating one or more specific columns.


MapReduce & Hadoop MPP
Mappers are different from reducers. Same engine.
Cheaper Hardware. Cabinet
Infinite scaling Limited scaling.
Uses imperative Code Uses SQL
Works with flat files, wide column tables, (HBase), and relational tables (through Hive). Works with relational models only


SQL is more productive than MapReduce code, and it is cheaper to find HR that use SQL than MapReduce, and it is thus quicker to develop solutions.


Bridging the gap between SQL and MapReduce:

Hive is a potential solution to bridge the two technologies, but it still uses MapReduce batch jobs, which renders modifications slower.

Better solution: Native SQL implementation over Hadoop. Anyway here are available solutions:


Apache Hive: SQL over Hadoop. It has ODBC/JDBC drivers. It still uses MapReduce so still slow, making it counter the purpose of BI, which requires faster report querying. BeesWax is a web interface for Hive.
Cloudera Impala: A native distributed SQL query enging running directly against HDFS, escaping MapReduce. It is compatible with Hive and ODBC, thus most BI tools, and very quick in comparison with Hive, making it better DW option for BI. Partnering with Pentaho. Basically a MPP engine.

Hadapt: Joining data from relational tables and Hadoop: Each cluster node is both a Hadoop (data node and worker node) node and Postgres-base MPP node, and its query optimizer decides between the two.

Hardware agnostic (commodity, network, enterprise, ..). May reduce footprint of data 30-40x via compression and de-duplication. Fast: Reads 1 million record/sec. De-duping of patterns, not only values. Can run standalone, MPP, or on Hadoop clusters too. Have readers and writers for MapReduce jobs (e.g. Pig).

A pioneer MPP vendor, moving to MapReduce, via Aster Data (now: Teradata Aster). v.1: SQL-MapReduce: MapReduce programming against relational data (MR functions in Java/C#.. that can be called by SQL queries).
v.2. SQL-H that imports HDFS data as external tables (via HCatalog), using SQL or Hive.

v.3: Big Data Appliance: MPP + SQL-H + Hadoop. A spin-off of Hadoop, in JV with HortonWorks.
Microsoft HDInsight.

SQL Server Parallel Data Warehouse (PDW) is MS MPP appliance. v.2. Includes PolyBase. Importing and querying Hadoop data into and from this DB (as Teradata Aster SQL-H).

MPP engine handles query & parallelism and targets HDFS directly, escaping MR (like Hadapt).
MPP over HDFS only. Query optimizer choosing between MPP and MR (like Hadapt).


Apache Sqoop: SQL to Hadoop. Import-Export tools, moving data between Relational DBs and Hadoop. Used with DWs and MPP appliances, and bridges MPPs with Hadoop.

Leave a Reply

Your email address will not be published. Required fields are marked *