Big data areas and technologies


Storage: HDFS is best for full scans. Hbase:is best for random filter read queries. Koodoo: is good for both. Jdfs


Ingestion from streaming:

Floom is best reading from stream.

Batch Ingestion from DBs: Scoop is best for reading from RDBMSs.

Hadoop vs spark: For each process MApReduce requires a map job shuffle job then reduce job. Spark on te other hand Hadoop uses the jard disk heavily to insure resilience and fault tolerance which RAM memory cannot guarantee. Spark reuses maps and shuffles so one can do one map, one shuffle then do multiple reduces out of them, which is way faster (Think of Sql execution plan or query data saved in memory).

Big data on cloud: S3 is AWS blob storage badically an equivalent of HDFS. For HBase:On AWS one can install HBase on it.

Processing: Sql is stateless so for sessionization on streaming use spark jobs written by scala or java, both of these which can run on JVM (Java Virtual Machine). Non JVM like R or Python are harder to translate to JVM.
Analyzing: impala concurrency goes direvtly to disk and doesn’t use hadoop nor spark thus faster in accessing the data. High throwput low latency. Spark SQL os good for simple detailed query of data.SHive on Spark is great for more resilient queries like ETL jobs or lomg queries.

ETL Orchestration: Technologies: Oozi andĀ Eskemon.

Installation, versioning and security: Became easy with distributions like Cloudera and HortonWorks.

Maintainability: still lacks.

Leave a Reply

Your email address will not be published. Required fields are marked *