Getting Started with Hadoop, Spark, Hive and Kafka – live blog from oracle code | Down Home Country Coding With Scott Selikoff and Jeanne Boyarsky

Title: Getting Started with Hadoop, Spark, Hive and Kafka

Speakers: Edelweiss Kammermann

See my live blog table of contents from Oracle Cloud

Nice beginning with picture of Uruguay and a map

Big data

Volume – Lots of data
Variety – Many different data format
Velocity – Data create/consumed quickly
Veracity – Know data is accurate
Value – Data has intrinsic value; but have to find it

Hadoop

Manage huge volumes of data
Parallel processing
Highly scalable
HDFS: Hadoop Distributed File System for storing info
Map Reduce – for processing data. Language/methods inside hadoop
Writes data into fixed size blocks
NameNode – ike index, central entry point
DataNode – store data. Send data to next DataNode and so on until done.
Fault tolerant – can survive node failure (Each DataNode sends heartbeat every 3 seconds to NameNode; assues dead after 10 minutes), Communication failure (DataNode sends ack), data corruption (data nodes send block report to NameNode of good blocks)
Can have second NameNode for active/standby config. DataNodes report to both.

Hive

Analyze and query HDFS data to find patterns
Structure the data into tables so can write SQL like queries – HiveQL
HiveQL has multitable insert and cluster by clause
HiveQL has high atench and lacks a query cache

Spark

Can write in Java, Scala, Python or R
Fast in-memory data processing engine
Supports SQL, streaing data, machine learning and graph procesing
Can run standalone, on Hadoop or on Apache Mesos
Much faster than map reduce. How much faster depends n whether the data can fit into memory
Includes packages for core, streaming, SQL, MLLib and GraphX
RDD (resilient distributed dataset) – immutable programming abstraction of objects collection, can be splt cross clusters. Can create from text file, sql, nosql, etc
Can choose which acks need to receive – none, from the leader or from al replicas

Kafka

Integrate data from different sources as input/output
Producer/consumer pattern (called source and sink)
Incoming essages are stored in topics
Topics are identified by unique names and split into partitions (for redundancy and partitions)
Partitions are ordered and has an id named offset
Brokers are Kafka servers in a cluster. Recommended to have three
Define replication factor for data. 2 or 3 is common
Consumers read data from a topic. They read in order from a partition, but in parallel between partitions.

My take

Good simplified intro for a bunch of topics. It was good seeing how things fit together. The audience asked what sounded like detailed questions. I would have liked if they held that for the end.

Down Home Country Coding With Scott Selikoff and Jeanne Boyarsky

Java/J2EE Software Development and Technology Discussion Blog

Getting Started with Hadoop, Spark, Hive and Kafka – live blog from oracle code

Leave a Reply

Share this:

Leave a Reply