Getting Started with Hadoop, Spark, Hive and Kafka – live blog from oracle code

Title: Getting Started with Hadoop, Spark, Hive and Kafka

Speakers: Edelweiss Kammermann

See my live blog table of contents from Oracle Cloud

Nice beginning with picture of Uruguay and a map

Big data

  • Volume – Lots of data
  • Variety – Many different data format
  • Velocity – Data create/consumed quickly
  • Veracity – Know data is accurate
  • Value – Data has intrinsic value; but have to find it

Hadoop

  • Manage huge volumes of data
  • Parallel processing
  • Highly scalable
  • HDFS: Hadoop Distributed File System  for storing info
  • Map Reduce – for processing data. Language/methods inside hadoop
  • Writes data into fixed size blocks
  • NameNode – ike index, central entry point
  • DataNode – store data. Send data to next DataNode and so on until done.
  • Fault tolerant – can survive node failure (Each DataNode sends heartbeat every 3 seconds to NameNode; assues dead after 10 minutes), Communication failure (DataNode sends ack), data corruption (data nodes send block report to NameNode of good blocks)
  • Can have second NameNode for active/standby config. DataNodes report to both.

Hive

  • Analyze and query HDFS data to find patterns
  • Structure the data into tables so can write SQL like queries – HiveQL
  • HiveQL has multitable insert and cluster by clause
  • HiveQL has high atench and lacks a query cache

Spark

  • Can write in Java, Scala, Python or R
  • Fast in-memory data processing engine
  • Supports SQL, streaing data, machine learning and graph procesing
  • Can run standalone, on Hadoop or on Apache Mesos
  • Much faster than map reduce. How much faster depends n whether the data can fit into memory
  • Includes packages for core, streaming, SQL, MLLib and GraphX
  • RDD (resilient distributed dataset) – immutable programming abstraction of objects collection, can be splt cross clusters. Can create from text file, sql, nosql, etc
  • Can choose which acks need to receive – none, from the leader or from al replicas

Kafka

  • Integrate data from different sources as input/output
  • Producer/consumer pattern (called source and sink)
  • Incoming essages are stored in topics
  • Topics are identified by unique names and split into partitions (for redundancy and partitions)
  • Partitions are ordered and has an id named offset
  • Brokers are Kafka servers in a cluster. Recommended to have three
  • Define replication factor for data. 2 or 3 is common
  • Consumers read data from a topic. They read in order from a partition, but in parallel between partitions.

My take

Good simplified intro for a bunch of topics. It was good seeing how things fit together. The audience asked what sounded like detailed questions. I would have liked if they held that for the end.

Leave a Reply

Your email address will not be published. Required fields are marked *