Big Data

5V's

  1. Volume -> clustering -> distributed file manage -> Apache Spark

  2. Velocity

  3. Variety (types of data)

  4. Value

  5. Veracity (reliable)

Workflow

Ingest Data (Apache Kafka) --> Storage (Framework) - might be skipped --> Preprocessing --> Process

MapReduce

Apache Spark

Resilient Distributed Dataset (RDD)

object sparkExample{
    def main(arg: Array[String]) {
        val conf = new SparkConf().setAppName("SparkExample").setMaster("local[*]")
        val sc = new SparkContext(conf)
        
        val lines = sc.textFile("data.txt")
        val words = lines.flatMap(l => l.split(" "))
        val pairs = splitLines.map(w => (w, 1))
        val wordCount = pairs.reduceByKey((a, b) => a + b)
        
        wordCount.collect().foreach(println)
    }
}

Data Harvesting

data -> analytics -> actions -> and back to data

  • Legibility -> understand data

  • Agency -> capacity to act

  • Negotiable

Last updated