Big Data

5V's

Volume -> clustering -> distributed file manage -> Apache Spark
Velocity
Variety (types of data)
Value
Veracity (reliable)

Workflow

Ingest Data (Apache Kafka) --> Storage (Framework) - might be skipped --> Preprocessing --> Process

MapReduce

Apache Spark

Resilient Distributed Dataset (RDD)

object sparkExample{
    def main(arg: Array[String]) {
        val conf = new SparkConf().setAppName("SparkExample").setMaster("local[*]")
        val sc = new SparkContext(conf)
        
        val lines = sc.textFile("data.txt")
        val words = lines.flatMap(l => l.split(" "))
        val pairs = splitLines.map(w => (w, 1))
        val wordCount = pairs.reduceByKey((a, b) => a + b)
        
        wordCount.collect().foreach(println)
    }
}

Data Harvesting

data -> analytics -> actions -> and back to data

Legibility -> understand data
Agency -> capacity to act
Negotiable

PreviousComputerphile NextGoogle ML Crash Course

Last updated 6 years ago