Big Data
5V's
- Volume -> clustering -> distributed file manage -> Apache Spark 
- Velocity 
- Variety (types of data) 
- Value 
- Veracity (reliable) 
Workflow
Ingest Data (Apache Kafka) --> Storage (Framework) - might be skipped --> Preprocessing --> Process
MapReduce
Apache Spark
Resilient Distributed Dataset (RDD)
object sparkExample{
    def main(arg: Array[String]) {
        val conf = new SparkConf().setAppName("SparkExample").setMaster("local[*]")
        val sc = new SparkContext(conf)
        
        val lines = sc.textFile("data.txt")
        val words = lines.flatMap(l => l.split(" "))
        val pairs = splitLines.map(w => (w, 1))
        val wordCount = pairs.reduceByKey((a, b) => a + b)
        
        wordCount.collect().foreach(println)
    }
}
Data Harvesting
data -> analytics -> actions -> and back to data
- Legibility -> understand data 
- Agency -> capacity to act 
- Negotiable 
Last updated
