Big Data
5V's
Volume -> clustering -> distributed file manage -> Apache Spark
Velocity
Variety (types of data)
Value
Veracity (reliable)
Workflow
Ingest Data (Apache Kafka) --> Storage (Framework) - might be skipped --> Preprocessing --> Process
MapReduce
Apache Spark
Resilient Distributed Dataset (RDD)
object sparkExample{
def main(arg: Array[String]) {
val conf = new SparkConf().setAppName("SparkExample").setMaster("local[*]")
val sc = new SparkContext(conf)
val lines = sc.textFile("data.txt")
val words = lines.flatMap(l => l.split(" "))
val pairs = splitLines.map(w => (w, 1))
val wordCount = pairs.reduceByKey((a, b) => a + b)
wordCount.collect().foreach(println)
}
}
Data Harvesting
data -> analytics -> actions -> and back to data
Legibility -> understand data
Agency -> capacity to act
Negotiable
Last updated