# Big Data

## 5V's

1. Volume -> **clustering** -> distributed file manage -> Apache Spark
2. Velocity
3. Variety (types of data)
4. Value&#x20;
5. Veracity (reliable)

## Workflow

Ingest Data (Apache Kafka) --> Storage (Framework) - might be skipped --> Preprocessing --> Process&#x20;

## MapReduce

## Apache Spark

Resilient Distributed Dataset (RDD)

```scala
object sparkExample{
    def main(arg: Array[String]) {
        val conf = new SparkConf().setAppName("SparkExample").setMaster("local[*]")
        val sc = new SparkContext(conf)
        
        val lines = sc.textFile("data.txt")
        val words = lines.flatMap(l => l.split(" "))
        val pairs = splitLines.map(w => (w, 1))
        val wordCount = pairs.reduceByKey((a, b) => a + b)
        
        wordCount.collect().foreach(println)
    }
}

```

## Data Harvesting

data -> analytics -> actions -> and back to data

* Legibility -> understand data
* Agency -> capacity to act
* Negotiable
