Big Data
5V's
Volume -> clustering -> distributed file manage -> Apache Spark
Velocity
Variety (types of data)
Value
Veracity (reliable)
Workflow
Ingest Data (Apache Kafka) --> Storage (Framework) - might be skipped --> Preprocessing --> Process
MapReduce
Apache Spark
Resilient Distributed Dataset (RDD)
Data Harvesting
data -> analytics -> actions -> and back to data
Legibility -> understand data
Agency -> capacity to act
Negotiable
Last updated