Hadoop and other noSQL technologies
|On This Page||More||Other Pages|
In 15 years since 1998 the number of servers in Google grew to approx 3 million servers in 2013.
Microsoft, Amazon, and others are not far behind. How do you manage millions of servers?
How do you distribute data between them?
How you query this data?
How Google returns results from millions of servers in less than a second?
What technologies power those huge distributed databases?
List of some well known BigData technologies (as of Summer 2014):
List below contains both open source and commercial solutions.
Which ones to choose?
"Open source or open wallet.
Vibrant innovative engineering community or marketing.
It's not about a great product.
It is about unbridled thinking and freedom from single company control.
The choice is easy."
-- David Garrison, Hortonworks
Open Source system to store data distributed over multiple (thousands) servers. Created in 2005 by Doug Cutting and Mike Cafarella for "Nutch" search engine project. Name "Hadoop" comes from the name of Cutting's little son's toy elephant. Cutting was working at Yahoo! at the time. Hadoop is written in Java.
|HDFS||Hadoop Distributed File System|
|YARN||Yet Another Resource Negotiator - for better resource management and map reduce in Hadoop 2.x|
|Storm||Distributed processing and streaming of data. Open source. Written in Clojure. Uses "spouts" and "bolts" to define topologies of moving data. Integrates well with many common messaging systems (RabbitMQ, Kestrel, Kafka, etc).|
|Samza||Open Source distributed stream processing framework. Uses Apache Kafka for messaging. Written in Java and Scala.|
|Hive||SQL-like language called HiveQL to query Hadoop HDFS data. DataWarehouse infrastructure. Can run on top of Spark for faster performance.|
Open Source distributed database on top of HDFS (Hadoop Distributed File System). Non-relational. Modeled after Google's BigTable - storing large quantities of sparse data. Written in Java. Good for analyzing huge 2-dim data (billions of rows, millions of columns - searches, log processing, etc.). HTTP or Thrift interface.
Open source distributed high-performance database to store and query huge amounts of data. Originally from Facebook. Has its own data model (not Hadoop). Many users load data from Hadoop into Cassandra to do analytics. Uses SQL-like query language (CQL3). Written in Java. Great for web analytics, transaction logs, etc.
|Spark||Open source, written in Scala, runs MapReduce up to 100x faster than Hadoop on top of Hadoop Distributed File System (HDFS). Originally developed in the AMPLab at UC Berkeley. Company - Databricks.|
|SparkSQL||Open source, allows to use SQL (or HiveQL or Scala) over Spark. It is recommended to move from Hive to SparkSQL.|
|Spark RDD||Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.|
Open source memory-caching for distributed file systems (HDFS, S3, GlusterFS, etc.). Hadoop compatible. Existing Spark and MapReduce programs can run on top of it without any code change. Used in Berkeley Data Analytics Stack (BDAS).
|DataPad||Data Analysis tools (Wes McKinney et al)
- PyData, June 2014, 40min
- GraphLab Conf, July 2014, 22min
|Pig||provides language (Pig Latin) for MapReduce over HDFS. Originally from Yahoo, 2006.|
|Mahout||machine learning algorythm on the Hadoop platform. Also provides math and statistics.|
|Oozie||A workflow scheduler system to manage Hadoop jobs (web app, Java servlets)|
distributed service for collecting, aggregating, and moving large amounts of log data, good for analytic applications
|Sqoop||tool for moving bulk data between Apache Hadoop and other storages (for example, relational databases)|
|Claudera Hue||Cloudera web interface for using Hadoop and analyzing data.|
|Claudera CDH||Cloudera Hadoop Distribution (free download)|
|Cloudera Express||CDH + Cloudera Manager (free download) - best way to start with Hadoop|
|Cloudera Impala||fast SQL queying of HDFS data (massive parrallel processing) - good for analytics, eliminates the need of moving data into some other datastore for analytics. Impala = medium-sized African antelope known for fast running and jumping.|
Open source, VERY FAST (written in C) in-memory (disk-backed) key-value database. REDIS = REmote DIctionary Server. DB should fit in RAM (or span RAMs of several computers using clustering). Supports bits, sets, lists, hashes. Support transactions. Great for real time stock prices, analytics, etc.
|Kyoto Tycoon - Kyoto Cabinet||
Open Source fast and lightweight network DBM over HTTP, can do more than 1 million insert/select per second. Multiple storage options (hash, tree, dir, etc.). Lua on server side. Can be used with C, Java, Python, Ruby, Perl, Lua, etc. Great for real-time data (cache).
Open Source distributed NoSQL document-oriented database to serve many concurrent users. Easy-to-scale key-value or document access with low latency and high sustained throughput. Designed to be clustered to very large scale deployments.
Open source graph database for connected data (nodes and relationships). Transactional. Advanced path finding. Good for road maps, topologies, etc. (graphs). Written in Java.
Open source database - basically a faster smaller HBase implemented in C++, uses ideas from Google's BigTable. Runs on HDFS (or ClusterFS or KFS (Kosmos File System). Uses its own, "SQL-like" language, HQL.
Open source distributed full-text search engine with HTTP interface and JSON documents (parent/children docs). Based on Java Lucene library. Great when you need advanced flexible fuzzy search.
|Solr||Open source popular blazing fast search platform. Based on Java Lucene library. Used by SalesForce.com over pure-java Jetty web server/servlet container.|
Open source - similar to HBase, but provides cell-level security (access labels). Sorted, distributed key/value storage. Built on top of Hadoop, ZooKeeper, and Thrift. Written in Java and C++.
|VoltDB||The world’s fastest in-memory relational database. Fast ingestion and export, massive scalability, real-time analytics. SQL access from within pre-compiled Java stored procedures. Open Source and commercial versions.|
Open source distributed memory caching system. Can be used to cache results of DB queries.
Open source scalable, distributed, transactional key-value store. Written in Erlang, accessible from Python, Ruby, Java, etc.
|BigTable||Google proprietary data storage system. Compressed, high performance. Uses Google File System, Chubby Lock Service, SSTable (log-structured storage like LevelDB).|
|LevelDB||open source on-disk key-value store (by Google).|
|GoogleFS||Google File System - proprietary|
|QFS||Quantcast File System (QFS) is an open-source distributed file system software package for large-scale MapReduce or other batch-processing workloads. It was designed as an alternative to Apache Hadoop’s HDFS, intended to deliver better performance and cost-efficiency for large-scale processing clusters.|
|Oracle noSQL database||Oracle NoSQL Database (ONDB) provides network-accessible multi-terabyte distributed key/value pair storage with predictable latency.|
|Oracle BigData SQL||Oracle Big Data SQL extends Oracle SQL to Hadoop and NoSQL|
|Oracle Data Integrator||Oracle Data Intergator (ODI) is an ETL (Extract-Transform-Load) platform for high volume/high performance batch loads, event-driven, trickle-feed integration, SOA-enabled data service, etc. BigData support, parallelism. Monitoring.|
|Talend||Open source ETL / integration tools, supports BigData. Look at Talend Jumpstart Sandbox.|
|Ab Initio||Applications and custom soultions for very high-volume data processing and data integration. Record-breaking.|
|kiji||Open Source framework for collecting, analyzing, and serving entity data in real time.|
|HPCC||HPCC (High Performance Computing Cluster) - open source platform for massive parallel-processing to solve Big Data problems|
|Pivotal||Big Data solutions, Hadoop distribution|
|IBM Big Data Platform||IBM's enterprise class big data and analytics platform – Watson Foundations.|
|HP Haven||HP HAVEn Big Data platform = (Hadoop, Autonomy Corporation, Vertica, HP Enterprise Security Products).|
|Amazon||Cloud Platform. Amazon Elastic Cloud Compute (Amazon EC2), Amazon Simple Storage Service (Amazon S3), Amazon Web Services (AWS), Amazon Redshift (Datawarehousing solutions), Amazon Elastic MapReduce.|
|Intel||Partners with Claudera to provide server platforms for Big Data.|
|Signiant||System to move large data sets into and out of the cloud (accross datacenters, parallel, encrypted). Signiant Media Shuttle, Signiant Media Exchange and Signiant Manager+Agents, Signiant SkyDrop.|
|BitYota||Datawarehouse as a service|
|DataStax||Delivers certified version of Apache Cassandra that is ready for heavy-duty production environments|
|Greenplum||Bigdata analytics, in 2012 became a part of Pivotal.|
|Teradata||Expandable relational datawarehouse system (since 1979), "shared nothing" architecture.|
|Splunk||Captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations|
|Sumo Logic||Cloud-based log management and analytics service. Uses advanced machine learning algorithms to whittle down mountains of log file data into common groupings.|
|qubole||Creators of Facebook’s Big Data infrastructure and Apache Hive have leveraged their experience to deliver Qubole Data Service (QDS) – a cloud Big Data service offering the same advanced capabilities used by Big Data savvy organizations.|
|altiscale||Altiscale Data Cloud is the first cloud service purpose-built to run Hadoop. We offer an on-demand, elastic solution on a pay-as-you-go basis|
Microsoft and Big Data ------------------------------
Hadoop and HDInsight: Big Data in Windows Azure:
HDInsight is an Apache Hadoop implementation that runs in globally distributed Microsoft datacenters.
It’s a service that allows you to easily build a Hadoop cluster in minutes when you need it.
Windows Azure Storage - can include NoSQL stores, SQL database, blobs, etc. REST-ful API http). Support virtual machines to run Hadoop on Linux.
http://talkincloud.com/talkin-cloud-top-100-cloud-services-providers-0 - top 100 Cloud service providers
Here are the top 10 (number = Rank, 1 - top rank)
1. Salesforce.com, San Francisco, CA
2. Amazon, Seattle, WA
3. Microsoft, Redmond, WA
4. Oracle, Redwood City, CA
5. Google, Mountain View, CA
6. SAP, Walldorf, Germany
7. SoftLayer (IBM), Dallas, TX
8. Terremark (a Verizon Company), Miami, FL
9. Rackspace, San Antonio, TX
10. NetSuite, San Mateo, CA
Some technologes to consider:
38 Seminal Articles Every Data Scientist Should Read (from datasciencecentral.com, August 2014)
DSC Internal Papers
Hadoop - 15 video tutorials
Hadoop MapReduce - 5 videos
Cloudera Video Tutorials
Real-Time Analytics - videos