With Cloud Dataproc, Google promises a Hadoop or Spark cluster in 90 seconds

Getting insights out of big data is typically neither quick nor easy, but Google is aiming to change all that with a new, managed service for Hadoop and Spark.

Cloud Dataproc, which the search giant launched into open beta on Wednesday, is a new piece of its big data portfolio that’s designed to help companies create clusters quickly, manage them easily and turn them off when they’re not needed.

Enterprises often struggle with getting the most out of rapidly evolving big data technology, said Holger Mueller, a vice president and principal analyst with Constellation Research.

“It’s often not easy for the average enterprise to install and operate,” he said. When two open source products need to be combined, “things can get even more complex.”

To read this article in full or to leave a comment, please click here

Computerworld Cloud Computing

IBM Strengthens Effort to Support Open Source Spark for Machine Learning

Spark 300x251 IBM Strengthens Effort to Support Open Source Spark for Machine LearningIBM is providing substantial resources to the Apache Software Foundation’s Spark project to prepare the platform for machine learning tasks, like pattern recognition and classification of objects. The company plans to offer Bluemix Spark as a service and has dedicated 3,500 researchers and developers to assist in its preservation and further development.

In 2009, AMPLab of the University of Berkeley developed the Spark framework that went open source a year later as an Apache project. This framework, which runs on a server cluster, can process data up to 100 times faster than Hadoop MapReduce. Given that the data and analyzes are embedded in the corporate structure and society – from applications to the Internet of Things (IoT) – Spark provides essential advancements in large-scale data processing.

First, it significantly improves the performance of applications dependent data. Then it radically simplifies the development process of intelligence, which are supplied by the data. Specifically, in its effort to accelerate innovation on Spark ecosystem, IBM decided to include Spark in its own platforms of predictive analysis and machine learning.

IBM Watson Health Cloud will use Spark to healthcare providers and researchers as they have access to new health data of the population. At the same time, IBM will make available its SystemML machine learning technology open source. IBM is also collaborating with Databricks in changing Spark capabilities.

IBM will hire more than 3,500 researchers and developers to work on Spark-related projects in more than a dozen laboratories worldwide. The big blue company plans to open a Spark Technology Center in San Francisco for the Data Science and the developer community. IBM will also train Spark to more than one million data scientists and data engineers through partnerships with DataCamp, AMPLab, Galvanize, MetiStream, and Big Data University.

A typical large corporation will have hundreds or thousands of data sets that reside in different databases through their computer system. A data scientist can design an algorithm using to plumb the depths of any database. But is needs 90 working days of scientific data to develop the algorithm. Today, if you want to implement another system, it is a quarter of work to adjust the algorithm so that it works. Spark eliminates that time in half. The spark-based system can access and analyze any database, without development and no additional delay.

Spark has another virtue of ease of use where developers can concentrate on the design of the solution, rather than building an engine from scratch. Spark brings advances in data processing technology on a large scale because it improves the performance of data-dependent applications, radically simplifies the process of developing intelligent solutions and enables a platform capable of unifying all kinds of information on real work schemes.

Many experts consider Spark as the successor to Hadoop, but its adoption remains slow. Spark works very well for machine learning tasks that normally require running large clusters of computers. The latest version of the platform, which recently came out, extends to the machine learning algorithms to run.


CloudTimes

Amazon Web Services jumps on Spark bandwagon

Amazon Web Services’ EMR (Elastic MapReduce) service has been upgraded to handle Spark applications, giving enterprises that want to use the increasingly popular processing engine a way to do so without building their own infrastructure.

Apache Spark is an open-source distributed processing engine used for big data workloads. It’s a good fit for batch processing, streaming, graph databases and machine learning thanks to in-memory caching and optimized execution for fast performance, according to Amazon.

EMR supports Spark version 1.3.1 and utilizes Hadoop YARN as the cluster manager. Running Spark on top of EMR has been possible before, but the integrated support should make using the engine more straightforward. IT staff can create a cluster from the AWS Management Console, for example. Spark applications developed using Scala, Python, Java, and SQL can all run on EMR.

To read this article in full or to leave a comment, please click here

CIO Cloud Computing