Cloud-Scale Text Classification With Convolutional Neural Networks On Microsoft Azure



This post is by Miguel Fierro, Ilia Karmanov, Thomas Delteil, Andreas Argyriou, and Max Kaznady, all Data Scientists at Microsoft.

Natural Language Processing (NLP) is one of the fields in which deep learning has made significant progress. Specifically, the area of text classification, where the objective is to categorize documents, paragraphs or individual sentences into classes, has attracted the interest of both industry and academia. Examples include determining what topics are discussed in a document or assessing whether the sentiment conveyed in a text passage is positive, negative or neutral. This information can be used by companies to define marketing strategy, generate leads or improve customer service.

This is the fourth blog showcasing deep learning applications on Microsoft’s DATA SCIENCE VIRTUAL MACHINE (DSVM) with GPUS using the R API of the deep learning library MXNET. The DSVM is a custom virtual machine image from Microsoft that comes pre-installed with popular data science tools for modeling and development activities.

In our FIRST POST, we showed how to set up a deep learning environment in one of the new DSVMs with NVIDIA TESLA K80 GPUS, installing CUDA drivers, Microsoft R Server and MXNet. In the SECOND POST, we presented a pipeline for a massive parallel scoring of 2.3 million images forming a collage of the Mona Lisa using HDINSIGHT APACHE SPARK CLUSTER. Finally, in the THIRD POST we illustrated how to train a network on multiple GPUs to classify objects among 1000 classes, using the IMAGENET dataset and RESNET architecture.

In this sequel of the deep learning series, we will demonstrate how to use Convolutional Neural Networks (CNNs) in a text classification problem. We will explain how to generate an end-to-end pipeline, train a CNN for text classification and prepare the model for production so it can be queried by a user to classify sentences via a web service.

Deep Learning for Text Classification on Azure

The development of Recurrent Neural Networks (RNNs) has led to significant advances in deep learning for NLP. These networks, especially the subclass of Long Short Term Memory Networks (LSTMs), have achieved promising results in tasks related to temporal series, for instance, in SPEECH RECOGNITION, TEXT UNDERSTANDING and TEXT CLASSIFICATION, usually treating the text as groups of words.

The area of text classification has been developed mostly with machine learning models that use features at the word level. The use of word features such as BAG OF WORDS, N-GRAMS or WORD EMBEDDINGS has been shown to be very successful. Some examples of text classification methods are BAG OF WORDS WITH TFIDF, K-MEANS on WORD2VEC, CNNS WITH WORD EMBEDDING, LSTM or BAG OF N-GRAMS WITH A LINEAR CLASSIFIER.

In parallel, there have been important advances in image recognition using different types of CNNs. The ResNet architecture introduced by Microsoft Research, is an example of this and was the first to surpass HUMAN PERFORMANCE in image classification. The reason for this extraordinary success comes from the fact that CNNs learn hierarchical representations in increasing levels of abstraction. This means they don’t just classify features but also automatically generate them in the first place.

Motivated in part by the success of CNNs in image recognition problems, where the inputs to the network are the pixels in images, a group of RESEARCHERS proposed using CNNs for text understanding, using the most atomic representation of a sentence: characters. Even though other researchers have used sub-word units as inputs to deep networks for INFORMATION RETRIEVAL and ANTI-SPAM FILTERING, the idea of using CNNs for text classification at character level first appeared in 2015 with the CREPE MODEL. The following year the technique was developed further in the VDCNN MODEL and the CHAR-CRNN model.

Text Classification with Convolutional Neural Networks at the Character Level

To achieve text classification with CNN at the character level, each sentence needs to be transformed into an image-like matrix, where each encoded character is equivalent to a pixel in the image. This process is explained in detail in ZHANG ET AL., but here’s a quick summary.

Fig. 1: Scheme of character encoding. Each sentence is encoded as a 69×1014 matrix.


Quick-Start Guide To The Data Science Bowl Lung Cancer Detection Challenge, Using Deep Learning, Microsoft Cognitive Toolkit And Azure GPU VMs



This post is by Miguel Fierro, Data Scientist, Ye Xing, Senior Data Scientist, and Tao Wu, Principal Data Scientist Manager, all at Microsoft.

Since its launch in mid-January, THE DATA SCIENCE BOWL LUNG CANCER DETECTION COMPETITION has attracted over than 1,000 submissions. To be successful in this competition, data scientists need to be able to get started quickly and make rapid iterative changes. In this post, we show how to compute features of the scanned images in the competition with a pre-trained Convolutional Neural Network (CNN), and use these features to classify the scans into cancerous or not cancerous, using a boosted tree, all in one hour. With a score of 0.55979, you would be ranked in the top 10% as of January 19th on the leaderboard, or in the top 20% as of February 7th.

To achieve this, we used the following:

  1. A pre-trained CNN as the image featurizer. This 152-layer ResNet model is implemented on the Microsoft Cognitive Toolkit deep learning framework (formerly called CNTK) and trained using the ImageNet dataset.
  2. LightGBM gradient boosting framework as the image classifier.
  3. Azure Virtual Machines (VMs) with GPU acceleration.

For the impatient, we have shared our code in this JUPYTER NOTEBOOK. The computation of the Cognitive Toolkit process takes 53 minutes (29 minutes, if a simpler, 18-layer ResNet model is used), and the computation of the LightGBM process takes 6 minutes at a learning rate of 0.001. A simple version of the code was also published on KAGGLE.


According to the American Lung Association, LUNG CANCER IS THE LEADING CANCER IN MORTALITY, in both men and women in the US, with a low rate of early diagnosis. The DATA SCIENCE BOWL COMPETITION on Kaggle aims to help with early lung cancer detection. Participants use machine learning to determine whether CT SCANS of the lung have cancerous lesions or not. A 3D representation of such a scan is shown in Fig. 1.

Fig. 1: 3D volume rendering of a sample lung using competition data. It was computed using the script from this blog post.

Training speed is one of the most important factors for success at competitions like these. In this respect, both Cognitive Toolkit and LightGBM are excellent in a range of tasks (SHI ET AL., 2016; LIGHTGBM PERFORMANCE SUMMARY). These two solutions, combined with Azure’s high-performance GPU VM, provide a powerful on-demand environment to compete in the Data Science Bowl.

To get started in in the GPU VM you need to install these frameworks:

  • CUDA: CUDA 8.0 can be downloaded from NVIDIA web (registration is required). If you are using Linux, you also need to download CUDA Patch 1 from the website. The patch adds support for gcc 5.4 as one of the host compilers.
  • cuDNN: cuDNN 5.1 (registration with NVIDIA required).
  • MKL: Intel´s Math Kernel Library (MKL) version 11.3 update 3 (registration with Intel required).
  • Anaconda: Anaconda 4.2.0 provides support for conda environments and jupyter notebooks.
  • OpenCV: Download and install from the official OpenCV website. This can also be installed via conda with this command:

    conda install -c opencv

  • Scikit-learn: Scikit-learn 0.18 is easily installed via pip:

    pip install scikit-learn

  • Cognitive Toolkit: Cognitive Toolkit 2.0 beta9 for Python. You can build from source but it’s faster to install the precompiled binaries.
  • LightGBM: LightGBM is easily installed with CMake. You will also need to install the Python bindings.
  • Data management libraries: You also need to install dicom and glob libraries, using pip:

    pip install pydicom glob2

In addition to these libraries and the pre-trained network (downloadable HERE), it’s necessary to download the competition DATA. The images are in DICOM format and consist of a group of slices of the thorax of each patient (see Fig. 2).

Fig. 2: Axial slices of the thorax of a patient with cancer (left) and a patient without cancer (right).

Cancer Image Detection with Cognitive Toolkit and LightGBM


Connecting your own Hadoop or Spark to Azure Data Lake Store


A frequent question we get is how do I connect my Hadoop or Spark cluster to Azure Data Lake Store. Turns out it is really easy to do. Here is a step by step article that will help you get this configured. Enjoy!

Connecting your own Hadoop or Spark to Azure Data Lake Store

Load Data From Azure Data Lake Into Azure SQL Data Warehouse At 3TB/Hour



Re-posted from the Azure blog.

AZURE SQL DATA WAREHOUSE (Azure SQL DW, or just SQL DW for short) is a SQL-based fully managed, petabyte-scale data warehousing solution in the cloud. It is highly elastic, enabling you to provision in minutes and scale capacity in seconds. You can scale compute and storage independently, allowing you to burst compute for complex analytical workloads or scale down your warehouse for archival scenarios. What’s more, you can pay by usage, rather than being locked into expensive predefined cluster configurations.

AZURE DATA LAKE (ADL) is a no-limits data lake optimized for massively parallel processing, and it lets you store and analyze petabyte-size files and trillions of objects.

A common use case involving ADL Store (ADLS) and SQL DW is the following: Raw data is ingested into ADLS from a variety of sources. ADL Analytics (ADLA) is used to clean and process the data into a loading-ready format. From there, high value data is imported into Azure SQL DW for interactive analytics.

Until recently, the data in ADLS would be loaded into SQL DW using row-by-row insertion which, obviously, consumed time and meant delays in how quickly data could be explored to gain useful business insights.

However, as we RECENTLY ANNOUNCED, with SQL DW PolyBase support for ADLS, you can now load data directly from ADLS into your SQL DW instance using External Tables at nearly 3TB per hour. Because SQL DW can now ingest data directly from Azure Storage Blob and ADLS, you can load data from any Azure storage service, giving you the flexibility to choose what’s right for your application. The picture below captures the “Before” and “After” situation.

Intrigued? Read THIS POST to learn more, including how to connect ADLS to SQL DW, and best practices for loading data. Learn more about the new PolyBase capability HERE. You can also check out a short VIDEO CLIP on how to use this new feature:

If you already have an Azure Data Lake Store, you can try LOADING YOUR DATA INTO SQL DATA WAREHOUSE. For those of you still exploring Azure Data Lake, check out these nice ADLS TUTORIALS which will get you up and running.

CIML Blog Team

Distributed Deep Learning on HDInsight with Caffe on Spark



Deep learning is impacting everything from healthcare to transportation to manufacturing, and more. Companies are turning to deep learning to solve hard problems, like image classification, speech recognition, object recognition, and machine translation.

There are many popular frameworks, including Microsoft Cognitive Toolkit, Tensorflow, MXNet, Theano, etc. Caffe is one of the most famous non-symbolic (imperative) neural network frameworks, and widely used in many areas including computer vision. Furthermore, CaffeOnSpark combines Caffe with Apache Spark, in which case deep learning can be easily used on an existing Hadoop cluster together with Spark ETL pipelines, reducing system complexity and latency for end-to-end learning.

HDInsight is the only fully-managed cloud Hadoop offering that provides optimized open source analytic clusters for Spark, Hive, MapReduce, HBase, Storm, Kafka, and R Server backed by a 99.9% SLA. Each of these big data technologies and ISV applications are easily deployable as managed clusters with enterprise-level security and monitoring.

Some users are asking us about how to use deep learning on HDInsight, which is Microsoft’s PaaS Hadoop product. We will have more to share in the future, but today we want to summarize a technical blog on how to use Caffe on HDInsight Spark.

If you have installed Caffe before, you will notice that installing this framework is a little bit challenging. In this blog, we will first illustrate how to install Caffe on Spark for an HDInsight cluster, then use the built-in MNIST demo to demostrate how to use Distributed Deep Learning using HDInsgiht Spark on CPUs.

There are four major steps to get it work on HDInsight.

  1. Install the required dependencies on all the nodes
  2. Build Caffe on Spark for HDInsight on the head node
  3. Distribute the required libraries to all the worker nodes
  4. Compose a Caffe model and run it distributely

Since HDInsight is a PaaS solution, it offers great platform features – so it is quite easy to perform some tasks. One of the features that we heavily use in this blog post is called Script Action, with which you can execute shell commands to customize cluster nodes (head node, worker node, or edge node).

Step 1: Install the required dependencies on all the nodes

To get started, we need to install the dependencies we need. The Caffe site and CaffeOnSpark site offers some very useful wiki for installing the dependencies for Spark on YARN mode (which is the mode for HDInsight Spark), but we need to add a few more dependencies for HDInsight platform. We will use the script action as below and run it on all the head nodes and worker nodes. This script action will take about 20 minutes, as those dependencies also depend on other packages. I put the script in my GitHub location so it is accessible by the cluster.

#Please be aware that installing the below will add additional 20 mins to cluster creation because of the dependencies
#installing all dependencies, including the ones mentioned in, as well a few packages that are not included in HDInsight, such as gflags, glog, lmdb, numpy
#It seems numpy will only needed during compilation time, but for safety purpose we install them on all the nodes

sudo apt-get install -y libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libhdf5-serial-dev protobuf-compiler maven libatlas-base-dev libgflags-dev libgoogle-glog-dev liblmdb-dev build-essential  libboost-all-dev python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose

#install protobuf
sudo tar xzvf protobuf-2.5.0.tar.gz -C /tmp/
cd /tmp/protobuf-2.5.0/
sudo ./configure
sudo make
sudo make check
sudo make install
sudo ldconfig
echo "protobuf installation done"

There are two steps in the script action above. The first step is to install all the required libraries. Those libraries include the necessary libraries for both compiling Caffe(such as gflags, glog) and running Caffe (such as numpy). We are using libatlas for CPU optimization, but you can always follow the CaffeOnSpark wiki on installing other optimization libraries, such as MKL or CUDA (for GPU).

The second step is to download, compile, and install protobuf 2.5.0 for Caffe during runtime. Protobuf 2.5.0 is required, however this version is not available as a package on Ubuntu 16, so we need to compile it from the source code. There are also a few resources on the Internet on how to compile it, such as this

To simply get started, you can just run this script action against your cluster to all the worker nodes and head nodes (for HDInsight 3.5). You can either run the script actions for a running cluster, or you can also run the script actions during the cluster provision time. For more details on the script actions, please see the documentation here

Script Actions to Install Dependencies

Step 2: Build Caffe on Spark for HDInsight on the head node


Home Appliances, Vending Machines – Even Cruise Ships – Get an Infusion of Cortana Intelligence & Machine Learning



A quick overview of recent customer case studies involving the application of Microsoft’s AI, Big Data & Machine Learning offerings.

Carnival Maritime Predicts Water Consumption on Cruise Ships

The Costa Group’s fleet of 26 cruise ships sail all over the world. The industrial equipment on their ships have thousands of sensors that collect data in real time. As part of their digital transformation, the company’s marine service unit, Carnival Maritime, wanted to explore how it might take advantage of this data to find opportunities for operational improvement. One of the areas they looked at was that of water consumption onboard their ship. This is a complex problem, as consumption patterns can vary widely. Passengers of different nationalities shower for different durations at different temperatures and times of the day, for instance, and there are numerous other variables that make such consumption challenging to predict. Accurately predicting water consumption helps ship captains avoid the need to spend fuel by unnecessarily producing excessive amounts of water at sea. This also mitigates their need to carry all that excess water along the way, which further shaves costs.

Carnival needed a mechanism to predict the right amount of water to produce at the right time, without having to store any excess. Carnival’s partner, Arundo Analytics, a global provider of analytical and predictive solutions, helped them build a microservice on their proprietary big-data platform, and trained a model to help them do just that. Using the machine learning models, APIs, and templates in the MICROSOFT CORTANA INTELLIGENCE SUITE, Arundo analyzed historical data sets along with data such as the speed and position of the ships, age and nationality of passengers, historical weather data and more, to better understand exactly the drivers of water consumption. Their platform runs on Azure and is able to easily connect to and derive value from a variety of data, and from both Carnival’s cloud and on-premises databases.

Carnival is now able to better predict how much water a ship will need for a specific route with a particular set of guests. They estimate that their optimizations can help each ship save over $200,000 a year. The solution also contributes to the company’s goal of reducing carbon emissions. Carnival is next looking to implement a predictive maintenance solution for its fleet, using Cortana Intelligence to study the data that’s already being collected from thousands of on-board sensors on each ship. You can LEARN MORE ABOUT THE CARNIVAL MARITIME STORY HERE.

Arçelik A.Ş. Increases Forecasting Accuracy on Spare Parts


Add Intelligence To Any SQL App, With The Power Of Deep Learning



Re-posted from the SQL Server blog.

Recent results and applications involving Deep Learning have proven to be incredibly promising, and across a diverse set of areas too, including speech recognition, language understanding, computer vision and more. Deep Learning is changing customer expectations and experiences around a variety of products and mobile apps, whether we’re aware of it or not. That’s definitely true of Microsoft apps you’re likely to be using every day, such as Skype, Office 365, Cortana or Bing. As we’ve mentioned before, our Deep Learning based language translation in Skype was recently named one of the 7 GREATEST SOFTWARE INNOVATIONS OF 2016 BY POPULAR SCIENCE, a true technological milestone, with machines now sitting at or above human parity, when it comes to recognizing conversational speech.

As a result of these developments, it’s only a matter of time before intelligence powered by Deep Learning becomes an expectation of any app.

In a new blog post, Rimma Nehme addresses the question of how easy might it be for your typical SQL Server developer to integrate Deep Learning into their app. This question is especially timely in light of the recent enhancement to SQL Server 2016 through the integration of R Services, with powerful ML functions, including deep neural networks (DNNs) as a core part of it.

Can we help you turn any SQL app into a truly ‘intelligent’ app, and ideally with just a few lines of code?

To find out, READ THE ORIGINAL BLOG POST HERE – the answer may surprise you.

Deep Learning – Microsoft Cognitive Toolkit (CNTK) Deep Dive and Hands-on Tutorial – Nov 2016

The Microsoft Cognitive Toolkit—previously known as CNTK—empowers you to harness the intelligence within massive datasets through deep learning by providing uncompromised scaling, speed and accuracy with commercial-grade quality and compatibility with the programming languages and algorithms data scientists already use.

Next Generation of Databases and Data Lakes from Microsoft

This post was authored by Joseph Sirosh, Corporate Vice President of the Microsoft Data Group.

Microsoft Connect() 2016

For the past two years, we’ve unveiled several of our cutting-edge technologies and innovative solutions at Connect(); which will be livestreaming globally from New York City starting November 16. This year, I am thrilled to announce the next generation of SQL Server and Azure Data Lake, and several new capabilities to help developers build intelligent applications.

1. Next release of SQL Server with Support for Linux and Docker (Preview)

I am excited to announce the public preview of the next release of SQL Server which brings the power of SQL Server to both Windows – and for the first time ever – Linux. Now you can also develop applications with SQL Server on Linux, Docker, or macOS (via Docker) and then deploy to Linux, Windows, Docker, on-premises, or in the cloud.  This represents a major step in our journey to making SQL Server the platform of choice across operating systems, development languages, data types, on-premises and the cloud.  All major features of the relational database engine, including advanced features such as in-memory OLTP, in-memory columnstores, Transparent Data Encryption, Always Encrypted, and Row-Level Security now come to Linux. Getting started is easier than ever. You’ll find native Linux installations (more info here) with familiar RPM and APT packages for Red Hat Enterprise Linux, Ubuntu Linux, and SUSE Linux Enterprise Server. The public preview on Windows and Linux will be available on Azure Virtual Machines and as images available on Docker Hub, offering a quick and easy installation within minutes.  The Windows download is available on the Technet Eval Center.

We have also added significant improvements into R Services inside SQL Server, such as a very powerful set of machine learning functions that are used by our own product teams across Microsoft. This brings new machine learning and deep neural network functionality with increased speed, performance and scale, especially for handling a large corpus of text data and high-dimensional categorical data. We have just recently showcased SQL Server running more than one million R predictions per second and encourage you all to try out R examples and machine learning templates for SQL Server on GitHub.

The choice of application development stack with the next release of SQL Server is absolutely amazing – it includes .NET, Java, PHP, Node.JS, etc. on Windows, Linux and Mac (via Docker). Native application development experience for Linux and Mac developers has been a key focus for this release. Get started with the next release of SQL Server on Linux, macOS (via Docker) and Windows with our developer tutorials that show you how to install and use the next release of SQL Server on macOS, Docker, Windows, RHEL and Ubuntu and quickly build an app in a programming language of your choice.

SQL Server

2. SQL Server 2016 SP1

We are announcing SQL Server 2016 SP1 which is a unique service pack – for the first time we introduce consistent programming modelacross SQL Server editions. With this model, programs written to exploit powerful SQL features such as in-memory OLTP, in-memory columnstore analytics, and partitioning will work across Enterprise, Standard and Express editions. Developers will find it easier than ever to take advantage of innovations such as in memory databases and advanced analytics – you can use these advanced features in the Standard Edition and then step up to Enterprise for Mission Critical performance, scale and availability – without having to re-write your application.


Eight scenarios with Apache Spark on Azure that will transform any business


This post was authored by Rimma Nehme, Technical Assistant, Data Group.


Since its birth in 2009, and the time it was open sourced in 2010, Apache Spark has grown to become one of the largest open source communities in big data with over 400 organizations from 100 companies contributing to it. Spark stands out for its ability to process large volumes of data 100x faster, because data is persisted in-memory. Azure cloud makes Apache Spark incredibly easy and cost effective to deploy with no hardware to buy, no software to configure, with a full notebook experience to author compelling narratives, and integration with partner business intelligence tools. In this blog post, I am going to review of some of the truly game-changing usage scenarios withApache Spark on Azure that companies can employ in their context.

Scenario #1: Streaming data, IoT and real-time analytics

Apache Spark’s key use case is its ability to process streaming data. With so much data being processed on a daily basis, it has become essential for companies to be able to stream and analyze it all in real time. Spark Streaming has the capability to handle this type of workload exceptionally well. As shown in the image below, a user can create an Azure Event Hub (or an Azure IoT Hub) to ingest rapidly arriving data into the cloud; both Event and IoT Hubs can intake millions of events and sensor updates per second that can then be processed in real-time by Spark.

Scenario 1_Spark Streaming

Businesses can use this scenario today for:

  • Streaming ETL: In traditional ETL (extract, transform, load) scenarios, the tools are used for batch processing, and data must be first read in its entirety, converted to a database compatible format, and then written to the target database. With Streaming ETL, data is continually cleaned and aggregated before it is pushed into data stores or for further analysis.
  • Data enrichment: Streaming capability can be used to enrich live data by combining it with static or ‘stationary’ data, thus allowing businesses to conduct more complete real-time data analysis. Online advertisers use data enrichment to combine historical customer data with live customer behavior data and deliver more personalized and targeted ads in real-time and in the context of what customers are doing. Since advertising is so time-sensitive, companies have to move fast if they want to capture mindshare. Spark on Azure is one way to help achieve that.
  • Trigger event detection: Spark Streaming can allow companies to detect and respond quickly to rare or unusual behaviors (“trigger events”) that could indicate a potentially serious problem within the system. For instance, financial institutions can use triggers to detect fraudulent transactions and stop fraud in its tracks. Hospitals can also use triggers to detect potentially dangerous health changes while monitoring patient vital signs and sending automatic alerts to the right caregivers who can then take immediate and appropriate action.
  • Complex session analysis: Using Spark Streaming, businesses can use events relating to live sessions, such as user activity after logging into a website or application, can be grouped together and quickly analyzed. Session information can also be used to continuously update machine learning models. Companies can then use this functionality to gain immediate insights as to how users are engaging on their site and provide more real-time personalized experiences.

Scenario #2: Visual data exploration and interactive analysis

Using Spark SQL running against data stored in Azure, companies can use BI tools such as Power BI, PowerApps, Flow, SAP Lumira, QlikView and Tableau to analyze and visualize their big data. Spark’s interactive analytics capability is fast enough to perform exploratory queries without sampling. By combining Spark with visualization tools, complex data sets can be processed and visualized interactively. These easy-to-use interfaces then allow even non-technical users to visually explore data, create models and share results. Because wider audience can analyze big data without preconceived notions, companies can test new ideas and visualize important findings in their data earlier than ever before. Companies can identify new trends and new relationships that were not apparent before and quickly drill down into them, ask new questions and find ways to innovate in new and smarter ways.

Scenario 2_Spark visual data exploration and interactive analysis

This scenario is even more powerful when interactive data discovery is combined with predictive analytics (more on this later in this blog). Based on relationships and trends identified during discovery, companies can use logistic regression or decision tree techniques to predict the probability of certain events in the future (e.g., customer churn probability). Companies can then take specific, targeted actions to control or avert certain events.

Scenario #3: Spark with NoSQL (HBase and Azure DocumentDB)

This scenario provides scalable and reliable Spark access to NoSQL data stored either in HBase or our blazing fast, planet-scale Azure DocumentDB, through “native” data access APIs. Apache HBase is an open-source NoSQL database that is built on Hadoop and modeled after Google BigTable. DocumentDB is a true schema-free managed NoSQL database service running in Azure designed for modern mobile, web, gaming, and IoT scenarios. DocumentDB ensures 99% of your reads are served under 10 milliseconds and 99% of your writes are served under 15 milliseconds. It also provides schema flexibility, and the ability to easily scale a database up and down on demand.

The Spark with NoSQL scenario enables ad-hoc, interactive queries on big data. NoSQL can be used for capturing data that is collected incrementally from various sources across the globe. This includes social analytics, time series, game or application telemetry, retail catalogs, up-to-date trends and counters, and audit log systems. Spark can then be used for running advanced analytics algorithms at scale on top of the data coming from NoSQL.

Scenario 3_Spark NoSQL

Companies can employ this scenario in online shopping recommendations, spam classifiers for real time communication applications, predictive analytics for personalization, and fraud detection models for mobile applications that need to make instant decisions to accept or reject a payment. I would also include in this category a broad group of applications that are really “next-gen” data warehousing, where large amounts of data needs to be processed inexpensively and then served in an interactive form to many users globally. Finally, internet of things scenarios fit in here as well, with the obvious difference that the data represents the actions of machines instead of people.

Scenario #4: Spark with Data Lake

Spark on Azure can be configured to use Azure Data Lake Store (ADLS) as an additional storage. ADLS is an enterprise-class, hyper-scale repository for big data analytic workloads. Azure Data Lake includes all the capabilities required to make it easy for developers, data scientists, and analysts in an enterprise environment to store data of any size, shape and speed, and do all types of processing and analytics across platforms and languages. Because ADLS is a file system compatible with Hadoop Distributed File System (HDFS), it makes it very easy to combine it with Spark for running computations at scale using pre-existing Spark queries.

Scenario 4_Spark with Data Lake

The data lake scenario arose because new types of data needed to be captured and exploited by companies, while still preserving all of the enterprise-level requirements like security, availability, compliance, failover, etc. Spark with data lake scenario enables a truly scalable advanced analytics on healthcare data, financial data, business-sensitive data, geo-location coordinates, clickstream data, server log, social media, machine and sensor data. If companies want an easy way of building data pipelines, have unparalleled performance, insure their data quality, manage access control, perform change data capture (CDC) processing, get enterprise-level security seamlessly and have world-class management and debugging tools, this is the scenario they need to implement.

Scenario #5: Spark with SQL Data Warehouse

While there is still a lot of confusion, Spark and big data analytics is not a replacement for traditional data warehousing. Instead, Spark on Azure can complement and enhance a company’s data warehousing efforts by modernizing the company’s approaches to analytics. A data warehouse can be viewed as an ‘information archive’ that supports business intelligence (BI) users and reporting tools for mission-critical functions of company. My definition of mission-critical is any system that supports revenue generation or cost control. If such a system fails, companies would have to manually perform these tasks to prevent loss of revenue or increased cost. Big data analytics systems like Spark help augment such systems by running more sophisticated computations, smarter analytics and delivering deeper insights using larger and more diverse datasets.

Azure SQL Data Warehouse (SQLDW) is a cloud-based, scale-out database capable of processing massive volumes of data, both relational and non-relational. Built on our massively parallel processing (MPP) architecture, SQLDW combines the power of the SQL Server relational database with Azure cloud scale-out capabilities. You can increase, decrease, pause, or resume a data warehouse in seconds with SQLDW. Furthermore, you save costs by scaling out CPU when you need it and cutting back usage during non-peak times. SQLDW is the manifestation of elastic future of data warehousing in the cloud.

Scenario 5_Spark with SQLDW

Some of the use cases of Spark with SQLDW scenario may include: using data warehouse to get a better understanding of its customers across product groups, then using Spark for predictive analytics on top of that data. Running advanced analytics using Spark on top of the enterprise data warehouse containing sales, marketing, store management, point of sale, customer loyalty, and supply chain data, then run advanced analytics using Spark to drive more informed business decisions at the corporate, regional, and store levels. Using Spark with the data warehousing data, companies can literally do anything from risk modeling, to parallel processing of large graphs, to advanced analytics, text processing – all on top of their elastic data warehouse.

Scenario #6: Machine Learning using R Server, MLlib

Another and probably one of the most prominent Spark use cases in Azure is machine learning. By storing datasets in-memory during a job, Spark has great performance for iterative queries common in machine learning workloads. Common machine learning tasks that can be run with Spark in Azure include (but are not limited to) classification, regression, clustering, topic modeling, singular value decomposition (SVD) and principal component analysis (PCA) and hypothesis testing and calculating sample statistics.

Typically, if you want to train a statistical model on very large amounts of data, you need three things:

  • Storage platform capable of holding all of the training data
  • Computational platform capable of efficiently performing the heavy-duty mathematical computations required
  • Statistical computing language with algorithms that can take advantage of the storage and computation power

Microsoft R Server, running on HDInsight with Apache Spark provides all three things above. Microsoft R Server runs within HDInsight Hadoop nodes running on Microsoft Azure. Better yet, the big-data-capable algorithms of ScaleR takes advantage of the in-memory architecture of Spark, dramatically reducing the time needed to train models on large data. With multi-threaded math libraries and transparent parallelization in R Server, customers can handle up to 1000x more data and up to 50x faster speeds than open source R. And if your data grows or you just need more power, you can dynamically add nodes to the Spark cluster using the Azure portal. Spark in Azure also includes MLlib for a variety of scalable machine learning algorithms, or you can use your own libraries. Some of the common applications of machine learning scenario with Spark on Azure are listed in a table below.

Vertical Sales and Marketing Finance and Risk Customer and Channel Operations and Workforce
Retail Demand forecastingLoyalty programs

Cross-sell and upsell

Customer acquisition

Fraud detectionPricing strategy PersonalizationLifetime customer value

Product segmentation

Store location demographicsSupply chain management

Inventory management

Financial Services Customer churnLoyalty programs

Cross-sell and upsell

Customer acquisition

Fraud detectionRisk and compliance

Loan defaults

PersonalizationLifetime customer value Call center optimizationPay for performance
Healthcare Marketing mix optimizationPatient acquisition Fraud detectionBill collection Population healthPatient demographics Operational efficiencyPay for performance
Manufacturing Demand forecastingMarketing mix optimization Pricing strategyPerf risk management Supply chain optimizationPersonalization Remote monitoringPredictive maintenance

Asset management


Scenario 6_Spark Machine Learning

Examples with just a few lines of code that you can try out right now:

Scenario #7: Putting it all together in a notebook experience

For data scientists, we provide out-of-the-box integration with Jupyter (iPython), the most popular open source notebook in the world. Unlike other managed Spark offerings that might require you to install your own notebooks, we worked with the Jupyter OSS community to enhance the kernel to allow Spark execution through a REST endpoint.

We co-led “Project Livy” with Cloudera and other organizations to create an open source Apache licensed REST web service that makes Spark a more robust back-end for running interactive notebooks.  As a result, Jupyter notebooks are now accessible within HDInsight out-of-the-box. In this scenario, we can use all of the services in Azure mentioned above with Spark with a full notebook experience to author compelling narratives and create data science collaborative spaces. Jupyter is a multi-lingual REPL on steroids. Jupyter notebook provides a collection of tools for scientific computing using powerful interactive shells that combine code execution with the creation of a live computational document. These notebook files can contain arbitrary text, mathematical formulas, input code, results, graphics, videos and any other kind of media that a modern web browser is capable of displaying. So, whether you’re absolutely new to R or Python or SQL or do some serious parallel/technical computing, the Jupyter Notebook in Azure is a great choice.

Scenario 7_Spark with Notebook

You can also use Zeppelin notebooks on Spark clusters in Azure to run Spark jobs. Zeppelin notebook for HDInsight Spark cluster is an offering just to showcase how to use Zeppelin in an Azure HDInsight Spark environment. If you want to use notebooks to work with HDInsight Spark, I recommend that you use Jupyter notebooks. To make development on Spark easier, we support IntelliJ Spark Tooling which introduces native authoring support for Scala and Java, local testing, remote debugging, and the ability to submit Spark applications to the Azure cloud.

Scenario #8: Using Excel with Spark

As a final example, I wanted to describe the ability to connect Excel to Spark cluster running in Azure using the Microsoft Open Database Connectivity (ODBC) Spark Driver. Download it here.

Scenario 8_Spark with Excel

Excel is one of the most popular clients for data analytics on Microsoft platforms. In Excel, our primary BI tools such as PowerPivot, data-modeling tools, Power View, and other data-visualization tools are built right into the software, no additional downloads required. This enables users of all levels to do self-service BI using the familiar interface of Excel. Through a Spark Add-in for Excel users can easily analyze massive amounts of structured or unstructured data with a very familiar tool.