HDinsight – How To Perform Bulk Load With Phoenix ?



APACHE HBASE is an open Source No SQL Hadoop database, a distributed, scalable, big data store. It provides real-time read/write access to large datasets. HDINSIGHT HBASE is offered as a managed cluster that is integrated into the Azure environment. HBase provides many features as a big data store. But in order to use HBase, the customers have to first load their data into HBase.

There are multiple ways to get data into HBase such as – using client API’s, Map Reduce job with TableOutputFormat or inputting the data manually on HBase shell. Many customers are interested in using APACHE PHOENIX – a SQL layer over HBase for its ease of use. The current post describes about how to use phoenix bulk load with HDinsight clusters.

Phoenix provides two methods for loading CSV data into Phoenix tables – a single-threaded client loading tool via the psql command, and a MapReduce-based bulk load tool.

Single threaded method

Please note that this method is suitable when your bulk load data is in tens of megabytes. Thus, this may not be a suitable option for most of the production scenarios. Following are the steps to use this method.

  • Create Table:

Put command to create table in a file (let’s say CreateTable.sql) based on the schema of your table. Example:

CREATE TABLE input Table (
		Field1 varchar NOT NULL PRIMARY KEY,
		Field2 varchar,
		Field3 decimal,
		Field4 INTEGER,
		Field5 varchar);


  • Input data: This file contains the input data for bulk load (let’s say input.csv).
  • Query to execute on the data: You can put any SQL query which you would like to run on the data (let’s say Query.sql). A Sample query:


SELECT Field2, Field3 from inputTable group by field5;


Building Advanced Analytical Solutions Faster Using Dataiku DSS On HDInsight

Building advanced analytical solutions faster using Dataiku DSS on HDInsight



The AZURE HDINSIGHT APPLICATION PLATFORM allows users to use applications that span a variety of use cases like data ingestion, data preparation, data processing, building analytical solutions and data visualization. In this post we will see how DSS (DATA SCIENCE STUDIO) from Dataiku can help a user build a predictive machine learning model to analyze movie sentiment on twitter.

To know more about DSS integration with HDInsight, register for the WEBINAR featuring Jed Dougherty from Dataiku and Pranav Rastogi from Microsoft.

DSS on HDInsight

By installing the DSS application on a HDInsight cluster (Hadoop or Spark), the user has the ability to:

  • Automate data flows

DSS has the ability to integrate with multiple data connectors. Users can connect to their existing infrastructure to consume their data. Data can be cleaned, merged and enriched by creating reusable workflows.

  • Use a collaborative platform

One of the highlights in DSS is to be able to collaboratively work on building an analytics solution. Data Scientists/Analysts can interact with developers to build solutions and improve results. DSS supports a wide variety of technologies like R, MapReduce, Spark etc.

  • Build prediction models

Another key feature in DSS is the ability to build predictive models leveraging the latest machine learning technologies. The models can be trained using various algorithms and applied existing flows to predict or cluster information.

  • Work using an integrated UI

DSS offers an integrated UI where you can visualize all the data transforms. Users can create interactive dashboards and share it with other members in the team.

Leverage the power of Azure HDInsight

DSS can leverage the benefits of the HDInsight platform like enterprise security, monitoring, SLA and more. DSS users can leverage the power of MapReduce and Spark to perform advanced analytics on their data. DSS offers various mechanisms to train the in-built ML algorithms when the data is stored in HDInsight. The below diagram illustrates how the HDInsight cluster is utilized by DSS:


How to install DSS on an HDInsight cluster?


Cloud-Scale Text Classification With Convolutional Neural Networks On Microsoft Azure

Source: https://blogs.technet.microsoft.com/machinelearning/2017/02/13/cloud-scale-text-classification-with-convolutional-neural-networks-on-microsoft-azure/


This post is by Miguel Fierro, Ilia Karmanov, Thomas Delteil, Andreas Argyriou, and Max Kaznady, all Data Scientists at Microsoft.

Natural Language Processing (NLP) is one of the fields in which deep learning has made significant progress. Specifically, the area of text classification, where the objective is to categorize documents, paragraphs or individual sentences into classes, has attracted the interest of both industry and academia. Examples include determining what topics are discussed in a document or assessing whether the sentiment conveyed in a text passage is positive, negative or neutral. This information can be used by companies to define marketing strategy, generate leads or improve customer service.

This is the fourth blog showcasing deep learning applications on Microsoft’s DATA SCIENCE VIRTUAL MACHINE (DSVM) with GPUS using the R API of the deep learning library MXNET. The DSVM is a custom virtual machine image from Microsoft that comes pre-installed with popular data science tools for modeling and development activities.

In our FIRST POST, we showed how to set up a deep learning environment in one of the new DSVMs with NVIDIA TESLA K80 GPUS, installing CUDA drivers, Microsoft R Server and MXNet. In the SECOND POST, we presented a pipeline for a massive parallel scoring of 2.3 million images forming a collage of the Mona Lisa using HDINSIGHT APACHE SPARK CLUSTER. Finally, in the THIRD POST we illustrated how to train a network on multiple GPUs to classify objects among 1000 classes, using the IMAGENET dataset and RESNET architecture.

In this sequel of the deep learning series, we will demonstrate how to use Convolutional Neural Networks (CNNs) in a text classification problem. We will explain how to generate an end-to-end pipeline, train a CNN for text classification and prepare the model for production so it can be queried by a user to classify sentences via a web service.

Deep Learning for Text Classification on Azure

The development of Recurrent Neural Networks (RNNs) has led to significant advances in deep learning for NLP. These networks, especially the subclass of Long Short Term Memory Networks (LSTMs), have achieved promising results in tasks related to temporal series, for instance, in SPEECH RECOGNITION, TEXT UNDERSTANDING and TEXT CLASSIFICATION, usually treating the text as groups of words.

The area of text classification has been developed mostly with machine learning models that use features at the word level. The use of word features such as BAG OF WORDS, N-GRAMS or WORD EMBEDDINGS has been shown to be very successful. Some examples of text classification methods are BAG OF WORDS WITH TFIDF, K-MEANS on WORD2VEC, CNNS WITH WORD EMBEDDING, LSTM or BAG OF N-GRAMS WITH A LINEAR CLASSIFIER.

In parallel, there have been important advances in image recognition using different types of CNNs. The ResNet architecture introduced by Microsoft Research, is an example of this and was the first to surpass HUMAN PERFORMANCE in image classification. The reason for this extraordinary success comes from the fact that CNNs learn hierarchical representations in increasing levels of abstraction. This means they don’t just classify features but also automatically generate them in the first place.

Motivated in part by the success of CNNs in image recognition problems, where the inputs to the network are the pixels in images, a group of RESEARCHERS proposed using CNNs for text understanding, using the most atomic representation of a sentence: characters. Even though other researchers have used sub-word units as inputs to deep networks for INFORMATION RETRIEVAL and ANTI-SPAM FILTERING, the idea of using CNNs for text classification at character level first appeared in 2015 with the CREPE MODEL. The following year the technique was developed further in the VDCNN MODEL and the CHAR-CRNN model.

Text Classification with Convolutional Neural Networks at the Character Level

To achieve text classification with CNN at the character level, each sentence needs to be transformed into an image-like matrix, where each encoded character is equivalent to a pixel in the image. This process is explained in detail in ZHANG ET AL., but here’s a quick summary.

Fig. 1: Scheme of character encoding. Each sentence is encoded as a 69×1014 matrix.


Quick-Start Guide To The Data Science Bowl Lung Cancer Detection Challenge, Using Deep Learning, Microsoft Cognitive Toolkit And Azure GPU VMs

Source: https://blogs.technet.microsoft.com/machinelearning/2017/02/17/quick-start-guide-to-the-data-science-bowl-lung-cancer-detection-challenge-using-deep-learning-microsoft-cognitive-toolkit-and-azure-gpu-vms/


This post is by Miguel Fierro, Data Scientist, Ye Xing, Senior Data Scientist, and Tao Wu, Principal Data Scientist Manager, all at Microsoft.

Since its launch in mid-January, THE DATA SCIENCE BOWL LUNG CANCER DETECTION COMPETITION has attracted over than 1,000 submissions. To be successful in this competition, data scientists need to be able to get started quickly and make rapid iterative changes. In this post, we show how to compute features of the scanned images in the competition with a pre-trained Convolutional Neural Network (CNN), and use these features to classify the scans into cancerous or not cancerous, using a boosted tree, all in one hour. With a score of 0.55979, you would be ranked in the top 10% as of January 19th on the leaderboard, or in the top 20% as of February 7th.

To achieve this, we used the following:

  1. A pre-trained CNN as the image featurizer. This 152-layer ResNet model is implemented on the Microsoft Cognitive Toolkit deep learning framework (formerly called CNTK) and trained using the ImageNet dataset.
  2. LightGBM gradient boosting framework as the image classifier.
  3. Azure Virtual Machines (VMs) with GPU acceleration.

For the impatient, we have shared our code in this JUPYTER NOTEBOOK. The computation of the Cognitive Toolkit process takes 53 minutes (29 minutes, if a simpler, 18-layer ResNet model is used), and the computation of the LightGBM process takes 6 minutes at a learning rate of 0.001. A simple version of the code was also published on KAGGLE.


According to the American Lung Association, LUNG CANCER IS THE LEADING CANCER IN MORTALITY, in both men and women in the US, with a low rate of early diagnosis. The DATA SCIENCE BOWL COMPETITION on Kaggle aims to help with early lung cancer detection. Participants use machine learning to determine whether CT SCANS of the lung have cancerous lesions or not. A 3D representation of such a scan is shown in Fig. 1.

Fig. 1: 3D volume rendering of a sample lung using competition data. It was computed using the script from this blog post.

Training speed is one of the most important factors for success at competitions like these. In this respect, both Cognitive Toolkit and LightGBM are excellent in a range of tasks (SHI ET AL., 2016; LIGHTGBM PERFORMANCE SUMMARY). These two solutions, combined with Azure’s high-performance GPU VM, provide a powerful on-demand environment to compete in the Data Science Bowl.

To get started in in the GPU VM you need to install these frameworks:

  • CUDA: CUDA 8.0 can be downloaded from NVIDIA web (registration is required). If you are using Linux, you also need to download CUDA Patch 1 from the website. The patch adds support for gcc 5.4 as one of the host compilers.
  • cuDNN: cuDNN 5.1 (registration with NVIDIA required).
  • MKL: Intel´s Math Kernel Library (MKL) version 11.3 update 3 (registration with Intel required).
  • Anaconda: Anaconda 4.2.0 provides support for conda environments and jupyter notebooks.
  • OpenCV: Download and install from the official OpenCV website. This can also be installed via conda with this command:

    conda install -c https://conda.binstar.org/conda-forge opencv

  • Scikit-learn: Scikit-learn 0.18 is easily installed via pip:

    pip install scikit-learn

  • Cognitive Toolkit: Cognitive Toolkit 2.0 beta9 for Python. You can build from source but it’s faster to install the precompiled binaries.
  • LightGBM: LightGBM is easily installed with CMake. You will also need to install the Python bindings.
  • Data management libraries: You also need to install dicom and glob libraries, using pip:

    pip install pydicom glob2

In addition to these libraries and the pre-trained network (downloadable HERE), it’s necessary to download the competition DATA. The images are in DICOM format and consist of a group of slices of the thorax of each patient (see Fig. 2).

Fig. 2: Axial slices of the thorax of a patient with cancer (left) and a patient without cancer (right).

Cancer Image Detection with Cognitive Toolkit and LightGBM


Connecting your own Hadoop or Spark to Azure Data Lake Store


A frequent question we get is how do I connect my Hadoop or Spark cluster to Azure Data Lake Store. Turns out it is really easy to do. Here is a step by step article that will help you get this configured. Enjoy!

Connecting your own Hadoop or Spark to Azure Data Lake Store

Load Data From Azure Data Lake Into Azure SQL Data Warehouse At 3TB/Hour

Source: https://blogs.technet.microsoft.com/machinelearning/2017/02/09/load-data-from-azure-data-lake-into-azure-sql-data-warehouse-at-3tbhour/


Re-posted from the Azure blog.

AZURE SQL DATA WAREHOUSE (Azure SQL DW, or just SQL DW for short) is a SQL-based fully managed, petabyte-scale data warehousing solution in the cloud. It is highly elastic, enabling you to provision in minutes and scale capacity in seconds. You can scale compute and storage independently, allowing you to burst compute for complex analytical workloads or scale down your warehouse for archival scenarios. What’s more, you can pay by usage, rather than being locked into expensive predefined cluster configurations.

AZURE DATA LAKE (ADL) is a no-limits data lake optimized for massively parallel processing, and it lets you store and analyze petabyte-size files and trillions of objects.

A common use case involving ADL Store (ADLS) and SQL DW is the following: Raw data is ingested into ADLS from a variety of sources. ADL Analytics (ADLA) is used to clean and process the data into a loading-ready format. From there, high value data is imported into Azure SQL DW for interactive analytics.

Until recently, the data in ADLS would be loaded into SQL DW using row-by-row insertion which, obviously, consumed time and meant delays in how quickly data could be explored to gain useful business insights.

However, as we RECENTLY ANNOUNCED, with SQL DW PolyBase support for ADLS, you can now load data directly from ADLS into your SQL DW instance using External Tables at nearly 3TB per hour. Because SQL DW can now ingest data directly from Azure Storage Blob and ADLS, you can load data from any Azure storage service, giving you the flexibility to choose what’s right for your application. The picture below captures the “Before” and “After” situation.

Intrigued? Read THIS POST to learn more, including how to connect ADLS to SQL DW, and best practices for loading data. Learn more about the new PolyBase capability HERE. You can also check out a short VIDEO CLIP on how to use this new feature:

If you already have an Azure Data Lake Store, you can try LOADING YOUR DATA INTO SQL DATA WAREHOUSE. For those of you still exploring Azure Data Lake, check out these nice ADLS TUTORIALS which will get you up and running.

CIML Blog Team

Distributed Deep Learning on HDInsight with Caffe on Spark

Source: https://blogs.msdn.microsoft.com/azuredatalake/2017/02/02/distributed-deep-learning-on-hdinsight-with-caffe-on-spark/


Deep learning is impacting everything from healthcare to transportation to manufacturing, and more. Companies are turning to deep learning to solve hard problems, like image classification, speech recognition, object recognition, and machine translation.

There are many popular frameworks, including Microsoft Cognitive Toolkit, Tensorflow, MXNet, Theano, etc. Caffe is one of the most famous non-symbolic (imperative) neural network frameworks, and widely used in many areas including computer vision. Furthermore, CaffeOnSpark combines Caffe with Apache Spark, in which case deep learning can be easily used on an existing Hadoop cluster together with Spark ETL pipelines, reducing system complexity and latency for end-to-end learning.

HDInsight is the only fully-managed cloud Hadoop offering that provides optimized open source analytic clusters for Spark, Hive, MapReduce, HBase, Storm, Kafka, and R Server backed by a 99.9% SLA. Each of these big data technologies and ISV applications are easily deployable as managed clusters with enterprise-level security and monitoring.

Some users are asking us about how to use deep learning on HDInsight, which is Microsoft’s PaaS Hadoop product. We will have more to share in the future, but today we want to summarize a technical blog on how to use Caffe on HDInsight Spark.

If you have installed Caffe before, you will notice that installing this framework is a little bit challenging. In this blog, we will first illustrate how to install Caffe on Spark for an HDInsight cluster, then use the built-in MNIST demo to demostrate how to use Distributed Deep Learning using HDInsgiht Spark on CPUs.

There are four major steps to get it work on HDInsight.

  1. Install the required dependencies on all the nodes
  2. Build Caffe on Spark for HDInsight on the head node
  3. Distribute the required libraries to all the worker nodes
  4. Compose a Caffe model and run it distributely

Since HDInsight is a PaaS solution, it offers great platform features – so it is quite easy to perform some tasks. One of the features that we heavily use in this blog post is called Script Action, with which you can execute shell commands to customize cluster nodes (head node, worker node, or edge node).

Step 1: Install the required dependencies on all the nodes

To get started, we need to install the dependencies we need. The Caffe site and CaffeOnSpark site offers some very useful wiki for installing the dependencies for Spark on YARN mode (which is the mode for HDInsight Spark), but we need to add a few more dependencies for HDInsight platform. We will use the script action as below and run it on all the head nodes and worker nodes. This script action will take about 20 minutes, as those dependencies also depend on other packages. I put the script in my GitHub location so it is accessible by the cluster.

#Please be aware that installing the below will add additional 20 mins to cluster creation because of the dependencies
#installing all dependencies, including the ones mentioned in http://caffe.berkeleyvision.org/install_apt.html, as well a few packages that are not included in HDInsight, such as gflags, glog, lmdb, numpy
#It seems numpy will only needed during compilation time, but for safety purpose we install them on all the nodes

sudo apt-get install -y libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libhdf5-serial-dev protobuf-compiler maven libatlas-base-dev libgflags-dev libgoogle-glog-dev liblmdb-dev build-essential  libboost-all-dev python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose

#install protobuf
wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
sudo tar xzvf protobuf-2.5.0.tar.gz -C /tmp/
cd /tmp/protobuf-2.5.0/
sudo ./configure
sudo make
sudo make check
sudo make install
sudo ldconfig
echo "protobuf installation done"

There are two steps in the script action above. The first step is to install all the required libraries. Those libraries include the necessary libraries for both compiling Caffe(such as gflags, glog) and running Caffe (such as numpy). We are using libatlas for CPU optimization, but you can always follow the CaffeOnSpark wiki on installing other optimization libraries, such as MKL or CUDA (for GPU).

The second step is to download, compile, and install protobuf 2.5.0 for Caffe during runtime. Protobuf 2.5.0 is required, however this version is not available as a package on Ubuntu 16, so we need to compile it from the source code. There are also a few resources on the Internet on how to compile it, such as this

To simply get started, you can just run this script action against your cluster to all the worker nodes and head nodes (for HDInsight 3.5). You can either run the script actions for a running cluster, or you can also run the script actions during the cluster provision time. For more details on the script actions, please see the documentation here

Script Actions to Install Dependencies

Step 2: Build Caffe on Spark for HDInsight on the head node


Home Appliances, Vending Machines – Even Cruise Ships – Get an Infusion of Cortana Intelligence & Machine Learning

Source: https://blogs.technet.microsoft.com/machinelearning/2017/02/01/home-appliances-vending-machines-even-cruise-ships-get-an-infusion-of-cortana-intelligence/


A quick overview of recent customer case studies involving the application of Microsoft’s AI, Big Data & Machine Learning offerings.

Carnival Maritime Predicts Water Consumption on Cruise Ships

The Costa Group’s fleet of 26 cruise ships sail all over the world. The industrial equipment on their ships have thousands of sensors that collect data in real time. As part of their digital transformation, the company’s marine service unit, Carnival Maritime, wanted to explore how it might take advantage of this data to find opportunities for operational improvement. One of the areas they looked at was that of water consumption onboard their ship. This is a complex problem, as consumption patterns can vary widely. Passengers of different nationalities shower for different durations at different temperatures and times of the day, for instance, and there are numerous other variables that make such consumption challenging to predict. Accurately predicting water consumption helps ship captains avoid the need to spend fuel by unnecessarily producing excessive amounts of water at sea. This also mitigates their need to carry all that excess water along the way, which further shaves costs.

Carnival needed a mechanism to predict the right amount of water to produce at the right time, without having to store any excess. Carnival’s partner, Arundo Analytics, a global provider of analytical and predictive solutions, helped them build a microservice on their proprietary big-data platform, and trained a model to help them do just that. Using the machine learning models, APIs, and templates in the MICROSOFT CORTANA INTELLIGENCE SUITE, Arundo analyzed historical data sets along with data such as the speed and position of the ships, age and nationality of passengers, historical weather data and more, to better understand exactly the drivers of water consumption. Their platform runs on Azure and is able to easily connect to and derive value from a variety of data, and from both Carnival’s cloud and on-premises databases.

Carnival is now able to better predict how much water a ship will need for a specific route with a particular set of guests. They estimate that their optimizations can help each ship save over $200,000 a year. The solution also contributes to the company’s goal of reducing carbon emissions. Carnival is next looking to implement a predictive maintenance solution for its fleet, using Cortana Intelligence to study the data that’s already being collected from thousands of on-board sensors on each ship. You can LEARN MORE ABOUT THE CARNIVAL MARITIME STORY HERE.

Arçelik A.Ş. Increases Forecasting Accuracy on Spare Parts


Add Intelligence To Any SQL App, With The Power Of Deep Learning

Source: https://blogs.technet.microsoft.com/machinelearning/2017/01/06/add-intelligence-to-any-sql-app-with-the-power-of-deep-learning/


Re-posted from the SQL Server blog.

Recent results and applications involving Deep Learning have proven to be incredibly promising, and across a diverse set of areas too, including speech recognition, language understanding, computer vision and more. Deep Learning is changing customer expectations and experiences around a variety of products and mobile apps, whether we’re aware of it or not. That’s definitely true of Microsoft apps you’re likely to be using every day, such as Skype, Office 365, Cortana or Bing. As we’ve mentioned before, our Deep Learning based language translation in Skype was recently named one of the 7 GREATEST SOFTWARE INNOVATIONS OF 2016 BY POPULAR SCIENCE, a true technological milestone, with machines now sitting at or above human parity, when it comes to recognizing conversational speech.

As a result of these developments, it’s only a matter of time before intelligence powered by Deep Learning becomes an expectation of any app.

In a new blog post, Rimma Nehme addresses the question of how easy might it be for your typical SQL Server developer to integrate Deep Learning into their app. This question is especially timely in light of the recent enhancement to SQL Server 2016 through the integration of R Services, with powerful ML functions, including deep neural networks (DNNs) as a core part of it.

Can we help you turn any SQL app into a truly ‘intelligent’ app, and ideally with just a few lines of code?

To find out, READ THE ORIGINAL BLOG POST HERE – the answer may surprise you.

Deep Learning – Microsoft Cognitive Toolkit (CNTK) Deep Dive and Hands-on Tutorial – Nov 2016

The Microsoft Cognitive Toolkit—previously known as CNTK—empowers you to harness the intelligence within massive datasets through deep learning by providing uncompromised scaling, speed and accuracy with commercial-grade quality and compatibility with the programming languages and algorithms data scientists already use.