Saving Snow Leopards with Deep Learning and Computer Vision on Spark

This post is authored by Mark Hamilton, Software Engineer at Microsoft, Rhetick Sengupta, President of Snow Leopard Trust, and Principal Program Manager at Microsoft, and Roope Astala, Senior Program Manager at Microsoft.

The Snow Leopard – A Highly Endangered Animal

Snow leopards are highly endangered animals that inhabit high-altitude steppes and mountainous terrain in Asia and Central Asia. There’s only an estimated 3900-6500 individuals left in the wild. Due the cats’ remote habitat, expansive range and extremely elusive nature, they have proven quite hard to study. Very little is therefore known about their ecology, range, survival rates and movement patterns. To truly understand the snow leopard and influence its survival rates more directly, lots more data is needed. Biologists have set up motion-sensitive camera traps in snow leopard territory in an attempt to gain a better understanding of these animals. In fact, over the years, these cameras have produced over 1 million images, and these images are used to understand the leopard population, range and other behaviors. This information, in turn, can be used to establish new protected areas as well as improve the many community-based conservation efforts administered by the Snow Leopard Trust.

However, the problem with camera trap data is that the biologists must sort through all the images to specially identify those with snow leopards or their prey as opposed to those images which have neither. Doing this sort of classification manually is very time-consuming and takes around 300 hours per camera survey. To solve this problem, the Snow Leopard Trust and Microsoft agreed to partner with each other. Working with the Azure Machine Learning team, the Snow Leopard Trust built an image classification model that uses deep neural networks at scale on Spark.

This blog post discusses the successes and learnings from our collaboration.

Image Analysis Using Microsoft Machine Learning for Apache Spark

Convolutional neural networks (CNNs) are today’s state of the art statistical models for image analysis. They are used in everything from driverless cars, facial recognition systems, and image search engines. To build our model, we used Microsoft Machine Learning for Apache Spark (MMLSpark), which provides easy-to-use APIs for many different kinds of deep learning models, including CNNs from the Microsoft Cognitive Toolkit (CNTK). Furthermore, MMLSpark provided capabilities for scalable image loading and preprocessing to build end-to-end image analysis workflows as SparkML pipelines.

Our model uses two key machine learning techniques: Transfer Learning and Ensembling and Dataset Augmentation

Transfer Learning: Instead of training a CNN model from scratch, we used a technique called transfer learning. In this technique, we used a network, namely ResNet, that had been pre-trained against generic images, and that has learned to hierarchically recognize different levels of structures from the image. We used output from the final layers of the network as high-order features. These features were then used to train a “traditional” SparkML logistic regression classifier to specifically detect snow leopards. Note that the DNN featurization is an “embarrassingly” parallel task that scales up with the number of Spark executors, and can therefore be easily used for very large image datasets. You can find a generic example of transfer learning using MMLSpark at our GitHub repository.

Ensembling and Dataset Augmentation: The second key part of our model involves “ensembling” or averaging the predictions of our model across multiple images. Often, when a snow leopard roams near a camera, she will trigger the camera to fire several times creating a sequence of images. This provides us with several different shots of the same leopard. If the network can classify most of these shots correctly, we can use the correct majority to override the incorrect predictions in the sequence, dramatically reducing the error on hard-to-classify shots. We also augmented the dataset by flipping each image horizontally to double our training data, which helped the network learn a more robust snow leopard identifier. We discovered that both additions significantly improved accuracy.


We compared several different models:

  1. L120: No DNN featurization, plain logistic regression against the image scaled down to 120×120 pixels.
  2. RN1: Using the next-to-last layer of the ResNet50 neural network as featurizer, and logistic regression as classifier.
  3. RN2: Using the second-to-last layer of the ResNet50 neural network as featurizer, and logistic regression as classifier.
  4. RN2+A: Using the RN2 setup with a dataset augmented with horizontally flipped images.
  5. RN2+ A + E: Using RN2+A but grouping the images into sequences of camera shots and averaging the predictions over these groups.

The results for each model are shown below, from left to right.

The most basic model without transfer learning or ensembling gets 63.4% on the test dataset, only slightly better than random guessing. The first big improvement in performance comes from using transfer learning which brings the results up to 83% accuracy. Cutting more layers off the net and augmenting the dataset helps a bit bringing the model to 89.5% accuracy. Finally averaging over the sequence of images brings us to 90% and dramatically improves our ROC curve, almost eliminating false positives entirely as seen in image below.

To summarize, using the knowledge contained in a pre-trained deep learning model allowed us to effectively automate snow leopard classification. This will greatly help the Trust’s conservation efforts by saving hundreds of hours of manual work and allowing the biologists to concentrate on more impactful scientific and conservation efforts. By using image processing tools and pre-trained DNN models from MMLSpark, we could easily build the image classification workflow, and process large volumes of images effectively. As the workflow consists of SparkML pipelines it is straightforward to deploy to production for use in future surveys. We have also laid the groundwork for future image analyses, as this workflow can easily be adapted to other image classification tasks.

Microsoft has been working for some time to bring the power of AI and the cloud to organizations and individuals who are working to improve, protect and preserve our planet and the species that live upon it. Projects like this one help advance our understanding of how these tools can be used across a wide range of ecological and environmental applications, and we’re looking forward to sharing more information about this work in coming months.

You can read more about Snow Leopard Trust here, and about the Microsoft Machine Learning Library for Apache Spark here.

Mark, Rhetick & Roope

Apache®, Apache Spark, and Spark® are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.


Monitoring Petabyte-Scale AI Data Lakes In Azure


This post by Reed Umbrasas, Software Engineer at Microsoft.

Azure Data Lake Analytics lets customers run massively parallel jobs over petabytes of raw data stored in Azure Data Lake Store. During the recent MICROSOFT DATA AMP EVENT, we demonstrated how this massive processing power can be used to build a petabyte scale AI data lake which turns 2PB of raw text data into actionable business insights. As part of this demo, we showed an operations dashboard to visualize the solution architecture and display key metrics for each Azure service used in our architecture. The goal of this exercise was two-fold – to visualize the solution architecture and to demonstrate the ease of integrating solutions running in Azure with custom monitoring software. Integrating cloud solutions with in-house monitoring infrastructure is particularly important for many organizations that are adopting Microsoft CORTANA INTELLIGENCE solutions.

In this blog post we will show how we used Azure Resource Manager APIs and Azure .NET SDKs to fetch and display service metrics in real time. Our examples will focus on Azure SQL DB, Azure SQL Data Warehouse (SQL DW), Azure Data Lake (ADL) and Azure Analysis Services. However, Azure Resource Manager APIs expose metric data for each provisioned Azure resource, so a similar approach could be extended to other architectures as well.

Solution Architecture

The illustration below shows the data flows between the different services in Azure.

We have made the source code available from GITHUB. We’ll next discuss how to fetch each of the metrics displayed here.

Authenticating with Azure Active Directory

To authenticate our web app requests to Azure APIs, we will set up Service-to-Service authentication using Azure Active Directory (AAD).

First, we’ll go to the Active Directory tab in the Azure Portal and register a new application with a client secret. For detailed instructions on how to do this, please refer to the AZURE DOCUMENTATION.

Before calling any Azure API, we will need to fetch an access token and pass it in the “Authorization” header of the request. The Azure documentation provides an OVERVIEW of how Service-to-Service authentication works with AAD, but for the purposes of this blog post, you can refer to the code in GITHUB.

Azure Data Lake Analytics

To get the number of active jobs running in an ADL Analytics account, we will use the AZURE DATA LAKE ANALYTICS .NET SDK. The SDK provides an API to list the jobs in each account. We simply list the running jobs and count the result – you can see code sample on GITHUB.

Azure Data Lake Store

To get the size of the ADL Store account, we will call the following REST API.


We will need to add the “Authentication” header to the request and populate it with a token fetched from AAD, as outlined above. The API call will return the size of ADL Store account during the specified period. We display the latest value in the dashboard.

Again, you can refer to the GitHub code SAMPLE.

Azure SQL Database and Azure SQL Data Warehouse

To get the number of active queries in Azure SQL DB and SQL DW, we must first establish a connection using ADO.NET. You can find the right connection string to use in the Azure portal.

Next, we query the “sys.dm_exec_requests” table in SQL DB and the “sys.dm_pdw_nodes_exec_requests” table in SQL DW to get a count of active queries.

SELECT count(*) FROM {table name} WHERE status = ‘Running’;

There is a wealth of information that can be gathered by querying those tables for building more complex dashboards. For more details, please refer to the SQL Server DOCUMENTATION.

Azure Analysis Services

To get the number of active user sessions in Azure Analysis Service, we will call the following REST API:{aas_server}/providers/Microsoft.Insights/metrics?api-version=2016-06-01&$filter=(name.value eq ‘CurrentUserSessions’) and (aggregationType eq ‘Average’) and startTime eq {start_time} and endTime eq {end_time} and timeGrain eq duration’PT1M’

This API call will return the average user session count over the specified period at a 1-minute granularity. We simply take the latest value and display it on the website. Please refer to the GitHub code SAMPLE.

Retrieving Metric Definitions from Azure Resource Manager

At this point you might be wondering where to find the metric definitions for a given Azure resource, i.e. if we provisioned an Azure Analysis Services account and we now want to know what metrics are provided by Azure. The REST API below will return a list of all metrics that are provided for the given resource:{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.AnalysisServices/servers/{aasServerName}/providers/microsoft.insights/metricDefinitions?api-version=2016-03-01

Based on the list returned, you can choose which metrics you are interested in. For more information on this topic, please refer to the MONITORING REST API WALKTHROUGH article in the Azure documentation.


In this blog post we demonstrated how custom monitoring solutions can retrieve Azure resource operational metrics. Once retrieved, this data can be displayed, persisted, sent to a third-party monitoring infrastructure, etc. This allows Azure customers to fully integrate their Cortana Intelligence solutions with their existing monitoring infrastructure.

Optimizing Intelligent Apps Using Machine Learning In SQL Server


Re-posted from the Azure blog.

SQL Server 2016 introduced a new function called R Services, bringing machine learning capabilities to the platform, and the NEXT VERSION OF SQL SERVER will further extend this with support for Python. These new capabilities – of running R and Python in-database and at scale – let developers keep their analytics services close to the data, eliminating expensive data movements. They also simplify the process of developing and deploying a new breed of intelligent apps.

There are several optimization tips and tricks available to developers, to help you get the most mileage out of SQL Server in these scenarios, including fine-tuning the model and boosting performance. In a new blog post, we apply some of these optimization techniques to a resume-matching scenario and demonstrate how such techniques can make your data analytics much more efficient and powerful. The optimization techniques covered are:

  • Full durable memory-optimized tables.
  • CPU affinity and memory allocation.
  • Resource governance and concurrent execution.

We ran benchmark tests where we compared the scoring time with and without these optimizations, scoring 1.1 million rows of data using the RevoScaleR and MicrosoftML packages to separately train a prediction model. Tests were run on the same Azure SQL Server VM using the same SQL query and R codes. As the charts below show, using such optimizations can significantly boost performance.

CLICK HERE TO READ THE ORIGINAL POST and access the full tutorial, complete with sample code and detailed step-by-step walkthroughs.

CIML Blog Team

Serving AI With Data: A Summary Of Build 2017 Data Innovations


This post was authored by Joseph Sirosh, Corporate Vice President, Microsoft Data Group

This week at the annual Microsoft Build conference, we are discussing how, more than ever, organizations are relying on developers to create breakthrough experiences. With big data, cloud and AI converging, innovation & disruption is accelerating to a pace never seen before. Data is the key strategic asset at the heart of this convergence. When combined with the limitless computing power of the cloud and new capabilities like Machine Learning and AI, it enables developers to build the next generation of intelligent applications. As a developer, you are looking for faster, easier ways to embrace these converging technologies and transform your app experiences.

Today at Build, we made several product announcements, adding to the recent momentum announced last month at MICROSOFT DATA AMP, that will help empower every organization on the planet with data-driven intelligence. Across these innovations, we are pursuing three key themes:

  1. Infusing AI within our data platform
  2. Turnkey global distribution to push intelligence wherever your users are
  3. Choice of database platforms and tools for developers

Infusing AI within our data platform

Joseph_AI1A thread of innovation you will see in our products is the deep integration of AI with data. In the past, a common application pattern was to create machine learning models outside the database in the application layer or in specialty statistical tools, and deploy these models in custom built production systems. This results in a lot of developer heavy lifting, and the development and deployment lifecycle can take months. Our approach dramatically simplifies the deployment of AI by bringing intelligence into existing well-engineered data platforms through a new extensibility model for databases.

SQL Server 2017

We started this journey by introducing R support within the SQL Server 2016 release and we are deepening this commitment with the upcoming release of SQL SERVER 2017. In this release, we have introduced support for a rich library of machine learning functions and introduced Python support to give you more choices across popular languages. SQL Server can also leverage GPU ACCELERATED COMPUTING through the Python/R interface to power even the most intensive deep learning jobs on images, text and other unstructured data. Developers can implement GPU ACCELERATED ANALYTICS and very sophisticated AI directly in the database server as stored procedures and gain orders of magnitude higher throughput.

Additionally, as data becomes more complex and the relationships across data are many-to-many, developers are looking for easier ways to ingest and manage this data. With SQL Server 2017, we have introduced Graph support to deliver the best of both relational and graph databases in a single product, including the ability to query across all data using a single platform.

We have made it easy for you to TRY SQL SERVER with R, Python, and Graph support today whether you are working with C#, Java, Node, PHP, or Ruby.

Azure SQL Database

We’re continuing to simultaneously ship SQL Server 2017 enhancements to Azure SQL Database, so you get consistent programming surface area across on-premises and cloud. Today, I am excited to announce the support for Graph is also coming to Azure SQL Database so you can also get the best of both relational and graph in a single proven service on Azure.

SQL Database is built for developer productivity with most database management tasks built-in. We have also built AI directly into the service itself, making it an intelligent database service. The service runs millions of customer databases, learns, and then adapts to offer customized experiences for each database. With Database Advisor, you can choose to let the service learn your unique patterns and make performance and tuning recommendations or automatically take action on your behalf. Today, I am also excited to announce general availability of Threat Detection, which uses machine learning around the clock to learn, profile and detect anomalous activity over your unique database and sends alerts in minutes so you can take immediate action versus what historically can take an organization days, months, or years to discover.

Also, we are making it even easier for you to move more of your existing SQL Server apps as-is to Azure SQL Database. Today we announced the private preview for a new deployment option within the service, Managed Instance—you get all the managed benefits of SQL Database and now at the instance level which offers support for SQL Agent, three-part names, DBMail, CDC and other instance-level capabilities.

To streamline this migration effort, we also introduced a preview for Azure Database Migration Service that will dramatically accelerate the migration of on-premises third-party and SQL Server databases into Azure SQL Database.

Eric Fleischman, Vice President & Chief Architect from DocuSign notes, “Our transaction volume doubles every year. We wanted the best of what we do in our datacenter…with the best of what Azure could bring to it. For us, we found that Azure SQL Database was the best way to do it. We deploy our SQL Server schema elements into a Managed Instance, and we point the application via connection string change directly over to the Managed Instance. We basically picked up our existing build infrastructure and we’re able to deploy to Azure within a few seconds. It allows us to scale the business very quickly with minimal effort.”

Learn more about our investments in AZURE SQL DATABASE in this deeper blog and sign up for an invitation to these PREVIEWS today.

Turnkey global distribution to push intelligence wherever your users are

With the intersection of mobile apps, internet of things, cloud and AI, users and data can come from anywhere around the globe. To deliver transformative intelligent apps that support the global nature of modern applications, and the volume, velocity, variety of data, you need more than a relational database, and more than a simple NoSQL database. You need a flexible database that can ingest massive volumes of data and data types, and navigate the challenges of space and time to ensure millisecond performance to any user anywhere on earth. And you want this with simplicity and support for the languages and technologies you know.

Joseph_AI2I’m also excited to share that today, Microsoft announced Azure Cosmos DB, the industry’s first globally-distributed, multi-model database service. Azure Cosmos DB was built from the ground up with global distribution and horizontal scale at its core – it offers turn-key global distribution across any number of Azure regions by transparently scaling and distributing your data wherever your users are, worldwide. Azure Cosmos DB leverages the work of Turing award winner DR. LESLIE LAMPORT, PAXOS ALGORITHM for distributed systems and TLA+ a high-level modeling language. Check out a new INTERVIEW with Dr. Lamport on Azure Cosmos DB.

Azure Cosmos DB started as “Project Florence” in 2010 to address developer the pain-points faced by large scale applications inside Microsoft. Observing that the challenges of building globally distributed apps are not a problem unique to Microsoft, in 2015 we made the first generation of this technology available to Azure developers in the form of Azure DocumentDB. Since that time, we’ve added new features and introduced significant new capabilities.  Azure Cosmos DB is the result.  It is the next big leap in globally distributed, at scale, cloud databases.

Now, with more innovation and value, Azure Cosmos DB delivers a schema-agnostic database service with turnkey global distribution, support for multiple models across popular NoSQL technologies, elastic scale of throughput and storage, five well-defined consistency models, and financially-backed SLAs across uptime, throughput, consistency, and millisecond latency.

“Domino’s Pizza chose Azure to rebuild their ordering system and a key component in this design is Azure Cosmos DB—delivering the capability to regionally distribute data, to scale easily, and support peak periods which are critical to the business. Their online solution is deployed across multiple regions around the world—even with the global scaling they can also rely on Azure Cosmos DB millisecond load latency and fail over to a completely different country if required.”

LEARN MORE about Azure Cosmos DB in this deeper blog.

Choice of database platforms and tools for developers

We understand that SQL Server isn’t the only database technology developers want to build with. Therefore, I’m excited to share that today we also announced two new relational database services; Azure Database for MySQL and Azure Database for PostgreSQL to join our database services offerings.

Joseph_AI3These new services are built on the proven database services platform, which has been powering Azure SQL Database, and offers high availability, data protection and recovery, and scale with minimal downtime—all built-in at no extra cost or configurations. Starting today, you can now develop on MySQL and PostgreSQL database services on Azure. Microsoft is managing the MySQL and PostgreSQL technology you know, love and expect but backed by an enterprise-grade, highly available and fault tolerant cloud services platform that allows you to focus on developing great apps versus management and maintenance.

“Each month, up to 2 million people turn to the GeekWire website for the latest news on tech innovation. Now, GeekWire is making news itself by migrating its popular WordPress site to the Microsoft Azure platform. Kevin Lisota, Web Developer at GeekWire notes, “The biggest benefit of Azure Database for MySQL will be to have Microsoft manage and back up that resource for us so that we can focus on other aspects of the site. Plus, we will be able to scale up temporarily as traffic surges and then bring it back down when it is not needed. That’s a big deal for us.”

LEARN MORE about these new services and try them today.

Azure Data Lake Tools for Visual Studio Code (VSCode)

Azure Data Lake includes all the capabilities required to make it easy for developers, data scientists, and analysts to store data of any size, shape, and speed, and do all types of processing and analytics across platforms and languages. Additionally, Azure Data Lake includes a set of cognitive capabilities built-in, making it seamless to execute AI over petabytes of data. On our journey to make it easier for every developer to become an AI and data science developer, we are investing in bringing more great tooling for data into the tools you know and love.

Today, I’m excited to announce General Availability of Azure Data Lake Tools for Visual Studio Code (VSCode) which gives developers a light but powerful code editor for big data analytics. The new Azure Data Lake Tools for VSCode supports U-SQL language authoring, scripting, and extensibility with C# to process different types of data and efficiently scale any size of data. The new tooling integrates with Azure Data Lake Analytics for U-SQL job submissions with job output to Azure Data Lake Analytics or Azure Blob Storage. In addition, U-SQL local run service has been added to allow developers to locally validate scripts and test data. LEARN MORE and download these tools today.

Getting started

It has never been easier to get started with the latest advances in the intelligent data platform. We invite you to watch our MICROSOFT BUILD 2017 online event for streaming and recorded coverage of these innovations, including SQL SERVER 2017 on Windows, Linux and Docker; scalable data transformation and intelligence from AZURE COSMOS DB, AZURE DATA LAKE STORE and AZURE DATA LAKE ANALYTICS; the AZURE SQL DATABASE approach to proactive Threat Detection and intelligent database tuning; new AZURE DATABASE FOR MYSQL and AZURE DATABASE FOR POSTGRESQL. I look forward to a great week at Build and your participation in this exciting journey of infusing AI into every software application.

Using IoT Sensors To Up Your Game


This post is authored by Patty Ryan, Principal Data Scientist, Hang Zhang, Senior Data and Applied Scientist, and Mustafa Kasap, Senior Software Design Engineer, at Microsoft.

Learn from the Professionals through Comparison of Sensor Data

As any athlete aspiring to greatness can tell you, measurement of your own performance and tips from the pros are two keys to improvement. Thanks to affordable wearable sensors, it is now possible for you to measure your own performance and also benchmark it to that of the professionals.

Everyone knows that a professional practicing a sport looks visibly different from an amateur. In skiing, we’ve identified just nine sensor positions that can clearly differentiate professionals from the amateurs. The information from these nine sensors allowed us to build a simple but powerful machine learning model that can classify professionals and non-professionals correctly 98% of the time.

Sensor Data Delivers an Activity Proficiency Signature

Here’s how it works: Each of the sensors measure position, acceleration and rotation individually, and record this data along with a time stamp. Sensor data includes position, acceleration and rotation, all relative to x, y and z coordinates. While the sample rate of sensors varies, we recommend a minimum of a 100hz. To illustrate the potential of these wearable sensors in measuring sports performance, we worked with the Professional Ski Instructors of America and the American Association of Snowboard Instructors, or PSIA-AASI. We measured the organization’s professional-level skiers and compared these measures to intermediate skiers.

To start, we characterized with the PSIA-AASI hallmark differences between professionals vs. non-professionals. These differences include the relative position of their upper body vs. lower body, their limbs, and how they took a turn. Using these insights from domain experts, we engineered data features that characterize limb position relative to one another, and the upper body relative to the lower body. We added these engineered features to the dataset of individual sensor measures.

Then we broke our ski activity sample into small activity interval slices – in our case, time slices of two seconds each. For each of these slices, we created summary statistical measures on the sensor data from these intervals. These summary statistical measures were basic measures and included medians, minimums, maximums and quartiles.

We also created features that characterize frequency measures for this sensor data, represented by spatial and temporal features in the time-series graphic illustration. We ran a fast-discrete Fourier transform function on the variables to transform the data from time series to frequency measures (measured in hz) for a given interval window. In our case, we chose a 2-second time window. This generated the frequency components of our sensor signal, including constant power, low-band average power, mid-band average power, and high-band average power. Finally, we generated cross-correlation measures on select variables to measure the similarity of various two-series combinations.

We filtered to the best thirty features, and trained and tested a logistical regression model. This model predicted the right skill level classification 98% of the time. In addition to this classification, the amateur can get even more guidance on specific differences and areas to improve relative to the professional model, by investigating specific sensor differences seen at various phases of the activity.

Follow this Recipe to Recreate from Scratch

Refer to sensor kit and R script HERE.

  1. Place the sensors on the body per the above diagram and test. Refer to the sensor kit at this GitHub location for suggestions on sensor options. Your sensor should emit positional, acceleration and rotational sensor data from the feet, each of lower legs, pelvis, torso and shoulders, as identified in the diagram above, with a minimal sampling rate of 100 hz.
  2. Generate data! Create skiing experiment data of drills, including short radius turns, medium radius turns, and large radius turns. Be sure to exclude non-skiing time from your experiment sample.
  3. Label the data by skill level. We labeled it as professionals vs. non-professionals.
  4. Store the data in the cloud. Store data on the device, or your phone for batch upload, or stream data to a storage location in the cloud.
  5. Import data into an Azure Machine Learning workspace. Options for Azure ML data import include Azure SQL Database, Azure Blob Storage, Azure Table, Azure Document DB and more.
  6. Clean and transform the data. Transform into the wide data format, with one row for each athlete and experiment. Refer to the R script here for this, and the following steps. Using the dplyr functions makes this easy. Create engineered features to better illustrate the differences between pros and non-pros. Features that best illustrate these differences in skiing include the normalized difference between the upper body and lower body, and relational positions and rotation of upper body and lower body, as well as the relational position and rotation of the limbs.
  7. Slice the data into intervals and generate statistics for these intervals. Over your time window, generate summary statistics as well as frequency and frequency covariance statistics. You can use a fast-discrete Fourier transform function in R to generate frequency statistics. Generate summary statistics including median, standardization, max, min, 1st quantile, and 3rd quantile. Generate frequency statistics for constant power, low-band average power, mid-band average power, and high-band average power of each time window. Finally, you can generate frequency covariance statistics for select sets of variables.
  8. Prepare data to train the predictive model. Exclude labeled and irrelevant columns from evaluation and training. In this case, we exclude the experiment number, subject id, time stamp and, of course, skill level.
  9. Select features. In our case, we used Joint Mutual Information Maximization to reduce to 30 features.
  10. Split the data for training and test. In our case, we used a 70/30 split of train and test.
  11. Train the model. In our case, we chose logistic regression to predict the skill level.
  12. Evaluate the model results based on the test set that hasn’t been seen yet. Reviewing the confusion matrix will allow you to see how your measure performs.

Compare Your Skiing Sensor Data to the Pros

Comparing your data to that of the pros will give you very specific guidance on how to improve. Use our data set LINKED HERE (about 890 MB), and analyze your own skiing sensor data vs. the sample from professionals.

Expand this Model to Other Sports

We invite you to help us expand this model by adding activity models for additional sports and activities. Read about our sports sensor work at REAL LIFE CODE. Contribute to the sensor kit, sports activity data and models found at this GITHUB LOCATION. Or reach out to KEVIN ASHLEY or MAX ZILBERMAN at Microsoft.

Allowing Multiple Users To Access R Server On HDInsight

Allowing multiple users to access R Server on HDInsight



Recently there are a few customers asking me how to enable multiple users to access R Server on HDInsight, so I think blogging all the ways might be a good idea.

To provide some background, you need to provide two users when creating an HDInsight cluster. One is the so called “http user” – i.e. the “Cluster login user name” below. Another one is the “ssh user” – i.e. the “SSH user name” below.


Basically speaking, the “http user” will be used to authenticate through the HDInsight gateway, which is used to protect the HDInsight clusters you created. This user is used to access the Ambari UI, YARN UI, as well as many other UI components.

The “ssh user” will be used to access the cluster through secure shell. This user is actually a user in the Linux system in all the head nodes, worker nodes, edge nodes, etc., so you can use secure shell to access the remote clusters.

For Microsoft R Server on HDInsight type cluster, it’s a bit more complex, because we put R Studio Server Community version in HDInsight, which only accepts Linux user name and password as login mechanisms (it does not support passing tokens), so if you have created a new cluster and want to use R Studio, you need to first login using the http user’s credential and login through the HDInsight Gateway, and then use the ssh user’s credential to login to RStudio.



One limitation for existing HDInsight cluster is that only one ssh user account can be created during cluster provisioning time. So in order to allow multiple users to access Microsoft R Server on  HDInsight clusters, we need to create additional users in the Linux system.

Because RStudio Server Community is running on the cluster’s edge node, so we need to do two steps here:

  1. Using the created ssh user to login the edge node
  2. Add more Linux users in Edge node
  3. Use RStudio Community version with the user just created

Step 1: Using the created ssh user to login the edge node

You can follow this documentation: Connect to HDInsight (Hadoop) using SSH  (HTTPS://DOCS.MICROSOFT.COM/EN-US/AZURE/HDINSIGHT/HDINSIGHT-HADOOP-LINUX-USE-SSH-UNIX) to access the edge node. But to start simple, you should download any ssh tool (such as Putty), and use the existing SSH user to login.

The edge node address for R Server on HDInsight cluster is:

Step 2: Add more Linux users in Edge node

Execute the command below:

sudo useradd yournewusername -m

sudo passwd yourusername

You will see something like below. When prompting “Current Kerberos password:”, just press Enter to ignore it. The -m option in useradd indicates that the system will create a home folder for the user.


Step 3: Use RStudio Community version with the user just created

Use the user just created to login to RStudio


And you will see that now we are using the new user (sshuser6) to login the clusters.


You can submit a job using scaleR functions:

# Set the HDFS (WASB) location of example data
bigDataDirRoot <- "/example/data"
# create a local folder for storaging data temporarily
source <- "/tmp/AirOnTimeCSV2012"
# Download data to the tmp folder
remoteDir <- ""
download.file(file.path(remoteDir, "airOT201201.csv"), file.path(source, "airOT201201.csv"))
download.file(file.path(remoteDir, "airOT201202.csv"), file.path(source, "airOT201202.csv"))
download.file(file.path(remoteDir, "airOT201203.csv"), file.path(source, "airOT201203.csv"))
download.file(file.path(remoteDir, "airOT201204.csv"), file.path(source, "airOT201204.csv"))
download.file(file.path(remoteDir, "airOT201205.csv"), file.path(source, "airOT201205.csv"))
download.file(file.path(remoteDir, "airOT201206.csv"), file.path(source, "airOT201206.csv"))
download.file(file.path(remoteDir, "airOT201207.csv"), file.path(source, "airOT201207.csv"))
download.file(file.path(remoteDir, "airOT201208.csv"), file.path(source, "airOT201208.csv"))
download.file(file.path(remoteDir, "airOT201209.csv"), file.path(source, "airOT201209.csv"))
download.file(file.path(remoteDir, "airOT201210.csv"), file.path(source, "airOT201210.csv"))
download.file(file.path(remoteDir, "airOT201211.csv"), file.path(source, "airOT201211.csv"))
download.file(file.path(remoteDir, "airOT201212.csv"), file.path(source, "airOT201212.csv"))
# Set directory in bigDataDirRoot to load the data into
inputDir <- file.path(bigDataDirRoot,"AirOnTimeCSV2012")
# Make the directory
# Copy the data from source to input
rxHadoopCopyFromLocal(source, bigDataDirRoot)
# Define the HDFS (WASB) file system
hdfsFS <- RxHdfsFileSystem()
# Create info list for the airline data
airlineColInfo <- list(
DAY_OF_WEEK = list(type = "factor"),
ORIGIN = list(type = "factor"),
DEST = list(type = "factor"),
DEP_TIME = list(type = "integer"),
ARR_DEL15 = list(type = "logical"))
# get all the column names
varNames <- names(airlineColInfo)
# Define the text data source in hdfs
airOnTimeData <- RxTextData(inputDir, colInfo = airlineColInfo, varsToKeep = varNames, fileSystem = hdfsFS)
# Define the text data source in local system
airOnTimeDataLocal <- RxTextData(source, colInfo = airlineColInfo, varsToKeep = varNames)
# formula to use
# Define the Spark compute context
mySparkCluster <- RxSpark()
# Set the compute context
# Run a logistic regression
modelSpark <- rxLogit(formula, data = airOnTimeData)
# Display a summary

and you will see that all the jobs submitted are under different user names in YARN UI.


Please be noted that all the newly added user does not have root privilege in Linux system, but it can  the same access all the files in the remote storage (HDFS storage or WASB storage).

Apache CarbonData announced as a Top-Level Project

Apache software foundation has announced that CarbonData is being made a top level project.

Apache CarbonData is an indexed columnar data format for fast analytics on big data platform, e.g. Apache Hadoop, Apache Spark, etc.

Highlights include:

  • Unique data organization to allow faster filtering and better compression;
  • Multi-level Indexing to enable faster search and speeding up query processing;
  • Deep Apache Spark Integration for dataframe + SQL compliance;
  • Advanced push down optimization to minimize the amount of data being read processed, converted, transmitted, and shuffled;
  • Efficient compression and global encoding schemes to further improve aggregation query performance;
  • Dictionary encoding for reduced storage space and faster processing; and
  • Data update + delete support using standard SQL syntax.


Use BigDL on HDInsight Spark for Distributed Deep Learning


Author: Xiaoyong Zhu

Deep learning is impacting everything from healthcare, transportation, manufacturing, and more. Companies are turning to deep learning to solve hard problems like image classification, speech recognition, object recognition, and machine translation. In this blog post, Intel’s BigDL team and Azure HDInsight team collaborate to provide the basic steps to use BigDL on Azure HDInsight.

What is Intel’s BigDL library?

clip_image002_thumb1In 2016, Intel released its BigDL distributed Deep Learning project into the open-source community, BigDL Github. It natively integrates into Spark, supports popular neural net topologies, and achieves feature parity with other open-source deep learning frameworks. BigDL also provides 100+ basic neural networks building blocks allowing users to create novel topologies to suit their unique applications. Thus, with Intel’s BigDL, the users are able to leverage their existing Spark infrastructure to enable Deep Learning applications without having to invest into bringing up separate frameworks to take advantage of neural networks capabilities.

Since BigDL is an integral part of Spark, a user does not need to explicitly manage distributed computations. While providing a high-level control “knobs” such as number of compute nodes, cores, and batch size, a BigDL application leverages stable Spark infrastructure for node communications and resource management during its execution. BigDL applications can be written in either Python or Scala and achieve high performance through both algorithm optimization and taking advantage of intimate integration with Intel’s Math Kernel Library (MKL). Check out Intel’s BigDL portal for more details.

Azure HDInsight

Azure HDInsight is the only fully-managed cloud Hadoop offering that provides optimized open source analytic clusters for Spark, Hive, MapReduce, HBase, Storm, Kafka, and R Server backed by a 99.9% SLA. Other than that, HDInsight is an open platform for 3rd party big data applications such as ISVs, as well as custom applications such as BigDL.

Through this blog post, BigDL team and Azure HDInsight team will give a high-level view on how to use BigDL with Apache Spark for Azure HDInsight. You can find a more detailed step to use BigDL to analyze MNIST dataset in the engineering blog post.

Getting BigDL to work on Apache Spark for Azure HDInsight

BigDL is very easy to build and integrate. There are two major steps:

  • Get BigDL source code and build it to get the required jar file
  • Use Jupyter Notebook to write your first BigDL application in Scala

Step 1: Build BigDL libraries

The first step is to build the BigDL libraries and get the required jar file. You can simply ssh into the cluster head node, and follow the build instructions in BigDL Documentation. Please be noted that you need to install maven in headnode to build BigDL, and put the jar file (dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar) to the default storage account of your HDInsight cluster. Please refer to the engineering blog for more details.

Step 2: Use Jupyter Notebook to write your first application

HDInsight cluster comes with Jupyter Notebook, which provides a nice notebook-like experience to author Spark jobs. Here is a snapshot of a Jupyter Notebook running BigDL on Azure Spark for Apache HDInsight. For detailed step-by-step example of implementing a popular MNIST dataset training using LeNet model, please refer to this Microsoft’s engineering blog post. For more details on how to use Jupyter Notebooks on HDInsight, please refer to the documentation.


BigDL workflow and major components

Below is a general workflow of how BigDL trains a deep learning model on Apache Spark:clip_image0025_thumb1As shown in the figure, BigDL jobs are standard Spark jobs. In a distributed training process, BigDL will launch spark tasks in executor (each task leverages Intel MKL to speed up training process).

A BigDL program starts with import and then initializes the Engine, including the number of executor nodes and the number of physical cores on each executor.

If the program runs on Spark, Engine.init() will return a SparkConf with proper configurations populated, which can then be used to create the SparkContext. For this particular case, the Jupyter Notebook will automatically set up a default spark context so you don’t need to do the above configuration, but you do need to configure a few other Spark related configuration which will be explained in the sample Jupyter Notebook.


In this blog post, we have demonstrated the basic steps to set up a BigDL environment on Apache Spark for Azure HDInsight, and you can find a more detailed step to use BigDL to analyze MNIST dataset in the engineering blog post “How to use BigDL on Apache Spark for Azure HDInsight.” Leveraging BigDL Spark library, a user can easily write scalable distributed Deep Learning applications within familiar Spark infrastructure without an intimate knowledge of the configuration of the underlying compute cluster. BigDL and Azure HDInsight team have been collaborating closely to enable BigDL in Apache Spark for Azure HDInsight environment.

If you have any feedback for HDInsight, feel free to drop an email to If you have any questions for BigDL, you can raise your questions in BigDL Google Group.


HDinsight – How To Perform Bulk Load With Phoenix ?



APACHE HBASE is an open Source No SQL Hadoop database, a distributed, scalable, big data store. It provides real-time read/write access to large datasets. HDINSIGHT HBASE is offered as a managed cluster that is integrated into the Azure environment. HBase provides many features as a big data store. But in order to use HBase, the customers have to first load their data into HBase.

There are multiple ways to get data into HBase such as – using client API’s, Map Reduce job with TableOutputFormat or inputting the data manually on HBase shell. Many customers are interested in using APACHE PHOENIX – a SQL layer over HBase for its ease of use. The current post describes about how to use phoenix bulk load with HDinsight clusters.

Phoenix provides two methods for loading CSV data into Phoenix tables – a single-threaded client loading tool via the psql command, and a MapReduce-based bulk load tool.

Single threaded method

Please note that this method is suitable when your bulk load data is in tens of megabytes. Thus, this may not be a suitable option for most of the production scenarios. Following are the steps to use this method.

  • Create Table:

Put command to create table in a file (let’s say CreateTable.sql) based on the schema of your table. Example:

CREATE TABLE input Table (
		Field1 varchar NOT NULL PRIMARY KEY,
		Field2 varchar,
		Field3 decimal,
		Field4 INTEGER,
		Field5 varchar);


  • Input data: This file contains the input data for bulk load (let’s say input.csv).
  • Query to execute on the data: You can put any SQL query which you would like to run on the data (let’s say Query.sql). A Sample query:


SELECT Field2, Field3 from inputTable group by field5;


Building Advanced Analytical Solutions Faster Using Dataiku DSS On HDInsight

Building advanced analytical solutions faster using Dataiku DSS on HDInsight



The AZURE HDINSIGHT APPLICATION PLATFORM allows users to use applications that span a variety of use cases like data ingestion, data preparation, data processing, building analytical solutions and data visualization. In this post we will see how DSS (DATA SCIENCE STUDIO) from Dataiku can help a user build a predictive machine learning model to analyze movie sentiment on twitter.

To know more about DSS integration with HDInsight, register for the WEBINAR featuring Jed Dougherty from Dataiku and Pranav Rastogi from Microsoft.

DSS on HDInsight

By installing the DSS application on a HDInsight cluster (Hadoop or Spark), the user has the ability to:

  • Automate data flows

DSS has the ability to integrate with multiple data connectors. Users can connect to their existing infrastructure to consume their data. Data can be cleaned, merged and enriched by creating reusable workflows.

  • Use a collaborative platform

One of the highlights in DSS is to be able to collaboratively work on building an analytics solution. Data Scientists/Analysts can interact with developers to build solutions and improve results. DSS supports a wide variety of technologies like R, MapReduce, Spark etc.

  • Build prediction models

Another key feature in DSS is the ability to build predictive models leveraging the latest machine learning technologies. The models can be trained using various algorithms and applied existing flows to predict or cluster information.

  • Work using an integrated UI

DSS offers an integrated UI where you can visualize all the data transforms. Users can create interactive dashboards and share it with other members in the team.

Leverage the power of Azure HDInsight

DSS can leverage the benefits of the HDInsight platform like enterprise security, monitoring, SLA and more. DSS users can leverage the power of MapReduce and Spark to perform advanced analytics on their data. DSS offers various mechanisms to train the in-built ML algorithms when the data is stored in HDInsight. The below diagram illustrates how the HDInsight cluster is utilized by DSS:


How to install DSS on an HDInsight cluster?