Skytree is announcing a major new release, version 16.0, in which we continue to combine the powers of automation, scalability, and ease of use to provide a powerful tool for data science, for both experts and non-experts.
In our last release (http://www.skytree.net/2016/10/10/creating-data-prep-pipelines-interactively) we added the functionality of transform snippets to the Skytree Platform. Snippets allow users to perform the arbitrary data preparation demanded by real world dataflows while taking advantage of the scalability of Spark without the requirement to know the Spark language. Furthermore, with snippets added to the system, GUI users can use them without needing to know any coding at all.
In 16.0, we continue our trend of increasing user ease of use via automation, further enabling non-expert users to access the power of machine learning to add value to their business. Now that a major stride in the area of data preparation has been added to the platform, in this release we turn our attention to another major demand on data scientist’s time: feature engineering.
In our latest release, Skytree 15.6, we continue taking steps toward making data science easier, with our main focus being on the pains of data preparation. Last release, we brought you our “low code” (i.e. low amount of coding) approach to data preparation, with our new snippet mechanism. This allows the user to author a new data preparation transformation, which can run on big (or any size) data under the hood, by only specifying its actual logic, and not having to worry about the implementation details that one needs to deal with when writing in Spark, MapReduce, or other systems. Further, the new transform is thereafter registered in the system – allowing it to leverage the system’s visual auto-documentation and resource management, and making it available to subsequent users by seamlessly extending the data science team’s Skytree system. In this release, we’ve brought this powerful snippet mechanism into the GUI, allowing users to apply and chain together snippets to create data preparation pipelines even more easily. FULL ARTICLE
By Rahul Deshmukh
Our friends at Cloudera have allowed us to share with our readers a great example of how to use big data insights to drive board level business decisions.
Organizations across the globe are going through a digital transformation. Fueled by rapid adoption of digital technologies—mobile, web, social, and an explosion of sensor driven devices—this transformation is creating new revenue streams, saving money, and changing the relationships these organizations have with their customers. At the heart of this transformation is how organizations use data (and the unique insights derived from it) to drive competitive advantage. Those who view data as an asset or opportunity will thrive, while those who lose sight of it will likely see significant negative business impact. Furthermore, actionable insights from this new class of data (as well as existing data) is greatly impacting board level initiatives—driving customer insights, improving products and services efficiency, and lowering business risks. Most importantly, Apache Hadoop is, in many ways, enabling this transformation. FULL ARTICLE
Part 2 of the New York Taxi Dataset Analysis with Skytree
This is the second part in a two part blog series that analyses the publicly available New York Taxi dataset. To read Part I click here. Part 2 discusses the results of our model and how we could improve the results in future analysis.
Skytree Performance and Model Results
The model training in Smart Search takes quite a long time to run: a couple of days. The required YARN resources (automatically calculated and shown in the GUI as described above) are 3 containers, 28 cores, 60,416 MB of memory. Skytree ML runs distributed so the memory available to it is the aggregate memory of the Hadoop nodes on the system, but it only selects what it needs. Our internal system consists of 10 nodes with 96GB memory available on each so this analysis was easily run. FULL ARTICLE
The New York City Taxi & Limousine Commission (NYC TLC) has made available data about its taxi journeys. In total, there are over 1 billion journeys from 2009-2015, yielding about 200 gigabytes of tabular data. Here, we demonstrate the use of Skytree supervised machine learning (ML) to analyze this data, and in particular, that Skytree ML with fully nonlinear models such as gradient boosted decision trees can be done on 100% of the data using our user-friendly graphical user interface (GUI).
In this two part blog series we concentrate on the substance of the data preparation and machine learning analysis. In part one, we describe the use case, data preparation and training of the model. In the second blog in this series, we will describe the results and how the analysis could be improved. It will be possible for a future entry to show the data preparation integrated into Skytree via PySpark (see Skytree Simplifies Advanced Data Preparation with Code Collaboration).
All data scientists and data analysts know that quite often, much of the work of a project is in the data preparation phase. Just getting from the raw data to its first ML-ready form may often constitute the majority of the overall effort. Adding to the complication is the fact that every dataset is just a little different, many requiring one or more custom transforms — such as unique ways to normalize or clean columns.
We are happy to introduce, in our free community edition, the ability to quickly and easily develop and utilize new data preparation transforms. Through a API-based low code approach, the user only needs to write a minimal amount of code to create a new custom transform, or through a new code collaboration mechanism, he/she can utilize data prep transforms as soon as they are created by teammates. FULL ARTICLE
Experiments involving machine learning have typically been run on individual computers. A recent trend has seen sophisticated data scientists and teams worldwide starting to adopt distributed platforms (e.g., Hadoop) in order to run increasingly advanced experiments on significantly larger datasets. Unfortunately, even when access to advanced software, capable of utilizing distributed compute resources, is available, data scientists are constrained by the explicit ceiling on available resources (e.g., compute, disk, memory, network). More precisely, they are limited either by their computer’s components or the available resources of the cluster. These limitations affect the productivity of data science teams in many ways, some of which are outlined as follows:
- Loss of Productivity: Data scientists spend substantial time waiting for results. Generally speaking, data science is an iterative process that alternates between many rounds of analysis and experimentation. The longer the experiments take, the longer the time to production is for a model.
- Lower Accuracy: Lack of resources prevents data scientists from using all of the data to fully explore and create the most accurate models.
- Wasted Resources: It is not uncommon for a compute cluster to be underutilized for long periods of time between bursts of activity. In a cluster provisioned to accommodate such bursts, there can be long stretches of time where capacity dramatically exceeds demand.
- Limitations of Capacity Planning: Demand for resources will inevitably exceed planned maximum cluster capacity. In fact, accurately predicting the requisite maximum capacity to provision for a worst-case scenario represents an interesting machine-learning problem in its own right.
The standard solution to this problem of limited capacity is the acquisition of additional resources. However, purchasing a bigger computer or requesting that additional nodes be added to a cluster requires significant investment in both human capital and money. As it stands, these limits can pose a serious hindrance to the pace of obtaining actionable results.
What if capacity could be added to a cluster dynamically, depending on demand and in effective increments, and removed as necessary on the fly?
I recently have done some preliminary research and have broken it up in to three blog posts. This is the third and final post in the series. Please read my post “The Problem: Sloan Digital Sky Survey and Galaxies” that I posted on 4/12/16 to understand the problem and my post “Applying Machine Learning to Galaxy Distance Problem” that I posted on 4/14/16 to understand how I applied machine learning to solve this problem.
As we saw above, Skytree is able to exceed the desirable performance of a redshift MAE of 0.02 or better.
Figure 10 shows the predicted versus true values for the redshift. This is similar to previous plots in the literature which show this result, and represents a good fit to the data. In contrast, the right-hand panel shows the same result for a generalized linear model fit, which clearly gives worse performance. FULL ARTICLE
I recently have done some preliminary research and have broken it up in to three blog posts. This is the second post in the series. Please read my post “The Problem: Sloan Digital Sky Survey and Galaxies” that I posted on 4/12/16.
SDSS Blog Series Continued:
5) Machine Learning
Now that we have our training set as a csv file, 720,423 rows by 5 columns, with ID and target columns giving a total file size of 51 MB, we are ready to run our machine learning. The data are summarized in Table 1.
|No. missing/bad values
|u band magnitude range
|g band magnitude range
|r band magnitude range
|i band magnitude range
|z band magnitude range
Table 1: Data for machine learning
We are predicting the redshift, thus we require regression to a continuous value, rather than assigning classification probabilities to classes.
The training set here is easily large enough to be representative of the population of galaxies whose distances we are interested in. We therefore use a simple 2:1 ratio of training to holdout set to train the model and assess performance. This can be trivially changed to K-fold cross-validation in Skytree (or even Monte Carlo cross-validation with both K-fold and holdout), but such a change here would only add to the runtime, unless we wish for a confidence interval on our performance. FULL ARTICLE
I recently have done some preliminary research and have broken it up in to three blog posts that I will post about over the next week. The first post (this one) discusses the problem that I looked at. The second post will talk about how I applied machine learning to the problem, and the third post will discuss the results.
As an astrophysicist, my research area was galaxies: their properties, environments, and evolution. One important property of galaxies is their distance: how far away are they? Over the history of astronomy, measuring the distances to objects has been surprisingly difficult. It is only in the last 100 years, for example, that we have even known that other galaxies even existed.
Why are distances helpful? Aside from simply knowing the scale of our universe, because we know the direction to a galaxy very precisely by its position on the sky, knowing its distance immediately gives us a 3D map of the universe. A map gives us the structures in which galaxies reside. The nature of the structures gives us galaxies’ environments, and, because looking out in distance is looking back in time (due to the finite speed at which light from the galaxies travels to us) we can study their evolution. FULL ARTICLE