Frequently Asked Questions
This is a static FAQ page that helps you get started. For a community driven dynamic Q & A forum, proceed to ask.skytree.net and ask/answer a question!
What is Skytree Express?
- Skytree Express is an express way to use Skytree’s machine learning software. It is available for use via a Virtual Machine (VM) for Windows, MAC OS X, and through an easy install script on RHEL/CentOS systems.
- Currently, Skytree Express comes in two versions (i) Skytree GUI & Python SDK and (ii) Skytree Command Line Interface (CLI).
- Skytree Express can be downloaded free of cost. It comes with a preconfigured license that is valid for one year.
Where can I download Skytree Express?
- Virtual Machine (VM) for Mac, Windows and Linux (that runs on VirtualBox 5.0) that enables Skytree’s GUI and Python SDK can be downloaded here. This VM version is close to Skytree’s full featured Enterprise version which makes it over 4 GB in size.
- The CLI version can be downloaded here. This is a lightweight version of Skytree’s enterprise offering that can be directly installed on CentOS/ RHEL systems through an easy install script.
What do I do after downloading Skytree Express? Please follow the Read Me for respective downloads:
How can I run Machine Learning after I have downloaded and installed Skytree?
A sample script can help you get started with the CLI! For the VM with GUI and Python SDK, simply browse to localhost:10000 to get the list of ipython notebooks to get you started.
How can I access the VM’s command line?
VM’s (GUI and Python SDK) command line can be accessed via ssh using: ssh -p 2222 skytree@localhost. The password is skytree.
Where can I find the datasets that the sample script/ ipython notebook refers to?
The datasets are available under the datasets directory that comes with the Skytree Express download.
On the virtual machine, do I need to change any of the settings? What about for large datasets close to the 100-million element limit?
For small datasets, the default VM settings should work well. For large datasets, you may need to increase the resources available to the VM. The limiting size depends upon the available resources on your machine. Note that we recommend at least 6 GB RAM and 2 virtual cores dedicated to the VM that comes with Skytree’s GUI and Python SDK.
Are there any restrictions (for example, educational, personal, commercial etc.)?
Skytree Express is available for personal, educational, and even commercial usage! The free version restricts usage up to 100 million elements on a single machine/node.
What methods of support are available to Skytree users?
Please join Skytree’s User Community. Community support will be provided through this forum. For FAQ concerning the Command Line Interface that is enabled with Skytree Express, see below.
How can I cite Skytree in my work?
If you find Skytree useful in your work, a citation is always appreciated. See examples here.
Command Line Interface
How can I get started with using Skytree through the Command Line Interface (CLI)?
A sample script that is also available with Skytree Express download can help you get started. It illustrates how to prepare datasets, tune, train, test, and score models.
Where can I find complete documentation for Skytree’s CLI? Is there a help for quick reference?
Skytree’s CLI documentation is available here. For quick help, please type skytree-server --help at the command line. For help on using a specific module, please use, skytree-server gbt --help,
What machine learning algorithms are available through Skytree CLI?
Here is a list of algorithms available through the CLI:
- AutoModel automodel
- Gradient-boosted decision trees gbt
- Random decision forests rdf
- The above for regression gbtr, rdfr
- Support vector machines (linear and nonlinear) svm
- Nearest neighbors binary classification nnc, with weights wnnc
- K-means clustering kmeans
- Logistic regression logistic
- Singular value decomposition svd (includes principal components analysis)
- Generalized linear model for classification or regression glmc, glmr
- Collaborative Filtering cf
- Kernel density estimation kde
- What-if analysis whatif
- Two-point correlation function two_pt
- Model scoring score
Is data preparation required before running skytree-server module to train a model on the dataset? How do I carry out data preparation if its required?
- Basic Data preparation is required to make the datasets machine learning ready.
- Skytree’s data preparation tools available through the CLI and enabled via Skytree Express are:
- generate-header.sh takes the input datasets and generates a header file with summary statistics about the data. For more information see generate-header.sh --help.
- convert-data.sh takes the input datasets, the header generated by generate-header.sh and outputs datasets that are ready for machine learning. For more information see convert-data.sh --help.
How can I use skytree’s supervised machine learning on a dataset that is ready for machine learning (output of convert-data.sh)?
Supervised Learning modules require the user to specify the predictors using --training_in and the labels for classification (targets for regression) using --training_labels_in (--training_targets_in) in addition to basic model parameters in order to create a model.
How can I tune models using Skytree?
Tuning is carried out on a piece of dataset held out from training. There are 3 ways to specify such a piece:
- specifying a tuning dataset using --tuning_in and --tuning_labels_in for classification (--tuning_targets_in for regression)
- specifying a holdout ratio using --holdout_ratio that holds out a fraction of the training data for tuning the parameters
- carrying out cross-validation through the option --num_folds
Skytree supports tuning via grid search of parameters (specified as, for example, --num_trees min:step:max) for all methods and –smart-search in a given range (specified as, for example, --num_trees min:max) for most methods. The number of tuning iterations is specified by the number of parameter combinations set up with grid search and by specifying --smart_search_iterations with smart search. The default tuning metric for classification problems is Gini/ AUC (area under the ROC curve) and for regression problems is MAE (mean absolute error).
How can I test a model using Skytree?
Testing can be specified in the same command as training and tuning via the option --testing_in. The results on the test set can be output into file(s) using --labels_out, --probabilities_out (for classification) and --targets_out for regression.
How do I evaluate the test results using Skytree?
Skytree provides a score module that can be used for evaluating results. For more details, see skytree-server score --help. For classification problems, it provides useful evaluation metrics like the AUC, confusion matrix, ROC curve and the precision-recall curve. For regression problems, it provides mean absolute error, mean squared error, L1 error, L2 error, coefficient of determination to name a few.
How can I export models using Skytree?
Skytree can export models in binary format for testing using the option --model_out. Skytree can also export models in pmml format using the option --pmml_out.
What options are useful to interpret models with Skytree?
Variable importances that illustrate the importance of variables in trained model are available via the option --variable_importances_out. Partial dependencies that show how the response is dependent on the input(s) are available via --partial_dependencies_out.
Are all features in the CLI available through Skytree Express?
No. Skytree Express limits the CLI to processing of 100 million elements. Furthermore, the CLI tool skytree-yarn that allows the computation to be distributed on multiple nodes is not available through Skytree Express.
Is the Python SDK available via Skytree Express?
Yes. The Python SDK can be accessed through a Virtual Machine that enables both the GUI and the Python SDK.
What additional features are available via the Python SDK?
Skytree’s Python SDK enables distributed data preparation and custom transformation using Spark. The Python SDK also enables interactive usage with Skytree’s graphical user interface. In addition, the Python SDK also complies with security norms of large enterprises. Finally, through the Python SDK, a user can build models in distributed fashion across multiple nodes making the modeling process many folds faster.
How can I access the Python SDK?
Once the VM is launched, point the browser to http://localhost:10000. This will show you a list of python notebooks that can be used to get started. We recommend working with IncomePredictionShort.ipynb to understand how the Python SDK can be used for creating machine learning models using the Skytree module.
What methods are available through the Python SDK?
Graphical User Interface (GUI)
Is the GUI available via Skytree Express?
Yes. The GUI can be accessed through a Virtual Machine that enables both the GUI and the Python SDK.
What additional features are available via the GUI?
The GUI is the point and shoot version of the CLI. Its a self-documenting data science flow that captures and enables data uploads, transforms, models, and results in a few clicks. Skytree’s automation available through the GUI empowers analysts to solve prediction problems using sophisticated machine learning methods. It also complies with security norms of large enterprises. Finally, through the GUI, the user can build models in distributed fashion across multiple nodes making the modeling process many folds faster.
Where can I access the GUI?
Once the VM is launched, point the browser to http://localhost:8080. This will prompt you for a login screen. Please use username: firstname.lastname@example.org and password: Skytree1 to login to the GUI. It comes preloaded with a project named
EXAMPLE PROJECT – INCOME PREDICTION.
What should I do if I’m having trouble invoking Skytree express on Port 8080?
All you will need to do is change the virtual machine’s port-forwarding configuration to an available port. For step by step instructions click here.