Introduction to integration with H2O

Below, we provide an introduction to the H2O integration with Stata and discuss how it works.

What is H2O?

H2O is a scalable and distributed machine learning and predictive analytics platform. You can perform in-memory data analysis and machine learning using this framework.

H2O is an open-source platform, and its core code is written in Java. Stata is using its REST API to connect to H2O. More information about the H2O framework and its various machine learning algorithms can be found on the H2O website at http://docs.h2o.ai/. You can also refer to the User Guide for more information.

How does it work from within Stata?

You can either start a new H2O cluster or connect to an existing H2O cluster from within Stata. Then you may use a suite of Stata commands to interact with the H2O cluster.

Start a local H2O cluster

You can start a local H2O cluster by typing h2o init in Stata. h2o init will look for the existence of h2o.jar, a Java Archive file that is used to start H2O. This file is distributed by H2O.ai. Stata does not distribute h2o.jar with its installation. You can acquire it from http://docs.h2o.ai/h2o/latest-stable/h2o-docs/downloading.html.

After downloading h2o.jar, place the file in a directory included in Stata’s adopath. You can view these directories using the adopath command. We recommend using the SITE, PERSONAL, or PLUS directory. See the following sample from a typical Stata adopath on a Windows computer:

. adopath
  [1] (BASE) "C:\Program Files\Stata18\ado\base"
  [2] (SITE) "C:\Program Files\Stata18\ado\site"
  [3] "."
  [4] (PERSONAL) "C:\ado\personal"
  [5] (PLUS) "C:\ado\plus"
  [6] (OLDPLACE) "C:\ado"

When h2o.jar is placed along the adopath, h2o init will use it directly to start a new local H2O cluster. If multiple copies of h2o.jar exist along the adopath, Stata uses its normal rules to search the adopath and will use the first h2o.jar it locates. Because we are looking for a .jar file, h2o init can locate h2o.jar if it is placed in a jar/ sub-directory of any of the defined adopath locations. If h2o.jar cannot be located, an error is produced.

After h2o.jar is located, h2o init will determine if a cluster is already running on your local machine. It uses the address of localhost:54321 where the IP of localhost is 127.0.0.1 and the port is 54321. If a cluster is not already running, h2o init will attempt to create one at this location, and by default, the new cluster will only allow connections from the local machine.

When the cluster has been successfully initialized, you will get a summary of the H2O cluster status similar to the following:

. h2o init
Connecting to the H2O cluster running at http://127.0.0.1:54321.....not found.
Starting a new cluster running at http://127.0.0.1:54321.
Connecting to the H2O cluster running at http://127.0.0.1:54321.. Successful.
------------------------------------------------------------------------------
H2O cluster uptime:        2 secs
H2O cluster timezone:      America/Chicago
H2O data parsing timezone: UTC
H2O cluster version:       3.36.0.1
H2O cluster version age:   1 year and 27 days
H2O cluster total nodes:   1
H2O cluster free memory:   7.984 Gb
H2O cluster total cores:   24
H2O cluster allowed cores: 24
H2O cluster status:        accepting new members, healthy
H2O connection url:        http://127.0.0.1:54321
------------------------------------------------------------------------------

Note that h2o init accepts some options to customize how the H2O cluster is initialized. For example, you can specify the nthreads() option to set the maximum number of parallel threads to use when launching the H2O cluster. See h2o init for more information.

If there is already an H2O cluster running on your local machine, h2o init will attempt to connect to it. If you explicitly specify the IP and port of a remote machine when calling h2o init, by using the ip() and port() options, it will attempt to connect to the H2O cluster running on the remote machine. This is the same as calling h2o connect. See Connect to an existing H2O cluster for more details.

Connect to an existing H2O cluster

Another way to interact with H2O is to connect to an existing H2O cluster. This is done by calling h2o connect. By default, it will attempt to connect to a cluster running at localhost:54321 on your local machine. If the connection is built successfully, you will get a summary of the cluster status similar to the following:

. h2o connect
Connecting to the H2O cluster running at http://127.0.0.1:54321. Successful.
------------------------------------------------------------------------------
H2O cluster uptime:        7 secs
H2O cluster timezone:      America/Chicago
H2O data parsing timezone: UTC
H2O cluster version:       3.30.0.5
H2O cluster version age:   2 years, 7 months and 7 days
H2O cluster total nodes:   1
H2O cluster free memory:   7.982 Gb
H2O cluster total cores:   24
H2O cluster allowed cores: 24
H2O cluster status:        locked, healthy
H2O connection url:        http://127.0.0.1:54321
------------------------------------------------------------------------------

You can also connect to an H2O cluster running on a remote machine by specifying its IP and port in the ip() and port() options, respectively.

When connecting to an existing H2O cluster, a new Stata H2O session is created between the Stata client and the H2O cluster. Multiple clients can be connecting to the H2O cluster at the same time, and they will all share its resources, such as the data and models within the cluster.

Interact with the H2O cluster

Once the H2O cluster is up, you can interact with the H2O cluster from within Stata. For example, you can type h2o query to check the status of the cluster at any time.

. h2o query
Cluster is running at http://127.0.0.1:54321.
------------------------------------------------------------------------------
H2O cluster uptime:        50 secs
H2O cluster timezone:      America/Chicago
H2O data parsing timezone: UTC
H2O cluster version:       3.36.0.1
H2O cluster version age:   1 year and 27 days
H2O cluster total nodes:   1
H2O cluster free memory:   7.984 Gb
H2O cluster total cores:   24
H2O cluster allowed cores: 24
H2O cluster status:        accepting new members, healthy
H2O connection url:        http://127.0.0.1:54321
------------------------------------------------------------------------------

If there are multiple nodes within the cluster, you can also specify the detail option to list the information for each node.

 Node Details:
 ------------------------------------------------------------------------------
 Node 1
 ------------------------------------------------------------------------------
 IP:                        127.0.0.1:54321
 Healthy:                   yes
 Total cores:               24
 Allowed cores:             24
 Free memory:               7.982 Gb
 Free disk:                 1.359 Tb
 Pid:                       7624
 ------------------------------------------------------------------------------

You can import data from your local drive to the cluster as an H2O frame. For example, the following code will load Stata’s auto dataset to the cluster.

. sysuse auto
. _h2oframe _put, into(h2oauto)

By default, _h2oframe _put loads the entire dataset in memory to the cluster. To load a subset instead, you can specify a columnlist and the if and in qualifiers. See _h2oframe _put for more information. The dataset will be stored as an H2O frame named h2oauto in the cluster. Once the dataset is loaded to the cluster, any operations you perform on it will be handled by the cluster, not by Stata. In other words, the two copies of the auto dataset are independent of each other.

You can type _h2oframe _dir to list all H2O frames in the cluster, along with the dimensions of the data and the amount of memory the data consume in the cluster.

. _h2oframe _dir
Name                                     |        Rows        Cols        Size
-----------------------------------------+------------------------------------
h2oauto                                  |          74          12    3.982 Kb

Total: 1

For more information about H2O frames, see Introduction to H2O frames.

Close and disconnect from the H2O cluster

Once you have finished your analysis on the H2O cluster, you can type h2o disconnect to close the connection between Stata and the H2O session, or use h2o shutdown to shut down the cluster altogether.

h2o disconnect will close the H2O connection between Stata and the cluster, leaving the H2O cluster running. h2o connect can be used to rebuild the connection and access the resources it contains.

h2o shutdown will destroy the cluster you are currently connected to along with all of its resources. By default, h2o shutdown will exit with an error and give a warning about its destructive nature. To override this warning and actually shutdown the cluster, use the force option. The force option will force the cluster to shut down, and everything in the cluster will be destroyed regardless of whether the cluster was created from Stata or outside of Stata.

Importantly, note that if the cluster was created by Stata using h2o init, then it will be automatically shut down and destroyed when the Stata session exits. Ensure you have saved all the necessary resources within the cluster before exiting. To prevent a cluster that Stata created from automatically getting shut down, use h2o disconnect before closing Stata. If the cluster was created outside of Stata and a connection was made using h2o connect, then exiting Stata will only close the connection, leaving all resources within the cluster intact.

In practice, if the H2O cluster is started in Stata with h2o init, and if you are certain that all necessary results have been saved, it is preferable to use h2o shutdown to destroy the H2O cluster. Putting all H2O-related commands between h2o init and h2o shutdown, force can make the sequence more obvious.