Software

stochtree

stochtree is a general-purpose R and Python library for stochastic tree ensemble modeling.

Its primary interface is a “batteries-included” implementation of both the BART model for supervised learning and the BCF model for causal inference.

It also offers a flexible “low-level” interface for specifying custom models that involve stochastic tree ensemble terms. The underlying implementation is largely in C++ with tree classes and I/O routines borrowed from xgboost and LightGBM. See here for stochtree’s source repo.

scikit-tree

scikit-tree was a spiritual precursor to stochtree that began while I was a PhD student. It extended the Cython implementation of a decision tree learner from scikit-learn. The general rationale was:

scikit-learn’s codebase is fast and heavily tested. Relying on their implementation of a tree data structures makes it easier to extend to new methods without worrying about edge cases in the tree code.
Cython offers a nice tradeoff between high-level code (readability, maintainability, syntactic sugar, etc…) and low-level code (direct memory access / management, GIL release, etc…)

The plan was for this project to provide an easier way to experiment with and extend decision tree models without spending 6 months writing C++. Ultimately, it proved not to be the right tool for the job, for several reasons:

Coupling to cython / python means that an R package would be difficult to offer (and certainly swimming upsteam)
Tooling: cython is much less developed than C / C++ in terms of debugging and profiling tools
Performance: most existing BART packages were built with a C++ core and while cython can be much faster than Python, it is still nontrivial to write cython code in a way that competes with C++. The selling point of Cython being as readable as Python could end sacrificed as successive optimizations are made to match C++ performance.

The process of setting up this testbed project is detailed in this blog post.

implementations

Implementations is a Github repository with simple implementations of common statistical and mathematical algorithms that is now largely dormant, but was a fun source of learning during my PhD. The goal of the repo was to enable easy exploration of the challenges and tradeoffs inherent in statistics and machine learning methods, not to provide robust implementations for use in research or applications.

Some examples: