Yandex open sources CatBoost, a gradient boosting ML library

kgwgk · on July 18, 2017

> gradient boosting — the branch of ML that is specifically designed to help “teach” systems when you have a very sparse amount of data, and especially when the data may not all be sensorial (such as audio, text or imagery), but includes transactional or historical data, too.

That's some strange definition.

mjn · on July 18, 2017

Weird, yeah. Seems like a roundabout way of trying to preemptively answer the "why not deep learning?" question omnipresent among ML newcomers. The bits identified aren't really wrong: you could argue that gradient boosting's comparative strength is that it works well (often out-of-the-box, with little tuning) on structured data sets, including relatively small data sets. Hence the good performance on Kaggle-type problems, whereas deep learning is ahead in audio/text/image/video data; and hence the lack of gradient boosting being used on ImageNet-type problems.

But these points all belong in some section entitled "why use gradient boosting instead of another ML method?", not in a definition of gradient boosting.

daddyo · on July 18, 2017

Seems that deep learning can benefit from gradient boosting too (at least, from a computational perspective).

https://arxiv.org/abs/1706.04964 "Learning Deep ResNet Blocks Sequentially using Boosting Theory"

(As for the lay-man description: I thought boosting performed better out-of-the-box on dense data than on sparse data, because most feature sub-selections for bagging are on zero'd features)

amrrs · on July 18, 2017

The benchmark scores seem to have been measured against Kaggle dataset which makes the scores more reliable and also with Categorical Features support and less tuning requirement, Catboost might be the ML library XGBoost enthus might have been looking for, but on the contrary, how come a Gradient Boosting Library making news while everyone's talking about Deep learning stuff?

zitterbewegung · on July 18, 2017

I think its because of three reasons:

1. Yandex is annoucing a new ML library and that makes it news because Yandex is well established

2. Gradient boosting is quite effective and popular

3. Not everything has to be about deep learning

daddyo · on July 18, 2017

Adding to 1.: This came out of MatrixNet research, which was state-of-the-art (and well-guarded) for years.

visarga · on July 18, 2017

Deep learning works best on images, video, text, audio and reinforcement learning. Most common applications are outside these domains, based on diverse types of business/scientific data. But, as they said, you can compose CatBoost with TF and Keras neural nets. It connects like a NN module.

technics256 · on July 19, 2017

Do you have any examples of this or can point me in the right direction of how to implement it? Thanks!

make3 · on July 19, 2017

Afaik gradient boosting wins lots of Kaggle competitions when they are on small amounts of data. It is possible that gradient boosting is the king of small data, but I'm not an expert on the subject

sandGorgon · on July 18, 2017

Catboost is implemented in C. Does anyone know how stuff like this is run at scale over multiple machines ? For example, if I want to run a distributed computation in spark - I use some primitives that are distributed in nature.

But how does someone use Catboost across a cluster of 10 machines ? All the help documents are heavily single machine. Is there any kind of infra-framework that will distributed the jobs across all the machines running Catboost, etc ?

prdonahue · on July 18, 2017

Last time I did something in C requiring spreading work across a cluster of machines it was with MPI(CH). Docs available at mpich.org. This was Monte Carlo simulation for hyper-dimensional (~8 IIRC) asset allocation and thus the simulations were easily divisible—recombining was just simple arithmetic—and the network I/O was minimal.

Interesting tidbit (to me anyway): This was 2006, before you could use something like AWS for the purpose and we were trying to keep costs to a minimum. (IBM had their "public one" grid but it was unusable.) The Core 2 Duo processor had just come out so I hired an intern from Cal—now a PhD and brilliant engineer at Netflix—to figure out the optimal overclocking rig and we built 32 chassis consisting of just motherboard, RAM, NIC, custom cooling/heatsink, and power supply. The problem was then how to deploy these in a colo. At the time there were some low end providers at 200 Paul willing to get creative with a cabinet so I found a machinist (metal worker?) able to cut some custom aluminum shelving on top of which we could stack the ATX cases. Rigged the boxes up to network boot off one of the nodes, compiled our application with Intel's C++ compiler to take advantage of the SIMD/SSE3 instruction set, and away we went running billions of simulations on a startup budget.

srean · on July 19, 2017

If you were using a distributed workload (not clear form your comment if you were) curious if you tried out multiple NICs per machine. By 2006 there already were mutliple queues per machine, but prior to that, multiple NICs were sometimes helpful.

sandGorgon · on July 18, 2017

this is incredible. could you talk about OS and deployment ?

tlarkworthy · on July 18, 2017

Did it work out?

s0ulmate · on July 18, 2017

Hey!

CatBoost team here.

CatBoost is currently single host, the version of training distributed on cluster will be open-sourced later.

kgwgk · on July 18, 2017

Catboost is implemented in C++. Like XGBoost, LightGBM, or TensorFlow, by the way.

defgeneric · on July 18, 2017

The github page shows a large % of C LOC but that's mostly all in 3rd-party libraries like openssl. CatBoost itself seems to be C++.

zintinio5 · on July 18, 2017

There might not be a need to do so at the moment. Otherwise, they might open that up at a later time.

sandGorgon · on July 18, 2017

I think someone of Yandex scale would be using a HUUGE cluster. I wonder if there is already something out there to do it.

ma2rten · on July 18, 2017

You could just run 10 versions of the same classifier and put a load balancer in front of it.

zintinio5 · on July 18, 2017

The problem is distributed training, not inference.

nl · on July 18, 2017

“Reduced overfitting” which Yandex says helps you get better results in a training program.

So that's awesome...

The benchmarks at the bottom of https://catboost.yandex/ are somewhat useful though. I do remember when LightGBM came out and the benchmarks vs XGB were... very selective though.

autokad · on July 18, 2017

I love both lightgbm and xgb. together they make a good ensamble. should be interesting how this one turns out.

nerdponx · on July 18, 2017

Interesting idea to ensemble them. The main difference is the way the trees are constructed, right? Do ensembling them is kind of like saying "I don't know which one is better, so screw it I will do both an average the results", right?

s0ulmate · on July 18, 2017

Hey! CatBoost team here.

Yes, stacking different gradient boosting algorithms works well on practice. One example is a just finished kaggle-like competition: http://mlbootcamp.ru/round/12/sandbox/ where mpershin stacked CatBoost with LGBM and got the 7th place.

The kernel for this solution can be found in one of our tutorials: https://github.com/catboost/catboost/blob/master/catboost/tu...

autokad · on July 18, 2017

if i am doing regressors, I use weights: like .7 xgb .3 lgb. I think its because they have their biases but the biasses are not completely correlated. This is one of the leading strategies in the kaggle competition for zillow logerror, check out this kernel: https://www.kaggle.com/aharless/xgb-w-o-outliers-lgb-with-ou...

When doing classification, I use several classifiers and get their scores rather than prediction via 20 fold xvalidation. I then use a random forest on top of the scores to get the final prediction. the random forest can figure out when to trust one classifier over another

robrenaud · on July 18, 2017

Lesser predictors can actually help in ensembles, so long as their errors aren't highly correlated with the better predictors.

kensoh · on July 18, 2017

Thanks OP, I really like this part of the whole article - 'It also uses an API interface that lets you use CatBoost from the command line or via API for Python or R'