Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Yandex open sources CatBoost, a gradient boosting ML library (techcrunch.com)
202 points by bobuk on July 18, 2017 | hide | past | favorite | 28 comments


> gradient boosting — the branch of ML that is specifically designed to help “teach” systems when you have a very sparse amount of data, and especially when the data may not all be sensorial (such as audio, text or imagery), but includes transactional or historical data, too.

That's some strange definition.


Weird, yeah. Seems like a roundabout way of trying to preemptively answer the "why not deep learning?" question omnipresent among ML newcomers. The bits identified aren't really wrong: you could argue that gradient boosting's comparative strength is that it works well (often out-of-the-box, with little tuning) on structured data sets, including relatively small data sets. Hence the good performance on Kaggle-type problems, whereas deep learning is ahead in audio/text/image/video data; and hence the lack of gradient boosting being used on ImageNet-type problems.

But these points all belong in some section entitled "why use gradient boosting instead of another ML method?", not in a definition of gradient boosting.


Seems that deep learning can benefit from gradient boosting too (at least, from a computational perspective).

https://arxiv.org/abs/1706.04964 "Learning Deep ResNet Blocks Sequentially using Boosting Theory"

(As for the lay-man description: I thought boosting performed better out-of-the-box on dense data than on sparse data, because most feature sub-selections for bagging are on zero'd features)


The benchmark scores seem to have been measured against Kaggle dataset which makes the scores more reliable and also with Categorical Features support and less tuning requirement, Catboost might be the ML library XGBoost enthus might have been looking for, but on the contrary, how come a Gradient Boosting Library making news while everyone's talking about Deep learning stuff?


I think its because of three reasons:

1. Yandex is annoucing a new ML library and that makes it news because Yandex is well established

2. Gradient boosting is quite effective and popular

3. Not everything has to be about deep learning


Adding to 1.: This came out of MatrixNet research, which was state-of-the-art (and well-guarded) for years.


Deep learning works best on images, video, text, audio and reinforcement learning. Most common applications are outside these domains, based on diverse types of business/scientific data. But, as they said, you can compose CatBoost with TF and Keras neural nets. It connects like a NN module.


Do you have any examples of this or can point me in the right direction of how to implement it? Thanks!


Afaik gradient boosting wins lots of Kaggle competitions when they are on small amounts of data. It is possible that gradient boosting is the king of small data, but I'm not an expert on the subject


Catboost is implemented in C. Does anyone know how stuff like this is run at scale over multiple machines ? For example, if I want to run a distributed computation in spark - I use some primitives that are distributed in nature.

But how does someone use Catboost across a cluster of 10 machines ? All the help documents are heavily single machine. Is there any kind of infra-framework that will distributed the jobs across all the machines running Catboost, etc ?


Last time I did something in C requiring spreading work across a cluster of machines it was with MPI(CH). Docs available at mpich.org. This was Monte Carlo simulation for hyper-dimensional (~8 IIRC) asset allocation and thus the simulations were easily divisible—recombining was just simple arithmetic—and the network I/O was minimal.

Interesting tidbit (to me anyway): This was 2006, before you could use something like AWS for the purpose and we were trying to keep costs to a minimum. (IBM had their "public one" grid but it was unusable.) The Core 2 Duo processor had just come out so I hired an intern from Cal—now a PhD and brilliant engineer at Netflix—to figure out the optimal overclocking rig and we built 32 chassis consisting of just motherboard, RAM, NIC, custom cooling/heatsink, and power supply. The problem was then how to deploy these in a colo. At the time there were some low end providers at 200 Paul willing to get creative with a cabinet so I found a machinist (metal worker?) able to cut some custom aluminum shelving on top of which we could stack the ATX cases. Rigged the boxes up to network boot off one of the nodes, compiled our application with Intel's C++ compiler to take advantage of the SIMD/SSE3 instruction set, and away we went running billions of simulations on a startup budget.


If you were using a distributed workload (not clear form your comment if you were) curious if you tried out multiple NICs per machine. By 2006 there already were mutliple queues per machine, but prior to that, multiple NICs were sometimes helpful.


this is incredible. could you talk about OS and deployment ?


Did it work out?


Hey!

CatBoost team here.

CatBoost is currently single host, the version of training distributed on cluster will be open-sourced later.


Catboost is implemented in C++. Like XGBoost, LightGBM, or TensorFlow, by the way.


The github page shows a large % of C LOC but that's mostly all in 3rd-party libraries like openssl. CatBoost itself seems to be C++.


There might not be a need to do so at the moment. Otherwise, they might open that up at a later time.


I think someone of Yandex scale would be using a HUUGE cluster. I wonder if there is already something out there to do it.


You could just run 10 versions of the same classifier and put a load balancer in front of it.


The problem is distributed training, not inference.


“Reduced overfitting” which Yandex says helps you get better results in a training program.

So that's awesome...

The benchmarks at the bottom of https://catboost.yandex/ are somewhat useful though. I do remember when LightGBM came out and the benchmarks vs XGB were... very selective though.


I love both lightgbm and xgb. together they make a good ensamble. should be interesting how this one turns out.


Interesting idea to ensemble them. The main difference is the way the trees are constructed, right? Do ensembling them is kind of like saying "I don't know which one is better, so screw it I will do both an average the results", right?


Hey! CatBoost team here.

Yes, stacking different gradient boosting algorithms works well on practice. One example is a just finished kaggle-like competition: http://mlbootcamp.ru/round/12/sandbox/ where mpershin stacked CatBoost with LGBM and got the 7th place.

The kernel for this solution can be found in one of our tutorials: https://github.com/catboost/catboost/blob/master/catboost/tu...


if i am doing regressors, I use weights: like .7 xgb .3 lgb. I think its because they have their biases but the biasses are not completely correlated. This is one of the leading strategies in the kaggle competition for zillow logerror, check out this kernel: https://www.kaggle.com/aharless/xgb-w-o-outliers-lgb-with-ou...

When doing classification, I use several classifiers and get their scores rather than prediction via 20 fold xvalidation. I then use a random forest on top of the scores to get the final prediction. the random forest can figure out when to trust one classifier over another


Lesser predictors can actually help in ensembles, so long as their errors aren't highly correlated with the better predictors.


Thanks OP, I really like this part of the whole article - 'It also uses an API interface that lets you use CatBoost from the command line or via API for Python or R'




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: