More

pbh · on July 6, 2012

This is a great description of R usage (import, clean, fit models), but I think a slightly erroneous explanation of R history.

John Chambers created "S" at Bell Labs. S was a programming language designed for interactive statistical analysis. Much like gcc and icc are implementations of C compilers, R and S-PLUS are implementations of S. S-PLUS was/is the primary proprietary implementation of the S language, whereas R is the primary free one (also, sometimes called GNU S). (SAS and SPSS are completely different languages/systems as far as I know.) I think that statisticians at some point made a conscious effort to publish their work in R, rather than S-PLUS (or any other statistical system like SAS) because it was more widely available. That in turn led R to be a viable competitor to S-PLUS (and other systems) because it had vast amounts of recent statistical libraries, often implemented by the people who developed the techniques. That said, SAS and SPSS seem to pretty much still have social science students locked up --- the market for R is probably statisticians who are also excellent functional programmers.

This history is in really marked contrast to MATLAB and its corresponding free version Octave, where computer scientists pretty much refuse to use Octave, despite MATLAB's massive price tag to pretty much everyone involved (even with 90% discounts).

(That said, if anyone lived through the change over from S-PLUS to R, I'd love to hear if this history is wrong!)

micro_cam · on July 6, 2012

I think that the Bioconductor project (http://www.bioconductor.org/) has also been a big part of R adoption as it has produced a core of well and consistently documented libraries for importing, managing and analyzing biological data that is not really matched anywhere else. R co-creator Robert Gentalman was/is a big driving force in that so of course it is in R.

dekayed · on July 6, 2012

> This history is in really marked contrast to MATLAB and its corresponding free version Octave, where computer scientists pretty much refuse to use Octave, despite MATLAB's massive price tag to pretty much everyone involved (even with 90% discounts).

Do you have any insights as to why Octave does not have higher adoption?

pbh · on July 6, 2012

I've always been a bit sad about it, but everyone involved is probably a rational actor.

Computer science professors probably view a couple hundred dollars per MATLAB network license as a tiny expense on a $1m+ grant (whereas statistics grants are apparently often smaller), and they may be charged for it in departmental overhead anyway (removing the incentive to cut costs).

The type of people who could contribute either core code or toolbox type code to Octave often have an extremely rare quantitative skill set that is worth hundreds of dollars an hour, so there is a huge incentive to get paid to do similar work instead. There probably isn't much community recognition (to balance things out) for implementing a library in Octave. (Though, in the R world there are certain recognizable superstars like Hadley Wickham.)

Graduate students (who might work for cheap on these problems) are probably more focused on publications and networking.

As long as all of this is the case, Octave will always kind of just be a worse MATLAB that happens to be open source, so a new user choosing between them will probably just choose MATLAB by default.

jordigh · on July 6, 2012

Octave core developer here.

It is true that we have a lot of trouble attracting new contributors. Most of our users keep demanding features that seem to us unimportant but to them are all the world: a GUI ("whatever for?", we think. "Use a real text editor!"), a JIT compiler ("here's a nickle, get better vectorised code, kid"), perfect Matlab compatibility (a never-ending chase, not very fun, in which we must always be behind).

Of these, we're finally slowly listening to our users. Two of our current three GSoC students are working on a GUI and a JIT compiler respectively. I have wild hope that this will attract more users and developers. I'm also currently hosting an Octave conference in a few days towards this goal:

    http://www.octave.org/wiki/index.php?title=OctConf_2012

By the way, Octave is GNU (so is R, supposedly), so we're not really open source; we're free. ;-)

I don't know why Octave hasn't been able to replicate R's success. I don't know if R's not really being GNU despite in name has something to it (R developers routinely try to find new ways to get around the GPL and link R to non-free code, and I don't doubt that this linking to Oracle's database is another example of that). I don't know if it's just that a lot of people with big money care more about statistics and R than they care about Octave (banks and brokers for R, electrical and civil engineers for Otave). Maybe our code sucks more than R's.

Do you have any suggestion how to make Octave the standard instead of Matlab? The recent gratis classes that emerged from Stanford gave Octave a lot of publicity. Do you have any suggestion of what else we might do?

pbh · on July 7, 2012

You're probably in a much better position to evaluate than I am! My guess is that more Octave-based classes would translate into more users and more code written for Octave down the line, but I'm not sure how to encourage more use of Octave in the classroom in the first place.

drucken · on July 6, 2012

Matlab is truly the RAD tool of choice for numerical programming and has a solid grip in universities combined with enterprise-level support.

I do not think Octave ever tried to replicate its workflow (which is not general programmer centric at all) and domain-specific documentation but merely focused on the underlying language compatibility, which is really the least important part of Matlab.

On top of that, I seem to recall, Matlab was one of the first of the specialist programming toolsets to offer a very competitive "Student Edition". This was a godsend for schools and universities before the Internet took off.

In short, Octave was too little too late, and Numpy/Scipy, while catching up fast, has supporting tools spread all over the place as well as being geared more to general programmers who want access to convenient numerics rather than numerical modellers/engineers wanting a RAD tool.

Numpy/Scipy etc. may well overtake Matlab eventually, but that will be purely a function of its infrastructure, not the something as mundane as even the nice language (which admittedly was its initial driving force). At least in this respect, it has done a lot better than Octave in much less time.

jordigh · on July 7, 2012

Actually, there are tons of people who want to run Matlab code freely. The code is already written. They need to run it in clusters, or they need to run it at home.

This is why we are doing Octave.

Scaevolus · on July 6, 2012

MATLAB is commonly used in introductory CS courses for engineering majors, since it's useful for a lot of general tasks and is pretty forgiving.

MATLAB has a nice GUI and IDE. It also generates good graphs with minimal effort.

Octave has a command-line REPL.

pbh · on June 7, 2012

ezl does not seem to have gone all the way down the rabbit hole with the oDesk tracking link. (I work at oDesk, but not on this, so I was really curious.) After about 3 HTTP redirects, the www.dpbolvw.net link seems to turn into some sort of affiliate link. So my guess is that a (possibly errant, but presumably very data-driven) affiliate was the one sending him e-mail about oDesk, rather than oDesk itself.

That seems to raise a broader question, which is: to what extent should companies be blamed (and thus try to control) for the actions of the people in their affiliate programs? Would you be unhappy about a blog purporting to be by a cute ice cream eating girl filled with recommended books with Amazon affiliate links? What are the standards in this area, exactly? (On reload, see dabent's comment below as well.)

pbh · on May 20, 2012

Actually, I thought his comment was quite delicate.

I used to think that DB comparisons were bad because of vested interests, like Oracle refusing to allow anyone to benchmark them. But I now think that Antirez's view is much closer to the truth: poorly informed people can't set up benchmarks correctly because they don't understand the underlying DB model and possible optimizations, and well informed people can't set up benchmarks because they can't think of a fair benchmark that doesn't explicitly handicap one model or set of optimizations versus the other.

(Of course, in this case, the comparison is pretty bad. Other commenters have noted that an inverted index is more sensible here and that Postgres' text search support should have been used. But the big issue to me is that the graphs are incredibly misleading, with totally different units on the y axis, so that a 0.004 second win for Mongo looks like as big a win as a 3 second win for PG.)

pbh · on Jan 26, 2012

I think the most succinct response to this question I've found comes from the Starcraft II casting archon, "Tastosis" during one of those filler periods where nothing is happening in the game:

Tasteless: I'm a philosophy major, man.

Artosis: Hmmmm? That's good.

Tasteless: You major in that, people are like: "What are you going to do with a philosophy major?" ::outraged mumble:: I'm like: "I dunno, properly navigate the world with my mind? Geez!"

pbh · on Jan 8, 2012

Is there a way to fix the recursion depth problem with this solution?

To me, the memoized version is more clear and should be just as efficient as the iterative version. However, this version of fib_mem dies before fib_mem(1000)!

Apparently, you can increase the recursion depth limit using the sys module. Is that the right thing to do? And, if you do it, are there guidelines for doing so without causing memory problems?

pash · on Jan 8, 2012

Yes, use sys.setrecursionlimit(n) to change the max recursion depth. Max supported stack depth of course varies by platform.

In many cases, a better solution is to implement tail recursion, which can be done quite easily. See http://paulbutler.org/archives/tail-recursion-in-python/ .

pbh · on Dec 10, 2011

This tutorial was a few months ago, though it was quite good. The slides are here if anyone is curious:

http://www.slideshare.net/ipeirotis/managing-crowdsourced-hu...

pbh · on Nov 20, 2011

I know a lot of Daphne's former and current students and she seems to be a great advisor who genuinely cares about producing both excellent research and top quality research talent.

Further, in the department, I think she is one of the people who cares most about teaching. She runs the undergraduate summer research program. She re-does her class on PGMs substantially almost every time she teaches it to try to make it better. (Though such a high rate of change may or may not be a good idea.) Daphne is almost certainly one of the key people behind the *-class effort at Stanford CS.

For those with a negative impression of Daphne, my guess is just that they are misinterpreting her directness. If she thinks you're wrong, or you're doing the wrong research, you'll know about it.

(Also, Ullman is one of the nicest people in the department in person, which is crazy given that he wrote the standard texts in compilers, databases, and arguably automata. He's emeritus these days though.)

sabraham · on Nov 21, 2011

Thanks for the insights! I appreciate it.

pbh · on Nov 20, 2011

Is this a CVPR reference, or just an amazingly accurate comment?

pbh · on Nov 11, 2011

Yes, I think computer vision is a common term for the area (at least, among academics and researchers), and CV is a common abbreviation.

Examples include Intel's OpenCV and the CVPR conference.

(I actually thought that it was weird that they called themselves "machine vision" rather than computer vision in the title.)

thwest · on Nov 11, 2011

OpenCV is OSS maintained by Willow Garage. Intel provides the Intel Performance Primitives which implement basic filtering up through haar, segementation, optical flow.

Machine vision usually refers to the domain of industrial inspection where the scene contents are highly controlled, and computer vision usually refers to harder problems where someone is walking around a real environment with bad lighting and awkward perspectives.

ansgri · on Nov 12, 2011

OpenCV originated at Intel and was based on Intel's IPL.

Some researchers prefer the term 'machine vision' as more general, referring to the entire field of artificial vision problems, not necessarily involving a 'computer' in common terms (i.e. FPGA-based or analog electronics).

pbh · on Nov 9, 2011

I agree with your point, but with an important caveat.

You are right that application developers should not be ignorant of databases, and DBAs should not be ignorant of what application developers are doing. All too often, this is the case.

However, the point when people came to think of application and database development as two separate things was the origin of databases themselves, and arguably one of the most important historical points in computer science itself. The notion that data has a life outside of any given application, and that it deserves itself to be managed, was and is incredibly powerful.