It's the future, representing a convergence between Field Programmable Gate Arrays (FPGAs) and the microprocessor.
Gate arrays are vast arrays of logic gates, which can be wired together in almost arbitrary patterns by a sea of "fuses", typically controlled by state stored in on-board SRAM. They are real time and blindingly fast due to their massive parallelism, achieving supercomputer type speeds when applied to the right type of problem and programmed well. They are more difficult to program than a microprocessor. One way of looking at an FPGA is as an array of tens of millions of very simple computing engines.
Over the years, the number of transistors on an FPGA has been rocketing up. Generally these transistors have been put to use by providing more and more simple logic blocks. We are now to the point where we have almost more gates than we know what to do with, and the chip is being dominated by interconnects. This has seen a move towards including a limited number of elaborate hard wired blocks, such as CPUs and multipliers, in addition to the array of logic.
The logical evolution is to stop providing more blocks, but make each block more complex as transistor counts go up. Eventually we will see arrays of tens of millions of microprocessors, rather than tens of millions of logic blocks. There will be no distinction between a multicore CPU and FPGA.
It's worth noting that the first Xilinx FPGAs, thirty years ago, provided arrays of around 144 logic blocks, similar to the processor count in this chip. Extrapolate 30 years and we will have an array of 10 million microprocessors.
I shall try, but this stuff gets really hairy, really really fast.
Von Neumann architecture is what almost all computers use today: you have (very roughly) an ALU (arithmetic logic unit) hooked up to a memory bank which stores both program data and the instructions the program consist of.
Now you can add a couple of cores to that, but you pretty soon start to run into problems -- threads which try to access the same data, race conditions, etc.
But the biggest problem is that under the Von Neumann architecture all memory is shared so any thread can access any other threads memory. This puts rather drastic limits on how much benefit you can get from new cores.
You also run into issues like the limited speed it is possible to access the main memory banks with, etc, etc. This is possible to compensate to some degree with caches, but they have their own problems.
But the fundamental problem with them is that they were from and of an era where the clock speed kept increasing and increasing.
Today we have a situation where transistors gets smaller and smaller. But if you are trying to use this to make a traditional CPU with these new transistors, all this gets you is a really small chip.
What we need is an architecture inspired by something else. Personally I am kinda hoping it will be some form of message sending -- you run a lot of small (green) threads which each have their own memory as well as the ability to send and receive packages of information to/from the other cores of on the CPU.
You can have access to a (comparatively large but slower) shared memory bank too (like RAM today).
I like it because it works well with how you would design a cluster of computers (where you cannot afford the illusion of shared memory), how computation is organized under the actor model (which I prefer to threads) and it would be possible to implement with not that much new changes in the CPU.
That wasn't a "try" - that was a success. Thank you!
If I may attempt a paraphrase: CPU caches stop being a bandage for slow access to RAM and become a valuable first class citizen for each core of the CPU when coupled with the actor model.
Well you could do that today if you as a programmer could manually tell the system "please load addr x, y and z into the cache".
But if the cores of the CPU starts to communicate with the actor model then you wouldn't be using the memory close to the cores as a cache but as a storage area for messages that haven't been sent/processed yet as well as possible for thread local storage.
I hesitate to divide the world into "FPGA guys" and "non-FPGA guys". It's a continuum of computing power; CPU<->DSP<->FPGA, and one moves up and down it as the task varies and technology changes.
If anything defines an "FPGA guy" is choice of language. Historically FPGAs have been programmed with an HDL, such as VHDL or Verilog, because these languages are concurrent meaning they handle parallel systems.
My prediction is that as mainstream multicore CPUs develop we will see the rise of concurrent versions of mainstream languages, which will supplant HDLs. Just as DSP's are now programmed in (almost) vanilla C/C++/..., we will see FPGAs being programmed using the same languages as micros and DSPs. It will then be possible to write code and (almost) seamlessly run it on any platform, depending on how fast it needs to run.
As an FPGA guy, I want to ask this - what software programming languages, models, etc. exist to support the description of fundamentally concurrent processing, whether task-parallel, data-parallel, hybrid, or "other" (whatever that may be) that will allow for the supplanting of HDLs? I know that there's been long-standing efforts to do C-to-HDL but to my knowledge the successes of this approach have been limited to relatively constrained solution spaces.
I guess my feeling is that there's a fundamental difference between the sequential execution inherent in something like C and the ability to describe concurrency that is fundamental to HDLs, and it's going to be a hard bridge to build.
Side question: are "SW guys" satisfied with the tools/languages available for multi-core development? I've always thought things like OpenMP and CUDA were steps in the right direction but still hacks. It seems to me like there's still problems to solve there as well.
For example, I've seen a meaningful subset of Haskell in which the compiler would only accept programs which would provably terminate. The compiler output could then be sent down the FPGA synthesis toolchain.
Side question: are "SW guys" satisfied with the tools/languages available for multi-core development?
Yes and no. The new C++11 memory concurrency model and the atomics and multithreading library support are a huge step in the right direction. But it's nothing particularly slick and it still feels clunky. The fancy tools still seem to be not-perfectly portable and open source.
I've always thought things like OpenMP and CUDA were steps in the right direction but still hacks.
I'm guilty of not having used OpenMP. I should try it.
CUDA seems well-designed and has a decently well-supported toolchain. But it's a vendor-specific technology. OpenCL seems to be almost as fast and supported by more vendors.
It seems to me like there's still problems to solve there as well.
I remember as a kid hearing my father talk about the computer they were building at the university for research on high-level parallelization. That was the Illiac IV, completed in 1976.
There's nothing inherently special about Verilog/VHDL as concurrent languages--I imagine a very slick-looking Clojure module could be built as well. MyHDL let me reason about the program much more easily than a Verilog equivalent, all around the block in the cases of synchronous execution, blocking/non-blocking statements and integers ( http://www.myhdl.org/doku.php/why ). Also, (mostly) painless simulation and tests!
Though I would tend to agree that a C to Verilog system would be a step back; MyHDL makes heavy use of decorators and generators and other goodies that come from the functional programming world. C doesn't really work well there; you can hack it in but at that point you might as well just use Verilog.
Sequential execution isn't necessarily inherent. C++, Smalltalk, VHDL and others share Simula as a common ancestor. Simula had elements of concurrency and event driven simulation in it. One could envisage an object oriented language, such as C++ or Smalltalk being extended to allow objects to execute in parallel and communicate via methods. Maybe every object would be derived from a root object that fundamentally supports concurrency?
I'm not saying that such languages currently exist, but I do think they will come into being, as programming comes to grips with multicore CPUs. It will be essentially HDL synthesis, with the synthesiser/compiler being smart enough to hide all scheduling issues from the user.
As someone with a foot in both camps, I'm not happy that I have to choose early in the design process whether to run my code on a CPU or FPGA.
PI calculus/actor model/'erlang style' concurrency.
In (synthesizable) verilog, you have tiny state machines communicating via explicit channels (clock+wires+buses), and functions that get turned into gates. A higher level language would give you ideal channels (mapping onto fixed hardware channels or synthesized to HDL). Depending on their complexity, the functions at each state-node would also be transformed to use more general blocks and intermediate states.
Do you have links to any tutorial, intros or such for the list that you enumerated above? I am very much in appreciation of your (and everyone else's) feedback - learning of a number of new options and approaches that are worth further research.
I haven't been in the hardware world for a while, so I don't have a great answer to your question. For concrete embedded systems tools, I'd look to see what springs up around chips like the OP (I think Parallax made a massively parallel chip a while back, too?). Hardware system design is already all about explicitly decomposed parallelism. The problems to be solved are really from the software perspective of not caring about how concurrent resources are automatically allocated, as long as the constraints are met. If you're looking to learn how to think about parallelism in that manner, I would learn to program in a popular high level language such as Erlang.
Side question: are "SW guys" satisfied with the tools/languages available for multi-core development?
I really like Intel's Threading Building Blocks for multicore programming in C++ and I really like Clojures take on time and concurrency, but overall, I feel language support is very limited and library support doesn't mesh as well with the languages as hoped. I would like to see a practical dataflow-centric language.
It's like Erlang, maybe not as HDL-like as you were thinking. But, I don't see why you couldn't model every little gate as a process (obviously impractical), so it theoretically could fit.
what software programming languages, models, etc. exist to support the description of fundamentally concurrent processing, whether task-parallel, data-parallel, hybrid, or "other" that will allow for the supplanting of HDLs?
Not a SW guy, but I'm pretty sure the options can be boiled down to "pthreads or HDLs"
Sorry for the misinterpretation. If I was an FPGA manufacturer, I'd be trying to buy a company like Green Array and beat the CPU manufacturers to the middle ground. Agree about the current divide, though I think it will weaken and eventually disappear.
Correction: I've been lax in my terminology with logic gates and cells. Current big FPGAs contain around 2 million "logic cells", which typically represent between 1 and 10 logic gates each, depending on device programming.
I've done many write-ups on this if you look under my profile. Basically the idea is that most of the big problems in software engineering can't be solved easily in a serial fashion with current methodologies, and many of the existing parallel solutions are either too expensive or two complicated to advance the state of the art. I wish I had this chip 15 years ago...
Can anyone explain to a nonhardware geek?