Good: a memory vendor finally getting serious about "processor in memory" silicon.
Bad: Doesn't mention all the research that was done on these sorts of things already. I mean, it makes a great story that you invented the whole thing from scratch, but nothing screams "this is a slick marketing doc" more than pretending that sort of thing. Or maybe they did ignore the literature and made a bunch of mistakes as a result?
Edit: wow, the scientific paper does the same thing.
I hope the refs reject it. That's unacceptable to have a paper that was written by a marketing department. There's nothing new or original about FPGAs. They make it sound like they invented the concept. They may have improved on it, but they didn't talk about the specifics of that.
I had a professor mention concept like this during microprocessor class. Basically idea comes from efficient organization of CPU cache. Using LRU queue, instead of having 4 way associative cache with each memory location tied to a specific spot in one of those 4 banks. Issue with LRU queue is that you would need to search all memory tag locations to determine if any of them store correct address. That is a boatload of transistors to accomplish it within latencies demanded of CPU cache, so we have cache with only 4 [1] locations to search for each memory address. It looks like micron figured out a way to place some logic close to memory chips to search entire memory for matching pattern. Would be interesting to know how they did it.
[1] Number of locations to search depends on associativity of the cache. I used 4 way associative cache in my example.
With a regular 4-way associative cache you search 4 tags to find the right block. With fully associative cache, you search all tags (where a tag would mean an entire block number), and it is typically implemented with CAM, where "some logic [is placed] close to memory chips to search entire memory for matching pattern".
Is that what you're describing? Are you saying Micron invented some form of CAM?
I don't have a better one handy, but this definitely is a Wikipedia entry where "This article needs additional citations for verification." is more than true.
For example: "In the 1980s, a tiny CPU that executed FORTH was fabricated into a DRAM chip to improve PUSH and POP. FORTH is a Stack-oriented programming language and this improved its efficiency."
Aside: I Googled for this (mythical?) CPU and did not find it, but I did find the next hot thing in agile programming: "Initially OKAD was implemented as the only application program in OK and was an experiment in sourceless programming. The structure of the programs looks like Forth code, but there was no Forth compiler or interpreter except Chuck himself. He entered the kernel of OK using a debugger and built the tools he need to build the rest of OK and OKAD" (http://www.ultratechnology.com/okad.htm)
Somebody should start a project where your system runs compiled Forth code, and the only way to backup the system is through a command that retrieve a set of functions from another running system ("please replace the 'BEEP' function with the one on the system at this IP address")
Harris Semiconductor attempted to intro the "RTX" microncontroller in the late 1980s. It featured a true Forth-based instruction set. RTX never caught traction due (IMO) to the lack of developer tools; the group was discontinued in ~1989.
If you mean http://en.wikipedia.org/wiki/RTX2010, that's nothing like what this describes. It's just a CPU with two hardware stacks (one more than the 6502 had).
There's plenty of alternatives for Forth hardware, especially if you are willing to use a FOGA. What this describes is more like Moore's GreenArray hardware, but with way more, less powerful chips (if it were DRAM-like, a major difference would be that you would be able to address all those CPUs from the outside, not just a few, like in the GreenArray chips)
> Good: a memory vendor finally getting serious about "processor in memory" silicon.
Let's face it, Moore's Law is dead. Now we must extract the sub-exponential gains that come from architecture optimizations, and Non-uniform memory can lead to huge ones.
It's time to finally take the challenge of programming such exotic kinds of computers. The good news is that now our computers are good enough to help us.
The particular form of Moore's Law that used to give regular doublings of clock speed is the one that has been dead for years.
Although not the original formulation (which was number of transistors), the clock speed one is very significant because it means that sequential computations are no longer getting faster exponentially.
The forms of Moore's Law that remain are still being used advantageously by Intel, e.g. for more cores amongst lots of other things, but it's not helping as much, since not all computations can be parallelized.
There's only one formulation of Moore's Law - the one from 1965 paper. Clock speed has nothing to do with it.
Clock speed is generally not a good indicator of a processor performance. I'm guessing a single 2GHz Haswell core is faster than 4GHz P4, due to a number of significant architectural improvements, which were made possible by the larger number of transistors available.
Keep in mind that processor performance per watt continues to improve significantly every two years, with no sign of slowing down in the nearest future.
> Clock speed is generally not a good indicator of a processor performance.
I didn't say it was. But if all else is equal, the exact same architecture running at a doubled clock frequency will run exactly twice as fast as the original.
You are arguing that not all else is usually equal, and although true, it's a different topic.
The point is that the "free" doubling of speed that we used to get stopped some years back.
Computation per watt is another important subject -- but it is s a different topic (and for that matter, is yet another formulation of Moore's Law).
There are a thousand issues that are important to varying degrees when discussing architecture and performance, but there is no point acting like we don't know what people mean when they say "Moore's Law has failed". What they mean is quite clear.
I actually looked at the section of the Wikipedia article you linked to. It lists about a dozen of rules or laws similar to Moore's Law. In every case, it's made clear the rule is not Moore's Law. In fact, most of those rules have their own distinct names (Dennard's scaling, Butter's Law, Wirth's Law, etc). Read it carefully and you will see that throughout the article, it's maintained that Moore's Law has exactly one, well defined meaning.
> there is no point acting like we don't know what people mean when they say "Moore's Law has failed". What they mean is quite clear.
When I read "Moore's Law is dead", I thought the OP meant we can't shrink transistors any more. You can probably agree it is not the same as saying "clock speeds are not improving anymore".
If someone says "Moore's Law" I will assume they mean "number of transistors on a chip doubles every 2 years or so". If they mean something else, such as "clock speed doubles every 2 years", then they use the term incorrectly. Moore's Law is alive and well, and if you talk to people who actually work on extending it, you should use the correct terminology.
I was talking about the original form, as created by Moore. The number of transistors on the cheapest chip size doubling every X months (with X=18 as recently stated, it was smaller once).
It's dead.
If you go back one and half cycle, you'll see that it does not add to a doubling for any big manufacturer. But what's most troubling is that the rate of increase is going down.
Yep, that happened in the past, and manufacturers recovered (altough not for so long). But this time it's different. We are very near the limits of MOSFET created on silicon by lithography, and we are so invested on this technology that I doubt we'll be able to transition quickly into anything else.
It is definitely on it's way out. We are having more and more trouble scaling to volume production for each production node. 14nm is behind schedule. 9nm is going to be harder and no one is really sure how to get down to 5nm. Beyond 5nm parts of the transistor will need to be smaller than an atom. We are hitting a wall, predictions are that by 2020 the cost of developing new manufacturing techniques will not justify the return. A fundamentally new substrate or perhaps paradigm will be required to push us forward. This will need to be revolutionary not evolutionary in nature. This is not guaranteed to be forthcoming in time to continue Moore's law.
People have been saying that for decades. Experts claimed it would be impossible to shrink transistors below 1 micron. When that was done, other experts claimed it's impossible to shrink below 100nm.
Yes, Intel does not know how to build 5nm transistors, which are 3 process generations away from the current state of the art (14nm - 10nm - 7nm - 5nm). It's always been like that - for example, when Intel released 90nm technology, they didn't know how to do 22nm.
Yes, a new paradigm or substrate might be required to get there, so what? There's no shortage of new ideas, or new materials. Graphene is looking pretty good. Can't shrink it below atomic dimensions? Put another layer on top!
The only thing that can kill Moore's Law is lack of demand. But as long as people want faster, more efficient computers, they will be getting faster and more efficient. And I don't see the demand decreasing any time soon.
The slides are misleading. "Memory Wall" refers to the fact that CPU speed has been doubling every 24 months from 1960s till early 2000s (annual improvement rate is 50%), while RAM memory speed doubles every 10 years (annual improvement rate is 10%). This "memory wall" is there regardless of the processor architecture that you use.
Math nitpick: It's a bit more complicated than that with exponential growth. You can't divide up 100% over 10 years and say it was 10% each year, instead you have to solve 2=1.1^x, and find that it doubles in 7.27 years.
If something grows 10% a year for 10 years, it actually ends up at 2.59x its original size.
I think there's more here than most comments reflect - yes, this is being done in many implementations today, but not nearly as efficiently. 4W for 1Gb of DFA stream analysis? That's pretty crazy considering that can put full on content inspection in a SOHO device that can end up in consumers networks. DFA is the shift from file based scanning in network security and offers parallelism efficiencies far beyond most of the UTM type platforms out there today. There are only a few companies using stream based platforms (Palo Alto Networks is one) and this is why the platform can do much more content scanning in one pass vs the legacy devices.
But, this is killer for SOHO networking IMHO. Puts a ton of power into the next wave of network security.
Micron demonstrated early versions of this at Microprocessor Forum in 2004 or 2005. So it isn't exactly "new" but being able to actually get one is.
Pattern algorithms are pretty difficult to synthesize however, see the Conway Glider search as an example. One of the challenges is that the set of 'instructions' and the solution possibilities are quite tightly interlinked. I hope that I can get my hands on one at some point, some of the old texture research from the Image Processing Institute would really fly on this thing.
Oh ... I completely missed that interpretation (maybe I'm too close to these techniques). In any case, it's probably better to think of the system as a giant state machine in which the data are states and the operations are transitions. Its processing and memory co-mingled. You'll notice that it's got a very wide data width (not a new idea) that speeds it up tremendously for parallizeable tasks, but it's actually slowing down the "clock" since the memory itself is still the slowest part.
(I also upvoted you ... you shouldn't get penalized for an honest question)
Upvoted you. We need more insights like this than the document itself. The concept of state machine, PIM (processing in memory) and parallelization is the core, looks like Micro has some implementation now. That's why they call it an Automata Processor (AP).
Given the later exchange between us ... I agree that it was rude and I'd like to apologize. I'm sorry.
I don't generally make excuses when I apologize because I think it diminishes the value of the apology itself. In this case I think it's fair to admit I thought he was trolling (and apparently so did his down-voters).
In any case, hopefully I'll be a better man tomorrow ... iron sharpens iron.
It was 5% trollish in the delivery for I could have been a little more explicit rather than just throwing my metaphor out in the wind. Happy that 1) I wasn't too far off 2) the situation is resolved.
ps: the explanation for my quick oneliner is that I often see very limited protocols (say FTP, or SMTP) that require many round trip over the wire where it would now (I understand back in the days servers were anemic) make sense to distribute the computation a little on both side. Not unlike memory IMHO.
Bad: Doesn't mention all the research that was done on these sorts of things already. I mean, it makes a great story that you invented the whole thing from scratch, but nothing screams "this is a slick marketing doc" more than pretending that sort of thing. Or maybe they did ignore the literature and made a bunch of mistakes as a result?
Edit: wow, the scientific paper does the same thing.