Micron Automata Processor – A Brief Introduction

wumpus · on Feb 27, 2014

Good: a memory vendor finally getting serious about "processor in memory" silicon.

Bad: Doesn't mention all the research that was done on these sorts of things already. I mean, it makes a great story that you invented the whole thing from scratch, but nothing screams "this is a slick marketing doc" more than pretending that sort of thing. Or maybe they did ignore the literature and made a bunch of mistakes as a result?

Edit: wow, the scientific paper does the same thing.

slashdev · on Feb 27, 2014

I hope the refs reject it. That's unacceptable to have a paper that was written by a marketing department. There's nothing new or original about FPGAs. They make it sound like they invented the concept. They may have improved on it, but they didn't talk about the specifics of that.

crb002 · on Feb 27, 2014

They already got one accepted to IPDPS 2014, "Finding Motifs in Biological Sequences Using the Micron Automata Processor" http://www.ipdps.org/ipdps2014/2014_%20advance_program.html

TrainedMonkey · on Feb 27, 2014

I had a professor mention concept like this during microprocessor class. Basically idea comes from efficient organization of CPU cache. Using LRU queue, instead of having 4 way associative cache with each memory location tied to a specific spot in one of those 4 banks. Issue with LRU queue is that you would need to search all memory tag locations to determine if any of them store correct address. That is a boatload of transistors to accomplish it within latencies demanded of CPU cache, so we have cache with only 4 [1] locations to search for each memory address. It looks like micron figured out a way to place some logic close to memory chips to search entire memory for matching pattern. Would be interesting to know how they did it.

[1] Number of locations to search depends on associativity of the cache. I used 4 way associative cache in my example.

p1esk · on Feb 28, 2014

With a regular 4-way associative cache you search 4 tags to find the right block. With fully associative cache, you search all tags (where a tag would mean an entire block number), and it is typically implemented with CAM, where "some logic [is placed] close to memory chips to search entire memory for matching pattern".

Is that what you're describing? Are you saying Micron invented some form of CAM?

TrainedMonkey · on Feb 28, 2014

That is the gist I got from the presentation.

http://en.wikipedia.org/wiki/Content-addressable_memory

crb002 · on Feb 27, 2014

What are some of the good papers?

wsxcde · on Feb 28, 2014

This (http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumb...) is one of the classic papers on PIMs.

crb002 · on Feb 28, 2014

You should reach out to the IPDPS 2014 program committee and see this reference gets added.

wumpus · on Feb 27, 2014

PIMs are outside my field, but Wikipedia looks like a reasonable start (http://en.wikipedia.org/wiki/Processor-in-memory).

Someone · on Feb 27, 2014

I don't have a better one handy, but this definitely is a Wikipedia entry where "This article needs additional citations for verification." is more than true.

For example: "In the 1980s, a tiny CPU that executed FORTH was fabricated into a DRAM chip to improve PUSH and POP. FORTH is a Stack-oriented programming language and this improved its efficiency."

I would love to see a reference there. Anybody know one? If it only had pop and push, I wouldn't call it a processor in memory. I wouldn't say the MuP21 (http://www.ultratechnology.com/p21.html#p21) or F21 (http://www.ultratechnology.com/f21cpu.html) had a processor in memory, either.

Aside: I Googled for this (mythical?) CPU and did not find it, but I did find the next hot thing in agile programming: "Initially OKAD was implemented as the only application program in OK and was an experiment in sourceless programming. The structure of the programs looks like Forth code, but there was no Forth compiler or interpreter except Chuck himself. He entered the kernel of OK using a debugger and built the tools he need to build the rest of OK and OKAD" (http://www.ultratechnology.com/okad.htm)

Somebody should start a project where your system runs compiled Forth code, and the only way to backup the system is through a command that retrieve a set of functions from another running system ("please replace the 'BEEP' function with the one on the system at this IP address")

mud_dauber · on Feb 28, 2014

Harris Semiconductor attempted to intro the "RTX" microncontroller in the late 1980s. It featured a true Forth-based instruction set. RTX never caught traction due (IMO) to the lack of developer tools; the group was discontinued in ~1989.

Someone · on Feb 28, 2014

If you mean http://en.wikipedia.org/wiki/RTX2010, that's nothing like what this describes. It's just a CPU with two hardware stacks (one more than the 6502 had).

There's plenty of alternatives for Forth hardware, especially if you are willing to use a FOGA. What this describes is more like Moore's GreenArray hardware, but with way more, less powerful chips (if it were DRAM-like, a major difference would be that you would be able to address all those CPUs from the outside, not just a few, like in the GreenArray chips)

troymc · on Feb 27, 2014

If you do a search for "Cellular Automata Machines" on Google Scholar, you'll have a good start.

The idea of cellular automata was first developed by von Neumann and his colleagues back in the 1940s.

darkmighty · on Feb 27, 2014

Do you know what's being talked about here? It's state machines (http://en.wikipedia.org/wiki/Deterministic_finite_automaton), not cellular automata

marcosdumay · on Feb 28, 2014

Cellular automata is what you get when you use automata as cells in a network. Just like at the presentation.

marcosdumay · on Feb 28, 2014

> Good: a memory vendor finally getting serious about "processor in memory" silicon.

Let's face it, Moore's Law is dead. Now we must extract the sub-exponential gains that come from architecture optimizations, and Non-uniform memory can lead to huge ones.

It's time to finally take the challenge of programming such exotic kinds of computers. The good news is that now our computers are good enough to help us.

p1esk · on Feb 28, 2014

> Moore's Law is dead.

What makes you say that? Intel would certainly disagree...

DougMerritt · on Feb 28, 2014

The particular form of Moore's Law that used to give regular doublings of clock speed is the one that has been dead for years.

Although not the original formulation (which was number of transistors), the clock speed one is very significant because it means that sequential computations are no longer getting faster exponentially.

The forms of Moore's Law that remain are still being used advantageously by Intel, e.g. for more cores amongst lots of other things, but it's not helping as much, since not all computations can be parallelized.

p1esk · on Feb 28, 2014

There's only one formulation of Moore's Law - the one from 1965 paper. Clock speed has nothing to do with it.

Clock speed is generally not a good indicator of a processor performance. I'm guessing a single 2GHz Haswell core is faster than 4GHz P4, due to a number of significant architectural improvements, which were made possible by the larger number of transistors available.

Keep in mind that processor performance per watt continues to improve significantly every two years, with no sign of slowing down in the nearest future.

DougMerritt · on Feb 28, 2014

> There's only one formulation of Moore's Law

You can't dictate that to the world:

http://en.wikipedia.org/wiki/Moore's_law#Other_formulations_...

> Clock speed is generally not a good indicator of a processor performance.

I didn't say it was. But if all else is equal, the exact same architecture running at a doubled clock frequency will run exactly twice as fast as the original.

You are arguing that not all else is usually equal, and although true, it's a different topic.

The point is that the "free" doubling of speed that we used to get stopped some years back.

Computation per watt is another important subject -- but it is s a different topic (and for that matter, is yet another formulation of Moore's Law).

There are a thousand issues that are important to varying degrees when discussing architecture and performance, but there is no point acting like we don't know what people mean when they say "Moore's Law has failed". What they mean is quite clear.

p1esk · on Feb 28, 2014

I actually looked at the section of the Wikipedia article you linked to. It lists about a dozen of rules or laws similar to Moore's Law. In every case, it's made clear the rule is not Moore's Law. In fact, most of those rules have their own distinct names (Dennard's scaling, Butter's Law, Wirth's Law, etc). Read it carefully and you will see that throughout the article, it's maintained that Moore's Law has exactly one, well defined meaning.

> there is no point acting like we don't know what people mean when they say "Moore's Law has failed". What they mean is quite clear.

When I read "Moore's Law is dead", I thought the OP meant we can't shrink transistors any more. You can probably agree it is not the same as saying "clock speeds are not improving anymore".

If someone says "Moore's Law" I will assume they mean "number of transistors on a chip doubles every 2 years or so". If they mean something else, such as "clock speed doubles every 2 years", then they use the term incorrectly. Moore's Law is alive and well, and if you talk to people who actually work on extending it, you should use the correct terminology.

marcosdumay · on Feb 28, 2014

I was talking about the original form, as created by Moore. The number of transistors on the cheapest chip size doubling every X months (with X=18 as recently stated, it was smaller once).

It's dead.

If you go back one and half cycle, you'll see that it does not add to a doubling for any big manufacturer. But what's most troubling is that the rate of increase is going down.

Yep, that happened in the past, and manufacturers recovered (altough not for so long). But this time it's different. We are very near the limits of MOSFET created on silicon by lithography, and we are so invested on this technology that I doubt we'll be able to transition quickly into anything else.

ssdfsdf · on Feb 28, 2014

It is definitely on it's way out. We are having more and more trouble scaling to volume production for each production node. 14nm is behind schedule. 9nm is going to be harder and no one is really sure how to get down to 5nm. Beyond 5nm parts of the transistor will need to be smaller than an atom. We are hitting a wall, predictions are that by 2020 the cost of developing new manufacturing techniques will not justify the return. A fundamentally new substrate or perhaps paradigm will be required to push us forward. This will need to be revolutionary not evolutionary in nature. This is not guaranteed to be forthcoming in time to continue Moore's law.

p1esk · on Feb 28, 2014

> It is definitely on it's way out.

People have been saying that for decades. Experts claimed it would be impossible to shrink transistors below 1 micron. When that was done, other experts claimed it's impossible to shrink below 100nm.

Yes, Intel does not know how to build 5nm transistors, which are 3 process generations away from the current state of the art (14nm - 10nm - 7nm - 5nm). It's always been like that - for example, when Intel released 90nm technology, they didn't know how to do 22nm.

Yes, a new paradigm or substrate might be required to get there, so what? There's no shortage of new ideas, or new materials. Graphene is looking pretty good. Can't shrink it below atomic dimensions? Put another layer on top!

The only thing that can kill Moore's Law is lack of demand. But as long as people want faster, more efficient computers, they will be getting faster and more efficient. And I don't see the demand decreasing any time soon.

majke · on Feb 27, 2014

> "Deep packet inspection to monetize traffic"

Whoops.

smoyer · on Feb 28, 2014

Excuse me! ... I'm with the NSA and we really need to monetize some traffic. Have you seen our budget?

bamakhrama · on Feb 27, 2014

The slides are misleading. "Memory Wall" refers to the fact that CPU speed has been doubling every 24 months from 1960s till early 2000s (annual improvement rate is 50%), while RAM memory speed doubles every 10 years (annual improvement rate is 10%). This "memory wall" is there regardless of the processor architecture that you use.

wlesieutre · on Feb 27, 2014

Math nitpick: It's a bit more complicated than that with exponential growth. You can't divide up 100% over 10 years and say it was 10% each year, instead you have to solve 2=1.1^x, and find that it doubles in 7.27 years.

If something grows 10% a year for 10 years, it actually ends up at 2.59x its original size.

wumpus · on Feb 27, 2014

Rule of 72: doubles in 10 years, annual %age is 72/10 = 7.2%. Close to the right answer, easy to do in your head.

http://en.wikipedia.org/wiki/Rule_of_72

wlesieutre · on Feb 27, 2014

Huh, that's a handy approximation. I'd never heard of it before.

smoyer · on Feb 28, 2014

You might not have heard of it, but every financial planner in the world can do this math in their head!

pacaro · on Feb 27, 2014

Also: pi-seconds is a nano-century.

dTal · on Feb 28, 2014

My favorite: 1337% of pi = 42 Beats Euler's identity any day.

windexh8er · on Feb 28, 2014

I think there's more here than most comments reflect - yes, this is being done in many implementations today, but not nearly as efficiently. 4W for 1Gb of DFA stream analysis? That's pretty crazy considering that can put full on content inspection in a SOHO device that can end up in consumers networks. DFA is the shift from file based scanning in network security and offers parallelism efficiencies far beyond most of the UTM type platforms out there today. There are only a few companies using stream based platforms (Palo Alto Networks is one) and this is why the platform can do much more content scanning in one pass vs the legacy devices.

But, this is killer for SOHO networking IMHO. Puts a ton of power into the next wave of network security.

erikj · on Feb 27, 2014

Here's more: http://www.micron.com/~/media/Documents/Products/Other%20Doc...

ChuckMcM · on Feb 27, 2014

Micron demonstrated early versions of this at Microprocessor Forum in 2004 or 2005. So it isn't exactly "new" but being able to actually get one is.

Pattern algorithms are pretty difficult to synthesize however, see the Conway Glider search as an example. One of the challenges is that the set of 'instructions' and the solution possibilities are quite tightly interlinked. I hope that I can get my hands on one at some point, some of the old texture research from the Image Processing Institute would really fly on this thing.

rasz_pl · on Feb 27, 2014

Cant wait for Stephen Wolfram to claim it his own invention.

ars · on Feb 27, 2014

So, when can I get one and start mining bitcoins? Or since it's all about memory, litecoins?

slashdev · on Feb 27, 2014

With bitcoin we passed the FPGA mark a while ago, now it's all about custom ASIC chips.

dkural · on Feb 28, 2014

The bioinformatics slides in that presentation make no sense. The image shown doesn't map in any way to the bullet points.

agumonkey · on Feb 27, 2014

So this is like going from SMTP to IMAP ?

smoyer · on Feb 28, 2014

I didn't down-vote you, but if you were serious I think you should find a profession outside of technology.

agumonkey · on Feb 28, 2014

I was almost serious, skimming through the diagram gave me the impression they were moving computation capabilities close to the data ~backend.

smoyer · on Feb 28, 2014

Oh ... I completely missed that interpretation (maybe I'm too close to these techniques). In any case, it's probably better to think of the system as a giant state machine in which the data are states and the operations are transitions. Its processing and memory co-mingled. You'll notice that it's got a very wide data width (not a new idea) that speeds it up tremendously for parallizeable tasks, but it's actually slowing down the "clock" since the memory itself is still the slowest part.

(I also upvoted you ... you shouldn't get penalized for an honest question)

dclara · on Feb 28, 2014

Upvoted you. We need more insights like this than the document itself. The concept of state machine, PIM (processing in memory) and parallelization is the core, looks like Micro has some implementation now. That's why they call it an Automata Processor (AP).

dandelany · on Feb 28, 2014

What a rude, condescending thing to say to someone who is asking for help understanding something.

smoyer · on Feb 28, 2014

Given the later exchange between us ... I agree that it was rude and I'd like to apologize. I'm sorry.

I don't generally make excuses when I apologize because I think it diminishes the value of the apology itself. In this case I think it's fair to admit I thought he was trolling (and apparently so did his down-voters).

In any case, hopefully I'll be a better man tomorrow ... iron sharpens iron.

agumonkey · on Feb 28, 2014

It was 5% trollish in the delivery for I could have been a little more explicit rather than just throwing my metaphor out in the wind. Happy that 1) I wasn't too far off 2) the situation is resolved.

ps: the explanation for my quick oneliner is that I often see very limited protocols (say FTP, or SMTP) that require many round trip over the wire where it would now (I understand back in the days servers were anemic) make sense to distribute the computation a little on both side. Not unlike memory IMHO.

wmf · on Feb 27, 2014

The NSA processor—now available commercially!