Hacker News new | past | comments | ask | show | jobs | submit login
Micron Automata Processor – A Brief Introduction (shareholder.com)
111 points by crb002 on Feb 27, 2014 | hide | past | favorite | 50 comments



Good: a memory vendor finally getting serious about "processor in memory" silicon.

Bad: Doesn't mention all the research that was done on these sorts of things already. I mean, it makes a great story that you invented the whole thing from scratch, but nothing screams "this is a slick marketing doc" more than pretending that sort of thing. Or maybe they did ignore the literature and made a bunch of mistakes as a result?

Edit: wow, the scientific paper does the same thing.


I hope the refs reject it. That's unacceptable to have a paper that was written by a marketing department. There's nothing new or original about FPGAs. They make it sound like they invented the concept. They may have improved on it, but they didn't talk about the specifics of that.


They already got one accepted to IPDPS 2014, "Finding Motifs in Biological Sequences Using the Micron Automata Processor" http://www.ipdps.org/ipdps2014/2014_%20advance_program.html


I had a professor mention concept like this during microprocessor class. Basically idea comes from efficient organization of CPU cache. Using LRU queue, instead of having 4 way associative cache with each memory location tied to a specific spot in one of those 4 banks. Issue with LRU queue is that you would need to search all memory tag locations to determine if any of them store correct address. That is a boatload of transistors to accomplish it within latencies demanded of CPU cache, so we have cache with only 4 [1] locations to search for each memory address. It looks like micron figured out a way to place some logic close to memory chips to search entire memory for matching pattern. Would be interesting to know how they did it.

[1] Number of locations to search depends on associativity of the cache. I used 4 way associative cache in my example.


With a regular 4-way associative cache you search 4 tags to find the right block. With fully associative cache, you search all tags (where a tag would mean an entire block number), and it is typically implemented with CAM, where "some logic [is placed] close to memory chips to search entire memory for matching pattern".

Is that what you're describing? Are you saying Micron invented some form of CAM?


That is the gist I got from the presentation.

http://en.wikipedia.org/wiki/Content-addressable_memory


What are some of the good papers?


This (http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumb...) is one of the classic papers on PIMs.


You should reach out to the IPDPS 2014 program committee and see this reference gets added.


PIMs are outside my field, but Wikipedia looks like a reasonable start (http://en.wikipedia.org/wiki/Processor-in-memory).


I don't have a better one handy, but this definitely is a Wikipedia entry where "This article needs additional citations for verification." is more than true.

For example: "In the 1980s, a tiny CPU that executed FORTH was fabricated into a DRAM chip to improve PUSH and POP. FORTH is a Stack-oriented programming language and this improved its efficiency."

I would love to see a reference there. Anybody know one? If it only had pop and push, I wouldn't call it a processor in memory. I wouldn't say the MuP21 (http://www.ultratechnology.com/p21.html#p21) or F21 (http://www.ultratechnology.com/f21cpu.html) had a processor in memory, either.

Aside: I Googled for this (mythical?) CPU and did not find it, but I did find the next hot thing in agile programming: "Initially OKAD was implemented as the only application program in OK and was an experiment in sourceless programming. The structure of the programs looks like Forth code, but there was no Forth compiler or interpreter except Chuck himself. He entered the kernel of OK using a debugger and built the tools he need to build the rest of OK and OKAD" (http://www.ultratechnology.com/okad.htm)

Somebody should start a project where your system runs compiled Forth code, and the only way to backup the system is through a command that retrieve a set of functions from another running system ("please replace the 'BEEP' function with the one on the system at this IP address")


Harris Semiconductor attempted to intro the "RTX" microncontroller in the late 1980s. It featured a true Forth-based instruction set. RTX never caught traction due (IMO) to the lack of developer tools; the group was discontinued in ~1989.


If you mean http://en.wikipedia.org/wiki/RTX2010, that's nothing like what this describes. It's just a CPU with two hardware stacks (one more than the 6502 had).

There's plenty of alternatives for Forth hardware, especially if you are willing to use a FOGA. What this describes is more like Moore's GreenArray hardware, but with way more, less powerful chips (if it were DRAM-like, a major difference would be that you would be able to address all those CPUs from the outside, not just a few, like in the GreenArray chips)


If you do a search for "Cellular Automata Machines" on Google Scholar, you'll have a good start.

The idea of cellular automata was first developed by von Neumann and his colleagues back in the 1940s.


Do you know what's being talked about here? It's state machines (http://en.wikipedia.org/wiki/Deterministic_finite_automaton), not cellular automata


Cellular automata is what you get when you use automata as cells in a network. Just like at the presentation.


> Good: a memory vendor finally getting serious about "processor in memory" silicon.

Let's face it, Moore's Law is dead. Now we must extract the sub-exponential gains that come from architecture optimizations, and Non-uniform memory can lead to huge ones.

It's time to finally take the challenge of programming such exotic kinds of computers. The good news is that now our computers are good enough to help us.


> Moore's Law is dead.

What makes you say that? Intel would certainly disagree...


The particular form of Moore's Law that used to give regular doublings of clock speed is the one that has been dead for years.

Although not the original formulation (which was number of transistors), the clock speed one is very significant because it means that sequential computations are no longer getting faster exponentially.

The forms of Moore's Law that remain are still being used advantageously by Intel, e.g. for more cores amongst lots of other things, but it's not helping as much, since not all computations can be parallelized.


There's only one formulation of Moore's Law - the one from 1965 paper. Clock speed has nothing to do with it.

Clock speed is generally not a good indicator of a processor performance. I'm guessing a single 2GHz Haswell core is faster than 4GHz P4, due to a number of significant architectural improvements, which were made possible by the larger number of transistors available.

Keep in mind that processor performance per watt continues to improve significantly every two years, with no sign of slowing down in the nearest future.


> There's only one formulation of Moore's Law

You can't dictate that to the world:

http://en.wikipedia.org/wiki/Moore's_law#Other_formulations_...

> Clock speed is generally not a good indicator of a processor performance.

I didn't say it was. But if all else is equal, the exact same architecture running at a doubled clock frequency will run exactly twice as fast as the original.

You are arguing that not all else is usually equal, and although true, it's a different topic.

The point is that the "free" doubling of speed that we used to get stopped some years back.

Computation per watt is another important subject -- but it is s a different topic (and for that matter, is yet another formulation of Moore's Law).

There are a thousand issues that are important to varying degrees when discussing architecture and performance, but there is no point acting like we don't know what people mean when they say "Moore's Law has failed". What they mean is quite clear.


I actually looked at the section of the Wikipedia article you linked to. It lists about a dozen of rules or laws similar to Moore's Law. In every case, it's made clear the rule is not Moore's Law. In fact, most of those rules have their own distinct names (Dennard's scaling, Butter's Law, Wirth's Law, etc). Read it carefully and you will see that throughout the article, it's maintained that Moore's Law has exactly one, well defined meaning.

> there is no point acting like we don't know what people mean when they say "Moore's Law has failed". What they mean is quite clear.

When I read "Moore's Law is dead", I thought the OP meant we can't shrink transistors any more. You can probably agree it is not the same as saying "clock speeds are not improving anymore".

If someone says "Moore's Law" I will assume they mean "number of transistors on a chip doubles every 2 years or so". If they mean something else, such as "clock speed doubles every 2 years", then they use the term incorrectly. Moore's Law is alive and well, and if you talk to people who actually work on extending it, you should use the correct terminology.


I was talking about the original form, as created by Moore. The number of transistors on the cheapest chip size doubling every X months (with X=18 as recently stated, it was smaller once).

It's dead.

If you go back one and half cycle, you'll see that it does not add to a doubling for any big manufacturer. But what's most troubling is that the rate of increase is going down.

Yep, that happened in the past, and manufacturers recovered (altough not for so long). But this time it's different. We are very near the limits of MOSFET created on silicon by lithography, and we are so invested on this technology that I doubt we'll be able to transition quickly into anything else.


It is definitely on it's way out. We are having more and more trouble scaling to volume production for each production node. 14nm is behind schedule. 9nm is going to be harder and no one is really sure how to get down to 5nm. Beyond 5nm parts of the transistor will need to be smaller than an atom. We are hitting a wall, predictions are that by 2020 the cost of developing new manufacturing techniques will not justify the return. A fundamentally new substrate or perhaps paradigm will be required to push us forward. This will need to be revolutionary not evolutionary in nature. This is not guaranteed to be forthcoming in time to continue Moore's law.


> It is definitely on it's way out.

People have been saying that for decades. Experts claimed it would be impossible to shrink transistors below 1 micron. When that was done, other experts claimed it's impossible to shrink below 100nm.

Yes, Intel does not know how to build 5nm transistors, which are 3 process generations away from the current state of the art (14nm - 10nm - 7nm - 5nm). It's always been like that - for example, when Intel released 90nm technology, they didn't know how to do 22nm.

Yes, a new paradigm or substrate might be required to get there, so what? There's no shortage of new ideas, or new materials. Graphene is looking pretty good. Can't shrink it below atomic dimensions? Put another layer on top!

The only thing that can kill Moore's Law is lack of demand. But as long as people want faster, more efficient computers, they will be getting faster and more efficient. And I don't see the demand decreasing any time soon.


> "Deep packet inspection to monetize traffic"

Whoops.


Excuse me! ... I'm with the NSA and we really need to monetize some traffic. Have you seen our budget?


The slides are misleading. "Memory Wall" refers to the fact that CPU speed has been doubling every 24 months from 1960s till early 2000s (annual improvement rate is 50%), while RAM memory speed doubles every 10 years (annual improvement rate is 10%). This "memory wall" is there regardless of the processor architecture that you use.


Math nitpick: It's a bit more complicated than that with exponential growth. You can't divide up 100% over 10 years and say it was 10% each year, instead you have to solve 2=1.1^x, and find that it doubles in 7.27 years.

If something grows 10% a year for 10 years, it actually ends up at 2.59x its original size.


Rule of 72: doubles in 10 years, annual %age is 72/10 = 7.2%. Close to the right answer, easy to do in your head.

http://en.wikipedia.org/wiki/Rule_of_72


Huh, that's a handy approximation. I'd never heard of it before.


You might not have heard of it, but every financial planner in the world can do this math in their head!


Also: pi-seconds is a nano-century.


My favorite: 1337% of pi = 42 Beats Euler's identity any day.


I think there's more here than most comments reflect - yes, this is being done in many implementations today, but not nearly as efficiently. 4W for 1Gb of DFA stream analysis? That's pretty crazy considering that can put full on content inspection in a SOHO device that can end up in consumers networks. DFA is the shift from file based scanning in network security and offers parallelism efficiencies far beyond most of the UTM type platforms out there today. There are only a few companies using stream based platforms (Palo Alto Networks is one) and this is why the platform can do much more content scanning in one pass vs the legacy devices.

But, this is killer for SOHO networking IMHO. Puts a ton of power into the next wave of network security.



Micron demonstrated early versions of this at Microprocessor Forum in 2004 or 2005. So it isn't exactly "new" but being able to actually get one is.

Pattern algorithms are pretty difficult to synthesize however, see the Conway Glider search as an example. One of the challenges is that the set of 'instructions' and the solution possibilities are quite tightly interlinked. I hope that I can get my hands on one at some point, some of the old texture research from the Image Processing Institute would really fly on this thing.


Cant wait for Stephen Wolfram to claim it his own invention.


So, when can I get one and start mining bitcoins? Or since it's all about memory, litecoins?


With bitcoin we passed the FPGA mark a while ago, now it's all about custom ASIC chips.


The bioinformatics slides in that presentation make no sense. The image shown doesn't map in any way to the bullet points.


So this is like going from SMTP to IMAP ?


I didn't down-vote you, but if you were serious I think you should find a profession outside of technology.


I was almost serious, skimming through the diagram gave me the impression they were moving computation capabilities close to the data ~backend.


Oh ... I completely missed that interpretation (maybe I'm too close to these techniques). In any case, it's probably better to think of the system as a giant state machine in which the data are states and the operations are transitions. Its processing and memory co-mingled. You'll notice that it's got a very wide data width (not a new idea) that speeds it up tremendously for parallizeable tasks, but it's actually slowing down the "clock" since the memory itself is still the slowest part.

(I also upvoted you ... you shouldn't get penalized for an honest question)


Upvoted you. We need more insights like this than the document itself. The concept of state machine, PIM (processing in memory) and parallelization is the core, looks like Micro has some implementation now. That's why they call it an Automata Processor (AP).


What a rude, condescending thing to say to someone who is asking for help understanding something.


Given the later exchange between us ... I agree that it was rude and I'd like to apologize. I'm sorry.

I don't generally make excuses when I apologize because I think it diminishes the value of the apology itself. In this case I think it's fair to admit I thought he was trolling (and apparently so did his down-voters).

In any case, hopefully I'll be a better man tomorrow ... iron sharpens iron.


It was 5% trollish in the delivery for I could have been a little more explicit rather than just throwing my metaphor out in the wind. Happy that 1) I wasn't too far off 2) the situation is resolved.

ps: the explanation for my quick oneliner is that I often see very limited protocols (say FTP, or SMTP) that require many round trip over the wire where it would now (I understand back in the days servers were anemic) make sense to distribute the computation a little on both side. Not unlike memory IMHO.


The NSA processor—now available commercially!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: