Hacker News new | past | comments | ask | show | jobs | submit login
Neural network spotted deep inside Samsung's Galaxy S7 silicon brain (theregister.co.uk)
269 points by dragonbonheur on Aug 23, 2016 | hide | past | favorite | 99 comments



My Computer Architecture professor Daniel Jimenez worked on (invented?) something like this:

"Dynamic Branch Prediction with Perceptrons" (PDF)

http://hpca23.cse.tamu.edu/taco/pdfs/hpca7_dist.pdf


I read the paper description and immediately jumped to conclusion that you are Calvin's student :) We had to study this paper during one of my classes at UTCS


Calvin Lin is also no slouch :)


Incidentally, Daniel was my very first CS instructor about 20 years ago, when he taught my Intro to CS class at UT San Antonio. It was the vi editor and K&R book C programming on some ancient green monochrome terminals back then. I remember it as a fairly severe introduction, but he was always great at explaining things properly.


I don't see why branching is so hard -- it's two cycles to test the branch, one more if the branch is taken, one more if it crosses a page boundary! /6502joke


well... for the size of its pipeline (1?) and the amount of speculative execution going on (0) the non-existing branch prediction of 6502 was basically optimal.


The 6502 has a tiny bit of pipeline/speculative execution. The CPU fetches the next instruction before the current one has finished.

http://www.atarihq.com/danb/files/64doc.txt:

"If an instruction does not store data in memory on its last cycle, the processor can fetch the opcode of the next instruction while executing the last cycle."

The 6502 also sometimes does excess reads and writes. For example, "Read-modify-write instructions (like INC) read the original data, write it back, and then write the modified data." (http://forum.6502.org/viewtopic.php?t=2).

One could call that speculative execution, where the processor writes back the value, just in case adding one to it doesn't increase it, and then, having discovered that adding one changes the value, writes the right value :-)


The 6502 had no concept of not doing a memory operation.

Either it's doing a read or it's doing a write, there is no option to use the bus unused. So the 6502 spams dummy reads and occasionally dummy writes whenever it's doing something internally.

Edit: And that's not the only interesting thing I've learned recently about the 6502, There is a very good reason why the stack is on page 1 and not some other constant page. There are only 3 constant values which can be pushed on the SB/ADH buses, 0xff, 0x00 and 0x01.

Ox00 is loaded into upper address register for zero page instructions. For Push operations, it loads 0x01 into the upper address register and puts the stack pointer + 0xff into the adder, which subtracts one from the stack pointer. (The 0xff is just the default bus value, if nothing else is pulling it low, so you get it for free)

For Pop operations, it pushes the same 0x01 into both the upper address register and the adder, at the same time.

So if they wanted to put the stack page at another constant page, or at a variable page they would also need to provide more gates to provide another constant one somewhere.


This is literally insane. I've never heard of a neural net branch predictor.

Edit: my layman knowledge about branch prediction and neural networks is showing its gross inadequacy :/


Branch prediction strategies can be learned, just like any other model. I could certainly see something like this outperforming the huge bag of hand-tuned optimization strategies for branch prediction anyway. It could save some silicon.

Even a simple dot product can be called a "neural network," albeit a small uninteresting one. Your features could be (say) the state of cache, the number of jumps, the number of recent stalls / stack size / returns, and so on. Put them into a 10-dimensional vector or whatever, take the dot product between that and a set of 10 learned weights, and preload the branch if the result is above some threshold. Even a simple model could perhaps work really well here.

Of course it's not like they have some deep learning convolutional net running on the program input. I also bet the weights are set at the factory so it's not "learning" anything.


Nothing in the article suggests that it's learning - so I suspect that the weights are indeed set in the factory.

Learning on the fly would consume much more power as it back-propagates, so doesn't really make much sense as the potential performance gain would be marginal and would probably cause a performance hit when switching to a different process or thread.


The article links to a paper[1] that talks about online learning:

> The perceptrons are trained by an algorithm that increments a weight when the branch outcome agrees with the weight’s correlation and decrements the weight otherwise.

A "perceptron" seems to be a single linear function, akin to a single neuron in a neural network. So complex learning algorithms like back-propagation don't apply here.

The complete process, from the paper, is:

> 1. The branch address is hashed to produce an index "i" into the table of perceptrons.

> 2. The "i"th perceptron is loaded from the table into a vector register "P" of weights.

> 3. The value of "y" is computed as the dot product of "P" and the global history register. [This is a shift register containing values of -1 or 1 for the last N branches seen. -1 means not taken, and 1 means taken.]

> 4. The branch is predicted "not taken" when "y" is negative, or "taken" otherwise.

> 5. Once the actual outcome of the branch becomes known, the training algorithm uses this outcome and the value of "y" to update the weights in "P".

> 6. "P" is written back to the "i"th entry in the table.

The article says that AMD's microarchitecture uses a hashed perceptron system in its branch prediction. So, from the above, it seems that "hashed" means that you hash the branch address to pick which perceptron (model) to use, and "perceptron" is just a dot product. (Actually the wikipedia article[2] suggests that the name "perceptron" refers to the training algorithm, i.e. how you update the weights.)

[1]: Daniel A. Jiménez and Calvin Lin. 2002. Neural Methods for Dynamic Branch Prediction. https://www.cs.utexas.edu/~lin/papers/tocs02.pdf

[2]: https://en.wikipedia.org/wiki/Perceptron


But if it is not learning, can you really call it NN?


Yes. Most neural net applications separate training and inference so that once the net is in production it doesn't change, though it might be replaced with a new net rather frequently depending on how much the distribution of data it's operating on changes over time.


To add to that, the backpropagation doesn't have to happen on device - they could, for instance, record statistics from a sampling of the cellphones in use, backprop on a server, then push an update to the weights in the next over the air update. The statistics would potentially be much cheaper than actually training the network.


not an expert on this, but wouldn't what you mention be adaptive control rather than ANN?


It's actually really common these days. :)

Search google for "hashed perceptron branch predictor"


Intel (and probably AMD) have been using neural based approaches to branch prediction since the early 2000's. This isn't particularly new, it's just not a topic talked about much.


[citation needed]. As far as I know, Intel is suspected to be using TAGE or some variant of it in its last few generations of chips.


Google for a PDF describing x86 architecture CPUs. There are ways to write a loop that will detect the kind of predictor in use.


No need, other people have already done the work: the paper 'Branch Prediction and the Performance of Interpreters' gave strong hints that Intel has been using ITTAGE at least since Haswell.

Edit: also, Intel branch predictors up to Core2 were well characterized and were not believed to be using NN.

[1] https://hal.inria.fr/hal-01100647/document ,discussed last year at HN.


Not necessarily. The cost of backtracking on a bad branch prediction on a device like a phone may be so great that implementing an ANN could be worth the effort to implement.

I'm not as familiar with mobile phone architecture as with that of PCs, but the number and types of operations influenced by the device's various sensors (light sensor, GPS, accelerometer, gyroscope) could conceivably be more than normal, naïve branch prediction can handle effectively.


Why would speculation invalidation be more expensive on a phone? If anything, many phone CPUs tend to have shorter pipelines and less expensive branch prediction misses.

Also I do not see how the number or quality of sensors could in any way affect the prediction rate.


In addition, could this be worse than what we have today - on 64-bit ARM?


The article indicates that academia has been writing about using them for branch prediction for a while and than Intel and AMD have been doing just that for a while. They just don't talk about it much.

> AMD's Zen architect Mike Clark confirmed to us his microarchitecture uses a hashed perceptron system in its branch prediction. "Maybe I should have called it a neural net," he added.

Neural networks can be very complicated, but they can also be very simple. What little I know about perceptrons is that they are very simple. If you google "hashed perceptron", many of the papers that come up mention branch prediction in their titles.


First time I've heard of them in hardware. The JVM has been employing similar feats for many years now.


Like where? Even the best JVMs do basic counted profiling of branches only, as far as I know.



Have hotspot JVM optimization mechanisms ever been disclosed?


What? It's been open source for years!


I was under the impression that only some parts of the JVM was open and that the high performance server implementations have always been proprietary?


The (old) Oracle JRockit is not really better. It was, but it was probably getting really troublesome to maintain, and with buying Sun they probably felt that it was rather redundant. (There was no need for a clean room implementation since they owned all relevant IP anyhow.)


Not at all.


e: was wrong about everything


I guess it's naive to assume that Apple isn't also doing the same thing. And apple's cores are bigger so they generally do more work per core per cycle. The exynos in GS7 has eight cores. But the die area is roughly same as apple's dual core chip.


If you have to pull an army of cores to fight two the really impressive arch belongs to Apple.

OTOH I see this as the optimization of hardware to fix the cap that software, paradigms and devs mean.


As long as the die area is same, I see it as a fair fight.


In fact, for the same die size, Android hardware has been adapted to work better with the way Android works (many small processes working together).

This is maybe not the best approach for certain games or benchmarks, but for normal use this has proven to be great.


Even with all those neural net advantages, dual-core, 2GB RAM iPhone 6s beats octa-core, 4GB RAM Galaxy Note 7:

http://www.redmondpie.com/galaxy-note-7-vs-iphone-6s-real-wo...

Video (3:29 sec): https://www.youtube.com/watch?v=3-61FFoJFy0

PS: I highly recommend you check out the video... iPhone completely obliterated Note 7.


That test is relevant for the end user, but it doesn't make sense as a hardware test. It's a different system, on different hardware, with different apps (that happen to share the same name), and essentially what you're comparing is resource loading strategies. iPhone is better at that - great.

But for a hardware test, I'd expect a single running app, executing shared native codebase, in no disruptions airplane mode, measuring specific part of the hardware. And that's not even close to measuring before/after improved branch prediction, which could be useful in both phones.


> But for a hardware test, I'd expect a single running app, executing shared native codebase, in no disruptions airplane mode, measuring specific part of the hardware. And that's not even close to measuring before/after improved branch prediction, which could be useful in both phones.

I would argue that would be a completely useless test unless you're explicitly testing something for development purposes only. Users use phones. Nothing else matters but the user experience. If you can make your phone beat another phone in your limited, no disruptions, network traffic off example that doesn't mean anything.

I feel this is a fair test. You test the phones at what they're supposed to be able to do. If one does it better it's legit to point that out regardless of what the underlying structure looks like.


I think you either misread my comment or missed the point of it. Overall tests matter, specific tests matter, unittests matter, microbenchmarks matter, user experience matters, perceived performance matters. But they don't always intersect.

Bringing up user experience when talking about branch predictors is meaningless. Just as bringing up branch predictors when talking about perceived performance. (unless you're proving that this exact feature being available/not available, with other things being controlled for, makes a major difference in the test)


> Bringing up user experience when talking about branch predictors is meaningless.

If working on branch predictors does not impact the user experience, then arguably that work is meaningless in the context of building real products -- it may still be interesting for academic purposes.


Branch predictors made the phones faster and this affects the user experience. But if you want to test the performance of branch predictors you have to compare similar things, you cannot benchmark two different OS and apps because you get and OS and app test not an processor test.


I think the point that previous posters were making is that in controlled environments some features might outperform others. And yes, the scientific way of testing things is isolating variables and changing one thing each time.

But, for the final user, it doesn't matter if each technology alone is better than another. If you have a background 64-core processor, driven by another 8-core... What matters is how fast something loads, performs, and how long batteries last.

Of course, this might be outplaced in a thread discussing NN branch prediction.


Why measure and therefore optimize for anything but how the product is being used?


I didn't say it's not valid, or that it shouldn't have been done. I'm just saying "NN still doesn't help them beat iPhone" is not a relevant comment regarding this test. To go with car analogy it's close to an article about an slightly improved design of one part of fuel injection in new ABC and someone commenting "look, an XYZ from last year still can go faster without it".

The test is a valid observation, but is irrelevant to this article, or to the technology.

Why measure something that doesn't show up in a standard user's test? Here are some examples: lower power usage, lower latency of small operations (rather than throughput of large actions), smaller design (saving chip space), etc. And finally - if this is a good technology, Apple can use it too and get even faster.


That is a great analogy. That test in the video didn't really mean anything towards the overall speed of the device. It's interesting to see, but I how this test would change over n trials.


You should do that, but not only that. If you just test the way the product is being used, you have a million possible reasons why it could be slow. If you just want to know whether slapping a neural network on the chip is a good idea or not, it's best to test that in such a way as to exclude as many confounding factors as possible.


In general: because how a piece of hardware (or set of hardware) is used today isn't necessarily how it's going to be used tomorrow. Faster hardware allows for use-cases that weren't possible prior.


However, when we're talking about phones or consumer devices, how it's used today is the only thing that's relevant. It will be obsolete in just a couple of years.


I think the parent point is still valid. Look at Pokemon Go. Average user wouldn't care about AR performance, all that high gps accuracy or gfx performance in unity. Suddenly, there are millions of players who are choosing their hardware based on that. If you are planning for today's usage you you may be lagging.


> That test is relevant for the end user, but it doesn't make sense as a hardware test.

Err? It seems to me that the only tests that make any sense are those that count for end users. Everything else is pointless. It doesn't matter if it's the same system or code base or whatever other engineering fetish.

Users who compare their phones to see which goes faster look at how the same app opens and runs on two separate phones. Which incidentally is what this test is about. It's the only sensible test you can do.


> Everything else is pointless.

Better understanding individual components lets engineers build better overall systems. Millions of choices and decisions go into something like a phone, most of which can't have their result easily measured by just looking at the end result. But if enough of them are good, you end up with the iPhone.

If enough of them are bad, you end up with mass recalls, and a class action lawsuit because your phones catch fire when charged overnight because your engineers didn't think to test that specific use case, didn't test their capacitors, didn't test their QA process, didn't unit test their battery management code, or otherwise failed to indulge their "fetish" (read: job.)

Proper hardware tests might be pointless to end users, but that doesn't make them pointless.


> Err? It seems to me that the only tests that make any sense are those that count for end users. Everything else is pointless.

If you are specifically comparing two specific phones against each other then yes this is the only sort of test that matters, and if you are comparing manufacture's flagship models then that is what you are doing.

From every other point of view, either comparing performance of individual components or comparing like-for-like performance of devices with very similar spec it is less valid though: the screen resolution is a huge variable for a game so it isn't a "fair" test to compare performance of two devices like that. You could equally state that the higher resolution produces a better result because of the resolution difference and you would similarly be called to task by for not comparing like for like. Maybe the a device with the same hardware other than the better screen would outperform the iDevice on the test, or at least not appear to underperform by as much, this test can't tell us that.

This is difficult of course because few phone models are practically identical and few vary by just one factor, and can make a developers life more difficult when trying to support a varied market like Android based devices.


If you're an engineer who's working on branch predictors, or a developer working on compiler optimisations, then such a test isn't exactly pointless...


This os a test for an entire phone, including the OS. The article focuses on the chip alone.


>That test is relevant for the end user, but it doesn't make sense as a hardware test. It's a different system, on different hardware, with different apps (that happen to share the same name), and essentially what you're comparing is resource loading strategies. iPhone is better at that - great.

Anything besides the concrete end-user experience of a system doesn't matter at all.

(In theory it could matter for cpu/compiler designers etc -- but even what matters to them is inconsequential if in the end of the line it doesn't matter to the end-user).


It should be noted most Android phones use a "Dual-Quad" design, with a bank of 4x A53 for low-power operation, and a dual or quad set of more powerful processors (A57, M1, etc) for full-power operation.

Still, it's interesting to see the single A9 with coprocessor outperform dual M1s - like was said about the reduced core count in the Snapdragon 820:

“The most important thing to have is peak single-threaded performance, as most of the time only one or two cores are active." [0]

This is apparently true, with the iPhone outperforming despite its A9 running up to 400mhz slower than the Android cohort's Snapdragon.

[0] http://trustedreviews.com/opinions/snapdragon-820-vs-snapdra...


Oh god it's one of those app opening videos isn't it?

Yep, it is. D: Well here is a tip that should be somewhat obvious: unscientific measurements of which phone loads apps faster when you click from the home screen isn't any basis to claim a phone "beats" the other, especially when we were talking about processor performance.

By the way, the iPhone beats the S7 on single core and the S7 wins on multi core performance. Which makes sense seeing as you are competing a dual core with an octa core.


It'd be really interesting to see a performance analysis of what's going on there. Is it just a weaker CPU or are the apps less optimised as well? The two phones were comparable right up to the time they loaded a game and it just fell behind further and further from that point on. Why is loading that particular game so much slower? The developers didn't care to optimise it?

edit: The article points out that the resolution of the Samsung display is a lot higher, so it's not quite comparing apples and oranges here. I'm guessing that a much higher resolution is pure win when drawing mostly vector graphics i.e. apps but really kills you when loading games as presumably they're paging in much larger assets from disk.


Well for starters, Anandtech measured the NAND performance of the 6s as significantly faster than the Note7; 1.65x faster sequential 256k and 1.15x faster random 4K.


I doubt the assets are different, that would require the developer to include different assets for various different Android screen resolutions for different phones in the Android version of the app. Rather it's the dynamic drawing engine that will be doing more work generating the screen images at the higher fidelity because that will work at the screen's physical resolution.


I believe separate APK files for separate resolutions may be a thing on the Play store, but I'm not entirely sure.


This is because of Samsung's garbage software. Years and years go by and every single time they manage to make the best hardware and slap a laggy and bloated version of Android on top of it. I'm sure that a Note 7 with a more Nexus-like build of Android is a dream for many.


While i prefer stock Android, the Note arguably benefits a lot from a different software because it's tailored to the Stylus and round edges. Other than that they already toned it down a lot to where it was a few years ago, so i don't think the impact is that huge. I really love the Note 7 for the possibilities that the stylus brings, so the Samsung software is something i'd have to live with.


I've been a happy Samsung customer for the longest time, because I basically flashed their nice hardware with cyanogenmod.

My recently purchase S7 is the first ever cellphone I return to the store because: 1) It's way too hard to flash their exynos based hardware and 2) the default install comes with 14Gb of garbage included. That includes the full microsoft office and a slew of garbage apps I'll never use (I'd include the launcher in that, but I understand some people like it, so I'll it a pass)

Annecdata, but I have now become this guy who will tirelessly slap Samsung's garbage software "strategy" every time I have the opportunity.


They drastically pared down the software differences in the last version or two. Me from five years ago would be shocked to hear me say it, but it's to the point that with the s6, I preferred the new stripped down TouchWiz by a long shot to stock android (TouchWiz has always had design advantages that stock android sometimes incorporates, e.g. Quick settings in the notification bar).


Apple A9 is a quite sophisticate CPU, there is no reason to believe is not using a state of the art predictor. The Samsung CPU might not have any advantage at all on this area.


That's not too surprising. There is a lot more that runs on an Android phone usually.


Yeah my first thought went to how Android will background anything when you hit center button, while iOS will basically give it a few seconds to pack up state and then evict it. Been that way since day one, and has very little to do with the hardware.


It doesn't really evict it. It pauses execution, but program is still running, so when you switch to it, it'll continue to run from the same moment you left it. If another foreground app requests more memory, then yes, iOS might terminate background apps, but that's not necessary. Also there's limited support for background tasks in modern iOS, so background apps might still drain battery and affect CPU (though probably not significantly).


> Also there's limited support for background tasks in modern iOS, so background apps might still drain battery and affect CPU (though probably not significantly).

Unless they do something abusive like keep the speaker or mic on, which then permits them to be more abusive like keep location services active, which then rapidly drains battery.

http://solutionowl.com/the-ultimate-guide-to-solving-iphone-...


What more is running on an android phone then other OS? Seems like they have some fat to trim if there's more required to run their phone.


Even today, Android and iOS are not comparable in terms of what they allow apps to do, which is why iOS tends to win responsiveness comparisons and Android (still) tends to win feature comparisons, even years after Apple fell behind the features race and started simply duplicating Android's featureset (multi-tasking, notifications drawer, gradually allowing more and more background operations etc).


Frameworks based on Java with GC and JIT for each process?


If you mean bloated Java runtimes, then we agree.


Android doesn't have any Java runtimes because it doesn't use Java! Java it merely the programming language used but the programs are compiled into an Android specific byte code



"Even with all those neural net advantages"

On the other hand, without a neural net, it might be even slower. Android supports tons of different architectures and platforms, of course you can't optimize it like Apple can with a single platform.


> Android supports tons of different architectures and platforms, of course you can't optimize it like Apple can with a single platform.

Not really. Sure Android can run on multiple architectures but 99% of the time it's mostly the same. Why couldn't they provide optimizations for multiple architectures anyway? Whenever something can run on more than 1 architecture I usually see the argument about how it can't be as optimized on any specific architecture but I don't see why not. Optimize for the 80% use case and you're probably covered. Go beyond that to get more performance out of other platforms but it's unlikely necessary.


"Why couldn't they provide optimizations for multiple architectures anyway?"

Because Android is usually developed by at least two companies: Google does generics and OEM which is responsible that Android runs smoothly on their hardware. Sure, given enough time and resources, you could do optimization for every hardware combination possible, but that is not how things work in real world, especially where release cycles are such insane as in smartphones market.


> Sure, given enough time and resources, you could do optimization for every hardware combination possible, but that is not how things work in real world, especially where release cycles are such insane as in smartphones market

You make it sound as if every version of CPU, GPU and chipset have to be optimized in every combination possible but typically you do optimizations based on various CPU targets, optimizations based on GPU targets, etc; the vast majority of smart phones use the same architecture and many end up targeting the same CPUs and GPUs (I mean how many phones ended up / still use the Snapdragon 820?).

Yes they can't optimize for every combination but optimizing for the typical CPUs and GPUs seems like something they're likely already doing. If the stack was squished and Google made it from hardware to software I'm not sure that they would be necessarily doing optimizations any different on the hardware / driver level.


Yeah the Note 7 is documented as being full of bloat effecting its performance. http://www.xda-developers.com/with-the-note-7-samsung-still-...


US versions of the Note 7 use a Qualcomm processor, not a Exynos. (supposedly to support Verizon and Sprint's CDMA networks)


This is not new, AMD has been using perceptron-based predictors probably even before Bulldozer. They are good, but harder to optimise for than standard rule based heuristic ones.


They should expose the NN coefficients to the end user so that one can at least start from a known-better value. This problem would seem analogous to JIT warmup.


The latter half of the article makes me wonder to what extend the cycle count of instructions is specified, or up to a specific implementation.


Hmm now I'm wondering if the brain does something like this? Might the brain need branch prediction? Interesting line of thought.


I have a background in Machine Learning and it is my belief that the mind does use branch prediction. Take for example a scenario where you are climbing down a stair in your house. Your brain already predicts that you are about to land on the next step. Now if the step height was lowered than the usual, the prediction fails and you immediately focus back on the climbing-down process to see what has happened and prosecute whatever fall-back actions are needed to minimize disruption.


Reminds me of what happens when you step onto an escalator that isn't working (the broken escalator phenomenon):

https://en.wikipedia.org/wiki/Broken_escalator_phenomenon


Nitpick: having a background in machine learning is only slightly relevant to belief on what the mind does. It's having a background in neuroscience that's important.


An analogy is basically every time you subconsciously predict something, but something else happens. You get surprised and confused which is the brain's analog of flushing the pipeline and calibrating the predictor. Indeed, the execution of most mundane tasks only bubbles up to the conscious level when something unexpected happens and you need to update your model of the world.

Disclaimer: as I said, this is an analogy. Brains are not literally executing instructions in a pipeline.


Why is it still slower than oneplus3 and htc 10?


Samsung bloat?


iPhone excel because the OS and HW is designed by one company unlike Samsung where the HW and OS are from different companies, add to android Samsung bloatware.

The iOS works well with the iPhone simply because it is fine tuned for the product. On the androids side the OS is tuned to add bloatware of the respective HW manufactures. To add the so called speed the HW manufactures put in more RAM.


You do realise that samsung designs and makes most of Apple stuff, right?


What parts do Apple get exclusively from Samsung and depend on their expertise these days?

It seems like they go for redundancy, use two suppliers etc for common parts and chips. Which means these are built based on the designs of Apple. Samsung just manufactures them.

I don't follow closely though so I might be wrong.


I think that was the plan after the court fight with Samsung. But after some quality problems (IIRC with LG displays) Apple had to give up the dual-source approach for at least some components.

Now the original iphone was probably 80% Samsung design and manufacturing, but these days the design part is closer to 0%.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: