TensorFlow Lite is TensorFlow’s lightweight solution for mobile and embedded devices! TensorFlow has always run on many platforms, from racks of servers to tiny devices, but as the adoption of machine learning models has grown over the last few years, so has the need to deploy them on mobile and embedded devices. TensorFlow Lite enables low-latency inference of on-device machine learning models.
Looking forward to your feedback as you try it out.
> Looking forward to your feedback as you try it out.
Thanks Rajat. We use typical Cortex-A9/A7 SoCs running plain Linux rather than Android. We would use it for inference.
1. Platform choice
Why make TFL Android/iOS only? TF works on plain Linux. TFL even uses NDK and it would appear the inference part could work on plain Linux.
2. Performance
I did not find any info on performance of TensorFlow Lite. Mainly interested in inference performance. The tag "low-latency inference" catches my eye, just want to know how low is low latency here? milliseconds?
1. The code is standard C/C++ with minimal dependencies so it should be buildable on even non-standard platforms. Linux is easy.
2. The interpreter is more optimized for being low overhead and the kernels are better optimized especially for ARM CPUs currently. While model performance varies by model - we have seen significant improvements on most models going from TensorFlow to TensorFlow Lite. We'll share benchmarks soon.
We want to provide a great experience across all our supported platforms, and are exploring ways to provide a simpler experience with good acceleration on iOS as well.
Woah this is cool. I’ve been waiting for this since you announced it. I was thinking about benchmarking it against other solutions . What do you think about other similar frameworks like coreml ?
- As mentioned below - flatbuffers makes the startup time faster while trading off some flexibility
- Smaller code size means trading off dependency on some libraries and broader support vs writing more things from scratch more focused on the user cases people care about
Do you have performance/memory comparisons from using flatbuffer vs protobuf in TF? A quick writeup with how switching effected performance would be really interesting :)
We have had huge issues in trying to figure out how to save models (freeze graph,etc) and load it on Android. If you look at my previous thread - it also mentions bugs,threads and support requests where people are consistently confused.
petewarden (https://news.ycombinator.com/item?id=15596990) from Google is also working on this - so im really hopeful you guys will have something soon. This is a serious blocker for doing anything reasonable in TF.
One nice thing about Lite is that it's a lot easier to just include the operations you need (compared to TensorFlow 'classic'), there's fusion for common patterns, and the base interpreter is only 70KB. That covers a lot of the advantages of using XLA for mobile apps. In return you have the ability to load models separately from the code, and the ops are hand-optimized for ARM.
I'm still a fan of XLA, and I expect the two will grow closer over time, but I think Lite is better for a lot of scenarios on mobile.
How about quantization? Does tensorflow lite perform quantization or is it tensorflow supposed to do it? Is it iterative process or straightforward? Or are you training quantized models as nn api docs say?
The quantization is done with a special training script that is quantization aware. We will be open sourcing a mobilenet quantized training script to show how to do this soon.
TensorFlow Lite is an interpreter in contrast with XLA which is a compiler. The advantage of TensorFlow lite is that a single interpreter can handle several models rather than needing specialized code for each model and each target platform. TensorFlow Lite’s core kernels have also been hand-optimized for common machine learning patterns. The advantage of compiler approaches is fusing many operations to reduce memory bandwidth (and thus speed). TensorFlow lite fuses many common patterns in the TensorFlow converter. We are of course excited about the possibility of using JIT techniques and using XLA technology within the TensorFlow Lite interpreter or as part of the TensorFlow Lite converter as a possible future direction.
This should be possible, but we haven't tried it. We're likely going to add a simplified target that has minimal dependencies (like no Eigen) that allows building on simple platforms.
It might be a tad annoying, but the rest of Tensorflow uses Bazel so it makes sense that Tensorflow Lite also uses it. It also probably matches the internal Google workflow better since Google uses Blaze internally.
CoreML doesn't actually support Tensorflow. It's support for Tensorflow is only through Keras which is fine if you just want to build stock standard models but if you're doing crazy research implementations then that's not going to work.
Is all in the converter tool, if the converter tool can get the tf file into a .mlmodel properly, then it will be supported. Inside is just a bunch of weights and layers and parameters. We just need a proper script to translate it
"just a bunch of weights and layers and parameters" -- I think you and the GP are agreeing. That's the definition of standard: If the model can be expressed using the currently-blessed set of layer definitions in CoreML, then yes. But if you're doing nonstandard stuff with weird control flow behavior, or RNNs that don't map into some of the common flavors, then all bets are off.
An example: Some of my colleagues put a QP solver in tandem with a DNN, so that the neural network could 'shell out' to the solver as part of its learning, and learned to solve small sudoku problems from examples alone: https://arxiv.org/abs/1703.00443 The pytorch code for it is one of the examples I like to use as a stress-test for doing funky things in the machine learning context.
TensorFlow is a very generic dataflow library at its heart - which happens to have a lot of DNN-specific functionality as ops. It's possible to express arbitrary computations in it, whereas CoreML and and similar frameworks make more assumptions that the computation will fit a particular mould, and optimize it thereby.
Looks like you are right, CoreML only support these 3 DNNs: Feedforward, convolutional, recurrent. I suppose capsule nets are not any one of those, if it were implemented in TF
We developed TensorFlow lite to be small enough to target really small devices that lack MMU’s like the ARM Cortex M MCU series, but we haven’t done the actual work to target those devices. That being said, we are excited when the ecosystem and community around machine learning expands.
You can deploy TensorFlow model binaries as serverless APIs on Google Cloud ML Engine [1]. But I would also be interested in seeing a TensorFlow Lite implementation.
Thanks, @rasmi. I have a feedback for you guys. The pricing for predictions inference in GCP is not very fair. If I deploy a small model (like a SqueezeNet or Mobilenet) I pay almost the same price of someone deploying large models (like Resnet or VGG). That’s why I’m deploying my models on serverless environments and paying about 5 dollars for 1 million inferences.
The pricing of GCP is: $0.10 per thousand predictions, plus $0.40 per hour. That’s more than 100 dollars for 1 million inferences.
I see what you mean. To some companies, ML Engine's cost as a managed service may be worth it. To others, spinning up a VM with TensorFlow Serving on it is worth the cost savings. If you've taken other approaches to serving TensorFlow models to get around ML Engine's per-prediction cost, I'm curious to hear about them.
The main TensorFlow interpreter provides a lot of functionality for larger machines like servers (e.g. Desktop GPU support and distributed support). Of course, TensorFlow lite does run on standard PCs and servers, so using it on non-mobile/small devices is possible. If you wanted to create a very small microservice, TensorFlow lite would likely work, and we’d love to hear about your experiences, if you try this.
Thanks for the answer. Currently I’m using AWS Lambda to deploy my TensorFlow models. But it’s pretty hard and hacky. I need to remove a considerable portion of the code base that is not needed for inference only routines. I do that so the code loads faster and to fit the deployment package size limit.
If TensorFlow Lite is already a compressed code, then it may be much easier to deploy it to a serverless environment.
I’ll be trying it in my next deployments.
With TensorFlow and TF Lite we are looking to provide a great experience across all platforms, and are exploring ways to provide a simpler experience with good acceleration on iOS as well.
Yes to your first question, from the article: ”As you may know, TensorFlow already supports mobile and embedded deployment of models through the TensorFlow Mobile API. Going forward, TensorFlow Lite should be seen as the evolution of TensorFlow Mobile, and as it matures it will become the recommended solution for deploying models on mobile and embedded devices.”
This is pretty Android/iPhone-only, wish it can be more flexible to be used on other edge devices such as home routers or other embedded products where resource is constrained.
The current examples talk about Android/iPhone, however the core runtime is pretty lightweight with the goal of supporting all kinds of embedded products.
Do let us know if you build/run on other platforms.
Quantization comes in many different forms. TensorFlow lite provides optimized kernels for 8-bit uint quantization. This specific form of evaluation is not directly supported in TensorFlow right now (though it can train such a model). We will be releasing training scripts that show how to setup such models for evaluation.
> “Google won't say much more about this new project. But it has revealed that TensorFlow Lite will be part of the primary TensorFlow open source project later this year”
how does this relate to other hardware beyond iOS / Javascript i.e. raspberry pi, nvidia jetson etc andddddd.........whats the likelihood of libraries that sit on top of TF supporting this like keras and pytorch. Just some questions that spring to my mind
And we'll start to see more battery-efficient hardware to run those apps without consuming all of your battery. :)
(I'm saying that glibly, but I'm dead serious -- look at what we've seen emerge just this year in Apple's Neural Engine, the Pixel Visual Core, rumored chips from Qualcomm, and the Movidius Myriad 2. The datacenter was the first place to get dedicated DNN accelerators in the form of Google's TPU, but the phones -- and even smaller devices, like the "clips" camera -- are the clear next spot. And this is why, for example, TensorFlow Lite can call into the Android DNNAPI to take advantage of local accelerators as they evolve.
Being able to run locally, if battery life is preserved, is a huge win in latency, privacy, potentially bandwidth, etc. It'll be good, though it does need advances in both the HW and the DNN techniques (things like Mobilenet, but we need far more).
Looking forward to your feedback as you try it out.