Hacker News new | past | comments | ask | show | jobs | submit login
Faker is a Python package that generates fake data for you (joke2k.net)
207 points by yaph on Jan 22, 2014 | hide | past | favorite | 36 comments



I love faker! It pisses me off a bit that the library is named faker but the package is fake-factory. It trips me up all the time.

I use it with factory_boy (http://factoryboy.readthedocs.org/en/latest/) to generate test fixtures that seem to make logical sense. Usernames are real names, birthdays are real dates, etc.

I think it helps with experimentation when you're using the REPL and also makes bugs stand out a bit more easily. Very neat for demoing purposes too.

Perhaps my favourite is faker.bs() which always gets a giggle when doing live demos.


It's probably fake-factory to prevent confusion with the Ruby gem that's been around for more than 6 years http://www.rubyinside.com/faker-quick-fake-data-generation-i...

It's a standard gem in the community and it's used for generating various types of seed database data. Don't know if does the exact same things that the Python faker does.


I used fake-factory because faker was already busy on pipy. Initially I asked the author to leave the name, but the response was negative.


Nice, fake.bs() looks pretty useful as does fake.catchPhrase()


If I ever start a company, I'm going to run fake.company() and fake.companySuffix() until I find one that looks good.


I was just about to post why is the package called fake-factory.


I made a tool like this for my company in Ruby (it wasn't nearly as mature as this). The largest challenge I struggled with (and never really solved) is that ultimately, there's no way to generate data as useful as real data. The value of real data comes from the fact that it's messy. Real data is different sizes than you expect[1], collides with your sentinel values[2], and comes in with unexpected encodings[3]. And sometimes people will enter data that is intended to break your system[4][5].

The value of testing with real data is that it doesn't conform to your assumptions.

As far as I can tell, this benefit is impossible to fake with a system that generates fake data algorithmically. Generated data conforms to the assumptions of the system that generated it and therefore can only be used to test that a system conforms to those assumptions.

Fake data is still useful. Volume is often important (does your database slow down or crash when there are 10 billion records?). And if your fake data has very few assumptions, you can use that to reduce the assumptions made by the system you're testing.

Nevertheless, I'd really like to see a system like this which integrates data from some sort of general-purpose real dataset. Ideally it would be configurable so that people can document and choose a 99% use case they want to support (for example, a US company might want to support long names, but might not get a ton of value from supporting names with Chinese characters).

[1] http://jalopnik.com/this-hawaiian-womans-name-is-too-long-fo...

[2] http://www.snopes.com/autos/law/noplate.asp

[3] http://www.joelonsoftware.com/articles/Unicode.html

[4] http://en.wikipedia.org/wiki/Buffer_overflow

[5] http://xkcd.com/327/


I'd suggest looking into fuzzers. Short version - tools designed to input messy, non-conforming data to ensure the inputs don't cause problems, that things are sanitized correctly, etc. At this point they are a mature technology, with improvements constantly being researched. They are generally thought of as security tools[1], but are very useful for basic development too.

[1] The common use of fuzzers in a security context is to send malformed packets to protocol parsers to see if they fall over or cause buffer overruns, or otherwise do fun things in the context of exploiting a system. Another common one being automatic sql-injection discovery tools.


A quick search gave me this list: http://www.infosecinstitute.com/blog/2005/12/fuzzers-ultimat... is there a notable fuzzer missing? It's a pretty long list, does anyone know which of these tools are really worth checking out?


That's a pretty old list. Just to name one, I would recommend taking a look of Radamsa

https://www.ee.oulu.fi/research/ouspg/Radamsa

...from the Oulu University. It's more like a framework for generating intelligent fuzzers than a shrink-wrapped product, though.

The OUSPG guys are really good at fuzzing. There is also a commercial spin-off, Codenomicon, whose tools are quite widely used.


Crude fuzzing can be done on the command line using dd.

The command "dd if=/dev/urandom bs=1000 count=1" will spit out 1 KB of psuedorandom data you can pipe, POST or otherwise send to your application. (GNU's implementation lets you use "1K" as well.)



You might be interested in something like https://github.com/buger/gor


I'm not a huge fan of the 'factory' singleton pattern used here.

I can see how explicit execution of the startup code is a good thing, but can't help thinking how much better the experience would be if it just lazy-loaded the same code.

Am I missing something obvious that would prevent this? Bad magic?


Has anyone ever done a project like this that builds test databases from larger production databases? I mean something that can look at a prod db and model the data in the columns and the relationships between tables and produce a much smaller test db that has the same statistical properties? For numeric columns you could just fit a statistical distribution and sample from that. For names I'm thinking you could look at the frequency charts for first, second, third ... letters and sample according to that. I believe it is also called Markov Chains.

In Andrew Ng's Machine Learning class he talked about taking labeled images and expanding the set by inverting, shearing, flipping, distorting them etc. He called the technique 'data synthesis'.

The test data problem has been hampering my team's ability to create maintainable automated tests.


I've been working on such a project for a couple years as a solo/on-the-side thing. It's not available as a turnkey product yet, but the technology is working and can build arbitrary amounts of realistic test data (based either on your private data sets, or public sources like census data). I'd love to begin working with others on how exactly to integrate this into their test and development workflows. Email me at ken.woodruff@gmail.com if you'd like discuss further.


After quick look it seems that this is only poor randomizer. Good generator generates data which isn't too random and internal correlations and ranges are right.

I used to maintain one over 15 years ago.

At least City Street address post number and telephone had to be internally linked. Those are things that can be easily and automatically checked. So those constraints need to be checked also when generating data. It's also silly to give flat address on area where there aren't any flats etc. 30th floor on country side? Oh yeah. Distance based address downtown. As silly.


The name looked familiar and it's indeed inspired by the Faker library for PHP: https://github.com/fzaninotto/Faker

Another interesting library that is build on top of Faker is Alice. It allows you to define complex fixtures in .yml: https://github.com/nelmio/alice


Aren't they all kind of inspired by the original Perl implementation? I use the Ruby one for testing.

Something kind of similar and worth thinking about is this:

http://en.wikipedia.org/wiki/QuickCheck


I was just about to comment the same thing about QuickCheck. Arbitrary data galore.

I've only done serious work with QuickCheck in Haskell but here's the python implementation I've played with: https://pypi.python.org/pypi/pytest-quickcheck/


> Aren't they all kind of inspired by the original Perl implementation?

The original Perl implementation is Data::Faker (https://metacpan.org/pod/Data::Faker). The earliest version available on CPAN appears to be from 2005.


I've got a semi-similar python library that I've been adding to for a few years called testdata: https://github.com/Jaymon/testdata

testdata has a lot of unicode and file system stuff I've found really useful, it looks like between this and testdata I'll be in generated data heaven :)


I've used the Ruby version of Faker to do fuzz/property/quickcheck-style testing in Ruby. I believe this to be an incredibly important, under-recognized form of testing. Faker is not the best tool for this as you really need more sources of randomness than it provides, but it's not a bad start.

The best places to learn are from the canonical libraries, quickcheck in Haskell, Quviq in Erlang, simple-check in Clojure, and there are others.

The challenge with all of these methods is that you want some notion of referential transparency in order to make useful properties. You can at least do that in certain contexts for certain expressions in Ruby and doing so will improve code readability.

I'd love to hear from others with experience using these techniques in Ruby or Python.


Testing data serves multiple purposes. Boundary conditions (long fields, unicode etc) is important for invalidation testing. (i.e. Testing to break your code and functionality)

Testing at scale is important for performance and predicting bottlenecks as you grow. (i.e. Testing to break your systems capacity)

It can be difficult to generate good quality test data at scale, and data based on your specific schema.

This is how http://goodtestdata.com/ came about. It has the building blocks of core data and new sources can be built on request.


Awesome! I needed something just like this a couple of years ago. A couple of comments:

- I wanted fake CC numbers and SSNs/other national IDs at the time (don't remember why). I see that Faker is missing those, so they might be useful additions to the library.

- Method names should be snake_case rather than camelCase (http://www.python.org/dev/peps/pep-0008/#method-names-and-in...).


Actually, it looks like this doc is out of date. The README on Github shows that method naming has been corrected, and CC numbers have been added: https://github.com/joke2k/faker


I wrote a much simpler Python version a few years ago:

https://pypi.python.org/pypi/Phony/0.5.0

I don't recommend my version: it isn't maintained and isn't complete. However, writing a faker-type library is a great way to learn a language: you learn about how to organize code, how it handles different types, and how to package it up for use.


I created a .NET port[1] of the Ruby port of the original Perl Faker library. It's also available on NuGet[2]

[1] https://github.com/slashdotdash/faker-cs

[2] http://www.nuget.org/packages/Faker.Net/


Yet another entry from "I made something similar to this" category, although the tool I wrote isn't as feature rich as this.

https://github.com/mindcrime/dummydatagenerator


I wrote a similar package in JavaScript for the browser or for node I called Chance:

http://chancejs.com/

https://github.com/victorquinn/chancejs


We use something very similar in node called Charlatan: https://github.com/nodeca/charlatan

Very very useful in any build.


Thanks for posting this, I was unaware of the various Faker implementations. I had often considered implementing a similar lib, but never invested the time. Now I don't have too!


Inspired by faker, I created https://github.com/alexmic/mongoose-fakery for Mongoose.


The amount of batteries included with this library is impressive.


I wish it had support for more locales. Esp. more Asian locales (ja_JP would be really useful).


This sort of looks like Faker.js and I remember we used it to populate a table with fake data so we could fill it up and demonstrate the app. It was very useful.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: