It does seem somewhat similar. However Pickfu is going for written feedback from a one (or a small number) of people, whereas the A/B testing idea would simply compare two photos and aggregate the results within seconds.
Pickfu also charges for their service, which I don't think would work for someone wanting to simply choose their wardrobe. It'd have to monetize in other ways (ads, or charging for getting gender/age breakdowns of the results, more results, allowing to compare more than 2 photos, etc.)
This is the part I don't understand. You would need to have a whole lot of scale and incentive for people to give their feedback this fast and often. Jelly can take a while to get a response, for instance.
Every time you upload your own photos, you are forced to rate other people's photos before you get your own response. This is the incentive.
Rating photos is very fast -- just look at two photos and tap one. All network lag can be eliminated by pre-caching the images. I'd bet the average rating speed is one per second (it's all snap judgements).
As long as the app has 30+ people using it at every moment of the day, you'll always get a response of at least 30 responses within 30 seconds. Plus I suspect people would enjoy voting and would spend time doing it even while not waiting on their own response, meaning you'd get more than you put in.
You don't want to force people to do anything, otherwise it skews the results. Force someone to vote for 30 seconds, or to vote 10 times, and they'll just sit there tapping the first picture until they reach their target. You'll get lots of votes, they'll just be useless.
I think you'd get enough votes if the app was in the right format. It would have to be like hot or not, where you have 2 photos, pick one, and instantly get another 2 photos along with the results from the first set. This gets people trapped in the just one more click mindset.
Focus it on fashion and clothing, people will vote to see more photos, to judge the outfits of others (a lot of people enjoy doing this for fun), and to get inspiration for themselves.
Also, allow users to select what gender and age groups can vote on their photo.
they'll just sit there tapping the first picture until they reach their target. You'll get lots of votes, they'll just be useless.
Some people would do this, but as long as you randomize the order of the pictures their data won't alter the winner since each image would get an even number of bad votes.
You can also detect when people are doing this and start throwing out their votes, or just don't even prompt them to vote and show them ads instead. I'm pretty sure this would be a minority, since most people would understand that they want to get real results for their own photos, so they need to give real votes to other photos.
You could offset the "tap the first picture" effect by e.g. randomizing the order the pictures are shown, weighting the votes of serial "top clickers" less, etc.
Then you're just applying random votes, instead of actual votes.
For example, something that should be 15-2 votes, ends up being 95-82. You'll be applying a large number of votes to both sides, and pushing everything towards a 50/50 rating. This doesn't help anyone, the goal is A/B testing, and you're making it more difficult to get accurate data. 15-2 shows a lot of promise for A, but then you add 80 random votes to both sides, and 95-82 seems like a tie.
Well, I didn't really outline a specific approach, I was just trying to suggest that there are techniques you could apply to the data after collection to solve the problem without modifying the fundamental premise of the app. For example if you "weighted the votes of the serial top clickers less" in addition to randomizing the order of the pictures shown, then when all of your serial top clicker votes count as 0.1 instead of 1 then suddenly your 95-82 might move closer to the 15-2 and leave you with something like 25-12 which isn't as clear as 15-2 but is clearly significant.
Instead of showing the user 95-82, or 25-12 you tell the user "others prefered shirt A 2-to-1 or shirt b 66% to 33% or whatever might be appropriate.
Anyway, again, I'n not suggesting any particular techniques are the right ones, just that there (almost certainly) viable techniques available to mitigate the problem.
For example, something that should be 15-2 votes, ends up being 95-82.
It's all in the presentation. If you just highlight one image and stamp "WINNER" next to it, most people won't even look at the numbers. Crowning a winner is more important than being scientifically accurate.