Hacker News new | past | comments | ask | show | jobs | submit login
Build Your Own Mobile Proxy for Web Scraping (scrapingfish.com)
185 points by mateuszbuda on Aug 17, 2022 | hide | past | favorite | 31 comments



You can buy simboxes at alibaba[0] which are routers with 12 to 64 simcards. Each simcard has it’s own port which you can use in the config of for instance playwright or scrapy. We use one in production for frontend testing and some scraping.

It works perfect but it’s not cheap and keeping track of simcards which are prepaid or subscriptions in combination with data usage is a bit of work.

Resetting a sim usually gives a new IP and our experience is that blocking an IP by a scraped party is rarely the case because of the shared nature of IPs in the mobile world.

[0]https://a.aliexpress.com/_m0XnDLm


Those are absolute overpriced crap. We scrape about 2T a month using mobile proxies, and testing those put us a in month-long nightmare of crashing, errors, connection drops, etc.

There are much better solutions out there.


What other solutions do you suggest? My experience with 4G dongles is that they are really unstable (even buying the Proxidize hardware + software license).

At the end, solution that has worked better for me has been VPN + 5G phones. But it would be great to get rid of the batteries ...


Keep in mind that this doesn't need to be a Raspberry Pi and can be any computer. With some networking magic (and potentially network namespaces) you can also run multiple dongles & proxies on the same machine. A bit of hardware engineering will allow you to decouple the SIM card from the modems themselves, so SIMs can be kept at a central location and assigned to modems at will, potentially automatically depending on remaining credit/carrier blocks/rate-limiting/etc.


How would you decouple the sims from the modems? Would it just be about reading some info off the sim and then passing it to the modem?


One of the Osmocom projects[0] does precisely that.

[0] https://osmocom.org/projects/osmo-remsim/wiki


Yes, RPi is totally optional. This article just provides an example setup.


Have you tried probing for 4G provider(s?) IP renew limits and IP allocation space size?

Things I can think of that would be interesting to know:

  1. How uniformly distributed are the IP addresses you're getting.
  2. The size of the IP addresses allocation space, assuming a fixed geographic location (I imagine this can vary a lot between providers and regions)
  3. How often can an IP be requested?
  4. Is there some sort of rate limiting? If so, how does it manifest itself (Sticky IP? slow reconnect? Something else?)?


The distribution of IP addresses depends to a large extent on 4G provider. Some of them reuse IPs and you can get the same IP after you change network mode 4G > 3G > 4G. The same applies to IP addresses allocation space. You can request IP change as often as you want. Resetting network mode takes 5-10 seconds. There is not rate limiting. From time to time reconnection fails but it has nothing to do with rate limiting. Sticky IP is an issue for some 4G providers but, again, it has nothing to do with rate limiting


Keep in mind that this behavior will be a significant outlier in ISP analytics - they will notice you eventually, just like they’ll notice 10 seemingly synchronized Huawei modems connecting to (and disconnecting from) the same base station.

That’s all to say - be careful using this for any nefarious activity. It’s not inconspicuous.


In Australia it's not possible [1] to get SIM cards without identification. So getting on the radar of the security folks at your telco could easily end with you never being able to use that telco again.

[1] Perhaps it is but I don't know of a way.


Or just use the free version of https://proxidize.com or their Android app https://github.com/proxidize/proxidize-android


> Residential: provide IP addresses from Internet Service Providers (ISP) pool that are shared with other users

> Mobile: the best class of proxies for web scraping that is based on ephemeral IP addresses which are frequently exchanged with mobile network users who move between Base Transceiver Stations (BTS)

I can change my home IP address whenever I want... just change my router's MAC address and reboot the modem.


I'm not an ISP but I assume your previous IP addresses aren't released immediately so you could end up impacting the ISP's business by using up too many IP addresses.

If they notice, you might get banned. You could argue that you're not violating the terms of service but it wouldn't get you very far.


I was wondering, is it possible to use these 4G USB modem to make voice calls over VoLTE? Can somebody direct me to resources that make it possible?


Kind of.

The modem can attach to the IMS APNs, however you'll need to do SIM-specific authentication which requires being able to send the right APDUs to the SIM. Some modems expose the SIM card in one way or another (either an AT command to send APDUs and get responses or as an actual USB smartcard reader) and it would be possible with those to do the authentication flow and then register onto the IMS server (which speaks SIP).

Alternatively, if you're happy with circuit-switched calls, some modems will expose the raw audio (as PCM data) as a separate serial port from which you read/write, and control the call flow using AT commands on the main serial port. Some modems need this to be first enabled via a poorly-documented AT command, and some might need authentication (security by obscurity) to do so which software such as DC Unlocker can break.


In addition to the EAP-AKA' authentication, the strange and highly opaque ways through which VoWiFi support is signalled and enabled might be an issue here too.

I've seen brand new Pixel devices (which fully support VoWifi) act like they have never heard of the feature, until a special cell broadcast SMS is received from the carrier, to turn the feature on.

It would be interesting to see if you could manually tick all the boxes and open up a tunnel to the ePDG etc without that happening, but given the number of moving parts, and the differences between each network's implementation, I'd be sceptical it would work reliably.

(I'm pretty sure I had a situation where individual batches of IMEIs were being whitelisted for WiFi calling on another network, where the switch appeared, but VoWiFi didn't work... Yet on the Nexus device in question, one bought from the carrier directly did work. Even when you manually flashed both devices with identical firmwares from Google directly, this remained the case - must have been an IMEI whitelist or similar).


Ah I see, so I did find someone that tried to do this here [1], but it seems like they were not even able to connect to the SIP server. Bummer, I wonder if someone else has had success.

1. https://worthdoingbadly.com/vowifi/


why dont you let us know what you really intend to do here


Just making voice calls from my laptop lol, or better over internet from a RPI connected to this modem.


you can already to this with 4g/lte dongles, a proxy seems overkill


Examples?

https://laforge.gnumonks.org/blog/20170902-cellular_modems-v...

comes to mind. Regardless of age.


I wasn't aware, and most of them do not advertise such a feature. It would be great if you could point me to one that allows this out of the box.


What is this sorcery?

Just use an old android phone.

It was just spam, sorry.



Is there a way to use this/scrapingfish with puppeteer/playwright?


Yes, it's implicitly stated in the article:

> curl --proxy 192.168.0.10:2000 https://eth0.me

Just use this proxy + port in your playwright config.


Scraping Fish uses puppeteer/playwright (headless browsers) connected to our mobile proxy pool under the hood.


What would the total cost be for everything needed?


That's not easy to estimate, depends on your location and if you want to use RPi.

A (very) rough estatimate is ~$40 per dongle, RPi starts from $35, totalling ~$115. You'd also need to buy two 4G plans.


Good insight!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: