The barrier for entry into web scraping is higher than ever. Gone are the days
of simple scripts and easy to scrape sites replaced by JavaScript web apps
and readily available antibbot tech everywhere. Plus a whole new enemy, AI.
Let's talk about it. Over the last 5 years, I've scraped millions of lines of
data using many different technologies and methods. And I want to share my
thoughts about modern web scraping, what works, what doesn't, and how you can
still be effective in extracting data from the web. I want to cover what's
being made obsolete, what's working and working well, and why AI isn't the
answer you so desperately want it to be, and how it's potentially detrimental to
the industry. So, I definitely think that web scraping is getting harder, and
I think a lot of people are actually ignoring that fact. The definition of
insanity is doing the same thing over and over and expecting different
results. So trying to scrape modern sites with just requests and random
proxies will make you want to bang your head against the wall. We need to be
cleverer with our approach. Modern scraping requires a wider set of
techniques and tools. There's much more to consider like full browser headers
rather than just a user agent. Consideration of TLS and browser
fingerprints instead of just a random thing you've thrown together. But more
than ever, a single script just won't cut it. It may work for a few rounds,
but try and scale it up and you'll quickly get knocked out. I think a
scraper's best friend now is good, clear logging, error handling, and
wellthoughtout retries. This just means a better understanding of code is needed
and coding practice along with a good understanding of how websites actually
work. You see, I always used to try and force Selenium to scrape sites until I
stumbled upon a Reddit post talking about a problem I was having. In true
Reddit fashion, the answer was extremely condescending, but it did have good
information. From then on, the first place I look when I'm scraping is in the
network section of the dev tools to see if I can find the site's backend API. I
can use it to get the data, usually in JSON format with no passing needed. But
as scrapers advance, so does the anti-bot tech, and it's now better than
ever. And it's not only better, but it's much more widely available. And often a
good level of protection is included in a free tier. But most tutorials just
don't talk about this, leading beginners down the wrong path of trying to pass
messy and offiscated HTML instead of looking in the places that they should
be. My second thing is that the scrapers toolkit is just feels like it's becoming
obsolete and as antibbot and blocks get more common. We need to update the tools
we use to compensate. I mentioned fingerprinting and this is by far the
most useful and easy to make change. Swapping out requests for curl cfi or
arnet means we have a modern built for scraping HTTP client that can send real
looking browser fingerprints to keep us out that blocking zone. And this is now
essential. Plain requests just won't cut it. And if you're thinking you can just
use a browser, well the same applies 10fold. It's so easy to spot a headless
browser. I haven't touched playright in months as it's so unreliable. The amount
of data a website can get from about you from your browser is just staggering.
extensions that you've installed, what fonts you use, where you're based, the
way that it renders things. All this allows a profile to be built up about
you. A common check that I think people neglect is that your browser's time zone
doesn't match your proxy's time zone, and that's just a dead giveaway. So, we
need to be more careful about proxies, sessions, cookies, and browser
fingerprints. But out of a flower emerges, and the community has come out
with some great scraping specific tools that are a fantastic first step to avoid
being blocked. Here's what I use now. HTTV client is Anna. It's written in
Rust for Python by a large contributor from the Rust community. So my hope is
that it stays updated and in active development. But another good option is
curlcfi. Both of these can make async HTTP requests with a TLS fingerprint
that matches a common browser. For actual browser automation, there's two I
recommend. Camo Fox, which is and no driver, plus it's forks and driver.
These are both modern browser automation libraries that have much more stealth
capabilities than I've seen anywhere in the open source world. And as a direct
replacement for a sleing or playright, these are definitely worth giving a try.
But if you'd rather like a simpler all-in-one style package that uses some
of the above, try H request or scrape. So what's next? What's the future? Well,
there's really only one thing that's being in my opinion way overhyped right
now, and that's of course AI scraping. Off the bat, I want to say that if
anyone's AI tool promises to scrape any site for you, they're most likely lying.
Don't get me wrong, AI does have a place in our workflow. But in the world of
scraping consistently at scale has much bigger issues that it can't really
solve. But let's look at where it can help. Things like using AI to generate
boilerplate code for your spiders to help you get up and running faster.
Using it to monitor links and only crawling selected places and models that
are trained on common scraping targets to help reduce passing pain. But it
won't stop you getting banned because your fancy AI tool uses plain playright
to scrape the initial data. And dumping a whole load of HTML into an LLM through
loads of tokens just feels like a massive waste of resources. Let's also
consider that everybody's using AI. So it's being used more effectively to spot
common patterns for anti-bot measures. Meaning if anything, I think it's
working against scraping more than helping. Plus, Cloudfare just came out
with their AI labyrinth technology that detects when crawling and uses AI to
generate a deep hole of links for your crawler to follow full of useless data.
It's AI inception at its finest. Now, I'm not saying don't use it, but just be
aware of what are you going to use it for and how it will work for you. For
me, it all comes down to the fact that the old methods don't work anymore, and
we need to adapt adapt our techniques and our toolkit to stay ahead and not
get left behind. But let me show you how I actually scrape data using everything
I've just talked about with this project video