Video Thumbnail 05:28
The Harsh Truth of Web Scraping in 2025
53.0K
1.4K
2025-04-23
Web Scraping is changing, but will we manage the challenges of the modern web? ➡ JOIN MY MAILING LIST https://johnwr.com ➡ COMMUNITY https://discord.gg/C4J2uckpbR https://www.patreon.com/johnwatsonrooney ➡ PROXIES https://proxyscrape.com/?ref=jhnwr ➡ HOSTING https://m.do.co/c/c7c90f161ff6 If you are new, welcome. I'm John, a self taught Python developer working in the web and data space. I specialize in data extraction and automation. If you like programming and web content as much as I do...
Subtitles

The barrier for entry into web scraping is higher than ever. Gone are the days

of simple scripts and easy to scrape sites replaced by JavaScript web apps

and readily available antibbot tech everywhere. Plus a whole new enemy, AI.

Let's talk about it. Over the last 5 years, I've scraped millions of lines of

data using many different technologies and methods. And I want to share my

thoughts about modern web scraping, what works, what doesn't, and how you can

still be effective in extracting data from the web. I want to cover what's

being made obsolete, what's working and working well, and why AI isn't the

answer you so desperately want it to be, and how it's potentially detrimental to

the industry. So, I definitely think that web scraping is getting harder, and

I think a lot of people are actually ignoring that fact. The definition of

insanity is doing the same thing over and over and expecting different

results. So trying to scrape modern sites with just requests and random

proxies will make you want to bang your head against the wall. We need to be

cleverer with our approach. Modern scraping requires a wider set of

techniques and tools. There's much more to consider like full browser headers

rather than just a user agent. Consideration of TLS and browser

fingerprints instead of just a random thing you've thrown together. But more

than ever, a single script just won't cut it. It may work for a few rounds,

but try and scale it up and you'll quickly get knocked out. I think a

scraper's best friend now is good, clear logging, error handling, and

wellthoughtout retries. This just means a better understanding of code is needed

and coding practice along with a good understanding of how websites actually

work. You see, I always used to try and force Selenium to scrape sites until I

stumbled upon a Reddit post talking about a problem I was having. In true

Reddit fashion, the answer was extremely condescending, but it did have good

information. From then on, the first place I look when I'm scraping is in the

network section of the dev tools to see if I can find the site's backend API. I

can use it to get the data, usually in JSON format with no passing needed. But

as scrapers advance, so does the anti-bot tech, and it's now better than

ever. And it's not only better, but it's much more widely available. And often a

good level of protection is included in a free tier. But most tutorials just

don't talk about this, leading beginners down the wrong path of trying to pass

messy and offiscated HTML instead of looking in the places that they should

be. My second thing is that the scrapers toolkit is just feels like it's becoming

obsolete and as antibbot and blocks get more common. We need to update the tools

we use to compensate. I mentioned fingerprinting and this is by far the

most useful and easy to make change. Swapping out requests for curl cfi or

arnet means we have a modern built for scraping HTTP client that can send real

looking browser fingerprints to keep us out that blocking zone. And this is now

essential. Plain requests just won't cut it. And if you're thinking you can just

use a browser, well the same applies 10fold. It's so easy to spot a headless

browser. I haven't touched playright in months as it's so unreliable. The amount

of data a website can get from about you from your browser is just staggering.

extensions that you've installed, what fonts you use, where you're based, the

way that it renders things. All this allows a profile to be built up about

you. A common check that I think people neglect is that your browser's time zone

doesn't match your proxy's time zone, and that's just a dead giveaway. So, we

need to be more careful about proxies, sessions, cookies, and browser

fingerprints. But out of a flower emerges, and the community has come out

with some great scraping specific tools that are a fantastic first step to avoid

being blocked. Here's what I use now. HTTV client is Anna. It's written in

Rust for Python by a large contributor from the Rust community. So my hope is

that it stays updated and in active development. But another good option is

curlcfi. Both of these can make async HTTP requests with a TLS fingerprint

that matches a common browser. For actual browser automation, there's two I

recommend. Camo Fox, which is and no driver, plus it's forks and driver.

These are both modern browser automation libraries that have much more stealth

capabilities than I've seen anywhere in the open source world. And as a direct

replacement for a sleing or playright, these are definitely worth giving a try.

But if you'd rather like a simpler all-in-one style package that uses some

of the above, try H request or scrape. So what's next? What's the future? Well,

there's really only one thing that's being in my opinion way overhyped right

now, and that's of course AI scraping. Off the bat, I want to say that if

anyone's AI tool promises to scrape any site for you, they're most likely lying.

Don't get me wrong, AI does have a place in our workflow. But in the world of

scraping consistently at scale has much bigger issues that it can't really

solve. But let's look at where it can help. Things like using AI to generate

boilerplate code for your spiders to help you get up and running faster.

Using it to monitor links and only crawling selected places and models that

are trained on common scraping targets to help reduce passing pain. But it

won't stop you getting banned because your fancy AI tool uses plain playright

to scrape the initial data. And dumping a whole load of HTML into an LLM through

loads of tokens just feels like a massive waste of resources. Let's also

consider that everybody's using AI. So it's being used more effectively to spot

common patterns for anti-bot measures. Meaning if anything, I think it's

working against scraping more than helping. Plus, Cloudfare just came out

with their AI labyrinth technology that detects when crawling and uses AI to

generate a deep hole of links for your crawler to follow full of useless data.

It's AI inception at its finest. Now, I'm not saying don't use it, but just be

aware of what are you going to use it for and how it will work for you. For

me, it all comes down to the fact that the old methods don't work anymore, and

we need to adapt adapt our techniques and our toolkit to stay ahead and not

get left behind. But let me show you how I actually scrape data using everything

I've just talked about with this project video