Video Thumbnail 09:18
the web scraping trap that breaks beginners (and the easy fix)
6.1K
271
2025-03-19
➡ JOIN MY MAILING LIST https://johnwr.com ➡ COMMUNITY https://discord.gg/C4J2uckpbR https://www.patreon.com/johnwatsonrooney ➡ PROXIES https://proxyscrape.com/?ref=jhnwr ➡ HOSTING (Digital Ocean) https://m.do.co/c/c7c90f161ff6 If you are new, welcome. I'm John, a self taught Python developer working in the web and data space. I specialize in data extraction and automation. If you like programming and web content as much as I do, you can subscribe for weekly content. ⚠ DISCLAIMER Some/all o...
Subtitles

What's the first thing that you do when you get a blocked request in Python? Is

it reaching for a headless browser like Playright or Selenium? Well, in this

video, I want to explain to you why you may not need to do that. And I think

it's a trap that a lot of people fall into. We're going to have a look at one

of the ways that the antibbot companies can block your requests and find out,

you know, where they've come from, what they've been sent by very, very easily.

And we'll cover a very simple way of overcoming it. And I think this is

something that you should try first or even just use from the start. Uh, and

I'll show you a Python package as well, a new one that's come out which I think

looks really interesting to try. So, of course, I'm talking about TLS

fingerprinting. Now, the TLS is like the handshake between the request that you

make to the server and a little bit of it data is transferred, but it's that

little bit of data that can be fingerprinted and profiled. Now, the TLS

fingerprint from a modern browser looks very, very different to the fingerprint

from something like Python requests. And it's that fingerprint that it makes

very, very easy for the companies to spot and block you. Regardless of your

headers or anything like that, it's a dead giveaway. So, what we want to do is

we want to be able to mimic those browser uh TLS fingerprints, but use it

from within a standard HTTP client. So the one that I'm going to show you today

is called Arnet. And the reason why I'm quite excited and interested in this to

give it a go is not only is it written in Rust, which I think is quite

interesting given that there's a lot of Python Rust stuff out there, but it's

also written by the guy who made Hyper, which is a very popular HTTP library for

Rust. So I've covered TLS fingerprinting in detail a little bit more on my

channel already. So, I'm not going to go too deep into it, but what I want to do

with this is to show you how this can get you past uh certain things. And

we'll build up a async extraction class using RNET that we can just drop into uh

any of our code and we can then access you know the actual good TLS

fingerprints and we're going to utilize proxies as well. So, here's the example

and I mean I can run this uh on my on my machine here um Python test and we can

have a look at all of the information that comes back. There's a few a few key

points. The the Akami fingerprint is something you only get with a good uh

HTTP2 fingerprint. Otherwise, that will be blank. All sorts of other

information. Uh and there'll be cipher codes, algorithms, all this sort of

stuff. Uh you could look through this and you can make the same request with

Python requests and compare the two or you can do some reading up on it if you

want to. The bits that I looked in the most were the ciphers uh really because

they kind of make up the bulk of the fingerprint. What we'll do now is we'll

have a look at it in a bit of more of a practical use case. So what I've got

here is a just a request to essentially uh this website's API that the the back

end calls to to the front end called rather to get the data from the back end

and this is just a bit of product information. So if I was to go ahead and

remove the impersonate part here and what we'll do is we'll we'll save

this. And if we run this, we get some errors here like you know we're

expecting something. It's not working. We're getting a decode request. So what

I'm going to do now is I'm going to go ahead and I'm going to put this

impersonate back in. And then we're going to run it again. And we have the

information. This is a very very simple way of getting around one of the basic

entry-level anti-bot protection. And I think a lot of people especially that

don't know about this TLS fingerprint will fall into that trap of thinking

that they need to use a browser to do this sort of thing. Now obviously in

some cases you do need to use a browser and I'd highly recommend you use a

stealth browser. I'm going to have a video coming out about that on my

channel very very shortly with the best open- source one at the moment I think

anyway. Uh so go ahead and subscribe for that if you want to see that video

should come out in a week or so. So how can we actually utilize this in more of

a project example? Well, what I want to show you is an extractor class that I've

written that utilizes the asynchronous and the impersonate function uh the

impersonate features of arnet along with the along with the proxy and also a

blocking version which is the sync version should you want to use that. So

what we're going to do initially is start up our class with our innit and

I'm basically getting the proxy here from my uh environment variable. However

you want to handle your proxy string that's up to you. I pull mine from my

environment. Uh, and then I just have a, you know, if I don't find the proxy

because I always scrape behind a proxy. I don't want to scrape anything if I

don't find it. Uh, then I'm going to create the client session here. So, this

is basically I'll make this a bit bigger so it's a bit easier to see maybe. So,

all this is just, you know, having our our client session created. Um, and then

I'm going to update it with a few things. The first one is the very

important part which is the impersonate. And it's here where you choose which

sort of browser you want to mimic. and that will put in the right fingerprint

including, you know, the um the the correct user agent that matches, etc.,

etc. I've been using the Firefox one. Um I just feel like maybe Firefox is less

likely to be blocked, but that could just be all in my head. So, you know,

use whichever one you think is the best for you, but definitely use the latest

version that's available in my opinion. Uh then I'm going to add in my proxies

and that will so I have a login file separate that I've imported that's by

the buy at the moment really. We're not focusing on that too much. So I'm just

going to log that my session was created. Then I'm going to create my

blocking client which again is the synchronous version. That's so if you

wanted to make you know non-async requests you could do that there. Then

I'm going to do the same thing. I'm going to do the update here. There was

probably a neater way of doing this but you know I've made it this way so we can

just work with that. Now I have the retry decorator on top of my async fetch

function using tenacity. So the retry is coming from tenacity and the async

function is within my class using obviously arnet. So I'm going to use my

logger to just show which URL we're we are requesting. Then we create our

response to get that URL using our session. Now obviously because we've

created our session up here and we've updated it when we utilize it from

instantiating this class we have the impersonate and our proxies in there

already ready to go. Then I'm just going to return the response.json with arnet

you can't return the response directly. It's not an awaitable. So I'm going to

return the JSON from that. Um I know I'm expecting JSON data. So that's

absolutely fine. The next thing I'm going to do is create a fetch all class

because I want to be able to utilize the async. I want to be able to give it a

load of URLs and just send it out to get them asynchronously utilizing again the

impersonate. So I'm basically creating some tasks with the URL list and then

you know adding getting the results for those with async io gather and await and

then I'm going to rezip those together so I have you know the URL together with

the results so I know which came from which one. Then we're going to add in

the blocking fetch which is the synchronous one. Same kind of thing

really. No, no real nothing nothing exciting there. So, let's go ahead and

use this. So, I'm going to do my async defaf main and I'm going to add in some

URLs. These are from the same example, but they are different products. Uh, and

then we're going to go ahead and create uh an instance of our extractor class.

Um, and then we're going to say, you know, our data is going to be equal to

our extractor fetch all on these URLs. So, of course, I'm doing this within the

same file, but you know, you could easily create your extractor, save it

some in a different file, and then just import it into your main or whatever,

add it into your project as you go. Uh, some logging, and then I'm going to do

some sync stuff as well. And then I'm going to run everything with asyno main

async.io run my main function. So, I'm going to remove the sync stuff for the

moment, and we're just going to log out what we get from our extractor class.

So, we've got four URLs. So, let's clear this up and let's run

our extractor. We can see that our logging is working and pretty much

instantly we got four responses back from that information from each of those

pages. And you can see that there. So, what I really wanted to focus on in this

was a showing you guys that you don't always need to go to a browser. That

should kind of be your last resort, although it is very much needed. But

again, stealth browser, don't go trying to use just standard playright or

selenium. You're probably going to have a bad time. But then how we can actually

utilize by building up an extractor class, we can utilize the asynchronous

functionality and make our lives very very easy to make chunks of requests

like this. Now, this could be anything. You know, any URLs that you've pulled.

Maybe you go page by page um synchronously and then you pull all the

product URLs from that page and do them asynchronously. You can absolutely do

that here because you know we've got the blocking fetch. So this could be your

main way of looping. This could be your main way of going through things and

then you know doing this for the product pages. And by creating a class like

this, we can now go ahead and, you know, we probably want to remove this from our

extractor class uh file, but we could then create something, you know, another

class that does a transform on the data and then another one that does something

else. So we can pass the data in between