What's the first thing that you do when you get a blocked request in Python? Is
it reaching for a headless browser like Playright or Selenium? Well, in this
video, I want to explain to you why you may not need to do that. And I think
it's a trap that a lot of people fall into. We're going to have a look at one
of the ways that the antibbot companies can block your requests and find out,
you know, where they've come from, what they've been sent by very, very easily.
And we'll cover a very simple way of overcoming it. And I think this is
something that you should try first or even just use from the start. Uh, and
I'll show you a Python package as well, a new one that's come out which I think
looks really interesting to try. So, of course, I'm talking about TLS
fingerprinting. Now, the TLS is like the handshake between the request that you
make to the server and a little bit of it data is transferred, but it's that
little bit of data that can be fingerprinted and profiled. Now, the TLS
fingerprint from a modern browser looks very, very different to the fingerprint
from something like Python requests. And it's that fingerprint that it makes
very, very easy for the companies to spot and block you. Regardless of your
headers or anything like that, it's a dead giveaway. So, what we want to do is
we want to be able to mimic those browser uh TLS fingerprints, but use it
from within a standard HTTP client. So the one that I'm going to show you today
is called Arnet. And the reason why I'm quite excited and interested in this to
give it a go is not only is it written in Rust, which I think is quite
interesting given that there's a lot of Python Rust stuff out there, but it's
also written by the guy who made Hyper, which is a very popular HTTP library for
Rust. So I've covered TLS fingerprinting in detail a little bit more on my
channel already. So, I'm not going to go too deep into it, but what I want to do
with this is to show you how this can get you past uh certain things. And
we'll build up a async extraction class using RNET that we can just drop into uh
any of our code and we can then access you know the actual good TLS
fingerprints and we're going to utilize proxies as well. So, here's the example
and I mean I can run this uh on my on my machine here um Python test and we can
have a look at all of the information that comes back. There's a few a few key
points. The the Akami fingerprint is something you only get with a good uh
HTTP2 fingerprint. Otherwise, that will be blank. All sorts of other
information. Uh and there'll be cipher codes, algorithms, all this sort of
stuff. Uh you could look through this and you can make the same request with
Python requests and compare the two or you can do some reading up on it if you
want to. The bits that I looked in the most were the ciphers uh really because
they kind of make up the bulk of the fingerprint. What we'll do now is we'll
have a look at it in a bit of more of a practical use case. So what I've got
here is a just a request to essentially uh this website's API that the the back
end calls to to the front end called rather to get the data from the back end
and this is just a bit of product information. So if I was to go ahead and
remove the impersonate part here and what we'll do is we'll we'll save
this. And if we run this, we get some errors here like you know we're
expecting something. It's not working. We're getting a decode request. So what
I'm going to do now is I'm going to go ahead and I'm going to put this
impersonate back in. And then we're going to run it again. And we have the
information. This is a very very simple way of getting around one of the basic
entry-level anti-bot protection. And I think a lot of people especially that
don't know about this TLS fingerprint will fall into that trap of thinking
that they need to use a browser to do this sort of thing. Now obviously in
some cases you do need to use a browser and I'd highly recommend you use a
stealth browser. I'm going to have a video coming out about that on my
channel very very shortly with the best open- source one at the moment I think
anyway. Uh so go ahead and subscribe for that if you want to see that video
should come out in a week or so. So how can we actually utilize this in more of
a project example? Well, what I want to show you is an extractor class that I've
written that utilizes the asynchronous and the impersonate function uh the
impersonate features of arnet along with the along with the proxy and also a
blocking version which is the sync version should you want to use that. So
what we're going to do initially is start up our class with our innit and
I'm basically getting the proxy here from my uh environment variable. However
you want to handle your proxy string that's up to you. I pull mine from my
environment. Uh, and then I just have a, you know, if I don't find the proxy
because I always scrape behind a proxy. I don't want to scrape anything if I
don't find it. Uh, then I'm going to create the client session here. So, this
is basically I'll make this a bit bigger so it's a bit easier to see maybe. So,
all this is just, you know, having our our client session created. Um, and then
I'm going to update it with a few things. The first one is the very
important part which is the impersonate. And it's here where you choose which
sort of browser you want to mimic. and that will put in the right fingerprint
including, you know, the um the the correct user agent that matches, etc.,
etc. I've been using the Firefox one. Um I just feel like maybe Firefox is less
likely to be blocked, but that could just be all in my head. So, you know,
use whichever one you think is the best for you, but definitely use the latest
version that's available in my opinion. Uh then I'm going to add in my proxies
and that will so I have a login file separate that I've imported that's by
the buy at the moment really. We're not focusing on that too much. So I'm just
going to log that my session was created. Then I'm going to create my
blocking client which again is the synchronous version. That's so if you
wanted to make you know non-async requests you could do that there. Then
I'm going to do the same thing. I'm going to do the update here. There was
probably a neater way of doing this but you know I've made it this way so we can
just work with that. Now I have the retry decorator on top of my async fetch
function using tenacity. So the retry is coming from tenacity and the async
function is within my class using obviously arnet. So I'm going to use my
logger to just show which URL we're we are requesting. Then we create our
response to get that URL using our session. Now obviously because we've
created our session up here and we've updated it when we utilize it from
instantiating this class we have the impersonate and our proxies in there
already ready to go. Then I'm just going to return the response.json with arnet
you can't return the response directly. It's not an awaitable. So I'm going to
return the JSON from that. Um I know I'm expecting JSON data. So that's
absolutely fine. The next thing I'm going to do is create a fetch all class
because I want to be able to utilize the async. I want to be able to give it a
load of URLs and just send it out to get them asynchronously utilizing again the
impersonate. So I'm basically creating some tasks with the URL list and then
you know adding getting the results for those with async io gather and await and
then I'm going to rezip those together so I have you know the URL together with
the results so I know which came from which one. Then we're going to add in
the blocking fetch which is the synchronous one. Same kind of thing
really. No, no real nothing nothing exciting there. So, let's go ahead and
use this. So, I'm going to do my async defaf main and I'm going to add in some
URLs. These are from the same example, but they are different products. Uh, and
then we're going to go ahead and create uh an instance of our extractor class.
Um, and then we're going to say, you know, our data is going to be equal to
our extractor fetch all on these URLs. So, of course, I'm doing this within the
same file, but you know, you could easily create your extractor, save it
some in a different file, and then just import it into your main or whatever,
add it into your project as you go. Uh, some logging, and then I'm going to do
some sync stuff as well. And then I'm going to run everything with asyno main
async.io run my main function. So, I'm going to remove the sync stuff for the
moment, and we're just going to log out what we get from our extractor class.
So, we've got four URLs. So, let's clear this up and let's run
our extractor. We can see that our logging is working and pretty much
instantly we got four responses back from that information from each of those
pages. And you can see that there. So, what I really wanted to focus on in this
was a showing you guys that you don't always need to go to a browser. That
should kind of be your last resort, although it is very much needed. But
again, stealth browser, don't go trying to use just standard playright or
selenium. You're probably going to have a bad time. But then how we can actually
utilize by building up an extractor class, we can utilize the asynchronous
functionality and make our lives very very easy to make chunks of requests
like this. Now, this could be anything. You know, any URLs that you've pulled.
Maybe you go page by page um synchronously and then you pull all the
product URLs from that page and do them asynchronously. You can absolutely do
that here because you know we've got the blocking fetch. So this could be your
main way of looping. This could be your main way of going through things and
then you know doing this for the product pages. And by creating a class like
this, we can now go ahead and, you know, we probably want to remove this from our
extractor class uh file, but we could then create something, you know, another
class that does a transform on the data and then another one that does something
else. So we can pass the data in between