I talk about this web scraping method a lot it's intercepting the network
requests from the front end site to the backend API and then mimicking those
requests and in this video not only I'm going to show you how to do that I'm
going to show you how to find which headers you actually need and then I'm
going to show you how to reliably get them each and every time using a stealth
browser so what I'm going to do here is I'm going to open up the dev tools I'm
going to come over to network and we're on this site here this page I'm just
going to load it up and we're just going to find the AP call that we want to make
to make ourselves how you do this on the site that you're looking at is up to you
the best way to do it in my opinion is just to start clicking around and
looking for things so I'm going to go and make this one here right so we found
it so to me this one has a good response of products data prices and everything
so this is what we're going to want to do but if we go ahead and just copy this
URL and paste it into our browser you can see we're getting authentication
denied now quite often that won't be the case but when it is we need to figure
out what it is that's stopping us what are we what do we need to be
authenticated so this video is sponsored by proxy scrape Friends of the channel
and their proxies that I use myself whenever I scrape data I'm always behind
a proxy it's just necessity this these days to be able to rotate through IPS
now I pretty much always use residential ones but there is use cases for the data
center proxies too you just need to figure out what's going to work best for
you I've actually been using the mobile proxies a lot too recently and found
them to be pretty good especially when you're us utilizing them within the
country that you're excite you're expecting to scrape very easy to add
them to your project I use a environment variable just to pull mine in and add it
in so I don't have to worry about anything like you know making sure that
the credentials are there it's all done nice and neatly and then you just let
proxy scrape handle the rest for you you can choose to rotate them if you want a
new proxy every request or you can have the sticky sessions which keeps hold of
the same proxy for a few minutes for you which also can be pretty useful and
something that I would potentially consider are doing um in the project
that I'm working here although in this case I'm just going to rotate through
the mobile proxies and also any traffic you purchase is yours to use whenever as
it doesn't ever expire which is a nice touch too so if this all sounds good to
you go ahead and check out proxy scrape at the link in the description below now
let's get back to our project so I'm going to do this here I'm going to come
over and I'm going to go and I'm going to do uh edit and resend and I'm going
to move this over a little bit and then bring this one out here now right away
the amongst you will see that there's this
client ID and client secret down here this has been generated by my browser
when it's loaded up the page so if I was to go ahead and click off these two and
then do send you'll see down here our response is authenticated denied
authentication denied which is what we got with our when we made it in the
plain browser so if I was to put I think it's just the client ID back in and send
again now we have the response so we know that we need to get this client ID
header whatever this is going to be to be able to make the request properly so
how do we do that well we need to go through a browser right so we need to
load the page up in a browser and find this client ID and get it and then add
it to our session so we can then make the requests that we need so if I was to
come over here and if I copy this um copy as curl and I come back over to my
browser and we do curl converter and paste this in you'll see
that we do have the a load of cookies and we also have the um client ID here
so to to a demonstration I'm going to copy this out and I'm going to put this
into um uh my code editor real quick and we'll just have a quick look at this and
make sure it works as we're expecting and then we can just double
check you know what we need so let's open this up and save so what I'm going
to do is I'm actually just going to blast away all the cookies yeah so we
don't we're not actually putting the cookies in if we remove them here but we
are going to put the head in so we'll do print and this is all me just figuring
out what we need to do right so response.
Json so if we run this one LV edit okay so this is the data that we
actually want back right say that for example so we know from this example
that we don't need these cookies these cookies are not relevant to us getting
this information back and that's really important because if they are you need
to include them but if they aren't you can just admit them completely so what
we want to do now is we want to build up our code and project we can actually
make one request to the main site with our browser with our stealth browser and
then capture those cookies and headers or just the headers in this case and
then load them into a request session then we can then make that session with
all those requests so like I have here I'm going to create a session that's
going to have these headers in with this client ID every time we run it we will
get a new ID blah blah blah etc etc so it will work for us and we can then make
one lot request to with the browser which takes a while and then we can make
subsequent requests to this API however you want to and however you you decide
to um it depends you know whatever website you're doing will depend this
this one just looks like it takes in a group of product IDs which is fine we're
not going to focus on that we're going to focus on how to get the actual
product uh and actual headers in so I'm going to create a new project in my
projects folder and we'll just call this one LV
edit and I'm going to CD into this folder and I'm going to do pip through
uh we need to do actually we need to do a virtual
environment so we create our virtual environment and the tools that we're
going to use for this we're going to use requests but we're also going to use
this uh this stealth browser here I'm going to do a separate video on this
um just because I think it's really powerful and I think it's it's possibly
the best one I've seen so far um it talks a lot about there's some really
good information on here um the only thing thing that I'll say is when it
comes to stealth browsers um everyone's kind of got their own way
of doing things and eventually you know those the ways that this you get around
it with the stealth tend to get patched so this is open source so that brings
you know the benefits of it's free to use you can use it whenever or wherever
you like but it's also you know if something changes it's just up to the
the one maintainer to to fix and Patch stuff but for me right now this works
really well so if you're struggling when you use stealth browser defin me try
this so we're going to do we to activate our environment and we'll do pip three
install and we'll have this and I'll have requests and rich as well so one of
the good things about cam Fox is that it's built around playwrights API so
anything that you can do in playright you can do with Cam Fox pretty much so
another thing I'm going to do differently here is I'm actually going
to create a couple of different files so I'm going to have an extractor class I'm
going to put this in here and then I'm going to have my main py file um and I'm
going to call that extractor the reason why I'm doing this
is that I want to separate things out because um I want to be able to build
upon this so everything that I put inside this class that I will then
access you could do separately if you wanted to so let's open up our extractor
class and let's Import in what we need so we're going to create our extractor
class and then we're going to initialize it and we want to give it some
information to initialize the class with the first one is we want to have a class
session object because we want to be able to update this session with the
headers and the proxies that we're going to be pulling so the proxies from my um
environment variables and the headers from the requests we're making so this
is my proxy that I'm going to be using I'm going to be using my mobile proxy
the UK version um I think this is really these are one of my favorite ones at the
moment they're they're much harder to block um because there's you know such a
wide use case in those um and then I'm going to say you know if it doesn't
exist or if it does exist we do this um and then we want to do our headers from
browser method so what this is going to involve is going to be loading up the
page using the camo Fox browser then we're going to create some handlers to
actually pull the requests out and this is like the standard kind of playright
way you know if you've ever seen that or done something like that we create a
Handler so what I'm going to do is I'm going to look for the word uh the
letters API inside the request URL so when we load the page that page is going
to create a request through XXR xhr to the back end and that it's going to have
the keyword API in it um which we saw over here this one here you know API so
when we find that we're going to update the headers um we're going to pull the
headers from it and then we're going to go through each of the header items and
we're going to create you know a um a little d a little list here and we're
not going to include these ones now this is important because if you include set
cookie this way you're going to get a um an error because it won't be formatted
the right way if you want to do cookies as well you have to do that through the
cookie jar I believe then I'm going to update our session with all of the
headers so what this code is doing right here is it's just going through all of
the headers it finds when it makes this request um and then it's going to turn
them into a key value dictionary and update our session then we need to
attach our handlers to our page that we've created through our camera Fox
which essentially playright then we're going to go to our URL wait for load
State and reload now I had some issues here and I don't know if this is typical
or whether this is going to be like something that happens a lot but I had
to basically reload the page go to it first and then reload it which was
interesting and if I didn't have this it didn't work it didn't actually catch
those API requests so something worth bearing in mind uh I've just put a print
statement in here so I can see that it's updated with the headers this is just
overriding you know this is my functions this is just doing a get request using
the session but you know the session will have all of the extra headers in um
so that's pretty simple that's pretty much all it is now as I said if you
don't want to use a class you don't have to you can make this all um as you need
it to be I wanted to put it in a class just because it was going to be easier
for me to go through and explain and show and now we can import that class
into our um our our our main.py file and we can go from there so I did put this
in just to go to Google first I don't think this is actually necessary this is
something that I left in for the sake of it um so what I'm going to do now is I'm
going to go to uh my main.py file and we're going to import in our extractor
class uh we need to do from
extractor Import in our extractor class and I think that's all we need so I'm
going to say that re e is going to be equal to the class of extractor and
here's the two URLs this is the one for the page to go to the browser page and
this is the one with the API now this could be you know anything that you
wanted it to be you know this is just the one with those those products on um
this is the part you know once we've got this working that we would actually
start to change the API URL to try and get more data or less data or more
specific data that we want we're just focusing here on actually getting the
headers and the uh client ID header that we need so that they can make this work
I'm just going to call E headers from browser and I'm going to pass uh pass in
the browser URL that I need there we go and now at this point
in our code we will have our session activated with all of the headers that
we need in it to be able to make requests so we should just be able to do
our responses equal to E.G get and that's the function that we created with
our session the API URL and that should be that simple and I'm just going to do
for uh item in response. Json and we'll just print
we'll print the item for the moment and then we'll just narrow it down so we can
actually see so when I run this now we should have our browser open up uh and a
good thing about cam Fox is that it it just handles all of the different uh
fingerprints for us so you might see it in different sizes or whatever and
that's just it's doing it let's run it uh oh we need to actually make sure this
the first time I've done it here so there we go cool so it's load loading up
Google and you can see it's all it's a different kind of size and whatnot and
then it's going to load up the actual page we were looking at we'll make this
a little bit smaller because there's going to be loads of data on this side
of the screen in just a minute in fact we can make it like this um I didn't
bother with any of the clicking and that's worked like so I'm going to
scroll back up so in here these are the headers and
the cookies that I pulled out or rather the headers uh and this is what we've
updated our session with so we've got the browser uh user agent we've got all
of the referers we've got the client ID which is the most important thing all of
this information here we do have the cookie because that was set um I don't
think you needed this but we set it anyway just by by doing it like that um
and that's it then we were able to make this request where we had this
information so I'll just do um instead of all of this we'll print out the
product ID or something actually what was it called just so we can actually
see item and product
ID and I think name again you can get whatever
information you want from here and you know just to show uh if we just copy
this let's just say that we were going to make different requests I'm going to
do the same request though um but you know we'll just do it twice just just so
you get the idea that it's working so again we can just run this and now we're
going to have to wait a few seconds for our browser to do its thing because we
need that data once that's done we should be able to make a decent amount
of calls through their API with that client ID that's come from our browser
now it's worth noting that there you can see it's worked I'm
not doing anything here that my browser isn't doing I'm basically just instead
of accessing the data through the browser I'm just loading my browser up
getting the information that it sends over and then accessing the data that
way in my opinion that's still perfectly fine I'm not pulling any information
that's behind a login or behind uh anything that isn't you know outwardly
visible on the main page I'm merely doing it in a more uh targeted and
quicker and more efficient manner um by utilizing you know what my browser is
doing I'm not so so my opinion it's absolutely fine so from here what you
would want to do is you know just update this API URL to whatever you wanted it
to be go ahead and explore that and find out what you need and then um just make
the request using the session and this should work on quite a lot of sites as
well you might find that some need different bits of information but
generally pulling the headers from the browser is going to work well enough