so I was partway through scraping the site because I wanted to demonstrate a
technique on it which I'm still planning on doing but it took a little bit of a
Twist and a turn so what I was going to do is I was going to use selenium
driverless sort of a very undetectable browser load up get the cookies and the
headers that we need to then make subsequent requests using requests on a
session object now that is kind of what we're going to do but what I want to
show you here is the when I was having a look at this and I selected a product
again I'm in the dev tools under xhr requests the first thing that popped up
is the fact that this is a graphql site which is quite unusual I suppose less
common I think is probably the best way to put it but what I had here was the
Json data with all the product information this is this is what I was
after all this structured data it's just so much easier to process so much easier
to work with and you know it's more efficient to extract to so because of
the technique we're using in this video I'm going to be using geolocated proxies
and holding the IP for 3 to 5 minutes to enable me enough time to replicate the
requests from the same proxy IP I use proxy scrape who are kind enough to
sponsor this video we get access to high quality secure fast and ethically
sourced proxies that cover residential Data Center and mobile with rotating and
sticky session options there's 10 million plus proxies in the pool to use
all with unlimited concurrent sessions from countries all over the globe
enabling us to scrape quickly and efficiently I use a variety of proxies
depending on the situation but i' recommend you start out with residential
ones but make sure you select countries that are appropriate to the site you're
trying to scrape and match your own country where possible also consider not
always rotating every request and hold an IP for a short time like I'm going to
do in this project either way it's still only one line of code to add and then
you can let proxy scrape handle the rest from there and also any traffic you
purchase is yours to use whenever you need as it doesn't ever expire so if
this all sounds good to you go ahead and check out proxy scrape at the link in
the description below now let's get on with the code but what I look for when I
go to the headers was you know usual kind of response headers but in the
request headers there was this here which was an authorization token
essentially now this is interesting because this has been created um by the
browser to make the request cuz these are the request headers from the browser
now if I copy this request uh as curl and we go to curl converter bang it in
here and we copy this to clipboard and I come to my other terminal where
I've got something open here and we go ahead and then do
print response. Json I'm just showing you this we'll get into the code demo
shortly and if I run this we should get this information back
yeah we do there it is so this is all the Json data that we want but if I come
back to this uh file here if we look down I've got this authorization header
and if I take this out we're going to get a very different
result in credentials which is just an interesting way of you know of seeing it
working because normally you don't really see this so what we need to do
now when we write our code is we want to make sure I don't think I need that
anymore we want to make sure that we are able to get those request headers this
is very POS possible in things like selenium wire but that's been deprecated
since the beginning of the year plus I really wanted to use one of the more
less detectable browsers so we're going to use selenium driverless to write this
code out one other thing that I'm just going to talk about before we get
started is um I found within the HTML it's hard to see but there is actually
the um the product IDs are in here somewhere product yeah data product ID
so you know they're easily getable from this there's one here they're easily
getable from the HTML source of like a main page which is which is good and the
second thing is that I've already put the D the Json data into the Json passes
so I could see what everything was because I'm going to create a model
based on this right so let's get started so I've got my project folder open here
and I'm going to uh start by pip installing what I need so we're going to
use selenium driver list which I covered in
a video a couple ago I think which is pretty good and we're going to need to
use um I'm going to want rich and I'm going to use pantic I'm going to show
you and talk to you why about why I use uh pip three need store obviously the
reason why I use pantic um it's just easier I don't need any other validation
I just can paste all of the Json into the python the P Json to pantic website
and get all the models made for me and I can just delete out the ones I don't
want that's why I use it no other reason really um I'm not worried about
performance or anything like that so let's create a main
file and let's get started so we're going to need a few things so we'll do
from let's make this a b bit selenium driverless we're going to
import in web driver don't know why it's called that cuz it's no driver but
anyway it doesn't matter and we're going to need a syn iio I think yeah we'll
need a syn iio um and we'll leave it at that for the moment so what I'm going to
do to start with i i just install rich as well from Rich to import print we'll
start with this and then we'll we'll we'll expand it as we go um I'm going to
create a proxy variable at the top because I am going to be using my proxy
for this I need OS for that import OS I keep my proxies as an environment
variable and I'm going to be using the mobile
proxy uh I think it's this the mobile proxies uh for this one they're pretty
good when it comes to sites like this because you know the the mobile traffic
is going to be a huge proportion of traffic going to these sorts of Ecom
sites and they really don't want to block those sorts of ips so this is
pretty cool uh and then we'll do if proxy
is none then we'll just say print no proxy
found you don't need to do this and I'm just going to quit if you do if you
handle your proxy by writing it directly into the script you won't need that
that's absolutely fine so let's create a main function let's do async def main
this is all asynchronous python we're going to be using we'll say our options
is equal to uh web driver. Chrome options
and options options do single proxy is the
one we want to use single proxy and that's going to be our proxy so that's
going to be set up to use our proxy every time uh single proxy my bad equals
to there we go I'm going to use quite a few I'm going to use a context manager
for this because I want it to close Chrome properly the way that it does it
because it clears out all of the profile data uh when it exits so we'll just use
a context manager they're great within python even if you know I think they
look a little bit too much too much indenting but they
they work well so we'll do async with and web driver. Chrome with our
options equal to the options that we've created as
driver and put this in the middle now so we're going to be using selenium
driver's Interceptor here and if I come down to this if I got examples
um no
it's I need a documentation uh request interception
here so it basically shows that you can use this Interceptor and you can create
a Handler function and have things happen on you know on a request type
thing so when that request is fired which is exactly what we want to do we
can do something with that that request essentially so we're going to be using
the network the Network intercept I just wanted to show you that in the
documentation so I'll do again async with because this is another this is
another context manager um async with network not
pulling it up okay let's do um from selenium
driverless do scripts. Network
Interceptor we're going to import in and intercept
request intercepted request there we go this is all a bit messy I'm going make
my my um text a little bit smaller with uh Network Interceptor with network
Interceptor and then we give it the driver and we say on request I'm just
going to call this one on request we're going to create this Handler function in
a second and I'm going to do await uh driver doget and we'll put our URL in
here my code editor doesn't like this for some reason or my rather my
um uh LSP um so I'm just going to ignore it essentially so I'm going to create a
new function we're going to call this one async Def and this is our on request
this is our Handler function uh this is going to go this is what this is going
to get called back to when we um when when we hit a network request here so we
want to put in here we want to say that the data is going to be equal to the
intercepted request so when we have this call back this Handler function every
time that a request is intercepted it's going to get put into this class the
Interceptor request and we have access to the all all the data there so we need
to put a couple of things in here we want to know um
basically narrow down CU there's going to be quite a lot of requests firing off
so I'm going to do if API in data. request.url so that's a
good start and data. request. method is equal to post so that should narrow it
down quite a lot for us so when we looked at this earlier I've closed it
now anyway this was a post request not a get request so that's why I'm filtering
It Out by that and what I'm going to do now is I'm
actually going to create a global variable here um I just found this was
the easiest way to do it because I had lots of different requests that I had to
filter through and wait for before I got the one with that authorization header
so I'm going create a Glo Global here and I'm just going to do a try request
and we're going to say if uh data. request.
headers and the key is authorization so I'm just going to test to see if this is
here I'm going to say that our or is now going to be equal to these headers so
data. request do headers like so and this is accept and this is going
to be a key error and this should be over here there
we go this will be a key error so you know if this key of authorization which
I'm asking for Within These headers isn't there I'm just going to say um
I'll just do print no or head uh
found in request that will do for now this will probably be a bit need a bit
of tidying up but we'll leave it like this so what I'm going to do is I'm just
going to grab this URL to start with and we will put it in there okay so now we
have our driver. getet here and I'm going to do await driver dos sleep now
there's a reason for this because all of these requests when they're intercepted
they they don't seem to fire up straight away so they they do happen we can see
it working but if you don't have some kind of weight in here for this page
your code will run and complete before the actual request to the API has been
made um this number will be dependent on your network uh how quick it is and how
quick the the site responds to you um I'm leaving it at 6 seconds 6 seconds
for me to be able to make loads of requests subsequently using these
headers and these cookies is absolutely fine for me to wait in my opinion so now
we want to do a bingo. run our main function here so I'm just going to
double check this really quick um so I've got my proxy which I'm giving here
checking to see if that actually exists this is going to check the request
method and the headers the header URL sorry the the request URL for API which
is going to match my um actual Network request I want and then at the end of
this we should end up with an or header uh and what I'll do then is I'm just
going to do print and we'll just debug it here and
we'll save so let's clear this and we'll run P Main and so this is going to load
up oh I think I've missed no no we should be fine there we go no header
found the request it's found it this time and I'm waiting the 6 seconds I
didn't need to wait 6 seconds this time because that worked just fine so here we
have there's uh it's happened a few times it's had this authorization Bearer
um so this would be absolutely fine so any of these would be good uh so we can
see that we're actually pulling it out there now I spent quite a while backing
my head against the wall trying to get this little bit to work how I wanted to
um this is the best I came up with at the moment but I'm sure there's a better
way of doing this so if there is leave it leave a comment down below let me
know because I mean I could spend more time on this but this works well or good
enough in my opinion so now we've run this and our browser has run and we have
those off headers the the good headers and the cookies and everything that we
need to make the request what I'm going to do is I'm going to reuse those I'm
going to input them into a other session uh an actual like request caller that I
can use to then just make the request directly to the graphql API using those
headers and again this is I'm aware that this is possibly a bit of a gray area
because of that authorization so this is for educational purposes only right
there you go set it so now we've finished we would have finished getting
our browser cookies headers all that sort of stuff that we need we done with
our async io. run I needed an HTTP client to actually make further requests
one that could have a session now URL lib 3 came installed with selenium
driver list so it was already a part of dependency of you know of the uh of the
program so I decided not to bother you could easily uh pip install requests or
httpx if you prefer U what it meant I had to do some U funny stuff here here
with the proxy for the session um because it handles it slightly
differently in url lib 3 you have this proxy manager and I needed to handle my
authentication proxy a bit different um I'd already done this so otherwise I
probably would have just installed requests you know there we go it is what
it is we now have a session that's going to use my proxy so what I'm going to do
is I'm going to say our response is equal to session. request and I'm going
to say this is going to be a get request and I'm going to do this to https and
we're going to do HTTP Benin sl. slhe headers and my headers are
going to be equal to or and then I'm going to do print
response. Json so what I'm going to do now is I'm going to come back to the top
and I'm going to get rid of this print here so we're only going to see the
print if we don't get those headers and then we should get them printed out back
from our um request to http bin which is just
going to send our headers back to us so the page is going to load load up the
first few request we're not going to find the or header should find it now
and then we're going this browser is going to
close and I could have done this wrong oh all that talking just to set it
up and I typed one thing wrong did it just there okay fine so now we'll do it
again um I never checked my code I always just run it and just deal with
errors afterwards probably not a good idea back to where we were so now we
should make the request to http when this is closed and we can see that our
headers sent back to us have this authentication Bearer header and all of
the other ones that we need to be able to make requests to our uh API endpoint
um there we go so now I'm going to do is we'll come back down here and we don't
need this because we know that that works but what we do need is all of this
so we are going to copy this Json data here because we're going to need to send
that along with the information because we want to basically this is the
what we're trying to get this is the product information so we'll do our
response is equal to our session. request and the
get and this is going to be the URL is over
here H this should be a post request my bad put that URL in I don't need that
one then we need to have our headers which is going to be equal to our
headers and the Json data which is going to be equal to should just be Json which
is equal to Json data like so what I'm also going to do is I'm going to
increase the timeout because sometimes on these other um the these sometimes on
the uh HTTP clients the timeout is too slow and you know this is a lot of this
is a large request we're going to get a lot of data back from a graph Q API so
it might take a little bit longer now we've got it we're going to do the
request so let's go ahead and run this again we have to wait for this browser
to come up obviously because this browser is the one that's going to be
giving us the up toate and latest headers and of course cookies that we're
going to need and there we have it so we've made
this request this has the information in it that's everything from that uh xhr
request that I showed you earlier uh except we are of course using the
headers and everything that we need so let me scroll down so if I was was to
for example take out these headers um and you know this is just
going to give us let's get rid of this we try to make
it without these we're obviously going to get blocked denied whatever you want
to say but again you know this a bit maybe a bit of a gray area but again you
I'm not doing anything that my browser wouldn't be doing my browser has done it
all for me I'm basically just mimicking those requests and I'm doing it in a
more time efficient manner um you know the these are these are methods of web
scraping that I just want to show you some of them you know you might find it
interesting or you might find it useful except to you so the last thing that I'm
going to do is I'm going to create my models so I'm going to have a models.py
file and I'm going to use Json top identic I've already put it in here um
you can see you dump the HTM the uh the Json in there and you get all of this
back so I'm just going to go ahead and copy this
out and we'll paste this into here here and I'm going to call this instead
of model I'm going to call this product model this is the bottom line one
everything in here and within this whole thing here what I'm going to do is I'm
going to get rid of a load of this stuff don't need that that we'll keep those
don't need that or that get rid of that get rid of that um get rid of that don't
want the product story don't want any of that get rid of all this as well you can
keep whatever it is whatever you want you can keep but this one is basically
just going to remove all of these things so this data this why like pantic is
going to fit the Json to this model and anything that doesn't match these fields
they just get discarded which is which is pretty handy so save this we're going
to come back to our main and we're going to do
import uh we'll do from actually from Models we're going to import in our
product model product model and come to the
bottom and instead of printing this we want to do
product model and we want to load
in the Json not
there in fact what I'm going to do is I'm probably going to let's have a look
at this actually let's come back to the clear Jason yeah so I've got data them
product um so what I might do is just try and load that directly in let's go
back to our model
product yes I'm going to load directly to product rather than the other one so
I'm going to come back to my main file so we will come back to the top and
instead of this import product model we're going to go product which is
basically direct to that part and we'll do
product Json and from this Json I'm going to I'm
going to get data and then product so all I'm doing
is instead of loading directly I'm instead of loading the whole of this
thing directly I'm just going to load this part here so we're going to ignore
you know these extra things extra fields that we don't need and my dog's bucking
for some reason sure she'll be fine add our or back in our headers so headers is
equal to our or headers uh save and run now so let's run this we're
going to use our browser to get the good headers cookies Etc that we need and
then from our URL live three session which of course could be a request
session could be httpx any client that has a session um client option or
something like that will work we're now going to get just the information that
we wanted back in our pantic model so if I go ahead and remove some more stuff
from the model it might be a bit easier just to see you know what we're actually
getting into so I'm going to remove the description the images we're going to
remove the colors list and I'm going to remove the variations as well because
what I want to see is you know just how we can change this as we need to so what
I'm going to do is come down here and I'm going to say
um let's put it here and we'll say our input is equal to uh let's call it I
need to do product is equal to input and this will just give me
like uh input model or product ID product
ID and then we'll have this here so we can
go like this so we can do string of product should have called this product
ID actually we will change it and I'm going to grab a different one
this one we'll do fine and now when we run this we should
get that you know question give us the product ID that you want to get and we
could obviously however we want to handle that we could give it a list of
product IDs we could pull them from the page or you know you could pull them
from your database let's put this one in and it's going to go ahead and make that
request and we got you know here's this the information 366 49 so I've cut a
load of data out here right so I need to summarize summarize this up real quick
so what we've done is we've used an undetectable or as as undetectable as
you can be browser with a proxy and specifically a mobile proxy you know
pick whichever ones worth best for your in use case we found that we needed to
make a post request to a graphql uh API to get the Json data structured you know
everything that's on the page will come back in that data anyway so we shouldn't
be doing anything wrong there uh and then we found that we had this
authorization cookie uh header sorry which had a sort of some kind of token
it was Bas 64 encoded token bit of a gray area because you I don't really
like it when you have to use a token however that is being generated by my
browser to actually make that request so who knows take it as you will then we
got that and then we could make subsequence requests to different
products to get that data back and I put it into a pent IC model lot going on
here but actually not a lot of code in the end I spent way more time looking
stuff up and figuring out how to do it I would be end up with 60 lines of code
you know this is this is web scraping this is how it goes the hardest part is
getting the data that's the hardest part so if you want to know how how I go
about getting data like this but without the browser where you don't have to
worry too much about the headers you want to watch this video next it's much
simpler and probably applies in much more use cases