Video Thumbnail 24:29
Should I have used this Web Scraping Technique?
7.2K
215
2024-11-10
➡ JOIN MY MAILING LIST https://johnwr.com ➡ COMMUNITY https://discord.gg/C4J2uckpbR ➡ PROXIES https://proxyscrape.com/?ref=jhnwr ➡ WEB SCRAPING API https://hubs.li/Q043T88w0 ➡ HOSTING https://m.do.co/c/c7c90f161ff6 If you are new, welcome. I'm John, a self taught Python developer and content creator, working at Zyte. I specialize in data extraction and automation. If you like programming and web content as much as I do, you can subscribe for weekly content. All views in this video are my o...
Subtitles

so I was partway through scraping the site because I wanted to demonstrate a

technique on it which I'm still planning on doing but it took a little bit of a

Twist and a turn so what I was going to do is I was going to use selenium

driverless sort of a very undetectable browser load up get the cookies and the

headers that we need to then make subsequent requests using requests on a

session object now that is kind of what we're going to do but what I want to

show you here is the when I was having a look at this and I selected a product

again I'm in the dev tools under xhr requests the first thing that popped up

is the fact that this is a graphql site which is quite unusual I suppose less

common I think is probably the best way to put it but what I had here was the

Json data with all the product information this is this is what I was

after all this structured data it's just so much easier to process so much easier

to work with and you know it's more efficient to extract to so because of

the technique we're using in this video I'm going to be using geolocated proxies

and holding the IP for 3 to 5 minutes to enable me enough time to replicate the

requests from the same proxy IP I use proxy scrape who are kind enough to

sponsor this video we get access to high quality secure fast and ethically

sourced proxies that cover residential Data Center and mobile with rotating and

sticky session options there's 10 million plus proxies in the pool to use

all with unlimited concurrent sessions from countries all over the globe

enabling us to scrape quickly and efficiently I use a variety of proxies

depending on the situation but i' recommend you start out with residential

ones but make sure you select countries that are appropriate to the site you're

trying to scrape and match your own country where possible also consider not

always rotating every request and hold an IP for a short time like I'm going to

do in this project either way it's still only one line of code to add and then

you can let proxy scrape handle the rest from there and also any traffic you

purchase is yours to use whenever you need as it doesn't ever expire so if

this all sounds good to you go ahead and check out proxy scrape at the link in

the description below now let's get on with the code but what I look for when I

go to the headers was you know usual kind of response headers but in the

request headers there was this here which was an authorization token

essentially now this is interesting because this has been created um by the

browser to make the request cuz these are the request headers from the browser

now if I copy this request uh as curl and we go to curl converter bang it in

here and we copy this to clipboard and I come to my other terminal where

I've got something open here and we go ahead and then do

print response. Json I'm just showing you this we'll get into the code demo

shortly and if I run this we should get this information back

yeah we do there it is so this is all the Json data that we want but if I come

back to this uh file here if we look down I've got this authorization header

and if I take this out we're going to get a very different

result in credentials which is just an interesting way of you know of seeing it

working because normally you don't really see this so what we need to do

now when we write our code is we want to make sure I don't think I need that

anymore we want to make sure that we are able to get those request headers this

is very POS possible in things like selenium wire but that's been deprecated

since the beginning of the year plus I really wanted to use one of the more

less detectable browsers so we're going to use selenium driverless to write this

code out one other thing that I'm just going to talk about before we get

started is um I found within the HTML it's hard to see but there is actually

the um the product IDs are in here somewhere product yeah data product ID

so you know they're easily getable from this there's one here they're easily

getable from the HTML source of like a main page which is which is good and the

second thing is that I've already put the D the Json data into the Json passes

so I could see what everything was because I'm going to create a model

based on this right so let's get started so I've got my project folder open here

and I'm going to uh start by pip installing what I need so we're going to

use selenium driver list which I covered in

a video a couple ago I think which is pretty good and we're going to need to

use um I'm going to want rich and I'm going to use pantic I'm going to show

you and talk to you why about why I use uh pip three need store obviously the

reason why I use pantic um it's just easier I don't need any other validation

I just can paste all of the Json into the python the P Json to pantic website

and get all the models made for me and I can just delete out the ones I don't

want that's why I use it no other reason really um I'm not worried about

performance or anything like that so let's create a main

file and let's get started so we're going to need a few things so we'll do

from let's make this a b bit selenium driverless we're going to

import in web driver don't know why it's called that cuz it's no driver but

anyway it doesn't matter and we're going to need a syn iio I think yeah we'll

need a syn iio um and we'll leave it at that for the moment so what I'm going to

do to start with i i just install rich as well from Rich to import print we'll

start with this and then we'll we'll we'll expand it as we go um I'm going to

create a proxy variable at the top because I am going to be using my proxy

for this I need OS for that import OS I keep my proxies as an environment

variable and I'm going to be using the mobile

proxy uh I think it's this the mobile proxies uh for this one they're pretty

good when it comes to sites like this because you know the the mobile traffic

is going to be a huge proportion of traffic going to these sorts of Ecom

sites and they really don't want to block those sorts of ips so this is

pretty cool uh and then we'll do if proxy

is none then we'll just say print no proxy

found you don't need to do this and I'm just going to quit if you do if you

handle your proxy by writing it directly into the script you won't need that

that's absolutely fine so let's create a main function let's do async def main

this is all asynchronous python we're going to be using we'll say our options

is equal to uh web driver. Chrome options

and options options do single proxy is the

one we want to use single proxy and that's going to be our proxy so that's

going to be set up to use our proxy every time uh single proxy my bad equals

to there we go I'm going to use quite a few I'm going to use a context manager

for this because I want it to close Chrome properly the way that it does it

because it clears out all of the profile data uh when it exits so we'll just use

a context manager they're great within python even if you know I think they

look a little bit too much too much indenting but they

they work well so we'll do async with and web driver. Chrome with our

options equal to the options that we've created as

driver and put this in the middle now so we're going to be using selenium

driver's Interceptor here and if I come down to this if I got examples

um no

it's I need a documentation uh request interception

here so it basically shows that you can use this Interceptor and you can create

a Handler function and have things happen on you know on a request type

thing so when that request is fired which is exactly what we want to do we

can do something with that that request essentially so we're going to be using

the network the Network intercept I just wanted to show you that in the

documentation so I'll do again async with because this is another this is

another context manager um async with network not

pulling it up okay let's do um from selenium

driverless do scripts. Network

Interceptor we're going to import in and intercept

request intercepted request there we go this is all a bit messy I'm going make

my my um text a little bit smaller with uh Network Interceptor with network

Interceptor and then we give it the driver and we say on request I'm just

going to call this one on request we're going to create this Handler function in

a second and I'm going to do await uh driver doget and we'll put our URL in

here my code editor doesn't like this for some reason or my rather my

um uh LSP um so I'm just going to ignore it essentially so I'm going to create a

new function we're going to call this one async Def and this is our on request

this is our Handler function uh this is going to go this is what this is going

to get called back to when we um when when we hit a network request here so we

want to put in here we want to say that the data is going to be equal to the

intercepted request so when we have this call back this Handler function every

time that a request is intercepted it's going to get put into this class the

Interceptor request and we have access to the all all the data there so we need

to put a couple of things in here we want to know um

basically narrow down CU there's going to be quite a lot of requests firing off

so I'm going to do if API in data. request.url so that's a

good start and data. request. method is equal to post so that should narrow it

down quite a lot for us so when we looked at this earlier I've closed it

now anyway this was a post request not a get request so that's why I'm filtering

It Out by that and what I'm going to do now is I'm

actually going to create a global variable here um I just found this was

the easiest way to do it because I had lots of different requests that I had to

filter through and wait for before I got the one with that authorization header

so I'm going create a Glo Global here and I'm just going to do a try request

and we're going to say if uh data. request.

headers and the key is authorization so I'm just going to test to see if this is

here I'm going to say that our or is now going to be equal to these headers so

data. request do headers like so and this is accept and this is going

to be a key error and this should be over here there

we go this will be a key error so you know if this key of authorization which

I'm asking for Within These headers isn't there I'm just going to say um

I'll just do print no or head uh

found in request that will do for now this will probably be a bit need a bit

of tidying up but we'll leave it like this so what I'm going to do is I'm just

going to grab this URL to start with and we will put it in there okay so now we

have our driver. getet here and I'm going to do await driver dos sleep now

there's a reason for this because all of these requests when they're intercepted

they they don't seem to fire up straight away so they they do happen we can see

it working but if you don't have some kind of weight in here for this page

your code will run and complete before the actual request to the API has been

made um this number will be dependent on your network uh how quick it is and how

quick the the site responds to you um I'm leaving it at 6 seconds 6 seconds

for me to be able to make loads of requests subsequently using these

headers and these cookies is absolutely fine for me to wait in my opinion so now

we want to do a bingo. run our main function here so I'm just going to

double check this really quick um so I've got my proxy which I'm giving here

checking to see if that actually exists this is going to check the request

method and the headers the header URL sorry the the request URL for API which

is going to match my um actual Network request I want and then at the end of

this we should end up with an or header uh and what I'll do then is I'm just

going to do print and we'll just debug it here and

we'll save so let's clear this and we'll run P Main and so this is going to load

up oh I think I've missed no no we should be fine there we go no header

found the request it's found it this time and I'm waiting the 6 seconds I

didn't need to wait 6 seconds this time because that worked just fine so here we

have there's uh it's happened a few times it's had this authorization Bearer

um so this would be absolutely fine so any of these would be good uh so we can

see that we're actually pulling it out there now I spent quite a while backing

my head against the wall trying to get this little bit to work how I wanted to

um this is the best I came up with at the moment but I'm sure there's a better

way of doing this so if there is leave it leave a comment down below let me

know because I mean I could spend more time on this but this works well or good

enough in my opinion so now we've run this and our browser has run and we have

those off headers the the good headers and the cookies and everything that we

need to make the request what I'm going to do is I'm going to reuse those I'm

going to input them into a other session uh an actual like request caller that I

can use to then just make the request directly to the graphql API using those

headers and again this is I'm aware that this is possibly a bit of a gray area

because of that authorization so this is for educational purposes only right

there you go set it so now we've finished we would have finished getting

our browser cookies headers all that sort of stuff that we need we done with

our async io. run I needed an HTTP client to actually make further requests

one that could have a session now URL lib 3 came installed with selenium

driver list so it was already a part of dependency of you know of the uh of the

program so I decided not to bother you could easily uh pip install requests or

httpx if you prefer U what it meant I had to do some U funny stuff here here

with the proxy for the session um because it handles it slightly

differently in url lib 3 you have this proxy manager and I needed to handle my

authentication proxy a bit different um I'd already done this so otherwise I

probably would have just installed requests you know there we go it is what

it is we now have a session that's going to use my proxy so what I'm going to do

is I'm going to say our response is equal to session. request and I'm going

to say this is going to be a get request and I'm going to do this to https and

we're going to do HTTP Benin sl. slhe headers and my headers are

going to be equal to or and then I'm going to do print

response. Json so what I'm going to do now is I'm going to come back to the top

and I'm going to get rid of this print here so we're only going to see the

print if we don't get those headers and then we should get them printed out back

from our um request to http bin which is just

going to send our headers back to us so the page is going to load load up the

first few request we're not going to find the or header should find it now

and then we're going this browser is going to

close and I could have done this wrong oh all that talking just to set it

up and I typed one thing wrong did it just there okay fine so now we'll do it

again um I never checked my code I always just run it and just deal with

errors afterwards probably not a good idea back to where we were so now we

should make the request to http when this is closed and we can see that our

headers sent back to us have this authentication Bearer header and all of

the other ones that we need to be able to make requests to our uh API endpoint

um there we go so now I'm going to do is we'll come back down here and we don't

need this because we know that that works but what we do need is all of this

so we are going to copy this Json data here because we're going to need to send

that along with the information because we want to basically this is the

what we're trying to get this is the product information so we'll do our

response is equal to our session. request and the

get and this is going to be the URL is over

here H this should be a post request my bad put that URL in I don't need that

one then we need to have our headers which is going to be equal to our

headers and the Json data which is going to be equal to should just be Json which

is equal to Json data like so what I'm also going to do is I'm going to

increase the timeout because sometimes on these other um the these sometimes on

the uh HTTP clients the timeout is too slow and you know this is a lot of this

is a large request we're going to get a lot of data back from a graph Q API so

it might take a little bit longer now we've got it we're going to do the

request so let's go ahead and run this again we have to wait for this browser

to come up obviously because this browser is the one that's going to be

giving us the up toate and latest headers and of course cookies that we're

going to need and there we have it so we've made

this request this has the information in it that's everything from that uh xhr

request that I showed you earlier uh except we are of course using the

headers and everything that we need so let me scroll down so if I was was to

for example take out these headers um and you know this is just

going to give us let's get rid of this we try to make

it without these we're obviously going to get blocked denied whatever you want

to say but again you know this a bit maybe a bit of a gray area but again you

I'm not doing anything that my browser wouldn't be doing my browser has done it

all for me I'm basically just mimicking those requests and I'm doing it in a

more time efficient manner um you know the these are these are methods of web

scraping that I just want to show you some of them you know you might find it

interesting or you might find it useful except to you so the last thing that I'm

going to do is I'm going to create my models so I'm going to have a models.py

file and I'm going to use Json top identic I've already put it in here um

you can see you dump the HTM the uh the Json in there and you get all of this

back so I'm just going to go ahead and copy this

out and we'll paste this into here here and I'm going to call this instead

of model I'm going to call this product model this is the bottom line one

everything in here and within this whole thing here what I'm going to do is I'm

going to get rid of a load of this stuff don't need that that we'll keep those

don't need that or that get rid of that get rid of that um get rid of that don't

want the product story don't want any of that get rid of all this as well you can

keep whatever it is whatever you want you can keep but this one is basically

just going to remove all of these things so this data this why like pantic is

going to fit the Json to this model and anything that doesn't match these fields

they just get discarded which is which is pretty handy so save this we're going

to come back to our main and we're going to do

import uh we'll do from actually from Models we're going to import in our

product model product model and come to the

bottom and instead of printing this we want to do

product model and we want to load

in the Json not

there in fact what I'm going to do is I'm probably going to let's have a look

at this actually let's come back to the clear Jason yeah so I've got data them

product um so what I might do is just try and load that directly in let's go

back to our model

product yes I'm going to load directly to product rather than the other one so

I'm going to come back to my main file so we will come back to the top and

instead of this import product model we're going to go product which is

basically direct to that part and we'll do

product Json and from this Json I'm going to I'm

going to get data and then product so all I'm doing

is instead of loading directly I'm instead of loading the whole of this

thing directly I'm just going to load this part here so we're going to ignore

you know these extra things extra fields that we don't need and my dog's bucking

for some reason sure she'll be fine add our or back in our headers so headers is

equal to our or headers uh save and run now so let's run this we're

going to use our browser to get the good headers cookies Etc that we need and

then from our URL live three session which of course could be a request

session could be httpx any client that has a session um client option or

something like that will work we're now going to get just the information that

we wanted back in our pantic model so if I go ahead and remove some more stuff

from the model it might be a bit easier just to see you know what we're actually

getting into so I'm going to remove the description the images we're going to

remove the colors list and I'm going to remove the variations as well because

what I want to see is you know just how we can change this as we need to so what

I'm going to do is come down here and I'm going to say

um let's put it here and we'll say our input is equal to uh let's call it I

need to do product is equal to input and this will just give me

like uh input model or product ID product

ID and then we'll have this here so we can

go like this so we can do string of product should have called this product

ID actually we will change it and I'm going to grab a different one

this one we'll do fine and now when we run this we should

get that you know question give us the product ID that you want to get and we

could obviously however we want to handle that we could give it a list of

product IDs we could pull them from the page or you know you could pull them

from your database let's put this one in and it's going to go ahead and make that

request and we got you know here's this the information 366 49 so I've cut a

load of data out here right so I need to summarize summarize this up real quick

so what we've done is we've used an undetectable or as as undetectable as

you can be browser with a proxy and specifically a mobile proxy you know

pick whichever ones worth best for your in use case we found that we needed to

make a post request to a graphql uh API to get the Json data structured you know

everything that's on the page will come back in that data anyway so we shouldn't

be doing anything wrong there uh and then we found that we had this

authorization cookie uh header sorry which had a sort of some kind of token

it was Bas 64 encoded token bit of a gray area because you I don't really

like it when you have to use a token however that is being generated by my

browser to actually make that request so who knows take it as you will then we

got that and then we could make subsequence requests to different

products to get that data back and I put it into a pent IC model lot going on

here but actually not a lot of code in the end I spent way more time looking

stuff up and figuring out how to do it I would be end up with 60 lines of code

you know this is this is web scraping this is how it goes the hardest part is

getting the data that's the hardest part so if you want to know how how I go

about getting data like this but without the browser where you don't have to

worry too much about the headers you want to watch this video next it's much

simpler and probably applies in much more use cases