Video Thumbnail 13:38
This script I threw together saves me hours.
24.7K
1.1K
2023-08-16
➡ JOIN MY MAILING LIST https://johnwr.com ➡ COMMUNITY https://discord.gg/C4J2uckpbR ➡ PROXIES https://proxyscrape.com/?ref=jhnwr ➡ WEB SCRAPING API https://hubs.li/Q043T88w0 ➡ HOSTING https://m.do.co/c/c7c90f161ff6 If you are new, welcome. I'm John, a self taught Python developer and content creator, working at Zyte. I specialize in data extraction and automation. If you like programming and web content as much as I do, you can subscribe for weekly content. All views in this video are my o...
Subtitles

so I'm going to show you a tool that I wrote for myself that loads up the page

using selenium wire and checks all of the network responses and requests that

it makes so we can easily find that Json data that's in that backend API so we

don't have to keep loading the page and looking at seeing what's going on we can

give it a URL it will load it up and it will get a nice list out of URLs and

also the responses save to a file for us to interrogate work out what we're doing

I like building tools like this they make your life so much easier hopefully

you like this one too so we're going to be using selenium wire which is an

extension to selenium it kind of adds to it so you'll need to make sure you pip

install that and then we're going to go ahead and do from selenium we're going

to import in driver and then we also need some of the utilities so from

selenium wire dot utils we need to import in decode now I'm going to import

decode in as decode s w because we are going to use normal decode as well I'm

also going to import in Json because we will need that later on so what selenium

wire does is it will load up the page and it will then show and check all of

the network activity that that website is doing so we want to be able to see

that and we want to be able to intercept that so we can create a few functions

first so what I'm going to do is I'm going to call this one show request

URLs so what this is going to do is it's going to just return us the URLs that

the site has made requests to externally this is where we can easily find the API

so here we need to give it the driver and I'm going to cover this in just a

second and also a Target URL so it knows what to load up within this driver I'm

going to do driver.get this is basically going to the page and we'll say Target

URL I'm going to create a blank list here of URLs so we can like add them to

it and from here we basically just want to interrogate the requests now we do

this for using the driver.request so we'll do four request in driver dot

requests and we'll just append it to our list urls.append I'm going to make this

a dictionary of the key URL and then request dot URL so this is the

first part of our selenium wire that gives us access to this request and the

driver requests here and also the responses which we'll do in a separate

function so I'm just going to return the URLs from this

function and we'll create a new function which will be our main

this is where we're going to run everything so here we need to actually

initialize our Webdriver so I'm going to say driver is equal to

webdriver.firefox you can use whichever one you like which is installed I like

Firefox massive Firefox Fanboy and we need to add in some selenium wire

options here as a dictionary because when we get the response back it's going

to be encoded or it's going to be bytes we want to make sure that it doesn't do

any extra encoding so we'll do disable encoding

is equal to true and this needs to be a dictionary here so now that we have this

driver we can then use it within our show request URLs to actually open the

web browser and load it up so I'm going to say that our Target I will just call

this URL is going to be equal to I'm just going to grab it from over here

we'll use this website here as a good example so

now we're going to say that our URLs is equal to actually I'm going to change

this because that is going to be a bit confusing Target URL okay so now we'll

have our URLs which is going to come back what's coming back out of this

function it's going to be equal to show request URLs and we'll pass in the

driver which we've created and also the target URL here like so

then let's just run through these URLs and print them out for URL in urls

print the URL out and then let's make sure we

run this function so Main so if

name is equal to done domain

and then we'll just run the main function

then we can just run the main function like so let's save that can I format

with black in here I don't know do I save that great

so let's give this a go let's run this now and check and see if we've got

anywhere so I'm going to hit run hopefully this is going to load up the

browser as you'll see it happens on the right hand side we're going to go to

that Target URL which is that website that I put in here and it's loading up

and we're going to get back a load of URLs that this page is now making

requests to like so now this didn't close because I need to add that in but

you can see we now have all of these URLs now that's everything that the

network every Network request that's happened when that page was loaded up a

request has been made to one or many or all of these URLs rather so this is

really interesting and we can actually look through this um

you'll find some things more interesting than others probably the ones that

you're going to like the most are ones like this where you can see we have this

full URL for the API search and then this product identifier this is really

what you're looking for and this is going to give you a good idea of how you

can actually get the data from this website so I think that this is a pretty

handy way of looking at it what I'm going to do now is I'm going to add in

my driver.close because we want to

make sure that this browser closes when we are done another thing that I do like

to do because we're looking at URLs is maybe have a list of keywords like

perhaps we want to have products or maybe even you want to put in API might

be a better option so we want to know if there's an API coming back and sometimes

the API might have something like V1 in it or or whatever you'll use keywords

depending on your knowledge of the target site and what you've sort of

decided you want to do or just general knowledge overall I tend to use bu have

been using just API but what we'll do is we will then have a look and check these

URLs so we'll do four keyword in keywords if KW for our

keyword in url print the URL like so

of course I need to reference the dictionary key here because otherwise

it's not going to know where to look we're searching within the key rather

than the value so we want to look for key in the URL value so this should give

us now the list okay so there we go that's a bit better so now we have a

list of more condensed URLs that have the API keyword in them and this is a

pretty good start it gives us a good idea of what's going on but we can do

more because we can actually then interrogate the actual API response

which is obviously going to be Json so we've got a good opportunity to actually

just grab the data there and then that we might want so I'm going to create a

new function and I'm going to call this one show

response and we'll say driver again and we want the target URL

targets URL and we'll need to do the same thing here I'm just going to grab

this and we'll paste him in here now we'll say our responses is

going to be equal to our blank list and we want to now look at how we handle the

encoding so I'm going to say four four request in driver

dot requests we need to access the request because we need the response

from the request we're going to need to do a try and accept now this is a bit

messy I'm not really sure what the best way to handle this is so if you know A

Better Way stick it down in the comments below so we can all benefit I want to

say our data is going to be equal to decode SW and within here we need to

pass in a couple of bits of information the first one is going to be

request.response dot body because we want to decode the response body we also

want the request dot response Dot

headers dot get this is going to basically get the information it's going

to understand the headers that are coming back and we want content encoding

this is all from the documentation for selenium wire and identity like this

then what we want to do is we want to say

response is equal to json.loads because we want this to be Json information if

it's not Json data we're not interested so we're just going to discard

everything else and then we want data dot decode and this is why I said at the

beginning we import Cellini and why is decode as decode SW because we are now

accessing Python's decode and we want to say this has got to be UTF eight this is

going to give us the actual information that we want so if this is valid if this

works inside our try block I'm going to do responses dot append the response

that we got back just here and if it doesn't I'm going to do that thing that

you probably don't want to do I'm just going to straight up ignore those errors

because I don't care we want to then return out here

responses like this so now we have a nice neat

list of only the things only the response is back from the back end to

the front end that are Json encodable that's the information that we want like

I said we're going to discard everything else so now what we can do is I'm going

to say that our responses are going to be equal to

our show response again driver and the target URL and then we can actually save

this data now you'll notice here that I am actually loading the page up twice

and this is intentional because my idea going forward with this is I will have

some kind of uh I'll pass or maybe even go the full route of click and we'll be

able to choose whether you want to see just the URLs or the responses or both

so I've got them separated like this for the moment also means you can choose as

well which ones you want whether you want the responses or just the URLs so

we are going to load the page twice I don't see that being a massive issue so

underneath this so we do get the URLs I'm going to do with open and we're

going to save these responses in to a Json file because there's a potential

there's going to be a lot of them and there could be a lot of data so it's

definitely worth saving so I'm just going to call this data.json W and we

want to do as f and we want to do Json dot damps is our

responses into our file there and let's give ourselves a little space there now

so if we go back to the top we have selenium wire which we're using we have

our first function which gets the URLs which gets the URLs that's being

requested to our responses then so when we open this page you'll get all of the

information back nice and neat that you can just see and interrogate as opposed

to having to load it up in your browser and have a long look around through the

network Tab and see what's going on now this doesn't entirely replace that but

is a good start and I think this can be improved and built upon too so let's run

and we should get our data Json file out and also our print of URLs that are

requested with that keyword in that we've chosen in this case API so you'll

see the page does load twice as explained earlier I'm okay with that for

the moment and we've made a mistake and this needs to be requests not request

otherwise we're going to get that error that we just saw here which means you

can't do it because it doesn't exist so this should work now this time one more

small error dump to file dumps to string Third Time Lucky maybe okay so that

finished and we do have a data.json file so let's open that up I think I can

format document there we go so now we can see all of the Json information that

came back and we have this items here so this could be interesting for us to look

at and find out more about there's a product URL all sorts of information we

could scan through this and have a look and see what information is available

using this method to scrape data and of course this is my preferred method if we

can do it and this tool that I've just shown you hopefully will help you know

whether you can or cannot use this method or whether you need to take a

different approach so hopefully you've enjoyed this video

and got some value from it I have a patreon which I'll link down below

there's a free tier check that out and also like And subscribe really helps me

out I hope you've enjoyed this video cheers see in the next one