so I'm going to show you a tool that I wrote for myself that loads up the page
using selenium wire and checks all of the network responses and requests that
it makes so we can easily find that Json data that's in that backend API so we
don't have to keep loading the page and looking at seeing what's going on we can
give it a URL it will load it up and it will get a nice list out of URLs and
also the responses save to a file for us to interrogate work out what we're doing
I like building tools like this they make your life so much easier hopefully
you like this one too so we're going to be using selenium wire which is an
extension to selenium it kind of adds to it so you'll need to make sure you pip
install that and then we're going to go ahead and do from selenium we're going
to import in driver and then we also need some of the utilities so from
selenium wire dot utils we need to import in decode now I'm going to import
decode in as decode s w because we are going to use normal decode as well I'm
also going to import in Json because we will need that later on so what selenium
wire does is it will load up the page and it will then show and check all of
the network activity that that website is doing so we want to be able to see
that and we want to be able to intercept that so we can create a few functions
first so what I'm going to do is I'm going to call this one show request
URLs so what this is going to do is it's going to just return us the URLs that
the site has made requests to externally this is where we can easily find the API
so here we need to give it the driver and I'm going to cover this in just a
second and also a Target URL so it knows what to load up within this driver I'm
going to do driver.get this is basically going to the page and we'll say Target
URL I'm going to create a blank list here of URLs so we can like add them to
it and from here we basically just want to interrogate the requests now we do
this for using the driver.request so we'll do four request in driver dot
requests and we'll just append it to our list urls.append I'm going to make this
a dictionary of the key URL and then request dot URL so this is the
first part of our selenium wire that gives us access to this request and the
driver requests here and also the responses which we'll do in a separate
function so I'm just going to return the URLs from this
function and we'll create a new function which will be our main
this is where we're going to run everything so here we need to actually
initialize our Webdriver so I'm going to say driver is equal to
webdriver.firefox you can use whichever one you like which is installed I like
Firefox massive Firefox Fanboy and we need to add in some selenium wire
options here as a dictionary because when we get the response back it's going
to be encoded or it's going to be bytes we want to make sure that it doesn't do
any extra encoding so we'll do disable encoding
is equal to true and this needs to be a dictionary here so now that we have this
driver we can then use it within our show request URLs to actually open the
web browser and load it up so I'm going to say that our Target I will just call
this URL is going to be equal to I'm just going to grab it from over here
we'll use this website here as a good example so
now we're going to say that our URLs is equal to actually I'm going to change
this because that is going to be a bit confusing Target URL okay so now we'll
have our URLs which is going to come back what's coming back out of this
function it's going to be equal to show request URLs and we'll pass in the
driver which we've created and also the target URL here like so
then let's just run through these URLs and print them out for URL in urls
print the URL out and then let's make sure we
run this function so Main so if
name is equal to done domain
and then we'll just run the main function
then we can just run the main function like so let's save that can I format
with black in here I don't know do I save that great
so let's give this a go let's run this now and check and see if we've got
anywhere so I'm going to hit run hopefully this is going to load up the
browser as you'll see it happens on the right hand side we're going to go to
that Target URL which is that website that I put in here and it's loading up
and we're going to get back a load of URLs that this page is now making
requests to like so now this didn't close because I need to add that in but
you can see we now have all of these URLs now that's everything that the
network every Network request that's happened when that page was loaded up a
request has been made to one or many or all of these URLs rather so this is
really interesting and we can actually look through this um
you'll find some things more interesting than others probably the ones that
you're going to like the most are ones like this where you can see we have this
full URL for the API search and then this product identifier this is really
what you're looking for and this is going to give you a good idea of how you
can actually get the data from this website so I think that this is a pretty
handy way of looking at it what I'm going to do now is I'm going to add in
my driver.close because we want to
make sure that this browser closes when we are done another thing that I do like
to do because we're looking at URLs is maybe have a list of keywords like
perhaps we want to have products or maybe even you want to put in API might
be a better option so we want to know if there's an API coming back and sometimes
the API might have something like V1 in it or or whatever you'll use keywords
depending on your knowledge of the target site and what you've sort of
decided you want to do or just general knowledge overall I tend to use bu have
been using just API but what we'll do is we will then have a look and check these
URLs so we'll do four keyword in keywords if KW for our
keyword in url print the URL like so
of course I need to reference the dictionary key here because otherwise
it's not going to know where to look we're searching within the key rather
than the value so we want to look for key in the URL value so this should give
us now the list okay so there we go that's a bit better so now we have a
list of more condensed URLs that have the API keyword in them and this is a
pretty good start it gives us a good idea of what's going on but we can do
more because we can actually then interrogate the actual API response
which is obviously going to be Json so we've got a good opportunity to actually
just grab the data there and then that we might want so I'm going to create a
new function and I'm going to call this one show
response and we'll say driver again and we want the target URL
targets URL and we'll need to do the same thing here I'm just going to grab
this and we'll paste him in here now we'll say our responses is
going to be equal to our blank list and we want to now look at how we handle the
encoding so I'm going to say four four request in driver
dot requests we need to access the request because we need the response
from the request we're going to need to do a try and accept now this is a bit
messy I'm not really sure what the best way to handle this is so if you know A
Better Way stick it down in the comments below so we can all benefit I want to
say our data is going to be equal to decode SW and within here we need to
pass in a couple of bits of information the first one is going to be
request.response dot body because we want to decode the response body we also
want the request dot response Dot
headers dot get this is going to basically get the information it's going
to understand the headers that are coming back and we want content encoding
this is all from the documentation for selenium wire and identity like this
then what we want to do is we want to say
response is equal to json.loads because we want this to be Json information if
it's not Json data we're not interested so we're just going to discard
everything else and then we want data dot decode and this is why I said at the
beginning we import Cellini and why is decode as decode SW because we are now
accessing Python's decode and we want to say this has got to be UTF eight this is
going to give us the actual information that we want so if this is valid if this
works inside our try block I'm going to do responses dot append the response
that we got back just here and if it doesn't I'm going to do that thing that
you probably don't want to do I'm just going to straight up ignore those errors
because I don't care we want to then return out here
responses like this so now we have a nice neat
list of only the things only the response is back from the back end to
the front end that are Json encodable that's the information that we want like
I said we're going to discard everything else so now what we can do is I'm going
to say that our responses are going to be equal to
our show response again driver and the target URL and then we can actually save
this data now you'll notice here that I am actually loading the page up twice
and this is intentional because my idea going forward with this is I will have
some kind of uh I'll pass or maybe even go the full route of click and we'll be
able to choose whether you want to see just the URLs or the responses or both
so I've got them separated like this for the moment also means you can choose as
well which ones you want whether you want the responses or just the URLs so
we are going to load the page twice I don't see that being a massive issue so
underneath this so we do get the URLs I'm going to do with open and we're
going to save these responses in to a Json file because there's a potential
there's going to be a lot of them and there could be a lot of data so it's
definitely worth saving so I'm just going to call this data.json W and we
want to do as f and we want to do Json dot damps is our
responses into our file there and let's give ourselves a little space there now
so if we go back to the top we have selenium wire which we're using we have
our first function which gets the URLs which gets the URLs that's being
requested to our responses then so when we open this page you'll get all of the
information back nice and neat that you can just see and interrogate as opposed
to having to load it up in your browser and have a long look around through the
network Tab and see what's going on now this doesn't entirely replace that but
is a good start and I think this can be improved and built upon too so let's run
and we should get our data Json file out and also our print of URLs that are
requested with that keyword in that we've chosen in this case API so you'll
see the page does load twice as explained earlier I'm okay with that for
the moment and we've made a mistake and this needs to be requests not request
otherwise we're going to get that error that we just saw here which means you
can't do it because it doesn't exist so this should work now this time one more
small error dump to file dumps to string Third Time Lucky maybe okay so that
finished and we do have a data.json file so let's open that up I think I can
format document there we go so now we can see all of the Json information that
came back and we have this items here so this could be interesting for us to look
at and find out more about there's a product URL all sorts of information we
could scan through this and have a look and see what information is available
using this method to scrape data and of course this is my preferred method if we
can do it and this tool that I've just shown you hopefully will help you know
whether you can or cannot use this method or whether you need to take a
different approach so hopefully you've enjoyed this video
and got some value from it I have a patreon which I'll link down below
there's a free tier check that out and also like And subscribe really helps me
out I hope you've enjoyed this video cheers see in the next one