while scraping like this using beautiful soup or some other HTML passer to pull
out the data that you want is a method that does work on some websites for web
scraping it's just outdated and just doesn't work on Modern websites what I
want to do in this video is show you the method that I think you should learn
first when you're first starting to try and learn web scraping rather than
spending a load of time trying to figure out how to pass HTML that may or may not
be there and I'll show you what I mean in a second so if we come out over to
this website and we look at this this is your typical modern e-commerce website
and if we scroll down we've got like loads of fancy moving pictures all sorts
of stuff flashing out there's just no way that you be able to try and pass
this information from HTML it's going to have loads of dynamic classes and
everything like that but fortunately for us there is a much easier way I'm just
going to do inspect and I'm going to go over to network and I'm going to refresh
this page for example oh it's already popped a load up here we have this uh
wallet SL all so let's just kill this and just rres this here what we want to
look for is is something where the front end of the website that the front end
that we were just looking at that gives us all the images and the and all the
information we want to find the request that it's made to its API to actually
get that information and we can then just mimic that request ourselves get
that Json data back every all the information will be there everything we
could possibly need all in a nice structured way and it's going to be you
know a lot less calls we don't have to go to product pages etc etc using an
HTTP client that offers a solid TLS f fingerprint is a great step to unlocking
sight but when it comes to scaling up you need to use proxies so I use proxy
scrape who are kind enough to sponsor this video we get access to high quality
secure fast and ethically sourced proxies that cover residential Data
Center and mobile with rotating and sticky session options there's 10
million plus proxies in the pool all to use with unlimited concurrent sessions
from countries all over the globe enabling us to scrape quickly and
efficiently I use a variety of proxies depending on the situation but I
recommend you start out with residential ones but make sure you select countries
that are appropriate to the sites you're trying to scrape and match your own
country where possible but to be fair I've had great success with their mobile
proxies and although I'm not using them in this project I've used them to great
success before either way it's still only one line of code to add to your
project and then you can let proxy scrape handle the rest from there and
any traffic you purchase is yours to use whenever you need as it doesn't ever
expire so if this all sounds good to you go go ahead and check out proxy scrape
at the link in the description below now let's get back to our project so I found
this one here this is a request URL you can see API wallet finder query it's
even got a page number in there for us and if we go to response we can see here
if I go preview this has all the information in it I believe somewhere
down here there's pricing as well I've seen that if I go to the response I
think it's here yeah look see Regional pricing everything loads and loads and
loads of information this is everything that they have on this product so if you
were trying to do some kind of market research or maybe you were looking at
selling similar products and you want to have you want to know what's out there
you could easily get all this information and you could track it so we
want to mimic this so the the first thing that I will always do is copy the
URL from the headers the request URL and then I just paste it into my browser I
hit enter and assuming you get the information back that you are after that
you're looking for you generally know that it's not going to be too difficult
if you find some issues with this you know it's not always this
straightforward Ward so you might have to do a little bit more to get there but
you would be surprised how often this is the case so I'm going to copy this again
now what I'm going to do is I'm going to open up my terminal down here and I'm
going to do curl and I'm going to paste the URL in and look we got all the
information back here so let's pipe this into JQ so it's easier to see there it
is now this one is particularly interesting because there is no there's
nothing to stop us I've made a plain curl request just to get this
information um so you know we don't have to worry about anything and to to this
is all publicly available data that we are pulling off everything comes from
here that we could find it would all be on here anyway so you know there's no
there's no legal issues here in fact to pull this information we're going to do
it in like you know 20 requests which is nothing and we don't even need to do
them quickly so before I build out a scraper of this I want to show you
another site so I'm going to I need to make this less we'll move you over to
that screen and just close this one out here click delete so this is another
example same sort of thing again working along that line of maybe you are looking
at getting into this market and you're doing some research so what I'm going to
do is I'm going to go to uh inspect and I'm going to leave it on fetch
xhr I need to do that for this actual page had the other one open still there
we go and uh we can see now if I do Network and refresh this page ah we
didn't find anything that we were looking for none of the this is
particularly useful this is rubbish basically this what you want to do is
you want to just click around move around Pages do all sorts of things and
see what comes up so what I'm going to do is I'm just going to hit next page
and there we go we got this one here so let's make this bigger we got this long
URL here which is basically just telling the API what information it wants and
the response here here's the preview you can see we got products and the actual
Json response what did mean to do that the actual Json response here has the
products in so you can see we have products position one here is all the
information product IDs it's going to have the price in here somewhere um all
this sorts of everything here because this is the information that the back
end is sent to hydrate the front end and put this information on the page so you
can see it so let's go ahead and just make out a simple example for this um
we'll do this one and maybe we'll do this one as well in fact let's just have
a quick check of do what we do normally is just grab this here paste it in same
thing that all works so let's go back over to our
curl curl paste the URL in uh this just needs to be moved over here I'm going to
pipe it into JQ assuming that it works which is just going to pass the Javas
the um the Json a bit neater there we go everything it's all here nice and easy
right so I don't know why this this so this is what I'm trying to say this
should be the first method that you learn when you're learning how to web
scrape you all these like tutorials that are out there they're all too old they
all tell you to make a request and then pass the HTML or to use some kind of
browser now both of those methods still do work and you will find websites where
you know it's service I rendered so it sends HTML back up to the front page
that's easy you just get that HTML and pass it but in most cases I wouldn't
start there I would always start looking at this especially if you're doing
e-commerce because there's so much product information that needs to come
forwards backwards and forwards from the from the server to the front end there
it's so easy to just find it and it needs to be done like this and it's all
structured because it needs to be there's also schemas which make it you
know very um structured and uh consistent structure as well uh which
I've covered in other videos the most important thing is to check this first
so let's build something out real quick for this one um I'll get rid of that we
need to come back here and I need the inspect go Network and if I just come
back and refresh this page and down here somewhere it was had query
in it I think I can't see it now can't see the
word for the trees there's too many there it is cool so I'm just going
to copy this URL so let's create a new
project let's go to my project folder and we'll do
um some kind of cryptic name so I'll never remember what this is ever and
we'll get lost forever so I'm going to do uh create a new virtual environment
once that's done I'm going to activate it that's just a shortcut for me to
activate that virtually environment um you might have to type the full thing
out now this is what I'm going to do here is I'm going to install a couple of
different things right so I'm going to install Rich because rich just helps
when I'm printing out to the terminal so we can all see it a bit neater but I'm
also going to install TLS client uh and we're also going to do pantic and I'm
going to explain what these uh explain why I use these so TLS client is
essentially a um just built on top of requests and what it does is it just
sends more browser like information up with the request so with the TLs uh
fingerprinting that some websites and the uh wafs that the the the firewalls
that they use to block Bots especially like the basic cloudfare it has they can
check by the T by the TLs information that you send whether it's come from a
browser or not and they just block everything that hasn't so using
something like TLS client with python or curl cffi anything that's based around
bog um go client is fine um they all use
that so you should be fine but that will just send the right information and give
you a better chance of not getting blocked we probably don't need it in
this case but it's worth doing anyway just so you know you can like you know
you're covered and uh it's all requests like so let's just create a new
um python file I'm going to call this main.py and then we're just going to do
uh import TLS client let make this bigger and then I'm going to do from
Rich I'm going to import print one thing that I wanted to
check actually before uh we build before we write any more code is the actual
pagination so I'm going to grab this uh URL again what we're going to do is
we're going to change the page number so you know we go to and we get more
products so what happens when we go to page 10 still getting products cool
let's try 15 still going okay 20 right blank list this is important because you
want to know what happens when you go to the a page beyond the last one so you
know how to break out of your pagination Loop in this case I know that if the
list is has a length of zero because this is returned real Json this is a
just a Json which will be interpreted into interpreted into a list if I go raw
this will become a list in um python we know that if the list is empty we can
break so that's always a good thing to figure out first before you start
writing code so let's just put this in here before um I forget
yet and now we want to build out our uh let's create our session in fact what I
normally do is I just copy over from the TLs client GitHub page okay we don't
need that anymore so we're on the TLs this is the python TLS client that I've
been using and we just want to copy this this is all we want so let's go do that
and get you in there like so right I don't know why but it's wrong you need
to change this chrome 112 thing to uh have a underscoring it like so and we
want to up the version so it's a bit more
um consistent with what we're expecting there we go cool I should be it 1,200 no
uh my text is too big there we go sweet so that's going to do that what I'm
going to do as well is I'm going to put my proxy in here um I always scrape with
a proxy these days there's just no point in not because if you go through a pro
if you go through a scraping program and you you figure out what you need to do
and you're using your own IP then try and use it with your proxies you might
have other issues so I just use them from the start uh we don't need any
extra headers what I do need to do is import OS because I'm going to be
pulling my proxy from my environment variable get
EnV I'm just going to call this one this one's just proxy so this is just me
pulling the proxy string from my environment variable if you don't do
this that's fine you can paste your proxy string straight in here and it
will work just fine as well um so that should be good so let's
change this to say our URL for example and then let's just print the response.
Json I'm going to change that res to response so I know it's a bit clearer
cool so let's save this and run and we got an empty list and that's good
because we're on page 20 so let's put this back down to page one and run
again and there's the information that we were after
so what I want to do now is the reason why I installed pantic is because I want
the easiest uh the easiest and most convenient way of putting this data into
something that I can then move around my program with do notation or I could then
you know have the option to put it back into Json or put it into anything else
that I would want to to export it out uh into something else maybe my database or
a different application and that's why I always tend to use pantic and to create
the models I'm going to use Json 2 pantic which is a website that's going
to do it all for me which means I can just paste my Json in here and it's
going to give me the pantic models out and what I tend to do is just go through
this and have a quick look at what I actually want um so I don't copy
everything that I don't need uh I think I just want this because I want the the
the regional pricing so I'm just going to copy this section here copy paste
this into here and this is going to create for us the uh models that we're
going to need to dump this Json into now obviously we're going to have more
information and when it to go into less models but it's going to just ignore the
information that isn't in our model which is exactly what we want so we know
that our data is just going to fit straight into our models so you've got
the region pricing the metadata and the base model and this is exactly what I
wanted I'm going to copy this out and I'm going to come uh back to my um my
code which I'm put over here let's create a
models.py and paste this in so I'm going to create this I'll change this to item
model there we go so now we can see that we have this all imported uh with the
information that we need and now if you get a load of stuff here and maybe you
don't want some of it the thing that I normally just do is just comment out the
parts that I don't want so I just get rid of these if I don't want any of this
information and all that means is it's not part of the model so that
information will be ignored um obviously this is just just a bit of a rough start
but this will get you to where you want to get to to start with so I'm going to
import in my item model now in my main.py so I'm going to do um
from Models we're going to import in item model and I need to refresh
my thing there we go and now instead of printing just the response what I'm
going to do is I'm going to print item model and I'm going to unpack that
response into that item model and now if run Pi uh main actually no that's not
going to work is it because I've got a list I need to Loop through the list and
put them in there first so let's get rid of
that what we want to do instead of this is you want to do for item in
responsejson uh this a bit bigger put this in the
middle then we can say let's just um our product is going to be equal to an item
model instance of unpacking whatever's in that item into there and then I'll
just print the product and we'll save this and we're
going to run it and we should now get actual pantic models back and this is
where some of the issues you get which are um validation errors there's a few
different ways you can handle this now because some of the prices as as we saw
are there that don't validate because they don't exist we just need to change
this to be optional so what I'm going to do is I'm just going to create a load of
cursor cursors and I'm going to say optional
optional uh integer or none and that should solve that problem for us so now
if we run this again it should be none where that it doesn't exist and there we
go so now we have nice models with the information in that we wanted um that's
exactly kind of where we were at here now you notice that some of these like
the content my model was just hey this is just a dictionary and that was fine I
didn't want to create pantic models for all of these because I didn't really
feel the need to um I was probably going to remove the content list anyway um it
has the pricing in again for some reason I don't know why so I'm going to just
going to remove that for now and we would just want to you know um check
your data and decide you know if it's useful for you or not so we'll just do
this for the moment let's come back to our main.py file uh and what I'm going
to do now is I'm going to tidy this up so basically this is how we're getting
the information we're going to put it into our pantic model so we can do
something with it um for example just makes it so much easier to do something
like [Music]
um name you know and this is just going to give us the name of all the products
now what I'm going to do is I'm going to tidy this up so what I'm going to say is
the first thing I'm going to do is I'm going to have a
create session and this is basically going to be all of this here so this is
going to um create my session I'm going to copy this in I'll
put this in up here there we go we don't need this here now
because we're going to return this session from this function I like to do
this because it keeps all of the um you know when you call this function you
will have the session that you actually need with everything in it and from here
I'm also going to do session dot uh proxies do
update um and this is going to um's what sort of file does this need okay it
needs it like this so this is just an easier way to um to handle it so let's
go back to our M file so this needs to be a dictionary so we'll do uh
HTTP is going to be equal to our os. get EMV um for our proxy and
also we want the https key which is also the same thing
same proxy get EMV
proxy proxy cool and then we can return this so now we don't need uh
this so we want to take a function now that makes an API request and this is
going to take in a uh URL uh let's call a session first a session and a
URL and what I'm going to do is I'm going to say that this is going to be
going to put a typin in here actually so this will be um
TLS client. session just so we have type
hinting so I can say now that our uh response is going to be my response is
going to be equal to the session. getet on the URL and this is just going to
return this out um so what we're going to do is return let's put in a little
bit of eror handling maybe so we'll do if response. status code does not equal
200 we'll raise uh an exception here and we'll
just ra raise a generic a generic exception uh bad St code this will just
give us an idea if it goes wrong somewhere of why it's going wrong um
there's probably better ways to do this but I like to put something in there
just in case then what we'll do is we will just return the response. Json out
here so we don't need this anymore so we need to create our main function so
let's have our main here and tie all this
together I'm going to keep my URL there um and I'm going to do my say my session
is equal to the create session function so we know for now now that this session
that we've got is going to have everything that we need all the TLs
fingerprinting and the proxies that we're going to use um and then we'll do
our um for page in range and I'm just going to do one to 25 and we'll say uh
let's copy let's put our URL down here so we want to want it inside this for
Loop and we're going to make this an F string and we'll just put the page
number in here here into the URL directly like so um you can do this with
like the actual parameters and everything through requests or whatever
HTTP client if you wanted to I'm just going to use an fstring for now it's
just easier um and then we're going to do our Json data is going to be equal to
the API request that we want to make with our session and our URL this is
going to have the good page in and then we want to do for item in Json data let
let's just print out the unpacking uh no item model this is our item model and we
want to unpack the item Json into it there we go cool right uh we just need
one more thing we need our if name equal to
main then we can run our main function here so what I'm going to do is I'm just
going to run this because there's going to be invariably something that I've
typed wrong so we might need to fix it but no it looks all right so it seems to
be working we can see the nuns coming up when it doesn't have a price for that
region which is pretty cool uh so what I need to do now is add in the Stop and
like when I said you know the when it showed us that the list was empty when
we hit a page that doesn't have any data in what we can do is we can now do if uh
length of our Json data um is equal to zero break and I
also put print statement in here I'll just say uh end of results like so I'm
also going to tidy this up we don't want to just print this I'm going to call
this our uh new product could be equal to that I'm going
to print out new product.
name and then we're going to add it to a global list we'll just say our um output
list output can be our list and we'll do um
output. append new product and then once our
whole thing has run we'll just do uh print our output and then we'll just ask
for the length of it as well and we'll do print length of our
output as well cool I'm going clear this up and let's run it now and we'll see
hopefully we get all of the product names come up here as we go and also
when we hit that 25th page or the 20th page where it fails we'll see and we'll
break out of our our Loop and then we'll just have all the products stored in
that list ready for us to you know move somewhere else put in a database do
anything else like that but I kind of really want you to understand that this
is pretty much what your sort of cookie cut web scraping project for modern
sites is going to look like now they're not all going to be this easy that's a
given right ended results there we go 271 they're not all going to be this
easy you're going to have cause issues you're going to need to make sure that
you do all you get all of the headers and um cookies that are required you
might need to find a way to generate those cookies but that is all very very
doable you might need to use an undertech browser and get the cookies
that way but if the website works like this where it makes this API request it
is possible to scrape it like this as I said it's not always that easy but I
think this is a really good place for you to start learning how to web scrape
in instead of spending a load of time trying to pass ghost HTML which just
doesn't exist and just grab the Json data instead so hopefully you have
enjoyed this video and I haven't been waffling on for far too long and you
haven't learned anything because well that would be bad anyway if you've
enjoyed this hit a like don't forget to subscribe that always helps me out and
if you want to me see me do some more projects like this you want to go here
and look at this one which is a little bit more advanced