a large part of the work I do in scraping is e-commerce data competitor
analysis product analysis and all that and I want to show you in this video how
I go about scraping almost every single site that I come up against especially
ones like this so I've covered this before but what you want to do is you
absolutely don't want to be trying to pull out links and trying to um you know
scrape the HTML that's just not going to work I know if you look over my head
here I'll make it a bit bigger I mean this is just passing HTML for this is
just not going to work what we want to do is we want to find the backend API
that this site uses to hydrate the front end to basically populate this data to
find that we want to open up our inspect tool our tools here in Chrome go to
network I'll try and make this a little bit
bigger and then we need to start interrogating the site now the first
thing I always do pretty much is just sort of scroll around and see what pops
up I'm going to click on Fetch xhr and it's responses that are Json that we are
going to be interested in uh you can either move around go to different
categories or click on a product we'll do just fine when you start to scale up
projects like this one you'll find that your requests start to get blocked and
that's where you need to start using high quality proxies and I want to share
with you the proxy provider that I use and the sponsor of this Video Proxy
scrape proxy scrape gives us access to high quality secure fast and ethically
sourced proxies that cover residential Data Center and mobile with rotating and
sticky session options there's 10 million plus proxy SE in the pool to use
or with unlimited concurrent sessions from countries all over the globe
enabling us to scrape quickly and efficiently my goto is either geot
targeted residential proxies based on the location of the website or the
mobile proxies as these are the best options for passing antibot protection
on sites and with auto rotation or sticky sessions it's a good first step
to avoid being blocked for the project we're working on today I'm going to use
the sticky sessions with residential proxies holding on to a single IP for
about 3 minutes it's still only one line of code to add to your project and then
we can let proxy scrape handle the rest from there and also any traffic you
purchase is yours to use whenever you need as it doesn't ever expire so if
this all sounds good to you go ahead and check out proxy scrape at the link in
the description below let's get on with the video so let's go ahead and look at
what we've got here um so here right away I can see a load of images and a
load of Json data here the one that I'm interested in straight away says
availability and this has all the product availability the like you know
the basically the stock numbers and the SKS etc for this item that's pretty
handy that's very relevant and the other one is right here which is sort of the
whole product data everything that uh comes with it so we can see we've got
all the images and stuff like that and there's there's pricing information in
here metadata if I Collapse these uh we can see everything coming up pricing
information so this is essentially the data that I want now I've shown you all
this before in other videos and if this is new to you then I'll will cover
everything you need to do to get started with this but what I haven't done before
is I haven't showed you more of a full project which is what I'm going to go
through through in a minute um the first thing that I want to do though is we
need to understand the API and the endpoints and what's happening so I'm
going to go ahead and I'm just going to copy the request URL for this one which
is the product now we can see that this is basically essentially just their API
and by hitting it like this we do indeed get the Json response for this data now
what that means is we could effectively take a different um
[Music] product for example uh let's see if I
can grab the data for this one the code for this one and just put it on the end
here and we're going to get that information but how do we go about
getting these product codes well there's another way that we can do this and uh
I'm going to keep this one open so now I've got the sort of the product link
here and I'm going to open the um the availability one as well so we can have
all three and have a look where is the availability here so again here the
availability it's basically very straightforward so just going to paste
this in here we get the availability again if I change the product
code it's going to give us the availability for that product now to
actually find the product IDs well how would you find them on the website well
you could either go to a category or you might want to search and this is kind of
where I tend to go for go for to start with so I might type something like
boots into the search again with this open on this side you know here we go
431 results this is how I would typically sort of look to get this
information so if I come over back to the um the the data here that I had I
need to scroll to the bottom somewhere around here we're going to find a um a
request wish it wouldn't show me all of these actually what I'm going to do is
I'm going to delete all this I had all the other ones and I'm going to search
again just so it comes up at the top okay so this is it loading up you can
see it's loading up all these products and this is because these are the the
products that have come from the search so this endpoint is actually slightly
different it's going to give you different bits of information we'll we
will cover that the one I'm looking for is the actual um search one here search
query there we I found it so what this is is this is like basically hitting the
M the API Endo with the search query that we gave it and again you know I can
put this in here put this in I wish this would go away I don't know what this is
for I wish and I can put this in here and here is the response now I'm going
to just collapse a lot of this information uh get rid of all of this
cuz we're not that interested in this information but what we are interested
in if I make this full screen and we have a good look is we have a view size
a view set size we have the count which is 431 which was the whole of the search
uh we have the search term and then we have the items at 48 per page which was
the view size we also have the current set which I believe uh no there should
be another one start index here we go so what we can actually do is we can start
to see are any of these parameters available for us to manipulate so if I
change the start index to 10 what happens okay that wasn't the right one
um I think it's actually so start index didn't work so I'm going to change it
and quite often it's just start maybe okay start is start index okay that's
fine to find that out if you were I mean you could try and guess it like that but
what you could do is you can uh if we just come back here and we manually go
to the next page with the uh developer tools open you would see that and it
would it would be there so if we scroll down somewhere along here
start is 48 we can see that there so you can start to do everything that you
would do on the page um and just keep an eye on the uh the actual Network Tab and
you'll see everything come through so now that I know that the uh the start
index works oh way too big we can start to put together
something that's going to give us we can use to search we can have like the that
we can start we want to start on zero index I guess yeah and then we can go
through the items so what we have in the actual items response is somewhere down
here we have a lot of good information actually and in some cases this is
enough but a lot of cases you do want to go actually deep into the product itself
we have a product ID so this product is some kind of kid Superstar boots right
so now we come back to our products part end point and we hit this in here here's
the product straight away has come back and it's given us all this information
and the one that I want to look at the most is the pricing information it's got
a discount all this cool stuff right here then we can of course go to the
availability one put the product code in and here's the available availability
and this one has some availability so you can see that we're starting to work
out how their API works now this is not that difficult especially if you've
either worked with rest apis before or built AR apis before but my best device
as I said is just to look through the website so what I want to do now is to
take this and I want to turn it into something we can repeat within our code
uh so I'm going to get rid of this at the moment I don't think I'm going to
need this uh we can always actually we can always come back to it and I've got
my um terminal open here in a new folder let's make this a bit bigger and I'm
going to create a virtual environment like
so I'm going to activate it what I want to show you now is a couple of
interesting things so I'm going to go and I'm going to use Curl I'm going to
take this endpoint that we know that works in our browser we can see it works
there I'm going to paste it here and we get denied so this is a curl error and
this is basically you know the akin to you know we can't get this data like
this well let's try it with requests so let's Import in requests and we'll do
our response is equal to requests.get let's put the URL in there we're getting
you can see that we we're having issues here we're not able to stream the data
for whatever reason so I'm going to change the headers I can't clear this up
can I clear this up we'll do it this way we're going to change the headers so we
I'll say our headers are equal to because you know you always want to do a
good user agent right user agent and let me just grab one my user
agent this one will be fine put that in here oh uh I need to sanitize and paste
please there we go cool so now we'll import requests again and we'll do our
response is equal to requests doget and we'll grab our URL again this
one will be fine put you in there we'll say our headers is equal to the headers
that we just created which is the user agent and response. status code
403 now this is because of TLS fingerprinting I'm going to cover this
much more in a video much more in depth coming up so if you're interested in
finding out really why this is happening and what you can do to avoid it and how
you know everything works underneath the hood you want to subscribe for that
video but essentially what we want to do is we're going to um I'm going to come
out of this just so I don't get any Nam space issues actually I don't need to
we'll do um import we'll do uh from Curl cffi we're going to import in requests
as uh CU Rec curl cffi is going to give us a more consistent fingerprint that
looks like a real browser so what I can do now is I can go up to here we don't
need this one we just want this and instead of using actual requests I'm
going to use Co requests uh CFI request and I'll do request. status code and I
got 403 because I forgot to do this impersonate is equal to and we can just
put Chrome in here you don't have to put the version and now if I do response do
status code we get our 200 our response. Json is all the data so we basically
needed to uh get our fingerprint sorted for the um to make the request you
notice I didn't need any cookies I didn't need any headers I didn't need
anything other than what curl cffi or other you know TLS fingerprint um sort
of spoofers do there's a few out there and I will as I said I'll cover that in
a following video so now that I know that this is going to work what I'm
going to do is I'm going to go into my we need to activate this one here I'm
going to do pip three and we're going to use that curl cffi Library three install
curl cffi and I'm going to use uh rich I always use Rich for printing we're also
going to use pantic because I want to get it to a point where we have modeled
the data a bit better um so I will install these I think that should
probably be enough for us in this instance and I'm going to touch
main.py and we'll make this open here now I've imported everything that we're
going to need I'm going to look at modeling my data a little bit closer now
I've done this already but essentially what I'm going to do is I'm going to
take so from this the products one and the search one so we can get that
information I haven't done the availability one but you can add that
one on nice and easy now that you know the the end point here so we're going to
model this information I'm basically just going to take what I want from here
and create a pantic model with it so the first one is the search item which I'm
going to have the product ID the model ID price sale price and the display name
and the rating so that's all comes from that search endpoint and then the same
thing I'm going to have with the search response which means I can easily find
out and manipulate what page and count Etc like this so we can see the search
term the count uh of total items for that search and the start index which I
tolded about earlier and then the items is the list of search items then I've
modeled the item detail um which is the the information that I was after before
so I've just basically put the product description and the pricing information
in as d dictionaries rather than modeling them because this is quite
Dynamic this data I found some products they don't have all of this information
so it was easier just to do it like this again with the product description so
it's up to you but basically what I'm saying is model your data from here I
creating a new session now I gave I created a function for this because
initially I thought maybe I would want to expand on this project and then be
able to import this new session function from into a different uh you know
different different file or different part of the project so all I'm saying is
I'm creating a session I'm using request. session and again this is K
cffi so we have this impersonate here and I also am importing my proxy now I
talked about sticky proxies earlier and that's what I'm going to be using here
it's not actually essential to do so with this specific site but there are
sites that will be um that will sort of match your fingerprint or your request
with the IP address and if it starts to different
it starts to get flagged that's a lot less common though so this should be
fine and now I'm going to model a function that's going to go ahead and
query the search API we need our session which we're going to create our query
string and our start number and I've just put in the an F string into the URL
here to do that and then I'm going to basically just get the data from here we
want to put in something to handle if we get a bad response so basically I've put
request uh for status which is going to throw me an exception if we get anything
that isn't a 200 response basically going to let me know if we're starting
to get blocked um I'm not too fond of this I think there's probably a more
elegant way of handling it but this will work just fine for now then we are
basically taking the response data and uh pushing it into our model our search
response model we're unpacking it and I'm unpacking from the raw and item list
which is essentially this piece of here so raw I'm going to go to this one and
then this one here and then I'm going to unpack everything that fits into my
models like so again it's up to you how you model your data and then I'm going
to return the search which is a type of the search response model I'm going to
do exactly the same now for the detail API very very similar we're going to put
the item. product ID and this is why I like to use models with my data because
now look I can clearly see in this function that this takes in the search
item and then we use the item ID as to to put into our URL rather than just
having you know the whatever piece of data from a dictionary I find this much
much easier to see request raise for State raise for status again and the
same thing we're going to push our response Json into our item detail model
we're going to return that out and here's our main function we're going to
create a new session we're going to go and put a search term in here so again
this is our session that we're giving it the search query parameter which I
Define in the other function is hoodie start index I put as one that should
probably be zero but you get the idea I'm just going to Loop through all of
these and we're going to print out the name of the product as we go through so
I've got it to this point here I wanted to show you up to here because this is
kind of like the main part of getting the data which is absolutely the hardest
part of web scraping and then sort of understanding how you can go through and
figure out how the sites backend apis work and then manipulate them slightly
to get the information that you're after once you've got that data it's entirely
up to you what you're going to do with it I mean you could collect more here
you probably want to do the availability etc etc so I'm going to save this and
I'm going to come over here and I'm going to run P Main and we should
hopefully start to see some of the product names coming through so I've
searched for hoodie and we're now this is the information that's coming back so
I'm just looping through the um products that were on that first search page I it
was 48 and I'm querying their API as if I was a browser like I showed you on
this page here and just pulling the data out so this is the absolute best and
easiest way to get data from websites like this website owners and site
designers will find it very very difficult to protect their backend API
in such a way that their front end can still access it just by the nature of it
it happens a lot now it's not always going to be as easy as this but you will
be surprised how often it is the only thing I will say is that if you're going
to do this you're going to be able to pull a lot of data quite quickly so I
would always say you know be be be consider it and don't hammer it if you
hammer it you're probably going to get blocked they'll find out anyway but pull
the data that you need it's all publicly available data I'm not doing anything
there I'm not using any API Keys here I'm not using anything that I shouldn't
do this is all publicly available data I'm just pulling it in the most
convenient and easy way to easy easy fashion as possible so hopefully you got
the idea and you can mimic this now with your own uh projects Etc if you've
enjoyed this video I'd really appreciate like comment subscribe it makes a whole
load of difference to me uh it really does check out the patreon I always post
stuff early on there or consider uh joining uh the Youtube uh the Youtube
channel down below as well um there's another video right here which if you
watch this one now you'll continue my watch time across YouTube and they will
promote my channel more thanks bye