Video Thumbnail 24:22
The BEST Web Scraping Method I Teach Beginners
10.5K
395
2024-12-29
Check Out ProxyScrape here: https://proxyscrape.com/?ref=jhnwr ➡ JOIN MY MAILING LIST https://johnwr.com ➡ COMMUNITY https://discord.gg/C4J2uckpbR https://www.patreon.com/johnwatsonrooney ➡ PROXIES https://proxyscrape.com/?ref=jhnwr ➡ HOSTING (Digital Ocean) https://m.do.co/c/c7c90f161ff6 If you are new, welcome. I'm John, a self taught Python developer working in the web and data space. I specialize in data extraction and automation. If you like programming and web content as much as I do...
Subtitles

while scraping like this using beautiful soup or some other HTML passer to pull

out the data that you want is a method that does work on some websites for web

scraping it's just outdated and just doesn't work on Modern websites what I

want to do in this video is show you the method that I think you should learn

first when you're first starting to try and learn web scraping rather than

spending a load of time trying to figure out how to pass HTML that may or may not

be there and I'll show you what I mean in a second so if we come out over to

this website and we look at this this is your typical modern e-commerce website

and if we scroll down we've got like loads of fancy moving pictures all sorts

of stuff flashing out there's just no way that you be able to try and pass

this information from HTML it's going to have loads of dynamic classes and

everything like that but fortunately for us there is a much easier way I'm just

going to do inspect and I'm going to go over to network and I'm going to refresh

this page for example oh it's already popped a load up here we have this uh

wallet SL all so let's just kill this and just rres this here what we want to

look for is is something where the front end of the website that the front end

that we were just looking at that gives us all the images and the and all the

information we want to find the request that it's made to its API to actually

get that information and we can then just mimic that request ourselves get

that Json data back every all the information will be there everything we

could possibly need all in a nice structured way and it's going to be you

know a lot less calls we don't have to go to product pages etc etc using an

HTTP client that offers a solid TLS f fingerprint is a great step to unlocking

sight but when it comes to scaling up you need to use proxies so I use proxy

scrape who are kind enough to sponsor this video we get access to high quality

secure fast and ethically sourced proxies that cover residential Data

Center and mobile with rotating and sticky session options there's 10

million plus proxies in the pool all to use with unlimited concurrent sessions

from countries all over the globe enabling us to scrape quickly and

efficiently I use a variety of proxies depending on the situation but I

recommend you start out with residential ones but make sure you select countries

that are appropriate to the sites you're trying to scrape and match your own

country where possible but to be fair I've had great success with their mobile

proxies and although I'm not using them in this project I've used them to great

success before either way it's still only one line of code to add to your

project and then you can let proxy scrape handle the rest from there and

any traffic you purchase is yours to use whenever you need as it doesn't ever

expire so if this all sounds good to you go go ahead and check out proxy scrape

at the link in the description below now let's get back to our project so I found

this one here this is a request URL you can see API wallet finder query it's

even got a page number in there for us and if we go to response we can see here

if I go preview this has all the information in it I believe somewhere

down here there's pricing as well I've seen that if I go to the response I

think it's here yeah look see Regional pricing everything loads and loads and

loads of information this is everything that they have on this product so if you

were trying to do some kind of market research or maybe you were looking at

selling similar products and you want to have you want to know what's out there

you could easily get all this information and you could track it so we

want to mimic this so the the first thing that I will always do is copy the

URL from the headers the request URL and then I just paste it into my browser I

hit enter and assuming you get the information back that you are after that

you're looking for you generally know that it's not going to be too difficult

if you find some issues with this you know it's not always this

straightforward Ward so you might have to do a little bit more to get there but

you would be surprised how often this is the case so I'm going to copy this again

now what I'm going to do is I'm going to open up my terminal down here and I'm

going to do curl and I'm going to paste the URL in and look we got all the

information back here so let's pipe this into JQ so it's easier to see there it

is now this one is particularly interesting because there is no there's

nothing to stop us I've made a plain curl request just to get this

information um so you know we don't have to worry about anything and to to this

is all publicly available data that we are pulling off everything comes from

here that we could find it would all be on here anyway so you know there's no

there's no legal issues here in fact to pull this information we're going to do

it in like you know 20 requests which is nothing and we don't even need to do

them quickly so before I build out a scraper of this I want to show you

another site so I'm going to I need to make this less we'll move you over to

that screen and just close this one out here click delete so this is another

example same sort of thing again working along that line of maybe you are looking

at getting into this market and you're doing some research so what I'm going to

do is I'm going to go to uh inspect and I'm going to leave it on fetch

xhr I need to do that for this actual page had the other one open still there

we go and uh we can see now if I do Network and refresh this page ah we

didn't find anything that we were looking for none of the this is

particularly useful this is rubbish basically this what you want to do is

you want to just click around move around Pages do all sorts of things and

see what comes up so what I'm going to do is I'm just going to hit next page

and there we go we got this one here so let's make this bigger we got this long

URL here which is basically just telling the API what information it wants and

the response here here's the preview you can see we got products and the actual

Json response what did mean to do that the actual Json response here has the

products in so you can see we have products position one here is all the

information product IDs it's going to have the price in here somewhere um all

this sorts of everything here because this is the information that the back

end is sent to hydrate the front end and put this information on the page so you

can see it so let's go ahead and just make out a simple example for this um

we'll do this one and maybe we'll do this one as well in fact let's just have

a quick check of do what we do normally is just grab this here paste it in same

thing that all works so let's go back over to our

curl curl paste the URL in uh this just needs to be moved over here I'm going to

pipe it into JQ assuming that it works which is just going to pass the Javas

the um the Json a bit neater there we go everything it's all here nice and easy

right so I don't know why this this so this is what I'm trying to say this

should be the first method that you learn when you're learning how to web

scrape you all these like tutorials that are out there they're all too old they

all tell you to make a request and then pass the HTML or to use some kind of

browser now both of those methods still do work and you will find websites where

you know it's service I rendered so it sends HTML back up to the front page

that's easy you just get that HTML and pass it but in most cases I wouldn't

start there I would always start looking at this especially if you're doing

e-commerce because there's so much product information that needs to come

forwards backwards and forwards from the from the server to the front end there

it's so easy to just find it and it needs to be done like this and it's all

structured because it needs to be there's also schemas which make it you

know very um structured and uh consistent structure as well uh which

I've covered in other videos the most important thing is to check this first

so let's build something out real quick for this one um I'll get rid of that we

need to come back here and I need the inspect go Network and if I just come

back and refresh this page and down here somewhere it was had query

in it I think I can't see it now can't see the

word for the trees there's too many there it is cool so I'm just going

to copy this URL so let's create a new

project let's go to my project folder and we'll do

um some kind of cryptic name so I'll never remember what this is ever and

we'll get lost forever so I'm going to do uh create a new virtual environment

once that's done I'm going to activate it that's just a shortcut for me to

activate that virtually environment um you might have to type the full thing

out now this is what I'm going to do here is I'm going to install a couple of

different things right so I'm going to install Rich because rich just helps

when I'm printing out to the terminal so we can all see it a bit neater but I'm

also going to install TLS client uh and we're also going to do pantic and I'm

going to explain what these uh explain why I use these so TLS client is

essentially a um just built on top of requests and what it does is it just

sends more browser like information up with the request so with the TLs uh

fingerprinting that some websites and the uh wafs that the the the firewalls

that they use to block Bots especially like the basic cloudfare it has they can

check by the T by the TLs information that you send whether it's come from a

browser or not and they just block everything that hasn't so using

something like TLS client with python or curl cffi anything that's based around

bog um go client is fine um they all use

that so you should be fine but that will just send the right information and give

you a better chance of not getting blocked we probably don't need it in

this case but it's worth doing anyway just so you know you can like you know

you're covered and uh it's all requests like so let's just create a new

um python file I'm going to call this main.py and then we're just going to do

uh import TLS client let make this bigger and then I'm going to do from

Rich I'm going to import print one thing that I wanted to

check actually before uh we build before we write any more code is the actual

pagination so I'm going to grab this uh URL again what we're going to do is

we're going to change the page number so you know we go to and we get more

products so what happens when we go to page 10 still getting products cool

let's try 15 still going okay 20 right blank list this is important because you

want to know what happens when you go to the a page beyond the last one so you

know how to break out of your pagination Loop in this case I know that if the

list is has a length of zero because this is returned real Json this is a

just a Json which will be interpreted into interpreted into a list if I go raw

this will become a list in um python we know that if the list is empty we can

break so that's always a good thing to figure out first before you start

writing code so let's just put this in here before um I forget

yet and now we want to build out our uh let's create our session in fact what I

normally do is I just copy over from the TLs client GitHub page okay we don't

need that anymore so we're on the TLs this is the python TLS client that I've

been using and we just want to copy this this is all we want so let's go do that

and get you in there like so right I don't know why but it's wrong you need

to change this chrome 112 thing to uh have a underscoring it like so and we

want to up the version so it's a bit more

um consistent with what we're expecting there we go cool I should be it 1,200 no

uh my text is too big there we go sweet so that's going to do that what I'm

going to do as well is I'm going to put my proxy in here um I always scrape with

a proxy these days there's just no point in not because if you go through a pro

if you go through a scraping program and you you figure out what you need to do

and you're using your own IP then try and use it with your proxies you might

have other issues so I just use them from the start uh we don't need any

extra headers what I do need to do is import OS because I'm going to be

pulling my proxy from my environment variable get

EnV I'm just going to call this one this one's just proxy so this is just me

pulling the proxy string from my environment variable if you don't do

this that's fine you can paste your proxy string straight in here and it

will work just fine as well um so that should be good so let's

change this to say our URL for example and then let's just print the response.

Json I'm going to change that res to response so I know it's a bit clearer

cool so let's save this and run and we got an empty list and that's good

because we're on page 20 so let's put this back down to page one and run

again and there's the information that we were after

so what I want to do now is the reason why I installed pantic is because I want

the easiest uh the easiest and most convenient way of putting this data into

something that I can then move around my program with do notation or I could then

you know have the option to put it back into Json or put it into anything else

that I would want to to export it out uh into something else maybe my database or

a different application and that's why I always tend to use pantic and to create

the models I'm going to use Json 2 pantic which is a website that's going

to do it all for me which means I can just paste my Json in here and it's

going to give me the pantic models out and what I tend to do is just go through

this and have a quick look at what I actually want um so I don't copy

everything that I don't need uh I think I just want this because I want the the

the regional pricing so I'm just going to copy this section here copy paste

this into here and this is going to create for us the uh models that we're

going to need to dump this Json into now obviously we're going to have more

information and when it to go into less models but it's going to just ignore the

information that isn't in our model which is exactly what we want so we know

that our data is just going to fit straight into our models so you've got

the region pricing the metadata and the base model and this is exactly what I

wanted I'm going to copy this out and I'm going to come uh back to my um my

code which I'm put over here let's create a

models.py and paste this in so I'm going to create this I'll change this to item

model there we go so now we can see that we have this all imported uh with the

information that we need and now if you get a load of stuff here and maybe you

don't want some of it the thing that I normally just do is just comment out the

parts that I don't want so I just get rid of these if I don't want any of this

information and all that means is it's not part of the model so that

information will be ignored um obviously this is just just a bit of a rough start

but this will get you to where you want to get to to start with so I'm going to

import in my item model now in my main.py so I'm going to do um

from Models we're going to import in item model and I need to refresh

my thing there we go and now instead of printing just the response what I'm

going to do is I'm going to print item model and I'm going to unpack that

response into that item model and now if run Pi uh main actually no that's not

going to work is it because I've got a list I need to Loop through the list and

put them in there first so let's get rid of

that what we want to do instead of this is you want to do for item in

responsejson uh this a bit bigger put this in the

middle then we can say let's just um our product is going to be equal to an item

model instance of unpacking whatever's in that item into there and then I'll

just print the product and we'll save this and we're

going to run it and we should now get actual pantic models back and this is

where some of the issues you get which are um validation errors there's a few

different ways you can handle this now because some of the prices as as we saw

are there that don't validate because they don't exist we just need to change

this to be optional so what I'm going to do is I'm just going to create a load of

cursor cursors and I'm going to say optional

optional uh integer or none and that should solve that problem for us so now

if we run this again it should be none where that it doesn't exist and there we

go so now we have nice models with the information in that we wanted um that's

exactly kind of where we were at here now you notice that some of these like

the content my model was just hey this is just a dictionary and that was fine I

didn't want to create pantic models for all of these because I didn't really

feel the need to um I was probably going to remove the content list anyway um it

has the pricing in again for some reason I don't know why so I'm going to just

going to remove that for now and we would just want to you know um check

your data and decide you know if it's useful for you or not so we'll just do

this for the moment let's come back to our main.py file uh and what I'm going

to do now is I'm going to tidy this up so basically this is how we're getting

the information we're going to put it into our pantic model so we can do

something with it um for example just makes it so much easier to do something

like [Music]

um name you know and this is just going to give us the name of all the products

now what I'm going to do is I'm going to tidy this up so what I'm going to say is

the first thing I'm going to do is I'm going to have a

create session and this is basically going to be all of this here so this is

going to um create my session I'm going to copy this in I'll

put this in up here there we go we don't need this here now

because we're going to return this session from this function I like to do

this because it keeps all of the um you know when you call this function you

will have the session that you actually need with everything in it and from here

I'm also going to do session dot uh proxies do

update um and this is going to um's what sort of file does this need okay it

needs it like this so this is just an easier way to um to handle it so let's

go back to our M file so this needs to be a dictionary so we'll do uh

HTTP is going to be equal to our os. get EMV um for our proxy and

also we want the https key which is also the same thing

same proxy get EMV

proxy proxy cool and then we can return this so now we don't need uh

this so we want to take a function now that makes an API request and this is

going to take in a uh URL uh let's call a session first a session and a

URL and what I'm going to do is I'm going to say that this is going to be

going to put a typin in here actually so this will be um

TLS client. session just so we have type

hinting so I can say now that our uh response is going to be my response is

going to be equal to the session. getet on the URL and this is just going to

return this out um so what we're going to do is return let's put in a little

bit of eror handling maybe so we'll do if response. status code does not equal

200 we'll raise uh an exception here and we'll

just ra raise a generic a generic exception uh bad St code this will just

give us an idea if it goes wrong somewhere of why it's going wrong um

there's probably better ways to do this but I like to put something in there

just in case then what we'll do is we will just return the response. Json out

here so we don't need this anymore so we need to create our main function so

let's have our main here and tie all this

together I'm going to keep my URL there um and I'm going to do my say my session

is equal to the create session function so we know for now now that this session

that we've got is going to have everything that we need all the TLs

fingerprinting and the proxies that we're going to use um and then we'll do

our um for page in range and I'm just going to do one to 25 and we'll say uh

let's copy let's put our URL down here so we want to want it inside this for

Loop and we're going to make this an F string and we'll just put the page

number in here here into the URL directly like so um you can do this with

like the actual parameters and everything through requests or whatever

HTTP client if you wanted to I'm just going to use an fstring for now it's

just easier um and then we're going to do our Json data is going to be equal to

the API request that we want to make with our session and our URL this is

going to have the good page in and then we want to do for item in Json data let

let's just print out the unpacking uh no item model this is our item model and we

want to unpack the item Json into it there we go cool right uh we just need

one more thing we need our if name equal to

main then we can run our main function here so what I'm going to do is I'm just

going to run this because there's going to be invariably something that I've

typed wrong so we might need to fix it but no it looks all right so it seems to

be working we can see the nuns coming up when it doesn't have a price for that

region which is pretty cool uh so what I need to do now is add in the Stop and

like when I said you know the when it showed us that the list was empty when

we hit a page that doesn't have any data in what we can do is we can now do if uh

length of our Json data um is equal to zero break and I

also put print statement in here I'll just say uh end of results like so I'm

also going to tidy this up we don't want to just print this I'm going to call

this our uh new product could be equal to that I'm going

to print out new product.

name and then we're going to add it to a global list we'll just say our um output

list output can be our list and we'll do um

output. append new product and then once our

whole thing has run we'll just do uh print our output and then we'll just ask

for the length of it as well and we'll do print length of our

output as well cool I'm going clear this up and let's run it now and we'll see

hopefully we get all of the product names come up here as we go and also

when we hit that 25th page or the 20th page where it fails we'll see and we'll

break out of our our Loop and then we'll just have all the products stored in

that list ready for us to you know move somewhere else put in a database do

anything else like that but I kind of really want you to understand that this

is pretty much what your sort of cookie cut web scraping project for modern

sites is going to look like now they're not all going to be this easy that's a

given right ended results there we go 271 they're not all going to be this

easy you're going to have cause issues you're going to need to make sure that

you do all you get all of the headers and um cookies that are required you

might need to find a way to generate those cookies but that is all very very

doable you might need to use an undertech browser and get the cookies

that way but if the website works like this where it makes this API request it

is possible to scrape it like this as I said it's not always that easy but I

think this is a really good place for you to start learning how to web scrape

in instead of spending a load of time trying to pass ghost HTML which just

doesn't exist and just grab the Json data instead so hopefully you have

enjoyed this video and I haven't been waffling on for far too long and you

haven't learned anything because well that would be bad anyway if you've

enjoyed this hit a like don't forget to subscribe that always helps me out and

if you want to me see me do some more projects like this you want to go here

and look at this one which is a little bit more advanced