Video Thumbnail 19:55
Scraping with Playwright 101 - Easy Mode
38.6K
844
2024-03-29
➡ JOIN MY MAILING LIST https://johnwr.com ➡ COMMUNITY https://discord.gg/C4J2uckpbR ➡ PROXIES https://proxyscrape.com/?ref=jhnwr ➡ WEB SCRAPING API https://hubs.li/Q043T88w0 ➡ HOSTING https://m.do.co/c/c7c90f161ff6 If you are new, welcome. I'm John, a self taught Python developer and content creator, working at Zyte. I specialize in data extraction and automation. If you like programming and web content as much as I do, you can subscribe for weekly content. All views in this video are my o...
Subtitles

sometimes browser Automation and something like playright is all you need

to scrape data that you're after and in this video I'm going to do just that I'm

just going to use playright on its own nothing else and we're going to scrape

the data from this site there's something like 700 and something items

in this category but this would work across any of these categories in this

sales section but before we write any code the first thing I always do is have

a look at the site so I can know what's going on so I want to show you a few

things first now um this is a p paginated um website here and you can

see that we have these buttons that we go to the next page um but if we check

out the product page first we'll show you that we don't need to do a lot of

passing and that's why I'm happy to use playright in its entirety for this I

don't particularly like using playwright o when I have to do a load of passing

now when you're on a dedicated product page like this always come to view page

source and then do search for the word schema and see what you can find now I

did a video on this uh on my channel a little bit earlier but this here is all

valid Json and we can access it from this script tag and grab it out here so

if I copy this and we go to Json Passa online paste this in if I copied the

whole thing rather like this paste it in it's valid

Json and you can see that this is all of the information we could ever want from

this product is basically the schema which is uh standard so this is very

very good for us to have so I'm going to close that and that's how we're going to

get the product information from each of the detail pages so what we want to do

is we want to Loop through all of the product pages per page and then go to

the next page and do the same and then go to the next page and do the same as I

said we're going to do that in playright so what I'm going to do is I'm going to

create a virtual environment first python

3-m VMV VMV and then we'll pip install what we need once we've activated that

and I'm going to install playright and I'm also going to install rich just

makes it easier when I print stuff to my terminal when I'm running it so you guys

can see you don't need that here then you want to do play right install I'm

going to do play right install chromium only because I don't need the other ones

it gives me this error but I've already installed it so I know it works it's

fine in that case cool I'm going to create a new file I'm going to call this

main.py and I'm going to open this up in my code editor I'm using Helix um I'm

really enjoying Helix at the moment I've pretty much moved from neovim to

whatever code editor is fine so let's start by importing what we need so we'll

do from playright do Sync API we're going to import in the sync playright

and also the playright module itself and then from Rich import print

and we will Import in anything else we need as we go so it's very simple to run

playright just on its own I'm going to keep this as straightforward as possible

so we're going to stick with the default of the Run function and this takes in an

instance of playright which I'm going to use for these type pins here and then

pass and then we have our with sync playright as playright we're just going

to run it down here this is going to run our code for us so we're going to put

everything in this run function as I said because we don't have to do loads

of passing it's not going to be that difficult there's not going to be too

many lines of code we'll end up with like 50 or 60 lines of code

so it's all good in that respect right so let's get this little bit started

first so we're going to say that our uh we'll have a start URL and I'll grab

that in just a second then we'll have our Chrome which is going to be equal to

play. chromium then our browser which is going to be equal to Chrome do launch

and then our new page which is going to be equal to browser. page new page like

so this is going to basically launch Play right for us create the browser

context create a new page for us etc etc then we're can do page. go to our

start URL like this so I'm going to save this I'm going to come over to my other

terminal and I'm going to activate my best environement here we'll just do

Python main.py and we should cannot navigate to an invalid URL

of course you can't I didn't put the URL in there that would help so let's put

you in there save now let's uh run it

again okay cool so it did nothing but it didn't didn't not it did nothing but it

didn't not work we are going to need to use uh headless as equal to false

here and that is because when we run it uh completely headless there's a

giveaway unless you remove that there's a giveaway that the website knows what's

going on so it doesn't work so we're going to do this we're going to see the

browser open here and load this page up and then disappear so I'm happy I know

that that's all working so let's construct the main part of our code that

is going to go to Every product page and return that data for us now I'm going to

put this in a wild uh true now this is just a continuous loop that I'm going to

use and I'm going to break out of it on a condition um it's up to you however

you want to Loop through however it works for you that's fine we need to

grab the links now for the page for each of the product pages on the main page so

I'm going go to the inspect tool I'm going to grab the selector for this and

here it is over here this uh thing here with the data- selenium thing so to do

that we're going to do four Linkin page dot locator now the locator

is going to allow us to use CSS selectors to actually grab the element

so I'm going to say a and I was Data Das selenium which equaled this thing

thing here and we want to do do all and this returns an iterator with all of the

links um B like find all if you us to if you're used to beautiful soup or

something like that what we want to do now is we don't want to use the original

page we don't want to use this to go to that whilst that is a valid approach I'm

going to create a new page every time and open it up so I don't have to go

back and forth between loading up the different pages I can just load up the

the uh list page all the product pages separately and then the next page from

the list page so to do that we do p is equal to uh

browser. new page like so and because we're not

clicking we're going to we want to create a base URL for this I'm just

going to grab that which is this here and I'll show you that in just a second

so that's a base URL and the reason why we do that is because just over here

above my head you'll see that the href is not a complete link it's not a full

absolute URL so we need to put the base bit in front of it so when we open it we

can go to this page here now we want to do our URL is going to be equal to link

do uh attribute get attribute href like so

this is the attribute the actual link bit which is going to get added onto the

base URL here so now we want to do p. go to the URL like so now you'll notice

it's saying on my error here that we can't be it's a a string or none and

that's because this attribute may or may not exist so what we're going to do is

if URL is not none P.G go to the

URL uh else p. close and then basically that just handles that error there just

in case if this doesn't exist it doesn't try and go to it because it doesn't

exist and it just closes that browser context there cool so let's uh save and

come back over here and let's run this now and we should open up a page and

then open up the next one the next one the next one cool so these are all the

product pages that are opening up we do have an issue here is that they are not

closed so they are going to hang around forever and cause us Untold misery so we

what want to do now is whilst we're in our Loop here p.

close cool so let's do this again and we should Now open up a page close it open

up the next one close it like so so you can see that we're going

through all of the product pages here which hold the information that we're

actually wanting to scrape now there's about 28 or something per page so that's

going to do this 28 is times um all in all it's not going to be the quickest

thing in the world but it's not going to be that bad you could easily set this to

run I reckon the whole thing would probably take about an hour if that

which is not that big a deal so now that we're loading the product the detail

page up we want to take the uh the schema data which I showed you from here

we want to grab this wherever it's gone we want to grab this here so this is the

a script tag with the application LD plus Json whenever you see this it's

likely to have this information in here so we can do the same thing again we can

do data is equal to P do locator and it was a script and it's a

LD and it was a type type type type type is equal to uh do I copy it application

LD plus Json from this we want the text content like so so now I'm going to

print out the data like this and we'll run a few and we'll see that we should

get that information spat out to our terminal it's going to be a bit

difficult to see but you can basically see it coming across here now and that

is exactly all the information that we want I'm going to stop this we don't

need it to run so we this is all the data there so what I'm going to do is

we'll just click this screen up and we'll come back to our code here I've

somehow ended up with an extra terminal that I don't

need great so now we've got this data what we're going to do is we're going to

make it into Json so we'll do import Json and we'll come down here and we'll

do our Json data is equal to json. load s load a string data like so and then

we'll just print out our Json data like this and we'll check that that still

works and now instead of that string type we're going to get an actual set of

Jason and you can see it's formatted ever so slightly on the left hand side

of my screen and that's because Rich knows that now it's not a string it's

actual Json so it's doing all the indenting for us so that's good and I

think we should be able to ask for just the name

now there we go so I've just uh I'm asking just asking for the name key uh

just to make it a bit easier so we can see what's going on right so that's

great that works clear that up so this is essentially the Crux of it so what we

got to do now is we're going to sort out the pagination so whilst I'm in this

that's why I set up this while true Loop so I'm going to go through all of the

links on the page and then underneath here I'm going to go to the next page

from the main page for the pagination so if we come over here let's make this a

little bit smaller scroll to the bottom and it's here here's the next page link

and we can see that it is here I've lost it now one second this listing

paging next now at the end of this and I know I think there's 28 Pages let me try

that 28 at the end of this you'll see that

it's gray out however it still has the full class thing here which is why I've

done this as a world Loop so we can break out and we can choose how we want

to do that so let's go ahead and do page. locator so we want to find this

there was an a tag like so is equal to listing page.

next and we can do dot click like this and this is going to then click on that

link on every page what I'm going to do just so we don't have to wait for it for

every single one to check the pagination is I'm going to index just one of the

products so this will be the first for every like grid of products on the page

so we're not going to get the full data but it means we can test out the

pagination without waiting all that time so let's run now so we're going to see

less uh products come by but we should hopefully see this page here go to the

next page there we go you can see now we're getting different

products so I'm going to let this run and we're going to see what happens to

we when we get to the end and what page number I think is something like 28 or

20 29 so let's see what we happens when we get back there so we're just going

round and round and round in circles now because we have nothing to break out of

this Loop and it's just loading up this swapping over because I'm moving my

mouse around it's just loading up this page over and over and over again until

I stop it so we need to now break out of our while true Loop now if we weren't in

our while true we would have to do something like figure out the number of

pages because we can't use a stop here or something like that but what I

decided to do was to use this here now if you look at this piece of text you

will come and open it up here it's a text string and we have seven 7 uh 757

to 776 of 776 so what I'm going to do is I'm going to get this string and I'm

going to split it up and I'm going to compare these two numbers and if they're

equal that's how I'm going to break out of my Loop now there's obviously a few

different ways you could do this this is just the way that I chose it's entirely

up to you how you want to do it so what we're going to do is we'll say our

page numbers is equal to uh are we in P page uh page.

locator and it was a

span like so do text content like this and we need to do a split on this

because this is a string so I'm going to split it first on a dash so we'll do do

split like so now when we split it on a dash let me actually um copy this so we

can show you so if I open up uh Python 3 like this let's make this nice and big

so if we say that our string is equal to do this if we do

string dos spit on the dash like this we're going to end up with a list like

this so what we want to do is we want to then ask for the first index and then

from that we want to split on a space and then we have this

and then we want to we can actually reference the first one which is 776 so

if I make this an integer like

so we have 776 and then we want to compare it to the second index turning

that into an integer so we then ignore the of and we have our comparison which

we can then do on those two numbers um so what I'm going to do is I'm going to

do do split here then um the first index and then do split on a space that's our

page numbers okay and then we can do if int on page numbers

zero is equal to an integer of page

numbers two that means we're on the last page so let's just do

print no more pages and break so we're going to break

out out of this while loop so I'm within the while loop here I'm not within this

for Loop this is for the detail page let put a comment in here

detail Pages well that's not very nice formatting on this let's not do that for

that's not very nice formatting there so now we will break out of this so I'm

going to do else and I'm going to put this in an else so it only happens if it

doesn't find that page. locator like so and then finally we want to have

browser. close like so that should be within our run statement so when it's

finally done the browser closes and we are all happy in our own way that is it

essentially so this is what what did we get to 43 lines of play right code and

that's going to work and go through all of those pages let's run this again I

think I'm still just getting the first one I am so we'll just check that this

works in fact what we're going to do just still work on the first page yeah

so what we'll do is we'll change the start URL to page

27 and we'll check that this works when it

goes to 28 and it's the end no more pages

perfect so we found a we found a way to consistently break out using that page

number selector at the end depending on what site you're you're looking at you

may need to uh figure something out a little bit different but this worked for

me in that case let we've reset everything so we're going to show all

the pages so let's save and run it again and we'll just see all the see it

working one more time and it'll load up each one and it'll go through and you

can see we're getting the information on the right hand side now the only thing

left to do would be to save this but we're basically ending up here with a

nice formatted Json piece of Json data so I would suggest from there most

likely thing to do would just be to export it to a Json or a Json lines file

and then handle it outside of the script that way um I find that's a much better

option than trying to do anything with it whilst you're actually working with

it and getting out here um and as we're going through Page by Page you can also

append so if it stops partway through for whatever reason you can carry on and

not lose everything that you've done so far so that's it that was nice and easy

uh for playright nice and easy mode didn't take us too long to do 40 40

something lines of code 43 lines of code no big deal so nice and easy there's a

few cool things in here um one last thing that I want to show you which um

is useful in a way is we want to put in in um is that in between here we can do

page. root and we can actually block images as well I just need to remember

how to do this it's like this so we're blocking Dot and we'll

have PNG JPEG and jpeg like this and then we pass this

into a Lambda Anonymous function and we do root root.

abort like so if we copy this line and we put it down here as

well I think we want it here and make this P like this now when we come back

over here we should have no images which means you know if you're using proxies

saves you a little bit of data marginally quicker because we don't have

to wait for it to load up the images if if it's an image heavy site which most

modern sites are so that's a nice tip to make things a little bit quicker and a

little bit easier just by blocking the images this will work for any other

types of files as well so if you got websites loading up something else

that's really kind of heavy that you don't need you can add it into here and

it will also block that there so that's it for this one hopefully you've enjoyed

it got something out of it make sure you like comment subscribe all that cool

stuff it really does help me out join the Discord there's loads of cool people

stuck in there now loads of cool stuff going on and yeah thank you very much

for watching and I will see you again in the next one