sometimes browser Automation and something like playright is all you need
to scrape data that you're after and in this video I'm going to do just that I'm
just going to use playright on its own nothing else and we're going to scrape
the data from this site there's something like 700 and something items
in this category but this would work across any of these categories in this
sales section but before we write any code the first thing I always do is have
a look at the site so I can know what's going on so I want to show you a few
things first now um this is a p paginated um website here and you can
see that we have these buttons that we go to the next page um but if we check
out the product page first we'll show you that we don't need to do a lot of
passing and that's why I'm happy to use playright in its entirety for this I
don't particularly like using playwright o when I have to do a load of passing
now when you're on a dedicated product page like this always come to view page
source and then do search for the word schema and see what you can find now I
did a video on this uh on my channel a little bit earlier but this here is all
valid Json and we can access it from this script tag and grab it out here so
if I copy this and we go to Json Passa online paste this in if I copied the
whole thing rather like this paste it in it's valid
Json and you can see that this is all of the information we could ever want from
this product is basically the schema which is uh standard so this is very
very good for us to have so I'm going to close that and that's how we're going to
get the product information from each of the detail pages so what we want to do
is we want to Loop through all of the product pages per page and then go to
the next page and do the same and then go to the next page and do the same as I
said we're going to do that in playright so what I'm going to do is I'm going to
create a virtual environment first python
3-m VMV VMV and then we'll pip install what we need once we've activated that
and I'm going to install playright and I'm also going to install rich just
makes it easier when I print stuff to my terminal when I'm running it so you guys
can see you don't need that here then you want to do play right install I'm
going to do play right install chromium only because I don't need the other ones
it gives me this error but I've already installed it so I know it works it's
fine in that case cool I'm going to create a new file I'm going to call this
main.py and I'm going to open this up in my code editor I'm using Helix um I'm
really enjoying Helix at the moment I've pretty much moved from neovim to
whatever code editor is fine so let's start by importing what we need so we'll
do from playright do Sync API we're going to import in the sync playright
and also the playright module itself and then from Rich import print
and we will Import in anything else we need as we go so it's very simple to run
playright just on its own I'm going to keep this as straightforward as possible
so we're going to stick with the default of the Run function and this takes in an
instance of playright which I'm going to use for these type pins here and then
pass and then we have our with sync playright as playright we're just going
to run it down here this is going to run our code for us so we're going to put
everything in this run function as I said because we don't have to do loads
of passing it's not going to be that difficult there's not going to be too
many lines of code we'll end up with like 50 or 60 lines of code
so it's all good in that respect right so let's get this little bit started
first so we're going to say that our uh we'll have a start URL and I'll grab
that in just a second then we'll have our Chrome which is going to be equal to
play. chromium then our browser which is going to be equal to Chrome do launch
and then our new page which is going to be equal to browser. page new page like
so this is going to basically launch Play right for us create the browser
context create a new page for us etc etc then we're can do page. go to our
start URL like this so I'm going to save this I'm going to come over to my other
terminal and I'm going to activate my best environement here we'll just do
Python main.py and we should cannot navigate to an invalid URL
of course you can't I didn't put the URL in there that would help so let's put
you in there save now let's uh run it
again okay cool so it did nothing but it didn't didn't not it did nothing but it
didn't not work we are going to need to use uh headless as equal to false
here and that is because when we run it uh completely headless there's a
giveaway unless you remove that there's a giveaway that the website knows what's
going on so it doesn't work so we're going to do this we're going to see the
browser open here and load this page up and then disappear so I'm happy I know
that that's all working so let's construct the main part of our code that
is going to go to Every product page and return that data for us now I'm going to
put this in a wild uh true now this is just a continuous loop that I'm going to
use and I'm going to break out of it on a condition um it's up to you however
you want to Loop through however it works for you that's fine we need to
grab the links now for the page for each of the product pages on the main page so
I'm going go to the inspect tool I'm going to grab the selector for this and
here it is over here this uh thing here with the data- selenium thing so to do
that we're going to do four Linkin page dot locator now the locator
is going to allow us to use CSS selectors to actually grab the element
so I'm going to say a and I was Data Das selenium which equaled this thing
thing here and we want to do do all and this returns an iterator with all of the
links um B like find all if you us to if you're used to beautiful soup or
something like that what we want to do now is we don't want to use the original
page we don't want to use this to go to that whilst that is a valid approach I'm
going to create a new page every time and open it up so I don't have to go
back and forth between loading up the different pages I can just load up the
the uh list page all the product pages separately and then the next page from
the list page so to do that we do p is equal to uh
browser. new page like so and because we're not
clicking we're going to we want to create a base URL for this I'm just
going to grab that which is this here and I'll show you that in just a second
so that's a base URL and the reason why we do that is because just over here
above my head you'll see that the href is not a complete link it's not a full
absolute URL so we need to put the base bit in front of it so when we open it we
can go to this page here now we want to do our URL is going to be equal to link
do uh attribute get attribute href like so
this is the attribute the actual link bit which is going to get added onto the
base URL here so now we want to do p. go to the URL like so now you'll notice
it's saying on my error here that we can't be it's a a string or none and
that's because this attribute may or may not exist so what we're going to do is
if URL is not none P.G go to the
URL uh else p. close and then basically that just handles that error there just
in case if this doesn't exist it doesn't try and go to it because it doesn't
exist and it just closes that browser context there cool so let's uh save and
come back over here and let's run this now and we should open up a page and
then open up the next one the next one the next one cool so these are all the
product pages that are opening up we do have an issue here is that they are not
closed so they are going to hang around forever and cause us Untold misery so we
what want to do now is whilst we're in our Loop here p.
close cool so let's do this again and we should Now open up a page close it open
up the next one close it like so so you can see that we're going
through all of the product pages here which hold the information that we're
actually wanting to scrape now there's about 28 or something per page so that's
going to do this 28 is times um all in all it's not going to be the quickest
thing in the world but it's not going to be that bad you could easily set this to
run I reckon the whole thing would probably take about an hour if that
which is not that big a deal so now that we're loading the product the detail
page up we want to take the uh the schema data which I showed you from here
we want to grab this wherever it's gone we want to grab this here so this is the
a script tag with the application LD plus Json whenever you see this it's
likely to have this information in here so we can do the same thing again we can
do data is equal to P do locator and it was a script and it's a
LD and it was a type type type type type is equal to uh do I copy it application
LD plus Json from this we want the text content like so so now I'm going to
print out the data like this and we'll run a few and we'll see that we should
get that information spat out to our terminal it's going to be a bit
difficult to see but you can basically see it coming across here now and that
is exactly all the information that we want I'm going to stop this we don't
need it to run so we this is all the data there so what I'm going to do is
we'll just click this screen up and we'll come back to our code here I've
somehow ended up with an extra terminal that I don't
need great so now we've got this data what we're going to do is we're going to
make it into Json so we'll do import Json and we'll come down here and we'll
do our Json data is equal to json. load s load a string data like so and then
we'll just print out our Json data like this and we'll check that that still
works and now instead of that string type we're going to get an actual set of
Jason and you can see it's formatted ever so slightly on the left hand side
of my screen and that's because Rich knows that now it's not a string it's
actual Json so it's doing all the indenting for us so that's good and I
think we should be able to ask for just the name
now there we go so I've just uh I'm asking just asking for the name key uh
just to make it a bit easier so we can see what's going on right so that's
great that works clear that up so this is essentially the Crux of it so what we
got to do now is we're going to sort out the pagination so whilst I'm in this
that's why I set up this while true Loop so I'm going to go through all of the
links on the page and then underneath here I'm going to go to the next page
from the main page for the pagination so if we come over here let's make this a
little bit smaller scroll to the bottom and it's here here's the next page link
and we can see that it is here I've lost it now one second this listing
paging next now at the end of this and I know I think there's 28 Pages let me try
that 28 at the end of this you'll see that
it's gray out however it still has the full class thing here which is why I've
done this as a world Loop so we can break out and we can choose how we want
to do that so let's go ahead and do page. locator so we want to find this
there was an a tag like so is equal to listing page.
next and we can do dot click like this and this is going to then click on that
link on every page what I'm going to do just so we don't have to wait for it for
every single one to check the pagination is I'm going to index just one of the
products so this will be the first for every like grid of products on the page
so we're not going to get the full data but it means we can test out the
pagination without waiting all that time so let's run now so we're going to see
less uh products come by but we should hopefully see this page here go to the
next page there we go you can see now we're getting different
products so I'm going to let this run and we're going to see what happens to
we when we get to the end and what page number I think is something like 28 or
20 29 so let's see what we happens when we get back there so we're just going
round and round and round in circles now because we have nothing to break out of
this Loop and it's just loading up this swapping over because I'm moving my
mouse around it's just loading up this page over and over and over again until
I stop it so we need to now break out of our while true Loop now if we weren't in
our while true we would have to do something like figure out the number of
pages because we can't use a stop here or something like that but what I
decided to do was to use this here now if you look at this piece of text you
will come and open it up here it's a text string and we have seven 7 uh 757
to 776 of 776 so what I'm going to do is I'm going to get this string and I'm
going to split it up and I'm going to compare these two numbers and if they're
equal that's how I'm going to break out of my Loop now there's obviously a few
different ways you could do this this is just the way that I chose it's entirely
up to you how you want to do it so what we're going to do is we'll say our
page numbers is equal to uh are we in P page uh page.
locator and it was a
span like so do text content like this and we need to do a split on this
because this is a string so I'm going to split it first on a dash so we'll do do
split like so now when we split it on a dash let me actually um copy this so we
can show you so if I open up uh Python 3 like this let's make this nice and big
so if we say that our string is equal to do this if we do
string dos spit on the dash like this we're going to end up with a list like
this so what we want to do is we want to then ask for the first index and then
from that we want to split on a space and then we have this
and then we want to we can actually reference the first one which is 776 so
if I make this an integer like
so we have 776 and then we want to compare it to the second index turning
that into an integer so we then ignore the of and we have our comparison which
we can then do on those two numbers um so what I'm going to do is I'm going to
do do split here then um the first index and then do split on a space that's our
page numbers okay and then we can do if int on page numbers
zero is equal to an integer of page
numbers two that means we're on the last page so let's just do
print no more pages and break so we're going to break
out out of this while loop so I'm within the while loop here I'm not within this
for Loop this is for the detail page let put a comment in here
detail Pages well that's not very nice formatting on this let's not do that for
that's not very nice formatting there so now we will break out of this so I'm
going to do else and I'm going to put this in an else so it only happens if it
doesn't find that page. locator like so and then finally we want to have
browser. close like so that should be within our run statement so when it's
finally done the browser closes and we are all happy in our own way that is it
essentially so this is what what did we get to 43 lines of play right code and
that's going to work and go through all of those pages let's run this again I
think I'm still just getting the first one I am so we'll just check that this
works in fact what we're going to do just still work on the first page yeah
so what we'll do is we'll change the start URL to page
27 and we'll check that this works when it
goes to 28 and it's the end no more pages
perfect so we found a we found a way to consistently break out using that page
number selector at the end depending on what site you're you're looking at you
may need to uh figure something out a little bit different but this worked for
me in that case let we've reset everything so we're going to show all
the pages so let's save and run it again and we'll just see all the see it
working one more time and it'll load up each one and it'll go through and you
can see we're getting the information on the right hand side now the only thing
left to do would be to save this but we're basically ending up here with a
nice formatted Json piece of Json data so I would suggest from there most
likely thing to do would just be to export it to a Json or a Json lines file
and then handle it outside of the script that way um I find that's a much better
option than trying to do anything with it whilst you're actually working with
it and getting out here um and as we're going through Page by Page you can also
append so if it stops partway through for whatever reason you can carry on and
not lose everything that you've done so far so that's it that was nice and easy
uh for playright nice and easy mode didn't take us too long to do 40 40
something lines of code 43 lines of code no big deal so nice and easy there's a
few cool things in here um one last thing that I want to show you which um
is useful in a way is we want to put in in um is that in between here we can do
page. root and we can actually block images as well I just need to remember
how to do this it's like this so we're blocking Dot and we'll
have PNG JPEG and jpeg like this and then we pass this
into a Lambda Anonymous function and we do root root.
abort like so if we copy this line and we put it down here as
well I think we want it here and make this P like this now when we come back
over here we should have no images which means you know if you're using proxies
saves you a little bit of data marginally quicker because we don't have
to wait for it to load up the images if if it's an image heavy site which most
modern sites are so that's a nice tip to make things a little bit quicker and a
little bit easier just by blocking the images this will work for any other
types of files as well so if you got websites loading up something else
that's really kind of heavy that you don't need you can add it into here and
it will also block that there so that's it for this one hopefully you've enjoyed
it got something out of it make sure you like comment subscribe all that cool
stuff it really does help me out join the Discord there's loads of cool people
stuck in there now loads of cool stuff going on and yeah thank you very much
for watching and I will see you again in the next one