there are sometimes when a browser is just absolutely necessary and in this
video I want to see how much it will slow down my existing Scrapy project by
putting playright into it so this is the uh project I've got this is a very very
simple spider but pretty effective we basically pass out the page links and
the product links we use response. followall and then we return the item
data out from that now I've already run this here and we'll see that I have
somewhere in in the region of um it's uh somewhere in the region of where are we
1100 requests item count is 9,060 that's because there are multiple items on a
page so the requests are different it took 262 seconds to run this is fairly
typical this is also going through my proxies as well so there you go so this
was pretty St straightforward so what I want to do now is I want to have a look
at the documentation for scrapey playright and we're going to implement
this into this project really quickly we're going to use the base settings to
start with so just what it suggests here we'll see how how well we get on and
we'll run it and then we'll maybe try and uh tweak it a little bit to see if
we can make it go even quicker than it already is or isn't so let's find out so
the first thing we need to do is install it so pip install Scrapy playright I'm
going to do that in this shell here let's clear this up pip
install Scrapy playright should do this nice and quickly and then play right
install I'm going to use Firefox um I've had some issues with my play right
Chrome installation um this should work fine
though so what I want to do now is I'm going to come to the download handers
I'm going to copy these I'm going come back to my settings file here and I
quickly search for download make sure that there's nothing going on here
middle where's no cool so let's then go to the bottom and let's put a new um
section in here called Scrapy play right so we can put all of our our Scrapy
playr settings under here so the first one is the download handlers that it
says we need uh we need to make sure we're using the Twisted async iio
reactor which we are this is default in all Scrapy projects now and it's uh it's
very good it's very powerful um and that's all we need to do by the looks of
it um unless uh explicitly marked requ will
be processed through the regular Scrapy download Handler so we do need to add in
the extra meta here we can see it's right here it says meta playright is
true um what I'm going to do though is before I add that in I do want to add
this in because I want to use Firefox I think it defaults to Chrome so let's put
this in here there's one more setting that we
need to change um it said mentioned it about user agent here so we're going to
change that now it's worth looking at this if you're trying to do this on
Windows make sure that you follow this um this section here to make sure that
it will work cuz obviously I don't run Windows so I can't I don't help you
there I can't really help you there so I'm going to remove this in fact we'll
just comment this out and then we'll create a new one so when we if we go
back to it we can um just comment the string out rather than having to type it
out so I think it said just use none so that should work I'm also going to
change the concurrent request I'm going to put this back to default um I did run
it previously I think I run it without that I'd been messing around with the
settings okay so let's go ahead and go back to our spider
products there it is and we need to add in these meta Scrapy playright into our
um field here so I'm going to go here and we'll do um I think it is what did
it say it was playright is true like this okay and we want to do the same for
here because obviously we're doing uh making requests here
now I don't actually know if this proxy string in here works with playright we
will find out otherwise I believe there is a section on proxies down here um yes
see it's going to not going to work like that so what we'll do is we will um go
ahead and we'll run it without this for the moment I'm going to
remove uh this section here and we'll add this uh in as playright suggests
later on so I want to try this now so I'm going to clear this up now I do
sometimes have issues with um playright when I install it into a virtual
environment it sometimes doesn't work it tells me I need to do playright install
even though I've already done it so let's just see where we get
to okay we can see it says scrapey playright starting download Handler now
I haven't changed any settings this is just bog
standard okay so we are getting some responses
back but um I can't tell it does look like we getting items so that's good but
we'll have a lot of information coming through on the screen so what I'm going
to do is I'm going to stop this and I'm actually going to run it again with the
output of uh PW test. Json and I'm going to run it again just so I can actually
pick up and see if those items are coming through so we can understand you
know before we actually try and run this um properly that the items we're
actually wanting are coming out so let's just run it for a few more goes I think
I might see some items hard to tell really there's a lot of information
coming back I believe it's logging out every request response from the browser
page which is probably quite a lot so I'm going stop that now and let's cat
out our PW test. Json and we do have items so let's go like here and we'll
format this great this is exactly what I wanted this is exactly what we were
expecting to get now this should be somewhere around uh 1100 Pages what do
we say 9,000 is items so that's good I I like that that's worked straight away
out of the box and that's really important that's one thing that I really
like about this package you can do very little and it will work straight away
for you it's been uh updated very recently I believe or it's constantly
being updated and it's very very powerful and it's uh it's the thing that
I lean to now if I need a browser for rendering and I want to do it on my
machine rather than so rather than going out to a third party for that sort of
service so let's go ahead and change some of the settings now so we already
had this so what I'm going to do is I'm going to go to the play right launch
settings and what this is going to do is it's going to um let me go back to our
settings I should have this in my buffer settings down to the bottom so now I've
got this what we can do is we can change headless from false to true I often
don't I know it work then but I often don't have an awful lot of success
running as head um oh headless is false sorry I already had
that setting cuz default is true so they Chang it to false I'm also going to up
the timeout a little bit I'm going to change it to 60 because we are going to
go ahead and put proxies in in this in just a second um you can actually
connect with via CDP Chrome de Chrome development tools um I've done this
before but it's not something I've ever really used that much context we're
going to come back to context in just a minute and the max contexts um so let's
go down um should be fine this one is an interesting one too um I just want to
touch on this before I go back and start running it is that we really want to cut
down on the stuff that we load up on the page because obviously it takes longer
and uses up more of our data but this wasn't that successful for me on this
specific site so I want to do more testing with that I found that um if you
try and abort loading of the images from this site through my testing I actually
come into a few errors so I'm going to ignore that for the moment and I think
we don't need to worry about anything else here I'm going to go ahead we don't
need to worry about the page okay so let's come back to our project second
time I've done that key binding wrong let's save it and now let's clear and
run it again I'm going to remove my test file and then let's run it again and we
should get the browser pop up and open on the screen now and I quite like this
is quite interesting cuz you can really see what's going on and uh how many
browsers instance is is running and what pages it's loading up we can see that
we're already going through product pages here one closes and a couple more
open I think we ended up with did we end up with eight there or six I think that
was eight or six and that's I think that's the default so we can see it's
all loading up and we could of course actually do something on these pages
each time if we wanted to I've don't often do a lot of that although it's
good to know that it's available you can either scroll or uh interact with a page
in some way but if you're trying to scroll because you're looking to get
extra items from infinite scroll there's often a better way and that's using
utilizing the um the API reverse engineering it so there'll be a video
link up here somewhere for you to check that method out if you want to but when
it comes to actually loading up the page to be able to click on something to do
something or action something else on the page than running it through a play
browser like this is it's almost it's essential basically so this is pretty
good I'm very pleased with the way with the way this is working now we can see
the pages are all loading up they're disappearing it's handling it I don't
think my system's taking my microphone just in the way I think I'm at 42% 43% I
think that's is that CPU or Ram I can't remember which one it is either way I
don't have a massive um powerful system but it's handling it just fine uh so
this is pretty decent so I'm going to stop this now we're going to let go back
and see how many items we had so that was
1,58 and we're looking for 9,000 in total so all in all so far that's pretty
good and you can see how quick that was compared to you know just your standard
sort of like one at a time browser and uh we didn't have to do anything to get
it to run multiple browsers we just installed it and ran which is really
cool so let's come out of this remove that file again let's clear
this up so we do need to start thinking about proxies because if I was to run
that more often or let it keep on going through my native residential IP
although obviously I have you know high value IP because it's my real one it
will end up getting blocked and you probably get blocked sooner rather than
later so we want to go back over here and we want to come to the proxy part
now I do believe if I search this page proxy
there was uh we can put them directly into the
request or we can run launch options with the proxy yes so
we want to go to this and we want to put them under
the prite uh launch Options under proxy like this okay so that's what I'm going
to do so we already have these launch options here so I'm going to go ahead
and paste these in here and format that nicely so I'm going to go ahead and get
my proxy information I'm going to do that uh on a different screen just
because you know this I don't want you to use my data
basically okay so that's done I'll either blur that out or just change my
password either way don't try it won't work okay so let's go ahead and did I
remove that file um I did so let's try running it again now and we'll see um if
it still works and we should be going then through our proxy which obviously
is going to mean that we can Cru more and more often yeah I think it's
eight Co so we are getting some errors here okay so I've done something
wrong I've missed the number of the port should be 6060 that's why I was doing
that CU I couldn't go anywhere so now this should work now
it's obviously a bit slower going through the proxy than my IP but you
know if it means that it does actually work and we don't get IP band straight
away it's worth it so now this is working um I'm going to stop this before
we actually run it the whole way through because I can see that this is working
so now I want to check out the contexts so if I come back over here and we look
at context context so playright contexts so we can
actually Define different contexts and if we have a look here in um browser
context and then I think this is the playright link okay that didn't take me
to where I wanted it to so let's look up contexts in here browser context so it's
basically uh provides a way to operate multiple independent browser uh sessions
so we can actually say we can have multiple Conta browser contexts for the
different parts of the site that we're going to or different like
um different style of links I suppose is the right answer so what I'm going to do
is I'm going to copy this and we're going to come and put this underneath
here and I'm just going to comment these out because I don't think we need them
and we don't need persistent in fact we can probably just remove this I'll just
comment it out so what I'm going to do is I'm going
to call one of these product uh products and one of these can
be search so I'm creating two separate
contexts so when we come to our uh Spider we can actually say under this
one where it Returns the search page I can do if I just go back to the thing
here um how did we put this into the meta just going to
remember uh play rightor context so we can put this in here and I can say for
this one we want to use the context of search I think that's what I called
it search and products yes and then under this one which is going to go back
to the past item and the products page I can do playright context is to products
so it's going to use those contexts there so let's try running this again
and we should see down here now we have launching two startup contexts where we
have um different browser contexts that are going to be used for on for the
search pages and one's going to be for the product pages this just gives us a
little bit more control over what happens uh in where in which part of the
browser so you can see that one stayed on the search page I think for a bit
before it was superseded so it does work so the other
thing that gives us we can do now with access to the contexts is we can let's
go back to our settings we have this option of how many contexts do we want
to run um and I've had mixed results with this so let me go back to Max
contexts so we are only specifi specifying uh two but we're going to let
no limit here but what we're going to change
is um this one let me go back to it's this it's Max pages per context so if I
change this to two I think what this means is we're going to see less
browsers opened up because we have our context but we're only allowing two per
context so we should have I think in theory
um yeah so you can see we've got two for the products and I think this is the
search page which is obviously loading up one page at a time so it's uh not
quite going as quickly so we can tweak this setting now to be able to scrape
faster or slower depending on the context the number of contexts we are
using the number of pages per context I said context so many times right so
let's put this at uh 10 and let's see what happens in fact I think when I did
this before I had quite a lot of issues so let's see
what happens when we have it at 10 pages per
context we'll see how many we get okay so we still only have I think
eight browsers loaded up here
um so I wonder if we've hit a limit on either processor cause maybe or
something or maybe there's a limit of how many context how many pages we can
have per cont context um but the idea being that you can use this these
settings to control how many browser Pages you open per context and you can
have different contexts per different part of the site you're going to so if
you were trying to scrape if you were going through categories and products
and search Pages you could have a browser context for each one and you
could say how many pages you wanted to load per that one obviously you're going
to hit a hardware limit I believe at some point which I think may be what
we're hitting here um I'm not entirely sure so I'm going to
turn this back down I'm just going to comment this back out and we're going to
let the system decide and I'm going to leave this running as head full let's
try it as headless I think this is going to have this might fail now um this is
an issue that you get quite often when you're trying to use headless browsing
is that it's spotted and did determined very very quickly yeah see see with
407s uh it's it's understanding very quickly that we are a headless browser
and it's going to reject our responses in fact it isn't however that's
interesting it's working in this case my point was I thought that wasn't going to
work my point was that quite often I found that if you run headless
especially if you don't run something like playright stealth where it draes to
remove those flags there is an obvious giveaway that is sent or can be found by
the website that knows that you're running a headless browser as in
completely headless so anyway I'm going to leave this as
false I'm going to leave most of the rest of the settings as default but I'm
going to keep my two separate contexts and I'm going to go ahead and run this
now in its entirety let's remove our this file and we're going to run it
in its entirety and we'll see how long it took versus running it in standard so
let's go ahead and do that okay so it's successfully finished and there's a few
things that I want to look and point out um and suggest POS possibly a couple of
reasons for them um so the first one here is that the um the playright one
which I need to open here it it took 1270 seconds over you know the standard
version which only took 262 so you know that's kind of like what's that five six
times slower um I don't think that's bad considering we were a loading up
multiple browsers we were using a lot more memory and obviously you know
loading up a lot more things um somewhere around here it will tell us
the response bytes which is this many which I believe I'm going to have to
look that up and uh find out how many that
is it's yes so that was about a gigabyte whereas over here if we look at this
this is probably this would be 175 megabytes yeah so this is a lot more
heavy on the data usage which would become an issue um if you're when you're
using your proxies because obviously you need to the data will cost you however
as I said there is there option to block the um uh the abort um for the blocking
all the images and this you can you can then block anything you want from this
this is something that I want to look at and see if I can actually figure out and
you get to work properly so that then becomes less of an issue here one of
thing which you may have seen already is the item scrape count here is
8,387 whereas over here it is a 9,060 this is probably a few products
different and I suspect that this is because the version that we of the site
that we get when we load the browser up differs slightly to what's actually
being returned from the standard HTTP request so we might be we might be
seeing slightly different links might might be slightly de lightly slightly
less we're also getting less um duplicated and filtered so this is just
something to keep in account but if we look at the request count oh I've lost
making it request count receive count is 1 uh
1107 uh here it was 1186 so there's like 80 odd requests less that we made that
we received a response for um again I suspect that's probably because of the
difference looking at the browser etc etc we have all this other information
here that tells you so much so look at you can tell uh here the play right um
images that was taken the images that we downloaded was this money and this is
just going to make it so much easier if we when we can start to block these and
uh it will go quicker as well so all in all I think this is really quite
impressive um for those times that you do need to use a browser this is
definitely what I'll be reaching for um I've been using Scrapy so much more
recently because of the fact that well once you start to build more complicated
scraping programs of your own you effectively find yourself inevitably
rewriting Scrapy in itself which so you might as well just use it um and it's
incredibly quick to get going I mean if I come back to if I close this out and
go back to my spider uh this is just this I mean all
of this was you know this is easy to write and this is hardly any hardly any
code whatsoever and this is able to scrape all that data very very quickly
in the case of the standard HTTP requests and probably quick enough in
the stand in the case of the playright requests and it gives you access to all
these cool things anyway I'm digressing ever so slightly um for me it's
definitely worth using I will be using this 100% going forward I've used it PL
a lot already but I will be using 100% going forward if I ever need a browser
for anything and it's easy to drop in to your existing Scrapy projects as I
showed you in this one so thank you very much much for watching if you've enjoyed
this video here's another one where there's a long scrapey project that you
might be interested in watching to find out more