Video Thumbnail 23:06
Playwright Isn't THAT slow for Scraping, if you do this
3.5K
81
2024-06-29
➡ E-commerce Data Extraction Specialist https://johnwr.com ➡ COMMUNITY https://discord.gg/C4J2uckpbR https://www.patreon.com/johnwatsonrooney ➡ PROXIES https://nodemaven.com/?a_aid=JohnWatsonRooney ➡ HOSTING https://m.do.co/c/c7c90f161ff6 If you are new, welcome. I'm John, a self taught Python developer working in the web and data space. I specialize in data extraction and automation. If you like programming and web content as much as I do, you can subscribe for weekly content. ⚠ DISCLAIME...
Subtitles

there are sometimes when a browser is just absolutely necessary and in this

video I want to see how much it will slow down my existing Scrapy project by

putting playright into it so this is the uh project I've got this is a very very

simple spider but pretty effective we basically pass out the page links and

the product links we use response. followall and then we return the item

data out from that now I've already run this here and we'll see that I have

somewhere in in the region of um it's uh somewhere in the region of where are we

1100 requests item count is 9,060 that's because there are multiple items on a

page so the requests are different it took 262 seconds to run this is fairly

typical this is also going through my proxies as well so there you go so this

was pretty St straightforward so what I want to do now is I want to have a look

at the documentation for scrapey playright and we're going to implement

this into this project really quickly we're going to use the base settings to

start with so just what it suggests here we'll see how how well we get on and

we'll run it and then we'll maybe try and uh tweak it a little bit to see if

we can make it go even quicker than it already is or isn't so let's find out so

the first thing we need to do is install it so pip install Scrapy playright I'm

going to do that in this shell here let's clear this up pip

install Scrapy playright should do this nice and quickly and then play right

install I'm going to use Firefox um I've had some issues with my play right

Chrome installation um this should work fine

though so what I want to do now is I'm going to come to the download handers

I'm going to copy these I'm going come back to my settings file here and I

quickly search for download make sure that there's nothing going on here

middle where's no cool so let's then go to the bottom and let's put a new um

section in here called Scrapy play right so we can put all of our our Scrapy

playr settings under here so the first one is the download handlers that it

says we need uh we need to make sure we're using the Twisted async iio

reactor which we are this is default in all Scrapy projects now and it's uh it's

very good it's very powerful um and that's all we need to do by the looks of

it um unless uh explicitly marked requ will

be processed through the regular Scrapy download Handler so we do need to add in

the extra meta here we can see it's right here it says meta playright is

true um what I'm going to do though is before I add that in I do want to add

this in because I want to use Firefox I think it defaults to Chrome so let's put

this in here there's one more setting that we

need to change um it said mentioned it about user agent here so we're going to

change that now it's worth looking at this if you're trying to do this on

Windows make sure that you follow this um this section here to make sure that

it will work cuz obviously I don't run Windows so I can't I don't help you

there I can't really help you there so I'm going to remove this in fact we'll

just comment this out and then we'll create a new one so when we if we go

back to it we can um just comment the string out rather than having to type it

out so I think it said just use none so that should work I'm also going to

change the concurrent request I'm going to put this back to default um I did run

it previously I think I run it without that I'd been messing around with the

settings okay so let's go ahead and go back to our spider

products there it is and we need to add in these meta Scrapy playright into our

um field here so I'm going to go here and we'll do um I think it is what did

it say it was playright is true like this okay and we want to do the same for

here because obviously we're doing uh making requests here

now I don't actually know if this proxy string in here works with playright we

will find out otherwise I believe there is a section on proxies down here um yes

see it's going to not going to work like that so what we'll do is we will um go

ahead and we'll run it without this for the moment I'm going to

remove uh this section here and we'll add this uh in as playright suggests

later on so I want to try this now so I'm going to clear this up now I do

sometimes have issues with um playright when I install it into a virtual

environment it sometimes doesn't work it tells me I need to do playright install

even though I've already done it so let's just see where we get

to okay we can see it says scrapey playright starting download Handler now

I haven't changed any settings this is just bog

standard okay so we are getting some responses

back but um I can't tell it does look like we getting items so that's good but

we'll have a lot of information coming through on the screen so what I'm going

to do is I'm going to stop this and I'm actually going to run it again with the

output of uh PW test. Json and I'm going to run it again just so I can actually

pick up and see if those items are coming through so we can understand you

know before we actually try and run this um properly that the items we're

actually wanting are coming out so let's just run it for a few more goes I think

I might see some items hard to tell really there's a lot of information

coming back I believe it's logging out every request response from the browser

page which is probably quite a lot so I'm going stop that now and let's cat

out our PW test. Json and we do have items so let's go like here and we'll

format this great this is exactly what I wanted this is exactly what we were

expecting to get now this should be somewhere around uh 1100 Pages what do

we say 9,000 is items so that's good I I like that that's worked straight away

out of the box and that's really important that's one thing that I really

like about this package you can do very little and it will work straight away

for you it's been uh updated very recently I believe or it's constantly

being updated and it's very very powerful and it's uh it's the thing that

I lean to now if I need a browser for rendering and I want to do it on my

machine rather than so rather than going out to a third party for that sort of

service so let's go ahead and change some of the settings now so we already

had this so what I'm going to do is I'm going to go to the play right launch

settings and what this is going to do is it's going to um let me go back to our

settings I should have this in my buffer settings down to the bottom so now I've

got this what we can do is we can change headless from false to true I often

don't I know it work then but I often don't have an awful lot of success

running as head um oh headless is false sorry I already had

that setting cuz default is true so they Chang it to false I'm also going to up

the timeout a little bit I'm going to change it to 60 because we are going to

go ahead and put proxies in in this in just a second um you can actually

connect with via CDP Chrome de Chrome development tools um I've done this

before but it's not something I've ever really used that much context we're

going to come back to context in just a minute and the max contexts um so let's

go down um should be fine this one is an interesting one too um I just want to

touch on this before I go back and start running it is that we really want to cut

down on the stuff that we load up on the page because obviously it takes longer

and uses up more of our data but this wasn't that successful for me on this

specific site so I want to do more testing with that I found that um if you

try and abort loading of the images from this site through my testing I actually

come into a few errors so I'm going to ignore that for the moment and I think

we don't need to worry about anything else here I'm going to go ahead we don't

need to worry about the page okay so let's come back to our project second

time I've done that key binding wrong let's save it and now let's clear and

run it again I'm going to remove my test file and then let's run it again and we

should get the browser pop up and open on the screen now and I quite like this

is quite interesting cuz you can really see what's going on and uh how many

browsers instance is is running and what pages it's loading up we can see that

we're already going through product pages here one closes and a couple more

open I think we ended up with did we end up with eight there or six I think that

was eight or six and that's I think that's the default so we can see it's

all loading up and we could of course actually do something on these pages

each time if we wanted to I've don't often do a lot of that although it's

good to know that it's available you can either scroll or uh interact with a page

in some way but if you're trying to scroll because you're looking to get

extra items from infinite scroll there's often a better way and that's using

utilizing the um the API reverse engineering it so there'll be a video

link up here somewhere for you to check that method out if you want to but when

it comes to actually loading up the page to be able to click on something to do

something or action something else on the page than running it through a play

browser like this is it's almost it's essential basically so this is pretty

good I'm very pleased with the way with the way this is working now we can see

the pages are all loading up they're disappearing it's handling it I don't

think my system's taking my microphone just in the way I think I'm at 42% 43% I

think that's is that CPU or Ram I can't remember which one it is either way I

don't have a massive um powerful system but it's handling it just fine uh so

this is pretty decent so I'm going to stop this now we're going to let go back

and see how many items we had so that was

1,58 and we're looking for 9,000 in total so all in all so far that's pretty

good and you can see how quick that was compared to you know just your standard

sort of like one at a time browser and uh we didn't have to do anything to get

it to run multiple browsers we just installed it and ran which is really

cool so let's come out of this remove that file again let's clear

this up so we do need to start thinking about proxies because if I was to run

that more often or let it keep on going through my native residential IP

although obviously I have you know high value IP because it's my real one it

will end up getting blocked and you probably get blocked sooner rather than

later so we want to go back over here and we want to come to the proxy part

now I do believe if I search this page proxy

there was uh we can put them directly into the

request or we can run launch options with the proxy yes so

we want to go to this and we want to put them under

the prite uh launch Options under proxy like this okay so that's what I'm going

to do so we already have these launch options here so I'm going to go ahead

and paste these in here and format that nicely so I'm going to go ahead and get

my proxy information I'm going to do that uh on a different screen just

because you know this I don't want you to use my data

basically okay so that's done I'll either blur that out or just change my

password either way don't try it won't work okay so let's go ahead and did I

remove that file um I did so let's try running it again now and we'll see um if

it still works and we should be going then through our proxy which obviously

is going to mean that we can Cru more and more often yeah I think it's

eight Co so we are getting some errors here okay so I've done something

wrong I've missed the number of the port should be 6060 that's why I was doing

that CU I couldn't go anywhere so now this should work now

it's obviously a bit slower going through the proxy than my IP but you

know if it means that it does actually work and we don't get IP band straight

away it's worth it so now this is working um I'm going to stop this before

we actually run it the whole way through because I can see that this is working

so now I want to check out the contexts so if I come back over here and we look

at context context so playright contexts so we can

actually Define different contexts and if we have a look here in um browser

context and then I think this is the playright link okay that didn't take me

to where I wanted it to so let's look up contexts in here browser context so it's

basically uh provides a way to operate multiple independent browser uh sessions

so we can actually say we can have multiple Conta browser contexts for the

different parts of the site that we're going to or different like

um different style of links I suppose is the right answer so what I'm going to do

is I'm going to copy this and we're going to come and put this underneath

here and I'm just going to comment these out because I don't think we need them

and we don't need persistent in fact we can probably just remove this I'll just

comment it out so what I'm going to do is I'm going

to call one of these product uh products and one of these can

be search so I'm creating two separate

contexts so when we come to our uh Spider we can actually say under this

one where it Returns the search page I can do if I just go back to the thing

here um how did we put this into the meta just going to

remember uh play rightor context so we can put this in here and I can say for

this one we want to use the context of search I think that's what I called

it search and products yes and then under this one which is going to go back

to the past item and the products page I can do playright context is to products

so it's going to use those contexts there so let's try running this again

and we should see down here now we have launching two startup contexts where we

have um different browser contexts that are going to be used for on for the

search pages and one's going to be for the product pages this just gives us a

little bit more control over what happens uh in where in which part of the

browser so you can see that one stayed on the search page I think for a bit

before it was superseded so it does work so the other

thing that gives us we can do now with access to the contexts is we can let's

go back to our settings we have this option of how many contexts do we want

to run um and I've had mixed results with this so let me go back to Max

contexts so we are only specifi specifying uh two but we're going to let

no limit here but what we're going to change

is um this one let me go back to it's this it's Max pages per context so if I

change this to two I think what this means is we're going to see less

browsers opened up because we have our context but we're only allowing two per

context so we should have I think in theory

um yeah so you can see we've got two for the products and I think this is the

search page which is obviously loading up one page at a time so it's uh not

quite going as quickly so we can tweak this setting now to be able to scrape

faster or slower depending on the context the number of contexts we are

using the number of pages per context I said context so many times right so

let's put this at uh 10 and let's see what happens in fact I think when I did

this before I had quite a lot of issues so let's see

what happens when we have it at 10 pages per

context we'll see how many we get okay so we still only have I think

eight browsers loaded up here

um so I wonder if we've hit a limit on either processor cause maybe or

something or maybe there's a limit of how many context how many pages we can

have per cont context um but the idea being that you can use this these

settings to control how many browser Pages you open per context and you can

have different contexts per different part of the site you're going to so if

you were trying to scrape if you were going through categories and products

and search Pages you could have a browser context for each one and you

could say how many pages you wanted to load per that one obviously you're going

to hit a hardware limit I believe at some point which I think may be what

we're hitting here um I'm not entirely sure so I'm going to

turn this back down I'm just going to comment this back out and we're going to

let the system decide and I'm going to leave this running as head full let's

try it as headless I think this is going to have this might fail now um this is

an issue that you get quite often when you're trying to use headless browsing

is that it's spotted and did determined very very quickly yeah see see with

407s uh it's it's understanding very quickly that we are a headless browser

and it's going to reject our responses in fact it isn't however that's

interesting it's working in this case my point was I thought that wasn't going to

work my point was that quite often I found that if you run headless

especially if you don't run something like playright stealth where it draes to

remove those flags there is an obvious giveaway that is sent or can be found by

the website that knows that you're running a headless browser as in

completely headless so anyway I'm going to leave this as

false I'm going to leave most of the rest of the settings as default but I'm

going to keep my two separate contexts and I'm going to go ahead and run this

now in its entirety let's remove our this file and we're going to run it

in its entirety and we'll see how long it took versus running it in standard so

let's go ahead and do that okay so it's successfully finished and there's a few

things that I want to look and point out um and suggest POS possibly a couple of

reasons for them um so the first one here is that the um the playright one

which I need to open here it it took 1270 seconds over you know the standard

version which only took 262 so you know that's kind of like what's that five six

times slower um I don't think that's bad considering we were a loading up

multiple browsers we were using a lot more memory and obviously you know

loading up a lot more things um somewhere around here it will tell us

the response bytes which is this many which I believe I'm going to have to

look that up and uh find out how many that

is it's yes so that was about a gigabyte whereas over here if we look at this

this is probably this would be 175 megabytes yeah so this is a lot more

heavy on the data usage which would become an issue um if you're when you're

using your proxies because obviously you need to the data will cost you however

as I said there is there option to block the um uh the abort um for the blocking

all the images and this you can you can then block anything you want from this

this is something that I want to look at and see if I can actually figure out and

you get to work properly so that then becomes less of an issue here one of

thing which you may have seen already is the item scrape count here is

8,387 whereas over here it is a 9,060 this is probably a few products

different and I suspect that this is because the version that we of the site

that we get when we load the browser up differs slightly to what's actually

being returned from the standard HTTP request so we might be we might be

seeing slightly different links might might be slightly de lightly slightly

less we're also getting less um duplicated and filtered so this is just

something to keep in account but if we look at the request count oh I've lost

making it request count receive count is 1 uh

1107 uh here it was 1186 so there's like 80 odd requests less that we made that

we received a response for um again I suspect that's probably because of the

difference looking at the browser etc etc we have all this other information

here that tells you so much so look at you can tell uh here the play right um

images that was taken the images that we downloaded was this money and this is

just going to make it so much easier if we when we can start to block these and

uh it will go quicker as well so all in all I think this is really quite

impressive um for those times that you do need to use a browser this is

definitely what I'll be reaching for um I've been using Scrapy so much more

recently because of the fact that well once you start to build more complicated

scraping programs of your own you effectively find yourself inevitably

rewriting Scrapy in itself which so you might as well just use it um and it's

incredibly quick to get going I mean if I come back to if I close this out and

go back to my spider uh this is just this I mean all

of this was you know this is easy to write and this is hardly any hardly any

code whatsoever and this is able to scrape all that data very very quickly

in the case of the standard HTTP requests and probably quick enough in

the stand in the case of the playright requests and it gives you access to all

these cool things anyway I'm digressing ever so slightly um for me it's

definitely worth using I will be using this 100% going forward I've used it PL

a lot already but I will be using 100% going forward if I ever need a browser

for anything and it's easy to drop in to your existing Scrapy projects as I

showed you in this one so thank you very much much for watching if you've enjoyed

this video here's another one where there's a long scrapey project that you

might be interested in watching to find out more