Video Thumbnail 22:10
Selenium Web Scraping is too Slow. Try This.
18.5K
590
2024-11-03
➡ JOIN MY MAILING LIST https://johnwr.com ➡ COMMUNITY https://discord.gg/C4J2uckpbR ➡ PROXIES https://proxyscrape.com/?ref=jhnwr ➡ WEB SCRAPING API https://hubs.li/Q043T88w0 ➡ HOSTING https://m.do.co/c/c7c90f161ff6 If you are new, welcome. I'm John, a self taught Python developer and content creator, working at Zyte. I specialize in data extraction and automation. If you like programming and web content as much as I do, you can subscribe for weekly content. All views in this video are my o...
Subtitles

right so in this project we're going to use selenium driver list and I'm going

to show you how to asynchronously use it as a browser to scrape some data from a

website we're going to load up this page here we're going to pull all of the

available uh obvious links and then we're going to visit each link and pull

the information out now this website doesn't necessarily need this technique

but I wanted to demonstrate it to you so we're going to be using selenium

driverless which I talked about in my last video uh the do mentation is here

in the GitHub and this is a a great library for um using when you need to

use a browser to scrape we're also going to be using the asyn io rate limiter

which is going to let us control how much we are able to how many windows

we're going to be able to open I'll show you with and without this and you'll get

the idea of why we want to use something like this I'm using a browser for this

project and that makes it even more important to use high quality proxies

and to consider the geolocation as even a non- detect browser like I'm using

here there always still ways for antibot to find you un block you so this video

is sponsored by proxy scrape Friends of the channel and the proxies that I use

myself we get access to high quality secure fast and ethically soft proxies

that cover residential Data Center and mobile with rotating and sticky session

options there's 10 million plus proxies in the pool to use all with unlimited

concurrent sessions from countries all over the globe I use a variety of

proxies depending on the situation but I'd recommend you start out with

residential ones but make sure you s select countries that are appropriate to

the site you're trying to scrape and match your own country where possible

also consider using sticky sessions and keeping the same proxy for 3 to 5

minutes which is what I'm going to do here either way it's still only one line

of code to add to your project and then you may let proxy scrape handle the rest

from there and also any traffic you purchase is yours to use whenever you

need as it doesn't ever expire so if this all sounds good to you go ahead and

check out proxy scrape at the link in the description below now on with the

project the first thing I'm going to do is I'm just going to create my virtual

environment in Python in my project folder I'm going to activate it Act is a

shortcut for me otherwise do it normal way and we're going to pip install what

we need so I think it's called selenium

driverless it is and then I forget what this is a syn iio limit okay so let's

copy that we want you

I'm also going to install rich as well because we're going to print some stuff

out the terminal makes our lives a lot easier so I'm going to create a main

file main.py I'm going to open this in my

text editor and we're going to get started so we're going to import what we

need first so let's do from selenium driverless we're going to import in the

web driver I know that's slightly confusing why it's called Web driver

driverless but that's just the way that it is and then because we're going to

want to act to we're going to want to um find ele elements on the page we want to

import in uh types. buy and then import in buy so we can say find this element

by the CSS selector or by the xath so we need that uh we of course going to need

a sync IO from python because you know we're going to be running this a

synchronously you'll see how many browser windows get spawned when I get

to the end and why we need that rate limiter we're going to import OS because

we are going to be using our proxy scrap proxies for this and I have mine stored

in an environment variable on my system I suggest you do the same I've covered

this in another video um but however you want to use them you're going to need

them and then of course from the async iio rate limiter we're going to import

in the limiter class limiter class right so we can get

started now so I'm going to leave the rate limiter out for the moment we'll do

it without and then I'll show you why we need to put it in but I'm going to do a

little thing here I'm going to say that my proxy is going to be equal to os.

getet EMV I'm just setting this up before we even look at the site just so

that I know that it's done and I'm going to use the sticky proxy I want to use

this one um and then we're going to do if

proxy is none I.E if it just finds none from my uh virtual environment I'm going

to go print no proxy found and we're going to quit out of the

program this isn't essential for you if you're putting the proxy directly in

because I'm pulling it from the my uh my OS uh my environment variable if that

doesn't if if this doesn't find this environment variable it returns none so

I'm going to make sure that we do find it just in case I type this wrong and I

start scraping on my home IP which I don't really want to do okay so let's

get into the main part of the project project now so I'm going to come back to

the website and we're going to have a look at this and we're going to first

open up the inspect element tool the dev tools here because we want to find the

link for all of these products here so I'm just going to hover over one of them

and I'll make this full screen now and I'm going to go back up and I liked this

div class here div grid product figure blah blah blah blah you can see when I

hover over this it's got everything in it and from there we can find the link

underneath and that's what we're going to need to do because we're going to

need to collect all the links on the page and then visit them all as

synchronously because we want to get the product information from the product

page so we'll keep that there and I'm going to open up one of the product

pages and generally speaking um this would be open for you you can get data

from here however you want generally speaking when I'm doing e-commerce sites

or anything really I always come here and I search for schema and if I scroll

down we have this here make this a little bit bigger so you can see it we

have this whole this scrip type application LD plus Json which has all

of this information in it if I just copy this out and go to uh Json Passa my

favorite one online. frr paste this in you can see that this is Json data for

all of the products including this one has the colorways I think oh no

it has the sizes sorry you can see different sizes here I think that's the

different sizes yes and it has all of the product information so this is

really handy really good way to get that data out so we're going to basically

just pull that from that element which is over in here if I go back to the

source so if we copy this element here this LD Json there's only one which

makes our lives really easy so we can just find this pull the text and it will

automatically get put nice and neatly into it python dictionary for us so now

we know that we how we're going to pull the product information we to start

writing out our code let's start with our main function so this is all going

to be asynchronous so async def main like so uh and then we want to have uh

if I just pass and then we have a sync io. run the main function like so

cool so inside this main function we need to have our uh async with function

that's going async with block which is going to open the browser do what we

need to do with it and then close it when we're done using a context manager

with when you're dealing with stuff like this is just generally better idea

because it clears everything up for you at the end uh and we'll see that

selenium driver lless actually creates a new profile for us when it loads up our

browser uh we'll talk about that when the browser pops up but what I am going

to do is I'm going to do options is equal to web driver. Chrome

options we aren't going to create any options here but I'm going to put that

in just because it's good to know that and you can add in options if you need

to uh like anything like this arguments exactly etc etc for when you want to

launch Chrome it's very useful uh in some cases but now we're going to do

async wi I'm going to remove this pass it's going to get a bit

confusing async with and we want to do our web

driver. Chrome where the options I'm going to put this

in anyway options is equal to options obviously these aren't really going to

do anything for us at the moment and we're going to call this driver and then

we're going to say we're going to await driver.

set single proxy so because we're using our proxy we can set this as a proxy and

this means every request that goes through this driver or every time a

browser page is opened it uses art proxy so obviously you if we open One browser

window and then just keep working with it it's going to use the same proxy over

and over again uh and if we open a new context browser window it's going to use

no proxy etc etc so this is what we want to do here so now we're going to do

await driver. getet and this is the URL that we want to grab so let's go back to

this page this is what we want to do and I'm going to put in here weight

load is equal to true then what I'll do is I'll put in I

think it's asyn iio dos sleep let's try this can't remember if

this works uh not 20 do 10 and we'll do wait cool so let's try uh running this

now py main so we should see the browser open up and we're going to connect to

this page and it's going to load up the page there we go done so the good thing

with one of the good things with selenium driver list and I covered this

in my last video is the fact that it uses your actual Chrome install rather

than having a separate one rather than having that uh actual driver which

controls everything that gives way that automated browser flag and it does all

of the basic cover up stuff that you need all the basic stealth stuff uh and

it makes it much easier it does everything through the CDP which is the

Chrome um protocol Dev tools protocol there we go that's what it's called Uh

and basically it's just a much more modern way of doing it the actual

selenium itself and playright are just fully focused really on testing so they

don't care about all of this stuff whereas people have then adapted it to

give you these sorts of things which work very very well and are much better

so that's why we using selenium driverless okay so I'm going to remove

the sleep and now we can move on so what we want to do here is want to find all

the elements so I'm going to say our products is going to be equal to weit

and driver. find elements and from in here we want to do

buy Dot CSS and we want to give it the CSS selector which I think I've still

got open here yep and it's this one which is copy

that div dot whatever that is so this is going to give us all of those elements

that match that on the page this could be anything for whatever you're scraping

find what the elements called do a little bit of testing figure it out

print out them out print them all out see what you're getting maybe you need

to load another page maybe you need to scroll all of that stuff can be done um

I'm just going to keep it a bit more simple on this one we're going to grab

all of the product links that it can find right away for this just concept

proof of concept so now I'm going to create a list of URLs uh and then we're

going to Loop through the products one so we do 4 p in products let's make this

in the middle of the screen and we want to do uh data is equal

to await because we're of course in async here and we'll do p. find element

so what we're doing is we're saying for each element that you find here I'm

going to look for something else and that is of course uh buy.

CSS and and we want to find the a tag here now we possibly

could that needs to be a capital B we could probably make this CSS selector a

bit better but you know I've done it this way so we'll we'll be fine so then

we're going to do link is equal to await data do get Dom attribute and we want

the HF here and then URLs do

append link so why am I doing it this way why am I not clicking on the links

and going through well because I want to do this as quickly as possible I want to

do this asynchronously I'm going to create a new browser context for every

link I'm going to open them all up together and I'm then going to be able

to visit all of those pages simultaneously uh as opposed to having

to do it one by one waiting to go through and that's where the limit is

going to come in I will show you that though so I'm going to go ahead now and

just print uh we'll do print away URLs I

think that might work let's load it

up so we should load up the full page and oh yeah given a list inside the uh

you can't do that my bad okay now we'll run that and

that will have to be blocking that's fine because this is just for

demonstration we're going to create tasks for all these links in just a

minute so let's just make sure these are actually product links they are we can

see them there great right let's clear that up and come back to our code here

so now that we've got all of these URLs we're going to use a co- routine and a

task to actually go through them all uh but what to do that to create a task we

need to have another function so what we do with a task is we say um do the do

async Co routine for this task with this piece of data so I'm going to say do

this task which is going to load the page and pull the data that I want for

every one of these URLs it will create all the task tasks for us and it will

run them all asynchronously so I'm going to say call this one async Def and I'm

going to call this get data and in here we're going to pass in the driver and

also the URL this is where we're going to create our new context so a new

context is essentially in this instance a new browser window um because we want

to you know scrape as many pages as we need as as many pages as we can we're

going to need that context to load them up now I think when you create a new

context you do kind of create a new version so you kind of do don't have all

of the uh the the sort of the browser cookies and everything that we loaded up

previously I think you can pass them through each context if you want to um

I'm going to do it this way because I know that it works like this that's just

something to bear in mind so we want to do await driver. new

context context and then we do await new context doget and the URL that we pass

in now from here we want to say that our schema is going to be equal to um our

new context. find I should really have put

typ pins in this so I had the the completion find

element and by capital B by. CSS and now we need to put the CSS in for that

element that was in The View source which I've lost here it is this one so

script type that's pretty standard uh CSS selector script type is equal to not

that one let's grab the Royal one

please copy and paste skills there we go you do need to have the single quote

marks here because of the uh characters otherwise it won't know what you're on

about and let's remove that cool and then I'm just going to uh print and we

need to do a wait here . text if you're ever unsure where they think just stick

a wait in front of it and see what happens and then we'll do a wait uh new

context. close because we want to close that context that browser window when

it's opened and done with we want to close it uh so we can move on with our

lives essentially and not have a shitload of browser windows open causing

us issues cool so now we need to move on to our tasks so as I said we needed that

function that I just wrote the get data function and we need to do that for

every URL so I'm going to call this task it's going to be equal to and I'm going

to do uh list comprehension here but basically you just need to end up with a

list of tasks uh so we'll do get data and we need to give it the

driver uh for URL in URLs and I need to give it the URL as

well obviously URL like so and then we can do await and we use a syn iio and we

want to do do gather so we can gather all these these tasks up and we give it

the list like so this is going to give us hey all of these URLs need to be run

with this function and we pass it off to a sync iode or gather to run them all

within the async loop so it's going to work for us now I didn't show you how

many um links there were but we'll see in just a second okay so this is why

we're going to need the limiter so let's run

this if I've done everything right we're about to spawn a million Chrome windows

and it's probably going to crash well it will crash I assume so for every link

that it's found it's opening a chrome window so how many is that I don't know

my I mean how many I don't know is is it going to work is it going to crash it's

probably going to crash or it's going to time out because it's not going to be

able to load these all up quick enough so we'll get a timeout but theoretically

this would work um but this is just not the best way to do this I mean it's

going to time out in just a minute I think it's a 30 second timeout one of

the pages won't load and we'll get an error but you can just kind of see that

they are starting to load up if I make um if I try to make one full screen it's

kind of loading up uh sort of working but this is just this is just there we

go time out so that's not going to work uh fun though so this is where the

limiter comes in so I'm going to create a limiter up here I'm going to call this

rate limiter and this is a really easy way to limit um anything that you put

into your async uh async loop here and we're going to create a limiter and you

get an option so I'm going to do one every 5 Seconds to start with and we'll

see how we get on so inside this function which is the one that we want

to limit I'm going to do a weit rate limiter do weight so this is going to

then now control how many of these can spawn within this uh within our ASN ky.

Gava it's going to control how many can spawn oh sorry jumping around way too

much there how many can spawn within this time frame so I'm going to save

this now and we're going to run it so we'll do PI main. pii and now we're

going to see that they're going to load up much slower up one every 5 seconds so

we might find that this is too slow so we're going to load up one and is this

one going to be finished within 5 seconds and we get the data

back oh I never awaited the uh my

bad this should have been awaited await okay that's why that failed so just run

it again and we should be fine this time so let me move this over so we get more

windows over here we're not really too worried about

the data coming back okay so it's been a bit slower so this might time out oh

there that it did work so I'm doing one every 5 seconds so it's going to these

ones should start closing by the time we open up new

ones and we can see that our data is is coming out over here so if I made this

one full screen for example this is the information that we're after and I

didn't put rich in there because I should have done cuz it would be nice

and easy and you'll be able to see it much easier but now we can kind of see

you know that it's uh it's spawning one up every 5 seconds and we're kind of

doing it a bit quicker a bit more aing and you can tweak this I'm going to

close this and we'll come back to our code and

I'm going to Tweet this and you can make this quicker or slower depending on your

network connection what you need to do I'm also going to do import we'll do uh

from Rich import print make our lives a little easier and we'll make this

smaller so now we should have one window every 3 seconds and we should be able to

see the data much neater coming through on the left hand side and you can kind

of see how it all comes together so this is possibly one of the one of

the better ways that to run uh scrape with a browser and make it not so

impossibly slow but it does come with a lot more complications because you need

to have a good handle on Python's asynchronous uh runtime it's async Loop

how to utilize that to your best usage and what you can then do with it going

forwards and how it's all going to work so this one's taking a little bit longer

to load up the pages so maybe one every 3 seconds is too many um but it should

start ticking away now yeah see we starting a timeout so one every 3

seconds isn't very good or you could extend the timeout set timeout in your

um in your code you can have the timeout extended up here somewhere but I

hopefully you kind of understand like the concept here to get all the links

using our selenium driver list which is going to give us a really good chance to

beat blocking on on sites as well as using the proxies from proxy scrape

which is going to help with us too because we can use utilize you know good

strong IP quality there and we can get through and then we're just basically

using a task to go ahead and grabb all of the data I'm not doing anything with

the data but from here you know you've got it in a got it in a variable you

could easily do whatever it is that you needed to do here um there's async

there's async capable um libraries for all sorts of databases as well as saving

to files so you could easily fit that in within the loop too uh so yeah that's

going to cover it for this one um hopefully you kind of got something out

of this uh let me know down below what you think go ahead leave me a like and a

comment um and also subscribe as well really helps me uh but if you want to

know how I scrape without using a browser and how I probably would scrape

this site without this and quicker you want to watch this video next