right so in this project we're going to use selenium driver list and I'm going
to show you how to asynchronously use it as a browser to scrape some data from a
website we're going to load up this page here we're going to pull all of the
available uh obvious links and then we're going to visit each link and pull
the information out now this website doesn't necessarily need this technique
but I wanted to demonstrate it to you so we're going to be using selenium
driverless which I talked about in my last video uh the do mentation is here
in the GitHub and this is a a great library for um using when you need to
use a browser to scrape we're also going to be using the asyn io rate limiter
which is going to let us control how much we are able to how many windows
we're going to be able to open I'll show you with and without this and you'll get
the idea of why we want to use something like this I'm using a browser for this
project and that makes it even more important to use high quality proxies
and to consider the geolocation as even a non- detect browser like I'm using
here there always still ways for antibot to find you un block you so this video
is sponsored by proxy scrape Friends of the channel and the proxies that I use
myself we get access to high quality secure fast and ethically soft proxies
that cover residential Data Center and mobile with rotating and sticky session
options there's 10 million plus proxies in the pool to use all with unlimited
concurrent sessions from countries all over the globe I use a variety of
proxies depending on the situation but I'd recommend you start out with
residential ones but make sure you s select countries that are appropriate to
the site you're trying to scrape and match your own country where possible
also consider using sticky sessions and keeping the same proxy for 3 to 5
minutes which is what I'm going to do here either way it's still only one line
of code to add to your project and then you may let proxy scrape handle the rest
from there and also any traffic you purchase is yours to use whenever you
need as it doesn't ever expire so if this all sounds good to you go ahead and
check out proxy scrape at the link in the description below now on with the
project the first thing I'm going to do is I'm just going to create my virtual
environment in Python in my project folder I'm going to activate it Act is a
shortcut for me otherwise do it normal way and we're going to pip install what
we need so I think it's called selenium
driverless it is and then I forget what this is a syn iio limit okay so let's
copy that we want you
I'm also going to install rich as well because we're going to print some stuff
out the terminal makes our lives a lot easier so I'm going to create a main
file main.py I'm going to open this in my
text editor and we're going to get started so we're going to import what we
need first so let's do from selenium driverless we're going to import in the
web driver I know that's slightly confusing why it's called Web driver
driverless but that's just the way that it is and then because we're going to
want to act to we're going to want to um find ele elements on the page we want to
import in uh types. buy and then import in buy so we can say find this element
by the CSS selector or by the xath so we need that uh we of course going to need
a sync IO from python because you know we're going to be running this a
synchronously you'll see how many browser windows get spawned when I get
to the end and why we need that rate limiter we're going to import OS because
we are going to be using our proxy scrap proxies for this and I have mine stored
in an environment variable on my system I suggest you do the same I've covered
this in another video um but however you want to use them you're going to need
them and then of course from the async iio rate limiter we're going to import
in the limiter class limiter class right so we can get
started now so I'm going to leave the rate limiter out for the moment we'll do
it without and then I'll show you why we need to put it in but I'm going to do a
little thing here I'm going to say that my proxy is going to be equal to os.
getet EMV I'm just setting this up before we even look at the site just so
that I know that it's done and I'm going to use the sticky proxy I want to use
this one um and then we're going to do if
proxy is none I.E if it just finds none from my uh virtual environment I'm going
to go print no proxy found and we're going to quit out of the
program this isn't essential for you if you're putting the proxy directly in
because I'm pulling it from the my uh my OS uh my environment variable if that
doesn't if if this doesn't find this environment variable it returns none so
I'm going to make sure that we do find it just in case I type this wrong and I
start scraping on my home IP which I don't really want to do okay so let's
get into the main part of the project project now so I'm going to come back to
the website and we're going to have a look at this and we're going to first
open up the inspect element tool the dev tools here because we want to find the
link for all of these products here so I'm just going to hover over one of them
and I'll make this full screen now and I'm going to go back up and I liked this
div class here div grid product figure blah blah blah blah you can see when I
hover over this it's got everything in it and from there we can find the link
underneath and that's what we're going to need to do because we're going to
need to collect all the links on the page and then visit them all as
synchronously because we want to get the product information from the product
page so we'll keep that there and I'm going to open up one of the product
pages and generally speaking um this would be open for you you can get data
from here however you want generally speaking when I'm doing e-commerce sites
or anything really I always come here and I search for schema and if I scroll
down we have this here make this a little bit bigger so you can see it we
have this whole this scrip type application LD plus Json which has all
of this information in it if I just copy this out and go to uh Json Passa my
favorite one online. frr paste this in you can see that this is Json data for
all of the products including this one has the colorways I think oh no
it has the sizes sorry you can see different sizes here I think that's the
different sizes yes and it has all of the product information so this is
really handy really good way to get that data out so we're going to basically
just pull that from that element which is over in here if I go back to the
source so if we copy this element here this LD Json there's only one which
makes our lives really easy so we can just find this pull the text and it will
automatically get put nice and neatly into it python dictionary for us so now
we know that we how we're going to pull the product information we to start
writing out our code let's start with our main function so this is all going
to be asynchronous so async def main like so uh and then we want to have uh
if I just pass and then we have a sync io. run the main function like so
cool so inside this main function we need to have our uh async with function
that's going async with block which is going to open the browser do what we
need to do with it and then close it when we're done using a context manager
with when you're dealing with stuff like this is just generally better idea
because it clears everything up for you at the end uh and we'll see that
selenium driver lless actually creates a new profile for us when it loads up our
browser uh we'll talk about that when the browser pops up but what I am going
to do is I'm going to do options is equal to web driver. Chrome
options we aren't going to create any options here but I'm going to put that
in just because it's good to know that and you can add in options if you need
to uh like anything like this arguments exactly etc etc for when you want to
launch Chrome it's very useful uh in some cases but now we're going to do
async wi I'm going to remove this pass it's going to get a bit
confusing async with and we want to do our web
driver. Chrome where the options I'm going to put this
in anyway options is equal to options obviously these aren't really going to
do anything for us at the moment and we're going to call this driver and then
we're going to say we're going to await driver.
set single proxy so because we're using our proxy we can set this as a proxy and
this means every request that goes through this driver or every time a
browser page is opened it uses art proxy so obviously you if we open One browser
window and then just keep working with it it's going to use the same proxy over
and over again uh and if we open a new context browser window it's going to use
no proxy etc etc so this is what we want to do here so now we're going to do
await driver. getet and this is the URL that we want to grab so let's go back to
this page this is what we want to do and I'm going to put in here weight
load is equal to true then what I'll do is I'll put in I
think it's asyn iio dos sleep let's try this can't remember if
this works uh not 20 do 10 and we'll do wait cool so let's try uh running this
now py main so we should see the browser open up and we're going to connect to
this page and it's going to load up the page there we go done so the good thing
with one of the good things with selenium driver list and I covered this
in my last video is the fact that it uses your actual Chrome install rather
than having a separate one rather than having that uh actual driver which
controls everything that gives way that automated browser flag and it does all
of the basic cover up stuff that you need all the basic stealth stuff uh and
it makes it much easier it does everything through the CDP which is the
Chrome um protocol Dev tools protocol there we go that's what it's called Uh
and basically it's just a much more modern way of doing it the actual
selenium itself and playright are just fully focused really on testing so they
don't care about all of this stuff whereas people have then adapted it to
give you these sorts of things which work very very well and are much better
so that's why we using selenium driverless okay so I'm going to remove
the sleep and now we can move on so what we want to do here is want to find all
the elements so I'm going to say our products is going to be equal to weit
and driver. find elements and from in here we want to do
buy Dot CSS and we want to give it the CSS selector which I think I've still
got open here yep and it's this one which is copy
that div dot whatever that is so this is going to give us all of those elements
that match that on the page this could be anything for whatever you're scraping
find what the elements called do a little bit of testing figure it out
print out them out print them all out see what you're getting maybe you need
to load another page maybe you need to scroll all of that stuff can be done um
I'm just going to keep it a bit more simple on this one we're going to grab
all of the product links that it can find right away for this just concept
proof of concept so now I'm going to create a list of URLs uh and then we're
going to Loop through the products one so we do 4 p in products let's make this
in the middle of the screen and we want to do uh data is equal
to await because we're of course in async here and we'll do p. find element
so what we're doing is we're saying for each element that you find here I'm
going to look for something else and that is of course uh buy.
CSS and and we want to find the a tag here now we possibly
could that needs to be a capital B we could probably make this CSS selector a
bit better but you know I've done it this way so we'll we'll be fine so then
we're going to do link is equal to await data do get Dom attribute and we want
the HF here and then URLs do
append link so why am I doing it this way why am I not clicking on the links
and going through well because I want to do this as quickly as possible I want to
do this asynchronously I'm going to create a new browser context for every
link I'm going to open them all up together and I'm then going to be able
to visit all of those pages simultaneously uh as opposed to having
to do it one by one waiting to go through and that's where the limit is
going to come in I will show you that though so I'm going to go ahead now and
just print uh we'll do print away URLs I
think that might work let's load it
up so we should load up the full page and oh yeah given a list inside the uh
you can't do that my bad okay now we'll run that and
that will have to be blocking that's fine because this is just for
demonstration we're going to create tasks for all these links in just a
minute so let's just make sure these are actually product links they are we can
see them there great right let's clear that up and come back to our code here
so now that we've got all of these URLs we're going to use a co- routine and a
task to actually go through them all uh but what to do that to create a task we
need to have another function so what we do with a task is we say um do the do
async Co routine for this task with this piece of data so I'm going to say do
this task which is going to load the page and pull the data that I want for
every one of these URLs it will create all the task tasks for us and it will
run them all asynchronously so I'm going to say call this one async Def and I'm
going to call this get data and in here we're going to pass in the driver and
also the URL this is where we're going to create our new context so a new
context is essentially in this instance a new browser window um because we want
to you know scrape as many pages as we need as as many pages as we can we're
going to need that context to load them up now I think when you create a new
context you do kind of create a new version so you kind of do don't have all
of the uh the the sort of the browser cookies and everything that we loaded up
previously I think you can pass them through each context if you want to um
I'm going to do it this way because I know that it works like this that's just
something to bear in mind so we want to do await driver. new
context context and then we do await new context doget and the URL that we pass
in now from here we want to say that our schema is going to be equal to um our
new context. find I should really have put
typ pins in this so I had the the completion find
element and by capital B by. CSS and now we need to put the CSS in for that
element that was in The View source which I've lost here it is this one so
script type that's pretty standard uh CSS selector script type is equal to not
that one let's grab the Royal one
please copy and paste skills there we go you do need to have the single quote
marks here because of the uh characters otherwise it won't know what you're on
about and let's remove that cool and then I'm just going to uh print and we
need to do a wait here . text if you're ever unsure where they think just stick
a wait in front of it and see what happens and then we'll do a wait uh new
context. close because we want to close that context that browser window when
it's opened and done with we want to close it uh so we can move on with our
lives essentially and not have a shitload of browser windows open causing
us issues cool so now we need to move on to our tasks so as I said we needed that
function that I just wrote the get data function and we need to do that for
every URL so I'm going to call this task it's going to be equal to and I'm going
to do uh list comprehension here but basically you just need to end up with a
list of tasks uh so we'll do get data and we need to give it the
driver uh for URL in URLs and I need to give it the URL as
well obviously URL like so and then we can do await and we use a syn iio and we
want to do do gather so we can gather all these these tasks up and we give it
the list like so this is going to give us hey all of these URLs need to be run
with this function and we pass it off to a sync iode or gather to run them all
within the async loop so it's going to work for us now I didn't show you how
many um links there were but we'll see in just a second okay so this is why
we're going to need the limiter so let's run
this if I've done everything right we're about to spawn a million Chrome windows
and it's probably going to crash well it will crash I assume so for every link
that it's found it's opening a chrome window so how many is that I don't know
my I mean how many I don't know is is it going to work is it going to crash it's
probably going to crash or it's going to time out because it's not going to be
able to load these all up quick enough so we'll get a timeout but theoretically
this would work um but this is just not the best way to do this I mean it's
going to time out in just a minute I think it's a 30 second timeout one of
the pages won't load and we'll get an error but you can just kind of see that
they are starting to load up if I make um if I try to make one full screen it's
kind of loading up uh sort of working but this is just this is just there we
go time out so that's not going to work uh fun though so this is where the
limiter comes in so I'm going to create a limiter up here I'm going to call this
rate limiter and this is a really easy way to limit um anything that you put
into your async uh async loop here and we're going to create a limiter and you
get an option so I'm going to do one every 5 Seconds to start with and we'll
see how we get on so inside this function which is the one that we want
to limit I'm going to do a weit rate limiter do weight so this is going to
then now control how many of these can spawn within this uh within our ASN ky.
Gava it's going to control how many can spawn oh sorry jumping around way too
much there how many can spawn within this time frame so I'm going to save
this now and we're going to run it so we'll do PI main. pii and now we're
going to see that they're going to load up much slower up one every 5 seconds so
we might find that this is too slow so we're going to load up one and is this
one going to be finished within 5 seconds and we get the data
back oh I never awaited the uh my
bad this should have been awaited await okay that's why that failed so just run
it again and we should be fine this time so let me move this over so we get more
windows over here we're not really too worried about
the data coming back okay so it's been a bit slower so this might time out oh
there that it did work so I'm doing one every 5 seconds so it's going to these
ones should start closing by the time we open up new
ones and we can see that our data is is coming out over here so if I made this
one full screen for example this is the information that we're after and I
didn't put rich in there because I should have done cuz it would be nice
and easy and you'll be able to see it much easier but now we can kind of see
you know that it's uh it's spawning one up every 5 seconds and we're kind of
doing it a bit quicker a bit more aing and you can tweak this I'm going to
close this and we'll come back to our code and
I'm going to Tweet this and you can make this quicker or slower depending on your
network connection what you need to do I'm also going to do import we'll do uh
from Rich import print make our lives a little easier and we'll make this
smaller so now we should have one window every 3 seconds and we should be able to
see the data much neater coming through on the left hand side and you can kind
of see how it all comes together so this is possibly one of the one of
the better ways that to run uh scrape with a browser and make it not so
impossibly slow but it does come with a lot more complications because you need
to have a good handle on Python's asynchronous uh runtime it's async Loop
how to utilize that to your best usage and what you can then do with it going
forwards and how it's all going to work so this one's taking a little bit longer
to load up the pages so maybe one every 3 seconds is too many um but it should
start ticking away now yeah see we starting a timeout so one every 3
seconds isn't very good or you could extend the timeout set timeout in your
um in your code you can have the timeout extended up here somewhere but I
hopefully you kind of understand like the concept here to get all the links
using our selenium driver list which is going to give us a really good chance to
beat blocking on on sites as well as using the proxies from proxy scrape
which is going to help with us too because we can use utilize you know good
strong IP quality there and we can get through and then we're just basically
using a task to go ahead and grabb all of the data I'm not doing anything with
the data but from here you know you've got it in a got it in a variable you
could easily do whatever it is that you needed to do here um there's async
there's async capable um libraries for all sorts of databases as well as saving
to files so you could easily fit that in within the loop too uh so yeah that's
going to cover it for this one um hopefully you kind of got something out
of this uh let me know down below what you think go ahead leave me a like and a
comment um and also subscribe as well really helps me uh but if you want to
know how I scrape without using a browser and how I probably would scrape
this site without this and quicker you want to watch this video next