this method turns standard web scraping on its head trading off slightly more
setup for a few key benefits that although are quite Niche are very good
let me explain conventional web scraping works by requesting HTML passing that
HTML and then saving the transformed output into our desired format but this
can pose a couple of problems so in this video I'm going to propose a slight
change to the conventional method we'll talk about what that change is how it's
so effective in the right situation the benefits of it and then we'll code out
an example so I've scraped millions of rows of data and with this method we
solve a couple of key issues errors while passing and not being able to go
back to the HTML from that given time and having to write out extra code to
scrape more sides we're going to write generic code that takes a list of URLs
scrapes the HTML but crucially we're going to save the full document to our
database we store the URL the HTML text and a timestamp this means we can
pass it later at our convenience avoiding any passing errors and also
should we need to revisit it for any extra data we can as we now have a
timestamped record of the page which is also very useful for pricing and product
changes and updates and because we're separating out our code giving Parts
specific jobs we can reuse that request part over and over again for any site we
want and each new site added is just writing passing code by doing this it
gives us more structure to our scraping code allows us to easily use Python's
async abilities to download data so let me show you to get started we are going
to need a database you're also going to need a way of interacting with
that database and I use Compass it's just nice and easy GUI Etc here is my
version to install I would recommend Docker I run mine on my Home
Server via Docker compose which looks very similar to this once you've got it
up and running it's dead easy to come onto and compass and connect to it
just by uh typing in the URL your local host and the port once that's up and
running we can continue with our python code now we have a few choices here but
we do need to make sure we have a few things installed which are mandatory P
being the first one so we can talk to our database and I'm going to
use Curl cffi which is a HTTP client within python that utilizes a load of
cool tricks to get around certain fingerprinting you can absolutely use
any HTTP client that you like for python I just used it in this project because I
quite like it you will benefit from an async client though which this is but
also aiio HTTP is and httpx of my favorite ones use this one or
httpx so we're going to also use Rich because rich just makes it nice and easy
to print stuff out to my terminal uh and that should be enough for us to get
started I'm going to create a new terminal in my tmox and now we got this
installed I'm going to create a new file I'm just going to call this
main.py and then I'm going to open it in my code editor which I'm using Helix use
whichever one you like it doesn't matter the first thing that we want to do is
just import everything that we need we're going to be doing async in this so
I'm going to be using async IO U I'm going to need OS because I'm going to be
pulling my proxy from my uh environment variables now you will need a proxy
especially if you're scraping async I save my proxy string in an environment
variable on my machine you can just put yours Direct into your code too but if
you need a proxy you're going to want to check out proxy scrape which is the
sponsor of today's video I've sent countless amounts of data through proxy
scrape and these are the proxies I'm using today and the ones I've been using
for the last year or so so as we know proxies are an integral part of scraping
data and with proxy scrape we have access to high quality secure fast and
ethically sourced proxies that are perfect for our web scraping use case I
almost exclusively use residential ones as these are the best option for beating
any antibot protection and with auto rotation we're able to scale up our
scraping Solutions with ease there's 10 million plus proxies in the pool to use
or with unlimited concurrent sessions so adding proxies to our project is simple
and extremely effective when combined with any scraping code but especially
asynchronized requests like we're going to be using here you'll have a choice of
Country 2 for when working on very region specific sites there's a 99%
success rate and traffic which never expires which is also very nice other
options though if you just want throughput then data center proxies with
unlimited bandwidth 99% up time and no rate limit from reputable countries and
IP or makes them a very very attractive option for The Right Use case so go
ahead and check out proxy scrape at the link in the description below so on with
our project now we're going to do from curl CFI cffi we're going to import and
we need to do do requests actually we're going to import in the async session
this is what we're going to use to make of our requests I'm going to import
logging I'm going to talk about that in just a second and I'm going to import in
uh from Rich do loging we're going to import in
the rich Handler capital r there we go uh I'm
going to import in time just so we can see everything that's going on when we
want to see how quick we're going and from PI we're going to import in
client now just before we get started I want to say that I'm going to
include some things in here which aren't mandatory I will tell you what is
mandatory and what isn't so the first thing I'm going to do is I'm going to
import in my logging I'm going to putop this in here just so you can get an idea
of what I'm doing this is not mandatory this section this is just logging for
what I'm doing if you don't want to have any logging like this or you want to use
print statements to log that's fine this section is uh optional I'll type that in
optional definitely worth learning how to use logging though if you so desire
so the Crux of the main part of this is our async function so I'm going to write
that first and this is of course mandatory if you're using async I highly
recommend you use async because we don't need to worry about passing as we're
doing that later we can use async to its full effect and not have to worry too
much so I'm going to do async run let's put this in the middle of the screen and
we'll have this function here so I'm going to do async with and this is our
async session as session now I'm going to grab
my proxy so our proxy is going to be equal to os. get uh EMV I always do that
get EMV and you can either type it in here or if you want to we're going to
create a constant we're going to call this proxy I'm going to go to the top of
the code I'm going to say that my proxy for my environment
variable like so this is just so I can easily rotate and use different versions
I have many different uh ones that I use from uh proxy scrape and the one I want
to use for this one is called this in my environment sticky proxy this is a one
that uh is a sticky session that rotates every uh 5 minutes I think I've got it
going as to make sure this is working I'm going to say if proxy is not
none and I'm going to do loginfo this is my nice rich logging I'm going to say
proxy found from EMV from my environment and
then I'll do s uh session do proxy so I can update my session with my proxy
uh which is not this it's equal to it's a dictionary
HTTP and the uh proxy and https also with the
proxy again proxies are kind of mandatory when you're doing this um you
don't have to but you'll probably find your IP will get blocked very very fast
if you make a load of requests very very quickly which we will be doing because
we're going to be using um async I'm just going to put a warning
in here I'm just going to say no proxy found
contining without cool I don't know if that's how you spell continuing but
that's will do for us right the async part is we're basically going to create
a load of tasks and what these tasks are going to be is it each URL which is
going to be then um done in our side our Co routine so we can have everything uh
going at the same time so I'm going to do for URL in URLs I'm going going to
create this URLs list in just a minute going do task is equal to our session.
get that URL so we want to use this to go and get our URL and then we'll say
our tasks. append our task so basically creating a load of tasks which we're
going to then run here when I'm going to call result because we want to get the
information back from these from these HTTP requests with a wait because this
is our async we want to do async io. gather
tasks and that's going to make everything go at the same time it's
going to get all of the information it's going to wait for them all to come back
and then we're going to have a nice load of data within results this is probably
only going to take a few seconds depending on how many URLs you're
running then from out out from this function we want to return our
result so from this I recommend you use a proxy you could of course just do this
string here and put your proxy string that you get from proxy scrape directly
in this variable if you want to I've just done this because I want to make
sure that I'm pulling different ones from my environment that we can then use
uh depending on what I'm trying to do and I will also get a log when I run
this if it hasn't picked up from my environment so I can avoid using my own
uh IP for such things then we create a load of tasks with our URLs which we are
yet to create uh we gather all those tasks together we wait for them to all
be done and then we return the result this is mandatory this section cuz
otherwise we're not going to end up with any URLs you could do this synchronously
one after one after the other for URL and URLs then get the HTML and then
return it out but that's going to be super super slow from this we want to
say our data is going to be able is going to be equal to async io. run on
our run function I probably should have called The Run function something else
you get the idea we're basically saying now with asnc iio let's execute this
function that's going to get everything for us now we're going to have all of
our results data back here so we do need some
URLs so what I'm going to do is I'm going to say here we're going to do with
open because we're going to pull these from a CSV file and I'm just going to
say our URLs CSV um read as F it's up to you how you get your URLs you could
scrape for them you could pull them from a database in this instance I'm just
pulling them from a CSV all you need to make sure is that you end up with a list
of the URLs you want to get asynchronously so our reader is going to
be equal to CSV do reader and I didn't import CSV that's interesting okay and
then URLs is going to be equal to URL first index because otherwise we'll have
a list of lists for URL in our reader like so let's import CSV because we're
going to need it as we can see now if I come to my uh file I've dumped in the uh
list of URLs we're going to be working with there's 80 in here so that's a
reasonable chunk to get working with so let's come back to my code editor which
I Clos by accident there we go so now we have our URLs which is going to go into
this function and return the result so what I'm going to do now I'm going to
run this I'm just going to print out the data that we get back and we'll save and
then we'll come over here we need to activate our virtual environment clear
the screen up Pi Main and we should get proxy found from environment that's our
loging this is logging lovely colors and everything so we're actually creating
all the requests now and we're just going to wait for them all to come back
before we do anything and there we have it these all look like 200 requests
which means they were all successful so we are going to come back to our code
now and I'm just going to paste this string in here which is one that I've
copied from online I apologize I can't remember where I copied this from and we
are now be able to tell how long this takes let's run it again see how long it
takes I think it takes about 5 or 6 seconds for 80 URLs which is not so bad
compared to if you did them one after the other it would probably take about 9
seconds in that case so not bad so let's uh go back to our code and now we can
think about what we're going to do with mongod DB so I'm going to put mandatory
here so I like to store my um data up here so I'm just going to copy this
over so I don't have to type it out and I've chopped the M off so what I've got
is I've got the connection string I've got the port which is actually a
different port I think it's I think it's this
one let me have a look 69 perfect and I've created a um a
database here and a collection within this now you don't have to do this
because when you do this here if they doesn't exist it will create it for you
but this is my database and my collection and this is important and I
also have the port here so with these I'm just keeping them as constants at
the top just so if anything changes it's easy just to go to the top of the file
and change it here you could of course put this in line if you wanted to so now
we have our data out here we're going to need to add in our information so
we can do something with this data so we want to create our client here now
so I'm going to say our client actually we'll call this client it's
probably easier or M client uh that's not very pythonic doesn't matter what
you call it and that's going to be equal to the client that we have
imported in from PI and we need to give it our con and our
port and then we want to say that our DB is equal to the client and we want to
give it our mongod DB the this case is called scraped items and then our
collection is equal to DB and of the Mongol there we go perfect uh
that's the wrong thing this here should be M
client there we go so now we have access to our collection we can easily go ahead
and add stuff into our database so underneath where we're getting all the
information we want to go through that responses that we saw on this screen
here and we want to get the HTML from them so we're doing no HTML passing
we're going to store the raw documents in our database with a Tim stamp
and a URL so so we can pass them at a later date whenever we need to or
revisit them if we need to so let's do for response in data and I'm going to
say if response. uh status
code does not equal 200 we going to do log. warn because we want to warning
this out so I'm going to do uh failed on response.
L uh with with code this is just some logging
uh response. status code you'll thank
yourself later when you try and run this again at a different time and you have
this lovely logging in here that tells you what's going on um from here what we
could do is we could create a failed list it's call it failed and then we
could do uh failed failed append and we could just store the
response. URL so we have a list of anything that failed we can revisit them
or do whatever we want to do but if it does have a response code of 200 we'll
have a results let's create that list actually whilst I'm here so we have a
failed list and a results list that's cool results. append this is where we
want to construct the information that we're going to be storing in our
database like it like it does that we'll do it
like this thank you so we want to say we want the URL which is going to be equal
to the response. URL we want the uh time or the date so we're going to have uh
date time do now and this is going to create
a date time object which will go neatly into our database and the HTML
which is the important part which is our response. text like so great now we want
to add this to our database so I'm going to come outside of
the for Loop and I'm going to use insert many and this is the part so we'll
just do inserted is equal to collection which is the collection that we
stipulated at the top uh do insert many and this just takes in a list of Json
objects results like so that is essentially it this is the one line that
it takes to um import stuff into your database once you've set it up and
added in nice and easy this is essentially it about 80 odd lines of
code where you know we're going to take a list of URLs and it can be any URLs we
could um create argument pass AR pass to change these or we could just change it
manually and we're going to then asynchronously with our proxy from proxy
scrape go through get the response from all the URLs we're going to go through
them and then we're going to save each and every response into our
document database for us to handle later so we're pulling all the information
we're storing the raw data and then we're going to pass it later and move it
to wherever we need and when we do that passing we could remove that document
from the database to keep it neat so we we know only the items in this do in
this collection are the ones that we want to still pass or anything else that
suits Us in that instance and it's super easy so I'm going to clear this up I'm
going to run Pi Main and we should have a little pause because I've got nothing
going on in between with all the requests we should see all the responses
and the insert statements come through in just a second I think it took about
10 seconds last time there we go so it took 16 seconds
all in all to request 80 pages from the uh URL list and store them all in our
document database in our database let's scroll up
and you can see we've inserted results and we have all the document IDs the
object IDs there and if I come over to our um thing here our uh Compass
here and hit refresh here's our documents all 80 of
them and this is what they look like you can do them in different views if you
want to but this is the raar HTML so all we need to do is we need to say hey give
me every um every entry in the products collection then we can just give the
HTML passer the HTML and we can figure out what we want to pull from this and
then we could move this document to maybe a completed collection maybe we
create a new collection completed and move it over there or maybe we just
leave it and we say give me all the documents in this folder from this date
and we pass those out we have one collection per website etc etc however
you want to manage it the good thing about
uh document databases like this is they're very very easy to use but the
downside is working when you have relationships in your data is not so
great so if you're storing past data that you've put into a specific schema I
wouldn't use I would have 100% use postgress for that and set it up to do
that this is just a great interim database for storing raw HTML documents
like so so hope you've enjoyed this video if you want to watch me do some
more traditional web scraping using Python's best web scraping framework
watch this video next