Is this how pro's scrape HUGE amounts of data? - en - Twincloud's Youtube Subtitle Extractor

this method turns standard web scraping on its head trading off slightly more

setup for a few key benefits that although are quite Niche are very good

let me explain conventional web scraping works by requesting HTML passing that

HTML and then saving the transformed output into our desired format but this

can pose a couple of problems so in this video I'm going to propose a slight

change to the conventional method we'll talk about what that change is how it's

so effective in the right situation the benefits of it and then we'll code out

an example so I've scraped millions of rows of data and with this method we

solve a couple of key issues errors while passing and not being able to go

back to the HTML from that given time and having to write out extra code to

scrape more sides we're going to write generic code that takes a list of URLs

scrapes the HTML but crucially we're going to save the full document to our

database we store the URL the HTML text and a timestamp this means we can

pass it later at our convenience avoiding any passing errors and also

should we need to revisit it for any extra data we can as we now have a

timestamped record of the page which is also very useful for pricing and product

changes and updates and because we're separating out our code giving Parts

specific jobs we can reuse that request part over and over again for any site we

want and each new site added is just writing passing code by doing this it

gives us more structure to our scraping code allows us to easily use Python's

async abilities to download data so let me show you to get started we are going

to need a database you're also going to need a way of interacting with

that database and I use Compass it's just nice and easy GUI Etc here is my

version to install I would recommend Docker I run mine on my Home

Server via Docker compose which looks very similar to this once you've got it

up and running it's dead easy to come onto and compass and connect to it

just by uh typing in the URL your local host and the port once that's up and

running we can continue with our python code now we have a few choices here but

we do need to make sure we have a few things installed which are mandatory P

being the first one so we can talk to our database and I'm going to

use Curl cffi which is a HTTP client within python that utilizes a load of

cool tricks to get around certain fingerprinting you can absolutely use

any HTTP client that you like for python I just used it in this project because I

quite like it you will benefit from an async client though which this is but

also aiio HTTP is and httpx of my favorite ones use this one or

httpx so we're going to also use Rich because rich just makes it nice and easy

to print stuff out to my terminal uh and that should be enough for us to get

started I'm going to create a new terminal in my tmox and now we got this

installed I'm going to create a new file I'm just going to call this

main.py and then I'm going to open it in my code editor which I'm using Helix use

whichever one you like it doesn't matter the first thing that we want to do is

just import everything that we need we're going to be doing async in this so

I'm going to be using async IO U I'm going to need OS because I'm going to be

pulling my proxy from my uh environment variables now you will need a proxy

especially if you're scraping async I save my proxy string in an environment

variable on my machine you can just put yours Direct into your code too but if

you need a proxy you're going to want to check out proxy scrape which is the

sponsor of today's video I've sent countless amounts of data through proxy

scrape and these are the proxies I'm using today and the ones I've been using

for the last year or so so as we know proxies are an integral part of scraping

data and with proxy scrape we have access to high quality secure fast and

ethically sourced proxies that are perfect for our web scraping use case I

almost exclusively use residential ones as these are the best option for beating

any antibot protection and with auto rotation we're able to scale up our

scraping Solutions with ease there's 10 million plus proxies in the pool to use

or with unlimited concurrent sessions so adding proxies to our project is simple

and extremely effective when combined with any scraping code but especially

asynchronized requests like we're going to be using here you'll have a choice of

Country 2 for when working on very region specific sites there's a 99%

success rate and traffic which never expires which is also very nice other

options though if you just want throughput then data center proxies with

unlimited bandwidth 99% up time and no rate limit from reputable countries and

IP or makes them a very very attractive option for The Right Use case so go

ahead and check out proxy scrape at the link in the description below so on with

our project now we're going to do from curl CFI cffi we're going to import and

we need to do do requests actually we're going to import in the async session

this is what we're going to use to make of our requests I'm going to import

logging I'm going to talk about that in just a second and I'm going to import in

uh from Rich do loging we're going to import in

the rich Handler capital r there we go uh I'm

going to import in time just so we can see everything that's going on when we

want to see how quick we're going and from PI we're going to import in

client now just before we get started I want to say that I'm going to

include some things in here which aren't mandatory I will tell you what is

mandatory and what isn't so the first thing I'm going to do is I'm going to

import in my logging I'm going to putop this in here just so you can get an idea

of what I'm doing this is not mandatory this section this is just logging for

what I'm doing if you don't want to have any logging like this or you want to use

print statements to log that's fine this section is uh optional I'll type that in

optional definitely worth learning how to use logging though if you so desire

so the Crux of the main part of this is our async function so I'm going to write

that first and this is of course mandatory if you're using async I highly

recommend you use async because we don't need to worry about passing as we're

doing that later we can use async to its full effect and not have to worry too

much so I'm going to do async run let's put this in the middle of the screen and

we'll have this function here so I'm going to do async with and this is our

async session as session now I'm going to grab

my proxy so our proxy is going to be equal to os. get uh EMV I always do that

get EMV and you can either type it in here or if you want to we're going to

create a constant we're going to call this proxy I'm going to go to the top of

the code I'm going to say that my proxy for my environment

variable like so this is just so I can easily rotate and use different versions

I have many different uh ones that I use from uh proxy scrape and the one I want

to use for this one is called this in my environment sticky proxy this is a one

that uh is a sticky session that rotates every uh 5 minutes I think I've got it

going as to make sure this is working I'm going to say if proxy is not

none and I'm going to do loginfo this is my nice rich logging I'm going to say

proxy found from EMV from my environment and

then I'll do s uh session do proxy so I can update my session with my proxy

uh which is not this it's equal to it's a dictionary

HTTP and the uh proxy and https also with the

proxy again proxies are kind of mandatory when you're doing this um you

don't have to but you'll probably find your IP will get blocked very very fast

if you make a load of requests very very quickly which we will be doing because

we're going to be using um async I'm just going to put a warning

in here I'm just going to say no proxy found

contining without cool I don't know if that's how you spell continuing but

that's will do for us right the async part is we're basically going to create

a load of tasks and what these tasks are going to be is it each URL which is

going to be then um done in our side our Co routine so we can have everything uh

going at the same time so I'm going to do for URL in URLs I'm going going to

create this URLs list in just a minute going do task is equal to our session.

get that URL so we want to use this to go and get our URL and then we'll say

our tasks. append our task so basically creating a load of tasks which we're

going to then run here when I'm going to call result because we want to get the

information back from these from these HTTP requests with a wait because this

is our async we want to do async io. gather

tasks and that's going to make everything go at the same time it's

going to get all of the information it's going to wait for them all to come back

and then we're going to have a nice load of data within results this is probably

only going to take a few seconds depending on how many URLs you're

running then from out out from this function we want to return our

result so from this I recommend you use a proxy you could of course just do this

string here and put your proxy string that you get from proxy scrape directly

in this variable if you want to I've just done this because I want to make

sure that I'm pulling different ones from my environment that we can then use

uh depending on what I'm trying to do and I will also get a log when I run

this if it hasn't picked up from my environment so I can avoid using my own

uh IP for such things then we create a load of tasks with our URLs which we are

yet to create uh we gather all those tasks together we wait for them to all

be done and then we return the result this is mandatory this section cuz

otherwise we're not going to end up with any URLs you could do this synchronously

one after one after the other for URL and URLs then get the HTML and then

return it out but that's going to be super super slow from this we want to

say our data is going to be able is going to be equal to async io. run on

our run function I probably should have called The Run function something else

you get the idea we're basically saying now with asnc iio let's execute this

function that's going to get everything for us now we're going to have all of

our results data back here so we do need some

URLs so what I'm going to do is I'm going to say here we're going to do with

open because we're going to pull these from a CSV file and I'm just going to

say our URLs CSV um read as F it's up to you how you get your URLs you could

scrape for them you could pull them from a database in this instance I'm just

pulling them from a CSV all you need to make sure is that you end up with a list

of the URLs you want to get asynchronously so our reader is going to

be equal to CSV do reader and I didn't import CSV that's interesting okay and

then URLs is going to be equal to URL first index because otherwise we'll have

a list of lists for URL in our reader like so let's import CSV because we're

going to need it as we can see now if I come to my uh file I've dumped in the uh

list of URLs we're going to be working with there's 80 in here so that's a

reasonable chunk to get working with so let's come back to my code editor which

I Clos by accident there we go so now we have our URLs which is going to go into

this function and return the result so what I'm going to do now I'm going to

run this I'm just going to print out the data that we get back and we'll save and

then we'll come over here we need to activate our virtual environment clear

the screen up Pi Main and we should get proxy found from environment that's our

loging this is logging lovely colors and everything so we're actually creating

all the requests now and we're just going to wait for them all to come back

before we do anything and there we have it these all look like 200 requests

which means they were all successful so we are going to come back to our code

now and I'm just going to paste this string in here which is one that I've

copied from online I apologize I can't remember where I copied this from and we

are now be able to tell how long this takes let's run it again see how long it

takes I think it takes about 5 or 6 seconds for 80 URLs which is not so bad

compared to if you did them one after the other it would probably take about 9

seconds in that case so not bad so let's uh go back to our code and now we can

think about what we're going to do with mongod DB so I'm going to put mandatory

here so I like to store my um data up here so I'm just going to copy this

over so I don't have to type it out and I've chopped the M off so what I've got

is I've got the connection string I've got the port which is actually a

different port I think it's I think it's this

one let me have a look 69 perfect and I've created a um a

database here and a collection within this now you don't have to do this

because when you do this here if they doesn't exist it will create it for you

but this is my database and my collection and this is important and I

also have the port here so with these I'm just keeping them as constants at

the top just so if anything changes it's easy just to go to the top of the file

and change it here you could of course put this in line if you wanted to so now

we have our data out here we're going to need to add in our information so

we can do something with this data so we want to create our client here now

so I'm going to say our client actually we'll call this client it's

probably easier or M client uh that's not very pythonic doesn't matter what

you call it and that's going to be equal to the client that we have

imported in from PI and we need to give it our con and our

port and then we want to say that our DB is equal to the client and we want to

give it our mongod DB the this case is called scraped items and then our

collection is equal to DB and of the Mongol there we go perfect uh

that's the wrong thing this here should be M

client there we go so now we have access to our collection we can easily go ahead

and add stuff into our database so underneath where we're getting all the

information we want to go through that responses that we saw on this screen

here and we want to get the HTML from them so we're doing no HTML passing

we're going to store the raw documents in our database with a Tim stamp

and a URL so so we can pass them at a later date whenever we need to or

revisit them if we need to so let's do for response in data and I'm going to

say if response. uh status

code does not equal 200 we going to do log. warn because we want to warning

this out so I'm going to do uh failed on response.

L uh with with code this is just some logging

uh response. status code you'll thank

yourself later when you try and run this again at a different time and you have

this lovely logging in here that tells you what's going on um from here what we

could do is we could create a failed list it's call it failed and then we

could do uh failed failed append and we could just store the

response. URL so we have a list of anything that failed we can revisit them

or do whatever we want to do but if it does have a response code of 200 we'll

have a results let's create that list actually whilst I'm here so we have a

failed list and a results list that's cool results. append this is where we

want to construct the information that we're going to be storing in our

database like it like it does that we'll do it

like this thank you so we want to say we want the URL which is going to be equal

to the response. URL we want the uh time or the date so we're going to have uh

date time do now and this is going to create

a date time object which will go neatly into our database and the HTML

which is the important part which is our response. text like so great now we want

to add this to our database so I'm going to come outside of

the for Loop and I'm going to use insert many and this is the part so we'll

just do inserted is equal to collection which is the collection that we

stipulated at the top uh do insert many and this just takes in a list of Json

objects results like so that is essentially it this is the one line that

it takes to um import stuff into your database once you've set it up and

added in nice and easy this is essentially it about 80 odd lines of

code where you know we're going to take a list of URLs and it can be any URLs we

could um create argument pass AR pass to change these or we could just change it

manually and we're going to then asynchronously with our proxy from proxy

scrape go through get the response from all the URLs we're going to go through

them and then we're going to save each and every response into our

document database for us to handle later so we're pulling all the information

we're storing the raw data and then we're going to pass it later and move it

to wherever we need and when we do that passing we could remove that document

from the database to keep it neat so we we know only the items in this do in

this collection are the ones that we want to still pass or anything else that

suits Us in that instance and it's super easy so I'm going to clear this up I'm

going to run Pi Main and we should have a little pause because I've got nothing

going on in between with all the requests we should see all the responses

and the insert statements come through in just a second I think it took about

10 seconds last time there we go so it took 16 seconds

all in all to request 80 pages from the uh URL list and store them all in our

document database in our database let's scroll up

and you can see we've inserted results and we have all the document IDs the

object IDs there and if I come over to our um thing here our uh Compass

here and hit refresh here's our documents all 80 of

them and this is what they look like you can do them in different views if you

want to but this is the raar HTML so all we need to do is we need to say hey give

me every um every entry in the products collection then we can just give the

HTML passer the HTML and we can figure out what we want to pull from this and

then we could move this document to maybe a completed collection maybe we

create a new collection completed and move it over there or maybe we just

leave it and we say give me all the documents in this folder from this date

and we pass those out we have one collection per website etc etc however

you want to manage it the good thing about

uh document databases like this is they're very very easy to use but the

downside is working when you have relationships in your data is not so

great so if you're storing past data that you've put into a specific schema I

wouldn't use I would have 100% use postgress for that and set it up to do

that this is just a great interim database for storing raw HTML documents

like so so hope you've enjoyed this video if you want to watch me do some

more traditional web scraping using Python's best web scraping framework

watch this video next