How I Write Web Scrapers w/ Python - en - Twincloud's Youtube Subtitle Extractor

For the longest time, I would write code that looks like this. Pretty

straightforward, very understandable, and you know, it worked. But there's a

big problem with this. It's not modular. It doesn't handle uh anything like

errors, and it's very, very difficult to upgrade and change when things

inevitably go wrong. You'll end up having to rewrite most of this if you

want to implement a different package, for example, or anything else like that.

So, what I suggest now is to actually make your Python code much more reusable

and build yourself an actual scraping system rather than one-off scripts which

are going to break and you'll have to fix. So, what I mean by that is at the

moment we just have like your main py file and that's fine. This it works and

it runs. What we can actually do is we can start to think about hey what does

our code actually do? So, let's have a think. Well, the first thing that we do

is we want to extract the data. Uh, and this is going to be possibly I mean this

is going to be the most important piece here. Well, then we want to do something

with it and then we want to save it somewhere. These three keywords are

pretty common and pretty important. ETL it's very very common in the programming

world and in the tech world. We're basically doing exactly the same thing.

But what we can do now is we can start to think of how our code and our project

structure can reflect using these three things. So well let's have a look at it

to start with. What does our extract need to do? Well, it might need to fetch

some data. Uh there might be, you know, a sync client or we might need to use a

browser. Uh we might want to do something with async. Uh we might need

to set up our client and our proxies. And the list goes on. It's a bit smaller

here. And the next is our transform. So what we going to do here? Well, we're

going to take some data in and we're going to do something with it. So we're

probably going to, you know, pass HTML or something with JSON. And then we're

going to maybe use pyantic classes or something else maybe atas or just you

know some kind of class system for our data. These are all very important

things that we need to consider. Let's make this a bit smaller. And finally we

want to load the data somewhere. Well we're going to want to save it to

something. So this could be CSV, JSON database like you know any database that

you want or whatever you're trying to do. So underneath these three headers we

can start to think well this is going to be a class on its own. This will be a

class on its own and so will this. So within that we can then start to

implement these methods within that class and then expand on them as we need

them. Then each one of these classes will be you know pulled into whatever

file we want to do you know whatever we need to do possibly something like a

pipeline. And then from here our pipeline you know it's just going to be

executed as and when we need and this is just going to give us that much more uh

modularity the ability to change things as and when we need to do switch

packages out. So let me show you what I mean in a bit more of a project example.

So on the screen here is my basic uh project that I've been using recently

working pulling some data both synchronously and asynchronously. So I

have my extract, my load, I have my main. py and I have my transform. I'm

using UV for this which is the new kind of like hot stuff when it comes to

Python projects and I think it's really cool. I really really liking it so far.

Uh so that's the only difference. That's what the UV lock file is and the pi

project.l file is. I also got a logging file here. So, we're starting to build

up our scraper as if it's an actual proper Python application. So, let's

open it up here. So, I'm going to go open my extract. Let's have a look here.

So, I have my extractor class. Now, generally speaking, without too much

extra specific customization, this extractor class I can just copy and

paste to whichever project I'm going to be doing next. There's nothing project

specific here. This class is solely designed to extract the data from

whatever website or URL that I send it to. So if we have a look at our

initialization, I'm getting my proxies. Probably always going to want to use

those. There's a link for the ones I use in the description. Then I'm creating my

client and then I'm updating some of the session information. There's two clients

here as you'll see. There's this one which is the client, the session client

which is my asynchronous client and also the blocking client here. So this is

actually using arnet which I talked about in a last video. Um, I'm testing

it out a bit more and so far pretty good. I'm really enjoying it. And I'm

adding my proxies to these clients and then I'm just logging some info out.

Again, logging very useful. So when I run this, I can see exactly what's

happening in what's what parts. If we look at the actual methods against this,

they're fairly straightforward. I have the fetch HTML sync version which uses

the blocking client. And this actually has a retry method on top, a retry

decorator. Now, this comes from a package called Tenacity. Um, I've

started using Tenacity a bit, but there's I'm going to be doing a video on

other retries and, you know, what's the easiest to use and what's the most

straightforward. You know, that's what Python's all about really for me. This

easy ease of use and straightforward. But this is quite simple. We just say,

hey, stop after attempt number five. And I'm waiting on exponential. So, you

know, it's going to start waiting with 3 seconds, 4 seconds, 5 seconds, etc. Then

we have our log in here as well. So, we can see what's actually going on. Then I

have fetching the HTML function itself. It's pretty self-explanatory. If we

don't get a 200 status, I'm going to raise an exception. And this exception

is then going to be hit by this retry. Then two more functions. Fetch JSON

asynchronously. And I know that I'm going to be expecting JSON data back

from this function specifically. So I'm going to be using this. And then a fetch

all function which essentially, you know, gathers all the tasks and the URLs

that I give it and then runs it asynchronously. So I can scrape much

quicker. Now let's think about what's actually going on here. Well, there's

nothing particularly project specific other than perhaps this where I'm

returning the JSON data. Possibly should be returning the actual response object

here instead. That probably be better. But what we can do here now is that if

you know if we come across something like hey I need to now run a browser for

this page. I can just import that in and I can come down here and I can just

create a new function you know uh get HTML

uh with browser or whatever you want to call it. And then I can write this in

here and give it self and then URL whatever I want to do and I spelled

browser wrong but you kind of get the idea. So we start to build it up and it

becomes much more modular and much easier to manage and improve. And

obviously with that retry and the logging when it goes wrong or if

something goes wrong when we're scraping we can see exactly what's going on. So I

put most of the effort into the extractor class here because this was

you know extracting data is the hardest part. But if I come to the transform is

very very straightforward and very simple. I'm just using select lax as my

HTML part passer and basically saying you know here's my passer that I'm

using. This is passing the list page. So when I'm scraping data from a specific

from this specific site I know that I want the elements returned in HTML from

my CSS selectors and then you know extract the text from it and extract the

href data from that information as well. Now here if we wanted to get more data

or we wanted to do something else with that we just need to write a new method

here and then we can call that when we instantiate this class in our main code

and quickly just the load here. All I'm doing in this class is I'm saving to a

JSON file. Nothing particularly interesting here but again if I wanted

to add a database very simple I can add it in here. I can initialize the

connection string or whatever I need to do. And you can even you know start

building a config uh pi file or whatever config.json file that you can then

import bits from. So the whole thing becomes much more of a system that you

can actually build and use and understand what it's going to do for you

and then you can take what you need from it from project to project so you never

have to sit there and write out your client again and all that sort of stuff

there. So it makes your life much much easier and gets you away from those

one-off scripts that have such a high chance of failing very often. Uh unless

of course you know if you write loads of like good logging and whatever and then

that's fine. This is just a much easier way to manage all of that. So I come to

the main file. What I've got here is I'm basically just importing what I need. I

decided from this I wanted to scrape from these categories. So I've

instantiated my class here. I thought I must have been writing go for a second

with using the EP and the L. Very not Pythonic at all. Now, it's a

straightforward, you know, main function that goes through categories and a value

in a range and then, you know, pulls the data out using the functions and the

methods that I've created, logging as we go. And then I'm just running the

function here. Now, I opted not to use a sort of a pipeline class, although you

absolutely could have done that depending on what you're trying to

achieve and what you want to do. This is very much an easier way to manage

everything. So if I come out of this now, if I clear this up and I do UV run

main.py, we're going to see all of the logging that I'm doing here. And you can

see this is the first time calling it. And that is from Tenacity. And you can

see I'm going through page by page, but then I'm asynchronously grabbing the

product information by 48 ago. Uh so I don't have to wait for each one of those

pages to load. And then I've going to have my results uh output here. Um if I

open that up, my results.json file. Uh you can see I've got 5800 lines. And

this is the information. And we can see it there. I didn't quite format this

properly. I don't think I must have done something wrong with my formatting

there. But you can kind of get the idea. So, what I'm trying to get at with this

is that with a little bit more effort and a little bit more time on your

behalf when you're learning Python, you can really start to actually treat your

scrapers as a project in themselves and make yourself make your life much much

easier when it comes to managing the project going forward. Because we all

know that scraping maintenance is one of the hardest thing. I put it as the

second hardest thing when we're talking like actual proper scraping projects.

The first one is actually getting the data. The second is maintaining it. And

especially when you have lots of sites to scrape and maintain the data from,

having a nice project structure like this where you can easily see what's

going on with your logging, you can work with the retries, and you can implement

new utilities and new methods to do things for you as you need to, uh, then

this is going to be absolutely it. And essentially what we've done here is

rewrote scrapby.