For the longest time, I would write code that looks like this. Pretty
straightforward, very understandable, and you know, it worked. But there's a
big problem with this. It's not modular. It doesn't handle uh anything like
errors, and it's very, very difficult to upgrade and change when things
inevitably go wrong. You'll end up having to rewrite most of this if you
want to implement a different package, for example, or anything else like that.
So, what I suggest now is to actually make your Python code much more reusable
and build yourself an actual scraping system rather than one-off scripts which
are going to break and you'll have to fix. So, what I mean by that is at the
moment we just have like your main py file and that's fine. This it works and
it runs. What we can actually do is we can start to think about hey what does
our code actually do? So, let's have a think. Well, the first thing that we do
is we want to extract the data. Uh, and this is going to be possibly I mean this
is going to be the most important piece here. Well, then we want to do something
with it and then we want to save it somewhere. These three keywords are
pretty common and pretty important. ETL it's very very common in the programming
world and in the tech world. We're basically doing exactly the same thing.
But what we can do now is we can start to think of how our code and our project
structure can reflect using these three things. So well let's have a look at it
to start with. What does our extract need to do? Well, it might need to fetch
some data. Uh there might be, you know, a sync client or we might need to use a
browser. Uh we might want to do something with async. Uh we might need
to set up our client and our proxies. And the list goes on. It's a bit smaller
here. And the next is our transform. So what we going to do here? Well, we're
going to take some data in and we're going to do something with it. So we're
probably going to, you know, pass HTML or something with JSON. And then we're
going to maybe use pyantic classes or something else maybe atas or just you
know some kind of class system for our data. These are all very important
things that we need to consider. Let's make this a bit smaller. And finally we
want to load the data somewhere. Well we're going to want to save it to
something. So this could be CSV, JSON database like you know any database that
you want or whatever you're trying to do. So underneath these three headers we
can start to think well this is going to be a class on its own. This will be a
class on its own and so will this. So within that we can then start to
implement these methods within that class and then expand on them as we need
them. Then each one of these classes will be you know pulled into whatever
file we want to do you know whatever we need to do possibly something like a
pipeline. And then from here our pipeline you know it's just going to be
executed as and when we need and this is just going to give us that much more uh
modularity the ability to change things as and when we need to do switch
packages out. So let me show you what I mean in a bit more of a project example.
So on the screen here is my basic uh project that I've been using recently
working pulling some data both synchronously and asynchronously. So I
have my extract, my load, I have my main. py and I have my transform. I'm
using UV for this which is the new kind of like hot stuff when it comes to
Python projects and I think it's really cool. I really really liking it so far.
Uh so that's the only difference. That's what the UV lock file is and the pi
project.l file is. I also got a logging file here. So, we're starting to build
up our scraper as if it's an actual proper Python application. So, let's
open it up here. So, I'm going to go open my extract. Let's have a look here.
So, I have my extractor class. Now, generally speaking, without too much
extra specific customization, this extractor class I can just copy and
paste to whichever project I'm going to be doing next. There's nothing project
specific here. This class is solely designed to extract the data from
whatever website or URL that I send it to. So if we have a look at our
initialization, I'm getting my proxies. Probably always going to want to use
those. There's a link for the ones I use in the description. Then I'm creating my
client and then I'm updating some of the session information. There's two clients
here as you'll see. There's this one which is the client, the session client
which is my asynchronous client and also the blocking client here. So this is
actually using arnet which I talked about in a last video. Um, I'm testing
it out a bit more and so far pretty good. I'm really enjoying it. And I'm
adding my proxies to these clients and then I'm just logging some info out.
Again, logging very useful. So when I run this, I can see exactly what's
happening in what's what parts. If we look at the actual methods against this,
they're fairly straightforward. I have the fetch HTML sync version which uses
the blocking client. And this actually has a retry method on top, a retry
decorator. Now, this comes from a package called Tenacity. Um, I've
started using Tenacity a bit, but there's I'm going to be doing a video on
other retries and, you know, what's the easiest to use and what's the most
straightforward. You know, that's what Python's all about really for me. This
easy ease of use and straightforward. But this is quite simple. We just say,
hey, stop after attempt number five. And I'm waiting on exponential. So, you
know, it's going to start waiting with 3 seconds, 4 seconds, 5 seconds, etc. Then
we have our log in here as well. So, we can see what's actually going on. Then I
have fetching the HTML function itself. It's pretty self-explanatory. If we
don't get a 200 status, I'm going to raise an exception. And this exception
is then going to be hit by this retry. Then two more functions. Fetch JSON
asynchronously. And I know that I'm going to be expecting JSON data back
from this function specifically. So I'm going to be using this. And then a fetch
all function which essentially, you know, gathers all the tasks and the URLs
that I give it and then runs it asynchronously. So I can scrape much
quicker. Now let's think about what's actually going on here. Well, there's
nothing particularly project specific other than perhaps this where I'm
returning the JSON data. Possibly should be returning the actual response object
here instead. That probably be better. But what we can do here now is that if
you know if we come across something like hey I need to now run a browser for
this page. I can just import that in and I can come down here and I can just
create a new function you know uh get HTML
uh with browser or whatever you want to call it. And then I can write this in
here and give it self and then URL whatever I want to do and I spelled
browser wrong but you kind of get the idea. So we start to build it up and it
becomes much more modular and much easier to manage and improve. And
obviously with that retry and the logging when it goes wrong or if
something goes wrong when we're scraping we can see exactly what's going on. So I
put most of the effort into the extractor class here because this was
you know extracting data is the hardest part. But if I come to the transform is
very very straightforward and very simple. I'm just using select lax as my
HTML part passer and basically saying you know here's my passer that I'm
using. This is passing the list page. So when I'm scraping data from a specific
from this specific site I know that I want the elements returned in HTML from
my CSS selectors and then you know extract the text from it and extract the
href data from that information as well. Now here if we wanted to get more data
or we wanted to do something else with that we just need to write a new method
here and then we can call that when we instantiate this class in our main code
and quickly just the load here. All I'm doing in this class is I'm saving to a
JSON file. Nothing particularly interesting here but again if I wanted
to add a database very simple I can add it in here. I can initialize the
connection string or whatever I need to do. And you can even you know start
building a config uh pi file or whatever config.json file that you can then
import bits from. So the whole thing becomes much more of a system that you
can actually build and use and understand what it's going to do for you
and then you can take what you need from it from project to project so you never
have to sit there and write out your client again and all that sort of stuff
there. So it makes your life much much easier and gets you away from those
one-off scripts that have such a high chance of failing very often. Uh unless
of course you know if you write loads of like good logging and whatever and then
that's fine. This is just a much easier way to manage all of that. So I come to
the main file. What I've got here is I'm basically just importing what I need. I
decided from this I wanted to scrape from these categories. So I've
instantiated my class here. I thought I must have been writing go for a second
with using the EP and the L. Very not Pythonic at all. Now, it's a
straightforward, you know, main function that goes through categories and a value
in a range and then, you know, pulls the data out using the functions and the
methods that I've created, logging as we go. And then I'm just running the
function here. Now, I opted not to use a sort of a pipeline class, although you
absolutely could have done that depending on what you're trying to
achieve and what you want to do. This is very much an easier way to manage
everything. So if I come out of this now, if I clear this up and I do UV run
main.py, we're going to see all of the logging that I'm doing here. And you can
see this is the first time calling it. And that is from Tenacity. And you can
see I'm going through page by page, but then I'm asynchronously grabbing the
product information by 48 ago. Uh so I don't have to wait for each one of those
pages to load. And then I've going to have my results uh output here. Um if I
open that up, my results.json file. Uh you can see I've got 5800 lines. And
this is the information. And we can see it there. I didn't quite format this
properly. I don't think I must have done something wrong with my formatting
there. But you can kind of get the idea. So, what I'm trying to get at with this
is that with a little bit more effort and a little bit more time on your
behalf when you're learning Python, you can really start to actually treat your
scrapers as a project in themselves and make yourself make your life much much
easier when it comes to managing the project going forward. Because we all
know that scraping maintenance is one of the hardest thing. I put it as the
second hardest thing when we're talking like actual proper scraping projects.
The first one is actually getting the data. The second is maintaining it. And
especially when you have lots of sites to scrape and maintain the data from,
having a nice project structure like this where you can easily see what's
going on with your logging, you can work with the retries, and you can implement
new utilities and new methods to do things for you as you need to, uh, then
this is going to be absolutely it. And essentially what we've done here is
rewrote scrapby.