Video Thumbnail 09:14
The library that will save your scrapers.
3.0K
100
2025-04-09
➡ JOIN MY MAILING LIST https://johnwr.com ➡ COMMUNITY https://discord.gg/C4J2uckpbR https://www.patreon.com/johnwatsonrooney ➡ PROXIES https://proxyscrape.com/?ref=jhnwr ➡ HOSTING https://m.do.co/c/c7c90f161ff6 Stamina Retries: https://github.com/hynek/stamina A look at how I use Python classes and build structured apps over one off scripts If you are new, welcome. I'm John, a self taught Python developer working in the web and data space. I specialize in data extraction and automation. I...
Subtitles

If you've ever had your scraper fail because of one failed HTTP request,

you'll know how much of a pain it can be. And in this video, I want to show

you how we can build retries into our web scraping logic and you can easily

implement them into your own code. I'm going to show you two different ways.

But first, we'll talk a little bit about what a decorator is. We'll zip through

that and I'll show you two different ways that you can add it using a

handwritten decorator that I'm going to write and also then using a cool Python

package. So this is my initial retry decorator that I have written to

demonstrate how a decorator and how this logic is going to work. Essentially, we

have a function down here where something could go wrong. And this is

going to be our HTTP request. Now, if you've ever used HTTPX or request,

you'll know there's a raise for status. That's a neat way of saying, you know,

if the status is a bad status, it's going to raise an exception. And it's

that exception that we can use to then activate our retries through our

decorator. So this function here is going to raise a value error every time.

What we're going to do here is write a new function that's going to wrap our

function and run it each time and then handle the exception based on what we

choose. So this simple decorator is just going to go through three iterations.

It's going to sleep after each one and yeah, it's going to say failed and then

yeah, nothing else is going to happen. So if I run this one, we can see, you

know, we're getting our error something trying to run trying to run failed after

three attempts. So this is how we're going to build our initial decorator. So

I'm going to come over to a different part of my code and I have this file

here. Now, this looks a lot more complicated, but essentially this part

here where I've got the decorator and the wrapper is exactly the same pretty

much as what we were just looking at, the simple version. I've just wrapped it

inside a class. So, my extractor class has a decorator class and a class method

to include that decorator. So, what we're doing here is we're saying we're

going to have a max number of attempts. Now, this is very important because if

you don't have this, your code will retry over and over and over again,

causing you even more issues than if it just failed in the first place. We also

have a delay in here. This is a time sleep delay in between each request and

we can choose what these numbers are however you want to do it. Then I have

the decorator function itself. This is a wrapper and we're going to say no for

attempt in range one to our maximum attempts and I've done plus one here.

We're going to try to return the value from the function. Now our function down

here that we're going to put this on is our fetch URL function. This one's going

to take our URL. It's going to make the request and we're going to say if

response. And this is from the arnet Python package. Um we're going to raise

an exception. This exception is going to then be intercepted within our wrapper

up here, our exception. I'm going to say, you know, failed. And then we're

going to say if it's um still underneath our maximum request amounts, we're going

to implement our sleep our delay. And then we're going to try again. And once

it's equal to that, we're going to return none. Now, this is important

because what can happen if you are just using a standard retry decorator is that

it will try however many times you ask it to, but then nothing will happen

afterwards. And that's not going to do anything for us. We want to retry x

amount of times and then we want to decide, hey, we're going to store this

URL for uh trying again another time. We're going to put it back into a queue

if we're using a Q system or we're going to, you know, just skip over it all

together. That's what's doing here. Here when I say returning none what that

means is when we actually call this function down here within our uh the

actual part that runs our code we can say you know if the response exists i.e.

if the response is not none we can do what we need to do with it. In this case

I'm just logging out the uh the URL and the headers. Otherwise we're going to

append the URL to a failed URLs list and then return that back out. So if I do

this one and run my function here, we're going to see initially we get our

successful attempts and that's because I'm reaching out to HTTP bubin at a 200

response. When I try 404, we can see we're doing 1, two, and then three

attempts and it's failed. Because it's failed and then it skips on, we're

storing that URL by returning none and then caching the URL afterwards. So I'm

just going to let this finish and I'm going to go through and then we'll have

a list of the URLs that we failed three retries on at the end. So you can see

them down here under this warning response. This is a list of the URLs

that failed for us. Now within real world code, we could then you do

something with these URLs. But what's happened is we've tried to handle

whatever we're getting, whatever status code we've decided that we want to catch

and and work with. We've tried to handle it in the best way that we can. We've

tried to retry it and maybe we would have put different proxies in. Maybe we

put different fingerprints in for each one of those retries. And if it still

fails, then we're going to log that out so we can do something with it later so

we're aware what's going on. Now retries and logging really do go hand in hand.

And uh I think I put it in this one if I'm using strrux log which is a cool

package that I've just started using. It's a very very simple way just to you

know handle not having to set up your own loggers. So if if that's something

you want interested and I would definitely check out struck log. It

seems pretty cool so far. Let's move on to a different way of working with the

decorators. I'm going to introduce you to a Python package too. It's called

Stamina by a guy called Hinek or Heneck. Uh he's very active in the Python

community and it's built on tenacity. So let's have a look at it over here. So

I'm importing it in. Now what stamina is, is I said it's a wrapper around

tenacity, but it gives you some sensible defaults. makes it just that little bit

easier to use and also adds some extra features. Um, which I'll link to the

package down below so you can check out those. Anyway, it's been pretty cool.

But one thing that I really wanted to try was working with an Async client

using it to handle the retries in that respect. And the way that I sort of uh

came up with having to do this is actually using the retry inside a

context manager rather than a decorator. Now, this has um positives and

negatives. Obviously, a decorator is easier to decorate any function that

might fail that we could retry, but in this instance, the only ones we're going

to want to retry are going to be the HTTP requests. So, using this uh context

manager to do that, I think is is a pretty good option here. So, it's going

to be inside my main fetch function. And I've got max attempts is equal to three.

This could be whatever you want it to be. Then we're going to do async for

attempt in our stamina. context. So if I go to the source code here, we can see

that this context manager here yields iterators that allows us to retry the

context within it. And this is it's pretty cool in my opinion. I think so

far it looks really nice. So we've got this here and I'm going to say with our

attempt and this is where our actual request code is going to be. And of

course we're using arnet and we're using asynchronous code here. So this fits all

in nicely. One thing that I did do here is I created my own status code error.

This is when I was working with it initially. I wanted to try a few

different things. This turned out to be not particularly mandatory or overly

useful other than it's a more clear error of what's going on rather than a

bare exception rather. So it's a bit of a better option. So you might want to

consider writing your own exceptions when you get to that point. So if we

look back into our async attempts, we're logging it here with the URL and we're

saying if our not response is okay, we're going to uh return the status

code. if our attempt number is equal to the maximum number of attempts. So this

is just our way of returning something when we reach the maximum number of

attempts. So in the last one where I had the decorator, we're actually returning

none from that. We could have done that here. In this case, I chose to return

the response status code. And I'll show you when this runs in just a second. And

then we say else, we're going to return our status code error. And then we're

going to return if everything's good, we're going to return the text. So what

I'm going to do here is I have four URLs which we're going to reach out to

asynchronously using our client. Uh we're going to be impersonating as well.

We don't need to do that for this but you know I'm really enjoying using arnet

at the moment which is good for impersonation for the TLS fingerprints.

Then we're going to try and get the information. I'm just going to print the

data. Now we know from looking at these URLs these HTTP ones that these are

going to fail. So let's see how it's handled here. Let's clear the screen up

and I'll do UV run main. py. So, we can see our stamina retry here. Now, there's

lots of different ways you can manage the retry and the back off within

stamina. I'm just using the defaults here just as an example, but I

definitely recommend you check out the GitHub and he's done a video on it, too,

which explains the differences from tenacity and stamina and what he wanted

to include. It's definitely worth looking at. I think it's the retry uh

package that I'm going to be using going forward. So, if we look at what has been

returned from my code, we didn't fail. Our code did not crash just because we

hit a 404 or a 504 whatever that is. I think I just made that up error there.

But what we've done is we've returned the status code after the maximum number

of attempts have happened. We can see that up here. So we can see that we're

having retry number etc etc and it's failing on these ones. So maximum

returns reached passing on this URL and then that's causing us to as I chose to

return the status code. And now the information that I have out of my

function includes the URL and either the data, the JSON data or our bad status

code. So we can review this. We can review the logs. We can see what's

happened. We don't have our whole code finishing and crashing just because we

make one HTTP request that fails. We can retry them and then we can handle them

however we want to. If you've enjoyed this video, you want to watch this one

next, which goes into more detail about how I actually get data when I'm

scraping sites.