If you've ever had your scraper fail because of one failed HTTP request,
you'll know how much of a pain it can be. And in this video, I want to show
you how we can build retries into our web scraping logic and you can easily
implement them into your own code. I'm going to show you two different ways.
But first, we'll talk a little bit about what a decorator is. We'll zip through
that and I'll show you two different ways that you can add it using a
handwritten decorator that I'm going to write and also then using a cool Python
package. So this is my initial retry decorator that I have written to
demonstrate how a decorator and how this logic is going to work. Essentially, we
have a function down here where something could go wrong. And this is
going to be our HTTP request. Now, if you've ever used HTTPX or request,
you'll know there's a raise for status. That's a neat way of saying, you know,
if the status is a bad status, it's going to raise an exception. And it's
that exception that we can use to then activate our retries through our
decorator. So this function here is going to raise a value error every time.
What we're going to do here is write a new function that's going to wrap our
function and run it each time and then handle the exception based on what we
choose. So this simple decorator is just going to go through three iterations.
It's going to sleep after each one and yeah, it's going to say failed and then
yeah, nothing else is going to happen. So if I run this one, we can see, you
know, we're getting our error something trying to run trying to run failed after
three attempts. So this is how we're going to build our initial decorator. So
I'm going to come over to a different part of my code and I have this file
here. Now, this looks a lot more complicated, but essentially this part
here where I've got the decorator and the wrapper is exactly the same pretty
much as what we were just looking at, the simple version. I've just wrapped it
inside a class. So, my extractor class has a decorator class and a class method
to include that decorator. So, what we're doing here is we're saying we're
going to have a max number of attempts. Now, this is very important because if
you don't have this, your code will retry over and over and over again,
causing you even more issues than if it just failed in the first place. We also
have a delay in here. This is a time sleep delay in between each request and
we can choose what these numbers are however you want to do it. Then I have
the decorator function itself. This is a wrapper and we're going to say no for
attempt in range one to our maximum attempts and I've done plus one here.
We're going to try to return the value from the function. Now our function down
here that we're going to put this on is our fetch URL function. This one's going
to take our URL. It's going to make the request and we're going to say if
response. And this is from the arnet Python package. Um we're going to raise
an exception. This exception is going to then be intercepted within our wrapper
up here, our exception. I'm going to say, you know, failed. And then we're
going to say if it's um still underneath our maximum request amounts, we're going
to implement our sleep our delay. And then we're going to try again. And once
it's equal to that, we're going to return none. Now, this is important
because what can happen if you are just using a standard retry decorator is that
it will try however many times you ask it to, but then nothing will happen
afterwards. And that's not going to do anything for us. We want to retry x
amount of times and then we want to decide, hey, we're going to store this
URL for uh trying again another time. We're going to put it back into a queue
if we're using a Q system or we're going to, you know, just skip over it all
together. That's what's doing here. Here when I say returning none what that
means is when we actually call this function down here within our uh the
actual part that runs our code we can say you know if the response exists i.e.
if the response is not none we can do what we need to do with it. In this case
I'm just logging out the uh the URL and the headers. Otherwise we're going to
append the URL to a failed URLs list and then return that back out. So if I do
this one and run my function here, we're going to see initially we get our
successful attempts and that's because I'm reaching out to HTTP bubin at a 200
response. When I try 404, we can see we're doing 1, two, and then three
attempts and it's failed. Because it's failed and then it skips on, we're
storing that URL by returning none and then caching the URL afterwards. So I'm
just going to let this finish and I'm going to go through and then we'll have
a list of the URLs that we failed three retries on at the end. So you can see
them down here under this warning response. This is a list of the URLs
that failed for us. Now within real world code, we could then you do
something with these URLs. But what's happened is we've tried to handle
whatever we're getting, whatever status code we've decided that we want to catch
and and work with. We've tried to handle it in the best way that we can. We've
tried to retry it and maybe we would have put different proxies in. Maybe we
put different fingerprints in for each one of those retries. And if it still
fails, then we're going to log that out so we can do something with it later so
we're aware what's going on. Now retries and logging really do go hand in hand.
And uh I think I put it in this one if I'm using strrux log which is a cool
package that I've just started using. It's a very very simple way just to you
know handle not having to set up your own loggers. So if if that's something
you want interested and I would definitely check out struck log. It
seems pretty cool so far. Let's move on to a different way of working with the
decorators. I'm going to introduce you to a Python package too. It's called
Stamina by a guy called Hinek or Heneck. Uh he's very active in the Python
community and it's built on tenacity. So let's have a look at it over here. So
I'm importing it in. Now what stamina is, is I said it's a wrapper around
tenacity, but it gives you some sensible defaults. makes it just that little bit
easier to use and also adds some extra features. Um, which I'll link to the
package down below so you can check out those. Anyway, it's been pretty cool.
But one thing that I really wanted to try was working with an Async client
using it to handle the retries in that respect. And the way that I sort of uh
came up with having to do this is actually using the retry inside a
context manager rather than a decorator. Now, this has um positives and
negatives. Obviously, a decorator is easier to decorate any function that
might fail that we could retry, but in this instance, the only ones we're going
to want to retry are going to be the HTTP requests. So, using this uh context
manager to do that, I think is is a pretty good option here. So, it's going
to be inside my main fetch function. And I've got max attempts is equal to three.
This could be whatever you want it to be. Then we're going to do async for
attempt in our stamina. context. So if I go to the source code here, we can see
that this context manager here yields iterators that allows us to retry the
context within it. And this is it's pretty cool in my opinion. I think so
far it looks really nice. So we've got this here and I'm going to say with our
attempt and this is where our actual request code is going to be. And of
course we're using arnet and we're using asynchronous code here. So this fits all
in nicely. One thing that I did do here is I created my own status code error.
This is when I was working with it initially. I wanted to try a few
different things. This turned out to be not particularly mandatory or overly
useful other than it's a more clear error of what's going on rather than a
bare exception rather. So it's a bit of a better option. So you might want to
consider writing your own exceptions when you get to that point. So if we
look back into our async attempts, we're logging it here with the URL and we're
saying if our not response is okay, we're going to uh return the status
code. if our attempt number is equal to the maximum number of attempts. So this
is just our way of returning something when we reach the maximum number of
attempts. So in the last one where I had the decorator, we're actually returning
none from that. We could have done that here. In this case, I chose to return
the response status code. And I'll show you when this runs in just a second. And
then we say else, we're going to return our status code error. And then we're
going to return if everything's good, we're going to return the text. So what
I'm going to do here is I have four URLs which we're going to reach out to
asynchronously using our client. Uh we're going to be impersonating as well.
We don't need to do that for this but you know I'm really enjoying using arnet
at the moment which is good for impersonation for the TLS fingerprints.
Then we're going to try and get the information. I'm just going to print the
data. Now we know from looking at these URLs these HTTP ones that these are
going to fail. So let's see how it's handled here. Let's clear the screen up
and I'll do UV run main. py. So, we can see our stamina retry here. Now, there's
lots of different ways you can manage the retry and the back off within
stamina. I'm just using the defaults here just as an example, but I
definitely recommend you check out the GitHub and he's done a video on it, too,
which explains the differences from tenacity and stamina and what he wanted
to include. It's definitely worth looking at. I think it's the retry uh
package that I'm going to be using going forward. So, if we look at what has been
returned from my code, we didn't fail. Our code did not crash just because we
hit a 404 or a 504 whatever that is. I think I just made that up error there.
But what we've done is we've returned the status code after the maximum number
of attempts have happened. We can see that up here. So we can see that we're
having retry number etc etc and it's failing on these ones. So maximum
returns reached passing on this URL and then that's causing us to as I chose to
return the status code. And now the information that I have out of my
function includes the URL and either the data, the JSON data or our bad status
code. So we can review this. We can review the logs. We can see what's
happened. We don't have our whole code finishing and crashing just because we
make one HTTP request that fails. We can retry them and then we can handle them
however we want to. If you've enjoyed this video, you want to watch this one
next, which goes into more detail about how I actually get data when I'm
scraping sites.