one of the problems with trying to put web scraping directly into a web
application is the fact that the scraping jobs take time to run and it
blocks up your app whilst it's doing it scraping jobs could take minutes if not
even longer and you just can't have your app being blocked for that long in this
video we're going to look at RQ which is a wrapper around Python and reddis to
easily set up a q system with workers and jobs Etc so we can hand our scraping
tasks off have them then have a job ID return back to us so we can actually get
the results back from those there now you want to think about this like if it
was in your application you might have a load of URLs in a database that you want
to update whilst your application can still do other things this is a great
way to do that or maybe you take in a load of URLs from your users so if we
have a look at uh our que here we see we have cues workers results and jobs
essentially so what we do is we create a queue in redis you can see here and then
we give it jobs now in our case our job is going to be our scraper and the URL
we can do that and then are the the RQ workers will automatically pick that up
they will run that function with that URL in our case and we can then query
back for the results so to get started we have this scraping code here I'm just
going to quickly uh run this so you get an idea of what should come back you can
see we're just getting some arbitrary product information here and it's each
page of my test site so this works and obv here's our run function so let's
come out of this and let's create a new file let's call it product q. py so
we're basically going to follow along with this to start with so let me just
get this back up here so our Q's just so I've got a reference so we're going to
do uh from RQ we're going to import in the Q class let's make this one twoo
bigger then we want to do reddis so from reddis we're going to import in our Redd
instance then we want to import in our scraper so from scraper we're going to
import run this is the function I just showed you uh and then we want to do I'm
going to import in time because we're going to need that just for the moment
so let's start with our redis connection this is going to be equal to an instance
of the reddis class here now I have reddis running on my uh local server I
have it running on all the time on there just so I can access it I can use it
whenever I need to you can run of course run Reddit on your local machine if you
want to test with it or you can run it in the cloud if you want to throw a
little bit of money at it it's entirely up to you but it will be running if this
was a full web application it would be running on either the same server as the
web application or a spec or a specified one if it grows too big so mine is at
this URL 19216811 144 and the port is
32768 I hope so now we can say uh that the Q is
going to be equal to our Q class here and the connection is our redus
connection like so this is going to set up the queue on our on our reddis cache
there so we can actually use it now if you look over here it basically says we
can add jobs by doing q. inq so that's essentially what we're going to do our
URLs is going to be equal to and we'll grab the URL in just a second uh for X
in range uh 1 to 13 because I know that
that's how many URLs there are on that site so let me just quickly regrab this
again cuz I lost it and there we go back to my
product and back here paste that in and we'll make this
an F string so we can construct all of our URLs this way and as I said your
URLs might come from your client from your customers or from your data datase
or something like that we're just constructing them like this now we want
to create all of our jobs with this so for URL in and I'm going to do enumerate
URLs start is equal to one and we need to make an idx here and the reason why
I'm doing this is because we need a job ID and I'm just going to use the index
from the list as the job ID you probably have a better way of doing this
depending on what you're trying to do then we can say our job is equal to Q do
uh inq and we want our run function fun the URL and the job ID which is going to
be equal to and this needs to be a string idx so I'm just turning the index
which would be integer into a string there now what I'm going to do is I'm
just going to say print length let's say uh we'll see how long the Q is so uh Q
length I'm struggling with this length of Q like
this Len just so we can see things are going on now to get back our results if
we come back over to the documentation again I just make this bigger we go up
to results and I'm going scroll all the way down to the bottom here and you'll
see it gives you a couple of options and we're going to do this one here where we
say job is job. fetch and then return value is a shortcut for the whole thing
so this is what we're going to want to do now it does mention multiple results
here and that's because if you run the same job over and over again it actually
stores multiple all the results from each each run to a certain point but
we're going to be doing just one so we'll come back to our code and this is
going to be job is equal to our job and we need to import this in Auto Import
from RQ see it pop up there uh Fetch and we want the is it
ID is equal to and let's ask for the first one we need to give it our
connection which is the redest connection that we created earlier then
we can print job. return value like so and I'm going
to put uh time do sleep underneath this just to give it a chance to complete
this job give it 5 seconds because obviously it's been sent off and the job
needs to be done we need to do the scraping so this looks about right so we
might need to fix a couple of other things but I'm going to come back over
to my other terminal and we're going to run rqr workers now this is obviously
the same reddis we're connecting to and we have high default and low uh this is
just like a priority system we're going ignore that for the moment so our
workers are running let's come back over here let's come out of this and we'll
now run our product que uh python file and you can see that we have plenty of
things in the queue and we're waiting we're waiting we gave that and now we
have the uh data come back now if I go back over to our uh worker you'll see
that we've actually completed all of the jobs there's all the page numbers you
can see on the screen that I'm pointing to that you can't see and that's all
there so the data is all there we just need to request it but what we want to
do is we don't want to um request the data in the same file that we're sending
to the Q so the idea is that we have the Q system in the middle and we send
everything to it and then we wait and when we come back and we pick all the
data back up from the job so what I'm going to do is I'm actually going to
create a new file so we'll say envm uh job collection. py that sounds
about right so I'm going to Import in uh do from RQ we'll Import
in RQ dojob we need the job thing don't we Import in job and from redis we'll
Import in redis we need to do the connection again which I'll just quickly
type out so now we can actually go and collect all the data from our jobs so
let's come back over here and we'll do uh for J
in uh four so let's have let's actually create
a new list for this so our jobs are going to
be yes I know there's probably a quicker way to do that
for job in we can't use job let's use for J in jobs and we'll do our uh data
is going to be equal to job. fetch and the ID being equal to a string
of J this is basically me just going back through the uh the jobs to connect
to and the connection is our reddis connection I've imported in something I
don't need thank you goodbye and then we can do
print data dot uh return value like so so let's save and
we should still we might still have those uh jobs in there so we'll do
Python 3 job collection and there's all the data from the jobs that we already
ran the first time we ran the other code and if I uh come back here you can see
that this says the result is kept for 500 seconds so that's all there waiting
for us so that's really cool so what we've done is we've essentially created
a q and a job system using RQ and reddis which has been really easy and then
we've basically just said here's the scraper function that we want to run
give it all the URLs let it do it and then we'll just come back later and pick
up the results so all you would do is you would let this all run and I think
if you try to collect results when they're not available You' get some kind
of um you get it back saying it's not done yet so you can still PLL for that
and you can work out what you need to get then you can do whatever you need to
do with your data this is a great way to do any kind of long running tasks in
your application and I think it works pretty well for web scraping too so
let's just go back to our product Q so you can see there so
this is basically it so hopefully you have enjoyed this video we're going to
expand on this um and make it into something a bit better and we'll build
an application around it I really want to explore the D Jango part here um
under the Integrations for Jango it looks really simple how to use so this
is what we're going to do in the next video so if you're interested in that
and you want to see that working make sure you subscribe also join the Discord
there's loads of people in there now all talking about all different sorts of web
scraping stuff it's fantastic it's gone better than i a ever imagined if you
want to watch more web scraping content that I actually get the data like I did
in this scraper you want to watch this video next
cheers