How I Use Python-RQ to create a scraper queue - en - Twincloud's Youtube Subtitle Extractor

one of the problems with trying to put web scraping directly into a web

application is the fact that the scraping jobs take time to run and it

blocks up your app whilst it's doing it scraping jobs could take minutes if not

even longer and you just can't have your app being blocked for that long in this

video we're going to look at RQ which is a wrapper around Python and reddis to

easily set up a q system with workers and jobs Etc so we can hand our scraping

tasks off have them then have a job ID return back to us so we can actually get

the results back from those there now you want to think about this like if it

was in your application you might have a load of URLs in a database that you want

to update whilst your application can still do other things this is a great

way to do that or maybe you take in a load of URLs from your users so if we

have a look at uh our que here we see we have cues workers results and jobs

essentially so what we do is we create a queue in redis you can see here and then

we give it jobs now in our case our job is going to be our scraper and the URL

we can do that and then are the the RQ workers will automatically pick that up

they will run that function with that URL in our case and we can then query

back for the results so to get started we have this scraping code here I'm just

going to quickly uh run this so you get an idea of what should come back you can

see we're just getting some arbitrary product information here and it's each

page of my test site so this works and obv here's our run function so let's

come out of this and let's create a new file let's call it product q. py so

we're basically going to follow along with this to start with so let me just

get this back up here so our Q's just so I've got a reference so we're going to

do uh from RQ we're going to import in the Q class let's make this one twoo

bigger then we want to do reddis so from reddis we're going to import in our Redd

instance then we want to import in our scraper so from scraper we're going to

import run this is the function I just showed you uh and then we want to do I'm

going to import in time because we're going to need that just for the moment

so let's start with our redis connection this is going to be equal to an instance

of the reddis class here now I have reddis running on my uh local server I

have it running on all the time on there just so I can access it I can use it

whenever I need to you can run of course run Reddit on your local machine if you

want to test with it or you can run it in the cloud if you want to throw a

little bit of money at it it's entirely up to you but it will be running if this

was a full web application it would be running on either the same server as the

web application or a spec or a specified one if it grows too big so mine is at

this URL 19216811 144 and the port is

32768 I hope so now we can say uh that the Q is

going to be equal to our Q class here and the connection is our redus

connection like so this is going to set up the queue on our on our reddis cache

there so we can actually use it now if you look over here it basically says we

can add jobs by doing q. inq so that's essentially what we're going to do our

URLs is going to be equal to and we'll grab the URL in just a second uh for X

in range uh 1 to 13 because I know that

that's how many URLs there are on that site so let me just quickly regrab this

again cuz I lost it and there we go back to my

product and back here paste that in and we'll make this

an F string so we can construct all of our URLs this way and as I said your

URLs might come from your client from your customers or from your data datase

or something like that we're just constructing them like this now we want

to create all of our jobs with this so for URL in and I'm going to do enumerate

URLs start is equal to one and we need to make an idx here and the reason why

I'm doing this is because we need a job ID and I'm just going to use the index

from the list as the job ID you probably have a better way of doing this

depending on what you're trying to do then we can say our job is equal to Q do

uh inq and we want our run function fun the URL and the job ID which is going to

be equal to and this needs to be a string idx so I'm just turning the index

which would be integer into a string there now what I'm going to do is I'm

just going to say print length let's say uh we'll see how long the Q is so uh Q

length I'm struggling with this length of Q like

this Len just so we can see things are going on now to get back our results if

we come back over to the documentation again I just make this bigger we go up

to results and I'm going scroll all the way down to the bottom here and you'll

see it gives you a couple of options and we're going to do this one here where we

say job is job. fetch and then return value is a shortcut for the whole thing

so this is what we're going to want to do now it does mention multiple results

here and that's because if you run the same job over and over again it actually

stores multiple all the results from each each run to a certain point but

we're going to be doing just one so we'll come back to our code and this is

going to be job is equal to our job and we need to import this in Auto Import

from RQ see it pop up there uh Fetch and we want the is it

ID is equal to and let's ask for the first one we need to give it our

connection which is the redest connection that we created earlier then

we can print job. return value like so and I'm going

to put uh time do sleep underneath this just to give it a chance to complete

this job give it 5 seconds because obviously it's been sent off and the job

needs to be done we need to do the scraping so this looks about right so we

might need to fix a couple of other things but I'm going to come back over

to my other terminal and we're going to run rqr workers now this is obviously

the same reddis we're connecting to and we have high default and low uh this is

just like a priority system we're going ignore that for the moment so our

workers are running let's come back over here let's come out of this and we'll

now run our product que uh python file and you can see that we have plenty of

things in the queue and we're waiting we're waiting we gave that and now we

have the uh data come back now if I go back over to our uh worker you'll see

that we've actually completed all of the jobs there's all the page numbers you

can see on the screen that I'm pointing to that you can't see and that's all

there so the data is all there we just need to request it but what we want to

do is we don't want to um request the data in the same file that we're sending

to the Q so the idea is that we have the Q system in the middle and we send

everything to it and then we wait and when we come back and we pick all the

data back up from the job so what I'm going to do is I'm actually going to

create a new file so we'll say envm uh job collection. py that sounds

about right so I'm going to Import in uh do from RQ we'll Import

in RQ dojob we need the job thing don't we Import in job and from redis we'll

Import in redis we need to do the connection again which I'll just quickly

type out so now we can actually go and collect all the data from our jobs so

let's come back over here and we'll do uh for J

in uh four so let's have let's actually create

a new list for this so our jobs are going to

be yes I know there's probably a quicker way to do that

for job in we can't use job let's use for J in jobs and we'll do our uh data

is going to be equal to job. fetch and the ID being equal to a string

of J this is basically me just going back through the uh the jobs to connect

to and the connection is our reddis connection I've imported in something I

don't need thank you goodbye and then we can do

print data dot uh return value like so so let's save and

we should still we might still have those uh jobs in there so we'll do

Python 3 job collection and there's all the data from the jobs that we already

ran the first time we ran the other code and if I uh come back here you can see

that this says the result is kept for 500 seconds so that's all there waiting

for us so that's really cool so what we've done is we've essentially created

a q and a job system using RQ and reddis which has been really easy and then

we've basically just said here's the scraper function that we want to run

give it all the URLs let it do it and then we'll just come back later and pick

up the results so all you would do is you would let this all run and I think

if you try to collect results when they're not available You' get some kind

of um you get it back saying it's not done yet so you can still PLL for that

and you can work out what you need to get then you can do whatever you need to

do with your data this is a great way to do any kind of long running tasks in

your application and I think it works pretty well for web scraping too so

let's just go back to our product Q so you can see there so

this is basically it so hopefully you have enjoyed this video we're going to

expand on this um and make it into something a bit better and we'll build

an application around it I really want to explore the D Jango part here um

under the Integrations for Jango it looks really simple how to use so this

is what we're going to do in the next video so if you're interested in that

and you want to see that working make sure you subscribe also join the Discord

there's loads of people in there now all talking about all different sorts of web

scraping stuff it's fantastic it's gone better than i a ever imagined if you

want to watch more web scraping content that I actually get the data like I did

in this scraper you want to watch this video next

cheers