Video Thumbnail 12:18
How To Get Started with Scrapy Redis
3.5K
125
2024-01-28
Join the Discord to discuss all things Python and Web with our growing community! https://discord.gg/C4J2uckpbR A look at distributed scraping with scrapy and redis If you are new, welcome! I am John, a self taught Python developer working in the web and data space. I specialize in data extraction and JSON web API's both server and client. If you like programming and web content as much as I do, you can subscribe for weekly content. :: Links :: My Patrons Really keep the channel alive, and ge...
Subtitles

in this video we're going to be looking at distributed crawling with Scrapy and

redis so on the left hand side of my screen these two split terminals here I

have two spiders running and waiting for instructions they're waiting for a URL

to come through to pass the data on the right hand screen I've connected to my

reddis instance through my CLI and if I push this URL you're going to notice one

of these spiders is going to pick it up and it's going to scrape the data there

it goes the top one got it now if I come over to this window and I show you the

uh actual goey I'm using reddis Commander to connect to my redis

instance when I hit refresh I have this product key and the items now hold the

information that came back from that spider so what we've essentially done is

we have our spiders waiting and ready and we sent them a URL they've or

they've picked the URL up from reddis rather they scraped the data and then

they've sent it back so I've just added a load more URLs and you'll see both

spiders have started to pick up some jobs so let's do it again and we should

get both of them start to flick through and you can see how quickly they'll pick

them up and then run through them if I come back to our actual key value here

and I refresh we now have 38 will be of the same thing because I gave it the

same URL over and over but they'll have a different ID with the information in

now because it's redis you can connect to reddis with any programming language

that you like so you could have a disconnect here if you wanted to I don't

know say take information from a JavaScript front end and then use SC on

the back end to get the data or even if you're just using python we could push

these URLs into this through our actual uh just python using Reddit so what I'll

do just quickly is I'll just close the spiders down and I'll just put a couple

of URLs into the key values so if we hit refresh you'll see we have start URLs

here these are just now waiting for a spider to start and to pick them up

let's have a quick look then at what this is actually doing so I'm going to

create a thing here and we'll just say that this is our redis Q like so H make

that nice and big not too big there we go then we're basically passing the

information the URLs that we get are going into a spider which we will make

here let's write spider

okay there we go so we have our spider then the spider is basically

doing the work and then passing it back into our redus store which is the same

uh the same the same redus instance is just a different key value now of course

you could have multiple keys and values for different spiders so they will pick

it up from a different place and send it back but the beauty of this is I'll just

put this here so we know that it's URLs going in the top however we choose

uh start URLs so the thing about this is that of course with this uh being able

to to pick up URLs as we we go we could of course have as many of these as we

like so we've got three here let's have five running for example and it's going

to do exactly the same thing they're all going to pull the same queue for the

URLs they're all the same spider they're all running and waiting so we can easily

scale this horizontally which is where the distribution comes from because if

you think you can get a reddish instance with from the cloud you can connect to

that from anywhere so you could have these spiders running across

any amount of different machines or all on one machine that you can then simply

scale up and down depending on how big your queue is and that's the most

important thing to take away from this it's not that this is going to be the

best for you if you're just scraping one site from your home but if you're trying

to do lots of different things this is a really cool way of doing it so to do

this to replicate it what we're going to do is we're going to come to the GitHub

for Scrapy reddis and if we scroll down it tells you a little bit about it here

now I did have some issues with this when I did it on the live stream a while

ago uh but those have been fixed however we aren't going to install from pip uh

because when I tried to do this before I had some issues so I'm just going to do

the git clone from GitHub this may very well work now depending on when you're

watching this video we're going to set up a example uh Spider here so you can

see we have our reddish spider and we're going to then scrape the data so we can

give it URLs from my test store here which is available for everyone if you

wanted to practice or use it that's absolutely fine you'll find the url here

so we need to create a new project folder so I'm going to call this one

distributed and we'll CD into our folder here and we always want to use a virtual

environment when we're working with anything especially when we're working

with Scrapy the dependencies can give you sorts of problems I'm activate this

and I'm going to do pip three install Scrapy so we'll get that installed I'm

going to come back over here and I'm going to copy this code here to actually

install Scrapy reddis we're going to do that separately as I explained just a

minute ago let's paste that in so it does it

and it should install fine and it has so let's clear out the terminal come out of

this folder because we don't need to be here and then do Scrapy start project

and I'm just going to call this one go for one which is the code name for my

site which I don't know why it's that it just is then we'll CD into the folder

and we'll do the Scrapy gen spider as I almost always do and we'll call this one

product and it's going to be on this website uh

extract.com like so cool so now we have our skeleton Scrapy project set up I'm

going to CD into it and I'm going to open up neovim you can use whichever

code edit to you like but if you don't use nebm you're wrong and we're going to

come to settings we do need to change a few things here so I'm going to come

down to the bottom to start with and I'm going to put in here let's say our put a

comment in so this is going to be our Scrapy redis

configuration so now we're going to come back to the documentation and the

example products the example spider settings and it'll basically tell us

exactly what we need here these are the main ones I'm going to copy these paste

them in here and I don't think we need to worry about any of the other ones

we'll leave this we can do this for for now and now we need to add in our redis

URL here so this is going to equal to redis and mine is on my local Home

Server so it's on here I think it is

32768 or is it 32678 I think it's 32

32768 32768 perfect so that's done so we can now go to our spider so if we come

to spider which our product spider we need to import in a couple of new things

so we'll do from uh scrapey

redis do spiders we're going to import in the reddis spider now there is a

redish CW spider as well we're going to be using just the normal one here and we

want to change our spider to this class so we want to come to redis spider like

so now we don't need Allowed no names and start URLs but what we do need is

this right here if I come back to the spider

example you can see that we have the redis key here so I'm going to copy this

and paste this in so this is where it's going to be looking for in reddis to

find these URLs so I tend to stick with just calling it the name of the spider

and then start URLs which is absolutely fine by default it will send back to

this um this key uh this key as well when you get the items back it'll be

product items will be where they go to that's neat and neat and tidy for me

here so if you have multiple ones you can do that or of course if you have two

different spiders one that gets a set of URLs and you want to pass them into a

queue for another set of spiders you could of course configure it like that

so we're going to pass this we're going to get the information back just like

any other scrapey spider we want to yield out a dictionary now the first

thing that I'm going to do is I'm going to say hey this is the URL this data

came from just so we have that there and this is response. URL then I'm going to

say we're going to have the name of the product and we'll do response.

CSS and we will do do get on that and we'll grab the um CSS selector in just a

minute then we'll do price response. CSS again my typing is

atrocious I know and. get again so let's just go ahead and grab those selectors

this is a product page and we'll go inspect and I'll just say this is H1 is

probably absolutely fine here and the price P span bdii Okay

cool so this will be H1 and we want to get the text from that

that should be working fine for us and then we want to do p which is a class of

price then it's a span tag then it's a BDI tag and we want to get the text off

of that uh my code editor is complaining here about past but we can totally

ignore that for now and we can also even remove this we don't need it so this is

essentially all you need to get going your spider here of course is whatever

your spider is and whatever it does it will run it in its entirety when it gets

that URL when it picks it up from redis so we're going to save this we'll come

out of this now we'll clear this up and we'll just check that this is working by

running Scrapy crawl on product and as long as we don't get any errors it's all

connected and it's all good and once it done it straight away picked up the URLs

that were left in the queue so if I come back to our redest commander you see we

had these start URLs here that were in there and waiting so we put them in

waiting for our spiders to run so now when I hit refresh they've gone and we

should have had the items put in here so I'm going to delete these Keys now there

we go and we will come over here and I'm going to get rid of this and this one

and we'll come back to we want this over here like

that and let's clear that up and we'll try that'll push again I

don't know if I've got any other URLs in there a bit tedious to type out that's

all no okay no worries we'll just push some more in and you'll see though come

through like this so you can see how you can just fill the queue up and your

spider is just going to keep pulling them it's going to keep grabbing them

and it's just going to keep working through them and then it's just going to

idle and wait for some more so hopefully you can kind of see how this is uh how

this can start to work for you and the things you can achieve with it um and

it's very very easy to set up it's just these simple settings and redis is not

difficult how or not difficult to learn uh just for a basic thing like this

there are a few more settings for scrapey Reddit that you can play around

with if you want to um but I'm going to definitely explore this more and I'm

going to put out another video maybe where I run multiple spiders and we pull

the URLs from somewhere else maybe from a database and then we get that data

back out and do something with it of course when you take the data out of

rist you you would pop it out like you pop something out of a list in Python so

it then disappears so you have all that data sorted out there so hopefully

you've enjoyed this video um if you have make sure you subscribe like and all

that good stuff really helps me out Discord as well come and join in the

Discord there's over 800 people now in the Discord it's crazy loads of chats

going on about all this sort stuff really cool plenty of people in there

that are super helpful as well so really appreciate them um yeah so come and join

that and uh yeah thanks very much for watching and I will see you again soon