in this video we're going to be looking at distributed crawling with Scrapy and
redis so on the left hand side of my screen these two split terminals here I
have two spiders running and waiting for instructions they're waiting for a URL
to come through to pass the data on the right hand screen I've connected to my
reddis instance through my CLI and if I push this URL you're going to notice one
of these spiders is going to pick it up and it's going to scrape the data there
it goes the top one got it now if I come over to this window and I show you the
uh actual goey I'm using reddis Commander to connect to my redis
instance when I hit refresh I have this product key and the items now hold the
information that came back from that spider so what we've essentially done is
we have our spiders waiting and ready and we sent them a URL they've or
they've picked the URL up from reddis rather they scraped the data and then
they've sent it back so I've just added a load more URLs and you'll see both
spiders have started to pick up some jobs so let's do it again and we should
get both of them start to flick through and you can see how quickly they'll pick
them up and then run through them if I come back to our actual key value here
and I refresh we now have 38 will be of the same thing because I gave it the
same URL over and over but they'll have a different ID with the information in
now because it's redis you can connect to reddis with any programming language
that you like so you could have a disconnect here if you wanted to I don't
know say take information from a JavaScript front end and then use SC on
the back end to get the data or even if you're just using python we could push
these URLs into this through our actual uh just python using Reddit so what I'll
do just quickly is I'll just close the spiders down and I'll just put a couple
of URLs into the key values so if we hit refresh you'll see we have start URLs
here these are just now waiting for a spider to start and to pick them up
let's have a quick look then at what this is actually doing so I'm going to
create a thing here and we'll just say that this is our redis Q like so H make
that nice and big not too big there we go then we're basically passing the
information the URLs that we get are going into a spider which we will make
here let's write spider
okay there we go so we have our spider then the spider is basically
doing the work and then passing it back into our redus store which is the same
uh the same the same redus instance is just a different key value now of course
you could have multiple keys and values for different spiders so they will pick
it up from a different place and send it back but the beauty of this is I'll just
put this here so we know that it's URLs going in the top however we choose
uh start URLs so the thing about this is that of course with this uh being able
to to pick up URLs as we we go we could of course have as many of these as we
like so we've got three here let's have five running for example and it's going
to do exactly the same thing they're all going to pull the same queue for the
URLs they're all the same spider they're all running and waiting so we can easily
scale this horizontally which is where the distribution comes from because if
you think you can get a reddish instance with from the cloud you can connect to
that from anywhere so you could have these spiders running across
any amount of different machines or all on one machine that you can then simply
scale up and down depending on how big your queue is and that's the most
important thing to take away from this it's not that this is going to be the
best for you if you're just scraping one site from your home but if you're trying
to do lots of different things this is a really cool way of doing it so to do
this to replicate it what we're going to do is we're going to come to the GitHub
for Scrapy reddis and if we scroll down it tells you a little bit about it here
now I did have some issues with this when I did it on the live stream a while
ago uh but those have been fixed however we aren't going to install from pip uh
because when I tried to do this before I had some issues so I'm just going to do
the git clone from GitHub this may very well work now depending on when you're
watching this video we're going to set up a example uh Spider here so you can
see we have our reddish spider and we're going to then scrape the data so we can
give it URLs from my test store here which is available for everyone if you
wanted to practice or use it that's absolutely fine you'll find the url here
so we need to create a new project folder so I'm going to call this one
distributed and we'll CD into our folder here and we always want to use a virtual
environment when we're working with anything especially when we're working
with Scrapy the dependencies can give you sorts of problems I'm activate this
and I'm going to do pip three install Scrapy so we'll get that installed I'm
going to come back over here and I'm going to copy this code here to actually
install Scrapy reddis we're going to do that separately as I explained just a
minute ago let's paste that in so it does it
and it should install fine and it has so let's clear out the terminal come out of
this folder because we don't need to be here and then do Scrapy start project
and I'm just going to call this one go for one which is the code name for my
site which I don't know why it's that it just is then we'll CD into the folder
and we'll do the Scrapy gen spider as I almost always do and we'll call this one
product and it's going to be on this website uh
extract.com like so cool so now we have our skeleton Scrapy project set up I'm
going to CD into it and I'm going to open up neovim you can use whichever
code edit to you like but if you don't use nebm you're wrong and we're going to
come to settings we do need to change a few things here so I'm going to come
down to the bottom to start with and I'm going to put in here let's say our put a
comment in so this is going to be our Scrapy redis
configuration so now we're going to come back to the documentation and the
example products the example spider settings and it'll basically tell us
exactly what we need here these are the main ones I'm going to copy these paste
them in here and I don't think we need to worry about any of the other ones
we'll leave this we can do this for for now and now we need to add in our redis
URL here so this is going to equal to redis and mine is on my local Home
Server so it's on here I think it is
32768 or is it 32678 I think it's 32
32768 32768 perfect so that's done so we can now go to our spider so if we come
to spider which our product spider we need to import in a couple of new things
so we'll do from uh scrapey
redis do spiders we're going to import in the reddis spider now there is a
redish CW spider as well we're going to be using just the normal one here and we
want to change our spider to this class so we want to come to redis spider like
so now we don't need Allowed no names and start URLs but what we do need is
this right here if I come back to the spider
example you can see that we have the redis key here so I'm going to copy this
and paste this in so this is where it's going to be looking for in reddis to
find these URLs so I tend to stick with just calling it the name of the spider
and then start URLs which is absolutely fine by default it will send back to
this um this key uh this key as well when you get the items back it'll be
product items will be where they go to that's neat and neat and tidy for me
here so if you have multiple ones you can do that or of course if you have two
different spiders one that gets a set of URLs and you want to pass them into a
queue for another set of spiders you could of course configure it like that
so we're going to pass this we're going to get the information back just like
any other scrapey spider we want to yield out a dictionary now the first
thing that I'm going to do is I'm going to say hey this is the URL this data
came from just so we have that there and this is response. URL then I'm going to
say we're going to have the name of the product and we'll do response.
CSS and we will do do get on that and we'll grab the um CSS selector in just a
minute then we'll do price response. CSS again my typing is
atrocious I know and. get again so let's just go ahead and grab those selectors
this is a product page and we'll go inspect and I'll just say this is H1 is
probably absolutely fine here and the price P span bdii Okay
cool so this will be H1 and we want to get the text from that
that should be working fine for us and then we want to do p which is a class of
price then it's a span tag then it's a BDI tag and we want to get the text off
of that uh my code editor is complaining here about past but we can totally
ignore that for now and we can also even remove this we don't need it so this is
essentially all you need to get going your spider here of course is whatever
your spider is and whatever it does it will run it in its entirety when it gets
that URL when it picks it up from redis so we're going to save this we'll come
out of this now we'll clear this up and we'll just check that this is working by
running Scrapy crawl on product and as long as we don't get any errors it's all
connected and it's all good and once it done it straight away picked up the URLs
that were left in the queue so if I come back to our redest commander you see we
had these start URLs here that were in there and waiting so we put them in
waiting for our spiders to run so now when I hit refresh they've gone and we
should have had the items put in here so I'm going to delete these Keys now there
we go and we will come over here and I'm going to get rid of this and this one
and we'll come back to we want this over here like
that and let's clear that up and we'll try that'll push again I
don't know if I've got any other URLs in there a bit tedious to type out that's
all no okay no worries we'll just push some more in and you'll see though come
through like this so you can see how you can just fill the queue up and your
spider is just going to keep pulling them it's going to keep grabbing them
and it's just going to keep working through them and then it's just going to
idle and wait for some more so hopefully you can kind of see how this is uh how
this can start to work for you and the things you can achieve with it um and
it's very very easy to set up it's just these simple settings and redis is not
difficult how or not difficult to learn uh just for a basic thing like this
there are a few more settings for scrapey Reddit that you can play around
with if you want to um but I'm going to definitely explore this more and I'm
going to put out another video maybe where I run multiple spiders and we pull
the URLs from somewhere else maybe from a database and then we get that data
back out and do something with it of course when you take the data out of
rist you you would pop it out like you pop something out of a list in Python so
it then disappears so you have all that data sorted out there so hopefully
you've enjoyed this video um if you have make sure you subscribe like and all
that good stuff really helps me out Discord as well come and join in the
Discord there's over 800 people now in the Discord it's crazy loads of chats
going on about all this sort stuff really cool plenty of people in there
that are super helpful as well so really appreciate them um yeah so come and join
that and uh yeah thanks very much for watching and I will see you again soon