if you don't use pipelines in your Scrapy projects you're missing out on
some great features that once you start to use you'll wonder how you ever went
without so in this video we'll talk through what pipelines are what they're
for and cover some great use cases for using them I'll share some code at the
end too that with some minor tweaking you can drop right into your scrapey
projects and for those who are new I'm John I've been scraping for over 5 years
and I help businesses extract the data they need for analysis this video is
sponsored by proxy scrape more on that a little later but first are they really
setat simple and easy to use at their core scrapey item pipelines are classes
that handle doing something with an item and by item we mean Scrapy item class
they Implement a simple method to process item that with the help of the
item adapter class allows us to interrogate and modify that scraped item
we can access any item field modify check or drop it completely this is what
makes pipelines so powerful so let's take a step back and talk about data
pipelines in more broad sense think of it like a conveyor belt that you your
data moves down it has a start point and an end and then in between we can the
opportunity to pick up some of the data and do something with it before putting
it back down or even discarding it think of toys moving down that conveyor belt
and at certain points they're picked up and checked and any that don't meet the
quality standards are then thrown away this is what we're able to do with our
data our pipeline starts with the input of the scraped item which we can then
pick up and check or modify and discard based on our criteria once we're all
done all the good toys are presented neatly to the end of the pipeline which
would be a box to load onto a truck another Pipeline with a different
function perhaps or in our case we're going to insert it into the database if
you've ever used scrapey CLI tool and added - o output Json or something
similar you've already use one of Scrapy default feed exporters when itself
itself is a data pipeline it takes the item and transforms it into Json or CSV
and outputs a file but of course you don't have to use them you can run
successful Scrapy project with no extra pip compes whatsoever and in fact if
you're running a simple scraper that doesn't need any extra processing of the
items like simply exporting URLs then there's just no need however once you to
start to scrape more and more data data pipelines become crucial and there's a
good another good reason to which I'll get to today's video is sponsored by
proxy scrape Friends of the channel and the proxies I've been using myself
personally for the last year or so as we know proxies are an integral part of
scraping data and with proxy scrape we have access to high quality secure fast
and ethically sourced proerties perfect for our scraping use case my preference
are the residential ones which give us the best option for beating captures and
any antibot protection on the sites we're scraping something that makes our
lives much easier there's 10 million plus proxies in the pool to use that
will all auto rotate and with unlimited concurrent sessions adding proxies to
our project is simple and extremely effective especially when combined with
something like Scrapy you have a choice of Country 2 too for helping when
working on very region specific sites there's a 99% success rate too and
traffic that never expires which is very very nice but if you just want
throughput there may be some data center proxies that have unlimited bandwidth
99% up time and no rate limit all from reputable countries and with IP
authentication makes these a very easy to use and attractive option within the
right use case so if you're looking for some top quality proxies check out proxy
scrape at the link in the description below let's get back to the video but
first how do we use them then well I'll show you but there's a couple of
important things we need to address first and if you don't do this your
pipelines just won't work firstly we must use Scrapy item class to hold our
data it's that that is this that the pipeline is expecting to see being
passed in and without it it will fail I'd always recommend taking a little
extra time in yielding out a Scrapy item anyway unless you have a really specific
reason not to they're easy to create and work just like Python's dictionaries but
let me ask ask you one question have you ever put data cleaning directly into
your spider code I know I have and well it worked just fine but I'm now saying
that this is not best practice and to understand why we need to have a quick
dip into separation of concerns this states that our project should be split
into sections and each one addresses a single concern think job back to our toy
conveyor belt we wouldn't want the machine that assembles toys to do the
quality control as well as that would be too much responsibility for one station
and will cause huge issues when we're working as scale and it's the same here
once we grow our project potentially with multiple spiders we need to let
that spider do what it does best crawl and get data then we can pass that to
our pipelines as soon as we Mudder the water managing the project will become
much much harder and trust me I've been there right so you're on board and ready
to start using Scrapy item pipeline let's start with a simple example for
cleaning data I'm going to assume you've created a scrapey project and it looks a
little bit like this here's our original set of data we can see that the price is
in a string the product ID is also a string and this checklist has duplicates
in and the price is in a different format again so what we want to do is we
want to change the price uh into a good format for both instances remove the
duplicates and change this to a integer so I'm going to do is I'm going to open
up our pipelines file and all I've got so far is I've imported in drop item
because we're going to add in a case to drop some items if we don't need them so
the first pipeline we're going to create is going to be our uh product ID
pipeline this is going to change it to an integer so we need to give it a name
and a class and then we have this Define our process item function our method
which is the main method we're going to be using takes in the item which is a
Scrapy item which is why that's important and the spider itself then we
say that our adapter is equal to the item adapter for this specific item this
means we have access to all those fields then when we have that we can access the
information that we're after that we want to check or do something with I've
done adapter. getet with this because that means if it doesn't have this field
it's going to return none rather than throw an error if it does exist we're
going to update it to an integer and we're going to return the item in this
one I also decided that if it doesn't exist I the product doesn't have a
product ID I'm going to drop it this just means that products that don't have
that ID will be dropped out of a pipeline and will be no more the next
pipeline is going to be for our price and this is the first price not within
the check and again it's the same process process item and then the
adapter of the item then we get the price field and all I'm doing here is
I'm removing the Dot from that field so we have we now have
690000 or uh as a string and then I'm changing it to an integer else I'm
setting the price to none so if has no if the item comes through with no price
we're just going to set it to none and then we're going to return that item
back out fairly straightforward so if we come back to our thing what we would
have done right now is we would have changed this to an integer like this and
this is going to be updated to this so like that now there's a few different
ways you can handle prices in Python I tend to do it like this I find it's
quite easy to manage um although you can use a decimal you can't use a float
though because it won't calculate properly because you'll end up with
wrong um calculations from adding and subtracting decimals back to the
pipeline now we need to handle the slightly more complicated one and this
is the check pipeline again starts off the same way and then we're basically
turning our list into a set and then back into a list because within python a
set cannot have duplicates so they will automatically get removed so this is
going to solve our duplicate issue from here what we're going to do is we're
going to get the index of this uh element so we can update it later then
we're going to do our replace with the dot and the pound symbol then we're
going to convert it to an integer so it matches the other price in the price
field and then update the list with that same index with our new item and then
I'm also updating the currency field to have gpp because it is a pound symbol in
the uh in front of the number for the price then I'm just returning the item
and that is it so once you have all of that changed you need to come to your
settings and then search for pipelines in your scrapey project and then you
need to uncomment it and add your pipelines in here so that they run
you'll notice that they have a number at the end and this just decides which
order they run in and this is important because if you were to save to a
database you would want to make sure you clean all the data first before you do
so if I come over to my other project you'll see here that I have a stock
available pipeline that happens before the SQ light no dupes Pipeline and if I
show you that here here's the stock available one very very similar to what
we were just doing before adapter doget and checking it and then we have our SQ
light pipeline here now this has an init which is slightly different to the other
ones and you're basically going to do your standard SQ light stuff here
connecting to the database or creating it giving getting a cursor make this one
bigger then executing creating the table if it doesn't exist this will obviously
happen every time the pipeline runs um so we have that table set up how we like
it then we're just going to in our process item we're just going to check
if that item that we are holding in this pipeline exists already if it does we'll
find it and then we'll just you know tell the spider through the logger that
it um tell ourselves through the logo that it does exist and nothing else will
happen if it doesn't we're going to insert it commit it to the database and
then return the item back to our original project and once you run it
again you'll end up with this here and you'll see now that we have our price in
both instances in the same format and our product ID as an integer and the
check with no extra duplicate data and the currency with the GBP here and that
is done for every single product because it went through are three pipelines I
have to admit when learning about this the first time I wasn't able to
completely see all of the use cases and I was happy filling up my spider file
with split and replace so what I want to do here is highlight some of the things
both scrapey doc suggests but also my own personal use use cases for pipelines
I'll try and break these down into about four categories given now that we
understand our pipeline is going to take data in move it down like a conveyor
belt the most obvious first one is to implement is to take our item and save
it into a database within one pipeline class we can initialize create tables
check for duplicates and add our scraped item to the database this one's for
rescue light which is a good option for starting out but you can of course use
any database that you like the Scrapy documentation shows one for mongod DB
but before we save our data we will want to clean it which is what we did earlier
and this is another great use case for our pipeline again we take the initial
item and we can now process each field removing things that we don't want I've
used this for things simple as correcting prices removing currency
symbols or removing Whit space and any unnecessary characters around names any
parts of the data you might want to remove from any field would come in here
so I work a lot in e-commerce and the main piece of data here is pricing and
getting it into the same format and data type before saving is very important so
I usually add in a pipeline specifically to clean up and modify the price field
in our example removing the currency symbol but also using it to add a new
field with a text representation instead is quite common and making sure that the
data type is consistent using a decimal or a string even other things like
adding dat and time stamps would also come in here this one's probably the
most important and also the most useful but also the most vague the business
business logic required for your use case this could be anything but here's
some examples checking stock flags and updating data combining fields and
updating information like vat and taxes checking dates for posts and discarding
any over a certain threshold and seeing if any item is on sale and adding a
percentage discount field the final use case is checking the Integrity of the
data making sure all the fields contain something setting default values and
acting on the item if there's something missing but these are just examples if
you want to see me use these in a project you're going to need to watch
this video right here next