Video Thumbnail 12:21
How I Use Data Pipelines in my Web Scrapers
7.3K
222
2024-07-21
➡ JOIN MY MAILING LIST https://johnwr.com ➡ COMMUNITY https://discord.gg/C4J2uckpbR ➡ PROXIES https://proxyscrape.com/?ref=jhnwr ➡ WEB SCRAPING API https://hubs.li/Q043T88w0 ➡ HOSTING https://m.do.co/c/c7c90f161ff6 If you are new, welcome. I'm John, a self taught Python developer and content creator, working at Zyte. I specialize in data extraction and automation. If you like programming and web content as much as I do, you can subscribe for weekly content. All views in this video are my o...
Subtitles

if you don't use pipelines in your Scrapy projects you're missing out on

some great features that once you start to use you'll wonder how you ever went

without so in this video we'll talk through what pipelines are what they're

for and cover some great use cases for using them I'll share some code at the

end too that with some minor tweaking you can drop right into your scrapey

projects and for those who are new I'm John I've been scraping for over 5 years

and I help businesses extract the data they need for analysis this video is

sponsored by proxy scrape more on that a little later but first are they really

setat simple and easy to use at their core scrapey item pipelines are classes

that handle doing something with an item and by item we mean Scrapy item class

they Implement a simple method to process item that with the help of the

item adapter class allows us to interrogate and modify that scraped item

we can access any item field modify check or drop it completely this is what

makes pipelines so powerful so let's take a step back and talk about data

pipelines in more broad sense think of it like a conveyor belt that you your

data moves down it has a start point and an end and then in between we can the

opportunity to pick up some of the data and do something with it before putting

it back down or even discarding it think of toys moving down that conveyor belt

and at certain points they're picked up and checked and any that don't meet the

quality standards are then thrown away this is what we're able to do with our

data our pipeline starts with the input of the scraped item which we can then

pick up and check or modify and discard based on our criteria once we're all

done all the good toys are presented neatly to the end of the pipeline which

would be a box to load onto a truck another Pipeline with a different

function perhaps or in our case we're going to insert it into the database if

you've ever used scrapey CLI tool and added - o output Json or something

similar you've already use one of Scrapy default feed exporters when itself

itself is a data pipeline it takes the item and transforms it into Json or CSV

and outputs a file but of course you don't have to use them you can run

successful Scrapy project with no extra pip compes whatsoever and in fact if

you're running a simple scraper that doesn't need any extra processing of the

items like simply exporting URLs then there's just no need however once you to

start to scrape more and more data data pipelines become crucial and there's a

good another good reason to which I'll get to today's video is sponsored by

proxy scrape Friends of the channel and the proxies I've been using myself

personally for the last year or so as we know proxies are an integral part of

scraping data and with proxy scrape we have access to high quality secure fast

and ethically sourced proerties perfect for our scraping use case my preference

are the residential ones which give us the best option for beating captures and

any antibot protection on the sites we're scraping something that makes our

lives much easier there's 10 million plus proxies in the pool to use that

will all auto rotate and with unlimited concurrent sessions adding proxies to

our project is simple and extremely effective especially when combined with

something like Scrapy you have a choice of Country 2 too for helping when

working on very region specific sites there's a 99% success rate too and

traffic that never expires which is very very nice but if you just want

throughput there may be some data center proxies that have unlimited bandwidth

99% up time and no rate limit all from reputable countries and with IP

authentication makes these a very easy to use and attractive option within the

right use case so if you're looking for some top quality proxies check out proxy

scrape at the link in the description below let's get back to the video but

first how do we use them then well I'll show you but there's a couple of

important things we need to address first and if you don't do this your

pipelines just won't work firstly we must use Scrapy item class to hold our

data it's that that is this that the pipeline is expecting to see being

passed in and without it it will fail I'd always recommend taking a little

extra time in yielding out a Scrapy item anyway unless you have a really specific

reason not to they're easy to create and work just like Python's dictionaries but

let me ask ask you one question have you ever put data cleaning directly into

your spider code I know I have and well it worked just fine but I'm now saying

that this is not best practice and to understand why we need to have a quick

dip into separation of concerns this states that our project should be split

into sections and each one addresses a single concern think job back to our toy

conveyor belt we wouldn't want the machine that assembles toys to do the

quality control as well as that would be too much responsibility for one station

and will cause huge issues when we're working as scale and it's the same here

once we grow our project potentially with multiple spiders we need to let

that spider do what it does best crawl and get data then we can pass that to

our pipelines as soon as we Mudder the water managing the project will become

much much harder and trust me I've been there right so you're on board and ready

to start using Scrapy item pipeline let's start with a simple example for

cleaning data I'm going to assume you've created a scrapey project and it looks a

little bit like this here's our original set of data we can see that the price is

in a string the product ID is also a string and this checklist has duplicates

in and the price is in a different format again so what we want to do is we

want to change the price uh into a good format for both instances remove the

duplicates and change this to a integer so I'm going to do is I'm going to open

up our pipelines file and all I've got so far is I've imported in drop item

because we're going to add in a case to drop some items if we don't need them so

the first pipeline we're going to create is going to be our uh product ID

pipeline this is going to change it to an integer so we need to give it a name

and a class and then we have this Define our process item function our method

which is the main method we're going to be using takes in the item which is a

Scrapy item which is why that's important and the spider itself then we

say that our adapter is equal to the item adapter for this specific item this

means we have access to all those fields then when we have that we can access the

information that we're after that we want to check or do something with I've

done adapter. getet with this because that means if it doesn't have this field

it's going to return none rather than throw an error if it does exist we're

going to update it to an integer and we're going to return the item in this

one I also decided that if it doesn't exist I the product doesn't have a

product ID I'm going to drop it this just means that products that don't have

that ID will be dropped out of a pipeline and will be no more the next

pipeline is going to be for our price and this is the first price not within

the check and again it's the same process process item and then the

adapter of the item then we get the price field and all I'm doing here is

I'm removing the Dot from that field so we have we now have

690000 or uh as a string and then I'm changing it to an integer else I'm

setting the price to none so if has no if the item comes through with no price

we're just going to set it to none and then we're going to return that item

back out fairly straightforward so if we come back to our thing what we would

have done right now is we would have changed this to an integer like this and

this is going to be updated to this so like that now there's a few different

ways you can handle prices in Python I tend to do it like this I find it's

quite easy to manage um although you can use a decimal you can't use a float

though because it won't calculate properly because you'll end up with

wrong um calculations from adding and subtracting decimals back to the

pipeline now we need to handle the slightly more complicated one and this

is the check pipeline again starts off the same way and then we're basically

turning our list into a set and then back into a list because within python a

set cannot have duplicates so they will automatically get removed so this is

going to solve our duplicate issue from here what we're going to do is we're

going to get the index of this uh element so we can update it later then

we're going to do our replace with the dot and the pound symbol then we're

going to convert it to an integer so it matches the other price in the price

field and then update the list with that same index with our new item and then

I'm also updating the currency field to have gpp because it is a pound symbol in

the uh in front of the number for the price then I'm just returning the item

and that is it so once you have all of that changed you need to come to your

settings and then search for pipelines in your scrapey project and then you

need to uncomment it and add your pipelines in here so that they run

you'll notice that they have a number at the end and this just decides which

order they run in and this is important because if you were to save to a

database you would want to make sure you clean all the data first before you do

so if I come over to my other project you'll see here that I have a stock

available pipeline that happens before the SQ light no dupes Pipeline and if I

show you that here here's the stock available one very very similar to what

we were just doing before adapter doget and checking it and then we have our SQ

light pipeline here now this has an init which is slightly different to the other

ones and you're basically going to do your standard SQ light stuff here

connecting to the database or creating it giving getting a cursor make this one

bigger then executing creating the table if it doesn't exist this will obviously

happen every time the pipeline runs um so we have that table set up how we like

it then we're just going to in our process item we're just going to check

if that item that we are holding in this pipeline exists already if it does we'll

find it and then we'll just you know tell the spider through the logger that

it um tell ourselves through the logo that it does exist and nothing else will

happen if it doesn't we're going to insert it commit it to the database and

then return the item back to our original project and once you run it

again you'll end up with this here and you'll see now that we have our price in

both instances in the same format and our product ID as an integer and the

check with no extra duplicate data and the currency with the GBP here and that

is done for every single product because it went through are three pipelines I

have to admit when learning about this the first time I wasn't able to

completely see all of the use cases and I was happy filling up my spider file

with split and replace so what I want to do here is highlight some of the things

both scrapey doc suggests but also my own personal use use cases for pipelines

I'll try and break these down into about four categories given now that we

understand our pipeline is going to take data in move it down like a conveyor

belt the most obvious first one is to implement is to take our item and save

it into a database within one pipeline class we can initialize create tables

check for duplicates and add our scraped item to the database this one's for

rescue light which is a good option for starting out but you can of course use

any database that you like the Scrapy documentation shows one for mongod DB

but before we save our data we will want to clean it which is what we did earlier

and this is another great use case for our pipeline again we take the initial

item and we can now process each field removing things that we don't want I've

used this for things simple as correcting prices removing currency

symbols or removing Whit space and any unnecessary characters around names any

parts of the data you might want to remove from any field would come in here

so I work a lot in e-commerce and the main piece of data here is pricing and

getting it into the same format and data type before saving is very important so

I usually add in a pipeline specifically to clean up and modify the price field

in our example removing the currency symbol but also using it to add a new

field with a text representation instead is quite common and making sure that the

data type is consistent using a decimal or a string even other things like

adding dat and time stamps would also come in here this one's probably the

most important and also the most useful but also the most vague the business

business logic required for your use case this could be anything but here's

some examples checking stock flags and updating data combining fields and

updating information like vat and taxes checking dates for posts and discarding

any over a certain threshold and seeing if any item is on sale and adding a

percentage discount field the final use case is checking the Integrity of the

data making sure all the fields contain something setting default values and

acting on the item if there's something missing but these are just examples if

you want to see me use these in a project you're going to need to watch

this video right here next