Video Thumbnail 11:42
The Best Tools to Scrape Data in 2024
10.0K
427
2024-04-10
Python has a great ecosystem for webscraping and in this video I run through the packages I use everyday to scrape data. Join the Discord to discuss all things Python and Web with our growing community! https://discord.gg/C4J2uckpbR If you are new, welcome! I am John, a self taught Python developer working in the web and data space. I specialize in data extraction and JSON web API's both server and client. If you like programming and web content as much as I do, you can subscribe for weekly co...
Subtitles

if you've clicked on this video hoping that I'm going to tell you the best web

scraping package to use in Python it's scrapey it's that straightforward so but

you know help me keep my retention graph nice and high and flat and and humor me

for the rest because it's not always that straightforward so I've been web

scraping professionally and um personally for like over five years now

I suppose that's kind of what I learned python 4 it's what I fell into it's it's

what I built my whole Channel around and and what I do basically um and I wanted

to explain to you that there's just there is no one best tool it's all about

using the right tool for the right job and the right project that you are

approaching so it's up to you but what I want to do this video is give you a

rundown of what I use now what I have used and some thoughts on it and what I

think you should be looking at using given certain circumstances so having

said that there probably is no one best tool I mean why would you fire up Scrapy

if you just need to make one request to a server and pass them Json you wouldn't

do that although you can if you want to it's like the web framework Wars except

not quite on a high enough level obviously but you know everyone's got

their personal favorites everyone has like likes things for different reasons

so it's up to you to pick the right tool for the job and also the ones that you

like working with the most so I'll talk about Scrapy towards the end but first

we'll look at HTTP clients uh this is what you're going to use to make that

request to the server to get that data back so you can then do something with

it the obvious choice is requests I use that for a very long time um and there's

a few others that are built around that too and then there's httpx and a http

which are good for acing now it's worth mentioning for all of the requests like

libraries they're generally all built on top of URL lib 3 so if you're just

making simple requests and all you need is some sessions maybe some proxies then

it's definitely worth looking at just using URL Li 3 I've used it in a few

projects now and it's it's pretty decent it works just like you would expect it

to no hidden surprises nothing fancy it's just good of course you can just

use request as well which is built on top of it the thing with request though

is that it's on a feature freeze I believe so there's nothing new is going

to come to it and that kind of sets us a little back a little bit when we need

more advanced web scraping features which is where I've moved on to using

this next uh library for almost all of my requests and it also uh integrates

with Scrapy well too so it's called curl cffi which is the python wrapper for

this Library which is a C library which wraps around curl it's all a bit

confusing but if you wanted to look through the source code you'll see how

it all all fits together and the great thing about this is that it builds on

top of requests so it does everything the requests do in fact you import

requests when you use it but it will allow you to send browser like handshake

information to a server now this is really important because one of the most

common and basic approaches that people like Cloud flare will use to block you

is based on your um ja3 hash usually which is your TLS handshake information

based on the certain Cipher Suites and everything that the initial bits of

information before you've even requested the data and what requests and all the

other libraries do is they send their information and you can't change that

which means as I said people like cloudfare can spot you and see you and

block you before you've even done anything and that's regardless of

whatever other headers you're using and maybe whatever proxies you're using too

so with this curl cffi which also integrates into Scrapy it sends browser

like information just choose which one you want to impersonate and so far it's

been working really well for me so definitely check that one out so next is

browser automation sometimes you just do need to use a real browser and generally

I go for playright it's they all kind of do the same thing playright and selenium

they work in very similar ways but I just find the documentation for

playright to be really really good which means if I know want to know how to

quickly do something I can find it really easy also has a slightly better

installation process in my opinion although selenium since version 4 has

been dead simple too the only exception to this is when I want to use grid which

is managing multiple instances of selenium over your server run on

different threads uh I use that running on my server now I can connect to and

then you can open up multiple browsers on your server it's just a really neat

and a really neat and tidy way to manage those selenium instances which playright

doesn't have a version of so now we've got the data we want to pass it I almost

never use playright or selenium is built-in paraa although it does work and

it works absolutely fine selecting all the elements I will always take the HTML

source and give it off to another HTML passord to work with just in case I need

to take selenium out put play right in take play right out put something else

in I find that process just much much easier to be a bit more modular so I

would probably recommend against using beautiful sup just because I don't feel

like it's quite old now and it's not totally transferable so it has its own

selector system you really want to learn how to either use CSS selectors or X

paath it's up to you which I use CSS selectors solely which is why I always

recommend selecto as my HTML passer it's built on top of a C library and it's

super fast and I've never had any issues with it and it's fantastic so I always

recommend using that doesn't work with X paath though so if you're looking if you

want to use x paath you can use things like parcel which is Scrapy HTML par so

you can take that out and you can use that separately that works great too one

thing that I'm looking forward to in this space potentially and maybe even

the HTTP client space too is some kind of rust backed uh python uh bindings and

Library there so maybe we can use rust to pass HTML easier and quicker although

with select delx it's perfectly good as it is give us an extra option and just

to wrap up the quick three first things that we always use here output um which

is usually going to be either a CSV Json file or a database databases I almost

always use SQ light I don't really feel the need to use anything else and if I

do it's always postest I'm not really have got a lot of experience with

document databases although you know if it works for you it works for you so not

really a lot to say on this in this matter so now I want to talk about this

what I'm going to call complete Solutions so that does so Scrapy will

fall into this I'm going to explain to you why I think Scrapy is worth learning

and why perhaps these ones aren't so there's a few new kind of complete

Solutions of what I would say come into the space recently one's called H

requests and one's called botaurus they both do very similar things and they

both look really promising and like you know say that they can do all this stuff

which is cool underneath all of this they are basically just uh built on top

of most of the tools I've just talked about curl cffi select tox etc etc

playright they're all built on top of that now why I suggest possibly not

learning these and obviously I don't want to to say bad words about these

because people have put a lot lot of work into these projects and they are

fantastic but I personally don't use them simply because they have their own

way of doing things and if you don't know what's going on under the hood when

all of a sudden it doesn't work you can't fix it so if you know exactly how

everything works then using something like this is going to be fine because

obviously you can figure out hey this has stopped working because of this

reason I need to update this I need to do this but these things these packages

are generally aimed at beginners I think so it kind of contradicts a little bit

there whereas I think you're much better off building your own thing from scratch

first so you understand how everything works and how everything fits together

and then maybe looking at something like these other complete packages now

there's a reason why I'm not including Scrapy within this because Scrapy has

built everything itself it has lots of ways to modify and change things you can

build your own middleware your own pipelines you can Implement your own uh

back ends your own path everything you can do all of that it will all work for

you uh and it's just building on top of these spider classes now I think this

really is the best way forward and I know I've talked before in the past

about how I don't use Scrapy so much but I think I've changed that because a lot

of the stuff that I've been writing recently I've essentially been mimicking

what Scrapy can do so I've just gone ahead and Rewritten it

in Scrapy and utilized it and also obviously it has all the built-in

features that are really useful exporting to uh Json F CSV without even

having to think about it auto um limiting the requests and all of that

stuff uh and all that thing in the great stuff that's in the settings it also has

loads of add-ons and plugins that you can use to integrate with different

things scrapey playright as scrapey with curl CFI for impersonating which I've

been using recently which is great all of this good stuff that's why I would

generally say Scrapy is where it's at now it does have a bit more of a

learning curve you do have to kind of understand how to scrape first and then

start to build into it but it's definitely worth it now in my opinion so

this last section I'm going to talk a little bit about uh paid options now

these definitely have a place I tend to use a few myself even um because you

know it's a con it's it's a it's a way up it's a balance between how much time

do you want to spend building something versus hey I'm just going to pay for

this to be done for me take away the hard part so I can just focus on

delivering and that's why things like uh scraping be and anxiety and whatnot

they're all kind of aimed more at businesses and professionals rather than

people who are just learning and again if that fits your use case I think they

are definitely well worth it takes away a lot of the headache a lot of the

figuring out you can just move on and do whatever it is that you need to do and

deliver whatever it is you said you're going to do so it's definitely worth

having a look at those options but they're not a silver bullet they won't

solve everything some will struggle on certain sites others won't it's all a

bit of a figuring out act and working out what's best in your case so don't

just sign up thinking all my problems are solved because they probably won't

be however they will solve a lot of your problems so it's definitely worth

checking out if you don't want to write everything else yourself and you're

struggling and you just need it for a business purpose or whatever so to sum

up you need to be adaptable you need to pick the tools for the right job for

example if I know that there's going to be a little bit of blocking or Cloud

flare or something like that I'm straight in with curl cffi if I don't

need any of that I'm just going to use requests or URL lib 3 or httpx whatever

I feel like I'm going to pass with select tox I'm going to store it in a

Json file or CSV to start with I don't mess around with databases until I need

to uh and I'm always going to use proxies if I need to um my link for

proxies down below if you want to help me out support the channel if you need

some good ones I've got good links down there um and also yeah if you need to

leverage some paid Solutions just do it don't think about it if it's worth it

for your business case just do it you need to make sure that you use and

master and feel comfortable with everything that you use and pick and

also how it works underneath so you need to understand What's blocking you why

you're getting redirected what's the cause of this what's the cause of that

because if you don't know that you can't solve the problems underneath that will

come up because web scraping changes it's Dynamic things change all the time

you just have to get on with it and be adaptable