if you've clicked on this video hoping that I'm going to tell you the best web
scraping package to use in Python it's scrapey it's that straightforward so but
you know help me keep my retention graph nice and high and flat and and humor me
for the rest because it's not always that straightforward so I've been web
scraping professionally and um personally for like over five years now
I suppose that's kind of what I learned python 4 it's what I fell into it's it's
what I built my whole Channel around and and what I do basically um and I wanted
to explain to you that there's just there is no one best tool it's all about
using the right tool for the right job and the right project that you are
approaching so it's up to you but what I want to do this video is give you a
rundown of what I use now what I have used and some thoughts on it and what I
think you should be looking at using given certain circumstances so having
said that there probably is no one best tool I mean why would you fire up Scrapy
if you just need to make one request to a server and pass them Json you wouldn't
do that although you can if you want to it's like the web framework Wars except
not quite on a high enough level obviously but you know everyone's got
their personal favorites everyone has like likes things for different reasons
so it's up to you to pick the right tool for the job and also the ones that you
like working with the most so I'll talk about Scrapy towards the end but first
we'll look at HTTP clients uh this is what you're going to use to make that
request to the server to get that data back so you can then do something with
it the obvious choice is requests I use that for a very long time um and there's
a few others that are built around that too and then there's httpx and a http
which are good for acing now it's worth mentioning for all of the requests like
libraries they're generally all built on top of URL lib 3 so if you're just
making simple requests and all you need is some sessions maybe some proxies then
it's definitely worth looking at just using URL Li 3 I've used it in a few
projects now and it's it's pretty decent it works just like you would expect it
to no hidden surprises nothing fancy it's just good of course you can just
use request as well which is built on top of it the thing with request though
is that it's on a feature freeze I believe so there's nothing new is going
to come to it and that kind of sets us a little back a little bit when we need
more advanced web scraping features which is where I've moved on to using
this next uh library for almost all of my requests and it also uh integrates
with Scrapy well too so it's called curl cffi which is the python wrapper for
this Library which is a C library which wraps around curl it's all a bit
confusing but if you wanted to look through the source code you'll see how
it all all fits together and the great thing about this is that it builds on
top of requests so it does everything the requests do in fact you import
requests when you use it but it will allow you to send browser like handshake
information to a server now this is really important because one of the most
common and basic approaches that people like Cloud flare will use to block you
is based on your um ja3 hash usually which is your TLS handshake information
based on the certain Cipher Suites and everything that the initial bits of
information before you've even requested the data and what requests and all the
other libraries do is they send their information and you can't change that
which means as I said people like cloudfare can spot you and see you and
block you before you've even done anything and that's regardless of
whatever other headers you're using and maybe whatever proxies you're using too
so with this curl cffi which also integrates into Scrapy it sends browser
like information just choose which one you want to impersonate and so far it's
been working really well for me so definitely check that one out so next is
browser automation sometimes you just do need to use a real browser and generally
I go for playright it's they all kind of do the same thing playright and selenium
they work in very similar ways but I just find the documentation for
playright to be really really good which means if I know want to know how to
quickly do something I can find it really easy also has a slightly better
installation process in my opinion although selenium since version 4 has
been dead simple too the only exception to this is when I want to use grid which
is managing multiple instances of selenium over your server run on
different threads uh I use that running on my server now I can connect to and
then you can open up multiple browsers on your server it's just a really neat
and a really neat and tidy way to manage those selenium instances which playright
doesn't have a version of so now we've got the data we want to pass it I almost
never use playright or selenium is built-in paraa although it does work and
it works absolutely fine selecting all the elements I will always take the HTML
source and give it off to another HTML passord to work with just in case I need
to take selenium out put play right in take play right out put something else
in I find that process just much much easier to be a bit more modular so I
would probably recommend against using beautiful sup just because I don't feel
like it's quite old now and it's not totally transferable so it has its own
selector system you really want to learn how to either use CSS selectors or X
paath it's up to you which I use CSS selectors solely which is why I always
recommend selecto as my HTML passer it's built on top of a C library and it's
super fast and I've never had any issues with it and it's fantastic so I always
recommend using that doesn't work with X paath though so if you're looking if you
want to use x paath you can use things like parcel which is Scrapy HTML par so
you can take that out and you can use that separately that works great too one
thing that I'm looking forward to in this space potentially and maybe even
the HTTP client space too is some kind of rust backed uh python uh bindings and
Library there so maybe we can use rust to pass HTML easier and quicker although
with select delx it's perfectly good as it is give us an extra option and just
to wrap up the quick three first things that we always use here output um which
is usually going to be either a CSV Json file or a database databases I almost
always use SQ light I don't really feel the need to use anything else and if I
do it's always postest I'm not really have got a lot of experience with
document databases although you know if it works for you it works for you so not
really a lot to say on this in this matter so now I want to talk about this
what I'm going to call complete Solutions so that does so Scrapy will
fall into this I'm going to explain to you why I think Scrapy is worth learning
and why perhaps these ones aren't so there's a few new kind of complete
Solutions of what I would say come into the space recently one's called H
requests and one's called botaurus they both do very similar things and they
both look really promising and like you know say that they can do all this stuff
which is cool underneath all of this they are basically just uh built on top
of most of the tools I've just talked about curl cffi select tox etc etc
playright they're all built on top of that now why I suggest possibly not
learning these and obviously I don't want to to say bad words about these
because people have put a lot lot of work into these projects and they are
fantastic but I personally don't use them simply because they have their own
way of doing things and if you don't know what's going on under the hood when
all of a sudden it doesn't work you can't fix it so if you know exactly how
everything works then using something like this is going to be fine because
obviously you can figure out hey this has stopped working because of this
reason I need to update this I need to do this but these things these packages
are generally aimed at beginners I think so it kind of contradicts a little bit
there whereas I think you're much better off building your own thing from scratch
first so you understand how everything works and how everything fits together
and then maybe looking at something like these other complete packages now
there's a reason why I'm not including Scrapy within this because Scrapy has
built everything itself it has lots of ways to modify and change things you can
build your own middleware your own pipelines you can Implement your own uh
back ends your own path everything you can do all of that it will all work for
you uh and it's just building on top of these spider classes now I think this
really is the best way forward and I know I've talked before in the past
about how I don't use Scrapy so much but I think I've changed that because a lot
of the stuff that I've been writing recently I've essentially been mimicking
what Scrapy can do so I've just gone ahead and Rewritten it
in Scrapy and utilized it and also obviously it has all the built-in
features that are really useful exporting to uh Json F CSV without even
having to think about it auto um limiting the requests and all of that
stuff uh and all that thing in the great stuff that's in the settings it also has
loads of add-ons and plugins that you can use to integrate with different
things scrapey playright as scrapey with curl CFI for impersonating which I've
been using recently which is great all of this good stuff that's why I would
generally say Scrapy is where it's at now it does have a bit more of a
learning curve you do have to kind of understand how to scrape first and then
start to build into it but it's definitely worth it now in my opinion so
this last section I'm going to talk a little bit about uh paid options now
these definitely have a place I tend to use a few myself even um because you
know it's a con it's it's a it's a way up it's a balance between how much time
do you want to spend building something versus hey I'm just going to pay for
this to be done for me take away the hard part so I can just focus on
delivering and that's why things like uh scraping be and anxiety and whatnot
they're all kind of aimed more at businesses and professionals rather than
people who are just learning and again if that fits your use case I think they
are definitely well worth it takes away a lot of the headache a lot of the
figuring out you can just move on and do whatever it is that you need to do and
deliver whatever it is you said you're going to do so it's definitely worth
having a look at those options but they're not a silver bullet they won't
solve everything some will struggle on certain sites others won't it's all a
bit of a figuring out act and working out what's best in your case so don't
just sign up thinking all my problems are solved because they probably won't
be however they will solve a lot of your problems so it's definitely worth
checking out if you don't want to write everything else yourself and you're
struggling and you just need it for a business purpose or whatever so to sum
up you need to be adaptable you need to pick the tools for the right job for
example if I know that there's going to be a little bit of blocking or Cloud
flare or something like that I'm straight in with curl cffi if I don't
need any of that I'm just going to use requests or URL lib 3 or httpx whatever
I feel like I'm going to pass with select tox I'm going to store it in a
Json file or CSV to start with I don't mess around with databases until I need
to uh and I'm always going to use proxies if I need to um my link for
proxies down below if you want to help me out support the channel if you need
some good ones I've got good links down there um and also yeah if you need to
leverage some paid Solutions just do it don't think about it if it's worth it
for your business case just do it you need to make sure that you use and
master and feel comfortable with everything that you use and pick and
also how it works underneath so you need to understand What's blocking you why
you're getting redirected what's the cause of this what's the cause of that
because if you don't know that you can't solve the problems underneath that will
come up because web scraping changes it's Dynamic things change all the time
you just have to get on with it and be adaptable