This will change Web Scraping forever. - en - Twincloud's Youtube Subtitle Extractor

okay look at this I got a basic spider here I didn't change anything this is

straight from the box I did a th000 requests which is what I actually asked

for so it stopped on that rather than a limit 756 items no errors full logs and

it only took half an hour I mean that's just mindblowing so coming up this

morning I have a really interesting call it's with a guy called Ian from zit and

he is the chief product officer over there and they have a new Scrapy product

that we're going to talk about and the thing that makes this particularly

interesting is that it's AI back definitely when AI came out originally

and sort of Hit the Forefront with chat GPT I couldn't really see how we could

use that as web scrapers and data extractors because the problems that we

had couldn't really be solved by that sort of a tool by main problem I always

thought about that was the fact that the llms are heavy on compute and heavy on

compute cost so it's going to be interesting to see how they get around

that this video is sponsored by zeit they gave me access to the AI tool and

ask me for my genuine feedback as well as to share it with you guys I want to

know is the the motivation behind creating this this tool our motivation

is that we think the web scraping is too slow that really both the challenge of

getting set up for new sites and what what does that look like when there's

multiple sites it's set up for each and every one there's the challenge of

Maintenance and when we speak to engineering teams they often say we're

under pressure we've got to add 100 100 site we are maintaining the ones that

we've already got adding 100 is difficult and so our motivation is how

can we make this quicker really interesting because it's it's rather

than trying to make it easier to scrape it's trying to get you to a base where

you can start the rest of your project quicker so what we've looked to do is

produce a solution which has automation where the automation does the routine

the for the most common types of website like product for the most common data

points in a schema we've got about 20 for the most common crawling strategies

like whole size or category or Search keywords we can get you to data

immediately but then to give people control that's good to know that it's

still going to be open source or at least the spider part is and you'll be

able to actually take that and build upon it and then have your own sort of

spider that you can put into their system host and then run that kind of

makes sense too do you think that part of it make creating all that automation

to to take that big chunk of work out of the start makes this different to other

generic AI I'm going to use that term loosely web scraping Solutions where

people are like just asking chat GPT for an exam as an example because this is a

more controlled use case we've been able to develop models that are specific to

this and so we have a machine learning model for product type websites we have

another for articles for jobs and because they're much more focused

they're more accurate for that task but they're also a lot cheaper to be honest

that the the spider itself is relatively simple so what that's doing the crawling

part or or the controlling of the crawling and so that's then calling zit

API which is solving the bands which is allowing people to retrieve the HTML in

the format they want do they need it rendered or not do they need to do

browser actions like scripting um it's the machine learning of the passing of

of turning the HTML into Json uh but uh it's it's controlled by a relatively

simple scrapey spider uh and and typically what we find is that if people

want to make a modification like add a filter to a crawl it's a couple of lines

of code but it's also quite a nice way for people to get started because they

can start something where they can see code and they can see what it does yeah

and so modifying that is very straight forwards only scenario we find where

this isn't a great solution is the very very largest sites where the cost of

setup and maintenance is dwarfed by the volume of quests in those scenarios the

um it's worth it to create a spider because running static code will be the

cheapest solution in general um for anyone who is looking to set up sites

where they're not yet in the billions of requests per month it's merely up to

tens of millions we're finding it's a great solution right so I need to know

is this going to change the landscape forever or is this going to make me

redundant so what I'm going to do I'm going to take 2 hours I'm going to build

two spiders and I'm going to compare myself to the AI

[Music] [Music]

okay so here's the results for the two spiders here's mine I got

1634 items and it took uh 1250 seconds which is about 20 minutes um I got a

decent amount of data points so on the left side here is the overview for the

AI spider job now this one took about an hour I thought this was interesting

because this was much slower than the other sites that I did in my test we did

get all of the data points though that I was expecting to get in fact more than I

pulled um because I didn't want to pass all of that information I would

definitely check with Zite as to why this one took an hour to run slightly

slower we had no issues it did it all there and given the fact that mine took

an hour to write we definitely in the positive here so we have a quick look at

the items you can see here's all the data points and all the items of which

you can download and export in these formats if you want to and there are

full logs and stats as well so the stats are going to be similar to the ones here

as you can see request counts etc etc everything that you can want to see from

your spider comes back to you here and here's the results from the other spider

I got about the same amount of items 1547 and I think I got1 15590 1566 mine

took about 37 minutes to run though and this one was 26 minutes and that's

because the AI was deciding when to use browser rendering and when not whereas I

was forcing browser rendering all the time which caused mine to be a bit

slower plus it took me about an hour to write as well so you know this is much

much quicker again let's have a quick look at the stats over here just so you

can see everything any all the information here that you want to know

about all your general SP stats the logs as well and the items here very similar

to the item data that I pulled out you can see I have it here with the skew

slightly different uh format but it's all here available you can see all of

the information and again you can download as you need here all in all

very very positive experience with this one given how easy it was to create

these all I had to do was put in the URL and configure two or three settings and

set it to run the fact that it managed to get all of these data points out in

such a short amount of time with very very little setup default out of the box

is very impressive now there's a lot to be said about getting to that point so

you can actually build upon it if you need to but in these cases I wouldn't

have to so the time saved would be huge especially considering this was just

over two use cases if I had to do that tenfold the amount of time I would save

by using the AI spider is astronomical so what are my thoughts

about this and has it changed how I feel about AI in this space well I think the

tool kind of speaks for itself it's pretty incredible the fact that you can

go from nothing to data with just one URL that quickly is really impressive

the fact that it hand it handles all the bands through the Zite API as well it's

just an extension of that I think it makes it a really really great package

I've used it quite a lot now over the last week or so since I recorded the

rest of this video and I've been very impressed I think that it's important to

know that tools like this are supposed to be supplementary to your work not

supposed to take over what you do whilst it does take out quite a lot of the

knowledge it does mean for me I think that I can provide a better service to

my clients by being able to give them the data they need in much less time and

I think I'm just going to honestly adapt to it and work with it and build upon it

myself and then use it to enhance what the services that I can deliver how do I

feel about AI in this space still mixed one thing that I think is important is

that this is a much smaller model this isn't the large language models that we

would associate with AI this is a very different thing it's a much more along

the machine learning type of uh style here where it's been fed data specific

data it's been trained on lots of different specific sites to pull out

information and I think that is definitely the right place to be in when

we're talking about this sort of thing AIML in web scraping using it in

conjunction with other tools to pull out the data that you need to save yourself

a load of time passing out a load of data if you want to try this tool for

yourself there'll be a link in the description and there'll be a code as

well for you to use to get you so to get you going and hopefully you enjoy it and

use it and like it as much as I did so yeah once again thanks to Ze for this

opportunity and thank you very much for all watching join the Discord like

comment subscribe it all makes a lot of difference to me thank you very much

I'll see you in the next one