Video Thumbnail 08:19
The Easiest Way to Avoid Being Blocked When Web Scraping
18.2K
468
2024-08-18
Check Out ProxyScrape here: https://proxyscrape.com/?ref=jhnwr ➡ WEB https://johnwr.com ➡ COMMUNITY https://discord.gg/C4J2uckpbR https://www.patreon.com/johnwatsonrooney ➡ PROXIES https://proxyscrape.com/?ref=jhnwr ➡ HOSTING (Digital Ocean) https://m.do.co/c/c7c90f161ff6 If you are new, welcome. I'm John, a self taught Python developer working in the web and data space. I specialize in data extraction and automation. If you like programming and web content as much as I do, you can subscri...
Subtitles

right here and the CF clearance these are the cloud flare specific cookies and

it knows that when we send these that it's checked us out and we're it's all

happy is a simple method to help you avoid being blocked and IP banned from

sites with low to medium bot protection it's going to give you a good chance to

avoid all but of the toughest sites and the best part is that it's very simple

and very fast let's look at what this method is when you should use it how it

works to avoid blocks and how you can adapt it to your situation and whilst

it's not a golden ticket it's absolutely worth learning and knowing how to do

before I go into more detail we need to understand a little bit about how your

scrapers get blocked the most common is a JavaScript test these sites execute

some basic js on your browser and compare it to the result that it's

expecting if you aren't using a browser at all it's a simple block before you've

even had a chance there's lots of ways through fingerprinting that your Bots

can be found out and if you're interested in learning about those drop

me a comment down below and I will create a video on it given that we know

that plain requests are easy to spot and block that leaves us with running a

browser but this is painfully slow and not even guaranteed given the

fingerprinting but what we can do is use a modified browser instance to give us

the best chance of passing that test and then returning those cookies back to us

to reuse with subsequent standard requests within a session it's important

to note that we will absolutely need to use proxies for this and in some cases

certain anti-bot measures include tagging your cookies with your IP

so if we rotate it's a red flag and our whole session is blocked but with

today's sponsor proxy scrape we have the option to use sticky sessions that will

hold that IP for us for a certain amount of minutes with proxy scrape we have

access to high quality secure fast and ethically source proxies that are

perfect for our use case here there's 10 million plus proxies in the pool to use

all with unlimited concurrent sessions from countries all over the globe

enabling us to scrape quickly and efficiently I'd always recommend

residential xes as these are the best option for helping flare solver here

beat any antibot protection on sites and with auto rotation all the sticky

sessions I suggested earlier this is the simplest but most effective way to avoid

our projects being blocked and allowing us access to the data we need it's only

one line of code to add and then we can let proxy scrape handle the rest from

there and any traffic you purchase is yours to use whenever you need as it

doesn't ever expire there's definitely a use case for data center proxies though

and proxy scrape as you covered there too unlimited bandwidth 99% uptime and

no rate limit with IP authentication makes them a great option in The Right

Use case so if this all sounds good to you go ahead and check out proxy scrape

at the link in the description below okay let's carry on with our project and

get the cookies that we are after so we know we're going to make a clear good

request pass that test and then keep the cookies and matching IP potentially to

have the best chance for more requests within our session but how do we do that

well there's there's a few options you could run an instance of playright or

selenium with the undetected Chrome driver or even do it manually with your

real browser and copy them out the headers and the cookies or what I've

been doing recently which is using flare solver this is a special version of

Chrome with the undetected driver that runs as an HTTP service locally via

Docker give it our URL and it will make sure it runs a browser and passes that

basic JavaScript test for us and Returns the HTML page and crucially the cookies

now I'll show you how to pass these back into your request session shortly but

let's have a look and see flare solver working and check out the cookies it's

responding with so here's the flare solver GitHub page um it explains a

little bit here about how it works with us a selenium with the undetected uh

Chrome driver to give you the best chance of passing those Java JavaScript

tests how to run it using Docker which is very very easy just use Docker run

and it will start and get going for you um and then it tells you a little bit

about some basic usage here as you can see we make a request to its endpoint

and then we give it the URL and let it do its thing and it Returns the data to

us but what we're interested in the most today are the cookies and down here

somewhere we have our return only cookies uh which just ignores all the

rest of the HTML data because we don't want to be making requests through this

to get the HTML we want to make requests through this to get us those cookies

back of course we can add in our proxy here as well here's mine running at the

moment as you can see just on this endpoint it just tells you that it's

ready and whatever version etc etc so let me show you my code so I have a get

cookies function that's going to take in that initial URL for me it's going to do

what we saw on the GitHub page make the request I have my return only cookies

set to true and I also have my proxies proxy and I also have my proxy set to

the sticky proxies which I've got configured to keep the same IP for about

3 minutes I think uh which is the ones from proxy scrape that I'm using from

here we're going to return the Json response which is going to contain the

cookie data which is a list of dictionaries we can see this down here

I'm creating the cookies here then I have this load cookies function because

we want to put these cookies into our requests session so we can then use that

session with those cookies going forward which is going to tell the website hey

you've already checked us we we we're legit let us through and it's going to

do so now to do this we're going to use cookie uh jar from dict because we want

to take the dictionary from the cookies that we get back from flare solver and

convert that into a cookie jar object for our session that's why I created

this function to do so and then in the main function I create my session object

I get my sticky proxy uh string from my uh uh environment my local environment

variables I recommend you do this rather than sticking them into your code

because you'll inevitably leave something in there and you don't want

people accessing it uh and then we go ahead and go to this URL and get the

cookies responding the responding cookies to that then we load them into

our session and then I'm just going to use HTTP bin cookies and IP so we can

you know see that those cookies come back to us when we make requests to hear

and I did the IP one just to check that my proxy was working fine so I'm going

to go ahead save and and we'll run this code and we'll see it's going to take a

few seconds because it needs to start that browser up it needs to pass any of

the cloud flare tests etc etc but we aren't going to be using it we're going

to be using it once every in my case probably two or 3 minutes make plenty of

requests in between so here we are so the thing that we're interested in the

most is the CF unor CF cookies here's one right here and the CF clearance

these are the cloud flare specific cookies and it knows that when we send

these that it's checked us out and we're it's all happy and we can see that this

is my proxy IP which I will now have for however long I think 3 minutes so that's

essentially it we can take these cookies I've sent them to htb bin but you would

continue to send them to the same site to make subsequent requests through your

request session much much faster than waiting for a browser to load up each

time so as you can see we successfully got ourselves good cookies to use but

what is the significance of these well now we have our CF cookie so that's the

kind of ver verification that cloudfare knows that we've passed the JavaScript

test and it will allow us access that's the important part it's kind of like our

free pass all without having to do much more than a single browser request every

time we need a new set of cookies but does this actually stop us being blocked

well it's important to know that this field is ever changing and that

something that works today may not work tomorrow this method is very good for

low-level protection where you find yourself up against pages with mild

cloudfire waft levels which will get you around the things like checking you're

not a bot all that sort of stuff this will actually get you quite far still

but as I said it's not a golden ticket to never getting blocked again my

opinion has always been to learn as much as possible to give yourself the best

chance at success if you want to know more about quick and easy ways to scrape

large amounts of data you're going to want to watch this video next