right here and the CF clearance these are the cloud flare specific cookies and
it knows that when we send these that it's checked us out and we're it's all
happy is a simple method to help you avoid being blocked and IP banned from
sites with low to medium bot protection it's going to give you a good chance to
avoid all but of the toughest sites and the best part is that it's very simple
and very fast let's look at what this method is when you should use it how it
works to avoid blocks and how you can adapt it to your situation and whilst
it's not a golden ticket it's absolutely worth learning and knowing how to do
before I go into more detail we need to understand a little bit about how your
scrapers get blocked the most common is a JavaScript test these sites execute
some basic js on your browser and compare it to the result that it's
expecting if you aren't using a browser at all it's a simple block before you've
even had a chance there's lots of ways through fingerprinting that your Bots
can be found out and if you're interested in learning about those drop
me a comment down below and I will create a video on it given that we know
that plain requests are easy to spot and block that leaves us with running a
browser but this is painfully slow and not even guaranteed given the
fingerprinting but what we can do is use a modified browser instance to give us
the best chance of passing that test and then returning those cookies back to us
to reuse with subsequent standard requests within a session it's important
to note that we will absolutely need to use proxies for this and in some cases
certain anti-bot measures include tagging your cookies with your IP
so if we rotate it's a red flag and our whole session is blocked but with
today's sponsor proxy scrape we have the option to use sticky sessions that will
hold that IP for us for a certain amount of minutes with proxy scrape we have
access to high quality secure fast and ethically source proxies that are
perfect for our use case here there's 10 million plus proxies in the pool to use
all with unlimited concurrent sessions from countries all over the globe
enabling us to scrape quickly and efficiently I'd always recommend
residential xes as these are the best option for helping flare solver here
beat any antibot protection on sites and with auto rotation all the sticky
sessions I suggested earlier this is the simplest but most effective way to avoid
our projects being blocked and allowing us access to the data we need it's only
one line of code to add and then we can let proxy scrape handle the rest from
there and any traffic you purchase is yours to use whenever you need as it
doesn't ever expire there's definitely a use case for data center proxies though
and proxy scrape as you covered there too unlimited bandwidth 99% uptime and
no rate limit with IP authentication makes them a great option in The Right
Use case so if this all sounds good to you go ahead and check out proxy scrape
at the link in the description below okay let's carry on with our project and
get the cookies that we are after so we know we're going to make a clear good
request pass that test and then keep the cookies and matching IP potentially to
have the best chance for more requests within our session but how do we do that
well there's there's a few options you could run an instance of playright or
selenium with the undetected Chrome driver or even do it manually with your
real browser and copy them out the headers and the cookies or what I've
been doing recently which is using flare solver this is a special version of
Chrome with the undetected driver that runs as an HTTP service locally via
Docker give it our URL and it will make sure it runs a browser and passes that
basic JavaScript test for us and Returns the HTML page and crucially the cookies
now I'll show you how to pass these back into your request session shortly but
let's have a look and see flare solver working and check out the cookies it's
responding with so here's the flare solver GitHub page um it explains a
little bit here about how it works with us a selenium with the undetected uh
Chrome driver to give you the best chance of passing those Java JavaScript
tests how to run it using Docker which is very very easy just use Docker run
and it will start and get going for you um and then it tells you a little bit
about some basic usage here as you can see we make a request to its endpoint
and then we give it the URL and let it do its thing and it Returns the data to
us but what we're interested in the most today are the cookies and down here
somewhere we have our return only cookies uh which just ignores all the
rest of the HTML data because we don't want to be making requests through this
to get the HTML we want to make requests through this to get us those cookies
back of course we can add in our proxy here as well here's mine running at the
moment as you can see just on this endpoint it just tells you that it's
ready and whatever version etc etc so let me show you my code so I have a get
cookies function that's going to take in that initial URL for me it's going to do
what we saw on the GitHub page make the request I have my return only cookies
set to true and I also have my proxies proxy and I also have my proxy set to
the sticky proxies which I've got configured to keep the same IP for about
3 minutes I think uh which is the ones from proxy scrape that I'm using from
here we're going to return the Json response which is going to contain the
cookie data which is a list of dictionaries we can see this down here
I'm creating the cookies here then I have this load cookies function because
we want to put these cookies into our requests session so we can then use that
session with those cookies going forward which is going to tell the website hey
you've already checked us we we we're legit let us through and it's going to
do so now to do this we're going to use cookie uh jar from dict because we want
to take the dictionary from the cookies that we get back from flare solver and
convert that into a cookie jar object for our session that's why I created
this function to do so and then in the main function I create my session object
I get my sticky proxy uh string from my uh uh environment my local environment
variables I recommend you do this rather than sticking them into your code
because you'll inevitably leave something in there and you don't want
people accessing it uh and then we go ahead and go to this URL and get the
cookies responding the responding cookies to that then we load them into
our session and then I'm just going to use HTTP bin cookies and IP so we can
you know see that those cookies come back to us when we make requests to hear
and I did the IP one just to check that my proxy was working fine so I'm going
to go ahead save and and we'll run this code and we'll see it's going to take a
few seconds because it needs to start that browser up it needs to pass any of
the cloud flare tests etc etc but we aren't going to be using it we're going
to be using it once every in my case probably two or 3 minutes make plenty of
requests in between so here we are so the thing that we're interested in the
most is the CF unor CF cookies here's one right here and the CF clearance
these are the cloud flare specific cookies and it knows that when we send
these that it's checked us out and we're it's all happy and we can see that this
is my proxy IP which I will now have for however long I think 3 minutes so that's
essentially it we can take these cookies I've sent them to htb bin but you would
continue to send them to the same site to make subsequent requests through your
request session much much faster than waiting for a browser to load up each
time so as you can see we successfully got ourselves good cookies to use but
what is the significance of these well now we have our CF cookie so that's the
kind of ver verification that cloudfare knows that we've passed the JavaScript
test and it will allow us access that's the important part it's kind of like our
free pass all without having to do much more than a single browser request every
time we need a new set of cookies but does this actually stop us being blocked
well it's important to know that this field is ever changing and that
something that works today may not work tomorrow this method is very good for
low-level protection where you find yourself up against pages with mild
cloudfire waft levels which will get you around the things like checking you're
not a bot all that sort of stuff this will actually get you quite far still
but as I said it's not a golden ticket to never getting blocked again my
opinion has always been to learn as much as possible to give yourself the best
chance at success if you want to know more about quick and easy ways to scrape
large amounts of data you're going to want to watch this video next