Video Thumbnail 13:54
Web Scrape Websites with a LOGIN - Python Basic Auth
152.4K
2.7K
2020-01-22
Here we go through how to use requests to POST the login information and session to make it persistent, allowing us to scrape information behind a login wall. Dummy site: https://the-internet.herokuapp.com/login ------------------------------------- # Patreon: https://www.patreon.com/johnwatsonrooney # Scraper API I use: https://www.scrapingbee.com/?fpr=jhnwr # Proxies: https://iproyal.club/JWR50 # Hosting: Digital Ocean: https://m.do.co/c/c7c90f161ff6 # Gear I use: https://www.amazon.co.uk/sh...
Subtitles

hello everyone welcome John here and today we're going to cover how to web

great sites that require login using requests and requests session using the

inspect element tool in our browser we can see where the login request is

actually sent and we can mimic that in our program and this and the session

part allows us to stay alive within that and access all the pages that are behind

the login there are a few things we need to do before we write our code however

and we need to find out the login URL what parameters are sent with that post

request and of course we need the login credentials although this although in

this example I will share the login information with you because we're using

a dummy site I also show you a way to separate out your credentials at the end

to make it a bit safer and better when you're sharing your script or uploading

to github or whatever so this is a site we're going to use and it's at this URL

and I'll put a link to that in the description

as you can see we've got a simple login form with a username and password

required so if we log into this now using the information given to us here

[Applause] we'll see that when we log in correctly

we go look you are logged in and we get a secure area so this is what we want to

get to with us with our Python program and then and then scrape the pages

within this although this is demo so there's no real meaningful information

here okay so if we log out now so the way that we find out what's going on

with the requests is by using the inspect or inspect element poor part of

the brother your web browser and the tab we're most interested in is the network

one so as you can see here if we click the login button with no credentials

we'll get a load of requests pop-up and this is what we just did what we just

sent to the server so we can see here one of them ones is called login and

it's got a or thin Takai into it now this looks like a post request to me so

if we could click on it here that was a get request so we want the one above

which is a post request so a post request is a request sent to the server

from the web browser and a get request is basically the information coming back

what we need to find out is the URL that is being posted to with a

username and password and any other information that goes along with that we

can see right away here that the request URL is this one so let's copy and paste

that over here for safekeeping because that's where we're going to need to send

our post request from our script so if we now clear this up and we clear that

up and if we click Preserve log we'll be able to see everything come in so if we

use exactly the same super secret password and login I type that wrong

let's clear that again I'll get the password by this time great so we logged

in correctly now we can what we can do is we can actually see on our request

here that we was a post request and somewhere down here it should give us a

response now here's the response didn't load okay here we go here's our form

data and this is what was sent along with our request to the URL so we need

to make sure these are this we need to make sure that we use the correct

matching information here now sometimes you might find there might be a bit more

information down here it might say have other have the parameters with it and

you need to make sure that those go along with the request as well but we

can see here there's only a username and password so that's all that we need from

logging in here as well we can see that we've got directed back to secure and

this should be our get request here that we got sent back so we need this URL as

well just put that in here okay great so I'm going to close out the browser now

and we'll get onto our editor and start writing our code so the first thing we

need to do as always is import requests and we need to set our URL so our let's

call this login URL is equal to this is where the information that we posted to

not the URL that was actually went to to get the login form

and then let's call this one our secure URL forbear there we go so that's

posters in and this is where this is the web this is the URL that we want to get

to once we have logged in okay so now we need to work on our post request and we

need to send the username and the password

along with that to get authenticated with the server now to do that we need

to send some kind of payload and because we have two parameters we need to make

that into a dictionary so we'll do payload is equal to and create a Python

dictionary and the first one was username which is what we saw in our

post request in the browser and that was Tom Smith and then the password was this

password just like that okay so now we've created our payload to send along

with it if there were any other parameters that needed to go with

request they would also need to go in here and match what we looked at on the

only inspect element Network toggle the browser so the next thing we need to do

is let's ignore session for now and let's just see if we can get

authenticated with the server so if we do R is equal to requests . post and

then we need the login URL that we set and then data is equal to the payload so

what this is doing is just going to use the requests to post this information to

this URL and the payload is what we created so if we print out our dot just

print the text hopefully what we should get back is the secure page there we go

secure area so this shows that we did actually manage to log in to the secure

area okay so that's great so now we think that

perhaps okay so we've authenticated with the service so if we were to try and

navigate to a different page within that login area we could just access that as

is but if if we try to do that say r2 is equal to

Quest's get and let's try and get the same page back secure once I call that

secure you are so this is exactly the same page but with this one when we send

this post request we're actually getting the information back and within that

information was a redirect which is the which was this page here the secure area

so if we try and do that if we try and do this post request and then also get

the same page back again and this could be a different page but this is the only

one that's there then we should the we should hopefully get this information

back again but we won't we'll go and it will send us back to the login page

because we are not authenticated so if we trim the text out from that request

which is going here we should get here that we're back at the login page so

what this is done is that we have authenticated with the server but then

because we haven't had we don't have our session we're not staying authenticated

so we're not going to get anything so what do we need to do well we need to

use request session so I'm going to remove these and we're going to keep

these for now and also to make it a bit easier to see what's going to get going

on I'm going to use import beautifulsoup as well so we can make the output a bit

nicer so we can see everything ok so the same we need to keep the same part

payload and we're going to use context manager in this case now context manager

is very useful because it will allow us to stay connected and stay logged in as

long as we remain within our with statement and we come out of that will

log back out again it's always good python practice to use a context manager

when you're opening files or creating a session like this it means you don't

stay connected to or logged into something so let's do with requests dot

session with the double brackets there and we'll do that as s just to give it a

name and we will then gonna do s dot post and exactly what we did before with

our sorry log in URL and then a data it's our payload so this is basically

just opening it and calling it s which is why it's s dot post here because

that's what we've used and then we're going to let's do print sorry let's do

let's create our soup variable and we'll do beautiful soup and actually I'm

getting a bit ahead of myself here let's just see what we get back if we do

response and then let's print ah so we should get ah area back

response 200 because we've got the status code and we do the text we should

get our secure area back which we do great

so that proves that we've logged into there okay so we can get rid of this and

let's try and load that page up again as we did before but when we did it without

the session we were not logged in so we can get the page so now let's do our is

equal to request dot get and then let's do the secure URL so send a request

directly to the at the URL which will only get a response back if we are still

logged in and in this this case I am gonna use going to create a suit

variable so it's just easier to see and beautiful suit capital and let's do our

content and we use the HTML parser like that and then that's print suit dot and

we'll use prettify so it's a bit easier to see it's clear I think that okay so

with this we'll keep our session open so when we post our login information which

we've created here to the authenticate URL which came from the inspect element

on the browser that we saw we should then stay connected with our session

which means when we request the secure URL we should get the information back

from that page okay well we didn't so we've done something wrong okay so I can

see straight away what we've done wrong here is that we haven't used our session

we've used requests to get as opposed to our session variable so if we change

this to s we'll get in there we go welcome to the secure area okay so what

we've managed to do is we've logged in to the website using the post and using

our session as a context manager and then we've got our response using our

session get to the secure page URL we've got the response back so this could be

anything you could use logging into whatever website and then going directly

to another URL that you can only access when you're logged in and getting that

information so I want to show now is why I mention the beginning of the video

where you can hide your user name and password from your main script which is

always a good practice so what we're going to do is we're going to create a

new file a new PI file and within that we're going to have username is equal to

Tom Smith and our password equal to the password

like this and we're going to save that as another pie file I'm gonna call that

creds dot py and it's going to be in the same folder the same directory as our

main script and here what we can do is we can actually import that PI file into

our main pipe into our main program and by doing that what we can do is we can

call those variables so we can then call creds dot username and also our creds

dot password and what that's going to do is it's going to go to this file and get

that information so you could then ignore this from your get upload and

just upload this which means no one can see your username and password let's

just check that works and there we go straight back to the secure area so

that's it guys we managed to log into a website using requests and session to

keep it alive and then access pages only available behind that login I've also

shown you away how you can hide your credentials from your main file so make

sure you get into that habit just by