Video Thumbnail 15:00
How to Scrape and Download ALL images from a webpage with Python
88.2K
1.7K
2020-10-28
Today we are going to create an image downloader / scraper using Python. Using web scraping we can extract all the image links from a page and then save them to our PC in bulk. This is a basic python programming tutorial for beginners to help show what can be achieved by learning Python! Saving all the images from a website is a real world project that you could well be asked to do at a job, or freelancing. It combines some basic scraping skills with learning the basics of creating directories a...
Subtitles

hi everyone and welcome john here and today's video i'm going to show you how

you can create your own image downloader using python so we're

going to be using python requests and beautiful soup and we are going to

be finding all the image tags and then saving

all of the images that it finds to our computer

so let's get started the first thing we want to do is import

requests and from ps4 we're going to import

beautiful soup and i'm also going to import the os

module because that's going to let us create folders and change directories

which we're going to need to do so now i've got those installed the os1

is in the standard python library if you need to pip install requests or

beautiful soup go ahead and do that so this is the website we're going to be

getting the images from everyone knows this website is airbnb um

i've never been to ljubljana before but i'm sure it's really nice so what i'm

going to do is i'm going to try and download the images that it lets

us from these listings what i'm not going to do is i'm not

going to go into each and every individual listing to get all the images

i'm just going to get the top the first one that it gives us so what

we want to do to start with is inspect element so we can start to

see how it looks so if i make that bigger so

we can see if we hover over the first image here

there is an image here image class blah blah blah and all this but

more specifically the most important thing to us is it's actually inside this

image tag now images in html will always be inside

these image tags so we can actually just use

find or with beautiful soup to get them all and start collecting the links that

we want to then download so now i can see that is in there i'm

just going to double check the page source

it's always useful to do and i'm going to just copy

some part of the text so we can get to the

let's just copy and we'll search for under free parking

just so we can see that it's there and it looks like it is

available so we know that we can't we can get to it

so i'm going to copy this url it's quite a long one

i'm just going to put it in here so we're going to say url is equal to this

and just move that up and out of the way the first part is to actually reach out

to the server with requests and then get that information back so as

always i like to do r is equal to requests.get and then we

give it our url which we have specified here see

these two right here the next thing we want to do is we want

to create our soup so we can do soup is equal to

beautiful soup and then we want r dot

we can do text in this case and we'll do html

dot passer so beautiful soup is just the html password in this case

let's move that up one and now i'm just going to check that this is working like

i always do and i'm just going to say print soup dot

title dot text and run that and hopefully if we get

something back that is right which we do we know that this is all going to work

let's clear that off delete that we don't want that what we

do want is we want to find all of the image tags so they're all

like this in the html which means we can simply do images is equal to soup

dot find all because we want it to return a list of ev

every single one that can find on the page and we want to do img

like this what i'm going to do now is i'm just going to print out

images and hopefully we get back a load

of information there we go we do so we can see that we actually got a list

and it's got all of this and we can actually see that the links are here

inside it so we can see there but that's no good we just got the

elements there what we'll do is we'll do a for loop so

we'll do four image in images so each one of those

elements that we just saw inside the all of the images list that we created here

i'm going to print image and then after that i'm going to do src

in the square brackets with the quotation marks because

if i come back here we can see the actual link to the image that i hover

over on the right hand side is under this src the source equals

and we can access the information that's just in this little tag here which is

where the image url is so to do that let's do that and then

let's run that and hopefully scroll down and we've got a nice long

list of image links that we could if i just

click on one that didn't work if i go to chrome copy and paste it in we can see

that is the image returned that's not quite the images that i was

hoping for from this but you know it's there and it works so the next

thing you want to do is to save the image but first what i'm

going to check out is i'm going to try and give it a better name than just the

file name so i'm going to go back over to our

source code and i'm going to have a look and quite often you get these alt tags

here which basically is the sort of the name

for the image so we can actually access that the same way that we did the

source tag this one we can use this in the alt tag almost all websites will

have an alt tag for their images it's quite

important for seo so they will be there we can access

let's close that down so then let's get rid of our print statement here

and say uh let's call this one link because that was the image link

above that i'm going to put name and i'm just going to say image and then the alt

alt tag like that so now if i print name and link

we should get that information out as well okay we can see it's all here so

the first one this is obviously something else at the

top of the page it doesn't have an alt tag and it seems

to be just a gif file we're just going to ignore that for now

um and that will be fine but the rest of them are all there

and working to save the images we can do with open so we're going to be

opening a file writing to it and then saving it

and we need to give it a file name this is why we've gone ahead and got the

name from the image here so we can call that our file this

name we need to give it an extension so i'm just going to do plus

and then i'm going to give it a jpeg for an image extension

it doesn't matter if the original file isn't a jpeg file or if it's

jpeg go ahead and try and save it as a jpeg first

um that's usually your best option most web files are jpegs anyway so

that's a good start and then we want to do wb because we want to write to it but

we want the bytes we want to know the actual

raw bits of the information that are in there so that's why we need wb

and then as f and our codon and then under here we want

to actually send out a request to the individual links that we can then get

the information from them from the server so we're going to want

to do another request so i've got r is equal to request dot

get up here so i'm actually just going to do i

m for image and then we're going to do requests.get

and then we're going to say link and then we want to do

f dot write the i m that is our response for the link for the image

and we want the dot contents the content is going to be the bytes

content so we can be able to save that using our write with our bytes

file and then save that to the disk so i'm

going to run this now and we'll see that it's going to go out and download all

those images and it's going to save them into the current directory that we're

working in we've got no output so i've got an error

here and that's because what i've tried to do is i've tried to write a name that

is not an acceptable file name so the best thing to do is i'm just going to go

ahead and hit replace and i'm going to replace all of the

blank spaces with a dash now hopefully what that will

do is it'll fill in all the blanks that are

actually causing us issues and saving that

with their new file name so that's looking like

it's failed right so that didn't work so let's go ahead and replace

the i think it's probably the slash forward slashes and replace them with

nothing let's try that okay there we go so we can see when i

actually read this error the first time i didn't take into account that

there was the forward slashes that were causing the problem i was just looking

at the extra dots so after we replaced that it worked fine

so if i go ahead and open the folder we can see we've

actually got all these images here so if i just

open the reveal explorer we can see that we've got them all and

they're all saved all of the thumb all the images there

for all of those and they've all got their

appropriate names as we save them the duplicate ones are where we run it the

first time i'm showing that bigger for you guys

so there we go so that's worked that's great

so there's a few things we can do to improve this although this is uh

the basics sort of frame of what it is that will work

but what i'd like to do is i'd like to turn this into a function that we can

then use for different websites add a little bit of error handling in as

well and also create a new folder that we can say

say hey save all of the images from this um

this page into this folder okay so i'm going to

actually just collapse some of this down now and i'm going to create our function

so def defining our function and i'm just going

to call this one image down and then inside this function we're

going to give it two two things so we're going to have url

and we're going to have folder so when i say folder i'm going to create a new

folder with the name that we give it so we need to indent this now to

create a folder on python it's really simple we use the

os module that we've imported and we would just do

os dot m k d i r make directory but we need to kind of

do a little bit more than that first so we need to find out

we need to get the current working directory first and then we need to

create one inside that because if we just did this it probably wouldn't be in

the right place so we want it to be in this folder but a

new directory so what i'm going to do is i'm going to

say we're going to do make a directory but what we want to do

is we want to join the current working directory

and the folder name that we give it so i'm going to say

os.path dot join and there we're going to join the two

together so when we do os.path.join it will automatically put in the forward

slashes in the correct places for us and we're going to join the two of os

dot i think it is get current working

directory and folder so that looks a little bit

sort of long and maybe quite a little bit convoluted but all we're doing is

the main part is we're creating a directory and what we're doing is we're

creating the directory that is joining together

the current directory we're in and the new folder name we give it

okay so it's it's it's just all on one line but it should be quite

straightforward what i'm going to do is i'm going to do

try first um and then i'm just going to do a real um

basic error handling you shouldn't really do except pass but for this case

i think it's fine because we we know what this is doing um so i'm going to

try creating the directory and if it fails

instead of kicking us out our program is just going to move on

okay so then we can do our r is equal to request dot get and we can find all the

image tags and then we can get the alt and the

source for each one and then we can write them all to the

file but what we haven't done is we haven't actually

um changed into our directory so i'm going to do that underneath

that i'm going to do os dot ch there for terrain

change directory what i'm going to do is i'm just going to paste this back in

because this is now created this directory the

join so i'm going to go ahead and put that right in there

because that's just going to go ahead and change into that directory

that we created now we've done that i'm just going to

add in a quick print statement down at the bottom

so i'm just going to say just so we can see it working

not like that print and i'm going to say writing and then we'll give it

name okay so what we've done is we've turned our

little basic script just into a function that we can reuse

we're going to give it a url and then a folder name so i'm going to comment this

url out here i'm going to let's find another

place let's go to where else do you want to go

let's go bratislava why not and select some random dates that we

might be looking at going cool great so we've got a new link let's copy

that and underneath here we're going to do image

down for our function and if you remember we have to give it the url

and this is hidden by me there we go and then we're going to give it the folder

name of which i'm going to just call it

bratislava why not i'm going to save that let's

move back over here and then going to run that and we'll get

writing see we still get that blank one at the top but i think that's okay we

we kind of understand what that is we could write that we could write some

code out for that if we wanted to but i don't think we need to

and let's go to our file browser and we can see we've got a new folder here

created and all the images in and if i reveal the explorer

we should have all those images right there so that was nice and easy

um i'll put this code in my github uh you guys can go ahead and take it and

maybe change it a little bit make it work for

you um but it's pretty simple uh the only sort of complicated bits

that you may or may not have seen is the os module and changing directories and

creating new folders just keeps it all tidy and you have to

do a little bit of replace on the string of the name if

you're using the alt tag you don't have to use the old tag you

can call it whatever you like you could just call

you could do a loop and you could say the first image you find is called

image one and then all the way down just keep adding

onto it if you like i just thought it was it was a nicer way to have the

actual alt name of the image in there um just makes it a bit better to

sort of know where you're at and know what it is

that you've actually got the image for but you could call it whatever you like

so that'll do it for this one guys thank you very much for watching don't forget

to like comment abs and subscribe and i will see you in the

next one thank you bye