Video Thumbnail 24:32
Why I chose Python & Polars for Data Analysis
8.5K
279
2024-04-19
To try everything Brilliant has to offer—free—for a full 30 days, visit https://brilliant.org/JohnWatsonRooney/ . You’ll also get 20% off an annual premium subscription. This video was sponsored by Brilliant join the Discord to discuss all things Python and Web with our growing community! https://discord.gg/C4J2uckpbR Work with me: https://johnwr.com If you are new, welcome! I am John, a self taught Python developer working in the web and data space. I specialize in data extraction and JSON ...
Subtitles

so when I first started getting into working with data for the company I was

working for at the time I had a few choices I got put through an Excel

course and that was fine but this was around about the time that I was

starting to teach myself Python and I could see that there was a clear divide

in between the two now I had the option to go ahead and carry on my Excel course

and learn more about quow query and this was kind of like the decided path for

the sort of company that I was working for at the time we didn't really do a

whole load of data analysis it was basically just to look at our own

customer data and figure out where we can improve things but as I started to

learn more Python and learn about where the data actually came from through the

apis through competitor analysis and web scraping which is kind of what I built

my channel around it became clear to me that I didn't want to actually export

that data out and then repport it into Excel to work with it in power query or

do anything like that now we had no proprietary business like bi tools or

anything like that I was allowed to make the decision myself so having watched a

few videos it was clear to me the option of being able to pull the data down from

the API from our company system and then actually go ahead and scrape competitor

data from their websites and pull all of that into a panda's data frame and

analyze and analyze it there it was just a no-brainer for me that that was the

choice that I was going to go for I also didn't have a load of fun with Excel

when it came to lots and lots of rows and manipulating and even doing the most

basic things was really challenging and even simple things like when you're

working with it in when working with a load of data in pandas you create

multiple data frames you don't chop up your original data you just build from

that uh little things like that made my choice very clear so in this video what

I want to show you is how to is how I kind of started with that I want to take

some data that I've scraped from the from from the uh internet and I'm going

to pull it all into a data frame and I'm going to show you just how simply you

can work with it to actually show off how pandas or in this case polers which

is very similar just a different sort of approach but basically the same thing I

just prefer the words and terminology for the functions that it uses how you

can actually take that and you can then present that to your boss or whoever and

say hey look this is what I've done this might help us out here this will help us

out there and then you're obviously going to help yourself out in the long

in the long run too so let's get some data going let's load up our data frame

and I'll show you how we can start to look at insights for the data that we

have making the choice to Learn Python for data became my path but if you're

interested in working with data in any capacity then it's all about how you

look at it and how to find those useful insights this is where today's sponsor

brilliant can really make a difference they make it easy for anyone to pick up

and get started and with their learn by doing and Hands-On lessons give you the

opportunity to play around with the concepts it will help you build up a

solid foundation and power up your all important problem solving skills this is

by far the most effective way to learn and is way more interesting and

captivating than watching lecture videos there's thousands of interactive lessons

in maths data analysis programming and AI already to help you regardless of

your skill level and completing just a few minutes each day will quickly become

a habit and become that powerful growth tool beats mindless scrolling there's a

ton of new data content now too including how to explore data visually

which which is where I went to brush up my skills on things like aggregating

filtering and grouping data like we're going to do later on in this video so to

try everything brilliant has to offer for free for a full 30 days visit

brilliant.org johnwatson Rooney or click on the link in the description you'll

also get 20% off an annual premium subscription so we are in our Jupiter

notebook now in my project folder I'm going to leave some instructions for you

of how to actually get to this point uh save me just running through it and just

boring everybody so the first thing we want to do is we want to import in

polers and we do this as PL like so now we want to create our data frame so we

do DF and we're going to do pl. read CSV now what's great about this is we can

actually read in a all the csvs in our folder just by this one command so we

just do star. CSV like so and then if we hit DF do shape you can see I have 33

columns and 6,227 rows the next thing I want to do

is want to have a look at DF do describe um just so we can see a little bit more

about what's going on here and this will show us some of the information so we

can see these columns the count uh of items so we start looking for null

values you can see here these ones I've got have high null values because

there's only 675 as opposed to 6,000 and so on and so forth and you can see

there's lots of null values here as well now obviously I was only able to do this

because all of my CSV files were the same if you have them at a different

number of columns or whatever you need to um import them in separately create

their own data frames and then go from there so now we have this in one last

thing to check which is the schema it's always useful to see I don't need the

parentheses here we can see the names of all the columns so there's going to be a

few main commands that we're going use there's going to be uh filter Group by

and uh select so select is going to select columns filter is going to filter

rows and group by is going to basically group the data together based on the

criteria that we give it so if we do DF do um let's do select first and let's

have a look at pl. call to select the column and let's have a look at the

price column to start with like so now this is going to be our our shape is the

correct amount of row uh rows with one column obviously the price we can see

them all there and it's telling us that this is an

I64 which is exactly what we wanted it to be now if it wasn't we would need to

change this and we may need to do this later on but you can do we could do DF

do cast and then within our uh parentheses we create some brackets we

give it the column price and then what we want it to be so

we could do PL dot I think it's float 60 before then if we were to find the price

in this bearing in mind that we haven't saved this this is just its own thing

here we can see now that the price is a float um if we were wanted to make this

permanent we could just do DF is equal to DF but I don't want the price to be a

float I want it to be an integer uh we might work with this later but this is a

good opportunity to just have a look at the data in fact what I'll do is we'll

just do DF do head to look at the top four uh five rows and we can see that

there is an added this is actually a date stream but it's come through as an

I64 number you would probably want to convert this but I'm not going to be

doing anything with the date in this analysis so um that' be one thing that I

would going to leave out but we want to sort of figure out what columns we

actually want here so we definitely want the number of beds that's going to be

useful for us and coming across here we're going to

want the post code this is going to be interesting obviously the price and the

property type as well um this is going to be the information that we're going

to work with so I've decided what I'm going to do is I'm going to look see if

there's a correlation between um the type of the property either like a house

or a flat or whatever the price and then the postcode location and we need to

work with that now obviously there is going to be some but you know it's going

to be interesting for us to see as we go forward um how we can pull this

information out and have a look so to select those columns we would just do DF

do select pl. call and we can then say we want the best

we want the price what else do we say we want to forget the columns when I'm

doing this I tend to do this with the schema let's get the schema back up

because it's nice and easy to see like so we can just copy The Columns over

that we wanted property type and the post code which is out here

somewhere we hit enter you see now we have our own new our new own data frame

which is obviously separate um that we've just created with this piece of

information and is there anything else that we going to want from this

possibly not we'll leave it like this for now we might come back and revisit

that um but this will do for now so what I'm going to do is I'm just going to I'm

going to save this as um post code I'm going to save it like this and that

means that now I have this subset data frame here with just this information

from it and if we were to look at the main data frame here's everything back

again so this is how we can easily sort of like me manage all our different data

frames like instead of having to have different sheets or whatever whatever on

our Excel page or however you want to handle it there so now we've got our

data narrowed down a bit there's a few things that we still want to do as you

can see we have this property type here now we're going to want to use this to

sort of figure out um the different types of properties that have different

costs and values obviously different prices we also need to split up the

postcode at the moment this is a UK set of data set so this is obviously a UK

postcode but the whole thing is quite a specific location we only want the first

three so we'll be chopping this up but first of all let's have a quick look at

doing unique this is quite a good way of being able

to sort of see the unique value so we do subset is equal to and this is a list

and I'm just going to put property type in here and this is just going to show

me all of the unique values that there are for the property type so you can see

we have eight in this case here so we know what we're kind of looking at and

we can now filter these down um a bit more should we need to uh we could also

for example if you put beds in here you'll see you know one of these

properties has 12 beds or 20 I mean that's land so that's a bit different so

that's kind of a we probably won't want to look at that but for example you know

nine bed house it's pretty pretty something there I suppose so as you can

tell so let's have uh another look oh there's one more thing I want to show

you which is Glimpse Glimpse is kind of useful as well you get to have a kind of

a quick look at the data um in kind of more of a raw State um you can see like

it's got all this information here so that's kind of useful as well so let's

go ahead and start uh actually clear clearing this up a little bit so we have

um something better to work with so let's get rid of this and we're going to

do they're going to split the post code up first so we'll do post code I've

called this data frame post codes not ideal and we'll dowi columns this is

another um one of the methods that we use on our data frame this is basically

what it says going to be working with some columns and changing or adding data

in some way so we're going to select the postcode column because this is the one

we want to work with post code and we going to do do string

because this is string in this post code see down here string and now we can do

do split and I'm going to split this on a space so you can see that the the data

has a space in the middle so that's going to work just fine for me and then

again we can chain these again chain these together so we can do with columns

again and again pl. and the same column and then it's a list because

we will have split it up in fact I'll show you this in just a second and we

want to get the zero index like so we need get here

like we need list. get my B list. get right so if I was to just I'm just going

to remove this bit for the moment if we run just this like it is you see we

return this list back here because we've done split split is obviously just like

the python split we are split witing and it returns us a list now that's fine but

by changing the Expressions here we're basically saying we're going to do with

columns again same column but because it's now a list see list we can do

list.get and I'm asking for the index of zero if I put the index of one you see

we get the second half which we don't want we want the zero index like this so

now I have my property type and my post code narrow my post code in a more of a

general area so we can actually think about you know having a look at the

differences in prices in different locations so again I'm just going to

override my data frame here and just save it like this I don't need to do

this anymore so now we have this let's have a look at the mean values for uh

each of these kind of like groups of postcodes so to do that we need to do

filter so let's do postcode do filter and we're going to be looking at pl.

call this is one of the things I like about um polers actually it's really

straightforward is saying oh this is looking for column so it's pretty

straightforward and easy to do then we're going to take our property type

like this we're going to say this is going to be equal

to houses and remember we looked at the all the unique values up here I should

probably include that down again actually what I'm going to do is I'm

just going to create a new uh a new cell here and we're going to do unique

again and it was our subset and we wanted it on property

value type property

type close that off cool so now I can see these here we have some that are not

specified that's fine so let's do um let's do bungalows in this case here so

now we're basically saying give me all of the rows that match Bungalow so I can

just close this off and run it and here we are here's all of our Bungalows chain

this with do select and we're going to do pl. call on the price and I'm going

to have have the mean which is going to be our average here so across this whole

data set the average Bungalow price is just about 299 but you'll see now this

has turned this into a float a 64b float we can actually do cast on the end of

this I showed youcast at the start where we were doing it because this is the an

output is the data frame so we'll do price and we do pl.

int64 close our brackets and now we have it as an integer so you can use cast

anytime you end with a data frame it's really handy to be able to chain these

all together so let's have a look at something else let's put in another

filter in the middle of this filter pl. call

beds and we'll make this equal to one and we can see now the average price

across the whole of the thing so this is a very general not that useful is uh for

one is 128,000 and and two is 246,000 is there

three yeah three and on so on and so forth um so that's kind of like one way

we can start to look at things here and we can of course put in houses here and

we'll say two bed houses 171,000 so versus

Bungalows 246,000 so in the case of small houses they seem to be cheaper

generally across the board or at least across my data set which is obviously

you know just a small 6,000 or so listings of properties for sale so there

we go so the next thing that you want to look at is I've done filter and I've

done select and we looked at mean so now we want to do group by and so what I'm

going to do is postcode which was this data frame that we created Group by and

this is going to like as it says group all of our data together so we want to

group it by post code this is the one that we want to do want to want to get

everything together like this and then we want to do Aggregate and we want to

do pl. we want to do of the uh price and I'm going to say uh mean again like so

and let's see I've got the syntax in complete one more bracket there and now

we have a list of all of the average prices for these this sort of area this

general area uh again just come through as a float 64 so what do we do again we

just go ahead and cast this over here as well and change this to an integer like

so so what be we've got here is a long list of all these different postcodes so

what we can do is we can do we need to move uh I think we can do do sort can we

do do sort on WE off this yes we can so we'll do sort

by and we want to sort on the sort on the

price and we'll do descending is equal to true so this will

give us the most expensive ones at the top we can see out of my data set these

areas have the highest average uh prices again this is just a my data set

we could end up saying that there's not that many um properties in this to match

but you know that's why I've pulled so this is what we're looking at right here

we can actually dive even deeper into this as well so we could go ahead and

say our postcode we're going to be group by um so I'm going to save this as

prices by postcode while

typing like so great so we have this now so let's try and narrow this uh broaden

it even more so for example this is the area and this this is like the general

area and then this is kind of like a subset of it so we want to go for just

these first two and we we'll bride we'll broaden this out even more and then

we'll have a sort of a good overview of the data

so what I'm going to do now is I'm going to say this is with columns so we're

going to be working with these columns again and we want to do

PL do call which is the post code like this I'm going to do string

because this is a string as it says here St Str and then we'll do do head and two

so we hit enter oh I missed a bracket I always do that so now you can see that

we have our CBS and is all matching here but now we need to do group by again so

we can sort of regroup all of this data and this is essentially the same as this

so what I'm going to do is uh because this basically gives us a data frame is

I'm going to just going to grab all of this and we'll put it on here and we'll

run it again and now you'll see that they have grouped all of these together

and if we were to let's go ahead and just grab the top ones so I'm just going

to put head again and this is obviously the head of the data frame rather than

the string which we did in the middle and now we have the top five sort of

shorthand postcode areas with the most expensive um houses or prices for uh any

property within now what we can do here is we can actually save this to uh

price totals like so I'm going to come back to the top here and I'm going to do

import cbor as SNS I'm just going to run all the sales

just to make make sure nothing goes wrong and now we can do SNS do set theme

this isn't necessary but I'm just going to do it anyway and we can do SNS

barplot now we can do price totals price

totals and we'll have the X column is going to be equal

to our post code and I think this should be price

we'll double check in just a second why there we go now this is a very

uninspiring graph because we chopped this right down uh where are we so we

come all the way to the end here and we can just you know make this a bit bigger

like so same deal there we go so we can start to look at this and we can start

to compare and you can of course put these the other way around if you wanted

to grow out sideways like so okay so let's construct ourselves another nice

uh long expression let's find um the sort of average price for houses

depending on the number of beds so to start with we want to do our again uh

filter we're going to go back to using the postcode data frame it's just there

available for us so we're going to use this one so we'll do

filter and we're going to filter on our property type column so pl. call

property type and we're going to say that that is going to be equal to going

to look at houses houses like this and now we want to do a group buy because we

want to group Buy on the number of beds so we'll do uh

beds and we'll do aggregate function or a function pl. call on the price so we

just chaining all these Expressions together now and we want to have the

mean of the price so there we go now we've got this sort of data here but

again we have our it's converted our uh price into a float 64 and you can see

that we've got some dodgy numbers going on here um so let's go ahead and bring

this in here so we want to do we've got this whole cast and sort and head so I'm

going to copy that we'll put you here as well and now we have uh the top 10 of

price for the number of beds and you can see here that generally speaking funnily

enough the more beds you have apart from this bit of anonymous this bit of data

which is a bit strange I'd be interested to see where that came from obviously

the more beds you have the higher the price we can do Tail as well which I

think is probably just going to give us this as well yes so there we go

obviously that's not overly surprising um one one thing I want to show you is

that when you create when you construct these sort of long uh Expressions it's

very easy just to drop in and drop out the different bits of information so we

take uh land for example here we can now just change this and we've got the same

amount the same information but for the type of land so it's very easy and I'm

going to change this back two heads so we get the top ones it's very easy just

to drop this in and I think this one here um is character properties so let's

go ahead and put that in cool so we can see right away that the character

properties are much more expensive and uh maybe generally have uh the this

different number of beds but you get the idea of what you can start to build and

construct with these long um expressions and obviously you don't have to do this

you can split these up but for example I know that when I run this is my data

frame when I run this this is going to end up with a data frame as well so then

I can do group by and when I end up here I know that I'm going to have data left

and I can do cast to change the data type then I can sort it because I have

that data frame left and of course if you wanted to you can then throw in

caborn to do some kind of um graphing if you want um

it's up to you this it's fairly straightforward I tend to only use S

graph plot or ra plot I think kind of get the idea not that

particularly useful in this but it is when you have more different more and

different data points so from here if I was going to continue to analyze this

data let's go back up so I can see the schema I would be changing added which

to a date and I'd be looking at seeing how

long um things have been on sale because we have

somewhere down here um sold STC so we can filter those out and we can have a

look how long things have been on sale and what sort of property stale stay on

sale for the longest that would be an interesting piece of data to look at we

focused in much more on the price and the type and the number of beds and the

averages in this instance so hopefully you enjoyed this video if you want to

know how I got all of this data make sure you subscribe because that video is

coming up next