so when I first started getting into working with data for the company I was
working for at the time I had a few choices I got put through an Excel
course and that was fine but this was around about the time that I was
starting to teach myself Python and I could see that there was a clear divide
in between the two now I had the option to go ahead and carry on my Excel course
and learn more about quow query and this was kind of like the decided path for
the sort of company that I was working for at the time we didn't really do a
whole load of data analysis it was basically just to look at our own
customer data and figure out where we can improve things but as I started to
learn more Python and learn about where the data actually came from through the
apis through competitor analysis and web scraping which is kind of what I built
my channel around it became clear to me that I didn't want to actually export
that data out and then repport it into Excel to work with it in power query or
do anything like that now we had no proprietary business like bi tools or
anything like that I was allowed to make the decision myself so having watched a
few videos it was clear to me the option of being able to pull the data down from
the API from our company system and then actually go ahead and scrape competitor
data from their websites and pull all of that into a panda's data frame and
analyze and analyze it there it was just a no-brainer for me that that was the
choice that I was going to go for I also didn't have a load of fun with Excel
when it came to lots and lots of rows and manipulating and even doing the most
basic things was really challenging and even simple things like when you're
working with it in when working with a load of data in pandas you create
multiple data frames you don't chop up your original data you just build from
that uh little things like that made my choice very clear so in this video what
I want to show you is how to is how I kind of started with that I want to take
some data that I've scraped from the from from the uh internet and I'm going
to pull it all into a data frame and I'm going to show you just how simply you
can work with it to actually show off how pandas or in this case polers which
is very similar just a different sort of approach but basically the same thing I
just prefer the words and terminology for the functions that it uses how you
can actually take that and you can then present that to your boss or whoever and
say hey look this is what I've done this might help us out here this will help us
out there and then you're obviously going to help yourself out in the long
in the long run too so let's get some data going let's load up our data frame
and I'll show you how we can start to look at insights for the data that we
have making the choice to Learn Python for data became my path but if you're
interested in working with data in any capacity then it's all about how you
look at it and how to find those useful insights this is where today's sponsor
brilliant can really make a difference they make it easy for anyone to pick up
and get started and with their learn by doing and Hands-On lessons give you the
opportunity to play around with the concepts it will help you build up a
solid foundation and power up your all important problem solving skills this is
by far the most effective way to learn and is way more interesting and
captivating than watching lecture videos there's thousands of interactive lessons
in maths data analysis programming and AI already to help you regardless of
your skill level and completing just a few minutes each day will quickly become
a habit and become that powerful growth tool beats mindless scrolling there's a
ton of new data content now too including how to explore data visually
which which is where I went to brush up my skills on things like aggregating
filtering and grouping data like we're going to do later on in this video so to
try everything brilliant has to offer for free for a full 30 days visit
brilliant.org johnwatson Rooney or click on the link in the description you'll
also get 20% off an annual premium subscription so we are in our Jupiter
notebook now in my project folder I'm going to leave some instructions for you
of how to actually get to this point uh save me just running through it and just
boring everybody so the first thing we want to do is we want to import in
polers and we do this as PL like so now we want to create our data frame so we
do DF and we're going to do pl. read CSV now what's great about this is we can
actually read in a all the csvs in our folder just by this one command so we
just do star. CSV like so and then if we hit DF do shape you can see I have 33
columns and 6,227 rows the next thing I want to do
is want to have a look at DF do describe um just so we can see a little bit more
about what's going on here and this will show us some of the information so we
can see these columns the count uh of items so we start looking for null
values you can see here these ones I've got have high null values because
there's only 675 as opposed to 6,000 and so on and so forth and you can see
there's lots of null values here as well now obviously I was only able to do this
because all of my CSV files were the same if you have them at a different
number of columns or whatever you need to um import them in separately create
their own data frames and then go from there so now we have this in one last
thing to check which is the schema it's always useful to see I don't need the
parentheses here we can see the names of all the columns so there's going to be a
few main commands that we're going use there's going to be uh filter Group by
and uh select so select is going to select columns filter is going to filter
rows and group by is going to basically group the data together based on the
criteria that we give it so if we do DF do um let's do select first and let's
have a look at pl. call to select the column and let's have a look at the
price column to start with like so now this is going to be our our shape is the
correct amount of row uh rows with one column obviously the price we can see
them all there and it's telling us that this is an
I64 which is exactly what we wanted it to be now if it wasn't we would need to
change this and we may need to do this later on but you can do we could do DF
do cast and then within our uh parentheses we create some brackets we
give it the column price and then what we want it to be so
we could do PL dot I think it's float 60 before then if we were to find the price
in this bearing in mind that we haven't saved this this is just its own thing
here we can see now that the price is a float um if we were wanted to make this
permanent we could just do DF is equal to DF but I don't want the price to be a
float I want it to be an integer uh we might work with this later but this is a
good opportunity to just have a look at the data in fact what I'll do is we'll
just do DF do head to look at the top four uh five rows and we can see that
there is an added this is actually a date stream but it's come through as an
I64 number you would probably want to convert this but I'm not going to be
doing anything with the date in this analysis so um that' be one thing that I
would going to leave out but we want to sort of figure out what columns we
actually want here so we definitely want the number of beds that's going to be
useful for us and coming across here we're going to
want the post code this is going to be interesting obviously the price and the
property type as well um this is going to be the information that we're going
to work with so I've decided what I'm going to do is I'm going to look see if
there's a correlation between um the type of the property either like a house
or a flat or whatever the price and then the postcode location and we need to
work with that now obviously there is going to be some but you know it's going
to be interesting for us to see as we go forward um how we can pull this
information out and have a look so to select those columns we would just do DF
do select pl. call and we can then say we want the best
we want the price what else do we say we want to forget the columns when I'm
doing this I tend to do this with the schema let's get the schema back up
because it's nice and easy to see like so we can just copy The Columns over
that we wanted property type and the post code which is out here
somewhere we hit enter you see now we have our own new our new own data frame
which is obviously separate um that we've just created with this piece of
information and is there anything else that we going to want from this
possibly not we'll leave it like this for now we might come back and revisit
that um but this will do for now so what I'm going to do is I'm just going to I'm
going to save this as um post code I'm going to save it like this and that
means that now I have this subset data frame here with just this information
from it and if we were to look at the main data frame here's everything back
again so this is how we can easily sort of like me manage all our different data
frames like instead of having to have different sheets or whatever whatever on
our Excel page or however you want to handle it there so now we've got our
data narrowed down a bit there's a few things that we still want to do as you
can see we have this property type here now we're going to want to use this to
sort of figure out um the different types of properties that have different
costs and values obviously different prices we also need to split up the
postcode at the moment this is a UK set of data set so this is obviously a UK
postcode but the whole thing is quite a specific location we only want the first
three so we'll be chopping this up but first of all let's have a quick look at
doing unique this is quite a good way of being able
to sort of see the unique value so we do subset is equal to and this is a list
and I'm just going to put property type in here and this is just going to show
me all of the unique values that there are for the property type so you can see
we have eight in this case here so we know what we're kind of looking at and
we can now filter these down um a bit more should we need to uh we could also
for example if you put beds in here you'll see you know one of these
properties has 12 beds or 20 I mean that's land so that's a bit different so
that's kind of a we probably won't want to look at that but for example you know
nine bed house it's pretty pretty something there I suppose so as you can
tell so let's have uh another look oh there's one more thing I want to show
you which is Glimpse Glimpse is kind of useful as well you get to have a kind of
a quick look at the data um in kind of more of a raw State um you can see like
it's got all this information here so that's kind of useful as well so let's
go ahead and start uh actually clear clearing this up a little bit so we have
um something better to work with so let's get rid of this and we're going to
do they're going to split the post code up first so we'll do post code I've
called this data frame post codes not ideal and we'll dowi columns this is
another um one of the methods that we use on our data frame this is basically
what it says going to be working with some columns and changing or adding data
in some way so we're going to select the postcode column because this is the one
we want to work with post code and we going to do do string
because this is string in this post code see down here string and now we can do
do split and I'm going to split this on a space so you can see that the the data
has a space in the middle so that's going to work just fine for me and then
again we can chain these again chain these together so we can do with columns
again and again pl. and the same column and then it's a list because
we will have split it up in fact I'll show you this in just a second and we
want to get the zero index like so we need get here
like we need list. get my B list. get right so if I was to just I'm just going
to remove this bit for the moment if we run just this like it is you see we
return this list back here because we've done split split is obviously just like
the python split we are split witing and it returns us a list now that's fine but
by changing the Expressions here we're basically saying we're going to do with
columns again same column but because it's now a list see list we can do
list.get and I'm asking for the index of zero if I put the index of one you see
we get the second half which we don't want we want the zero index like this so
now I have my property type and my post code narrow my post code in a more of a
general area so we can actually think about you know having a look at the
differences in prices in different locations so again I'm just going to
override my data frame here and just save it like this I don't need to do
this anymore so now we have this let's have a look at the mean values for uh
each of these kind of like groups of postcodes so to do that we need to do
filter so let's do postcode do filter and we're going to be looking at pl.
call this is one of the things I like about um polers actually it's really
straightforward is saying oh this is looking for column so it's pretty
straightforward and easy to do then we're going to take our property type
like this we're going to say this is going to be equal
to houses and remember we looked at the all the unique values up here I should
probably include that down again actually what I'm going to do is I'm
just going to create a new uh a new cell here and we're going to do unique
again and it was our subset and we wanted it on property
value type property
type close that off cool so now I can see these here we have some that are not
specified that's fine so let's do um let's do bungalows in this case here so
now we're basically saying give me all of the rows that match Bungalow so I can
just close this off and run it and here we are here's all of our Bungalows chain
this with do select and we're going to do pl. call on the price and I'm going
to have have the mean which is going to be our average here so across this whole
data set the average Bungalow price is just about 299 but you'll see now this
has turned this into a float a 64b float we can actually do cast on the end of
this I showed youcast at the start where we were doing it because this is the an
output is the data frame so we'll do price and we do pl.
int64 close our brackets and now we have it as an integer so you can use cast
anytime you end with a data frame it's really handy to be able to chain these
all together so let's have a look at something else let's put in another
filter in the middle of this filter pl. call
beds and we'll make this equal to one and we can see now the average price
across the whole of the thing so this is a very general not that useful is uh for
one is 128,000 and and two is 246,000 is there
three yeah three and on so on and so forth um so that's kind of like one way
we can start to look at things here and we can of course put in houses here and
we'll say two bed houses 171,000 so versus
Bungalows 246,000 so in the case of small houses they seem to be cheaper
generally across the board or at least across my data set which is obviously
you know just a small 6,000 or so listings of properties for sale so there
we go so the next thing that you want to look at is I've done filter and I've
done select and we looked at mean so now we want to do group by and so what I'm
going to do is postcode which was this data frame that we created Group by and
this is going to like as it says group all of our data together so we want to
group it by post code this is the one that we want to do want to want to get
everything together like this and then we want to do Aggregate and we want to
do pl. we want to do of the uh price and I'm going to say uh mean again like so
and let's see I've got the syntax in complete one more bracket there and now
we have a list of all of the average prices for these this sort of area this
general area uh again just come through as a float 64 so what do we do again we
just go ahead and cast this over here as well and change this to an integer like
so so what be we've got here is a long list of all these different postcodes so
what we can do is we can do we need to move uh I think we can do do sort can we
do do sort on WE off this yes we can so we'll do sort
by and we want to sort on the sort on the
price and we'll do descending is equal to true so this will
give us the most expensive ones at the top we can see out of my data set these
areas have the highest average uh prices again this is just a my data set
we could end up saying that there's not that many um properties in this to match
but you know that's why I've pulled so this is what we're looking at right here
we can actually dive even deeper into this as well so we could go ahead and
say our postcode we're going to be group by um so I'm going to save this as
prices by postcode while
typing like so great so we have this now so let's try and narrow this uh broaden
it even more so for example this is the area and this this is like the general
area and then this is kind of like a subset of it so we want to go for just
these first two and we we'll bride we'll broaden this out even more and then
we'll have a sort of a good overview of the data
so what I'm going to do now is I'm going to say this is with columns so we're
going to be working with these columns again and we want to do
PL do call which is the post code like this I'm going to do string
because this is a string as it says here St Str and then we'll do do head and two
so we hit enter oh I missed a bracket I always do that so now you can see that
we have our CBS and is all matching here but now we need to do group by again so
we can sort of regroup all of this data and this is essentially the same as this
so what I'm going to do is uh because this basically gives us a data frame is
I'm going to just going to grab all of this and we'll put it on here and we'll
run it again and now you'll see that they have grouped all of these together
and if we were to let's go ahead and just grab the top ones so I'm just going
to put head again and this is obviously the head of the data frame rather than
the string which we did in the middle and now we have the top five sort of
shorthand postcode areas with the most expensive um houses or prices for uh any
property within now what we can do here is we can actually save this to uh
price totals like so I'm going to come back to the top here and I'm going to do
import cbor as SNS I'm just going to run all the sales
just to make make sure nothing goes wrong and now we can do SNS do set theme
this isn't necessary but I'm just going to do it anyway and we can do SNS
barplot now we can do price totals price
totals and we'll have the X column is going to be equal
to our post code and I think this should be price
we'll double check in just a second why there we go now this is a very
uninspiring graph because we chopped this right down uh where are we so we
come all the way to the end here and we can just you know make this a bit bigger
like so same deal there we go so we can start to look at this and we can start
to compare and you can of course put these the other way around if you wanted
to grow out sideways like so okay so let's construct ourselves another nice
uh long expression let's find um the sort of average price for houses
depending on the number of beds so to start with we want to do our again uh
filter we're going to go back to using the postcode data frame it's just there
available for us so we're going to use this one so we'll do
filter and we're going to filter on our property type column so pl. call
property type and we're going to say that that is going to be equal to going
to look at houses houses like this and now we want to do a group buy because we
want to group Buy on the number of beds so we'll do uh
beds and we'll do aggregate function or a function pl. call on the price so we
just chaining all these Expressions together now and we want to have the
mean of the price so there we go now we've got this sort of data here but
again we have our it's converted our uh price into a float 64 and you can see
that we've got some dodgy numbers going on here um so let's go ahead and bring
this in here so we want to do we've got this whole cast and sort and head so I'm
going to copy that we'll put you here as well and now we have uh the top 10 of
price for the number of beds and you can see here that generally speaking funnily
enough the more beds you have apart from this bit of anonymous this bit of data
which is a bit strange I'd be interested to see where that came from obviously
the more beds you have the higher the price we can do Tail as well which I
think is probably just going to give us this as well yes so there we go
obviously that's not overly surprising um one one thing I want to show you is
that when you create when you construct these sort of long uh Expressions it's
very easy just to drop in and drop out the different bits of information so we
take uh land for example here we can now just change this and we've got the same
amount the same information but for the type of land so it's very easy and I'm
going to change this back two heads so we get the top ones it's very easy just
to drop this in and I think this one here um is character properties so let's
go ahead and put that in cool so we can see right away that the character
properties are much more expensive and uh maybe generally have uh the this
different number of beds but you get the idea of what you can start to build and
construct with these long um expressions and obviously you don't have to do this
you can split these up but for example I know that when I run this is my data
frame when I run this this is going to end up with a data frame as well so then
I can do group by and when I end up here I know that I'm going to have data left
and I can do cast to change the data type then I can sort it because I have
that data frame left and of course if you wanted to you can then throw in
caborn to do some kind of um graphing if you want um
it's up to you this it's fairly straightforward I tend to only use S
graph plot or ra plot I think kind of get the idea not that
particularly useful in this but it is when you have more different more and
different data points so from here if I was going to continue to analyze this
data let's go back up so I can see the schema I would be changing added which
to a date and I'd be looking at seeing how
long um things have been on sale because we have
somewhere down here um sold STC so we can filter those out and we can have a
look how long things have been on sale and what sort of property stale stay on
sale for the longest that would be an interesting piece of data to look at we
focused in much more on the price and the type and the number of beds and the
averages in this instance so hopefully you enjoyed this video if you want to
know how I got all of this data make sure you subscribe because that video is
coming up next