iste780day03d exercise

iste780day03d exercise


>>Hi, Mick McQuaid
here, welcoming you to the last view for day three. This is the video in which we
will concentrate exclusively on exercise 2.4.9. This is not an exercise to turn
in; this is imply a dry run for tomorrow’s exercise,
which is to turn in. You know, just the idea is to
help you format it properly and, you know, get the maximum
possible points for the one that you’re going
to turn in tomorrow. Okay, so here it is,
exercise 2.4.9, and in order to accomplish this, in order
to do this, I’m going to point out that I have a set of
exercise homework instructions that I’d like you to follow and so I’m going to
follow them here. So the first thing
that it tells me in the exercise instructions
is that I need to copy and paste this into a text file
in a full feature text editor so I’m going to copy this. And here I have a
full-feature text editor. I also have a copy of r in here
running and I have to put a, I’m going to, actually I might
as well do this right now. Oops. [ Tapping ] Put a blank line in
between each one of these. Oh, maybe I should
wait, let me do that. I didn’t think this
thing through carefully. Okay, so, I’m gong to, for yeah,
for all the lines I’m going to substitute for the beginning
I’m going to substitute the, that should do it, yeah, okay. So I put comment characters
in front of all of these. So this is in a file
called 0.r. Your file for tomorrow’s homework is
going to be called i, letter i, lower case, dot r,
also lower case. And you’re going to copy the
instructions for that exercise and you’re going to include,
let me get rid of these, you’re going to include the
instructions for the homework and then interweave your answers
in between those instructions. Okay, so, and then
at the beginning of the file you’re going
to say the exercise number from the book, which is, in this
case, 2.4.9, your name and do it in the way that you
like to be called. Don’t worry about me being able
to identify you from your name. Put the way that you like
your name to be remembered. I only ever remember
students’ first names. I don’t, I just don’t
have the ability to remember your last name but I can pretty easily memorize
your first names and so, put down the first name the
way you like to be called. I can always identify
you by MyCourses. Okay, so having done that then
I’m going to start putting in my answers and I’m
going to leave a blank line in between these
comments and my answers. And I have two types of
things that I’m going to put I my answers, r code,
which is going to be flush left with no hash marks
as comment characters and then my comments were
the things that it asks me like which of the predictors are
quantitative, things like that. And I’ll precede those
with two hash marks. Okay, so now the exercise
involves the auto data set. Make sure the missing
values have been removed. Uh! I’m completely unprepared
here; I’m not on the network. Let me establish a network
connection because what I want to do is show you how I
get this data set and I’ll. Okay, actually, I guess
I’m on my own network now? Yes, I’m on the network now. Okay, so what I’m
going to do is open up a browser to get
the data set. So let’s see where this
browser window will open, if it will be on screen or off. Okay, so it’s off screen here. So I’m going to put it on
screen and what I’m going to do is Google,
what should I Google? islr.data. And it’s going to be the case
that the very first thing that, or one of the very
first things that comes up is the website for this book. And one of the things that’s on this very first
page is auto.data. And actually, instead of
auto.data I’ve already looked at all of these files and
this particular version of it is not very good. This is a much better version so
I’m going to get this version. So I’m going to download
it, save it, make a note of where I
saved it and then move it. Actually, I’m not going to do
that because I did it already, move it to the directory
where I’m working. Okay, so that’s a
part of the work that I’ve already
done before this. So I actually have a copy
of a file called auto.csv in my current directory. So in order to bring
that into r I’m going to write a command like this. Don’t worry if you don’t
understand this command yet because I will explain. Okay, so this is an
assignment operator in r and so I’m assigning, this object is the
recipient of the assignment. I’m going to create
an object called auto and I create it using
the read function in the csv method of
the read function. And the first argument
by default for that type of argument I don’t have to
name, is the name of the file and that’s in my current
directory so I don’t need to qualify the name at all,
I can just say auto.csv. That, I happen to know that
that has a header in it. I could examine the file
separately but I’m not going to bother with that right now. I’m just going to tell you that
I know there’s a header in it so I’m going to say header=
true, T is an abbreviation, an acceptable abbreviation for
true to r. Another parameter that I’m going to pass is
NA strings equals “?”. And that simply means
that for any information in the file that’s
not available, whoever made the file up put a ? instead of the missing value. So missing value is=?. So this simply alerts r
to how this is encoded. Then once I have it in I’m going
to alter it, once I have it in r I’m going to change it so
I’m going to say na.omitauto and that’s going to
remove all the lines that have some missing values because I just don’t
want to use those lines. And then, I’m going to
check the number of rows in my resulting data set. Oops, what am I saying? That’s not, I’m not
creating an object. That’s a function, nrow
is function number of rows and then I also want to check
the number of columns and, let’s see, I’ll also
check the column names. Notice that I’m using a
syntax-highlighting editor and when I spelled the name of the function wrong it
wouldn’t change the color of it. And then I’m going to attach
it so that I can refer to the column names without
having to qualify them. Oh, and I’m also going,
it seems like I’m doing so much stuff before we get
started here, I’m also going to say cylinders are a
factor, not a, not a numerical. I’m going to use them as a
factor with levels rather than just using them
as a raw number. So I’m going to do
all these things. And what you’re going to
see here is in the left side of my screen I have r running. I’m simply going to press
a keystroke that’s going to pass these command over to r
and that’s a little bit simpler from my standpoint than actually
typing the things into r and then cutting and pasting
them into a text editor. Okay, so I have the expected
number of rows, which is 392, and I have the expected
number of columns, yay! And here are the column
names that I have. But I don’t, you know, I
don’t really like the way that this looks so I’m going
to say, what am I going to say, options with=50. Let’s see if that looks
a little bit better. Oh, that wasn’t what
I meant to do. There we go, yeah. So that’s what I want; I wanted
that to look a little bit better when I put options
in front of that. Okay, so that just tells the
terminal how many columns wide the display is so it narrows
the display so it’s easier for me to, I can’t
read across that well. I have terrible eyesight. Okay, so I have attached auto
and that lets me use these names so I can now say
cylinders instead of having to say auto $ cylinders. So I’m going to make
that a factor and now what’s the next
them I’m going to do. Which of the predictors
are quantitative and which are qualitative? So how am I going to know that,
how can I figure that out? Well, one way that I can figure
that out is to say summary auto and if I do that what I’m
going to get is summaries of all the items in here. Now, I can tell immediately
from this that name is a qualitative
variable. I can’t really tell
that about origin but I believe origin is
a qualitative variable. Origin is the country
of origin coded as 1, 2, or 3 and I think 1 means North
America, which, come to think of it, isn’t even a country. Two is the European Union,
which is also not a country, and 3 is Asia so it’s
not country of origin, I guess it’s continent of
origin or something like that. Miles per gallon,
that’s numerical. Cylinders we’ve changed to
a factor so I’m going to say that the qualitative,
which are quantitative, are everything except, I guess
everything except cylinders, origin, and name. Okay, so- [ Tapping ] quantitative is mpg,
displacement, horsepower, weight, acceleration, I can’t
seem to spell, and year. And then qualitative is
cylinders, origin, name. And this may not actually
agree with the textbook. The textbook may claim that two
of these are actually numerical. They may claim that some of
it is an origin or numerical because we can place them in
order but we’re not going to. So that’s my opinion
is that they’re this way. Okay, what is the range of
each quantitative predictor? You can answer this by using
the range function and, indeed, we could say range MPG. So that’s one way to
do it that’s going to give us the range
of miles per gallon. Miles per gallon
ranges from 9 to 46. You can also do that by looking
in summary and we can see that the min in summary
is 9 and the max is 46.6. So we can do this
for all of them. I could list them
all out like this and it would be incredibly
boring. Want to do another one? What’s the next one? Displacement. Okay, 68 to 455, which agrees
with what we already know. This is really kind
of boring to do though so let’s do it a different way. Let’s use a function, an
r function called s apply that applies a, that applies a
function to a bunch of vectors and the vectors we’re going to
hand it are auto. And this isn’t going
to work right away so don’t get too excited,
it’s going to fail and I’ll explain
why in a second. Okay, so we’ve gotten a
message here that says that range is not
meaningful for factors. So remember that we
converted cylinder to a factor so that’s a problem. So what we’re going to
do is get rid of factor as a column being
considered here. And I want to get rid
of, did I say cylinders? That’s what I meant
to say anyway. So I want to only
consider column 1 and columns 3 through,
3, 4, 5, 6, 7. So I’m not going
to consider origin or name even though origin
hasn’t been declared as a factor so consequently origin, as far
as r knows origin is a number. Okay, so what I’ve done here is
index, which we’re going to talk about tomorrow, I’ve
indexed this data set so that only a portion of
the data set is being passed to the s apply function
and having done that I get this result,
which is the range is just for the desired items. So isn’t that nice? So now I can do the same exact
thing for the next question so I’m just going to
copy and paste this. What is the mean and
standard deviation? So I can do this with mean and I
get that, I get the mean there. And I can do it for standard
deviation so let me just go over here and change mean
to standard deviation. And that’s really
swell, isn’t it? Now, there is another thing
that I can do, let’s see, yeah. Here’s a real fancy thing that
I can do so this is pretty cool. So I’m going to type
this in separately. This is really long. I’m going to wrap everything
in a data frame and I’m going to transpose my results
then I’m going to give this s apply thing here,
auto all rows, columns 1, 2, 1 + 3 to 7 and I’m
going to say function. Now, I’m just going to
say placeholder name for the function so
I’m just going to say, bla is my usual placeholder. And now I need to create a
list of the actual functions. So I’m going to say
means, which is mean, bla, SDs, which is SD, bla. Did I put a comma, okay, yeah,
I did, all right good, ranges. Quite a bit of what’s hard
about computing is typing and one thing that I’ve
noticed [laughter] God, how many parentheses
am I going to need? See what I, I have a
parenthesis matching thing. Here, so I was waiting until
I had enough parentheses so that a parenthesis would
match that very first one there. Okay, so this should give me
a nice looking little table, and it does with means, SDs,
and ranges for the 7, for the 6, rather, quantitative variables. Okay, and this is not something
that you should’ve been able to figure out on your own. This is part of the r toolbox
that I’m just showing you, like, just junk that you can do
that you might not know about that you might
not think of. R is a very powerful language
and there are a lot of things that we can do quickly that
you might, instead, do slowly. But as you gain experience
you should learn little tricks like this and tricks like
this will become less opaque over time. Okay, now remove the 10th
through 85th observations, what is the range mean and standard deviation
of each predictor? Okay, so what we can do,
we can copy this stuff in the subset that remain. So we can copy all of these
things and we just have to alter them a tiny bit. So I can alter these things
a tiny bit so what I want to do is omit rows
10 through 85. So the easy way to do that is to
put in a row specification here and so that row specification
is going to be, oops, minus 10:minus 85. And I should point out that
I did not need to wrap these in cs in parentheses
because they are contiguous. If they were not contiguous,
if there were a comma in them then I would
needed to do it this way but since there isn’t I
can just do it like this. Okay? And what I’m going to get
I can either use this method, which is three commands, or use this table method,
which is one command. And what you’ll notice, if you
look up here at this version, is that the ranges are
narrower than they were before and that is as it should be
because we’ve taken a subset so obviously a subset is going to have either the same
ranges or narrow ranges. If we had wider ranges there than what we have here
then we would know that something is wrong. And you shoujd always
be checking your work. You should always be looking
for common sense things that tell you that something
is wrong and so I glance at these things from time
to time and try to see if I see anything that
looks suspicious to me and I don’t see anything
that looks suspicious here. Okay, using the full data
set, so we’re going to go back to the full data set here, so it’s a good thing we didn’t
actually eliminate these. It’s a good thing that we
simply said don’t use these. So we still have
the full data set. So using the full data set, investigate predictors
graphically using scatter plots or other tools of your choice. Create some plots highlighting
the relationships among the predictors, comment
on your findings. Okay, so I’m only going to
do a couple of plots here. The main one I’m going
to do is called pairs. This is a scatter plot. This will produce a
scatter plot matrix. We can also say splom, I
think, although I might need to load a particular
library to say splom. And here’s the output of it. Now, this output is ridiculously
compressed so that it will fit onto this screen and I
can compress it even more. But I’m going to expand it so
now we’re only going to be able to see the upper corner of it but you’ll have a much
better view of it. Okay, and I can sort of move
it along the display here. So this is a scatter
plot matrix. So this is a matrix that shows
along the main diagonal every single variable name and
it shows in the spaces where the variable names
meet, it shows a scatter plot of those two variables. So this is a scatter pot of miles per gallon
versus cylinders. And if you’ll notice,
this is a flipped version of that exact same plot. These are the same,
these two are the same, these two are the same and
so on, all the way down. So you really only
need to look at half of the scatter plot matrix and,
in fact, if you learn much more about r you’ll learn of ways
of using this extra space down here for some other stuff. Or, using this space
here instead of putting the names here of other stuff instead
of the names here. So my favorite way to first, to
begin looking at a data set is to do a scatter plot matrix. So this is my favorite graphical
command for starting out. It’s very easy, it’s
very simple, and it tells me right away where there are some
relationships I can see. For example, all these very
linear looking relationships and all these very
curvilinear-looking relationships there. So this is quite valuable and I can tell these are
categorical variables right away because of the lines. There’s another categorical
variable here. Well, that’s a variable that
perhaps should be categorical because they’re only, oh no, there are a whole bunch
of values over here. I take it back. No, that’s just, again, showing
that cylinders is categorical. Okay, and then another thing
that I could do, so let me, I’m going to squeeze this
into the upper corner here and I’m just going
to plot two of them. So if it’s too hard to view
this, we can say, for example, plot displacement by horsepower. So this is just two
of the variables and we could get a
more readable plot. So I could do that
with all of them to get better scatter plots. And there were vast number
of plots that I could do but what I can tell
from just looking at the scatter plot
matrix is the following. Oops. The strongest
linear relationship is between displacement; and
usually with statistics we like to weasel word things. So usually I would
say more like, the strongest linear
relationship seems to be between displacement
and horsepower. The next strongest linear
relationship seems to be between displacement and weight. I’m just copying down;
I looked at this earlier because I didn’t want to have, well I do have long
pauses in here. I didn’t want to have too many
long pauses in here and horsepower [ Tapping ] There’s a strong
categorical relationship between displacement and cylinders there is a
somewhat strong categorical relationship in displacement
origin. There are strong
curvilinear relationships between miles per
gallon and displacement. Let me just fix that typo here. And what else? Between miles per
gallon and displacement? I already said miles per
gallon and horsepower. Those were those
three at the top, miles per gallon and weight. It’s three at the top on the top
row of the, when we were looking at the scatter plot
matrix made by pairs. The other relationships
are less clear. [ Tapping ] Okay, suppose that we wish to
predict gas mileage on the basis of the other variables. Do your plots suggest any of
the variables might be useful in predicting miles per gallon? Here again is something
where I don’t need to run any r commands,
or I shouldn’t say again. This is the first one where I’m
not going to run any r commands, I’m just going to use my
scatter plot matrix as the basis for answering this
question here. So let me leave a blank line
there with my two hash marks and say weight appears And this, again, I just
went back and looked at my scatter plot matrix. It might actually be
better for you to, if you have a big enough screen
you won’t have any problems doing the scatter plot matrix. My whole screen is
quit a bit larger than what I’m showing
you so it’s pretty easy. You may need to print out or, if you have a low
resolution screen, or do individual scatter plots. It shows that only, let’s see, two 3500-pound plus
cars get better than 20 miles per gallon. Well, no 40 mile per gallon
car, this is really old data. Cars have come a long way
since this was made up. You can’t really
save that much money by getting an old car these
days because newer cars are so much better and they last so much longer unless
you just love old cars. I have had a couple of old cars
just because I love them but I’m under no illusions; it’s not a
way of saving money it’s just, you know, it’s something
that really, you know, some old cars are just
beautiful and interesting. [laughter] but they are
expensive compared to new cars. [ Tapping ] And I’m putting this amount of
detail in here, you don’t have to be too detailed and I do
hate like really, you know, wasteful irrelevant
stuff that people stick in just to try to get points. If I suspect that you’re
sticking in a bunch, that you’re just saying
every possible thing that could be said, hoping that
something turns out to be right, I will take off points for
that and I will tell you in the comments section in
MyCourses that you went too far. So you have to develop an
intuition for being concise and that’s something that
students often don’t have. Okay, so that’s pretty
much the end of it. That’s the whole, that’s all
there is to this assignment. So I’m looking forward to
talking to you again tomorrow and I thank you for
your attention.