Why does R not auto-complete data frame variables when subsetting - r

Using the data frame mtcars on RStudio.
Say for example I want to subset mtcars[mtcars$cyl == 4,]
Tabbing after mtcars$ will provide a drop down list of variable names in the data frame.
Tabbing after mtcars[mtcars$ does not return the variable names.
Why does this happen?

it will if you add a space:
mtcars[ mtcars$
otherwise your expecting r to look in something called mtcars[mtcars not mtcars...

I was going to ask the same thing. I disagree with the answer that you are expecting R to look for something called mtcars[mtcars, because you can't even make that without putting it all in quotes anyway, e.g.
test[test <- c(1,3,2) # leaves you stuck with the next line being +
The only way to make such an abomination is:
"test[test" <- c(1,3,2)
And once made you still cant use
test[test[2]
You still need to use quotes
"test[test"[2]
So, as far as I can tell, tabbing after mtcars[mtcars$ failing is either a bug or has some sort of reason behind it. If there is a reason does anyone know what it is?

Related

In R, dataframe[-NULL] returns an empty dataframe

I'm creating some routines in R to ease model creation and to distinguish several groups based on several parameters (ex: original watches VS fakes ones using watches common attributes).
During the proccess, I keep track of the potential excluded lines in a vector (empty at first), and I get ride of them at the end using:
model$var <- raw_data[-line_excluded,]
The problem is that if line_excluded is c() (ndlr no line exlcuded), model$var is an empty dataframe then in that case I want all the lines of the dataframe.
The only solution I have think about is the us of
if (!is.null(line_excluded)){
model$var <- raw_data[-line_excluded,]}
But that's not really pretty, and I have several tracking variables as line_excluded which need that.
Thanks for the help
You can make it in another way using setdiff(), which can deal with empty line_excluded i.e.,
model$var <- raw_data[setdiff(seq(nrow(raw_data)),line_excluded),]
You can also try:
model$var <- raw_data[!(1:nrow(raw_data) %in% line_excluded),]
This is similar to what #THomasIsCoding suggested, you look for the row numbers that are not in your line_excluded..

Anonymous function in lapply

I am reading Wickham's Advanced R book. This question is relating to solving Question 5 in chapter 12 - Functionals. The exercise asks us to:
Implement a version of lapply() that supplies FUN with both the name and value of each component.
Now, when I run below code, I get expected answer for one column.
c(class(iris[1]),names(iris[1]))
Output is:
"data.frame" "Sepal.Length"
Building upon above code, here's what I did:
lapply(iris,function(x){c(class(x),names(x))})
However, I only get the output from class(x) and not from names(x). Why is this the case?
I also tried paste() to see whether it works.
lapply(iris,function(x){paste(class(x),names(x),sep = " ")})
I only get class(x) in the output. I don't see names(x) being returned.
Why is this the case? Also, how do I fix it?
Can someone please help me?
Instead of going over the data frame directly you could switch things around and have lapply go over a vector of the column names,
data(iris)
lapply(colnames(iris), function(x) c(class(iris[[x]]), x))
or over an index for the columns, referencing the data frame.
lapply(1:ncol(iris), function(x) c(class(iris[[x]]), names(iris[x])))
Notice the use of both single and double square brackets.
iris[[n]] references the values of the nth object in the list iris (a data frame is just a particular kind of list), stripping all attributes, making something like mean(iris[[1]]) possible.
iris[n] references the nth object itself, all attributes intact, making something like names(iris[1]) possible.

ifelse with no else

Basically in SAS I could just do an if statement without an else. For example:
if species='setosa' then species='regular';
there is no need for else.
How to do it in R? This is my script below which does not work:
attach(iris)
iris2 <- iris
iris2$Species <- ifelse(iris2$Species=='setosa',iris2$Species <- 'regular',iris2$Species <- iris2$Species)
table(iris2$Species)
A couple options. The best is to just do the replacement, this is nice and clean:
iris2$Species[iris2$Species == 'setosa'] <- 'regular'
ifelse returns a vector, so the way to use it in cases like this is to replace the column with a new one created by ifelse. Don't do assignment inside ifelse!
iris2$Species <- ifelse(iris2$Species=='setosa', 'regular', iris2$Species)
But there's rarely need to use ifelse if the else is "stay the same" - the direct replacement of the subset (the first line of code in this answer) is better.
New factor levels
Okay, so the code posted above doesn't actually work - this is because iris$Species is a factor (categorical) variable, and 'regular' isn't one of the categories. The easiest way to deal with this is to coerce the variable to character before editing:
iris2$Species <- as.character(iris2$Species)
iris2$Species[iris2$Species == 'setosa'] <- 'regular'
Other methods work as well, (editing the factor levels directly or re-factoring and specifying new labels), but that's not the focus of your question so I'll consider it out of scope for the answer.
Also, as I said in the comments, don't use attach. If you're not careful with it you can end up with your columns out of sync creating annoying bugs. (In the code you post, you're not using it anyway - the rest runs just as well if you delete the attach line.)
I would recommend looking at the base R documentation for help with this. You can find the documentation of if, else, and ifelse here. For use of if and else, refer to ?Control.
Regular control flow in code is done with the basic if and else statements, as in most languages. ifelse() is used for vectorized operations--it will return the same shape as your vector based on the test. Regular if and else expressions do not necessarily have those properties.

Strangeness with filtering in R and showing summary of filtered data

I have a data frame loaded using the CSV Library in R, like
mySheet <- read.csv("Table.csv", sep=";")
I now can print a summary on that mySheet object
summary(mySheet)
and it will show me a summary for each column, for example, one column named Diagnose has the unique values RCM, UCM, HCM and it shows the number of occurences of each of these values.
I now filter by a diagnose, like
subSheet <- mySheet[mySheet$Diagnose=='UCM',]
which seems to be working, when I just type subSheet in the console it will print only the rows where the value has been matched with 'UCM'
However, if I do a summary on that subSheet, like
summary(subSheet)
it still 'knows' about the other two possibilities RCM and HCM and prints those having a value of 0. However, I expected that the new created object will NOT know about the possible values of the original mySheet I initially loaded.
Is there any way to get rid of those other possible values after filtering? I also tried subset but this one just seems to be some kind of shortcut to '[' for the interactive mode... I also tried DROP=TRUE as option, but this one didn't change the game.
Totally mind squeezing :D Any help is highly appreciated!
What you are dealing with here are factors from reading the csv file. You can get subSheet to forget the missing factors with
subSheet$Diagnose <- droplevels(subSheet$Diagnose)
or
subSheet$Diagnose <- subSheet$Diagnose[ , drop=TRUE]
just before you do summary(subSheet).
Personally I dislike factors, as they cause me too many problems, and I only convert strings to factors when I really need to. So I would have started with something like
mySheet <- read.csv("Table.csv", sep=";", stringsAsFactors=FALSE)

Do you use attach() or call variables by name or slicing?

Many intro R books and guides start off with the practice of attaching a data.frame so that you can call the variables by name. I have always found it favorable to call variables with $ notation or square bracket slicing [,2]. That way I can use multiple data.frames without confusing them and/or use iteration to successively call columns of interest. I noticed Google recently posted coding guidelines for R which included the line
1) attach: avoid using it
How do people feel about this practice?
I never use attach. with and within are your friends.
Example code:
> N <- 3
> df <- data.frame(x1=rnorm(N),x2=runif(N))
> df$y <- with(df,{
x1+x2
})
> df
x1 x2 y
1 -0.8943125 0.24298534 -0.6513271
2 -0.9384312 0.01460008 -0.9238312
3 -0.7159518 0.34618060 -0.3697712
>
> df <- within(df,{
x1.sq <- x1^2
x2.sq <- x2^2
y <- x1.sq+x2.sq
x1 <- x2 <- NULL
})
> df
y x2.sq x1.sq
1 0.8588367 0.0590418774 0.7997948
2 0.8808663 0.0002131623 0.8806532
3 0.6324280 0.1198410071 0.5125870
Edit: hadley mentions transform in the comments. here is some code:
> transform(df, xtot=x1.sq+x2.sq, y=NULL)
x2.sq x1.sq xtot
1 0.41557079 0.021393571 0.43696436
2 0.57716487 0.266325959 0.84349083
3 0.04935442 0.004226069 0.05358049
I much prefer to use with to obtain the equivalent of attach on a single command:
with(someDataFrame, someFunction(...))
This also leads naturally to a form where subset is the first argument:
with(subset(someDataFrame, someVar > someValue),
someFunction(...))
which makes it pretty clear that we operate on a selection of the data. And while many modelling function have both data and subset arguments, the use above is more consistent as it also applies to those functions who do not have data and subset arguments.
The main problem with attach is that it can result in unwanted behaviour. Suppose you have an object with name xyz in your workspace. Now you attach dataframe abc which has a column named xyz. If your code reference to xyz, can you guarantee that is references to the object or the dataframe column? If you don't use attach then it is easy. just xyz refers to the object. abc$xyz refers to the column of the dataframe.
One of the main reasons that attach is used frequently in textbooks is that it shortens the code.
"Attach" is an evil temptation. The only place where it works well is in the classroom setting where one is given a single dataframe and expected to write lines of code to do the analysis on that one dataframe. The user is unlikely to ever use that data again once the assignement is done and handed in.
However, in the real world, more data frames can be added to the collection of data in a particular project. Furthermore one often copies and pastes blocks of code to be used for something similar. Often one is borrowing from something one did a few months ago and cannot remember the nuances of what was being called from where. In these circumstances one gets drowned by the previous use of "attach."
Just like Leoni said, with and within are perfect substitutes for attach, but I wouldn't completely dismiss it. I use it sometimes, when I'm working directly at the R prompt and want to test some commands before writing them on a script. Especially when testing multiple commands, attach can be a more interesting, convenient and even harmless alternative to with and within, since after you run attach, the command prompt is clear for you to write inputs and see outputs.
Just make sure to detach your data after you're done!
I prefer not to use attach(), as it is far too easy to run a batch of code several times each time calling attach(). The data frame is added to the search path each time, extending it unnecessarily. Of course, good programming practice is to also detach() at the end of the block of code, but that is often forgotten.
Instead, I use xxx$y or xxx[,"y"]. It's more transparent.
Another possibility is to use the data argument available in many functions which allows individual variables to be referenced within the data frame. e.g., lm(z ~ y, data=xxx).
While I, too, prefer not to use attach(), it does have its place when you need to persist an object (in this case, a data.frame) through the life of your program when you have several functions using it. Instead of passing the object into every R function that uses it, I think it is more convenient to keep it in one place and call its elements as needed.
That said, I would only use it if I know how much memory I have available and only if I make sure that I detach() this data.frame once it is out of scope.
Am I making sense?

Resources