Data.frames in R: name autocompletion? - r

Sorry if this is trivial. I am seeing the following behaviour in R:
> myDF <- data.frame(Score=5, scoreScaled=1)
> myDF$score ## forgot that the Score variable was capitalized
[1] 1
Expected result: returns NULL (even better: throws error).
I have searched for this, but was unable to find any discussion of this behaviour. Is anyone able to provide any references on this, the rationale on why this is done and if there is any way to prevent this? In general I would love a version of R that is a little stricter with its variables, but it seems that will never happen...

The $ operator needs only the first unique part of a data frame name to index it. So for example:
> d <- data.frame(score=1, scotch=2)
> d$sco
NULL
> d$scor
[1] 1
A way of avoiding this behavior is to use the [[]] operator, which will behave like so:
> d <- data.frame(score=1, scotch=2)
> d[['scor']]
NULL
> d[['score']]
[1] 1
I hope that was helpful.
Cheers!

Using [,""] instead of $ will throw an error in case the name is not found.
myDF$score
#[1] 1
myDF[,"score"]
#Error in `[.data.frame`(myDF, , "score") : undefined columns selected
myDF[,"Score"]
#[1] 5
myDF[,"score", drop=TRUE] #More explicit and will also work with tibble::as_tibble
#Error in `[.data.frame`(myDF, , "score", drop = TRUE) :
# undefined columns selected
myDF[,"Score", drop=TRUE]
#[1] 5
as.data.frame(myDF)[,"score"] #Will work also with tibble::as_tibble and data.table::as.data.table
#Error in `[.data.frame`(as.data.frame(myDF), , "score") :
# undefined columns selected
as.data.frame(myDF)[,"Score"]
#[1] 5
unlist(myDF[,"score"], use.names = FALSE) #Will work also with tibble::as_tibble and data.table::as.data.table
#Error in `[.data.frame`(myDF, , "score") : undefined columns selected
unlist(myDF[,"Score"], use.names = FALSE)
#[1] 5

Related

R base::options with variables

I have noticed some behaviour in R's base::options() that I am unable to fully understand.
the following is fine:
> vals_vector
[1] "temp" "hum" "co2" "voc" "pm1" "pm2_5" "pm10"
> options("hum" = TRUE)
> if (getOption("hum")) {
+ print("stuff")
+ }
[1] "stuff"
And this is also fine:
> options(TEMP_ENABLE = "temp" %in% vals_vector)
> getOption("TEMP_ENABLE")
[1] TRUE
However the following does not work.
> options(as.character(vals_vector[1]) = TRUE)
Error: unexpected '=' in "options(as.character(vals_vector[1]) ="
> as.character(vals_vector[1])
[1] "temp"
> "temp"
[1] "temp"
Makes no sense. You can see, I have evaluated the argument and it's exactly the same is both cases. Just in one I've used the variable. My intention was to use a loop to set an option for each variable present in a data set. Why doesn't this work as expected?
You can't use function calls as stand-ins for argument names in R. This is nothing to do with options, it's just how the R parser works. Take the following example, using the function data.frame.
data.frame(A_B = 2)
#> A_B
#> 1 2
Suppose we wanted to generate the name A_B programmatically:
paste("A", "B", sep = "_")
#> [1] "A_B"
Looks good. But if we try to use this function call with the intention that its output is interpreted as an argument name, the parser will simply tell us we have a syntax error:
data.frame(paste("A", "B", sep = "_") = 2)
#> Error: unexpected '=' in "data.frame(paste("A", "B", sep = "_") ="
There are ways round this - with most base R functions we would create a named list programmatically and pass that as an argument list using do.call:
mylist <- list(TRUE)
names(mylist) <- vals_vector[1]
do.call(options, mylist)
getOption("temp")
#> [1] TRUE
However, if you read the docs for options, it says
Options can also be passed by giving a single unnamed argument which is a named list.
So a more concise idiom would be:
options(setNames(list(TRUE), vals_vector[1]))
getOption("temp")
#> [1] TRUE

R error: "duplicate 'row.names' are not allowed"

I got the error when I wanted to set the first column as the row names:
dt <- fread('../data/data_logTMP.csv', header = T)
rownames(dt) <- dt$GENE
I used duplicated() to check the values:
> which(duplicated(dt$GENE) == TRUE)
[1] 20209 21919
Therefore, I compared these values:
> dt$GENE[20209] == dt$GENE[21919]
[1] FALSE
> dt$GENE[20209]
[1] "1-Mar"
> dt$GENE[21919]
[1] "2-Mar"
Why were these two values recognized as duplicated? And how can I fix this problem?
As you are using fread for reading the file the default class of you object dt will be of data.table. By design data.table will not support row.names. Therefore you need to pass an additional argument to fread as shown below to make sure that the class of the object that you are reading is not a data.table.
data.table::fread(input = "file name",sep = ",",header = T,data.table = FALSE)

Why is the function work after doing fix() in R

Here's what had happened:
> NA.of.df = which(rowSums(is.na(df)) == ncol(df))
> NA.of.df
named integer(0)
> fix(df) # i want to see what's in here -- nothing wrong
> NA.of.df # so i run it again
1 3 5 7 9 # it works!
why would this happens??
A producible example (but doesn't seems like any data structure with dput()) is like the following:
> dput(NA.of.df)
structure(integer(0), .Names = character(0))
and NA.of.df is just the code for finding rows with all NAs (obtained from here:
Remove rows in R matrix where all data is NA). (i.e. NA.of.df = which(rowSums(is.na(df)) == ncol(df)))
It could be an issue with quotes around the NA resulting in is.na to not pick up those elements
is.na(c(NA, "NA"))
#[1] TRUE FALSE
After doing the fix, it may have dropped the quotes and evaluate it correctly

r stop guessing names when root is similar

Is there an option in R that prevents it from returning values from field names with the same beginning if the one you asked for does not exist? This is causing me a fair amount of problems as my fields may or may not be present, and they have similar root names.
d <- data.frame(areallylongname = -99, y = 2, z = 0)
# How do I stop this returning a value
d$a
#[1] -99
# it should return NULL like this
d$jjj
# NULL
You can switch to bracket notation, which requires exact column names:
> d['a']
Error in `[.data.frame`(d, "a") : undefined columns selected
> d['y']
y
1 2
If you want to avoid partial matching and return an error, the following could work.
However, this will make all other warnings to errors as well.
options(warnPartialMatchDollar = TRUE, warn = 2)
# test
d$a
Error in $.data.frame(d, a) :
(converted from warning) Partial match of 'a' to 'areallylongname' in data frame

dplyr invalid subscript type list

I have run into an error in a script I am writing that only occurs when I have dplyr running. I first encountered it when I found a function from dplyr that I wanted to use, after which I installed and ran the package. Here is an example of my error:
First I read in a table from excel that has column values I am going to use as indices in it:
library(readxl)
examplelist <- read_excel("example.xlsx")
The contents of the file are:
1 2 3 4
1 1 4 1
2 3 2 1
4 4 1 4
And then I build a data frame:
testdf = data.frame(1:12, 13:24, 25:36, 37:48)
And then I have a loop that calls a function that uses the values of examplelist as indices.
testfun <- function(df, a, b, c, d){
value1 <- df[[a]]
value2 <- df[[b]]
value3 <- df[[c]]
value4 <- df[[d]]
}
for (i in 1:nrow(examplelist)){
testfun(testdf, examplelist[i, 1], examplelist[i, 2],
examplelist[i, 3], examplelist[i, 4])
}
When I run this script without dplyr, everything is fine, but with dplyr it gives me the error:
Error in .subset2(x, i, exact = exact) : invalid subscript type 'list'
Why would having dplyr cause this error, and how can I fix it?
I think MKR's answer is a valid solution, I will elaborate a bit more on the why with some alternatives.
The readxl library is part of the tidyverse and returns a tibble (tbl_df) with the function read_excel. This is a special type of data frame and there are differences from base behaviour, notably printing and subsetting (read here).
Tibbles also clearly delineate [ and [[: [ always returns another tibble, [[ always returns a vector. No more drop = FALSE
So you can see now that your examplelist[i, n] will return a tibble and not a vector of length 1, which is why using as.numeric works.
library(readxl)
examplelist <- read_excel("example.xlsx")
class(examplelist[1, 1])
# [1] "tbl_df" "tbl" "data.frame"
class(examplelist[[1, 1]])
# [1] "numeric"
class(as.numeric(examplelist[1, 1]))
# [1] "numeric"
class(as.data.frame(examplelist)[1, 1])
# [1] "numeric"
My workflow tends towards using the tidyverse so you could use [[ to subset or as.data.frame if you don't want tibbles.
I can see this issue even without loading dplyr. It seems the culprit is use of examplelist items. if you print the value of examplelist[1, 2] then it is 1x1 dimension data.frame. But the value of a, b, c and d are expected to be a simple number. Hence if you change examplelist[i, 1] etc using as.numeric then the error will be avoided. Change call of testfun as:
testfun(testdf, as.numeric(examplelist[i, 1]), as.numeric(examplelist[i, 2]),
as.numeric(examplelist[i, 3]), as.numeric(examplelist[i, 4]))

Resources