R error subsetting data.frame when using [[ - r

For a data frame (data) which has one columns as sulfate,
What is a difference between data[["sulfate"]] and data[[colnames(data)=="sulfate"]]?
data["sulfate'] and data[colnames(data)=="sulfate"] yields same valued result and have data frame class but data[["sulfate"]] results into a numeric vector in my case but data[[colnames(data)=="sulfate"]] turns out to be an error. Why?

First - here are some ways to achieve what you are trying to achieve:
data$sulfate
getElement(data, "sulfate")
Next a short explanation why data[[colnames(data)=="sulfate"]] does not work.
1) The expression within [[ is colnames(data)=="sulfate" which is a logical vector.
2) Function [[ accepts a single element (because it's used to select a single element) or a numeric vector in which case it is used to select elements of a nested list. For example:
a <- list(list(2,3), list(3,4))
> a[[c(2,1)]]
[1] 3
The help page help(`[[`) will have more information about how it works.
3) The data.frame object in R is a list, you can confirm this by doing is.list(data). So the function [[ works the same way.
Now what happens when you pass it a vector instead of a single number - it gets turned into a numeric representation of 0s and 1s. For example inspect as.numeric(colnames(data)=="sulfate")).
Then the subsetting [[ encounters 0 entries and when you try to subset using a 0 it throws an error that you are attempting to select less than one element.
data[[0]]
Notice that the error is the same as when doing data[[colnames(data)=="sulfate"]]

Related

When creating new data.frame column, what is the difference between `df$NewCol=` and `df[,"NewCol"]=` methods?

Using the default "iris" DataFrame in R, how come when creating a new column "NewCol"
iris[,'NewCol'] = as.POSIXlt(Sys.Date()) # throws Warning
BUT
iris$NewCol = as.POSIXlt(Sys.Date()) # is correct
This issue doesn't exist when assigning Primitive types like chr, int, float, ....
First, notice as #sindri_baldur pointed, as.POSIXlt returns a list.
From R help ($<-.data.frame):
There is no data.frame method for $, so x$name uses the default method which treats x as a list (with partial matching of column names if the match is unique, see Extract). The replacement method (for $) checks value for the correct number of rows, and replicates it if necessary.
So, if You try iris[, "NewCol"] <- as.POSIClt(Sys.Date()) You get warning that You're trying assign a list object to a vector. So only the first element of the list is used.
Again, from R help:
"For [ the replacement value can be a list: each element of the list is used to replace (part of) one column, recycling the list as necessary".
And in Your case, only one column is specified meaning only the first element of the as.POSIXlt's result (list) will be used. And You are warned of that.
Using $ syntax the iris data.frame is treated as a list and then the result of as.POSIXlt - a list again - is appended to it. Finally, the result is data.frame, but take a look at the type of the NewCol - it's a list.
iris[, "NewCol"] <- as.POSIXlt(Sys.Date()) # warning
iris$NewCol2 <- as.POSIXlt(Sys.Date())
typeof(iris$NewCol) # double
typeof(iris$NewCol2) # list
Suggestion: maybe You wanted to use as.POSIXct()?

mpfr'izing a data.frame in R

I'm trying to convert a data.frame in R to mpfr format by multiplying by an mpfr unit constant. This works, as demonstrated in the code below, when applied to a column (result variable 'mpfr_col'), but for both approaches shown for working with a data.frame, it does not. The relevant errors for each attempt are listed in comment.
library(Rmpfr)
prec <- 256
m1 <- mpfr(1,prec)
col_build <- 1:10
test_df <- data.frame(col_build, col_build, col_build)
mpfr_col <- m1*(col_build)
mpfr_df <- m1*test_df # (list) object cannot be coerced to type 'double'
for(colnum in 1:length(colnames(test_df))){
test_df[,colnum] <- m1*test_df[,colnum] # attempt to replicate an object of type 'S4'
}
Answer:
Use [[colnum]] to access the columns instead of [,colnum]:
for(colnum in length(colnames(test_df))){
test_df[[colnum]] <- m1*test_df[[colnum]]
}
(Note: the print method of data.frame will fail, but the 'mpfr-izing' work. You can print it either by printing the columns individually or using as_tibble(test_df).
Explanation
The original fails because the [,colnum] assignment doesn't coerce the argument, I think. Using [[ returns an element (aka a column) of the list (aka the data.frame).
See this bit of Hadley Wickham's Advanced R book:
[ selects sub-lists. It always returns a list; if you use it with a
single positive integer, it returns a list of length one. [[ selects
an element within a list. $ is a convenient shorthand: x$y is
equivalent to x[["y"]].
And the help from Extract.data.frame {base}:
When [ and [[ are used to add or replace a whole column, no coercion
takes place but value will be replicated (by calling the generic
function rep) to the right length if an exact number of repeats can be
used.

r - Check if any value in a data.frame column is null

I am trying to see if the data.frame column has any null values to move to the next loop. I am currently using the code below:
if (is.na(df[,relevant_column]) == TRUE ){next}
which spits out the warning:
In if (is.na(df_cell_client[, numerator]) == TRUE) { ... : the
condition has length > 1 and only the first element will be used
How do I check if any of the values are null and not just the first row?
(I assume by "null" you really mean NA, since a data.frame cannot contain NULL in that sense.)
Your problem is that if expects a single logical, but is.na(df[,relevant_column]) is returning a vector of logicals. any reduces a vector of logicals into a single global "or" of the vector:
Try:
if (any(is.na(df[,relevant_column]))) {next}
BTW: == TRUE is unnecessary. Keep it if you feel you want the clarity in your code, but I think you'll find most R code does not use that. (I've also seen something == FALSE, equally "odd/wrong", where ! something should work ... but I digress.)

if-else vs ifelse with lists

Why do the if-else construct and the function ifelse() behave differently?
mylist <- list(list(a=1, b=2), list(x=10, y=20))
l1 <- ifelse(sum(sapply(mylist, class) != "list")==0, mylist, list(mylist))
l2 <-
if(sum(sapply(mylist, class) != "list") == 0){ # T: all list elements are lists
mylist
} else {
list(mylist)
}
all.equal(l1,l2)
# [1] "Length mismatch: comparison on first 1 components"
From the ifelse documentation:
‘ifelse’ returns a value with the same shape as ‘test’ which is
filled with elements selected from either ‘yes’ or ‘no’ depending
on whether the element of ‘test’ is ‘TRUE’ or ‘FALSE’.
So your input has length one so the output is truncated to length 1.
You can also see this illustrated with a more simple example:
ifelse(TRUE, c(1, 3), 7)
# [1] 1
if ( cond) { yes } else { no } is a control structure. It was designed to effect programming forks rather than to process a sequence. I think many people come from SPSS or SAS whose authors chose "IF" to implement conditional assignment within their DATA or TRANSFORM functions and so they expect R to behave the same. SA and SPSS both have implicit FOR-loops in there Data steps. Whereas R came from a programming tradition. R's implicit for-loops are built in to the many vectorized functions (including ifelse). The lapply/sapply fucntions are the more Rsavvy way to implement most sequential processing, although they don't succeed at doing lagged variable access, especially if there are any randomizing features whose "effects" get cumulatively handled.
ifelse takes an expression that builds a vector of logical values as its first argument. The second and third arguments need to be vectors of equal length and either the first of them or the second gets chosen. This is similar to the SPSS/SAS IF commands which have an implicit by-row mode of operation.
For some reason this is marked as a duplicate of
Why does ifelse() return single-value output?
So a work around for that question is:
a=3
yo <- ifelse(a==1, 1, list(c(1,2)))
yo[[1]]

comparing two integers in R: "longer object length not multiple of shorter object length" ddply

I'm getting an "longer object length not multiple of shorter object length" warning in R when comparing two integers to subset a dataframe in the midst of a user defined function.
The user defined function just returns the median of a subset of integers taken from a dataframe:
function(s){
return(median((subset(EDB,as.integer(validSession) == as.integer(s)))$absStudentDeviation))
}
(I did not originally have the as.integer coercions in there. I put them there to debug, text, and I'm still getting an error.)
The specific error I'm getting is:
In as.integer(validSession) == as.integer(s) :
longer object length is not a multiple of shorter object length
I get this warning over 50 times when calling:
mediandf <- ddply(mediandf,.(validSession),
transform,
grossMed2 = medianfuncEDB(as.integer(validSession)))
The goal is to calculate the median of $validSession associated with the given validSession in the large dataframe EDB and attach that vector to mediandf.
I have actually double-checked that all values for validSession in both the mediandf dataframe and the EDB dataframe are integers by subsetting with is.integer(validSession).
Furthermore, it appears that the command actually does what I intend, I get a new column in my dataframe with values I have not verified, but I want to understand the warning. if "medianfuncEDB" is being called with an integer as its input, why am I getting a "longer object length is not multiple of shorter object length" when s == validSession is called?
Note that simple function calls, like medianfuncEDB(5) work without any problems, so why do I get warnings when using ddply?
EDIT: I found the problem with the help of Joran's comment. I did not know that transform fed entire vecotrs into the function. Using validSession[1] instead gave no warnings.
The ddply function already subsets your data frame by validSession. Hence transform is only fed a data frame with all the rows corresponding to a particular validSession.
That is, transform is already being fed subset(mediandf,validSession==s) for each s in unique(mediandf$validSession).
Since you don't have to do any subsetting (ddply takes care of that), all you need to do is:
ddply(mediandf,.(validSession),transform,grossMed2=median(absStudentDeviation))
And then you'll get mediandf back out with a new column grossMed2 with the value you want (so it will be the same value within each unique validSession).

Resources