as.formula does not like equivalence '=' (object not found) - r

consider the following example
df1 <- data.frame(a=c(1,2,3),b=c(2,4,6));
transform(df1,c=a+b)
a b c
1 1 2 3
2 2 4 6
3 3 6 9
So far, so good. Now I would like to code this dynamically, using as.formula:
transform(df1,as.formula("c=a+b"))
However, R says
Error in eval(expr, envir, enclos) : object 'b' not found
This error does not occur using "~" as separator of left hand and right hand side. Can I somehow delay the evaluation of the formula? Is it possible at all to use as.formula on an assignment? I have tried fiddling around with 'with' but to no avail.

I've solved the problem you mentioned in your comment, since that seems to be your real goal. This avoids messing with the formulae from your original question.
A reproducible version of your dataset.
group_names <- apply(
expand.grid("X", c("X", "O", "Y"), c("A", "B", "C"), "_", 0:9, 0:9),
1,
paste,
collapse = ""
)
n_groups <- 50
n_points_per_group <- 10
df1 <- as.data.frame(matrix(
runif(n_points_per_group * n_groups),
ncol = n_groups
))
colnames(df1) <- sample(group_names, n_groups)
Now convert the data frame to long format. (Using reshape package here. You could also use stats::reshape.)
melted_df1 <- melt(df1)
Define a grouping based upon your criteria that the second character and the number match.
melted_df1$group <- with(melted_df1, paste(
substring(variable, 2, 2),
substring(variable, 5, 6),
sep = ""
))
Now call tapply (or plyr::ddply if you prefer) to get the summary stats.
with(melted_df1, tapply(value, group, mean))

Related

Substituting part of a function with variable input

I want to substitute parts of the transform function with variable inputs.
I have created a df using subset with col1 from an existing table:
col1 = c('A','B','C')
The df looks something like this:
A = c(1, 3)
B = c(3, 1)
C = c(5, 2)
df = data.frame(A, B, C)
I now want to automate calculations which manually would look like this:
df <- transform(df, 'ABC' = (A + B + C))
where (A + B + C) refers to the columns of the df. Because I have hundreds of 'col1's I can't do it by hand. I was trying to use something similar to %s (as available in python 2.X), yet so far nothing really worked and I understand too little of R (related to eval()?)to get things working (tried paste, as.formula, sprintf, substitute etc.).
Using cv(col1) I'm trying to paste the output inside the transform function, yet the furthest I got was transform trying to grab values from the environment (not columns) when using as.formula.
cv = function(var){
output = paste('(', paste(var, collapse = ' + '), ')', sep = '')
return(output)
}
Would appreciate any hints or ideas!
You have maneuvered yourself into a strange corner. This is easy with R:
cols <- c("A", "B", "C")
df[, paste(cols, collapse = "")] <- rowSums(df[, cols])
#alternatively for other binary functions:
#Reduce("+", df[, cols])
# A B C ABC
#1 1 3 5 9
#2 3 1 2 6
You can get a similar effect using mutate from dplyr:
library(dplyr)
cols <- c("A", "B", "C")
df %>% mutate_(.dots = setNames(paste(cols, collapse = '+'),
'new_column_name'))
Here we tell mutate_ (spot the _) what to do via paste() which yields "A+B+C", and use setNames to name the new column.
I acknowledge the syntax is somewhat convoluted, but this is related to non-standard evaluation in dplyr. But if you want to do this in the dplyr ecosystem, this is the way to do it.

Append a column of NA values: lit() and withColumn() giving error

I am trying to append a column of null values to a SparkR DataFrame with the following code:
w <- rbind(3, 0, 2, 3, NA, 1)
z <- rbind("a", "b", "c", "d", "e", "f")
x <- rbind(3, 3, 3, 3, 3, 3)
d <- cbind.data.frame(w, z, x)
B <- as.DataFrame(sqlContext, d)
B1 <- sample(B, withReplacement = FALSE, fraction = 0.5)
B2 <- except(B, B1)
col_sub <- c("z", "x")
B2 <- select(B2, col_sub)
B2 <- withColumn(B2, "w", lit(NA))
But, the last expression returns the error: Error in FUN(X[[i]], ...) : Unsupported data type: null. I have used the lit operation to produce a column of null values before, but I'm not sure why it won't work this time.
Also, this has been discussed on SE before, see this question. I'm completely clueless as to why my expression yields that error. For reference, I'm using SparkR 1.6.1.
No matter if it works or not adding column this way is not a good practice. Since the only practical reason to add column which contains only undefined values is enforcing specific schema for unions or external writes you should always use columns of specific type.
For example:
withColumn(B2, "w", cast(lit(NULL), "double"))
Spark columns can have types numeric, character. My understanding is that it is intended that columns of other data types are illegal.
NA is not recognized by SparkR in the same way that R recognizes it as being an indicator of a missing value. SparkR sees NA as being a value of type logical. For example:
dtypes(NA)
unable to find an inherited method for function ‘dtypes’ for signature ‘"logical"’
If you try to add a column of NA's, Spark tries to create a column of type logical, which is not a valid data type for a column. Hence the error.
There are a couple of places where SparkR (1.6.2) is inconsistent in trapping errors around creating illegal column types. As you found, SparkR throws an error if you use lit(NA), but SparkR will let you convert an R data.frame with a column of NAs and it successfully creates an illegal column of type "logical"
x <- c(NA,NA,NA, NA, NA)
dfX <- data.frame(x)
colnames(dfX) <- c("Empty")
sdfX <- createDataFrame(sqlContext, dfX)
str(sdfX)
'DataFrame': 1 variables:
$ Empty: logi NA NA NA NA NA

One liner wanted: Create data frame and give colnames: R data.frame(..., colnames = c("a", "b", "c"))

Is there an easier (i.e. one line of code instead of two!) way to do the following:
results <- as.data.frame(str_split_fixed(c("SampleID_someusefulinfo.countsA" , "SampleID_someusefulinfo.countsB" , "SampleID_someusefulinfo.counts"), "\\.", n=2))
names(results) <- c("a", "b")
Something like:
results <- data.frame(str_split_fixed(c("SampleID_someusefulinfo.countsA" , "SampleID_someusefulinfo.countsB" , "SampleID_someusefulinfo.counts"), "\\.", n=2), colnames = c("a", "b"))
I do this a lot, and would really love to have a way to have this in one line of code.
/data.table works too, if it's easier to do there than in base data.frame/
Clarifying:
My expected output (which is achieved by running the two lines of code at the top - AND I WANT IT TO BE ONE - THAT's IT!!!) is a result data frame of the structure:
results
a b
1 SampleID_someusefulinfo countsA
2 SampleID_someusefulinfo countsB
3 SampleID_someusefulinfo counts
What I would like to do is:
CREATE the data frame from a matrix or with some content (for example the toy code of matrix(c(1,2,3,4),nrow=2,ncol=2) I provided in the first example I wrote)
SPECIFY IN THAT SAME LINE what I would like the column names of my data frame to be
Use setNames() around a data.frame
setNames(data.frame(matrix(c(1,2,3,4),nrow=2,ncol=2)), c("a","b"))
# a b
#1 1 3
#2 2 4
?setNames:
a convenience function that sets the names on an object and returns the object
> setNames
function (object = nm, nm)
{
names(object) <- nm
object
}
We can use the dimnames option in matrix as the OP was using matrix to create the data.
data.frame(matrix(1:4, 2, 2, dimnames=list(NULL, c("a", "b"))))
Or
`colnames<-`(data.frame(matrix(1:4, 2, 2)), c('a', 'b'))

Using dplyr, Remove all strings from a data frame

I have a data frame with 300 columns which has a string variable somewhere which I am trying to remove. I have found this solution in stack overflow using lapply (see below), which is what I want to do, but using the dplyr package. I have tried using the mutate_each function but cant seem to make it work
"If your data frame (df) is really all integers except for NAs and garbage then then the following converts it.
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
You'll have a warning about NAs introduced by coercion but that's just all those non numeric character strings turning into NAs.
dplyr 0.5 now includes a select_if() function.
For example:
person <- c("jim", "john", "harry")
df <- data.frame(matrix(c(1:9,NA,11,12), nrow=3), person)
library(dplyr)
df %>% select_if(is.numeric)
# X1 X2 X3 X4
#1 1 4 7 NA
#2 2 5 8 11
#3 3 6 9 12
Of course you could add further conditions if necessary.
If you want to use this line of code:
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
with dplyr (by which I assume you mean "using pipes") the easiest would be
df2 = df %>% lapply(function(x) as.numeric(as.character(x))) %>%
as.data.frame
To "translate" this into the mutate_each idiom:
mutate_each(df, funs(as.numeric(as.character(.)))
This function will, of course, convert all columns to character, then to numeric. To improve efficiency, don't bother doing two conversions on columns that are already numeric:
mutate_each(df, funs({
if (is.numeric(.)) return(.)
as.numeric(as.character(.))
}))
Data for testing:
df = data.frame(v1 = 1:10, v2 = factor(11:20))
mutate_all works here, and simply wrap the gsub in a function. (I also assume you aren't necessarily string hunting, so much as trawling for non-integers.
StrScrub <- function(x) {
as.integer(gsub("^\\D+$",NA, x))
}
ScrubbedDF <- mutate_all(data, funs(StrScrub))
Example dataframe:
library(dplyr)
options(stringsAsFactors = F)
data = data.frame("A" = c(2:5),"B" = c(5,"gr",3:2), "C" = c("h", 9, "j", "1"))
with reference/help from Tony Ladson

Using filter_ in dplyr where both field and value are in variables

I want to filter a dataframe using a field which is defined in a variable, to select a value that is also in a variable. Say I have
df <- data.frame(V=c(6, 1, 5, 3, 2), Unhappy=c("N", "Y", "Y", "Y", "N"))
fld <- "Unhappy"
sval <- "Y"
The value I want would be df[df$Unhappy == "Y", ].
I've read the nse vignette to try use filter_ but can't quite understand it. I tried
df %>% filter_(.dots = ~ fld == sval)
which returned nothing. I got what I wanted with
df %>% filter_(.dots = ~ Unhappy == sval)
but obviously that defeats the purpose of having a variable to store the field name. Any clues please? Eventually I want to use this where fld is a vector of field names and sval is a vector of filter values for each field in fld.
You can try with interp from lazyeval
library(lazyeval)
library(dplyr)
df %>%
filter_(interp(~v==sval, v=as.name(fld)))
# V Unhappy
#1 1 Y
#2 5 Y
#3 3 Y
For multiple key/value pairs, I found this to be working but I think a better way should be there.
df1 %>%
filter_(interp(~v==sval1[1] & y ==sval1[2],
.values=list(v=as.name(fld1[1]), y= as.name(fld1[2]))))
# V Unhappy Col2
#1 1 Y B
#2 5 Y B
For these cases, I find the base R option to be easier. For example, if we are trying to filter the rows based on the 'key' variables in 'fld1' with corresponding values in 'sval1', one option is using Map. We subset the dataset (df1[fld1]) and apply the FUN (==) to each column of df1[f1d1] with corresponding value in 'sval1' and use the & with Reduce to get a logical vector that can be used to filter the rows of 'df1'.
df1[Reduce(`&`, Map(`==`, df1[fld1],sval1)),]
# V Unhappy Col2
# 2 1 Y B
#3 5 Y B
data
df1 <- cbind(df, Col2= c("A", "B", "B", "C", "A"))
fld1 <- c(fld, 'Col2')
sval1 <- c(sval, 'B')
Now, with rlang 0.4.0, it introduces a new more intuitive way for this type of use case:
packageVersion("rlang")
# [1] ‘0.4.0’
df <- data.frame(V=c(6, 1, 5, 3, 2), Unhappy=c("N", "Y", "Y", "Y", "N"))
fld <- "Unhappy"
sval <- "Y"
df %>% filter(.data[[fld]]==sval)
#OR
filter_col_val <- function(df, fld, sval) {
df %>% filter({{fld}}==sval)
}
filter_col_val(df, Unhappy, "Y")
More information can be found at https://www.tidyverse.org/articles/2019/06/rlang-0-4-0/
Previous Answer
With dplyr 0.6.0 and later, this code works:
packageVersion("dplyr")
# [1] ‘0.7.1’
df <- data.frame(V=c(6, 1, 5, 3, 2), Unhappy=c("N", "Y", "Y", "Y", "N"))
fld <- "Unhappy"
sval <- "Y"
df %>% filter(UQ(rlang::sym(fld))==sval)
#OR
df %>% filter((!!rlang::sym(fld))==sval)
#OR
fld <- quo(Unhappy)
sval <- "Y"
df %>% filter(UQ(fld)==sval)
More about the dplyr syntax available at http://dplyr.tidyverse.org/articles/programming.html and the quosure usage in the rlang package https://cran.r-project.org/web/packages/rlang/index.html .
If you find it challenging mastering non-standard evaluation in dplyr 0.6+, Alex Hayes has an excellent writing-up on the topic: https://www.alexpghayes.com/blog/gentle-tidy-eval-with-examples/
Original Answer
With dplyr version 0.5.0 and later, it is possible to use a simpler syntax and gets closer to the syntax #Ricky originally wanted, which I also find more readable than using lazyeval::interp
df %>% filter_(.dots = paste0(fld, "=='", sval, "'"))
# V Unhappy
#1 1 Y
#2 5 Y
#3 3 Y
#OR
df %>% filter_(.dots = glue::glue("{fld}=='{sval}'"))
Here's an alternative with base R, which is maybe not very elegant, but it might have the benefit of being rather easily understandable:
df[df[colnames(df)==fld]==sval,]
# V Unhappy
#2 1 Y
#3 5 Y
#4 3 Y
Following on from LmW; personally I prefer using a dplyr pipeline where the dots are specified before the pipeline so that it is easier to use programmatically, say in a loop of filters.
dots <- paste0(fld," == '",sval,"'")
df %>% filter_(.dots = dots)
LmW's example is correct but the values are hardcoded.
So I was trying to do the same thing, and it seems that now dplyr has a builtin functionality to address exactly this.
Check the last example here: https://dplyr.tidyverse.org/reference/filter.html
I'm also pasting it here for simplicity:
# To refer to column names that are stored as strings, use the `.data` pronoun:
vars <- c("mass", "height")
cond <- c(80, 150)
starwars %>%
filter(
.data[[vars[[1]]]] > cond[[1]],
.data[[vars[[2]]]] > cond[[2]]
)

Resources