Unquoting variable in user-defined function - r

I need to bind two data.frames using a user-defined function. As example let's imagine that the data frames look like this.
library(dplyr)
library(lazyeval)
df<-data.frame(type1=c("a","b","c","a","b","c",NA),type2=c("d","e","f","d","e","f","f"))
f<-function(x){
y<-df%>%
dplyr::filter_(lazyeval::interp(~!is.na(x),x=as.name(x)))%>%
dplyr::group_by_(x)%>%
dplyr::summarize("Sum"=sum(type2=="d"))
y<-dplyr::bind_rows(y,data.frame(x="Total",Sum=sum(y$Sum)))
return(y)
}
result_f<-f("type1")
The problem is that this function assumes that the name of variable "Total" in the second data frame is "x" instead of "Total" creating an additional column due to the mismatch with the first data frame.
How can the function interpret x as a variable instead of a string? Unquoting? How?

You can change the last line in the function to
y <- dplyr::bind_rows(y,setNames(data.frame("Total",sum(y$Sum)), c(x, "Sum")))
That will set the names of the data.frame you are trying to bind in to the original names.
Before you spend too much time learning all the underscore functions in dplyr, note that in the next version (0.6) they are being superseded by a completely different method of non-standard evaluation. Read more here: https://blog.rstudio.org/2017/04/13/dplyr-0-6-0-coming-soon/

Related

How to convert all factor variables into numeric variables (in multiple data frames at once)?

I have n data frames, each corresponding to data from a city.
There are 3 variables per data frame and currently they are all factor variables.
I want to transform all of them into numeric variables.
I have started by creating a vector with the names of all the data frames in order to use in a for loop.
cities <- as.vector(objects())
for ( i in cities){
i <- as.data.frame(lapply(i, function(x) as.numeric(levels(x))[x]))
}
Although the code runs and there I get no error code, I don't see any changes to my data frames as all three variables remain factor variables.
The strangest thing is that when doing them one by one (as below) it works:
df <- as.data.frame(lapply(df, function(x) as.numeric(levels(x))[x]))
What you're essentially trying to do is modify the type of the field if it is a factor (to a numeric type). One approach using purrr would be:
library(purrr)
map(cities, ~ modify_if(., is.factor, as.numeric))
Note that modify() in itself is like lapply() but it doesn't change the underlying data structure of the objects you are modifying (in this case, dataframes). modify_if() simply takes a predicate as an additional argument.
for anyone who's interested in my question, I worked out the answer:
for ( i in cities){
assign(i, as.data.frame(lapply(get(i), function(x) as.numeric(levels(x))[x])))
}

Reference variables by string in loop when normalising multiple variable names (assign and get do not seem to work)

I am trying to normalise column names for all data sets in the global environment. Since I am using a loop I would need to use strings to reference the data sets. Many similar examples suggest using either get or assign but neither seems to work in this case.
library(magrittr) #the piping operator
library(rattle) #normVarNames()
datanames <- names(which(sapply(.GlobalEnv, is.data.frame)))
for(name in datanames){
names(name) %<>% normVarNames()
}

Order based on multiple columns passed in as an input

I would like to write a function that sorts a given data.frame (which I'll refer to as dataSet) by any number of its columns, whose names are also passed into the function (in a vector which I will refer to as orderList). I know that to order by a single passed in string you can just use
sortDataset <- function(dataSet, sortCol) {
return(dataSet[order(dataSet[[sortCol]]),])
}
and that you can order by multiple passed in strings using
sortDataset <- function(dataSet, sortCol1, sortCol2) {
return(dataSet[order(dataSet[[sortCol1]], dataSet[[sortCol2]]),])
}
with however many sortCol# inputs as I would want. I would, however, like to be able to pass in a list of any number of strings. I tried the following:
dataSet[order(dataSet[[orderList]]),]
dataSet[order(dataSet$orderList),]
dataSet[order(dataSet[,orderList])]
and encountered issues that with the first 2, since they're just not a valid way to get multiple columns (I still tried, though ): ) and that in the third, order doesn't seem to accept the matrix returned by dataSet[,orderList] as a parameter.
I would like a function as follows:
sortDataset <- function(dataSet, sortCols)
where the first element of sortCols is the column which takes highest priority, then the second column is the first tiebreaker, the third column is the second tiebreaker, etc. and the function returns dataSet sorted appropriately. It would also be nice if I could specify whether each should be ascending in an optional input, so the first column could be sorting ascending, the second sorted descending, etc.
So far, the only method I can really think of is to assume each list only contains numeric values, and then do some multiplying of the various sorting columns by 10^n so that all the columns can be consolidated into one column that maintains the priorities, and then sort by that column. I feel like there should be a better way to do this, though, since this seems like a pretty basic function.
Use do.call:
data[do.call("order", data[sortCols]), ]
where data is a data frame and sortCols is a character vector of column names.
Also have a look at orderBy in the doBy package.
We can do this with tidyverse
library(dplyr)
data %>%
arrange_at(vars(sortCols))
which can be made into a function with either using quos/1!!
sortDataset <- function(dataSet, ...) {
stopifnot(rlang::is_quosures(...))
a1 <- c(...)
dataSet %>%
arrange(!!! a1)
}
sortDataset(mtcars, quos(mpg, cyl))
or with arrange_at if we are passing variable as string
sortDataset <- function(dataSet, ...) {
a1 <- c(...)
dataSet %>%
arrange_at(vars(a1))
}
sortDataset(mtcars, "mpg", "cyl")
As #Nettle mentioned in the comments, using arrange_at with group_by can cause some bugs (based on here

Error in 'colsplit' function?

Im am trying to split a column of a dataframe into 2 columns using transform and colsplit from reshape package. I don't get what I am doing wrong. Here's an example...
library(reshape)
df1 <- data.frame(col1=c("x-1","y-2","z-3"))
Now I am trying to split the col1 into col1.a and col1.b at the delimiter '-'. the following is my code...
df1 <- transform(df1,col1 = colsplit(col1,split='-',names = c('a','b')))
Now in my RStudio when I do View(df1) I do get to see col1.a and col1.b split the way I want to.
But when I run...
df1$col1.a or head(df1$col1.a) I get NULL. Apparently I am not able to make any further operations on these split columns. What exactly is wrong with this?
colsplit returns a list, the easiest (and idiomatic) way to assign these to multiple columns in the data frame is to use [<-
eg
df1[c('col1.a','col1.b')] <- colsplit(df1$col1,'-',c('a','b'))
it will be much harder to do this within transform (see Assign multiple new variables on LHS in a single line in R)

Using as.numeric, with functions and pipes in R

I have a function that looks like this
calc_df <- function(A_df, B_df){
C_df <- filter(A_df, Type == "Animal") %>%
left_join(B_df) %>%
as.numeric(C$Count)
Where I cannot get the last lime to work, the first 3 work properly, but I would like the last line to take the column "Count" from the new df calculated in the function and make it numeric. (Right now it is a character vector)
** I have to do this at the end of the function because before the filter command, the Count column contains letters and cannot be made as.numeric
Looks like you're using dplyr, and that you want to change or add a column. This is what the dplyr::mutate function does.
Replace
as.numeric(C$Count)
with
mutate(Count = as.numeric(Count))
to replace the old, non-numeric Count column with the coerced-to-numeric replacement.
As to why your code didn't work, there are a few problems:
dplyr is made for working with data frames, and the main dplyr functions (select, filter, mutate, summarize, group_by, *_join, ...) expect data frames as the first argument, and then return data frames. By piping the result of a left join into as.numeric, you are really calling as.numeric(unnamed_data_frame_from_your_join, C$Count), which clearly doesn't make much sense.
You are trying to reference a data frame called C inside a definition for a data frame called C_df, which I think you mean to be the same thing. There's two issues here: (1) the mismatch between the names C and C_df, and (2) you can't reference C_df inside it's own definition.

Resources