I would like to write a function that sorts a given data.frame (which I'll refer to as dataSet) by any number of its columns, whose names are also passed into the function (in a vector which I will refer to as orderList). I know that to order by a single passed in string you can just use
sortDataset <- function(dataSet, sortCol) {
return(dataSet[order(dataSet[[sortCol]]),])
}
and that you can order by multiple passed in strings using
sortDataset <- function(dataSet, sortCol1, sortCol2) {
return(dataSet[order(dataSet[[sortCol1]], dataSet[[sortCol2]]),])
}
with however many sortCol# inputs as I would want. I would, however, like to be able to pass in a list of any number of strings. I tried the following:
dataSet[order(dataSet[[orderList]]),]
dataSet[order(dataSet$orderList),]
dataSet[order(dataSet[,orderList])]
and encountered issues that with the first 2, since they're just not a valid way to get multiple columns (I still tried, though ): ) and that in the third, order doesn't seem to accept the matrix returned by dataSet[,orderList] as a parameter.
I would like a function as follows:
sortDataset <- function(dataSet, sortCols)
where the first element of sortCols is the column which takes highest priority, then the second column is the first tiebreaker, the third column is the second tiebreaker, etc. and the function returns dataSet sorted appropriately. It would also be nice if I could specify whether each should be ascending in an optional input, so the first column could be sorting ascending, the second sorted descending, etc.
So far, the only method I can really think of is to assume each list only contains numeric values, and then do some multiplying of the various sorting columns by 10^n so that all the columns can be consolidated into one column that maintains the priorities, and then sort by that column. I feel like there should be a better way to do this, though, since this seems like a pretty basic function.
Use do.call:
data[do.call("order", data[sortCols]), ]
where data is a data frame and sortCols is a character vector of column names.
Also have a look at orderBy in the doBy package.
We can do this with tidyverse
library(dplyr)
data %>%
arrange_at(vars(sortCols))
which can be made into a function with either using quos/1!!
sortDataset <- function(dataSet, ...) {
stopifnot(rlang::is_quosures(...))
a1 <- c(...)
dataSet %>%
arrange(!!! a1)
}
sortDataset(mtcars, quos(mpg, cyl))
or with arrange_at if we are passing variable as string
sortDataset <- function(dataSet, ...) {
a1 <- c(...)
dataSet %>%
arrange_at(vars(a1))
}
sortDataset(mtcars, "mpg", "cyl")
As #Nettle mentioned in the comments, using arrange_at with group_by can cause some bugs (based on here
Related
I am trying to create a custom function that I can apply across various columns to recode values from characters to numeric data. The data has many blanks and each character entry is the same in each given column (ie. when there is a survey question that is "select all that apply" so you need to create binary 1/0 variables if the choice was selected). So logical i am trying to create a function that does the following:
"In a specified range of columns in the data, if there is any character present, recode that entry as 1, otherwise mark as NA"
This works as a standalone function as follows perfectly:
data$var <- if_else(data$var == data$var[grep("[a-z]", data$var)], 1, NULL)
But I am having trouble creating a function that does this that I can apply to many different columns.
I have tried to solve this with lapply, mutate, and if_else in the following ways to no avail.
I can return the indices correctly with the following fxn but need to update the actual dataframe:
fxn <- function(x) {
if_else(x == (x[grep("[a-z]", x)]), 1, NULL)
}
fxn(data$variable)
But when I try to use mutate to update the dataframe as follows it doesn't work:
data %>%
mutate(across(.cols = variable, fxn))
Any help would be appreciated as there are 100+ columns I need to do this on!
We create the function and apply to the selected columns with lapply. In the below example, columns 1 to 5 are selected and applied the function, and assigned back
fxn <- function(x) NA^(!grepl('[a-z]', x))
data[1:5] <- lapply(data[1:5], fxn)
I have n data frames, each corresponding to data from a city.
There are 3 variables per data frame and currently they are all factor variables.
I want to transform all of them into numeric variables.
I have started by creating a vector with the names of all the data frames in order to use in a for loop.
cities <- as.vector(objects())
for ( i in cities){
i <- as.data.frame(lapply(i, function(x) as.numeric(levels(x))[x]))
}
Although the code runs and there I get no error code, I don't see any changes to my data frames as all three variables remain factor variables.
The strangest thing is that when doing them one by one (as below) it works:
df <- as.data.frame(lapply(df, function(x) as.numeric(levels(x))[x]))
What you're essentially trying to do is modify the type of the field if it is a factor (to a numeric type). One approach using purrr would be:
library(purrr)
map(cities, ~ modify_if(., is.factor, as.numeric))
Note that modify() in itself is like lapply() but it doesn't change the underlying data structure of the objects you are modifying (in this case, dataframes). modify_if() simply takes a predicate as an additional argument.
for anyone who's interested in my question, I worked out the answer:
for ( i in cities){
assign(i, as.data.frame(lapply(get(i), function(x) as.numeric(levels(x))[x])))
}
I need to bind two data.frames using a user-defined function. As example let's imagine that the data frames look like this.
library(dplyr)
library(lazyeval)
df<-data.frame(type1=c("a","b","c","a","b","c",NA),type2=c("d","e","f","d","e","f","f"))
f<-function(x){
y<-df%>%
dplyr::filter_(lazyeval::interp(~!is.na(x),x=as.name(x)))%>%
dplyr::group_by_(x)%>%
dplyr::summarize("Sum"=sum(type2=="d"))
y<-dplyr::bind_rows(y,data.frame(x="Total",Sum=sum(y$Sum)))
return(y)
}
result_f<-f("type1")
The problem is that this function assumes that the name of variable "Total" in the second data frame is "x" instead of "Total" creating an additional column due to the mismatch with the first data frame.
How can the function interpret x as a variable instead of a string? Unquoting? How?
You can change the last line in the function to
y <- dplyr::bind_rows(y,setNames(data.frame("Total",sum(y$Sum)), c(x, "Sum")))
That will set the names of the data.frame you are trying to bind in to the original names.
Before you spend too much time learning all the underscore functions in dplyr, note that in the next version (0.6) they are being superseded by a completely different method of non-standard evaluation. Read more here: https://blog.rstudio.org/2017/04/13/dplyr-0-6-0-coming-soon/
I have a function that looks like this
calc_df <- function(A_df, B_df){
C_df <- filter(A_df, Type == "Animal") %>%
left_join(B_df) %>%
as.numeric(C$Count)
Where I cannot get the last lime to work, the first 3 work properly, but I would like the last line to take the column "Count" from the new df calculated in the function and make it numeric. (Right now it is a character vector)
** I have to do this at the end of the function because before the filter command, the Count column contains letters and cannot be made as.numeric
Looks like you're using dplyr, and that you want to change or add a column. This is what the dplyr::mutate function does.
Replace
as.numeric(C$Count)
with
mutate(Count = as.numeric(Count))
to replace the old, non-numeric Count column with the coerced-to-numeric replacement.
As to why your code didn't work, there are a few problems:
dplyr is made for working with data frames, and the main dplyr functions (select, filter, mutate, summarize, group_by, *_join, ...) expect data frames as the first argument, and then return data frames. By piping the result of a left join into as.numeric, you are really calling as.numeric(unnamed_data_frame_from_your_join, C$Count), which clearly doesn't make much sense.
You are trying to reference a data frame called C inside a definition for a data frame called C_df, which I think you mean to be the same thing. There's two issues here: (1) the mismatch between the names C and C_df, and (2) you can't reference C_df inside it's own definition.
My dataframe(m*n) has few hundreds of columns, i need to compare each column with all other columns (contingency table) and perform chisq test and save the results for each column in different variable.
Its working for one column at a time like,
s <- function(x) {
a <- table(x,data[,1])
b <- chisq.test(a)
}
c1 <- apply(data,2,s)
The results are stored in c1 for column 1, but how will I loop this over all columns and save result for each column for further analysis?
If you're sure you want to do this (I wouldn't, thinking about the multitesting problem), work with lists :
Data <- data.frame(
x=sample(letters[1:3],20,TRUE),
y=sample(letters[1:3],20,TRUE),
z=sample(letters[1:3],20,TRUE)
)
# Make a nice list of indices
ids <- combn(names(Data),2,simplify=FALSE)
# use the appropriate apply
my.results <- lapply(ids,
function(z) chisq.test(table(Data[,z]))
)
# use some paste voodoo to give the results the names of the column indices
names(my.results) <- sapply(ids,paste,collapse="-")
# select all values for y :
my.results[grep("y",names(my.results))]
Not harder than that. As I show you in the last line, you can easily get all tests for a specific column, so there is no need to make a list for each column. That just takes longer and takes more space, but gives the same information. You can write a small convenience function to extract the data you need :
extract <- function(col,l){
l[grep(col,names(l))]
}
extract("^y$",my.results)
Which makes you can even loop over different column names of your dataframe and get a list of lists returned :
lapply(names(Data),extract,my.results)
I strongly suggest you get yourself acquainted with working with lists, they're one of the most powerful and clean ways of doing things in R.
PS : Be aware that you save the whole chisq.test object in your list. If you only need the value for Chi square or the p-value, select them first.
Fundamentally, you have a few problems here:
You're relying heavily on global arguments rather than local ones.
This makes the double usage of "data" confusing.
Similarly, you rely on a hard-coded value (column 1) instead of
passing it as an argument to the function.
You're not extracting the one value you need from the chisq.test().
This means your result gets returned as a list.
You didn't provide some example data. So here's some:
m <- 10
n <- 4
mytable <- matrix(runif(m*n),nrow=m,ncol=n)
Once you fix the above problems, simply run a loop over various columns (since you've now avoided hard-coding the column) and store the result.