I am trying to order a data frame on multiple columns. And the column names are passed through variable, i.e. a character vector.
df <- data.frame(var1 = c("b","a","b","a"), var2 = c("l","l","k","k"),
var3 = c("t","w","x","t"))
var1 var2 var3
1 b l t
2 a l w
3 b k x
4 a k t
Sorting on one column using a variable
sortvar <- "var1"
df[order(df[ , sortvar]),]
var1 var2 var3
2 a l w
4 a k t
1 b l t
3 b k x
Now, if I want to order by two columns, the above solution does not work.
sortvar <- c("var1", "var2")
df[order(df[, sortvar]), ] #does not work
I can manually order with column names:
df[with(df, order(var1, var2)),]
var1 var2 var3
4 a k t
2 a l w
3 b k x
1 b l t
But, how do I order the data frame dynamically on multiple columns using a variable with column names? I am aware of the plyr and dplyr arrange function, but I want to use base R here.
order expects multiple ordering variables as separate arguments, which is unfortunate in your case but suggests a direct solution: use do.call:
df[do.call(order, df[, sortvar]), ]
In case you’re unfamiliar with do.call: it constructs and executes a call programmatically. The following two statements are equivalent:
fun(arg1, arg2, …)
do.call(fun, list(arg1, arg2, …))
It's a bit awkward, but you can use do.call() to pass each of the columns to order as a different argument
dat[do.call("order", dat[,cols, drop=FALSE]), ]
I added drop=FALSE just in case length(cols)==1 where indexing a data.frame would return a vector instead of a list. You can wrap it in a fucntion to make it a bit easier to use
order_by_cols <- function(data, cols=1) {
data[do.call("order", data[, cols, drop=FALSE]), ]
}
order_by_cols(dat, cols)
it's a bit easier with dplyr if that's something you might consider
library(dplyr)
dat %>% arrange(across(all_of(cols)))
dat %>% arrange_at(cols) # though this method has been superseded by the above line
Related
I often have the problem that R converts my one column data frames into character vectors, which I solve by using the drop=FALSE option.
However, there are some instances where I do not know how to put a solution to this kind of behavior in R, and this is one of them.
I have a data frame like the following:
mydf <- data.frame(ID=LETTERS[1:3], value1=paste(LETTERS[1:3], 1:3), value2=paste(rev(LETTERS)[1:3], 1:3))
that looks like:
> mydf
ID value1 value2
1 A A 1 Z 1
2 B B 2 Y 2
3 C C 3 X 3
The task I am doing here, is to replace spaces by _ in every column except the first, and I want to use an apply family function for this, sapply in this case.
I do the following:
new_df <- as.data.frame(sapply(mydf[,-1,drop=F], function(x) gsub("\\s+","_",x)))
new_df <- cbind(mydf[,1,drop=F], new_df)
The resulting data frame looks exactly how I want it:
> new_df
ID value1 value2
1 A A_1 Z_1
2 B B_2 Y_2
3 C C_3 X_3
My problem starts with some rare cases where my input can have one row of data only. For some reason I never understood, R has a completely different behavior in these cases, but no drop=FALSE option can save me here...
My input data frame now is:
mydf <- data.frame(ID=LETTERS[1], value1=paste(LETTERS[1], 1), value2=paste(rev(LETTERS)[1], 1))
which looks like:
> mydf
ID value1 value2
1 A A 1 Z 1
However, when I apply the same code, my resulting data frame looks hideous like this:
> new_df
ID sapply(mydf[, -1, drop = F], function(x) gsub("\\\\s+", "_", x))
value1 A A_1
value2 A Z_1
How to solve this issue so that the same line of code gives me the same kind of result for input data frames of any number of rows?
A deeper question would be why on earth does R do this? I keep going back to my codes when I have some new weird inputs with one row/column cause they break everything... Thanks!
You can solve your problem by using lapply instead of sapply, and then combine the result using do.call as follows
new_df <- as.data.frame(lapply(mydf[,-1,drop=F], function(x) gsub("\\s+","_",x)))
new_df <- do.call(cbind, new_df)
new_df
# value1 value2
#[1,] "A_1" "Z_1"
new_df <- cbind(mydf[,1,drop=F], new_df)
#new_df
# ID value1 value2
#1 A A_1 Z_1
As for your question about unpredictable behavior of sapply, it is because s in sapply represent simplification, but the simplified result is not guaranteed to be a data frame. It can be a data frame, a matrix, or a vector.
According to the documentation of sapply:
sapply is a user-friendly version and wrapper of lapply by default
returning a vector, matrix or, if simplify = "array", an array if
appropriate, by applying simplify2array().
On the simplify argument:
logical or character string; should the result be simplified
to a vector, matrix or higher dimensional array if possible? For
sapply it must be named and not abbreviated. The default value, TRUE,
returns a vector or matrix if appropriate, whereas if simplify =
"array" the result may be an array of “rank” (=length(dim(.))) one
higher than the result of FUN(X[[i]]).
The Details part explain its behavior that loos similar with what you experienced (emphasis is from me) :
Simplification in sapply is only attempted if X has length greater
than zero and if the return values from all elements of X are all of
the same (positive) length. If the common length is one the result is
a vector, and if greater than one is a matrix with a column
corresponding to each element of X.
Hadley Wickham also recommend not to use sapply:
I recommend that you avoid sapply() because it tries to simplify the
result, so it can return a list, a vector, or a matrix. This makes it
difficult to program with, and it should be avoided in non-interactive
settings
He also recommends not to use apply with a data frame. See Advanced R for further explanation.
You can also use map_df function from purrr package, which applies a function on each element of an object and also returns a data frame:
library(dplyr)
library(purrr)
mydf %>%
mutate(map_df(select(cur_data(), starts_with("value")), ~ gsub("\\s", "_", .x)))
ID value1 value2
1 A A_1 Z_1
And with the original data frame:
ID value1 value2
1 A A_1 Z_1
2 B B_2 Y_2
3 C C_3 X_3
Here's a solution that replaces the original data. Not sure if this is plays into your workflow, though. Notice that I used apply which is used to process data.frames by rows or columns.
mydf <- data.frame(ID=LETTERS[1], value1=paste(LETTERS[1], 1), value2=paste(rev(LETTERS)[1], 1))
xy <- apply(X = mydf[, -1, drop = FALSE],
MARGIN = 2,
FUN = function(x) gsub("\\s+", "_", x),
simplify = FALSE
)
xy <- do.call(cbind, xy)
xy <- as.data.frame(xy)
mydf[, -1] <- as.data.frame(xy)
mydf
ID value1 value2
1 A A_1 Z_1
I would like to expand a dataframe based on all pairwise combinations of one variable while keeping the associate value of a second variable. For example:
V1 <- letters[1:2]
V2 <- 1:2
df <- data.frame(V1, V2)
I would like to return:
Var1 Var2 Var3 Var4
a a 1 1
b a 2 1
a b 1 2
b b 2 2
I can use expand.grid(df$V1, df$V1) to get all of the pairs, but I'm not sure how to include the second variable without having its values expanded also.
If we need to expand each column separately, then we can do this with Map where the arguments are two 'df' objects
do.call(cbind, Map(expand.grid, df, df))
in R I have a twofold problem.
First,
I would like to transform my data from this:
d <- data.table(
person_id=1:10,
cat=letters[1:10],
group_id=c(rep(1,5),rep(2,5))
)
Into this:
d_grouped <- data.table(
group_id=1:2
)
d_grouped$Cat_grouped <- list(letters[1:5],letters[6:10] )
i.e. group my data, from person level to group level, but keeping the information on individual characteristics into a column containing list of person level characteristics for each group.
How can I do this aggregation?
Preferably a data.table solution. But it could also be a normal data.frame.
Second,
I would like to search for presence of the elements of a vector in each list of each group
Something like (I know this is not correct syntax):
c('a','b') %in% d_grouped$Cat_grouped
which should return another list:
list(c(T,T),c(F,F))
More broadly, I am trying to merge lists (A and B) both containing vectors. The match should be based on the elements of a vector in list A being present on a vector in list B. Is there any merge command based on this SubVector logic?
To accomplish the first transformation,
d[, list(Cat_grouped=paste0(cat, collapse = ',')), group_id]
To accomplish the second, it seems as though your best bet is to leave the data in the original shape? After all
d[, c('a', 'b') %in% cat, group_id]
returns
group_id V1
1: 1 TRUE
2: 1 TRUE
3: 2 FALSE
4: 2 FALSE
All this being said your "more broadly" appears to be asking for something else which I fear is not quite what I've helped you with by answering the two specific questions. Perhaps you could provide another example?
Just do it in data.table, returning a list for each by= group:
d[, .(cat_grouped=.(cat)), by=group_id]
# group_id cat_grouped
#1: 1 a,b,c,d,e
#2: 2 f,g,h,i,j
I tend to agree with #HarlandMason's answer that the analysis you are doing does not however require this intermediate data.table.
Base R solution using aggregate
d2 = aggregate(list(cat = d$cat), list(group = d$group_id), function(x)
as.character(x), simplify = FALSE)
d2
# group cat
#1 1 a, b, c, d, e
#2 2 f, g, h, i, j
lapply(d2$cat, function(x) c("a","b") %in% x)
#$`1`
#[1] TRUE TRUE
#$`2`
#[1] FALSE FALSE
Also consider
mylist = split(d$cat, d$group_id)
We can also use dplyr
library(dplyr)
d %>%
group_by(group_id) %>%
summarise(cat = list(cat))
I am trying to order a data frame on multiple columns. And the column names are passed through variable, i.e. a character vector.
df <- data.frame(var1 = c("b","a","b","a"), var2 = c("l","l","k","k"),
var3 = c("t","w","x","t"))
var1 var2 var3
1 b l t
2 a l w
3 b k x
4 a k t
Sorting on one column using a variable
sortvar <- "var1"
df[order(df[ , sortvar]),]
var1 var2 var3
2 a l w
4 a k t
1 b l t
3 b k x
Now, if I want to order by two columns, the above solution does not work.
sortvar <- c("var1", "var2")
df[order(df[, sortvar]), ] #does not work
I can manually order with column names:
df[with(df, order(var1, var2)),]
var1 var2 var3
4 a k t
2 a l w
3 b k x
1 b l t
But, how do I order the data frame dynamically on multiple columns using a variable with column names? I am aware of the plyr and dplyr arrange function, but I want to use base R here.
order expects multiple ordering variables as separate arguments, which is unfortunate in your case but suggests a direct solution: use do.call:
df[do.call(order, df[, sortvar]), ]
In case you’re unfamiliar with do.call: it constructs and executes a call programmatically. The following two statements are equivalent:
fun(arg1, arg2, …)
do.call(fun, list(arg1, arg2, …))
It's a bit awkward, but you can use do.call() to pass each of the columns to order as a different argument
dat[do.call("order", dat[,cols, drop=FALSE]), ]
I added drop=FALSE just in case length(cols)==1 where indexing a data.frame would return a vector instead of a list. You can wrap it in a fucntion to make it a bit easier to use
order_by_cols <- function(data, cols=1) {
data[do.call("order", data[, cols, drop=FALSE]), ]
}
order_by_cols(dat, cols)
it's a bit easier with dplyr if that's something you might consider
library(dplyr)
dat %>% arrange(across(all_of(cols)))
dat %>% arrange_at(cols) # though this method has been superseded by the above line
I have a large (millions of rows) melted data.table with the usual melt-style unrolling in the variable and value columns. I need to cast the table in wide form (rolling the variables up). The problem is that the data table also has a list column called data, which I need to preserve. This makes it impossible to use reshape2 because dcast cannot deal with non-atomic columns. Therefore, I need to do the rolling up myself.
The answer from a previous question about working with melted data tables does not apply here because of the list column.
I am not satisfied with the solution I've come up with. I'm looking for suggestions for a simpler/faster implementation.
x <- LETTERS[1:3]
dt <- data.table(
x=rep(x, each=2),
y='d',
data=list(list(), list(), list(), list(), list(), list()),
variable=rep(c('var.1', 'var.2'), 3),
value=seq(1,6)
)
# Column template set up
list_template <- Reduce(
function(l, col) { l[[col]] <- col; l },
unique(dt$variable),
list())
# Expression set up
q <- substitute({
l <- lapply(
list_template,
function(col) .SD[variable==as.character(col)]$value)
l$data = .SD[1,]$data
l
}, list(list_template=list_template))
# Roll up
dt[, eval(q), by=list(x, y)]
x y var.1 var.2 data
1: A d 1 2 <list>
2: B d 3 4 <list>
3: C d 5 6 <list>
This old question piqued my curiosity as data.table has been improved sigificantly since 2013.
However, even with data.table version 1.11.4
dcast(dt, x + y + data ~ variable)
still returns an error
Columns specified in formula can not be of type list
The workaround follows the general outline of jonsedar's answer :
Reshape the non-list columns from long to wide format
Aggregate the list column data grouped by x and y
Join the two partial results on x and y
but uses the features of the actual data.table syntax, e.g., the on parameter:
dcast(dt, x + y ~ variable)[
dt[, .(data = .(first(data))), by = .(x, y)], on = .(x, y)]
x y var.1 var.2 data
1: A d 1 2 <list>
2: B d 3 4 <list>
3: C d 5 6 <list>
The list column data is aggregated by taking the first element. This is in line with OP's code line
l$data = .SD[1,]$data
which also picks the first element.
I have somewhat cheating method that might do the trick - importantly, I assume that each x,y,list combination is unique! If not, please disregard.
I'm going to create two separate datatables, the first which is dcasted without the data list objects, and the second which has only the unique data list objects and a key. Then just merge them together to get the desired result.
require(data.table)
require(stringr)
require(reshape2)
x <- LETTERS[1:3]
dt <- data.table(
x=rep(x, each=2),
y='d',
data=list(list("a","b"), list("c","d")),
variable=rep(c('var.1', 'var.2'), 3),
value=seq(1,6)
)
# First create the dcasted datatable without the pesky list objects:
dt_nolist <- dt[,list(x,y,variable,value)]
dt_dcast <- data.table(dcast(dt_nolist,x+y~variable,value.var="value")
,key=c("x","y"))
# Second: create a datatable with only unique "groups" of x,y, list
dt_list <- dt[,list(x,y,data)]
# Rows are duplicated so I'd like to use unique() to get rid of them, but
# unique() doesn't work when there's list objects in the data.table.
# Instead so I cheat by applying a value to each row within an x,y "group"
# that is unique within EACH group, but present within EVERY group.
# Then just simply subselect based on that unique value.
# I've chosen rank(), but no doubt there's other options
dt_list <- dt_list[,rank:=rank(str_c(x,y),ties.method="first"),by=str_c(x,y)]
# now keep only one row per x,y "group"
dt_list <- dt_list[rank==1]
setkeyv(dt_list,c("x","y"))
# drop the rank since we no longer need it
dt_list[,rank:=NULL]
# Finally just merge back together
dt_final <- merge(dt_dcast,dt_list)