I would like to expand a dataframe based on all pairwise combinations of one variable while keeping the associate value of a second variable. For example:
V1 <- letters[1:2]
V2 <- 1:2
df <- data.frame(V1, V2)
I would like to return:
Var1 Var2 Var3 Var4
a a 1 1
b a 2 1
a b 1 2
b b 2 2
I can use expand.grid(df$V1, df$V1) to get all of the pairs, but I'm not sure how to include the second variable without having its values expanded also.
If we need to expand each column separately, then we can do this with Map where the arguments are two 'df' objects
do.call(cbind, Map(expand.grid, df, df))
Related
I am trying to order a data frame on multiple columns. And the column names are passed through variable, i.e. a character vector.
df <- data.frame(var1 = c("b","a","b","a"), var2 = c("l","l","k","k"),
var3 = c("t","w","x","t"))
var1 var2 var3
1 b l t
2 a l w
3 b k x
4 a k t
Sorting on one column using a variable
sortvar <- "var1"
df[order(df[ , sortvar]),]
var1 var2 var3
2 a l w
4 a k t
1 b l t
3 b k x
Now, if I want to order by two columns, the above solution does not work.
sortvar <- c("var1", "var2")
df[order(df[, sortvar]), ] #does not work
I can manually order with column names:
df[with(df, order(var1, var2)),]
var1 var2 var3
4 a k t
2 a l w
3 b k x
1 b l t
But, how do I order the data frame dynamically on multiple columns using a variable with column names? I am aware of the plyr and dplyr arrange function, but I want to use base R here.
order expects multiple ordering variables as separate arguments, which is unfortunate in your case but suggests a direct solution: use do.call:
df[do.call(order, df[, sortvar]), ]
In case you’re unfamiliar with do.call: it constructs and executes a call programmatically. The following two statements are equivalent:
fun(arg1, arg2, …)
do.call(fun, list(arg1, arg2, …))
It's a bit awkward, but you can use do.call() to pass each of the columns to order as a different argument
dat[do.call("order", dat[,cols, drop=FALSE]), ]
I added drop=FALSE just in case length(cols)==1 where indexing a data.frame would return a vector instead of a list. You can wrap it in a fucntion to make it a bit easier to use
order_by_cols <- function(data, cols=1) {
data[do.call("order", data[, cols, drop=FALSE]), ]
}
order_by_cols(dat, cols)
it's a bit easier with dplyr if that's something you might consider
library(dplyr)
dat %>% arrange(across(all_of(cols)))
dat %>% arrange_at(cols) # though this method has been superseded by the above line
I want to replace the values of one element of a list with the values of a second element of a list. Specifically,
I have a list containing multiple data sets.
Each data set has 2 variables
The variables are factors
The n'th element of the second variable of each data set needs to be replaced with the n'th element of the first variable in each data set
Also, the replaced value should be called "replaced"
dat1 <- data.frame(names1 =c("a", "b", "c", "f", "x"),values= c("val1_1", "val2_1", "val3_1", "val4_1", "val5_1"))
dat1$values <- as.factor(dat1$values)
dat2 <- data.frame(names1 =c("a", "b", "f2", "s5", "h"),values= c("val1_2", "val2_2", "val3_2", "val4_2", "val5_2"))
dat2$values <- as.factor(dat2$values)
list1 <- list(dat1, dat2)
The result should be the same list, but just with the 5th value replaced.
[[1]]
names1 values
1 a val1_1
2 b val2_1
3 c val3_1
4 f val4_1
5 replaced x
[[2]]
names1 values
1 a val1_2
2 b val2_2
3 f2 val3_2
4 s5 val4_2
5 replaced h
A base R approach using lapply, since both the columns are factors we need to add new levels first before replacing them with new values otherwise those value would turn as NAs.
n <- 5
lapply(list1, function(x) {
levels(x$values) <- c(levels(x$values), as.character(x$names1[n]))
x$values[n] <- x$names1[n]
levels(x$names1) <- c(levels(x$names1), "replaced")
x$names1[n] <- "replaced"
x
})
#[[1]]
# names1 values
#1 a val1_1
#2 b val2_1
#3 c val3_1
#4 f val4_1
#5 replaced x
#[[2]]
# names1 values
#1 a val1_2
#2 b val2_2
#3 f2 val3_2
#4 s5 val4_2
#5 replaced h
There is also another approach where we can convert both the columns to characters, then replace the values at required position and again convert them back to factors but since every dataframe in the list can be huge we do not want to convert all the values to characters and then back to factor just to change one value which could be computationally very expensive.
Here is one option with tidyverse. Loop through the list with map, slice the row of interest (in this case, it is the last row, so n() can be used), mutate the column value and bind with the original data without the last row
library(tidyverse)
map(list1, ~ .x %>%
slice(n()) %>%
mutate(values = names1, names1 = 'replaced') %>%
bind_rows(.x %>% slice(-n()), .))
#[[1]]
# names1 values
#1 a val1_1
#2 b val2_1
#3 c val3_1
#4 f val4_1
#5 replaced x
#[[2]]
# names1 values
#1 a val1_2
#2 b val2_2
#3 f2 val3_2
#4 s5 val4_2
#5 replaced h
Or it can be made more compact with fct_c from forcats. Different factor levels can be combined together with fct_c for the 'values' and 'names1' column
library(forcats)
map(list1, ~ .x %>%
mutate(values = fct_c(values[-n()], names1[n()]),
names1 = fct_c(names1[-n()], factor('replaced'))))
Or using similar approach with base R where we loop through the list with lapply, then convert the data.frame to matrix, rbind the subset of matrix i.e. the last row removed with the values of interest, and convert to data.frame (by default, stringsAsFactors = TRUE - so it gets converted to factor)
lapply(list1, function(x) as.data.frame(rbind(as.matrix(x)[-5, ],
c('replaced', as.character(x$names1[5])))))
I would like to compute an id variable based on the unique combination of two (or more) variables. Consider the simple example below:
# Example dataframe
mydf <- data.frame(var1 = LETTERS[c(1, 2, 1)], var2 = LETTERS[c(2, 1, 3)])
mydf
# var1 var2
# A B
# B A
# A C
Here, rows 1 and 2 should have the same id because AB and BA represent a combination of the same elements. Row 3 however, has a different id since the AC combination appear only once.
# Desired output
cbind(mydf, cid = c(1, 1, 2))
# var1 var2 cid
# A B 1
# B A 1
# A C 2
Any suggestion?
We can sort by row, create a logical vector with duplicated and get the cumsum
cbind(mydf, cid = cumsum(!duplicated(t(apply(mydf, 1, sort)))))
You could benefit from factor type in base R for that:
mydf$cid <- as.numeric(factor(apply(mydf,1,function(x) paste0(sort(x), collapse = ""))))
It disregards the order by which the equivalent rows are appeared in data frame. cumsum does not work once, for example, the rows 2 and 3 are switched in your data frame.
I am trying to order a data frame on multiple columns. And the column names are passed through variable, i.e. a character vector.
df <- data.frame(var1 = c("b","a","b","a"), var2 = c("l","l","k","k"),
var3 = c("t","w","x","t"))
var1 var2 var3
1 b l t
2 a l w
3 b k x
4 a k t
Sorting on one column using a variable
sortvar <- "var1"
df[order(df[ , sortvar]),]
var1 var2 var3
2 a l w
4 a k t
1 b l t
3 b k x
Now, if I want to order by two columns, the above solution does not work.
sortvar <- c("var1", "var2")
df[order(df[, sortvar]), ] #does not work
I can manually order with column names:
df[with(df, order(var1, var2)),]
var1 var2 var3
4 a k t
2 a l w
3 b k x
1 b l t
But, how do I order the data frame dynamically on multiple columns using a variable with column names? I am aware of the plyr and dplyr arrange function, but I want to use base R here.
order expects multiple ordering variables as separate arguments, which is unfortunate in your case but suggests a direct solution: use do.call:
df[do.call(order, df[, sortvar]), ]
In case you’re unfamiliar with do.call: it constructs and executes a call programmatically. The following two statements are equivalent:
fun(arg1, arg2, …)
do.call(fun, list(arg1, arg2, …))
It's a bit awkward, but you can use do.call() to pass each of the columns to order as a different argument
dat[do.call("order", dat[,cols, drop=FALSE]), ]
I added drop=FALSE just in case length(cols)==1 where indexing a data.frame would return a vector instead of a list. You can wrap it in a fucntion to make it a bit easier to use
order_by_cols <- function(data, cols=1) {
data[do.call("order", data[, cols, drop=FALSE]), ]
}
order_by_cols(dat, cols)
it's a bit easier with dplyr if that's something you might consider
library(dplyr)
dat %>% arrange(across(all_of(cols)))
dat %>% arrange_at(cols) # though this method has been superseded by the above line
I need an efficient way to convert the column names of a number of data frames to lowercase.
Suppose we have:
df1 <- data.frame(VAR1=c(1,2), VAR2=c("a", "b"))
df2 <- data.frame(VAR1=c(TRUE,FALSE), VAR2=c("foo", "bar"))
A simple way to get what I want is:
names(df1) <- tolower(names(df1))
names(df2) <- tolower(names(df2))
A little tedious if you have a large number of data frames, though.
I need something better.
I thought I could use get() in a loop:
my.files <- ls()
for(i in 1:2) names(get(my.files[i])) <- tolower(names(get(my.files[i])))
but it doesn't work. I couldn't find a solution using lapply() either.
Any suggestion to modify the column names of a large number of data frames without too much coding?
Here's a one-liner that uses setNames, which is a nice function for modifying the "names" attribute of an object without having to temporarily create a copy.
for(i in ls(pattern = "df")) assign(i, setNames(get(i), tolower(names(get(i)))))
df1
# var1 var2
# 1 1 a
# 2 2 b
df2
# var1 var2
# 1 TRUE foo
# 2 FALSE bar
Generally doing this kind of get and assign routine is discouraged. It's better to just put your data.frames in a list rather than a bunch of named objects in the .GlobalEnv. In your case, you could do something like the following:
a <- list(df1 = df1, df2 = df2)
a
# $df1
# VAR1 VAR2
# 1 1 a
# 2 2 b
#
# $df2
# VAR1 VAR2
# 1 TRUE foo
# 2 FALSE bar
lapply(a, function(x) setNames(x, tolower(names(x))))
# $df1
# var1 var2
# 1 1 a
# 2 2 b
#
# $df2
# var1 var2
# 1 TRUE foo
# 2 FALSE bar