I am trying to use expand() function to create combinations of multiple variables in the list vector. The following codes correctly produces 27 rows of combinations when the atomic vectors are listed. However, when I try to use the var_list in many different forms, it the expand() function does not produce desired outcome of 27 combinations. How could I use the var_list to dynamically create combinations of multiple columns in a data frame df?
abc <- letters[1:3]
num <- c(1,2,3)
xyz <- letters[24:26]
df <- as.data.frame(cbind(abc,num,xyz))
combinations_1 <- expand(df,abc,num,xyz) #This returns 27 combinations
var_list <- c("abc","num","xyz")
combinations_2 <- expand(df,var_list) #This returns 3 combinations
combinations_3 <- expand(df,df[var_list]) #This returns 3 combinations
combinations_4 <- expand(df,noquote(var_list)) #This returns 3 combinations
We can use mget to return the values of the object names
expand.grid(mget(var_list))
Or if we need to make use of 'df', just extract with [
expand.grid(df[var_list])
Or using expand
library(dplyr)
library(tidyr)
expand(df, !!! rlang::syms(var_list))
# A tibble: 27 x 3
# abc num xyz
# <fct> <fct> <fct>
# 1 a 1 x
# 2 a 1 y
# 3 a 1 z
# 4 a 2 x
# 5 a 2 y
# 6 a 2 z
# 7 a 3 x
# 8 a 3 y
# 9 a 3 z
#10 b 1 x
# … with 17 more rows
You can make var_list a list of symbols and splice.
library(tidyr)
df %>%
expand(!!!syms(var_list))
# A tibble: 27 x 3
abc num xyz
<fct> <fct> <fct>
1 a 1 x
2 a 1 y
3 a 1 z
4 a 2 x
5 a 2 y
6 a 2 z
7 a 3 x
8 a 3 y
9 a 3 z
10 b 1 x
# ... with 17 more rows
Related
I know there a several ways to create a column based on another column, however I would like to know how to do it while creating a data frame.
For example this works but is not the way I want to use it.
v1 = rnorm(10)
sample_df <- data.frame(v1 = v1,
cs = cumsum(v1))
This works not:
sample_df2 <- data.frame(v2 = rnorm(10),
cs = cumsum(v2))
Is there a way to it directly in the data.frame function? Thanks in advance.
It cannot be done using data.frame, but package tibble implements a data.frame analogue with the functionality that you want.
library("tibble")
tib <- tibble(x = 1:6, y = cumsum(x))
tib
# # A tibble: 6 × 2
# x y
# <int> <int>
# 1 1 1
# 2 2 3
# 3 3 6
# 4 4 10
# 5 5 15
# 6 6 21
In most cases, the resulting object (called a "tibble") can be treated as if it were a data frame, but if you truly need a data frame, then you can do this:
dat <- as.data.frame(tib)
dat
# x y
# 1 1 1
# 2 2 3
# 3 3 6
# 4 4 10
# 5 5 15
# 6 6 21
You can wrap everything in a function if you like:
f <- function(...) as.data.frame(tibble(...))
f(x = 1:6, y = cumsum(x))
# x y
# 1 1 1
# 2 2 3
# 3 3 6
# 4 4 10
# 5 5 15
# 6 6 21
The data frame x has a column in which the values are periodic. For each unique value in that column, I want to calculate summation of the second column. If x is something like this:
x <- data.frame(a=c(1:2,1:2,1:2),b=c(1,4,5,2,3,4))
a b
1 1 1
2 2 4
3 1 5
4 2 2
5 1 3
6 2 4
The output I want is the following data frame:
a b
1 9
2 10
Using aggregate as follows will get you your desired result
aggregate(b ~ a, x, sum)
Here is the option with dplyr
library(dplyr)
x %>%
group_by(a) %>%
summarise(b = sum(b))
# A tibble: 2 x 2
# a b
# <int> <dbl>
#1 1 9.00
#2 2 10.0
I need in 2nd column a count of how many times observation is in 1st column. Here, 2nd column should have values 1,2,3,1,2,1,2. This code doesn't work.
x <- c(11,11,11,22,22,33,33) y <- c(1,1,1,1,1,1,1) df <- data.frame(x,y)
i <- 1 for (i in 1:(nrow(df)-1)){
if(df[i+1,1] == df[i,1]){df[i+1,2] <- 2}
if(df[i+2,1] == df[i,1]){df[i+2,2 <- 3}
else df[i+2,2] <- 1}
Here's a solution:
require(dplyr)
df %>% group_by(x) %>% mutate(z=cumsum(y)) %>% ungroup()
# A tibble: 7 × 3
# x y z
# <dbl> <dbl> <dbl>
# 1 11 1 1
# 2 11 1 2
# 3 11 1 3
# 4 22 1 1
# 5 22 1 2
# 6 33 1 1
# 7 33 1 2
You can try base R:
unlist(by(x, x, seq_along), use.names = F)
# or
ave(x, x, FUN = seq_along)
You could also use sequence and rle like
sequence(rle(x)$length)
[1] 1 2 3 1 2 1 2
So
df$y <- sequence(rle(x)$length)
rle produces an object with the lengths of runs of repeated values as well as the values themselves. By feeding these lengths to sequence, we can produce counts from 1 to each of these values.
I have two data frames. Data frame A has many observations/rows, an ID for each observation, and many additional columns. For a subset of observations X, the values for a set of columns are missing/NA. Data frame B contains a subset of the observations in X (which can be matched across data frames using the ID) and variables with identical names as in data frame A, but containing values to replace the missing values in the set of columns with missing/NA.
My code below (using a join operation) merely adds columns rather than replacing missing values. For each of the additional variables (let's name them W) in B, the resulting table produces W.x and W.y.
library(dplyr)
foo <- data.frame(id = seq(1:6), x = c(NA, NA, NA, 1, 3, 8), z = seq_along(10:15))
bar <- data.frame(id = seq(1:2), x = c(10, 9))
dplyr::left_join(x = foo, y = bar, by = "id")
I am trying to replace the missing values in A using the values in B based on the ID, but do so in an efficient manner since I have many columns and many rows. My goal is this:
id x z
1 1 10 1
2 2 9 2
3 3 NA 3
4 4 1 4
5 5 3 5
6 6 8 6
One thought was to use ifelse() after joining, but typing out ifelse() functions for all of the variables is not feasible. Is there a way to do this simply without the database join or is there a way to apply a function across all columns ending in .x to replace the values in .x with the value in .y if the value in .x is missing?
Another attempt which should essentially only be one assignment operation. Using #alistaire's data again:
vars <- c("x","y")
foo[vars] <- Map(pmax, foo[vars], bar[match(foo$id, bar$id), vars], na.rm=TRUE)
foo
# id x y z
#1 1 10 1 1
#2 2 9 2 2
#3 3 NA 3 3
#4 4 1 4 4
#5 5 3 5 5
#6 6 8 6 6
EDIT
Updating the answer taking #alistaire 's example dataframe.
We can extend the same answer given below using mapply so that it can handle multiple columns for both foo and bar.
Finding out common columns between two dataframes and sorting them so they are in the same order.
vars <- sort(intersect(names(foo), names(bar))[-1])
foo[vars] <- mapply(function(x, y) {
ind = is.na(x)
replace(x, ind, y[match(foo$id[ind], bar$id)])
}, foo[vars], bar[vars])
foo
# id x y z
#1 1 10 1 1
#2 2 9 2 2
#3 3 NA 3 3
#4 4 1 4 4
#5 5 3 5 5
#6 6 8 6 6
Original Answer
I think this does what you are looking for :
foo[-1] <- sapply(foo[-1], function(x) {
ind = is.na(x)
replace(x, ind, bar$x[match(foo$id[ind], bar$id)])
})
foo
# id x z
#1 1 10 1
#2 2 9 2
#3 3 NA 3
#4 4 1 4
#5 5 3 5
#6 6 8 6
For every column (except id) we find the missing value in foo and replace it with corresponding values from bar.
If you don't mind verbose baseR approaches, then you can easily accomplish this using merge() and a careful subsetting of your data frame.
df <- merge(foo, bar, by="id", all.x=TRUE)
names(df) <- c("id", "x", "z", "y")
df$x[is.na(df$x)] <- df$y[is.na(df$x)]
df <- df[c("id", "x", "z")]
> df
id x z
1 1 10 1
2 2 9 2
3 3 NA 3
4 4 1 4
5 5 3 5
6 6 8 6
You can iterate dplyr::coalesce over the intersect of non-grouping columns. It's not elegant, but it should scale reasonably well:
library(tidyverse)
foo <- data.frame(id = seq(1:6),
x = c(NA, NA, NA, 1, 3, 8),
y = 1:6, # add extra shared variable
z = seq_along(10:15))
bar <- data.frame(id = seq(1:2),
y = c(1L, NA),
x = c(10, 9))
# names of non-grouping variables in both
vars <- intersect(names(foo), names(bar))[-1]
foobar <- left_join(foo, bar, by = 'id')
foobar <- vars %>%
map(paste0, c('.x', '.y')) %>% # make list of columns to coalesce
map(~foobar[.x]) %>% # for each set, subset foobar to a two-column data.frame
invoke_map(.f = coalesce) %>% # ...and coalesce it into a vector
set_names(vars) %>% # add names to list elements
bind_cols(foobar) %>% # bind into data.frame and cbind to foobar
select(union(names(foo), names(bar))) # drop duplicated columns
foobar
#> # A tibble: 6 x 4
#> id x y z
#> <int> <dbl> <int> <int>
#> 1 1 10 1 1
#> 2 2 9 2 2
#> 3 3 NA 3 3
#> 4 4 1 4 4
#> 5 5 3 5 5
#> 6 6 8 6 6
let's say I have a data.frame that looks like this:
Variable X Y Z
A 2 5 3
B 4 3 2
C 5 1 5
B 6 2 4
C 2 5 2
Using dplyr or any other suitable package, I would like to group by every single variable, compare it to the rest of the variables pooled together and compute a mathematical operation between the two resulting groups, let's say the sum along columns. I would get something like this:
Variable X Y Z
A 2 5 3
rest 17 11 13
Variable X Y Z
B 10 5 6
rest 9 11 10
Variable X Y Z
C 7 6 7
rest 12 10 9
I have a large data.frame with hundreds of variables, so I would also like to do it in an iterative way. Any suggestion would be of great help. Thank you very much in advance.
something along this ? (you can subset from a list)
l = lapply(unique(df$Variable), function(x) rbind(colSums(df[df$Variable == x,][c("X", "Y", "Z")]),
colSums(df[df$Variable != x,][c("X", "Y", "Z")])))
#[[1]]
# X Y Z
#[1,] 2 5 3
#[2,] 17 11 13
#[[2]]
# X Y Z
#[1,] 10 5 6
#[2,] 9 11 10
#[[3]]
# X Y Z
#[1,] 7 6 7
#[2,] 12 10 9
names(l) = LETTERS[1:3]
l = lapply(l, function(x){rownames(x) = c("Variable", "Rest");x})
list2env(l, .Globalenv) # this should load all dataframes separately
If you want to go full tidyverse
library(tidyverse)
df <- tibble(Variable = c("A","B","C","B","C"),
X = c(2,4,5,6,2),
Y = c(5,3,1,2,5),
Z = c(3,2,5,4,2))
group_summary <- function(data, var) {
data %>%
group_by_(group = ~ if_else(grepl(var, Variable), var, "rest")) %>%
summarise_each_(funs(sum),~-Variable) %>%
rename_(.dots = setNames(c("group"), c("Variable")))
}
map(unique(df$Variable), ~group_summary(df, .x))
[[1]]
# A tibble: 2 × 4
Variable X Y Z
<chr> <dbl> <dbl> <dbl>
1 A 2 5 3
2 rest 17 11 13
[[2]]
# A tibble: 2 × 4
Variable X Y Z
<chr> <dbl> <dbl> <dbl>
1 B 10 5 6
2 rest 9 11 10
[[3]]
# A tibble: 2 × 4
Variable X Y Z
<chr> <dbl> <dbl> <dbl>
1 C 7 6 7
2 rest 12 10 9
If you want a different output than a list you can explore the use of the different map functions (e.g map_df) and the use of tibbles