Average values from multiple data frames by position [duplicate] - r

This question already has an answer here:
Average Cells of Two or More DataFrames
(1 answer)
Closed 1 year ago.
I have two dataframes:
dataA <- data.frame(A = replicate(5, 1), B = replicate(5, 2))
dataB <- data.frame(A = replicate(5, 3), B = replicate(5, 4))
I would like to create a third data frame dataC that is the average of the other two. For example, row 1 column 1 in the third data frame would be the average of the same position in the first two data frames.
Desired output:
dataC <- data.frame(A = replicate(5, 2), B = replicate(5, 3))
dataC
A B
2 3
2 3
2 3
2 3
2 3

We can use place the datasets in a list, do elementwise sum with + and divide by the lenght of the list
Reduce(`+`, list(dataA, dataB))/2
-output
A B
1 2 3
2 2 3
3 2 3
4 2 3
5 2 3
Or another option is to bind the datasets while creating a grouping column based on sequence and then do the group by mean
library(dplyr)
library(data.table)
bind_rows(dataA, dataB, .id = 'grp') %>%
group_by(grp = rowid(grp)) %>%
summarise(across(everything(), mean)) %>%
select(-grp)
-output
# A tibble: 5 x 2
A B
<dbl> <dbl>
1 2 3
2 2 3
3 2 3
4 2 3
5 2 3

Here are some solutions:
# method 1:
dataC <- (dataA + dataB) / 2
# method 2:
dataC <- dataA
dataC[] <- Map(function(x,y) (x+y)/2, dataA, dataB)
# A B
# 1 2 3
# 2 2 3
# 3 2 3
# 4 2 3
# 5 2 3

Related

is it possible to filter rows of one dataframe based on another dataframe?

is it possible to filter rows of one dataframe based on another dataframe?
I have this 2 dataframe:
df_node <- data.frame( id= c("a","b","c","d","e","f","g","h","i"),
group= c(1,1,1,2,2,2,3,3,3))
df_link <- data.frame(from = c("a","d","f","i","b"),
to = c("d","f","i","b","h"))
I would like to delete the lines with characters that are not present in the second dataframe, like this:
here is a basic way to do that:
df_node <- data.frame( id= c("a","b","c","d","e","f","g","h","i"),
group= c(1,1,1,2,2,2,3,3,3))
df_link <- data.frame(from = c("a","d","f","i","b"),
to = c("d","f","i","b","h"))
library(dplyr)
df_result <- df_node%>%
filter(id%in%c(df_link$from,df_link$to))
df_result
# > df_result
# id group
# 1 a 1
# 2 b 1
# 3 d 2
# 4 f 2
# 5 h 3
# 6 i 3
We could use a semi_join:
library(dplyr)
df_node |>
semi_join(tibble(id = c(df_link$from, df_link$to)))
Output:
id group
1 a 1
2 b 1
3 d 2
4 f 2
5 h 3
6 i 3
Here is a oneliner with base R:
df_node[df_node$id %in% unlist(df_link),]
id group
1 a 1
2 b 1
4 d 2
6 f 2
8 h 3
9 i 3
But you could also use a join:
library(dplyr)
df_uniqueID <- data.frame(id = unique(c(df_link$from,df_link$to)) )
right_join(df_node,df_uniqueID)
Joining, by = "id"
id group
1 a 1
2 b 1
3 d 2
4 f 2
5 h 3
6 i 3

Column bind several list elements based on id variable [duplicate]

This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 1 year ago.
Assuming the following list:
a <- data.frame(id = 1:3, x = 1:3)
b <- data.frame(id = 3:1, y = 4:6)
my_list <- list(a, b)
my_list
# [[1]]
# id x
# 1 1 1
# 2 2 2
# 3 3 3
# [[2]]
# id y
# 1 3 4
# 2 2 5
# 3 1 6
I now want to column bind the list elements into a data frame/tibble while matching the respective rows based on the id variable, i.e. the outcome should be:
# A tibble: 3 x 3
id x y
<int> <int> <int>
1 1 1 6
2 2 2 5
3 3 3 4
I know how I can do it with some pivoting, but I'm wondering if there's a smarter way of doing it and I hoped there was some binding function that simply allows for specifying an id column?
Current approach:
library(tidyverse)
my_list %>%
bind_rows() %>%
pivot_longer(cols = -id) %>%
filter(!is.na(value)) %>%
pivot_wider()
reduce(my_list, full_join, by='id')
id x y
1 1 1 6
2 2 2 5
3 3 3 4
If its only 2 dataframes:
invoke(full_join, my_list, by='id')
id x y
1 1 1 6
2 2 2 5
3 3 3 4
If you are using base R, any of the following should work:
Reduce(merge, my_list)
do.call(merge, my_list)

repeat dataframe n times whilst adding column

This is my reproducible code:
df <- data.frame(x = c(1, 2), y = c(3, 4))
df1 <- df %>% mutate(z = 1)
df2 <- df %>% mutate(z = 2)
df3 <- df %>% mutate(z = 3)
df <- rbind(df1, df2, df3)
df
I repeat the original data frame df 3 times, whilst adding one column where the number in the column indicated the repetition. In my use case, I have to do this more than 3 times. I could use a loop but is there a neater way? I guess i cannot use expand.grid.
You can also do it with a merge:
dfz <- data.frame(z = 1:3)
merge(df, dfz)
# x y z
# 1 1 3 1
# 2 2 4 1
# 3 1 3 2
# 4 2 4 2
# 5 1 3 3
# 6 2 4 3
We can create a list column and unnest
library(tidyverse)
df %>%
mutate(z = list(1:3)) %>%
unnest %>%
arrange(z)
# x y z
#1 1 3 1
#2 2 4 1
#3 1 3 2
#4 2 4 2
#5 1 3 3
#6 2 4 3
We can also do a cross join with sqldf. This creates a Cartesian Product of df and the reps tables:
library(sqldf)
reps <- data.frame(z = 1:3)
sqldf("select * from df, reps order by z")
or simply with map_dfr from purrr:
library(purrr)
map_dfr(1:3, ~cbind(df, z = .))
Output:
x y z
1 1 3 1
2 2 4 1
3 1 3 2
4 2 4 2
5 1 3 3
6 2 4 3
Yet another option using base R
n <- 3
do.call(rbind,
Map(`[<-`, replicate(n = n,
expr = df,
simplify = FALSE),
"z",
value = seq_len(n)))
# x y z
#1 1 3 1
#2 2 4 1
#3 1 3 2
#4 2 4 2
#5 1 3 3
#6 2 4 3
A few other ways not covered yet:
# setup
df = data.frame(x = c(1, 2), y = c(3, 4))
n = 3
# simple row indexing, add column manually
result = df[rep(1:nrow(df), 3), ]
result$id = rep(1:n, each = nrow(df))
# cross join in base
merge(df, data.frame(id = 1:n), by = NULL)
# cross join in tidyr
tidyr::crossing(df, data.frame(id = 1:n))
# dplyr version of the row-index method above
slice(df, rep(1:n(), n)) %>% mutate(id = rep(1:n, each = nrow(df)))
Inspiration drawn heavily from an old question of mine, How can I repeat a data frame?. Basically the same question but without the id column requirement.

In R, split a dataframe so subset dataframes contain last row of previous dataframe and first row of subsequent dataframe

There are many answers for how to split a dataframe, for example How to split a data frame?
However, I'd like to split a dataframe so that the smaller dataframes contain the last row of the previous dataframe and the first row of the following dataframe.
Here's an example
n <- 1:9
group <- rep(c("a","b","c"), each = 3)
data.frame(n = n, group)
n group
1 1 a
2 2 a
3 3 a
4 4 b
5 5 b
6 6 b
7 7 c
8 8 c
9 9 c
I'd like the output to look like:
d1 <- data.frame(n = 1:4, group = c(rep("a",3),"b"))
d2 <- data.frame(n = 3:7, group = c("a",rep("b",3),"c"))
d3 <- data.frame(n = 6:9, group = c("b",rep("c",3)))
d <- list(d1, d2, d3)
d
[[1]]
n group
1 1 a
2 2 a
3 3 a
4 4 b
[[2]]
n group
1 3 a
2 4 b
3 5 b
4 6 b
5 7 c
[[3]]
n group
1 6 b
2 7 c
3 8 c
4 9 c
What is an efficient way to accomplish this task?
Suppose DF is the original data.frame, the one with columns n and group. Let n be the number of rows in DF. Now define a function extract which given a sequence of indexes ix enlarges it to include the one prior to the first and after the last and then returns those rows of DF. Now that we have defined extract, split the vector 1, ..., n by group and apply extract to each component of the split.
n <- nrow(DF)
extract <- function(ix) DF[seq(max(1, min(ix) - 1), min(n, max(ix) + 1)), ]
lapply(split(seq_len(n), DF$group), extract)
$a
n group
1 1 a
2 2 a
3 3 a
4 4 b
$b
n group
3 3 a
4 4 b
5 5 b
6 6 b
7 7 c
$c
n group
6 6 b
7 7 c
8 8 c
9 9 c
Or why not try good'ol by, which "[a]ppl[ies] a Function to a Data Frame Split by Factors [INDICES]".
by(data = df, INDICES = df$group, function(x){
id <- c(min(x$n) - 1, x$n, max(x$n) + 1)
na.omit(df[id, ])
})
# df$group: a
# n group
# 1 1 a
# 2 2 a
# 3 3 a
# 4 4 b
# --------------------------------------------------------------------------------
# df$group: b
# n group
# 3 3 a
# 4 4 b
# 5 5 b
# 6 6 b
# 7 7 c
# --------------------------------------------------------------------------------
# df$group: c
# n group
# 6 6 b
# 7 7 c
# 8 8 c
# 9 9 c
Although the print method of by creates a 'fancy' output, the (default) result is a list, with elements named by the levels of the grouping variable (just try str and names on the resulting object).
I was going to comment under #cdetermans answer but its too late now.
You can generalize his approach using data.table::shift (or dyplr::lag) in order to find the group indices and then run a simple lapply on the ranges, something like
library(data.table) # v1.9.6+
indx <- setDT(df)[, which(group != shift(group, fill = TRUE))]
lapply(Map(`:`, c(1L, indx - 1L), c(indx, nrow(df))), function(x) df[x,])
# [[1]]
# n group
# 1: 1 a
# 2: 2 a
# 3: 3 a
# 4: 4 b
#
# [[2]]
# n group
# 1: 3 a
# 2: 4 b
# 3: 5 b
# 4: 6 b
# 5: 7 c
#
# [[3]]
# n group
# 1: 6 b
# 2: 7 c
# 3: 8 c
# 4: 9 c
Could be done with data.frame as well, but is there ever a reason not to use data.table? Also this has the option to be executed with parallelism.
library(data.table)
n <- 1:9
group <- rep(c("a","b","c"), each = 3)
df <- data.table(n = n, group)
df[, `:=` (group = factor(df$group))]
df[, `:=` (group_i = seq_len(.N), group_N = .N), by = "group"]
library(doParallel)
groups <- unique(df$group)
foreach(i = seq(groups)) %do% {
df[group == groups[i] | (as.integer(group) == i + 1 & group_i == 1) | (as.integer(group) == i - 1 & group_i == group_N), c("n", "group"), with = FALSE]
}
[[1]]
n group
1: 1 a
2: 2 a
3: 3 a
4: 4 b
[[2]]
n group
1: 3 a
2: 4 b
3: 5 b
4: 6 b
5: 7 c
[[3]]
n group
1: 6 b
2: 7 c
3: 8 c
4: 9 c
Here is another dplyr way:
library(dplyr)
data =
data_frame(n = n, group) %>%
group_by(group)
firsts =
data %>%
slice(1) %>%
ungroup %>%
mutate(new_group = lag(group)) %>%
slice(-1)
lasts =
data %>%
slice(n()) %>%
ungroup %>%
mutate(new_group = lead(group)) %>%
slice(-n())
bind_rows(firsts, data, lasts) %>%
mutate(final_group =
ifelse(is.na(new_group),
group,
new_group) ) %>%
arrange(final_group, n) %>%
group_by(final_group)

R, dplyr: cumulative version of n_distinct

I have a dataframe as follows. It is ordered by column time.
Input -
df = data.frame(time = 1:20,
grp = sort(rep(1:5,4)),
var1 = rep(c('A','B'),10)
)
head(df,10)
time grp var1
1 1 1 A
2 2 1 B
3 3 1 A
4 4 1 B
5 5 2 A
6 6 2 B
7 7 2 A
8 8 2 B
9 9 3 A
10 10 3 B
I want to create another variable var2 which computes no of distinct var1 values so far i.e. until that point in time for each group grp . This is a little different from what I'd get if I were to use n_distinct.
Expected output -
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
I want to create a function say cum_n_distinct for this and use it as -
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))
A dplyr solution inspired from #akrun's answer -
Ths logic is basically to set 1st occurrence of each unique values of var1 to 1 and rest to 0 for each group grp and then apply cumsum on it -
df = df %>%
arrange(time) %>%
group_by(grp,var1) %>%
mutate(var_temp = ifelse(row_number()==1,1,0)) %>%
group_by(grp) %>%
mutate(var2 = cumsum(var_temp)) %>%
select(-var_temp)
head(df,10)
Source: local data frame [10 x 4]
Groups: grp
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
Assuming stuff is ordered by time already, first define a cumulative distinct function:
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
Then a base solution that uses ave to create groups (note, assumes var1 is factor), and then applies our function to each group:
transform(df, var2=ave(as.integer(var1), grp, FUN=dist_cum))
A data.table solution, basically doing the same thing:
library(data.table)
(data.table(df)[, var2:=dist_cum(var1), by=grp])
And dplyr, again, same thing:
library(dplyr)
df %>% group_by(grp) %>% mutate(var2=dist_cum(var1))
Try:
Update
With your new dataset, an approach in base R
df$var2 <- unlist(lapply(split(df, df$grp),
function(x) {x$var2 <-0
indx <- match(unique(x$var1), x$var1)
x$var2[indx] <- 1
cumsum(x$var2) }))
head(df,7)
# time grp var1 var2
# 1 1 1 A 1
# 2 2 1 B 2
# 3 3 1 A 2
# 4 4 1 B 2
# 5 5 2 A 1
# 6 6 2 B 2
# 7 7 2 A 2
Here's another solution using data.table that's pretty quick.
Generic Function
cum_n_distinct <- function(x, na.include = TRUE){
# Given a vector x, returns a corresponding vector y
# where the ith element of y gives the number of unique
# elements observed up to and including index i
# if na.include = TRUE (default) NA is counted as an
# additional unique element, otherwise it's essentially ignored
temp <- data.table(x, idx = seq_along(x))
firsts <- temp[temp[, .I[1L], by = x]$V1]
if(na.include == FALSE) firsts <- firsts[!is.na(x)]
y <- rep(0, times = length(x))
y[firsts$idx] <- 1
y <- cumsum(y)
return(y)
}
Example Use
cum_n_distinct(c(5,10,10,15,5)) # 1 2 2 3 3
cum_n_distinct(c(5,NA,10,15,5)) # 1 2 3 4 4
cum_n_distinct(c(5,NA,10,15,5), na.include = FALSE) # 1 1 2 3 3
Solution To Your Question
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))

Resources