This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 1 year ago.
Assuming the following list:
a <- data.frame(id = 1:3, x = 1:3)
b <- data.frame(id = 3:1, y = 4:6)
my_list <- list(a, b)
my_list
# [[1]]
# id x
# 1 1 1
# 2 2 2
# 3 3 3
# [[2]]
# id y
# 1 3 4
# 2 2 5
# 3 1 6
I now want to column bind the list elements into a data frame/tibble while matching the respective rows based on the id variable, i.e. the outcome should be:
# A tibble: 3 x 3
id x y
<int> <int> <int>
1 1 1 6
2 2 2 5
3 3 3 4
I know how I can do it with some pivoting, but I'm wondering if there's a smarter way of doing it and I hoped there was some binding function that simply allows for specifying an id column?
Current approach:
library(tidyverse)
my_list %>%
bind_rows() %>%
pivot_longer(cols = -id) %>%
filter(!is.na(value)) %>%
pivot_wider()
reduce(my_list, full_join, by='id')
id x y
1 1 1 6
2 2 2 5
3 3 3 4
If its only 2 dataframes:
invoke(full_join, my_list, by='id')
id x y
1 1 1 6
2 2 2 5
3 3 3 4
If you are using base R, any of the following should work:
Reduce(merge, my_list)
do.call(merge, my_list)
Related
I have a long list of vectors:
mylist <- list(a = c(1,2,3)
,b = c(2,3)
)
I would like to combine these vectors into a single two-column dataframe, where the first column (named sd) stores the vector content, and the second column (named id) stores the vector ID. The final dataframe should look as follows:
sd id
1 1 a
2 2 a
3 3 a
4 2 b
5 3 b
I imagined that bind_rows(mylist, .id = "id")
would do the job, but I get the Tibble columns must have compatible sizes. error.
Using tidyr and tibble :
library(tibble)
library(tidyr)
enframe(mylist,name="id",value="sd") %>% unnest(sd)
# A tibble: 5 × 2
id sd
<chr> <dbl>
1 a 1
2 a 2
3 a 3
4 b 2
5 b 3
enframe converts named atomic vectors or lists to one- or two-column data frame and unnest makes each element of sd on its own row
You can do this with pivot_longer:
library(tidyr)
data.frame(t(mylist)) %>%
pivot_longer(1:2) %>% unnest(1:2)
# A tibble: 5 × 2
name value
<chr> <dbl>
1 a 1
2 a 2
3 a 3
4 b 2
5 b 3
This question already has an answer here:
Average Cells of Two or More DataFrames
(1 answer)
Closed 1 year ago.
I have two dataframes:
dataA <- data.frame(A = replicate(5, 1), B = replicate(5, 2))
dataB <- data.frame(A = replicate(5, 3), B = replicate(5, 4))
I would like to create a third data frame dataC that is the average of the other two. For example, row 1 column 1 in the third data frame would be the average of the same position in the first two data frames.
Desired output:
dataC <- data.frame(A = replicate(5, 2), B = replicate(5, 3))
dataC
A B
2 3
2 3
2 3
2 3
2 3
We can use place the datasets in a list, do elementwise sum with + and divide by the lenght of the list
Reduce(`+`, list(dataA, dataB))/2
-output
A B
1 2 3
2 2 3
3 2 3
4 2 3
5 2 3
Or another option is to bind the datasets while creating a grouping column based on sequence and then do the group by mean
library(dplyr)
library(data.table)
bind_rows(dataA, dataB, .id = 'grp') %>%
group_by(grp = rowid(grp)) %>%
summarise(across(everything(), mean)) %>%
select(-grp)
-output
# A tibble: 5 x 2
A B
<dbl> <dbl>
1 2 3
2 2 3
3 2 3
4 2 3
5 2 3
Here are some solutions:
# method 1:
dataC <- (dataA + dataB) / 2
# method 2:
dataC <- dataA
dataC[] <- Map(function(x,y) (x+y)/2, dataA, dataB)
# A B
# 1 2 3
# 2 2 3
# 3 2 3
# 4 2 3
# 5 2 3
This question already has an answer here:
R code to assign a sequence based off of multiple variables [duplicate]
(1 answer)
Closed 3 years ago.
I have following kind of data and i need output as the second data frame...
a <- c(1,1,1,1,2,2,2,2,2,2,2)
b <- c(1,1,1,2,3,3,3,3,4,5,6)
d <- c(1,2,3,4,1,2,3,4,5,6,7)
df <- as.data.frame(cbind(a,b,d))
output <- c(1,1,1,2,1,1,1,1,2,3,4)
df_output <- as.data.frame(cbind(df,output))
I have tried cumsum and I am not able to get the desired results. Please guide. Regards, Enthu.
based on column a value cahnges and if b is to be reset starting from one.
the condition is if b has same value it should start with 1.
Like in the 5th record, col b has value as 3. It should reset to 1 and if all the values if col b is same ( as the case from ro 6,6,7,8 is same , then it should be 1 and any change should increment by 1).
We can do a group by column 'a' and then create the new column with either match the unique values in 'b'
library(dplyr)
df2 <- df %>%
group_by(a) %>%
mutate(out = match(b, unique(b)))
df2
# A tibble: 11 x 4
# Groups: a [2]
# a b d out
# <dbl> <dbl> <dbl> <int>
# 1 1 1 1 1
# 2 1 1 2 1
# 3 1 1 3 1
# 4 1 2 4 2
# 5 2 3 1 1
# 6 2 3 2 1
# 7 2 3 3 1
# 8 2 3 4 1
# 9 2 4 5 2
#10 2 5 6 3
#11 2 6 7 4
Or another option is to coerce a factor variable to integer
df %>%
group_by(a) %>%
mutate(out = as.integer(factor(b)))
data
df <- data.frame(a, b, d)
So I am trying to create a table with counts of distinct records in my data table
mytable <-
group team num ID
1 a x 1 9
2 a x 2 4
3 a y 3 5
4 a y 4 9
5 b x 1 7
6 b y 4 4
7 b x 3 9
8 b y 2 8
The column names are group,team, num, and ID. I want an individual table that contains the counts of distinct records in each of the columns. I want the table names to be in the format "table_colName"
colName <- c('group','team','num','ID')
for (col in colName)
'table_'+colName <- mytable %>% group_by(col) %>% summarise(Count = n())
This generate an error "Error in grouped_df_impl(data, unname(vars), drop) : Column col is unknown".
Is there a way I can iterate through the group_by function using the columns in my data table and to save it to a new data table each time so that in this example I end up with table_group, table_team,table_num, and table_ID?
An option is to use group_by_at in combination with lapply. You need to pass columns of mytable to lapply. The function will group each columns and result will be available in a list.
library(dplyr)
lapply(names(mytable), function(x){
group_by_at(mytable, x)%>%summarise(Count = n()) %>% as.data.frame()
})
# [[1]]
# group Count
# 1 a 4
# 2 b 4
#
# [[2]]
# team Count
# 1 x 4
# 2 y 4
#
# [[3]]
# num Count
# 1 1 2
# 2 2 2
# 3 3 2
# 4 4 2
#
# [[4]]
# ID Count
# 1 4 2
# 2 5 1
# 3 7 1
# 4 8 1
# 5 9 3
Data:
mytable <- read.table(text=
"group team num ID
1 a x 1 9
2 a x 2 4
3 a y 3 5
4 a y 4 9
5 b x 1 7
6 b y 4 4
7 b x 3 9
8 b y 2 8",
header = TRUE, stringsAsFactors = FALSE)
try this:
mytable %>%
group_by(.dots=c('group','team','num','ID')) %>%
summarise(Count = n())
I was able to fix this with the code below, thank you all for your attempt at helping me but I am new to coding and probably did not phrase the question right, sorry!
colName <- c('group','team','num','ID')
for (col in colName) {
tables <- paste('table',col, sep = '_')
assign(tables, mytable %>% group_by(.dots = col) %>% summarise(Count = n()))
}
A solution using data.table and lapply.
Create data
library(data.table)
dt <- read.table(text = "
group team num ID
1 a x 1 9
2 a x 2 4
3 a y 3 5
4 a y 4 9
5 b x 1 7
6 b y 4 4
7 b x 3 9
8 b y 2 8")
Code to generate results
setDT(dt)
l <- lapply(cnms, function(i)setnames(dt[, .N, get(i)], "get", i))
names(l) <- paste0("table_", cnms)
str(l)
I have two data frames. Data frame A has many observations/rows, an ID for each observation, and many additional columns. For a subset of observations X, the values for a set of columns are missing/NA. Data frame B contains a subset of the observations in X (which can be matched across data frames using the ID) and variables with identical names as in data frame A, but containing values to replace the missing values in the set of columns with missing/NA.
My code below (using a join operation) merely adds columns rather than replacing missing values. For each of the additional variables (let's name them W) in B, the resulting table produces W.x and W.y.
library(dplyr)
foo <- data.frame(id = seq(1:6), x = c(NA, NA, NA, 1, 3, 8), z = seq_along(10:15))
bar <- data.frame(id = seq(1:2), x = c(10, 9))
dplyr::left_join(x = foo, y = bar, by = "id")
I am trying to replace the missing values in A using the values in B based on the ID, but do so in an efficient manner since I have many columns and many rows. My goal is this:
id x z
1 1 10 1
2 2 9 2
3 3 NA 3
4 4 1 4
5 5 3 5
6 6 8 6
One thought was to use ifelse() after joining, but typing out ifelse() functions for all of the variables is not feasible. Is there a way to do this simply without the database join or is there a way to apply a function across all columns ending in .x to replace the values in .x with the value in .y if the value in .x is missing?
Another attempt which should essentially only be one assignment operation. Using #alistaire's data again:
vars <- c("x","y")
foo[vars] <- Map(pmax, foo[vars], bar[match(foo$id, bar$id), vars], na.rm=TRUE)
foo
# id x y z
#1 1 10 1 1
#2 2 9 2 2
#3 3 NA 3 3
#4 4 1 4 4
#5 5 3 5 5
#6 6 8 6 6
EDIT
Updating the answer taking #alistaire 's example dataframe.
We can extend the same answer given below using mapply so that it can handle multiple columns for both foo and bar.
Finding out common columns between two dataframes and sorting them so they are in the same order.
vars <- sort(intersect(names(foo), names(bar))[-1])
foo[vars] <- mapply(function(x, y) {
ind = is.na(x)
replace(x, ind, y[match(foo$id[ind], bar$id)])
}, foo[vars], bar[vars])
foo
# id x y z
#1 1 10 1 1
#2 2 9 2 2
#3 3 NA 3 3
#4 4 1 4 4
#5 5 3 5 5
#6 6 8 6 6
Original Answer
I think this does what you are looking for :
foo[-1] <- sapply(foo[-1], function(x) {
ind = is.na(x)
replace(x, ind, bar$x[match(foo$id[ind], bar$id)])
})
foo
# id x z
#1 1 10 1
#2 2 9 2
#3 3 NA 3
#4 4 1 4
#5 5 3 5
#6 6 8 6
For every column (except id) we find the missing value in foo and replace it with corresponding values from bar.
If you don't mind verbose baseR approaches, then you can easily accomplish this using merge() and a careful subsetting of your data frame.
df <- merge(foo, bar, by="id", all.x=TRUE)
names(df) <- c("id", "x", "z", "y")
df$x[is.na(df$x)] <- df$y[is.na(df$x)]
df <- df[c("id", "x", "z")]
> df
id x z
1 1 10 1
2 2 9 2
3 3 NA 3
4 4 1 4
5 5 3 5
6 6 8 6
You can iterate dplyr::coalesce over the intersect of non-grouping columns. It's not elegant, but it should scale reasonably well:
library(tidyverse)
foo <- data.frame(id = seq(1:6),
x = c(NA, NA, NA, 1, 3, 8),
y = 1:6, # add extra shared variable
z = seq_along(10:15))
bar <- data.frame(id = seq(1:2),
y = c(1L, NA),
x = c(10, 9))
# names of non-grouping variables in both
vars <- intersect(names(foo), names(bar))[-1]
foobar <- left_join(foo, bar, by = 'id')
foobar <- vars %>%
map(paste0, c('.x', '.y')) %>% # make list of columns to coalesce
map(~foobar[.x]) %>% # for each set, subset foobar to a two-column data.frame
invoke_map(.f = coalesce) %>% # ...and coalesce it into a vector
set_names(vars) %>% # add names to list elements
bind_cols(foobar) %>% # bind into data.frame and cbind to foobar
select(union(names(foo), names(bar))) # drop duplicated columns
foobar
#> # A tibble: 6 x 4
#> id x y z
#> <int> <dbl> <int> <int>
#> 1 1 10 1 1
#> 2 2 9 2 2
#> 3 3 NA 3 3
#> 4 4 1 4 4
#> 5 5 3 5 5
#> 6 6 8 6 6