Compare two dataframes in R - r

I have two dataframes in R and want to compare any entries of rows. I want two check if the value of the first entrie, second entrie etc. of first (any) row of the first dataframe is bigger as the entrie of the first entrie of the the first row of the second dataframe. Afterwards it should give me a TRUE if all entries are bigger and in the intervall (0,2). It looks like this.
Dataframe 1
Letter 2011 2012 2013
A 2 3 5
B 6 6 6
C 5 4 8
Dataframe 2
Letter 2011 2012 2013
A 1 1 4
C 5 5 5
Result for example like this (comparing rows A and A and C and C)
Letter 2011 2012 2013
A 1 2 1 TRUE- all ok
C 0 -1 3 FALSE- second entrie smaller of the first table and third entrie much more
bigger of the first table.

One approach could be to convert data to long format, perform an inner_join subtract values, check if all the values are in range and get the data back in wide format.
library(dplyr)
library(tidyr)
df1 %>% pivot_longer(cols = -Letter) %>%
inner_join(df2 %>% pivot_longer(cols = -Letter), by = c("Letter", "name")) %>%
mutate(value = value.x - value.y) %>%
group_by(Letter) %>%
mutate(check = all(between(value, 0, 2))) %>%
select(-value.x, -value.y) %>%
pivot_wider()
# Letter check `2011` `2012` `2013`
# <chr> <lgl> <int> <int> <int>
#1 A TRUE 1 2 1
#2 C FALSE 0 -1 3
data
df1 <- structure(list(Letter = c("A", "B", "C"), `2011` = c(2L, 6L,5L),
`2012` = c(3L, 6L, 4L), `2013` = c(5L, 6L, 8L)), row.names = c(NA, -3L),
class = "data.frame")
df2 <- structure(list(Letter = c("A", "C"), `2011` = c(1L, 5L), `2012` = c(1L,
5L), `2013` = 4:5), row.names = c(NA, -2L), class = "data.frame")

Related

Add a new column with sum of count to a dataframe according to informations from another in R

I would need help in order to add count column into a table called tab1 according to another tab2.
Here is the first tab :
tab1
Event_Groups Other_column
1 1_G1,2_G2 A
2 2_G1 B
3 4_G4 C
4 7_G5,8_G5,9_G5 D
as you can see in Event_Groups column I have 2 information (Event and Groups numbers separated by a "_"). These informations will also be found in tab2$Group and tab2$Event and the idea is for each element within rows in tab1 (separated by a comma) , to count the number of rows within tab2 where VALUE1 < 10 AND VALUE2 > 30 and then add this count into tab1 in a new column called Sum_count.
Here is the
tab2
Group Event VALUE1 VALUE2
1 G1 1 5 50 <- VALUE1 < 10 & VALUE2 > 30 : count 1
2 G1 2 6 20 <- VALUE2 < 30 : count 0
3 G2 2 50 50 <- VALUE1 > 10 : count 0
4 G3 3 0 0
5 G4 1 0 0
6 G4 4 2 40 <- VALUE1 < 10 & VALUE2 > 30 : count 1
7 G5 7 1 70 <- VALUE1 < 10 & VALUE2 > 30 : count 1
8 G5 8 4 67 <- VALUE1 < 10 & VALUE2 > 30 : count 1
9 G5 9 3 60 <- VALUE1 < 10 & VALUE2 > 30 : count 1
Example :
For instance for the first element of row1 in tab1: 1_G1
we see in tab2 (row1) that VALUE1 < 10 & VALUE2 > 30, so I count 1.
For the seconde element (row1) : 2_G2 we see in tab2 (row3) that VALUE1 > 10, so I count 0.
And here is the expected result tab1 dataframe;
Event_Groups Other_column Sum_count
1_G1,2_G2 A 1
2_G1 B 0
4_G4 C 1
7_G5,8_G5,9_G5 D 3
I dot not know if I am clear enough, do not hesitate to ask questions.
Here are the two tables in dput format if it can helps:
tab1
structure(list(Event_Groups = structure(1:4, .Label = c("1_G1,2_G2",
"2_G1", "4_G4", "7_G5,8_G5,9_G5"), class = "factor"), Other_column =
structure(1:4, .Label = c("A", "B", "C", "D"), class = "factor")),
class = "data.frame", row.names = c(NA,
-4L))
tab2
structure(list(Group = structure(c(1L, 1L, 2L, 3L, 4L, 4L, 5L,
5L, 5L), .Label = c("G1", "G2", "G3", "G4", "G5"), class = "factor"),
Event = c(1L, 2L, 2L, 3L, 1L, 4L, 7L, 8L, 9L), VALUE1 = c(5L,
6L, 50L, 0L, 0L, 2L, 1L, 4L, 3L), VALUE2 = c(50, 20, 50,
0, 0, 40, 70, 67, 60)), class = "data.frame", row.names = c(NA,
-9L))
Here is one way to do it:
library(dplyr)
library(tidyr)
tab1 %>%
mutate(Event_Groups = as.character(Event_Groups)) %>%
separate_rows(Event_Groups, sep = ",") %>%
left_join(.,
tab2 %>%
unite(col = "Event_Groups", Event, Group) %>%
mutate(count = if_else(VALUE1 < 10 & VALUE2 > 30,1L, 0L))) %>%
group_by(Other_column) %>%
summarise(Event_Groups = paste(unique(Event_Groups), collapse = ","),
Sum_count = sum(count)) %>%
select(Event_Groups, everything())
#> Joining, by = "Event_Groups"
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 4 x 3
#> Event_Groups Other_column Sum_count
#> <chr> <fct> <int>
#> 1 1_G1,2_G2 A 1
#> 2 2_G1 B 0
#> 3 4_G4 C 1
#> 4 7_G5,8_G5,9_G5 D 3
Created on 2021-07-29 by the reprex package (v0.3.0)
You can try a tidyverse
library(tidyverse)
tab1 %>%
rownames_to_column() %>%
separate_rows(Event_Groups, sep = ",") %>%
separate(Event_Groups, into = c("Event", "Group"), sep="_", convert = T) %>%
left_join(tab2 %>%
mutate(count = as.numeric(VALUE1 < 10 & VALUE2 > 30)),
by = c("Event", "Group")) %>%
unite(Event_Groups, Event, Group) %>%
group_by(rowname) %>%
summarise(Event_Groups = toString(Event_Groups),
Other_column = unique(Other_column),
count =sum(count))
# A tibble: 4 x 4
rowname Event_Groups Other_column count
<chr> <chr> <chr> <dbl>
1 1 1_G1, 2_G2 A 1
2 2 2_G1 B 0
3 3 4_G4 C 1
4 4 7_G5, 8_G5, 9_G5 D 3

Coalescing multiple columns from both the left and right side

Given the following data
df1 <- structure(list(ID = 1:3, alpha_1 = c(2L, 2L, 3L),
alpha_2 = c(1L, 2L,
3L), alpha_3 = c(4L, 4L, 2L), alpha_4 = c(3L, NA, NA), beta_1 = c(NA,
2L, NA), beta_2 = c(3L, NA, 2L), charlie_1 = c(1L, NA, 1L), charlie_2 = c(NA,
2L, NA)), class = "data.frame", row.names = c(NA, -3L))
I'm trying to coalesce all columns sharing the same initial prefix name (i.e. coalesce alpha_1, alpha_2, alpha_3, alpha_4, and coalesce beta_1 beta_2, etc.), but from both the left and right sides. That is, I want to generate two new variables, say 'alpha_left' and 'alpha_right', whose columns would be, in this example, (2, 2, 3) and (3, 4, 2) respectively (first non-missing elements from the left and right side of the dataframe).
User #akrun offered a great solution for the coalescing part here, but I'm unsure how to create two new variables from both the left and right coalesces.
Here is an option in tidyverse
Reshape to 'long' format - pivot_longer
Grouped by 'ID'
Do the summarise across the columns 'alpha' till 'charlie'
Get the column name - cur_column()
Create a tibble with the first non-NA element from the left and the right
Change the column names by appending the 'nm1' as prefix
Finally, unnest the list columns created in summarise
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
pivot_longer(cols = contains("_"),
names_to = c( ".value", "grp"), names_sep = "_") %>%
group_by(ID) %>%
summarise(across(alpha:charlie, ~ {
nm1 <- cur_column()
tbl1 <- tibble(left= .[complete.cases(.)][1],
right = rev(.)[complete.cases(rev(.))][1]);
names(tbl1) <- str_c(nm1, "_", names(tbl1))
list(tbl1)})) %>%
unnest(c(alpha, beta, charlie))
-output
# A tibble: 3 x 7
ID alpha_left alpha_right beta_left beta_right charlie_left charlie_right
<int> <int> <int> <int> <int> <int> <int>
1 1 2 3 3 3 1 1
2 2 2 4 2 2 2 2
3 3 3 2 2 2 1 1
Or using base R
lst1 <- lapply(split.default(df1[-1], sub("_\\d+$", "", names(df1)[-1])),
function(x) {
x1 <- apply(x, 1, function(y) {
y1 <- na.omit(y)
if(length(y1) > 1 ) y1[c(1, length(y1))] else y1[1]
})
if(is.vector(x1)) as.data.frame(matrix(x1)) else as.data.frame(t(x1))
})
You could also do:
df1[-1] %>%
split.default(sub("_\\d+", "", names(.))) %>%
imap_dfc(~data.frame(right = coalesce(!!!.x),
left = coalesce(!!!rev(.x))) %>%
set_names(paste(.y, names(.), sep="_")))
alpha_right alpha_left beta_right beta_left charlie_right charlie_left
1 2 3 3 3 1 1
2 2 4 2 2 2 2
3 3 2 2 2 1 1
One more approach not as elegant as #Onyambu's
library(tidyverse)
df1[-1] %>%
split.default(sub("_\\d+", "", names(.))) %>%
imap_dfc(~ .x %>% rowwise() %>%
mutate(!!paste0(.y, '_left') := head(na.omit(c_across(everything())),1),
!!paste0(.y, '_right') := tail(na.omit(c_across(!last_col())),1),
.keep = 'none' )
)
#> # A tibble: 3 x 6
#> # Rowwise:
#> alpha_left alpha_right beta_left beta_right charlie_left charlie_right
#> <int> <int> <int> <int> <int> <int>
#> 1 2 3 3 3 1 1
#> 2 2 4 2 2 2 2
#> 3 3 2 2 2 1 1
Created on 2021-06-19 by the reprex package (v2.0.0)
Another option
library(tidyverse)
df1 <- structure(list(ID = 1:3, alpha_1 = c(2L, 2L, 3L),
alpha_2 = c(1L, 2L,
3L), alpha_3 = c(4L, 4L, 2L), alpha_4 = c(3L, NA, NA), beta_1 = c(NA,
2L, NA), beta_2 = c(3L, NA, 2L), charlie_1 = c(1L, NA, 1L), charlie_2 = c(NA,
2L, NA)), class = "data.frame", row.names = c(NA, -3L))
df1 %>%
pivot_longer(cols = -ID, names_sep = "_", names_to = c(".value", "set")) %>%
group_by(ID) %>%
fill(alpha:charlie, .direction = "updown") %>%
filter(set %in% range(set)) %>%
mutate(set = c("left", "right")) %>%
pivot_wider(id_cols = ID, names_from = set, values_from = alpha:charlie)
#> # A tibble: 3 x 7
#> # Groups: ID [3]
#> ID alpha_left alpha_right beta_left beta_right charlie_left charlie_right
#> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 2 3 3 3 1 1
#> 2 2 2 4 2 2 2 2
#> 3 3 3 2 2 2 1 1
Created on 2021-06-20 by the reprex package (v2.0.0)

R function to paste information from different rows with a common column? [duplicate]

This question already has an answer here:
dplyr::first() to choose first non NA value
(1 answer)
Closed 2 years ago.
I understand we can use the dplyr function coalesce() to unite different columns, but is there such function to unite rows?
I am struggling with a confusing incomplete/doubled dataframe with duplicate rows for the same id, but with different columns filled. E.g.
id sex age source
12 M NA 1
12 NA 3 1
13 NA 2 2
13 NA NA NA
13 F 2 NA
and I am trying to achieve:
id sex age source
12 M 3 1
13 F 2 2
You can try:
library(dplyr)
#Data
df <- structure(list(id = c(12L, 12L, 13L, 13L, 13L), sex = structure(c(2L,
NA, NA, NA, 1L), .Label = c("F", "M"), class = "factor"), age = c(NA,
3L, 2L, NA, 2L), source = c(1L, 1L, 2L, NA, NA)), class = "data.frame", row.names = c(NA,
-5L))
df %>%
group_by(id) %>%
fill(everything(), .direction = "down") %>%
fill(everything(), .direction = "up") %>%
slice(1)
# A tibble: 2 x 4
# Groups: id [2]
id sex age source
<int> <fct> <int> <int>
1 12 M 3 1
2 13 F 2 2
As mentioned by #A5C1D2H2I1M1N2O1R2T1 you can select the first non-NA value in each group. This can be done using dplyr :
library(dplyr)
df %>% group_by(id) %>% summarise(across(.fns = ~na.omit(.)[1]))
# A tibble: 2 x 4
# id sex age source
# <int> <fct> <int> <int>
#1 12 M 3 1
#2 13 F 2 2
Base R :
aggregate(.~id, df, function(x) na.omit(x)[1], na.action = 'na.pass')
Or data.table :
library(data.table)
setDT(df)[, lapply(.SD, function(x) na.omit(x)[1]), id]

Pmax of columns ending with a given string

I would like to conditionally mutate a new column representing the pmax() of columns ending with "_n" for a given row. I know I can do this by explicitly specifying the column names, but I would prefer to have this be the result of a call to ends_with() or similar.
I have tried mutate_at() and plain mutate(). My general thought is that I need to pass a vars(ends_with("_n")) to something, but I'm just missing that something.
Thanks in advance.
library(dplyr)
library(tidyr)
mtcars %>%
group_by(vs, gear) %>%
summarize(mean = mean(disp),
sd = sd(disp),
n = n()) %>%
mutate_if(is.double, round, 1) %>%
mutate(mean_sd = paste0(mean, " (", sd, ")")) %>%
select(-mean, -sd) %>%
group_by(vs, gear) %>%
nest(n, mean_sd, .key = "summary") %>%
spread(key = vs, value = summary) %>%
unnest(`0`, `1`, .sep = "_")
gear `0_n` `0_mean_sd` `1_n` `1_mean_sd`
<dbl> <int> <chr> <int> <chr>
1 3 12 357.6 (71.8) 3 201 (72)
2 4 2 160 (0) 10 115.6 (38.5)
3 5 4 229.3 (113.9) 1 95.1 (NA)
edit: both answers are much appreciated. Cheers!
Here's one way using the unquote-splice operator. We can select columns that we want to compare and then splice them as vectors into pmax:
library(tidyverse)
tbl <- structure(list(gear = c(3, 4, 5), `0_n` = c(12L, 2L, 4L), `0_mean_sd` = c("357.6 (71.8)", "160 (0)", "229.3 (113.9)"), `1_n` = c(3L, 10L, 1L), `1_mean_sd` = c("201 (72)", "115.6 (38.5)", "95.1 (NA)")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
tbl %>%
mutate(pmax = pmax(!!!select(., ends_with("_n"))))
#> # A tibble: 3 x 6
#> gear `0_n` `0_mean_sd` `1_n` `1_mean_sd` pmax
#> <dbl> <int> <chr> <int> <chr> <int>
#> 1 3 12 357.6 (71.8) 3 201 (72) 12
#> 2 4 2 160 (0) 10 115.6 (38.5) 10
#> 3 5 4 229.3 (113.9) 1 95.1 (NA) 4
Created on 2019-04-23 by the reprex package (v0.2.1)
A base R version, just as an alternative:
tbl <- structure(list(gear = c(3, 4, 5), `0_n` = c(12L, 2L, 4L), `0_mean_sd` = c("357.6 (71.8)", "160 (0)", "229.3 (113.9)"), `1_n` = c(3L, 10L, 1L), `1_mean_sd` = c("201 (72)", "115.6 (38.5)", "95.1 (NA)")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
tbl$pmax <- do.call(pmax,as.list(dat[,grepl("_n$",names(dat))]))

Merge two tables based on 2 conditions and output the average as result column

I have the following two tables:
Table_1
ID Interval
1 10
1 11
2 11
and
Table_2
ID Interval Rating
1 10 0.5
1 10 0.3
1 11 0.1
2 11 0.1
2 11 0.2
The output table should look like this:
ID Interval Mean Ratings
1 10 0.4
1 11 0.1
2 11 0.15
My goal is to join both tables based on the two conditions/columns ID and interval. Given that I have several ratings for the same ID and interval, I want to compute the mean of the ratings. Whereas the IDs are unique (~9500), the interval repeats for different IDs (as seen in the table above). My current approach is the join function with 2 by arguments. How can I create a final table in which Table_1 and Table_2 are joined based on the condition ID and interval, and receive the average rating in the result column?
left_join(Table_1, Table_2, by = c("ID" = "ID", "Interval" = "Interval"))
First of all you would need to summarize second table DT2 and then perform a right join with first table DT1.
library(data.table)
DT1[DT2[, .(Mean_Rating = mean(Rating)), .(ID, Interval)], on = c(ID = "ID", Interval = "Interval")]
which gives
ID Interval Mean_Rating
1: 1 10 0.40
2: 1 11 0.10
3: 2 11 0.15
Sample data:
DT1 <- structure(list(ID = c(1L, 1L, 2L), Interval = c(10L, 11L, 11L
)), .Names = c("ID", "Interval"), class = c("data.table", "data.frame"
), row.names = c(NA, -3L))
DT2 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), Interval = c(10L,
10L, 11L, 11L, 11L), Rating = c(0.5, 0.3, 0.1, 0.1, 0.2)), .Names = c("ID",
"Interval", "Rating"), class = c("data.table", "data.frame"), row.names = c(NA,
-5L))
You can achieve it with dplyr's left_join, group_by and then summarise.
library(dplyr)
table1 %>%
left_join(table2, by = c("ID", "Interval")) %>%
group_by(ID, Interval) %>%
summarise("Mean Ratings" = mean(Rating))
## A tibble: 3 x 3
## Groups: ID [?]
# ID Interval `Mean Ratings`
# <int> <int> <dbl>
#1 1 10 0.4
#2 1 11 0.1
#3 2 11 0.15
data
table1 <- read.table(header = T, text="ID Interval
1 10
1 11
2 11")
table2 <- read.table(header = T, text = "ID Interval Rating
1 10 0.5
1 10 0.3
1 11 0.1
2 11 0.1
2 11 0.2")
You don't need join. Instead, bind your tables and use group & summarize from dplyr. The following achieves what you asked for:
library(dplyr)
table_1 <- data.frame("ID"= c(1,1,2),"Interval"=c (10,11,11),"Rating"= c(NA,NA,NA))
table_2 <- data.frame("ID"= c(1,1,1,2,2),"Interval"= c(10,10,11,11,11),"Rating"= c(0.5,0.3,0.1,0.1,0.2))
df1 <- bind_rows(table_1,table_2) %>% group_by(ID,Interval) %>% summarise("Mean Ratings" = mean(Rating,na.rm = TRUE))

Resources