How to match different ids within a single data set in R? - r

A sample of my data is
df1 <- read.table(text = " id1 time id2 gender id3 group id4 house
123 12 141 F 13 1 156 A
141 19 144 F 144 1 123 A
144 22 123 M 123 1 141 M
168 14 13 M 141 2 144 M
156 13 168 M 168 2 13 Q
13 11 156 F 156 2 168 Q
", header = TRUE)
I want to get the following outcome. For example, id123, time= 12, Gender=M, group=1, house= A, by looking at other ids
df1 <- read.table(text = " id time gender group house
123 12 M 1 A
141 19 F 2 M
144 22 F 1 M
168 14 M 2 Q
156 13 F 2 A
13 11 M 1 Q
", header = TRUE)
I have tried left_join, but I struggled to get the outcome of interest
df1 <- left_join(id2,id3,id4 by = "id1")

You've got the folks confused here because your table is in an unusual format. Typically in R, we expect one variable per column and one observation per row. What you have is effectively four tables stuck side-by-side, where id1, id2, id3 and id4 are all actually just "id". So effectively, you are looking to left join columns 3:4 to columns 1:2, then left join columns 5:6 to that, and so on.
I'll show one way of doing that, then maybe some of the smart folks here can show you a better way:
library(dplyr)
df_list <- lapply(list(1:2, 3:4, 5:6, 7:8), function(x) df1[x])
df_list <- lapply(df_list, function(x) {names(x)[1] <- "id"; x})
df2 <- df_list[[1]] %>%
left_join(df_list[[2]]) %>%
left_join(df_list[[3]]) %>%
left_join(df_list[[4]])
df2
#> id time gender group house
#> 1 123 12 M 1 A
#> 2 141 19 F 2 M
#> 3 144 22 F 1 M
#> 4 168 14 M 2 Q
#> 5 156 13 F 2 A
#> 6 13 11 M 1 Q
Created on 2020-07-01 by the reprex package (v0.3.0)

It seems like we need a match for different 'id' columns and corresponding 'group', 'gender' etc columns
nm1 <- c('id1', 'time', 'gender', 'group', 'house')
out1 <- transform(df1, gender = gender[match(id1, id2)],
group = group[match(id1, id3)],
house = house[match(id1, id4)])[nm1]
names(out1)[1] <- 'id'
out1
# id time gender group house
#1 123 12 M 1 A
#2 141 19 F 2 M
#3 144 22 F 1 M
#4 168 14 M 2 Q
#5 156 13 F 2 A
#6 13 11 M 1 Q
In addition to the above base R, an alternative option to #AllanCameron's solution would be to split subset of columns based on the occurrence of 'id' column (split.default), then change the first column name to 'id' and apply left_join within reduce
library(dplyr)
library(purrr)
df1 %>%
split.default(cumsum(startsWith(names(.), "id"))) %>%
map(~ rename_at(.x, 1, ~ 'id')) %>%
reduce(left_join, by = 'id')
# id time gender group house
#1 123 12 M 1 A
#2 141 19 F 2 M
#3 144 22 F 1 M
#4 168 14 M 2 Q
#5 156 13 F 2 A
#6 13 11 M 1 Q

Related

Identify pairs or groups of rows that have the same values across multiple columns

Say I have a data.frame:
file = read.table(text = "sex age num
M 32 5
F 31 2
M 91 2
M 30 1
M 23 1
F 19 1
F 31 2
F 21 2
M 32 5
F 65 3
M 24 5", header = T, sep = "")
I want to get a sorted data frame of all rows that have the exact same values of sex, age, and num with any other row in the data frame.
The result should look like this (note that the data frame is sorted by the pairs or groups that are duplicated with each other):
result = read.table(text = "sex age num
M 32 5
M 32 5
F 31 2
F 31 2", header = T, sep = "")
I have tried various combinations of distinct in dplyr and duplicated, but they don't quite get at this use case.
We need duplicated twice i.e. one duplicated in the normal direction from up to bottom and second from bottom to top (fromLast = TRUE) and then use | so that it can be TRUE in either direction for subsetting
out <- file[duplicated(file)|duplicated(file, fromLast = TRUE),]
out$sex <- factor(out$sex, levels = c("M", "F"))
out1 <- out[do.call(order, out),]
row.names(out1) <- NULL
-output
> out1
sex age num
1 M 32 5
2 M 32 5
3 F 31 2
4 F 31 2
The above can be written in tidyverse
library(dplyr)
file %>%
arrange(sex == "F", across(everything())) %>%
filter(duplicated(.)|duplicated(., fromLast = TRUE))
sex age num
1 M 32 5
2 M 32 5
3 F 31 2
4 F 31 2
An alternative approach:
Here all groups with more then 1 nrow will be kept:
library(dplyr)
file %>%
group_by(sex, age, num) %>%
filter(n() > 1) %>%
arrange(.by_group = T)
ungroup()
sex age num
<chr> <int> <int>
1 F 31 2
2 F 31 2
3 M 32 5
4 M 32 5
file = read.table(text = "sex age num
M 32 5
F 31 2
M 91 2
M 30 1
M 23 1
F 19 1
F 31 2
F 21 2
M 32 5
F 65 3
M 24 5", header = T, sep = "")
library(vctrs)
library(dplyr, warn = F)
#> Warning: package 'dplyr' was built under R version 4.1.2
file %>%
filter(vec_duplicate_detect(.)) %>%
arrange(across(everything()))
#> sex age num
#> 1 F 31 2
#> 2 F 31 2
#> 3 M 32 5
#> 4 M 32 5
Created on 2022-08-19 by the reprex package (v2.0.1.9000)
A base R option using subset + ave
> subset(file, ave(seq_along(num), sex, age, num, FUN = length) > 1)
sex age num
1 M 32 5
2 F 31 2
7 F 31 2
9 M 32 5
or rbind + split
> do.call(rbind, Filter(function(x) nrow(x) > 1, split(file, ~ sex + age + num)))
sex age num
F.31.2.2 F 31 2
F.31.2.7 F 31 2
M.32.5.1 M 32 5
M.32.5.9 M 32 5
Here is an approach, using .SD[.N>1] by group in data.table
library(data.table)
result = setDT(file)[, i:=.I][, .SD[.N>1],.(sex,age,num)][, i:=NULL]
Output:
sex age num
1: M 32 5
2: M 32 5
3: F 31 2
4: F 31 2

How to make a copy of every row and column in one table for every index of another table in R?

There are two dataframes, one with an index and another with no index. I want to make a new dataframe with the indices of the first and the rows and columns of the other in such a way that there is a copy of every data in the second table for each index.
df_A <- data.frame("index" = c("id1","id2","id3")
, variable_a = c(1,2,3)
, variable_b = c("x","f","d"))
df_B <- data.frame(variable_x = c("4124","414","123")
, variable_y = c(12,22,13)
, variable_z = c("q","w","d"))
The result should be:
df_C <- data.frame("index" = c("id1","id1","id1","id2","id2","id2","id3","id3","id3")
, variable_x = c("4124","414","123","4124","414","123","4124","414","123")
, variable_y = c(12,22,13,12,22,13,12,22,13)
, variable_z = c("q","w","d","q","w","d","q","w","d"))
This is a full outer join and could be solved via
merge(df_B, df_A$index)
Which yields
> merge(df_B, df_A$index)
variable_x variable_y variable_z y
1 4124 12 q id1
2 414 22 w id1
3 123 13 d id1
4 4124 12 q id2
5 414 22 w id2
6 123 13 d id2
7 4124 12 q id3
8 414 22 w id3
9 123 13 d id3
You could correct the order of the columns like this:
merge(df_B, df_A$index)[,c(4, 1, 2, 3)]
Obviously, a full join can be done in dplyr as well, if you prefer that:
dplyr::full_join(df_B, df_A, by = character())
Another option is to use tidyr::crossing
tidyr::crossing(df_A, df_B)
#----------
# A tibble: 9 x 6
index variable_a variable_b variable_x variable_y variable_z
<chr> <dbl> <chr> <chr> <dbl> <chr>
1 id1 1 x 123 13 d
2 id1 1 x 4124 12 q
3 id1 1 x 414 22 w
4 id2 2 f 123 13 d
5 id2 2 f 4124 12 q
6 id2 2 f 414 22 w
7 id3 3 d 123 13 d
8 id3 3 d 4124 12 q
9 id3 3 d 414 22 w
The following function should help using the library dplyr. Insert the dataframe with index in the first parameter and add the dataframe without index in the second parameter. It should return the requested dataframe.
merge_lines_with_index <- function(index_table, data_table){
df <- data.frame(matrix(ncol = ncol(data_table) + 1))
x <- names(data_table) %>% unlist()
colnames(df) <- c("index", x)
for (item in index_table %>% select(1) %>% unlist()) {
new_data <- data_table %>%
mutate("index" = item)
df <- df %>% rbind(new_data)
}
return(df[-1,])
}

How do I combine row entries for the same patient ID# in R while keeping other columns and NA values?

I need to combine some of the columns for these multiple IDs and can just use the values from the first ID listing for the others. For example here I just want to combine the "spending" column as well as the heart attack column to just say whether they ever had a heart attack. I then want to delete the duplicate ID#s and just keep the values from the first listing for the other columns:
df <- read.table(text =
"ID Age Gender heartattack spending
1 24 f 0 140
2 24 m na 123
2 24 m 1 58
2 24 m 0 na
3 85 f 1 170
4 45 m na 204", header=TRUE)
What I need:
df2 <- read.table(text =
"ID Age Gender ever_heartattack all_spending
1 24 f 0 140
2 24 m 1 181
3 85 f 1 170
4 45 m na 204", header=TRUE)
I tried group_by with transmute() and sum() as follows:
df$heartattack = as.numeric(as.character(df$heartattack))
df$spending = as.numeric(as.character(df$spending))
library(dplyr)
df = df %>% group_by(ID) %>% transmute(ever_heartattack = sum(heartattack, na.rm = T), all_spending = sum(spending, na.rm=T))
But this removes all the other columns! Also it turns NA values into zeros...for example I still want "NA" to be the value for patient ID#4, I don't want to change the data to say they never had a heart attack!
> print(dfa) #This doesn't at all match df2 :(
ID ever_heartattack all_spending
1 1 0 140
2 2 1 181
3 2 1 181
4 2 1 181
5 3 1 170
6 4 0 204
Could you do this?
aggregate(
spending ~ ID + Age + Gender,
data = transform(df, spending = as.numeric(as.character(spending))),
FUN = sum)
# ID Age Gender spending
#1 1 24 f 140
#2 3 85 f 170
#3 2 24 m 181
#4 4 45 m 204
Some comments:
The thing is that when aggregating you don't give clear rules how to deal with data in additional columns that differ (like heartattack in this case). For example, for ID = 2 why do you retain heartattack = 1 instead of heartattack = na or heartattack = 0?
Your "na"s are in fact not real NAs. That leads to spending being a factor column instead of a numeric column vector.
To exactly reproduce your expected output one can do
df %>%
mutate(
heartattack = as.numeric(as.character(heartattack)),
spending = as.numeric(as.character(spending))) %>%
group_by(ID, Age, Gender) %>%
summarise(
heartattack = ifelse(
any(heartattack %in% c(0, 1)),
max(heartattack, na.rm = T),
NA),
spending = sum(spending, na.rm = T))
## A tibble: 4 x 5
## Groups: ID, Age [?]
# ID Age Gender heartattack spending
# <int> <int> <fct> <dbl> <dbl>
#1 1 24 f 0 140
#2 2 24 m 1 181
#3 3 85 f 1 170
#4 4 45 m NA 204
This feels a bit "hacky" on account of the rules not being clear which heartattack value to keep. In this case we
keep the maximum value of heartattack if heartattack contains either 0 or 1.
return NA if heartattack does not contain 0 or 1.

Aggregate dataframe in rolling blocks of 3 rows

I have the following data frame as an example
df <- data.frame(score=letters[1:15], total1=1:15, total2=16:30)
> df
score total1 total2
1 a 1 16
2 b 2 17
3 c 3 18
4 d 4 19
5 e 5 20
6 f 6 21
7 g 7 22
8 h 8 23
9 i 9 24
10 j 10 25
11 k 11 26
12 l 12 27
13 m 13 28
14 n 14 29
15 o 15 30
I would like to aggregate my data frame by sum by grouping the rows having different name, i.e.
groups sum1 sum2
'a-b-c' 6 51
'c-d-e' 21 60
etc
All the given answers to this kind of question assume that the strings repeat in the row.
The usual aggregate function that I use to obtain the summary delivers a different result:
aggregate(df$total1, by=list(sum1=df$score %in% c('a','b','c'), sum2=df$score %in% c('d','e','f')), FUN=sum)
sum1 sum2 x
1 FALSE FALSE 99
2 TRUE FALSE 6
3 FALSE TRUE 15
If you want a tidyverse solution, here is one possibility:
df <- data.frame(score=letters[1:15], total1=1:15, total2=16:30)
df %>%
mutate(groups = case_when(
score %in% c("a","b","c") ~ "a-b-c",
score %in% c("d","e","f") ~ "d-e-f"
)) %>%
group_by(groups) %>%
summarise_if(is.numeric, sum)
returns
# A tibble: 3 x 3
groups total1 total2
<chr> <int> <int>
1 a-b-c 6 51
2 d-e-f 15 60
3 <NA> 99 234
Add a "groups" column with the category value.
df$groups = NA
and then define each group like this:
df$groups[df$score=="a" | df$score=="b" | df$score=="c" ] = "a-b-c"
Finally aggregate by that column.
Here's a solution that works for any sized data frame.
df <- data.frame(score=letters[1:15], total1=1:15, total2=16:30)
# I'm adding a row to demonstrate that the grouping pattern works when the
# number of rows is not equally divisible by 3.
df <- rbind(df, data.frame(score = letters[16], total1 = 16, total2 = 31))
# A vector that represents the correct groupings for the data frame.
groups <- c(rep(1:floor(nrow(df) / 3), each = 3),
rep(floor(nrow(df) / 3) + 1, nrow(df) - length(1:(nrow(df) / 3)) * 3))
# Your method of aggregation by `groups`. I'm going to use `data.table`.
require(data.table)
dt <- as.data.table(df)
dt[, group := groups]
aggDT <- dt[, list(score = paste0(score, collapse = "-"),
total1 = sum(total1), total2 = sum(total2)), by = group][
, group := NULL]
aggDT
score total1 total2
1: a-b-c 6 51
2: d-e-f 15 60
3: g-h-i 24 69
4: j-k-l 33 78
5: m-n-o 42 87
6: p 16 31

How to recode multiple columns in R

I tried my best to recode multiple columns, but I still struggle to do it. Here what I have done:
df<-read.table(text="ZR1 Time1 ZR2 Time2 ZR3 Time3
A 60 A 56 B 44
C 61 B 44 D 78
D 62 C 78 E 66
E 58 D 46 B 45
A 54 B 23 B 23
A 57 E 24 B 100",h=T)
What I have done
for (i in 1) {
ZRi<-paste0("ZR", i)
Zi<-paste0("Z",i)}
df[,Zi]=c(A=4,B=3,C=2,D=1,E=0)
df[,Zi]=c(A=4,B=3,C=2,D=1,E=0)[df[,ZRi]]
I got this:
ZR1 Time1 ZR2 Time2 ZR3 Time3 Z1
1 A 60 A 56 B 44 4
2 C 61 B 44 D 78 3
3 D 62 C 78 E 66 2
4 E 58 D 46 B 45 1
5 A 54 B 23 B 23 4
6 A 57 E 24 B 100 4
As you can see, I could get Z1, which is wrong.
I want to get this:
ZR1 Time1 ZR2 Time2 ZR3 Time3 Z1 Z2 Z3
A 60 A 56 B 44 4 4 3
C 61 B 44 D 78 2 3 1
D 62 C 78 E 66 1 2 0
E 58 D 46 B 45 0 1 3
A 54 B 23 B 23 4 3 3
A 57 E 24 B 100 4 0 3
Here's the base approach (and probably fastest). You are just using the values of the ZR columns as an index into c(A=4,B=3,C=2,D=1,E=0) which becomes a translation table and then assigning those results to new columns in df:
df[ paste0("Z", 1:3) ] <-
lapply( df[ , grepl("^ZR", names(df))] , # passes "ZR" columns one-at-a-time
function(x) {c(A=4,B=3,C=2,D=1,E=0)[as.character(x)]})
Depending on what was intended as the purpose for these new columns, #User60 should be aware that this delivers numeric vectors
By playing with levels and labels you can get this:
for (i in 1:3) {
df[[paste0("Z",i)]] <-
factor(df[[paste0("ZR", i)]],levels=LETTERS[1:5],labels=4:0)
}
df
# ZR1 Time1 ZR2 Time2 ZR3 Time3 Z1 Z2 Z3
# 1 A 60 A 56 B 44 4 4 3
# 2 C 61 B 44 D 78 2 3 1
# 3 D 62 C 78 E 66 1 2 0
# 4 E 58 D 46 B 45 0 1 3
# 5 A 54 B 23 B 23 4 3 3
# 6 A 57 E 24 B 100 4 0 3
The created columns with this method will be factors, to have numeric instead use the following:
for (i in 1:3) {
df[[paste0("Z",i)]] <-
as.numeric(as.character(factor(df[[paste0("ZR", i)]],levels=LETTERS[1:5],labels=4:0)))
}
Maybe this one one-liner with dplyr could help
df %>%
mutate_at(setNames(paste0("ZR", 1:3), paste0("Z", 1:3)),
~5-as.numeric(factor(.x, levels = LETTERS[1:5])))
The trick here is to pass named vector to mutate_at to create new columns. You can coerce factor to numeric if you pre-specified the levels.
An alternative solution using dplyr + magrittr packages
library(dplyr); library(magrittr)
df2 <- select(df, starts_with("ZR")) %>%
lapply(as.character) %>%
mapply(`[`, list(c(A=4,B=3,C=2,D=1,E=0)), .) %>%
data.frame(df, .)
names(df2)[ncol(df2)-2:0] <- paste0("Z", 1:3)
Here's a more dplyr-esque method. Useful for recoding when the output isn't an integer.
library(dplyr)
# Make lookup table
lookup <- data.frame(let = LETTERS[1:5], num = 4:0, stringsAsFactors = F)
# Join with lookup table
df %>%
left_join(lookup, by = c('ZR1' = 'let')) %>%
left_join(lookup, by = c('ZR2' = 'let')) %>%
left_join(lookup, by = c('ZR3' = 'let')) %>%
rename_at(vars(matches('num')), ~paste0('Z', 1:3))
Or, with data.table
library(data.table)
lookup <- data.frame(let = LETTERS[1:5], num = 4:0, stringsAsFactors = F)
setDT(df)
df[, paste0('Z', 1:3) := lapply(df[,paste0('ZR', 1:3)],
function(x) lookup$num[match(x, lookup$let)])]

Resources