merge multiple columns in one table? - r

I have a table with several columns, I would like to make a column by combining 'R1,R2 and R3' columns in a table.
DF:
ID R1 T1 R2 T2 R3 T3
rs1 A 1 NA . NA 0
rs21 NA 0 C 1 C 1
rs32 A 1 A 1 A 0
rs25 NA 2 NA 0 A 0
Desired output:
ID R1 T1 R2 T2 R3 T3 New_R
rs1 A 1 NA . NA 0 A
rs21 NA 0 C 1 C 1 C
rs32 A 1 A 1 A 0 A
rs25 NA 2 NA 0 A 0 A

We can use tidyverse
library(tidyverse)
DF %>%
mutate(New_R = pmap_chr(select(., starts_with("R")), ~c(...) %>%
na.omit %>%
unique %>%
str_c(collape="")))
#. ID R1 T1 R2 T2 R3 T3 New_R
#1 rs1 A 1 <NA> . <NA> 0 A
#2 rs21 <NA> 0 C 1 C 1 C
#3 rs32 A 1 A 1 A 0 A
#4 rs25 <NA> 2 <NA> 0 A 0 A
If there is only one non-NA element per row, we can use coalecse
DF %>%
mutate(New_R = coalesce(!!! select(., starts_with("R"))))
Or in base R
DF$New_R <- do.call(pmin, c(DF[grep("^R\\d+", names(DF))], na.rm = TRUE))
data
DF <- structure(list(ID = c("rs1", "rs21", "rs32", "rs25"), R1 = c("A",
NA, "A", NA), T1 = c(1L, 0L, 1L, 2L), R2 = c(NA, "C", "A", NA
), T2 = c(".", "1", "1", "0"), R3 = c(NA, "C", "A", "A"), T3 = c(0L,
1L, 0L, 0L)), class = "data.frame", row.names = c(NA, -4L))

you can use the ifelse function in a nested way:
DF$New_R <- ifelse(!is.na(DF$R1), DF$R1,
ifelse(!is.na(DF$R2), DF$R2,
ifelse(!is.na(DF$R3), DF$R3, NA)))
ifelse takes three arguments, a condition, what to do if the condition is fulfilled, and what to do if the condition is not fulfilled. It can be applied to data frame column treating each raw separately. In my example it will pick the first non NA value found.

We can use apply row-wise, remove NA values and keeping only unique values.
cols <- paste0("R", 1:3)
df$New_R <- apply(df[cols], 1, function(x)
paste0(unique(na.omit(x)), collapse = ""))
df
# ID R1 T1 R2 T2 R3 T3 New_R
#1 rs1 A 1 <NA> . <NA> 0 A
#2 rs21 <NA> 0 C 1 C 1 C
#3 rs32 A 1 A 1 A 0 A
#4 rs25 <NA> 2 <NA> 0 A 0 A

Related

How to get merged data frame from two data frames having some same columns(R)

I want to merge them and find the values of one dataframe that would like to be added to the existing values of the other based on the same columns.
For example:
df1
No
A
B
C
D
1
1
0
1
0
2
0
1
2
1
3
0
0
1
0
df2
No
A
B
E
F
1
1
0
1
1
2
0
1
2
1
3
2
1
1
0
Finally, I want the output table like this.
df
No
A
B
C
D
E
F
1
2
0
1
0
1
1
2
0
2
2
1
2
1
3
2
1
1
0
1
0
Note: I did try merge(), but in this case, it did not work.
Any help/suggestion would be appreciated.
Reproducible sample data
df1 <-
structure(list(No = 1:3, A = c(1L, 0L, 0L), B = c(0L, 1L, 0L),
C = c(1L, 2L, 1L), D = c(0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
df2 <-
structure(list(No = 1:3, A = c(1L, 0L, 2L), B = c(0L, 1L, 1L),
E = c(1L, 2L, 1L), F = c(1L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
You can also carry out this operation by left_joining these two data frames:
library(dplyr)
library(stringr)
df1 %>%
left_join(df2, by = "No") %>%
mutate(across(ends_with(".x"), ~ .x + get(str_replace(cur_column(), "\\.x", "\\.y")))) %>%
rename_with(~ str_replace(., "\\.x", ""), ends_with(".x")) %>%
select(!ends_with(".y"))
No A B C D E F
1 1 2 0 1 0 1 1
2 2 0 2 2 1 2 1
3 3 2 1 1 0 1 0
You can first row-bind the two dataframes and then compute the sum of each column while 'grouping' by the No column. This can be done like so:
library(dplyr)
bind_rows(df1, df2) %>%
group_by(No) %>%
summarise(across(c(A, B, C, D, E, `F`), sum, na.rm = TRUE),
.groups = "drop")
If a particular column doesn't exist in one dataframe (i.e. columns E and F), values will be padded with NA. Adding the na.rm = TRUE argument (to be passed to sum()) means that these values will get treated like zeros.
Using data.table :
library(data.table)
rbindlist(list(df1, df2), fill = TRUE)[, lapply(.SD, sum, na.rm = TRUE), No]
# No A B C D E F
#1: 1 2 0 1 0 1 1
#2: 2 0 2 2 1 2 1
#3: 3 2 1 1 0 1 0
We can use base R (with R 4.1.0). Get the values of the objects in a list ('lst1'). Then, find the union of the column names ('nm1'). Loop over the list assign to create 0 value columns with setdiff in each list element, rbind them and use aggregate to get the sum grouped by 'No'
lst1 <- mget(ls(pattern= '^df\\d+$'))
nm1 <- lapply(lst1, names) |>
{\(x) Reduce(union, x)}()
lapply(lst1, \(x) {x[setdiff(nm1, names(x))] <- 0; x}) |>
{\(x) do.call(rbind, x)}() |>
{\(dat) aggregate(.~ No, data = dat, FUN = sum, na.rm = TRUE,
na.action = na.pass)}()
# No A B C D E F
#1 1 2 0 1 0 1 1
#2 2 0 2 2 1 2 1
#3 3 2 1 1 0 1 0

Subtracting multiple rows from the same row in R

I am looking to subtract multiple rows from the same row within a dataframe.
For example:
Group A B C
A 3 1 2
B 4 0 3
C 4 1 1
D 2 1 2
This is what I want it to look like:
Group A B C
B 1 -1 1
C 1 0 -1
D -1 0 0
So in other words:
Row B - Row A
Row C - Row A
Row D - Row A
Thank you!
Here's a dplyr solution:
library(dplyr)
df %>%
mutate(across(A:C, ~ . - .[1])) %>%
filter(Group != "A")
This gives us:
Group A B C
1: B 1 -1 1
2: C 1 0 -1
3: D -1 0 0
Here's an approach with base R:
data[-1] <- do.call(rbind,
apply(data[-1],1,function(x) x - data[1,-1])
)
data[-1,]
# Group A B C
#2 B 1 -1 1
#3 C 1 0 -1
#4 D -1 0 0
Data:
data <- structure(list(Group = c("A", "B", "C", "D"), A = c(3L, 4L, 4L,
2L), B = c(1L, 0L, 1L, 1L), C = c(2L, 3L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-4L))
We could also replicate the first row and substract from the rest
cbind(data[-1, 1, drop = FALSE], data[-1, -1] - data[1, -1][col(data[-1, -1])])
-output
# Group A B C
#2 B 1 -1 1
#3 C 1 0 -1
#4 D -1 0 0

Repeat a value within each ID

I have a dataset in R in long format. Each ID does not appear the same number of times (i.e. one ID might be one row, another might appear 79 rows).
e.g.
ID V1 V2
1 B 0
1 A 1
1 C 0
2 C 0
3 A 0
3 C 0
I want to create a variable which, if any of the rows for a given ID have Var2 == 1, then 1 repeats for every row of that ID
e.g.
ID V1 V2 V3
1 B 0 1
1 A 1 1
1 C 0 1
2 C 0 0
3 A 0 0
3 C 0 0
In base R we can use any - and ave for the grouping.
DF$V3 <- with(DF, ave(V2, ID, FUN = function(x) any(x == 1)))
DF
# ID V1 V2 V3
#1 1 B 0 1
#2 1 A 1 1
#3 1 C 0 1
#4 2 C 0 0
#5 3 A 0 0
#6 3 C 0 0
data
DF <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L), V1 = c("B", "A",
"C", "C", "A", "C"), V2 = c(0L, 1L, 0L, 0L, 0L, 0L)), .Names = c("ID",
"V1", "V2"), class = "data.frame", row.names = c(NA, -6L))
Here's a tidyverse solution.
If V2 can only be 0 or 1:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(V3 = max(V2))
If you want to check that V2 is exactly 1.
df %>%
group_by(ID) %>%
mutate(V3 = as.numeric(any(V2 == 1)))
Another base R option is
df$V3 <- with(df, +(ID %in% which(rowsum(V2, ID) > 0)))

Seeing if all values in one dataframe row exist in another dataframe

I have a dataframe as follows:
df1
ColA ColB ColC ColD
10 A B L
11 N Q NA
12 P J L
43 M T NA
89 O J T
df2
ATTR Att R1 R2 R3 R4
1 45 A B NA NA
2 40 C D NA NA
3 33 T J O NA
4 65 L NA NA NA
5 20 P L J NA
6 23 Q NA NA NA
7 38 Q L NA NA
How do I match up df2 with df1 so that if ALL the values in each df2 row (disregarding the order) show up in the df1 rows, then it will populate. So it is checking if ALL not just one value from each df2 row matches up with each df1 row. The final result in this case should be this:
ColA ColB ColC ColD ATTR Att R1 R2 R3 R4
10 A B L 1 45 A B NA NA
10 A B L 4 65 L NA NA NA
11 N Q NA 6 23 Q NA NA NA
12 P J L 4 65 L NA NA NA
12 P J L 5 20 P L J NA
89 O J T 3 33 T J O NA
Thanks
Here is a possible solution using base R.
Make sure everything is a character before continuing, i.e.
df[-1] <- lapply(df[-1], as.character)
df1[-c(1:2)] <- lapply(df1[-c(1:2)], as.character)
First we create two lists which contain vectors of the rowwise elements of each data frame. We then create a matrix with the length of elements from l2 are found in l1, If the length is 0 then it means they match. i.e,
l1 <- lapply(split(df[-1], seq(nrow(df))), function(i) i[!is.na(i)])
l2 <- lapply(split(df1[-c(1:2)], seq(nrow(df1))), function(i) i[!is.na(i)])
m1 <- sapply(l1, function(i) sapply(l2, function(j) length(setdiff(j, i))))
m1
# 1 2 3 4 5
#1 0 2 2 2 2
#2 2 2 2 2 2
#3 3 3 2 2 0
#4 0 1 0 1 1
#5 2 3 0 3 2
#6 1 0 1 1 1
#7 1 1 1 2 2
We then use that matrix to create a couple of coloumns in our original df. The first column rpt will indicate how many times each row has length 0 and use that as a number of repeats for each row. We also use it to filter out all the 0 lengths (i.e. the rows that do not have a match with df1). After expanding the data frame we create another variable; ATTR (same name as ATTR in df1) in order to use it for a merge. i.e.
df$rpt <- colSums(m1 == 0)
df <- df[df$rpt != 0,]
df <- df[rep(row.names(df), df$rpt),]
df$ATTR <- which(m1 == 0, arr.ind = TRUE)[,1]
df
# ColA ColB ColC ColD rpt ATTR
#1 10 A B L 2 1
#1.1 10 A B L 2 4
#2 11 N Q <NA> 1 6
#3 12 P J L 2 4
#3.1 12 P J L 2 5
#5 89 O J T 1 3
We then merge and order the two data frames,
final_df <- merge(df, df1, by = 'ATTR')
final_df[order(final_df$ColA),]
# ATTR ColA ColB ColC ColD rpt Att R1 R2 R3 R4
#1 1 10 A B L 2 45 A B <NA> <NA>
#3 4 10 A B L 2 65 L <NA> <NA> <NA>
#6 6 11 N Q <NA> 1 23 Q <NA> <NA> <NA>
#4 4 12 P J L 2 65 L <NA> <NA> <NA>
#5 5 12 P J L 2 20 P L J <NA>
#2 3 89 O J T 1 33 T J O <NA>
DATA
dput(df)
structure(list(ColA = c(10L, 11L, 12L, 43L, 89L), ColB = c("A",
"N", "P", "M", "O"), ColC = c("B", "Q", "J", "T", "J"), ColD = c("L",
NA, "L", NA, "T")), .Names = c("ColA", "ColB", "ColC", "ColD"
), row.names = c(NA, -5L), class = "data.frame")
dput(df1)
structure(list(ATTR = 1:7, Att = c(45L, 40L, 33L, 65L, 20L, 23L,
38L), R1 = c("A", "C", "T", "L", "P", "Q", "Q"), R2 = c("B",
"D", "J", NA, "L", NA, "L"), R3 = c(NA, NA, "O", NA, "J", NA,
NA), R4 = c(NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_)), .Names = c("ATTR",
"Att", "R1", "R2", "R3", "R4"), row.names = c(NA, -7L), class = "data.frame")

R how to do the partial row sums

I am very new to R, and I sincerely appreciate your help.
The following is part of my data:
subjectID A B C D E F G H I J
S001 1 1 1 1 1 0 0
S002 1 1 1 0 0 0 0
I want to sum the rows from A to J, and so the data will look like this:
subjectID A B C D E F G H I J TOTAL
S001 1 1 1 1 1 0 0 5
S002 1 1 1 0 0 0 0 3
Thank you so much! I would like sum if variable A to J == 1.
As suggested, I post here my answers.
This is is with apply. the df[-1] is to exclude the first column (which is not numeric), the x[x == 1] is to subset the elements of x (a single row due to the 1 of the apply) with only values of 1.
df$TOTAL <- apply(df[-1], 1, function(x) sum(x[x == 1], na.rm = T))
Another (I bet much faster and) easier to code way in base R is:
df$TOTAL <- rowSums(df[-1] == 1, na.rm = T)
both have as a result this
df
subjectID A B C D E F G H I J TOTAL
1 S001 1 1 1 1 1 0 0 NA NA NA 5
2 S002 1 1 1 0 0 0 0 NA NA NA 3
Data
df <- structure(list(subjectID = structure(1:2, .Label = c("S001",
"S002"), class = "factor"), A = c(1L, 1L), B = c(1L, 1L), C = c(1L,
1L), D = c(1L, 0L), E = c(1L, 0L), F = c(0L, 0L), G = c(0L, 0L
), H = c(NA, NA), I = c(NA, NA), J = c(NA, NA)), .Names = c("subjectID",
"A", "B", "C", "D", "E", "F", "G", "H", "I", "J"), class = "data.frame", row.names = c(NA,
-2L))
Another similar option to the one posted by SabDeM but using sapply to sum only numeric columns
df$Total <- rowSums(df[ ,sapply(df, is.numeric)])
Output:
subjectID A B C D E F G H I J Total
1 S001 1 1 1 1 1 0 0 NA NA NA 5
2 S002 1 1 1 0 0 0 0 NA NA NA 3

Resources