Merging two rows with some having missing values in R - r

I would like to ask the R community how to merge two rows with the same ID (i.e. same participant) with some variables that are identical and others where there are NA's. In my example, I would like all the values 4-5-6 to appear on one row and therefore for the NA's (or empty cells) to be gone.
I have tried using dplyr without much success, and I have to do the merging by hand (which is quite time consuming and increases the risk for errors). Thank you in advance for your help with this problem!

# Create sample data frame.
id <- c(rep('Participant 1', 2), rep('Participant 2', 2))
value1 <- rep('A', 4)
value2 <- rep('B', 4)
value3 <- rep('C', 4)
value4 <- c('x', NA, NA, 'x')
value5 <- c('x', NA, 'x', NA)
value6 <- c(NA, 'x', NA, 'x')
df <- data.frame(id, value1, value2, value3, value4, value5, value6, stringsAsFactors = F)
# Use dplyr to group the data and keep the non-NA value from the other columns.
df %>% group_by(id, value1, value2, value3) %>%
summarise(value4 = max(value4, na.rm = T),
value5 = max(value5, na.rm = T),
value6 = max(value6, na.rm = T))

Another solution with dplyr and tidyr:
library(dplyr)
library(tidyr)
DF %>%
gather(var, val, Value4:Value6) %>%
filter(!is.na(val)) %>%
spread(var, val)
using the data of #G.Grothendieck, this results in:
ID Value1 Value2 Value3 Value4 Value5 Value6
1 1 A B C x x x
2 2 A B C x x x
Or another variation with summarise_each with the max approach of #G.Grothendieck:
DF %>%
group_by(ID, Value1, Value2, Value3) %>%
summarise_each(funs(max(., na.rm = TRUE)))
The gather and spread options can also be translated into a solution with reshape2:
library(reshape2)
dcast(na.omit(melt(DF, id.vars = c('ID','Value1','Value2','Value3'))),
ID + Value1 + Value2 + Value3 ~ variable,
value.var = 'value')

1) Using DF defined in the Note below try aggregating using the compress function defined below. This function removes NA values and appends an NA just in case all values were removed and then takes the first of what is left. No packages are used.
compress <- function(x) c(na.omit(x), NA)[1]
aggregate(DF[5:7], DF[1:4], compress)
giving:
ID Value1 Value2 Value3 Value4 Value5 Value6
1 1 A B C x x x
2 2 A B C x x x
2) A simpler alternative if no participant has all NA values in any column is that we could eliminate the definition of compress and use max with na.rm = TRUE instead like this:
aggregate(DF[5:7], DF[1:4], max, na.rm = TRUE)
Note: The input in reproducible form:
Lines <- "ID Value1 Value2 Value3 Value4 Value5 Value6
1 A B C x x NA
1 A B C NA NA x
2 A B C NA x NA
2 A B C x NA x"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)

If you prefer to use dplyr try:
library(dplyr)
DF %>%
group_by(ID, Value1, Value2, Value3) %>%
summarise_each(funs(toString(na.omit(.))))
Result:
ID Value1 Value2 Value3 Value4 Value5 Value6
<int> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 A B C x x x
2 2 A B C x x x
Note:
DF as defined by G. Grothendieck https://stackoverflow.com/a/40820313/5727278
This builds off of docendo discimus's https://stackoverflow.com/a/27289383/5727278

Related

Replace a value in a data frame from other dataframe in r

Hi I have two dataframes, based on the id match, i wanted to replace table a's values with that of table b.
sample dataset is here :
a = tibble(id = c(1, 2,3),
type = c("a", "x", "y"))
b= tibble(id = c(1,3),
type =c("d", "n"))
Im expecting an output like the following :
c= tibble(id = c(1,2,3),
type = c("d", "x", "n"))
In dplyr v1.0.0, the rows_update() function was introduced for this purpose:
rows_update(a, b)
# Matching, by = "id"
# # A tibble: 3 x 2
# id type
# <dbl> <chr>
# 1 1 d
# 2 2 x
# 3 3 n
Here is an option using dplyr::left_join and dplyr::coalesce
library(dplyr)
a %>%
rename(old = type) %>%
left_join(b, by = "id") %>%
mutate(type = coalesce(type, old)) %>%
select(-old)
## A tibble: 3 × 2
# id type
#. <dbl> <chr>
#1 1 d
#2 2 x
#3 3 n
The idea is to join a with b on column id; then replace missing values in type from b with values from a (column old is the old type column from a, avoiding duplicate column names).

Collapsing Columns in R using tidyverse with mutate, replace, and unite. Writing a function to reuse?

Data:
ID
B
C
1
NA
x
2
x
NA
3
x
x
Results:
ID
Unified
1
C
2
B
3
B_C
I'm trying to combine colums B and C, using mutate and unify, but how would I scale up this function so that I can reuse this for multiple columns (think 100+), instead of having to write out the variables each time? Or is there a function that's already built in to do this?
My current solution is this:
library(tidyverse)
Data %>%
mutate(B = replace(B, B == 'x', 'B'), C = replace(C, C == 'x', 'C')) %>%
unite("Unified", B:C, na.rm = TRUE, remove= TRUE)
We may use across to loop over the column, replace the value that corresponds to 'x' with column name (cur_column())
library(dplyr)
library(tidyr)
Data %>%
mutate(across(B:C, ~ replace(., .== 'x', cur_column()))) %>%
unite(Unified, B:C, na.rm = TRUE, remove = TRUE)
-output
ID Unified
1 1 C
2 2 B
3 3 B_C
data
Data <- structure(list(ID = 1:3, B = c(NA, "x", "x"), C = c("x", NA,
"x")), class = "data.frame", row.names = c(NA, -3L))
Here are couple of options.
Using dplyr -
library(dplyr)
cols <- names(Data)[-1]
Data %>%
rowwise() %>%
mutate(Unified = paste0(cols[!is.na(c_across(B:C))], collapse = '_')) %>%
ungroup -> Data
Data
# ID B C Unified
# <int> <chr> <chr> <chr>
#1 1 NA x C
#2 2 x NA B
#3 3 x x B_C
Base R
Data$Unified <- apply(Data[cols], 1, function(x)
paste0(cols[!is.na(x)], collapse = '_'))

Replacing value depending on paired column

I have a dataframe with two columns per sample (n > 1000 samples):
df <- data.frame(
"sample1.a" = 1:5, "sample1.b" = 2,
"sample2.a" = 2:6, "sample2.b" = c(1, 3, 3, 3, 3),
"sample3.a" = 3:7, "sample3.b" = 2)
If there is a zero in column .b, the correspsonding value in column .a should be set to NA.
I thought to write a function over colnames (without suffix) to filter each pair of columns and conditional exchaning values. Is there a simpler approach based on tidyverse?
We can split the data.frame into a list of data.frames and do the replacement in base R
df1 <- do.call(cbind, lapply(split.default(df,
sub("\\..*", "", names(df))), function(x) {
x[,1][x[2] == 0] <- NA
x}))
Or another option is Map
acols <- endsWith(names(df), "a")
bcols <- endsWith(names(df), "b")
df[acols] <- Map(function(x, y) replace(x, y == 0, NA), df[acols], df[bcols])
Or if the columns are alternate with 'a', 'b' columns, use a logical index for recycling, create the logical matrix with 'b' columns and assign the corresponding values in 'a' columns to NA
df[c(TRUE, FALSE)][df[c(FALSE, TRUE)] == 0] <- NA
or an option with tidyverse by reshaping into 'long' format (pivot_longer), changing the 'a' column to NA if there is a correspoinding 0 in 'a', and reshape back to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_sep="\\.",
names_to = c('group', '.value')) %>%
mutate(a = na_if(b, a == 0)) %>%
pivot_wider(names_from = group, values_from = c(a, b)) %>%
select(-rn)
# A tibble: 5 x 6
# a_sample1 a_sample2 a_sample3 b_sample1 b_sample2 b_sample3
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 2 1 2 2 1 2
#2 2 3 2 2 3 2
#3 2 3 2 2 3 2
#4 2 3 2 2 3 2
#5 2 3 2 2 3 2

Using the value in one column to specify from which row to retrieve a value for a new column

I'm looking for an automated way of converting this:
dat = tribble(
~a, ~b, ~c
, 'x', 1, 'y'
, 'y', 2, NA
, 'q', 4, NA
, 'z', 3, 'q'
)
to:
tribble(
~a, ~b, ~d
, 'x', 1, 2
, 'z', 3, 4
)
So, the column c in dat encodes which row in dat to look at to grab a value for a new column d, and if c is NA, toss that row from the output. Any tips?
We can join dat with itself using c and a columns.
library(dplyr)
dat %>%
inner_join(dat %>% select(-c) %>% rename(d = 'b'),
by = c('c' = 'a'))
# A tibble: 2 x 4
# a b c d
# <chr> <dbl> <chr> <dbl>
#1 x 1 y 2
#2 z 3 q 4
In base R, we can do this with merge :
merge(dat, dat[-3], by.x = 'c', by.y = 'a')
We create the 'd' with lead of 'b' and filter out the NA rows of 'c' and remove the c column with select
library(dplyr)
dat %>%
mutate(d = lead(b)) %>%
filter(!is.na(c)) %>%
select(-c)
# A tibble: 2 x 3
# a b d
# <chr> <dbl> <dbl>
#1 x 1 2
#2 z 3 4
Or more compactly
dat %>%
mutate(d = replace(lead(b), is.na(c), NA), c = NULL) %>%
na.omit
Or with fill
library(tidyr)
dat %>%
mutate(c1 = c) %>%
fill(c1) %>%
group_by(c1) %>%
mutate(d = lead(b)) %>%
ungroup %>%
filter(!is.na(c)) %>%
select(-c, -c1)
Or in data.table
library(data.table)
setDT(dat)[, d := shift(b, type = 'lead')][!is.na(c)][, c := NULL][]
# a b d
#1: x 1 2
#2: z 3 4
NOTE: Both the solutions are simple and doesn't require any joins. Besides, it gives the expected output in the OP's post
Or using match from base R
cbind(na.omit(dat), d = with(dat, b[match(c, a, nomatch = 0)]))[, -3]
# a b d
#1 x 1 2
#2 z 3 4

Reshape column values to column names

I've got a dataset with the following structure:
df <- data.frame(mult=c(1,2,3,4),red=c(1,0.9,0.8,0.7),
result=c('value1','value2','value3','value4'))
that I'd like to display in a 3-D plot (x axis: mult, y axis: red, and the x-y points would be 'result') or multiple 2-D plots. Obviously the real DF has a lot more rows and combinations of mult&red.
Columns mult & red do not have values repeated. What I'd like is to reshape DF to DF1:
- 1 0.9 0.8 0.7
1 value1
2 value2
3 value3
4 .....
so essentially:
1) [mult] values stays as it is (column 1)
2) [red] values become the column names.
3) Each cross between 'mult' and 'red' is a value in
the new DF
My preference would be to do this with the reshape function, but other packages are fine too.
Thanks in advance, p.
Try
library(reshape2)
df1 <- transform(df, result=as.character(result),
red= factor(red, levels= unique(red)))
dcast(df1, mult~red, value.var='result', fill='')[-1]
# 1 0.9 0.8 0.7
#1 value1
#2 value2
#3 value3
#4 value4
Here is a way using tidyr
library(tidyr)
out = rev(spread(df[-1], red, result))
out[is.na(out)] = ''
#> out
# 1 0.9 0.8 0.7
#1 value1
#2 value2
#3 value3
#4 value4
Using reshape as you requested
df <- data.frame(mult=c(1,2,3,4),red=c(1,0.9,0.8,0.7),
result=c('value1','value2','value3','value4'))
df$result = as.character(df$result)
dfWide = reshape(data = df, idvar = "mult", timevar = "red", v.names = "result", direction = "wide")
rownames(dfWide) = dfWide$mult
dfWide$mult = NULL
colnames(dfWide) = gsub(pattern = "result.", replacement = "", colnames(dfWide) )
dfWide[is.na(dfWide)] = ''
dfWide
# 1 0.9 0.8 0.7
# 1 value1
# 2 value2
# 3 value3
# 4 value4

Resources