de-duplicate rows and align columns in R

de-duplicate rows and align columns in R - r

I have a dataframe where there are duplicate samples but the reason for this is that only variable appears per row:
Sample
Var1
Var2
A
1
NA
B
NA
1
A
NA
3
C
NA
2
C
5
NA
B
4
NA
I would like to end up with the row names de-duplicated and corresponding column values side-by-side:
Sample
Var1
Var2
A
1
3
B
4
1
C
5
2
I've tried the group_by() function and that failed miserably!
I would very much appreciate any assistance and happy to clarify anything further if required!

We could use group_by and summarise for this task. Getting the max() will give us the desired output:
library(dplyr)
df %>%
group_by(Sample) %>%
summarise(across(, ~max(., na.rm=TRUE)))
Sample Var1 Var2
<chr> <int> <int>
1 A 1 3
2 B 4 1
3 C 5 2

data.table approach
library(data.table)
DT <- fread("Sample Var1 Var2
A 1 NA
B NA 1
A NA 3
C NA 2
C 5 NA
B 4 NA")
# or setDT(DT) if DT is not a data.table format
# melt to long format, and remove NA's
DT.melt <- melt(DT, id.vars = "Sample", na.rm = TRUE)
# cast to wide again
dcast(DT.melt, Sample ~ variable, fill = NA)
# Sample Var1 Var2
# 1: A 1 3
# 2: B 4 1
# 3: C 5 2

Using collapse
library(collapse)
fmax(df1[-1], g = df1$Sample)
Var1 Var2
A 1 3
B 4 1
C 5 2
Or in dplyr 1.1.0, we can also use .by in reframe
df1 %>%
reframe(across(where(is.numeric), ~ max(.x, na.rm = TRUE)), .by = 'Sample')
Sample Var1 Var2
1 A 1 3
2 B 4 1
3 C 5 2

Related

Merging columns while ignoring NAs

I would like to merge multiple columns. Here is what my sample dataset looks like.
df <- data.frame(
id = c(1,2,3,4,5),
cat.1 = c(3,4,NA,4,2),
cat.2 = c(3,NA,1,4,NA),
cat.3 = c(3,4,1,4,2))
> df
id cat.1 cat.2 cat.3
1 1 3 3 3
2 2 4 NA 4
3 3 NA 1 1
4 4 4 4 4
5 5 2 NA 2
I am trying to merge columns cat.1 cat.2 and cat.3. It is a little complicated for me since there are NAs.
I need to have only one cat variable and even some columns have NA, I need to ignore them. The desired output is below:
> df
id cat
1 1 3
2 2 4
3 3 1
4 4 4
5 5 2
Any thoughts?

Another variation of Gregor's answer using dplyr::transmute:
library(dplyr)
df %>%
transmute(id = id, cat = coalesce(cat.1, cat.2, cat.3))
#> id cat
#> 1 1 3
#> 2 2 4
#> 3 3 1
#> 4 4 4
#> 5 5 2

With dplyr:
library(dplyr)
df %>%
mutate(cat = coalesce(cat.1, cat.2, cat.3)) %>%
select(-cat.1, -cat.2, -cat.3)

An option with fcoalesce from data.table
library(data.table)
setDT(df)[, .(id, cat = do.call(fcoalesce, .SD)), .SDcols = patterns('^cat')]
-output
# id cat
#1: 1 3
#2: 2 4
#3: 3 1
#4: 4 4
#5: 5 2

Does this work:
> library(dplyr)
> df %>% rowwise() %>% mutate(cat = mean(c(cat.1, cat.2, cat.3), na.rm = T)) %>% select(-(2:4))
# A tibble: 5 x 2
# Rowwise:
id cat
<dbl> <dbl>
1 1 3
2 2 4
3 3 1
4 4 4
5 5 2
Since values across rows are unique, mean of the rows will return the same unique value, can also go with max or min.

Here is a base R solution which uses apply:
df$cat <- apply(df, 1, function(x) unique(x[!is.na(x)][-1]))

Expand dataframe in R by columns having different ID values

I have the following data frame in R
df1 <- data.frame(
"ID" = c("A", "B", "A", "B"),
"Value" = c(1, 2, 5, 5),
"freq" = c(1, 3, 5, 3)
)
I wish to obtain the following data frame
Value freq ID
1 1 A
2 NA A
3 NA A
4 NA A
5 1 A
1 NA B
2 2 B
3 NA B
4 NA B
5 5 B
I have tried the following code
library(tidyverse)
df_new <- bind_cols(df1 %>%
select(Value, freq, ID) %>%
complete(., expand(.,
Value = min(df1$Value):max(df1$Value))),)
I am getting the following output
Value freq ID
<dbl> <dbl> <fct>
1 1 A
2 3 B
3 NA NA
4 NA NA
5 5 A
5 3 B
I request someone to help me.

Using tidyr::full_seq we can find the full version of Value but nesting(full_seq(Value,1) will return an error:
Error: by can't contain join column full_seq(Value, 1) which is missing from RHS
so we need to add a name, hence nesting(Value=full_seq(Value,1)
library(tidyr)
df1 %>% complete(ID, nesting(Value=full_seq(Value,1)))
# A tibble: 10 x 3
ID Value freq
<fct> <dbl> <dbl>
1 A 1. 1.
2 A 2. NA
3 A 3. NA
4 A 4. NA
5 A 5. 5.
6 B 1. NA
7 B 2. 3.
8 B 3. NA
9 B 4. NA
10 B 5. 3.

Using data.table:
library(data.table)
setDT(df1)
setkey(df1, ID, Value)
df1[CJ(ID = c("A", "B"), Value = 1:5)]
ID Value freq
1: A 1 1
2: A 2 NA
3: A 3 NA
4: A 4 NA
5: A 5 5
6: B 1 NA
7: B 2 3
8: B 3 NA
9: B 4 NA
10: B 5 3

Would the following approach work for you?
with(data = df1,
expr = {
data.frame(Value = rep(wrapr::seqi(min(Value), max(Value)), length(unique(ID))),
ID = unique(ID))
}) %>%
left_join(y = df1,
by = c("ID" = "ID", "Value" = "Value")) %>%
arrange(ID, Value)
Results
Value ID freq
1 1 A 1
2 2 A NA
3 3 A NA
4 4 A NA
5 5 A 5
6 1 B NA
7 2 B 3
8 3 B NA
9 4 B NA
10 5 B 3
Comments
If I'm following your example correctly, your ID group takes values from 1 to 5. If this is the case, my approach would be to generate that reading unique combinations of both from the original data frame.
The only variable that is carried from the original data frame is freq that may / may not be available for a given par ID-Value. I would join that variable via left_join (as you seem to like tidyverse)
In your example, you have freq variable with values 1,3,5 but then in the example you list 1,2,5? In my example, I took original freq and left join it. You can modify it further using normal dplyr pipeline, if this is something you intended to do.

Fill sequence by factor

I need to fill $Year with missing values of the sequence by the factor of $Country. The $Count column can just be padded out with 0's.
Country Year Count
A 1 1
A 2 1
A 4 2
B 1 1
B 3 1
So I end up with
Country Year Count
A 1 1
A 2 1
A 3 0
A 4 2
B 1 1
B 2 0
B 3 1
Hope that's clear guys, thanks in advance!

This is a dplyr/tidyr solution using complete and full_seq:
library(dplyr)
library(tidyr)
df %>% group_by(Country) %>% complete(Year=full_seq(Year,1),fill=list(Count=0))
Country Year Count
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1

library(data.table)
# d is your original data.frame
setDT(d)
foo <- d[, .(Year = min(Year):max(Year)), Country]
res <- merge(d, foo, all.y = TRUE)[is.na(Count), Count := 0]

Similar to #PoGibas' answer:
library(data.table)
# set default values
def = list(Count = 0L)
# create table with all levels
fullDT = setkey(DT[, .(Year = seq(min(Year), max(Year))), by=Country])
# initialize to defaults
fullDT[, names(def) := def ]
# overwrite from data
fullDT[DT, names(def) := mget(sprintf("i.%s", names(def))) ]
which gives
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 0
4: A 4 2
5: B 1 1
6: B 2 0
7: B 3 1
This generalizes to having more columns (besides Count). I guess similar functionality exists in the "tidyverse", with a name like "expand" or "complete".

Another base R idea can be to split on Country, use setdiff to find the missing values from the seq(max(Year)), and rbind them to original data frame. Use do.call to rbind the list back to a data frame, i.e.
d1 <- do.call(rbind, c(lapply(split(df, df$Country), function(i){
x <- rbind(i, data.frame(Country = i$Country[1],
Year = setdiff(seq(max(i$Year)), i$Year),
Count = 0));
x[with(x, order(Year)),]}), make.row.names = FALSE))
which gives,
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1

> setkey(DT,Country,Year)
> DT[setkey(DT[, .(min(Year):max(Year)), by = Country], Country, V1)]
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 NA
4: A 4 2
5: B 1 1
6: B 2 NA
7: B 3 1

Another dplyr and tidyr solution.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(Country) %>%
do(data_frame(Country = unique(.$Country),
Year = full_seq(.$Year, 1))) %>%
full_join(dt, by = c("Country", "Year")) %>%
replace_na(list(Count = 0))

Here is an approach in base R that uses tapply, do.call, range, and seq, to calculate year sequences. Then constructs a data.frame from the named list that is returned, merges this onto the original which adds the desired rows, and finally fills in missing values.
# get named list with year sequences
temp <- tapply(dat$Year, dat$Country, function(x) do.call(seq, as.list(range(x))))
# construct data.frame
mydf <- data.frame(Year=unlist(temp), Country=rep(names(temp), lengths(temp)))
# merge onto original
mydf <- merge(dat, mydf, all=TRUE)
# fill in missing values
mydf[is.na(mydf)] <- 0
This returns
mydf
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1

How to merge two data frames on common columns in R with sum of others using dplyr package

Is there a way using the "dplyr" package to intersect two data frames and sum one column. For example:
Given DF 1
Var1 Var2 Var3
1 A 5
1 B 4
2 A 5
2 B 3
2 C 4
DF 2
Var1 Var2 Var3
1 A 3
1 D 2
2 E 3
2 B 3
2 G 2
And return
DF 3
Var1 Var2 Var3
1 A 8
2 B 6

how easy is that my friend?
df1 %>% left_join(df2, key = c('var1', 'var2')) %>%
mutate(sum = var2 + var3)

How do I impute missing variables in R using dplyr?

I would like to impute missing values for a variable given the existing values.
In var2, we notice that there are a lot of NAs.
If any 2 ids are the same, then their values for var2 are the same.
If the id has no values for var2, like in the case of id==2, then we just output as NA.
It should look from df_old to df_new.
df_old<- read.table(header = TRUE, text = "
id var1 var2
1 A 12
1 B NA
1 E NA
2 G NA
2 J NA
")
df_new<- read.table(header = TRUE, text = "
id var1 var2
1 A 12
1 B 12
1 E 12
2 G NA
2 J NA
")
I tried take:
df_new<-df_old %>%
group_by(id) %>%
mutate(var2=na.omit(var2))
I believe it doesn't work because of the second case. I was also wondering if using ifelse would be okay. Need help thanks!

If there is only one var2 value per id available you could simply do:
df_old %>%
group_by(id) %>%
mutate(var2 = min(var2, na.rm = TRUE))
Source: local data frame [5 x 3]
Groups: id [2]
id var1 var2
<int> <fctr> <int>
1 1 A 12
2 1 B 12
3 1 E 12
4 2 G NA
5 2 J NA
Another option would be:
mutate(var2 = var2[1])

We can use data.table, but unlike dplyr, for groups that have all NA, we have to specify NA to return or else it will give Inf
library(data.table)
setDT(df_old)[, var2 := if(any(!is.na(var2))) min(var2, na.rm = TRUE)
else NA_integer_, by = id]
df_old
# id var1 var2
#1: 1 A 12
#2: 1 B 12
#3: 1 E 12
#4: 2 G NA
#5: 2 J NA

By now there is tidyimpute package available in CRAN which looks like it might do the trick
"Functions and methods for imputing missing values (NA) in tables and list
patterned after the tidyverse approach of 'dplyr' and 'rlang'; works with
data.tables as well."
https://cran.r-project.org/web/packages/tidyimpute/tidyimpute.pdf

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

de-duplicate rows and align columns in R - r

We could use group_by and summarise for this task. Getting the max() will give us the desired output: library(dplyr) df %>% group_by(Sample) %>% summarise(across(, ~max(., na.rm=TRUE))) Sample Var1 Var2 <chr> <int> <int> 1 A 1 3 2 B 4 1 3 C 5 2

Using collapse library(collapse) fmax(df1[-1], g = df1$Sample) Var1 Var2 A 1 3 B 4 1 C 5 2 Or in dplyr 1.1.0, we can also use .by in reframe df1 %>% reframe(across(where(is.numeric), ~ max(.x, na.rm = TRUE)), .by = 'Sample') Sample Var1 Var2 1 A 1 3 2 B 4 1 3 C 5 2

Related

Merging columns while ignoring NAs

Expand dataframe in R by columns having different ID values

Fill sequence by factor

How to merge two data frames on common columns in R with sum of others using dplyr package

How do I impute missing variables in R using dplyr?

Categories

Resources