Merging columns while ignoring NAs - r

I would like to merge multiple columns. Here is what my sample dataset looks like.
df <- data.frame(
id = c(1,2,3,4,5),
cat.1 = c(3,4,NA,4,2),
cat.2 = c(3,NA,1,4,NA),
cat.3 = c(3,4,1,4,2))
> df
id cat.1 cat.2 cat.3
1 1 3 3 3
2 2 4 NA 4
3 3 NA 1 1
4 4 4 4 4
5 5 2 NA 2
I am trying to merge columns cat.1 cat.2 and cat.3. It is a little complicated for me since there are NAs.
I need to have only one cat variable and even some columns have NA, I need to ignore them. The desired output is below:
> df
id cat
1 1 3
2 2 4
3 3 1
4 4 4
5 5 2
Any thoughts?

Another variation of Gregor's answer using dplyr::transmute:
library(dplyr)
df %>%
transmute(id = id, cat = coalesce(cat.1, cat.2, cat.3))
#> id cat
#> 1 1 3
#> 2 2 4
#> 3 3 1
#> 4 4 4
#> 5 5 2

With dplyr:
library(dplyr)
df %>%
mutate(cat = coalesce(cat.1, cat.2, cat.3)) %>%
select(-cat.1, -cat.2, -cat.3)

An option with fcoalesce from data.table
library(data.table)
setDT(df)[, .(id, cat = do.call(fcoalesce, .SD)), .SDcols = patterns('^cat')]
-output
# id cat
#1: 1 3
#2: 2 4
#3: 3 1
#4: 4 4
#5: 5 2

Does this work:
> library(dplyr)
> df %>% rowwise() %>% mutate(cat = mean(c(cat.1, cat.2, cat.3), na.rm = T)) %>% select(-(2:4))
# A tibble: 5 x 2
# Rowwise:
id cat
<dbl> <dbl>
1 1 3
2 2 4
3 3 1
4 4 4
5 5 2
Since values across rows are unique, mean of the rows will return the same unique value, can also go with max or min.

Here is a base R solution which uses apply:
df$cat <- apply(df, 1, function(x) unique(x[!is.na(x)][-1]))

Related

Label distinct combinations across multiple columns in R

I want to create a new column that labels each unique combination of values across x, y, z columns. My current work-around to achieve that is this:
> library(tidyverse)
>
> set.seed(100)
> df = tibble(x = sample.int(5, 50, replace = T), y = sample.int(5, 50, replace = T), z = sample.int(5, 50, replace = T))
> df
# A tibble: 50 x 3
x y z
<int> <int> <int>
1 2 4 4
2 3 4 4
3 1 3 5
4 2 1 4
5 4 2 5
6 4 5 2
7 2 3 4
8 3 5 4
9 2 4 1
10 5 5 2
# … with 40 more rows
>
> df2 = df %>% distinct(x,y,z) %>% rowid_to_column("unique_id") %>% left_join(df)
Joining, by = c("x", "y", "z")
> df2
# A tibble: 50 x 4
unique_id x y z
<int> <int> <int> <int>
1 1 2 4 4
2 2 3 4 4
3 3 1 3 5
4 4 2 1 4
5 4 2 1 4
6 5 4 2 5
7 5 4 2 5
8 6 4 5 2
9 6 4 5 2
10 7 2 3 4
# … with 40 more rows
What is a better/more efficient way to do this on a fairly large dataset? I'd like to stay within tidyverse but also open to other suggestions.
You could use rleidv from data.table
df$unique_id <- data.table::rleidv(df)
In dplyr, we can use group_indices function for this purpose which generates a unique id for each group of values.
library(dplyr)
df %>% mutate(unique_id = group_indices(., x, y, z))
In the devel version of dplyr, we can use cur_group_id
library(dplyr)
df %>%
group_by_all() %>%
mutate(unique_id = cur_group_id())
Or using .GRP from data.table
library(data.table)
setDT(df)[, unique_id := .GRP, names(df)]

Dense Rank by Multiple Columns in R

How can I get a dense rank of multiple columns in a dataframe? For example,
# I have:
df <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3))
# I want:
res <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3),
r = c(1,2,3,4,5,5,5,6,7,8))
res
x y z
1 1 1 1
2 1 2 2
3 1 3 3
4 1 4 4
5 2 2 5
6 2 2 5
7 2 2 5
8 3 1 6
9 3 2 7
10 3 3 8
My hack approach works for this particular dataset:
df %>%
arrange(x,y) %>%
mutate(r = if_else(y - lag(y,default=0) == 0, 0, 1)) %>%
mutate(r = cumsum(r))
But there must be a more general solution, maybe using functions like dense_rank() or row_number(). But I'm struggling with this.
dplyr solutions are ideal.
Right after posting, I think I found a solution here. In my case, it would be:
mutate(df, r = dense_rank(interaction(x,y,lex.order=T)))
But if you have a better solution, please share.
data.table
data.table has you covered with frank().
library(data.table)
frank(df, x,y, ties.method = 'min')
[1] 1 2 3 4 5 5 5 8 9 10
You can df$r <- frank(df, x,y, ties.method = 'min') to add as a new column.
tidyr/dplyr
Another option (though clunkier) is to use tidyr::unite to collapse your columns to one plus dplyr::dense_rank.
library(tidyverse)
df %>%
# add a single column with all the info
unite(xy, x, y) %>%
cbind(df) %>%
# dense rank on that
mutate(r = dense_rank(xy)) %>%
# now drop the helper col
select(-xy)
You can use cur_group_id:
library(dplyr)
df %>%
group_by(x, y) %>%
mutate(r = cur_group_id())
# x y r
# <dbl> <dbl> <int>
# 1 1 1 1
# 2 1 2 2
# 3 1 3 3
# 4 1 4 4
# 5 2 2 5
# 6 2 2 5
# 7 2 2 5
# 8 3 1 6
# 9 3 2 7
# 10 3 3 8

Creating a new data frame using existing data

I would like to create a new data from my existing data frame "ab". The new data frame should look like "Newdf".
a<- c(1:5)
b<-c(11:15)
ab<-data.frame(C1=a,c2=b)
ab
df<-c(1,11,2,12,3,13,4,14,5,15)
CMT<-c(1:2)
CMT1<-rep.int(CMT,times=5)
Newdf<-data.frame(DV=df,Comp=CMT1)
Newdf
Can we use dplyr package? If yes, how?
More importantly than dplyr, you'd need tidyr:
library(tidyr)
library(dplyr)
ab %>%
gather(Comp, DV) %>%
mutate(Comp = recode(Comp, "C1" = 1, "c2" = 2))
# Comp DV
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 5
# 6 2 11
# 7 2 12
# 8 2 13
# 9 2 14
# 10 2 15
Using dplyr and tidyr gives you something close...
library(tidyr)
library(dplyr)
df2 <- ab %>%
mutate(Order=1:n()) %>%
gather(key=Comp,value=DV,C1,c2) %>%
arrange(Order) %>%
mutate(Comp=recode(Comp,"C1"=1,"c2"=2)) %>%
select(DV,Comp)
df2
DV Comp
1 1 1
2 11 2
3 2 1
4 12 2
5 3 1
6 13 2
7 4 1
8 14 2
9 5 1
10 15 2
Although the OP has asked for a dpylr solution, I felt challenged to look for a data.table solution. So, FWIW, here is an alternative approach using melt().
Note that this solution does not depend on specific column names in ab as the two other dplyr solutions do. In addition, it should be working for more than two columns in ab as well (untested).
library(data.table)
melt(setDT(ab, keep.rownames = TRUE), id.vars = "rn", value.name = "DV"
)[, Comp := rleid(variable)
][order(rn)][, c("rn", "variable") := NULL][]
# DV Comp
# 1: 1 1
# 2: 11 2
# 3: 2 1
# 4: 12 2
# 5: 3 1
# 6: 13 2
# 7: 4 1
# 8: 14 2
# 9: 5 1
#10: 15 2
Data
ab <- structure(list(C1 = 1:5, c2 = 11:15), .Names = c("C1", "c2"),
row.names = c(NA, -5L), class = "data.frame")
ab
# C1 c2
#1 1 11
#2 2 12
#3 3 13
#4 4 14
#5 5 15

summarise by group of columns using min and maintaing row number

I have a data frame with 3 columns
df <- data.frame(ID1=c(rep(1,4),rep(2,4)), ID2=rep(1:2,4), value=1:8)
I need to recover the min for each group (ID1, ID2) and the position(row.name) of this min in the original table.
Using group_by and summarise, I have obtained the min but I can't see a way to obtain the position as summarise gets rid of the columns not summarised and not used for group.
df<-data.frame(ID1=c(rep(1,4),rep(2,4)), ID2=rep(1:2,4), value=1:8)
df[['X']] <- paste0(df$ID1,'.',df$ID2)
df <- group_by( df, X )
df <- summarise( df, Objective=min(value) )
Any ideas on how to solve this to get?
X Objective Position
1 1.1 1 1
2 1.2 2 2
3 2.1 5 5
4 2.2 6 6
Thanks in advance
If I understand correct and since you're already using dplyr, you could do it like this:
library(dplyr); library(tidyr)
unite(df, X, ID1:ID2, sep = ".") %>%
mutate(Position = row_number()) %>%
group_by(X) %>% slice(which.min(value))
#Source: local data frame [4 x 3]
#Groups: X
#
# X value Position
#1 1.1 1 1
#2 1.2 2 2
#3 2.1 5 5
#4 2.2 6 6
Or alternatively (only dplyr) - I'd rather use this one:
mutate(df, Position = row_number()) %>% group_by(ID1, ID2) %>% slice(which.min(value))
#Source: local data frame [4 x 4]
#Groups: ID1, ID2
#
# ID1 ID2 value Position
#1 1 1 1 1
#2 1 2 2 2
#3 2 1 5 5
#4 2 2 6 6
data
df <- data.frame(ID1=rep(1:2, each = 4), ID2=rep(1:2,4), value=1:8)
Here's how would I approach this using data.table (rn would be your row number).
library(data.table)
setDT(df, keep.rownames = TRUE)[, .SD[which.min(value)], list(ID1, ID2)]
# ID1 ID2 rn value
# 1: 1 1 1 1
# 2: 1 2 2 2
# 3: 2 1 5 5
# 4: 2 2 6 6
Another option is ordering and then picking the unique values
unique(setorder(df, value), by = c("ID1", "ID2"))
# ID1 ID2 rn value
# 1: 1 1 1 1
# 2: 1 2 2 2
# 3: 2 1 5 5
# 4: 2 2 6 6
Both approaches don't require creating X column
Or using base R
df <- df[order(df$value), ]
df[!duplicated(df[, 1:2]), ]
# ID1 ID2 value
# 1 1 1 1
# 2 1 2 2
# 5 2 1 5
# 6 2 2 6
data
df <- data.frame(ID1=c(rep(1,4),rep(2,4)), ID2=rep(1:2,4), value=1:8)
Using Aggregate:
Data:
df<-data.frame(ID1=c(rep(1,4),rep(2,4)), ID2=rep(1:2,4), value=1:8)
df[['X']] <- paste0(df$ID1,'.',df$ID2)
df$rn<-row.names(df) #rn is the row number
df<-df[c("X","rn","value")]
#> df
# X rn value
#1 1.1 1 1
#2 1.2 2 2
#3 1.1 3 3
#4 1.2 4 4
#5 2.1 5 5
#6 2.2 6 6
#7 2.1 7 7
#8 2.2 8 8
Aggregate step:
df2<- aggregate(df, by=list(c(df$X)), min)
#> df2
# Group.1 X rn value
#1 1.1 1.1 1 1
#2 1.2 1.2 2 2
#3 2.1 2.1 5 5
#4 2.2 2.2 6 6

R, dplyr: cumulative version of n_distinct

I have a dataframe as follows. It is ordered by column time.
Input -
df = data.frame(time = 1:20,
grp = sort(rep(1:5,4)),
var1 = rep(c('A','B'),10)
)
head(df,10)
time grp var1
1 1 1 A
2 2 1 B
3 3 1 A
4 4 1 B
5 5 2 A
6 6 2 B
7 7 2 A
8 8 2 B
9 9 3 A
10 10 3 B
I want to create another variable var2 which computes no of distinct var1 values so far i.e. until that point in time for each group grp . This is a little different from what I'd get if I were to use n_distinct.
Expected output -
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
I want to create a function say cum_n_distinct for this and use it as -
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))
A dplyr solution inspired from #akrun's answer -
Ths logic is basically to set 1st occurrence of each unique values of var1 to 1 and rest to 0 for each group grp and then apply cumsum on it -
df = df %>%
arrange(time) %>%
group_by(grp,var1) %>%
mutate(var_temp = ifelse(row_number()==1,1,0)) %>%
group_by(grp) %>%
mutate(var2 = cumsum(var_temp)) %>%
select(-var_temp)
head(df,10)
Source: local data frame [10 x 4]
Groups: grp
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
Assuming stuff is ordered by time already, first define a cumulative distinct function:
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
Then a base solution that uses ave to create groups (note, assumes var1 is factor), and then applies our function to each group:
transform(df, var2=ave(as.integer(var1), grp, FUN=dist_cum))
A data.table solution, basically doing the same thing:
library(data.table)
(data.table(df)[, var2:=dist_cum(var1), by=grp])
And dplyr, again, same thing:
library(dplyr)
df %>% group_by(grp) %>% mutate(var2=dist_cum(var1))
Try:
Update
With your new dataset, an approach in base R
df$var2 <- unlist(lapply(split(df, df$grp),
function(x) {x$var2 <-0
indx <- match(unique(x$var1), x$var1)
x$var2[indx] <- 1
cumsum(x$var2) }))
head(df,7)
# time grp var1 var2
# 1 1 1 A 1
# 2 2 1 B 2
# 3 3 1 A 2
# 4 4 1 B 2
# 5 5 2 A 1
# 6 6 2 B 2
# 7 7 2 A 2
Here's another solution using data.table that's pretty quick.
Generic Function
cum_n_distinct <- function(x, na.include = TRUE){
# Given a vector x, returns a corresponding vector y
# where the ith element of y gives the number of unique
# elements observed up to and including index i
# if na.include = TRUE (default) NA is counted as an
# additional unique element, otherwise it's essentially ignored
temp <- data.table(x, idx = seq_along(x))
firsts <- temp[temp[, .I[1L], by = x]$V1]
if(na.include == FALSE) firsts <- firsts[!is.na(x)]
y <- rep(0, times = length(x))
y[firsts$idx] <- 1
y <- cumsum(y)
return(y)
}
Example Use
cum_n_distinct(c(5,10,10,15,5)) # 1 2 2 3 3
cum_n_distinct(c(5,NA,10,15,5)) # 1 2 3 4 4
cum_n_distinct(c(5,NA,10,15,5), na.include = FALSE) # 1 1 2 3 3
Solution To Your Question
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))

Resources