Combining columns, while ignoring duplicates and NAs - r

I have a dataframe as follows and I would like to combine two columns, namely Var1 and Var2. I want the combined column (Var3) to contain no duplicates of <alpha><digit>. i.e. if Var1 == A1 and Var2 == A1, hence Var3 == A1 but not Var3 == A1-A1 or if Var1 == A4-E9 and Var2 == A4, hence Var3 == A4-E9 but not Var3 == A4-E9-A4
df <- read.table(header = TRUE, text =
"id Var1 Var2
A A1 A1
B F2 A2
C NA A3
D A4-E9 A4
E E5 A5
F NA NA
G B2-R4 A3-B2
H B3-B4 E1-G5", stringsAsFactors = FALSE)
The following is my code. I would like to improve on its readability as well as get rid of the NA that is present in row 3's entry for Var3, i.e A3-NA.
library(dplyr)
library(tidyr)
df %>%
mutate(Var3 = paste(Var1, Var2, sep = "-")) %>%
separate_rows(Var3, sep = "-") %>%
group_by(id, Var3) %>%
slice(1) %>%
group_by(id) %>%
mutate(Var3 = paste(unlist(Var3[!is.na(Var3)]), collapse = "-")) %>%
slice(1) %>%
ungroup
Here's my desired output:
# A tibble: 8 x 4
id Var1 Var2 Var3
<chr> <chr> <chr> <chr>
1 A A1 A1 A1
2 B F2 A2 A2-F2
3 C <NA> A3 A3
4 D A4-E9 A4 A4-E9
5 E E5 A5 A5-E5
6 F <NA> <NA> <NA>
7 G B2-R4 A3-B2 A3-B2-R4
8 H B3-B4 E1-G5 B3-B4-E1-G5

if 'df1' is the output, then we remove the 'NA' that follows a - with sub
df1 %>%
mutate(Var3 = sub("-NA", "", Var3))
# A tibble: 8 x 4
# id Var1 Var2 Var3
# <chr> <chr> <chr> <chr>
#1 A A1 A1 A1
#2 B F2 A2 A2-F2
#3 C <NA> A3 A3
#4 D A4-E9 A4 A4-E9
#5 E E5 A5 A5-E5
#6 F <NA> <NA> NA
#7 G B2-R4 A3-B2 A3-B2-R4
#8 H B3-B4 E1-G5 B3-B4-E1-G5
We can also do this slightly differently with tidyverse by gather into 'long' format, then split the 'value' column using separate_rows, grouped by 'id', summarise the 'Var3' column by pasteing the sorted unique elements of 'Var3' and left_join with the original dataset 'df'
library(tidyverse)
gather(df, key, value, -id) %>%
separate_rows(value) %>%
group_by(id) %>%
summarise(Var3 = paste(sort(unique(value)), collapse='-')) %>%
mutate(Var3 = replace(Var3, Var3=='', NA)) %>%
left_join(df, .)
# id Var1 Var2 Var3
#1 A A1 A1 A1
#2 B F2 A2 A2-F2
#3 C <NA> A3 A3
#4 D A4-E9 A4 A4-E9
#5 E E5 A5 A5-E5
#6 F <NA> <NA> <NA>
#7 G B2-R4 A3-B2 A3-B2-R4
#8 H B3-B4 E1-G5 B3-B4-E1-G5
NOTE: The %>% makes even a simple code to appear in multiple lines, but if required, we can put all those statements in a single line and term as one-liner
Here is a one-liner
library(data.table)
setDT(df)[, Var3 := paste(sort(unique(unlist(strsplit(unlist(.SD),"-")))), collapse="-"), id]

You could do it in one line
df$Var3 = lapply(strsplit(paste(df$Var1, df$Var2, sep = "-"),"-"),
function(x)paste(unique(x)[unique(x)!="NA"],collapse="-"))
Output:
id Var1 Var2 Var3
1 A A1 A1 A1
2 B F2 A2 F2-A2
3 C <NA> A3 A3
4 D A4-E9 A4 A4-E9
5 E E5 A5 E5-A5
6 F <NA> <NA>
7 G B2-R4 A3-B2 B2-R4-A3
8 H B3-B4 E1-G5 B3-B4-E1-G5
The first part in the lapply function is similar to your first call with dplyr. First the columns are concatenated, and then we split them again.
The function within lapply removes all NA's, and then collapses the string again.
Hope this helps!
EDIT: Speed comparison for fun!
262,144 rows
Average runtimes:
Florian: 3.97 seconds
Sotos: 2.46 seconds
Akrun: 1.34 seconds
Adamm: >120 seconds
df <- read.table(header = TRUE, text =
"id Var1 Var2
A A1 A1
B F2 A2
C NA A3
D A4-E9 A4
E E5 A5
F NA NA
G B2-R4 A3-B2
H B3-B4 E1-G5", stringsAsFactors = FALSE)
for(i in 1:15)
{
df = rbind(df,df)
}
library(microbenchmark)
# Florian's method
microbenchmark(
lapply(strsplit(paste(df$Var1, df$Var2, sep = "-"),"-"),
function(x)paste(unique(x)[unique(x)!="NA"],collapse="-")),times=5)
# Sotos'method
microbenchmark(
gsub('NA-|-NA', '', vapply(strsplit(do.call(paste, df[-1]), " |-"), function(i) paste(unique(i), collapse = "-"), character(1L))), times=5)
# akrun method
library(data.table)
microbenchmark(
setDT(df)[, Var3 := paste(sort(unique(unlist(strsplit(unlist(.SD),"-")))), collapse="-"), id], times=5)
# Adamm method
microbenchmark(
sapply(1:nrow(df), function(i) ifelse(df[i,2]!=df[i,3] & !is.na(df[i,2]) & !is.na(df[i,3]), paste(df[i,2], df[i,3], sep="-"), ifelse(!is.na(df[i,3]), df[i,3], df[i,2]))), times=5)

If you want complex solution; long one-liner, nested ifelse().
df$Var3 <- sapply(1:nrow(df), function(i) ifelse(df[i,2]!=df[i,3] & !is.na(df[i,2]) & !is.na(df[i,3]), paste(df[i,2], df[i,3], sep="-"), ifelse(!is.na(df[i,3]), df[i,3], df[i,2])))
> df
id Var1 Var2 Var3
1 A A1 A1 A1
2 B F2 A2 F2-A2
3 C <NA> A3 A3
4 D A4-E9 A4 A4-E9-A4
5 E E5 A5 E5-A5
6 F <NA> <NA> <NA>
7 G B2-R4 A3-B2 B2-R4-A3-B2
8 H B3-B4 E1-G5 B3-B4-E1-G5
In case of efficiency I made a small experiment and I measured time of each proposed solution, here are the results:
First of all I need more rows:
n <- 10000
df <- do.call("rbind", replicate(n, df, simplify = FALSE))
Akrun solution 1 with tidyverse
Time difference of 1.452809 secs
Akrun solution 2 with data.table
Time difference of 0.4530261 secs
Florian Maas solution with lapply
Time difference of 1.812106 secs
My solution with sapply
Time difference of 2.289345 mins
Sotos solution
Time difference of 1.515296 secs

Related

Convert nested lists to data.frame using all recursive indexes as colnames and fill missing columns with NAs

I have been struggling with this for a day now and all research made in SO doesn't seem to produce the result that i need.
I have this list:
input_list <- list(
list(A = 'a1', B = 'b1', C = 'c1', D = 'd1'),
list(A = 'a2', C = 'c2', D = 'd2'),
list(A = 'a3', B = 'b3', C = 'c3'),
list(A = 'a4', B = 'b4', C = 'c4',
D = list(
sub_1 = "d4_1",
sub_2 = "d4_2")
)
)
Basically I want to turn it into this structure:
#tbl_df
#A B C D D.sub_1 D_sub_2
#a1 b1 c1 d1 NA NA
#a2 NA c2 d2 NA NA
#a3 b3 c3 NA NA NA
#a4 b4 c4 NA d4_1 d4_2
I tried messing with map function:
output_list <- input_list %>%
map(unlist) %>%
do.call(rbind.data.frame, .)
It correctly unlists all nested lists converting them to named vectors, but I'm stuck as to how to rbind the rows matching the column names and fill missing variables with NAs.
Any help appreciated.
Maybe try this. You can use unlist() with lapply() to unnest the values and then transform to dataframe each element using as.data.frame(t(...)). Finally, bind_rows() from dplyr can bind the elements as you expect. Here the code:
library(dplyr)
#Code
newdf <- bind_rows(lapply(input_list, function(x) as.data.frame(t(unlist(x)))))
Output:
A B C D D.sub_1 D.sub_2
1 a1 b1 c1 d1 <NA> <NA>
2 a2 <NA> c2 d2 <NA> <NA>
3 a3 b3 c3 <NA> <NA> <NA>
4 a4 b4 c4 <NA> d4_1 d4_2
You can use map_dfr :
purrr::map_dfr(input_list, as.data.frame)
# A B C D D.sub_1 D.sub_2
#1 a1 b1 c1 d1 <NA> <NA>
#2 a2 <NA> c2 d2 <NA> <NA>
#3 a3 b3 c3 <NA> <NA> <NA>
#4 a4 b4 c4 <NA> d4_1 d4_2
We can use unnest_wider
library(purrr)
library(dplyr)
tibble(col = input_list) %>%
unnest_wider(c(col)) %>%
unnest_wider(c(D))

Conditionally copy contents of one column to another [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I want to add extra columns depending on values of code which are defined in VAR
DF <- data.frame(id = c(1:5), code = c("A","B","C","D","E"), sub = c("A1","B1","C1","D1","E1"))
id code sub
1 1 A A1
2 2 B B1
3 3 C C1
4 4 D D1
5 5 E E1
VAR <- c("A","B")
How result should be:
id code sub AB ABsub
1 1 A A1 A A1
2 2 B B1 B B1
3 3 C C1 <NA> <NA>
4 4 D D1 <NA> <NA>
5 5 E E1 <NA> <NA>
Or using dplyr:
library(dplyr)
DF<-data.frame(id=c(1:5),code=c("A","B","C","D","E"),sub=c("A1","B1","C1","D1","E1"), stringsAsFactors = FALSE)
VAR<-c("A","B")
DF <- DF %>%
mutate(AB = ifelse(code %in% {{VAR}}, code, NA_character_)) %>%
mutate(ABsub = ifelse(code == AB, sub, NA_character_))
with:
> DF
id code sub AB ABsub
1 1 A A1 A A1
2 2 B B1 B B1
3 3 C C1 <NA> <NA>
4 4 D D1 <NA> <NA>
5 5 E E1 <NA> <NA>
Also works if VAR would equal c("A", "B", "C") but we do not know if that is what you are after.
A simple base R option using merge + subset
merge(DF,subset(DF,code %in% VAR),by = "id",all = TRUE)
such that
> merge(DF,subset(DF,code %in% VAR),by = "id",all = TRUE)
id code.x sub.x code.y sub.y
1 1 A A1 A A1
2 2 B B1 B B1
3 3 C C1 <NA> <NA>
4 4 D D1 <NA> <NA>
5 5 E E1 <NA> <NA>
A dplyr solution with across():
library(dplyr)
DF %>%
mutate(across(-id, ~ replace(.x, !(code %in% VAR), NA), .names = "AB{col}"))
# id code sub ABcode ABsub
# 1 1 A A1 A A1
# 2 2 B B1 B B1
# 3 3 C C1 <NA> <NA>
# 4 4 D D1 <NA> <NA>
# 5 5 E E1 <NA> <NA>
or with left_join():
DF %>%
filter(code %in% VAR) %>%
left_join(DF, ., by = "id", suffix = c("", "AB"))
# id code sub codeAB subAB
# 1 1 A A1 A A1
# 2 2 B B1 B B1
# 3 3 C C1 <NA> <NA>
# 4 4 D D1 <NA> <NA>
# 5 5 E E1 <NA> <NA>
Note: If you have multiple columns in your real data, you don't need to type
mutate(Col1 = ifelse(...), Col2 = ifelse(...), etc.)
one by one.
Here's a solution
ABsub <- ifelse(DF$code %in% VAR, DF$code, NA)
cbind(DF, ABsub)

add row based on variable condition in R

I have df as follow
df
ID type other-col
1 A1 cc
1 A2 dd
1 A3 cc
2 A1 cc
2 B1 aa
3 A2 aa
I want add new to when "ID" changes with the value of F for "type" and "other-col" columns
new_df
ID
df
ID type other-col
1 A1 cc
1 A2 dd
1 A3 cc
1 F F <- this row added
2 A1 cc
2 B1 aa
2 F F <- this row added
3 A2 aa
how can I do it in R?
thx
This should be doable in a single replacement operation once you know the indexes of where each change occurs. E.g.:
idx <- match(unique(df$ID), df$ID)[-1] - 1
df <- df[sort(c(sequence(nrow(df)),idx)),]
df[seq_along(idx) + idx, c("type","other_col")] <- "F"
# ID type other_col
#1 1 A1 cc
#2 1 A2 dd
#3 1 A3 cc
#3.1 1 F F
#4 2 A1 cc
#5 2 B1 aa
#5.1 2 F F
#6 3 A2 aa
Where df was:
df <- read.table(text="ID type other_col
1 A1 cc
1 A2 dd
1 A3 cc
2 A1 cc
2 B1 aa
3 A2 aa", header=TRUE, stringsAsFactors=FALSE)
An option with group_split and add_row. We can split by 'ID' with group_split into a list of data.frames, then loop through the list with map, add a row as the last row (add_row - by default adds row to the end, but we can control it with .before and .after), then slice out the last row as the last 'ID' didn't need the 'F' row
library(tidyverse)
df1 %>%
group_split(ID) %>%
map_dfr(~ .x %>%
add_row(ID = first(.$ID), type = 'F', `other-col` = 'F')) %>%
slice(-n())
Here is another approach with a similar idea as #akrun's answer.
library(tidyverse)
dat2 <- dat %>%
split(f = .$ID) %>%
map_if(.p = function(x) unique(x$ID) < max(dat$ID),
~bind_rows(.x, tibble(ID = unique(.x$ID), type = "F", `other.col` = "F"))) %>%
bind_rows()
dat2
# ID type other.col
# 1 1 A1 cc
# 2 1 A2 dd
# 3 1 A3 cc
# 4 1 F F
# 5 2 A1 cc
# 6 2 B1 aa
# 7 2 F F
# 8 3 A2 aa
Data
dat <- read.table(text = "ID type other-col
1 A1 cc
1 A2 dd
1 A3 cc
2 A1 cc
2 B1 aa
3 A2 aa",
header = TRUE, stringsAsFactors = FALSE)
Update
I provided an updated answer to show that if ID column is not integer but character, we can create a new column (ID2 in this case) that is converted to be factor based on ID, and then convert it to integer. The rest of the operation would be similar to the original answer but based on ID2.
library(tidyverse)
dat2 <- dat %>%
mutate(ID2 = as.integer(factor(ID, levels = unique(.$ID)))) %>%
split(f = .$ID2) %>%
map_if(.p = function(x) unique(x$ID2) != unique(last(.)$ID2),
~bind_rows(.x, tibble(ID = unique(.x$ID), type = "F", `other.col` = "F",
ID2 = unique(.x$ID2)))) %>%
bind_rows() %>%
select(-ID2)
dat2
# ID type other.col
# 1 C A1 cc
# 2 C A2 dd
# 3 C A3 cc
# 4 C F F
# 5 A A1 cc
# 6 A B1 aa
# 7 A F F
# 8 B A2 aa
DATA
dat <- read.table(text = "ID type other-col
C A1 cc
C A2 dd
C A3 cc
A A1 cc
A B1 aa
B A2 aa",
header = TRUE, stringsAsFactors = FALSE)
Similar to akrun's answer but in base R. Basically, split dataframe by ID then rbind extra row to each split, then recombine dataframe and remove unrequired last row using head(..., -1) -
head(n = -1,
do.call(rbind,
lapply(split(dat, dat$ID), function(x) {
rbind(x, c(x$ID[1], "F", "F"))
})
)
)
ID type other.col
1.1 1 A1 cc
1.2 1 A2 dd
1.3 1 A3 cc
1.4 1 F F
2.4 2 A1 cc
2.5 2 B1 aa
2.3 2 F F
3.6 3 A2 aa
Using base R you could do:
cbind(ID=sort(c(dat$ID,unique(dat$ID))),do.call(rbind,by(dat[-1],dat[1],rbind,'F')))
ID type other.col
1.1 1 A1 cc
1.2 1 A2 dd
1.3 1 A3 cc
1.4 1 F F
2.4 2 A1 cc
2.5 2 B1 aa
2.3 2 F F
3.6 3 A2 aa
3.2 3 F F
Or you could do:
do.call(rbind,by(dat,dat$ID,function(x)cbind(ID = unique(x[,1]),rbind(x[-1],"F"))))
inds = head(cumsum(with(rle(df$ID), unlist(lapply(lengths, function(i) c((rep(1, i)), F = 0))))), -1)
df1 = df[inds,]
df1[which(names(inds) == "F"), c("type", "other_col")] = "F"
df1
# ID type other_col
#1 1 A1 cc
#2 1 A2 dd
#3 1 A3 cc
#3.1 1 F F
#4 2 A1 cc
#5 2 B1 aa
#5.1 2 F F
#6 3 A2 aa
A possible approach using data.table:
library(data.table)
m <- setDT(df)[, max(ID)]
df[, if (.BY$ID < m) rbind(.SD, as.list(rep("F", ncol(.SD)))) else .SD, ID]
output:
ID type other-col
1: 1 A1 cc
2: 1 A2 dd
3: 1 A3 cc
4: 1 F F
5: 2 A1 cc
6: 2 B1 aa
7: 2 F F
8: 3 A2 aa
or if you dont mind adding another row at the bottom, code will be shorter: setDT(df)[, rbind(.SD, as.list(rep("F", ncol(.SD)))), ID]

access first row of group_by dataset

I have a dataframedf1 with columns a,b,c. I want to assign c=0 to the first row of the dataset returned by group_by(a,b). I tried something like
t <- df1 %>% group_by(a,b) %>% filter(row_number(a)==1) %>% mutate(c= 0)
But it reduced number of rows. Expected output is
a b c
a1 b1 0
a1 b1 NA
a2 b2 0
a2 b2 NA
You can use seq_along to number elements in each group from 1 to the total number of elements within each group (2, in this case). Then use ifelse to set the first element of 'c' for each group to be 0 and leave the other element as is.
library(dplyr)
df %>%
group_by(a, b) %>%
mutate(c = ifelse(seq_along(c) == 1, 0, c))
# A tibble: 4 x 3
# Groups: a, b [2]
# a b c
# <fct> <fct> <dbl>
#1 a1 b1 0.
#2 a1 b1 NA
#3 a2 b2 0.
#4 a2 b2 NA
data
df <- data.frame(a = rep(c("a1", "a2"), each = 2),
b = rep(c("b1", "b2"), each = 2),
c = NA)
df
# a b c
#1 a1 b1 NA
#2 a1 b1 NA
#3 a2 b2 NA
#4 a2 b2 NA

parsing list string vector to multiple column data.table

I am trying to convert data of form:
dt <- data.table(foo = c(c('a=a1|b=b1'),c('a=a2|b=b2|c=c2'),c('a=a3|d=d3')))
to form:
data.table(a=c('a1','a2','a3'),b=c('b1','b2',NA),c=c(NA,'c2',NA),d=c(NA,NA,'d3'))
I tried to parse first step using:
lapply(dt$foo, function(x) unlist(strsplit(x, split = '|', fixed = T)))
but couldn't proceed further. Any pointers?
Will update if more cases are provided. This is not a data.table because I don't use it, but afaik it should still work? Maybe if coerced to data.frame first.
library(tidyverse)
dt <- tibble(foo = c(c('a=a1|b=b1'),c('a=a2|b=b2|c=c2'),c('a=a3|d=d3')))
tibble(a=c('a1','a2','a3'),b=c('b1','b2',NA),c=c(NA,'c2',NA),d=c(NA,NA,'d3'))
#> # A tibble: 3 x 4
#> a b c d
#> <chr> <chr> <chr> <chr>
#> 1 a1 b1 <NA> <NA>
#> 2 a2 b2 c2 <NA>
#> 3 a3 <NA> <NA> d3
dt %>%
mutate(foo = str_split(foo, pattern = "\\|")) %>%
rowid_to_column() %>%
unnest() %>%
separate(foo, into = c("col", "val"), sep = "=") %>%
spread(col, val)
#> # A tibble: 3 x 5
#> rowid a b c d
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 a1 b1 <NA> <NA>
#> 2 2 a2 b2 c2 <NA>
#> 3 3 a3 <NA> <NA> d3
Created on 2018-04-12 by the reprex package (v0.2.0).
Converting my comment to an answer, you can try:
library(splitstackshape)
cSplit(dt[, row := .I], "foo", "|", "long")[
, cSplit(.SD, "foo", "=")][
, dcast(.SD, row ~ foo_1, value.var = "foo_2")]
# row a b c d
# 1: 1 a1 b1 <NA> <NA>
# 2: 2 a2 b2 c2 <NA>
# 3: 3 a3 <NA> <NA> d3
You can, of course, also just use a combination of strsplit, tstrsplit, and dcast as well.
dt[, unlist(strsplit(foo, "|", TRUE)), 1:nrow(dt)][
, c("col", "val") := tstrsplit(V1, "=", fixed = TRUE)][
, dcast(.SD, nrow ~ col, value.var = "val")]
# nrow a b c d
# 1: 1 a1 b1 <NA> <NA>
# 2: 2 a2 b2 c2 <NA>
# 3: 3 a3 <NA> <NA> d3
An option is to use read.table function to read values as key-value pair and then finally convert into data.frame. The dplyr::bind_rows can help to join different rows.
dt <- data.table(foo = c(c('a=a1|b=b1'),c('a=a2|b=b2|c=c2'),c('a=a3|d=d3')))
library(dplyr)
bind_rows(mapply(function(x){
t <- read.table(text = gsub("\\|","\n",x), sep=c("="), stringsAsFactors=FALSE)
t <- as.data.frame(t(t), stringsAsFactors = FALSE)
colnames(t) <- t[1,]
t <- t[-1,]
}, dt$foo))
# a b c d
# 1 a1 b1 <NA> <NA>
# 2 a2 b2 c2 <NA>
# 3 a3 <NA> <NA> d3
UPDATED: data.table based solution as suggested by #abhiieor will be as:
library(data.table)
rbindlist(mapply(function(x){
t <- read.table(text = gsub("\\|","\n",x), sep=c("="), stringsAsFactors=FALSE)
t <- as.data.frame(t(t), stringsAsFactors = FALSE)
colnames(t) <- t[1,]
t <- t[-1,]
}, dt$foo), use.names = T, fill = T)

Resources