R data.table merge two columns from the same table - r

I have:
inputDT <- data.table(COL1 = c(1, NA, NA), COL1 = c(NA, 2, NA), COL1 = c(NA, NA, 3))
inputDT
COL1 COL1 COL1
1: 1 NA NA
2: NA 2 NA
3: NA NA 3
I want
outputDT <- data.table(COL1 = c(1,2,3))
outputDT
COL1
1: 1
2: 2
3: 3
Essentially, I have a data.table with multiple columns whose names are the same (values are mutually exclusive), and I need to generate just one column to combine those.
How to achieve it?

The OP is asking for a data.table solution. As of version v1.12.4 (03 Oct 2019), the fcoalesce() function is available:
library(data.table)
inputDT[, .(COL1 = fcoalesce(.SD))]
COL1
1: 1
2: 2
3: 3

Alternatively (less elegant than #Uwe's answer), if you have only numbers and NA, you can calculate the max of each row while removing NA:
library(data.table)
inputDT[, .(COL2 = do.call(pmax, c(na.rm=TRUE, .SD)))]
COL2
1: 1
2: 2
3: 3

Related

rbindlist a list column of data.frames and select unique values

I have a data.table 'DT' with a column ('col2') that is a list of data frames:
require(data.table)
DT <- data.table(col1 = c('A','A','B'),
col2 = list(data.frame(colA = c(1,3,54, 23),
colB = c("aa", "bb", "cc", "hh")),
data.frame(colA =c(23, 1),
colB = c("hh", "aa")),
data.frame(colA = 1,
colB = "aa")))
> DT
col1 col2
1: A <data.frame>
2: A <data.frame>
3: B <data.frame>
>> DT$col2
[[1]]
colA colB
1 1 aa
2 3 bb
3 54 cc
4 23 hh
[[2]]
colA colB
1 23 hh
2 1 aa
[[3]]
colA colB
1 1 aa
Each data.frame in col2 has two columns colA and colB.
I'd like to have a data.table output that binds each unique row of those data.frames based on col1 of DT.
I guess it's like using rbindlist in an aggregate function of the data.table.
This is the desired output:
> #desired output
> output
colA colB col1
1: 1 aa A
2: 3 bb A
3: 54 cc A
4: 23 hh A
5: 1 aa B
The dataframe of the second row of DT (DT[2, col2]) has duplicate entries, and only unique entries are desired for each unique col1.
I tried the following and I get an error.
desired_output <- DT[, lapply(col2, function(x) unique(rbindlist(x))), by = col1]
# Error in rbindlist(x) :
# Item 1 of list input is not a data.frame, data.table or list
This 'works', though not desired output:
unique(rbindlist(DT$col2))
colA colB
1: 1 aa
2: 3 bb
3: 54 cc
4: 23 hh
Is there anyway to use rbindlist in an aggregate function of a data.table?
Group by 'col1', run rbindlist on 'col2':
unique(DT[ , rbindlist(col2), by = col1]) # trimmed thanks to #snoram
# col1 colA colB
# 1: A 1 aa
# 2: A 3 bb
# 3: A 54 cc
# 4: A 23 hh
# 5: B 1 aa
only unique entries are desired for each unique col1
If you add a column for col1, the expression above means "unique entries" (unconditional on columns).
Henrik's answer is one way to keep col1. Another is:
unique(DT[, rbindlist(setNames(col2, col1), id="col1")])
I guess this should be more efficient than
bycols = "col1"
unique(DT[, rbindlist(col2), by=bycols]) # Henrik's
though the extension to either (1) col1 not being a character column (hence suitable for setNames) or (2) having multiple by= columns is not so obvious. For either of these cases, I would make an .id column equal to row numbers of DT then copy them over:
bycols = "col1"
res = unique(DT[, rbindlist(col2, id="DT_row")])
res[, (bycols) := DT[DT_row, ..bycols]]
To put those columns first/leftmost, I think setcolorder(res, bycols) should work, but am on too old a data.table version to see it do so.
There's also an open issue for a tidyr::unnest-like function.
This works:
DT1<-apply(DT, 1, function(x){cbind(col1=x$col1,x$col2)})
unique(rbindlist(DT1))
# col1 colA colB
#1: A 1 aa
#2: A 3 bb
#3: A 54 cc
#4: A 23 hh
#5: B 1 aa
You could do something hackish like this:
nDT <- cbind(rbindlist(DT[[2]]), col1 = rep(DT[[1]], sapply(DT[[2]], nrow)))
nDT[!duplicated(nDT)]
colA colB col1
1: 1 aa A
2: 3 bb A
3: 54 cc A
4: 23 hh A
5: 1 aa B
Or using tidyr (Inspired by PKumar's comment):
unique(tidyr::unnest(DT))
Or more generalisable base R:
names(DT[[2]]) <- DT[[1]]
ndf <- do.call(rbind, DT[[2]])
ndf$col1 <- substr(row.names(ndf), 1, 1)
unique(ndf)

Subset dataframe to contain only cells that match a pattern in r

I am begginer in R and this is a very simple question, but I can't find the answer.
I would like to select cells in a table that match a particular pattern and exclude everything else.
Example data:
data.t <- data.frame(ColA = c("NARG_ECOLI^Q:103", "NARG_ECOLI^NARG", "SPEB_KLEP7^Q:103"), ColB = c(NA, NA, NA), ColC = c("KLEP7^Q:103", "NARG_ECOLI^KLEP7", NA), ColD = c("RPOC_ENTFA^Q:2", NA, NA), ColE = c("Y1546_STAS1^Q:6", NA, NA))
which generates a table like this:
ColA ColB ColC ColD ColE
1 NARG_ECOLI^Q:103 NA KLEP7^Q:103 RPOC_ENTFA^Q:2 NA
2 NARG_ECOLI^NARG NA NARG_ECOLI^KLEP7 <NA> NA
3 SPEB_KLEP7^Q:103 NA <NA> <NA> NA
I would like to select only cells containing ECOLI. Thus, the desired output would look like this one:
ColA ColC
1 NARG_ECOLI^Q:103 NARG_ECOLI^KLEP7
2 NARG_ECOLI^NARG <NA>
One possible solution is to visually inspect and make the selections in my data, but the actual table has dozens of columns and hundreds of rows. Any help would be greatly appreciated. Thank you in advance!
If you want to return ONLY the items in the data frame that have "ECOLI" in them, then here is a tidyverse approach
library(tidyverse)
filter_all(data.t, any_vars(grepl("ECOLI", .))) %>%
.[map_lgl(., ~any(grepl("ECOLI", .x)))] %>%
map_df(~replace(.x, !grepl("ECOLI", .x), NA_character_))
# A tibble: 2 x 2
ColA ColC
<fctr> <fctr>
1 NARG_ECOLI^Q:103 <NA>
2 NARG_ECOLI^NARG NARG_ECOLI^KLEP7
data.t <- data.t[grepl('ECOLI', data.t$ColA), ]
To obtain a staggered list of all instances of ECOLI in each column of data.t:
out <- lapply(data.t, grep, pattern='ECOLI', value=T)
If you want to drop 0 length entries.
nout <- sapply(out, length)
out <- out[nout > 0]
nout <- nout[nout > 0]
To merge that staggered list into a rectangular object like a data frame is unwise, but:
mapply(c, out, mapply(rep, NA, max(nout)-nout))
I tried solving this using base functions.
# Data
data.t <- data.frame(ColA = c("NARG_ECOLI^Q:103", "NARG_ECOLI^NARG",
"SPEB_KLEP7^Q:103"), ColB = c(NA, NA, NA), ColC = c("KLEP7^Q:103",
"NARG_ECOLI^KLEP7", NA), ColD = c("RPOC_ENTFA^Q:2", NA, NA), ColE =
c("Y1546_STAS1^Q:6", NA, NA), stringsAsFactors = FALSE)
# First wrote a function to check cell value. If value contains
"ECOLI" then value # of cell is retained else value is replaced with NA
findECOLI <- function(x){
ifelse(grepl("ECOLI", x, fixed = TRUE), x, NA)
}
d1 <- sapply(data.t, findECOLI)
#> d1
# ColA ColB ColC ColD ColE
#[1,] "NARG_ECOLI^Q:103" NA NA NA NA
#[2,] "NARG_ECOLI^NARG" NA "NARG_ECOLI^KLEP7" NA NA
#[3,] NA NA NA NA NA
# Now, remove the rows containing only NA
d1 <- d1[rowSums(is.na(d1)) != ncol(d1), ]
#> d1
# ColA ColB ColC ColD ColE
#[1,] "NARG_ECOLI^Q:103" NA NA NA NA
#[2,] "NARG_ECOLI^NARG" NA "NARG_ECOLI^KLEP7" NA NA
# Remove the columns containing only NA
d1 <- d1[, colSums(is.na(d1)) != nrow(d1)]
#Result:
#>d1
# ColA ColC
#[1,] "NARG_ECOLI^Q:103" NA
#[2,] "NARG_ECOLI^NARG" "NARG_ECOLI^KLEP7"

Labeling each value in a column by grouping from another column R

I have an unusual data set that I need to work with and I've created a small scale, reproducible example.
library(data.table)
DT <- data.table(Type = c("A", rep("", 4), "B", rep("", 3), "C", rep("", 5)), Cohort = c(NA,1:4, NA, 5:7, NA, 8:12))
dt <- data.table(Type = c(rep("A", 4), rep("B", 3), rep("C", 5)), Cohort = 1:12)
I need DT to look like dt and the actual dataset has 6.8 million rows. I realize it might be a simple issue but I can't seem to figure it out, maybe setkey? Any help is appreciated, thanks.
You can replace "" by NA and use na.locf from the zoo package:
library(zoo)
DT[Type=="",Type:=NA][,Type:=na.locf(Type)][!is.na(Cohort)]
Here is another option without using na.locf. Grouped by the cumulative sum of logical vector (Type!=""), we select the first 'Type' and the lead value of 'Cohort', assign (:=) it to the names of 'DT' to replace the original column values and use na.omit to replace the NA rows.
na.omit(DT[, names(DT) := .(Type[1L], shift(Cohort, type="lead")), cumsum(Type!="")])
# Type Cohort
# 1: A 1
# 2: A 2
# 3: A 3
# 4: A 4
# 5: B 5
# 6: B 6
# 7: B 7
# 8: C 8
# 9: C 9
#10: C 10
#11: C 11
#12: C 12

drop levels of factor for which there is one missing value for one column r

I would like to drop any occurrence of a factor level for which one row contains a missing value
Example:
ID var1 var2
1 1 2
1 NA 3
2 1 2
2 2 4
So, in this hypothetical, what would be left would be:
ID var1 var2
2 1 2
2 2 4
Hers's possible data.table solution (sorry #rawr)
library(data.table)
setDT(df)[, if (all(!is.na(.SD))) .SD, ID]
# ID var1 var2
# 1: 2 1 2
# 2: 2 2 4
If you only want to check var1 then
df[, if (all(!is.na(var1))) .SD, ID]
# ID var1 var2
# 1: 2 1 2
# 2: 2 2 4
Assuming that NAs would occur in both var columns,
df[with(df, !ave(!!rowSums(is.na(df[,-1])), ID, FUN=any)),]
# ID var1 var2
#3 2 1 2
#4 2 2 4
Or if it is only specific to var1
df[with(df, !ave(is.na(var1), ID, FUN=any)),]
# ID var1 var2
#3 2 1 2
#4 2 2 4
Or using dplyr
library(dplyr)
df %>%
group_by(ID) %>%
filter(all(!is.na(var1)))
# ID var1 var2
#1 2 1 2
#2 2 2 4
data
df <- structure(list(ID = c(1L, 1L, 2L, 2L), var1 = c(1L, NA, 1L, 2L
), var2 = c(2L, 3L, 2L, 4L)), .Names = c("ID", "var1", "var2"
), class = "data.frame", row.names = c(NA, -4L))
Here's one more option in base R. It will check all columns for NAs.
df[!df$ID %in% df$ID[rowSums(is.na(df)) > 0],]
# ID var1 var2
#3 2 1 2
#4 2 2 4
If you only want to check in column "var1" you can do:
df[!with(df, ID %in% ID[is.na(var1)]),]
# ID var1 var2
#3 2 1 2
#4 2 2 4
In the current development version of data.table, there's a new implementation of na.omit for data.tables, which takes a cols =and invert = arguments.
The cols = allows to specify the columns on which to look for NAs. And invert = TRUE returns the NA rows instead, instead of omitting them.
You can install the devel version by following these instructions. Or you can wait for 1.9.6 on CRAN at some point. Using that, we can do:
require(data.table) ## 1.9.5+
setkey(setDT(df), ID)
df[!na.omit(df, invert = TRUE)]
# ID var1 var2
# 1: 2 1 2
# 2: 2 2 4
How this works:
setDT converts data.frame to data.table by reference.
setkey sorts the data.table by the columns provided and marks those columns as key columns so that we can perform a join.
na.omit(df, invert = TRUE) gives just those rows that have NA anywhere.
X[!Y] does an anit-join by joining on the key column ID, and returns all the rows that don't match ID = 1 (from Y). Check this post to read in detail about data.table's joins.
HTH

consolidate duplicate rows and add column in R [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 7 years ago.
I'd like to know how to consolidate duplicate rows in a data frame and then combine the duplicated values in another column.
Here's a sample of the existing dataframe and two dataframes that would be acceptable as a solution
df1 <- data.frame(col1 = c("test1", "test2", "test2", "test3"), col2 = c(1, 2, 3, 4))
df.ideal <- data.frame(col1 = c("test1", "test2", "test3"), col2 = c(1, "2, 3", 4))
df.ideal2 <- data.frame(col1 = c("test1", "test2", "test3"),
col2 = c(1, 2, 4),
col3 = c(NA, 3, NA))
In the first ideal dataframe, the duplicated row is collapsed and the column is added with both numbers. I've looked at other similar questions on stack overflow, but they all dealt with combining rows. I need to delete the duplicate row because I have another dataset I'm merging it with that needs the a certain number of rows. So, I want to preserve all of the values. Thanks for your help!
To go from df1 to df.ideal, you can use aggregate().
aggregate(col2~col1, df1, paste, collapse=",")
# col1 col2
# 1 test1 1
# 2 test2 2,3
# 3 test3 4
If you want to get to df.ideal2, that's more of a reshaping from long to wide process. You can do
reshape(transform(df1, time=ave(col2, col1, FUN=seq_along)), idvar="col1", direction="wide")
# col1 col2.1 col2.2
# 1 test1 1 NA
# 2 test2 2 3
# 4 test3 4 NA
using just the base reshape() function.
Another option would be to use splitstackshape
library(data.table)
library(splitstackshape)
DT1 <- setDT(df1)[,list(col2=toString(col2)) ,col1]
DT1
# col1 col2
#1: test1 1
#2: test2 2, 3
#3: test3 4
You could split the col2 in DT1 to get the df.ideal2 or
cSplit(DT1, 'col2', sep=',')
# col1 col2_1 col2_2
#1: test1 1 NA
#2: test2 2 3
#3: test3 4 NA
or from df1
dcast.data.table(getanID(df1, 'col1'), col1~.id, value.var='col2')
# col1 1 2
#1: test1 1 NA
#2: test2 2 3
#3: test3 4 NA

Resources