Subsetting data.table based on repeated rows - r

I have a list of data tables stored in an object ddf (a sample is shown below):
[[43]]
V1 V2 V3
1: b c a
2: b c a
3: b c a
4: b c a
5: b b a
6: b c a
7: b c a
[[44]]
V1 V2 V3
1: a c a
2: a c a
3: a c a
4: a c a
5: a c a
[[45]]
V1 V2 V3
1: a c b
2: a c b
3: a c b
4: a c b
5: a c b
6: a c b
7: a c b
8: a c b
9: a c b
.............and so on till [[100]]
I want to Subset the list ddf such that the result only consists of ddf's which:
have at least 9 rows each
each of the 9 rows are same
I want to store this sub-setted output
I have written some code for this below:
for(i in 1:100){
m=(as.numeric(nrow(df[[i]]))>= 9)
if(m == TRUE & df[[i]][1,] = df[[i]][2,] =
=df[[i]][3,] =df[[i]][4,] =df[[i]][5,] =df[[i]][6,]=
df[[i]][7,]=df[[i]][8,]=df[[i]][9,]){
print(df[[i]])
}}
Please tell me whats wrong & how I can generalize the result for sub-setting based on "n" similar rows.
[Follow-up Question]
Answer obtained from Main question:
> ddf[sapply(ddf, function(x) nrow(x) >= n & nrow(unique(x)) == 1)]
$`61`
V1 V2 V3
1: a c b
2: a c b
3: a c b
4: a c b
5: a c b
6: a c b
7: a c b
$`68`
V1 V2 V3
1: a c a
2: a c a
3: a c a
4: a c a
5: a c a
6: a c a
7: a c a
8: a c a
$`91`
V1 V2 V3
1: b c a
2: b c a
3: b c a
4: b c a
5: b c a
6: b c a
7: b c a
..... till the last data.frame which meet the row matching criteria (of at least 9 similar rows)
There are only 2 types of elements in the list:
**[[.. ]]**
**Case 1.** >70% accuracy
**Case 2.** <70% accuracy
You will notice that the Output shown above in the "Follow Up Question" is for
$'61', $'68' & $'91', but there is NO output for the other dataframes which don't match the "matching row" criteria.
I need an output where these missing values which don't match the output criteria give an output of "bad output".
Thus the Final list should be the same length as the input list.
By placing them side-by-side using paste I should be able to see each output.

We can loop through the list ('ddf'), subset only the duplicate rows with (duplicated), order the dataset, if the number of rows of the dataset 'x1' is greater than 8, then get the first 9 rows (head(x1, 9)) or else return 'bad result' printed
lapply(ddf, function(x) {
x1 <- x[duplicated(x)|duplicated(x, fromLast=TRUE)]
if(nrow(x1)>9) {
x1[order(V1, V2, V3), head(.SD, 9)]
} else "bad answer"
})
#[[1]]
# V1 V2 V3
#1: b c a
#2: b c a
#3: b c a
#4: b c a
#5: b c a
#6: b c a
#7: b c a
#8: b c a
#9: b c a
#[[2]]
#[1] "bad answer"
#[[3]]
#[1] "bad answer"
data
ddf <- list(data.table(V1 = 'b', V2 = rep(c('c', 'b', 'c'), c(8, 1, 2)), V3 = 'a'),
data.table(V1 = rep("a", 5), V2 = rep("c", 5), V3 = rep("a", 5)),
data.table(V1 = c('b', 'a', 'b', 'b'), V2 = c('b', 'a', 'c', 'b'),
V3 = c("c", "d", "a", "b")))

When ddf is your list of datatables, then:
ddf[sapply(ddf, nrow) >= 9 & sapply(ddf, function(x) nrow(unique(x))) == 1]
should give you the desired result.
Where:
sapply(ddf, nrow) >= 9 checks whether the datatables have nine or more rows
sapply(ddf, function(x) nrow(unique(x))) == 1 checks whether all the rows are the same.
Or with one sapply call as #docendodiscimus suggested:
ddf[sapply(ddf, function(x) nrow(x) >= 9 & nrow(unique(x)) == 1)]
Or by using the .N special symbol and the uniqueN function of data.table:
ddf[sapply(ddf, function(x) x[,.N] >= 9 & uniqueN(x) == 1)]
Another option is to use Filter (following the suggestion of #Frank in the comments):
Filter(function(x) nrow(x) >= 9 & uniqueN(x) == 1, ddf)
Two approaches to get the datatable numbers:
1. Using which:
which(sapply(ddf, function(x) nrow(x) >= 9 & nrow(unique(x)) == 1))
2. Assign names to the datatables in the list:
names(ddf) <- paste0('dt', 1:length(ddf))
now the output will have the datatable number in the output:
$dt4
V1 V2 V3
1 a c b
2 a c b
3 a c b
4 a c b
5 a c b
6 a c b
7 a c b
8 a c b
9 a c b

Related

How to compare if any of the elements in a row is same

Is there a way to compare whether "any value of" a row is identical to "any value" of the row above -- regardless of the order? Below is a very random input data table.
DT <- data.table(A=c("a","a","b","d","e","f","h","i","j"),
B=c("a","b","c","c","f","g",NA,"j",NA),
C=c("a","b","c","b","g","h",NA,NA,NA))
> DT
A B C
1: a a a
2: a b b
3: b c c
4: d c b
5: e f g
6: f g h
7: h NA NA
8: i j NA
9: j NA NA
I would like to add a column D that compares a row with the row above, and compare whether any values of the two rows are identical (regardless of the order). So the desired output would be:
> DT
A B C D
1: a a a 0 #No row above to compare; could be either NA or 0
2: a b b 1 #row 2 has "a", which is in row 1; returns 1
3: b c c 1 #row 3 has "b", which is in row 2; returns 1
4: d c b 1 #row 4 has "b" and "c", which are in row 3; returns 1
5: e f g 0 #row 5 has nothing that is in row 4; returns 0
6: f g h 1 #row 6 has "f" and "g", which are in row 5; returns 1
7: h NA NA 1 #row 7 has "h", which is in row 6; returns 1
8: i j NA 0 #row 8 has nothing that is in row 7 (NA doesn't count)
9: j NA NA 1 #row 9 has "j", which is in row 8; returns 1 (NA doesn't count)
The main idea is that I would like to compare a row (or a vector) with another row (vector), and define two rows to be identical if any of the elements in each row (vector) are. (without reiterating to compare each element)
We can do this by getting the lead rows of the dataset, paste each row, check for any pattern in with the pasteed rows of original dataset using grepl and Map, then unlist and convert to integer
DT[, D := {
v1 <- do.call(paste, .SD)
v2 <- do.call(paste, c(shift(.SD, type = "lead"), sep="|"))
v2N <- gsub("NA\\|*|\\|*NA", "", v2)
v3 <- unlist(Map(grepl, v2N, v1), use.names = FALSE)
as.integer(head(c(FALSE, v3), -1))
}]
DT
# A B C D
#1: a a a 0
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1
Or we can do a split and do comparison using Map
as.integer(c(FALSE, unlist(Map(function(x,y) {
x1 <- na.omit(unlist(x))
y1 <- na.omit(unlist(y))
any(x1 %in% y1 | y1 %in% x1) },
split(DT[-nrow(DT)], 1:(nrow(DT)-1)), split(DT[-1], 2:nrow(DT))), use.names = FALSE)))
Here is another method. It's probably not advisable on large data.tables as it uses by=1:nrow(DT) which tends to be quite slow.
DT[, D:= sign(DT[, c(.SD, shift(.SD))][,
sum(!is.na(intersect(unlist(.SD[, .(A, B, C)]), unlist(.SD[, .(V4, V5, V6)])))),
by=1:nrow(DT)]$V1)]
Here, [, c(.SD, shift(.SD))] creates a copy of the data.frame, with the lagged variables included (cbinded). Then the second chain intersects the unlisted variables in the original data.table and the shifted data.table. NAs are assigned 0 and non-NAs are assigned 1 and these results are summed. This operation occurs for each row of the copied data.table. The sum is extracted with $v1 and is turned into binary (0 and 1) using sign.
It returns
DT
A B C D
1: a a a 0
2: a b b 1
3: b c c 1
4: d c b 1
5: e f g 0
6: f g h 1
7: h NA NA 1
8: i j NA 0
9: j NA NA 1
Here's a loop-free approach using data.table's joins:
DT[, id := 1:.N]
dt <- melt(DT, id.vars = "id")
dt[, id2 := id-1]
dt <- dt[!is.na(value)]
idx <- dt[dt, on = .(id2 = id, value), nomatch=0][, unique(id)]
DT[, `:=`(D = as.integer(id %in% idx), id = NULL)]
It looks somewhat complicated but id does perform pretty well with just over a second for a 1-million-row data set with three columns.
I would do a sapply along the indices (minus the last) of the table:
compare <- function(i) {
row1 <- as.character(DT[i,])
row2 <- as.character(DT[i+1,])
return(length(intersect(row1[!is.na(row1)], row2[!is.na(row2)])) > 0)
}
result <- sapply(1:(nrow(DT) - 1), compare)
This returns a vector of logicals, so if you prefer to get integers, wrap the output of compare in a as.numeric()
Here is a base R solution using intersect:
res <- c(0, sapply(2:nrow(DT), function(i)
length(intersect( na.omit(as.character(DT[i,])), na.omit(as.character(DT[i-1,])) ) )>0))
cbind(DT, D=res)
# A B C D
# 1: a a a 0
# 2: a b b 1
# 3: b c c 1
# 4: d c b 1
# 5: e f g 0
# 6: f g h 1
# 7: h NA NA 1
# 8: i j NA 0
# 9: j NA NA 1
This solution compares the two rows with %in% (after unlist()):
DT[, result:=as.integer(c(NA, sapply(2:DT[,.N], function(i) any(na.omit(unlist(DT[i])) %in% unlist(DT[i-1])))))]
#> DT
# A B C result
#1: a a a NA
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1
Using a combination of intersect and mapply you could do:
#list of unique elements in each row
tableList = apply(DT,1,function(x) unique(na.omit(x)))
#a lagged list to be compared with above list
tableListLag = c(NA,tableList[2:length(tableList)-1])
#find common elements using intersect function
#if length > 0 implies common elements hence set value as 1 else 0
DT$D = mapply(function(x,y) ifelse(length(intersect(x,y))>0,1,0) ,tableList,tableListLag,
SIMPLIFY = TRUE)
DT
# A B C D
#1: a a a 0
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1

Subset a data.table by matching columns of another data.table

I have been searching for a solution for subsetting a data table using matching values for certain columns in another data table.
Here is in example:
set.seed(2)
dt <-
data.table(a = 1:10,
b = rnorm(10),
c = runif(10),
d = letters[1:10])
dt2 <-
data.table(a = 5:20,
b = rnorm(16),
c = runif(16),
d = letters[5:20])
This is the result I need:
> dt2
1: 5 -2.311069085 0.62512173 e
2: 6 0.878604581 0.26030004 f
3: 7 0.035806718 0.85907312 g
4: 8 1.012828692 0.43748800 h
5: 9 0.432265155 0.38814476 i
6: 10 2.090819205 0.46150111 j
where I have the rows returned from the second data table where a and d match even though b and c may not. The real data are mutually exclusive, and I need to match on three columns.
We can use %in% to match the columns and subset accordingly.
dt2[a %in% dt$a & d %in% dt$d]
# a b c d
#1: 5 -2.31106908 0.6251217 e
#2: 6 0.87860458 0.2603000 f
#3: 7 0.03580672 0.8590731 g
#4: 8 1.01282869 0.4374880 h
#5: 9 0.43226515 0.3881448 i
#6: 10 2.09081921 0.4615011 j
Here is an option using join and specifying the on
na.omit(dt2[dt[, c("a", "d"), with = FALSE], on = c("a", "d")])
# a b c d
#1: 5 -2.31106908 0.6251217 e
#2: 6 0.87860458 0.2603000 f
#3: 7 0.03580672 0.8590731 g
#4: 8 1.01282869 0.4374880 h
#5: 9 0.43226515 0.3881448 i
#6: 10 2.09081921 0.4615011 j

Change the /NA to special values in R

I have a data.table table with about 50,000 rows. There are two columns. There are lots of "/NA" in it.
Example:
V1 V2
A 1
B 2
A 1
C 3
A NA
B 2
C 3
A /NA
B /NA
A 1
I want to get
V1 V2
A 1
B 2
A 1
C 3
A 1
B 2
C 3
A 1
B 2
A 1
How can I finish it?
Thank you so much, Justin
tf <- tempfile()
writeLines(" V1 V2
A A
B B
A A
C C
A NA
B B
C C
A /NA
B /NA
A A", tf )
x <- read.table(tf, header=T, stringsAsFactors = F)
x$V2 <- ifelse(gsub("[/]","", x$V2) == "NA" | is.na(x$V2), x$V1, x$V2)
R> x
V1 V2
1 A A
2 B B
3 A A
4 C C
5 A A
6 B B
7 C C
8 A A
9 B B
10 A A
edit
A second ifelse() clause (or switch) is needed for the new question to parse V1 to V2. Note that I've switched the initial clause's evaluation via !:
x$V2 <- ifelse(!(gsub("[/]","", x$V2) == "NA" | is.na(x$V2)), x$V2,
ifelse(x$V1 == "A", 1, ifelse(x$V1 == "B", 2,3)))
You can use the data frame in R to get the same result
example <- data.frame(V1 = c("A","B","A","C","A","B","C","A","B","A"),
V2=c(1,2,1,3,"NA",2,3,"/NA","/NA",1), stringsAsFactors = FALSE)
example <- within(example, V2[V1=="A" & (V2=="NA" | V2=="/NA")] <-1)
example <- within(example, V2[V1=="B" & (V2=="NA" | V2=="/NA")] <-2)
example <- within(example, V2[V1=="C" & (V2=="NA" | V2=="/NA")] <-3)

Reshape a data frame to long format by expanding elements of an existing column

I have a dataframe with 3 columns:
A <- c("stringA", "stringA", "stringB", "stringB")
B <- c(1, 2, 1, 2)
C <- c("abcd", "abcd", "abcde", "bbc")
df <- data.frame(A, B, C)
> test
A B C
1 stringA 1 abcd
2 stringA 2 abcd
3 stringB 1 abcde
4 stringB 2 bbc
I would like to reformat so that column B becomes the row names and the values in column C are split into individual letters to get:
A 1 2
stringA a a
stringA b b
stringA c c
stringA d d
stringB a b
stringB b b
stringB c c
stringB d NA
stringB e NA
Here's an approach using "data.table" and "reshape2". Make sure you're using at least version 1.8.11 of the "data.table" package first.
library(reshape2)
library(data.table)
packageVersion("data.table")
# [1] ‘1.8.11’
DT <- data.table(df, key="A,B")
DT <- DT[, list(C = unlist(strsplit(as.character(C), ""))), by = key(DT)]
DT[, N := sequence(.N), by = key(DT)]
dcast.data.table(DT, A + N ~ B, value.var="C")
# A N 1 2
# 1: stringA 1 a a
# 2: stringA 2 b b
# 3: stringA 3 c c
# 4: stringA 4 d d
# 5: stringB 1 a b
# 6: stringB 2 b b
# 7: stringB 3 c c
# 8: stringB 4 d NA
# 9: stringB 5 e NA
If you prefer sticking with base R, the approach is somewhat similar:
## Split the "C" column up
X <- strsplit(as.character(df$C), "")
## "Expand" your data.frame
df2 <- df[rep(seq_along(X), sapply(X, length)), ]
## Create an additional "id"
df2$id <- with(df2, ave(as.character(A), A, B, FUN = seq_along))
## Replace your "C" values
df2$C <- unlist(X)
## Reshape your data
reshape(df2, direction = "wide", idvar=c("A", "id"), timevar="B")
# A id C.1 C.2
# 1 stringA 1 a a
# 1.1 stringA 2 b b
# 1.2 stringA 3 c c
# 1.3 stringA 4 d d
# 3 stringB 1 a b
# 3.1 stringB 2 b b
# 3.2 stringB 3 c c
# 3.3 stringB 4 d <NA>
# 3.4 stringB 5 e <NA>

Count unique elements in data frame row and return one with maximum occurrence [duplicate]

This question already has answers here:
How to find mode across variables/vectors within a data row in R
(3 answers)
Closed 9 years ago.
Is it possible to count unique elements in data frame row and return one with maximum occurrence and as result form the vector.
example:
a a a b b b b -> b
c v f w w r t -> w
s s d f b b b -> b
You can use apply to use table function on every row of dataframe.
df <- read.table(textConnection("a a a b b b b\nc v f w w r t\ns s d f b b b"), header = F)
df$result <- apply(df, 1, function(x) names(table(x))[which.max(table(x))])
df
## V1 V2 V3 V4 V5 V6 V7 result
## 1 a a a b b b b b
## 2 c v f w w r t w
## 3 s s d f b b b b
Yes with table
x=c("a", "a", "a", "b" ,"b" ,"b" ,"b")
table(x)
x
a b
3 4
EDIT with data.table
DT = data.table(x=sample(letters[1:5],10,T),y=sample(letters[1:5],10,T))
#DT
# x y
# 1: d a
# 2: c d
# 3: d c
# 4: c a
# 5: a e
# 6: d c
# 7: c b
# 8: a b
# 9: b c
#10: c d
f = function(x) names(table(x))[which.max(table(x))]
DT[,lapply(.SD,f)]
# x y
#1: c c
Note that if you want to keep ALL max's, you need to ask for them explicitly.
You can save them as a list inside the data.frame. If there is only one per row, then the list will be simplified to a common vector
df$result <- apply(df, 1, function(x) {T <- table(x); list(T[which(T==max(T))])})
With Ties for max:
df2 <- df[, 1:6]
df2$result <- apply(df2, 1, function(x) {T <- table(x); list(T[which(T==max(T))])})
> df2
V1 V2 V3 V4 V5 V6 result
1 a a a b b b 3, 3
2 c v f w w r 2
3 s s d f b b 2, 2
With No Ties for max:
df$result <- apply(df, 1, function(x) {T <- table(x); list(T[which(T==max(T))])})
> df
V1 V2 V3 V4 V5 V6 V7 result
1 a a a b b b b 4
2 c v f w w r t 2
3 s s d f b b b 3

Resources