I have been searching for a solution for subsetting a data table using matching values for certain columns in another data table.
Here is in example:
set.seed(2)
dt <-
data.table(a = 1:10,
b = rnorm(10),
c = runif(10),
d = letters[1:10])
dt2 <-
data.table(a = 5:20,
b = rnorm(16),
c = runif(16),
d = letters[5:20])
This is the result I need:
> dt2
1: 5 -2.311069085 0.62512173 e
2: 6 0.878604581 0.26030004 f
3: 7 0.035806718 0.85907312 g
4: 8 1.012828692 0.43748800 h
5: 9 0.432265155 0.38814476 i
6: 10 2.090819205 0.46150111 j
where I have the rows returned from the second data table where a and d match even though b and c may not. The real data are mutually exclusive, and I need to match on three columns.
We can use %in% to match the columns and subset accordingly.
dt2[a %in% dt$a & d %in% dt$d]
# a b c d
#1: 5 -2.31106908 0.6251217 e
#2: 6 0.87860458 0.2603000 f
#3: 7 0.03580672 0.8590731 g
#4: 8 1.01282869 0.4374880 h
#5: 9 0.43226515 0.3881448 i
#6: 10 2.09081921 0.4615011 j
Here is an option using join and specifying the on
na.omit(dt2[dt[, c("a", "d"), with = FALSE], on = c("a", "d")])
# a b c d
#1: 5 -2.31106908 0.6251217 e
#2: 6 0.87860458 0.2603000 f
#3: 7 0.03580672 0.8590731 g
#4: 8 1.01282869 0.4374880 h
#5: 9 0.43226515 0.3881448 i
#6: 10 2.09081921 0.4615011 j
Related
Is there a way to compare whether "any value of" a row is identical to "any value" of the row above -- regardless of the order? Below is a very random input data table.
DT <- data.table(A=c("a","a","b","d","e","f","h","i","j"),
B=c("a","b","c","c","f","g",NA,"j",NA),
C=c("a","b","c","b","g","h",NA,NA,NA))
> DT
A B C
1: a a a
2: a b b
3: b c c
4: d c b
5: e f g
6: f g h
7: h NA NA
8: i j NA
9: j NA NA
I would like to add a column D that compares a row with the row above, and compare whether any values of the two rows are identical (regardless of the order). So the desired output would be:
> DT
A B C D
1: a a a 0 #No row above to compare; could be either NA or 0
2: a b b 1 #row 2 has "a", which is in row 1; returns 1
3: b c c 1 #row 3 has "b", which is in row 2; returns 1
4: d c b 1 #row 4 has "b" and "c", which are in row 3; returns 1
5: e f g 0 #row 5 has nothing that is in row 4; returns 0
6: f g h 1 #row 6 has "f" and "g", which are in row 5; returns 1
7: h NA NA 1 #row 7 has "h", which is in row 6; returns 1
8: i j NA 0 #row 8 has nothing that is in row 7 (NA doesn't count)
9: j NA NA 1 #row 9 has "j", which is in row 8; returns 1 (NA doesn't count)
The main idea is that I would like to compare a row (or a vector) with another row (vector), and define two rows to be identical if any of the elements in each row (vector) are. (without reiterating to compare each element)
We can do this by getting the lead rows of the dataset, paste each row, check for any pattern in with the pasteed rows of original dataset using grepl and Map, then unlist and convert to integer
DT[, D := {
v1 <- do.call(paste, .SD)
v2 <- do.call(paste, c(shift(.SD, type = "lead"), sep="|"))
v2N <- gsub("NA\\|*|\\|*NA", "", v2)
v3 <- unlist(Map(grepl, v2N, v1), use.names = FALSE)
as.integer(head(c(FALSE, v3), -1))
}]
DT
# A B C D
#1: a a a 0
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1
Or we can do a split and do comparison using Map
as.integer(c(FALSE, unlist(Map(function(x,y) {
x1 <- na.omit(unlist(x))
y1 <- na.omit(unlist(y))
any(x1 %in% y1 | y1 %in% x1) },
split(DT[-nrow(DT)], 1:(nrow(DT)-1)), split(DT[-1], 2:nrow(DT))), use.names = FALSE)))
Here is another method. It's probably not advisable on large data.tables as it uses by=1:nrow(DT) which tends to be quite slow.
DT[, D:= sign(DT[, c(.SD, shift(.SD))][,
sum(!is.na(intersect(unlist(.SD[, .(A, B, C)]), unlist(.SD[, .(V4, V5, V6)])))),
by=1:nrow(DT)]$V1)]
Here, [, c(.SD, shift(.SD))] creates a copy of the data.frame, with the lagged variables included (cbinded). Then the second chain intersects the unlisted variables in the original data.table and the shifted data.table. NAs are assigned 0 and non-NAs are assigned 1 and these results are summed. This operation occurs for each row of the copied data.table. The sum is extracted with $v1 and is turned into binary (0 and 1) using sign.
It returns
DT
A B C D
1: a a a 0
2: a b b 1
3: b c c 1
4: d c b 1
5: e f g 0
6: f g h 1
7: h NA NA 1
8: i j NA 0
9: j NA NA 1
Here's a loop-free approach using data.table's joins:
DT[, id := 1:.N]
dt <- melt(DT, id.vars = "id")
dt[, id2 := id-1]
dt <- dt[!is.na(value)]
idx <- dt[dt, on = .(id2 = id, value), nomatch=0][, unique(id)]
DT[, `:=`(D = as.integer(id %in% idx), id = NULL)]
It looks somewhat complicated but id does perform pretty well with just over a second for a 1-million-row data set with three columns.
I would do a sapply along the indices (minus the last) of the table:
compare <- function(i) {
row1 <- as.character(DT[i,])
row2 <- as.character(DT[i+1,])
return(length(intersect(row1[!is.na(row1)], row2[!is.na(row2)])) > 0)
}
result <- sapply(1:(nrow(DT) - 1), compare)
This returns a vector of logicals, so if you prefer to get integers, wrap the output of compare in a as.numeric()
Here is a base R solution using intersect:
res <- c(0, sapply(2:nrow(DT), function(i)
length(intersect( na.omit(as.character(DT[i,])), na.omit(as.character(DT[i-1,])) ) )>0))
cbind(DT, D=res)
# A B C D
# 1: a a a 0
# 2: a b b 1
# 3: b c c 1
# 4: d c b 1
# 5: e f g 0
# 6: f g h 1
# 7: h NA NA 1
# 8: i j NA 0
# 9: j NA NA 1
This solution compares the two rows with %in% (after unlist()):
DT[, result:=as.integer(c(NA, sapply(2:DT[,.N], function(i) any(na.omit(unlist(DT[i])) %in% unlist(DT[i-1])))))]
#> DT
# A B C result
#1: a a a NA
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1
Using a combination of intersect and mapply you could do:
#list of unique elements in each row
tableList = apply(DT,1,function(x) unique(na.omit(x)))
#a lagged list to be compared with above list
tableListLag = c(NA,tableList[2:length(tableList)-1])
#find common elements using intersect function
#if length > 0 implies common elements hence set value as 1 else 0
DT$D = mapply(function(x,y) ifelse(length(intersect(x,y))>0,1,0) ,tableList,tableListLag,
SIMPLIFY = TRUE)
DT
# A B C D
#1: a a a 0
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1
I have a list of data tables stored in an object ddf (a sample is shown below):
[[43]]
V1 V2 V3
1: b c a
2: b c a
3: b c a
4: b c a
5: b b a
6: b c a
7: b c a
[[44]]
V1 V2 V3
1: a c a
2: a c a
3: a c a
4: a c a
5: a c a
[[45]]
V1 V2 V3
1: a c b
2: a c b
3: a c b
4: a c b
5: a c b
6: a c b
7: a c b
8: a c b
9: a c b
.............and so on till [[100]]
I want to Subset the list ddf such that the result only consists of ddf's which:
have at least 9 rows each
each of the 9 rows are same
I want to store this sub-setted output
I have written some code for this below:
for(i in 1:100){
m=(as.numeric(nrow(df[[i]]))>= 9)
if(m == TRUE & df[[i]][1,] = df[[i]][2,] =
=df[[i]][3,] =df[[i]][4,] =df[[i]][5,] =df[[i]][6,]=
df[[i]][7,]=df[[i]][8,]=df[[i]][9,]){
print(df[[i]])
}}
Please tell me whats wrong & how I can generalize the result for sub-setting based on "n" similar rows.
[Follow-up Question]
Answer obtained from Main question:
> ddf[sapply(ddf, function(x) nrow(x) >= n & nrow(unique(x)) == 1)]
$`61`
V1 V2 V3
1: a c b
2: a c b
3: a c b
4: a c b
5: a c b
6: a c b
7: a c b
$`68`
V1 V2 V3
1: a c a
2: a c a
3: a c a
4: a c a
5: a c a
6: a c a
7: a c a
8: a c a
$`91`
V1 V2 V3
1: b c a
2: b c a
3: b c a
4: b c a
5: b c a
6: b c a
7: b c a
..... till the last data.frame which meet the row matching criteria (of at least 9 similar rows)
There are only 2 types of elements in the list:
**[[.. ]]**
**Case 1.** >70% accuracy
**Case 2.** <70% accuracy
You will notice that the Output shown above in the "Follow Up Question" is for
$'61', $'68' & $'91', but there is NO output for the other dataframes which don't match the "matching row" criteria.
I need an output where these missing values which don't match the output criteria give an output of "bad output".
Thus the Final list should be the same length as the input list.
By placing them side-by-side using paste I should be able to see each output.
We can loop through the list ('ddf'), subset only the duplicate rows with (duplicated), order the dataset, if the number of rows of the dataset 'x1' is greater than 8, then get the first 9 rows (head(x1, 9)) or else return 'bad result' printed
lapply(ddf, function(x) {
x1 <- x[duplicated(x)|duplicated(x, fromLast=TRUE)]
if(nrow(x1)>9) {
x1[order(V1, V2, V3), head(.SD, 9)]
} else "bad answer"
})
#[[1]]
# V1 V2 V3
#1: b c a
#2: b c a
#3: b c a
#4: b c a
#5: b c a
#6: b c a
#7: b c a
#8: b c a
#9: b c a
#[[2]]
#[1] "bad answer"
#[[3]]
#[1] "bad answer"
data
ddf <- list(data.table(V1 = 'b', V2 = rep(c('c', 'b', 'c'), c(8, 1, 2)), V3 = 'a'),
data.table(V1 = rep("a", 5), V2 = rep("c", 5), V3 = rep("a", 5)),
data.table(V1 = c('b', 'a', 'b', 'b'), V2 = c('b', 'a', 'c', 'b'),
V3 = c("c", "d", "a", "b")))
When ddf is your list of datatables, then:
ddf[sapply(ddf, nrow) >= 9 & sapply(ddf, function(x) nrow(unique(x))) == 1]
should give you the desired result.
Where:
sapply(ddf, nrow) >= 9 checks whether the datatables have nine or more rows
sapply(ddf, function(x) nrow(unique(x))) == 1 checks whether all the rows are the same.
Or with one sapply call as #docendodiscimus suggested:
ddf[sapply(ddf, function(x) nrow(x) >= 9 & nrow(unique(x)) == 1)]
Or by using the .N special symbol and the uniqueN function of data.table:
ddf[sapply(ddf, function(x) x[,.N] >= 9 & uniqueN(x) == 1)]
Another option is to use Filter (following the suggestion of #Frank in the comments):
Filter(function(x) nrow(x) >= 9 & uniqueN(x) == 1, ddf)
Two approaches to get the datatable numbers:
1. Using which:
which(sapply(ddf, function(x) nrow(x) >= 9 & nrow(unique(x)) == 1))
2. Assign names to the datatables in the list:
names(ddf) <- paste0('dt', 1:length(ddf))
now the output will have the datatable number in the output:
$dt4
V1 V2 V3
1 a c b
2 a c b
3 a c b
4 a c b
5 a c b
6 a c b
7 a c b
8 a c b
9 a c b
I have a data set in Excel with a lot of vlookup formulas that I am trying to transpose in R using the data.table package.
In my example below I am saying, for each row, find the value in column y within column x and return the value in column z.
The first row results in na because the value 6 doesn't exist in column x.
On the second row the value 5 appears twice in column x but returning the first match is fine, which is e in this case
I've added in the result column which is the expected outcome.
library(data.table)
dt <- data.table(x = c(1,2,3,4,5,5),
y = c(6,5,4,3,2,1),
z = c("a", "b", "c", "d", "e", "f"),
Result = c("na", "e", "d", "c", "b", "a"))
Many thanks
You can do this with a join, but need to change the order first:
setorder(dt, y)
dt[.(x = x, z = z), result1 := i.z, on = .("y" = x)]
setorder(dt, x)
# x y z Result result1
#1: 1 6 a na NA
#2: 2 5 b e e
#3: 3 4 c d d
#4: 4 3 d c c
#5: 5 1 f a a
#6: 5 2 e b b
I haven't tested if this is faster than match for a big data.table, but it might be.
We can just use match to find the index of those matching elements of 'y' with that of 'x' and use that to index to get the corresponding 'z'
dt[, Result1 := z[match(y,x)]]
dt
# x y z Result Result1
#1: 1 6 a na NA
#2: 2 5 b e e
#3: 3 4 c d d
#4: 4 3 d c c
#5: 5 2 e b b
#6: 5 1 f a a
If I want to add a field to a given data frame and setting it equal to an existing field in the same data frame based on a condition on a different (existing) field.
I know this works:
is.even <- function(x) x %% 2 == 0
df <- data.frame(a = c(1,2,3,4,5,6),
b = c("A","B","C","D","E","F"))
df$test[is.even(df$a)] <- as.character(df[is.even(df$a), "b"])
> df
a b test
1 1 A NA
2 2 B B
3 3 C NA
4 4 D D
5 5 E NA
6 6 F F
But I have this feeling it can be done a lot better than this.
Using data.table it's quite easy
library(data.table)
dt = data.table(a = c(1,2,3,4,5,6),
b = c("A","B","C","D","E","F"))
dt[is.even(a), test := b]
> dt
a b test
1: 1 A NA
2: 2 B B
3: 3 C NA
4: 4 D D
5: 5 E NA
6: 6 F F
I have 8 columns of variables which I must keep column 1 to 3. For column 4 to 8 I need to keep those with only 3 levels and drop which does not qualify that condition.
I tried the following command
data3 <- data2[,sapply(data2,function(col)length(unique(col)))==3]
It managed to retain the variables with 3 levels, but deleted my first 3 columns.
You could do a two step process:
data4 <- data2[1:3]
#Your answer for the second part here:
data3 <- data2[,sapply(data2,function(col)length(unique(col)))==3]
merge(data3,data4)
Depending on what you would like your expected output to be, could try with the option all =TRUE inside the merge().
I would suggest another approach:
x = 1:3
cbind(data2[x], Filter(function(i) length(unique(i))==3, data2[-x]))
# 1 2 3 5
#1 a 1 3 b
#2 b 2 4 b
#3 c 3 5 b
#4 d 4 6 a
#5 e 5 7 c
#6 f 6 8 c
#7 g 7 9 c
#8 h 8 10 a
#9 i 9 11 c
#10 j 10 12 b
Data:
data2 = setNames(
data.frame(letters[1:10],
1:10,
3:12,
sample(letters[1:10],10, replace=T),
sample(letters[1:3],10, replace=T)),
1:5)
Assuming that the columns 4:8 are factor class, we can also use nlevels to filter the columns. We create 'toKeep' as the numeric index of columns to keep, and 'toFilter' as numeric index of columns to filter. We subset the dataset into two: 1) using the 'toKeep' as the index (data2[toKeep]), 2) using the 'toFilter', we further subset the dataset by looping with sapply to find the number of levels (nlevels), create logical index (==3) to filter the columns and cbind with the first subset.
toKeep <- 1:3
toFilter <- setdiff(seq_len(ncol(data2)), n)
cbind(data2[toKeep], data2[toFilter][sapply(data2[toFilter], nlevels)==3])
# V1 V2 V3 V4 V6
#1 B B D C B
#2 B D D A B
#3 D E B A B
#4 C B E C A
#5 D D A D E
#6 E B A A B
data
set.seed(24)
data2 <- as.data.frame(matrix(sample(LETTERS[1:5], 8*6, replace=TRUE), ncol=8))