dataHAVE=data.frame(STUDENT=c(1,1,1,2,2,2,3,3,3),
SCORE=c(0,1,1,5,1,2,1,1,1),
CAT=c(3,10,7,4,5,0,4,5,1),
FOX=c(5,0,10,8,9,1,8,9,0))
dataWANT=data.frame(STUDENT=c(1,2,3),
SCORE=c(1,1,1),
CAT=c(10,5,4),
FOX=c(0,9,8))
I have 'dataHAVE' and want 'dataWANT' which takes the first row for every 'STUDENT' when 'SCORE' equals to 1. I am seeking a data.table solution because of it being a large data. I try this but do not know how to set the criteria for 'SCORE'
dataWANT[,.SD[1],by = key(STUDENT)]
Convert the 'data.frame' to 'data.table' (setDT), grouped by 'STUDENT', specify the logical condition in i, get the index of the first row (.I[1]), extract that column ($V1) and subset the rows
library(data.table)
setDT(dataHAVE)[dataHAVE[SCORE == 1, .I[1], STUDENT]$V1]
.I returns row index. If we don't have a grouping column, it would return a vector i.e.
setDT(dataHAVE)[SCORE == 1, .I]
#[1] 1 2 3 4 5 6
when we provide the grouping column, by default, the .I returns with a named column V1 (we could override it by changing the name)
setDT(dataHAVE)[SCORE == 1, .(colindex = .I[1]), STUDENT]
# STUDENT colindex
#1: 1 2
#2: 2 5
#3: 3 7
Nowe, we have two columns, 'STUDENT', 'colindex'. We are specifically interested in the 'colindex', so extract with standard procedures ($ or [[) and then use that as row index in i
i1 <- setDT(dataHAVE)[SCORE == 1, .(colindex = .I[1]), STUDENT]$colindex
i1
#[1] 2 5 7
This we use for subsetting
dataHAVE[i1]
Here is a base R option using subset + ave
subset(
dataHAVE,
ave(SCORE==1, STUDENT, FUN = function(x) seq_along(x) == min(which(x)))
)
which gives
STUDENT SCORE CAT FOX
2 1 1 10 0
5 2 1 5 9
7 3 1 4 8
Solution 1. There is a straightforward and comprehensive solution in two lines:
dataWANT <- dataHAVE[dataHAVE$SCORE == 1,] #Filter score equals to 1
dataWANT <- dataWANT[!duplicated(dataWANT$STUDENT), ] #Remove duplicated students
Solution 2. However, if you prefer to solve in one line:
dataWANT <- dataHAVE[!duplicated(paste0(dataHAVE$STUDENT, dataHAVE$SCORE)) & dataHAVE$SCORE ==1, ]
That creates a logical vector showing which of the combinations that are not duplicated of preceding elements, and combine it with a test if 'SCORE' is 1.
You could use match to get 1st row where SCORE = 1 for each STUDENT.
library(data.table)
setDT(dataHAVE)
dataHAVE[, .SD[match(1, SCORE)], STUDENT]
# STUDENT SCORE CAT FOX
#1: 1 1 10 0
#2: 2 1 5 9
#3: 3 1 4 8
Related
I have a data that looks as follows:
Patent_number<-c(2323,4449,4939,4939,12245)
IPC_class_1<-c("C12N",4,"C29N00185",2,"C12F")
IPC_class_2<-c(3,"K12N","C12F","A01N",8)
IPC_class_3<-c("S12F",1,"CQ010029393049",5,"CQ1N")
df<-data.frame(Patent_number, IPC_class_1, IPC_class_2, IPC_class_3)
View(df)
I want to count only the number o (string) values such as C12N, A01N etc. per row by adding another column "counts" in the end of the data frame. In other words, I want to exclude the numeric values from the row count.
Any suggestions?
You can't have mixed types in a dataframe column, so all of the numeric values will also be stored as type character. One approach would be to convert everything using as.numeric, and then use is.na to count those that are not coercible to numeric...
df$counts <- apply(sapply(df, as.numeric), 1, function(x) sum(is.na(x)))
df
Patent_number IPC_class_1 IPC_class_2 IPC_class_3 counts
1 2323 C12N 3 S12F 2
2 4449 4 K12N 1 1
3 4939 C29N C12F CQ01 3
4 4939 2 A01N 5 1
5 12245 C12F 8 CQ1N 2
We may also count by checking if all the characters are digits
df$counts <- ncol(df) - Reduce(`+`, lapply(df, grepl, pattern = '^[0-9.]+$'))
df$counts
[1] 2 1 3 1 2
Let's say I have the following data.table.
library(data.table)
DT <- data.table(x=1:6, y=c(0,0,1,0,0,0))
Could I write some command DT[...] that selects all the rows within 2 rows of the one in which y=1? That is, using proximity to row three, I want to select rows 1-5.
Here is one option to loop over the position index (which(y == 1)) with sapply, create a sequence by adding/subtracting 2 to it, get the unique elements (in case of overlaps) and subset the rows by using that i
library(data.table)
DT[unique(sapply(which(y==1), function(i) (i-2):(i + 2)))]
-output
# x y
#1: 1 0
#2: 2 0
#3: 3 1
#4: 4 0
#5: 5 0
If there are negative index, we can subset those
i1 <- DT[,unique(sapply(which(y==1), function(i) (i-2):(i + 2)))][,1]
DT[i1[i1 > 0]]
We can use rolling operations with different alignment to find if there is any value in y which has 1 in it with a window size of 3.
library(data.table)
library(zoo)
DT[rollapplyr(y == 1, 3, any, fill = FALSE) |
rollapply(y == 1, 3, any, fill = FALSE, align = 'left')]
# x y
#1: 1 0
#2: 2 0
#3: 3 1
#4: 4 0
#5: 5 0
rollapplyr is same as rollapply(...., align = 'right')
I have a data.table
library(data.table)
DT <- data.table(a=c(1,2,3,4), b=c(4,4,4,4), x=c(1,3,5,5))
> DT
a b x
1: 1 4 1
2: 2 4 3
3: 3 4 5
4: 4 4 5
and I would like to select rows where x equals either a or b. Obviously, I could use
> DT[x==a | x==b]
a b x
1: 1 4 1
which gives the correct result. However, with many columns I thought, the follwoing should work just as well
> DT[x%in%c(a,b)]
a b x
1: 1 4 1
2: 2 4 3
but it gives a different result that is not intuitive to me. Can anyone help?
The expression
DT[x==a | x==b]
returns all rows in DT where the values in x and a are equal or x and b are equal. This is the desired result.
On the other hand
DT[x%in%c(a,b)]
returns all rows where x matches any value in c(a, b), not just the corresponding value. Thus your second row appears because x == 3 and 3 appears (somewhere) in a.
We can use Reduce with .SDcols for multiple columns. Specify the columns of interest in .SDcols, then loop over the .SD (Subset of Data.table), do the comparison (==) with 'x', and Reduce it to a single logical vector with |
DT[DT[, Reduce(`|`, lapply(.SD, `==`, x)), .SDcols = a:b]]
# a b x
#1: 1 4 1
Another way is use rowSums
DT[rowSums(DT[,.SD,.SDcols=-'x']==x)>0,]
# a b x
#1: 1 4 1
You can change to rowMeans...==1 if you want to select rows where all columns equal x
I have a simple example data frame with two data columns (data1 and data2) and two grouping variables (Measure 1 and 2). Measure 1 and 2 have missing data NA.
d <- data.frame(Measure1 = 1:2, Measure2 = 3:4, data1 = 1:10, data2 = 11:20)
d$Measure1[4]=NA
d$Measure2[8]=NA
d
Measure1 Measure2 data1 data2
1 1 3 1 11
2 2 4 2 12
3 1 3 3 13
4 NA 4 4 14
5 1 3 5 15
6 2 4 6 16
7 1 3 7 17
8 2 NA 8 18
9 1 3 9 19
10 2 4 10 20
I want to create a new variable (d$new) that contains data1, but only for rows where Measure1 equals 1. I tried this and get the following error:
d$new[d$Measure1 == 1] = d$data1[d$Measure1 == 1]
Error in d$new[d$Measure1 == 1] = d$data1[d$Measure1 == 1] : NAs
are not allowed in subscripted assignments
Next I would like to add to d$new the data from data2 only for rows where Measure2 equals 4. However, the missing data in Measure1 and Measure2 is causing problems in subsetting the data and assigning it to a new variable. I can think of some overly complicated solutions, but I'm sure there's an easy way I'm not thinking of. Thanks for the help!
Find rows where Measure1 is not NA and is the value you want.
measure1_notNA = which(!is.na(d$Measure1) & d$Measure1 == 1)
Initialize your new column with some default value.
d$new = NA
Replace only those rows with corresponding values from data1 column.
d$new[measure1_notNA] = d$data1[measure1_notNA]
Or, in 1 line:
d$new[d$Measure1 == 1 & !is.na(d$Measure1)] = d$data1[d$Measure1 == 1 & !is.na(d$Measure1)]
Based on the description, it seems that the OP want to create a column 'new' based on two columns i.e. when Measure1==1, get the corresponding elements of 'data1', similarly for Measure2==4, get the corresponding 'data2' values, and the rest with NA. We can use ifelse
d$new <- with(d, ifelse(Measure1==1 & !is.na(Measure1), data1,
ifelse(Measure2==4, data2, NA)))
We could also do this with data.table by assigning (:=) in two steps. Convert the 'data.frame' to 'data.table' (setDT(d)). Based on the logical condition (Measure1==1 & !is.na(Measure1)), we assign the column 'new' as 'data1'. This will create the column with values from 'data1' for that are TRUE for the logical condition and get NA for the rest. In the second step, we do the same using 'Measure2/data2'.
library(data.table)
setDT(d)[Measure1==1 & !is.na(Measure1), new:= data1]
d[Measure2==4, new:= data2]
I think I am on the right direction with this code, but I am not quite there yet.
I tried finding something useful on Google and SE, but I did not seem to be able to formulate the question in a way that gets me the answer I am looking for.
I could write a for-loop for this, comparing for each id and for each unique value of a per row, but I strive to achieve a higher level of R-understanding and thus want to avoid loops.
id <- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
a <- c(1,1,1,2,2,2,3,3,4,4,4,5,5,5,6)
b <- c(1,2,3,3,3,4,3,4,5,4,4,5,6,7,8)
require(data.table)
dt <- data.table(id, a, b)
dt
dt[,unique(a) %in% b, by=id]
tmp <- dt[,unique(a) %in% b, by=id]
tmp$id[tmp$V1 == FALSE]
In my example, IDs 2, 3 and 5 should be the result, the decision rule being: "By id, check if for each unique value of a if there is at least one observation where the value of b equals value of a."
However, my code only outputs IDs 2 and 5, but not 3. This is because for ID 3, the 4 is matched with the 4 of the previous observation.
The result should either output the IDs for which the condition is not met, or add a dummy variable to the original table that indicated whether the condition is met for the ID.
How about
dt[, all(sapply(unique(a), function(i) any(a == i & b == i))), by = id]
# id V1
#1: 1 TRUE
#2: 2 FALSE
#3: 3 FALSE
#4: 4 TRUE
#5: 5 FALSE
If you want to add a dummy variable to the original table, you can modify it like
dt[, check:=all(sapply(unique(a), function(i) any(a == i & b == i))), by = id]
I wondered if I can find are more data.table-esk solution for this old question using the enhanced join capabilities which were introduced to data.table in version 1.9.6 (on CRAN 19 Sep 2015). With that version, data.table has gained the ability to join without having to set keys by using the on argument.
Variant 1
dt[a == b][dt[, unique(a), by = id], on = .(id, a == V1)][is.na(b), unique(id)]
[1] 2 3 5
First, the rows of dt where a and b are equal are selected. Only these rows are right joined with the unique values of a for each id. The result of the join is
dt[a == b][dt[, unique(a), by = id], on = .(id, a == V1)]
id a b
1: 1 1 1
2: 2 2 NA
3: 3 3 3
4: 3 4 NA
5: 4 4 4
6: 4 4 4
7: 4 5 5
8: 5 5 NA
9: 5 6 NA
The NA values in column b indicate that no match is found. Any id which has an NA value indicates that OP's condition is not met.
Variant 2
dt[dt[, unique(a), by = id], on = .(id, a == V1, b == V1), unique(id[is.na(x.a)])]
[1] 2 3 5
This variant right joins dt (unfiltered!) with the unique values of a for each id but the join conditions require matches in id as well as matches in a and b. (This resembles the a == i & b == i expression in konvas' accepted answer. Finally, those ids are returned which have at least one NA value in the join result indicating a missing match.