unique identifier in data.table - r

I have a data.table with 11 variables and 200,000+ rows. I am trying to find the unique identifier (in other words, key) in this data.table.
I am looking for something like isid in Stata, which checks whether the specified variables uniquely identify the observations. Can someone please help?

This doesn't exactly answer the OP question [I haven't used the data.table yet], but it will help R only user's to answer the OP's question. My focus will be on explaining how isid is actually working on Stata. I use data from R database (you need to install optmatch for this data).
library(optmatch)
data(nuclearplants)
sample<-nuclearplants
I am focusing only on subset of data frame since my goal is to only explain what isid is doing:
sample<-sample[,c(1,2,5,10)]
head(sample,5)
cost date cap cum.n
H 460.05 68.58 687 14
I 452.99 67.33 1065 1
A 443.22 67.33 1065 1
J 652.32 68.00 1065 12
B 642.23 68.00 1065 12
Now, when I use the Stata command isid cost it doesn't display anything which means there are no duplicate observations on cost (R command for this is unique(sample$cost) or sample[duplicated(sample),]
[1] cost date cap cum.n
<0 rows> (or 0-length row.names).)
However when we use isid date i.e. on date variables, Stata reports that it is not unique. Alternatively, if you run duplicates date examples, Stata will give you duplicate observations as follows:
. duplicates example date
Duplicates in terms of date
+-------------------------------+
| group: # e.g. obs date |
|-------------------------------|
| 1 2 27 67.25 |
| 2 2 2 67.33 |
| 3 3 29 67.83 |
| 4 2 4 68 |
| 5 5 8 68.42 |
|-------------------------------|
| 6 2 1 68.58 |
| 7 2 12 68.75 |
| 8 3 14 68.92 |
+-------------------------------+
To interpret the output, it is saying that observation 67.25 has two repeated observations (as indicated by #). The first observation corresponds to row 27 (it doesn't identify the row number of second duplicate with 67.25). Group gives the unique identifier for each repetition.
R command for the same is duplicated(sample$date).
duplicated(sample$date)
[1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
[22] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
To identify the unique observation we can also use unique(sample$date) in R.
We can do same for two variables isid cost date. Again, the Stata doesn't identify duplicate observations across two variables. The same is true when you use unique(sample[,c(1,2)] in R.
Again if I run isid on all four variables then Stata says that it is unique(no warnings).
duplicates example cost date cap cum_n
Duplicates in terms of cost date cap cum_n
(0 observations are duplicates)
The same with unique(sample) in R.
Conclusion: I therefore, think that as long as one variable is unique (i.e. it has no duplicate observations), the combination of the variables which include the unique variable should be always unique. Please correct me if I am wrong.

I think you are confused on a few points about data.tables and keys.
A data.table will not have a key unless you explicitly set it.
A data.table key does not have to be unique.
You can write a function that will check if certain columns could create a unique identifer for a dataset.
I've used data.table here, and have taken care to use unique on an unkeyed copy of the data.table.
This is not efficient.
isid <- function(columns, data, verbose = TRUE){
if(!is.data.table(data)){
copyd <- data.table(data)
} else{
copyd <- copy(data)
}
if(haskey(copyd)){
setkey(copyd, NULL)
}
# NA values don't work in keys for data.tables
any.NA <- Filter(columns, f= function(x) any(is.na(copyd[[x]])))
if(verbose){
for(aa in seq_along(any.NA)){message(sprintf('Column %s contains NA values', any.NA[aa] ))}
}
validCols <- setdiff(columns, any.NA)
# cycle through columns 1 at a time
ncol <- 1L
validKey <- FALSE
while(!isTRUE(validKey) && ncol <= length(validCols)){
anyValid <- combn(x = validCols, m = ncol, FUN = function(xn){
subd <- copyd[, ..xn]
result <- nrow(subd) == nrow(unique(subd))
list(cols = xn, valid = result)
}, simplify = FALSE)
whichValid <- sapply(anyValid, `[[`, 'valid')
validKey <- any(whichValid)
ncol <- ncol + 1L
}
if(!validKey){
warning('No combinations are unique')
return(NULL)} else {
valid.combinations <- lapply(anyValid, `[[`, 'cols')[whichValid]
if(length(valid.combinations) > 1){
warning('More than one combination valid, returning the first only')
}
return(valid.combinations[[1]])
}
}
some examples in use
oneU <- data.table(a = c(2,1,2,2), b = c(1,2,3,4))
twoU <- data.table(a = 1:4, b = letters[1:4])
bothU <- data.table(a = letters[1:2], b = rep(letters[1:2], each = 2))
someNA <- data.table(a = c(1,2,3,NA), b = 1:4)
isid(names(oneU), oneU)
# [1] "b"
isid(names(twoU), twoU)
# [1] "a"
# Warning message:
# In isid(names(twoU), twoU) :
# More than one combination valid, returning the first only
isid(names(bothU), bothU)
# [1] "a" "b"
isid(names(someNA), someNA)
# Column a contains NA values
# [1] "b"
# examples with no valid identifiers
isid('a', someNA)
## Column a contains NA values
## NULL
## Warning message:
## In isid("a", someNA) : No combinations are unique
isid('a', oneU)
## NULL
## Warning message:
## In isid("a", oneU) : No combinations are unique

Related

When subsetting in R is it necessary to include `which` or can I just put a logical test?

Say I have a data frame df and want to subset it based on the value of column a.
df <- data.frame(a = 1:4, b = 5:8)
df
Is it necessary to include a which function in the brackets or can I just include the logical test?
df[df$a == "2",]
# a b
#2 2 6
df[which(df$a == "2"),]
# a b
#2 2 6
It seems to work the same either way... I was getting some strange results in a large data frame (i.e., getting empty rows returned as well as the correct ones) but once I cleaned the environment and reran my script it worked fine.
df$a == "2" returns a logical vector, while which(df$a=="2") returns indices. If there are missing values in the vector, the first approach will include them in the returned value, but which will exclude them.
For example:
x=c(1,NA,2,10)
x[x==2]
[1] NA 2
x[which(x==2)]
[1] 2
x==2
[1] FALSE NA TRUE FALSE
which(x==2)
[1] 3

rle(): Return average of lengths only if values == TRUE

I have the following rle object:
Run Length Encoding
lengths: int [1:189] 4 5 3 15 6 4 9 1 9 5 ...
values : logi [1:189] FALSE TRUE FALSE TRUE FALSE TRUE ...
I would like to find the average (mean) of the lengths if the corresponding item in the values == TRUE (I'm not interested in the lengths when values == FALSE)
df <- data.frame(values = NoOfTradesAndLength$values, lengths = NoOfTradesAndLength$lengths)
AveLength <- aggregate(lengths ~ values, data = df, FUN = function(x) mean(x))
Which returns this:
values lengths
1 FALSE 7.694737
2 TRUE 5.287234
I can now obtain the length where values == TRUE but is there a nicer way of doing this? Or perhaps, could I achieve a similar result without using rle at all? It feels a bit fiddly converting from lists to dataframe and I'm sure there is a one line clever way of doing this. I've seen that derivatives of this question have cycled through before but I wasn't able to come up with anything better from those so your help is much appreciated.
The rle returns a list of 'lengths' and 'values'. We can subset the 'lengths' using the 'values' as logical index and get the mean
with(NoOfTradesAndLength, mean(lengths[values]))
Using a reproducible example
set.seed(24)
NoOfTradesAndLength <- rle(sample(c(TRUE, FALSE), 25, replace=TRUE))
with(NoOfTradesAndLength, mean(lengths[values]))
#[1] 1.5
Using the OP's code
AveLength[2,]
# values lengths
#2 TRUE 1.5

Finding matching character strings in 2 sets of data in R

I have 2 data sets; one contains information on patients, and the other is a list of medical codes
patient <- data.table(ID = rep(1:5, each = 3),
codes = c("13H42", "1B1U", "Eu410", "Je450", "Fg65", "Eu411", "Eu402", "B110", "Eu410", "Eu50",
"1B1U", "Eu513", "Eu531", "Eu411", "Eu608")
)
code <- data.table(codes = c("BG689", "13H42", "BG689", "Ju34K", "Eu402", "Eu410", "Eu50", "JE541", "1B1U",
"Eu411", "Fg605", "GT6TU"),
term = c(NA))
The code$term has values, but for this example they're omitted.
What I want is an indicator column in patient that shows 1 if a code in code occurs in patient$codes.
patient
ID codes mh
1: 1 13H42 TRUE
2: 1 1B1U TRUE
3: 1 Eu410 TRUE
4: 2 Je450 FALSE
5: 2 Fg65 FALSE
6: 2 Eu411 TRUE
7: 3 Eu402 TRUE
8: 3 B110 FALSE
9: 3 Eu410 TRUE
10: 4 Eu50 TRUE
11: 4 1B1U TRUE
12: 4 Eu513 FALSE
13: 5 Eu531 FALSE
14: 5 Eu411 TRUE
15: 5 Eu608 FALSE
My solution was to use grepl:
patient$mh <- mapply(grepl, pattern=code$codes, x=patient$codes)
however this didn't work as code isn't the same length and i got the warning
Warning message:
In mapply(grepl, pattern = code$codes, x = patient$codes) :
longer argument not a multiple of length of shorter
Any solutions for an exact match?
You can do this:
patient[,mh := codes %in% code$codes]
Update:
As rightly suggested by Pasqui, for getting 0s and 1s,
you can further do:
patient[,mh := as.numeric(mh)]
EDIT: others have posted better answers. I like the %in% one from #moto myself. Much more concise, and much more efficient. Stick with those :)
This should do it. I've used a for loop, so you might figure something out that would be more efficient. I've also split the loop up into a few lines, rather than squeezing it into one. That's just so you can see what's happening:
for( row in 1:nrow(patient) ) {
codecheck <- patient$codes[row]
output <- ifelse( sum( grepl( codecheck, code$codes ) ) > 0L, 1, 0 )
patient$new[row] <- output
}
So this just goes through the patient list one by one, checks for a match using grepl, then puts the result (1 for match, 0 for no match) back into the patient frame, as a new column.
Is that what you're after?

R dropping NA's in logical column levels

I have a dataframe, which includes a corrupt row with NAs and "". I cannot remove this from the .csv file I am importing into R since Excel cannot deal with (opening) the size of the .csv document.
I do a check when I first read.csv() like below to remove the row with NA:
if ( any( is.na(unique(data$A)) ) ){
print("WARNING: data has a corrupt row in it!")
data <- data[ !is.na(data$A) , ]
}
However, as if it is a factor, the Acolumn remembers NA as a level:
> summary(data$A)
Mode FALSE TRUE NA's
logical 185692 36978 0
This obviously causes issues when I am trying to fit a linear model. How can I get rid of the NA as a logical level here?
I tried this but doesn't seem to work:
A <- as.logical(droplevels(factor(data_combine$A)))
summary(A)
Mode FALSE TRUE NA's
logical 185692 36978 0
unique(A)
[1] FALSE TRUE
First, your data$A is not a factor, it's a logical. The summary print methods are not the same for factors and logicals. Logicals use summary.default while factors dispatch to summary.factor. Plus it tells you in the result that the variable is a logical.
fac <- factor(c(NA, letters[1:4]))
log <- c(NA, logical(4), !logical(2))
summary(fac)
# a b c d NA's
# 1 1 1 1 1
summary(log)
# Mode FALSE TRUE NA's
# logical 4 2 1
See ?summary for the differences.
Second, your call
A <- as.logical(droplevels(factor(data_combine$A)))
summary(A)
is also calling summary.default because you wrapped droplevels with as.logical (why?). So don't change data_combine$A at all, and just try
summary(data_combine$A)
and see how that goes. For more information, please provide a sample of your data.
As mentioned in my other answer, those actually are not factor levels. Since you asked how to remove the NA printing on summary, I'm undeleting this answer.
The NA printing is hard-coded into a summary for a logical vector. Here's the relevant code from summary.default.
# value <- if (is.logical(object))
# c(Mode = "logical", {
# tb <- table(object, exclude = NULL)
# if (!is.null(n <- dimnames(tb)[[1L]]) && any(iN <- is.na(n)))
# dimnames(tb)[[1L]][iN] <- "NA's"
# tb
# })
The exclude = NULL in table is the problem. If we look at the exclude argument in table with a logical vector log, we can see that when it is NULL the NAs always print out.
log <- c(NA, logical(4), NA, !logical(2), NA)
table(log, exclude = NULL) ## with NA values
# log
# FALSE TRUE <NA>
# 4 2 3
table(log[!is.na(log)], exclude = NULL) ## NA values removed
#
# FALSE TRUE <NA>
# 4 2 0
To make your summary print the way you want it, we can write a summary method based on the original source code.
summary.logvec <- function(object, exclude = NA) {
stopifnot(is.logical(object))
value <- c(Mode = "logical", {
tb <- table(object, exclude = exclude)
if(is.null(exclude)) {
if (!is.null(n <- dimnames(tb)[[1L]]) && any(iN <- is.na(n)))
dimnames(tb)[[1L]][iN] <- "NA's"
}
tb
})
class(value) <- c("summaryDefault", "table")
print.summary.logvec <- function(x) {
UseMethod("print.summaryDefault")
}
value
}
And then here are the results. Since we set exclude = NA in our print method the NAs will not print unless we set it to NULL
summary(log) ## original vector
# Mode FALSE TRUE NA's
# logical 4 2 3
class(log) <- "logvec"
summary(log, exclude = NULL) ## prints NA when exclude = NULL
# Mode FALSE TRUE NA's
# logical 4 2 3
summary(log) ## NA's don't print
# Mode FALSE TRUE
# logical 4 2
Now that I've done all this I'm wondering if you have tried to run your linear model.

operate a custom loop inside ddply

My data set has about 54,000 rows. I want to set a value (First_Pass) to either T or F depending upon both a value in another column and also whether or not that other column's value has been seen before. I have a for loop that does exactly what I need it to do. However, that loop is only for a subset of the data. I need that same for loop to be run individually for different subsets based upon factor levels.
This seems like the perfect case for the plyr functions as I want to split the data into subsets, apply a function (my for loop) and then rejoin the data. However, I cannot get it to work. First, I give a sample of the df, called char.data.
session_id list Sent_Order Sentence_ID Cond1 Cond2 Q_ID Was_y CI CI_Delta character tsle tsoc Direct
5139 2 b 9 25 rc su 25 correct 1 0 T 995 56 R
5140 2 b 9 25 rc su 25 correct 2 1 h 56 56 R
5141 2 b 9 25 rc su 25 correct 3 1 e 56 56 R
5142 2 b 9 25 rc su 25 correct 4 1 56 37 R
There is some clutter in there. The key columns are session_id, Sentence_ID, CI, and CI_Delta.
I then initialise a column called First_Pass to "F"
char.data$First_Pass <- "F"
I want to now calculate when First_Pass is actually "T" for each combination of session_id and Sentence_ID. I created a toy set, which is just one subset to work out the overall logic. Here's the code of a for loop that gives me just what I want for the toy data.
char.data.toy$First_Pass <- "F"
l <-c(200)
for (i in 1:nrow(char.data.toy)) {
if(char.data.toy[i,]$CI_Delta >= 0 & char.data.toy[i,]$CI %nin% l){
char.data.toy[i,]$First_Pass <- "T"
l <- c(l,char.data.toy[i,]$CI)}
}
I now want to take this loop and run it for every session_id and Sentence_ID subset. I've created a function called set_fp and then called it inside ddply. Here is that code:
#define function
set_fp <- function (df){
l <- 200
for (i in 1:nrow(df)) {
if(df[i,]$CI_Delta >= 0 & df[i,]$CI %nin% l){
df[i,]$First_Pass <- "T"
l <- c(l,df[i,]$CI)}
else df[i,]$First_Pass <- "F"
return(df)
}
}
char.data.fp <- ddply(char.data,c("session_id","Sentence_ID"),function(df)set_fp(df))
Unfortunately, this is not quite right. For a long time, I was getting all "F" values for First_Pass. Now I'm getting 24 T values, when it should be many more, so I suspect, it's only keeping the last subset or something similar. Help?
This is a little hard to test with only the four rows that you've provided. I created random data to see if it works and it seems to work for me. Try it on you data too.
This uses the data.table library and doesn't try to run loops inside a ddply. I'm assuming the means aren't important.
library(data.table)
dt <- data.table(df)
l <- c(200)
# subsetting to keep only the important fields
dt <- dt[,list(session_id, Sentence_ID, CI, CI_Delta)]
# Initialising First_Pass
dt[,First_Pass := 'F']
# The next two lines are basically rewording your logic -
# Within each group of session_id, Sentence_ID, identify the duplicate CI entries. These would have been inserted in l. The first time occurence of these CI entries is marked false as they wouldn't have been in l when that row was being checked
dt[CI_Delta >= 0,duplicatedCI := duplicated(CI), by = c("session_id", "Sentence_ID")]
# So if the CI value hasn't occurred before within the session_id,Sentence_ID group, and it doesn't appear in l, then mark it as "T"
dt[!(CI %in% l) & !(duplicatedCI), First_Pass := "T"]
# Just for curiosity's sake, calculating l too
l <- c(l,dt[duplicatedCI == FALSE,CI])

Resources