Related
Quite new to R...
I loaded a file with 13458 observations containing a time and a value. I ran it through a program which detects homologue series. The output is a large list with 6 elements, including values IDed by the row number in the original file.
I would like to export the original file with values detected by the program marked somehow so I can easily identify them in Excel. Hopefully that makes some sense.
My dataframe looks like this and I'm using the m.z and RT values:
m.z dummy RT
1 151.0092 255975.8 15.043
2 151.0092 110111.7 15.456
3 151.0092 108958.1 15.243
4 151.0093 3258343.0 14.620
5 151.0127 107255.9 6.336
My output contains a list of related series and looks like this:
[359] "3518,4779,5929,6975,8032,9051,9825"
[360] "5927,6977,8036,9052,9824,10507,11043"
I would like a data frame that lets me know if a value has been identified, as this:
m.z dummy RT homologue
3518 459.2006 255975.8 15.043 TRUE
3519 459.2120 110111.7 15.456 FALSE
3520 459.2159 108958.1 15.243 FALSE
Thanks!
Here is an attempt
your MS data:
DF <- read.table(text="m.z dummy RT
1 151.0092 255975.8 15.043
2 151.0092 110111.7 15.456
3 151.0092 108958.1 15.243
4 151.0093 3258343.0 14.620
5 151.0127 107255.9 6.336", header = T)
the script output:
vec <- c("1,3,5", "3,5") #from your example looks like a vector of strings with numbers separated by a comma
As I understand you would like to label rows in df with TRUE/FALSE depending on appearance anywhere in vec?
DF$homologue <- ifelse(row.names(DF) %in% as.numeric(unlist(strsplit(unlist(vec), ","))), T, F)
explanation:
unlist(vec) #in case it is a list and not a vector
strsplit(unlist(vec), ",") #split strings at "," returning a list
unlist(str... #convert that list into a vector
as.numeric(unlist(str... #convert to numeric
if any row names of DF are in vec they will be labeled T and if not F
DF
m.z dummy RT homologue
1 151.0092 255975.8 15.043 TRUE
2 151.0092 110111.7 15.456 FALSE
3 151.0092 108958.1 15.243 TRUE
4 151.0093 3258343.0 14.620 FALSE
5 151.0127 107255.9 6.336 TRUE
I have a question about searching for values in R, it is actually a bit similar to a question which was posted yesterday (as given over here: Searching a vector/data table backwards in R) except I think my problem is a bit more complicated (and also the opposite of what I want to do), and since I'm very new to R I'm not too sure how to solve this problem.
I have a data frame similar to one given below, and I wish to find a previous index value to my current one where the Times column is different to my current time and the Midquote column does not have an NA value.
Index Times | Midquote
-----------------------------
1 10:30:45.58 | 5.319
2 10:30:45.93 | 5.323
3 10:30:45.104 | 5.325
4 10:30:45.127 | 5.322
5 10:30:45.188 | 5.325
6 10:30:45.188 | NA
7 10:30:45.212 | NA
8 10:30:45.231 | 5.321
9 10:30:45.231 | 5.321
If we start at the bottom of the data frame and take this to be the 'current' time, this is found to be at index 9 and which has a Times value of 10:30:45.231 and Midquote value of 5.321, then if I want to find the first index where the time is different to my current time, we see this is found to be index 7, which has a time of 10:30:45.212 (since index 8 has the same time). But we also see that at index 7 the Midquote value is NA so I now have to check the data frame again. Index 6 again has a different time (i.e. 10:30:45.188 ) but it also has an NA value again in the Midquote column, so moving up again to index 5 we see that the Times column has a different time to my current time (i.e. 10:30:45.188 again) and that the Midquotes value is 5.325.
Therefore, since at index 5 the time is 10:30:45.188 (which is different to my current time which was 10:30:45.231) and since the Midquote value at index 5 is not NA, I wish to obtain the output '5' since it is the index value which fulfills both criteria.
My question is, is there a good way of doing this? I am sorry if this is an easy question, I am very new to R and I don't know much about working with data frames...
EDIT: I would also like to do it preferably without adding another column to the data frame (as is given in the top answer of the link I mentioned above), if that is possible
Working with dates is tough especially with fractional seconds.
If you could convert the times to doubles it would be easier to work with.
Assuming your 'Times' are in order you could use this
library(magrittr)
which(df$Times < df[9,1] & !is.na(df$Midquote)) %>% max()
The which gives a vector of the 'Index' where 'Times' are less than that in 9 AND the 'Midquote' is not NA. The %>% sends the vector to max() which gives the highest value. This is pretty inelegant, but will get the job done.
If I understood it correctly, please check if this is the output you are expecting.
ind<-function(t,df){
ind<-t
while(t>1){
t=t-1
if((df$Times[t]!=df$Times[ind]) && (!is.na(df$Midquote[t]))){
return(t)
}
}
}
sapply((nrow(data):1),FUN = ind,data)
#[[1]]
#[1] 5
#[[2]]
#[1] 5
#[[3]]
#[1] 5
#[[4]]
#[1] 4
#[[5]]
#[1] 4
#[[6]]
#[1] 3
#[[7]]
#[1] 2
#[[8]]
#[1] 1
#[[9]]
#NULL
The output series corresponds to the associated index for your data.frame starting from the last row.
Explanation: ind takes the value of row number as the current row, while t takes value starting from ind-1 to 1. df takes the entire data.frame as input and then while loop is used to check if time and midquote value of df$Times[t] and df$Midquote[t] satisfy the required conditions. If yes they return the index else the loop continues until it reaches the first row.
Without using sapply for a particular current row:
ind(9,df)
[1] 5
Data.table solution, 1 line.
library(data.table)
dt <- data.table(Index = 1:9,
Times = c( '10:30:45.58', '10:30:45.93','10:30:45.104','10:30:45.127','10:30:45.188','10:30:45.188','10:30:45.212','10:30:45.231','10:30:45.231' ),
Midquote = c('5.319','5.323','5.325','5.322','5.325',NA,NA,'5.321','5.321')
)
> dt[ Times != Times[.N] & !is.na(Midquote), max(Index) ]
[1] 5
EDIT
To remove the Index column you have (at least) two options
dt2 <- data.table(Times = c( '10:30:45.58', '10:30:45.93','10:30:45.104','10:30:45.127','10:30:45.188','10:30:45.188','10:30:45.212','10:30:45.231','10:30:45.231' ),
Midquote = c('5.319','5.323','5.325','5.322','5.325',NA,NA,'5.321','5.321'))
# Option 1 - create an id column on the fly (unfortunately data.table recalculate .I after evaluating the "where" clause... so you need to save it)
dt2[, cbind(.SD, id=.I)][ Times != Times[.N] & !is.na(Midquote), max(id) ]
# Option 2 - simply check the last position of where your condition is met
dt2[, max(which(Times != Times[.N] & !is.na(Midquote))) ]
NB You can't do nrow because you can have, say, the 1st, 2nd, and 4th records matching your condition, and nrow would give you 3, which is wrong because the 3rd row does not match.
EDIT 2 (option 3 is not correct)
dt3 <- data.table(Times = c( '10:30:45.58', '10:30:45.93','10:30:45.104','10:30:45.127','10:30:45.188','10:30:45.188','10:30:45.212','10:30:45.231','10:30:45.231' ),
Midquote = c('5.319','5.323', NA,'5.322','5.325', NA, NA,'5.321','5.321'))
# Option 1 - create an id column on the fly (unfortunately data.table recalculate .I after evaluating the "where" clause... so you need to save it)
dt3[, cbind(.SD, id=.I)][ Times != Times[.N] & !is.na(Midquote), max(id) ]
[1] 5
# Option 2 - simply check the last position of where your condition is met
dt3[, max(which(Times != Times[.N] & !is.na(Midquote))) ]
[1] 5
# Option 3 - good luck with this
nrow(dt3[Times != Times[.N] & !is.na(Midquote)])
[1] 4
I have 2 data sets; one contains information on patients, and the other is a list of medical codes
patient <- data.table(ID = rep(1:5, each = 3),
codes = c("13H42", "1B1U", "Eu410", "Je450", "Fg65", "Eu411", "Eu402", "B110", "Eu410", "Eu50",
"1B1U", "Eu513", "Eu531", "Eu411", "Eu608")
)
code <- data.table(codes = c("BG689", "13H42", "BG689", "Ju34K", "Eu402", "Eu410", "Eu50", "JE541", "1B1U",
"Eu411", "Fg605", "GT6TU"),
term = c(NA))
The code$term has values, but for this example they're omitted.
What I want is an indicator column in patient that shows 1 if a code in code occurs in patient$codes.
patient
ID codes mh
1: 1 13H42 TRUE
2: 1 1B1U TRUE
3: 1 Eu410 TRUE
4: 2 Je450 FALSE
5: 2 Fg65 FALSE
6: 2 Eu411 TRUE
7: 3 Eu402 TRUE
8: 3 B110 FALSE
9: 3 Eu410 TRUE
10: 4 Eu50 TRUE
11: 4 1B1U TRUE
12: 4 Eu513 FALSE
13: 5 Eu531 FALSE
14: 5 Eu411 TRUE
15: 5 Eu608 FALSE
My solution was to use grepl:
patient$mh <- mapply(grepl, pattern=code$codes, x=patient$codes)
however this didn't work as code isn't the same length and i got the warning
Warning message:
In mapply(grepl, pattern = code$codes, x = patient$codes) :
longer argument not a multiple of length of shorter
Any solutions for an exact match?
You can do this:
patient[,mh := codes %in% code$codes]
Update:
As rightly suggested by Pasqui, for getting 0s and 1s,
you can further do:
patient[,mh := as.numeric(mh)]
EDIT: others have posted better answers. I like the %in% one from #moto myself. Much more concise, and much more efficient. Stick with those :)
This should do it. I've used a for loop, so you might figure something out that would be more efficient. I've also split the loop up into a few lines, rather than squeezing it into one. That's just so you can see what's happening:
for( row in 1:nrow(patient) ) {
codecheck <- patient$codes[row]
output <- ifelse( sum( grepl( codecheck, code$codes ) ) > 0L, 1, 0 )
patient$new[row] <- output
}
So this just goes through the patient list one by one, checks for a match using grepl, then puts the result (1 for match, 0 for no match) back into the patient frame, as a new column.
Is that what you're after?
I have two dataframes df.o and df.m as defined below. I need to find which observation in df.o (dimension table) corresponds which observations in df.m (fact table) based on two criteria: 1) df.o$Var1==df.o$Var1 and df.o$date1 < df.m$date2 < df.o$date3 such that I get the correct value of df.o$oID in df.m$oID (the correct value is manually entered in df.m$CORRECToID). I need the ID to complete a merge afterwards.
df.o <- data.frame(oID=1:4,
Var1=c("a","a","b","c"),
date3=c(2015,2011,2014,2015),
date1=c(2013,2009,2012,2013),
stringsAsFactors=FALSE)
df.m <- data.frame(mID=1:3,
Var1=c("a","a","b"),
date2=c(2014,2010,2013),
oID=NA,
CORRECToID=c(1,2,3),
points=c(5, 10,15),
stringsAsFactors=FALSE)
I have tried various combinations of like the code below, but without luck:
df.m$oID[df.m$date2 < df.o$date3 & df.m$date2 > df.o$date1 & df.o$Var1==df.m$Var1] <- df.o$oID
I have also tried experimenting with various combinations of ifelse, which and match, but none seem to do the trick.
The problem I keep encountering is that my replacement was a different number of rows than data and that "longer object length is not a multiple of shorter object length".
What you are looking for is called an "overlap join", you could try the data.table::foverlaps function in order to achieve this.
The idea is simple
Create the columns to overlap on (add an additional column to df.m)
key by these columns
run foverlaps and select the column you want back
library(data.table)
setkey(setDT(df.m)[, date4 := date2], Var1, date2, date4)
setkey(setDT(df.o), Var1, date1, date3)
foverlaps(df.m, df.o)[, names(df.m), with = FALSE]
# mID Var1 date2 oID CORRECToID points date4
# 1: 2 a 2010 2 2 10 2010
# 2: 1 a 2014 1 1 5 2014
# 3: 3 b 2013 3 3 15 2013
I have a data.table with 11 variables and 200,000+ rows. I am trying to find the unique identifier (in other words, key) in this data.table.
I am looking for something like isid in Stata, which checks whether the specified variables uniquely identify the observations. Can someone please help?
This doesn't exactly answer the OP question [I haven't used the data.table yet], but it will help R only user's to answer the OP's question. My focus will be on explaining how isid is actually working on Stata. I use data from R database (you need to install optmatch for this data).
library(optmatch)
data(nuclearplants)
sample<-nuclearplants
I am focusing only on subset of data frame since my goal is to only explain what isid is doing:
sample<-sample[,c(1,2,5,10)]
head(sample,5)
cost date cap cum.n
H 460.05 68.58 687 14
I 452.99 67.33 1065 1
A 443.22 67.33 1065 1
J 652.32 68.00 1065 12
B 642.23 68.00 1065 12
Now, when I use the Stata command isid cost it doesn't display anything which means there are no duplicate observations on cost (R command for this is unique(sample$cost) or sample[duplicated(sample),]
[1] cost date cap cum.n
<0 rows> (or 0-length row.names).)
However when we use isid date i.e. on date variables, Stata reports that it is not unique. Alternatively, if you run duplicates date examples, Stata will give you duplicate observations as follows:
. duplicates example date
Duplicates in terms of date
+-------------------------------+
| group: # e.g. obs date |
|-------------------------------|
| 1 2 27 67.25 |
| 2 2 2 67.33 |
| 3 3 29 67.83 |
| 4 2 4 68 |
| 5 5 8 68.42 |
|-------------------------------|
| 6 2 1 68.58 |
| 7 2 12 68.75 |
| 8 3 14 68.92 |
+-------------------------------+
To interpret the output, it is saying that observation 67.25 has two repeated observations (as indicated by #). The first observation corresponds to row 27 (it doesn't identify the row number of second duplicate with 67.25). Group gives the unique identifier for each repetition.
R command for the same is duplicated(sample$date).
duplicated(sample$date)
[1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
[22] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
To identify the unique observation we can also use unique(sample$date) in R.
We can do same for two variables isid cost date. Again, the Stata doesn't identify duplicate observations across two variables. The same is true when you use unique(sample[,c(1,2)] in R.
Again if I run isid on all four variables then Stata says that it is unique(no warnings).
duplicates example cost date cap cum_n
Duplicates in terms of cost date cap cum_n
(0 observations are duplicates)
The same with unique(sample) in R.
Conclusion: I therefore, think that as long as one variable is unique (i.e. it has no duplicate observations), the combination of the variables which include the unique variable should be always unique. Please correct me if I am wrong.
I think you are confused on a few points about data.tables and keys.
A data.table will not have a key unless you explicitly set it.
A data.table key does not have to be unique.
You can write a function that will check if certain columns could create a unique identifer for a dataset.
I've used data.table here, and have taken care to use unique on an unkeyed copy of the data.table.
This is not efficient.
isid <- function(columns, data, verbose = TRUE){
if(!is.data.table(data)){
copyd <- data.table(data)
} else{
copyd <- copy(data)
}
if(haskey(copyd)){
setkey(copyd, NULL)
}
# NA values don't work in keys for data.tables
any.NA <- Filter(columns, f= function(x) any(is.na(copyd[[x]])))
if(verbose){
for(aa in seq_along(any.NA)){message(sprintf('Column %s contains NA values', any.NA[aa] ))}
}
validCols <- setdiff(columns, any.NA)
# cycle through columns 1 at a time
ncol <- 1L
validKey <- FALSE
while(!isTRUE(validKey) && ncol <= length(validCols)){
anyValid <- combn(x = validCols, m = ncol, FUN = function(xn){
subd <- copyd[, ..xn]
result <- nrow(subd) == nrow(unique(subd))
list(cols = xn, valid = result)
}, simplify = FALSE)
whichValid <- sapply(anyValid, `[[`, 'valid')
validKey <- any(whichValid)
ncol <- ncol + 1L
}
if(!validKey){
warning('No combinations are unique')
return(NULL)} else {
valid.combinations <- lapply(anyValid, `[[`, 'cols')[whichValid]
if(length(valid.combinations) > 1){
warning('More than one combination valid, returning the first only')
}
return(valid.combinations[[1]])
}
}
some examples in use
oneU <- data.table(a = c(2,1,2,2), b = c(1,2,3,4))
twoU <- data.table(a = 1:4, b = letters[1:4])
bothU <- data.table(a = letters[1:2], b = rep(letters[1:2], each = 2))
someNA <- data.table(a = c(1,2,3,NA), b = 1:4)
isid(names(oneU), oneU)
# [1] "b"
isid(names(twoU), twoU)
# [1] "a"
# Warning message:
# In isid(names(twoU), twoU) :
# More than one combination valid, returning the first only
isid(names(bothU), bothU)
# [1] "a" "b"
isid(names(someNA), someNA)
# Column a contains NA values
# [1] "b"
# examples with no valid identifiers
isid('a', someNA)
## Column a contains NA values
## NULL
## Warning message:
## In isid("a", someNA) : No combinations are unique
isid('a', oneU)
## NULL
## Warning message:
## In isid("a", oneU) : No combinations are unique