Finding matching character strings in 2 sets of data in R

Finding matching character strings in 2 sets of data in R - r

I have 2 data sets; one contains information on patients, and the other is a list of medical codes
patient <- data.table(ID = rep(1:5, each = 3),
codes = c("13H42", "1B1U", "Eu410", "Je450", "Fg65", "Eu411", "Eu402", "B110", "Eu410", "Eu50",
"1B1U", "Eu513", "Eu531", "Eu411", "Eu608")
)
code <- data.table(codes = c("BG689", "13H42", "BG689", "Ju34K", "Eu402", "Eu410", "Eu50", "JE541", "1B1U",
"Eu411", "Fg605", "GT6TU"),
term = c(NA))
The code$term has values, but for this example they're omitted.
What I want is an indicator column in patient that shows 1 if a code in code occurs in patient$codes.
patient
ID codes mh
1: 1 13H42 TRUE
2: 1 1B1U TRUE
3: 1 Eu410 TRUE
4: 2 Je450 FALSE
5: 2 Fg65 FALSE
6: 2 Eu411 TRUE
7: 3 Eu402 TRUE
8: 3 B110 FALSE
9: 3 Eu410 TRUE
10: 4 Eu50 TRUE
11: 4 1B1U TRUE
12: 4 Eu513 FALSE
13: 5 Eu531 FALSE
14: 5 Eu411 TRUE
15: 5 Eu608 FALSE
My solution was to use grepl:
patient$mh <- mapply(grepl, pattern=code$codes, x=patient$codes)
however this didn't work as code isn't the same length and i got the warning
Warning message:
In mapply(grepl, pattern = code$codes, x = patient$codes) :
longer argument not a multiple of length of shorter
Any solutions for an exact match?

You can do this:
patient[,mh := codes %in% code$codes]
Update:
As rightly suggested by Pasqui, for getting 0s and 1s,
you can further do:
patient[,mh := as.numeric(mh)]

EDIT: others have posted better answers. I like the %in% one from #moto myself. Much more concise, and much more efficient. Stick with those :)
This should do it. I've used a for loop, so you might figure something out that would be more efficient. I've also split the loop up into a few lines, rather than squeezing it into one. That's just so you can see what's happening:
for( row in 1:nrow(patient) ) {
codecheck <- patient$codes[row]
output <- ifelse( sum( grepl( codecheck, code$codes ) ) > 0L, 1, 0 )
patient$new[row] <- output
}
So this just goes through the patient list one by one, checks for a match using grepl, then puts the result (1 for match, 0 for no match) back into the patient frame, as a new column.
Is that what you're after?

Related

in R, concatenate a conditional vector of strings [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am struggling with how to gain insight into two tables of information in R I have. I want to search to see if a string of characters in one data frame is present in another data frame. If it is, record the name for that string and append it to a new data frame.
Here's what I am working with:
df_repeats
sequence promoter_numbers promotors
1 AAAAAAAAAAAA 715 NA
2 AAAAAAAAAAAC 61 NA
3 AAAAAAAAAAAG 184 NA
df_promotors
gene promotor_coordinates sequence
1 Xkr4_1 range=chr1:3671549-36… GAGCTAGTTCTCTTTTCCCTGGTTACTAGCCATGTCCCTCCTCCCA…
2 Rp1_2 range=chr1:4360255-43… CACACACACACACACACACACACACATGTAACAATGAAACAAAAAG…
3 Rp1_1 range=chr1:4409254-44… AGGTATAACTTGGTAAAGACTTTGAAGTAAACAAGAACAAACAGCT…
I am trying to see which gene repeat sequences in df_repeats are present in the sequence column in df_promotors. My goal is to create a new data frame to be able to perform some visualizations. So I've been struggling to create something like the below (just as an example)
df_repeat_occurances
sequence promotor_numbers in_genes
1 AAAAAAAAAAAA 715 Rp1_2
2 AAAAAAAAAAAC 61 Xkr4_1, Rp1_2
3 AAAAAAAAAAAG 184 Xkr4_1
I tried to write a nested loop to search through and if there's a match, append it to the df_repeats in place of the NA, and then change the row names later, but I am completely lost on how to do this, or if it's an ideal way to combine the information of from the two tables into one. Here's what I tried and could not work through.
for (i in 1:nrow(df_repeats)) {
x = df_repeats$sequence[i]
for (j in 1:nrow(df_promotors)) {
if (grepl(x, df_promotors$sequence[j])) {
y = df_promotors$gene[j]
df_repeats$sequence[i] = c(df_repeats$sequence[i], " ", y)
}
}
}
First time ever posting and asking for help, so any guidance or pointers would be greatly appreciated!!!

welcome to SO, in the future please include a reproducible example as I did below, including some meaningful names such as "result" etc.
Also mark any acceptable answer as "accepted".
The best approach is to separate the different computation steps.
#First, define a reproducible example
sequences <- c("AAA", "BBB", "CCC", 'DDD')
promnb <- 1:4
result <- data.frame(sequences, promnb)
genes_names <- paste0("gene_", letters[1:4])
sequence <- c('BBB', 'ABC', 'AAA', 'AAA')
df_proms <- data.frame(genes_names, sequence)
# genes_names sequence
# 1 gene_a BBB
# 2 gene_b ABC
# 3 gene_c AAA
# 4 gene_d AAA
# 1: check in which genes each sequence is present using grepl
# sapply used with data.frames will by default apply the defined function over each column:
in_genes <- sapply(result$sequences, function(x) grepl(x, df_proms$sequence))
# AAA BBB CCC DDD
# [1,] FALSE TRUE FALSE FALSE
# [2,] FALSE FALSE FALSE FALSE
# [3,] TRUE FALSE FALSE FALSE
# [4,] TRUE FALSE FALSE FALSE
#2: replace TRUE or FALSE by the names of the genes
in_genes_names <- data.frame(ifelse(in_genes, paste0(genes_names), ""))
#3: finally, paste each column of the last df to get all the names of the genes
that contain this sequence
result$in_genes <- sapply(in_genes_names, paste, collapse = " ")
result$in_genes <- trimws(result$in_genes)
# By the way, you'd probably want to keep a list of the matches
# you can also include this list as a column of the result df
result$in_genes_list <- sapply(in_genes_names, list)
result
# sequences promnb in_genes in_genes_list
# 1 AAA 1 gene_c gene_d , , gene_c, gene_d
# 2 BBB 2 gene_a gene_a, , ,
# 3 CCC 3 , , ,
# 4 DDD 4 , , ,

You may try the following sapply loop -
df_repeats$in_genes <- sapply(df_repeats$sequence, function(x)
toString(df_promotors$gene[grepl(x, df_promotors$sequence)]))

Select rows from data frame based on condition by group

I have a data frame which is a list of lists - a list of storms, each point of a storm is a row. One column is whether each point in the storm is over land. I'm able to work out which storms have made landfall, however I do not know how to select only those storms, i.e. create a new data frame of only those storms that have made landfall.
This code lets me know whether a storm has made landfall (by grouping by ID it sums the in region column (1 or 0) and if greater than 1 says it's made landfall):
land_tracks <- all_tracks[, sum(inregion) > 0, by = ID]
Gives me:
ID V1
1: 1987051906_15933 TRUE
2: 1987060118_16870 TRUE
3: 1987061306_18015 TRUE
4: 1987062100_18878 TRUE
5: 1987062918_19507 FALSE
6: 1987070512_20168 TRUE
7: 1987070812_20341 TRUE
8: 1987071218_20635 TRUE
9: 1987071412_20762 TRUE
10: 1987071606_20881 TRUE
How do I use this to go through all_tracks to find all the rows which match the ID where V1 == TRUE?
I regularly have the issue that land_tracks has 41 rows, all_tracks has 1879 rows, and R raises an issue about recycling.

Maybe you can do someting like INNER JOIN between those two tables:
merge(all_tracks,land_tracks[which(land_tracks$V1== TRUE)], by = 'ID')

rle(): Return average of lengths only if values == TRUE

I have the following rle object:
Run Length Encoding
lengths: int [1:189] 4 5 3 15 6 4 9 1 9 5 ...
values : logi [1:189] FALSE TRUE FALSE TRUE FALSE TRUE ...
I would like to find the average (mean) of the lengths if the corresponding item in the values == TRUE (I'm not interested in the lengths when values == FALSE)
df <- data.frame(values = NoOfTradesAndLength$values, lengths = NoOfTradesAndLength$lengths)
AveLength <- aggregate(lengths ~ values, data = df, FUN = function(x) mean(x))
Which returns this:
values lengths
1 FALSE 7.694737
2 TRUE 5.287234
I can now obtain the length where values == TRUE but is there a nicer way of doing this? Or perhaps, could I achieve a similar result without using rle at all? It feels a bit fiddly converting from lists to dataframe and I'm sure there is a one line clever way of doing this. I've seen that derivatives of this question have cycled through before but I wasn't able to come up with anything better from those so your help is much appreciated.

The rle returns a list of 'lengths' and 'values'. We can subset the 'lengths' using the 'values' as logical index and get the mean
with(NoOfTradesAndLength, mean(lengths[values]))
Using a reproducible example
set.seed(24)
NoOfTradesAndLength <- rle(sample(c(TRUE, FALSE), 25, replace=TRUE))
with(NoOfTradesAndLength, mean(lengths[values]))
#[1] 1.5
Using the OP's code
AveLength[2,]
# values lengths
#2 TRUE 1.5

`j` doesn't evaluate to the same number of columns for each group

I am trying to use data.table where my j function could and will return a different number of columns on each call. I would like it to behave like rbind.fill in that it fills any missing columns with NA.
fetch <- function(by) {
if(by == 1)
data.table(A=c("a"), B=c("b"))
else
data.table(B=c("b"))
}
data <- data.table(id=c(1,2))
result <- data[, fetch(.BY), by=id]
In this case 'result' may end up with two columns; A and B. 'A' and 'B' was returned as part of the first call to 'fetch' and only 'B' was returned as part of the second. I would like the example code to return this result.
id A B
1 1 a b
2 2 <NA> b
Unfortunately, when run I get this error.
Error in `[.data.table`(data, , fetch(.BY, .SD), by = id) :
j doesn't evaluate to the same number of columns for each group
I can do this with plyr as follows, but in my real world use case plyr is running out of memory. Each call to fetch occurs rather quickly, but the memory crash occurs when plyr tries to merge all of the data back together. I am trying to see if data.table might solve this problem for me.
result <- ddply(data, "id", fetch)
Any thoughts appreciated.

DWin's approach is good. Or you could return a list column instead, where each cell is itself a vector. That's generally a better way of handling variable length vectors.
DT = data.table(A=rep(1:3,1:3),B=1:6)
DT
A B
1: 1 1
2: 2 2
3: 2 3
4: 3 4
5: 3 5
6: 3 6
ans = DT[, list(list(B)), by=A]
ans
A V1
1: 1 1
2: 2 2,3 # V1 is a list column. These aren't strings, the
3: 3 4,5,6 # vectors just display with commas
ans$V1[3]
[[1]]
[1] 4 5 6
ans$V1[[3]]
[1] 4 5 6
ans[,sapply(V1,length)]
[1] 1 2 3
So in your example you could use this as follows:
library(plyr)
rbind.fill(data[, list(list(fetch(.BY))), by = id]$V1)
# A B
#1 a b
#2 <NA> b
Or, just make the list returned conformant :
allcols = c("A","B")
fetch <- function(by) {
if(by == 1)
list(A=c("a"), B=c("b"))[allcols]
else
list(B=c("b"))[allcols]
}

Here are two approaches. The first roughly follows your strategy:
data[,list(A=if(.BY==1) 'a' else NA_character_,B='b'), by=id]
And the second does things in two steps:
DT <- copy(data)[,`:=`(A=NA_character_,B='b')][id==1,A:='a']
Using a by just to check for a single value seems wasteful (maybe computationally, but also in terms of clarity); of course, it could be that your application isn't really like that.

Try
data.table(A=NA, B=c("b"))
#NickAllen: I'm not sure from the comments whether you understood my suggestion. (I was posting from a mobile phone that limited my cut-paste capabilities and I suspect my wife was telling me to stop texting to S0 or she would divorce me.) What I meant was this:
fetch <- function(by) {
if(by == 1)
data.table(A=c("a"), B=c("b"))
else
data.table(A=NA, B=c("b"))
}
data <- data.table(id=c(1,2))
result <- data[, fetch(.BY), by=id]

unique identifier in data.table

I have a data.table with 11 variables and 200,000+ rows. I am trying to find the unique identifier (in other words, key) in this data.table.
I am looking for something like isid in Stata, which checks whether the specified variables uniquely identify the observations. Can someone please help?

This doesn't exactly answer the OP question [I haven't used the data.table yet], but it will help R only user's to answer the OP's question. My focus will be on explaining how isid is actually working on Stata. I use data from R database (you need to install optmatch for this data).
library(optmatch)
data(nuclearplants)
sample<-nuclearplants
I am focusing only on subset of data frame since my goal is to only explain what isid is doing:
sample<-sample[,c(1,2,5,10)]
head(sample,5)
cost date cap cum.n
H 460.05 68.58 687 14
I 452.99 67.33 1065 1
A 443.22 67.33 1065 1
J 652.32 68.00 1065 12
B 642.23 68.00 1065 12
Now, when I use the Stata command isid cost it doesn't display anything which means there are no duplicate observations on cost (R command for this is unique(sample$cost) or sample[duplicated(sample),]
[1] cost date cap cum.n
<0 rows> (or 0-length row.names).)
However when we use isid date i.e. on date variables, Stata reports that it is not unique. Alternatively, if you run duplicates date examples, Stata will give you duplicate observations as follows:
. duplicates example date
Duplicates in terms of date
+-------------------------------+
| group: # e.g. obs date |
|-------------------------------|
| 1 2 27 67.25 |
| 2 2 2 67.33 |
| 3 3 29 67.83 |
| 4 2 4 68 |
| 5 5 8 68.42 |
|-------------------------------|
| 6 2 1 68.58 |
| 7 2 12 68.75 |
| 8 3 14 68.92 |
+-------------------------------+
To interpret the output, it is saying that observation 67.25 has two repeated observations (as indicated by #). The first observation corresponds to row 27 (it doesn't identify the row number of second duplicate with 67.25). Group gives the unique identifier for each repetition.
R command for the same is duplicated(sample$date).
duplicated(sample$date)
[1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
[22] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
To identify the unique observation we can also use unique(sample$date) in R.
We can do same for two variables isid cost date. Again, the Stata doesn't identify duplicate observations across two variables. The same is true when you use unique(sample[,c(1,2)] in R.
Again if I run isid on all four variables then Stata says that it is unique(no warnings).
duplicates example cost date cap cum_n
Duplicates in terms of cost date cap cum_n
(0 observations are duplicates)
The same with unique(sample) in R.
Conclusion: I therefore, think that as long as one variable is unique (i.e. it has no duplicate observations), the combination of the variables which include the unique variable should be always unique. Please correct me if I am wrong.

I think you are confused on a few points about data.tables and keys.
A data.table will not have a key unless you explicitly set it.
A data.table key does not have to be unique.
You can write a function that will check if certain columns could create a unique identifer for a dataset.
I've used data.table here, and have taken care to use unique on an unkeyed copy of the data.table.
This is not efficient.
isid <- function(columns, data, verbose = TRUE){
if(!is.data.table(data)){
copyd <- data.table(data)
} else{
copyd <- copy(data)
}
if(haskey(copyd)){
setkey(copyd, NULL)
}
# NA values don't work in keys for data.tables
any.NA <- Filter(columns, f= function(x) any(is.na(copyd[[x]])))
if(verbose){
for(aa in seq_along(any.NA)){message(sprintf('Column %s contains NA values', any.NA[aa] ))}
}
validCols <- setdiff(columns, any.NA)
# cycle through columns 1 at a time
ncol <- 1L
validKey <- FALSE
while(!isTRUE(validKey) && ncol <= length(validCols)){
anyValid <- combn(x = validCols, m = ncol, FUN = function(xn){
subd <- copyd[, ..xn]
result <- nrow(subd) == nrow(unique(subd))
list(cols = xn, valid = result)
}, simplify = FALSE)
whichValid <- sapply(anyValid, `[[`, 'valid')
validKey <- any(whichValid)
ncol <- ncol + 1L
}
if(!validKey){
warning('No combinations are unique')
return(NULL)} else {
valid.combinations <- lapply(anyValid, `[[`, 'cols')[whichValid]
if(length(valid.combinations) > 1){
warning('More than one combination valid, returning the first only')
}
return(valid.combinations[[1]])
}
}
some examples in use
oneU <- data.table(a = c(2,1,2,2), b = c(1,2,3,4))
twoU <- data.table(a = 1:4, b = letters[1:4])
bothU <- data.table(a = letters[1:2], b = rep(letters[1:2], each = 2))
someNA <- data.table(a = c(1,2,3,NA), b = 1:4)
isid(names(oneU), oneU)
# [1] "b"
isid(names(twoU), twoU)
# [1] "a"
# Warning message:
# In isid(names(twoU), twoU) :
# More than one combination valid, returning the first only
isid(names(bothU), bothU)
# [1] "a" "b"
isid(names(someNA), someNA)
# Column a contains NA values
# [1] "b"
# examples with no valid identifiers
isid('a', someNA)
## Column a contains NA values
## NULL
## Warning message:
## In isid("a", someNA) : No combinations are unique
isid('a', oneU)
## NULL
## Warning message:
## In isid("a", oneU) : No combinations are unique

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Finding matching character strings in 2 sets of data in R - r

You can do this: patient[,mh := codes %in% code$codes] Update: As rightly suggested by Pasqui, for getting 0s and 1s, you can further do: patient[,mh := as.numeric(mh)]

Related

in R, concatenate a conditional vector of strings [closed]

Select rows from data frame based on condition by group

rle(): Return average of lengths only if values == TRUE

`j` doesn't evaluate to the same number of columns for each group

unique identifier in data.table

Categories

Resources