I am currently using the 'agrep' function with 'lapply' in a data.table code to link entries from a user-provided VIN# list to a DMV VIN# database. Please see the following two links for all data/code so far:
Accelerate performance and speed of string match in R
Imperfect string match using data.table in R
Is there a way to extract the "best" match from my list that is being generated by:
dt <- dt[lapply(car.vins, function(x) agrep(x,vin.vins, max.distance=c(cost=2, all=2), value=T)), list(NumTimesFound=.N), vin.names]
because as of now, the 'agrep' function gives me multiple matches, even with a lot of modification of the cost, all, substitution, ect. variables.
I have also tried using the 'adist' function instead of 'agrip' but because 'adist' does not have an option for value=TRUE like 'agrep', it throws out the same
Error in `[.data.table`(dt, lapply(vin.vins, function(x) agrep(x,car.vins, :
x.'vin.vins' is a character column being joined to i.'V1' which is type 'integer'.
Character columns must join to factor or character columns.
that I was receiving with the 'agrep' before.
Is there perhaps some other package I could use?
Thanks!
Tom, this isn't strictly a data.table problem. Also, it's hard to know exactly what you want without having the data you are using. I tried to figure out what you want, and I came up with this solution:
vin.match <- vapply(car.vins, function(x) which.min(adist(x, vin.vins)), integer(1L))
data.frame(car.vins, vin.vins=vin.vins[vin.match], vin.names=vin.names[vin.match])
# car.vins vin.vins vin.names
# 1 abcdekl abcdef NAME1
# 2 abcdeF abcdef NAME1
# 3 laskdjg laskdjf NAME2
# 4 blerghk blerghk NAME3
And here is the data:
vin.vins <- c("abcdef", "laskdjf", "blerghk")
vin.names <- paste0("NAME", 1:length(vin.vins))
car.vins <- c("abcdekl", "abcdeF", "laskdjg", "blerghk")
This will find the closest match for every value in car.vins in vin.vins, as per adist. I'm not sure data.table is needed for this particular step. If you provide your actual data (or a representative sample), then I can provide a more targeted answer.
Related
I am trying to match DNA sequences in a column. I am trying to find the longer version of itself, but also in this column it has the same sequence.
I am trying to use Str_which for which I know it works, since if I manually put the search pattern in it finds the rows which include the sequence.
As a preview of the data I have:
SNID type seqs2
9584818 seqs TCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAA
9584818 reversed TTCTGTGGCCCCAGGAGCAGAATCTACAAGTTCATTGCACTCACATATAAGTAGATTCCTYAGATGAAGAAGTTTCTTTGGTAGGTTCCCTTTCAGCTTGGGACAGTGTCTTAAAGAAAGA
9562505 seqs GTCTTCAGCATCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAACTTTGTGAAT
9562505 reversed ATTCACAAAGTTCTGTGGCCCCAGGAGCAGAATCTACAAGTTCATTGCACTCACATATAAGTAGATTCCTYAGATGAAGAAGTTTCTTTGGTAGGTTCCCTTTCAGCTTGGGACAGTGTCTTAAAGAAAGATGCTGAAGAC
Using a simple search of row one as x
x <- "TCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAA"
str_which(df$seqs2, x)
I get the answer I expect:
> str_which(df$seqs3, x)
[1] 1 3
But when I try to search as a whole column, I just get the result of the rows finding itself. And not the other rows in which it is also stated.
> str_which(df$seqs2, df$seqs2)
[1] 1 2 3 4
Since my data set is quite large, I do not want to do this manually, and rather use the column as input, and not just state "x" first.
Anybody any idea how to solve this? I have tried most Stringr cmds by now, but by mistake I might have did it wrongly or skipped some important ones.
Thanks in advance
You may need lapply :
lapply(df$seqs2, function(x) stringr::str_which(df$seqs2, x))
You can also use grep to keep this in base R :
lapply(df$seqs2, function(x) grep(x, df$seqs2))
If I use this simple data.table (one column)
mydata <- data.table(A=c("ID123", "ID22", "AAA", NA))
I can find the position of the rows starting by "ID"
grep("^ID", mydata[,A])
How can I get the same result using numbers instead? (saying I want the first column).
I've tried
grep("^ID", mydata[,1, with=F])
but it doesn't work.
And more important, I would like to do it in the data.table way, introducing the command inside the brackets.
mydata[,grep("^ID",.SD), .SDcols=1]
But this doesn't work.
I've found this way, but it's too convoluted
mydata[,lapply(.SD, grep,pattern="ID"), .SDcols=1]
What's the proper way to do it?
A little bit more complex:
What if I want to count simultaneously how many rows are not NA and start by "ID"?
Something like
any(!(grepl("^ID", mydata[,A] ) | is.na(mydata[,A])))
but more compact and inside the brackets.
I don't like the fact that grep considers the NA as a not matching instead of outputing an NA too.
Don't forget that data.table is a list, too. So if you really and just want an entire column as a vector then it is encouraged just to use base R methods on it: [[ and $.
mydata <- data.table(A=c("ID123", "ID22", "AAA"))
mydata
# A
#1: ID123
#2: ID22
#3: AAA
grep("^ID", mydata[[1]]) # using a column number
#[1] 1 2
grep("^ID", mydata$A)
#[1] 1 2
If you need this in a loop then [[ and $ are faster as they avoid the overhead of argument checking inside DT[...]. If it's just one call then that overhead is negligible.
grep("^ID", mydata[,1, with=F]) "doesn't work" (please include the error message that you saw instead of "does't work"!) because grep wants a vector but DT[] always returns a data.table, even if 1-column, for important type consistency e.g. when chaining. mydata[[1]] directly is cleaner, but another way just to illustrate is grep("^ID", mydata[,1,with=F][[1]]).
As Frank said in comments, using column numbers is highly discouraged because of the potential for bugs as your data changes over the months and years into the future as the documentation explains. Use column names instead, within DT[...].
But if you really must, and sometimes it's valid, then how about :
..theCol = DT[[theNumber]]
DT[ grep(,..theCol) & ..theCol | ..theCol etc , ... ]
The .. prefix in your variable name kind of means "one up" like a directory path. But any variable name that for sure isn't a column name would do. This way you can use it several times inside DT[...] without having to repeat both the table name DT and the column number just to access the column by number several times. (We try to avoid symbol name repetition as much as possible to reduce the potential for bugs due to typos.)
One data.table way of indexing a column by number would be to convert to a column name , convert to an R symbol, and evaluate:
mydata[ , eval( as.symbol( names(mydata)[1] ) )]
[1] "ID123" "ID22" "AAA"
> grep("^ID", mydata[,eval(as.symbol(names(mydata)[1]))])
[1] 1 2
But this is not really an approved path to success because of the DT FAQ #1 as well as the fact that row numbers are not considered as valid targets. The philosophy (as I understand it) is that row numbers are accidental and you should be storing your records with unique identifiers.
I am working with a long list of data frames.
Here is a simple hypothetical example of a data frame:
DFrame<-data.frame(c(1,0),c("Yes","No"))
colnames(DFrame)<-c("ColOne","ColTwo")
I am trying to retrieve a specified column of the data frame using paste function.
get(paste("DFrame","$","ColTwo",sep=""))
The get function returns the following error, when trying to retrieve a specified column:
Error in get(paste("DFrame", "$", "ColTwo", sep = "")) :object 'DFrame$ColTwo' not found
When I enter the constructed name of the data frame DFrame$ColTwo it returns the desired output of the second column.
If I reconstruct an example without the '$' sign then I get the desired answer from the get function. For example the code yields 2:
enter code here
Ans <- 2
get(paste("An","s",sep=""))
[1] 2
I am looking for the same desired outcome, but struggling to get past the error that the object could not be found.
I also attempted using the following format, but the quotation in the column name breaks the paste function:
paste("DFrame","[,"ColTwo"]",sep="")
Thank you very much for the input,
Kind regards
You can do that using the following syntax:
get("DFrame")[,"ColTwo"]
You can use paste() in both of these strings, for example:
get(paste("D", "Frame", sep=""))[,paste("Col", "Two", sep="")]
Edit: Despite someone downvoting this answer without leaving a comment, this does exactly what the original poster asked for. If you feel that it does not or is in some way dangerous, I would encourage you to leave a comment.
Stop trying to use paste and get entirely.
The whole point of having a list (of data frames, say) is that you can reference them using names:
DFrame<-data.frame(c(1,0),c("Yes","No"))
colnames(DFrame)<-c("ColOne","ColTwo")
#A list of data frames
l <- list(DFrame,DFrame)
#The data frames in the list can have names
names(l) <- c("DF1",'DF2')
# Now you just use `[[`
> l[["DF1"]][["ColOne"]]
[1] 1 0
> l[["DF1"]][["ColTwo"]]
[1] Yes No
Levels: No Yes
If you have to, you can use paste to construct the indices passed inside [[.
I want to remove data from a dataframe that is present in another dataframe. Let me give an example:
letters<-c('a','b','c','d','e')
numbers<-c(1,2,3,4,5)
list_one<-data.frame(letters,numbers)
I want to remove every row in list_one with matches in letters to this other dataframe:
letters2<-c('a','c','d')
list_two<-data.frame(letters2)
I should mention that I'm actually trying to do this with two large csv files, so I really can't use the negative expression - to take out the rows.
And create a final dataframe which only has the letters b and e and their corresponding numbers. How do I do this?
I'm new to R so it's hard to research questions when I'm not quite sure what key terms to search. Any help is appreciated, thanks!
A dplyr solution
library(dplyr)
list_one %>% anti_join(list_two)
Base R Solution
list_one[!list_one$letters %in% list_two$letters2,]
gives you:
letters numbers
2 b 2
5 e 5
Explanation:
> list_one$letters %in% list_two$letters2
[1] TRUE FALSE TRUE TRUE FALSE
This gives you a vector of LENGTH == length(list_one$letters) with TRUE/FALSE Values. ! negates this vector. So you end up with FALSE/TRUE values if the value is present in list_two$letters2.
If you have questions about how to select rows from a data.frame enter
?`[.data.frame`
to the console and read it.
Answer is response to your edit:
" so I really can't use the negative expression".
I guess one of the most efficient ways to do this is using data.table as follows:
require(data.table)
setDT(list_one)
setDT(list_two)
list_one[!list_two, on=c(letters = "letters2")]
Or
require(data.table)
setDT(list_one, key = "letters")
setDT(list_two, key = "letters2")
list_one[!letters2]
(Thanks to Frank for the improvement)
Result:
letters numbers
1: b 2
2: e 5
Have a look at ?"data.table" and Quickly reading very large tables as dataframes in R on why to use data.table::freadto read the csv-files in the first place.
BTW: If you have letters2 instead of list_two you can use
list_one[!J(letters2)]
I have several datafiles, which I need to process in a particular order. The pattern of the names of the files is, e.g. "Ad_10170_75_79.txt".
Currently they are sorted according to the first numbers (which differ in length), see below:
f <- as.matrix (list.files())
f
[1] "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_1049_25_79.txt" "Ad_10531_77_79.txt"
But I need them to be sorted by the middle number, like this:
> f
[1] "Ad_1049_25_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
As I just need the middle number of the filename, I thought the easiest way is, to get rid of the rest of the name and renaming all files. For this I tried using strsplit (plyr).
f2 <- strsplit (f,"_79.txt")
But I'm sure there is a way to sort the files directly, without renaming all files. I tried using sort and to describe the name with regex but without success. This has been a problem for many days, and I spent several hours searching and trying, to solve this presumably easy task. Any help is very much appreciated.
old example dataset:
f <- c("Ad_10170_75_79.txt", "Ad_10345_76_79.txt",
"Ad_1049_25_79.txt", "Ad_10531_77_79.txt")
Thank your for your answers. I think I have to modify my example, because the solution should work for all possible middle numbers, independent of their digits.
new example dataset:
f <- c("Ad_10170_75_79.txt", "Ad_10345_76_79.txt",
"Ad_1049_9_79.txt", "Ad_10531_77_79.txt")
Here's a regex approach.
f[order(as.numeric(gsub('Ad_\\d+_(\\d+)_\\d+\\.txt', '\\1', f)))]
# [1] "Ad_1049_9_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
Try this:
f[order(as.numeric(unlist(lapply(strsplit(f, "_"), "[[", 3))))]
[1] "Ad_1049_25_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
First we split by _, then select the third element of every list element, find the order and subset f based on that order.
I would create a small dataframe containing filenames and their respective extracted indices:
f<- c("Ad_10170_75_79.txt","Ad_10345_76_79.txt","Ad_1049_25_79.txt","Ad_10531_77_79.txt")
f2 <- strsplit (f,"_79.txt")
mydb <- as.data.frame(cbind(f,substr(f2,start=nchar(f2)-1,nchar(f2))))
names(mydb) <- c("filename","index")
library(plyr)
arrange(mydb,index)
Take the first column of this as your filename vector.
ADDENDUM:
If a numeric index is required, simply convert character to numeric:
mydb$index <- as.numeric(mydb$index)