Accessing a data.table via column numbers and grep - r

If I use this simple data.table (one column)
mydata <- data.table(A=c("ID123", "ID22", "AAA", NA))
I can find the position of the rows starting by "ID"
grep("^ID", mydata[,A])
How can I get the same result using numbers instead? (saying I want the first column).
I've tried
grep("^ID", mydata[,1, with=F])
but it doesn't work.
And more important, I would like to do it in the data.table way, introducing the command inside the brackets.
mydata[,grep("^ID",.SD), .SDcols=1]
But this doesn't work.
I've found this way, but it's too convoluted
mydata[,lapply(.SD, grep,pattern="ID"), .SDcols=1]
What's the proper way to do it?
A little bit more complex:
What if I want to count simultaneously how many rows are not NA and start by "ID"?
Something like
any(!(grepl("^ID", mydata[,A] ) | is.na(mydata[,A])))
but more compact and inside the brackets.
I don't like the fact that grep considers the NA as a not matching instead of outputing an NA too.

Don't forget that data.table is a list, too. So if you really and just want an entire column as a vector then it is encouraged just to use base R methods on it: [[ and $.
mydata <- data.table(A=c("ID123", "ID22", "AAA"))
mydata
# A
#1: ID123
#2: ID22
#3: AAA
grep("^ID", mydata[[1]]) # using a column number
#[1] 1 2
grep("^ID", mydata$A)
#[1] 1 2
If you need this in a loop then [[ and $ are faster as they avoid the overhead of argument checking inside DT[...]. If it's just one call then that overhead is negligible.
grep("^ID", mydata[,1, with=F]) "doesn't work" (please include the error message that you saw instead of "does't work"!) because grep wants a vector but DT[] always returns a data.table, even if 1-column, for important type consistency e.g. when chaining. mydata[[1]] directly is cleaner, but another way just to illustrate is grep("^ID", mydata[,1,with=F][[1]]).
As Frank said in comments, using column numbers is highly discouraged because of the potential for bugs as your data changes over the months and years into the future as the documentation explains. Use column names instead, within DT[...].
But if you really must, and sometimes it's valid, then how about :
..theCol = DT[[theNumber]]
DT[ grep(,..theCol) & ..theCol | ..theCol etc , ... ]
The .. prefix in your variable name kind of means "one up" like a directory path. But any variable name that for sure isn't a column name would do. This way you can use it several times inside DT[...] without having to repeat both the table name DT and the column number just to access the column by number several times. (We try to avoid symbol name repetition as much as possible to reduce the potential for bugs due to typos.)

One data.table way of indexing a column by number would be to convert to a column name , convert to an R symbol, and evaluate:
mydata[ , eval( as.symbol( names(mydata)[1] ) )]
[1] "ID123" "ID22" "AAA"
> grep("^ID", mydata[,eval(as.symbol(names(mydata)[1]))])
[1] 1 2
But this is not really an approved path to success because of the DT FAQ #1 as well as the fact that row numbers are not considered as valid targets. The philosophy (as I understand it) is that row numbers are accidental and you should be storing your records with unique identifiers.

Related

read.csv ;check.names=F; R;Look at the picture,why it works a treat?

please see the the column name "if" in the second column,the deifference is :when check.name=F,"." beside "if" disappear
Sorry for the code,because I try to type some codes to generate this data.frame like in the picture,but i failed due to the "if".We know that "if" is a reserved word in R(like else,for, while ,function).And here, i deliberately use the "if" as the column name (the 2nd column),and see whether R will generate some novel things.
So using another way, I type the "if" in the excel and save as the format of csv in order to use read.csv.
Question is:
Why "if." changes to "if"?(After i use check.names=FALSE)
enter image description here
?read.csv describes check.names= in a similar fashion:
check.names: logical. If 'TRUE' then the names of the variables in the
data frame are checked to ensure that they are syntactically
valid variable names. If necessary they are adjusted (by
'make.names') so that they are, and also to ensure that there
are no duplicates.
The default action is to allow you to do something like dat$<column-name>, but unfortunately dat$if will fail with Error: unexpected 'if' in "dat$if", ergo check.names=TRUE changing it to something that the parser will not trip over. Note, though, that dat[["if"]] will work even when dat$if will not.
If you are wondering if check.names=FALSE is ever a bad thing, then imagine this:
dat <- read.csv(text = "a,a\n2,3")
dat
# a a.1
# 1 2 3
dat <- read.csv(text = "a,a\n2,3", check.names = FALSE)
dat
# a a
# 1 2 3
In the second case, how does one access the second column by-name? dat$a returns 2 only. However, if you don't want to use $ or [[, and instead can rely on positional indexing for columns, then dat[,colnames(dat) == "a"] does return both of them.

Stringr str_which first compare 1st row with whole column than to next row

I am trying to match DNA sequences in a column. I am trying to find the longer version of itself, but also in this column it has the same sequence.
I am trying to use Str_which for which I know it works, since if I manually put the search pattern in it finds the rows which include the sequence.
As a preview of the data I have:
SNID type seqs2
9584818 seqs TCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAA
9584818 reversed TTCTGTGGCCCCAGGAGCAGAATCTACAAGTTCATTGCACTCACATATAAGTAGATTCCTYAGATGAAGAAGTTTCTTTGGTAGGTTCCCTTTCAGCTTGGGACAGTGTCTTAAAGAAAGA
9562505 seqs GTCTTCAGCATCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAACTTTGTGAAT
9562505 reversed ATTCACAAAGTTCTGTGGCCCCAGGAGCAGAATCTACAAGTTCATTGCACTCACATATAAGTAGATTCCTYAGATGAAGAAGTTTCTTTGGTAGGTTCCCTTTCAGCTTGGGACAGTGTCTTAAAGAAAGATGCTGAAGAC
Using a simple search of row one as x
x <- "TCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAA"
str_which(df$seqs2, x)
I get the answer I expect:
> str_which(df$seqs3, x)
[1] 1 3
But when I try to search as a whole column, I just get the result of the rows finding itself. And not the other rows in which it is also stated.
> str_which(df$seqs2, df$seqs2)
[1] 1 2 3 4
Since my data set is quite large, I do not want to do this manually, and rather use the column as input, and not just state "x" first.
Anybody any idea how to solve this? I have tried most Stringr cmds by now, but by mistake I might have did it wrongly or skipped some important ones.
Thanks in advance
You may need lapply :
lapply(df$seqs2, function(x) stringr::str_which(df$seqs2, x))
You can also use grep to keep this in base R :
lapply(df$seqs2, function(x) grep(x, df$seqs2))

Removing data from one dataframe that exists in another dataframe R

I want to remove data from a dataframe that is present in another dataframe. Let me give an example:
letters<-c('a','b','c','d','e')
numbers<-c(1,2,3,4,5)
list_one<-data.frame(letters,numbers)
I want to remove every row in list_one with matches in letters to this other dataframe:
letters2<-c('a','c','d')
list_two<-data.frame(letters2)
I should mention that I'm actually trying to do this with two large csv files, so I really can't use the negative expression - to take out the rows.
And create a final dataframe which only has the letters b and e and their corresponding numbers. How do I do this?
I'm new to R so it's hard to research questions when I'm not quite sure what key terms to search. Any help is appreciated, thanks!
A dplyr solution
library(dplyr)
list_one %>% anti_join(list_two)
Base R Solution
list_one[!list_one$letters %in% list_two$letters2,]
gives you:
letters numbers
2 b 2
5 e 5
Explanation:
> list_one$letters %in% list_two$letters2
[1] TRUE FALSE TRUE TRUE FALSE
This gives you a vector of LENGTH == length(list_one$letters) with TRUE/FALSE Values. ! negates this vector. So you end up with FALSE/TRUE values if the value is present in list_two$letters2.
If you have questions about how to select rows from a data.frame enter
?`[.data.frame`
to the console and read it.
Answer is response to your edit:
" so I really can't use the negative expression".
I guess one of the most efficient ways to do this is using data.table as follows:
require(data.table)
setDT(list_one)
setDT(list_two)
list_one[!list_two, on=c(letters = "letters2")]
Or
require(data.table)
setDT(list_one, key = "letters")
setDT(list_two, key = "letters2")
list_one[!letters2]
(Thanks to Frank for the improvement)
Result:
letters numbers
1: b 2
2: e 5
Have a look at ?"data.table" and Quickly reading very large tables as dataframes in R on why to use data.table::freadto read the csv-files in the first place.
BTW: If you have letters2 instead of list_two you can use
list_one[!J(letters2)]

How do I loop over the rows of a data frame without relying on locational references to Column in R

I have figured out how to create a new column on my data frame that = TRUE if the character string in "Column 5" is contained within the longer string in "Column 6" - can I do this by referring to the names of my columns rather than using [r,c] locational references?
rows = NULL
for(i in 1:length(excptn1[,1]))
{
rows[i] <- grepl(excptn1[i,5],excptn1[i,6], perl=TRUE)
}
As a programmer I'm nervous about referring to things as "Column 5 and Column 6"...I want to refer to the names of the variables captured in those columns so that I'm not reliant on my source file always having the columns in the identical order. Furthermore I might forget about that locational reference and add something earlier in the code that causes the locational reference to fail later...when you can think in terms of the names of the columns in general (rather than their particular ordering at a point in time) it's a lot easier to build robust production strength code.
I found a related question on this site and it uses the same kind of locational references I want to avoid...
How do I perform a function on each row of a data frame and have just one element of the output inserted as a new column in that row
While R does seem very flexible it seems to lack a lot of features that you'd want in scaleable, production strength code...but I'm hoping I'm wrong and can learn otherwise.
Thanks!
You could refer to the columns by name rather than by index in two ways:
rows[i] <- grepl(excptn1[i,"colname"],excptn1[i,"othercolname"], perl=TRUE)
or
rows[i] <- grepl(excptn1$colname[i],excptn1$othercolname[i], perl=TRUE)
Finally, note that most R programmers would do this as:
rows = sapply(1:nrow(excptn), grepl(excptn1$colname[i],excptn1$othercolname[i], perl=TRUE))
One thing this avoids is the overhead of increasing the size of the vector in each iteration.
If you want to do this faster, use stri_match_first_regex function from stringi package.
Example:
require(stringi)
ramka <- data.frame(foo=letters[1:3],bar=c("ala","ma","koteczka"))
> ramka
foo bar
1 a ala
2 b ma
3 c koteczka
> stri_match_first_regex(str=ramka$bar, pattern=ramka$foo)
[,1]
[1,] "a"
[2,] NA
[3,] "c"

Extracting the top match from string comparison in R

I am currently using the 'agrep' function with 'lapply' in a data.table code to link entries from a user-provided VIN# list to a DMV VIN# database. Please see the following two links for all data/code so far:
Accelerate performance and speed of string match in R
Imperfect string match using data.table in R
Is there a way to extract the "best" match from my list that is being generated by:
dt <- dt[lapply(car.vins, function(x) agrep(x,vin.vins, max.distance=c(cost=2, all=2), value=T)), list(NumTimesFound=.N), vin.names]
because as of now, the 'agrep' function gives me multiple matches, even with a lot of modification of the cost, all, substitution, ect. variables.
I have also tried using the 'adist' function instead of 'agrip' but because 'adist' does not have an option for value=TRUE like 'agrep', it throws out the same
Error in `[.data.table`(dt, lapply(vin.vins, function(x) agrep(x,car.vins, :
x.'vin.vins' is a character column being joined to i.'V1' which is type 'integer'.
Character columns must join to factor or character columns.
that I was receiving with the 'agrep' before.
Is there perhaps some other package I could use?
Thanks!
Tom, this isn't strictly a data.table problem. Also, it's hard to know exactly what you want without having the data you are using. I tried to figure out what you want, and I came up with this solution:
vin.match <- vapply(car.vins, function(x) which.min(adist(x, vin.vins)), integer(1L))
data.frame(car.vins, vin.vins=vin.vins[vin.match], vin.names=vin.names[vin.match])
# car.vins vin.vins vin.names
# 1 abcdekl abcdef NAME1
# 2 abcdeF abcdef NAME1
# 3 laskdjg laskdjf NAME2
# 4 blerghk blerghk NAME3
And here is the data:
vin.vins <- c("abcdef", "laskdjf", "blerghk")
vin.names <- paste0("NAME", 1:length(vin.vins))
car.vins <- c("abcdekl", "abcdeF", "laskdjg", "blerghk")
This will find the closest match for every value in car.vins in vin.vins, as per adist. I'm not sure data.table is needed for this particular step. If you provide your actual data (or a representative sample), then I can provide a more targeted answer.

Resources