Extract string within data.table

Extract string within data.table - r

I have the following data.table called D.
ngram
1 in_the_years
2 the_years_thereafter
3 years_thereafter_most
4 he_wasn't_home
5 how_are_you
6 thereafter_most_of
I need to add a few variables.
1.queryWord (the requirement is to extract the first 2 words)
the following is my code
D[,queryWord:=strsplit(ngram,"_[^_]+$")[[1]],by=ngram]
ngram queryWord
1 in_the_years in_the
2 the_years_thereafter the_years
3 years_thereafter_most years_thereafter
4 he_wasn't_home he_wasn't
5 how_are_you how_are
6 thereafter_most_of thereafter_most
2.predict. The requirement is to extract the last word.
The following is desired output
ngram queryWord predict
1 in_the_years in_the years
2 the_years_thereafter the_years thereafter
3 years_thereafter_most years_thereafter most
4 he_wasn't_home he_wasn't home
5 how_are_you how_are you
6 thereafter_most_of thereafter_most of
For this purpose I wrote the following function
getLastTerm<-function(x){
y<-strsplit(x,"_")
y[[1]][length(y[[1]])]
}
getLasTerm("in_the_years","_") return "years" however is not working inside the data.table object D.
D[,predict:=getLastTerm(ngram)[[1]],by=ngram]
Please I need help

Before adressing your actual question, you can simplify your first step to:
# option 1
D[, queryWord := strsplit(ngram,"_[^_]+$")][]
# option 2
D[, queryWord := sub('(.*)_.*$','\\1',ngram)][]
To get the predict-column, you don't need to write a special function. Using a combination of strsplit, lapply and last:
D[, predict := lapply(strsplit(D$ngram,"_"), last)][]
Or an even easier solution is using only sub:
D[, predict := sub('.*_(.*)$','\\1',ngram)][]
Both approaches give the following final result:
> D
ngram queryWord predict
1: in_the_years in_the years
2: the_years_thereafter the_years thereafter
3: years_thereafter_most years_thereafter most
4: he_wasn't_home he_wasn't home
5: how_are_you how_are you
6: thereafter_most_of thereafter_most of
Used data:
D <- fread("ngram
in_the_years
the_years_thereafter
years_thereafter_most
he_wasn't_home
how_are_you
thereafter_most_of", header = TRUE)

Your get last term function only selects the first list. Try below.
getLastTerm <- function(x){
y <- strsplit(x,"_")
for (i in (1:6)) {
x[i] <- y[[i]][length(y[[i]])]
}
x
}
D$new <- getLastTerm(D$ngram)

Related

Newbie working on Horse Racing Database using R

I'm new to the group and to R language.
I've written some code (below) that achieves the desired result.
However, i'm aware that i'm reproducing lines of the same code which would surely be more efficiently coded using a for loop.
Also, there will be races with large numbers of horses so I really need to be able to run a for loop that runs through each horse.
ie. num_runners = NROW(my_new_data)
my_new_data holds data on horses previous races.
DaH is a numeric rating that is attached to each of a horse's previous runs with DaH1 being the most recent and DaH6 is six races back.
Code, a character, signifies the type of race that the horse competed in. ie. Flat, Fences.
I have played with using for loops, ie. for(i in 1:6) without success.
Since I am assigning to a new horse each time I would hope something such as the following would work:
horse(i) = c(my_new_data$DaH1[i],my_new_data$DaH2[i],my_new_data$DaH3[i],my_new_data$DaH4[i],my_new_data$DaH5[i],my_new_data$DaH6[i])
But I know that horse(i) is not allowed.
Would my best strategy be to pre-define a dataframe of size: 6 rows and 6 columns
and use 2 for loops to populate [row][column]? Something like:
final_data[i,j]
Here is the code I am presently using which creates the dataframe racetest:
horse1 = c(my_new_data$DaH1[1],my_new_data$DaH2[1],my_new_data$DaH3[1],my_new_data$DaH4[1],my_new_data$DaH5[1],my_new_data$DaH6[1])
horse2 = c(my_new_data$DaH1[2],my_new_data$DaH2[2],my_new_data$DaH3[2],my_new_data$DaH4[2],my_new_data$DaH5[2],my_new_data$DaH6[2])
horse3 = c(my_new_data$DaH1[3],my_new_data$DaH2[3],my_new_data$DaH3[3],my_new_data$DaH4[3],my_new_data$DaH5[3],my_new_data$DaH6[3])
horse4 = c(my_new_data$DaH1[4],my_new_data$DaH2[4],my_new_data$DaH3[4],my_new_data$DaH4[4],my_new_data$DaH5[4],my_new_data$DaH6[4])
horse5 = c(my_new_data$DaH1[5],my_new_data$DaH2[5],my_new_data$DaH3[5],my_new_data$DaH4[5],my_new_data$DaH5[5],my_new_data$DaH6[5])
horse6 = c(my_new_data$DaH1[6],my_new_data$DaH2[6],my_new_data$DaH3[6],my_new_data$DaH4[6],my_new_data$DaH5[6],my_new_data$DaH6[6])
horse1.code = c(my_new_data$Code1[1],my_new_data$Code2[1],my_new_data$Code3[1],my_new_data$Code4[1],my_new_data$Code5[1],my_new_data$Code6[1])
horse2.code = c(my_new_data$Code1[2],my_new_data$Code2[2],my_new_data$Code3[2],my_new_data$Code4[2],my_new_data$Code5[2],my_new_data$Code6[2])
horse3.code = c(my_new_data$Code1[3],my_new_data$Code2[3],my_new_data$Code3[3],my_new_data$Code4[3],my_new_data$Code5[3],my_new_data$Code6[3])
horse4.code = c(my_new_data$Code1[4],my_new_data$Code2[4],my_new_data$Code3[4],my_new_data$Code4[4],my_new_data$Code5[4],my_new_data$Code6[4])
horse5.code = c(my_new_data$Code1[5],my_new_data$Code2[5],my_new_data$Code3[5],my_new_data$Code4[5],my_new_data$Code5[5],my_new_data$Code6[5])
horse6.code = c(my_new_data$Code1[6],my_new_data$Code2[6],my_new_data$Code3[6],my_new_data$Code4[6],my_new_data$Code5[6],my_new_data$Code6[6])
racetest = data.frame(horse1,horse1.code,horse2,horse2.code, horse3, horse3.code,
horse4,horse4.code,horse5,horse5.code, horse6, horse6.code)
Thanks in advance for any help that can be offered!
Graham

using loops in R is usually not the correct approach. Still I will give you something which might work.
There are two possible approaches I see here, I will address the simpler one:
if columns are ordered such that column 1:6 are named DaH1 to DaH6 and columns 7: 12 are the ones named horse1.code etc... in this case:
library(magrittr)
temp<- cbind(my_new_data[,1:6] %>% t,
my_new_data[,7:12]%>% t)
Odd = seq(1,12,2)
my_new_data[ , Odd] = temp[,1:6]
my new_data[ , -Odd] = temp[,7:12]
#cleanup
rm(temp,Odd)
my_new_data should now contain your desired output. Before you run this, make sure your data is backed up inside another object as this is untested code.

Actually we want to reshape the wide format of the data in a different wide format. But first let's look at your desired for loop approach, to understand what's going on.
Using a loop
For the loop we'll need two variables with sequences i and j.
## initialize matrix with dimnames
racetest <- matrix(NA, 3, 6,
dimnames=list(c("DaH1", "DaH2", "DaH3"),
c("horse1", "horse1.code", "horse2", "horse2.code",
"horse3", "horse3.code")))
## loop
for (i in 0:2) {
for (j in 1:3) {
racetest[j, 1:2+2*i] <- unlist(my_new_data[i+1, c(1, 4)])
}
}
# horse1 horse1.code horse2 horse2.code horse3 horse3.code
# DaH1 1 1 2 2 3 3
# DaH2 1 1 2 2 3 3
# DaH3 1 1 2 2 3 3
Often for loops are discouraged in R, because they might be slow and doesn't use the vectorized features of the R language. Moreover they can also be tricky to program.
Transposing column sets
We also could do a different approach. Actually we want to transpose the DaH* and Code* column sets (identifiable using grep) and bring them in the appropriate order using substring of names, with nchar as first character.
rownames(my_new_data) <- paste0("horse.", seq(nrow(my_new_data)))
rr <- data.frame(DaH=t(my_new_data[, grep("DaH", names(my_new_data))]),
Code=t(my_new_data[, grep("Code", names(my_new_data))]))
rr <- rr[order(substring(names(rr), nchar(names(rr))))]
rr
# DaH.horse.1 Code.horse.1 DaH.horse.2 Code.horse.2 DaH.horse.3 Code.horse.3
# DaH1 1 1 2 2 3 3
# DaH2 1 1 2 2 3 3
# DaH3 1 1 2 2 3 3
Reshaping data
Last but not least, we actually want to reshape the data. For this we give the data set an ID variable.
my_new_data <- transform(my_new_data, horse=1:nrow(my_new_data))
At first, we reshape the data into "long" format, using the new ID variable horse and put the two varying column sets into a list.
rr1 <- reshape(my_new_data, idvar="horse", varying=list(1:3, 4:6), direction="long", sep="",
v.names=c("DaH", "Code"))
rr1
# horse time DaH Code
# 1.1 1 1 1 1
# 2.1 2 1 2 2
# 3.1 3 1 3 3
# 1.2 1 2 1 1
# 2.2 2 2 2 2
# 3.2 3 2 3 3
# 1.3 1 3 1 1
# 2.3 2 3 2 2
# 3.3 3 3 3 3
Then, in order to get the desired wide format, what we want is to swap idvar and timevar, where our new idvar is "time" and our new timevar is "horse".
reshape(rr1, timevar="horse", idvar="time", direction= "wide")
# time DaH.1 Code.1 DaH.2 Code.2 DaH.3 Code.3
# 1.1 1 1 1 2 2 3 3
# 1.2 2 1 1 2 2 3 3
# 1.3 3 1 1 2 2 3 3
Benchmark
The benchmark reveals that of these three approaches transposing of the matrices is fastest, while the 'for' loop is actually by far the slowest.
# Unit: microseconds
# expr min lq mean median uq max neval cld
# forloop 7191.038 7373.5890 8381.8036 7576.678 7980.4320 46677.324 100 c
# transpose 620.748 656.0845 707.7248 692.953 733.1365 944.773 100 a
# reshape 2791.710 2858.6830 3013.8372 2958.825 3118.4125 3871.960 100 b
Toy data:
my_new_data <- data.frame(DaH1=1:3, DaH2=1:3, DaH3=1:3, Code1=1:3, Code2=1:3, Code3=1:3)

Programmatically generating a list of columns to be assigned to data.table with `:=` syntax

In data.table, I can generate a list of new columns that are immediately assigned to the table using the `:=` syntax, like so:
x <- data.table(x1=1:5, x2=1:5)
x[, `:=` (x3=x1+2, x4=x2*3)]
Alternatively, I could have done the following:
x[, c("x3","x4") := list(x1+2, x2*3)]
I would like to do something like the first method, but have the right hand side of the assignment statement be built up automatically using a custom function. For example, suppose I want a function that will accept a set of column names, then generate new columns that are the mean of the given columns, with the column name being equal to the original column plus some suffix. For example,
x[, `:=` MEAN(x1,x2)]
would yield the same result as
x[, `:=` (x1_mean=mean(x1), x2_mean=mean(x2))]
Is this possible in data.table? I realize this is possible if I'm willing to pass in a list of column names like in the c("x3","x4") := ... example, but I want to avoid this so I don't have to write as much code.

Just refer to the function by name:
myfun <- "mean"
x[,paste(names(x),myfun,sep="_"):=lapply(.SD,myfun)]
# x1 x2 x1_mean x2_mean
# 1: 1 1 3 3
# 2: 2 2 3 3
# 3: 3 3 3 3
# 4: 4 4 3 3
# 5: 5 5 3 3
Customization is straightforward:
divby2 <- function(x) x/2 # custom function
myfun <- "divby2"
mycols <- "x1" # custom columns
x[,paste(mycols,myfun,sep="_"):=lapply(.SD,myfun),.SDcols=mycols]
# x1 x2 x1_mean x2_mean x1_divby2
# 1: 1 1 3 3 0.5
# 2: 2 2 3 3 1.0
# 3: 3 3 3 3 1.5
# 4: 4 4 3 3 2.0
# 5: 5 5 3 3 2.5
We may some day have syntax like paste(.SDcols,myfun,sep="_"):=lapply(.SD,myfun), but .SDcols on the left-hand side is not supported currently.
Making a function. If you want a function to do this, there's
add_myfun <- function(DT,myfun,mycols){
DT[,paste(mycols,myfun,sep="_"):=lapply(.SD,myfun),.SDcols=mycols]
}
add_myfun(x,"median","x2")
Can a function be written that will work inside j of DT[i,j]? Maybe. But I think it's not a good idea.
Can you be sure your function will be robust to all the other uses of j, like by?
Can your function take advantage of data.table's optimization (e.g., of mean)?
Will anyone else be able to read your code?
Using [ can be slow. If you're doing this for many columns, you might be better off initializing the new columns and assigning with set.

R : group by with custom functions

I have managed to aggregate data successfully using the following pattern:
newdf <- setDT(df)[, list(X=sum(x),Y=max(y)), by=Z]
However, the moment I try to do anything more complicated, although the code runs, it no longer aggregates by Z: it seems to create a dataframe with the same number of observations as the original df so I know that no grouping is actually occurring.
The custom function I would like to apply is to find the n-quantile for the current list of values and then do some other stuff with it. I saw use of sdcols in another SO answer and tried something like:
customfunc <- function(dt){
q = unname(quantile(dt$column,0.25))
n = nrow(dt[dt$column <= q])
return(n/dt$someOtherColumn)
}
#fails to group anything!!! also rather slow...
newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c(column, someOtherColumn)]
Can someone please help me figure out what is wrong with the way I'm trying to use group by and custom functions? Thank you very much.
Literal example as requested:
> df <- data.frame(Z=c("abc","abc","def","abc"), column=c(1,2,3,4), someOtherColumn=c(5,6,7,8))
> df
Z column someOtherColumn
1 abc 1 5
2 abc 2 6
3 def 3 7
4 abc 4 8
> newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c("column", "someOtherColumn")]
> newdf
Z V1
1: abc 0.2000000
2: abc 0.1666667
3: abc 0.1250000
4: def 0.1428571
>
As you can see, DF is not grouped. There should just be two rows, one for "abc", and another for "def" since I am trying to group by Z.

As guided by eddi's point above, the basic problem is thinking that your custom function is being called inside a loop and that 'dt$column' will mysteriously give you the 'current value at the current row'. Instead it gives you the entire column (a vector). The function is passed the entire data table, not row-wise bits of data.
So, replacing the value in the return statement with something that represents a single value works. Example:
customfunc <- function(dt){
q = unname(quantile(dt$column,0.25))
n = nrow(dt[dt$column <= q])
return(n/length(dt$someOtherColumn))
}
> df <- data.frame(Z=c("abc","abc","def","abc"), column=c(1,2,3,4), someOtherColumn=c(5,6,7,8))
> df
Z column someOtherColumn
1 abc 1 5
2 abc 2 6
3 def 3 7
4 abc 4 8
> newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c("column", "someOtherColumn")]
> newdf
Z V1
1: abc 0.3333333
2: def 1.0000000
Now the data is aggregated correctly.

Extract data elements found in a single column

Here is what my data look like.
id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{
As you can see, can be multiple codes concatenated into a single column, seperated by {. It is also possible for a row to have no interest_string values at all.
How can I manipulate this data frame to extract the values into a format like this:
id interest
1 YI
1 Z0
1 ZI
2 Z0
3 <NA>
4 ZT
I need to complete this task with R.
Thanks in advance.

This is one solution
out <- with(dat, strsplit(as.character(interest_string), "\\{"))
## or
# out <- with(dat, strsplit(as.character(interest_string), "{", fixed = TRUE))
out <- cbind.data.frame(id = rep(dat$id, times = sapply(out, length)),
interest = unlist(out, use.names = FALSE))
Giving:
R> out
id interest
1 1 YI
2 1 Z0
3 1 ZI
4 2 ZO
5 3 <NA>
6 4 ZT
Explanation
The first line of solution simply splits each element of the interest_string factor in data object dat, using \\{ as the split indicator. This indicator has to be escaped and in R that requires two \. (Actually it doesn't if you use fixed = TRUE in the call to strsplit.) The resulting object is a list, which looks like this for the example data
R> out
[[1]]
[1] "YI" "Z0" "ZI"
[[2]]
[1] "ZO"
[[3]]
[1] "<NA>"
[[4]]
[1] "ZT"
We have almost everything we need in this list to form the output you require. The only thing we need external to this list is the id values that refer to each element of out, which we grab from the original data.
Hence, in the second line, we bind, column-wise (specifying the data frame method so we get a data frame returned) the original id values, each one repeated the required number of times, to the strsplit list (out). By unlisting this list, we unwrap it to a vector which is of the required length as given by your expected output. We get the number of times we need to replicate each id value from the lengths of the components of the list returned by strsplit.

A nice and tidy data.table solution:
library(data.table)
DT <- data.table( read.table( textConnection("id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{"), header=TRUE))
DT$interest_string <- as.character(DT$interest_string)
DT[, {
list(interest=unlist(strsplit( interest_string, "{", fixed=TRUE )))
}, by=id]
gives me
id interest
1: 1 YI
2: 1 Z0
3: 1 ZI
4: 2 ZO
5: 3 <NA>
6: 4 ZT

big table processing (advice needed)

I have a table of 55000 rows, which looks like that (left table):
(the code to generate sample data is below)
Now I need to convert every row of this table to 6 rows, each containing one letter of "hexamer" (right table on the picture) with some calculations:
# input for the function is one row of source table, output is 6 rows
splithexamer <- function(x){
dir <- x$dir # strand direction: +1 or -1
pos <- x$pos # hexamer position
out <- x[0,] # template of output
hexamer <- as.character(x$hexamer)
for (i in 1:nchar(hexamer)) {
letter <- substr(hexamer, i, i)
if (dir==1) {newpos <- pos+i-1;}
else {newpos <- pos+6-i;}
y <- x
y$pos <- newpos
y$letter <- letter
out <- rbind(out,y)
}
return(out);
}
# Sample data generation:
set.seed(123)
size <- 55000
letters <- c("G","A","T","C")
df<-data.frame(
HSid=paste0("Hs.", 1:size),
hexamer=replicate(n=size, paste0(sample(letters,6,replace=T), collapse="")),
chr=sample(c(1:23,"X","Y"),size,replace=T),
pos=sample(1:99999,size,replace=T),
dir=sample(c(1,-1),size,replace=T)
)
Now I would like to get some advices what would be the most efficient way to apply my function to every row. So far I tried the following:
# Variant 1: for() with rbind
tmp <- data.frame()
for (i in 1:nrow(df)){
tmp<-rbind(tmp,splithexamer(df[i,]));
}
# Variant 2: for() with direct writing to file
for (i in 1:nrow(df)){
write.table(splithexamer(df[i,]),file="d:/test.txt",append=TRUE,quote=FALSE,col.names=FALSE)
}
# Variant 3: ddply
tmp<-ddply(df, .(HSid), .fun=splithexamer)
# Variant 4: apply - I don't know correct syntax
tmp<-apply(X=df, 1, FUN=splithexamer) # this causes an error
all of the above is extremely slow, I am wondering if there's better way to solve this task...

Solution using data.table:
df$hexamer <- as.character(df$hexamer)
dt <- data.table(df)
dt[, id := seq_len(nrow(df))]
setkey(dt, "id")
dt.out <- dt[, { mod.pos <- pos:(pos+5); if(dir == -1) mod.pos <- rev(mod.pos);
list(split = unlist(strsplit(hexamer, "")),
mod.pos = mod.pos)}, by=id][dt][, id := NULL]
dt.out
# split mod.pos HSid hexamer chr pos dir
# 1: G 95982 Hs.1 GCTCCA 5 95982 1
# 2: C 95983 Hs.1 GCTCCA 5 95982 1
# 3: T 95984 Hs.1 GCTCCA 5 95982 1
# 4: C 95985 Hs.1 GCTCCA 5 95982 1
# 5: C 95986 Hs.1 GCTCCA 5 95982 1
# ---
# 329996: A 59437 Hs.55000 AATCTG 7 59436 1
# 329997: T 59438 Hs.55000 AATCTG 7 59436 1
# 329998: C 59439 Hs.55000 AATCTG 7 59436 1
# 329999: T 59440 Hs.55000 AATCTG 7 59436 1
# 330000: G 59441 Hs.55000 AATCTG 7 59436 1
Explanation of the main line:
The by=id will group by id and since they are all unique, it'll group by every line, one at a time.
Then, the ones within {} sets mod.pos to pos:(pos+6-1) and if dir == -1 reverses it.
Now, the list argument: It creates the column split by creating 6 nucleotides from your hexamer using strsplit and also sets mod.pos which we've already calculated in the step before.
This will result in a data.table with columns id, split and mod.pos.
The next part [dt] is a typical usage of data.table's X[Y] syntax which performs a join on the data.tables based on the key column ( = id, here). Since there are 6 rows for every id you get all the other columns in dt duplicated during this join.
I'd suggest you take a look at data.table FAQ first and then its documentation (intro). These links can be obtained by installing the package and loading it and then typing ?data.table. I also suggest you work through the many examples in there one by one with a test data.table to understand practically the features of data.table.
Hope this helps.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extract string within data.table - r

Your get last term function only selects the first list. Try below. getLastTerm <- function(x){ y <- strsplit(x,"_") for (i in (1:6)) { x[i] <- y[[i]][length(y[[i]])] } x } D$new <- getLastTerm(D$ngram)

Related

Newbie working on Horse Racing Database using R

Programmatically generating a list of columns to be assigned to data.table with `:=` syntax

R : group by with custom functions

Extract data elements found in a single column

big table processing (advice needed)

Categories

Resources