word splitting speed up n R - r

I've written a function that splits words into single letters, and than create a 2 variable dataframe with those letters and their position in the original word expressed in percentage. It looks like this:
pozycje.literek <- function(slowo){
literki <- unlist(strsplit(slowo,""))
liczby <- seq(0,length(literki)-1) / (length(literki)-1)
pozycje <- data_frame(literki, liczby)
return(pozycje)
}
The function does what I need, but it is awfully slow. with the below example with 10 thousand elements it took 52 seconds (just the second loop, without generating random example vector of characters). And the vectors I'm dealing with are above 500 thousand.
wektor <- vector()
for(i in 1:10000){
wektor[i] <- paste0(sample(letters[1:24], round(runif(1,3,10),0)),collapse = "")
}
tabelka <- data.frame()
system.time(for(i in wektor){
tabelka <- rbind(tabelka, pozycje.literek(i)) #tu powstaje baza dla danego kraju i potem już jest kod wspolny bo zamieniam na 'tabelka'
})
Any idea how to speed it up? I could't think of any application of apply family, to do that, but I believe there might be one. Or the job my function does could be done in completely different way?

literki <- strsplit(wektor, "")
x <- lengths(literki)
liczby <- lapply(x, function(x) seq(0, x-1)/(x-1))
pozycje <- data_frame(unlist(literki), unlist(liczby))

Related

Loop Changing to Matrix then Running tests

I have a dataframe with ~9000 rows of human coded data in it, two coders per item so about 4500 unique pairs. I want to break the dataset into each of these pairs, so ~4500 dataframes, run a kripp.alpha on the scores that were assigned, and then save those into a coder sheet I have made. I cannot get the loop to work to do this.
I can get it to work individually, using this:
example.m <- as.matrix(example.m)
s <- kripp.alpha(example.m)
example$alpha <- s$value
However, when trying a loop I am getting either "Error in get(v) : object 'NA' not found" when running this:
for (i in items) {
v <- i
v <- v[c("V1","V2")]
v <- assign(v, as.matrix(get(v)))
s <- kripp.alpha(v)
i$alpha <- s$value
}
Or am getting "In i$alpha <- s$value : Coercing LHS to a list" when running:
for (i in items) {
i.m <- i[c("V1","V2")]
i.m <- as.matrix(i.m)
s <- kripp.alpha(i.m)
i$alpha <- s$value
}
Here is an example set of data. Items is a list of individual dataframes.
l <- as.data.frame(matrix(c(4,3,3,3,1,1,3,3,3,3,1,1),nrow=2))
t <- as.data.frame(matrix(c(4,3,4,3,1,1,3,3,1,3,1,1),nrow=2))
items <- c("l","t")
I am sure this is a basic question, but what I want is for each file, i, to add a column with the alpha score at the end. Thanks!
Your problem is with scoping and extracting names from objects when referenced through strings. You'd need to eval() some of your object to make your current approach work.
Here's another solution
library("irr") # For kripp.alpha
# Produce the data
l <- as.data.frame(matrix(c(4,3,3,3,1,1,3,3,3,3,1,1),nrow=2))
t <- as.data.frame(matrix(c(4,3,4,3,1,1,3,3,1,3,1,1),nrow=2))
# Collect the data as a list right away
items <- list(l, t)
Now you can sapply() directly over the elements in the list.
sapply(items, function(v) {
kripp.alpha(as.matrix(v[c("V1","V2")]))$value
})
which produces
[1] 0.0 -0.5

How can i stop a while loop and start another while loop where the previous one started

I am trying to count the letters in the list by skipping 1 letter and grouping them in three until i find "t a c" in the data frame and then i want to group the rest of them in three by skipping 3 letters until i find "a t t"
example of what i am trying to say:
"agttacgtaattatgat"
it should do:
agt,gtt,tta,tac stop, gta,att stop ,atg,tga,gat
(data frame's name is agen)
my code for that:
y=c()
x=1
while(x<853){
x=x+1
rt<-paste(agen[x],agen[x+1],agen[x+2])
y=c(y,rt)
ff<-data.frame(y)
if(ff=="t a c"){break}
}
ay=c()
while(x<853){
x=x+3
art<-paste(agen[x],agen[x+1],agen[x+2])
ay=c(ay,art)
aff<-data.frame(ay)
if(aff=="a t t"){break}
}
the first one is working fine but the second one does not break.
there will be a lot of stops and starts in the code, so can you help me write a loop that can do the job?
I guess I know just roughly what you need, but here is a code example, that maybe does what you need. I used the example you specified and used a vector with your DNA bases as elements instead of a 'data frame'. I also changed some style things.
agen_string <- "agttacgtaattatgat"
# Is not a data frame, but a vector. I don't know, why you try to use a data frame.
agen <- strsplit(agen_string, split = "")[[1]]
y <- c()
x <- 0 # Start with 0. Otherwise, you wouldn't find 'tac' in the beginning
# Search for 'tac' triplett
while(x < length(agen)){
x <- x + 1
rt <- paste(agen[x], agen[x+1], agen[x+2], sep = "")
print(rt)
y <- c(y, rt)
#ff <- data.frame(y)
if(rt == "tac"){
print("stop")
break
}
}
ay <- c()
while(x < length(agen)) {
x <- x + 3
art <- paste(agen[x], agen[x+1], agen[x+2], sep = "")
print(art)
ay = c(ay,art)
#aff<-data.frame(ay)
if(art == "att"){
print("stop")
break
}
}
If you work more on DNA sequences, you may want to use a more specialized R-package, like Biostrings for example.

R for-loop iterating from central value out to extremes

I'm trying to improve the speed of my code, which is trying to optimise a value using 3 variables which have large ranges. The most likely output uses values in the middle of the ranges, so it is wasting time starting from the lowest possible value of each variable. I want to start from the middle value and iterate out! The actual problem has thousands of lines with numbers from 150-650. C,H and O limits will be defined somewhat based on the starting number, but will always be more likely at a central value in the defined range. Is there a way to define the for loop to work outwards like I want? The only, quite shabby, way I can think of is to simply redefine the value within the loop from a vector (e.g. 1=20, 2=21, 3=19, etc). See current code below:
set_error<-2.5
ct<-c(325.00214,325.00952,325.02004,325.02762,325.03535,325.03831,325.04588, 325.05641,325.06402,325.06766,325.07167,325.07454,325.10396)
FormFun<-function(x){
for(C in 1:40){
for(H in 1:80){
for(O in 1:40){
test_mass=C*12+H*1.007825+O*15.9949146-1.0072765
error<-1000000*abs(test_mass-x)/x
if(error<set_error){
result<-paste("C",C,"H",H,"O",O,sep ="")
return(result)
break;break;break;break
}
}
}
}
}
old_t <- Sys.time()
ct2<-lapply(ct,FormFun)
new_t <- Sys.time() - old_t # calculate difference
print(new_t)
Use vectorization and create a closure:
FormFun1_fac <- function(gr) {
gr <<- gr
function(x, set_error){
test_mass <- with(gr, C*12+H*1.007825+O*15.9949146-1.0072765)
error <- 1000000 * abs(test_mass - x) / x
ind <- which(error < set_error)[1]
if (is.na(ind)) return(NULL)
paste0("C", gr[ind, "C"],"H", gr[ind, "H"],"O", gr[ind, "O"])
}
}
FormFun1 <- FormFun1_fac(expand.grid(C = 1:40, H = 1:80, O = 1:40))
ct21 <- lapply(ct, FormFun1, set_error = set_error)
all.equal(ct2, ct21)
#[1] TRUE
This saves a grid of all combinations of C, H, O in the function environment and calculates the error for all combinations (which is fast in vectorized code). The first combination that passes the test is returned.

Indexing certain elements in a nested list, for all nests

I have a list which contains more lists of lists:
results <- sapply(c(paste0("cv_", seq(1:50)), "errors"), function(x) NULL)
## Locations for results to be stored
step_results <- sapply(c("myFit", "forecast", "errors"), function(x) NULL)
step_errors <- sapply(c("MAE", "MSE", "sign_accuracy"), function(x) NULL)
final_error <- sapply(c("MAE", "MSE", "sign_accuracy"), function(x) NULL)
for(i in 1:50){results[[i]] <- step_results}
for(i in 1:50){results[[i]][[3]] <- step_errors}
results$errors <- final_error
Now in this whole structure, I would like to sum up all the values in sign_accuracy and save them in results$errors$sign_accuracy
I could maybe do this with a for-loop, indexing with i:
## This is just an example - it won't actually work!
sign_acc <- matrix(nrow = 50, ncol = 2)
for (i in 1:50){
sign_acc[i, ] <- `results[[i]][[3]][[3]]`
results$errors$sign_accuracy <- sign_acc
}
If I remember correctly, in Matlab there is something like list(:), which means all elements. In Python I have seen something like list(0:-1), which also means all elements.
What is the elegent R equivalent? I don't really like loops.
I have seen methods using the apply family of functions. With something like apply(data, "[[", 2), but can't get it to work for deeper lists.
Did you try with c(..., recursive)?
Here is an option with a short example at the end:
sumList <- function(l, label) {
lc <- c(l, recursive=T)
filter <- grepl(paste0("\\.",label, "$"), names(lc)) | (names(lc) == label)
nums <- lc[filter]
return(sum(as.numeric(nums)))
}
ex <- list(a=56,b=list("5",a=34,list(c="3",a="5")))
sumList(ex,"a")
In this case, you can do what you want with
results$errors$sign_accuracy <- do.call(sum, lapply(results, function(x){x[[3]][[3]]}))
lapply loops through the first layer of results, and pulls out the third element of the third element for each. do.call(sum catches all the results and sums them.
The real problems with lists arise when the nesting is more irregular, or when you need to loop through more than one index. It can always be done in the same way, but it gets extraordinarily ugly very quickly.

Store results of a for-loop in an object or matrix

i've following problem:
I use the for-loop within R to get specific data from a matrix.
my code is as follows.
for(i in 1:100){
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST <- (subset(datensatz_Start_End.frame, TIME <= T))[,1]
write.table(DELIST, file = paste("tab", i, ".csv"), sep="," )
print(DELIST)
}
Using print, R delivers the data.
Using write.table, R delivers the data into different files.
My aim is to aggregate the results from the for-loop within one matrix. (each row for 'i')
But unfortunately I can not make it.
sorry, i'm a real noob within R.
for(i in 1:100)
{
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST <- (subset(datensatz_Start_End.frame, TIME <= T))[,1]
assign(paste('b',i,sep=''),DELIST)
}
this delivers 100 objects, which contain my results.
But what i need is one matrix/dataframe with 100 columns or one list.
Any ideas?
Hey!
Hence I'm not allowed to edit my own answers, here my (simple) solution as follows:
DELIST <- vector("list",100)
for(i in 1:100)
{
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST[[i]] <- as.character((subset(datensatz_Start_End.frame, TIME <= T))[,1])
}
DELIST[[99]] ## it is possible to requist the relevant companies for every 'i'
Thx to everyone!
George
If you want a list you can use lapply instead of loop
LL <- lapply(1:100,
function(i) {
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST <- (subset(datensatz_Start_End.frame, TIME <= T))[,1]
assign(paste('b',i,sep=''),DELIST)
}
)
After that you can rbind results together using do.call
result <- do.call(rbind, LL)
Or if you are confident that columns of all elements of LL are going to be of same, then you can use more efficient rbindlist from package data.table
result <- rbindlist(LL)
check out rbind function. You can start with empty DELIST.DF and append each row to it inside the loop -
DELIST.DF <- NULL
for(i in 1:100){
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST <- (subset(datensatz_Start_End.frame, TIME <= T))[,1]
DELIST.DF <- rbind(DELIST.DF, DELIST)
write.table(DELIST, file = paste("tab", i, ".csv"), sep="," )
print(DELIST)
}

Resources