I want to create a model where I duplicate a sentence several times, introducing random error each time. The duplicates of the sentence also get duplicated. So, in cycle one, I start with just "example_sentence". In cycle two, I have two copies of that sentence. In cycle three, I have 4 copies of that sentence. I want to do this for 25 cycles with 20k sentences. The code I wrote to do that works way too slowly, and I am wondering if there is a way to make my nested for loops more efficient? Here is the part of the code that is the slowest:
alphabet <- c("a","b","d","j")
modr1 <- "sentencetoduplicate"
errorRate <- c()
errorRate <- append(errorRate, rep(1,1))
errorRate <- append(errorRate, rep(0,999))
duplicate <- c(modr1)
for (q in 1:25) {
collect <- c()
for (z in seq_along(duplicate)) {
modr1 = duplicate[z]
compile1 <- c()
for (k in 1:nchar(modr1)) {
error <- sample(errorRate, 1, replace = TRUE)
if (error == 1) {
compile1 <- append(compile1, sub(substring(modr1,k,k),sample(alphabet,1,replace=TRUE),substring(modr1,k,k)))
} else {
compile1 <- append(compile1, substring(modr1,k,k))
}
}
modr1 <- paste(compile1, collapse='')
collect <- append(collect, modr1)
}
duplicate <- append(duplicate, collect)
}
Here is a faster approach to your loop, but I think the problem of applying this to your problem of 20K sentences remains!
f <- function(let, alphabet = c("a","b","c","d","j"),error_rate=1/1000) {
lenlet=length(let)
let = unlist(let)
k <- rbinom(length(let),1,prob = error_rate)
let[k==1] <- sample(alphabet,size = sum(k==1), replace=T)
return(as.list(as.data.frame(matrix(let, ncol=lenlet))))
}
modr1 <- "sentencetoduplicate"
k <- data.table(list(strsplit(modr1,"")[[1]]))
for(q in 1:25) {
k[, V1:=list(f(V1))]
k <- k[rep(1:nrow(k),2)]
}
Updated with slightly faster version! (Notice this is no longer by=1:nrow(k))
Related
I have a multi-nested for-loop that I need to restart the entire loop once the last nest (here, clip.groups) is complete. I have tried several options. Each layer involves rasters and I cannot vectorize it through apply, etc. Since there are so many input files, it is not reproducible.
The basic structure though is this:
clip.groups <- c('Bay area','Alameda County','Oakland','West and Downtown Oakland')
rate.groups <- c('co.25','cbg.25')
conc.groups <- c('ppb', 'ug')
pop.groups <- c('pop.ls.night.25')
beta.groups <- c(0.001105454,0.000318195,0.001881231)
for (j in 1:length(conc.groups)){
for (i in 1:length(beta.groups)){
for (k in 1:length(rate.groups)){
for (h in 1:length(pop.groups)){
for (m in 1:length(clip.groups)){
break #==== THIS IS WHERE I NEED IT TO GO BACK TO THE OUTER MOST LOOP - (conc.groups j)
}
}
}
}
}
}
}
If you go back to the outermost loop than the inbetween loops are meaningless. That is you get this
clip.groups <- c('Bay area','Alameda County','Oakland','West and Downtown Oakland')
rate.groups <- c('co.25','cbg.25')
conc.groups <- c('ppb', 'ug')
pop.groups <- c('pop.ls.night.25')
beta.groups <- c(0.001105454,0.000318195,0.001881231)
for (j in 1:length(conc.groups)){
beta.groups[1]
rate.groups[1]
pop.groups[1]
for (m in 1:length(clip.groups)){
cat(j, "-", m, "\n")
}
}
I'm relatively new to R programming, but have a specific problem concerning the extraction of text from a syntactically parsed historical language corpus. The problem should be easy to solve, but I just can't get my head around it. My question is basically a more specific variation of this one: R: parse nested parentheses
I would like to parse nested parentheses in R. Here is an example of some data:
(sometext(NP-SBJ(D+N_THYSTORYE)(PP(P_OF)(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))))sometext)
From this string I would like to extract all (potentially nested) substrings that begin with "NP", so the result should be
(NP-SBJ(D+N_THYSTORYE)(PP(P_OF)(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))))
(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))
(NPR_REYNARD)
(NP-PRN(D_THE)(N_FOXE))
Any help would be much appreciated!
This probably isn't the most efficient, but here's a function which can extract the "tokens" or strings between matched parentheis.
find_tokens <- function(s) {
stopifnot(length(s)==1)
mm <- gregexpr("[)()]", s)
stack <- numeric()
starts <- numeric()
stops <- numeric()
Map(function(i, v) {
if(v=="(") {
stack <<- c(stack, i)
} else if (v==")") {
starts <<- c(starts, tail(stack, 1))
stops <<- c(stops, i)
stack <<- stack[-length(stack)]
}
}, mm[[1]], regmatches(s, mm)[[1]])
rev(substring(s, starts, stops))
}
This will extract everything. If you want to keep just the values that start with "(NP" you can just grep this list
grep("^\\(NP", find_tokens(s), value=TRUE)
# [1] "(NP-SBJ(D+N_THYSTORYE)(PP(P_OF)(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))))"
# [2] "(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))"
# [3] "(NP-PRN(D_THE)(N_FOXE))"
# [4] "(NPR_REYNARD)"
Here's another possible implementation of find_tokens that might be more efficient that will better support multiple strings as a list.
find_tokens <- function(s) {
mm <- gregexpr("[)()]", s)
vv <- regmatches(s, mm)
extr <- function(x, mm, vv) {
open_i <- 0
shut_i <- 0
open <- numeric(length(vv)/2)
shut <- numeric(length(vv)/2)
close <- numeric(length(vv)/2)
for(i in seq_along(mm)) {
if (vv[i]=="(") {
open_i <- open_i + 1
shut_i <- shut_i + 1
open[open_i] <- mm[i]
shut[shut_i] <- open_i
} else if (vv[i]==")") {
close[shut[shut_i]] <- mm[i]
shut_i <- shut_i - 1
}
}
substring(x, open, close)
}
unname(Map(extr, s, mm, vv))
}
and then you would use
lapply(find_tokens(s), function(x) grep("^\\(NP", x, value=TRUE))
I just read that vectorization increases performance and lowers significantly computation time, and in the case of if() else , best choice is ifelse().
My problem is I got some if statements inside a for loop, and each if statement contains multiple assignments, like the following:
x <- matrix(1:10,10,3)
criteria <- matrix(c(1,1,1,0,1,0,0,1,0,0,
1,1,1,1,1,0,0,1,1,0,
1,1,1,1,1,1,1,1,1,1),10,3) #criteria for the ifs
output1 <- rep(list(NA),10) #storage list for output
for (i in 1:10) {
if (criteria[i,1]>=1) {
output1[[i]] <- colMeans(x)
output1[[i]] <- output1[[i]][1] #part of the somefunction output
} else {
if(criteria[i,2]>=1) {
output1[[i]] <- colSums(x)
output1[[i]] <- output1[[i]][1] #the same
} else {
output1[[i]] <- colSums(x+1)
output1[[i]] <- output1[[i]][1] #the same
}}}
How can I translate this into ifelse?
Thanks in advance!
Note that you don't need a for loop as all operations used are vectorized:
output2 <- ifelse(criteria[, 1] >= 1,
colMeans(x)[1],
ifelse(criteria[, 2] >= 1,
colSums(x)[1],
colSums(x+1)[1]))
identical(output1, as.list(output2))
## [1] TRUE
At least you can convert two assignments into one. So instead of
output[[i]] <- somefunction(arg1,arg2,...)
output[[i]] <- output[[i]]$thing #part of the somefunction output
you can refer directly to the only part you are interested in.
output[[i]] <- somefunction(arg1,arg2,...)$thing #part of the somefunction output
Hope that it helps!
It seems I found the answer trying to build the example:
output2 <- rep(list(NA),10) #storage list for output
for (i in 1:10) {
output2[[i]] <- ifelse(criteria[i,1]>=1,
yes=colMeans(x)[1],
no=ifelse(criteria[i,2]>=1,
yes=colSums(x)[1],
no=colSums(x+1)[1]))}
Okay, so I have combed the internet for an answer to my problem and I can only put it down to me being a little naive in how R works.
Below is my code for a function that generates public and private keys from the system clock and uses it to attempt to decrypt an encrypted message. This bit works fine, but obviously as it goes through different random generations it comes back with a lot of garbage and NULL data.
I wanted to filter this out by using grep and testing whether the result of that grep was 1, is so, the decoded message would be put into a list.
The problem is that, no matter how I propose the if statement, my list gets cluttered with both the nonsense entries and the NULL entries.
I've tried, !is.null, is.character. test == 1. etc etc but nothing seems to work. Either the list doesn't get populated at all, or it gets populated by every entry that runs through the if statement.
Any advice would be appreciated. Thanks :)
Edit: Okay, forgive me, for these are copy and paste jobs to provide clarity. The first code is the code I'm using to encrypt the message.
require(gmp)
source("convert.R")
p <- nextprime(urand.bigz(size=51, seed=as.bigz(Sys.time())))
q <- nextprime(urand.bigz(size=50))
n <- p*q
finde <- function(phi) {
r <- floor(log(phi, base = 2))
y <- 0 # initialise
while(y != 1) {
e <- urand.bigz(nb = 1, size = r)
y <- gcd.bigz(e, phi)
}
return(e)
}
phi <- (p-1) * (q-1)
e <-finde(phi)
d <- inv.bigz(e, phi)
text1 <- c("I want to eat a baby panda with my bare teeth and hands. Just so I know there's something else in this world suffering more than myself, right now.")
m <- blocks(text1, n) # arguments are text1 (message) and n (public key)
u <- as.bigz((as.bigz(m, n)^e))
dput(u, file="codedmessage.R")
The second is the code contained in the "convert.R" source file:
blocks <- function(txt, n) {
x <- strtoi(charToRaw(txt), 16L)
ll <- length(x)
bl <- floor(log(n, base=256)) # block length (how large the blocks must be)
nb <- floor(ll / bl)
wp <- bl*nb
rem <- ll - wp
s <- as.bigz(vector(mode="numeric", length=0))
u <- 0
while(u < wp) {
total <- as.bigz(0)
for(i in 1:bl) {
total <- 256 * total + x[i+u]
}
u <- u + bl
s <- c(s, total)
}
if(rem > 0) {
total <- as.bigz(0)
for(i in 1:rem) {
total <- 256 * total + x[i + wp]
}
s <- c(s, total)
}
return(s)
}
words <- function(blocknum) {
w <- vector(mode="numeric", length=0)
wl <- blocknum
while(as.bigz(wl) > 0) {
rem <- as.bigz(wl) %% 256
w <- c(rem, w)
wl <- (as.bigz(wl) - as.bigz(rem)) / 256
}
return(w)
}
dectext <- function(listnum) {
len <- length(listnum)
newls <- as.integer(vector(mode="numeric", length=0))
for(i in 1:len) {
temp <- as.integer(words(listnum[i]))
newls <- c(newls, temp)
}
return(rawToChar(as.raw(newls)))
}
And finally the last code is the decrypt and compile list function that I'm having issues with.
finde <- function(phi) {
r <- floor(log(phi, base = 2))
y <- 0 # initialise
while(y != 1) {
e <- urand.bigz(nb = 1, size = r)
y <- gcd.bigz(e, phi)
}
return(e)
}
FindKey <- function(a, y) {
x <<- 1 #initialisation
decodedlist <<- list() #initialisation
while (x<7200) {
print(x)
print(a)
p <- nextprime(urand.bigz(size=51, seed=as.bigz(a)))
q <- nextprime(urand.bigz(size=50))
n <- p*q
phi <- (p-1) * (q-1)
phi
e <-finde(phi)
d <- inv.bigz(e, phi)
recieved<-dget(file=y)
v<-as.bigz(as.bigz(recieved, n)^d)
tryCatch({
decodetext<-dectext(v)
Decrypt<- capture.output(cat(decodetext))
print(Decrypt)
test <- grep("and", Decrypt)
if (!is.null(Decrypt)){
if (is.character(Decrypt)){
decodedlist[[x]] <<- Decrypt
}else{return}}else{return}
}, warning = function(war) {
return()
}, error = function(err){
return()
}, finally = {
x=x+1
a=a-1})
}
}
Sorry it's long.. But I really don't know what to do :(
I found a "sort of" solution to my problem, albeit within a different code I've written.
I'm not very knowledgeable in the reasoning behind why this works but I believe the problem lay in the fact that the list was storing something with a NULL reference (Reps to Acccumulation for the hint ;D) and therefore was not technically NULL itself.
My workaround for this was to avoid using an if statement altogether, instead I found a more efficient method of filtering out NULL list entries in a program I had written for generating large prime numbers.
Extra points for anyone who can figure out what I'm currently studying ;)
#Combine two lists and remove NULL entries therein.
Prime_List2 <<- PrimeList[-which(sapply(PrimeList, is.null))]
Prime_List1 <<- PrimeList[-which(sapply(PrimeList, is.null))]
I have the following code (nested for loop) in R which is extremely slow. The loop matches values from two columns. Then picks up a corresponding file and iterates through the file to find a match. Then it picks up that row from the file. The iterations could go up to more than 100,000. Please if some one can provide an insight on how to quicken the process.
for(i in 1: length(Jaspar_ids_in_Network)) {
m <- Jaspar_ids_in_Network[i]
gene_ids <- as.character(GeneTFS$GeneIds[i])
gene_names <- as.character(GeneTFS$Genes[i])
print("i")
print(i)
for(j in 1: length(Jaspar_ids_in_Exp)) {
l <- Jaspar_ids_in_Exp[j]
print("j")
print(j)
if (m == l) {
check <- as.matrix(read.csv(file=paste0(dirpath,listoffiles[j]),sep=",",header=FALSE))
data_check <- data.frame(check)
for(k in 1: nrow(data_check)) {
gene_ids_JF <- as.character(data_check[k,3])
genenames_JF <- as.character(data_check[k,4])
if(gene_ids_JF == gene_ids) {
GeneTFS$Source[i] <- as.character(data_check[k,3])
data1 <- rbind(data1, cbind(as.character(data_check[k,3]),
as.character(data_check[k,8]),
as.character(data_check[k,9]),
as.character(data_check[k,6]),
as.character(data_check[k,7]),
as.character(data_check[k,5])))
} else if (toupper(genenames_JF) == toupper(gene_names)) {
GeneTFS$Source[i] <- as.character(data_check[k,4])
data1 <- rbind(data1, cbind(as.character(data_check[k,4]),
as.character(data_check[k,5]),
as.character(data_check[k,6]),
as.character(data_check[k,7]),
as.character(data_check[k,8]),
as.character(data_check[k,2])))
} else {
# GeneTFS[i,4] <- "No Evidence"
}
}
} else {
# GeneTFS[i,4] <- "Record Not Found"
}
}
}
If you pull out the logic for processing one pair into a function, f(m,l), then you could replace the double loop with:
outer(Jaspar_ids_in_Network, Jaspar_ids_in_Exp, Vectorize(f))