I'm relatively new to R programming, but have a specific problem concerning the extraction of text from a syntactically parsed historical language corpus. The problem should be easy to solve, but I just can't get my head around it. My question is basically a more specific variation of this one: R: parse nested parentheses
I would like to parse nested parentheses in R. Here is an example of some data:
(sometext(NP-SBJ(D+N_THYSTORYE)(PP(P_OF)(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))))sometext)
From this string I would like to extract all (potentially nested) substrings that begin with "NP", so the result should be
(NP-SBJ(D+N_THYSTORYE)(PP(P_OF)(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))))
(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))
(NPR_REYNARD)
(NP-PRN(D_THE)(N_FOXE))
Any help would be much appreciated!
This probably isn't the most efficient, but here's a function which can extract the "tokens" or strings between matched parentheis.
find_tokens <- function(s) {
stopifnot(length(s)==1)
mm <- gregexpr("[)()]", s)
stack <- numeric()
starts <- numeric()
stops <- numeric()
Map(function(i, v) {
if(v=="(") {
stack <<- c(stack, i)
} else if (v==")") {
starts <<- c(starts, tail(stack, 1))
stops <<- c(stops, i)
stack <<- stack[-length(stack)]
}
}, mm[[1]], regmatches(s, mm)[[1]])
rev(substring(s, starts, stops))
}
This will extract everything. If you want to keep just the values that start with "(NP" you can just grep this list
grep("^\\(NP", find_tokens(s), value=TRUE)
# [1] "(NP-SBJ(D+N_THYSTORYE)(PP(P_OF)(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))))"
# [2] "(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))"
# [3] "(NP-PRN(D_THE)(N_FOXE))"
# [4] "(NPR_REYNARD)"
Here's another possible implementation of find_tokens that might be more efficient that will better support multiple strings as a list.
find_tokens <- function(s) {
mm <- gregexpr("[)()]", s)
vv <- regmatches(s, mm)
extr <- function(x, mm, vv) {
open_i <- 0
shut_i <- 0
open <- numeric(length(vv)/2)
shut <- numeric(length(vv)/2)
close <- numeric(length(vv)/2)
for(i in seq_along(mm)) {
if (vv[i]=="(") {
open_i <- open_i + 1
shut_i <- shut_i + 1
open[open_i] <- mm[i]
shut[shut_i] <- open_i
} else if (vv[i]==")") {
close[shut[shut_i]] <- mm[i]
shut_i <- shut_i - 1
}
}
substring(x, open, close)
}
unname(Map(extr, s, mm, vv))
}
and then you would use
lapply(find_tokens(s), function(x) grep("^\\(NP", x, value=TRUE))
Related
I want to create a model where I duplicate a sentence several times, introducing random error each time. The duplicates of the sentence also get duplicated. So, in cycle one, I start with just "example_sentence". In cycle two, I have two copies of that sentence. In cycle three, I have 4 copies of that sentence. I want to do this for 25 cycles with 20k sentences. The code I wrote to do that works way too slowly, and I am wondering if there is a way to make my nested for loops more efficient? Here is the part of the code that is the slowest:
alphabet <- c("a","b","d","j")
modr1 <- "sentencetoduplicate"
errorRate <- c()
errorRate <- append(errorRate, rep(1,1))
errorRate <- append(errorRate, rep(0,999))
duplicate <- c(modr1)
for (q in 1:25) {
collect <- c()
for (z in seq_along(duplicate)) {
modr1 = duplicate[z]
compile1 <- c()
for (k in 1:nchar(modr1)) {
error <- sample(errorRate, 1, replace = TRUE)
if (error == 1) {
compile1 <- append(compile1, sub(substring(modr1,k,k),sample(alphabet,1,replace=TRUE),substring(modr1,k,k)))
} else {
compile1 <- append(compile1, substring(modr1,k,k))
}
}
modr1 <- paste(compile1, collapse='')
collect <- append(collect, modr1)
}
duplicate <- append(duplicate, collect)
}
Here is a faster approach to your loop, but I think the problem of applying this to your problem of 20K sentences remains!
f <- function(let, alphabet = c("a","b","c","d","j"),error_rate=1/1000) {
lenlet=length(let)
let = unlist(let)
k <- rbinom(length(let),1,prob = error_rate)
let[k==1] <- sample(alphabet,size = sum(k==1), replace=T)
return(as.list(as.data.frame(matrix(let, ncol=lenlet))))
}
modr1 <- "sentencetoduplicate"
k <- data.table(list(strsplit(modr1,"")[[1]]))
for(q in 1:25) {
k[, V1:=list(f(V1))]
k <- k[rep(1:nrow(k),2)]
}
Updated with slightly faster version! (Notice this is no longer by=1:nrow(k))
I wish to "copy and modify" a function at a specific point in its body. Currently, what I have is
nearest_psd <- function(mat) {
ed <- eigen(mat)
eigvecs <- ed$vectors
eigvals <- ed$values
eigvals[eigvals<0] <- 0
eigvecs %*% diag(eigvals) %*% t(eigvecs)
}
nearest_pd <- nearest_psd
formals(nearest_pd)$pdeps <- 1e-08
body(nearest_pd)[[c(7,3)]] <- quote(pdeps)
, so that nearest_pd is a copy of nearest_psd, except for the line eigvals[eigvals<0] <- pdeps.
However, the line number (7, in this case) is hard-coded, and I would prefer to have a robust way to determine this line number. How can I search for the line that contains the expression eigvals[eigvals<0] <- 0?
You can use identical to compare two expressions; that way, you can identify and replace the expression in question:
to_replace = vapply(body(nearest_pd), function (e) identical(e, quote(eigvals[eigvals < 0] <- 0)), logical(1L))
body(nearest_pd)[to_replace] = list(quote(eigvals[eigvals < pdeps] <- pdeps))
However, this is no more readable, nor more robust, than your code: in both cases you’re forced to hard-code the relevant information; in your code, the indices. In mine, the expression. For that reason I wouldn’t recommend using this.
… of course you could instead use an AST walker to replace all occurrences of 0 in the function’s body with pdeps. But is that better? No, since 0 could be used for other purposes. It currently isn’t, but who knows, once the original function changes. And if the original function can’t be assumed to change, why not hard-code the new function entirely? That is, write this:
nearest_pd <- function (mat, pdeps = 1e-08) {
ed <- eigen(mat)
eigvecs <- ed$vectors
eigvals <- ed$values
eigvals[eigvals < pdeps] <- pdeps
eigvecs %*% diag(eigvals) %*% t(eigvecs)
}
… no need to use metaprogramming just for the sake of it.
The following might do what you want.
nearest_psd <- function(mat) {
ed <- eigen(mat)
eigvecs <- ed$vectors
eigvals <- ed$values
eigvals[eigvals<0] <- 0
eigvecs %*% diag(eigvals) %*% t(eigvecs)
}
nearest_pd <- nearest_psd
formals(nearest_pd)$pdeps <- 1e-08
nearest_psd_body <- body(nearest_psd)
# Find the string we a re looking for and replace it ...
new.code <- gsub("eigvals[eigvals < 0] <- 0",
"MY_NEW_CODE",
nearest_psd_body, fixed = TRUE)
# Buidling the function body as a string.
new.code <- new.code[-1] # delete first { such that ...
new.code <- paste(new.code, collapse = ";") # we can collapse the remaining here ....
new.code <- paste("{", new.code, "}", sep = "", collapse = "") # and then wrap the remaining in { }
# parse returns an expression.
body(nearest_pd) <- parse(text = new.code)
See At a basic level, what does eval-parse do in R? for an explantion of parse. Or In programming, what is an expression? what an expression is.
I'm not quite sure of the terminology here. I'm doing a code parsing project to visualize the relationships between objects of an R script. Almost everyting works except for parsing and evaluating functions and loops. I don't think I have a good enough handle on regex to find the beginning/end of nested curly braces.
In the script, it might look like this:
something_above <- "above"
my_function <- function(x){
x2 <- x^2
if(x2 >= 200){
res <- "200+"
} else {
res <- "<200"
}
return(res)
}
something_below <- "below"
When I read it in, it looks like this
string <- '\r\nsometing_above <- \"above\"\r\n\r\nmy_function <- function(x){\r\n x2 <- x^2\r\n if(x2 >= 200){\r\n res <- \"200+\"\r\n } else {\r\n res <- \"<200\"\r\n}\r\n return(res)\r\n }\r\nsomething_below <- \"below\"\r\n'
I would like to be able to collapse the function into a single line like this
... <- function(x){x2 <- x^2 ; if(x2 >= 200){res <- "200+" ; ...}; return(res)}\r\n ...
so that each step within the function is separated by a ; instead of a new line. I would, however, like to keep the new line \r\n pattern at the beginning of the assignment\r\nmy_function and once the function assignment is over };return(res)}\r\n.
The final result would be three lines:
[1] something_above <- "above"
[2] my_function <- function(x){x2 <- x^2; if(x2 >= 200 ...}
[3] something_below <- "below"
Thank you.
Okay, so I have combed the internet for an answer to my problem and I can only put it down to me being a little naive in how R works.
Below is my code for a function that generates public and private keys from the system clock and uses it to attempt to decrypt an encrypted message. This bit works fine, but obviously as it goes through different random generations it comes back with a lot of garbage and NULL data.
I wanted to filter this out by using grep and testing whether the result of that grep was 1, is so, the decoded message would be put into a list.
The problem is that, no matter how I propose the if statement, my list gets cluttered with both the nonsense entries and the NULL entries.
I've tried, !is.null, is.character. test == 1. etc etc but nothing seems to work. Either the list doesn't get populated at all, or it gets populated by every entry that runs through the if statement.
Any advice would be appreciated. Thanks :)
Edit: Okay, forgive me, for these are copy and paste jobs to provide clarity. The first code is the code I'm using to encrypt the message.
require(gmp)
source("convert.R")
p <- nextprime(urand.bigz(size=51, seed=as.bigz(Sys.time())))
q <- nextprime(urand.bigz(size=50))
n <- p*q
finde <- function(phi) {
r <- floor(log(phi, base = 2))
y <- 0 # initialise
while(y != 1) {
e <- urand.bigz(nb = 1, size = r)
y <- gcd.bigz(e, phi)
}
return(e)
}
phi <- (p-1) * (q-1)
e <-finde(phi)
d <- inv.bigz(e, phi)
text1 <- c("I want to eat a baby panda with my bare teeth and hands. Just so I know there's something else in this world suffering more than myself, right now.")
m <- blocks(text1, n) # arguments are text1 (message) and n (public key)
u <- as.bigz((as.bigz(m, n)^e))
dput(u, file="codedmessage.R")
The second is the code contained in the "convert.R" source file:
blocks <- function(txt, n) {
x <- strtoi(charToRaw(txt), 16L)
ll <- length(x)
bl <- floor(log(n, base=256)) # block length (how large the blocks must be)
nb <- floor(ll / bl)
wp <- bl*nb
rem <- ll - wp
s <- as.bigz(vector(mode="numeric", length=0))
u <- 0
while(u < wp) {
total <- as.bigz(0)
for(i in 1:bl) {
total <- 256 * total + x[i+u]
}
u <- u + bl
s <- c(s, total)
}
if(rem > 0) {
total <- as.bigz(0)
for(i in 1:rem) {
total <- 256 * total + x[i + wp]
}
s <- c(s, total)
}
return(s)
}
words <- function(blocknum) {
w <- vector(mode="numeric", length=0)
wl <- blocknum
while(as.bigz(wl) > 0) {
rem <- as.bigz(wl) %% 256
w <- c(rem, w)
wl <- (as.bigz(wl) - as.bigz(rem)) / 256
}
return(w)
}
dectext <- function(listnum) {
len <- length(listnum)
newls <- as.integer(vector(mode="numeric", length=0))
for(i in 1:len) {
temp <- as.integer(words(listnum[i]))
newls <- c(newls, temp)
}
return(rawToChar(as.raw(newls)))
}
And finally the last code is the decrypt and compile list function that I'm having issues with.
finde <- function(phi) {
r <- floor(log(phi, base = 2))
y <- 0 # initialise
while(y != 1) {
e <- urand.bigz(nb = 1, size = r)
y <- gcd.bigz(e, phi)
}
return(e)
}
FindKey <- function(a, y) {
x <<- 1 #initialisation
decodedlist <<- list() #initialisation
while (x<7200) {
print(x)
print(a)
p <- nextprime(urand.bigz(size=51, seed=as.bigz(a)))
q <- nextprime(urand.bigz(size=50))
n <- p*q
phi <- (p-1) * (q-1)
phi
e <-finde(phi)
d <- inv.bigz(e, phi)
recieved<-dget(file=y)
v<-as.bigz(as.bigz(recieved, n)^d)
tryCatch({
decodetext<-dectext(v)
Decrypt<- capture.output(cat(decodetext))
print(Decrypt)
test <- grep("and", Decrypt)
if (!is.null(Decrypt)){
if (is.character(Decrypt)){
decodedlist[[x]] <<- Decrypt
}else{return}}else{return}
}, warning = function(war) {
return()
}, error = function(err){
return()
}, finally = {
x=x+1
a=a-1})
}
}
Sorry it's long.. But I really don't know what to do :(
I found a "sort of" solution to my problem, albeit within a different code I've written.
I'm not very knowledgeable in the reasoning behind why this works but I believe the problem lay in the fact that the list was storing something with a NULL reference (Reps to Acccumulation for the hint ;D) and therefore was not technically NULL itself.
My workaround for this was to avoid using an if statement altogether, instead I found a more efficient method of filtering out NULL list entries in a program I had written for generating large prime numbers.
Extra points for anyone who can figure out what I'm currently studying ;)
#Combine two lists and remove NULL entries therein.
Prime_List2 <<- PrimeList[-which(sapply(PrimeList, is.null))]
Prime_List1 <<- PrimeList[-which(sapply(PrimeList, is.null))]
Let me explain what I want using the code
taget_input <- list(1,2,3,4,5,6,7,8,9)
desired_output <- list(list(1,2,3), list(4,5,6), list(7,8,9))
# I wish there is a elegant function that passes this test
testthat::expect_equal(DESIRED_FUNCTION(taget_input, group_length=3), desired_output)
I wish there is a function like that. My google search did not yield any good result. Here's a rough implementation, but it's not elegant, and I was curious if there is any better way to do what I wanted to do
#' #param target_list list to split on
#' #param group_length length for each split
group_list <- function(target_list, group_length) {
len_of_list <- length(target_list)
res <- list()
i <- 1
while(TRUE) {
start <- (1 + group_length*(i-1))
if(start > len_of_list) {
break
}
end <- min((group_length + group_length*(i-1)), len_of_list)
res[[i]] <- target_list[start:end]
i <- i + 1
}
res
}