While trying to learn R, I want to implement the algorithm below in R. Consider the two lists below:
List 1: "crashed", "red", "car"
List 2: "crashed", "blue", "bus"
I want to find out how many actions it would take to transform 'list1' into 'list2'.
As you can see I need only two actions:
1. Replace "red" with "blue".
2. Replace "car" with "bus".
But, how we can find the number of actions like this automatically.
We can have several actions to transform the sentences: ADD, REMOVE, or REPLACE the words in the list.
Now, I will try my best to explain how the algorithm should work:
At the first step: I will create a table like this:
rows: i= 0,1,2,3,
columns: j = 0,1,2,3
(example: value[0,0] = 0 , value[0, 1] = 1 ...)
crashed red car
0 1 2 3
crashed 1
blue 2
bus 3
Now, I will try to fill the table. Please, note that each cell in the table shows the number of actions we need to do to reformat the sentence (ADD, remove, or replace).
Consider the interaction between "crashed" and "crashed" (value[1,1]), obviously we don't need to change it so the value will be '0'. Since they are the same words. Basically, we got the diagonal value = value[0,0]
crashed red car
0 1 2 3
crashed 1 0
blue 2
bus 3
Now, consider "crashed" and the second part of the sentence which is "red". Since they are not the same word we can use calculate the number of changes like this :
min{value[0,1] , value[0,2] and value[1,1]} + 1
min{ 1, 2, 0} + 1 = 1
Therefore, we need to just remove "red".
So, the table will look like this:
crashed red car
0 1 2 3
crashed 1 0 1
blue 2
bus 3
And we will continue like this :
"crashed" and "car" will be :
min{value[0,3], value[0,2] and value[1,2]} + 1
min{3, 2, 1} +1 = 2
and the table will be:
crashed red car
0 1 2 3
crashed 1 0 1 2
blue 2
bus 3
And we will continue to do so. the final result will be :
crashed red car
0 1 2 3
crashed 1 0 1 2
blue 2 1 1 2
bus 3 2 2 2
As you can see the last number in the table shows the distance between two sentences: value[3,3] = 2
Basically, the algorithm should look like this:
if (characters_in_header_of_matrix[i]==characters_in_column_of_matrix [j] &
value[i,j] == value[i+1][j-1] )
then {get the 'DIAGONAL VALUE' #diagonal value= value[i, j-1]}
else{
value[i,j] = min(value[i-1, j], value[i-1, j-1], value[i, j-1]) + 1
}
endif
for finding the difference between the elements of two lists that you can see in the header and the column of the matrix, I have used the strcmp() function which will give us a boolean value(TRUE or FALSE) while comparing the words. But, I fail at implementing this.
I'd appreciate your help on this one, thanks.
The question
After some clarification in a previous post, and after the update of the post, my understanding is that Zero is asking: 'how one can iteratively count the number of word differences in two strings'.
I am unaware of any implementation in R, although i would be surprised if i doesn't already exists. I took a bit of time out to create a simple implementation, altering the algorithm slightly for simplicity (For anyone not interested scroll down for 2 implementations, 1 in pure R, one using the smallest amount of Rcpp). The general idea of the implementation:
Initialize with string_1 and string_2 of length n_1 and n_2
Calculate the cumulative difference between the first min(n_1, n_2) elements,
Use this cumulative difference as the diagonal in the matrix
Set the first off-diagonal element to the very first element + 1
Calculate the remaining off diagonal elements as: diag(i) - diag(i-1) + full_matrix(i-1,j)
In the previous step i iterates over diagonals, j iterates over rows/columns (either one works), and we start in the third diagonal, as the first 2x2 matrix is filled in step 1 to 4
Calculate the remaining abs(n_1 - n_2) elements as full_matrix[,min(n_1 - n_2)] + 1:abs(n_1 - n_2), applying the latter over each value in the prior, and bind them appropriately to the full_matrix.
The output is a matrix with dimensions row and column names of the corresponding strings, which has been formatted for some easier reading.
Implementation in R
Dist_between_strings <- function(x, y,
split = " ",
split_x = split, split_y = split,
case_sensitive = TRUE){
#Safety checks
if(!is.character(x) || !is.character(y) ||
nchar(x) == 0 || nchar(y) == 0)
stop("x, y needs to be none empty character strings.")
if(length(x) != 1 || length(y) != 1)
stop("Currency the function is not vectorized, please provide the strings individually or use lapply.")
if(!is.logical(case_sensitive))
stop("case_sensitivity needs to be logical")
#Extract variable names of our variables
# used for the dimension names later on
x_name <- deparse(substitute(x))
y_name <- deparse(substitute(y))
#Expression which when evaluated will name our output
dimname_expression <-
parse(text = paste0("dimnames(output) <- list(",make.names(x_name, unique = TRUE)," = x_names,",
make.names(y_name, unique = TRUE)," = y_names)"))
#split the strings into words
x_names <- str_split(x, split_x, simplify = TRUE)
y_names <- str_split(y, split_y, simplify = TRUE)
#are we case_sensitive?
if(isTRUE(case_sensitive)){
x_split <- str_split(tolower(x), split_x, simplify = TRUE)
y_split <- str_split(tolower(y), split_y, simplify = TRUE)
}else{
x_split <- x_names
y_split <- y_names
}
#Create an index in case the two are of different length
idx <- seq(1, (n_min <- min((nx <- length(x_split)),
(ny <- length(y_split)))))
n_max <- max(nx, ny)
#If we have one string that has length 1, the output is simplified
if(n_min == 1){
distances <- seq(1, n_max) - (x_split[idx] == y_split[idx])
output <- matrix(distances, nrow = nx)
eval(dimname_expression)
return(output)
}
#If not we will have to do a bit of work
output <- diag(cumsum(ifelse(x_split[idx] == y_split[idx], 0, 1)))
#The loop will fill in the off_diagonal
output[2, 1] <- output[1, 2] <- output[1, 1] + 1
if(n_max > 2)
for(i in 3:n_min){
for(j in 1:(i - 1)){
output[i,j] <- output[j,i] <- output[i,i] - output[i - 1, i - 1] + #are the words different?
output[i - 1, j] #How many words were different before?
}
}
#comparison if the list is not of the same size
if(nx != ny){
#Add the remaining words to the side that does not contain this
additional_words <- seq(1, n_max - n_min)
additional_words <- sapply(additional_words, function(x) x + output[,n_min])
#merge the additional words
if(nx > ny)
output <- rbind(output, t(additional_words))
else
output <- cbind(output, additional_words)
}
#set the dimension names,
# I would like the original variable names to be displayed, as such i create an expression and evaluate it
eval(dimname_expression)
output
}
Note that the implementation is not vectorized, and as such can only take single string inputs!
Testing the implementation
To test the implementation, one could use the strings given. As they were said to be contained in lists, we will have to convert them to strings. Note that the function lets one split each string differently, however it assumes space separated strings. So first I'll show how one could achieve a conversion to the correct format:
list_1 <- list("crashed","red","car")
list_2 <- list("crashed","blue","bus")
string_1 <- paste(list_1,collapse = " ")
string_2 <- paste(list_2,collapse = " ")
Dist_between_strings(string_1, string_2)
output
#Strings in the given example
string_2
string_1 crashed blue bus
crashed 0 1 2
red 1 1 2
car 2 2 2
This is not exactly the output, but it yields the same information, as the words are ordered as they were given in the string.
More examples
Now i stated it worked for other strings as well and this is indeed the fact, so lets try some random user-made strings:
#More complicated strings
string_3 <- "I am not a blue whale"
string_4 <- "I am a cat"
string_5 <- "I am a beautiful flower power girl with monster wings"
string_6 <- "Hello"
Dist_between_strings(string_3, string_4, case_sensitive = TRUE)
Dist_between_strings(string_3, string_5, case_sensitive = TRUE)
Dist_between_strings(string_4, string_5, case_sensitive = TRUE)
Dist_between_strings(string_6, string_5)
Running these show that these do yield the correct answers. Note that if either string is of size 1, the comparison is a lot faster.
Benchmarking the implementation
Now as the implementation is accepted, as correct, we would like to know how well it performs (For the uninterested reader, one can scroll past this section, to where a faster implementation is given). For this purpose, i will use much larger strings. For a complete benchmark i should test various string sizes, but for the purposes i will only use 2 rather large strings of size 1000 and 2500. For this purpose i use the microbenchmark package in R, which contains a microbenchmark function, which claims to be accurate down to nanoseconds. The function itself executes the code 100 (or a user defined) number of times, returning the mean and quartiles of the run times. Due to other parts of R such as the Garbage Cleaner, the median is mostly considered a good estimate of the actual average run-time of the function.
The execution and results are shown below:
#Benchmarks for larger strings
set.seed(1)
string_7 <- paste(sample(LETTERS,1000,replace = TRUE), collapse = " ")
string_8 <- paste(sample(LETTERS,2500,replace = TRUE), collapse = " ")
microbenchmark::microbenchmark(String_Comparison = Dist_between_strings(string_7, string_8, case_sensitive = FALSE))
# Unit: milliseconds
# expr min lq mean median uq max neval
# String_Comparison 716.5703 729.4458 816.1161 763.5452 888.1231 1106.959 100
Profiling
Now i find the run-times very slow. One use case for the implementation could be an initial check of student hand-ins to check for plagiarism, in which case a low difference count very likely shows plagiarism. These can be very long and there may be hundreds of handins, an as such i would like the run to be very fast.
To figure out how to improve my implementation i used the profvis package with the corrosponding profvis function. To profile the function i exported it in another R script, that i sourced, running the code 1 once prior to profiling to compile the code and avoid profiling noise (important). The code to run the profiling can be seen below, and the most important part of the output is visualized in an image below it.
library(profvis)
profvis(Dist_between_strings(string_7, string_8, case_sensitive = FALSE))
Now, despite the colour, here i can see a clear problem. The loop filling the off-diagonal by far is responsible for most of the runtime. R (like python and other not compiled languages) loops are notoriously slow.
Using Rcpp to improve performance
To improve the implementation, we could implement the loop in c++ using the Rcpp package. This is rather simple. The code is not unlike the one we would use in R, if we avoid iterators. A c++ script can be made in file -> new file -> c++ File. The following c++ code would be pasted into the corrosponding file and sourced using the source button.
//Rcpp Code
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix Cpp_String_difference_outer_diag(NumericMatrix output){
long nrow = output.nrow();
for(long i = 2; i < nrow; i++){ // note the
for(long j = 0; j < i; j++){
output(i, j) = output(i, i) - output(i - 1, i - 1) + //are the words different?
output(i - 1, j);
output(j, i) = output(i, j);
}
}
return output;
}
The corresponding R function needs to be altered to use this function instead of looping. The code is similar to the first function, only switching the loop for a call to the c++ function.
Dist_between_strings_cpp <- function(x, y,
split = " ",
split_x = split, split_y = split,
case_sensitive = TRUE){
#Safety checks
if(!is.character(x) || !is.character(y) ||
nchar(x) == 0 || nchar(y) == 0)
stop("x, y needs to be none empty character strings.")
if(length(x) != 1 || length(y) != 1)
stop("Currency the function is not vectorized, please provide the strings individually or use lapply.")
if(!is.logical(case_sensitive))
stop("case_sensitivity needs to be logical")
#Extract variable names of our variables
# used for the dimension names later on
x_name <- deparse(substitute(x))
y_name <- deparse(substitute(y))
#Expression which when evaluated will name our output
dimname_expression <-
parse(text = paste0("dimnames(output) <- list(", make.names(x_name, unique = TRUE)," = x_names,",
make.names(y_name, unique = TRUE)," = y_names)"))
#split the strings into words
x_names <- str_split(x, split_x, simplify = TRUE)
y_names <- str_split(y, split_y, simplify = TRUE)
#are we case_sensitive?
if(isTRUE(case_sensitive)){
x_split <- str_split(tolower(x), split_x, simplify = TRUE)
y_split <- str_split(tolower(y), split_y, simplify = TRUE)
}else{
x_split <- x_names
y_split <- y_names
}
#Create an index in case the two are of different length
idx <- seq(1, (n_min <- min((nx <- length(x_split)),
(ny <- length(y_split)))))
n_max <- max(nx, ny)
#If we have one string that has length 1, the output is simplified
if(n_min == 1){
distances <- seq(1, n_max) - (x_split[idx] == y_split[idx])
output <- matrix(distances, nrow = nx)
eval(dimname_expression)
return(output)
}
#If not we will have to do a bit of work
output <- diag(cumsum(ifelse(x_split[idx] == y_split[idx], 0, 1)))
#The loop will fill in the off_diagonal
output[2, 1] <- output[1, 2] <- output[1, 1] + 1
if(n_max > 2)
output <- Cpp_String_difference_outer_diag(output) #Execute the c++ code
#comparison if the list is not of the same size
if(nx != ny){
#Add the remaining words to the side that does not contain this
additional_words <- seq(1, n_max - n_min)
additional_words <- sapply(additional_words, function(x) x + output[,n_min])
#merge the additional words
if(nx > ny)
output <- rbind(output, t(additional_words))
else
output <- cbind(output, additional_words)
}
#set the dimension names,
# I would like the original variable names to be displayed, as such i create an expression and evaluate it
eval(dimname_expression)
output
}
Testing the c++ implementation
To be sure the implementation is correct we check if the same output is obtained with the c++ implementation.
#Test the cpp implementation
identical(Dist_between_strings(string_3, string_4, case_sensitive = TRUE),
Dist_between_strings_cpp(string_3, string_4, case_sensitive = TRUE))
#TRUE
Final benchmarks
Now is this actually faster? To see this we could run another benchmark using the microbenchmark package. The code and results are shown below:
#Final microbenchmarking
microbenchmark::microbenchmark(R = Dist_between_strings(string_7, string_8, case_sensitive = FALSE),
Rcpp = Dist_between_strings_cpp(string_7, string_8, case_sensitive = FALSE))
# Unit: milliseconds
# expr min lq mean median uq max neval
# R 721.71899 753.6992 850.21045 787.26555 907.06919 1756.7574 100
# Rcpp 23.90164 32.9145 54.37215 37.28216 47.88256 243.6572 100
From the microbenchmark median improvement factor of roughly 21 ( = 787 / 37), which is a massive improvement from just implementing a single loop!
There is already an edit-distance function in R we can take advantage of: adist().
As it works on the character level, we'll have to assign a character to each unique word in our sentences, and stitch them together to form pseudo-words we can calculate the distance between.
s1 <- c("crashed", "red", "car")
s2 <- c("crashed", "blue", "bus")
ll <- list(s1, s2)
alnum <- c(letters, LETTERS, 0:9)
ll2 <- relist(alnum[factor(unlist(ll))], ll)
ll2 <- sapply(ll2, paste, collapse="")
adist(ll2)
# [,1] [,2]
# [1,] 0 2
# [2,] 2 0
Main limitation here, as far as I can tell, is the number of unique characters available, which in this case is 62, but can be extended quite easily, depending on your locale. E.g: intToUtf8(c(32:126, 161:300), TRUE).
I recently asked a question about improving performance in my code (Faster method than "while" loop to find chain of infection in R).
Background:
I'm analyzing large tables (300 000 - 500 000 rows) that store data output by a disease simulation model. In the model, animals on a landscape infect other animals. For example, in the example pictured below, animal a1 infects every animal on the landscape, and the infection moves from animal to animal, branching off into "chains" of infection.
In my original question, I asked how I could output a data.frame corresponding to animal "d2"s "chain of infection (see below, outlined in green, for illustration of one "chain"). The suggested solution worked well for one animal.
In reality, I will need to calculate chains for about 400 animals, corresponding to a subset of all animals (allanimals table).
I've included a link to an example dataset that is large enough to play with.
Here is the code for one chain, starting with animal 5497370, and note that I've slightly changed column names from my previous question, and updated the code!
The code:
allanimals <- read.csv("https://www.dropbox.com/s/0o6w29lz8yzryau/allanimals.csv?raw=1",
stringsAsFactors = FALSE)
# Here's an example animal
ExampleAnimal <- 5497370
ptm <- proc.time()
allanimals_ID <- setdiff(unique(c(allanimals$ID, allanimals$InfectingAnimal_ID)), -1)
infected <- rep(NA_integer_, length(allanimals_ID))
infected[match(allanimals$ID, allanimals_ID)] <-
match(allanimals$InfectingAnimal_ID, allanimals_ID)
path <- rep(NA_integer_, length(allanimals_ID))
curOne <- match(ExampleAnimal, allanimals_ID)
i <- 1
while (!is.na(nextOne <- infected[curOne])) {
path[i] <- curOne
i <- i + 1
curOne <- nextOne
}
chain <- allanimals[path[seq_len(i - 1)], ]
chain
proc.time() - ptm
# check it out
chain
I'd like to output chains for each animal in "sel.set":
sel.set <- allanimals %>%
filter(HexRow < 4 & Year == 130) %>%
pull("ID")
If possible, I'd like to store each "chain" data.frame as list with length = number of chains.
So I'll return the indices to access the data frame rather than all data frame subsets. You'll just need to use lapply(test, function(path) allanimals[path, ]) or with a more complicated function inside the lapply if you want to do other things on the data frame subsets.
One could think of just lapply on the solution for one animal:
get_path <- function(animal) {
curOne <- match(animal, allanimals_ID)
i <- 1
while (!is.na(nextOne <- infected[curOne])) {
path[i] <- curOne
i <- i + 1
curOne <- nextOne
}
path[seq_len(i - 1)]
}
sel.set <- allanimals %>%
filter(HexRow < 4 & Year == 130) %>%
pull("ID")
system.time(
test <- lapply(sel.set, get_path)
) # 0.66 seconds
We could rewrite this function as a recursive function (this will introduce my third and last solution).
system.time(
sel.set.match <- match(sel.set, allanimals_ID)
) # 0
get_path_rec <- function(animal.match) {
`if`(is.na(nextOne <- infected[animal.match]),
NULL,
c(animal.match, get_path_rec(nextOne)))
}
system.time(
test2 <- lapply(sel.set.match, get_path_rec)
) # 0.06
all.equal(test2, test) # TRUE
This solution is 10 times as fast. I don't understand why though.
Why I wanted to write a recursive function? I thought you might have a lot of cases where you want for example to get the path of animalX and animalY where animalY infected animalX. So when computing the path of animalX, you would recompute all path of animalY.
So I wanted to use memoization to store already computed results and memoization works well with recursive functions. So my last solution:
get_path_rec_memo <- memoise::memoize(get_path_rec)
memoise::forget(get_path_rec_memo)
system.time(
test3 <- lapply(sel.set.match, get_path_rec_memo)
) # 0.12
all.equal(test3, test) # TRUE
Unfortunately, this is slower than the second solution. Hope it will be useful for the whole dataset.
I'm looking for some simple vectorized approach for my for loop in R.
I have the following data frame with sentences and two dictionaries of positive and negative words:
# Create data.frame with sentences
sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop",
"wouldnt bad notebook", "very good quality", "orgtop",
"great improvement for that bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
stringsAsFactors=F)
# Create pos/negWords
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
"extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
"wouldnt bad")
negWords <- c("hate","bad","not good","horrible")
And now I create replicates of the original data frame to simulate a big data set:
# Replicate original data.frame - big data simulation (700.000 rows of sentences)
df.expanded <- as.data.frame(replicate(100000,sent$words))
# library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),100000),]
rownames(sent) <- NULL
For my next step, I'll have to do descending ordering of words in dictionaries with their sentiment score (pos word = 1 and neg word = -1).
# Ordering words in pos/negWords
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
rownames(wordsDF) <- NULL
Then I define the following function with a for loop:
# Sentiment score function
scoreSentence2 <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
count <- length(grep(matchWords,sentence)) # count them
if(count){
score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF
# library(qdapRegex)
sentence <- rm_white(sentence)
}
}
score
}
And I call the previous function on sentences in my data frame:
# Apply scoreSentence function to sentences
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))
# Time consumption for 700.000 sentences in sent data.frame:
# user system elapsed
# 1054.19 0.09 1056.17
# Add sentiment score to origin sent data.frame
sent <- cbind(sent, SentimentScore2)
Desired output is:
Words user SentimentScore2
just right size and i love this notebook 1 2
benefits great laptop 2 1
wouldnt bad notebook 3 1
very good quality 4 1
orgtop 5 0
.
.
.
And so forth...
Please, can anyone help me to reduce computing time of my original approach. Because of my beginners programming skills in R I'm in the end :-)
Any of your help or advice will be very appreciated. Thank you very much in advance.
In the spirit of "Teach somebody to fish is better than to give a fish", I'll walk you through that:
Make a copy of your code: you are going to mess it up!
Find the bottleneck:
1a: make the problem smaller:
Rep <- 100
df.expanded <- as.data.frame(replicate(nRep,sent$words))
library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),nRep),]
1b: keep a reference solution: you'll be changing your code and there are few activities as amazing at introducing bugs than optimizing a code!
sentRef <- sent
and add the same but commented out at the end of your code to remember where is your reference. To make it even easier to check you are not messing up your code you can test it automatically at the end of your code:
library("testthat")
expect_equal(sent,sentRef)
1c: Trigger the profiler around the code to look at:
Rprof()
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))
Rprof(NULL)
1d: view the result, with base R:
summaryRprof()
There are also nicer tools, you can check package
profileR
or
lineprof
lineprof
is my tool of choice and here a real added value, allowing to narrow down the problem to these two lines:
matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
count <- length(grep(matchWords,sentence)) # count them
Fix it.
3.1 Fortunately the main problem is fairly easy: you don't need the first line to be in the function, move it before. By the way the same applies to your paste0(). Your code becomes:
matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words
matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*')
# Sentiment score function
scoreSentence2 <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
count <- length(grep(matchWords[x],sentence)) # count them
if(count){
score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF
require(qdapRegex)
# sentence <- rm_white(sentence)
}
}
score
}
That changes the execution time for 1000 reps from
5.64s to 2.32s. Not a bad investment!
3.2 The next bootle neck is the "count <-" line, but I think
shadow had just the right answer :-) Combined we get :
matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words
matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*')
# Sentiment score function
scoreSentence2 <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
count <- grepl(matchWords[x],sentence) # count them
score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF
require(qdapRegex)
# sentence <- rm_white(sentence)
}
score
}
Here that makes 0.18s or 31 times faster...
You can easily vectorize your scoreSentence2 function, since grep, grepl are already vectorized:
scoreSentence <- function(sentence){
score <- rep(0, length(sentence))
for(x in 1:nrow(wordsDF)){
matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
count <- grepl(matchWords, sentence) # count them
score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF
sentence <- rm_white(sentence)
}
return(score)
}
scoreSentence(sent$words)
Note thet the count does not actually count the number of times the expression appears in one sentence (neither in your nor in my version). It just tells you if the expression appears at all. If you want to actually count them, you could use the following instead.
count <- sapply(gregexpr(matchWords, sentence), function(x) length(x[x>0]))
I am relatively new to R and all its wisdom and I am trying to be more efficient with my script. I am using a loop to simulate how an animal moves among different sites. The problem that I have is that when I increase the number of sites or change the initial parameters (based on fixed probability of moving or staying in the same site) then I end with a very complicated loop. If I have to run several different simulations with different parameters, I prefer a more efficient loop or function that could adjust to different situations. The first loop will fill a matrix according to the initial probabilities and the second loop will compared the cumulative probability matrix against a random number from a list of values (10 in this example) and will decide the fate of that individual (either stay or go to a new site)
Here is a simplification of my code:
N<-4 # number of sites
sites<-LETTERS[seq(from=1,to=N)]
p.stay<-0.45
p.move<-0.4
move<-matrix(c(0),nrow=N,ncol=N,dimnames=list(c(sites),c(sites)))
from<-array(0,c(N,N),dimnames=list(c(sites),c(sites)))
to<-array(0,c(N,N),dimnames=list(c(sites),c(sites)))
# Filling matrix with fixed probability #
for(from in 1:N){
for(to in 1:N){
if(from==to){move[from,to]<-p.stay} else {move[from,to]<-p.move/(N-1)}
}
}
move
cumsum.move<-cumsum(data.frame(move))
steps<-100
result<-as.character("") # for storing results
rand<-sample(random,steps,replace=TRUE)
time.step<-data.frame(rand)
colnames(time.step)<-c("time.step")
time.step$event<-""
to.r<-(rbind(sites))
j<-sample(1:N,1,replace=T) # first column to select (random number)
k<-sample(1:N,1,replace=T) # site selected after leaving and coming back
# Beginning of the longer loop #
for(i in 1:steps){
if (time.step$time.step[i]<cumsum.move[1,j]){time.step$event[i]<-to.r[1]} else
if (time.step$time.step[i]<cumsum.move[2,j]){time.step$event[i]<-to.r[2]} else
if (time.step$time.step[i]<cumsum.move[3,j]){time.step$event[i]<-to.r[3]} else
if (time.step$time.step[i]<cumsum.move[4,j]){time.step$event[i]<-to.r[4]} else
if (time.step$time.step[i]<(0.95)){time.step$event[i]<-NA} else
if (time.step$time.step[i]<1.0) break # break the loop
result[i]<-time.step$event[i]
j<-which(to.r==result[i])
if(length(j)==0){j<-k} # for individuals the leave and come back later
}
time.step
result
This loop is part of a bigger loop that will simulate and store the result after a series of simulations. Any ideas or comments on how I can improve the efficiency of this loop so that I can easily modify the number of sites or change the initial probability parameters without repeating or having to do major edits of the loop will be appreciated.
I'm not sure if I'm capturing the essence of your code, but this is faster than the for loops. This started having an advantage as soon as we start getting past a few thousand steps. I replace "random" with a sample of the uniform distribution (runif())
system.time(
time.step$event <- sapply(
time.step$time.step,
function(x) rownames(
cumsum.move[which(cumsum.move[,j] > x),])[[1]]
)
)
Here are my results # 10,000 steps. I'm working on a laptop so 100,000 with the for loop didn't compute in under 1 minute, but sapply did it in 14 seconds.
> system.time(
+ time.step$event <- sapply(
+ time.step$time.step,
+ function(x) rownames(
+ cumsum.move[which(cumsum.move[,j] > x),])[[1]]
+ )
+ )
user system elapsed
1.384 0.000 1.387
> head(time.step)
time.step event
1 0.2787642 C
2 0.3098240 C
3 0.9079045 D
4 0.9904031 D
5 0.3754330 C
6 0.6984415 C
> system.time(
+ for(i in 1:steps){
+ if (time.step$time.step[i]<cumsum.move[1,j]){time.step$event[i]<-to.r[1]} else
+ if (time.step$time.step[i]<cumsum.move[2,j]){time.step$event[i]<-to.r[2]} else
+ if (time.step$time.step[i]<cumsum.move[3,j]){time.step$event[i]<-to.r[3]} else
+ if (time.step$time.step[i]<cumsum.move[4,j]){time.step$event[i]<-to.r[4]}
+ result[i]<-time.step$event[i]
+ }
+ )
user system elapsed
3.137 0.000 3.143
> head(time.step)
time.step event
1 0.2787642 C
2 0.3098240 C
3 0.9079045 D
4 0.9904031 D
5 0.3754330 C
6 0.6984415 C
I've got a column in a CSV file that looks like c("","1","1 1e-3") (i.e. white space seperated). I'm trying to run through all values, taking the sum() of values where there is at least one value and returning NA otherwise.
My code currently does something like this:
x <- c("","1","1 2 3")
x2 <- as.numeric(rep(NA,length(x)))
for (i in 1:length(x)) {
si <- scan(text=x[[i]],quiet=TRUE)
if (length(si) > 0)
x2[[i]] <- sum(si)
}
I'm struggling to make this fast; x is really a set of columns from a CSV file containing a few hundred thousand rows and thought it should be possible to do this in R.
(these are thinned samples from the posterior of a reversible jump MCMC algorithm, hence combining multiple values as the dimensionality changes throughout the file and I want useful columns).
Building on the idea from #Chase, but handling NA and also avoiding a name for the helper function:
unlist(lapply(strsplit(x, " "),
function(v)
if (length(v) > 0)
sum(as.numeric(v))
else
NA
) )
This seems to perform a bit faster and may work for you.
#define a helper function
f <- function(x) sum(as.numeric(x))
unlist(lapply((strsplit(x3, " ")), f))
#-----
[1] 0 1 6
This will return a zero instead of NA, but maybe that isn't a deal breaker for you?
Let's see how this scales to a larger problem:
#set up variables
x3 <- rep(x, 1e5)
x4 <- as.numeric(rep(NA,length(x3)))
#initial approach
system.time(for (i in 1:length(x3)) {
si <- scan(text=x3[[i]],quiet=TRUE)
if (length(si) > 0)
x4[[i]] <- sum(si)
})
#-----
user system elapsed
30.5 0.0 30.5
#New approach:
system.time(unlist(lapply((strsplit(x3, " ")), f)))
#-----
user system elapsed
0.82 0.01 0.84