R iterations only saving the first value of a vector - r

Up until now I find any problem I have has been had and posted here already, but this time I'm really at a loss.
I am running grep in R to look for a list of regex values in two strings, and write the ones that are exclusive to one string and not the other. The outputs are vectors, but when I loop through the vectors or even just try to save them, R is only saving the first value of the vector.
So while:
inWT = as.list(rep(0, nrow(NEBres)))
> c(setdiff(c(grep(NEBres$Recognition_Site[1], reverseComplement(bigfile$FASTA, case = "upper")),
grep(NEBres$Recognition_Site[1], bigfile$FASTA)),
c(grep(NEBres$Recognition_Site[1], reverseComplement(bigfile$m_FASTA, case = "upper")),
grep(NEBres$Recognition_Site[1], bigfile$m_FASTA))))
>[1] 86 480
if I try to save it it only saves the first value as so:
inWT[1] = c(setdiff(c(grep(NEBres$Recognition_Site[1], reverseComplement(bigfile$FASTA, case = "upper")),
grep(NEBres$Recognition_Site[1], bigfile$FASTA)),
c(grep(NEBres$Recognition_Site[1], reverseComplement(bigfile$m_FASTA, case = "upper")),
grep(NEBres$Recognition_Site[1], bigfile$m_FASTA))))
> inWT[1]
[[1]]
[1] 86
I haven't been able to manage this solution for some time now, and I'm starting to run out of solutions.
Thanks in advance!

Related

Having trouble filling a vector in a for-loop

I am having trouble filling a specified vector with search-values from the spotifyr package and I can't really understand where it is going wrong.
top2017id <- numeric(200)
for(i in top2017vec){
search <- search_spotify(i, type = "track", limit = 1)
top2017id[i] <- search$id
}
Error in top2017id[i] <- search$id : replacement has length zero
In addition: Warning message:
Unknown or uninitialised column: `id`.
top2017vec is a vector containing 200 track names, for example: "Mi Gente", and what I want the for-loop to do, is search for the first track name in the vector using the search_spotify function, save it to the un-defined "search" and then save search$id to the first place in the already defined vector top2017id, and then repeat the process but with the second track name instead.
The function that I use inside the for-loop, "search_spotify" is a function from the spotifyr package, that returns a list with 27 variables. I have tested outside the for-loop, and indexing with search$id works perfectly fine in returning just a string with the tracks id.
Other than the error I recive, it do add some values to the top2017id vector. The first 200 values are 0, but after it adds 27 values which alternates between a track-name from the top2017vec, and the specific tracks id. Like this:
> top2017id
"0"
"0"
"0"
"0"
...
Believer
"0pqnGHJpmpxLKifKRmU6WP"
Felices los 4
"1RouRzlg8OKFeqc6LvdxmB"
What is it that I have managed to screw up?
Edit:
I kept on trying after the answer from #Dylan_Gomes and I made some progress, however I am stuck with another simillar error now.
for(i in 1:length(top2017vec)){
search <- search_spotify(top2017vec[i], type = "track", limit = 1)
top2017id[i] <- search$id
}
It now works for the first 26 id's, but after the first 26 ids it gives me 0's for the rest of the vector, and then ends. The error message I receive is:
Error in top2017id[i] <- search$id : replacement has length zero
In addition: Warning message:
Unknown or uninitialised column: `id`.
The way you have your for loop might be the problem. For example:
vect<-numeric(200)
for(i in vect){
search<-rnorm(1,0,1)
vect[i]<-search
}
vect
Doesn't work, it returns a vector of 200 zeros still. Yet, if we change the for loop structure to:
for(i in 1:length(vect)){
search<-rnorm(1,0,1)
vect[i]<-search
}
vect
[1] 0.87096868 0.78146593 0.72339698 0.45954073 1.29507907 0.28822357 -0.97277289 -0.22033080
[9] -0.41323427 -1.79971088 -0.20233652 -1.30564552 0.46676890 -0.64209630 0.95616195 0.67121680
[17] -0.18220987 -0.45524523 -0.91059605 -1.65350181 -0.33524219 2.60902403 0.58630848 -1.22887993
It then works as expected. There might be a different problem going on with spotifyr but I can't check it because it doesn't work with the current version of R.

Storing matrix after every iteration

I have following code.
for(i in 1:100)
{
for(j in 1:100)
R[i,j]=gcm(i,j)
}
gcm() is some function which returns a number based on the values of i and j and so, R has all values. But this calculation takes a lot of time. My machine's power was interrupted several times due to which I had to start over. Can somebody please help, how can I save R somewhere after every iteration, so as to be safe? Any help is highly appreciated.
You can use the saveRDS() function to save the result of each calculation in a file.
To understand the difference between save and saveRDS, here is a link I found useful. http://www.fromthebottomoftheheap.net/2012/04/01/saving-and-loading-r-objects/
If you want to save the R-workspace have a look at ?save or ?save.image (use the first to save a subset of your objects, the second one to save your workspace in toto).
Your edited code should look like
for(i in 1:100)
{
for(j in 1:100)
R[i,j]=gcm(i,j)
save.image(file="path/to/your/file.RData")
}
About your code taking a lot of time I would advise trying the ?apply function, which
Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix
You want gmc to be run for-each cell, which means you want to apply it for each combination of row and column coordinates
R = 100; # number of rows
C = 100; # number of columns
M = expand.grid(1:R, 1:C); # Cartesian product of the coordinates
# each row of M contains the indexes of one of R's cells
# head(M); # just to see it
# To use apply we need gmc to take into account one variable only (that' not entirely true, if you want to know how it really works have a look how at ?apply)
# thus I create a function which takes into account one row of M and tells gmc the first cell is the row index, the second cell is the column index
gmcWrapper = function(x) { return(gmc(x[1], x[2])); }
# run apply which will return a vector containing *all* the evaluated expressions
R = apply(M, 1, gmcWrapper);
# re-shape R into a matrix
R = matrix(R, nrow=R, ncol=C);
If the apply-approach is again slow try considering the snowfall package which will allow you to follow the apply-approach using parallel computing. An introduction to snowfall usage can be found in this pdf, look at page 5 and 6 in particular

R: Logical from 2 vectors on pattern match

Trying to clean up some dirty data (for work), my data frame has a column for customer information (for our example lets say store and product) in a long weird string, as well as a column for store and a column for product. I can parse the store and the product from the string. Here is where I arrive at my problem.
let's say (consider these vectors part of a larger dataframe, appended with data$ if that helps, I was just working with them as vectors thinking it may speed up the code not having to pull the whole dataframe):
WeirdString <- c("fname: john; lname:smith; store:Amazon Inc.; product:Echo", "fname: cindy; lname:smith; store:BestBuy; product:Ps-4","fname: jon; lname:smith; store:WALMART; product:Pants")
so I parse this to be:
WS_Store <- c("Amazon Inc.", "BestBuy", "WALMART")
WS_Prod <- c("Echo", "Ps-4", "Pants")
What's in the tables (i.e. the non-parsed columns) is:
DB_Store <- c("Amazon", "BEST BUY", "Other")
DB_Prod <- c("ECHO", "PS4", "Jeans")
I currently am using a for loop to loop through i to grepl the "true" string from the parsed string. This takes forever, and I know R was designed to use vectorized code, So my question is, how do I eliminate the loop and use something like lapply (which I tried, and failed at, because I'm not savvy enough with lapply), or some other vectorized thing?
My current code:
for(i in 1:nrow(data)){ # could be i in length(DB_prod) or whatever, all vectors are the same length)
Diff_Store[i] <- !grepl(DB_Store[i], WS_Store[i], ignore.case=T)
Diff_Prod[i] <- !grepl(DB_Prod[i] , WS_Prod[i] , ignore.case=T)
}
I intend to append those columns back into the dataframe, as the true goal is to diagnose why the database has this problem.
If there's a better way than this, rather than trying to vectorize it, I'm open to it. The data in the DB_Store is restricted to a specific number of "stores" (in the table it comes from) but in the string, it seems to be open, which is why I use the DB as the pattern, not the x. Product is similar, but not as restricted, this is why some have dashes and some don't. I would love to match "close things" like Ps-4 vs. PS4, but I will probably just build a table of matches once I see how weird the string gets. To be true though, the string may not match, which is represented by the Pants/Jeans thing. The dataset is 2.5 million records, and there are many different "stores" and "products", and I do want to make sure they match on the same line, not "is it in the database" (which is what previous questions seem to ask, can I see if a string is in a list of strings, rather than a 1:1 comparison, and the last question did end in a loop, which takes minutes and hours to run)
Thanks!
Please check if this works for you:
check <- function(vec_a, vec_b){
mat <- cbind(vec_a, vec_b)
diff <- apply(mat, 1, function(x) !grepl(pattern = x[1], x = x[2], ignore.case = TRUE))
diff
}
Use your different vectors for stores (or products) in the arguments vec_a and vec_b, respectively (example: diff_stores <- check(DB_Store, WS_Store) ). This function will return a logical vector with TRUE values referring to items that weren't a match in the two original vectors. Is this what you wanted?

R returns list instead of filling in dataframe column

I am trying to use apply() to fill in an additional column in a dataframe and by calling a function I created with each row of the data frame.
The dataframe is called Hit.Data has 2 columns Zip.Code and Hits. Here are a few rows
Zip.Code , Hits
97222 , 20
10100 , 35
87700 , 23
The apply code is the following:
Hit.Data$Zone = apply(Hit.Data, 1, function(x) lookupZone("89000", x["Zip.Code"]))
The lookupZone() function is the following:
lookupZone <- function(sourceZip, destZip){
sourceKey = substr(sourceZip, 1, 3)
destKey = substr(destZips, 1, 3)
return(zipToZipZoneMap[[sourceKey]][[destKey]])
}
All the lookupZone() function does is take the 2 strings, truncates to the required characters and looks up the values. What happens when I run this code though is that R assigns a list to Hit.Data$Zone instead of filling in data row by row.
> typeof(Hit.Data$Zone)
[1] "list
What baffles me is that when I use apply and just tell it to put a number in it works correctly:
> Hit.Data$Zone = apply(Hit.Data, 1, function(x) 2)
> typeof(Hit.Data$Zone)
[1] "double"
I know R has a lot of strange behavior around dropping dimensions of matrices and doing odd things with lists but this looks like it should be pretty straightforward. What am I missing? I feel like there is something fundamental about R I am fighting, and so far it is winning.
Your problem is that you are occasionally looking up non-existing entries in your hashmap, which causes hash to silently return NULL. Consider:
> hash("890", hash("972"=3, "101"=3, "877"=3))[["890"]][["101"]]
[1] 3
> hash("890", hash("972"=3, "101"=3, "877"=3))[["890"]][["100"]]
NULL
If apply encounters any NULL values, then it can't coerce the result to a vector, so it will return a list. Same will happen with sapply.
You have to ensure that all possible combinations of the first three zip code digits in your data are present in your hash, or you need logic in your code to return NA instead of NULL for missing entries.
As others have said, it's hard to diagnose without knowing what ZiptoZipZoneMap(...) is doing, but you could try this:
Hit.Data$Zone <- sapply(Hit.Data$Zip.Code, function(x) lookupZone("89000", x))

R assign several list elements the same object

I currently have a loop - well actually a loop in loop, in a simulation model which gets slow with larger numbers of individuals. I've vectorised most of it and made it a heck of a lot faster. But there's a part where I assign multiple elements of a list as the same thing, simplifying a big loop to just the task I want to achieve:
new.matrices[[length(new.matrices)+1]]<-old.matrix
With each iteration of the loop the line above is called, and the same matrix object is assigned to the next new element of a list.
I'm trying to vectorize this - if possible, or make it faster than a loop or apply statement.
So far I've tried stuff along the lines of:
indices <- seq(from = length(new.matrices) + 1, to = length(new.matrices) + reps)
new.matrices[indices] <- old.matrix
However this results in the message:
Warning message:
In new.effectors[effectorlength] <- matrix :
number of items to replace is not a multiple of replacement length
It also tries to assign one value of the old.matrix to one element of new.matrices like so:
[[1]]
[1] 8687
[[2]]
[1] 1
[[3]]
[1] 5486
[[4]]
[1] 0
When the desired result is one list element = one whole matrix, a copy of old.matrix
Is there a way I can vectorize sticking a matrix in list elements without looping? With loops how it is currently implemented we are talking many thousands of repetitions which slows things down considerably, hence my desire to vectorize this if possible.
Probably you already solved your problem, anyway, the issue in your code
new.matrices[indices] <- old.matrix
was caused by trying to replace some objects (the NULL elements in your new.matrices list) with something different, a matrix. So R coerces old.matrix into a vector and tries to stick each single value to a different list element, (that's why you got this result, and when, say, reps is 4 or 8 and old.matrix is NOT a 2 x 2 matrix, you also get the warning). Doing
new.matrices[indices] <- list(old.matrix)
will work, and R will replicate the single element list list(old.matrix) "reps" times automatically.

Resources