How to synchronize an index in a for loop in R - r

I have a vector modbirths which has 23 entries.
I have a vector pop_vars_modded which has 23 entries.
I want to synchronize my for loop so that county[i] and pop_vars_modded[i]
move at the same speed. For example, the loop would only do county[1] and pop_vars_modded[1], ..... county[19] and pop_vars_modded[19], but not county[7] and pop_vars_modded[22], or other values that are not synchronous. Here is the code I am working with:
for(county in modbirths){
for( i in 1:23){
print(dbinom(county[i], pop_vars_modded[i], p=wyo2004_smokerate, log=T))
}}
the output gives me 23^2 = 529 numbers that dbinom produces (every combination of the 23 possible values i takes on), but I only want the key 23 ones.

Related

How can I speed up this R code, in which I use stringdist?

I'm trying to clean up our customer database by identifying customer data that is similar enough to consider them the same customer (thus, give them the same customer id). I've concatenated relevant customerdata into one column named customerdata. I've found the R package stringdist and I'm using the following code to calculate the distance between every single record:
output <- df$id
for(i in 1:(length(df$customerdata)-1) ){
for(j in (i+1):length(df$customerdata)){
if(abs(df$customerdataLEN[i]-df$customerdataLEN[j]) < 10){
if( stringdist(df$customerdata[i],df$customerdata[j])<10){
output[j] <- df$id[i]
}
}
}
}
df$newcustomerid <- output
So here, I first initialize a vector named output with customerid data. Then, I loop through all customers. I have a column called customerdatalength. To reduce calculation time, I first check if there is large (>10) difference in length between columns. If that is the case, I don't bother calculating the stringdist. Otherwise, if the distance between the two customers is < 10, I consider them to be the same customer, and I give that customer the same id.
I'm looking to speed up the process however. At 2000 rows, this loop takes 2 minutes. At 7400 rows, this loop takes 32 minutes. I'm looking to run this on around 1 000 000 rows. Does anyone have any idea on how to improve the speed of this loop?

how to select a matrix column based on column name

I have a table with shortest paths obtained with:
g<-barabasi.game(200)
geodesic.distr <- table(shortest.paths(g))
geodesic.distr
# 0 1 2 3 4 5 6 7
# 117 298 3002 2478 3342 3624 800 28
I then build a matrix with 100 rows and same number of columns as length(geodesic.distr):
geo<-matrix(0, nrow=100, ncol=length(unlist(labels(geodesic.distr))))
colnames(geo) <- unlist(labels(geodesic.distr))
Now I run 100 experiments where I create preferential attachment-based networks with
for(i in seq(1:100)){
bar <- barabasi.game(vcount(g))
geodesic.distr <- table(shortest.paths(bar))
distance <- unlist(labels(geodesic.distr))
for(ii in distance){
geo[i,ii]<-WHAT HERE?
}
}
and for each experiment, I'd like to store in the matrix how many paths I have found.
My question is: how to select the right column based on the column name? In my case, some names produced by the simulated network may not be present in the original one, so I need not only to find the right column by its name, but also the closest one (suppose my max value is 7, I may end up with a path of length 9 which is not present in the geo matrix, so I want to add it to the column named 7)
There is actually a problem with your approach. The length of the geodesic.distr table is stochastic, and you are allocating a matrix to store 100 realizations based on a single run. What if one of the 100 runs will give you a longer geodesic.distr vector? I assume you want to make the allocated matrix bigger in this case. Or, even better, you want run the 100 realizations first, and allocate the matrix after you know its size.
Another potential problem is that if you do table(shortest.paths(bar)), then you are (by default) considering undirected distances, will end up with a symmetric matrix and count all distances (expect for self-distances) twice. This may or may not be what you want.
Anyway, here is a simple way, with the matrix allocated after the 100 runs:
dists <- lapply(1:100, function(x) {
bar <- barabasi.game(vcount(g))
table(shortest.paths(bar))
})
maxlen <- max(sapply(dists, length))
geo <- t(sapply(dists, function(d) c(d, rep(0, maxlen-length(d)))))

R looping over two vectors

I have created two vectors in R, using statistical distributions to build the vectors.
The first is a vector of locations on a string of length 1000. That vector has around 10 values and is called mu.
The second vector is a list of numbers, each one representing the number of features at each location mentioned above. This vector is called N.
What I need to do is generate a random distribution for all features (N) at each location (mu)
After some fiddling around, I found that this code works correctly:
for (i in 1:length(mu)){
a <- rnorm(N[i],mu[i],20)
feature.location <- c(feature.location,a)
}
This produces the right output - a list of numbers of length sum(N), and each number is a location figure which correlates with the data in mu.
I found that this only worked when I used concatenate to get the values into a vector.
My question is; why does this code work? How does R know to loop sum(N) times but for each position in mu? What role does concatenate play here?
Thanks in advance.
To try and answer your question directly, c(...) is not "concatenate", it's "combine". That is, it combines it's argument list into a vector. So c(1,2,3) is a vector with 3 elements.
Also, rnorm(n,mu,sigma) is a function that returns a vector of n random numbers sampled from the normal distribution. So at each iteration, i,
a <- rnorm(N[i],mu[i],20)
creates a vector a containing N[i] random numbers sampled from Normal(mu[i],20). Then
feature.location <- c(feature.location,a)
adds the elements of that vector to the vector from the previous iteration. So at the end, you have a vector with sum(N[i]) elements.
I guess you're sampling from a series of locations, each a variable no. of times.
I'm guessing your data looks something like this:
set.seed(1) # make reproducible
N <- ceiling(10*runif(10))
mu <- sample(seq(1000), 10)
> N;mu
[1] 3 4 6 10 3 9 10 7 7 1
[1] 206 177 686 383 767 496 714 985 377 771
Now you want to take a sample from rnorm of length N(i), with mean mu(i) and sd=20 and store all the results in a vector.
The method you're using (growing the vector) is not recommended as it will be re-copied in memory each time an element is added. (See Circle 2, although for small examples like this, it's not so important.)
First, initialize the storage vector:
f.l <- NULL
for (i in 1:length(mu)){
a <- rnorm(n=N[i], mean=mu[i], sd=20)
f.l <- c(f.l, a)
}
Then, each time, a stores your sample of length N[i] and c() combines it with the existing f.l by adding it to the end.
A more efficient approach is
unlist(mapply(rnorm, N, mu, MoreArgs=list(sd=20)))
Which vectorizes the loop. Unlist is used as mapply returns a list of vectors of varying lengths.

Iterate process in R using range of vectors derived from matrix

I must first apologize as I have no programming background, so please forgive me if this question is overly simplistic or if it has been addressed repeatedly. I would be very willing to help clarify my issue if it is not clear from my explanation.
I have two sets of data matrices. "A":
[Ac1] [Ac2] ... [Ac500]
[Ac1] 25 30 ... 15
[Ar2] 7 54 ... 41
...
[cr25000]
and
"B" which is similar in the number of columns, but not the number of rows
[Bc1] [Bc2] ... [Bc500]
[Br1] 25 30 ... 15
[Br2] 7 54 ... 41
...
[Br20000]
I'm running an module ("npSeq") in R that uses the matrix A consistently as an input value, a horizontal vector that includes all of the values from a row in matrix B, ex [1]. The module returns a separate list of values. I will need to run the analysis independently for all of the rows in matrix B saving all of the returned lists which I will then need to combine.
However I would like to know if there is a way to automate the process so that the module runs using a vector derived from row [Br1], saves the returned list, and then runs the process again using the vector derived from row [Br2]. Repeating the process until [Br20000].
Again I'm sorry that this is worded so poorly. I wish I understood enough of the terminology to state my problem more clearly.
You can use lapply to loop over B's row indices:
result.list <- lapply(1:nrow(B), function(i) npSeq(A, B[i, ]))
Note that this is not going to be much (any?) faster than using a for loop. It is just a short and clean equivalent. 20,000 iterations does sound like a lot so it may take a while depending on how slow the function is.

calculating sums of unique values in a log in R

I have a data frame with three columns: timestamp, key, event which is ordered by time.
ts,key,event
3,12,1
8,49,1
12,42,1
46,12,-1
100,49,1
From this, I want to create a data frame with timestamp and (all unique keys - all unique keys with cumulative sum 0 up until a given timestamp) divided by all unique keys until the same timestamp. E.g. for the above example the result should be:
ts,prob
3,1
8,1
12,1
46,2/3
100,2/3
My initial step is to calculate the cumsum grouped by key:
items = data.frame(ts=c(3,8,12,46,100), key=c(12,49,42,12,49), event=c(1,1,1,-1,1))
sumByKey = ddply(items, .(key), transform, sum=cumsum(event))
In the second (and final) step i iterate over sumByKey with a for-loop and keep track of both all unique keys and all unique keys that have a 0 in their sum using vectors, e.g. if(!(k %in% uniqueKeys) uniqueKeys = append(uniqueKeys, key). The prob is derived using the two vectors.
Initially, i tried to solve the second step using plyr, but i wanted to avoid re-calculating the unique keys up to a certain timestamp for each row in sumByKey. What im missing is a way to either refer to external variables from a function passed to ddply. Or, alternatively (and more functional), use an accumulator passed back into the function, e.g. function(acc, x) acc + x.
Is it possible to solve the second step in a better way, using e.g. ddply?
If my interpretation is right, then this should do it :
items = data.frame(ts=c(3,8,12,46,100), key=c(12,49,42,12,49), event=c(1,1,1,-1,1))
# numbers of keys that sum to zero, no ddply necessary
nzero <- cumsum(ave(items$event,items$key,FUN=cumsum)==0)
# number of unique keys at a given timepoint
nunique <- rep(F,length(items$key))
nunique[match(unique(items$key),items$key)] <- T
nunique <- cumsum(nunique)
# makes :
items$p <- (nunique-nzero)/nunique
items
ts key event p
1 3 12 1 1.0000000
2 8 49 1 1.0000000
3 12 42 1 1.0000000
4 46 12 -1 0.6666667
5 100 49 1 0.6666667
If your problem is only computational time, I bet the better idea will be to implement your algorithm as a C chunk; you may first use R to convert keys to a coherent interval of integers (as.numeric(factor(...))) and then use boolean array in C to obtain unique key number easily and very fast. Remember that neither plyr nor standard R *pplys are significantly faster than loops (providing both are used without embarrassing errors, of course).

Resources