Using vectors instead of for loop in R - r

I have a long line of code and am trying to speed things up by removing for loops. As I under stand when you have multiple nested loops it can slow your code down. My original code contained 3 loops which ran for 598, 687 and 44 iterations. It took about 15 minutes to run. I use this code to display output from some models I'm running and waiting 15 minutes is unacceptable. I'm having trouble getting rid of one of the loops. I'm trying to use vectors but it doesn't run correctly. Lets look at the first 10 iterations.
#data
flows=c(-0.088227, 0.73024, 0.39683, 1.1165, 1.0802, 0.22345, 0.78272, 0.91673, 0.53052, 0.13852)
cols=c(31, 31, 30, 30, 30, 30, 31, 31, 31, 31)
rows=c(3, 4, 4, 5, 6, 7, 7, 8, 9, 10)
dataset=matrix(0,33,44)
for (i in 1:10){dataset[rows[i],cols[i]]<-flows[i]+dataset[rows[i],cols[i]]}
#And this is my alternative(Not working)
dataset=matrix(0,33,44)
NoR=10
dataset[rows[1:NoR],cols[1:NoR]]<-flows[1:NoR]+dataset[rows[1:NoR],cols[1:NoR]]
See the problem here. Somehow columns are showing the same row information.
What am I missing here? Why won't the second code run correctly?

It's a little hard to help with what is likely your real underlying problem, because I'm guessing we're only seeing a small snippet of your code here. But I think maybe you're looking for something like this:
flow_mat <- matrix(0,33,34)
flow_mat[(cols - 1) * 33 + rows] <- flows
Remember that matrices are just vectors with a dimension attribute. So you can index them just like a vector, imagining that the indices start at the "upper left" and wrap around by column.

Related

Create a list of element from a vector with an incrementation of specified number of element from the vector itself

I already posted a related question (Create a new vector by appending elements between them in R).
I would like to know whether it's possible to increment a vector with a specified number of elements (like accumulate() from purr package).
In fact, I'm working with a vector of 16000 genes. I'm trying to write a for loop where at each iteration, 100 genes should be knocked out from the data set and proceed some clustering analysis (clustering with 16000 genes, clustering with 15900 genes, clustering with 15800 genes, etc.) My idea is to make a list from the vector where each element of it is a vector of genes incremented by 100 (first element 100 genes, second element 200 genes, third element 300 genes and the 160th element, the total 16000 genes).
With accumulate (), I can increment one by one only between two following elements. Is there a way to make it increment 100 by 100?
Thank you all once again for your help!
Instead of a for loop, you could use a while loop and build a new list every time. It's not the most efficient way of doing this, but considering your dataset size, it should do the trick.
Here is some code to help you get started:
# Create a list of values
my_list <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
# Get the length of your list
max_len <- length(my_list)
# While [max_len] is positive, create a new list of [max_len] elements and decrement [max_len] by some value (here, 10) for the next list
while (max_len > 0) {
new_list = my_list[1:max_len]
print(new_list)
max_len <- max_len - 10
}
Hope that helps !

Set my own fixed X-axis value in a grid chart? Including symbols like "<" and ">" (QlikView)

So Im creating this grid-chart and I really want to have the following values in my X-Axis:
"<10"
"<20"
">20"
I want my graph to look something like the following graph, in the link below:
Graph example
The nodes X values does not have the lesser than (<) or bigger than(>) symbols, they are just numbers spanning from 1-30 with no extra characters. Chosing only that field as the x-axis doesnt do it, ofc. I only want those three specified values, containing the symbols (< and >), in the X-axis.
I feel like this should be a simple thing to solve, but I've tried for a while now without any succes...
Sorry about the poor example, hopefully you understand what i'm saying
Any ideas?
Thanks in advance.
Have a look at the Class function.
The class function assigns the first parameter to a class interval. The result is a dual value with a<=x
You could create a grid chart and use PplWatched and Rating as dimensions with expression count(id) using the following testdata:
Data:
load
id, class(PplWatched, 10) as PplWatched, Rating
;
load * inline [
id, PplWatched, Rating
1, 14, 4
2, 2, 2
3, 19, 5
4, 30, 4
5, 9, 1
6, 45, 5
];

R seq function between item 1 and 2, then between 2 and 3 of a vector

I have a vector c(5, 10, 15) and would like to use something like the seq function to created a new vector: 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15. This is how I would do it now, but it seems ineloquent at best. In the final (functional) form, I would need to increment by any given number, not necessarily units of 1.
original_vec <- c(5, 10, 15)
new_vec <- unique(c(seq(original_vec[1],original_vec[2],1),seq(original_vec[2],original_vec[3],1)))
> new_vec
[1] 5 6 7 8 9 10 11 12 13 14 15
Is there a way (I'm sure there is!) to use an apply or similar function to apply a sequence across multiple items in a vector, also without repeating the number in the middle (in the case above, 10 would be repeated, if not for the unique function call.
Edit: Some other possible scenarios might include changing c(1,5,7,10,12) to 1,1.5,2,2.5 ... 10, 10.5, 11, 11.5, 12, or c(1,7,4) where the price increases and then decreases by an interval.
The answer may be totally obvious and I just can't quite figure it out. I have looked at manuals and conducted searched for the answer already. Thank you!
While this isn't the answer to my original question, after discussing with my colleague, we don't have cases where seq(min(original_vec), max(original_vec), by=0.5), wouldn't work, so that's the simplest answer.
However, a more generalized answer might be:
interval = 1
seq(original_vec[1], original_vec[length(original_vec)], by = interval)
Edit: Just thought I'd go ahead and include the finished product, which includes the seq value in a larger context and work for increasing values AND for cases where values change direction. The use case is the linear interpolation of utilities, given original prices and utilities.
orig_price <- c(2,4,6)
orig_utils <- c(2,1,-3)
utility.expansion = function(x, y, by=1){
#x = original price, y = original utilities
require(zoo)
new_price <- seq(x[1],x[length(x)],by)
temp_ind <- new_price %in% x
new_utils <- rep(NA,length(new_price))
new_utils[temp_ind] <- y
new_utils <- na.approx(new_utils)
return(list("new price"=new_price,"new utilities"=new_utils))
}

Are random seeds "spent" in some way?

I expected this code to give me the same number each time:
set.seed(1)
sample(rep(1:100), size = 1)
However, it doesn't. Here's is how it behaves: it gives me the same number if I run the two lines directly after each other, however, if I then run the second line once again, it gives me something different. Does that mean the seed is "spent" after running sample() once?
I need to produce code that includes random sampling, but which is reproducible. How can I make sure the same, random number is produced each time?
This isn't exactly how things really work, but it helps for understanding.
Imagine that we're drawing numbers between 0 and 9 by looking them up in a pre-generated book of random numbers. A really long book with lots and lots of random draws made by some bored undergraduate intern with a fair 10-sided die. What R normally does is look at the number following whatever the previous random number was in the book. So if the book starts 4, 5, 2, 8, 3, 4, 4, 1, 7, 0, 4, ... the first random number will be 4, then 5 etc.
The results will be as random as the book is---no matter where in the book we start. Typically, you don't know what you're going to get because you have no idea where in the book R is currently at, maybe page 103, maybe page 10003. Setting the seed tells R to start at a specific place. So set.seed(1) says "start on page 1", so you now can count on that "4" being first, and then followed by a 5, and so forth.
It's not that the seed is spent; rather, setting the seed produces a fixed sequence of pseudo-random values that you'll be sampling from. Once you've sampled one value from the sequence, the next value you sample will be the subsequent value in the sequence. However, if you reset the seed, then you will start back at the beginning of the sequence when you sample again.

Specifying directly factor levels and sizes

How would you create a factor with levels and corresponding sizes directly specified?
e.g. [0, 5) 6
[5, 7) 20
[7, 13) 4
Edit: This question is related to grouped frequency distributions. Sometimes (say in textbooks), you don't get access to original data but you're just given the count of the occurrences of values within each class. Later on, you'd want to compute cumulative count/frequency, you'd like to tell what count such or such class has and so on. So you just need to be able to enter the class table and hence my question.
Second edit:
Typical textbook example (it's already a summary, the original data set is not available):
[20, 30) 221890
[30, 35) 171050
[35, 40) 121400
[40, 45) 101050
[45, 60) 71620
# ... possibly many more but let's stop here.
Then typical questions are: what is the tally for the [30, 35) class? What is the cumlative count at 45? Plot the corresponding histogram, and so on and so forth.
So #thelatemail 1st comment provided a workable answer but I was worried about the resulting factor 'size'. That's why I asked for other alternative solutions. #agstudy answer also works along the same lines but with the extra burden of recreating a (temporary, agreed) whole new data set. Still it's an interesting answer by itself. I was in particular interested in the way #agstudy computed the temporary data set.
All in all, these solutions work but I would like some optimized approach if at all possible.
Theoretically, 'factor's would be the needed output but 'factor's seem way too big to store that summary table.
For example using cut you can do this:
cut(rep(c(1,6,11),c(6,20,4)),c(0,5,7,13))
You can check using table
table(cut(rep(c(1,6,11),c(6,20,4)),c(0,5,7,13)))
(0,5] (5,7] (7,13]
6 20 4
EDIT to create data from intervals you can do this also :
cut(rep((c(0,5,7,13) +1)[-1],c(6,20,4)),c(0,5,7,13))
EDIT even after clarification is still not clear for me what do you have as inputs specially the structure of your inputs data. Here a straight method:
text='[20, 30) 221890
[30, 35) 171050
[35, 40) 121400
[40, 45) 101050
[45, 60) 71620'
dd <- do.call(rbind,strsplit(readLines(textConnection(text)),') '))
vv <- as.numeric(dd[,2])
names(vv) <- paste0(dd[,1],')')
vv
[20, 30) [30, 35) [35, 40) [40, 45) [45, 60)
221890 171050 121400 101050 71620

Resources