I am using a main R function to call a series of R functions from different scripts. In order to reproduce results, I set.seed in the beginning of my main script. In the code, sample() function to randomly select a couple of rows from a dataframe in function_8, and rand() in function_6. So a simple workflow is like below:
### Main R Function
library(dplyr)
set.seed(111)
### Begin calling other R scripts
output_1 <- function_1(...)
...
output_10 <- function_10(...)
### End Main R Function
Recently, I realized that if I make changes to my function_9 which does not contain any randomization. Random numbers generated from in function_8 changes. For example,
sample() in function_8 will get Row 2, 15, 23, 50, 54 before updating function_9.
sample() in function_8 will get Row 23, 44, 50, 95, 98 after updating function_9
However, results can be reproduced by starting a new R session.
So, I am wondering if anyone can give me some suggestions on how to properly set.seed in this situation? THX in advance!
Update
Per a deleted comment, I change the seed number to 123, which produces a set of consistent results. But I appreciate if someone can provide any in-depth explanation!
Maybe the series 111 is just have same character which doesn't change the function 8, you maybe want to generate a time based random seed, Here is a previous answer, that may help you by using system time.
Related
I'm trying to figure out how to get a for loop setup in R when I want it to run two or more parameters at once. Below I have posted a sample code where I am able to get the code to run and fill a matrix table with two values. In the 2nd line of the for loop I have
R<-ARMA.var(length(x_global_sample),ar=c(tt[i],-.7))
And what I would like to do is replace the -.7 with another tt[i], example below, so that my for loop would run through the values starting at (-1,-1), then it would be as follows (-1,-.99),
(-1,-.98),...,(1,.98),(1,.99),(1,1) where the result matrix would then be populated by the output of Q and sigma.
R<-ARMA.var(length(x_global_sample),ar=c(tt[i],tt[i]))
or something similar to
R<-ARMA.var(length(x_global_sample),ar=c(tt[i],ss[i]))
It may be very possible that this would be better handled by two for loops however I'm not 100% sure on how I would set that up so the first parameter would be fixed and the code would run through the sequence of the second parameter, once that would get finished the first parameter would now increase by one and fix itself at that increase until the second parameter does another run through.
I've posted some sample code down below where the ARMA.var function just comes from the ts.extend package. However, any insight into this would be great.
Thank you
tt<-seq(-1,1,0.01)
Result<-matrix(NA, nrow=length(tt)*length(tt), ncol=2)
for (i in seq_along(tt)){
R<-ARMA.var(length(x_global_sample),ar=c(tt[i],-.7))
Q<-t((y-X%*%beta_est_d))%*%solve(R)%*%(y-X%*%beta_est_d)+
lam*t(beta_est_d)%*%D%*%beta_est_d
RSS<-sum((y-X%*%solve(t(X)%*%solve(R)%*%X+lam*D)%*%t(X)%*%solve(R)%*%y)^2)
Denom<-n-sum(diag(X%*%solve(t(X)%*%solve(R)%*%X+lam*D)%*%t(X)%*%solve(R)))
sigma<-RSS/Denom
Result[i,1]<-Q
Result[i,2]<-sigma
rm(Q)
rm(R)
rm(sigma)
}
Edit: I realize that what I have posted above is quite unclear so to simplify things consider the following code,
x<-seq(1,20,1)
y<-seq(1,20,2)
Result<-matrix(NA, nrow=length(x)*length(y), ncol=2)
for(i in seq_along(x)){
z1<-x[i]+y[i]
z2<-z1+y[i]
Result[i,1]<-z1
Result[i,2]<-z2
}
So the results table would appear as follow as the following rows,
Row1: 1+1=2, 2+1=3
Row2: 1+3=4, 4+3=7
Row3: 1+5=6, 6+5=11
Row4: 1+7=8, 8+7=15
And this pattern would continue with x staying fixed until the last value of y is reached, then x would start at 2 and cycle through the calculations of y to the point where my last row is as,
RowN: 20+19=39, 39+19=58.
So I just want to know if is there a way to do it in one loop or if is it easier to run it as 2 loops.
I hope this is clearer as to what my question was asking, and I realize this is not the optimal way to do this, however for now it is just for testing purposes to see how long my initial process takes so that it can be streamlined down the road.
Thank you
I'm creating a Monte Carlo model using R. My model creates matrices that are filled with either zeros or values that fall within the constraints. I'm running a couple hundred thousand n values thru my model, and I want to find the average of the non zero matrices that I've created. I'm guessing I can do something in the last section.
Thanks for the help!
Code:
n<-252500
PaidLoss_1<-numeric(n)
PaidLoss_2<-numeric(n)
PaidLoss_3<-numeric(n)
PaidLoss_4<-numeric(n)
PaidLoss_5<-numeric(n)
PaidLoss_6<-numeric(n)
PaidLoss_7<-numeric(n)
PaidLoss_8<-numeric(n)
PaidLoss_9<-numeric(n)
for(i in 1:n){
claim_type<-rmultinom(1,1,c(0.00166439057698873, 0.000810856947763742, 0.00183509730283373, 0.000725503584841243, 0.00405428473881871, 0.00725503584841243, 0.0100290201433936, 0.00529190850119495, 0.0103277569136224, 0.0096449300102424, 0.00375554796858996, 0.00806589279617617, 0.00776715602594742, 0.000768180266302492, 0.00405428473881871, 0.00226186411744623, 0.00354216456128371, 0.00277398429498122, 0.000682826903379993))
claim_type<-which(claim_type==1)
claim_Amanda<-runif(1, min=34115, max=2158707.51)
claim_Bob<-runif(1, min=16443, max=413150.50)
claim_Claire<-runif(1, min=30607.50, max=1341330.97)
claim_Doug<-runif(1, min=17554.20, max=969871)
if(claim_type==1){PaidLoss_1[i]<-1*claim_Amanda}
if(claim_type==2){PaidLoss_2[i]<-0*claim_Amanda}
if(claim_type==3){PaidLoss_3[i]<-1* claim_Bob}
if(claim_type==4){PaidLoss_4[i]<-0* claim_Bob}
if(claim_type==5){PaidLoss_5[i]<-1* claim_Claire}
if(claim_type==6){PaidLoss_6[i]<-0* claim_Claire}
}
PaidLoss1<-sum(PaidLoss_1)/2525
PaidLoss3<-sum(PaidLoss_3)/2525
PaidLoss5<-sum(PaidLoss_5)/2525
PaidLoss7<-sum(PaidLoss_7)/2525
partial output of my numeric matrix
First, let me make sure I've wrapped my head around what you want to do: you have several columns -- in your example, PaidLoss_1, ..., PaidLoss_9, which have many entries. Some of these entries are 0, and you'd like to take the average (within each column) of the entries that are not zero. Did I get that right?
If so:
Comment 1: At the very end of your code, you might want to avoid using sum and dividing by a number to get the mean you want. It obviously works, but it opens you up to a risk: if you ever change the value of n at the top, then in the best case scenario you have to edit several lines down below, and in the worst case scenario you forget to do that. So, I'd suggest something more like mean(PaidLoss_1) to get your mean.
Right now, you have n as 252500, and your denominator at the end is 2525, which has the effect of inflating your mean by a factor of 100. Maybe that's what you wanted; if so, I'd recommend mean(PaidLoss_1) * 100 for the same reasons as above.
Comment 2: You can do what you want via subsetting. Take a smaller example as a demonstration:
test <- c(10, 0, 10, 0, 10, 0)
mean(test) # gives 5
test!=0 # a vector of TRUE/FALSE for which are nonzero
test[test!=0] # the subset of test which we found to be nonzero
mean(test[test!=0]) # gives 10, the average of the nonzero entries
The middle three lines are just for demonstration; the only necessary lines to do what you want are the first (to declare the vector) and the last (to get the mean). So your code should be something like PaidLoss1 <- mean(PaidLoss_1[PaidLoss_1 != 0]), or perhaps that times 100.
Comment 3: You might consider organizing your stuff into a dataframe. Instead of typing PaidLoss_1, PaidLoss_2, etc., it might make sense to organize all this PaidLoss stuff into a matrix. You could then access elements of the matrix with [ , ] indexing. This would be useful because it would clean up some of the code and prevent you from having to type lots of things; you could also then make use of things like the apply() family of functions to save you from having to type the same commands over and over for different columns (such as the mean). You could also use a dataframe or something else to organize it, but having some structure would make your life easier.
(And to be super clear, your code is exactly what my code looked like when I first started writing in R. You can decide if it's worth pursuing some of that optimization; it probably just depends how much time you plan to eventually spend in R.)
I have some code which requires several simulations and I am hoping to run across separate computers. Each simulation requires identifying a random subset of the data to then run the analyses on. When I try to run this on separate computers at the same time, I get notice that the same rows are selected for each simulation. So if I am running 3 simulations, each simulation will identify the same 'random' samples across separatae computers. I am not sure why this is, can anyone suggest any code to get around this?
I show the sample_n function in dplyr below, but the same thing happened using the 'sample' function in Base R. Thanks in advance.
library(dplyr)
explanatory <- c(1,2,3,4,3,2,4,5,6,7,8,5,4,3)
response <- c(3,4,5,4,5,6,4,6,7,8,6,10,11,9)
A <- data.frame(explanatory,response)
B <- data.frame(explanatory,response)
C <- data.frame(explanatory,response)
for(i in 1:3)
{
Rand_A = sample_n(A,8)
Rand_B = sample_n(B,8)
Rand_C = sample_n(C,8)
Rand_All = rbind(Rand_A, Rand_B,Rand_C)
}
You can set the seed for each computer separately as brb suggests above. You could also have this happen automatically by setting the seed to the computer's ip address, which would eliminate the need to edit your script for each computer. One implementation using the ipify package:
library(devtools)
install_github("gregce/ipify")
library(ipify)
set.seed(as.numeric(gsub("[.]","",get_ip())))
I am trying to perform a t-sne analysis on a file with 39772 columns and 170 rows.
I first used the "Rtsne" package, but that package seems to have a limit of 10,000 columns as R keeps aborting every time I run the code with the entire file.
Because of this, I changed the package to "tsne" instead of "Rtsne" but now the code is taking FOREVER to run (like over 2 hours). This is what I have so far...I've read other posts but nothing seems to apply to my problem. I'd appreciate any ideas on what I can do to fix this and actually see an output.
CODE USING "TSNE" PACKAGE (taking 2+ hours to run...still haven't seen an output):
exp =read.csv("tsnedata.csv")
library(tsne)
exp1=t(exp)
exp2=matrix(as.numeric(unlist(exp1)),nrow=nrow(exp1))
exp3=data.matrix(exp2)
cols=rainbow(10)
ecb=function(x,y){plot(x, t='n'); text(x, col=cols);}
tsne_res=tsne(exp3, epoch_callback=ecb, perplexity=50, epoch=50)
ORIGINAL CODE USING "RTSNE" PACKAGE (this is the code that immediately causes R to abort unless I run the code using only the first 10,000 columns of the data):
exp<- read.csv("tsnedata.csv")
library(Rtsne)
exp1=t(exp)
exp2=matrix(as.numeric(unlist(exp1)),nrow=nrow(exp1))
exp3 <- data.matrix(exp2)
tsne <- Rtsne(as.matrix(exp3), check_duplicates = FALSE, pca = FALSE, perplexity=30, theta=0.5, dims=2)
cols <- rainbow(10)
plot(tsne$Y, t='n')
text(tsne$Y, col=cols)
If you are dealing with scRNAseq data and want to visualize each cell as each dot on tsne visualization, here are my thoughts:
1. Make sure your input is cell by gene expression matrix.
2. Do dimension reduction first(e.g PCA), only feed in the first few principal components into Rtsne.
Rtsne is based on Barnes-Hut implementation, it is much faster than original implementation of tsne, and also a better way to do tsne analysis as well(as it corrected some bugs from the original tsne package). However, from my experience, tsne outputs cuter (round shape, ball-like) visualization than Rtsne.
Im trying to use dput() to create a reproducible example with a large database. The database needs to be large as the reproducible example involves moving averages. The way I've found to do this involves the function reproduce, shared here How to make a great R reproducible example? by #Ricardo Saporta. reproduce is based on dput() (code here https://github.com/rsaporta/pubR/blob/gitbranch/reproduce.R).
library(data.table)
library(devtools)
source_url("https://raw.github.com/rsaporta/pubR/gitbranch/reproduce.R")
data <- read.table("http://pastebin.com/raw/xP1Zd0sC")
setDF(data)
reproduce(data, rows = c(1:100))
That code creates data dataframe, and then provides a dput() output for it. It uses the rows argument to output the full dataframe. Yet if I use such output to recreate the dataframe, it fails.
Trying to allocate the dput() output to a new dataframe results in incomplete code, requiring me to add three parentheses manually at the end. And after doing so, I get the following error message: "Error in View : arguments imply differing number of rows: 100, 61".
Please not that the dput() output from reproduce without the rows = c(1:100) argument works fine. It just does not output the full dataframe, but rather a sample of it.
#This works fine
reproduce(data)
Please also note that I used the pastebin method to create this reproducible example. That method does not replace the dput() method for my purposes because it fails whenever trying to import data where some columns have spaces between the words (e.g. dataframes with datetime stamps).
EDIT: after some further troubleshooting discovered that reproduce fails as described above when the rows argument is used together with a dataframe containing 4 or more columns. Will have to find an alternative.
If anyone is interested in testing this, run the code above with the following links, all containing different number of columns:
1) 100x5: http://pastebin.com/raw/xP1Zd0sC
2) 100x4: http://pastebin.com/raw/YZtetfne
3) 100x4: http://pastebin.com/raw/63Ap2bh5
4) 100x3: http://pastebin.com/raw/1vMMcMtx
5) 100x3: http://pastebin.com/raw/ziM1bYQt
6) 100x1: http://pastebin.com/raw/qxtQs5u4
If you are just trying to dput() the first 100 rows of a data set, then you can simply subset the data just prior to running dput(). There doesn't seem to be a need to use the linked function.
dput(droplevels(head(data, 100))) ## or dput(droplevels(data[1:100,]))
should do it.
It is, however, peculiar that your try on reproduce() did not work. I would file an issue on the github page for that. You will likely get a more constructive answer there.
Thanks to #David Arenburg for reminding me about droplevels(). It is useful on this operation if we have factor columns. "Leftover" levels will be dropped.