I'm using data in a format shown:
Actual data set much longer. Column labels are: Date | Variable 1 | Variable 2 | Failed ?
I'm sorting the data into date order. Some dates may be missing, but an ordering function should be sorting this out. From there, I'm trying to split the data into sets where new sets are denoted by the far right column registering a 1. I'm then trying to plot these sets on a single graph with number of days passed on the x-axis. I've looked into using the ggplot function, but it seems to require frames where the length of each vector is already known. I tried creating a matrix of a length based on the maximum number of days that passed for all sets and then fill the spare cells with NaN values to be plotted, but this took ages as my data set is quite large. I was wondering whether there was a more elegant way of plotting the values against days past for all sets on a single graph, and then iterate the process for additional variables.
Any help would be much appreciated.
Code for a reproducible example is included here:
test <-matrix(c(
"01/03/1997", 0.521583294, 0.315170092, 0,
"02/03/1997", 0.63946859, 0.270870821, 0,
"03/03/1997", 0.698687101, 0.253495021, 0,
"04/03/1997", 0.828754157, 0.233024574, 0,
"05/03/1997", 0.87078867, 0.214507537, 0,
"06/03/1997", 0.883279874, 0.212268627, 0,
"07/03/1997", 0.952083969, 0.062663598, 0,
"08/03/1997", 0.991100195, 0.054875256, 0,
"09/03/1997", 0.992490126, 0.026610776, 1,
"10/03/1997", 0.020707391, 0.866874513, 0,
"11/03/1997", 0.32405139, 0.778696984, 0,
"12/03/1997", 0.32665243, 0.703234151, 0,
"13/03/1997", 0.603941956, 0.362869647, 0,
"14/03/1997", 0.944046386, 0.026992527, 1,
"15/03/1997", 0.108246142, 0.939363715, 0,
"16/03/1997", 0.152195386, 0.907458966, 0,
"17/03/1997", 0.285748169, 0.765212667, 0), ncol = 4, byrow=TRUE)
colnames(test) <- c("Date", "Variable 1", "Variable 2", "Failed")
test <-as.table(test)
test
I've managed to hash together a solution, but it looks very messy. I'm convinced that there is a far more elegant way of solving this.
z = as.data.frame.matrix(test)
attach(z)
x = as.numeric(as.character(Failed))
x = cumsum(x) #Variable names recycled
A corrected cumulative failure sum puts data into sets of number of preceding failures
z <- within(z, acc_sum <- x)
attach(z)
z$acc_sum <- as.numeric(as.character(z$acc_sum))-as.numeric(as.character(z$Failed))
attach(z)
z = data.frame(z, Group_Index=ave(acc_sum==acc_sum,acc_sum,FUN=cumsum)
An extra row is created that has the number of days passed since the start of the measurement. It's easier to read the code to keep new variable names than to keep indexing directly.
attach(z)
x = (max(acc_sum)+1) #This is the number of sets of variable results
Current columns read: Date|Variable.1|Variable.2|Failed|acc_sum|Group_Index
library(ggplot2)
n = data.frame(acc_sum, Group_Index)
This initialises the frame and should make it faster so Group_Index and acc_sum aren't read-in each time.
for(j in 1:(ncol(z)-4)){ #This iterates through all the variables to generate a new set of lists. -4 is from removing date, failed, Group_index and acc_sum
n$Variable <- z[,(j+1)] #This reads in the new variable data, but requires the variables to all be next to each other
n[] <- lapply(n,function(x)as.numeric(as.character(x))) #This ensures all the values are numeric for plotting
plot <- ggplot(n, aes(x = Group_Index, y = Variable, colour = acc_sum)) +
theme_bw() +
geom_line(aes(group=acc_sum)) #linetype = "dotted"
print(plot) #This ensures that the graph is presented in every iteration
cat ("Press [enter] to continue") #This waits for a user input before moving to the next variable
line <- readline()
}
The graph could be improved for the actual variable name to change with what is being plotted. This could be done by including a ylabel in the for loop.
Related
Can someone help me with this? I got the cut_interval code to work for a single test column, but can't seem to get it to work in a for loop to have it run on all of the columns.
#Bin worker data into three groups (low/medium/high %methylation) for the cpg cg10757709
#This code works
cg10757709_interval <- cut_interval(cpgs$cg10757709, n=3, labels = c("low","med","high"))
View(cg10757709_interval)
#Write a loop so that data for each of the significant cpgs will be binned into low, medium, and high groups
#This code gives an error (that there are more elements are supplied than there are to replace)
cpgs_interval <- matrix(ncol = length(cpgs), nrow = 29)
for (i in seq_along(cpgs)) {
cpgs_interval[[i]] <- cut_interval(cpgs[[i]], n=3, labels = c("low","med","high"))
}
View(cpgs_interval)
The error says "Error in cpgs_interval[[i]] <- cut_interval(cpgs[[i]], n = 3, labels = c("low", : more elements supplied than there are to replace". Should I not be using a matrix for cpgs_interval? Or is something else the problem? I'm rather new to writing for loops. Thanks.
In your example, cpgs_interval is a matrix. If you want to put the variable into the ith column of the matrix, you could do:
for (i in seq_along(cpgs)) {
cpgs_interval[,i] <- cut_interval(cpgs[[i]], n=3, labels = c("low","med","high"))
}
That said, you might be better off making cpgs_interval a data frame, then you'll retain the factor rather than turning it into text.
I have the following MATLAB code and I'm working to translating it to R:
nproc=40
T=3
lambda=4
tarr = zeros(1, nproc);
i = 1;
while (min(tarr(i,:))<= T)
tarr = [tarr; tarr(i, :)-log(rand(1, nproc))/lambda];
i = i+1;
end
tarr2=tarr';
X=min(tarr2);
stairs(X, 0:size(tarr, 1)-1);
It is the Poisson Process from the renewal processes perspective. I've done my best in R but something is wrong in my code:
nproc<-40
T<-3
lambda<-4
i<-1
tarr=array(0,nproc)
lst<-vector('list', 1)
while(min(tarr[i]<=T)){
tarr<-tarr[i]-log((runif(nproc))/lambda)
i=i+1
print(tarr)
}
tarr2=tarr^-1
X=min(tarr2)
plot(X, type="s")
The loop prints an aleatory number of arrays and only the last is saved by tarr after it.
The result has to look like...
Thank you in advance. All interesting and supportive comments will be rewarded.
Adding on to the previous comment, there are a few things which are happening in the matlab script that are not in the R:
[tarr; tarr(i, :)-log(rand(1, nproc))/lambda]; from my understanding, you are adding another row to your matrix and populating it with tarr(i, :)-log(rand(1, nproc))/lambda].
You will need to use a different method as Matlab and R handle this type of thing differently.
One glaring thing that stands out to me, is that you seem to be using R: tarr[i] and M: tarr(i, :) as equals where these are very different, as what I think you are trying to achieve is all the columns in a given row i so in R that would look like tarr[i, ]
Now the use of min is also different as R: min() will return the minimum of the matrix (just one number) and M: min() returns the minimum value of each column. So for this in R you can use the Rfast package Rfast::colMins.
The stairs part is something I am not familiar with much but something like ggplot2::qplot(..., geom = "step") may work.
Now I have tried to create something that works in R but am not sure really what the required output is. But nevertheless, hopefully some of the basics can help you get it done on your side. Below is a quick try to achieve something!
nproc <- 40
T0 <- 3
lambda <- 4
i <- 1
tarr <- matrix(rep(0, nproc), nrow = 1, ncol = nproc)
while(min(tarr[i, ]) <= T0){
# Major alteration, create a temporary row from previous row in tarr
temp <- matrix(tarr[i, ] - log((runif(nproc))/lambda), nrow = 1)
# Join temp row to tarr matrix
tarr <- rbind(tarr, temp)
i = i + 1
}
# I am not sure what was meant by tarr' in the matlab script I took it as inverse of tarr
# which in matlab is tarr.^(-1)??
tarr2 = tarr^(-1)
library(ggplot2)
library(Rfast)
min_for_each_col <- colMins(tarr2, value = TRUE)
qplot(seq_along(min_for_each_col), sort(min_for_each_col), geom="step")
As you can see I have sorted the min_for_each_col so that the plot is actually a stair plot and not some random stepwise plot. I think there is a problem since from the Matlab code 0:size(tarr2, 1)-1 gives the number of rows less 1 but I cant figure out why if grabbing colMins (and there are 40 columns) we would create around 20 steps. But I might be completely misunderstanding! Also I have change T to T0 since in R T exists as TRUE and is not good to overwrite!
Hope this helps!
I downloaded GNU Octave today to actually run the MatLab code. After looking at the code running, I made a few tweeks to the great answer by #Croote
nproc <- 40
T0 <- 3
lambda <- 4
i <- 1
tarr <- matrix(rep(0, nproc), nrow = 1, ncol = nproc)
while(min(tarr[i, ]) <= T0){
temp <- matrix(tarr[i, ] - log(runif(nproc))/lambda, nrow = 1) #fixed paren
tarr <- rbind(tarr, temp)
i = i + 1
}
tarr2 = t(tarr) #takes transpose
library(ggplot2)
library(Rfast)
min_for_each_col <- colMins(tarr2, value = TRUE)
qplot(seq_along(min_for_each_col), sort(min_for_each_col), geom="step")
Edit: Some extra plotting tweeks -- seems to be closer to the original
qplot(seq_along(min_for_each_col), c(1:length(min_for_each_col)), geom="step", ylab="", xlab="")
#or with ggplot2
df1 <- cbind(min_for_each_col, 1:length(min_for_each_col)) %>% as.data.frame
colnames(df1)[2] <- "index"
ggplot() +
geom_step(data = df1, mapping = aes(x = min_for_each_col, y = index), color = "blue") +
labs(x = "", y = "")
I'm not too familiar with renewal processes or matlab so bear with me if I misunderstood the intention of your code. That said, let's break down your R code step by step and see what is happening.
The first 4 lines assign numbers to variables.
The fifth line creates an array with 40 (nproc) zeros.
The sixth line (which doesnt seem to be used later) creates an empty vector with mode 'list'.
The seventh line starts a while loop. I suspect this line is supposed to say while the min value of tarr is less than or equal to T ...
or it's supposed to say while i is less than or equal to T ...
It actually takes the minimum of a single boolean value (tarr[i] <= T). Now this can work because TRUE and FALSE are treated like numbers. Namely:
TRUE == 1 # returns TRUE
FALSE == 0 # returns TRUE
TRUE == 0 # returns FALSE
FALSE == 1 # returns FALSE
However, since the value of tarr[i] depends on a random number (see line 8), this could lead to the same code running differently each time it is executed. This might explain why the code "prints an aleatory number of arrays ".
The eight line seems to overwrite the assignment of tarr with the computation on the right. Thus it takes the single value of tarr[i] and subtracts from it the natural log of runif(proc) divided by 4 (lambda) -- which gives 40 different values. These fourty different values from the last time through the loop are stored in tarr.
If you want to store all fourty values from each time through the loop, I'd suggest storing it in say a matrix or dataframe instead. If that's what you want to do, here's an example of storing it in a matrix:
for(i in 1:nrow(yourMatrix)){
//computations
yourMatrix[i,] <- rowCreatedByComputations
}
See this answer for more info about that. Also, since it's a set number of values per run, you could keep them in a vector and simply append to the vector each loop like this:
vector <- c(vector,newvector)
The ninth line increases i by one.
The tenth line prints tarr.
the eleveth line closes the loop statement.
Then after the loop tarr2 is assigned 1/tarr. Again this will be 40 values from the last time through the loop (line 8)
Then X is assigned the min value of tarr2.
This single value is plotted in the last line.
Also note that runif samples from the uniform distribution -- if you're looking for a Poisson distribution see: Poisson
Hope this helped! Let me know if there's more I can do to help.
I have what I thought was a well-prepared dataset. I wanted to use the Apriori Algorithm in R to look for associations and come up with some rules. I have about 16,000 rows (unique customers) and 179 columns that represent various items/categories. The data looks like this:
Cat1 Cat2 Cat3 Cat4 Cat5 ... Cat179
1, 0, 0, 0, 1, ... 0
0, 0, 0, 0, 0, ... 1
0, 1, 1, 0, 0, ... 0
...
I thought having a comma separated file with binary values (1/0) for each customer and category would do the trick, but after I read in the data using:
data5 = read.csv("Z:/CUST_DM/data_test.txt",header = TRUE,sep=",")
and then run this command:
rules = apriori(data5, parameter = list(supp = .001,conf = 0.8))
I get the following error:
Error in asMethod(object):
column(s) 1, 2, 3, ...178 not logical or a factor. Discretize the columns first.
I understand Discretize but not in this context I guess. Everything is a 1 or 0. I've even changed the data from INT to CHAR and received the same error. I also had the customer ID (unique) as column 1 but I understand that isn't necessary when the data is in this form (flat file). I'm sure there is something obvious I'm missing - I'm new to R.
What am I missing? Thanks for your input.
I solved the problem this way: After reading in the data to R I used lapply() to change the data to factors (I think that's what it does). Then I took that data set and created a data frame from it. Then I was able to apply apriori() successfully.
Your data is actually already in (dense) matrix format, but read.csv always reads data in as a data.frame. Just coerce the data to a matrix first:
dat <- as.matrix(data5)
rules <- apriori(dat, parameter = list(supp = .001,conf = 0.8))
1s in the data will be interpreted as the presence of the item and 0s as the absence. More information about how to create transactions can be found in the manual page ? transactions.
I am using R to code simulations for a research project I am conducting in college. After creating relevant data structures and generating data, I seek to randomly modify a proportion P of observations (in increments of 0.02) in a 20 x 20 matrix by some effect K. In order to randomly determine the observations to be modified, I sample a number of integers equal to P*400 twice to represent row (rRow) and column (rCol) indices. In order to guarantee that no observation will be modified more than once, I perform this algorithm:
I create a matrix, alrdyModded, that is 20 x 20 and initialized to 0s.
I take the first value in rRow and rCol, and check whether alrdyModded[rRow[1]][rCol[1]]==1; WHILE alrdyModded[rRow[1]][rCol[1]]==1, i randomly select new integers for the indices until it ==0
When alrdyModded[rRow[1]][rCol[1]]==0, modify the value in a treatment matrix with same indices and change alrdyModded[rRow[1]][rCol[1]] to 1
Repeat for the entire length of rRow and rCol vectors
I believe a good method to perform this operation is a while loop nested in a for loop. However, when I enter the code below into R, I receive the following error code:
R CODE:
propModded<-1.0
trtSize<-2
numModded<-propModded*400
trt1<- matrix(rnorm(400,0,1),nrow = 20, ncol = 20)
cont<- matrix(rnorm(400,0,1),nrow = 20, ncol = 20)
alrdyModded1<- matrix(0, nrow = 20, ncol = 20)
## data structures for computation have been intitialized and filled
rCol<-sample.int(20,numModded,replace = TRUE)
rRow<-sample.int(20,numModded,replace = TRUE)
## indices for modifying observations have been generated
for(b in 1:numModded){
while(alrdyModded1[rRow[b]][rCol[b]]==1){
rRow[b]<-sample.int(20,1)
rCol[b]<-sample.int(20,1)}
trt1[rRow[b]][rCol[b]]<-'+'(trt1[rRow[b]][rCol[b]],trtSize)
alrdyModded[rRow[b]][rCol[b]]<-1
}
## algorithm for guaranteeing no observation in trt1 is modified more than once
R OUTPUT
" Error in while (alrdyModded1[rRow[b]][rCol[b]] == 1) { :
missing value where TRUE/FALSE needed "
When I take out the for loop and run the code, the while loop evaluates the statement just fine, which implies an issue with accessing the correct values from the rRow and rCol vectors. I would appreciate any help in resolving this problem.
It appears you're not indexing right within the matrix. Instead of having a condition like while(alrdyModded1[rRow[b]][rCol[b]]==1){, it should read like this: while(alrdyModded1[rRow[b], rCol[b]]==1){. Matrices are indexed like this: matrix[1, 1], and it looks like you're forgetting your commas. The for-loop should be something closer to this:
for(b in 1:numModded){
while(alrdyModded1[rRow[b], rCol[b]]==1){
rRow[b]<-sample.int(20,1)
rCol[b]<-sample.int(20,1)}
trt1[rRow[b], rCol[b]]<-'+'(trt1[rRow[b], rCol[b]],trtSize)
alrdyModded1[rRow[b], rCol[b]]<-1
}
On a side note, why not make alrdyModded1 a boolean matrix (populated with just TRUE and FALSE values) with alrdyModded1<- matrix(FALSE, nrow = 20, ncol = 20) in line 7, and have the condition be just while(alrdyModded1[rRow[b], rCol[b]]){ instead?
I want to test the sensitivity of a calculation to the value of 4 parameters. To do this, I want to vary one parameter at a time -- i.e., change Variable 1, hold variables 2-4 at a "default" value (e.g., 1). I thought an easy way to organize these values would be in a data.frame(), where each column corresponds to a different variable, and each row to a set of parameters for which the calculation should be made. I would then loop through each row of the data frame, evaluating a function given the parameter values in that row.
This seems like it should be a simple thing to do, but I can't find a quick way to do it.
The problem might be my overall approach to programming the sensitivity analysis, but I can't think of a good, simple way to program the aforementioned data.frame.
My code for generating the data.frame:
Adj_vals <- c(seq(0, 1, by=0.1), seq(1.1, 2, by=0.1)) #a series of values for 3 of the parameters to use
A_Adj_vals <- 10^(seq(1,14,0.5)) #a series of values for another one of the parameters to use
n1 <- length(Adj_vals)
n2 <- length(A_Adj_vals)
data.frame(
"Dg_Adj"=c(Adj_vals, rep(1, n1*2+n2)), #this parameter's default is 1
"Df_Adj"=c(rep(1, n1), Adj_vals, rep(1, n1+n2)), #this parameter's default is 1
"sd_Adj"=c(rep(1, n1*2), 0.01, Adj_vals[-1], rep(1, n2)), #This parameter has default of 1, but unlike the others using Adj_vals, it can only take on values >0
"A"=c(rep(1E7, n1*3), A_Adj_vals) #this parameter's default is 10 million
)
This code produces the desired data.frame. Is there a simpler way to achieve the same result? I would accept an answer where sd_Adj takes on 0 instead of 0.01.
It's pretty debatable if this is better, but another way to do it would be to follow this pattern:
defaults<-data.frame(a=1,b=1,c=1,d=10000000)
merge(defaults[c("b","c","d")],data.frame(a=c(seq(0, 1, by=0.1), seq(1.1, 2, by=0.1))))
This should be pretty easy to cook up into a function that automatically removes the correct column from defaults based on the column name in the data frame you are merging with etc