compare row values over multiple rows (R) - r

I don't think this question has asked yet (most similar questions are about extracting data or returning a count). I am new to R, so any help would be appreciated!
I have a dataset of multiple runs of an experiment in one file and the data looks like this, where i have all the time steps for each run in rows
time [info] id (unique per run)
I am attempting to calculate when the system reaches equilibrium, which I am defining as stable values in 3 interdependent parameters. I would like to have the contents of rows compared and if they are within 5% of each other over 20 timesteps, to return the timestep at which the stability begins and the id.
So far, I'm thinking it will be something like the following (or maybe have a while loop)(sorry for the bad formatting):
y=1;
z=0; #variables to control the loop
x=0;
for (ID) {
if (CC at time=x == 0.05+-CC at time=y ) {
if(z<=20){ #catalogs the number of periods that match
y++
z++}
else [save value in column]
}
else{ #no match for sustained period so start over again
x++
y=x+1
z=0
}
}
eta: CC is one of my parameters of interest and ranges between 0 and 1 although the endpoints are unlikely.
Here's a simple example that might help: this is something like how my data looks:
zz <- textConnection("time CC ID
1 0.99 1
2 0.80 1
3 0.90 1
4 0.91 1
5 0.92 1
6 0.91 1
1 0.99 2
2 0.90 2
3 0.90 2
4 0.91 2
5 0.92 2
6 0.91 2")
Data <- read.table(zz, header = TRUE)
close(zz)
my question is, how can i run through the lines to find out when the value of CC becomes 'stable' (meaning it doesn't change by more than 0.05 over X (here, 3) time steps) so that it would create the following results:
ID timeToEQ
1 1 3
2 2 2
does this help? The only way I can think to do this is with a for-loop and I think there must be an easier way!

Here is my code. I will post the explanation in some time.
require(plyr)
ddply(Data, .(ID), summarize, timeToEQ = Position(isTRUE, abs(diff(CC)) < 0.05 ))
ID timeToEQ
1 1 3
2 2 2
EDIT. Here is how it works.
ddply breaks Data into subsets based on ID.
diff(CC) computes the difference between CC of successive rows.
abs(diff(CC)) < 0.05) returns TRUE if the difference has stabilized.
Position locates the first instance of an element which satisfies isTRUE.

Related

dataframe math to calculate profit based on indicator

Simple explanation: My Goal is to figure out how to get the profit column shown below. I am trying to calculate the difference in val for every pair of changing values (-1 to 1 and 1 to -1).
If the starting indicator is 1 or -1 store the value.
Find the next indicator that is opposite (so -1 on row 3). Save this val. Subtract the first value from it (.85.-.84). Store that in the profit column.
repeat
Specific to this case
Go until find next opposite val (on row 4). Save this value. Subtract values, save in profit column. ()
Go until find next opposite val (on row 8). Save this value. Subtract values, save in profit column.
Financial explanation (if it is useful)
I am trying to write a function to calculate profit given a column of values and a column of indicators (buy/sell/hold). I know this is implemented in a few of the big packages (quantmod, quantstrat), but I cant seem to find a simple way to do it.
df<-
data.frame(val=c(.84,.83,.85,.83,.83,.84,.85,.81),indicator=c(1,0,-1,1,0,1,1,-1))
df
val indicator profit
1 0.84 1 NA
2 0.83 0 NA
3 0.85 -1 .01 based on: (.85-.84) from 1 one to -1
4 0.83 1 .02 based on (.85-.83) from -1 to 1
5 0.83 0 NA
6 0.84 1 NA
7 0.85 1 NA
8 0.81 -1 -.02 based on (.81-.83) from last change (row 4) to now
Notes
multiple indicators should be ignored (1111) means nothing beyond the first one which should be stored. (line 4 is stored, lines 5,6,7 are not)
ignore 0, holds do not change profit calculation
I am happy to provide more info if needed.
It was easier for me to work it out in my head after splitting the problem into two parts, which correspond to the two loops shown below. The first part involves flagging the rows where there was a change in the indicator value, while the second part involves subtracting the val's from the relevant rows (i.e., those selected in part 1). FYI, I assume you meant to put -.02 for row 4 in your example? If not, then please clarify which rows get subtracted from which when calculating profit.
data.frame(val=c(.84,.83,.85,.83,.83,.84,.85,.81),
indicator=c(1,0,-1,1,0,1,1,-1)) -> x
x$num <- seq_along(x$val)
x$rollingProf <- NA
# start with indicator = 1
indicator <- 1
value <- .84
for (i in 1:(nrow(x) - 1)) {
x[i + 1, "indicator"] -> next_
if (indicator * -1 == next_) {
1 -> x[i + 1, "rollingProf"]
indicator <- next_
}
}
x[!is.na(x$rollingProf), c("val", "num")] -> q
for (i in 2:nrow(q)) {
q[i, "val"] - q[i - 1, "val"] -> q[i, "change"]
}

In R, how to take the mean of a varying number of elements for each row in a data frame?

So I have a dataframe, PVALUES, like this:
PVALS <- read.csv(textConnection("PVAL1 PVAL2 PVAL3
0.1 0.04 0.02
0.9 0.001 0.98
0.03 0.02 0.01"),sep = " ")
That corresponds to another dataframe, DATA, like this:
DATA <- read.csv(textConnection("COL1 COL2 CO3
10 2 9
11 20 200
2 3 5"),sep=" ")
For every row in DATA, I'd like to take the mean of the numbers whose indices correspond to entries in PVALUES that are <= 0.05.
So, for example, the first row in PVALUES only has two entries <= 0.05, the entries in [1,2] and [1,3]. Therefore, for the first row of DATA, I want to take the mean of 2 and 9.
In the second row of PVALUES, only the entry [2,2] is <=0.05, so instead of taking the mean for the second row of DATA, I would just use DATA[20,20].
So, my output would look like:
MEANS
6.5
20
3.33
I thought I might be able to generate indices for every entry in PVALUES <=0.05, and then use that to select entries in DATA to use for the mean. I tried to use this command to generate indices:
exp <- which(PVALUES[,]<=0.05, arr.ind=TRUE)
...but it only picks up on indices for entries the first column that are <=0.05. In my example above, it would only output [3,1].
Can anyone see what I'm doing wrong, or have ideas on how to tackle this problem?
Thank you!
It's a bit funny looking, but this should work
rowMeans(`is.na<-`(DATA,PVALUES>=.05), na.rm=T)
The "ugly" part is calling is.na<- without doing the automatic replacement, but here we just set all data with p-values larger than .05 to missing and then take the row means.
It's unclear to me exactly what you were doing with exp, but that type of method could work as well. Maybe with
expx <- which(PVALUES[,]<=0.05, arr.ind=TRUE)
aggregate(val~row, cbind(expx,val=DATA[exp]), mean)
(renamed so as not to interfere with the built in exp() function)
Tested with
PVALUES<-read.table(text="PVAL1 PVAL2 PVAL3
0.1 0.04 0.02
0.9 0.001 0.98
0.03 0.02 0.01", header=T)
DATA<-read.table(text="COL1 COL2 CO3
10 2 9
11 20 200
2 3 5", header=T)
I usually enjoy MrFlick's responses but the use of is.na<- in that manner seems to violate my expectations of R code because it is destructively modifies the data. I admit that I probably should have been expecting that possibility because of assignment arrow but it surprised me nonetheless. (I don't object to data.table code because it is hones t and forthright about modifying its contents with the := function.) I also admit that my efforts to improve one it lead me down a rabbit hole where I found this equally "baroke" effort. (You have incorrectly averaged 2 and 9)
sapply( split( DATA[which( PVALS <= 0.05, arr.ind=TRUE)],
which( PVALS <= 0.05, arr.ind=TRUE)[,'row']),
mean)
1 2 3
5.500000 20.000000 3.333333

extracting unique combination rows from a data frame in R

I have a data frame that gives pairwise correlations of scores people in the same state had provided.
I am giving a small example of what I wish to do with this data, but right now my actual data set has 15 million rows for pairwise correlations and many more additional columns.
Below is the example data:
>sample_data
Pair_1ID Pair_2ID CORR
1 2 0.12
1 3 0.23
2 1 0.12
2 3 0.75
3 1 0.23
3 2 0.75
I want to generate a new data frame without duplicates, for example in row 1, the correlation between persons 1 and 2 is 0.12. Row 1 is the same as Row 3, which shows the correlation between 2 and 1. Since they have the same information I would like a final file without duplicates, I would like a file like the one below:
>output
Pair_1ID Pair_2ID CORR
1 2 0.12
1 3 0.23
2 3 0.75
Can someone help? The unique command wont work with this and I don't know how to do it.
Assuming every combination shows up twice:
subset(sample_data , Pair_1ID <= Pair_2ID)
If not:
unique(transform(sample_data, Pair_1ID = pmin(Pair_1ID, Pair_2ID),
Pair_2ID = pmax(Pair_1ID, Pair_2ID)))
Edit: regarding that last one, including CORR in the unique is not a great idea because of possible floating point issues. I also see you mention you have a lot more columns. So it would be better to limit the comparison to the two ids:
relabeled <- transform(sample_data, Pair_1ID = pmin(Pair_1ID, Pair_2ID),
Pair_2ID = pmax(Pair_1ID, Pair_2ID))
subset(relabeled, !duplicated(cbind(Pair_1ID, Pair_2ID)))
The answer of flodel is really excellent. I just want to add another solution based on indexing without looking at the actual values. It only works if all combinations are present and the data frame is ordered by column 1 in the first place and column 2 in the second place (like in the example).
maxVal <- max(sample_data$Pair_1ID)
shrtIdx <- logical(maxVal)
idx <- sapply(seq(maxVal - 1, 1), function(x) replace(shrtIdx, seq(x), TRUE))
sample_data[idx,]
# Pair_1ID Pair_2ID CORR
# 1 1 2 0.12
# 2 1 3 0.23
# 4 2 3 0.75

Extracting Values from certain time limit in Simulated Data using R

I tried to do a stochastic simulation on a epidemiology SEIR model using the coding below.
library(GillespieSSA)
parms <- c(beta=0.591,sigma=1/8,gamma=1/7)
x0 <- c(S=50,E=0,I=1,R=0)
a <- c("beta*S*I","sigma*E","gamma*I")
nu <- matrix(c(-1,0,0,
1,-1,0,
0,1,-1,
0,0,1),nrow=4,byrow=TRUE)
set.seed(12345)
out <- lapply(X=1:10,FUN=function(x) ssa(x0,a,nu,parms,tf=50)$data)
out
I managed to obtain the 10 simulations values that I wanted. The time is in continuous form. Now, I have to extract time in discrete form such as 1,2,3...,50 from each simulation. Which type of coding should I use?
I tried doing data.frame and extract but still not able to do it.
Thanking in advance for any help.
Lets say the data looks like this:
df <- data.frame(t=seq(0.4,4.5,0.03), x=1:137)
## t x
## 1 0.40 1
## 2 0.43 2
## 3 0.46 3
## 4 0.49 4
## 5 0.52 5
To get the discrete time index values:
idx <- diff(ceiling(df$t)) == 1
Discrete time series will be:
df[idx,]
## t x
## 21 1.00 21
## 54 1.99 54
## 87 2.98 87
## 121 4.00 121
Having run the trial myself a problem seems to be that many of the time stamps are quite a distance from an integer result.
To see these remainders, check: out[[1]][,1] %% 1
The good news is that you can use the output from this, with a tuning parameter, to select what you want. For this purpose, you'll want to find the distance from one and then control for what's an acceptable gap.
Do this as follows and save the result (and bunch of TRUE and FALSE results)
selection <- abs((out[[1]][,1] %% 1) - 1) < 0.1
You can then subset the matrix out using the selection index we just saved:
out[[1]][selection,]

How do I perform a function on each row of a data frame and have just one element of the output inserted as a new column in that row

It is easy to do an Exact Binomial Test on two values but what happens if one wants to do the test on a whole bunch of number of successes and number of trials. I created a dataframe of test sensitivities, potential number of enrollees in a study and then for each row I calculate how may successes that would be. Here is the code.
sens <-seq(from=.1, to=.5, by=0.05)
enroll <-seq(from=20, to=200, by=20)
df <-expand.grid(sens=sens,enroll=enroll)
df <-transform(df,succes=sens*enroll)
But now how do I use each row's combination of successes and number of trials to do the binomial test.
I am only interested in the upper limit of the 95% confidence interval of the binomial test. I want that single number to be added to the data frame as a column called "upper.limit"
I thought of something along the lines of
binom.test(succes,enroll)$conf.int
alas, conf.int gives something such as
[1] 0.1266556 0.2918427
attr(,"conf.level")
[1] 0.95
All I want is just 0.2918427
Furthermore I have a feeling that there has to be do.call in there somewhere and maybe even an lapply but I do not know how that will go through the whole data frame. Or should I perhaps be using plyr?
Clearly my head is spinning. Please make it stop.
If this gives you (almost) what you want, then try this:
binom.test(succes,enroll)$conf.int[2]
And apply across the board or across the rows as it were:
> df$UCL <- apply(df, 1, function(x) binom.test(x[3],x[2])$conf.int[2] )
> head(df)
sens enroll succes UCL
1 0.10 20 2 0.3169827
2 0.15 20 3 0.3789268
3 0.20 20 4 0.4366140
4 0.25 20 5 0.4910459
5 0.30 20 6 0.5427892
6 0.35 20 7 0.5921885
Here you go:
R> newres <- do.call(rbind, apply(df, 1, function(x) {
+ bt <- binom.test(x[3], x[2])$conf.int;
+ newdf <- data.frame(t(x), UCL=bt[2]) }))
R>
R> head(newres)
sens enroll succes UCL
1 0.10 20 2 0.31698
2 0.15 20 3 0.37893
3 0.20 20 4 0.43661
4 0.25 20 5 0.49105
5 0.30 20 6 0.54279
6 0.35 20 7 0.59219
R>
This uses apply to loop over your existing data, compute test, return the value you want by sticking it into a new (one-row) data.frame. And we then glue all those 90 data.frame objects into a new single one with do.call(rbind, ...) over the list we got from apply.
Ah yes, if you just want to directly insert a single column the other answer rocks as it is simple. My longer answer shows how to grow or construct a data.frame during the sweep of apply.

Resources