Selecting datachunks depending on condition - r

I have some question of selecting data chunks depending on condition I provide.
Its a multi step process which I think should be done in function and can be applied to the other data sets by lapply.
I have have data.frame which has 19 column (but the example data here has only two) I want to first check the first column (time) rows they should be in range 90 and 54000 if some of them not in this range skip them. After count those chunks, count how many of mag columns show full positive and neg/pos values. If the chunk contains negative number count it as switched state. and give the switching rate something like (total numbers of chunks which shows switched state)/(total number of chunks which range in between 90:54000)
for the data chunks which satisfies the range 90:54000, check the mag
for the first observation of the number <0 together with corresponding time
numbers <- c(seq(1,-1,length.out = 601),seq(1,0.98,length.out = 601))
time <- c(seq(90,54144,length.out = 601),seq(90,49850,length.out = 601))
data = data.frame(rep(time,times=12), mag=rep(numbers, times=6))
n <- 90:54000
dfchunk<- split(data, factor(sort(rank(row.names(data))%%n)))
ext_fsw<-lapply(dfchunk,function(x)x[which(x$Mag<0)[1],])
x.n <- data.frame(matrix(unlist(ext_fsw),nrow=n, byrow=T)
Here is what the real dataset look like:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1 90 0 0 0 0.0023 -0.0064 0.9987 0.0810 0.0375 0.9814 0.0829 0.0379 0.9803 0.0715 0.0270 0.9823
2 180 0 0 0 0.0023 -0.0064 0.9987 0.0887 -0.0281 0.9818 0.0956 -0.0288 0.9778 0.0796 -0.0469 0.9772
3 270 0 0 0 0.0023 -0.0064 0.9987 -0.0132 -0.0265 0.9776 0.0087 -0.0369 0.9797 0.0311 -0.0004 0.9827
4 360 0 0 0 0.0023 -0.0064 0.9987 0.0843 0.0369 0.9752 0.0765 0.0362 0.9749 0.0632 0.0486 0.9735
5 450 0 0 0 0.0023 -0.0064 0.9987 0.1075 -0.0660 0.9737 0.0914 -0.0748 0.9698 0.0586 -0.0361 0.9794
6 540 0 0 0 0.0023 -0.0064 0.9987 0.0006 0.0072 0.9808 -0.0162 -0.0152 0.9797 0.0369 0.0118 0.9763
Here is the expected outputs (just and example)
For part 1:
ss (swiched state) total countable chunks switching probability
5 10 5/10
For part 2:
time mag
27207 -0.03
26520 -0.98
32034 -0.67
.
.
.
.
etc

Okay, I think have this figured out. I put them into two functions. For each function, you give a dataframe and a column name, and it'll return the requested data.
library(dplyr)
thabescity <- function(data, col){
filter_vec <- data[col] < 0
new_df <- data %>%
filter(filter_vec) %>%
filter(90 <= time & time <= 54000) %>%
group_by(time) %>%
summarise()
ss <- nrow(new_df)
total <- length(unique(data$time))
switching_probability <- ss/total
results <- c(ss, total, switching_probability)
output <- as.data.frame(cbind(ss, total, switching_probability))
return(output)
}
print(thabescity(data, "mag"))
ss total switching_probability
1 298 1201 0.2481266
You can make a list and run it in a loop to do all the columns and have it come out in a list:
data_names <- names(data)[2:length(names(data))]
first_problem <- list()
for(name in data_names){
first_problem[[name]] <- thabescity(data, name)
}
first_problem[["mag"]]
ss total switching_probability
1 298 1201 0.2481266
The second problem is a bit easier:
thabescity2 <- function(data, col){
data <- data[,c("time", col)]
filter_vec <- data[col] < 0
new_df <- data %>%
filter(filter_vec) %>%
filter(90 <= time & time <= 54000) %>%
group_by(time) %>%
filter(row_number() == 1)
return(new_df)
}
print(thabescity2(data, "mag"))
Source: local data frame [298 x 2]
Groups: time
time mag
1 27207.09 -0.003333333
2 27297.18 -0.006666667
3 27387.27 -0.010000000
4 27477.36 -0.013333333
5 27567.45 -0.016666667
6 27657.54 -0.020000000
7 27747.63 -0.023333333
8 27837.72 -0.026666667
9 27927.81 -0.030000000
10 28017.90 -0.033333333
.. ... ...
You can do the same thing as above to go through the whole dataframe:
data_names <- names(data)[2:length(names(data))]
second_problem <- list()
for(name in data_names){
second_problem[[name]] <- thabescity2(data, name)
}
second_problem[["mag"]]
Source: local data frame [298 x 2]
Groups: time
time mag
1 27207.09 -0.003333333
2 27297.18 -0.006666667
3 27387.27 -0.010000000
4 27477.36 -0.013333333
5 27567.45 -0.016666667
6 27657.54 -0.020000000
7 27747.63 -0.023333333
8 27837.72 -0.026666667
9 27927.81 -0.030000000
10 28017.90 -0.033333333
.. ... ...
Double check my results, but I think this does what you want.

Related

For loop with a function for a moving/rolling average?

Essentially (in R), I want to apply a moving average function over a period of time (eg. date and time variables) to see how a particular metric changes over time. However, the metric in itself is a function. The scores can either be 1 (pro), 0 (neutral), or -1 (neg). The function for the metric is:
function(pro, neg, total) {
x <- (pro / total) * 100
y <- (neg / total) * 100
x - y
}
So the percentage of 1's minus the percentage of -1's is the metric value.
Given timestamps for each recorded score, I want to evaluate the metric as a moving average across all rows. I assumed that a for loop would be the best way to apply this but I am stuck in how to do this.
Does anyone have any thoughts / advice?
As mentioned in the comments, rollapply() from zoo is a good option. I took the liberty to generate some example data, apologies if it doesn't resemble yours.
library(zoo)
f <- function(x, l) {
p <- sum(x == 1) / l
n <- sum(x == -1) / l
(p - n)*100
}
# Or more efficiently
f <- function(x, l=length(x)) {
(sum(x)/l)*100
}
set.seed(1)
N <- 25
dtf <- data.frame(time=as.Date(15000+(1:N)), score=sample(-1:1, N, rep=TRUE))
score <- read.zoo(dtf)
l <- 8
zts <- cbind(score, rolling=rollapply(score, l, f, l, fill=NA))
zts
# score rolling
# 2011-01-27 -1 NA
# 2011-01-28 0 NA
# 2011-01-29 0 NA
# 2011-01-30 1 12.5
# 2011-01-31 -1 25.0
# 2011-02-01 1 12.5
# 2011-02-02 1 0.0
# 2011-02-03 0 -25.0
# 2011-02-04 0 0.0
# 2011-02-05 -1 -12.5
# 2011-02-06 -1 -12.5
# 2011-02-07 -1 -12.5
# 2011-02-08 1 0.0
# 2011-02-09 0 25.0
# 2011-02-10 1 37.5
# 2011-02-11 0 62.5
# 2011-02-12 1 62.5
# 2011-02-13 1 50.0
# 2011-02-14 0 37.5
# 2011-02-15 1 25.0
# 2011-02-16 1 0.0
# 2011-02-17 -1 NA
# 2011-02-18 0 NA
# 2011-02-19 -1 NA
# 2011-02-20 -1 NA

big dataframe: "repeated" t-test between groups for thousand of factors

I have read a lot of posts related to data wrangling and “repeated” t-test but I can’t figure out the way to achieve it in my case.
You can get my example dataset for StackOverflow here: https://www.dropbox.com/s/0b618fs1jjnuzbg/dataset.example.stckovflw.txt?dl=0
I have a big dataframe of gen expression like:
> b<-read.delim("dataset.example.stckovflw.txt")
> head(b)
animal gen condition tissue LogFC
1 animalcontrol1 kjhss1 control brain 7.129283
2 animalcontrol1 sdth2 control brain 7.179909
3 animalcontrol1 sgdhstjh20 control brain 9.353147
4 animalcontrol1 jdygfjgdkydg21 control brain 6.459432
5 animalcontrol1 shfjdfyjydg22 control brain 9.372865
6 animalcontrol1 jdyjkdg23 control brain 9.541097
> str(b)
'data.frame': 21507 obs. of 5 variables:
$ animal : Factor w/ 25 levels "animalcontrol1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ gen : Factor w/ 1131 levels "dghwg1041","dghwg1086",..: 480 761 787 360 863 385 133 888 563 738 ...
$ condition: Factor w/ 5 levels "control","treatmentA",..: 1 1 1 1 1 1 1 1 1 1 ...
$ tissue : Factor w/ 2 levels "brain","heart": 1 1 1 1 1 1 1 1 1 1 ...
$ LogFC : num 7.13 7.18 9.35 6.46 9.37 ...
Each group has 5 animals, and each animals has many gens quantified. (However, each animal may possibly have a different set of quantified gens, but also many of the gens will be in common between animals and groups).
I would like to perform t-test for each gen between my treated group (A, B, C or D) and the controls. The data should be presented as a table containing the p- value for each gen in each group.
Because I have so many gens (thousand), I cannot subset each gen.
Do you know how could I automate the procedure ?
I was thinking about a loop but I am absolutely not sure it could achieve what I want and how to proceed.
Also, I was looking more at these posts using the apply function : Apply t-test on many columns in a dataframe split by factor and Looping through t.tests for data frame subsets in r
#
################ additionnal information after reading first comments and answers :
#andrew_reece : Thank you very much for this. It is almost-exactly what I was looking for. However, I can’t find the way to do it with t-test. ANOVA is interesting information, but then I will need to know which of the treated groups is/are significantly different from my controls. Also I would need to know which treated group is significantly different from each others, “two by two”.
I have been trying to use your code by changing the “aov(..)” in “t.test(…)”. For that, first I realize a subset(b, condition == "control" | condition == "treatmentA" ) in order to compare only two groups. However, when exporting the result table in csv file, the table is unanderstandable (no gen name, no p-values, etc, only numbers). I will keep searching a way to do it properly but until now I’m stuck.
#42:
Thank you very much for these tips. This is just a dataset example, let’s assume we do have to use individual t-tests.
This is very useful start for exploring my data. For example, I have been trying to reprsent my data with Venndiagrams. I can write my code but it is kind of out of the initial topic. Also, I don't know how to summarize in a less fastidious way the shared "gene" detected in each combination of conditions so i have simplified with only 3 conditions.
# Visualisation of shared genes by VennDiagrams :
# let's simplify and consider only 3 conditions :
b<-read.delim("dataset.example.stckovflw.txt")
b<- subset(b, condition == "control" | condition == "treatmentA" | condition == "treatmentB")
b1<-table(b$gen, b$condition)
b1
b2<-subset(data.frame(b1, "control" > 2
|"treatmentA" > 2
|"treatmentB" > 2 ))
b3<-subset(b2, Freq>2) # select only genes that have been quantified in at least 2 animals per group
b3
b4 = within(b3, {
Freq = ifelse(Freq > 1, 1, 0)
}) # for those observations, we consider the gene has been detected so we change the value 0 regardless the freq of occurence (>2)
b4
b5<-table(b4$Var1, b4$Var2)
write.csv(b5, file = "b5.csv")
# make an intermediate file .txt (just add manually the name of the cfirst column title)
# so now we have info
bb5<-read.delim("bb5.txt")
nrow(subset(bb5, control == 1))
nrow(subset(bb5, treatmentA == 1))
nrow(subset(bb5, treatmentB == 1))
nrow(subset(bb5, control == 1 & treatmentA == 1))
nrow(subset(bb5, control == 1 & treatmentB == 1))
nrow(subset(bb5, treatmentA == 1 & treatmentB == 1))
nrow(subset(bb5, control == 1 & treatmentA == 1 & treatmentB == 1))
library(grid)
library(futile.logger)
library(VennDiagram)
venn.plot <- draw.triple.venn(area1 = 1005,
area2 = 927,
area3 = 943,
n12 = 843,
n23 = 861,
n13 = 866,
n123 = 794,
category = c("controls", "treatmentA", "treatmentB"),
fill = c("red", "yellow", "blue"),
cex = 2,
cat.cex = 2,
lwd = 6,
lty = 'dashed',
fontface = "bold",
fontfamily = "sans",
cat.fontface = "bold",
cat.default.pos = "outer",
cat.pos = c(-27, 27, 135),
cat.dist = c(0.055, 0.055, 0.085),
cat.fontfamily = "sans",
rotation = 1);
Update (per OP comments):
Pairwise comparison across condition can be managed with an ANOVA post-hoc test, such as Tukey's Honest Significant Difference (stats::TukeyHSD()). (There are others, this is just one way to demonstrate PoC.)
results <- b %>%
mutate(condition = factor(condition)) %>%
group_by(gen) %>%
filter(length(unique(condition)) >= 2) %>%
nest() %>%
mutate(
model = map(data, ~ TukeyHSD(aov(LogFC ~ condition, data = .x))),
coef = map(model, ~ broom::tidy(.x))
) %>%
unnest(coef) %>%
select(-term)
results
# A tibble: 7,118 x 6
gen comparison estimate conf.low conf.high adj.p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 kjhss1 treatmentA-control 1.58 -20.3 23.5 0.997
2 kjhss1 treatmentC-control -3.71 -25.6 18.2 0.962
3 kjhss1 treatmentD-control 0.240 -21.7 22.2 1.000
4 kjhss1 treatmentC-treatmentA -5.29 -27.2 16.6 0.899
5 kjhss1 treatmentD-treatmentA -1.34 -23.3 20.6 0.998
6 kjhss1 treatmentD-treatmentC 3.95 -18.0 25.9 0.954
7 sdth2 treatmentC-control -1.02 -21.7 19.7 0.991
8 sdth2 treatmentD-control 3.25 -17.5 24.0 0.909
9 sdth2 treatmentD-treatmentC 4.27 -16.5 25.0 0.849
10 sgdhstjh20 treatmentC-control -7.48 -30.4 15.5 0.669
# ... with 7,108 more rows
Original answer
You can use tidyr::nest() and purrr::map() to accomplish the technical task of grouping by gen, and then conducting statistical tests comparing the effects of condition (presumably with LogFC as your DV).
But I agree with the other comments that there are issues with your statistical approach here that bear careful consideration - stats.stackexchange.com is a better forum for those questions.
For the purpose of demonstration, I've used an ANOVA instead of a t-test, since there are frequently more than two conditions per gen grouping. This shouldn't really change the intuition behind the implementation, however.
require(tidyverse)
results <- b %>%
mutate(condition = factor(condition)) %>%
group_by(gen) %>%
filter(length(unique(condition)) >= 2) %>%
nest() %>%
mutate(
model = map(data, ~ aov(LogFC ~ condition, data = .x)),
coef = map(model, ~ broom::tidy(.x))
) %>%
unnest(coef)
A few cosmetic trimmings to get closer to your original vision (of just a table with gen and p-values), although note that this really leaves a lot of important information out and I'm not advising you actually limit your results in this way.
results %>%
filter(term!="Residuals") %>%
select(gen, df, statistic, p.value)
results
# A tibble: 1,111 x 4
gen df statistic p.value
<chr> <dbl> <dbl> <dbl>
1 kjhss1 3. 0.175 0.912
2 sdth2 2. 0.165 0.850
3 sgdhstjh20 2. 0.440 0.654
4 jdygfjgdkydg21 2. 0.267 0.770
5 shfjdfyjydg22 2. 0.632 0.548
6 jdyjkdg23 2. 0.792 0.477
7 fckjhghw24 2. 0.790 0.478
8 shsnv25 2. 1.15 0.354
9 qeifyvj26 2. 0.588 0.573
10 qsiubx27 2. 1.14 0.359
# ... with 1,101 more rows
Note: I can't take much credit for this approach - it's taken almost verbatim from an example I saw Hadley give at a talk last night on purrr. Here's a link to the public repo of the demo code he used, which covers a similar use case.
You have 25 animals in 5 different treatment groups with a varying number of gen-values (presumably activities of genetic probes) in two different tissues:
table(b$animal, b$condition)
control treatmentA treatmentB treatmentC treatmentD
animalcontrol1 1005 0 0 0 0
animalcontrol2 857 0 0 0 0
animalcontrol3 959 0 0 0 0
animalcontrol4 928 0 0 0 0
animalcontrol5 1005 0 0 0 0
animaltreatmentA1 0 927 0 0 0
animaltreatmentA2 0 883 0 0 0
animaltreatmentA3 0 908 0 0 0
animaltreatmentA4 0 861 0 0 0
animaltreatmentA5 0 927 0 0 0
animaltreatmentB1 0 0 943 0 0
animaltreatmentB2 0 0 841 0 0
animaltreatmentB3 0 0 943 0 0
animaltreatmentB4 0 0 910 0 0
animaltreatmentB5 0 0 943 0 0
animaltreatmentC1 0 0 0 742 0
animaltreatmentC2 0 0 0 724 0
animaltreatmentC3 0 0 0 702 0
animaltreatmentC4 0 0 0 698 0
animaltreatmentC5 0 0 0 742 0
animaltreatmentD1 0 0 0 0 844
animaltreatmentD2 0 0 0 0 776
animaltreatmentD3 0 0 0 0 812
animaltreatmentD4 0 0 0 0 783
animaltreatmentD5 0 0 0 0 844
Agree you need to "automate" this in some fashion, but I think you are in need of a more general strategy for statistical inference rather than trying to pick out relationships by applying individual t-tests. You might consider either mixed models or one of the random forest variants. I think you should be discussing this with a statistician. As an example of where your hopes are not going to be met, take a look at the information you have about the first "gen" among the 1131 values:
str( b[ b$gen == "dghwg1041", ])
'data.frame': 13 obs. of 5 variables:
$ animal : Factor w/ 25 levels "animalcontrol1",..: 1 6 11 2 7 12 3 8 13 14 ...
$ gen : Factor w/ 1131 levels "dghwg1041","dghwg1086",..: 1 1 1 1 1 1 1 1 1 1 ...
$ condition: Factor w/ 5 levels "control","treatmentA",..: 1 2 3 1 2 3 1 2 3 3 ...
$ tissue : Factor w/ 2 levels "brain","heart": 1 1 1 1 1 1 1 1 1 1 ...
$ LogFC : num 4.34 2.98 4.44 3.87 2.65 ...
You do have a fair number with "complete representation:
gen_length <- ave(b$LogFC, b$gen, FUN=length)
Hmisc::describe(gen_length)
#--------------
gen_length
n missing distinct Info Mean Gmd .05 .10
21507 0 18 0.976 20.32 4.802 13 14
.25 .50 .75 .90 .95
18 20 24 25 25
Value 5 8 9 10 12 13 14 15 16 17
Frequency 100 48 288 270 84 624 924 2220 64 527
Proportion 0.005 0.002 0.013 0.013 0.004 0.029 0.043 0.103 0.003 0.025
Value 18 19 20 21 22 23 24 25
Frequency 666 2223 3840 42 220 1058 3384 4925
Proportion 0.031 0.103 0.179 0.002 0.010 0.049 0.157 0.229
You might start by looking at all the "gen"s that have complete data:
head( gen_tbl[ gen_tbl == 25 ], 25)
#------------------
dghwg1131 dghwg546 dghwg591 dghwg636 dghwg681
25 25 25 25 25
dghwg726 dgkuck196 dgkuck286 dgkuck421 dgkuck691
25 25 25 25 25
dgkuck736 dgkukdgse197 dgkukdgse287 dgkukdgse422 dgkukdgse692
25 25 25 25 25
dgkukdgse737 djh592 djh637 djh682 djh727
25 25 25 25 25
dkgkjd327 dkgkjd642 dkgkjd687 dkgkjd732 fckjhghw204
25 25 25 25 25

Dataframe calculation, anchor cell value to formula

I would like to do some calculations with the following dataframe. There are some values in specific cells of a column, and I would like to have them replicated based on a second column value, and store these in a new, third column:
x <- c ("1", "2","3", "4")
z <- (rep(x,5))
batch <- sort(z)
NDF <- rnorm(20, 10, 1); NDF <- signif (NDF, digits =3)
Fibre_analysis <- data.frame(batch, NDF)
Fibre_analysis$NDF[[1]] <- 10
Fibre_analysis$NDF[[6]] <- 100
Fibre_analysis$NDF[[11]] <- 1000
Fibre_analysis$NDF[[16]] <- 10000
This is the table that I would like:
batch NDF NEW_column
1 1 10.00 10
2 1 10.80 10
3 1 9.44 10
4 1 10.30 10
5 1 11.60 10
6 2 100.00 100
7 2 8.26 100
8 2 9.15 100
9 2 9.40 100
10 2 8.53 100
11 3 1000.00 1000
12 3 9.41 1000
13 3 9.20 1000
14 3 10.30 1000
15 3 9.32 1000
16 4 10000.00 10000
17 4 11.20 10000
18 4 7.33 10000
19 4 9.34 10000
20 4 11.00 10000
I would like this to create a new column in the dataframe, with absolute cell values from $NDFthat have to change for each value of $batch.
Because I need to use this process more than once I created the following function:
batch_Function <- function (x,y){
ifelse (x =="1", y[[1]],
ifelse (x =="2", y[[6]],
ifelse (x =="3", y[[11]],
y[[16]] )))
print (y)
}
when I call the function:
Fibre_analysis$NEW_column <- batch_Function ( Fibre_analysis$batch , Fibre_analysis$NDF )
I expect $NEW_column to look like this:
x <- c(10,100,1000,10000)
NEW_column <- rep(x, each=5)
whereas instead it is the exact same copy of the $NDF.
The only necessary change is to drop print(y) as it is not allowing to return the actual result:
batch_Function <- function (x, y) {
ifelse (x =="1", y[[1]],
ifelse (x =="2", y[[6]],
ifelse (x =="3", y[[11]],
y[[16]] )))
}
batch_Function (Fibre_analysis$batch , Fibre_analysis$NDF )
# [1] 10 10 10 10 10 100 100 100 100 100 1000 1000 1000 1000
# [15] 1000 10000 10000 10000 10000 10000
In case you still want print(y), you may put it at the beginning of batch_Function.

How can I most efficiently set 0 vals to NA in a subset of columns? [duplicate]

This question already has answers here:
How to replace NA values in a table for selected columns
(12 answers)
Closed 6 years ago.
I have a book on statistics (using R) showing the following:
> pima$diastolic [pima$diastolic = = 0] <- NA
> pima$glucose [pima$glucose == 0] <- NA
> pima$triceps [pima$triceps == 0] <- NA
> pima$insulin [pima$insulin == 0] <- NA
> pima$bmi [pima$bmi == 0] <- NA
Is there a way to do it in one line or more efficiently? I see there are functions such as with, apply, subset for doing similar stuff but could not figure out how to put them together...
Sample data (how do I read this in as a dataframe (like pythons stringio):
pregnant glucose diastolic triceps insulin bmi diabetes age test
1 6 148 72 35 0 33.6 0.627 50 positive
2 1 85 66 29 0 26.6 0.351 31 negative
3 8 183 64 0 0 23.3 0.672 32 positive
4 1 89 66 23 94 28.1 0.167 21 negative
5 0 137 40 35 168 43.1 2.288 33 positive
6 5 116 74 0 0 25.6 0.201 30 negative
Something like this:
Use lapply() to use a function for every column
In the function, test if the column is numeric. If numeric, then replace zeros with NA, else return the original column, unchanged:
Try this:
pima[] <- lapply(pima, function(x){ if(is.numeric(x)) x[x==0] <- NA else x})
Or for predefined columns
cols = c("diastolic", "glucose", "triceps", "insulin", "bmi")
pima[cols] <- lapply(pima[cols], function(x) {x[x==0] <- NA ; x})
Or using is.na<-
is.na(pima[cols]) <- pima[cols] == 0
Using data.table you can try
for (col in c("diastolic","glucose","triceps","insulin", "bmi")) pima[(get(col))==0, (col) := NA]
more details here:
How to replace NA values in a table *for selected columns*? data.frame, data.tableenter link description here
Using dplyr, you could do:
# banal function definition
zero_to_NA <- function(col) {
# any code that works here
# I chose this because it is concise and efficient
`is.na<-`(col, col==0)
}
# Assuming you want to change 0 to NA only in these 3 columns
pima <- pima %>%
mutate_each(funs(zero_to_NA), diastolic, glucose, triceps)
Or you could skip the function definition and write directly:
pima <- pima %>%
mutate_each(funs(`is.na<-`(., .==0)),
diastolic, glucose, triceps)

Automatically creating and filling data frames in R

Here is the code that I am working with.
rnumbers <- data.frame(replicate(5,runif(20000, 0, 1)))
dt <- c(.001)
A <- dt*1
B <- dt*.5
## A = 0
## B = 1
rstate <- rnumbers # copy the structure
rstate[] <- NA # preserve structure with NA's
# Init:
rstate[1, ] <- rnumbers[1, ] < .02 & rnumbers[1, ] > 0.01
step_generator <- function(col, rnum){
for (i in 2:length(col) ){
if( rnum[i] < B) { col[i] <- 0 }
else { if (rnum[i] < A) {col[i] <- 1 }
else {col[i] <- col[i-1] } }
}
return(col)
}
# Run for each column index:
for(cl in 1:5){ rstate[ , cl] <-
step_generator(rstate[,cl], rnumbers[,cl]) }
rstate1 <- transform(rstate, time = rep(dt))
rstate2 <- transform(rstate1, cumtime = cumsum(time))
This gives me a data frame with 5 columns that contain state switches over time. Time interval is in the 6th column (seconds) and cumulative time is in the 7th column (seconds). Now I want to see how long each state lasts in seconds. This is what I am doing -
1) lengths <- rle(rstate2[,1])
>Run Length Encoding
lengths: int [1:15] 366 3278 1817 451 3033 1655 1901 748 742 1780 ...
values : num [1:15] 0 1 0 1 0 1 0 1 0 1 ...
2) lengths1 <- data.frame(state = lengths$values, duration = lengths$lengths)
> lengths1
state duration
1 0 366
2 1 3278
3 0 1817
4 1 451
5 0 3033
6 1 1655
7 0 1901
8 1 748
9 0 742
10 1 1780
11 0 26
12 1 458
13 0 305
14 1 1039
15 0 2401
3) library("plyr")
lengths2 <- transform(lengths1, time = duration*dt)
lengths3 <- arrange(lengths2, desc(state))
> lengths3
state duration time
1 1 3278 3.278
2 1 451 0.451
3 1 1655 1.655
4 1 748 0.748
5 1 1780 1.780
6 1 458 0.458
7 1 1039 1.039
8 0 366 0.366
9 0 1817 1.817
10 0 3033 3.033
11 0 1901 1.901
12 0 742 0.742
13 0 26 0.026
14 0 305 0.305
15 0 2401 2.401
4) col1 <- ddply(lengths3, .(state), function(df) 1/mean(df$time))
> col1
state V1
1 0 0.7553583
2 1 0.7439685
So, col1 is showing me "1/mean(time in each state)" for column1 of rstate2. What I would like to do is iterate steps 1-4 for every column in rstate2 and generate a data frame that looks like this :
> rates
state col1 col2 col3 col4 col5
1 0 0.1 0.2 0.3 0.4 0.5
2 1 0.3 0.4 0.5 0.6 0.7
Where the numbers for each column are equal to the 1/mean(df$time) for each of the column from rstate2.
Thank you for any and all help.
I'd do this using the development version of data.table (v 1.8.11) in this manner:
require(data.table) # 1.8.11
require(reshape2)
DT <- data.table(rstate2)
DT.m <- melt(DT, id=6, measure=1:5)
ans <- DT.m[, {dl=data.table:::duplist(list(value));
list(state=value[dl], time=c(diff(dl),
.N-dl[length(dl)]+1)*dt)
}, by=list(variable)]
ans <- ans[, 1/mean(time), by=list(variable, state)]
dcast.data.table(ans, state ~ variable)
state X1 X2 X3 X4 X5
1: 0 0.9875568 1.0777521 0.3227194 2.2371365 0.7237054
2: 1 1.0127608 0.4442799 0.2802691 0.2887169 1.0576415
Unfortunately, it's still building on R-Forge. So, probably you can install 1.8.10 from CRAN and use reshape2's melt and cast (which'll output a data.frame) and convert the result back to a data.table and do the grouping as follows:
require(data.table) # 1.8.10
require(reshape2)
DT.m <- data.table(melt(rstate2, id=6, measure=1:5))
ans <- DT.m[, {dl=data.table:::duplist(list(value));
list(state=value[dl], time=c(diff(dl),
.N-dl[length(dl)]+1)*dt)
}, by=list(variable)]
ans <- ans[, 1/mean(time), by=list(variable, state)]
dcast(ans, state ~ variable)

Resources