Computing deciles over calendar years and across different columns using R - r

I have the following dataset that I created using dplyr and the function tbl_df():
date X1 X2
1 2001-01-31 4.698648 4.640957
2 2001-02-28 4.491493 4.398382
3 2001-03-30 4.101235 4.074065
4 2001-04-30 4.072041 4.217999
5 2001-05-31 3.856718 4.114061
6 2001-06-29 3.909194 4.142691
7 2001-07-31 3.489640 3.678374
8 2001-08-31 3.327068 3.534823
9 2001-09-28 2.476066 2.727257
10 2001-10-31 2.015936 2.299102
11 2001-11-30 2.127617 2.590702
12 2001-12-31 2.162643 2.777744
13 2002-01-31 2.221636 2.740961
14 2002-02-28 2.276458 2.834494
15 2002-03-28 2.861650 3.472853
16 2002-04-30 2.402687 3.026207
17 2002-05-31 2.426250 2.968679
18 2002-06-28 2.045413 2.523772
19 2002-07-31 1.468695 1.677434
20 2002-08-30 1.707742 1.920101
21 2002-09-30 1.449055 1.554702
22 2002-10-31 1.350024 1.466806
23 2002-11-29 1.541507 1.844471
24 2002-12-31 1.208786 1.392031
I am interested in computing deciles for each year and each column. For example, the deciles of 2001 for X1, deciles of 2001 for X2, deciles of 2002 for X1, deciles of 2002 for X2 and so on if I have more years and more columns. I tried:
quantile(x, prob = seq(0, 1, length = 11), type = 5) or using apply.yearly() with the quantile() function and an xts object of x (my dataframe above) but none of them do what I actually need to compute. Your help will be appreciated.

You can try the following function:
df<- read.table(header=T,text='date X1 X2
1 2001/01/31 4.698648 4.640957
2 2001/02/28 4.491493 4.398382
3 2001/03/30 4.101235 4.074065
4 2001/04/30 4.072041 4.217999
5 2001/05/31 3.856718 4.114061
6 2001/06/29 3.909194 4.142691
7 2001/07/31 3.489640 3.678374
8 2001/08/31 3.327068 3.534823
9 2001/09/28 2.476066 2.727257
10 2001/10/31 2.015936 2.299102
11 2001/11/30 2.127617 2.590702
12 2001/12/31 2.162643 2.777744
13 2002/01/31 2.221636 2.740961
14 2002/02/28 2.276458 2.834494
15 2002/03/28 2.861650 3.472853
16 2002/04/30 2.402687 3.026207
17 2002/05/31 2.426250 2.968679
18 2002/06/28 2.045413 2.523772
19 2002/07/31 1.468695 1.677434
20 2002/08/30 1.707742 1.920101
21 2002/09/30 1.449055 1.554702
22 2002/10/31 1.350024 1.466806
23 2002/11/29 1.541507 1.844471
24 2002/12/31 1.208786 1.392031')
find_quantile <- function(df,year,col,quant) {
year_df <- subset(df,year==substring(as.character(date),1,4))
a <- quantile(year_df[,col] , quant)
return(a)
}
#where df is the dataframe,
#year is the year you want (as character),
#col is the column you want to calculate the quantile (as index i.e. in your case 2 or 3,
#quant is the quantile
For example:
> find_quantile(df,'2001',2,0.7) #specify the year as character
70%
4.023187

Assuming you have a simple data.frame, first, bin the dates by year:
df$year <- cut(as.Date(df$date), "year")
And then aggregate by year:
foo <- aggregate(. ~ year, subset(df, select=-date), quantile,
prob = seq(0, 1, length = 11), type = 5)
This returns a data frame. But it needs a bit of cleaning. Using unnest from the dev version of tidyr and lapply, you could do the following. Please note that the first row for X1 is for 2001, and the second for 2002.
devtools::install_github("hadley/tidyr")
library(tidyr)
unnest(lapply(foo[-1], as.data.frame), column)
# column 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
#1 X1 2.015936 2.094113 2.159140 2.561166 3.375840 3.673179 3.893451 4.055756 4.140261 4.553640 4.698648
#2 X1 1.208786 1.307653 1.439152 1.475976 1.591378 1.876578 2.168769 2.270976 2.405043 2.556870 2.861650
#3 X2 2.299102 2.503222 2.713601 2.853452 3.577888 3.876219 4.102062 4.139828 4.236037 4.471155 4.640957
#4 X2 1.392031 1.444374 1.545912 1.694138 1.867160 2.221936 2.675804 2.825141 2.974432 3.160201 3.472853

Related

Return values with matching conditions in r

I would like to return values with matching conditions in another column based on a cut score criterion. If the cut scores are not available in the variable, I would like to grab closest larger value. Here is a snapshot of dataset:
ids <- c(1,2,3,4,5,6,7,8,9,10)
scores.a <- c(512,531,541,555,562,565,570,572,573,588)
scores.b <- c(12,13,14,15,16,17,18,19,20,21)
data <- data.frame(ids, scores.a, scores.b)
> data
ids scores.a scores.b
1 1 512 12
2 2 531 13
3 3 541 14
4 4 555 15
5 5 562 16
6 6 565 17
7 7 570 18
8 8 572 19
9 9 573 20
10 10 588 21
cuts <- c(531, 560, 571)
I would like to grab score.b value corresponding to the first cut score, which is 13. Then, grab score.b value corresponding to the second cut (560) score but it is not in the score.a, so I would like to get the score.a value 562 (closest to 560), and the corresponding value would be 16. Lastly, for the third cut score (571), I would like to get 19 which is the corresponding value of the closest value (572) to the third cut score.
Here is what I would like to get.
scores.b
cut.1 13
cut.2 16
cut.3 19
Any thoughts?
Thanks
We can use a rolling join
library(data.table)
setDT(data)[data.table(cuts = cuts), .(ids = ids, cuts, scores.b),
on = .(scores.a = cuts), roll = -Inf]
# ids cuts scores.b
#1: 2 531 13
#2: 5 560 16
#3: 8 571 19
Or another option is findInterval from base R after changing the sign and taking the reverse
with(data, scores.b[rev(nrow(data) + 1 - findInterval(rev(-cuts), rev(-scores.a)))])
#[1] 13 16 19
This doesn't remove the other columns, but this illustrates correct results better
df1 <- data[match(seq_along(cuts), findInterval(data$scores.a, cuts)), ]
rownames(df1) <- paste("cuts", seq_along(cuts), sep = ".")
> df1
ids scores.a scores.b
cuts.1 2 531 13
cuts.2 5 562 16
cuts.3 8 572 19

When a variable switches from 1 to 2, delete some data from the other variables and average what's left?

I am analysing some data and need help.
Basically, I have a dataset that looks like this:
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
As can be seen, there's a switch column that switches between 1 and 2 every 10 data points. I want to write a code that does: when the "switch" column changes its value (from 1 to 2, or 2 to 1), delete the first 5 rows of data after the switch (i.e. leaving the 5 last data points for all the 4 variables), average the rest of the data points for O2 and CO2, and put them in 2 new columns (avg.O2 and avg.CO2) before the next switch. Then repeat this process until the end.
It's quite easy to do manually on paper or excel, but my real dataset would comprise thousands of data points and I would like to use R to do it automatically for me. So anyone has any ideas that could help me?
Please find my edits which should work for both regular and irregular
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
CleanMachineData <- function(Data, SwitchData, UnreliableRows = 5){
# First, we can properly turn your switch column into a grouping column (1,2,1,2)->(1,2,3,4)
grouplength <- rle(Data[,"switch"])$lengths
# mapply lets us input vector arguments into typically one/first-element only argument functions.
# In this case we create a sequence of lengths (output is a list/vector)
grouping <- mapply(seq, grouplength)
# Here we want it to become a single vector representing groups
groups <- mapply(rep, 1:length(grouplength), each = grouplength)
# if frequency was irregular, it will be a list, if regular it will be a matrix
# convert either into a vector by doing as follows:
if(class(grouping) == "list"){
groups <- unlist(groups)
} else {
groups <- as.vector(groups)
}
Data$group <- groups
#
# vector of the first row of each new switch (except the starting 0)
switchRow <- c(0,which(abs(diff(SwitchData)) == 1))+1
# I use "as.vector" to turn the matrix output of mapply into a sequence of numbers.
# "ToRemove" will have all the row numbers to get rid of from your original data, except for what happens before (in this case) row 10
ToRemove <- c(1:UnreliableRows, as.vector(mapply(seq, switchRow, switchRow+(UnreliableRows)-1)))
# I concatenate the missing beginning (1,2,3,4,5) and theToRemove them with c() and then remove them from n with "-"
Keep <- seq(nrow(Data))[-c(1:UnreliableRows,ToRemove)]
# Create the new data, (in case you don't know: data[<ROW>,<COLUMN>])
newdat <- Data[-ToRemove,]
# print the results
newdat
}
dat <- CleanMachineData(test.data, test.data$switch, 5)
dat
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
19 2017-04-18 2 21.03252 0.07960098 2
20 2017-04-19 2 21.04032 0.07892145 2
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
39 2017-05-08 2 21.04136 0.07781525 4
40 2017-05-09 2 21.05375 0.07941123 4
aggregate(cbind(O2,CO2) ~ group, dat, mean)
group O2 CO2
1 1 21.04675 0.07812336
2 2 21.03497 0.07819329
3 3 21.03967 0.07834986
4 4 21.04166 0.07882221
# crazier, irregular switching
test.data2 <- test.data
test.data2$switch <- unlist(mapply(rep, 1:2, times = 1, each = c(10,8,10,5,3,10)))[1:20]
dat2 <- CleanMachineData(test.data2, test.data2$switch, 5)
dat2
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
24 2017-04-23 1 21.05658 0.07669662 3
25 2017-04-24 1 21.04452 0.07983165 3
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
# You can try removing a vector with the following
lapply(5:7, function(x) {
dat <- CleanMachineData(test.data2, test.data2$switch, x)
list(data = dat, means = aggregate(cbind(O2,CO2)~group, dat, mean))
})
Use
test.data[rep(c(FALSE, TRUE), each=5),]
to select always the last five rows from the group of 10 rows.
Then you can use aggregate:
d2 <- test.data[rep(c(FALSE, TRUE), each=5),]
aggregate(cbind(O2, CO2) ~ 1, data=d2, FUN=mean)
If you want the average for every 5-rows-group:
aggregate(cbind(O2, CO2) ~ gl(k=5, n=nrow(d2)/5L), data=d2, FUN=mean)
Here is a generalization for the situation of arbitrary number of rows in test.data:
stay <- rep(c(FALSE, TRUE), each=5, length.out=nrow(test.data))
d2 <- test.data[stay,]
group <- gl(k=5, n=nrow(d2)/5L+1L, length=nrow(d2))
aggregate(cbind(O2, CO2) ~ group, data=d2, FUN=mean)
Here is a variant for mixing the data with the averages:
group <- gl(k=10, n=nrow(test.data)/10L+1L, length=nrow(test.data))
L <- split(test.data, group)
mySummary <- function(x) {
if (nrow(x) <= 5) return(NULL)
x <- x[-(1:5),]
d.avg <- aggregate(cbind(O2, CO2) ~ 1, data=x, FUN=mean)
rbind(x, cbind(date=NA, switch=-1, d.avg))
}
lapply(L, mySummary) # as list of dataframes
do.call(rbind, lapply(L, mySummary)) # as one dataframe

R: calculating ICC across multiple columns in two dataframes

I am calculating the ICC's for 301 variables of 2 readers. Results are saved in two files with 301 columns each. The first column of file1 (reader1$Var1) corresponds to the first column of file2 (reader2$Var302). I can perform the ICC manually (see below), but I need help to automate this process using apply or a loop. Thank you.
library(irr)
irr::icc()
a= data.frame(reader1$Var1)
b= data.frame(reader2$Var302)
X= data.frame (a,b)
function.ICC <- function (X) {irr::icc(X, model =c("oneway"), type = c("consistency"), unit =("single"), r0 = 0, conf.level = 0.95)}
Results <- function.ICC(X)
Results[7]
A combination of lapply and do.call could do for your case (although there's quite a few options). You don't provide a sample of your data, so I assume you first do a cbind of your 2 dataframes one after the other, so that in this toy example
> X = data.frame(cbind(1:10, 11:20, 21:30, 21:30))
> X
X1 X2 X3 X4
1 1 11 21 21
2 2 12 22 22
3 3 13 23 23
4 4 14 24 24
5 5 15 25 25
6 6 16 26 26
7 7 17 27 27
8 8 18 28 28
9 9 19 29 29
10 10 20 30 30
you would like to run icc of X1 vs X3 and X2 versus X4. It would be something like the following, relying on function.ICC as you've defined it:
> do.call(cbind, lapply(1:2, function(i) function.ICC(X[,c(i, i+2)])))
[,1] [,2]
subjects 10 10
raters 2 2
model "oneway" "oneway"
type "consistency" "consistency"
unit "single" "single"
icc.name "ICC(1)" "ICC(1)"
value -0.8320611 -0.4634146
r0 0 0
Fvalue 0.09166667 0.3666667
df1 9 9
df2 10 10
p.value 0.9993158 0.926668
conf.level 0.95 0.95
lbound -0.9526347 -0.8231069
ubound -0.4669701 0.1848105
So, for your cbind'ed dataframes with 301 columns, omething similar to this should work:
do.call(cbind, lapply(1:301, function(i) function.ICC(X[,c(i, i+301)])))

change code because of unwanted factors

So basically my code above simply takes every 5th number and calculates the standard deviation of the values for every 5th number....So if I have a sample data like this
Number STD
1 11.15
2 11.18
3 11.21
4 11.24
5 11.3
10 11.36
11 11.42
12 11.48
13 11.54
14 11.6
15 11.66
16 11.72
17 11.78
18 11.84
19 11.9
20 11.96
When I run my code, I'll get this output
Number STD
1 1 0.05770615
2 2 NA
3 3 0.09486833
4 4 0.09486833
So what I want to do is simple replace the NA with 0. Also instead of getting factors like 1,2,3,4 etc...I want to get 5,10,15,20,25 etc....
Another way of doing it:
# Generate data
number <- c(1:5, 10:20)
val <- c(11.15, 11.18, 11.21, 11.24, 11.30, 11.36, 11.42,
11.48, 11.54, 11.60, 11.66, 11.72, 11.78, 11.84, 11.90, 11.96)
data <- data.frame(number, val)
# Calculate SD
breaks <- seq(0, 20, 5)
splitted.data <- split(data$val, f=cut(data$number, breaks, labels=F))
err <- sapply(splitted.data, sd)
err[is.na(err)] <- 0
res <- cbind(Number = breaks[-1], STD = err)
Resulting in:
> res
Number STD
1 5 0.05770615
2 10 0.00000000
3 15 0.09486833
4 20 0.09486833
I haven't tried to rewrite what you try to do , but just for the sake of continuity you can
You can use argument labels of cut to set labels resulting category.
Change NA to 0 using spread[is.na(spread)] <- 0
The all code is :
hunter <- lapply(hunt, function(i) {
random <- cut(value[,i],seq(0,max(value[i]),5),
labels=seq(5,max(value[i]),5))
spread<-tapply(value[,i+1],random, sd,na.rm=TRUE)
spread[is.na(spread)] <- 0
Number<-levels(as.factor(random))
d <- data.frame(Number=Number,STD=spread)
})
Number STD
5 5 0.05770615
10 10 0.00000000
15 15 0.09486833
20 20 0.09486833
Using the data.table package, you can accomplish this in one call:
library(data.table)
DT <- data.table(value)
As a sigle call:
DT[, list(SD = ifelse(is.na(sd(STD)), 0, sd(STD)))
, by=list("Group" = factor(G <- (Number-1) %/% 5, labels=(unique(G) + 1)*5))]
Group SD
1: 5 0.05770615
2: 10 0.00000000
3: 15 0.09486833
4: 20 0.09486833
Breaking it down:
# you can create your groupings by
(Number-1) %/% 5 # (ie, the remainder when divided by 5)
# you can create your factor levels by
5 * ((Number-1) %/% 5 + 1)
# calculate the Group:
DT[, grp := factor(G <- (Number-1) %/% 5, labels=(unique(G) + 1)*5)]
# calculate the SD by Group, replacing NA's with 0:
DT[, SD := ifelse(is.na(sd(STD)), 0, sd(STD)), by=grp]
unique(DT[, list(grp, SD)])

Using R to apply an equation to specific groups of data within a data set

I have a data set, and I would like to apply an equation to groups of my values. Specifically I would like to apply
sqrt(X^2+Y^2+Z^2)
to all values within a specific time and variable
Looking at the data below I would like to group my values by unique time (TS) and Bins (Bin), and grab the square root of the sum of squares for each of the X Y and Z components.
id D Bin value Month Day Year Hour Minute Second TS
1 X V1 -0.320 1 30 2012 13 59 50 2012-01-30 13:59:50
1 Y V1 -0.088 1 30 2012 13 59 50 2012-01-30 13:59:50
1 Z V1 0.171 1 30 2012 13 59 50 2012-01-30 13:59:50
1 X V2 0.368 1 30 2012 13 59 50 2012-01-30 13:59:50
1 Y V2 -0.104 1 30 2012 13 59 50 2012-01-30 13:59:50
1 Z V2 0.008 1 30 2012 13 59 50 2012-01-30 13:59:50
2 X V1 -0.052 1 30 2012 14 0 50 2012-01-30 14:00:50
2 Y V1 0.278 1 30 2012 14 0 50 2012-01-30 14:00:50
2 Z V1 -0.086 1 30 2012 14 0 50 2012-01-30 14:00:50
2 X V2 -0.214 1 30 2012 14 0 50 2012-01-30 14:00:50
2 Y V2 0.118 1 30 2012 14 0 50 2012-01-30 14:00:50
2 Z V2 -0.030 1 30 2012 14 0
So up first would be V1 at 13:59:50
sqrt(-0.320^2 + -0.088^2 + 0.171^2)
and then for V2 at t13:59:50
sqrt(0.368^2 +-0.104^2 + 0,008^2)
and so on
I had tried to use this formula (Data is called "V")
V=aggregate(value~TS+variable,data=V,sqrt((if(V$D=="X")V$value^2)+(if(V$D=="Y")V$value^2))+(if(V$D=="Z")V$value^2))
But obviously that does not work. So does anyone have a better way to first index unique groups in a data set, and than apply an equation to said group?
Use the plyr and reshape (or reshape2) packages. (Really. If you're not using those packages, you'll be astounded how much better things go.) Briefly, you'll want to first cast() your data into a wide form, so that instead of columns named D and value, you have columns named X, Y and Z. From there, you can use any number of techniques. transform in base would work, although I like mutate in the plyr package a bit better:
V <- mutate(V, norm=sqrt(X^2+Y^2+Z^2))
Assuming you always have one X, one Y, and one Z for each combination of (TS, Bin), I would try this:
aggregate(value ~ TS + Bin, data = V, FUN = function(x)sqrt(sum(x^2)))
library("plyr")
ddply(V, .(TS, Bin), summarise, norm=sqrt(sum(value*value)))
If there is exactly one X, Y, and Z per TS/Bin combination.

Resources