I have an hourly time-series of rain at four locations for one year as follow. I want to compute sum or mean for all the 24 hours of the day for entire year separate for all the four locations. This is common analysis in Meteorology and termed as diurnal variation.
This will give me an idea of which hours are preferred for rainfall at these locations. Is there a simple way to do this in xts/zoo package?
head(rg_hr_xts)
rg1 rg2 rg3 rg4
2018-06-01 00:59:17 1.0 0.0 0 0
2018-06-01 01:59:17 0.2 0.0 0 0
2018-06-01 02:59:17 0.0 0.2 0 0
2018-06-01 03:59:17 0.0 1.6 0 0
2018-06-01 04:59:17 0.0 3.4 0 0
2018-06-01 05:59:17 0.0 0.8 0 0
Note: I have used .indexhour(rg_hr_xts)
which gives me hour of each index as follow
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
20 21 22 23 0 1 2 3 4 5 6 7
I want to sum all the hours with above index 0, 1, 2 and get a dataframe with hour index 0-23 and sum of rainfall for that hour.
I expect following structure of output dataframe:
hour rg1 rg2 rg3 rg4
0 0.3 0.7 1.2 0.4
1 1.3 1.5 1.3 1.3
2 1.5 1.7 1.9 1.8
3 2.0 2.5 2.6 2.9
4 2.1 2.9 3.5 3.6
You can use aggregate() to calculate sums by hour of the day.
library(xts)
# Some reproducible example data
n <- 1e4
set.seed(21)
x <- .xts(cbind(rg1 = runif(n), rg2 = runif(n)), 1:n * 3600 + runif(n)*100)
# Aggregate by hour of the day
y <- aggregate(x, by = .indexhour(x), FUN = sum)
The result of aggregate() will be a zoo object with an index of the hour of the day. Then you can use data.frame() to convert that to a data.frame.
data.frame(hour = index(y), y)
# hour rg1 rg2
# 0 0 214.3876 211.5131
# 1 1 215.5521 205.4340
# 2 2 206.1494 211.7510
# 3 3 223.9533 209.5391
# 4 4 202.8989 211.6612
# 5 5 198.6387 203.7809
# 6 6 218.7807 218.9829
# 7 7 205.2797 214.6127
# 8 8 207.2061 219.2323
# 9 9 217.2509 208.9815
# 10 10 218.4591 202.3216
# 11 11 205.6799 219.2482
# 12 12 206.8984 209.7392
# 13 13 209.4091 205.2837
# 14 14 212.0559 213.2387
# 15 15 211.8372 204.3384
# 16 16 206.5818 221.5508
# 17 17 212.1076 214.9638
# 18 18 219.3799 205.4536
# 19 19 202.6254 202.2210
# 20 20 208.5686 208.5411
# 21 21 213.2116 218.9530
# 22 22 210.6371 207.5539
# 23 23 197.8964 203.9069
Related
I have a data with primary key and ratio values like the following
2.243164164
1.429242413
2.119270714
3.013427143
1.208634972
1.208634972
1.23657632
2.212136028
2.168583297
2.151961216
1.159886063
1.234106444
1.694206176
1.401425329
5.210125578
1.215267806
1.089189869
I want to add a rank column which groups these ratios in say 3 bins. Functionality similar to the sas code:
PROC RANK DATA = TAB1 GROUPS = &NUM_BINS
I did the following:
Convert your vector to data frame.
Create variable Rank:
test2$rank<-rank(test2$test)
> test2
test rank
1 2.243164 15.0
2 1.429242 9.0
3 2.119271 11.0
4 3.013427 16.0
5 1.208635 3.5
6 1.208635 3.5
7 1.236576 7.0
8 2.212136 14.0
9 2.168583 13.0
10 2.151961 12.0
11 1.159886 2.0
12 1.234106 6.0
13 1.694206 10.0
14 1.401425 8.0
15 5.210126 17.0
16 1.215268 5.0
17 1.089190 1.0
Define function to convert to percentile ranks and then define pr as that percentile.
percent.rank<-function(x) trunc(rank(x)/length(x)*100)
test3<-within(test2,pr<-percent.rank(rank))
Then I created bins on the fact you wanted 3 of them.
test3$bins <- cut(test3$pr, breaks=c(0,33,66,100), labels=c("0-33","34-66","66-100"))
test x rank pr bins
1 2.243164 15.0 15.0 88 66-100
2 1.429242 9.0 9.0 52 34-66
3 2.119271 11.0 11.0 64 34-66
4 3.013427 16.0 16.0 94 66-100
5 1.208635 3.5 3.5 20 0-33
6 1.208635 3.5 3.5 20 0-33
7 1.236576 7.0 7.0 41 34-66
8 2.212136 14.0 14.0 82 66-100
9 2.168583 13.0 13.0 76 66-100
10 2.151961 12.0 12.0 70 66-100
11 1.159886 2.0 2.0 11 0-33
12 1.234106 6.0 6.0 35 34-66
13 1.694206 10.0 10.0 58 34-66
14 1.401425 8.0 8.0 47 34-66
15 5.210126 17.0 17.0 100 66-100
16 1.215268 5.0 5.0 29 0-33
17 1.089190 1.0 1.0 5 0-33
That work for you?
Almost late but given your data, we can use ntile from dplyr package to get equal sized groups:
df <- data.frame(values = c(2.243164164,
1.429242413,
2.119270714,
3.013427143,
1.208634972,
1.208634972,
1.23657632,
2.212136028,
2.168583297,
2.151961216,
1.159886063,
1.234106444,
1.694206176,
1.401425329,
5.210125578,
1.215267806,
1.089189869))
library(dplyr)
df <- df %>%
arrange(values) %>%
mutate(rank = ntile(values, 3))
values rank
1 1.089190 1
2 1.159886 1
3 1.208635 1
4 1.208635 1
5 1.215268 1
6 1.234106 1
7 1.236576 2
8 1.401425 2
9 1.429242 2
10 1.694206 2
11 2.119271 2
12 2.151961 2
13 2.168583 3
14 2.212136 3
15 2.243164 3
16 3.013427 3
17 5.210126 3
Or see cut_number from ggplot2 package:
library(ggplot2)
df$rank2 <- cut_number(df$values, 3, labels = c(1:3))
values rank rank2
1 1.089190 1 1
2 1.159886 1 1
3 1.208635 1 1
4 1.208635 1 1
5 1.215268 1 1
6 1.234106 1 1
7 1.236576 2 2
8 1.401425 2 2
9 1.429242 2 2
10 1.694206 2 2
11 2.119271 2 2
12 2.151961 2 3
13 2.168583 3 3
14 2.212136 3 3
15 2.243164 3 3
16 3.013427 3 3
17 5.210126 3 3
Because your sample consists of 17 numbers, one bin consists of 5 numbers while the others consist of 6 numbers. There are differences for row 12: ntile assigns 6 numbers to the first and second group, whereas cut_number assigns them to the first and third group.
> table(df$rank)
1 2 3
6 6 5
> table(df$rank2)
1 2 3
6 5 6
See also here: Splitting a continuous variable into equal sized groups
I want to make new column in my data set with the values determined by values in another data set, but it's not as simple as the values in one column being a function of the values in the other. Here's an example:
>df1
chromosome position
1 1 1
2 1 2
3 1 4
4 1 5
5 1 7
6 1 12
7 1 13
8 1 15
9 1 21
10 1 23
11 1 24
12 2 1
13 2 5
14 2 7
15 2 8
16 2 12
17 2 15
18 2 18
19 2 21
20 2 22
and
>df2
chromosome segment_start segment_end segment.number
1 1 1 5 1.1
2 1 6 20 1.2
3 1 21 25 1.3
4 2 1 7 2.1
5 2 8 16 2.2
6 2 18 22 2.3
I want to make a new column in df1 called 'segment', and the value in segment is to be determined by which segment (as determined by 'segment_start', 'segment_end', and 'chromosome' from df2) the value in 'position' belongs to. For example, in df1, row 7, position=13, and chromosome=1. Because 13 is between 6 and 20, the entry in my hypothetical 'segment' column would be 1.2, from row 2 of df2, because 13 falls between segment_start and segment_end from that row (6 and 20, respectively), and the 'chromosome' value from df1 row 7 is 1, just as 'chromosome' in df2 row 2 is 1.
Each row in df1 belongs to one of the segments described in df2; that is, it lies on the same chromosome as one of the segments, and its 'position' is >=segment_start and <=segment_end. And I want to get that information into df1, so it says what segment each position belongs to.
I was thinking of using an if function, and started with:
if(df1$position>=df2$segment_start & df1$position<=df2$segment_end & df1$chromosome==df2$chromosome) df1$segment<-df2$segment.number
But am not sure that way will be feasible. If nothing else maybe the code can help illustrate what it is I'm trying to do. Basically, I want match each row by its position and chromosome to a segment in df2. Thanks.
This appears to be a rolling join. You can use data.table for this
require(data.table)
DT1 <- data.table(df1, key = c('chromosome','position'))
DT2 <- data.table(df2, key = c('chromosome','section_start'))
# this will perform the join you want (but retain all the
# columns with names names of DT2)
# DT2[DT1, roll=TRUE]
# which is why I have renamed and subset here)
DT2[DT1, roll=TRUE][ ,list(chromosome,position = segment_start,segment.number)]
# chromosome position segment.number
# 1: 1 1 1.1
# 2: 1 2 1.1
# 3: 1 4 1.1
# 4: 1 5 1.1
# 5: 1 7 1.2
# 6: 1 12 1.2
# 7: 1 13 1.2
# 8: 1 15 1.2
# 9: 1 21 1.3
# 10: 1 23 1.3
# 11: 1 24 1.3
# 12: 2 1 2.1
# 13: 2 5 2.1
# 14: 2 7 2.1
# 15: 2 8 2.2
# 16: 2 12 2.2
# 17: 2 15 2.2
# 18: 2 18 2.3
# 19: 2 21 2.3
# 20: 2 22 2.3
You really need to check out the GenomicRanges package from Bioconductor. It provides the data structures that are appropriate for your use case.
First, we create the GRanges objects:
gr1 <- with(df1, GRanges(chromosome, IRanges(position, width=1L)))
gr2 <- with(df2, GRanges(chromosome, IRanges(segment_start, segment_end),
segment.number=segment.number))
Then we find the overlaps and do the merge:
hits <- findOverlaps(gr1, gr2)
gr1$segment[queryHits(hits)] <- gr2$segment.number[subjectHits(hits)]
I'm going to assume that the regions in df2 are non-overlapping, continuous and complete (not missing any positions from df1). I seem to do this differently every time I try, so here's my latest idea.
First, make sure chromosome is a factor in both data sets
df1$chromosome<-factor(df1$chromosome)
df2$chromosome<-factor(df2$chromosome)
Now I want to unwrap, chr/pos into one over all generic position, i'll do that with
ends<-with(df2, tapply(segment_end, chromosome, max))
offset<-head(c(0,cumsum(ends)),-1)
names(offset)<-names(ends)
This assigns unique position values to all positions across all chromosomes and it tracks the offset to the beginning of each chromosome in this new system. Now we will build a translation function from the data in df2
seglookup <- approxfun(with(df2, offset[chromosome]+segment_start), 1:nrow(df2),
method="constant", rule=2)
We use approxfun to find the right interval in the genetic position space for each segment. Now we use this function on df1
segid <- with(df1, seglookup(offset[chromosome]+position))
Now we have the correct ID for each position. We can verify this by merging the data and looking at the results
cbind(df1, df2[segid,-1])
chromosome position segment_start segment_end segment.number
1 1 1 1 5 1.1
2 1 2 1 5 1.1
3 1 4 1 5 1.1
4 1 5 1 5 1.1
5 1 7 6 20 1.2
6 1 12 6 20 1.2
7 1 13 6 20 1.2
8 1 15 6 20 1.2
9 1 21 21 25 1.3
10 1 23 21 25 1.3
11 1 24 21 25 1.3
12 2 1 1 7 2.1
13 2 5 1 7 2.1
14 2 7 1 7 2.1
15 2 8 8 16 2.2
16 2 12 8 16 2.2
17 2 15 8 16 2.2
18 2 18 18 22 2.3
19 2 21 18 22 2.3
20 2 22 18 22 2.3
So it looks like we did alright.
I have the following dummy data set:
ID TIME DDAY DV
1 0 50 6.6
1 12 50 6.1
1 24 50 5.6
1 48 50 7.6
2 0 10 6.6
2 12 10 6.6
2 24 10 6.6
2 48 10 6.6
3 0 50 3.6
3 12 50 6.8
3 24 50 9.6
3 48 50 7.1
4 0 10 8.6
4 12 10 6.4
4 24 10 4.6
4 48 10 5.6
I want to create summary table for mean and standard deviations for DV as shown below:
N TIME DDAY MEAN-DV SD-DV
2 0 50 6.5 1.1
2 12 50 6.1 0.8
2 24 50 4.5 2.0
2 48 50 7.5 1.0
2 0 10 6.9 1.5
2 12 10 8.5 1.3
2 24 10 6.1 0.9
2 48 10 4.5 1.8
How do I do this in R?
You can use:
1) dplyr:
library(dplyr)
dat %.%
group_by(TIME, DDAY) %.%
summarise(MEAN_DV = mean(DV), SD_DV = sd(DV), N = length(DV))
# TIME DDAY MEAN_DV SD_DV N
# 1 48 10 6.10 0.7071068 2
# 2 24 10 5.60 1.4142136 2
# 3 12 10 6.50 0.1414214 2
# 4 0 10 7.60 1.4142136 2
# 5 48 50 7.35 0.3535534 2
# 6 24 50 7.60 2.8284271 2
# 7 12 50 6.45 0.4949747 2
# 8 0 50 5.10 2.1213203 2
where dat is the name of your data frame.
2) data.table:
library(data.table)
DT <- as.data.table(dat)
DT[ , list(MEAN_DV = mean(DV), SD_DV = sd(DV), N = .N), by = c("TIME", "DDAY")]
# TIME DDAY MEAN_DV SD_DV N
# 1: 0 50 5.10 2.1213203 2
# 2: 12 50 6.45 0.4949747 2
# 3: 24 50 7.60 2.8284271 2
# 4: 48 50 7.35 0.3535534 2
# 5: 0 10 7.60 1.4142136 2
# 6: 12 10 6.50 0.1414214 2
# 7: 24 10 5.60 1.4142136 2
# 8: 48 10 6.10 0.7071068 2
require(plyr)
# THIS COLLAPSES ON TIME
ddply(df, .(TIME), summarize, MEAN_DV=mean(DV), SD_DV=sd(DV), N=length(DV))
# THIS COLLAPSES ON TIME AND DDAY
ddply(df, .(TIME, DDAY), summarize, MEAN_DV=mean(DV), SD_DV=sd(DV), N=length(DV))
I am doing a data analysis on wall thickness measurements of circular tubes. I have the following matrix:
> head(datIn, 12)
Component Tube.number Measurement.location Sub.location Interval Unit Start
1 In 1 1 A 121 U6100 7/25/2000
2 In 1 1 A 122 U6100 5/24/2001
3 In 1 1 A 222 U6200 1/19/2001
4 In 1 1 A 321 U6300 6/1/2000
5 In 1 1 A 223 U6200 5/22/2002
6 In 1 1 A 323 U6300 6/18/2002
7 In 1 1 A 21 U6200 10/1/1997
8 In 1 1 A 221 U6200 6/3/2000
9 In 1 1 A 322 U6300 12/11/2000
10 In 1 1 B 122 U6100 5/24/2001
11 In 1 1 B 322 U6300 12/11/2000
12 In 1 1 B 21 U6200 10/1/1997
End Measurement Material.loss Material.loss.interval Run.hours.interval
1 5/11/2001 7.6 0.4 NA 6653.10
2 2/7/2004 6.1 1.9 1.5 15484.82
3 3/7/2002 8.5 -0.5 -0.5 8826.50
4 12/1/2000 7.8 0.2 0.2 4170.15
5 4/30/2003 7.4 0.6 1.1 6879.73
6 9/30/2003 7.9 0.1 -0.1 9711.56
7 4/20/2000 7.6 0.4 NA 15159.94
8 1/5/2001 8.0 0.0 -0.4 4728.88
9 5/30/2002 7.8 0.2 0.0 9829.75
10 2/7/2004 5.9 2.1 0.9 15484.82
11 5/30/2002 7.0 1.0 0.7 9829.75
12 4/20/2000 8.2 -0.2 NA 15159.94
Run.hours.prior.to.interval Total.run.hours.end.interval
1 0.00 6653.10
2 6653.10 22137.92
3 19888.82 28715.32
4 0.00 4170.15
5 28715.32 35595.05
6 30039.58 39751.14
7 0.00 15159.94
8 15159.94 19888.82
9 20209.83 30039.58
10 6653.10 22137.92
11 20209.83 30039.58
12 0.00 15159.94
Straight.or.In.Out.Middle.bend.1 Straight.or.In.Out.Middle.bend.2
1 Out Out
2 Out Out
3 Out Out
4 Out Out
5 Out Out
6 Out Out
7 Out Out
8 Out Out
9 Out Out
10 Middle Out
11 Middle Out
12 Middle Out
The Sub.location column has values A, B, C, D. They are measurements at the same measurement location but at a different position in the cross section. So at 0, 90, 180, 270 degrees along the tube.
I would like to make a plot in which it becomes clear which measurement location has the biggest wall thickness decrease in time.
To do this I first want to calculate the mean value of the wall thickness of a tube at each measurement location at each unique interval (the running hours are coupled to the interval).
I tried doing this with the following formula:
par(mfrow=c(1,2))
myfunction <- function(mydata1) { return(mean(mydata1,na.rm=TRUE))}
AVmeasloc <- tapply(datIn$Measurement,list(as.factor(datIn$Sub.location),as.factor(datIn$Measurement.location), myfunction))
AVmeasloc
This doesnt seem to work. I would like to keep the tapply function as I also calculated the standard deviation for some values with this and it lets me make plots easily.
Does anyone have any advice how to tackle this problem?
From the code you've post, there is a parenthesis error around list(), it should read
AVmeasloc <- tapply(datIn$Measurement,list(as.factor(datIn$Sub.location),as.factor(datIn$Measurement.location)), myfunction)
This can now be cleaned up to
AVmeasloc <- tapply(datIn$Measurement,datIn[,c(3,4)],mean,na.rm=TRUE)
Here's a working example:
test.data <- data.frame(cat1 = c("A","A","A","B","B","B","C","C","D"),
cat2 = c(1,1,2,2,1,NA,2,1,1),
val = c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9))
tapply(test.data$val, test.data[,c(1,2)],mean,na.rm=TRUE)
cat2
cat1 1 2
A 0.15 0.3
B 0.50 0.4
C 0.80 0.7
D 0.90 NA
I was wondering if somebody could help me with a data set, I would like to make a graph of survivors on the y-axis and day on the x-axis. The problem I am having is trying to break up the data by the four intensity groups. Optimally I would like a graph with a trend line for all four intensities so that I can see if there are significant differences between them. Any help or tips would be much appreciated!Here is what my data frame looks like:
Intensity Day Survivors
1 0.0 0 37
2 0.0 1 29
3 0.0 2 9
4 0.0 3 1
5 0.0 4 1
6 0.0 5 0
7 0.0 6 0
8 0.0 7 0
9 0.1 0 40
10 0.1 1 28
11 0.1 2 8
12 0.1 3 0
13 0.1 4 0
14 0.1 5 0
15 0.1 6 0
16 0.1 7 0
17 0.2 0 40
18 0.2 1 26
19 0.2 2 15
20 0.2 3 8
21 0.2 4 5
22 0.2 5 3
23 0.2 6 1
24 0.2 7 0
25 0.4 0 47
26 0.4 1 29
27 0.4 2 5
28 0.4 3 0
29 0.4 4 0
30 0.4 5 0
31 0.4 6 0
32 0.4 7 0
You could try
library(ggplot2)
ggplot(x, aes(x = Day, y = Survivors, colour = as.factor(Intensity))) +
geom_point(pch = 15) +
geom_line() +
theme_bw()
Lattice also does this easily.
library(lattice)
xyplot(Survivors ~ Day,
data=x,
groups=Intensity,
grid=TRUE,
type=c('p','l'),
auto.key=list(title='Intensity', space='right')
)
If you are using Excel, just take a scatter plot with smooth lines and markers, then add 4 data series corresponding to your intensities
Then add whatever legend/descriptions you want