Creating quantiles - r

I have a data set of individuals with their socioeconomic scores, ranging from -6.3 to 3.5. Now I want to assign each individual to their quantiles based on their socioeconomic score.
I have a dataset named Healthdata with two columns: Healthdata$SSE, and Healthdata$ID.
Eventually, I would like to get a data frame matched by their SSE quantiles.
How can I do this in R?

Here's one approach:
# an example data set
set.seed(1)
Healthdata <- data.frame(SSE = rnorm(8), ID = gl(2, 4))
transform(Healthdata, quint = ave(SSE, ID, FUN = function(x) {
quintiles <- quantile(x, seq(0, 1, .2))
cuts <- cut(x, quintiles, include.lowest = TRUE)
quintVal <- quintiles[match(cuts, levels(cuts)) + 1]
return(quintVal)
}))
# SSE ID quint
# 1 -0.6264538 1 -0.4644344
# 2 0.1836433 1 0.7482983
# 3 -0.8356286 1 -0.7101237
# 4 1.5952808 1 1.5952808
# 5 0.3295078 2 0.3610920
# 6 -0.8204684 2 -0.1304827
# 7 0.4874291 2 0.5877873
# 8 0.7383247 2 0.7383247
A simple illustration of how it works:
values <- 1:10
# [1] 1 2 3 4 5 6 7 8 9 10
quintiles <- quantile(values, seq(0, 1, .2))
# 0% 20% 40% 60% 80% 100%
# 1.0 2.8 4.6 6.4 8.2 10.0
cuts <- cut(values, quintiles, include.lowest = TRUE)
# [1] [1,2.8] [1,2.8] (2.8,4.6] (2.8,4.6]
# [5] (4.6,6.4] (4.6,6.4] (6.4,8.2] (6.4,8.2]
# [9] (8.2,10] (8.2,10]
# 5 Levels: [1,2.8] (2.8,4.6] ... (8.2,10]
quintVal <- quintiles[match(cuts, levels(cuts)) + 1]
# 20% 20% 40% 40% 60% 60% 80% 80% 100% 100%
# 2.8 2.8 4.6 4.6 6.4 6.4 8.2 8.2 10.0 10.0

So let's start with a sample data set based on your description:
set.seed(315)
Healthdata <- data.frame(SSE = sample(-6.3:3.5, 21, replace=TRUE), ID = gl(7, 3))
Which gives something like this:
> Healthdata[1:15,]
SSE ID
1 -0.3 1
2 -6.3 2
3 -1.3 3
4 -3.3 4
5 -5.3 5
6 -4.3 6
7 -4.3 7
8 0.7 8
9 -4.3 9
10 -4.3 10
11 -3.3 11
12 0.7 12
13 -2.3 13
14 -3.3 14
15 0.7 15
I understand that you want a new variable which identifies the quantile group of the individual's socioeconomic status. I would do something like this:
transform(Healthdata, Q = cut(Healthdata$SSE,
breaks = quantile(Healthdata$SSE),
labels = c(1, 2, 3, 4),
include.lowest=TRUE))
To return:
SSE ID Q
1 -1.3 1 2
2 -6.3 2 1
3 -4.3 3 1
4 0.7 4 3
5 1.7 5 3
6 1.7 6 3
7 -5.3 7 1
8 1.7 8 3
9 2.7 9 4
10 -3.3 10 2
11 -1.3 11 2
12 -3.3 12 2
13 1.7 13 3
14 0.7 14 3
15 -4.3 15 1
If you want to see the upper and lower bounds for the quantile ranges, omit the labels = c(1, 2, 3, 4) to return this instead:
SSE ID Q
1 -1.3 1 (-4.3,-1.3]
2 -6.3 2 [-6.3,-4.3]
3 -4.3 3 [-6.3,-4.3]
4 0.7 4 (-1.3,1.7]
5 1.7 5 (-1.3,1.7]

Related

efficiency in calculating Euclidean distance between points and reference points in r

I have a list of points a1, a2, b1 and b2 and a set of reference points. I would like to figure out the distance between the four points and all of the reference points. I can do this no problem at all using the following example
library(dplyr)
x <- rep(1:5, 5)
y <- rep(1:5, each = 5)
a1.x <- c(4.5)
a1.y <- c(6)
a2.x <- c(0.8)
a2.y <- c(3.2)
b1.x <- c(2.5)
b1.y <- c(5)
b2.x <- c(3.8)
b2.y <- c(1.5)
time <- 1
time1 <- as.data.frame(cbind(time,x,y,a1.x,a1.y,a2.x,a2.y,b1.x,b1.y,b2.x,b2.y))
x <- rep(1:5, 5)
y <- rep(1:5, each = 5)
a1.x <- c(4)
a1.y <- c(5)
a2.x <- c(1.5)
a2.y <- c(3.9)
b1.x <- c(1.4)
b1.y <- c(4.6)
b2.x <- c(6)
b2.y <- c(5.2)
time <- 2
time2 <- as.data.frame(cbind(time,x,y,a1.x,a1.y,a2.x,a2.y,b1.x,b1.y,b2.x,b2.y))
df <- rbind(time1,time2)
df <- df %>%
mutate(dista1 = sqrt((x-a1.x)^2 + (y-a1.y)^2)) %>%
mutate(dista2 = sqrt((x-a2.x)^2 + (y-a2.y)^2)) %>%
mutate(distb1 = sqrt((x-b1.x)^2 + (y-b1.y)^2)) %>%
mutate(distb2 = sqrt((x-b2.x)^2 + (y-b2.y)^2))
This works no problem and is fairly fast. However on a larger data set it is slowed down by typing out all columns or having to rely on for loops. What is the most efficient way to perform the above?
Edit - additionally, I have a time factor as well
Using the rdist function in package fields is simpler:
library(fields) #use install.packages("fields") first
pts <- cbind(x= c(a1.x, a2.x, b1.x, b2.x), y=c(a1.y, a2.y, b1.y, b2.y))
ref <- cbind(x, y)
distances <- rdist(ref, pts)
colnames(distances) <- c("dista1", "dista2", "distb1", "distb2")
head(distances)
# dista1 dista2 distb1 distb2
# [1,] 6.103278 2.209072 4.272002 2.8442925
# [2,] 5.590170 2.505993 4.031129 1.8681542
# [3,] 5.220153 3.111270 4.031129 0.9433981
# [4,] 5.024938 3.883298 4.272002 0.5385165
# [5,] 5.024938 4.741308 4.716991 1.3000000
# [6,] 5.315073 1.216553 3.354102 2.8442925
If you want to match df in your example:
df <- cbind(ref, a1.x, a1.y, a2.x, a2.y, b1.x, b1.y, b2.x, b2.y, distances)
head(df)
# x y a1.x a1.y a2.x a2.y b1.x b1.y b2.x b2.y dista1 dista2 distb1 distb2
# [1,] 1 1 4.5 6 0.8 3.2 2.5 5 3.8 1.5 6.103278 2.209072 4.272002 2.8442925
# [2,] 2 1 4.5 6 0.8 3.2 2.5 5 3.8 1.5 5.590170 2.505993 4.031129 1.8681542
# [3,] 3 1 4.5 6 0.8 3.2 2.5 5 3.8 1.5 5.220153 3.111270 4.031129 0.9433981
# [4,] 4 1 4.5 6 0.8 3.2 2.5 5 3.8 1.5 5.024938 3.883298 4.272002 0.5385165
# [5,] 5 1 4.5 6 0.8 3.2 2.5 5 3.8 1.5 5.024938 4.741308 4.716991 1.3000000
# [6,] 1 2 4.5 6 0.8 3.2 2.5 5 3.8 1.5 5.315073 1.216553 3.354102 2.8442925
If there are multiple times, this approach can be extended. First extracting from your time1 and time2 objects to create multiple points and reference matrices:
time1.pts <- matrix(unlist(time1[1, 4:11]), 4, 2, byrow=TRUE)
time2.pts <- matrix(unlist(time2[1, 4:11]), 4, 2, byrow=TRUE)
ref1 <- matrix(unlist(time1[1, 2:3]), 4, 2, byrow=TRUE)
ref2 <- matrix(unlist(time2[1, 2:3]), 4, 2, byrow=TRUE)
ref <- list(ref1=ref1, ref2=ref2)
pts <- list(time1.pts=time1.pts, time2.pts=time2.pts)
Matrices are faster to process than data frames so this should be faster than working with data frames. Now the analysis:
results <- lapply(seq(ntimes), function(i) rdist(ref[[i]], pts[[i]]))
distances <- do.call(rbind, results)
colnames(distances) <- c("dista1", "dista2", "distb1", "distb2")
The distances matrix contains all of the distances. Now we just combine them with your df:
df <- data.frame(df, distances)
options(digits=4)
head(df, 5); cat(". . . . .\n"); tail(df, 5)
# time x y a1.x a1.y a2.x a2.y b1.x b1.y b2.x b2.y dista1 dista2 distb1 distb2
# 1 1 1 1 4.5 6 0.8 3.2 2.5 5 3.8 1.5 6.103 2.209 4.272 2.8443
# 2 1 2 1 4.5 6 0.8 3.2 2.5 5 3.8 1.5 5.590 2.506 4.031 1.8682
# 3 1 3 1 4.5 6 0.8 3.2 2.5 5 3.8 1.5 5.220 3.111 4.031 0.9434
# 4 1 4 1 4.5 6 0.8 3.2 2.5 5 3.8 1.5 5.025 3.883 4.272 0.5385
# 5 1 5 1 4.5 6 0.8 3.2 2.5 5 3.8 1.5 5.025 4.741 4.717 1.3000
# . . . . .
# time x y a1.x a1.y a2.x a2.y b1.x b1.y b2.x b2.y dista1 dista2 distb1 distb2
# 46 2 1 5 4 5 1.5 3.9 1.4 4.6 6 5.2 3 1.208 0.5657 5.004
# 47 2 2 5 4 5 1.5 3.9 1.4 4.6 6 5.2 2 1.208 0.7211 4.005
# 48 2 3 5 4 5 1.5 3.9 1.4 4.6 6 5.2 1 1.860 1.6492 3.007
# 49 2 4 5 4 5 1.5 3.9 1.4 4.6 6 5.2 0 2.731 2.6306 2.010
# 50 2 5 5 4 5 1.5 3.9 1.4 4.6 6 5.2 1 3.669 3.6222 1.020
I would suggest base-R using
referencePointList <- list(a1 = c(4.5,6), a2 = c(0.8,3.2), b1 = c(2.5,5),b2 = c(3.8,1.5))
distanceDfToReferencePoints <- function(x,y,referencePointList) {
distDf <- setNames(data.frame(do.call("cbind", lapply(referencePointList, function(rp) {
sqrt((x-rp[1])^2+(y-rp[2])^2)
}))), paste0("dist_",names(referencePointList)))
cbind(data.frame(x=x,y=y),distDf)
}
Lets put your method into a function say
f0 <- function() {
df <- as.data.frame(cbind(x,y,a1.x,a1.y,a2.x,a2.y,b1.x,b1.y,b2.x,b2.y))
df %>%
mutate(dista1 = sqrt((x-a1.x)^2 + (y-a1.y)^2)) %>%
mutate(dista2 = sqrt((x-a2.x)^2 + (y-a2.y)^2)) %>%
mutate(distb1 = sqrt((x-b1.x)^2 + (y-b1.y)^2)) %>%
mutate(distb2 = sqrt((x-b2.x)^2 + (y-b2.y)^2))
}
and do a benchmark:
microbenchmark::microbenchmark(distanceDfToReferencePoints(x,y,referencePointList),f0())
giving on my machine
Unit: microseconds
expr min lq mean median uq max neval
distanceDfToReferencePoints(x, y, referencePointList) 302.3 340.90 543.882 411.15 479.8 11869.5 100
f0() 6074.3 6557.85 7497.018 6994.55 7591.5 28291.7 100
Hence the base-R-solution seems to be much faster. Maybe that already helps you.
a data.table approach
library(data.table)
setDT(df)
#get names of points
refpoints <- unique( sub( "(^.*)\\.[xy]", "\\1", names( df[, -c(1,2)] ) ) )
#melt
DT <- melt(df, id.vars = c("x","y"), measure.vars = patterns(x2 = ".*\\.x", y2 = ".*\\.y"))
#set points' names
setattr(DT$variable, "levels", refpoints )
#calculate distance
DT[, distance := sqrt((x-x2)^2 + (y-y2)^2)]
#cast to wide again
dcast(DT, x + y ~ paste0("dist_", variable), value.var = "distance")
# x y dist_a1 dist_a2 dist_b1 dist_b2
# 1: 1 1 6.103278 2.2090722 4.272002 2.8442925
# 2: 1 2 5.315073 1.2165525 3.354102 2.8442925
# 3: 1 3 4.609772 0.2828427 2.500000 3.1764760
# 4: 1 4 4.031129 0.8246211 1.802776 3.7536649
# 5: 1 5 3.640055 1.8110770 1.500000 4.4821870
# 6: 2 1 5.590170 2.5059928 4.031129 1.8681542
# 7: 2 2 4.716991 1.6970563 3.041381 1.8681542
# 8: 2 3 3.905125 1.2165525 2.061553 2.3430749
# 9: 2 4 3.201562 1.4422205 1.118034 3.0805844
#10: 2 5 2.692582 2.1633308 0.500000 3.9357337
#11: 3 1 5.220153 3.1112698 4.031129 0.9433981
#12: 3 2 4.272002 2.5059928 3.041381 0.9433981
#13: 3 3 3.354102 2.2090722 2.061553 1.7000000
#14: 3 4 2.500000 2.3409400 1.118034 2.6248809
#15: 3 5 1.802776 2.8425341 0.500000 3.5902646
#16: 4 1 5.024938 3.8832976 4.272002 0.5385165
#17: 4 2 4.031129 3.4176015 3.354102 0.5385165
#18: 4 3 3.041381 3.2062439 2.500000 1.5132746
#19: 4 4 2.061553 3.2984845 1.802776 2.5079872
#20: 4 5 1.118034 3.6715120 1.500000 3.5057096
#21: 5 1 5.024938 4.7413078 4.716991 1.3000000
#22: 5 2 4.031129 4.3680659 3.905125 1.3000000
#23: 5 3 3.041381 4.2047592 3.201562 1.9209373
#24: 5 4 2.061553 4.2755117 2.692582 2.7730849
#25: 5 5 1.118034 4.5694639 2.500000 3.7000000
# x y dist_a1 dist_a2 dist_b1 dist_b2
My suggestion is to use Rfast::dista() and see the time for yourself.
using tidyverse:
df %>%
rename(x_new = x, y_new = y)%>%
pivot_longer(3:ncol(df), names_pattern ="(\\w+)\\.(\\w+)",
names_to = c('var', '.value')) %>%
mutate(value = sqrt((x-x_new)^2 + (y-y_new)^2)) %>%
pivot_wider(c(x_new,y_new),var, values_from = value)
x_new y_new a1 a2 b1 b2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 6.10 2.21 4.27 2.84
2 2 1 5.59 2.51 4.03 1.87
3 3 1 5.22 3.11 4.03 0.943
4 4 1 5.02 3.88 4.27 0.539
5 5 1 5.02 4.74 4.72 1.3
6 1 2 5.32 1.22 3.35 2.84
7 2 2 4.72 1.70 3.04 1.87
8 3 2 4.27 2.51 3.04 0.943
9 4 2 4.03 3.42 3.35 0.539
10 5 2 4.03 4.37 3.91 1.3
# ... with 15 more rows
If at all you could extract the columns by hand, they you could even make this faster:
pts <- cbind(x= c(a1.x, a2.x, b1.x, b2.x), y=c(a1.y, a2.y, b1.y, b2.y))
ref <- cbind(x, y)
sqrt(laGP::distance(ref, pts))
[,1] [,2] [,3] [,4]
[1,] 6.103278 2.2090722 4.272002 2.8442925
[2,] 5.590170 2.5059928 4.031129 1.8681542
[3,] 5.220153 3.1112698 4.031129 0.9433981
[4,] 5.024938 3.8832976 4.272002 0.5385165
[5,] 5.024938 4.7413078 4.716991 1.3000000
[6,] 5.315073 1.2165525 3.354102 2.8442925
[7,] 4.716991 1.6970563 3.041381 1.8681542
[8,] 4.272002 2.5059928 3.041381 0.9433981
or even:
apply(pts, 1, function(x)sqrt(rowSums((x-ref)^2)))
[,1] [,2] [,3] [,4]
[1,] 6.103278 2.2090722 4.272002 2.8442925
[2,] 5.315073 1.2165525 3.354102 2.8442925
[3,] 5.220153 3.1112698 4.031129 0.9433981
[4,] 4.031129 0.8246211 1.802776 3.7536649
[5,] 5.024938 4.7413078 4.716991 1.3000000
[6,] 5.590170 2.5059928 4.031129 1.8681542
[7,] 4.716991 1.6970563 3.041381 1.8681542
[8,] 3.905125 1.2165525 2.061553 2.3430749
If you are using R >= 4.1
df |>
reshape(matrix(3:ncol(df), 2), dir='long') |>
transform(new_pt = sqrt((x-a1.x)^2 + (y - a1.y)^2)) |>
reshape(v.names = c('a1.x','a1.y','new_pt'),dir='wide')
If not, change the |> pipe to %>%

How to get diurnal cycle from hourly multiseries XTS in R?

I have an hourly time-series of rain at four locations for one year as follow. I want to compute sum or mean for all the 24 hours of the day for entire year separate for all the four locations. This is common analysis in Meteorology and termed as diurnal variation.
This will give me an idea of which hours are preferred for rainfall at these locations. Is there a simple way to do this in xts/zoo package?
head(rg_hr_xts)
rg1 rg2 rg3 rg4
2018-06-01 00:59:17 1.0 0.0 0 0
2018-06-01 01:59:17 0.2 0.0 0 0
2018-06-01 02:59:17 0.0 0.2 0 0
2018-06-01 03:59:17 0.0 1.6 0 0
2018-06-01 04:59:17 0.0 3.4 0 0
2018-06-01 05:59:17 0.0 0.8 0 0
Note: I have used .indexhour(rg_hr_xts)
which gives me hour of each index as follow
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
20 21 22 23 0 1 2 3 4 5 6 7
I want to sum all the hours with above index 0, 1, 2 and get a dataframe with hour index 0-23 and sum of rainfall for that hour.
I expect following structure of output dataframe:
hour rg1 rg2 rg3 rg4
0 0.3 0.7 1.2 0.4
1 1.3 1.5 1.3 1.3
2 1.5 1.7 1.9 1.8
3 2.0 2.5 2.6 2.9
4 2.1 2.9 3.5 3.6
You can use aggregate() to calculate sums by hour of the day.
library(xts)
# Some reproducible example data
n <- 1e4
set.seed(21)
x <- .xts(cbind(rg1 = runif(n), rg2 = runif(n)), 1:n * 3600 + runif(n)*100)
# Aggregate by hour of the day
y <- aggregate(x, by = .indexhour(x), FUN = sum)
The result of aggregate() will be a zoo object with an index of the hour of the day. Then you can use data.frame() to convert that to a data.frame.
data.frame(hour = index(y), y)
# hour rg1 rg2
# 0 0 214.3876 211.5131
# 1 1 215.5521 205.4340
# 2 2 206.1494 211.7510
# 3 3 223.9533 209.5391
# 4 4 202.8989 211.6612
# 5 5 198.6387 203.7809
# 6 6 218.7807 218.9829
# 7 7 205.2797 214.6127
# 8 8 207.2061 219.2323
# 9 9 217.2509 208.9815
# 10 10 218.4591 202.3216
# 11 11 205.6799 219.2482
# 12 12 206.8984 209.7392
# 13 13 209.4091 205.2837
# 14 14 212.0559 213.2387
# 15 15 211.8372 204.3384
# 16 16 206.5818 221.5508
# 17 17 212.1076 214.9638
# 18 18 219.3799 205.4536
# 19 19 202.6254 202.2210
# 20 20 208.5686 208.5411
# 21 21 213.2116 218.9530
# 22 22 210.6371 207.5539
# 23 23 197.8964 203.9069

R: How to create a new column for 90th quantile based off previous rows in a data frame

data.frame(c = c(1,7,11,4,5,5))
c
1 1
2 7
3 11
4 4
5 5
6 5
desired dataframe
c c.90th
1 1 NA
2 7 1
3 11 6.4
4 4 10.2
5 5 9.8
6 5 9.4
For the first row, I want it to look at the previous rows, none and get the 90th quantile, NA.
For the second row, I want it to look at the previous rows, 1 and get the 90th quantile, 1.
For the third row, I want it to look at the previous rows, 1, 7 and get the 90th quantile, 6.4.
etc.
A solution using data.table that also works by groups:
library(data.table)
dt <- data.table(c = c(1,7,11,4,5,5),
group = c(1, 1, 1, 2, 2, 2))
cumquantile <- function(y, prob) {
sapply(seq_along(y), function(x) quantile(y[0:(x - 1)], prob))
}
dt[, c90 := cumquantile(c, 0.9)]
dt[, c90_by_group := cumquantile(c, 0.9), by = group]
> dt
c group c90 c90_by_group
1: 1 1 NA NA
2: 7 1 1.0 1.0
3: 11 1 6.4 6.4
4: 4 2 10.2 NA
5: 5 2 9.8 4.0
6: 5 2 9.4 4.9
Try:
dff <- data.frame(c = c(1,7,11,4,5,5))
dff$c.90th <- sapply(1:nrow(dff),function(x) quantile(dff$c[0:(x-1)],0.9,names=F))
Output:
c c.90th
1 NA
7 1.0
11 6.4
4 10.2
5 9.8
5 9.4

Calculate mean and SD for every timepoint with different ID's and different doses

I have the following dummy data set:
ID TIME DDAY DV
1 0 50 6.6
1 12 50 6.1
1 24 50 5.6
1 48 50 7.6
2 0 10 6.6
2 12 10 6.6
2 24 10 6.6
2 48 10 6.6
3 0 50 3.6
3 12 50 6.8
3 24 50 9.6
3 48 50 7.1
4 0 10 8.6
4 12 10 6.4
4 24 10 4.6
4 48 10 5.6
I want to create summary table for mean and standard deviations for DV as shown below:
N TIME DDAY MEAN-DV SD-DV
2 0 50 6.5 1.1
2 12 50 6.1 0.8
2 24 50 4.5 2.0
2 48 50 7.5 1.0
2 0 10 6.9 1.5
2 12 10 8.5 1.3
2 24 10 6.1 0.9
2 48 10 4.5 1.8
How do I do this in R?
You can use:
1) dplyr:
library(dplyr)
dat %.%
group_by(TIME, DDAY) %.%
summarise(MEAN_DV = mean(DV), SD_DV = sd(DV), N = length(DV))
# TIME DDAY MEAN_DV SD_DV N
# 1 48 10 6.10 0.7071068 2
# 2 24 10 5.60 1.4142136 2
# 3 12 10 6.50 0.1414214 2
# 4 0 10 7.60 1.4142136 2
# 5 48 50 7.35 0.3535534 2
# 6 24 50 7.60 2.8284271 2
# 7 12 50 6.45 0.4949747 2
# 8 0 50 5.10 2.1213203 2
where dat is the name of your data frame.
2) data.table:
library(data.table)
DT <- as.data.table(dat)
DT[ , list(MEAN_DV = mean(DV), SD_DV = sd(DV), N = .N), by = c("TIME", "DDAY")]
# TIME DDAY MEAN_DV SD_DV N
# 1: 0 50 5.10 2.1213203 2
# 2: 12 50 6.45 0.4949747 2
# 3: 24 50 7.60 2.8284271 2
# 4: 48 50 7.35 0.3535534 2
# 5: 0 10 7.60 1.4142136 2
# 6: 12 10 6.50 0.1414214 2
# 7: 24 10 5.60 1.4142136 2
# 8: 48 10 6.10 0.7071068 2
require(plyr)
# THIS COLLAPSES ON TIME
ddply(df, .(TIME), summarize, MEAN_DV=mean(DV), SD_DV=sd(DV), N=length(DV))
# THIS COLLAPSES ON TIME AND DDAY
ddply(df, .(TIME, DDAY), summarize, MEAN_DV=mean(DV), SD_DV=sd(DV), N=length(DV))

Calculate mean value of sets of 4 sub locations from multiple location from a larger matrix

I am doing a data analysis on wall thickness measurements of circular tubes. I have the following matrix:
> head(datIn, 12)
Component Tube.number Measurement.location Sub.location Interval Unit Start
1 In 1 1 A 121 U6100 7/25/2000
2 In 1 1 A 122 U6100 5/24/2001
3 In 1 1 A 222 U6200 1/19/2001
4 In 1 1 A 321 U6300 6/1/2000
5 In 1 1 A 223 U6200 5/22/2002
6 In 1 1 A 323 U6300 6/18/2002
7 In 1 1 A 21 U6200 10/1/1997
8 In 1 1 A 221 U6200 6/3/2000
9 In 1 1 A 322 U6300 12/11/2000
10 In 1 1 B 122 U6100 5/24/2001
11 In 1 1 B 322 U6300 12/11/2000
12 In 1 1 B 21 U6200 10/1/1997
End Measurement Material.loss Material.loss.interval Run.hours.interval
1 5/11/2001 7.6 0.4 NA 6653.10
2 2/7/2004 6.1 1.9 1.5 15484.82
3 3/7/2002 8.5 -0.5 -0.5 8826.50
4 12/1/2000 7.8 0.2 0.2 4170.15
5 4/30/2003 7.4 0.6 1.1 6879.73
6 9/30/2003 7.9 0.1 -0.1 9711.56
7 4/20/2000 7.6 0.4 NA 15159.94
8 1/5/2001 8.0 0.0 -0.4 4728.88
9 5/30/2002 7.8 0.2 0.0 9829.75
10 2/7/2004 5.9 2.1 0.9 15484.82
11 5/30/2002 7.0 1.0 0.7 9829.75
12 4/20/2000 8.2 -0.2 NA 15159.94
Run.hours.prior.to.interval Total.run.hours.end.interval
1 0.00 6653.10
2 6653.10 22137.92
3 19888.82 28715.32
4 0.00 4170.15
5 28715.32 35595.05
6 30039.58 39751.14
7 0.00 15159.94
8 15159.94 19888.82
9 20209.83 30039.58
10 6653.10 22137.92
11 20209.83 30039.58
12 0.00 15159.94
Straight.or.In.Out.Middle.bend.1 Straight.or.In.Out.Middle.bend.2
1 Out Out
2 Out Out
3 Out Out
4 Out Out
5 Out Out
6 Out Out
7 Out Out
8 Out Out
9 Out Out
10 Middle Out
11 Middle Out
12 Middle Out
The Sub.location column has values A, B, C, D. They are measurements at the same measurement location but at a different position in the cross section. So at 0, 90, 180, 270 degrees along the tube.
I would like to make a plot in which it becomes clear which measurement location has the biggest wall thickness decrease in time.
To do this I first want to calculate the mean value of the wall thickness of a tube at each measurement location at each unique interval (the running hours are coupled to the interval).
I tried doing this with the following formula:
par(mfrow=c(1,2))
myfunction <- function(mydata1) { return(mean(mydata1,na.rm=TRUE))}
AVmeasloc <- tapply(datIn$Measurement,list(as.factor(datIn$Sub.location),as.factor(datIn$Measurement.location), myfunction))
AVmeasloc
This doesnt seem to work. I would like to keep the tapply function as I also calculated the standard deviation for some values with this and it lets me make plots easily.
Does anyone have any advice how to tackle this problem?
From the code you've post, there is a parenthesis error around list(), it should read
AVmeasloc <- tapply(datIn$Measurement,list(as.factor(datIn$Sub.location),as.factor(datIn$Measurement.location)), myfunction)
This can now be cleaned up to
AVmeasloc <- tapply(datIn$Measurement,datIn[,c(3,4)],mean,na.rm=TRUE)
Here's a working example:
test.data <- data.frame(cat1 = c("A","A","A","B","B","B","C","C","D"),
cat2 = c(1,1,2,2,1,NA,2,1,1),
val = c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9))
tapply(test.data$val, test.data[,c(1,2)],mean,na.rm=TRUE)
cat2
cat1 1 2
A 0.15 0.3
B 0.50 0.4
C 0.80 0.7
D 0.90 NA

Resources