Resampling cross-sectional time series data in R - r

I'm dealing with cross-sectional time series data (many DIFFERENT individuals over time). At the individual level, each person has a quantity of a good demanded. This data is unbalanced with respect to how many individuals are in each period. For each time period, I've aggregated the individual data into a single time series. Example data structure below
Cross-Section Time Series
Time | Person | Quantity
----------------------
11/18| Bob | 2
11/18| Sally | 1
11/18| Jake | 5
12/18| Jim | 2
12/18| Roger | 8
Time Series
Time | Total Q
-------------
11/18| 8
12/18| 10
What I want to do for each period is resample (with replacement) the individual quantity, aggregate across the individuals, iterate X amount of times, and then get an mean and standard error from the bootstrap.
The end result should look like
Time | Total Q | Boot Strap Total Mean
-------------------------------------
11/18| 8 | 8.5
12/18| 10 | 10.05
Here is some code to create example sample data:
library(tidyverse)
set.seed(1234)
Cross_Time = data.frame(x) %>%
mutate(Period = sample(1:10, 50, replace=T),
Q=rnorm(50,10,1)) %>%
arrange(Period)
Timeseries = Cross_Time %>%
group_by(Period) %>%
summarize(Total=sum(Q))
I know this is possible in R, but I'm at a loss as to how to code it or what the right questions I need to ask are. All help is appreciated!

We may do the following:
X <- 1000
Cross_Time %>% group_by(Period) %>%
do({QS <- colSums(replicate(sample(.$Q, replace = TRUE), n = X))
data.frame(Period = .$Period[1], `Total Q` = sum(.$Q), Mean = mean(QS), `Standard Error` = sd(QS))})
# A tibble: 10 x 4
# Groups: Period [10]
# Period Total.Q Mean Standard.Error
# <int> <dbl> <dbl> <dbl>
# 1 1 28.8 28.8 0.284
# 2 2 35.9 35.8 0.874
# 3 3 109. 109. 3.90
# 4 4 48.9 48.9 2.16
# 5 5 20.2 20.2 0.658
# 6 6 59.0 58.8 3.57
# 7 7 88.7 88.6 2.64
# 8 8 22.7 22.7 1.04
# 9 9 47.7 47.7 2.46
# 10 10 27.9 27.9 0.575
I think the code is quite self-explanatory. In every group we resample it's values with replacement X times with replicate and compute the two desired statistics. It's also straightforward to add any others!

Related

Writing a function to summarize the results of dunn.test::dunn.test

In R, I perform dunn's test. The function I use has no option to group the input variables by their statistical significant differences. However, this is what I am genuinely interested in, so I tried to write my own function. Unfortunately, I am not able to wrap my head around it. Perhaps someone can help.
I use the airquality dataset that comes with R as an example. The result that I need could look somewhat like this:
> library (tidyverse)
> ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
# A tibble: 5 x 2
Month Mean
<int> <dbl>
1 5 23.6
2 6 29.4
3 7 59.1
4 8 60.0
5 9 31.4
When I run the dunn.test, I get the following:
> dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
Kruskal-Wallis rank sum test
data: x and group
Kruskal-Wallis chi-squared = 29.2666, df = 4, p-value = 0
Comparison of x by group
(Benjamini-Hochberg)
Col Mean-|
Row Mean | 5 6 7 8
---------+--------------------------------------------
6 | -0.925158
| 0.4436
|
7 | -4.419470 -2.244208
| 0.0001* 0.0496*
|
8 | -4.132813 -2.038635 0.286657
| 0.0002* 0.0691 0.8604
|
9 | -1.321202 0.002538 3.217199 2.922827
| 0.2663 0.9980 0.0043* 0.0087*
alpha = 0.05
Reject Ho if p <= alpha
From this result, I deduce that May differs from July and August, June differs from July (but not from August) and so on. So I'd like to append significantly differing groups to my results table:
# A tibble: 5 x 3
Month Mean Group
<int> <dbl> <chr>
1 5 23.6 a
2 6 29.4 ac
3 7 59.1 b
4 8 60.0 bc
5 9 31.4 a
While I did this by hand, I suppose it must be possible to automate this process. However, I don't find a good starting point. I created a dataframe containing all comparisons:
> ozone_differences <- dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
> ozone_differences <- data.frame ("P" = ozone_differences$altP.adjusted, "Compare" = ozone_differences$comparisons)
P Compare
1 4.436043e-01 5 - 6
2 9.894296e-05 5 - 7
3 4.963804e-02 6 - 7
4 1.791748e-04 5 - 8
5 6.914403e-02 6 - 8
6 8.604164e-01 7 - 8
7 2.663342e-01 5 - 9
8 9.979745e-01 6 - 9
9 4.314957e-03 7 - 9
10 8.671708e-03 8 - 9
I thought that a function iterating through this data frame and using a selection variable to choose the right letter from letters() might work. However, I cannot even think of a starting point, because changing numbers of rows have to considered at the same time...
Perhaps someone has a good idea?
Perhaps you could look into cldList() function from rcompanion library, you can pipe the res results from the output od dunnTest() and create a table that specifies the compact letter display comparison per group.
Following the advice of #TylerRuddenfort , the following code will work. The first cld is created with rcompanion::cldList, and the second directly uses multcompView::multcompLetters. Note that to use multcompLetters, the spaces have to be removed from the names of the comparisons.
Here, I have used FSA:dunnTest for the Dunn test (1964).
In general, I recommend ordering groups by e.g. median or mean before running e.g. dunnTest if you plan on using a cld, so that the cld comes out in a sensible order.
library (tidyverse)
ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
library(FSA)
Result = dunnTest(airquality$Ozone, airquality$Month, method = "bh")$res
### Use cldList()
library(rcompanion)
cldList(P.adj ~ Comparison, data=Result)
### Use multcompView
library(multcompView)
X = Result$P.adj <= 0.05
names(X) = gsub(" ", "", Result$Comparison)
multcompLetters(X)

How to prevent R from rounding in frequency function?

I used the freq function of frequency package to get frequency percent on my dataset$MoriskyAdherence, then R gives me percent values with rounding. I need more decimal places.
MoriskyAdherence=dataset$MoriskyAdherence
freq(MoriskyAdherence)
The result is:
The Percent values are 35.5, 41.3,23.8. The sum of them is 100.1.
The exact amounts should be 35.5, 41.25, 23.75.
What should I do?
I used sprintf, as.data.frame,formatC, and some other function to deal with it.But...
The function freq returns a character data frame, and has no option to adjust the number of decimal places. However, it is easy to recreate the table however you want it. For example, I have written this function, which will give you the same result but with two decimal places instead of one:
freq2 <- function(data_frame)
{
df <- frequency::freq(data_frame)
lapply(df, function(x)
{
n <- suppressWarnings(as.numeric(x$Freq))
sum_all <- as.numeric(x$Freq[nrow(x)])
raw_percent <- suppressWarnings(100 * n / sum_all)
t_row <- grep("Total", x[,2])[1]
valid_percent <- suppressWarnings(100*n / as.numeric(x$Freq[t_row]))
x$Percent <- format(round(raw_percent, 2), nsmall = 2)
x$'Valid Percent' <- format(round(valid_percent, 2), nsmall = 2)
x$'Cumulative Percent' <- format(round(cumsum(valid_percent), 2), nsmall = 2)
x$'Cumulative Percent'[t_row:nrow(x)] <- ""
x$'Valid Percent'[(t_row + 1):nrow(x)] <- ""
return(x)
})
}
Now instead of
freq(MoriskyAdherence)
#> Building tables
#> |===========================================================================| 100%
#> $`x:`
#> x label Freq Percent Valid Percent Cumulative Percent
#> 2 Valid High Adherence 56 35.0 35.0 35.0
#> 3 Low Adherence 66 41.3 41.3 76.3
#> 4 Medium Adherence 38 23.8 23.8 100.0
#> 41 Total 160 100.0 100.0
#> 1 Missing <blank> 0 0.0
#> 5 <NA> 0 0.0
#> 7 Total 160 100.0
you can do
freq2(MoriskyAdherence)
#> Building tables
#> |===========================================================================| 100%
#> $`x:`
#> x label Freq Percent Valid Percent Cumulative Percent
#> 2 Valid High Adherence 56 35.00 35.00 35.00
#> 3 Low Adherence 66 41.25 41.25 76.25
#> 4 Medium Adherence 38 23.75 23.75 100.00
#> 41 Total 160 100.00 100.00
#> 1 Missing <blank> 0 0.00
#> 5 <NA> 0 0.00
#> 7 Total 160 100.00
which is exactly what you were looking for.
Two (potential) solutions:
Solution #1:
Make changes inside the function freq. This can be done by retrieving the function's code with the command freq (without round brackets), or by retrieving the code, with comments, from https://rdrr.io/github/wilcoxa/frequencies/src/R/freq.R.
My hunch is that to obtain more decimals, changes must be implemented at this point in the code:
# create a list of frequencies
message("Building tables")
all_freqs <- lapply_pb(names(x), function(y, x1 = as.data.frame(x), maxrow1 = maxrow, trim1 = trim){
makefreqs(x1, y, maxrow1, trim1)
})
Solution #2:
If you're only after percentages with more decimals, you can use aggregate. Let's suppose your data has this structure: a dataframe with two variables, one numeric, one a factor by which you want to group:
set.seed(123)
Var1 <- sample(LETTERS[1:4], 10, replace = T)
Var2 <- sample(10:100, 10, replace = T)
df <- data.frame(Var1, Var2)
Var1 Var2
1 B 97
2 D 51
3 B 71
4 D 62
5 D 19
6 A 91
7 C 32
8 D 13
9 C 39
10 B 96
Then to obtain your percentages by factor, you would use aggregatethus:
aggregate(Var2 ~ Var1, data = df, function(x) sum(x)/sum(Var2)*100)
Var1 Var2
1 A 15.93695
2 B 46.23468
3 C 12.43433
4 D 25.39405
You can control the number of decimals by using round:
aggregate(Var2 ~ Var1, data = df, function(x) round(sum(x)/sum(Var2)*100,3))

Applying a label depending on which condition is met using R

I would like to use a simple R function where the contents of a specified data frame column are read row by row, then depending on the value, a string is applied to that row in a new column.
So far, I've tried to use a combination of loops and generating individual columns which were combined later. However, I cannot seem to get the syntax right.
The input looks like this:
head(data,10)
# A tibble: 10 x 5
Patient T1Score T2Score T3Score T4Score
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 96.4 75 80.4 82.1
2 5 100 85.7 53.6 55.4
3 6 82.1 85.7 NA NA
4 7 82.1 85.7 60.7 28.6
5 8 100 76.8 64.3 57.7
6 10 46.4 57.1 NA 75
7 11 71.4 NA NA NA
8 12 98.2 92.9 85.7 82.1
9 13 78.6 89.3 37.5 42.9
10 14 89.3 100 64.3 87.5
and the function I have written looks like this:
minMax<-function(x){
#make an empty data frame for the output to go
output<-data.frame()
#making sure the rest of the commands only look at what I want them to look at in the input object
a<-x[2:5]
#here I'm gathering the columns necessary to perform the calculation
minValue<-apply(a,1,min,na.rm=T)
maxValue<-apply(a,1,max,na.rm=T)
tempdf<-as.data.frame((cbind(minValue,maxValue)))
Difference<-tempdf$maxValue-tempdf$minValue
referenceValue<-ave(Difference)
referenceValue<-referenceValue[1]
#quick aside to make the first two thirds of the output file
output<-as.data.frame((cbind(x[1],Difference)))
#Now I need to define the class based on the referenceValue, and here is where I run into trouble.
apply(output, 1, FUN =
for (i in Difference) {
ifelse(i>referenceValue,"HIGH","LOW")
}
)
output
}
I also tried...
if (i>referenceValue) {
apply(output,1,print("HIGH"))
}else(print("LOW")) {}
}
)
output
}
Regardless, both end up giving me the error message,
c("'for (i in Difference) {' is not a function, character or symbol", "' ifelse(i > referenceValue, \"HIGH\", \"LOW\")' is not a function, character or symbol", "'}' is not a function, character or symbol")
The expected output should look like:
Patient Difference Toxicity
3 21.430000 LOW
5 46.430000 HIGH
6 3.570000 LOW
7 57.140000 HIGH
8 42.310000 HIGH
10 28.570000 HIGH
11 0.000000 LOW
12 16.070000 LOW
13 51.790000 HIGH
14 35.710000 HIGH
Is there a better way for me to organize the last loop?
Since you seem to be using tibbles anyway, here's a much shorter version using dplyr and tidyr:
> d %>%
gather(key = tscore,value = score,T1Score:T4Score) %>%
group_by(Patient) %>%
summarise(Difference = max(score,na.rm = TRUE) - min(score,na.rm = TRUE)) %>%
ungroup() %>%
mutate(AvgDifference = mean(Difference),
Toxicity = if_else(Difference > mean(Difference),"HIGH","LOW"))
# A tibble: 10 x 4
Patient Difference AvgDifference Toxicity
<int> <dbl> <dbl> <chr>
1 3 21.4 30.3 LOW
2 5 46.4 30.3 HIGH
3 6 3.6 30.3 LOW
4 7 57.1 30.3 HIGH
5 8 42.3 30.3 HIGH
6 10 28.6 30.3 LOW
7 11 0 30.3 LOW
8 12 16.1 30.3 LOW
9 13 51.8 30.3 HIGH
10 14 35.7 30.3 HIGH
I think maybe your expected output might have been based on a slightly different average difference, so this output is very slightly different.
And a much simpler base R version if you prefer:
d$min <- apply(d[,2:5],1,min,na.rm = TRUE)
d$max <- apply(d[,2:5],1,max,na.rm = TRUE)
d$diff <- d$max - d$min
d$avg_diff <- mean(d$diff)
d$toxicity <- with(d,ifelse(diff > avg_diff,"HIGH","LOW"))
A few notes on your existing code:
as.data.frame((cbind(minValue,maxValue))) is not an advisable way to create data frames. This is more awkward than simply doing data.frame(minValue = minValue,maxValue = maxValue) and risks unintended coercion from cbind.
ave is for computing summaries over groups; just use mean if you have a single vector
The FUN argument in apply expects a function, not an arbitrary expression, which is what you're trying to pass at the end. The general syntax for an "anonymous" function in that context would be apply(...,FUN = function(arg) { do some stuff and return exactly the thing you want}).

Selecting subsets of a grouped variable

The data I used can be found here (the "sq.txt" file).
Below is a summary of the data:
> summary(sq)
behaviour date squirrel time
resting :983 2017-06-28: 197 22995 : 127 09:30:00: 17
travelling :649 2017-06-26: 160 22758 : 116 08:00:00: 16
feeding :344 2017-06-30: 139 23080 : 108 16:25:00: 15
OOS :330 2017-07-18: 110 23089 : 100 08:11:00: 13
vocalization:246 2017-06-27: 99 23079 : 97 08:31:00: 13
social : 53 2017-06-29: 96 22865 : 95 15:24:00: 13
(Other) : 67 (Other) :1871 (Other):2029 (Other) :2585
Each squirrel has a number of observations that correspond to a number of different behaviours (behaviour).
For example, squirrel 22995 was observed 127 times. These 127 observations correspond to different behaviour categories: 7 feeding, 1 territorial, 55 resting, etc. I then need to divide the number of each behaviour by the total number of observations (i.e. feeding = 7/127, territorial = 1/127, resting = 55/127, etc.) to get proportions of time spent doing each behaviour.
I already have grouped my observations by squirrel using the dplyr package.
Is there a way, using dplyr, for me to calculate proportions for one column (behaviour) based on the total observations for a column (squirrel) where the values have been grouped?
Something like this?
sq %>%
count(squirrel, behaviour) %>%
group_by(squirrel) %>%
mutate(p = n/sum(n)) %>%
# add this line to see result for squirrel 22995
filter(squirrel == 22995)
# A tibble: 8 x 4
# Groups: squirrel [1]
squirrel behaviour n p
<int> <chr> <int> <dbl>
1 22995 feeding 7 0.0551
2 22995 nest_building 4 0.0315
3 22995 OOS 9 0.0709
4 22995 resting 55 0.433
5 22995 social 6 0.0472
6 22995 territorial 1 0.00787
7 22995 travelling 32 0.252
8 22995 vocalization 13 0.102
EDIT:
If you want to include zero counts for squirrels where a behaviour was not observed, one way is to use tidyr::complete(). That generates NA by default, which you may want to replace with zero.
library(dplyr)
library(tidyr)
sq %>%
count(squirrel, behaviour) %>%
complete(squirrel, behaviour) %>%
group_by(squirrel) %>%
mutate(p = n/sum(n, na.rm = TRUE)) %>%
replace_na(list(n = 0, p = 0)) %>%
filter(squirrel == 22995)
# A tibble: 11 x 4
# Groups: squirrel [1]
squirrel behaviour n p
<int> <chr> <dbl> <dbl>
1 22995 dead 0 0
2 22995 feeding 7.00 0.0551
3 22995 grooming 0 0
4 22995 nest_building 4.00 0.0315
5 22995 OOS 9.00 0.0709
6 22995 resting 55.0 0.433
7 22995 social 6.00 0.0472
8 22995 territorial 1.00 0.00787
9 22995 travelling 32.0 0.252
10 22995 vigilant 0 0
11 22995 vocalization 13.0 0.102

R max of multiple categories [duplicate]

This question already has answers here:
Aggregate a dataframe on a given column and display another column
(8 answers)
Closed 5 years ago.
I've got data somewhat like this (of course with many more rows):
Age Work Zone SomeNumber
26 1 2.61
32 4 8.42
41 2 9.71
45 2 4.14
64 3 6.04
56 1 5.28
37 4 7.93
I want to get the maximum SomeNumber for each zone at or below each age. SomeNumber increases with age, so I expect that the highest SomeNumber in Zone 2 by an under-32-y/o is by a guy who's age 31, but it could in fact be a guy age 27.
To do this I've written a nested for loop:
for(i in zonelist){
temp = data[data$zone==i,]
for(j in 1:max(data$age)){
temp.lessequal=c(temp.lessequal,max((temp[temp$Age<=j,])$SomeNumber))
}
#plot temp.lessequal or save it at this point
}
which of course is tremendously slow. How can I do this faster? I've looked at the order function to sort by two columns at once, but that doesn't let me take the max of each group.
Data:
df1 <- read.table(text='Age Work_Zone SomeNumber
26 1 2.61
32 4 8.42
41 2 9.71
45 2 4.14
64 3 6.04
56 1 5.28
37 4 7.93',
header = TRUE)
Code:
df2 <- with( df1, df1[ Age <= 32, ] ) # extract rows with Age <= 32
# get maximum of someNumber by aggregating with work_zone and then merging with df1 to combine the age column
merge(aggregate(SomeNumber ~ Work_Zone, data = df2, max), df2)
# Work_Zone SomeNumber Age
# 1 1 2.61 26
# 2 4 8.42 32
It seems OP is looking for max value based on <= condition on a particular column (Age).
The use of sqldf comes very handy in such cases in order to explain the logic. One solution could be:
# Data
df <- read.table(text = "Age Work_Zone SomeNumber
26 1 2.61
32 4 8.42
41 2 9.71
45 2 4.14
64 3 6.04
56 1 5.28
37 4 7.93", header = T, stringsAsFactors = F)
library(sqldf)
df3 <- sqldf("select df1.Work_Zone, df1.Age, max(df2.SomeNumber) from df df1
inner join df df2 on df1.Work_Zone = df2.Work_Zone
WHERE df2.Age <= df1.Age
GROUP BY df1.Work_Zone, df1.Age")
# Result:
# Work_Zone Age max(df2.SomeNumber)
# 1 1 26 2.61
# 2 1 56 5.28
# 3 2 41 9.71
# 4 2 45 9.71
# 5 3 64 6.04
# 6 4 32 8.42
# 7 4 37 8.42
Using the library data.table you can select the rows which are less than required age, then output the max(somenumber) and their respective age for each Workzone ie group by workzone.
library(data.table)
setDT(df1)[Age<=32,.(max(SomeNumber),Age),by=Work_Zone]
Work_Zone V1 Age
1: 1 2.61 26
2: 4 8.42 32

Resources