I am trying write code that will do autocorrelation for multiple subsets. For example. I have health data for multiple countries over time. I want to get each country's autocorrelation for each variable. Any help would be great!
Here are some things I have tried, unsuccessfully:
require(plyr)
POP_ACF=acf(PhD_data_list_view$POP, lag.max=NULL, type=c("correlation"),
plot=TRUE, na.action=na.pass, demean=TRUE)
dlply(PhD_data_list_view, .(Country), function(x) POP_ACF %+% x)
POP_ACF=function(PhD_data_list_view$POP) c(acf(PhD_data_list_view$POP, plot=TRUE)$acf)
acf is a function takes a vector and returns a list. That makes it a natural fit for the purrr package, which maps functions over lists, but it can also be done using base R.
I'll use the beaver1 dataset from the datasets package since you didn't provide yours. I'll use different days of observations as the analogue to your different countries, and temperature for your POP variable.
Base R:
split turns the vector beaver1$temp into a list of vectors along the second argument, beaver1$day.
Then mapply runs the function acf on each element of that list.
Since we're using mapply instead of lapply, we can also provide another list of arguments, here the titles for each plot, main = unique(beaver1$day).
The last argument, SIMPLIFY = F, tells it to return the default output, not attempt to coerce the list into anything else.
par(mfrow = c(1,2))
mapply(acf,
split(beaver1$temp, beaver1$day),
main = unique(beaver1$day),
SIMPLIFY = F)
# $`346`
#
# Autocorrelations of series ‘dots[[1L]][[1L]]’, by lag
#
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
# 1.000 0.838 0.698 0.593 0.468 0.355 0.265 0.167 0.113 0.069 0.028 0.037 0.087 0.108 0.145 0.177 0.151 0.125 0.123 0.106
# $`347`
#
# Autocorrelations of series ‘dots[[1L]][[2L]]’, by lag
#
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13
# 1.000 0.546 0.335 0.130 0.080 0.024 -0.025 -0.103 -0.090 -0.032 0.168 0.036 -0.089 -0.306
purrr and the tidy way:
This way is a bit more flexible depending what you want to do with the output. We can use purrr::map as a direct drop-in for mapply:
library(purrr)
beaver1 %>%
split(.$day) %>%
map(~acf(.$temp, main = unique(.$day)))
Which returns the exact same output. But we can also go fully tidy and return the data from acf as a dataframe so that we can explore it further with ggplot2.
The first map is returning a list of outputs, each of which is a list containing, among other things, variables lag, acf, and n.used.
The map_dfr is running the function data.frame, assigning each of those variables to a new column.
We also make a column to calculate the CIs. Refer to: How is the confidence interval calculated for the ACF function?
Then we can use ggplot to make any kind of plot we want, and we still have the data for any other analysis you want to do.
library(ggplot2)
beaver_acf <-
beaver1 %>%
split(.$day) %>%
map(~acf(.$temp, plot = F)) %>%
map_dfr(
~data.frame(lag = .$lag,
acf = .$acf,
ci = qnorm(0.975)/sqrt(.$n.used)
), .id = "day")
head(beaver_acf)
# day lag acf ci
# 1 346 0 1.0000000 0.2054601
# 2 346 1 0.8378889 0.2054601
# 3 346 2 0.6983476 0.2054601
# 4 346 3 0.5928198 0.2054601
# 5 346 4 0.4680912 0.2054601
# 6 346 5 0.3554939 0.2054601
ggplot(beaver_acf, aes(lag, acf)) +
geom_segment(aes(xend = lag, yend = 0)) +
geom_hline(aes(yintercept = ci), linetype = "dashed", color = "blue") +
geom_hline(aes(yintercept = -ci), linetype = "dashed", color = "blue") +
facet_wrap(~variable)
Related
This question already has answers here:
Calculate the Area under a Curve
(7 answers)
Closed 1 year ago.
I have a dataframe (gdata) with x (as "r") and y (as "km") coordinates of a function.
When I plot it like this:
plot(x = gdata$r, y = gdata$km, type = "l")
I get the graph of the function:
Now I want to calculate the area under the curve from x = 0 to x = 0.6. When I look for appropriate packages I only find something like calculation AUC of a ROC curve. But is there a way just to calculate the AUC of a normal function?
The area under the curve (AUC) of a given set of data points can be archived using numeric integration:
Let data be your data frame containing x and y values. You can get the area under the curve from lower x0=0 to upper x1=0.6 by integrating the function, which is linearly approximating your data.
This is a numeric approximation and not exact, because we do not have an infinite number of data points: For y=sqrt(x) we will get 0.3033 instead of true value of 0.3098. For 200 rows in data we'll get even better with auc=0.3096.
library(tidyverse)
data <-
tibble(
x = seq(0, 2, length.out = 20)
) %>%
mutate(y = sqrt(x))
data
#> # A tibble: 20 × 2
#> x y
#> <dbl> <dbl>
#> 1 0 0
#> 2 0.105 0.324
#> 3 0.211 0.459
#> 4 0.316 0.562
#> 5 0.421 0.649
#> 6 0.526 0.725
#> 7 0.632 0.795
#> 8 0.737 0.858
#> 9 0.842 0.918
#> 10 0.947 0.973
#> 11 1.05 1.03
#> 12 1.16 1.08
#> 13 1.26 1.12
#> 14 1.37 1.17
#> 15 1.47 1.21
#> 16 1.58 1.26
#> 17 1.68 1.30
#> 18 1.79 1.34
#> 19 1.89 1.38
#> 20 2 1.41
qplot(x, y, data = data)
integrate(approxfun(data$x, data$y), 0, 0.6)
#> 0.3033307 with absolute error < 8.8e-05
Created on 2021-10-03 by the reprex package (v2.0.1)
The absolute error returned by integrate is corerect, iff the real world between every two data points is a perfect linear interpolation, as we assumed.
I used the package MESS to solve the problem:
# Toy example
library(MESS)
x <- seq(0,3, by=0.1)
y <- x^2
auc(x,y, from = 0.1, to = 2, type = "spline")
The analytical result is:
7999/3000
Which is approximately 2.666333333333333
The R script offered gives: 2.66632 using the spline approximation and 2.6695 using the linear approximation.
The acf method in the stats package returns a complex output. For example
x = rnorm(1000, mean=100, sd=10)
acf(x)
returns a plot. If I do
acf_x = acf(x)
acf_x
it returns
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11
1.000 0.000 -0.031 -0.002 -0.052 0.017 -0.014 0.030 0.011 0.002 -0.044 0.000
12 13 14 15 16 17 18 19 20 21 22 23
0.055 -0.007 0.049 0.025 -0.027 -0.048 0.033 0.027 0.043 -0.007 -0.010 0.025
24 25 26 27 28 29 30
-0.083 0.045 -0.074 0.016 0.041 -0.046 0.010
If I look at class(acf) it returns 'acf'.
How do I extract the autocorrelation versus lag into a data_frame?
More generally, when presented with a function that returns a complex object, how do I extract the data from it, i.e. is there a general pattern for this type of function?
If you look at the help function of acf via ?acf you'll see under "value" what the output will look like.
In this case, the acf object is a list with several elements.
If you e.g. want the lags, you can simply access this via:
my_lags <- acf_x$lag
Deschen's answer to the original question gives the general response - how do I discover the elements in a complex model object: str(). One can also use the names() function for S3 objects, where the result lists the names one can use to extract elements from the list() with the $ or [[ forms of the extract operator.
set.seed(95014)
x = rnorm(1000, mean=100, sd=10)
acf_x <- acf(x)
names(acf_x)
> names(acf_x)
[1] "acf" "type" "n.used" "lag" "series" "snames"
>
Since the acf and lag elements are stored as arrays, we'll need to extract just the first dimension to obtain a simple vector. We can accomplish this by chaining the [ form of the extract operator onto the object that is generated by the [[ extract on the model object.
head(acf_x[["acf"]][,1,1]) # second extract returns a simple vector
> head(acf_x[["acf"]][,1,1])
[1] 1.000000000 -0.034863150 0.037745441 -0.020464290 -0.004974406
[6] 0.016770363
In this case R performs the extraction left to right - first acf_x[["acf"]] is evaluated, and then [,1,1] is applied to the result.
For the concrete part of the question, "how do I create a data frame with this data?" One can create a data frame with the output from the acf() function as follows.
set.seed(95014)
x = rnorm(1000, mean=100, sd=10)
acf_x <- acf(x)
results <- data.frame(acf_value = acf_x$acf[,1,1],
acf_lag = acf_x$lag[,1,1])
head(results)
...and the output:
> head(results)
acf_value acf_lag
1 1.000000000 0
2 -0.034863150 1
3 0.037745441 2
4 -0.020464290 3
5 -0.004974406 4
6 0.016770363 5
Try
str(acf_x)
or
print.default(acf_x)
This will get you an idea how the object looks like internally and how to access the elements in it.
I have run a beta regression in R and would like to assess the residual diagnostics. I have used the plot function and obtained plots, however, the potential outliers are not labelled. How can I add the corresponding labels to the outliers?
breg.full <- betareg(Percentage ~ Total_testscore + Campus + Programme +
Gender + SE_track + Hours_Math_SE, data = starters, # [-c(53, 24, 35), ]
link = "logit") # , , link.phi = NULL, type = "ML"
summary(breg.full)
par(mfrow = c(2,3))
plot(breg.full, which = 1:6)
EDIT:
I want to have something like this (without the actual pink box, but with the ID number.)
The author provides a link for this code (http://www.de.ufpe.br/~cribari/betareg_example.zip.) however it is no longer working ...
Explanation
I couldn't see your data anywhere here, but I will use the iris dataset to demonstrate how this can be achieved. I'll stick to only two examples because this takes some time to code, but once you see two examples I think it will become fairly quick to recognize what is going on. I will supply a reference at the end that will be helpful too.
Fitting Model Data
First we can fit a regression using the iris data, then turn the data into a tibble with model data using both fortify and as_tibble. I have added an index column for one of the plots later.
#### Load Library ####
library(tidyverse)
#### Fit Model ####
fit <- lm(Petal.Width ~ Petal.Length,
data = iris)
#### Turn Model into Data Frame ####
fit.data <- fortify(fit) %>%
as_tibble() %>%
mutate(.index = 1:150)
fit.data
Which gives you this:
# A tibble: 150 × 9
Petal…¹ Petal…² .hat .sigma .cooksd .fitted .resid .stdr…³ .index
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 0.2 1.4 0.0186 0.207 8.18e-5 0.219 -0.0190 -0.0928 1
2 0.2 1.4 0.0186 0.207 8.18e-5 0.219 -0.0190 -0.0928 2
3 0.2 1.3 0.0197 0.207 1.23e-4 0.177 0.0226 0.111 3
4 0.2 1.5 0.0176 0.207 7.86e-4 0.261 -0.0606 -0.296 4
5 0.2 1.4 0.0186 0.207 8.18e-5 0.219 -0.0190 -0.0928 5
6 0.4 1.7 0.0158 0.207 6.06e-4 0.344 0.0563 0.275 6
7 0.3 1.4 0.0186 0.207 1.49e-3 0.219 0.0810 0.396 7
8 0.2 1.5 0.0176 0.207 7.86e-4 0.261 -0.0606 -0.296 8
9 0.2 1.4 0.0186 0.207 8.18e-5 0.219 -0.0190 -0.0928 9
10 0.1 1.5 0.0176 0.207 5.53e-3 0.261 -0.161 -0.785 10
# … with 140 more rows, and abbreviated variable names ¹Petal.Width,
# ²Petal.Length, ³.stdresid
# ℹ Use `print(n = ...)` to see more rows
You can see here it gives you a lot of valuable information...residuals, fitted residuals, Cook's distance, etc. This makes it easy to plot them in ggplot2.
Plotting
The first example will be a Cook's distance plot. This takes the index of the data point and plots the columns representing their respective distance using the geom_col function. The key ingredient here is the geom_text portion. Simply subset the data and nudge it a little so it doesnt totally overlap and you can essentially label whatever you want:
#### Cooks Distance ####
fit.data %>%
ggplot(aes(x=.index,
y=.cooksd,
label=.index))+
geom_col()+
labs(x="Index",
y="Cook's Distance",
title = "Cook's Distance")+
geom_text(data=subset(fit.data,
.cooksd > .05),
nudge_y = .003)
Giving you this plot:
Another example using a similar method below plots fitted values versus their respective residuals, with an arbitrary label placed here was well:
#### Fitted vs Residuals ####
ggplot(fit.data,
aes(.fitted,
round(.resid,2),
label=round(.resid,2))) +
geom_point() +
geom_hline(yintercept = 0) +
geom_smooth(se = FALSE)+
labs(x="Fitted",
y="Residual",
title = "Fitted vs Residuals")+
geom_text(data=subset(fit.data,
.resid > .5 | .resid < -.5),
nudge_x = .09)
A slew of other examples of how to do this can be seen at this link. The customization will be up to you, but it should give you a fair idea of how to hand tailor some of these base R plots you are getting.
To give a small working example, suppose I have the following data frame:
library(dplyr)
country <- rep(c("A", "B", "C"), each = 6)
year <- rep(c(1,2,3), each = 2, times = 3)
categ <- rep(c(0,1), times = 9)
pop <- rep(c(NA, runif(n=8)), each=2)
money <- runif(18)+100
df <- data.frame(Country = country,
Year = year,
Category = categ,
Population = pop,
Money = money)
Now the data I'm actually working with has many more repetitions, namely for every country, year, and category, there are many repeated rows corresponding to various sources of money, and I want to sum these all together. However, for now it's enough just to have one row for each country, year, and category, and just trivially apply the sum() function on each row. This will still exhibit the behavior I'm trying to get rid of.
Notice that for country A in year 1, the population listed is NA. Therefore when I run
aggregate(Money ~ Country+Year+Category+Population, df, sum)
the resulting data frame has dropped the rows corresponding to country A and year 1. I'm only using the ...+Population... bit of code because I want the output data frame to retain this column.
I'm wondering how to make the aggregate() function not drop things that have NAs in the columns by which the grouping occurs--it'd be nice if, for instance, the NAs themselves could be treated as values to group by.
My attempts: I tried turning the Population column into factors but that didn't change the behavior. I read something on the na.action argument but neither na.action=NULL nor na.action=na.skip changed the behavior. I thought about trying to turn all the NAs to 0s, and I can't think of what that would hurt but it feels like a hack that might bite me later on--not sure. But if I try to do it, I'm not sure how I would. When I wrote a function with the is.na() function in it, it didn't apply the if (is.na(x)) test in a vectorized way and gave the error that it would just use the first element of the vector. I thought about perhaps using lapply() on the column and coercing it back to a vector and sticking that in the column, but that also sounds kind of hacky and needlessly round-about.
The solution here seemed to be about keeping the NA values out of the data frame in the first place, which I can't do: Aggregate raster in R with NA values
As you have already mentioned dplyr before your data, you can use dplyr::summarise function. The summarise function supports grouping on NA values.
library(dplyr)
df %>% group_by(Country,Year,Category,Population) %>%
summarise(Money = sum(Money))
# # A tibble: 18 x 5
# # Groups: Country, Year, Category [?]
# Country Year Category Population Money
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 A 1.00 0 NA 101
# 2 A 1.00 1.00 NA 100
# 3 A 2.00 0 0.482 101
# 4 A 2.00 1.00 0.482 101
# 5 A 3.00 0 0.600 101
# 6 A 3.00 1.00 0.600 101
# 7 B 1.00 0 0.494 101
# 8 B 1.00 1.00 0.494 101
# 9 B 2.00 0 0.186 100
# 10 B 2.00 1.00 0.186 100
# 11 B 3.00 0 0.827 101
# 12 B 3.00 1.00 0.827 101
# 13 C 1.00 0 0.668 100
# 14 C 1.00 1.00 0.668 101
# 15 C 2.00 0 0.794 100
# 16 C 2.00 1.00 0.794 100
# 17 C 3.00 0 0.108 100
# 18 C 3.00 1.00 0.108 100
Note: The OP's sample data doesn't have multiple rows for same groups. Hence, number of summarized rows will be same as actual rows.
I have this data frame "df" (showing 15 of the 1000 tuples)
inf sup frec prob
1 1.000318 1.005308 12 0.060
2 1.005308 1.010297 5 0.025
3 1.010297 1.015286 5 0.025
4 1.015286 1.020276 2 0.010
5 1.020276 1.025265 3 0.015
6 1.025265 1.030254 3 0.015
7 1.030254 1.035244 8 0.040
8 1.035244 1.040233 2 0.010
9 1.040233 1.045223 3 0.015
10 1.045223 1.050212 0 0.000
11 1.050212 1.055201 4 0.020
12 1.055201 1.060191 1 0.005
13 1.060191 1.065180 1 0.005
14 1.065180 1.070169 0 0.000
15 1.070169 1.075159 1 0.005
And i want to plot a segment in the interval of x = [ inf[ i ]:sup[ i ] ], and in the y axis = prob[i], for each row.
I tried this solution, using a "for loop" to plot each segment:
plot <- ggplot(data = df)
for(i in 1:15){
plot <- plot + geom_segment(aes(x = df$inf[i], xend = df$sup[i], y = df$prob[i], yend = df$prob[i]))
}
plot
But all i get is a single line in y = 0; i assume because my "prob" has values close to zero. The other problem is that if the for loop goes up to a decent value, an error pops saying:
Error: nested evaluation too deep; Infinite recursion options (expressions =)?
Is there any way to plot those segments by its x intervals?
Or maybe abandon the idea of intervals and plot some points per interval would be better?