I output a series of acf results and want to extract just lag 1 autocorrelation coefficient. Can anyone give a quick pointer? Thank you.
#A snippet of a series of acf() results
$`25`
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11 12
1.000 0.366 -0.347 -0.399 -0.074 0.230 0.050 -0.250 -0.213 -0.106 0.059 0.154 0.031
$`26`
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11
1.000 0.060 0.026 -0.163 -0.233 -0.191 -0.377 0.214 0.037 0.178 -0.016 0.049
$`27`
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11 12
1.000 -0.025 -0.136 0.569 -0.227 -0.264 0.218 -0.262 -0.411 0.123 -0.039 -0.192 0.130
#For this example, the extracted values will be 0.366, 0.060, -0.025, the values can either
be in a list or matrix
EDIT
#`acf` in base R was used
p <- acf.each()
#sapply was tried but it resulted this
sapply(acf.each(), `[`, "1")
1 2 3
acf 0.7398 0.1746 0.4278
type "correlation" "correlation" "correlation"
n.used 24 17 14
lag 1 1 1
series "x" "x" "x"
snames NULL NULL NULL
The structure seems to be a list. We can use sapply to do the extraction
sapply(lst1, function(x) x$acf[2])
data
lst1 <- list(acf(ldeaths), acf(ldeaths))
Related
Here I have a snippet of my dataset. The rows indicate different days of the year.
The Substations represent individuals, there are over 500 individuals.
The 10 minute time periods run all the way through 24 hours.
I need to find an average value for each 10 minute interval for each individual in this dataset. This should result in single row for each individual substation, with the respective average value for each time interval.
I have tried:
meanbygroup <- stationgroup %>%
group_by(Substation) %>%
summarise(means = colMeans(tenminintervals[sapply(tenminintervals, is.numeric)]))
But this averages the entire column and I am left with the same average values for each individual substation.
So for each individual substation, I need an average for each individual time interval.
Please help!
Try using summarize(across()), like this:
df %>%
group_by(Substation) %>%
summarize(across(everything(), ~mean(.x, na.rm=T)))
Output:
Substation `00:00` `00:10` `00:20`
<chr> <dbl> <dbl> <dbl>
1 A -0.233 0.110 -0.106
2 B 0.203 -0.0997 -0.128
3 C -0.0733 0.196 -0.0205
4 D 0.0905 -0.0449 -0.0529
5 E 0.401 0.152 -0.0957
6 F 0.0368 0.120 -0.0787
7 G 0.0323 -0.0792 -0.278
8 H 0.132 -0.0766 0.157
9 I -0.0693 0.0578 0.0732
10 J 0.0776 -0.176 -0.0192
# … with 16 more rows
Input:
set.seed(123)
df = bind_cols(
tibble(Substation = sample(LETTERS,size = 1000, replace=T)),
as_tibble(setNames(lapply(1:3, function(x) rnorm(1000)),c("00:00", "00:10", "00:20")))
) %>% arrange(Substation)
# A tibble: 1,000 × 4
Substation `00:00` `00:10` `00:20`
<chr> <dbl> <dbl> <dbl>
1 A 0.121 -1.94 0.137
2 A -0.322 1.05 0.416
3 A -0.158 -1.40 0.192
4 A -1.85 1.69 -0.0922
5 A -1.16 -0.455 0.754
6 A 1.95 1.06 0.732
7 A -0.132 0.655 -1.84
8 A 1.08 -0.329 -0.130
9 A -1.21 2.82 -0.0571
10 A -1.04 0.237 -0.328
# … with 990 more rows
For a publication in a peer-reviewed scientific journal (http://www.redjournal.org), we would like to prepare Kaplan-Meier plots. The journal has the following specific guidelines for these plots:
"If your figures include curves generated from analyses using the Kaplan-Meier method or the cumulative incidence method, the following are now requirements for the presentation of these curves:
That the number of patients at risk is indicated;
That censoring marks are included;
That curves be truncated when there are fewer than 10 patients at risk; and
An estimate of the confidence interval should be included either in the figure itself or the text.”
Here, I illustrate my problem with the veteran dataset (https://github.com/tidyverse/reprex is great!).
We can adress 1, 2 and 4 easily with the survminer package:
library(survival)
library(survminer)
#> Warning: package 'survminer' was built under R version 3.4.3
#> Loading required package: ggplot2
#> Loading required package: ggpubr
#> Warning: package 'ggpubr' was built under R version 3.4.3
#> Loading required package: magrittr
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
ggsurvplot(fit.obj,
conf.int = T,
risk.table ="absolute",
tables.theme = theme_cleantable())
I have, however, a problem with requirement 3 (truncate curves when there are fewer than 10 patients at risk). I see that all the required information is available in the survfit object:
library(survival)
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
summary(fit.obj)
#> Call: survfit(formula = Surv(time, status) ~ celltype, data = veteran)
#>
#> celltype=squamous
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 1 35 2 0.943 0.0392 0.8690 1.000
#> 8 33 1 0.914 0.0473 0.8261 1.000
#> 10 32 1 0.886 0.0538 0.7863 0.998
#> 11 31 1 0.857 0.0591 0.7487 0.981
#> 15 30 1 0.829 0.0637 0.7127 0.963
#> 25 29 1 0.800 0.0676 0.6779 0.944
#> 30 27 1 0.770 0.0713 0.6426 0.924
#> 33 26 1 0.741 0.0745 0.6083 0.902
#> 42 25 1 0.711 0.0772 0.5749 0.880
#> 44 24 1 0.681 0.0794 0.5423 0.856
#> 72 23 1 0.652 0.0813 0.5105 0.832
#> 82 22 1 0.622 0.0828 0.4793 0.808
#> 110 19 1 0.589 0.0847 0.4448 0.781
#> 111 18 1 0.557 0.0861 0.4112 0.754
#> 112 17 1 0.524 0.0870 0.3784 0.726
#> 118 16 1 0.491 0.0875 0.3464 0.697
#> 126 15 1 0.458 0.0876 0.3152 0.667
#> 144 14 1 0.426 0.0873 0.2849 0.636
#> 201 13 1 0.393 0.0865 0.2553 0.605
#> 228 12 1 0.360 0.0852 0.2265 0.573
#> 242 10 1 0.324 0.0840 0.1951 0.539
#> 283 9 1 0.288 0.0820 0.1650 0.503
#> 314 8 1 0.252 0.0793 0.1362 0.467
#> 357 7 1 0.216 0.0757 0.1088 0.429
#> 389 6 1 0.180 0.0711 0.0831 0.391
#> 411 5 1 0.144 0.0654 0.0592 0.351
#> 467 4 1 0.108 0.0581 0.0377 0.310
#> 587 3 1 0.072 0.0487 0.0192 0.271
#> 991 2 1 0.036 0.0352 0.0053 0.245
#> 999 1 1 0.000 NaN NA NA
#>
#> celltype=smallcell
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 2 48 1 0.9792 0.0206 0.93958 1.000
#> 4 47 1 0.9583 0.0288 0.90344 1.000
#> 7 46 2 0.9167 0.0399 0.84172 0.998
#> 8 44 1 0.8958 0.0441 0.81345 0.987
#> 10 43 1 0.8750 0.0477 0.78627 0.974
#> 13 42 2 0.8333 0.0538 0.73430 0.946
#> 16 40 1 0.8125 0.0563 0.70926 0.931
#> 18 39 2 0.7708 0.0607 0.66065 0.899
#> 20 37 2 0.7292 0.0641 0.61369 0.866
#> 21 35 2 0.6875 0.0669 0.56812 0.832
#> 22 33 1 0.6667 0.0680 0.54580 0.814
#> 24 32 1 0.6458 0.0690 0.52377 0.796
#> 25 31 2 0.6042 0.0706 0.48052 0.760
#> 27 29 1 0.5833 0.0712 0.45928 0.741
#> 29 28 1 0.5625 0.0716 0.43830 0.722
#> 30 27 1 0.5417 0.0719 0.41756 0.703
#> 31 26 1 0.5208 0.0721 0.39706 0.683
#> 51 25 2 0.4792 0.0721 0.35678 0.644
#> 52 23 1 0.4583 0.0719 0.33699 0.623
#> 54 22 2 0.4167 0.0712 0.29814 0.582
#> 56 20 1 0.3958 0.0706 0.27908 0.561
#> 59 19 1 0.3750 0.0699 0.26027 0.540
#> 61 18 1 0.3542 0.0690 0.24171 0.519
#> 63 17 1 0.3333 0.0680 0.22342 0.497
#> 80 16 1 0.3125 0.0669 0.20541 0.475
#> 87 15 1 0.2917 0.0656 0.18768 0.453
#> 95 14 1 0.2708 0.0641 0.17026 0.431
#> 99 12 2 0.2257 0.0609 0.13302 0.383
#> 117 9 1 0.2006 0.0591 0.11267 0.357
#> 122 8 1 0.1755 0.0567 0.09316 0.331
#> 139 6 1 0.1463 0.0543 0.07066 0.303
#> 151 5 1 0.1170 0.0507 0.05005 0.274
#> 153 4 1 0.0878 0.0457 0.03163 0.244
#> 287 3 1 0.0585 0.0387 0.01600 0.214
#> 384 2 1 0.0293 0.0283 0.00438 0.195
#> 392 1 1 0.0000 NaN NA NA
#>
#> celltype=adeno
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 3 27 1 0.9630 0.0363 0.89430 1.000
#> 7 26 1 0.9259 0.0504 0.83223 1.000
#> 8 25 2 0.8519 0.0684 0.72786 0.997
#> 12 23 1 0.8148 0.0748 0.68071 0.975
#> 18 22 1 0.7778 0.0800 0.63576 0.952
#> 19 21 1 0.7407 0.0843 0.59259 0.926
#> 24 20 1 0.7037 0.0879 0.55093 0.899
#> 31 19 1 0.6667 0.0907 0.51059 0.870
#> 35 18 1 0.6296 0.0929 0.47146 0.841
#> 36 17 1 0.5926 0.0946 0.43344 0.810
#> 45 16 1 0.5556 0.0956 0.39647 0.778
#> 48 15 1 0.5185 0.0962 0.36050 0.746
#> 51 14 1 0.4815 0.0962 0.32552 0.712
#> 52 13 1 0.4444 0.0956 0.29152 0.678
#> 73 12 1 0.4074 0.0946 0.25850 0.642
#> 80 11 1 0.3704 0.0929 0.22649 0.606
#> 84 9 1 0.3292 0.0913 0.19121 0.567
#> 90 8 1 0.2881 0.0887 0.15759 0.527
#> 92 7 1 0.2469 0.0850 0.12575 0.485
#> 95 6 1 0.2058 0.0802 0.09587 0.442
#> 117 5 1 0.1646 0.0740 0.06824 0.397
#> 132 4 1 0.1235 0.0659 0.04335 0.352
#> 140 3 1 0.0823 0.0553 0.02204 0.307
#> 162 2 1 0.0412 0.0401 0.00608 0.279
#> 186 1 1 0.0000 NaN NA NA
#>
#> celltype=large
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 12 27 1 0.9630 0.0363 0.89430 1.000
#> 15 26 1 0.9259 0.0504 0.83223 1.000
#> 19 25 1 0.8889 0.0605 0.77791 1.000
#> 43 24 1 0.8519 0.0684 0.72786 0.997
#> 49 23 1 0.8148 0.0748 0.68071 0.975
#> 52 22 1 0.7778 0.0800 0.63576 0.952
#> 53 21 1 0.7407 0.0843 0.59259 0.926
#> 100 20 1 0.7037 0.0879 0.55093 0.899
#> 103 19 1 0.6667 0.0907 0.51059 0.870
#> 105 18 1 0.6296 0.0929 0.47146 0.841
#> 111 17 1 0.5926 0.0946 0.43344 0.810
#> 133 16 1 0.5556 0.0956 0.39647 0.778
#> 143 15 1 0.5185 0.0962 0.36050 0.746
#> 156 14 1 0.4815 0.0962 0.32552 0.712
#> 162 13 1 0.4444 0.0956 0.29152 0.678
#> 164 12 1 0.4074 0.0946 0.25850 0.642
#> 177 11 1 0.3704 0.0929 0.22649 0.606
#> 200 9 1 0.3292 0.0913 0.19121 0.567
#> 216 8 1 0.2881 0.0887 0.15759 0.527
#> 231 7 1 0.2469 0.0850 0.12575 0.485
#> 250 6 1 0.2058 0.0802 0.09587 0.442
#> 260 5 1 0.1646 0.0740 0.06824 0.397
#> 278 4 1 0.1235 0.0659 0.04335 0.352
#> 340 3 1 0.0823 0.0553 0.02204 0.307
#> 378 2 1 0.0412 0.0401 0.00608 0.279
#> 553 1 1 0.0000 NaN NA NA
But I have no idea how I can manipulate this list. I would very much appreciate any advice on how to filter out all lines with n.risk < 10 from fit.obj.
I can't quite seem to get this all the way there. But I see that you can pass a data.frame rather than a fit object to the plotting function. You can do this and clip the values. For example
ss <- subset(surv_summary(fit.obj), n.risk>=10)
ggsurvplot(ss,
conf.int = T)
But it seems in this mode it does not automatically print the table. There is a function to draw just the table with
ggrisktable(fit.obj, tables.theme = theme_cleantable())
So I guess you could just combine them. Maybe i'm missing an easier way to draw the table when using a data.frame in the same plot.
As a slight variation on the above answers, if you want to truncate each group individually when less than 10 patients are at risk in that group, I found this to work and not require plotting the figure and table separately:
library(survival)
library(survminer)
# truncate each line when fewer than 10 at risk
atrisk <- 10
# KM fit
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
# subset each stratum separately
maxcutofftime = 0 # for plotting
strata <- rep(names(fit.obj$strata), fit.obj$strata)
for (i in names(fit.obj$strata)){
cutofftime <- min(fit.obj$time[fit.obj$n.risk < atrisk & strata == i])
maxcutofftime = max(maxcutofftime, cutofftime)
cutoffs <- which(fit.obj$n.risk < atrisk & strata == i)
fit.obj$lower[cutoffs] <- NA
fit.obj$upper[cutoffs] <- NA
fit.obj$surv[cutoffs] <- NA
}
# plot
ggsurvplot(fit.obj, data = veteran, risk.table = TRUE, conf.int = T, pval = F,
tables.theme = theme_cleantable(), xlim = c(0,maxcutofftime), break.x.by = 90)
edited to add: note that if we had used pval = T above, that would give the p-value for the truncated data, not the full data. It doesn't make much of a difference in this example as both are p<0.0001, but be careful :)
I'm following up on MrFlick's great answer.
I'd interpret 3) to mean that there should be at least 10 at risk total - i.e., not per group. So we have to create an ungrouped Kaplan-Meier fit first and determine the time cutoff from there.
Subset the surv_summary object w/r/t this cutoff.
Plot KM-curve and risk table separately. Crucially, function survminer::ggrisktable() (a minimal front end for ggsurvtable()) accepts options xlim and break.time.by. However, the function can currently only extend the upper time limit, not reduce it. I assume this is a bug. I created function ggsurvtable_mod() to change this.
Turn ggplot objects into grobs and use ggExtra::grid.arrange() to put both plots together. There is probably a more elegant way to do this based on options widths and heights.
Admittedly, this is a bit of a hack and needs tweaking to get the correct alignment between survival plot and risk table.
library(survival)
library(survminer)
# ungrouped KM estimate to determine cutoff
fit1_ss <- surv_summary(survfit(Surv(time, status) ~ 1, data=veteran))
# time cutoff with fewer than 10 at risk
cutoff <- min(fit1_ss$time[fit1_ss$n.risk < 10])
# KM fit and subset to cutoff
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
fit_ss <- subset(surv_summary(fit.obj), time < cutoff)
# KM survival plot and risk table as separate plots
p1 <- ggsurvplot(fit_ss, conf.int=TRUE)
# note options xlim and break.time.by
p2 <- ggsurvtable_mod(fit.obj,
survtable="risk.table",
tables.theme=theme_cleantable(),
xlim=c(0, cutoff),
break.time.by=100)
# turn ggplot objects into grobs and arrange them (needs tweaking)
g1 <- ggplotGrob(p1)
g2 <- ggplotGrob(p2)
lom <- rbind(c(NA, rep(1, 14)),
c(NA, rep(1, 14)),
c(rep(2, 15)))
gridExtra::grid.arrange(grobs=list(g1, g2), layout_matrix=lom)
I am trying to calculate the sequential recordings in a time series, and aggregate the data for these sequences.
Example Data
Here is an example of the data taken at a maximum frequency of 1 second:
timestamp Value
06:07:23 0.439
06:07:24 0.556
06:07:25 0.430
06:07:26 0.418
06:07:27 0.407
06:07:47 0.439
06:07:48 0.420
06:07:49 0.405
09:55:21 0.507
09:55:22 0.439
10:03:24 0.439
10:03:25 0.439
10:03:36 1.708
10:03:37 0.608
10:03:38 0.439
10:03:46 0.484
10:03:47 0.380
10:03:48 0.607
10:03:49 0.439
10:03:50 0.439
10:03:51 0.439
10:03:52 0.430
10:03:53 0.439
10:03:54 4.924
10:03:55 1.012
10:03:56 0.887
10:03:57 0.439
10:03:58 0.439
10:04:18 0.447
10:04:19 0.447
As can be seen, there are periods whereby a value is taken every second. I am trying to find a way to aggregate if there was no gap between the observations to end up with something as follows:
timestamp max duration
06:07:23 0.556 5
06:07:47 0.439 3
09:55:21 0.507 2
10:03:24 0.439 2
10:03:36 1.708 3
10:03:46 1.012 13
10:04:18 0.447 2
I am struggling to find a way of grouping the data by the sequential data. The closest answer I have been able to find is this one, however, the answers were provided over three and a half years ago and I was struggling to get the data.table method working.
Any ideas much appreciated!
Here is an attempt in data.table:
dat[,
.(timestamp = timestamp[1], max = max(Value), duration=.N),
by = cumsum(c(FALSE, diff(as.POSIXct(dat$timestamp, format="%H:%M:%S", tz="UTC")) > 1))
]
# cumsum timestamp max duration
#1: 0 06:07:23 0.556 5
#2: 1 06:07:47 0.439 3
#3: 2 09:55:21 0.507 2
#4: 3 10:03:24 0.439 2
#5: 4 10:03:36 1.708 3
#6: 5 10:03:46 4.924 13
#7: 6 10:04:18 0.447 2
This question already has answers here:
Create dummy variable in R excluding certain cases as NA
(2 answers)
Convert continuous numeric values to discrete categories defined by intervals
(2 answers)
Closed 5 years ago.
I currently have a data frame in r where one column contains values from 1-20. I need to replace values 0-4 -> A, 5-9 -> B, 10-14 -> C.... and so on. What would be the best way to accomplish this?
This is what the first part of the data frame currently looks like:
sex length diameter height wholeWeight shuckedWeight visceraWeight shellWeight rings
1 2 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.1500 15
2 2 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 7
3 1 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.2100 9
4 2 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.1550 10
5 0 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.0550 7
6 0 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.1200 8
7 1 0.530 0.415 0.150 0.7775 0.2370 0.1415 0.3300 20
8 1 0.545 0.425 0.125 0.7680 0.2940 0.1495 0.2600 16
9 2 0.475 0.370 0.125 0.5095 0.2165 0.1125 0.1650 9
10 1 0.550 0.440 0.150 0.8945 0.3145 0.1510 0.3200 19
11 1 0.525 0.380 0.140 0.6065 0.1940 0.1475 0.2100 14
12 2 0.430 0.350 0.110 0.4060 0.1675 0.0810 0.1350 10
I'm trying to replace the values in rings.
Maybe you can create a vector to be a transform table.
tran_table <- rep(c("A","B","C","D"), rep(5,4))
names(tran_table) <- 1:20
test_df <- data.frame(
ring = sample(1:20, 20)
)
test_df$ring <- tran_table[test_df$ring]
And the result is:
> test_df
ring
1 A
2 D
3 B
4 A
5 A
6 C
7 B
8 D
9 B
10 A
11 B
12 D
13 C
14 A
15 C
16 C
17 B
18 D
19 C
20 D
I have two dataframes. For example
require('xlsx')
csvData <- read.csv("myData.csv")
xlsData <- read.xlsx("myData.xlsx")
csvData looks like this:
Period CPI VIX
1 0.029 31.740
2 0.039 32.840
3 0.028 34.720
4 0.011 43.740
5 -0.003 35.310
6 0.013 26.090
7 0.032 28.420
8 0.022 45.080
xlsData looks like this:
Period CPI DJIA
1 0.029 12176
2 0.039 10646
3 0.028 11407
4 0.011 9563
5 -0.003 10708
6 0.013 10776
7 0.032 9384
8 0.022 7774
When I merge this data, the CPI data is duplicated, and a suffix is put on the header, which is problematic (I have many more columns in my real df's).
mergedData <- merge(xlsData, csvData, by = "Period")
mergedData:
Period CPI.x VIX CPI.y DJIA
1 0.029 31.740 0.029 12176
2 0.039 32.840 0.039 10646
3 0.028 34.720 0.028 11407
4 0.011 43.740 0.011 9563
5 -0.003 35.310 -0.003 10708
6 0.013 26.090 0.013 10776
7 0.032 28.420 0.032 9384
8 0.022 45.080 0.022 7774
I want to merge the data frames without duplicating columns with the same name. For example, I want this kind of output:
Period CPI VIX DJIA
1 0.029 31.740 12176
2 0.039 32.840 10646
3 0.028 34.720 11407
4 0.011 43.740 9563
5 -0.003 35.310 10708
6 0.013 26.090 10776
7 0.032 28.420 9384
8 0.022 45.080 7774
I don't want to have to use additional 'by' arguments, or dropping columns from one of the df's, because there are too many columns that are duplicated in both df's. I'm just looking for a dynamic way to drop those duplicated columns during the merge process.
Thanks!
You can skip the by argument if the common columns are named the same.
From ?merge:
By default the data frames are merged on the columns with names they both have, but separate specifications of the columns can be given by by.x and by.y.
Keeping that in mind, the following should work (as it did on your sample data):
merge(csvData, xlsData)
# Period CPI VIX DJIA
# 1 1 0.029 31.74 12176
# 2 2 0.039 32.84 10646
# 3 3 0.028 34.72 11407
# 4 4 0.011 43.74 9563
# 5 5 -0.003 35.31 10708
# 6 6 0.013 26.09 10776
# 7 7 0.032 28.42 9384
# 8 8 0.022 45.08 7774
You can also index your specific column of interest by name. This is useful if you just need a single column/vector from a large data frame.
Period <- seq(1,8)
CPI <- seq(11,18)
VIX <- seq(21,28)
DJIA <- seq(31,38)
Other1 <- paste(letters)[1:8]
Other2 <- paste(letters)[2:9]
Other3 <- paste(letters)[3:10]
df1<- data.frame(Period,CPI,VIX)
df2<- data.frame(Period,CPI,Other1,DJIA,Other2,Other3)
merge(df1,df2[c("Period","DJIA")],by="Period")
> merge(df1,df2[c("Period","DJIA")],by="Period")
Period CPI VIX DJIA
1 1 11 21 31
2 2 12 22 32
3 3 13 23 33
4 4 14 24 34
5 5 15 25 35
6 6 16 26 36
7 7 17 27 37
8 8 18 28 38