I have a large dataframe (34707060 obs) that consists of accelerometer data for x,y,z. Data was collected at 30Hz, meaning I have 30 rows of data for each second. See head of my data below.
Timestamp Accelerometer.X Accelerometer.Y Accelerometer.Z
1 30/06/2021 08:00:00.000 -1.109 -1.559 1.508
2 30/06/2021 08:00:00.034 -0.688 -1.043 0.891
3 30/06/2021 08:00:00.067 -0.363 -0.531 0.555
4 30/06/2021 08:00:00.100 -0.164 -0.496 0.816
5 30/06/2021 08:00:00.134 0.063 -0.363 0.496
6 30/06/2021 08:00:00.167 -0.098 -0.992 0.227
I would like to compress this dataset to have data for every second, by calculating the mean, minimum, maximum, sum and standard deviation of every 30 rows. I would like to keep the Timestamp with data and time.
I have tried to apply the following code to my dataframe, which I copied from the answer of det on the question here:
df %>% group_by(group=row_number() %/% 30) %>%
dplyr::summarize(
Timestamp = first(Timestamp),
X_mean=mean(Accelerometer.X),
Y_mean=mean(Accelerometer.Y),
Z_mean=mean(Accelerometer.Z),
X_min=min(Accelerometer.X),
Y_min=min(Accelerometer.Y),
Z_min=min(Accelerometer.Z),
X_max=max(Accelerometer.X),
Y_max=max(Accelerometer.Y),
Z_max=max(Accelerometer.Z),
X_sum=sum(Accelerometer.X),
Y_sum=sum(Accelerometer.Y),
Z_sum=sum(Accelerometer.Z),
X_sd=sd(Accelerometer.X),
Y_sd=sd(Accelerometer.Y),
Z_sd=sd(Accelerometer.Z),
)
Unfortunately, this does not give me the result I want (see below).
# A tibble: 5 × 5
group Timestamp X_mean Y_mean Z_mean
<dbl> <chr> <dbl> <dbl> <dbl>
1 0 30/06/2021 08:00:00.000 -0.576 -0.989 0.431
2 1 30/06/2021 08:00:00.967 -0.240 -1.06 0.270
3 2 30/06/2021 08:00:01.967 -0.287 -0.821 0.390
4 3 30/06/2021 08:00:02.967 -0.364 -0.830 0.337
5 4 30/06/2021 08:00:03.967 -0.332 -0.961 -0.086
The way it looks to me now, it first calculates all the values for the first 30 rows, and then includes these calculated values as the first row of 30 in the next calculation. So rather than calculating the compressed values for rows 1:30, 31-60, 61-90 etc, it keeps applying the code to lines 1:30.
I am not sure how to adjust the code to calculate the mean, min, max, sum, sd for every 30 rows (so 1:30, 31:60 etc.). Would really appreciate some help.
You can use dmy_hms to convert your row to a lubridate objectand use floor_date to round to the second. Then I'd rather use across here to compute the mean, min, max and sd
library(lubridate)
library(dplyr)
dat %>%
group_by(sec = floor_date(dmy_hms(Timestamp), "second")) %>%
summarise(Timestamp = first(Timestamp),
across(-Timestamp,
list(mean = mean, min = min, max = max, sd = sd),
.names = "{.col}_{.fn}"))
Related
I have a long-formatted set of longitudinal data with two variables that change over time (cognitive_score and motor_score), subject id (labeled subjid) and age of each subject in days at the moment of measurement(labeled agedays). Measurements were taken twice.
I want to transform it to wide-formatted longitudinal dataset.
The problem is that agedays measurements are unique for each subject, and the only way to see which measurement entry was the first, and which was the second, is to check where the agedays is higher (agedays higher than in the other entry means second measurement, lower agedays means first measurement).
We thus have this dataset:
subjid agedays cognitive_score motor_score
<int> <int> <dbl> <dbl>
1 4900001 457 0.338 0.176
2 4900001 1035 0.191 0.216
3 4900002 639 0.25 0.176
4 4900002 1248 0.176 0.353
5 4900003 335 0.103 0.196
6 4900003 913 0.176 0.196
And what I tried was using reshape:
reshape(dataset_col, direction = "wide", idvar = "subjid", timevar = "agedays", v.names = c("cognitive_score", "motor_score"))
Where dataset_col is the name of the dataset.
What it does, however, is adding these two columns:
The numbers in the name of the columns seem to be the values of agedays variable.
Any advice on how I can do this?
With libraries dplyr and tidyr:
You can use group_by and mutate to determine which of the ids is the first measurement and which is the second measurement. In this case I used an ifelse statement to determine which measurement is the first, this only works if you have exactly 2 measurements for each subjid.
Then you can use pivot_wider to pivot your data to wide format based on your newly created names column.
library(dplyr)
library(tidyr)
df %>%
group_by(subjid) %>%
mutate(names = ifelse(agedays == min(agedays),"first","second")) %>%
pivot_wider(names_from = names, values_from = c(agedays,cognitive_score,motor_score))
# A tibble: 3 × 7
# Groups: subjid [3]
# subjid agedays_first agedays_second cognitive_score_first cognitive_score_second motor_score_first motor_score_second
# <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 4900001 457 1035 0.338 0.191 0.176 0.216
# 2 4900002 639 1248 0.25 0.176 0.176 0.353
# 3 4900003 335 913 0.103 0.176 0.196 0.196
Suppose I am using panel data: for each individual and time, there is an observation of a numerical variable. I want to apply a function to this numerical variable but this function outputs a vector of numbers. I'd like to apply this function over the observations of each individual and store the resulting vector as columns of a new dataframe.
Example:
TICKER OFTIC CNAME ANNDATS_ACT ACTUAL
<chr> <chr> <chr> <date> <dbl>
1 0001 EPE EP ENGR CORP 2019-05-08 -0.15
2 0004 ACSF AMERICAN CAPITAL 2014-08-04 0.29
3 000R CRCM CARECOM 2018-02-27 0.32
4 000V EIGR EIGER 2018-05-11 -0.84
5 000Y RARE ULTRAGENYX 2016-02-25 -1.42
6 000Z BIOC BIOCEPT 2018-03-28 -54
7 0018 EGLT EGALET 2016-03-08 -0.28
8 001A SESN SESEN BIO 2021-03-15 -0.11
9 001C ARGS ARGOS 2017-03-16 -7
10 001J KN KNOWLES 2021-02-04 0.38
For each TICKER, I will consider the time-series implied by ACTUAL and compute the autocorrelation function. I defined the following wrapper to perform the operation:
my_acf <- function(x, lag = NULL){
acf_vec <- acf(x, lag.max = lag, plot = FALSE, na.action = na.contiguous)$acf
acf_vec <- as.vector(acf_vec)[-1]
return(acf_vec)
}
If the desired maximum lag is, say, 3, I'd like to create another dataset in which I have 4 columns: TICKER and the correspoding 3 first autocorrelations of the associated series of ACTUAL observations.
My solution was:
max_lag = 3
autocorrs <- final_sample %>%
group_by(TICKER) %>%
filter(!all(is.na(ACTUAL))) %>%
summarise(rho = my_acf(ACTUAL, lag = max_lag)) %>%
mutate(order = row_number()) %>%
pivot_wider(id_cols = TICKER, values_from = rho, names_from = order, names_prefix = "rho_")
This indeed provides the desired output:
TICKER rho_1 rho_2 rho_3
<chr> <dbl> <dbl> <dbl>
1 0001 0.836 0.676 0.493
2 0004 0.469 -0.224 -0.366
3 000R 0.561 0.579 0.327
4 000V 0.634 0.626 0.604
5 000Y 0.370 0.396 0.117
6 000Z 0.476 0.454 0.382
7 0018 0.382 -0.0170 -0.278
8 001A 0.330 0.316 0.0944
9 001C 0.727 0.590 0.400
10 001J 0.281 -0.308 -0.0343
My question is how can one perform this operation without a pivot_wider and the manual creation of the order column? The summarise verb creates a single column that store the autocorrelations sequentially for each TICKER. Is there a way to force summarize to create different columns for the different output a given function may provide when applied to, let's say, the ACTUAL series?
I have a data set similar to the following with 1 column and 60 rows:
value
1 0.0423
2 0.0388
3 0.0386
4 0.0342
5 0.0296
6 0.0276
7 0.0246
8 0.0239
9 0.0234
10 0.0214
.
40 0.1424
.
60 -0.0312
I want to reorder the rows so that certain conditions are met. For example one condition could be: sum(df$value[4:7]) > 0.1000 & sum(df$value[4:7]) <0.1100
With the data set looking like this for example.
value
1 0.0423
2 0.0388
3 0.0386
4 0.1312
5 -0.0312
6 0.0276
7 0.0246
8 0.0239
9 0.0234
10 0.0214
.
.
.
60 0.0342
What I tried was using repeat and sample as in the following:
repeat{
df1 <- as_tibble(sample(sdf$value, replace = TRUE))
if (sum(df$value[4:7]) > 0.1000 & sum(df$value[4:7]) <0.1100) break
}
Unfortunately, this method takes quite some time and I was wondering if there is a faster way to reorder rows based on mathematical conditions such as sum or prod
Here's a quick implementation of the hill-climbing method I outlined in my comment. I've had to slightly reframe the desired condition as "distance of sum(x[4:7]) from 0.105" to make it continuous, although you can still use the exact condition when doing the check that all requirements are satisfied. The benefit is that you can add extra conditions to the distance function easily.
# Using same example data as Jon Spring
set.seed(42)
vs = rnorm(60, 0.05, 0.08)
get_distance = function(x) {
distance = abs(sum(x[4:7]) - 0.105)
# Add to the distance with further conditions if needed
distance
}
max_attempts = 10000
best_distance = Inf
swaps_made = 0
for (step in 1:max_attempts) {
# Copy the vector and swap two random values
new_vs = vs
swap_inds = sample.int(length(vs), 2, replace = FALSE)
new_vs[swap_inds] = rev(new_vs[swap_inds])
# Keep the new vector if the distance has improved
new_distance = get_distance(new_vs)
if (new_distance < best_distance) {
vs = new_vs
best_distance = new_distance
swaps_made = swaps_made + 1
}
complete = (sum(vs[4:7]) < 0.11) & (sum(vs[4:7]) > 0.1)
if (complete) {
print(paste0("Solution found in ", step, " steps"))
break
}
}
sum(vs[4:7])
There's no real guarantee that this method will reach a solution, but I often try this kind of basic hill-climbing when I'm not sure if there's a "smart" way to approach a problem.
Here's an approach using combn from base R, and then filtering using dplyr. (I'm sure there's a way w/o it but my base-fu isn't there yet.)
With only 4 numbers from a pool of 60, there are "only" 488k different combinations (ignoring order; =60*59*58*57/4/3/2), so it's quick to brute force in about a second.
# Make a vector of 60 numbers like your example
set.seed(42)
my_nums <- rnorm(60, 0.05, 0.08);
all_combos <- combn(my_nums, 4) # Get all unique combos of 4 numbers
library(tidyverse)
combos_table <- all_combos %>%
t() %>%
as_tibble() %>%
mutate(sum = V1 + V2 + V3 + V4) %>%
filter(sum > 0.1, sum < 0.11)
> combos_table
# A tibble: 8,989 x 5
V1 V2 V3 V4 sum
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.160 0.00482 0.0791 -0.143 0.100
2 0.160 0.00482 0.101 -0.163 0.103
3 0.160 0.00482 0.0823 -0.145 0.102
4 0.160 0.00482 0.0823 -0.143 0.104
5 0.160 0.00482 -0.0611 -0.00120 0.102
6 0.160 0.00482 -0.0611 0.00129 0.105
7 0.160 0.00482 0.0277 -0.0911 0.101
8 0.160 0.00482 0.0277 -0.0874 0.105
9 0.160 0.00482 0.101 -0.163 0.103
10 0.160 0.00482 0.0273 -0.0911 0.101
# … with 8,979 more rows
This says that in this example, there are about 9000 different sets of 4 numbers from my sequence which meet the criteria. We could pick any of these and put them in positions 4-7 to meet your requirement.
First and foremost, regardless if you have input or not, thank you for taking your time to view my question.
Let me break down what I am doing, the sample dataset, and the error.
What I currently have is data for several different ID's that list the dispersion per day. (you will see below). I want to loop through the dates and add two columns to the data : Rolling Means columns & Rolling standard deviation column.
The code I have written out so far is this:
library(zoo)
Testing1 <- function(dataset, k) {
ops <- data.frame()
for (i in unique(dataset$Date)) {
ops <- dataset %>% mutate(rolling_mean = rollmean(dataset$Dispersion,k)) %>%
mutate(rolling_std = rollapply(dataset$Dispersion, width = k, FUN = sd))
}
Results <<- ops
}
however, i get the following error:
Error in mutate_impl(.data, dots) :
Column rolling_mean must be length 30 (the number of rows) or one, not 26
I am assuming that the row differential is due to me specifying a 5 day window for the rolling average, meaning it won't calculate it for the first 4 rows. But how do I go about telling R that it's ok to input NA's on those rows? Or If you guys have any other solution, that would work as well. Please do help.
Heres a sample of the data:
Identifier Date Dispersion
1000 2/15/2018 0.390
1000 2/16/2018 0.664
1000 2/17/2018 0.526
1000 2/18/2018 0.933
1000 2/19/2018 0.009
1000 2/20/2018 0.987
1000 2/21/2018 0.517
1000 2/22/2018 0.641
1000 2/23/2018 0.777
1000 2/24/2018 0.613
1001 2/15/2018 0.617
1001 2/16/2018 0.234
1001 2/17/2018 0.303
1001 2/18/2018 0.796
1001 2/19/2018 0.359
1001 2/20/2018 0.840
1001 2/21/2018 0.291
1001 2/22/2018 0.699
1001 2/23/2018 0.882
1001 2/24/2018 0.467
1002 2/15/2018 0.042
1002 2/16/2018 0.906
1002 2/17/2018 0.077
1002 2/18/2018 0.156
1002 2/19/2018 0.350
1002 2/20/2018 0.060
1002 2/21/2018 0.457
1002 2/22/2018 0.770
1002 2/23/2018 0.433
1002 2/24/2018 0.366
You get this error because the length of rolling means/stds does not match the legth of Dispersion. Simply add k - 1 NAs at the beginnig of your means/stds vectors.
Below is a working example. You can modify this based on your needs.
my_function <- function(df, k) {
df %>%
mutate(
rolling_mean = c(rep(NA, k - 1), rollmean(Dispersion, k)),
rolling_std = c(rep(NA, k - 1), rollapply(Dispersion, width = k, FUN = sd))
)
}
For example, you may want to add group_by to compute these values for each Identifier:
my_function <- function(df, k) {
df %>%
group_by(Identifier) %>%
mutate(
rolling_mean = c(rep(NA, k - 1), rollmean(Dispersion, k)),
rolling_std = c(rep(NA, k - 1), rollapply(Dispersion, width = k, FUN = sd))
)
}
Update following up #G. Grothendieck's comment:
It turns out the package zoo already has comprehensive features for NA handling, refactoring the above-given code as:
my_function <- function(df, k) {
df %>%
mutate(
rolling_mean = rollmeanr(Dispersion, k, fill = NA),
rolling_std = rollapplyr(Dispersion, width = k, FUN = sd, fill = NA)
)
}
I'd take a look at tibbletime.
Assuming your data frame is named mydata and the Date column is a character: first convert the Date, then convert to a time-aware tibble:
library(dplyr)
library(tibbletime)
mydata <- mydata %>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
as_tbl_time(index = Date)
Now you can define functions for rolling mean and sd:
mean_5 <- rollify(mean, window = 5)
sd_5 <- rollify(sd, window = 5)
mydata %>%
mutate(rolling_mean = mean_5(Dispersion),
rolling_std = sd_5(Dispersion))
# A time tibble: 30 x 5
# Index: Date
Identifier Date Dispersion rolling_mean rolling_std
<int> <date> <dbl> <dbl> <dbl>
1 1000 2018-02-15 0.39 NA NA
2 1000 2018-02-16 0.664 NA NA
3 1000 2018-02-17 0.526 NA NA
4 1000 2018-02-18 0.933 NA NA
5 1000 2018-02-19 0.009 0.504 0.342
6 1000 2018-02-20 0.987 0.624 0.393
7 1000 2018-02-21 0.517 0.594 0.394
8 1000 2018-02-22 0.641 0.617 0.393
9 1000 2018-02-23 0.777 0.586 0.367
10 1000 2018-02-24 0.613 0.707 0.182
# ... with 20 more rows
Say I have a dataframe called RaM that holds cumulative return values. In this case, they literally are just a single row of cumulative return values along with column headers, but I would like to apply the logic to not just single row dataframes.
Say I want to sort by the max cumulative return value of each column, or even the average, or the sum of each column.
So each column would be re-ordered so that the max cumulative returns for each column is compared and the highest return becomes the 1st column with the min being the last column
then say I want to derive either the top 10 (1st 10 columns after they are rearranged), or even the top 10%.
I know how to derive the column averages, but I don't know how to effectively do the remaining operations. There is an order function, but when I used it, it stripped my column names, which I need. I could easily then cut the 1st say 10 columns, but is there a way that preserves the names? I don't think I can easily extract the names from the unordered original dataframe and apply it with my sorted by aggregate dataframe. My goal is to extract the column names of the top n columns (in dataframe RaM) in terms of a column aggregate function over the entire dataframe.
something like
top10 <- getTop10ColumnNames(colSums(RaM))
that would then output a dataframe of the top 10 columns in terms of their sum from RaM
Here's output off RaM
> head(RaM,2)
ABMD ACAD ALGN ALNY ANIP ASCMA AVGO CALD CLVS CORT
2013-01-31 0.03794643 0.296774194 0.13009009 0.32219178 0.13008130 0.02857604 0.13014640 -0.07929515 0.23375000 0.5174825
2013-02-28 0.14982079 0.006633499 0.00255102 -0.01823456 -0.05755396 0.07659708 -0.04333138 0.04066986 -0.04457953 -0.2465438
CPST EA EGY EXEL FCSC FOLD GNC GTT HEAR HK HZNP
2013-01-31 -0.05269663 0.08333333 -0.01849711 0.01969365 0 0.4179104 0.07992677 0.250000000 0.2017417 0.10404624 -0.085836910
2013-02-28 0.15051595 0.11443102 -0.04475854 -0.02145923 0 -0.2947368 0.14079036 0.002857143 0.4239130 -0.07068063 -0.009389671
ICON IMI IMMU INFI INSY KEG LGND LQDT MCF MU
2013-01-31 0.07750896 0.05393258 -0.01027397 -0.01571429 -0.05806459 0.16978417 -0.03085824 -0.22001958 0.01345609 0.1924290
2013-02-28 -0.01746362 0.03091684 -0.20415225 0.19854862 0.36849503 0.05535055 0.02189055 0.06840289 -0.09713487 0.1078042
NBIX NFLX NVDA OREX PFPT PQ PRTA PTX RAS REXX RTRX
2013-01-31 0.2112299 0.7846467 0.00000000 0.08950306 0.06823721 0.03838384 -0.1800819 0.04387097 0.23852335 0.008448541 0.34328358
2013-02-28 0.1677704 0.1382251 0.03888981 0.04020979 0.06311787 -0.25291829 0.0266223 -0.26328801 0.05079882 0.026656512 -0.02222222
SDRL SHOS SSI STMP TAL TREE TSLA TTWO UVE VICL
2013-01-31 0.07826093 0.2023956 -0.07788381 0.07103175 -0.14166875 -0.030504714 0.10746974 0.1053588 0.0365299 0.2302405
2013-02-28 -0.07585546 0.1384419 0.08052150 -0.09633197 0.08009728 -0.002860412 -0.07144761 0.2029581 -0.0330408 -0.1061453
VSI VVUS WLB
2013-01-31 0.06485356 -0.0976155 0.07494647
2013-02-28 -0.13965291 -0.1156069 0.04581673
Here's one way using the first section of your sample data to illustrate. You can gather up all the columns so that we can do summary calculations more easily, calculate all the summaries by group that you want, and then sort with arrange. Here I ordered with the highest sums first, but you could do whatever order you wanted.
library(tidyverse)
ram <- read_table2(
"ABMD ACAD ALGN ALNY ANIP ASCMA AVGO CALD CLVS CORT
0.03794643 0.296774194 0.13009009 0.32219178 0.13008130 0.02857604 0.13014640 -0.07929515 0.23375000 0.5174825
0.14982079 0.006633499 0.00255102 -0.01823456 -0.05755396 0.07659708 -0.04333138 0.04066986 -0.04457953 -0.2465438"
)
summary <- ram %>%
gather(colname, value) %>%
group_by(colname) %>%
summarise_at(.vars = vars(value), .funs = funs(mean = mean, sum = sum, max = max)) %>%
arrange(desc(sum))
summary
#> # A tibble: 10 x 4
#> colname mean sum max
#> <chr> <dbl> <dbl> <dbl>
#> 1 ALNY 0.152 0.304 0.322
#> 2 ACAD 0.152 0.303 0.297
#> 3 CORT 0.135 0.271 0.517
#> 4 CLVS 0.0946 0.189 0.234
#> 5 ABMD 0.0939 0.188 0.150
#> 6 ALGN 0.0663 0.133 0.130
#> 7 ASCMA 0.0526 0.105 0.0766
#> 8 AVGO 0.0434 0.0868 0.130
#> 9 ANIP 0.0363 0.0725 0.130
#> 10 CALD -0.0193 -0.0386 0.0407
If you then want to reorder your original data frame, you can get the order from this summary output and index with it:
ram[summary$colname]
#> # A tibble: 2 x 10
#> ALNY ACAD CORT CLVS ABMD ALGN ASCMA AVGO ANIP
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.322 0.297 0.517 0.234 0.0379 0.130 0.0286 0.130 0.130
#> 2 -0.0182 0.00663 -0.247 -0.0446 0.150 0.00255 0.0766 -0.0433 -0.0576
#> # ... with 1 more variable: CALD <dbl>
Created on 2018-08-01 by the reprex package (v0.2.0).