I am new to R . I have a large dataset with 1-minute resolution for one year . It makes total of 55940 observation all 1 minute apart with dates and times . I want to change it to six minute resolution data. It necessarily means adding first 6 rows then next 6 and so on so forth . Any good solutions ?
You could try something like this:
library(dplyr)
# original df
df <- data.frame(min = 1:60, val = rnorm(60))
# create a grouping variable and add to df
grp <- floor(df$min / 6)
df <- data.frame(grp, df)
# create new df at 6 min level
new.df <- df %>%
group_by(grp) %>%
summarise(new.val = sum(val))
Another option with a similar approach
library(dplyr)
# original dataframe
n <- 55940
df <- data.frame(id = 1:n , val = rnorm(n))
# new dataframe
df_new <- df %>%
group_by(cut(df$id, n/6)) %>%
summarise(new.val = sum(val))
Related
I need to operate columns based on their name condition. In the following reproducible example, per each column that ends with 'x', I create a column that multiplies by 2 the respective variable:
library(dplyr)
set.seed(8)
id <- seq(1,700, by = 1)
a1_x <- runif(700, 0, 10)
a1_y <- runif(700, 0, 10)
a2_x <- runif(700, 0, 10)
df <- data.frame(id, a1_x, a1_y, a2_x)
#Create variables manually: For every column that ends with X, I need to create one column that multiplies the respective column by 2
df <- df %>%
mutate(a1_x_new = a1_x*2,
a2_x_new = a2_x*2)
Since I'm working with several columns, I need to automate this process. Does anybody know how to achieve this? Thanks in advance!
Try this:
df %>% mutate(
across(ends_with("x"), ~ .x*2, .names = "{.col}_new")
)
Thanks #RicardoVillalba for correction.
You could use transmute and across to generate the new columns for those column names ending in "x". Then, use rename_with to add the "_new" suffix and bind_cols back to the original data frame.
library(dplyr)
df <- df %>%
transmute(across(ends_with("x"), ~ . * 2)) %>%
rename_with(., ~ paste0(.x, "_new")) %>%
bind_cols(df, .)
Result:
head(df)
id a1_x a1_y a2_x a1_x_new a2_x_new
1 1 4.662952 0.4152313 8.706219 9.325905 17.412438
2 2 2.078233 1.4834044 3.317145 4.156466 6.634290
3 3 7.996580 1.4035441 4.834126 15.993159 9.668252
4 4 6.518713 7.0844794 8.457379 13.037426 16.914759
5 5 3.215092 3.5578827 8.196574 6.430184 16.393149
6 6 7.189275 5.2277208 3.712805 14.378550 7.425611
I have a dataset, and I would like to randomize the order of this dataset 100 times and calculate the cumulative mean each time.
# example data
ID <- seq.int(1,100)
val <- rnorm(100)
df <- cbind(ID, val) %>%
as.data.frame(df)
I already know how to calculate the cumulative mean using the function "cummean()" in dplyr.
df2 <- df %>%
mutate(cm = cummean(val))
However, I don't know how to randomize the dataset 100 times and apply the cummean() function to each iteration of the dataframe. Any advice on how to do this would be greatly appreciated.
I realize this could probably be solved via either a loop, or in tidyverse, and I'm open to either solution.
Additionally, if possible, I'd like to include a column that indicates which iteration the data was produced from (i.e., randomization #1, #2, ..., #100), as well as include the "ID" value, which indicates how many data values were included in the cumulative mean. Thanks in advance!
Here is an approach using the purrr package. Also, not sure what cummean is calculating (maybe someone can share that in the comments) so I included an alternative, the column cm2 as a comparison.
library(tidyverse)
set.seed(2000)
num_iterations <- 100
num_sample <- 100
1:num_iterations %>%
map_dfr(
function(i) {
tibble(
iteration = i,
id = 1:num_sample,
val = rnorm(num_sample),
cm = cummean(val),
cm2 = cumsum(val) / seq_along(val)
)
}
)
You can mutate to create 100 samples then call cummean:
library(dplyr)
library(purrr)
df %>% mutate(map_dfc(1:100, ~cummean(sample(val))))
We may use rerun from purrr
library(dplyr)
library(purrr)
f1 <- function(dat, valcol) {
dat %>%
sample_n(size = n()) %>%
mutate(cm = cummean({{valcol}}))
}
n <- 100
out <- rerun(n, f1(df, val))
The output of rerun is a list, which we can name it with sequence and if we need to create a new column by binding, use bind_rows
out1 <- bind_rows(out, .id = 'ID')
> head(out1)
ID val cm
1 1 0.3376980 0.33769804
2 1 -1.5699384 -0.61612019
3 1 1.3387892 0.03551628
4 1 0.2409634 0.08687807
5 1 0.7373232 0.21696708
6 1 -0.8012491 0.04726439
This question already has answers here:
Repeat rows of a data.frame [duplicate]
(10 answers)
Closed 2 years ago.
I'd like to expand my dataframe by repeating a row 24 times while adding an additional column for "hour". Here is an example of my data:
set.seed(1)
mydata <- data.frame(Tmin = sample(0:3), Tmax = sample(4:7), Day = rep(1:4))
I want to expand this table such that each row is repeated with the same Tmin, Tmax, and Day 24 times, with an additional column mydata$hour where the numbers 1:24 are repeated for each day. All other values (Tmin, Tmax, Day) stay the same for each row. Thanks!
You can repeat each row index 24 times and then assign new hour column from 1 to 24 using recycling techinique.
newdata <- mydata[rep(seq_len(nrow(mydata)), each = 24),]
newdata$hour <- 1:24
Couple of tidyverse options :
library(dplyr)
mydata %>% tidyr::uncount(24) %>% group_by(Day) %>% mutate(hour = 1:24)
and
mydata %>% group_by(Day) %>% slice(rep(row_number(), 24)) %>% mutate(hour = 1:24)
Another alternative using lapply and dplyr::mutate .
library(dplyr)
set.seed(1)
mydata <- data.frame(Tmin = sample(0:3), Tmax = sample(4:7), Day = rep(1:4))
newdata <- as.data.frame(lapply(mydata, rep, 24))
newdata %>%
mutate(hour = rep(c(1:24), times = 4))
I am attempting to sample a dataframe using sample_n. I know that sample_n usually takes a single size= argument at a time, however, I would like to sample sizes from 2 to the max # of rows in the df. Unfortunately, the code I have compiled below does not do the job. The needed output would be a dataframe with an id= column or a list divided by the id column from crossing().
df <- data.frame(Date = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
data_sampled_by_stratum <- df %>%
group_by(Date) %>%
crossing(id = seq(500)) %>% # repeat dataframes
group_by(id) %>%
sample_n(size=c(2:15)) %>%
group_by(CLUSTER_ID,Date) %>% filter(n() > 2)
If you had a column with different sites you could do this.
data_sampled_by_stratum <- data_grouped_by_stratum %>%
group_by(siteid, Date) %>%
crossing(id = seq(500)) %>% # repeat dataframes
sample_n(rbinom(1,sum(siteid==i),(1-s)^2))
So currently I am able to calculate a daily max for one site using the following code:
library('dplyr')
library('data.table')
library('tidyverse')
library('tidyr')
library('lubridate')
funcVolume <- function(max_data$enter_yard, max_data$exit_yard)
{
vecOnes <- array(1,c(length(max_data$enter_yard),1))
vecTime <- c(max_data$enter_yard,max_data$exit_yard)
vecCount <- c(vecOnes,-vecOnes)
df_test <- data.frame(T = vecTime, Count = vecCount)
df_test <- df_test %>%
arrange(T) %>%
mutate(Volume = cumsum(Count))
df_test
}
df_test2 <- df_test
df_test2$date <- as.Date(format(df_test$T, "%Y-%m-%d"))
df_test3 <- df_test2
df_test3 <- tibble(x = df_test2$Volume, y = df_test2$date) %>%
arrange(y)
dataset <- df_test3 %>%
group_by(y) %>%
dplyr::filter(x == max(x)) %>%
distinct(x,.keep_all = T) %>%
ungroup()
However, I would like to do this for multiple locations. In my original dataframe, I have a column that lists the name of the site, and two columns for when an object enter or leaves a site. The name is just a general text column, and the other two columns are datetime columns. Ideally, I would want an output that looks like the following:
Date | Max Count | Site
x y z
x a b
I also have a couple million rows of data, so I need something that can run in a reasonable time frame.