I have a dataset with 3.9M rows, 5 columns, stored as a tibble. When I try to convert it to tsibble, I run out of memory even though I have 32 GB which should be way more than enough. The weird thing is that if I apply a filter function before piping it into as_tsibble() then it works, even though I'm not actually filtering out any rows.
This does not work:
dataset %>% as_tsibble(index = TimeStamp, key = c("TSSU", "Phase"))
This works. But there are no "Phase" values less than 1 so the filter does nothing, no rows are actually removed.
dataset %>% filter(Phase > 0) %>% as_tsibble(index = TimeStamp, key = c("TSSU", "Phase"))
Any ideas why the second option works? Here's what the dataset looks like:
Volume <dbl>
Travel_Time <dbl>
TSSU <chr>
Phase <int>
TimeStamp <dttm>
105
1.23
01017
2
2020-09-28 10:00:00
20
1.11
01017
2
2020-09-28 10:15:00
Have you tried using the data.table library? It is optimized for performance with large datasets. I have replicated your steps and depending on where the dataset variable is coming from you may want to use the fread function to load the data as it is also very fast.
library(data.table)
dataset <- data.table(dataset)
# setkeyv(x = dataset, cols = c("TSSU", "Phase")) # This line may not be needed
dataset[Phase>0, ]
Related
I'm new to plyr and dplyr and seriously don't get it. I have managed my way around some functions, but I struggle with really basic stuff such as the following example.
Taking mtcars, I have different overlapping subsets, such as vs = 1 and am = 1
I now want to run the same analysis, in this case median() for one var over the different subsets, and another analysis, such as mean() for another var.
This should give me in the end the same result, such as the following code - just much shorter:
data_mt <- mtcars # has binary dummy vars for grouping
data_vs <- data_mt[ which(data_mt$vs == 1 ), ]
data_am <- data_mt[ which(data_mt$am == 1 ), ]
median(data_mt$mpg)
median(data_vs$mpg)
median(data_am$mpg)
mean(data_mt$cyl)
mean(data_vs$cyl)
mean(data_am$cyl)
In my real example, I have an analog to data_mt, so if you have a solution starting there, without data_vs etc. that would be great.
I'm sure this is very basic, but I can't wrap my head around it - and as I have some 1500 variables that I want to look at, I'd appreciate your help =)
It may well be that my answer is already out there, but with the terminology I know I didn't find it explain for Dummies ;D
Edit:
To have a better understanding of what I am doing and what I am looking for, I hereby post my original code (not the mtcars example).
I have a dataset ds with 402 observations of 553 variables
The dataset comes from a study with human participants, some of which opted in for additional research mys or obs or both.
ds$mys <- 0
ds$mys[ which(ds$staffmystery_p == "Yes" ) ] <- 1
ds$obs <- 0
ds$obs[ which( !is.na(ds$sales_time)) ] <- 1
The 553 variables are either integers (e.g. for age or years of experience) or factors (e.g. sex or yes/no answers). I now want to compare some descriptive of the full dataset with the descriptives for the subsets and ideally also do a t-test for difference.
Currently I have just a very long list of code that reads more or less like the following (just much longer). This doesn't include t-tests.
describe(ds$age_b)
describe(dm$age_b)
describe(do$age_b)
prop.table(table(ds$sex_b))*100
prop.table(table(dm$sex_b))*100
prop.table(table(do$sex_b))*100
ds, dm and do are different datasets, but they are all just based on the above mentioned full dataset ds and the subsets ds$mys for dm and ds$obs for do
describe comes from the psych package and just lists descriptive statistics like mean or median etc. I don't need all of the metrics, mostly n, mean, median, sd and iqr.
The formula around 'prop.table' gives me a readout I can just copy into the excel tables I use for the final publications. I don't want automated output because I get asked all the time to add or change this, which is really just easier in excel than with automated output. (unless you know a much superior way ;)
Thank you so much!
Here is an option if we want to do this for different columns by group separately
library(dplyr)
library(purrr)
library(stringr)
map_dfc(c('vs', 'am'), ~
mtcars %>%
group_by(across(all_of(.x))) %>%
summarise(!! str_c("Mean_cyl_", .x) := mean(cyl),
!! str_c("Median_mpg_", .x) := median(mpg), .groups = 'drop'))%>%
mutate(Mean_cyl_full = mean(mtcars$cyl), Median_mpg_full = median(mtcars$mpg))
-output
# A tibble: 2 x 8
# vs Mean_cyl_vs Median_mpg_vs am Mean_cyl_am Median_mpg_am Mean_cyl_full Median_mpg_full
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 0 7.44 15.6 0 6.95 17.3 6.19 19.2
#2 1 4.57 22.8 1 5.08 22.8 6.19 19.2
If the package version is old, we can replace the across with group_by_at
map_dfc(c('vs', 'am'), ~
mtcars %>%
group_by_at(vars(.x)) %>%
summarise(!! str_c("Mean_cyl_", .x) := mean(cyl),
!! str_c("Median_mpg_", .x) := median(mpg), .groups = 'drop'))%>%
mutate(Mean_cyl_full = mean(mtcars$cyl), Median_mpg_full = median(mtcars$mpg))
Update
Based on the update, we could place the datasets in a list, do the transformations at once and return a list of descriptive statistics and the proportion table
out <- map(dplyr::lst(dm, ds, do), ~ {
dat <- .x %>%
mutate(mys = as.integer(staffmystery_p == 'Yes'),
obs = as.integer(!is.na(sales_time)))
age_b_desc <- describe(dat$age_b)
prop_table_out <- prop.table(table(dat$sex_b))*100
return(dplyr::lst(age_b_desc, prop_table_out))
}
)
I have the dataframe below and I would like to subset it in a way that it should find the observation when a name covered the longest distance between two consecutive observations. If there is a situation when a name moves exactly the same amount of meters at the same time to select the most recent.
So I would like to have as final result 2 rows. Those consequtives with the longest distance, And if there are more than one consequtive pairs only the most recent should remain. Then I will take those 2 points and I will display them on a map.
Here is my data:
name<-c("bal","bal","bal","bal","bal","bal","bal","bal")
LAT<-c(54.77127,54.76542,54.76007,54.75468,54.74926 ,54.74385,54.73847,54.73228)
LON<-c(18.99692,18.99361,18.99059 ,18.98753,18.98447,18.98150,18.97842,18.97505 )
dtime<-c("2016-12-16 02:37:02","2016-12-16 02:38:02","2016-12-16 02:38:32","2016-12-16 02:39:08",
"2016-12-16 02:39:52","2016-12-16 02:41:02","2016-12-16 02:42:02","2016-12-16 02:42:32")
df<-data.frame(name,LAT,LON,dtime)
anf this is how I think I should calculate the distance
library(geosphere)
distm(c(as.numeric(df[1,3]),as.numeric(df[1,2])), c(as.numeric(df[2,3]),as.numeric(df[2,2])), fun = distHaversine)
and this regarding time difference:
difftime("2016-12-19 11:33:01", "2016-12-19 11:31:01", units = "secs")
but how can I combine them?
I think you can do everything with one pipeline in dplyr
library(dplyr)
df %>%
group_by(name) %>%
mutate(lag_LAT = lag(LAT), lag_LON = lag(LON)) %>%
mutate(distance = diag(distm(cbind(lag_LON, lag_LAT), cbind(LON, LAT), fun = distHaversine)),
timespan = difftime(dtime, lag(dtime), units = "secs")) %>%
slice_max(distance) %>%
slice_max(dtime) %>%
ungroup()
#> # A tibble: 1 x 8
#> name LAT LON dtime lag_LAT lag_LON distance timespan
#> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <drtn>
#> 1 bal 54.7 19.0 2016-12-16 02:42:32 54.7 19.0 722. 30 secs
Given your request in the comment, I added the first mutate to keep track of the previous position, so that you're able to plot it later.
Having everything in one unique row, it's much better than having two separated rows.
With the second mutate you can calculate the distance between two following points and the time difference.
I did not question whether the calculation of the distance is correct.
I assumed you knew better than I do.
The first slice_max identifies the max distance, while the second one it's necessary just in case of ties in the first one (you said you were looking for the most recent in case of ties).
I grouped because I figured you may have more than one name in your dataset.
I did not get why you need to calculate the time difference, but I left it.
Am calculating error rates between two different forecasting methods.
My basic approach is to get group by nk and calculate the errors to compare and choose the one which have less error rate value. The issue is am getting MAP1E_arima_ds and MAPE_cagr_ds is all the same value some how the group_by function is not working while calculating.
Here is something I tried
group_by(nk) %>%
mutate(MAP1E_arima_ds=sum(temp2$ABS_arima_error_ds)/nrow(temp2)) %>%
mutate(MAPE_cagr_ds=sum(temp2$ABS_cagr_error_ds)/nrow(temp2))
So finally expected like
nk MAP1E_arima_ds MAPE_cagr_ds
1-G0175 value_x value_y
1-H0182 value_z value_a
so that I can compare error rate and choose forecasting method with less error rate.
If I understand you correctly, I think what you are looking for is this
library(dplyr)
df %>%
group_by(nk) %>%
summarise(MAP1E_arima_ds=sum(ABS_arima_error_ds)/n(),
MAPE_cagr_ds=sum(ABS_cagr_error_ds)/n())
# A tibble: 2 x 3
# nk MAP1E_arima_ds MAPE_cagr_ds
# <chr> <dbl> <dbl>
#1 1-G0175 14.7 3.38
#2 1-H0182 2.91 7.40
which is actually mean
df %>%
group_by(nk) %>%
summarise(MAP1E_arima_ds = mean(ABS_arima_error_ds),
MAPE_cagr_ds = mean(ABS_cagr_error_ds))
Moreover, after copying your dput it seems that your data is already grouped by nk, so the following would also give the same result
df %>%
summarise(MAP1E_arima_ds=mean(ABS_arima_error_ds),
MAPE_cagr_ds=mean(ABS_cagr_error_ds))
I have a dataframe that looks like this in R:
library(dplyr)
group <- c(1,2,3,4,5,6)
num_click <- c(33000, 34000, 35000, 33500, 34500, 32900)
num_open <- c(999000, 999500, 1000000, 1000050, 985000, 999999)
df <- data.frame(group, num_click, num_open)
> df
# group num_click num_open
# 1 1 33000 999000
# 2 2 34000 999500
# 3 3 35000 1000000
# 4 4 33500 1000050
# 5 5 34500 985000
# 6 6 32900 999999
and I've written two trivial functions that I would like to apply to each row:
prop_test_ctr <- function(open, click){
return(prop.test(c(click, 34000), c(open, 999000), correct = FALSE)$p.value)
}
add_one_to_group <- function(group) {
return(group + 1)
}
The prop_test_ctr function uses the prop.test function from R's stats package to test the null hypothesis that the proportions of several groups are the same; the $p.value is the output value I am grabbing here which corresponds to the p-value of the test.
The add_one_to_group function is a simple function to add 1 to each group_num in the df so I can verify that rowwise() is working as expected.
When I try to build a new results dataframe by applying the two functions to each row using dyplr's rowwise() with the following:
results <- df %>%
filter(group %in% c(1,2)) %>%
rowwise() %>%
mutate(p_value_ctr = prop_test_ctr(num_open,num_click),
group_plus_one = add_one_to_group(group))
it yields this output:
results
# A tibble: 2 x 5
group num_click num_open p_value_ctr group_plus_one
* <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 33000 999000 0.00004201837 2
2 2 34000 999500 0.00004201837 3
Where the p_value_ctr is column is incorrect - instead of calculating the p-value for the difference in clicks and opens for each row, it calculates the p-value for the combination of groups 2,3 and the values hard-coded in the prop_test_ctr function (34000 and 999000).
The add_one_to_group function works as expected with use of rowwise() but the p_value_ctr does not. The p-value that the p_value_ctr function returns is actually equal to the same value as if I ran:
prop.test(c(33000, 34000, 34000), c(999000, 999500, 999000))$p.value
which appears that the vector of column clicks and opens for both groups 2 and 3 is being passed to the function instead of the intended column value for just one row (hence the user of rowwise().
I know there are other ways to accomplish this, but specifically curious if I can stay within the dpylr universe here (as opposed to using sapply() and then cbind those results the the original df, for example) because it seems like this should be the intended behavior of rowwise(); I've just messed something up.
Thank you for your help!!
It looks like the problem was due to the mutate function being masked by another identically named function (most likely plyr::mutate). Restarting in a clean R session fixed the problem.
Thank you #user2738526 for your response! Looks like mutate being
masked was the issue
Because of the generic nature of dplyr function names, I often define their package with dplyr:: even then I've attached its package.
I have a data frame in R that lists monthly sales data by department for a store. Each record contains a month/year, a department name, and the total sales in that department for the month. I'm trying to calculate the mean sales by department, adding them to the vector avgs but I seem to be having two problems: the total sales per department is not compiling at all (its evaluating to zero) and avgs is compiling by record instead of by department. Here's what I have:
avgs = c()
for(dept in data$departmentName){
total <- 0
for(record in data){
if(identical(data$departmentName, dept)){
total <- total + data$ownerSales[record]
}
}
avgs <- c(avgs, total/72)
}
Upon looking at avgs on completion of the loop, I find that it's returning a vector of zeroes the length of the data frame rather than a vector of 22 averages (there are 22 departments). I've been tweaking this forever and I'm sure it's a stupid mistake, but I can't figure out what it is. Any help would be appreciated.
why not use library(dplyr)?:
library(dplyr)
data(iris)
iris %>% group_by(Species) %>% # or dept
summarise(total_plength = sum(Petal.Length), # total owner sales
weird_divby72 = total_plength/72) # total/72?
# A tibble: 3 × 3
Species total_plength weird_divby72
<fctr> <dbl> <dbl>
1 setosa 73.1 1.015278
2 versicolor 213.0 2.958333
3 virginica 277.6 3.855556
your case would probably look like this :
data %>% group_by(deptName) %>%
summarise(total_sales = sum(ownerSales),
monthly_sales = total_sales/72)
I like dplyr for it's syntax and pipeability. I think it is a huge improvement over base R for ease of data wrangling. Here is a good cheat sheet to help you get rolling: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf