Grouped moving average with new time window in R - r

A minimal example would be:
set.seed(42)
df <- data.frame(ID = rep("P1", 100),
treatment = c(rep("A", 50), rep("B", 50)),
t = rep(seq(1:50), 2),
x = rnorm(n = 100))
which, let's imagine, is some value measured each second. Now I would also want to have the average of each 20 second window. so 1-20, 21-40, ... (i.e. no overlap)
My actual data has a column ID representing multiple subjects, treatment with labels for two treatments and the actual time column is in increments of 5 (0, 5, 10, 15, 20). There are other important grouping columns. The values I have are from EEG recordings and I need to increase the size of window for some analyses.
What would be cleanest way to do this? Note that there must be a way of knowing which time-window the calculated average relates to (just 20, 40 would be enough, not a string with 1-20).
EDIT:
based on a now deleted comment I got here
df %>%
mutate(timeWin = ceiling(.$t/20)*20) %>%
group_by(ID, treatment, timeWin) %>%
summarise(xAvg = mean(x))
Only issue with that bit of code is the rather crude use of ceiling. In the example above the 40-50 bracket will be printed as 60.

df$grouped_time = ave(df$t, ceiling(df$t/20), FUN = max)
aggregate(df["x"], df[c("ID", "treatment", "grouped_time")], mean)
# ID treatment grouped_time x
#1 P1 A 20 0.19192002
#2 P1 B 20 0.27873536
#3 P1 A 40 -0.27099180
#4 P1 B 40 0.01661547
#5 P1 A 50 -0.02021535
#6 P1 B 50 -0.08719458

Related

Way to loop over multiple tables and keep only if condition met?

So I'm working on project that has multiple data tables, separated by month, that I need to iterate through. Speed is of the essence here, and I can't seem to get the time down to something reasonable unless I do a lot of crossjoins through data table functions. So here are my tables:
TABLE 1
Product Date Cost
A 8/1/2020 10
A 8/2/2020 20
A 8/3/2020 30
B 8/4/2020 15
B 8/5/2020 25
B 8/6/2020 35
and TABLE 2:
Product Date Price
A 9/1/2020 20
A 9/2/2020 30
A 9/3/2020 40
B 9/4/2020 27
B 9/5/2020 33
B 9/6/2020 42
So I need to iterate over every combination of Table 2 Price - Table 1 Cost, and do it by Product. So output would be:
NEW TABLE
Product Date1 Date2 Profit
A 8/1/2020 9/1/2020 10
A 8/1/2020 9/2/2020 20
...
EDIT: To clarify, the New Table should continue on. Product A should have 27 different profits (3 dates under A x 3 dates under A x 3 discount rates) assuming they are all above 0. If any of the profits are below 0, then I don't want them as part of the New Table.
I also have a Discount factor I need to apply to each permutation of Price as we give discounts quite a bit
Discount = c(10%,12%,18%)
I've tried using a loop and various ways of using apply but the loops take way too long to finish (hours, and some never do). The combinations lead to millions of rows but I only want to keep the profitable ones, where Price*Discount > Cost, which are only maybe 10,000 in number.
My solution is to cross join the data tables to create a massive table that I can vectorize against, which is much faster (around 1 min) but with some of the larger tables I quickly run into memory constraints and it isn't very scalable.
CTbl =setkey(CTbl[,c(k=1,.SD)],k)[Price[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
CTbl[,Profit:=(Discount*Price - Cost]
CTbl = setDT(CTbl)[, .SD[Price > Cost ]]
DT = CTbl[,list(MinProfit = min(Profit)),by = Product]
Of course this is quite fast but is a huge of waste of memory when all I really want is profitable rows, and of course the ongoing memory issue.
Can anyone help? I've asked some R users at work but they seem stumped as well, the loops they made couldn't get close to the sub-5 minutes it takes to run the above. I don't mind a bit of extra time if it means I can scale it up.
Thanks!
This sounds like a problem for the dplyr package, which. The dplyr package allows you to string together data operations in a "pipe" to avoid storing things in memory. The pipe operator %>%takes the output of the function on the left and uses it as the first argument of the function on the right. Each function in the dplyr package works over the entire vector or data tibble, so no need for loops.
So, your operation might look like the following:
# Initialize random data like your first table
df1 <- data.frame(product = sample(LETTERS[1:10], 10000, replace = TRUE),
date1 = sample(seq(as.Date("2020/08/01"), as.Date("2020/08/31"),
by = "day"), 10000, replace = TRUE),
cost = round(runif(10000, 5, 100)))
# Initialize random data like your second table
df2 <- data.frame(product = sample(LETTERS[1:10], 10000, replace = TRUE),
date2 = sample(seq(as.Date("2020/09/01"), as.Date("2020/09/30"),
by = "day"), 10000, replace = TRUE),
price = round(runif(10000, 5, 100)))
# Initialize discounts
discounts <- data.frame(product = rep(LETTERS[1:10],4),
discount = rep(c(0, 0.1, 0.12, 0.18), 10))
library(dplyr)
out_table <- df1 %>%
full_join(df2) %>%
full_join(discounts) %>%
mutate(profit = price * discount - cost) %>%
filter(profit > 0)
For my random data, this takes about 3 seconds on my machine. Furthermore, the filter verb only keeps those rows we want.
This is not a complete answer to your question, but maybe you can iterate a loop by products. The following function finds profits for a specified product. The function does not include discount but it can be added if the function works as you want.
profit = function(product, df1, df2) {
cost = with(df1, df1[which(Product == product), 'Cost'])
price = with(df2, df2[which(Product == product), 'Price'])
date = merge(
with(df1, df1[which(Product == product), 'Date']),
(with(df2, df2[which(Product == product), 'Date']))
)
product = t(matrix(rep(price, length(cost)), nrow = length(cost)) - t(matrix(rep(cost, length(price)), ncol = length(price))))
product = data.frame(cbind(date[which(product > 0), ], product[which(product > 0)]))
names(product) = c('costdate', 'pricedate', 'profit')
return(product)
}
Example:
df1 = data.frame(Product = c('A', 'A', 'A', 'B', 'B', 'B'),
Date = c('8/1/2020', '8/2/2020', '8/3/2020', '8/4/2020', '8/5/2020', '8/6/2020'),
Cost = c(10, 20, 30, 15, 25, 35))
df2 = data.frame(Product = c('A', 'A', 'A', 'B', 'B', 'B'),
Date = c('9/1/2020', '9/2/2020', '9/3/2020', '9/4/2020', '9/5/2020', '9/6/2020'),
Price = c(20, 30, 40, 27, 33, 42))
> profit('A', df1, df2)
costdate pricedate profit
1 8/1/2020 9/1/2020 10
4 8/1/2020 9/2/2020 20
5 8/2/2020 9/2/2020 10
7 8/1/2020 9/3/2020 30
8 8/2/2020 9/3/2020 20
9 8/3/2020 9/3/2020 10
> profit('B', df1, df2)
costdate pricedate profit
1 8/4/2020 9/4/2020 12
2 8/5/2020 9/4/2020 2
4 8/4/2020 9/5/2020 18
5 8/5/2020 9/5/2020 8
7 8/4/2020 9/6/2020 27
8 8/5/2020 9/6/2020 17
9 8/6/2020 9/6/2020 7
I could not test it properly since I have limited data.

How to add values of one column based on conditional statement of another column that has blank cells?

I'm trying to subset data based on a conditional statement of a column that has blank values which means the employee logged in multiple times on a work order. An example data set is shown below:
employee_name <- c("Person A","Person A","Person A","Person A","Person A", "Person B","Person B","Person B")
work_order <- c("WO001","WO001","WO001","WO002","WO003","WO001","WO003", "WO003")
num_of_points <- c(40,"","",64,25,20,68,"")
time <- c(10, 30, 15, 20, 25, 5, 15, 30)
final_summary <- data.frame(employee_name,work_order,num_of_points, time)
View(final_summary)
Input
Basically, I want to sum up the points and time by selecting all rows with points > 30, then grouped by Employee Name and Work Order which should return this:
Output
I can do the summarize function properly, but when I perform the initial subset, it excludes the blank rows for num_of_points and thus does not compute all the adjacent time (in minutes) values. This makes sense because subset(num_of_points > 30) only finds anything greater than 30. How can I tweak this to include the blank rows so I can successfully filter the data in order to compute the sum of time accurately, grouped by unique work order and employee name?
Conver the num_of_points to numeric class, grouped by 'employee_name', 'work_order', get the sum of 'num_of_points' where it is greater than 30, and the sum of 'time', then filter out the rows where 'num_of_points' are 0
library(dplyr)
final_summary %>%
mutate(num_of_points = as.numeric(num_of_points)) %>%
group_by(employee_name, work_order) %>%
summarise(num_of_points = sum(num_of_points[num_of_points> 30],
na.rm = TRUE), time = sum(time)) %>%
filter(num_of_points > 0)
# A tibble: 3 x 4
# Groups: employee_name [2]
# employee_name work_order num_of_points time
# <chr> <chr> <dbl> <dbl>
#1 Person A WO001 40 55
#2 Person A WO002 64 20
#3 Person B WO003 68 45
In base R you will do:
aggregate(.~employee_name + work_order, type.convert(final_summary), sum, subset = num_of_points>30)
employee_name work_order num_of_points time
1 Person A WO001 40 10
2 Person A WO002 64 20
3 Person B WO003 68 15
You can aggregate num_of_points and time separately and merge the results.
merge(aggregate(num_of_points~employee_name + work_order, final_summary,
sum, subset = num_of_points>30),
aggregate(time~employee_name + work_order, final_summary, sum))
# employee_name work_order num_of_points time
#1 Person A WO001 40 55
#2 Person A WO002 64 20
#3 Person B WO003 68 45

ggplot2 alternatives to fill in barplots, occurence of factor in multiple rows

I'm pretty new to R and I have a problem with plotting a barplot out of my data which looks like this:
condition answer
2 H
1 H
8 H
5 W
4 M
7 H
9 H
10 H
6 H
3 W
The data consists of 100 rows with the conditions 1 to 10, each randomly generated 10 times (10 times condition 1, 10 times condition 8,...). Each of the conditions also has a answer which could be H for Hit, M for Miss or W for wrong.
I want to plot the number of Hits for each condition in a barplot (for example 8 Hits out of 10 for condition 1,...) for that I tried to do the following in ggplot2
ggplot(data=test, aes(x=test$condition, fill=answer=="H"))+
geom_bar()+labs(x="Conditions", y="Hitrate")+
coord_cartesian(xlim = c(1:10), ylim = c(0:10))+
scale_x_continuous(breaks=seq(1,10,1))
And it looked like this:
This actually exactly what I need except for the red color which covers everything. You can see that conditions 3 to 5 have no blue bar, because there are no hits for these conditions.
Is there any way to get rid of this red color and to maybe count the amount of hits for the different conditions? -> I tried the count function of dplyr but it only showed me the amount of H when there where some for this particular condition. 3-5 where just "ignored" by count, there wasn't even a 0 in the output.-> but I'd still need those numbers for the plot
I'm sorry for this particular long post but I'm really at the end of knowledge considering this. I'd be open for suggestions or alternatives! Thanks in advance!
This is a situation where a little preprocessing goes a long way. I made sample data that would recreate the issue, i.e. has cases where there won't be any "H"s.
Instead of relying on ggplot to aggregate data in the way you want it, use proper tools. Since you mention dplyr::count, I use dplyr functions.
The preprocessing task is to count observations with answer "H", including cases where the count is 0. To make sure all combinations are retained, convert condition to a factor and set .drop = F in count, which is in turn passed to group_by.
library(dplyr)
library(ggplot2)
set.seed(529)
test <- data.frame(condition = rep(1:10, times = 10),
answer = c(sample(c("H", "M", "W"), 50, replace = T),
sample(c("M", "W"), 50, replace = T)))
hit_counts <- test %>%
mutate(condition = as.factor(condition)) %>%
filter(answer == "H") %>%
count(condition, .drop = F)
hit_counts
#> # A tibble: 10 x 2
#> condition n
#> <fct> <int>
#> 1 1 0
#> 2 2 1
#> 3 3 4
#> 4 4 2
#> 5 5 3
#> 6 6 0
#> 7 7 3
#> 8 8 2
#> 9 9 1
#> 10 10 1
Then just plot that. geom_col is the version of geom_bar for where you have your y-values already, instead of having ggplot tally them up for you.
ggplot(hit_counts, aes(x = condition, y = n)) +
geom_col()
One option is to just filter out anything but where answer == "H" from your dataset, and then plot.
An alternative is to use a grouped bar plot, made by setting position = "dodge":
test <- data.frame(condition = rep(1:10, each = 10),
answer = sample(c('H', 'M', 'W'), 100, replace = T))
ggplot(data=test) +
geom_bar(aes(x = condition, fill = answer), position = "dodge") +
labs(x="Conditions", y="Hitrate") +
coord_cartesian(xlim = c(1:10), ylim = c(0:10)) +
scale_x_continuous(breaks=seq(1,10,1))
Also note that if the condition is actually a categorical variable, it may be better to make it a factor:
test$condition <- as.factor(test$condition)
This means that you don't need the scale_x_continuous call, and that the grid lines will be cleaner.
Another option is to pick your fill colors explicitly and make FALSE transparent by using scale_fill_manual. Since FALSE comes alphabetically first, the first value to specify is FALSE, the second TRUE.
ggplot(data=test, aes(x=condition, fill=answer=="H"))+
geom_bar()+labs(x="Conditions", y="Hitrate")+
coord_cartesian(xlim = c(1:10), ylim = c(0:10))+
scale_x_continuous(breaks=seq(1,10,1)) +
scale_fill_manual(values = c(alpha("red", 0), "cadetblue")) +
guides(fill = F)

Selecting 10 names based on 10 highest numbers of other column

I want to select the top 10 voted restaurants, and plot them together.
So i want to create a plot that shows the restaurant names and their votes.
I used:
topTenVotes <- top_n(dataSet, 10, Votes)
and it showed me data of the columns in dataset based on the top 10 highest votes, however i want just the number of votes and restaurant names.
My Question is how to select only the top 10 highest votes and their restaurant names, and plotting them together?
expected output:
Restaurant Names Votes
A 300
B 250
C 230
D 220
E 210
F 205
G 200
H 194
I 160
J 120
K 34
And then a bar plot that shows these restaurant names and their votes
Another simple approach with base functions creating another variable:
df <- data.frame(Names = LETTERS, Votes = sample(40:400, length(LETTERS)))
x <- df$Votes
names(x) <- df$Names # x <- setNames(df$Votes, df$Names) is another approach
barplot(sort(x, decreasing = TRUE)[1:10], xlab = "Restaurant Name", ylab = "Votes")
Or a one-line solution with base functions:
barplot(sort(xtabs(Votes ~ Names, df), decreasing = TRUE)[1:10], xlab = "Restaurant Names")
I'm not seeing a data set to use, so here's a minimal example to show how it might work:
library(tidyverse)
df <-
tibble(
restaurant = c("res1", "res2", "res3", "res4"),
votes = c(2, 5, 8, 6)
)
df %>%
arrange(-votes) %>%
head(3) %>%
ggplot(aes(x = reorder(restaurant, votes), y = votes)) +
geom_col() +
coord_flip()
The top_n command also works in this case but is designed for grouped data.
Its more efficient, though less readable, to use base functions:
#toy data
d <- data.frame(list(Names = sample(LETTERS, size = 15), value = rnorm(25, 10, n = 15)))
head(d)
Names value
1 D 25.592749
2 B 28.362303
3 H 1.576343
4 L 28.718517
5 S 27.648078
6 Y 29.364797
#reorder by, and retain, the top 10
newdata <- data.frame()
for (i in 1:10) {
newdata <- rbind(newdata,d[which(d$value == sort(d$value, decreasing = T)[1:10][i]),])
}
newdata
Names value
8 W 45.11330
13 K 36.50623
14 P 31.33122
15 T 30.28397
6 Y 29.36480
7 Q 29.29337
4 L 28.71852
10 Z 28.62501
2 B 28.36230
5 S 27.64808

programatically create new variables which are sums of nested series of other variables

I have data giving me the percentage of people in some groups who have various levels of educational attainment:
df <- data_frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10))
df
# A tibble: 2 x 5
group no.highschool high.school college graduate
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 20. 70. 10. 0.
2 B 10. 40. 40. 10.
E.g., in group A 70% of people have a high school education.
I want to generate 4 variables that give me the proportion of people in each group with less than each of the 4 levels of education (e.g., lessthan_no.highschool, lessthan_high.school, etc.).
desired df would be:
desired.df <- data.frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10),
lessthan_no.highschool = c(0,0),
lessthan_high.school = c(20, 10),
lessthan_college = c(90, 50),
lessthan_graduate = c(100, 90))
In my actual data I have many groups and a lot more levels of education. Of course I could do this one variable at a time, but how could I do this programatically (and elegantly) using tidyverse tools?
I would start by doing something like a mutate_at() inside of a map(), but where I get tripped up is that the list of variables being summed is different for each of the new variables. You could pass in the list of new variables and their corresponding variables to be summed as two lists to a pmap(), but it's not obvious how to generate that second list concisely. Wondering if there's some kind of nesting solution...
Here is a base R solution. Though the question asks for a tidyverse one, considering the dialog in the comments to the question I have decided to post it.
It uses apply and cumsum to do the hard work. Then there are some cosmetic concerns before cbinding into the final result.
tmp <- apply(df[-1], 1, function(x){
s <- cumsum(x)
100*c(0, s[-length(s)])/sum(x)
})
rownames(tmp) <- paste("lessthan", names(df)[-1], sep = "_")
desired.df <- cbind(df, t(tmp))
desired.df
# group no.highschool high.school college graduate lessthan_no.highschool
#1 A 20 70 10 0 0
#2 B 10 40 40 10 0
# lessthan_high.school lessthan_college lessthan_graduate
#1 20 90 100
#2 10 50 90
how could I do this programatically (and elegantly) using tidyverse tools?
Definitely the first step is to tidy your data. Encoding information (like edu level) in column names is not tidy. When you convert education to a factor, make sure the levels are in the correct order - I used the order in which they appeared in the original data column names.
library(tidyr)
tidy_result = df %>% gather(key = "education", value = "n", -group) %>%
mutate(education = factor(education, levels = names(df)[-1])) %>%
group_by(group) %>%
mutate(lessthan_x = lag(cumsum(n), default = 0) / sum(n) * 100) %>%
arrange(group, education)
tidy_result
# # A tibble: 8 x 4
# # Groups: group [2]
# group education n lessthan_x
# <chr> <fct> <dbl> <dbl>
# 1 A no.highschool 20 0
# 2 A high.school 70 20
# 3 A college 10 90
# 4 A graduate 0 100
# 5 B no.highschool 10 0
# 6 B high.school 40 10
# 7 B college 40 50
# 8 B graduate 10 90
This gives us a nice, tidy result. If you want to spread/cast this data into your un-tidy desired.df format, I would recommend using data.table::dcast, as (to my knowledge) the tidyverse does not offer a nice way to spread multiple columns. See Spreading multiple columns with tidyr or How can I spread repeated measures of multiple variables into wide format? for the data.table solution or an inelegant tidyr/dplyr version. Before spreading, you could create a key less_than_x_key = paste("lessthan", education, sep = "_").

Resources