Efficiently iterate over rows to dynamically/sequentially populate variable going down rows - r

I am trying to dynamically populate a variable, which requires me to reference rows.
Given are 3 columns: time, group, and val.
I want to populate rows 3, 4, 7, and 8's val which are initially NA.
Here is my toy data:
df <- expand.grid(time = rep(c(1,2,3,4)), group = rep(c("A", "B")))
df$val <- c(50,40,NA,NA)
df
> df
time group val
1 1 A 50
2 2 A 40
3 3 A NA
4 4 A NA
5 1 B 50
6 2 B 40
7 3 B NA
8 4 B NA
I have two grouping variables (time and group) and, as example, I need to populate row 3 above by this set of rules:
1. Order by group and time (in ascending order)
2. For time = 3, the value of **val** is the arithmetic average of two previous rows;
(2a). i.e. the average of time 2 and time 1 values, so it will be 1/2 * (40+50) = 45.
3. For time = 4, the value of **val** is the arithmetic average of two previous rows;
(3a). i.e. the average of time 3 and time 2 values, so it will be 1/2 * (45+40) = 42.5.
And so on, going down to the last row of each group as defined by time and group variables.
I want to avoid using loops and referencing row index to achieve this, and prefer to stay within dplyr, as the rest of my scripts are in the dplyr ecosystem. Is there an efficient way to achieve this?

This isn't the cleanest solution, but it gets the job done:
df2 = df %>%
arrange(group, time) %>%
mutate(val = if_else(is.na(val), (lag(val, n=1) + lag(val, n=2))/2.0, val)) %>%
mutate(val = if_else(is.na(val), (lag(val, n=1) + lag(val, n=2))/2.0, val))
Again, it's not pretty, but it seems to work. Hope that helps give you something to start from.

Related

In dplyr::mutate, refer to a value conditionally, based on the value of another column

Apologies for the unclear title. Although not effective, I couldn't think of a better way to describe this problem.
Here is a sample dataset I am working with
test = data.frame(
Value = c(1:5, 5:1),
Index = c(1:5, 1:5),
GroupNum = c(rep.int(1, 5), rep.int(2, 5))
)
I want to create a new column (called "Value_Standardized") whose values are calculated by grouping the data by GroupNum and then dividing each Value observation by the Value observation of the group when the Index is 1.
Here's what I've come up with so far.
test2 = test %>%
group_by(GroupNum) %>%
mutate(Value_Standardized = Value / special_function(Value))
The special_function would represent a way to get value when Index == 1.
That is also precisely the problem - I cannot figure out a way to get the denominator to be the value when index == 1 in that group. Unfortunately, the value when the index is 1 is not necessarily the max or the min of the group.
Thanks in advance.
Edit: Emphasis added for clarity.
There is a super simple tidyverse way of doing this with the method cur_data() it pulls the tibble for the current subset (group) of data and acts on it
test2 <- test %>%
group_by(GroupNum) %>%
mutate(output=Value/cur_data()$Value[1])
The cur_data() grabs the tibble, then you extract the Values column as you would normally using $Value and because the denominator is always the first row in this group, you just specify that index with [1]
Nice and neat, there are a whole bunch of cur_... methods you can use, check them out here:
Not sure if this is what you meant, nor if it's the best way to do this but...
Instead of using a group_by I used a nested pipe, filtering and then left_joining the table to itself.
test = data.frame(
Value = c(1:5, 5:1),
Index = c(1:5, 1:5),
GroupNum = c(rep.int(1, 5), rep.int(2, 5))
)
test %>%
left_join(test %>%
filter(Index == 1) %>%
select(Value,GroupNum),
by = "GroupNum",
suffix = c('','_Index_1')) %>%
mutate(Value = Value/Value_Index_1)
output:
Value Index GroupNum Value_Index_1
1 1.0 1 1 1
2 2.0 2 1 1
3 3.0 3 1 1
4 4.0 4 1 1
5 5.0 5 1 1
6 1.0 1 2 5
7 0.8 2 2 5
8 0.6 3 2 5
9 0.4 4 2 5
10 0.2 5 2 5
A quick base R solution:
test = data.frame(
Value = c(1:5, 5:1),
Index = c(1:5, 1:5),
GroupNum = c(rep.int(1, 5), rep.int(2, 5)),
Value_Standardized = NA
)
groups <- levels(factor(test$GroupNum))
for(currentGroup in groups) {
test$Value_Standardized[test$GroupNum == currentGroup] <- test$Value[test$GroupNum == currentGroup] / test$Value[test$GroupNum == currentGroup & test$Index == 1]
}
This only works under the assumption that each group will have only one observation with a "1" index though. It's easy to run into trouble...

Conditionally mutate dataframe based on multiple conditions R

I have seen some similar questions, but none of them was exactly the same as the thing I want to do - which is why I am asking.
I have a dataframe (dummy_data) which contains indices of some observations (obs) regarding given subjects (ID). The dataframe consists only the meaningful data (in other words: the desired conditions are met). The last column in this example data contains the total number of observations (total_obs).
ID <-c(rep("item_001",5),rep("item_452",8),rep("item_0001",7),rep("item_31",9),rep("item_007",5))
obs <- c(1,2,3,5,6,3,4,5,7,8,9,12,16,1,2,4,5,6,7,8,2,4,6,7,8,10,13,14,15,3,4,6,7,11)
total_obs <- c(rep(6,5),rep(16,8),rep(9,7),rep(18,9),rep(11,5))
dummy_data <- data.frame(ID, obs, total_obs)
I would like to create a new column (interval) with 3 possible values: "start", "center", "end" based on following condition(s):
it should split total number of observations (total_obs) into 3 groups (based on indices - from 1st to the last - which is the value stored in the total_obs column) and assign the interval value according to the indices stored in obs column.
Here is the expected output:
ID <- c(rep("item_001",5),rep("item_452",8),rep("item_0001",7),rep("item_31",9),rep("item_007",5))
segment <- c(1,2,3,5,6, 3,4,5,7,8,9,12,16, 1,2,4,5,6,7,8, 2,4,6,7,8,10,13,14,15, 3,4,6,7,11)
total_segments <- c(rep(6,5),rep(16,8),rep(9,7),rep(18,9),rep(11,5))
interval <- c("start","start","center","end","end","start","start","start","center","center","center","end","end","start","start","center","center","center","end","end","start","start","start","center","center","center","end","end","end", "start","start","center","center","end")
wanted_data <- data.frame(ID, segment, total_segments, interval)
I would like to use use dplyr::ntile() with dplyr::mutate() and dplyr::case_when() but I could not make my code function properly. Any solutions?
You just need dplyr::mutate() and dplyr::case_when().
The following should give you something to work off of.
dummy_data %>%
mutate(interval = case_when(obs < (total_obs/3) ~ "start",
obs < 2*(total_obs/3) ~ "center",
TRUE ~ "end"))
# TRUE ~ "end" is the 'else' case when everything else is false
Which gives slightly different results.
I think more careful deliberation should be made regarding where the endpoints are for each interval, but if you know what you are doing, using a combination of <=, %/%, and ceil() should give you the result you desire.
First, because dummy_data$obs is identical withwanted_data$segment, and dummy_data$total_obs is identical with wanted_data$total_segments, you just need to rename these columns.
For the interval column, here is one approach of creating it:
group the data based on segment column
create a column, say tile, and fill it with ntile(segment) results.
create interval column, and use case_when to fill it with the category labels created from tile. It means, fill interval with "start" when tile = 1, "center" when 2, and "end" when 3.
drop the tile column.
wanted_data <- dummy_data %>%
rename(segment = obs, total_segments = total_obs) %>%
group_by(total_segments) %>%
mutate(tile = ntile(segment, 3)) %>%
mutate(interval = case_when(tile == 1~"start",
tile == 2~"center",
tile == 3~"end")) %>%
select(-tile)
wanted_data
# A tibble: 34 × 4
# Groups: total_segments [5]
ID segment total_segments interval
<chr> <dbl> <dbl> <chr>
1 item_001 1 6 start
2 item_001 2 6 start
3 item_001 3 6 center
4 item_001 5 6 center
5 item_001 6 6 end
6 item_452 3 16 start
7 item_452 4 16 start
8 item_452 5 16 start
9 item_452 7 16 center
10 item_452 8 16 center
# … with 24 more rows
It's slightly different from wanted_data$interval that you showed because based on your comment, you said that the division into categories is just as dplyr::ntile() does.

Conditional sum grouped by date in R

The data I am working with is the number of people in a group. The columns in the dataset I'm concerned with are the date (column 1) and the number of people in a group (column 3 where there is a separate row for each group on a given day). I am looking for an output spreadsheet that gives me a column for a date, one for the sum of all the groups with one person in it on a day, and a column for the sum of all the people who are in groups larger than one on a day.
For example if this was my dataset:
Date People
10/18 1
10/18 3
10/18 1
10/18 8
10/20 1
10/20 4
10/20 2
My desired output would be:
Date p=1 p>1
10/18 2 11
10/20 1 6
My data frame is "DF" and a csv with the different dates is "times". I tried to use a for loop but the output was just zeros.
Here is what I tried:
ntimes = length(times$UniTimes)
for(i in 1:ntimes)
{
s<- sum(DF[which (DF[,3] > 1 & DF[,1]==i),3])
t<- sum(DF[which (DF[,3] < 2 & DF[,1]== i),3])
}
ndf<-data.frame(times,s,t)
write.csv(ndf,'groups_c.csv')
Thank you for your time and help!
You can use aggregate:
aggregate(People ~ Date, x, function(x) c("p=1" = sum(x[x==1]),
"p>1" = sum(x[x>1])))
# Date People.p=1 People.p>1
#1 10/18 2 11
#2 10/20 1 6
This should work, but without data to reproduce it's difficult to say:
library(dplyr)
DF %>%
group_by(Date) %>%
summarise(peq1 = sum(People == 1),
pgeq1 = sum(People[People > 1]))
An option with data.table
library(data.table)
setDT(DF)[, .(peq1 = sum(People == 1), pgeq1 = sum(People[People >1])), .(Date)]

Fill in N lags based on a variable in R data frame

I have two columns in my data frame, value and num_leads. I'd like to create a third column that stores the value's value from n rows below - where n is whatever number is stored in num_leads. Here's an example:
df1 <- data.frame(value = c(1:5),
num_leads = c(2, 3, 1, 1, 0))
Desired output:
value num_leads result
1 1 2 3
2 2 3 5
3 3 1 4
4 4 1 5
5 5 0 5
I have tried using the lead function in dplyr but unfortunately it seems all the leads must have the same number.
using indexing
with(df1, value[seq_along(value) + num_leads])
where seq_along(value) gives the row number, and by adding to num_leads you can pull out the right value
This is what I came up with:
df1$result <- df1$value[df1$value + df1$num_leads]

R summing up total for each class for each id

Say I have a dataset like this:
df <- data.frame(id = c(1, 1, 1, 2, 2),
classname = c("Welding", "Welding", "Auto", "HVAC", "Plumbing"),
hours = c(3, 2, 4, 1, 2))
I.e.,
id classname hours
1 1 Welding 3
2 1 Welding 2
3 1 Auto 4
4 2 HVAC 1
5 2 Plumbing 2
I'm trying to figure out how to summarize the data in a way that gives me, for each id, a list of the classes they took as well as how many hours of each class. I would want these to be in a list so I can keep it one row per id. So, I would want it to return:
id class.list class.hours
1 1 Welding, Auto 5,4
2 2 HVAC, Plumbing 1,2
I was able to figure out how to get it to return the class.list.
library(dplyr)
classes <- df %>%
group_by(id) %>%
summarise(class.list = list(unique(as.character(classname))))
This gives me:
id class.list
1 1 Welding, Auto
2 2 HVAC, Plumbing
But I'm not sure how I could get it to sum the number of hours for each of those classes (class.hours).
Thanks for your help!
In base R, this can be accomplished with two calls to aggregate. The inner call sums the hours and the outer call "concatenates" the hours and the class names. In the outer call of aggregate, cbind is used to include both the hours and the class names in the output, and also to provide the desired variable names.
# convert class name to character variable
df$classname <- as.character(df$classname)
# aggregate
aggregate(cbind("class.hours"=hours, "class.list"=classname)~id,
data=aggregate(hours~id+classname, data=df, FUN=sum), toString)
id class.hours class.list
1 1 4, 5 Auto, Welding
2 2 1, 2 HVAC, Plumbing
In data.table, roughly the same output is produced with a chained statement.
setDT(df)[, .(hours=sum(hours)), by=.(id, classname)][, lapply(.SD, toString), by=id]
id classname hours
1: 1 Welding, Auto 5, 4
2: 2 HVAC, Plumbing 1, 2
The variable names could then be set using the data.table setnames function.
This is how you could do it using dplyr:
classes <- df %>%
group_by(id, classname) %>%
summarise(hours = sum(hours)) %>%
summarise(class.list = list(unique(as.character(classname))),
class.hours = list(hours))
The first summarise peels of the latest group by (classname). It is not necessary to use unique() anymore, but I kept it in there to match the part you already had.

Resources