I want to sum every two prior observations for each ID, and place them in a new column that is named 'prior_work'. May sound strange, but here is an example that should clarify what I'm trying to do. My data frame:
ID Week Hours
1 1 .00
1 2 24.00
1 3 25.00
1 4 22.00
1 5 19.00
1 6 20.00
2 1 .00
2 2 .00
2 3 .00
2 4 .00
2 5 16.00
2 6 16.00
What I need:
ID Week Hours Hours_prior_two_weeks
1 1 .00 NA
1 2 24.00 NA
1 3 25.00 24.00
1 4 22.00 49.00
1 5 19.00 47.00
1 6 20.00 41.00
2 1 .00 NA #new ID / person here
2 2 .00 NA
2 3 .00 .00
2 4 .00 .00
2 5 16.00 .00
2 6 16.00 16.00
Tried basic aggregation and such, but I can't figure out how to sum 'prior observations'. Thanks!
library(dplyr)
df = data.frame(ID=rep(1:2, each=6),
Week=rep(1:6, each=2),
Hours=c(0,24,25,22,19,20,0,0,0,0,16,16))
df
# ID Week Hours
# 1 1 1 0
# 2 1 1 24
# 3 1 2 25
# 4 1 2 22
# 5 1 3 19
# 6 1 3 20
# 7 2 4 0
# 8 2 4 0
# 9 2 5 0
# 10 2 5 0
# 11 2 6 16
# 12 2 6 16
df %>% group_by(ID) %>% mutate(Hours_Prior_Two_Weeks = lag(Hours, 2) + lag(Hours, 1))
# Source: local data frame [12 x 4]
# Groups: ID [2]
#
# ID Week Hours Hours_Prior_Two_Weeks
# (int) (int) (dbl) (dbl)
# 1 1 1 0 NA
# 2 1 1 24 NA
# 3 1 2 25 24
# 4 1 2 22 49
# 5 1 3 19 47
# 6 1 3 20 41
# 7 2 4 0 NA
# 8 2 4 0 NA
# 9 2 5 0 0
# 10 2 5 0 0
# 11 2 6 16 0
# 12 2 6 16 16
The above code uses dplyr to group by your ID variable and then uses lag to look back at the last two values.
You can use the ave function with rollsum, Need to change from the defaults for fill and align arguments to get the structure you requisitioned:
> dat$Hours_prior_two_weeks <- with(dat, ave( Hours, ID, FUN=function(x) rollsum(x, k=3, fill=NA, align="right")))
> dat
ID Week Hours Hours_prior_two_weeks
1 1 1 0 NA
2 1 2 24 NA
3 1 3 25 49
4 1 4 22 71
5 1 5 19 66
6 1 6 20 61
7 2 1 0 NA
8 2 2 0 NA
9 2 3 0 0
10 2 4 0 0
11 2 5 16 16
12 2 6 16 32
But that didn't shift them so you need to add an extra NA at the beginning of the vectors within groups, as well as leaving one off the end (also within groups):
dat$Hours_prior_two_weeks <- with(dat, ave( Hours, ID,
FUN=function(x) c(NA, head(rollsum(x, k=2, fill=NA, align="right"), -1))) )
dat
#-----------
ID Week Hours Hours_prior_two_weeks
1 1 1 0 NA
2 1 2 24 NA
3 1 3 25 24
4 1 4 22 49
5 1 5 19 47
6 1 6 20 41
7 2 1 0 NA
8 2 2 0 NA
9 2 3 0 0
10 2 4 0 0
11 2 5 16 0
12 2 6 16 16
An option using data.table
library(data.table)
setDT(df1)[, Hours_prior_two_weeks := Reduce(`+`, shift(Hours, 1:2)), by = ID]
df1
# ID Week Hours Hours_prior_two_weeks
# 1: 1 1 0 NA
# 2: 1 2 24 NA
# 3: 1 3 25 24
# 4: 1 4 22 49
# 5: 1 5 19 47
# 6: 1 6 20 41
# 7: 2 1 0 NA
# 8: 2 2 0 NA
# 9: 2 3 0 0
#10: 2 4 0 0
#11: 2 5 16 0
#12: 2 6 16 16
Related
I have the following data set:
time <- c(0,1,2,3,4,5,0,1,2,3,4,5,0,1,2,3,4,5)
value <- c(10,8,6,5,3,2,12,10,6,5,4,2,20,15,16,9,2,2)
group <- c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3)
data <- data.frame(time, value, group)
I want to create a new column called data$diff that is equal to data$value minus the value of data$value when data$time == 0 within each group.
I am beginning with the following code
for(i in 1:nrow(data)){
for(n in 1:max(data$group)){
if(data$group[i] == n) {
data$diff[i] <- ???????
}
}
}
But cannot figure out what to put in place of the question marks. The desired output would be this table: https://i.stack.imgur.com/1bAKj.png
Any thoughts are appreciated.
Since in your example data$time == 0 is always the first element of the group, you can use this data.table approach.
library(data.table)
setDT(data)
data[, diff := value[1] - value, by = group]
In case that data$time == 0 is not the first element in each group you can use this:
data[, diff := value[time==0] - value, by = group]
Output:
> data
time value group diff
1: 0 10 1 0
2: 1 8 1 2
3: 2 6 1 4
4: 3 5 1 5
5: 4 3 1 7
6: 5 2 1 8
7: 0 12 2 0
8: 1 10 2 2
9: 2 6 2 6
10: 3 5 2 7
11: 4 4 2 8
12: 5 2 2 10
13: 0 20 3 0
14: 1 15 3 5
15: 2 16 3 4
16: 3 9 3 11
17: 4 2 3 18
18: 5 2 3 18
Here is a base R approach.
within(data, diff <- ave(
seq_along(value), group,
FUN = \(i) value[i][time[i] == 0] - value[i]
))
Output
time value group diff
1 0 10 1 0
2 1 8 1 2
3 2 6 1 4
4 3 5 1 5
5 4 3 1 7
6 5 2 1 8
7 0 12 2 0
8 1 10 2 2
9 2 6 2 6
10 3 5 2 7
11 4 4 2 8
12 5 2 2 10
13 0 20 3 0
14 1 15 3 5
15 2 16 3 4
16 3 9 3 11
17 4 2 3 18
18 5 2 3 18
Here is a short way to do it with dplyr.
library(dplyr)
data %>%
group_by(group) %>%
mutate(diff = value[which(time == 0)] - value)
Which gives
# Groups: group [3]
time value group diff
<dbl> <dbl> <dbl> <dbl>
1 0 10 1 0
2 1 8 1 2
3 2 6 1 4
4 3 5 1 5
5 4 3 1 7
6 5 2 1 8
7 0 12 2 0
8 1 10 2 2
9 2 6 2 6
10 3 5 2 7
11 4 4 2 8
12 5 2 2 10
13 0 20 3 0
14 1 15 3 5
15 2 16 3 4
16 3 9 3 11
17 4 2 3 18
18 5 2 3 18
library(dplyr)
vals2use <- data %>%
group_by(group) %>%
filter(time==0) %>%
select(c(2,3)) %>%
rename(value4diff=value)
dataNew <- merge(data, vals2use, all=T)
dataNew$diff <- dataNew$value4diff-dataNew$value
dataNew <- dataNew[,c(1,2,3,5)]
dataNew
group time value diff
1 1 0 10 0
2 1 1 8 2
3 1 2 6 4
4 1 3 5 5
5 1 4 3 7
6 1 5 2 8
7 2 0 12 0
8 2 1 10 2
9 2 2 6 6
10 2 3 5 7
11 2 4 4 8
12 2 5 2 10
13 3 0 20 0
14 3 1 15 5
15 3 2 16 4
16 3 3 9 11
17 3 4 2 18
18 3 5 2 18
I have a dataframe with 2 columns: time and day. there are 3 days and for each day, time runs from 1 to 12. I want to add new rows for each day with times: -2, 1 and 0. How do I do this?
I have tried using add_row and specifying the row number to add to, but this changes each time a new row is added making the process tedious. Thanks in advance
picture of the dataframe
We could use add_row
then slice the desired sequence
and bind all to a dataframe:
library(tibble)
library(dplyr)
df1 <- df %>%
add_row(time = -2:0, Day = c(1,1,1), .before = 1) %>%
slice(1:15)
df2 <- bind_rows(df1, df1, df1) %>%
mutate(Day = rep(row_number(), each=15, length.out = n()))
Output:
# A tibble: 45 x 2
time Day
<dbl> <int>
1 -2 1
2 -1 1
3 0 1
4 1 1
5 2 1
6 3 1
7 4 1
8 5 1
9 6 1
10 7 1
11 8 1
12 9 1
13 10 1
14 11 1
15 12 1
16 -2 2
17 -1 2
18 0 2
19 1 2
20 2 2
21 3 2
22 4 2
23 5 2
24 6 2
25 7 2
26 8 2
27 9 2
28 10 2
29 11 2
30 12 2
31 -2 3
32 -1 3
33 0 3
34 1 3
35 2 3
36 3 3
37 4 3
38 5 3
39 6 3
40 7 3
41 8 3
42 9 3
43 10 3
44 11 3
45 12 3
Here's a fast way to create the desired dataframe from scratch using expand.grid(), rather than adding individual rows:
df <- expand.grid(-2:12,1:3)
colnames(df) <- c("time","day")
Results:
df
time day
1 -2 1
2 -1 1
3 0 1
4 1 1
5 2 1
6 3 1
7 4 1
8 5 1
9 6 1
10 7 1
11 8 1
12 9 1
13 10 1
14 11 1
15 12 1
16 -2 2
17 -1 2
18 0 2
19 1 2
20 2 2
21 3 2
22 4 2
23 5 2
24 6 2
25 7 2
26 8 2
27 9 2
28 10 2
29 11 2
30 12 2
31 -2 3
32 -1 3
33 0 3
34 1 3
35 2 3
36 3 3
37 4 3
38 5 3
39 6 3
40 7 3
41 8 3
42 9 3
43 10 3
44 11 3
45 12 3
You can use tidyr::crossing
library(dplyr)
library(tidyr)
add_values <- c(-2, 1, 0)
crossing(time = add_values, Day = unique(day$Day)) %>%
bind_rows(day) %>%
arrange(Day, time)
# A tibble: 45 x 2
# time Day
# <dbl> <int>
# 1 -2 1
# 2 0 1
# 3 1 1
# 4 1 1
# 5 2 1
# 6 3 1
# 7 4 1
# 8 5 1
# 9 6 1
#10 7 1
# … with 35 more rows
If you meant -2, -1 and 0 you can also use complete.
tidyr::complete(day, Day, time = -2:0)
Using R, I am trying to fill NAs in a column with values leveraging conditions of other columns. The data frame has 4columns. The 4 columns are described below.
"Water_Level": Has some values which also include NAs. This is the column I want to replace the NAs. Take this column as the amount of water in liters in a tank.
"Tank": Unique identifier for tanks. In this sample, I have tank 1 and tank 2.
"Flag": This has a series of 0's and 1's. When value is 0 the tap is opened and the Water_level value decreases by a constant of 0.05. When flag is 1, the tank is being pumped, so the water level increases in the respective tank gradually to the peak value at the end of the series of 1's. The rate of increase is varies and is determined by the length of 1's in the Flag column or the Counter number corresponding to the end of the series of 1's.
"Counter": A column counting the number of 0's and 1's in the flag column in order.
I need to fill the NAs in the "Water_level" column with the conditions of the other columns.
Honestly, I haven't been able to try anything despite clearly understanding the outcome required.
df <- data.frame(
Water_level = c(67.92, rep(NA,9),67.96,10.5,rep(NA,8),20),
Flag = c(rep(0,5),rep(1,6),rep(0,5),rep(1,5)),
Tank= c(rep(1, 11), rep(2, 10)),
Counter = c(seq(1:5),seq(1:6), seq(1:5),seq(1:5))
)
df
Water_level Flag Tank Counter
1 67.92 0 1 1
2 NA 0 1 2
3 NA 0 1 3
4 NA 0 1 4
5 NA 0 1 5
6 NA 1 1 1
7 NA 1 1 2
8 NA 1 1 3
9 NA 1 1 4
10 NA 1 1 5
11 67.96 1 1 6
12 10.50 0 2 1
13 NA 0 2 2
14 NA 0 2 3
15 NA 0 2 4
16 NA 0 2 5
17 NA 1 2 1
18 NA 1 2 2
19 NA 1 2 3
20 NA 1 2 4
21 20.00 1 2 5
The result expected is to fill the NAs in the Water_level as described by the conditions in my introduction.
For example, line 2 in the "Water_level" should be 67.92 - 0.05 = 67.87. This is because the tap is open i.e Flag is at 0. line 3 will be 67.87 - 0.05 = 67.82 and so on.
The tricky part is in line 6 were the Flag changes to 1 i.e the tank is being pumped. We can see the series of 1's for Tank 1 ends at line 11. The peak value recorded for water_level is 67.96. So the rate of increase from line 6 to 10 will now be as seen in the formular below.
(67.96- value at line5 following the decrease pattern) / number of Counter steps i.e 6 for this case
This calculation continues for Tank 2.
Thanks is anticipation for a solution.
Update.
#manotheshark. This is a good beginning. But it doesnt generalise well. When I include row 12 to 16, it produces a wrong output. i.e. it doesnt decline by 0.05 from line 11.
df <- data.frame(
Water_level = c(67.92, rep(NA,9),67.96, rep(NA,5),10.5,rep(NA,8),20),
Flag = c(rep(0,5),rep(1,6),rep(0,5),rep(0,5),rep(1,5)),
Tank= c(rep(1, 16), rep(2, 10)),
Counter = c(seq(1:5),seq(1:6),seq(1:5), seq(1:5),seq(1:5))
)
df
Water_level Flag Tank Counter
1 67.92 0 1 1
2 NA 0 1 2
3 NA 0 1 3
4 NA 0 1 4
5 NA 0 1 5
6 NA 1 1 1
7 NA 1 1 2
8 NA 1 1 3
9 NA 1 1 4
10 NA 1 1 5
11 67.96 1 1 6
12 NA 0 1 1
13 NA 0 1 2
14 NA 0 1 3
15 NA 0 1 4
16 NA 0 1 5
17 10.50 0 2 1
18 NA 0 2 2
19 NA 0 2 3
20 NA 0 2 4
21 NA 0 2 5
22 NA 1 2 1
23 NA 1 2 2
24 NA 1 2 3
25 NA 1 2 4
26 20.00 1 2 5
The output running your solution is presented below. Line 12 should be 67.96 - 0.05 = 67.91.
Water_level Flag Tank Counter
1 67.92000 0 1 1
2 67.87000 0 1 2
3 67.82000 0 1 3
4 67.77000 0 1 4
5 67.72000 0 1 5
6 67.30167 1 1 1
7 67.43333 1 1 2
8 67.56500 1 1 3
9 67.69667 1 1 4
10 67.82833 1 1 5
11 67.96000 1 1 6
12 67.37000 0 1 1
13 67.32000 0 1 2
14 67.27000 0 1 3
15 67.22000 0 1 4
16 67.17000 0 1 5
17 10.50000 0 2 1
18 10.45000 0 2 2
19 10.40000 0 2 3
20 10.35000 0 2 4
21 10.30000 0 2 5
22 12.24000 1 2 1
23 14.18000 1 2 2
24 16.12000 1 2 3
25 18.06000 1 2 4
26 20.00000 1 2 5
Not tested if this works for multiple tank cycles. Converted data.frame to data.table
library(data.table)
setDT(df)
# calculate tank levels when dropping with Flag of 0
df[Flag == 0, Water_level := first(Water_level) - 0.05 * (.I - first(.I)), by = .(Flag, Tank)]
# use sequence to determine tank levels when filling from previous minimum to new max
df[Flag == 1, Water_level := seq(df[Flag == 0, last(Water_level), by = .(Flag, Tank)][,V1][.GRP], last(Water_level), length.out = .N + 1)[-1], by = .(Flag, Tank)]
> df
Water_level Flag Tank Counter
1: 67.92 0 1 1
2: 67.87 0 1 2
3: 67.82 0 1 3
4: 67.77 0 1 4
5: 67.72 0 1 5
6: 67.76 1 1 1
7: 67.80 1 1 2
8: 67.84 1 1 3
9: 67.88 1 1 4
10: 67.92 1 1 5
11: 67.96 1 1 6
12: 10.50 0 2 1
13: 10.45 0 2 2
14: 10.40 0 2 3
15: 10.35 0 2 4
16: 10.30 0 2 5
17: 12.24 1 2 1
18: 14.18 1 2 2
19: 16.12 1 2 3
20: 18.06 1 2 4
21: 20.00 1 2 5
Water_level Flag Tank Counter
I have a unbalanced data frame with date, localities and prices. I would like calculate diff price among diferents localities by date. My data its unbalanced and to get all diff price I think in create data(localities) to balance data.
My data look like:
library(dplyr)
set.seed(123)
df= data.frame(date=(1:3),
locality= rbinom(21,3, 0.2),
price=rnorm(21, 50, 20))
df %>%
arrange(date, locality)
> date locality price
1 1 0 60.07625
2 1 0 35.32994
3 1 0 63.69872
4 1 1 54.76426
5 1 1 66.51080
6 1 1 28.28602
7 1 2 47.09213
8 2 0 26.68910
9 2 1 100.56673
10 2 1 48.88628
11 2 1 48.29153
12 2 2 29.02214
13 2 2 45.68269
14 2 2 43.59887
15 3 0 60.98193
16 3 0 75.89527
17 3 0 43.30174
18 3 0 71.41221
19 3 0 33.62969
20 3 1 34.31236
21 3 1 23.76955
To get balanced data I think in:
> date locality price
1 1 0 60.07625
2 1 0 35.32994
3 1 0 63.69872
4 1 1 54.76426
5 1 1 66.51080
6 1 1 28.28602
7 1 2 47.09213
8 1 2 NA
9 1 2 NA
10 2 0 26.68910
10 2 0 NA
10 2 0 NA
11 2 1 100.56673
12 2 1 48.88628
13 2 1 48.29153
14 2 2 29.02214
15 2 2 45.68269
16 2 2 43.59887
etc...
Finally to get diff price beetwen pair localities I think:
> date diff(price, 0-1) diff(price, 0-2) diff(price, 1-2)
1 1 60.07625-54.76426 60.07625-47.09213 etc...
2 1 35.32994-66.51080 35.32994-NA
3 1 63.69872-28.28602 63.69872-NA
You don't need to balance your data. If you use dcast, it will add the NAs for you.
First transform the data to show individual columns for each locality
library(data.table)
library(tidyverse)
setDT(df)
df[, rid := rowid(date, locality)]
df2 <- dcast(df, rid + date ~ locality, value.var = 'price')
# rid date 0 1 2
# 1: 1 1 60.07625 54.76426 47.09213
# 2: 1 2 26.68910 100.56673 29.02214
# 3: 1 3 60.98193 34.31236 NA
# 4: 2 1 35.32994 66.51080 NA
# 5: 2 2 NA 48.88628 45.68269
# 6: 2 3 75.89527 23.76955 NA
# 7: 3 1 63.69872 28.28602 NA
# 8: 3 2 NA 48.29153 43.59887
# 9: 3 3 43.30174 NA NA
# 10: 4 3 71.41221 NA NA
# 11: 5 3 33.62969 NA NA
Then create a data frame to_diff of differences to calculate, and pmap over that to calculate the differences. Here c0_1 corresponds to what you call in your question diff(price, 0-1).
to_diff <- CJ(0:2, 0:2)[V1 < V2]
pmap(to_diff, ~ df2[[as.character(.x)]] - df2[[as.character(.y)]]) %>%
setNames(paste0('c', to_diff[[1]], '_', to_diff[[2]])) %>%
bind_cols(df2[, 1:2])
# A tibble: 11 x 5
# c0_1 c0_2 c1_2 rid date
# <dbl> <dbl> <dbl> <int> <int>
# 1 5.31 13.0 7.67 1 1
# 2 -73.9 -2.33 71.5 1 2
# 3 26.7 NA NA 1 3
# 4 -31.2 NA NA 2 1
# 5 NA NA 3.20 2 2
# 6 52.1 NA NA 2 3
# 7 35.4 NA NA 3 1
# 8 NA NA 4.69 3 2
# 9 NA NA NA 3 3
# 10 NA NA NA 4 3
# 11 NA NA NA 5 3
Somewhat new to R and I find myself needing to delete rows based on multiple criteria. The data frame has 3 columns and I need to delete rows where bid=99 and there are values less than 99 grouping by rid and qid. The desired output at an rid and qid level are bid has multiple values less than 99 or bid=99.
rid qid bid
1 1 5
1 1 6
1 1 99
1 2 6
2 1 7
2 1 99
2 2 2
2 2 3
3 1 7
3 1 8
3 2 1
3 2 99
4 1 2
4 1 6
4 2 1
4 2 2
4 2 99
5 1 99
5 2 99
The expected output...
rid qid bid
1 1 5
1 1 6
1 2 6
2 1 7
2 2 2
2 2 3
3 1 7
3 1 8
3 2 1
4 1 2
4 1 6
4 2 1
4 2 2
5 1 99
5 2 99
Any assistance would be appreciated.
You can use the base R function ave to generate a dropping variable like this:
df$dropper <- with(df, ave(bid, rid, qid, FUN= function(i) i == 99 & length(i) > 1))
ave calculates a function on bid, grouping by rid and qid. The function tests if each element of the grouped bid values i is 99 and if i has a length greater than 1. Also, with is used to reduce typing.
which returns
df
rid qid bid dropper
1 1 1 5 0
2 1 1 6 0
3 1 1 99 1
4 1 2 6 0
5 2 1 7 0
6 2 1 99 1
7 2 2 2 0
8 2 2 3 0
9 3 1 7 0
10 3 1 8 0
11 3 2 1 0
12 3 2 99 1
13 4 1 2 0
14 4 1 6 0
15 4 2 1 0
16 4 2 2 0
17 4 2 99 1
18 5 1 99 0
19 5 2 99 0
then drop the undesired observations with df[dropper == 0, 1:3] which will simultaneously drop the new variable.
If you want to just delete rows where bid = 99 then use dplyr.
library(dplyr)
df <- df %>%
filter(bid != 99)
Where df is your data frame. and != means not equal to
Updated solution using dplyr
df %>%
group_by(rid, qid) %>%
mutate(tempcount = n())%>%
ungroup() %>%
mutate(DropValue =ifelse(bid == 99 & tempcount > 1, 1,0) ) %>%
filter(DropValue == 0) %>%
select(rid,qid,bid)
Here is another option with all and if condition in data.table to subset the rows after grouping by 'rid' and 'qid'
library(data.table)
setDT(df1)[, if(all(bid==99)) .SD else .SD[bid!= 99], .(rid, qid)]
# rid qid bid
# 1: 1 1 5
# 2: 1 1 6
# 3: 1 2 6
# 4: 2 1 7
# 5: 2 2 2
# 6: 2 2 3
# 7: 3 1 7
# 8: 3 1 8
# 9: 3 2 1
#10: 4 1 2
#11: 4 1 6
#12: 4 2 1
#13: 4 2 2
#14: 5 1 99
#15: 5 2 99
Or without using the if
setDT(df1)[df1[, .I[all(bid==99) | bid != 99], .(rid, qid)]$V1]
Here is a solution using dplyr, which is a very expressive framework for this kind of problems.
df <- read.table(text =
" rid qid bid
1 1 5
1 1 6
1 1 99
1 2 6
2 1 7
2 1 99
2 2 2
2 2 3
3 1 7
3 1 8
3 2 1
3 2 99
4 1 2
4 1 6
4 2 1
4 2 2
4 2 99
5 1 99
5 2 99",
header = TRUE, stringsAsFactors = FALSE)
Dplyr verbs allow to express the program in a way that is close to the very terms of your questions:
library(dplyr)
res <-
df %>%
group_by(rid, qid) %>%
filter(!(any(bid < 99) & bid == 99)) %>%
ungroup()
# # A tibble: 15 × 3
# rid qid bid
# <int> <int> <int>
# 1 1 1 5
# 2 1 1 6
# 3 1 2 6
# 4 2 1 7
# 5 2 2 2
# 6 2 2 3
# 7 3 1 7
# 8 3 1 8
# 9 3 2 1
# 10 4 1 2
# 11 4 1 6
# 12 4 2 1
# 13 4 2 2
# 14 5 1 99
# 15 5 2 99
Let's check we get the desired output:
desired_output <- read.table(text =
" rid qid bid
1 1 5
1 1 6
1 2 6
2 1 7
2 2 2
2 2 3
3 1 7
3 1 8
3 2 1
4 1 2
4 1 6
4 2 1
4 2 2
5 1 99
5 2 99",
header = TRUE, stringsAsFactors = FALSE)
identical(as.data.frame(res), desired_output)
# [1] TRUE