Rolling rowsum over existing data frame with NAs in r - r

Given the data frame:
df1 <- data.frame(Company = c('A','B','C','D','E'),
`X1980` = c(NA, 5, 3, 8, 13),
`X1981` = c(NA, 12, NA, 11, 29),
`X1982` = c(33, NA, NA, 41, 42),
`X1983` = c(45, 47, 53, NA, 55))
I would like to create a new data frame where each value is replaced by the sum of the current value and the previous value of the row. NAs should be kept as they are.
This should result in the following data frame:
Company 1980 1981 1982 1983
A NA NA 33 78
B 5 17 NA 47
C 3 NA NA 53
D 8 19 60 NA
E 13 42 84 139

Here is a tidyverse approach
library(dplyr)
library(tidyr)
library(purrr)
df1 %>%
pivot_longer(matches("\\d{4}$")) %>%
group_by(Company) %>%
mutate(value = accumulate(value, ~if (is.na(out <- .x + .y)) .y else out)) %>%
pivot_wider()
Output
# A tibble: 5 x 5
# Groups: Company [5]
Company X1980 X1981 X1982 X1983
<chr> <dbl> <dbl> <dbl> <dbl>
1 A NA NA 33 78
2 B 5 17 NA 47
3 C 3 NA NA 53
4 D 8 19 60 NA
5 E 13 42 84 139

Related

Extrapolating sequence of numbers in R

I would like to increment numbers from the last observed number to the last number i.e. the NA values should be replaced with 35, 36, 38, 39
How can I go about it?
library(tidyverse)
trialdata <- tibble(
id = c(13, 8, 20, 34, 4, NA, NA, NA, NA, NA)
)
If your goal is to fill the NA id rows with a sequence that starts after the maximum non-NA value, then here's one way you could do it:
trialdata %>%
mutate(
id_filled = cumsum(is.na(id)) + max(id, na.rm = T),
id_filled = coalesce(id, id_filled)
)
id id_filled
<dbl> <dbl>
1 13 13
2 8 8
3 20 20
4 34 34
5 4 4
6 NA 35
7 NA 36
8 NA 37
9 NA 38
10 NA 39
Here is one option that could work on NAs that are between as well
library(dplyr)
library(tidyr)
library(data.table)
trialdata %>%
mutate(id1 = id) %>%
fill(id1, .direction = "downup") %>%
group_by(grp = rleid(id1)) %>%
mutate(id = id[!is.na(id)] + row_number() - row_number()[!is.na(id)]) %>%
ungroup %>%
select(-id1, -grp)
-output
# A tibble: 10 × 1
id
<dbl>
1 4
2 8
3 13
4 20
5 34
6 35
7 36
8 37
9 38
10 39
On another data
trialdata1 <- structure(list(id = c(NA, 4, 10, NA, NA, 20, NA)),
class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -7L))
the output is
# A tibble: 7 × 1
id
<dbl>
1 3
2 4
3 10
4 11
5 12
6 20
7 21
Here is an alternative:
libaray(dplyr)
library(tidyr)
trialdata %>%
mutate(id1 = row_number()) %>%
arrange(id) %>%
fill(id) %>%
group_by(id) %>%
mutate(id = ifelse(row_number()>1, id+row_number()-1, id)) %>%
arrange(id1) %>%
ungroup() %>%
select(-id1)
id
<dbl>
1 13
2 8
3 20
4 34
5 4
6 35
7 36
8 37
9 38
10 39
For trailing NAs, you could do something like this.
x[is.na(x)] <- (x[sum(x > 0, na.rm=TRUE)]) + seq(sum(is.na(x)))
x
# [1] 4 8 13 20 34 35 36 37 38 39
Data:
x <- c(sort(c(13, 8, 20, 34, 4)), NA, NA, NA, NA, NA)

Convert R to wide format with arrangement depending on value in column

I have got a table containing data from various samples ("sample1" etc) with which several types measurement (A to C) were made. Every measurement gave 3 values: concentration, maximum and minimum.
my.sample <- c("sample1", "sample1", "sample2", "sample2", "sample3")
type <- c("A", "B", "A", "C", "C")
concentration <- c(12, 5, 7, 10, 14)
max <- c(13, 6, 7, 11, 15)
min <- c(11, 4, 6, 9, 13)
mydata <- data.frame(my.sample, type, concentration, max, min)
> mydata
my.sample type concentration max min
1 sample1 A 12 13 11
2 sample1 B 5 6 4
3 sample2 A 7 7 6
4 sample2 C 10 11 9
5 sample3 C 14 15 13
I'd like to convert this data to a new table where I only have one row per sample. This means creating 3 columns (concentration, max, min) for every measurement type, with the type of measurement indicated in the column. Missing values should be defined as NA. Here's an example of the result I'd like to obtain:
A_concentration <- c(12, 7, NA)
A_max <- c(13, 7, NA)
A_min <- c(11, 6, NA)
B_concentration <- c(5, NA, NA)
B_max <- c(6, NA, NA)
B_min <- c(4, NA, NA)
C_concentration <- c(NA, 10, 14)
C_max <- c(NA, 11, 15)
C_min <- c(NA, 9, 13)
mydata.new <- data.frame(my.sample.new, A_concentration, A_max, A_min, B_concentration, B_max, B_min, C_concentration, C_max, C_min)
> mydata.new
my.sample.new A_concentration A_max A_min B_concentration B_max B_min
1 sample1 12 13 11 5 6 4
2 sample2 7 7 6 NA NA NA
3 sample3 NA NA NA NA NA NA
C_concentration C_max C_min
1 NA NA NA
2 10 11 9
3 14 15 13
Is there a method to widen data based on a condition and include a value (here: from type ) in the column name? I have got many more types in my real dataset, so it should ideally be generalisable.
This works:
library(dplyr)
mydata %>%
pivot_wider(id_cols = my.sample, names_from = type, values_from = c(concentration, max, min), names_glue = "{type}_{.value}") %>%
select(my.sample, starts_with("A"), starts_with("B"), starts_with("C"))
This gives us:
# A tibble: 3 x 10
my.sample A_concentration B_concentration C_concentration A_max B_max C_max A_min B_min C_min
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 sample1 12 5 NA 13 6 NA 11 4 NA
2 sample2 7 NA 10 7 NA 11 6 NA 9
3 sample3 NA NA 14 NA NA 15 NA NA 13

1 Day lag, 3 Day Average Lags and 7 Average Day lags for all columns in R

I have 158 columns in a dataset. I want to create 3 new columns(1_day,3_day and 7_day lag) for each column.
data <- data.frame(DATE = c("1/1/2016","1/2/2016","1/3/2016","1/4/2016","1/5/2016","1/6/2016","1/7/2016","1/8/2016","1/9/2016","1/10/2016",
Attr1 = c(5,8,7,6,2,1,4,1,2),
Attr2 = c(10,23,32,12,3,2,5,3,21),
Attr3 = c(12,23,43,3,2,4,1,23,33))
The result wanted is as follows :
Attr1_3D = Average of last 3 days of ATTR1
Attr1_7D = Aveage of last 7 days of ATTR1
Attr2_3D = Average of last 3 days of ATTR2
Attr2_7D = Aveage of last 7 days of ATTR2
Attr3_3D = Average of last 3 days of ATTR3
Attr3_7D = Aveage of last 7 days of ATTR3
One approach using tidyverse and zoo is below. You can use rollapply from zoo package to get rolling means (by 1, 3, or 7 days).
Edit: Also added offset by 1 day (as rolling mean values are included on the day after the X-day window). Also joining back to original data frame to include original Attr columns.
library(tidyverse)
library(zoo)
data %>%
pivot_longer(starts_with("Attr"), names_to = "Attr", values_to = "Value") %>%
group_by(Attr) %>%
mutate(Attr_1D = rollapply(Value, 1, mean, align = 'right', fill = NA),
Attr_3D = rollapply(Value, 3, mean, align = 'right', fill = NA),
Attr_7D = rollapply(Value, 7, mean, align = 'right', fill = NA),
DATE = lead(DATE)) %>%
pivot_wider(id_cols = DATE, names_from = "Attr", values_from = c("Attr_1D", "Attr_3D", "Attr_7D")) %>%
right_join(data)
Output
# A tibble: 9 x 13
DATE Attr_1D_Attr1 Attr_1D_Attr2 Attr_1D_Attr3 Attr_3D_Attr1 Attr_3D_Attr2 Attr_3D_Attr3 Attr_7D_Attr1 Attr_7D_Attr2 Attr_7D_Attr3 Attr1 Attr2 Attr3
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1/1/2016 NA NA NA NA NA NA NA NA NA 5 10 12
2 1/2/2016 5 10 12 NA NA NA NA NA NA 8 23 23
3 1/3/2016 8 23 23 NA NA NA NA NA NA 7 32 43
4 1/4/2016 7 32 43 6.67 21.7 26 NA NA NA 6 12 3
5 1/5/2016 6 12 3 7 22.3 23 NA NA NA 2 3 2
6 1/6/2016 2 3 2 5 15.7 16 NA NA NA 1 2 4
7 1/7/2016 1 2 4 3 5.67 3 NA NA NA 4 5 1
8 1/8/2016 4 5 1 2.33 3.33 2.33 4.71 12.4 12.6 1 3 23
9 1/9/2016 1 3 23 2 3.33 9.33 4.14 11.4 14.1 2 21 33
Data
data <- structure(list(DATE = structure(1:9, .Label = c("1/1/2016", "1/2/2016",
"1/3/2016", "1/4/2016", "1/5/2016", "1/6/2016", "1/7/2016", "1/8/2016",
"1/9/2016"), class = "factor"), Attr1 = c(5, 8, 7, 6, 2, 1, 4,
1, 2), Attr2 = c(10, 23, 32, 12, 3, 2, 5, 3, 21), Attr3 = c(12,
23, 43, 3, 2, 4, 1, 23, 33)), class = "data.frame", row.names = c(NA,
-9L))

How to join data frames in R without duplicating original data values

I have 2 data frames (DF1 & DF2) and 1 would like to join them together by a unique value called "acc_num". In DF2, payment was made twice by acc_num A and thrice by B. Data frames are as follows.
DF1:
acc_num total_use sales
A 433 145
A NA 2
A NA 18
B 149 32
DF2:
acc payment
A 150
A 98
B 44
B 15
B 10
My desired output is:
acc_num total_use sales payment
A 433 145 150
A NA 2 98
A NA 18 NA
B 149 32 44
B NA NA 15
B NA NA 10
I've tried full_join and merge but the output was not as desired. I couldn't work this out as I'm still a beginner in R, and haven't found the solution to this.
Example of the code I used was
test_full_join <- DF1 %>% full_join(DF2, by = c("acc_num" = "acc"))
The displayed output was:
acc_num total_use sales payment
A 433 145 150
A 433 145 98
A NA 2 150
A NA 2 98
A NA 18 150
A NA 18 98
B 149 32 44
B 149 32 15
B 149 32 10
This is contrary to my desired output as at the end,
my concern is to get the total sum of total_use, sales and payment.
This output will definitely give me wrong interpretation
for data visualization later on.
We may need to do a join by row_number() based on 'acc_num'
library(dplyr)
df1 %>%
group_by(acc_num) %>%
mutate(grpind = row_number()) %>%
full_join(df2 %>%
group_by(acc_num = acc) %>%
mutate(grpind = row_number())) %>%
select(acc_num, total_use, sales, payment)
# A tibble: 6 x 4
# Groups: acc_num [2]
# acc_num total_use sales payment
# <chr> <int> <int> <int>
#1 A 433 145 150
#2 A NA 2 98
#3 A NA 18 NA
#4 B 149 32 44
#5 B NA NA 15
#6 B NA NA 10
data
df1 <- structure(list(acc_num = c("A", "A", "A", "B"), total_use = c(433L,
NA, NA, 149L), sales = c(145L, 2L, 18L, 32L)), class = "data.frame",
row.names = c(NA,
-4L))
df2 <- structure(list(acc = c("A", "A", "B", "B", "B"), payment = c(150L,
98L, 44L, 15L, 10L)), class = "data.frame", row.names = c(NA,
-5L))

Replace the values of NA with a sum of previous value and a current value in different column

I have a dataset where I have to fill NA values using the previous value and a sum of current value in another column. Basically, my data looks like
library(lubridate)
library(tidyverse)
library(zoo)
df <- tibble(
Id = c(1, 1, 1, 1, 2, 2, 2, 2),
Time = ymd(c("2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04", "2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04")),
av = c(18, NA, NA, NA, 21, NA, NA, NA),
Value = c(121, NA,NA, NA, 146, NA, NA, NA)
)
# A tibble: 8 x 4
Id Time av Value
<dbl> <date> <dbl> <dbl>
1 2012-09-01 18 121
1 2012-09-02 NA NA
1 2012-09-03 NA NA
1 2012-09-04 NA NA
2 2012-09-01 21 146
2 2012-09-02 NA NA
2 2012-09-03 NA NA
2 2012-09-04 NA NA
What I want to do is: where the Value is NA, I want to replace it by sum of previous Value and current value of av. If av is NA, it can be replaced with previous value. I use na.locf function from zoo package as
df1 <- df %>% arrange(Id, Time) %>% group_by(Id) %>%
mutate(av = zoo::na.locf(av))
However, filling in for Value seems to be difficult. I can do it using for loop as
# Back up the Value column for testing
df1$Value_backup <- df1$Value
for(i in 2:nrow(df1))
{
df1$Value[i] <- ifelse(is.na(df1$Value[i]), df1$av[i] + df1$Value[i-1], df1$Value[i])
}
This produces the result I want but for a large dataset, I believe there are better ways to do it in R. I tried complete function from dplyr but it adds two additional rows as:
df1 <- df %>% arrange(Id, Time) %>% group_by(Id) %>% mutate(av = zoo::na.locf(av)) %>%
mutate(num_rows = n()) %>%
complete(nesting(Id), Value = seq(min(Value, na.rm = TRUE),
(min(Value, na.rm = TRUE) + max(num_rows) * min(na.omit(av))), min(na.omit(av))))
The output has two extra rows; 10 instead of 8
# A tibble: 10 x 5
# Groups: Id [2]
Id Value Time av num_rows
<dbl> <dbl> <date> < dbl> <int>
1 121 2012-09-01 18 4
1 139 NA NA NA
1 157 NA NA NA
1 175 NA NA NA
1 193 NA NA NA
2 146 2012-09-01 21 4
2 167 NA NA NA
2 188 NA NA NA
2 209 NA NA NA
2 230 NA NA NA
Any help to do it faster without loops would be greatly appreciated.
In the question av starts with a non-NA in each group and is followed by NAs so if this is the general pattern then this will work. Note that it is good form to close any group_by with ungroup; however, we did not do that below so that we could compare df2 with df1.
df2 <- df %>%
group_by(Id) %>%
mutate(Value_backup = Value,
av = first(av),
Value = first(Value) + cumsum(av) - av)
identical(df1, df2)
## [1] TRUE
Note
For reproducibility first run this (taken from question except we only load needed packages):
library(dplyr)
library(tibble)
library(lubridate)
df <- tibble(
Id = c(1, 1, 1, 1, 2, 2, 2, 2),
Time = ymd(c("2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04", "
2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04")),
av = c(18, NA, NA, NA, 21, NA, NA, NA),
Value = c(121, NA,NA, NA, 146, NA, NA, NA)
)
df1 <- df %>% arrange(Id, Time) %>% group_by(Id) %>%
mutate(av = zoo::na.locf(av))
df1$Value_backup <- df1$Value
for(i in 2:nrow(df1))
{
df1$Value[i] <- ifelse(is.na(df1$Value[i]), df1$av[i] + df1$Value[i-1], df1$Value[i])
}

Resources