Running delta - Lag difference over row of data frame - r

I have a data frame such as
df1 <- data.frame(Company = c('A','B','C','D','E'),
`X1980` = c(1, 5, 3, 8, 13),
`X1981` = c(20, 13, 23, 11, 29),
`X1982` = c(33, 32, 31, 41, 42),
`X1983` = c(45, 47, 53, 58, 55))
I would like to create a new data frame (df2) keeping the company column as is. The values for the years 1980 and 1983 should be calculated by using the current value minus the previous value. So basically I would like a data frame resulting in the rolling deltas.
Company 1980 1981 1982 1983
A NA 19 13 12
B NA 8 19 15
C NA 20 8 22
D NA 3 30 17
E NA 16 13 13
Thanks for the help! If there's any way for me to improve the question, then just let me know.

You can find difference in each row adding NA to first value.
df1[-1] <- t(apply(df1[-1], 1, function(x) c(NA, diff(x))))
df1
# Company X1980 X1981 X1982 X1983
#1 A NA 19 13 12
#2 B NA 8 19 15
#3 C NA 20 8 22
#4 D NA 3 30 17
#5 E NA 16 13 13
You can also use tidyverse functions.
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -Company) %>%
group_by(Company) %>%
mutate(value = value - lag(value)) %>%
pivot_wider()

We can use rowDiffs from matrixStats
library(matrixStats)
df1[-1] <- cbind(NA, rowDiffs(as.matrix(df1[-1])))
-output
df1
# Company X1980 X1981 X1982 X1983
#1 A NA 19 13 12
#2 B NA 8 19 15
#3 C NA 20 8 22
#4 D NA 3 30 17
#5 E NA 16 13 13

Related

R - Impute missing values by group (linear / moving average)

I have a large dataset with a lot of missing values and I want to impute it by group "name" either linearly or with moving average.
d <- data.frame(
name = c('a', 'a','a','a','b','b','b','b','c','c','c','c'),
year = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
V = c(NA, 21, 31, 41, 11, NA, NA, 41, NA, NA, NA, 41),
W = c(11, NA, 31, 41, 11, 21, NA, NA, NA, NA, 31, NA),
X = c(11, 21, NA, 41, NA, 21, NA, 41, 11, NA, NA, NA),
Y = c(11, 21, 31, NA, NA, 21, 31, NA, NA, 21, NA, NA),
Z = c(NA, NA, 31, 41, 11, NA, 31, NA, NA, NA, NA, NA)
)
> d
name year V W X Y Z
1 a 1 NA 11 11 11 NA
2 a 2 21 NA 21 21 NA
3 a 3 31 31 NA 31 31
4 a 4 41 41 41 NA 41
5 b 1 11 11 NA NA 11
6 b 2 NA 21 21 21 NA
7 b 3 NA NA NA 31 31
8 b 4 41 NA 41 NA NA
9 c 1 NA NA 11 NA NA
10 c 2 NA NA NA 21 NA
11 c 3 NA 31 NA NA NA
12 c 4 41 NA NA NA NA
Hopefully the results can be as closed as the following:
name year V W X Y Z
1 a 1 11 11 11 11 11
2 a 2 21 21 21 21 21
3 a 3 31 31 31 31 31
4 a 4 41 41 41 41 41
5 b 1 11 11 11 11 11
6 b 2 21 21 21 21 21
7 b 3 31 31 31 31 31
8 b 4 41 41 41 41 41
9 c 1 11 11 11 11 NA
10 c 2 21 21 21 21 NA
11 c 3 31 31 31 31 NA
12 c 4 41 41 41 41 NA
I found this and this. Tried the following without groupby but it didn't work:
data.frame(lapply(d, function(X) approxfun(seq_along(X), X)(seq_along(X))))
imputeTS::na_ma(d, k = 2, weighting = "simple")
The first one gave an error as below:
Error in approxfun(seq_along(X), X) :
need at least two non-NA values to interpolate
In addition: Warning message:
In xy.coords(x, y, setLab = FALSE) :
Error in approxfun(seq_along(X), X) :
need at least two non-NA values to interpolate
So I tried the second one and it keep loading for a long time and nothing happened. According to the reply from the first link,
the package requires time series/vector input (that's why each column
has to be called separately).
Any help is greatly appreciated!
You can use zoo::na.spline -
library(dplyr)
d %>%
group_by(name) %>%
mutate(across(V:Z, zoo::na.spline, na.rm = FALSE)) %>%
ungroup
# name year V W X Y Z
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 11 11 11 11 11
# 2 a 2 21 21 21 21 21
# 3 a 3 31 31 31 31 31
# 4 a 4 41 41 41 41 41
# 5 b 1 11 11 11 11 11
# 6 b 2 21 21 21 21 21
# 7 b 3 31 31 31 31 31
# 8 b 4 41 41 41 41 41
# 9 c 1 41 31 11 21 NA
#10 c 2 41 31 11 21 NA
#11 c 3 41 31 11 21 NA
#12 c 4 41 31 11 21 NA
For name, "c" I think it would be difficult to impute the missing values only from 1 number.
One issue I see is, that some of the series you want to impute have only 1 non-NA value, thus na_ma or na_interpolation from imputeTS or also other packages can not be applied successfully, since these require at least 2 non-NA values.
That is why in this solution I created a impute_select function for you, that let's you choose, what to to when > 1 values or present, when exactly == 1 values are present or when there are only NAs.
In this case, when > 1 values is present, it uses na_ma, but you could also use na_interpoltion or any other imputation function from imputeTS here.
When only 1 value is present, it uses na_locf since this method also works with only 1 value in the series.
When no non-NA values are in the series, it uses na_replace, just replacing all the NAs with a default value (I just set it to 11)
By adjusting this function you should be able to individually adjust the imputation for different amounts of NAs in the series.
library("imputeTS")
d <- data.frame(
name = c('a', 'a','a','a','b','b','b','b','c','c','c','c'),
year = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
V = c(NA, 21, 31, 41, 11, NA, NA, 41, NA, NA, NA, 41),
W = c(11, NA, 31, 41, 11, 21, NA, NA, NA, NA, 31, NA),
X = c(11, 21, NA, 41, NA, 21, NA, 41, 11, NA, NA, NA),
Y = c(11, 21, 31, NA, NA, 21, 31, NA, NA, 21, NA, NA),
Z = c(NA, NA, 31, 41, 11, NA, 31, NA, NA, NA, NA, NA)
)
impute_select <- function(x) {
# select a method to use when more than 1 values are available
if (sum(!is.na(x)) > 1) {
result <- na_ma(x)
}
# Select value when only 1 value is in series
if (sum(!is.na(x)) == 1) {
result <- na_locf(x)
}
# Select method, when no non-NA value is present
else {
result <- na_replace(x, 11)
}
}
# This code is to apply the function row-wise to your data frame
# Since usually the imputation would happen column-wise instead
d[,3:7] <- t(apply(d[,3:7], MARGIN =1, FUN = impute_select))
d
This are the results (hopefully exactly what you wanted):
name year V W X Y Z
1 a 1 11 11 11 11 11
2 a 2 21 11 21 21 11
3 a 3 31 31 11 31 31
4 a 4 41 41 41 11 41
5 b 1 11 11 11 11 11
6 b 2 11 21 21 21 11
7 b 3 11 11 11 31 31
8 b 4 41 11 41 11 11
9 c 1 11 11 11 11 11
10 c 2 21 21 21 21 21
11 c 3 31 31 31 31 31
12 c 4 41 41 41 41 41

How to restructure data with one observation by row into data with one observation by ID (and multiple columns) in R?

Let's say I have a dataframe with 3 ID columns and one column of interest. Each row represents one observation. Some ID have multiple observations, i.e., multiple rows.
df <- data.frame(id1 = c( 1, 2, 3, 4, 4),
id2 = c( 11, 12, 13, 14, 14),
id3 = c(111, 112, 113, 114, 114),
variable_of_interest = c(13, 24, 35, 31, 12))
id1 id2 id3 variable_of_interest
1 1 11 111 13
2 2 12 112 24
3 3 13 113 35
4 4 14 114 31
5 4 14 114 12
My goal is to restructure it in odred to have one row per ID, to keep the 3 IDs and to name the new columns "variable_of_interest1", "variable_of_interest2":
id1 id2 id3 variable_of_interest1 variable_of_interest1
1 1 11 111 13 NA
2 2 12 112 24 NA
3 3 13 113 35 NA
4 4 14 114 31 12
The solution might need reshape2 and the dcast function, but until now, I could not solve this out.
We can create a sequence grouped by the 'id' columns and then with pivot_wider reshape to wide
library(dplyr)
library(stringr)
library(tidyr)
library(data.table)
df %>%
mutate(ind = str_c('variable_of_interest', rowid(id1, id2, id3))) %>%
pivot_wider(names_from = ind, values_from = variable_of_interest)
-output
# A tibble: 4 x 5
# id1 id2 id3 variable_of_interest1 variable_of_interest2
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 11 111 13 NA
#2 2 12 112 24 NA
#3 3 13 113 35 NA
#4 4 14 114 31 12
Or another option is data.table
library(data.table)
dcast(setDT(df), id1 + id2 + id3 ~
paste0('variable_of_interest', rowid(id1, id2, id3)),
value.var = 'variable_of_interest')
-output
# id1 id2 id3 variable_of_interest1 variable_of_interest2
#1: 1 11 111 13 NA
#2: 2 12 112 24 NA
#3: 3 13 113 35 NA
#4: 4 14 114 31 12

How to handle or ignore NAs when using ifelse to mutate a new column with multiple conditions (solved)

I am a newcomer to dplyr and tried to create a new composite variable from three different age variables using dplyr and ifelse. I made a data frame to explain the situation as follows:
library(dplyr)
z <- data.frame("j6" = c(6, 19, NA, NA, NA, NA, NA, 8, 20, 20, NA),
"j7" = c(27, 20, NA, 7, 19, NA, NA, 20, 30, 9, NA),
"j8" = c(8, 22, NA, 20, NA, 8, 30, NA, NA, NA, 3))
z <- z %>%
mutate(., age_event = NA) %>%
mutate(., age_event = ifelse(j6 < 18 | j7 < 18 | j8 < 18, 1, 0))
My expectations:
The three columns (j6, j7, and j8) indicate ages, and if at lease one of them is less than 18 year-old, the new column (age_event) should be "1", otherwise 0.
And if the two of the three columns are both 18-year or older and the other is NA, the age_event variable should be 0 .
Likewise if the one of the three columns is 18-year or older and the others are NAs, the age_event variable should be 0.
Also it is NA if all of the three columns are NAs.
However, the result and problems are shown as follows:
> z
j6 j7 j8 age_event
1 6 27 8 1
2 19 20 22 0
3 NA NA NA NA
4 NA 7 20 1
5 NA 19 NA NA <-- should be 0, but NA
6 NA NA 8 1
7 NA NA 30 NA <-- should be 0, but NA
8 8 20 NA 1
9 20 30 NA NA <-- should be 0, but NA
10 20 9 NA 1
11 NA NA 3 1
I'd like to know if there is a way to turn 5th, 7th, and 9th observations above to 0s using mutate and ifelse. Any suggestions would be greatly appreciated!
Update (2/27/2020): I found a solution with pmin when using mutate and ifelse:
z <- z %>%
mutate(., age_event = ifelse(is.na(j6) & is.na(j7) & is.na(j8), NA,
ifelse(pmin(j6, j7, j8, na.rm = T) < 18, 1, 0)))
> z
j6 j7 j8 age_event
1 6 27 8 1
2 19 20 22 0
3 NA NA NA NA
4 NA 7 20 1
5 NA 19 NA 0
6 NA NA 8 1
7 NA NA 30 0
8 8 20 NA 1
9 20 30 NA 0
10 20 9 NA 1
11 NA NA 3 1
You can use rowMeans() in place of if_else() which will handle cases that are all NA.
z %>%
mutate(age_event = +(rowMeans(. < 18, na.rm = TRUE) > 0))
j6 j7 j8 age_event
1 6 27 8 1
2 19 20 22 0
3 NA NA NA NA
4 NA 7 20 1
5 NA 19 NA 0
6 NA NA 8 1
7 NA NA 30 0
8 8 20 NA 1
9 20 30 NA 0
10 20 9 NA 1
11 NA NA 3 1
We can use rowSums to calculate number of NA values in a row and number of values that are less than 18. We can then use case_when to assign numbers based on different conditions.
library(dplyr)
z %>%
mutate(calc = rowSums(!is.na(.), na.rm = TRUE),
ls18 = rowSums(. < 18, na.rm = TRUE),
age_event = case_when(calc == 0 & ls18 == 0 ~ NA_integer_,
ls18 > 0 ~ 1L,
TRUE ~ 0L)) %>%
select(-calc, -ls18)
# j6 j7 j8 age_event
#1 6 27 8 1
#2 19 20 22 0
#3 NA NA NA NA
#4 NA 7 20 1
#5 NA 19 NA 0
#6 NA NA 8 1
#7 NA NA 30 0
#8 8 20 NA 1
#9 20 30 NA 0
#10 20 9 NA 1
#11 NA NA 3 1

Replace missing values with the values of related ID

I have a dataframe with missing values.
df1 <- data.frame(ID = c(1, 2, 3, 4, 5, 6), value1 = c(23, 14, NA, 45, NA, NA),
value2 = c(25, 15, NA, 34, NA, NA), value3 = c(33, 29, NA, 29, NA, NA))
ID value1 value2 value3
1 23 25 33
2 14 15 29
3 NA NA NA
4 45 34 29
5 NA NA NA
6 NA NA NA
And a dataframe with id relations.
df2 <- data.frame(ID1 = c(1, 2, 4), ID2 = c(3, 5, 6))
ID1 ID2
1 3
2 5
4 6
I want to replace the missing values, with the values of the related ID.
So the dataframe will look like this.
ID value1 value2 value3
1 23 25 33
2 14 15 29
3 23 25 33
4 45 34 29
5 14 15 29
6 45 34 29
Any help would be appreciated.
you will need a for-loop like this:
for (i in seq_along(df2[, "ID2"])) {
df1[df2[i, "ID2"], c("value1", "value2", "value3")] <- df1[df2[i, "ID1"], c("value1", "value2", "value3")] }
You can use as #FannieY already suggested a for loop. In addition I test with is.na to avoid to overwrite existing values.
for(i in seq_len(nrow(df2))) {
idx <- is.na(df1[df2[i,2],-1])
df1[df2[i,2],-1][idx] <- df1[df2[i,1],-1][idx]
}
df1
# ID value1 value2 value3
#1 1 23 25 33
#2 2 14 15 29
#3 3 23 25 33
#4 4 45 34 29
#5 5 14 15 29
#6 6 45 34 29

data frame selecting top by grouping

I have a data frame such as:
set.seed(1)
df <- data.frame(
sample = 1:50,
value = runif(50),
group = c(rep(NA, 20), gl(3, 10)))
I want to select the top 10 samples based on value. However, if there is a group corresponding to the sample, I only want to include one sample from that group. If group == NA, I want to include all of them. Arranging df by value looks like:
df_top <- df %>%
arrange(-value) %>%
top_n(10, value)
sample value group
1 46 0.7973088 3
2 49 0.8108702 3
3 22 0.8394404 1
4 2 0.8612095 NA
5 27 0.8643395 1
6 20 0.8753213 NA
7 44 0.8762692 3
8 26 0.8921983 1
9 11 0.9128759 NA
10 30 0.9606180 1
I would want to include samples 36, 22, 2, 20, 11, and the next five highest values in my data frame that continue to fit the pattern. How do I accomplish this?
I think I figured this out. Would this be the best way:
df_top <- df %>%
arrange(-value) %>%
group_by(group) %>%
filter(ifelse(!is.na(group), value == max(value), value == value)) %>%
ungroup() %>%
top_n(10, value)
# A tibble: 10 x 3
sample value group
<int> <dbl> <int>
1 18 0.992 NA
2 7 0.945 NA
3 21 0.935 1
4 4 0.908 NA
5 6 0.898 NA
6 35 0.827 2
7 41 0.821 3
8 20 0.777 NA
9 15 0.770 NA
10 17 0.718 NA
Similar method that uses slice instead of filter:
library(dplyr)
df_top <- df %>%
arrange(-value) %>%
group_by(group) %>%
slice(if(any(!is.na(group))) 1 else 1:n()) %>%
ungroup() %>%
top_n(10, value)
Result:
# A tibble: 10 x 3
sample value group
<int> <dbl> <int>
1 21 0.9347052 1
2 35 0.8273733 2
3 41 0.8209463 3
4 18 0.9919061 NA
5 7 0.9446753 NA
6 4 0.9082078 NA
7 6 0.8983897 NA
8 20 0.7774452 NA
9 15 0.7698414 NA
10 17 0.7176185 NA

Resources