I have a dataframe with missing values.
df1 <- data.frame(ID = c(1, 2, 3, 4, 5, 6), value1 = c(23, 14, NA, 45, NA, NA),
value2 = c(25, 15, NA, 34, NA, NA), value3 = c(33, 29, NA, 29, NA, NA))
ID value1 value2 value3
1 23 25 33
2 14 15 29
3 NA NA NA
4 45 34 29
5 NA NA NA
6 NA NA NA
And a dataframe with id relations.
df2 <- data.frame(ID1 = c(1, 2, 4), ID2 = c(3, 5, 6))
ID1 ID2
1 3
2 5
4 6
I want to replace the missing values, with the values of the related ID.
So the dataframe will look like this.
ID value1 value2 value3
1 23 25 33
2 14 15 29
3 23 25 33
4 45 34 29
5 14 15 29
6 45 34 29
Any help would be appreciated.
you will need a for-loop like this:
for (i in seq_along(df2[, "ID2"])) {
df1[df2[i, "ID2"], c("value1", "value2", "value3")] <- df1[df2[i, "ID1"], c("value1", "value2", "value3")] }
You can use as #FannieY already suggested a for loop. In addition I test with is.na to avoid to overwrite existing values.
for(i in seq_len(nrow(df2))) {
idx <- is.na(df1[df2[i,2],-1])
df1[df2[i,2],-1][idx] <- df1[df2[i,1],-1][idx]
}
df1
# ID value1 value2 value3
#1 1 23 25 33
#2 2 14 15 29
#3 3 23 25 33
#4 4 45 34 29
#5 5 14 15 29
#6 6 45 34 29
Related
Here is dataframe 1
card value cat1 cat2 cat3
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 10 1 2 3
2 A 20 4 5 6
3 B 30 7 8 9
4 A 40 10 11 12
Here is dataframe 2 with the same number of rows and columns
card value cat1 cat2 cat3
<chr> <dbl> <dbl> <dbl> <dbl>
1 C 11 13 14 15
2 C 19 16 17 18
3 A 35 19 20 21
4 B 45 22 23 24
I want to create a new dataframe that is chosen based on the maximum value of the "value" column. The row of the new dataframe is the entire row of the dataframe that has the highest number in the "value" column.
Thus the desired solution is:
<chr> <dbl> <dbl> <dbl> <dbl>
1 C 11 13 14 15
2 A 20 4 5 6
3 A 35 19 20 21
4 B 45 22 23 24
These are demo dataframes. The actual data frames are on the order of 200,000 rows. What is the best way to do this? Note, it would also be good to have a column in the new dataframe indicating which data frame the row was from: df_1 or df_2.
Dataframes
df_1 <- structure(list(card = c("A", "A", "B", "A"), value = c(10, 20,
30, 40), cat1 = c(1, 4, 7, 10), cat2 = c(2, 5, 8, 11), cat3 = c(3,
6, 9, 12)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-4L))
df_2 <- structure(list(card = c("C", "C", "A", "B"), value = c(11, 19,
35, 45), cat1 = c(13, 16, 19, 22), cat2 = c(14, 17, 20, 23),
cat3 = c(15, 18, 21, 24)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L))
You can probably avoid needing to do this as a grouped operation if you stack each dataset and offset the row indexes. E.g.:
sel <- max.col(cbind(df_1$value, df_2$value))
rbind(df_1, df_2)[seq_along(sel) + c(0,nrow(df_1))[sel],]
## A tibble: 4 x 5
# card value cat1 cat2 cat3
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 C 11 13 14 15
#2 A 20 4 5 6
#3 A 35 19 20 21
#4 B 45 22 23 24
sel will contain the source dataset too.
cbind(rbind(df_1, df_2)[seq_along(sel) + c(0,nrow(df_1))[sel],], src=sel)
# card value cat1 cat2 cat3 src
#1 C 11 13 14 15 2
#2 A 20 4 5 6 1
#3 A 35 19 20 21 2
#4 B 45 22 23 24 2
base solutions
ifelse
do.call(rbind,
ifelse(df1$value >= df2$value,
split(df1, 1:nrow(df1)),
split(df2, 1:nrow(df2)))
)
lapply
do.call(rbind, lapply(1:nrow(df1), \(x) {
if(df1$value[x] >= df2$value[x]) df1[x, ] else df2[x, ]
}))
# card value cat1 cat2 cat3
# 1 C 11 13 14 15
# 2 A 20 4 5 6
# 3 A 35 19 20 21
# 4 B 45 22 23 24
I prepared a reproducible example with an example of dataset here:
patient <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)
month <- rep (1:10, 2)
fev1 <- c(58, NA, NA, NA, 57, NA, NA, NA, NA, 60, NA, NA, NA, NA, 32, NA, NA, NA, NA, 40)
adherence <- c (30, 32, 34, 36, 34, 32, 30, 34, 32, 36, 70, 65, 75, 70, 70, 55, 50, 65, 70, 70)
data <- tibble(patient, month, fev1, adherence)
data
I would like to obtain a 5th column, called mean adherence, which for each fev1 value which is not NA, would provide the mean adherence of the previous values since the last fev1 value, including the adherence value corresponding to the preceding fev1 value, and excluding the adherence value corresponding to the current fev1 value, and this for each patient id.
For example, for patient 1, at month 5, for fev1 = 57, it would compute the mean of (30, 32, 34, 36); then, for fev1 = 60, it would compute the mean of (34, 32, 30, 34, 32)
Thank you very much for your help
We could create a grouping variable based on the NA values in fev1, and then get the mean by group
library(dplyr)
data %>%
group_by(patient) %>%
mutate(lagadher = lag(adherence),
grp = lag(cumsum(!is.na(fev1)))) %>%
group_by(grp, .add = TRUE) %>%
mutate(Mean_adhere = mean(lagadher) * NA^(is.na(fev1))) %>%
ungroup %>%
select(-grp, -lagadher)
-output
# A tibble: 20 × 5
patient month fev1 adherence Mean_adhere
<dbl> <int> <dbl> <dbl> <dbl>
1 1 1 58 30 NA
2 1 2 NA 32 NA
3 1 3 NA 34 NA
4 1 4 NA 36 NA
5 1 5 57 34 33
6 1 6 NA 32 NA
7 1 7 NA 30 NA
8 1 8 NA 34 NA
9 1 9 NA 32 NA
10 1 10 60 36 32.4
11 2 1 NA 70 NA
12 2 2 NA 65 NA
13 2 3 NA 75 NA
14 2 4 NA 70 NA
15 2 5 32 70 70
16 2 6 NA 55 NA
17 2 7 NA 50 NA
18 2 8 NA 65 NA
19 2 9 NA 70 NA
20 2 10 40 70 62
I have a large dataset with a lot of missing values and I want to impute it by group "name" either linearly or with moving average.
d <- data.frame(
name = c('a', 'a','a','a','b','b','b','b','c','c','c','c'),
year = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
V = c(NA, 21, 31, 41, 11, NA, NA, 41, NA, NA, NA, 41),
W = c(11, NA, 31, 41, 11, 21, NA, NA, NA, NA, 31, NA),
X = c(11, 21, NA, 41, NA, 21, NA, 41, 11, NA, NA, NA),
Y = c(11, 21, 31, NA, NA, 21, 31, NA, NA, 21, NA, NA),
Z = c(NA, NA, 31, 41, 11, NA, 31, NA, NA, NA, NA, NA)
)
> d
name year V W X Y Z
1 a 1 NA 11 11 11 NA
2 a 2 21 NA 21 21 NA
3 a 3 31 31 NA 31 31
4 a 4 41 41 41 NA 41
5 b 1 11 11 NA NA 11
6 b 2 NA 21 21 21 NA
7 b 3 NA NA NA 31 31
8 b 4 41 NA 41 NA NA
9 c 1 NA NA 11 NA NA
10 c 2 NA NA NA 21 NA
11 c 3 NA 31 NA NA NA
12 c 4 41 NA NA NA NA
Hopefully the results can be as closed as the following:
name year V W X Y Z
1 a 1 11 11 11 11 11
2 a 2 21 21 21 21 21
3 a 3 31 31 31 31 31
4 a 4 41 41 41 41 41
5 b 1 11 11 11 11 11
6 b 2 21 21 21 21 21
7 b 3 31 31 31 31 31
8 b 4 41 41 41 41 41
9 c 1 11 11 11 11 NA
10 c 2 21 21 21 21 NA
11 c 3 31 31 31 31 NA
12 c 4 41 41 41 41 NA
I found this and this. Tried the following without groupby but it didn't work:
data.frame(lapply(d, function(X) approxfun(seq_along(X), X)(seq_along(X))))
imputeTS::na_ma(d, k = 2, weighting = "simple")
The first one gave an error as below:
Error in approxfun(seq_along(X), X) :
need at least two non-NA values to interpolate
In addition: Warning message:
In xy.coords(x, y, setLab = FALSE) :
Error in approxfun(seq_along(X), X) :
need at least two non-NA values to interpolate
So I tried the second one and it keep loading for a long time and nothing happened. According to the reply from the first link,
the package requires time series/vector input (that's why each column
has to be called separately).
Any help is greatly appreciated!
You can use zoo::na.spline -
library(dplyr)
d %>%
group_by(name) %>%
mutate(across(V:Z, zoo::na.spline, na.rm = FALSE)) %>%
ungroup
# name year V W X Y Z
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 11 11 11 11 11
# 2 a 2 21 21 21 21 21
# 3 a 3 31 31 31 31 31
# 4 a 4 41 41 41 41 41
# 5 b 1 11 11 11 11 11
# 6 b 2 21 21 21 21 21
# 7 b 3 31 31 31 31 31
# 8 b 4 41 41 41 41 41
# 9 c 1 41 31 11 21 NA
#10 c 2 41 31 11 21 NA
#11 c 3 41 31 11 21 NA
#12 c 4 41 31 11 21 NA
For name, "c" I think it would be difficult to impute the missing values only from 1 number.
One issue I see is, that some of the series you want to impute have only 1 non-NA value, thus na_ma or na_interpolation from imputeTS or also other packages can not be applied successfully, since these require at least 2 non-NA values.
That is why in this solution I created a impute_select function for you, that let's you choose, what to to when > 1 values or present, when exactly == 1 values are present or when there are only NAs.
In this case, when > 1 values is present, it uses na_ma, but you could also use na_interpoltion or any other imputation function from imputeTS here.
When only 1 value is present, it uses na_locf since this method also works with only 1 value in the series.
When no non-NA values are in the series, it uses na_replace, just replacing all the NAs with a default value (I just set it to 11)
By adjusting this function you should be able to individually adjust the imputation for different amounts of NAs in the series.
library("imputeTS")
d <- data.frame(
name = c('a', 'a','a','a','b','b','b','b','c','c','c','c'),
year = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
V = c(NA, 21, 31, 41, 11, NA, NA, 41, NA, NA, NA, 41),
W = c(11, NA, 31, 41, 11, 21, NA, NA, NA, NA, 31, NA),
X = c(11, 21, NA, 41, NA, 21, NA, 41, 11, NA, NA, NA),
Y = c(11, 21, 31, NA, NA, 21, 31, NA, NA, 21, NA, NA),
Z = c(NA, NA, 31, 41, 11, NA, 31, NA, NA, NA, NA, NA)
)
impute_select <- function(x) {
# select a method to use when more than 1 values are available
if (sum(!is.na(x)) > 1) {
result <- na_ma(x)
}
# Select value when only 1 value is in series
if (sum(!is.na(x)) == 1) {
result <- na_locf(x)
}
# Select method, when no non-NA value is present
else {
result <- na_replace(x, 11)
}
}
# This code is to apply the function row-wise to your data frame
# Since usually the imputation would happen column-wise instead
d[,3:7] <- t(apply(d[,3:7], MARGIN =1, FUN = impute_select))
d
This are the results (hopefully exactly what you wanted):
name year V W X Y Z
1 a 1 11 11 11 11 11
2 a 2 21 11 21 21 11
3 a 3 31 31 11 31 31
4 a 4 41 41 41 11 41
5 b 1 11 11 11 11 11
6 b 2 11 21 21 21 11
7 b 3 11 11 11 31 31
8 b 4 41 11 41 11 11
9 c 1 11 11 11 11 11
10 c 2 21 21 21 21 21
11 c 3 31 31 31 31 31
12 c 4 41 41 41 41 41
Let's say I have a dataframe with 3 ID columns and one column of interest. Each row represents one observation. Some ID have multiple observations, i.e., multiple rows.
df <- data.frame(id1 = c( 1, 2, 3, 4, 4),
id2 = c( 11, 12, 13, 14, 14),
id3 = c(111, 112, 113, 114, 114),
variable_of_interest = c(13, 24, 35, 31, 12))
id1 id2 id3 variable_of_interest
1 1 11 111 13
2 2 12 112 24
3 3 13 113 35
4 4 14 114 31
5 4 14 114 12
My goal is to restructure it in odred to have one row per ID, to keep the 3 IDs and to name the new columns "variable_of_interest1", "variable_of_interest2":
id1 id2 id3 variable_of_interest1 variable_of_interest1
1 1 11 111 13 NA
2 2 12 112 24 NA
3 3 13 113 35 NA
4 4 14 114 31 12
The solution might need reshape2 and the dcast function, but until now, I could not solve this out.
We can create a sequence grouped by the 'id' columns and then with pivot_wider reshape to wide
library(dplyr)
library(stringr)
library(tidyr)
library(data.table)
df %>%
mutate(ind = str_c('variable_of_interest', rowid(id1, id2, id3))) %>%
pivot_wider(names_from = ind, values_from = variable_of_interest)
-output
# A tibble: 4 x 5
# id1 id2 id3 variable_of_interest1 variable_of_interest2
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 11 111 13 NA
#2 2 12 112 24 NA
#3 3 13 113 35 NA
#4 4 14 114 31 12
Or another option is data.table
library(data.table)
dcast(setDT(df), id1 + id2 + id3 ~
paste0('variable_of_interest', rowid(id1, id2, id3)),
value.var = 'variable_of_interest')
-output
# id1 id2 id3 variable_of_interest1 variable_of_interest2
#1: 1 11 111 13 NA
#2: 2 12 112 24 NA
#3: 3 13 113 35 NA
#4: 4 14 114 31 12
I have a data frame such as
df1 <- data.frame(Company = c('A','B','C','D','E'),
`X1980` = c(1, 5, 3, 8, 13),
`X1981` = c(20, 13, 23, 11, 29),
`X1982` = c(33, 32, 31, 41, 42),
`X1983` = c(45, 47, 53, 58, 55))
I would like to create a new data frame (df2) keeping the company column as is. The values for the years 1980 and 1983 should be calculated by using the current value minus the previous value. So basically I would like a data frame resulting in the rolling deltas.
Company 1980 1981 1982 1983
A NA 19 13 12
B NA 8 19 15
C NA 20 8 22
D NA 3 30 17
E NA 16 13 13
Thanks for the help! If there's any way for me to improve the question, then just let me know.
You can find difference in each row adding NA to first value.
df1[-1] <- t(apply(df1[-1], 1, function(x) c(NA, diff(x))))
df1
# Company X1980 X1981 X1982 X1983
#1 A NA 19 13 12
#2 B NA 8 19 15
#3 C NA 20 8 22
#4 D NA 3 30 17
#5 E NA 16 13 13
You can also use tidyverse functions.
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -Company) %>%
group_by(Company) %>%
mutate(value = value - lag(value)) %>%
pivot_wider()
We can use rowDiffs from matrixStats
library(matrixStats)
df1[-1] <- cbind(NA, rowDiffs(as.matrix(df1[-1])))
-output
df1
# Company X1980 X1981 X1982 X1983
#1 A NA 19 13 12
#2 B NA 8 19 15
#3 C NA 20 8 22
#4 D NA 3 30 17
#5 E NA 16 13 13