I have a large dataset with a lot of missing values and I want to impute it by group "name" either linearly or with moving average.
d <- data.frame(
name = c('a', 'a','a','a','b','b','b','b','c','c','c','c'),
year = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
V = c(NA, 21, 31, 41, 11, NA, NA, 41, NA, NA, NA, 41),
W = c(11, NA, 31, 41, 11, 21, NA, NA, NA, NA, 31, NA),
X = c(11, 21, NA, 41, NA, 21, NA, 41, 11, NA, NA, NA),
Y = c(11, 21, 31, NA, NA, 21, 31, NA, NA, 21, NA, NA),
Z = c(NA, NA, 31, 41, 11, NA, 31, NA, NA, NA, NA, NA)
)
> d
name year V W X Y Z
1 a 1 NA 11 11 11 NA
2 a 2 21 NA 21 21 NA
3 a 3 31 31 NA 31 31
4 a 4 41 41 41 NA 41
5 b 1 11 11 NA NA 11
6 b 2 NA 21 21 21 NA
7 b 3 NA NA NA 31 31
8 b 4 41 NA 41 NA NA
9 c 1 NA NA 11 NA NA
10 c 2 NA NA NA 21 NA
11 c 3 NA 31 NA NA NA
12 c 4 41 NA NA NA NA
Hopefully the results can be as closed as the following:
name year V W X Y Z
1 a 1 11 11 11 11 11
2 a 2 21 21 21 21 21
3 a 3 31 31 31 31 31
4 a 4 41 41 41 41 41
5 b 1 11 11 11 11 11
6 b 2 21 21 21 21 21
7 b 3 31 31 31 31 31
8 b 4 41 41 41 41 41
9 c 1 11 11 11 11 NA
10 c 2 21 21 21 21 NA
11 c 3 31 31 31 31 NA
12 c 4 41 41 41 41 NA
I found this and this. Tried the following without groupby but it didn't work:
data.frame(lapply(d, function(X) approxfun(seq_along(X), X)(seq_along(X))))
imputeTS::na_ma(d, k = 2, weighting = "simple")
The first one gave an error as below:
Error in approxfun(seq_along(X), X) :
need at least two non-NA values to interpolate
In addition: Warning message:
In xy.coords(x, y, setLab = FALSE) :
Error in approxfun(seq_along(X), X) :
need at least two non-NA values to interpolate
So I tried the second one and it keep loading for a long time and nothing happened. According to the reply from the first link,
the package requires time series/vector input (that's why each column
has to be called separately).
Any help is greatly appreciated!
You can use zoo::na.spline -
library(dplyr)
d %>%
group_by(name) %>%
mutate(across(V:Z, zoo::na.spline, na.rm = FALSE)) %>%
ungroup
# name year V W X Y Z
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 11 11 11 11 11
# 2 a 2 21 21 21 21 21
# 3 a 3 31 31 31 31 31
# 4 a 4 41 41 41 41 41
# 5 b 1 11 11 11 11 11
# 6 b 2 21 21 21 21 21
# 7 b 3 31 31 31 31 31
# 8 b 4 41 41 41 41 41
# 9 c 1 41 31 11 21 NA
#10 c 2 41 31 11 21 NA
#11 c 3 41 31 11 21 NA
#12 c 4 41 31 11 21 NA
For name, "c" I think it would be difficult to impute the missing values only from 1 number.
One issue I see is, that some of the series you want to impute have only 1 non-NA value, thus na_ma or na_interpolation from imputeTS or also other packages can not be applied successfully, since these require at least 2 non-NA values.
That is why in this solution I created a impute_select function for you, that let's you choose, what to to when > 1 values or present, when exactly == 1 values are present or when there are only NAs.
In this case, when > 1 values is present, it uses na_ma, but you could also use na_interpoltion or any other imputation function from imputeTS here.
When only 1 value is present, it uses na_locf since this method also works with only 1 value in the series.
When no non-NA values are in the series, it uses na_replace, just replacing all the NAs with a default value (I just set it to 11)
By adjusting this function you should be able to individually adjust the imputation for different amounts of NAs in the series.
library("imputeTS")
d <- data.frame(
name = c('a', 'a','a','a','b','b','b','b','c','c','c','c'),
year = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
V = c(NA, 21, 31, 41, 11, NA, NA, 41, NA, NA, NA, 41),
W = c(11, NA, 31, 41, 11, 21, NA, NA, NA, NA, 31, NA),
X = c(11, 21, NA, 41, NA, 21, NA, 41, 11, NA, NA, NA),
Y = c(11, 21, 31, NA, NA, 21, 31, NA, NA, 21, NA, NA),
Z = c(NA, NA, 31, 41, 11, NA, 31, NA, NA, NA, NA, NA)
)
impute_select <- function(x) {
# select a method to use when more than 1 values are available
if (sum(!is.na(x)) > 1) {
result <- na_ma(x)
}
# Select value when only 1 value is in series
if (sum(!is.na(x)) == 1) {
result <- na_locf(x)
}
# Select method, when no non-NA value is present
else {
result <- na_replace(x, 11)
}
}
# This code is to apply the function row-wise to your data frame
# Since usually the imputation would happen column-wise instead
d[,3:7] <- t(apply(d[,3:7], MARGIN =1, FUN = impute_select))
d
This are the results (hopefully exactly what you wanted):
name year V W X Y Z
1 a 1 11 11 11 11 11
2 a 2 21 11 21 21 11
3 a 3 31 31 11 31 31
4 a 4 41 41 41 11 41
5 b 1 11 11 11 11 11
6 b 2 11 21 21 21 11
7 b 3 11 11 11 31 31
8 b 4 41 11 41 11 11
9 c 1 11 11 11 11 11
10 c 2 21 21 21 21 21
11 c 3 31 31 31 31 31
12 c 4 41 41 41 41 41
Let's say I have a dataframe with 3 ID columns and one column of interest. Each row represents one observation. Some ID have multiple observations, i.e., multiple rows.
df <- data.frame(id1 = c( 1, 2, 3, 4, 4),
id2 = c( 11, 12, 13, 14, 14),
id3 = c(111, 112, 113, 114, 114),
variable_of_interest = c(13, 24, 35, 31, 12))
id1 id2 id3 variable_of_interest
1 1 11 111 13
2 2 12 112 24
3 3 13 113 35
4 4 14 114 31
5 4 14 114 12
My goal is to restructure it in odred to have one row per ID, to keep the 3 IDs and to name the new columns "variable_of_interest1", "variable_of_interest2":
id1 id2 id3 variable_of_interest1 variable_of_interest1
1 1 11 111 13 NA
2 2 12 112 24 NA
3 3 13 113 35 NA
4 4 14 114 31 12
The solution might need reshape2 and the dcast function, but until now, I could not solve this out.
We can create a sequence grouped by the 'id' columns and then with pivot_wider reshape to wide
library(dplyr)
library(stringr)
library(tidyr)
library(data.table)
df %>%
mutate(ind = str_c('variable_of_interest', rowid(id1, id2, id3))) %>%
pivot_wider(names_from = ind, values_from = variable_of_interest)
-output
# A tibble: 4 x 5
# id1 id2 id3 variable_of_interest1 variable_of_interest2
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 11 111 13 NA
#2 2 12 112 24 NA
#3 3 13 113 35 NA
#4 4 14 114 31 12
Or another option is data.table
library(data.table)
dcast(setDT(df), id1 + id2 + id3 ~
paste0('variable_of_interest', rowid(id1, id2, id3)),
value.var = 'variable_of_interest')
-output
# id1 id2 id3 variable_of_interest1 variable_of_interest2
#1: 1 11 111 13 NA
#2: 2 12 112 24 NA
#3: 3 13 113 35 NA
#4: 4 14 114 31 12
I am a newcomer to dplyr and tried to create a new composite variable from three different age variables using dplyr and ifelse. I made a data frame to explain the situation as follows:
library(dplyr)
z <- data.frame("j6" = c(6, 19, NA, NA, NA, NA, NA, 8, 20, 20, NA),
"j7" = c(27, 20, NA, 7, 19, NA, NA, 20, 30, 9, NA),
"j8" = c(8, 22, NA, 20, NA, 8, 30, NA, NA, NA, 3))
z <- z %>%
mutate(., age_event = NA) %>%
mutate(., age_event = ifelse(j6 < 18 | j7 < 18 | j8 < 18, 1, 0))
My expectations:
The three columns (j6, j7, and j8) indicate ages, and if at lease one of them is less than 18 year-old, the new column (age_event) should be "1", otherwise 0.
And if the two of the three columns are both 18-year or older and the other is NA, the age_event variable should be 0 .
Likewise if the one of the three columns is 18-year or older and the others are NAs, the age_event variable should be 0.
Also it is NA if all of the three columns are NAs.
However, the result and problems are shown as follows:
> z
j6 j7 j8 age_event
1 6 27 8 1
2 19 20 22 0
3 NA NA NA NA
4 NA 7 20 1
5 NA 19 NA NA <-- should be 0, but NA
6 NA NA 8 1
7 NA NA 30 NA <-- should be 0, but NA
8 8 20 NA 1
9 20 30 NA NA <-- should be 0, but NA
10 20 9 NA 1
11 NA NA 3 1
I'd like to know if there is a way to turn 5th, 7th, and 9th observations above to 0s using mutate and ifelse. Any suggestions would be greatly appreciated!
Update (2/27/2020): I found a solution with pmin when using mutate and ifelse:
z <- z %>%
mutate(., age_event = ifelse(is.na(j6) & is.na(j7) & is.na(j8), NA,
ifelse(pmin(j6, j7, j8, na.rm = T) < 18, 1, 0)))
> z
j6 j7 j8 age_event
1 6 27 8 1
2 19 20 22 0
3 NA NA NA NA
4 NA 7 20 1
5 NA 19 NA 0
6 NA NA 8 1
7 NA NA 30 0
8 8 20 NA 1
9 20 30 NA 0
10 20 9 NA 1
11 NA NA 3 1
You can use rowMeans() in place of if_else() which will handle cases that are all NA.
z %>%
mutate(age_event = +(rowMeans(. < 18, na.rm = TRUE) > 0))
j6 j7 j8 age_event
1 6 27 8 1
2 19 20 22 0
3 NA NA NA NA
4 NA 7 20 1
5 NA 19 NA 0
6 NA NA 8 1
7 NA NA 30 0
8 8 20 NA 1
9 20 30 NA 0
10 20 9 NA 1
11 NA NA 3 1
We can use rowSums to calculate number of NA values in a row and number of values that are less than 18. We can then use case_when to assign numbers based on different conditions.
library(dplyr)
z %>%
mutate(calc = rowSums(!is.na(.), na.rm = TRUE),
ls18 = rowSums(. < 18, na.rm = TRUE),
age_event = case_when(calc == 0 & ls18 == 0 ~ NA_integer_,
ls18 > 0 ~ 1L,
TRUE ~ 0L)) %>%
select(-calc, -ls18)
# j6 j7 j8 age_event
#1 6 27 8 1
#2 19 20 22 0
#3 NA NA NA NA
#4 NA 7 20 1
#5 NA 19 NA 0
#6 NA NA 8 1
#7 NA NA 30 0
#8 8 20 NA 1
#9 20 30 NA 0
#10 20 9 NA 1
#11 NA NA 3 1
I have a dataframe with missing values.
df1 <- data.frame(ID = c(1, 2, 3, 4, 5, 6), value1 = c(23, 14, NA, 45, NA, NA),
value2 = c(25, 15, NA, 34, NA, NA), value3 = c(33, 29, NA, 29, NA, NA))
ID value1 value2 value3
1 23 25 33
2 14 15 29
3 NA NA NA
4 45 34 29
5 NA NA NA
6 NA NA NA
And a dataframe with id relations.
df2 <- data.frame(ID1 = c(1, 2, 4), ID2 = c(3, 5, 6))
ID1 ID2
1 3
2 5
4 6
I want to replace the missing values, with the values of the related ID.
So the dataframe will look like this.
ID value1 value2 value3
1 23 25 33
2 14 15 29
3 23 25 33
4 45 34 29
5 14 15 29
6 45 34 29
Any help would be appreciated.
you will need a for-loop like this:
for (i in seq_along(df2[, "ID2"])) {
df1[df2[i, "ID2"], c("value1", "value2", "value3")] <- df1[df2[i, "ID1"], c("value1", "value2", "value3")] }
You can use as #FannieY already suggested a for loop. In addition I test with is.na to avoid to overwrite existing values.
for(i in seq_len(nrow(df2))) {
idx <- is.na(df1[df2[i,2],-1])
df1[df2[i,2],-1][idx] <- df1[df2[i,1],-1][idx]
}
df1
# ID value1 value2 value3
#1 1 23 25 33
#2 2 14 15 29
#3 3 23 25 33
#4 4 45 34 29
#5 5 14 15 29
#6 6 45 34 29
I have a data frame such as:
set.seed(1)
df <- data.frame(
sample = 1:50,
value = runif(50),
group = c(rep(NA, 20), gl(3, 10)))
I want to select the top 10 samples based on value. However, if there is a group corresponding to the sample, I only want to include one sample from that group. If group == NA, I want to include all of them. Arranging df by value looks like:
df_top <- df %>%
arrange(-value) %>%
top_n(10, value)
sample value group
1 46 0.7973088 3
2 49 0.8108702 3
3 22 0.8394404 1
4 2 0.8612095 NA
5 27 0.8643395 1
6 20 0.8753213 NA
7 44 0.8762692 3
8 26 0.8921983 1
9 11 0.9128759 NA
10 30 0.9606180 1
I would want to include samples 36, 22, 2, 20, 11, and the next five highest values in my data frame that continue to fit the pattern. How do I accomplish this?
I think I figured this out. Would this be the best way:
df_top <- df %>%
arrange(-value) %>%
group_by(group) %>%
filter(ifelse(!is.na(group), value == max(value), value == value)) %>%
ungroup() %>%
top_n(10, value)
# A tibble: 10 x 3
sample value group
<int> <dbl> <int>
1 18 0.992 NA
2 7 0.945 NA
3 21 0.935 1
4 4 0.908 NA
5 6 0.898 NA
6 35 0.827 2
7 41 0.821 3
8 20 0.777 NA
9 15 0.770 NA
10 17 0.718 NA
Similar method that uses slice instead of filter:
library(dplyr)
df_top <- df %>%
arrange(-value) %>%
group_by(group) %>%
slice(if(any(!is.na(group))) 1 else 1:n()) %>%
ungroup() %>%
top_n(10, value)
Result:
# A tibble: 10 x 3
sample value group
<int> <dbl> <int>
1 21 0.9347052 1
2 35 0.8273733 2
3 41 0.8209463 3
4 18 0.9919061 NA
5 7 0.9446753 NA
6 4 0.9082078 NA
7 6 0.8983897 NA
8 20 0.7774452 NA
9 15 0.7698414 NA
10 17 0.7176185 NA