How to replace missing points in a data set? - r

I want to write a function in R that receives any data set as input, such that the data set has some missing points (NA). Now I want to use mean function to replace some numbers/values for missing points (NA) in the data set. What I am thinking is a function like this:
x<function(data,type=c("mean", lag=2))
Indeed, it should compute the mean of the two numbers later and two numbers before of the missing point (because I considered lag as 2 in the function). For example, if the missing point is in place 12th then the function should compute the mean of the numbers in places 10th, 11th, 13th, and 14th and substitute the result for the missing point at place 12th. In particular cases, for example, if the missing point is in the last place, and we do not have two numbers later, the function should compute the mean of all the data of the corresponding column and substitute for the missing point. Here I give an example to make it clear. Consider the following data set:
3 7 8 0 8 12 2
5 8 9 2 8 9 1
1 2 4 5 0 6 7
5 6 0 NA 3 9 10
7 2 3 6 11 14 2
4 8 7 4 5 3 NA
In the above data set, the first NA should be replaced with the mean of numbers 2, 5 (two data before), and 6 and 4 (two data after) which is (2+5+6+4)/4 is equal to 17/4. And the last NA should be replaced with the mean of the last column which is (2+1+7+10+2)/5 is equal to 22/5.
My question is how can I add some codes (if, if-else, or other loops) to the above function to make a complete function to satisfy the above explanations. I should highlight that I want to use the family of apply functions.

First we can define a function that smooths a single vector:
library(dplyr)
smooth = function(vec, n=2){
# Lead and lag the vector twice in both directions
purrr::map(1:n, function(i){
cbind(
lead(vec, i),
lag(vec, i)
)
}) %>%
# Bind the matrix together
do.call(cbind, .) %>%
# Take the mean of each row, ie the smoothed version at each position
# If there are NAs in the mean, it will itself be NA
rowMeans() %>%
# In order, take a) original values b) locally smoothed values
# c) globally smoothed values (ie the entire mean ignoring NAs)
coalesce(vec, ., mean(vec, na.rm=TRUE))
}
> smooth(c(0, 2, 5, NA, 6, 4))
[1] 0.00 2.00 5.00 4.25 6.00 4.00
> smooth(c(2, 1, 7, 10, 2, NA))
[1] 2.0 1.0 7.0 10.0 2.0 4.4
Then we can apply it to each column:
> c(3, 7, 8, 0, 8, 12, 2, 5, 8, 9, 2, 8, 9, 1, 1, 2, 4, 5, 0, 6, 7, 5, 6, 0, NA, 3, 9, 10, 7, 2, 3, 6, 11, 14, 2, 4, 8, 7, 4, 5, 3, NA) %>%
matrix(byrow=TRUE, ncol=7) %>%
as_tibble(.name_repair="universal") %>%
mutate(across(everything(), smooth))
# A tibble: 6 × 7
...1 ...2 ...3 ...4 ...5 ...6 ...7
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 7 8 0 8 12 2
2 5 8 9 2 8 9 1
3 1 2 4 5 0 6 7
4 5 6 0 4.25 3 9 10
5 7 2 3 6 11 14 2
6 4 8 7 4 5 3 4.4

Please find below one solution using the data.table library.
Reprex
Your data:
m1 <- "3 7 8 0 8 12 2
5 8 9 2 8 9 1
1 2 4 5 0 6 7
5 6 0 NA 3 9 10
7 2 3 6 11 14 2
4 8 7 4 5 3 NA"
myData<- read.table(text=m1,h=F)
Code for the function replaceNA
library(data.table)
replaceNA <- function(data){
setDT(data)
# Create a data.table identifying rows and cols indexes of NA values in the data.table
NA_DT <- as.data.table(which(is.na(data), arr.ind=TRUE))
# Select row and column indexes of NAs that are not at the last row in the data.table
NA_not_Last <- NA_DT[row < nrow(data)]
# Select row and column indexes of NA that is at the last row in the data.table
NA_Last <- NA_DT[row == nrow(data)]
# Create a vector of column names where NA values are not at the last row in the data.table
Cols_NA_not_Last <- colnames(data)[NA_not_Last[,col]]
# Create a vector of column names where NA values are at the last row in the data.table
Cols_NA_Last <- colnames(data)[NA_Last[,col]]
# Replace NA values that are not at the last row in the data.table by the mean of the values located
# in the two previous lines and the two following lines of the line containing the NA value
data[, (Cols_NA_not_Last) := lapply(.SD, function(x) replace(x, which(is.na(x)), mean(c(x[which(is.na(x))-2], x[which(is.na(x))-1], x[which(is.na(x))+1], x[which(is.na(x))+2]), na.rm = TRUE))), .SDcols = Cols_NA_not_Last][]
# Replace NA values that are at the last row in the data.table by the mean of all the values in the column where the NA value is found
data[, (Cols_NA_Last) := lapply(.SD, function(x) replace(x, which(is.na(x)), mean(x, na.rm = TRUE))), .SDcols = Cols_NA_Last][]
return(data)
}
Test of the function with your data
replaceNA(myData)
#> V1 V2 V3 V4 V5 V6 V7
#> 1: 3 7 8 0.00 8 12 2.0
#> 2: 5 8 9 2.00 8 9 1.0
#> 3: 1 2 4 5.00 0 6 7.0
#> 4: 5 6 0 4.25 3 9 10.0
#> 5: 7 2 3 6.00 11 14 2.0
#> 6: 4 8 7 4.00 5 3 4.4
Created on 2021-11-08 by the reprex package (v2.0.1)

Related

R Completing NAs with average of previous values

I have looked at several similar questions on SO but can't seem to find a solution that works for me (though zoo and tidyr have gotten me the closest). I have a df with a column containing a series of NA values and need to fill those values with the average of the previous 2 lags. That new value needs to be included as one of the lags in the next record and so on. So something like this:
1
2
3
4
5
NA
NA
NA
needs to become
1
2
3
4
5
4.5
4.75
4.625
Thanks in advance for any suggestions, here is some sample data to play with.
df <- tibble::tribble(
~x,
1,
2,
3,
4,
5,
NA,
NA,
NA
)
I'd use a for loop:
for (i in 1:nrow(df)){
if(is.na(df$x[i])){
df$x[i] <- mean(c(df$x[i-1], df$x[i-2]))
}
}
# x
# <dbl>
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# 6 4.5
# 7 4.75
# 8 4.62

subtracting the greater column from smaller columns in a dataframe in R

I have the input below and I would like to subtract the two columns, but I want to subtract always the lowest value from the highest value.
Because I don't want negative values as a result and sometimes the highest value is in the first column (PaternalOrgin) and other times in the second column (MaternalOrigin).
Input:
df <- PaternalOrigin MaternalOrigin
16 20
3 6
11 0
1 3
1 4
3 11
and the dput output is this:
df <- structure(list(PaternalOrigin = c(16, 3, 11, 1, 1, 3), MaternalOrigin = c(20, 6, 0, 3, 4, 11)), colnames = c("PaternalOrigin", "MaternalOrigin"), row.names= c(NA, -6L), class="data.frame")
Thus, my expected output would look like:
df2 <- PaternalOrigin MaternalOrigin Results
16 20 4
3 6 3
11 0 11
1 3 2
1 4 3
3 11 8
Please, can someone advise me?
Thanks.
We can wrap with abs
transform(df, Results = abs(PaternalOrigin - MaternalOrigin))
# PaternalOrigin MaternalOrigin Results
#1 16 20 4
#2 3 6 3
#3 11 0 11
#4 1 3 2
#5 1 4 3
#6 3 11 8
Or we can assign it to 'Results'
df$Results <- with(df, abs(PaternalOrigin - MaternalOrigin))
Or using data.table
library(data.table)
setDT(df)[, Results := abs(PaternalOrigin - MaternalOrigin)]
Or with dplyr
library(dplyr)
df %>%
mutate(Results = abs(PaternalOrigin - MaternalOrigin))

Deleting incomplete cases across multiple rows in R studio

Say I have a longitudinal data set as below
ID <- c(1, 1, 2, 2, 3, 3, 4, 4)
time <- c(1, 2, 1, 2, 1, 2, 1, 2)
value <- c(7, 5, 9, 2, NA, 3, 7, NA)
mydata <- data.frame(ID, time, value)
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
5 3 1 NA
6 3 2 3
7 4 1 7
8 4 2 NA
In this data-set, we have 4 cases with data at two time-points (let's say pre and post treatment)
Something I want to do is set criteria to delete any case that are not complete for both time-points. In this example, I would want to delete ID3 (who is missing timepoint 1), and ID4 (who is missing timepoint 2). Like below:
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
I am not having much luck. I've tried variants of complete.cases() or which() to no avail
I'm still new to R, and would be hugely appreciative if anyone could help me out
Edit: Thank you Ronak for answering my question. Upon reflection of my real data, I have encountered a second problem. My actual data is more reflected by the below:
ID <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 6, 7, 8)
time <- c(1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1)
value <- c(7, 5, 9, 2, NA, 3, 7, NA, 8, 9, 7, 6)
mydata <- data.frame(ID, time, value)
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
5 3 1 NA
6 3 2 3
7 4 1 7
8 4 2 NA
9 5 1 8
10 6 1 9
11 7 1 7
12 8 1 6
Where I would also want to remove cases 5, 6, 7 and 8. These IDs have an entry for Time 1, but not Time 2. Hopefully this makes sense
Thanks a heap
If you switch your data to wide format (where each time point is represented as its own column), then you can use na.omit. Using dplyr and tidyr functions:
library(dplyr)
mydata <- mydata %>%
tidyr::spread(key=time, value=value) %>% # reformat to wide
na.omit() %>% # delete cases with missingness on any variable (i.e. any time point)
tidyr::gather(key="time", value="value", -ID) # put it back in long format
> mydata
ID time value
1 1 1 7
2 2 1 9
3 1 2 5
4 2 2 2
Note that this will work (it will keep only cases with complete data for both time 1 and time 2) even when you have a time point missing without an explicit NA present in the data, like this:
> mydata
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
5 3 1 NA
6 3 2 3
7 4 1 7
8 4 2 NA
9 5 1 8
10 6 1 9
11 7 1 7
12 8 1 6
You can do this easily with sqldf.
library(sqldf)
sqldf(' select * from (select ID, count(*) as cnt from mydata where value is not null group by id having cnt >1 ) t1 inner join mydata t2 on t1.ID=t2.ID')
You would select those id having a count greater than 1 and who doesn't have NA in their values and then join back with the original data.
#Ronak already provided
mydata[!mydata$ID %in% mydata$ID[is.na(mydata$value)], ]
For the second part, you can just group over each id and filter on their frequency
k2 <- data.frame(table(mydata$ID))
k2$Var1[k2$Freq > 1]
and then do something like
mydata[mydata$ID %in% k2$Var1[k2$Freq > 1],]
See the updated answer
# Eliminates ID cases with NA
mydata = mydata[!mydata$ID %in% mydata[!complete.cases(mydata) ,]$ID, ]
library(plyr)
# counts all the IDs
cnt = count(mydata, "ID")
# Eliminates any ID that doesn't have 2 observations
mydata[mydata$ID %in% cnt[cnt$freq == 2, ]$ID, ]
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2

Conditional cumsum with reset

I have a data frame, the data frame is already sorted as needed, but now I will like to "slice it" in groups.
This groups should have a max cumulative value of 10. When the cumulative value is > 10, it should reset the cumulative sum and start over again
library(dplyr)
id <- sample(1:15)
order <- 1:15
value <- c(4, 5, 7, 3, 8, 1, 2, 5, 3, 6, 2, 6, 3, 1, 4)
df <- data.frame(id, order, value)
df
This is the output I'm looking for(I did it "manually")
cumsum_10 <- c(4, 9, 7, 10, 8, 9, 2, 7, 10, 6, 8, 6, 9, 10, 4)
group_10 <- c(1, 1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 6, 6, 7)
df1 <- data.frame(df, cumsum_10, group_10)
df1
So I'm having 2 problems
How to create a cumulative variable that resets everytime it passes an upper limit (10 in this case)
How to count/group every group
For the first part I was trying some combinations of group_by and cumsum with no luck
df1 <- df %>% group_by(cumsum(c(False, value < 10)))
I would prefer a pipe (%>%) solution instead of a for loop
Thanks
I think this is not easily vectorizable.... at least i do not know how.
You can do it by hand via:
my_cumsum <- function(x){
grp = integer(length(x))
grp[1] = 1
for(i in 2:length(x)){
if(x[i-1] + x[i] <= 10){
grp[i] = grp[i-1]
x[i] = x[i-1] + x[i]
} else {
grp[i] = grp[i-1] + 1
}
}
data.frame(grp, x)
}
For your data this gives:
> my_cumsum(df$value)
grp x
1 1 4
2 1 9
3 2 7
4 2 10
5 3 8
6 3 9
7 4 2
8 4 7
9 4 10
10 5 6
11 5 8
12 6 6
13 6 9
14 6 10
15 7 4
Also for my "counter-example" this gives:
> my_cumsum(c(10,6,4))
grp x
1 1 10
2 2 6
3 2 10
As #Khashaa pointed out this can be implementet more efficiently via Rcpp. He linked to this answer How to speed up or vectorize a for loop? which i find very useful
You could define your own function and then use it inside dplyr's mutate statement as follows:
df %>% group_by() %>%
mutate(
cumsum_10 = cumsum_with_reset(value, 10),
group_10 = cumsum_with_reset_group(value, 10)
) %>%
ungroup()
The cumsum_with_reset() function takes a column and a threshold value which resets the sum. cumsum_with_reset_group() is similar but identifies rows that have been grouped together. Definitions are as follows:
# group rows based on cumsum with reset
cumsum_with_reset_group <- function(x, threshold) {
cumsum <- 0
group <- 1
result <- numeric()
for (i in 1:length(x)) {
cumsum <- cumsum + x[i]
if (cumsum > threshold) {
group <- group + 1
cumsum <- x[i]
}
result = c(result, group)
}
return (result)
}
# cumsum with reset
cumsum_with_reset <- function(x, threshold) {
cumsum <- 0
group <- 1
result <- numeric()
for (i in 1:length(x)) {
cumsum <- cumsum + x[i]
if (cumsum > threshold) {
group <- group + 1
cumsum <- x[i]
}
result = c(result, cumsum)
}
return (result)
}
# use functions above as window functions inside mutate statement
df %>% group_by() %>%
mutate(
cumsum_10 = cumsum_with_reset(value, 10),
group_10 = cumsum_with_reset_group(value, 10)
) %>%
ungroup()
This can be done easily with purrr::accumulate
library(dplyr)
library(purrr)
df %>% mutate(cumsum_10 = accumulate(value, ~ifelse(.x + .y <= 10, .x + .y, .y)),
group_10 = cumsum(value == cumsum_10))
id order value cumsum_10 group_10
1 8 1 4 4 1
2 13 2 5 9 1
3 7 3 7 7 2
4 1 4 3 10 2
5 4 5 8 8 3
6 10 6 1 9 3
7 12 7 2 2 4
8 2 8 5 7 4
9 15 9 3 10 4
10 11 10 6 6 5
11 14 11 2 8 5
12 3 12 6 6 6
13 5 13 3 9 6
14 9 14 1 10 6
15 6 15 4 4 7
We can take advantage of the function cumsumbinning, from the package MESS, that performs this task:
library(MESS)
df %>%
group_by(group_10 = cumsumbinning(value, 10)) %>%
mutate(cumsum_10 = cumsum(value))
Output
# A tibble: 15 x 5
# Groups: group_10 [7]
id order value group_10 cumsum_10
<int> <int> <dbl> <int> <dbl>
1 6 1 4 1 4
2 10 2 5 1 9
3 1 3 7 2 7
4 5 4 3 2 10
5 3 5 8 3 8
6 9 6 1 3 9
7 14 7 2 4 2
8 11 8 5 4 7
9 15 9 3 4 10
10 8 10 6 5 6
11 12 11 2 5 8
12 2 12 6 6 6
13 4 13 3 6 9
14 7 14 1 6 10
15 13 15 4 7 4
The function below uses recursion to construct a vector with the lengths of each group. It is faster than a loop for small data vectors (length less than about a hundred values), but slower for longer ones. It takes three arguments:
1) vec: A vector of values that we want to group.
2) i: The index of the starting position in vec.
3) glv: A vector of group lengths. This is the return value, but we need to initialize it and pass it along through each recursion.
# Group a vector based on consecutive values with a cumulative sum <= 10
gf = function(vec, i, glv) {
## Break out of the recursion when we get to the last group
if (sum(vec[i:length(vec)]) <= 10) {
glv = c(glv, length(i:length(vec)))
return(glv)
}
## Keep recursion going if there are at least two groups left
# Calculate length of current group
gl = sum(cumsum(vec[i:length(vec)]) <= 10)
# Append to previous group lengths
glv.append = c(glv, gl)
# Call function recursively
gf(vec, i + gl, glv.append)
}
Run the function to return a vector of group lengths:
group_vec = gf(df$value, 1, numeric(0))
[1] 2 2 2 3 2 3 1
To add a column to df with the group lengths, use rep:
df$group10 = rep(1:length(group_vec), group_vec)
In its current form the function will only work on vectors that don't have any values greater than 10, and the grouping by sums <= 10 is hard-coded. The function can of course be generalized to deal with these limitations.
The function can be speeded up somewhat by doing cumulative sums that look ahead only a certain number of values, rather than the remaining length of the vector. For example, if the values are always positive, you only need to look ten values ahead, since you'll never need to sum more than ten numbers to reach a value of 10. This too can be generalized for any target value. Even with this modification, the function is still slower than a loop for a vector with more than about a hundred values.
I haven't worked with recursive functions in R before and would be interested in any comments and suggestions on whether recursion makes sense for this type of problem and whether it can be improved, especially execution speed.

Select first observed data and utilize mutate

I am running into an issue with my data where I want to take the first observed ob score score for each individual id and subtract that from that last observed score.
The problem with asking for the first observation minus the last observation is that sometimes the first observation data is missing.
Is there anyway to ask for the first observed score for each individual, thus skipping any missing data?
I built the below df to illustrate my problem.
help <- data.frame(id = c(5,5,5,5,5,12,12,12,17,17,20,20,20),
ob = c(1,2,3,4,5,1,2,3,1,2,1,2,3),
score = c(NA, 2, 3, 4, 3, 7, 3, 4, 3, 4, NA, 1, 4))
id ob score
1 5 1 NA
2 5 2 2
3 5 3 3
4 5 4 4
5 5 5 3
6 12 1 7
7 12 2 3
8 12 3 4
9 17 1 3
10 17 2 4
11 20 1 NA
12 20 2 1
13 20 3 4
And what I am hoping to run is code that will give me...
id ob score es
1 5 1 NA -1
2 5 2 2 -1
3 5 3 3 -1
4 5 4 4 -1
5 5 5 3 -1
6 12 1 7 3
7 12 2 3 3
8 12 3 4 3
9 17 1 3 -1
10 17 2 4 -1
11 20 1 NA -3
12 20 2 1 -3
13 20 3 4 -3
I am attempting to work out of dplyr and I understand the use of the 'group_by' command, however, not sure how to 'select' only first observed scores and then mutate to create es.
I would use first() and last() (both dplyr function) and na.omit() (from the default stats package.
First, I would make sure your score column was a numberic column with proper NA values (not strings as in your example)
help <- data.frame(id = c(5,5,5,5,5,12,12,12,17,17,20,20,20),
ob = c(1,2,3,4,5,1,2,3,1,2,1,2,3),
score = c(NA, 2, 3, 4, 3, 7, 3, 4, 3, 4, NA, 1, 4))
then you can do
library(dplyr)
help %>% group_by(id) %>% arrange(ob) %>%
mutate(es=first(na.omit(score)-last(na.omit(score))))
library(dplyr)
temp <- help %>% group_by(id) %>%
arrange(ob) %>%
filter(!is.na(score)) %>%
mutate(es = first(score) - last(score)) %>%
select(id, es) %>%
distinct()
help %>% left_join(temp)
This solution is a little verbose, only b/c it relies on a couple of helper functions FIRST and LAST:
# The position (indicator) of the first value that evaluates to TRUE.
LAST <- function (x, none = NA) {
out <- FIRST(reverse(x), none = none)
if (identical(none, out)) {
return(none)
}
else {
return(length(x) - out + 1)
}
}
# The position (indicator) of the last value that evaluates to TRUE.
FIRST <- function (x, none = NA)
{
x[is.na(x)] <- FALSE
if (any(x))
return(which.max(x))
else return(none)
}
# returns the difference between the first and last non-missing values
diff2 <- function(x)
x[LAST(!is.na(x))] - x[FIRST(!is.na(x))]
library(dplyr)
help %>%
group_by(id) %>%
arrange(ob) %>%
summarise(diff = diff2(score))

Resources