Impute missing values for missing dates - r

Imagine I have the following two data frames:
> sp
date value
1 2004-08-20 1
2 2004-08-23 2
3 2004-08-24 4
4 2004-08-25 5
5 2004-08-26 10
6 2004-08-27 11
> other
date value
1 2004-08-20 2
2 2004-08-23 4
3 2004-08-24 5
4 2004-08-25 10
5 2004-08-27 11
where the first columns represents the dates and the second the values for each day. The matrix of reference is sp and I want to impute to the matrix other the missing dates and values with respect to sp. For instance, in this case I miss the date "2004-08-26" in the matrix other. I should add to the matrix other a new row, with the date "2004-08-26" and the value which is given by the mean of the values at "2004-08-25" and "2004-08-27".
Could anyone suggest me how I can do it?
Data
sp <- data.frame(date=c("2004-08-20", "2004-08-23", "2004-08-24", "2004-08-25",
"2004-08-26", "2004-08-27"), value=c(1, 2, 4, 5, 10, 11))
other <- data.frame(date=c("2004-08-20", "2004-08-23", "2004-08-24", "2004-08-25",
"2004-08-27"), value=c(2, 4, 5, 10, 11))

An option using zoo::na.approx :
library(dplyr)
sp %>%
select(date) %>%
left_join(other, by = 'date') %>%
mutate(value = zoo::na.approx(value))
# date value
#1 2004-08-20 2.0
#2 2004-08-23 4.0
#3 2004-08-24 5.0
#4 2004-08-25 10.0
#5 2004-08-26 10.5
#6 2004-08-27 11.0

If I understand correctly, you want to add dates from sp that are missing in other.
You can merge other with just the date column of sp. Note, that by default from one-column data frames (and matrices) dimensions are dropped, so we need drop=FALSE.
The resulting NA can be e.g. linearly interpolated using approx, which gives the desired mean.
other2 <- merge(other, sp[, 'date', drop=FALSE], all=TRUE) |>
transform(value=approx(value, xout=seq_along(value))$y)
other2
# date value
# 1 2004-08-20 2.0
# 2 2004-08-23 4.0
# 3 2004-08-24 5.0
# 4 2004-08-25 10.0
# 5 2004-08-26 10.5 ## interpolated
# 6 2004-08-27 11.0
Note: For R < 4.1, do:
transform(merge(other, sp[, "date", drop = FALSE], all = TRUE),
value = approx(value, xout = seq_along(value))$y)
# date value
# 1 2004-08-20 2.0
# 2 2004-08-23 4.0
# 3 2004-08-24 5.0
# 4 2004-08-25 10.0
# 5 2004-08-26 10.5 ## interpolated
# 6 2004-08-27 11.0

Related

How to replace missing points in a data set?

I want to write a function in R that receives any data set as input, such that the data set has some missing points (NA). Now I want to use mean function to replace some numbers/values for missing points (NA) in the data set. What I am thinking is a function like this:
x<function(data,type=c("mean", lag=2))
Indeed, it should compute the mean of the two numbers later and two numbers before of the missing point (because I considered lag as 2 in the function). For example, if the missing point is in place 12th then the function should compute the mean of the numbers in places 10th, 11th, 13th, and 14th and substitute the result for the missing point at place 12th. In particular cases, for example, if the missing point is in the last place, and we do not have two numbers later, the function should compute the mean of all the data of the corresponding column and substitute for the missing point. Here I give an example to make it clear. Consider the following data set:
3 7 8 0 8 12 2
5 8 9 2 8 9 1
1 2 4 5 0 6 7
5 6 0 NA 3 9 10
7 2 3 6 11 14 2
4 8 7 4 5 3 NA
In the above data set, the first NA should be replaced with the mean of numbers 2, 5 (two data before), and 6 and 4 (two data after) which is (2+5+6+4)/4 is equal to 17/4. And the last NA should be replaced with the mean of the last column which is (2+1+7+10+2)/5 is equal to 22/5.
My question is how can I add some codes (if, if-else, or other loops) to the above function to make a complete function to satisfy the above explanations. I should highlight that I want to use the family of apply functions.
First we can define a function that smooths a single vector:
library(dplyr)
smooth = function(vec, n=2){
# Lead and lag the vector twice in both directions
purrr::map(1:n, function(i){
cbind(
lead(vec, i),
lag(vec, i)
)
}) %>%
# Bind the matrix together
do.call(cbind, .) %>%
# Take the mean of each row, ie the smoothed version at each position
# If there are NAs in the mean, it will itself be NA
rowMeans() %>%
# In order, take a) original values b) locally smoothed values
# c) globally smoothed values (ie the entire mean ignoring NAs)
coalesce(vec, ., mean(vec, na.rm=TRUE))
}
> smooth(c(0, 2, 5, NA, 6, 4))
[1] 0.00 2.00 5.00 4.25 6.00 4.00
> smooth(c(2, 1, 7, 10, 2, NA))
[1] 2.0 1.0 7.0 10.0 2.0 4.4
Then we can apply it to each column:
> c(3, 7, 8, 0, 8, 12, 2, 5, 8, 9, 2, 8, 9, 1, 1, 2, 4, 5, 0, 6, 7, 5, 6, 0, NA, 3, 9, 10, 7, 2, 3, 6, 11, 14, 2, 4, 8, 7, 4, 5, 3, NA) %>%
matrix(byrow=TRUE, ncol=7) %>%
as_tibble(.name_repair="universal") %>%
mutate(across(everything(), smooth))
# A tibble: 6 × 7
...1 ...2 ...3 ...4 ...5 ...6 ...7
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 7 8 0 8 12 2
2 5 8 9 2 8 9 1
3 1 2 4 5 0 6 7
4 5 6 0 4.25 3 9 10
5 7 2 3 6 11 14 2
6 4 8 7 4 5 3 4.4
Please find below one solution using the data.table library.
Reprex
Your data:
m1 <- "3 7 8 0 8 12 2
5 8 9 2 8 9 1
1 2 4 5 0 6 7
5 6 0 NA 3 9 10
7 2 3 6 11 14 2
4 8 7 4 5 3 NA"
myData<- read.table(text=m1,h=F)
Code for the function replaceNA
library(data.table)
replaceNA <- function(data){
setDT(data)
# Create a data.table identifying rows and cols indexes of NA values in the data.table
NA_DT <- as.data.table(which(is.na(data), arr.ind=TRUE))
# Select row and column indexes of NAs that are not at the last row in the data.table
NA_not_Last <- NA_DT[row < nrow(data)]
# Select row and column indexes of NA that is at the last row in the data.table
NA_Last <- NA_DT[row == nrow(data)]
# Create a vector of column names where NA values are not at the last row in the data.table
Cols_NA_not_Last <- colnames(data)[NA_not_Last[,col]]
# Create a vector of column names where NA values are at the last row in the data.table
Cols_NA_Last <- colnames(data)[NA_Last[,col]]
# Replace NA values that are not at the last row in the data.table by the mean of the values located
# in the two previous lines and the two following lines of the line containing the NA value
data[, (Cols_NA_not_Last) := lapply(.SD, function(x) replace(x, which(is.na(x)), mean(c(x[which(is.na(x))-2], x[which(is.na(x))-1], x[which(is.na(x))+1], x[which(is.na(x))+2]), na.rm = TRUE))), .SDcols = Cols_NA_not_Last][]
# Replace NA values that are at the last row in the data.table by the mean of all the values in the column where the NA value is found
data[, (Cols_NA_Last) := lapply(.SD, function(x) replace(x, which(is.na(x)), mean(x, na.rm = TRUE))), .SDcols = Cols_NA_Last][]
return(data)
}
Test of the function with your data
replaceNA(myData)
#> V1 V2 V3 V4 V5 V6 V7
#> 1: 3 7 8 0.00 8 12 2.0
#> 2: 5 8 9 2.00 8 9 1.0
#> 3: 1 2 4 5.00 0 6 7.0
#> 4: 5 6 0 4.25 3 9 10.0
#> 5: 7 2 3 6.00 11 14 2.0
#> 6: 4 8 7 4.00 5 3 4.4
Created on 2021-11-08 by the reprex package (v2.0.1)

How to create a function which loops through column index numbers in R?

Consider the following data frame (df):
"id" "a1" "b1" "c1" "not_relevant" "p_a1" "p_b1" "p_c1"
a 2 6 0 x 2 19 12
a 4 2 7 x 3.5 7 11
b 1 9 4 x 7 1.5 4
b 7 5 11 x 8 12 5
I would like to create a new column which shows the sum of the product between two corresponding columns. To write less code I address the columns by their index number. Unfortunately I have no experience in writing functions, so I ended up doing this manually, which is extremely tedious and not very elegant.
Here a reproducible example of the data frame and what I have tried so far:
id <- c("a","a","b","b")
df <- data.frame(id)
df$a1 <- as.numeric((c(2,4,1,7)))
df$b1 <- as.numeric((c(6,2,9,5)))
df$c1 <- as.numeric((c(0,7,4,11)))
df$not_relevant <- c("x","x","x","x")
df$p_a1 <- as.numeric((c(2,3.5,7,8)))
df$p_b1 <- as.numeric((c(19,7,1.5,12)))
df$p_c1 <- as.numeric((c(12,11,4,5)))
require(dplyr)
df %>% mutate(total = .[[2]]*.[[6]] + .[[3]] *.[[7]]+ .[[4]] *.[[8]])
This leads to the desired result, but as I mentioned is not very efficient:
"id" "a1" "b1" "c1" "not_relevant" "p_a1" "p_b1" "p_c1" "total"
a 2 6 0 x 2 19 12 118.0
a 4 2 7 x 3.5 7 11 105.0
b 1 9 4 x 7 1.5 4 36.5
b 7 5 11 x 8 12 5 171.0
The real data I am working with has much more columns, so I would be glad if someone could show me a way to pack this operation into a function which loops through the column index numbers and matches the correct columns to each other.
Column indices are not a good way to do this. (Not a good way in general...)
Here's a simple dplyr method that assumes the columns are in the correct corresponding order (that is, it will give the wrong result if the "x1", "x2", "x3" is in a different order than "p_x3", "p_x2", "p_x1"). You may also need to refine the selection criteria for your real data:
df$total = rowSums(select(df, starts_with("x")) * select(df, starts_with("p_")))
df
# id x1 x2 x3 not_relevant p_x1 p_x2 p_x3 total
# 1 a 2 6 0 x 2.0 19.0 12 118.0
# 2 a 4 2 7 x 3.5 7.0 11 105.0
# 3 b 1 9 4 x 7.0 1.5 4 36.5
# 4 b 7 5 11 x 8.0 12.0 5 171.0
The other good option would be to convert your data to a long format, where you have a single x column and a single p column, with an "index" column indicating the 1, 2, 3. Then the operation could be done by group, finally moving back to a wide format.

R Compare Columns across Dataframes to Match Values

I have two dataframes, looking at houses (n=6) and certain dates (n=22).
ORIGINAL is the original dataset. It contains 38 observations on 5 variables. Not all houses have all the dates listed, and vice versa, leading to errors in calculations with different length variables.
SAMPLE is a new empty dataset. It contains 132 (6 x 22) observations on the same 5 variables. Now there is an observation for every household for every date.
House Day Mongoose Fruit Elephant
A 1 40 7 0.6
A 6 32 12 4.2
B 2 50 3 4.0
B 4 51 4 8.6
B 6 8 7 12.1
C 2 12 8 13.0
I am trying to fill in the rest of SAMPLE by asking R to compare HouseID and Date between the two dataframes; if they match, the rest of the variables (mongoose, fruit, elephant) should be copied over for that observation.
I tried this to no avail...
for(i in 1:nrow(original))
{
if ((sample$Day == original$Day) && (sample$House == original$House))
{
sample$Mongoose[i] <- original$Mongoose[i]
sample$Fruit[i] <- original$Fruit[i]
sample$Elephant[i] <- original$Elephant[i]
}
}
The following results:
I get the following 3 errors in sequence
In sample$Day == test$Day : longer object length is not a multiple of
shorter object length
In is.na(e1) | is.na(e2) :longer object length is not a multiple of
shorter object length
In ==.default(sample$House, test$House) :longer object length is
not a multiple of shorter object length
The data DOES copy over, but incorrectly. All the values get transferred to the A house and sequential date, rather than the appropriate house and date.
I.e., it looks like this
House Day Mongoose Fruit Elephant
A 1 40 7 0.6
A 2 50 3 4.0
A 3 51 4 8.6
A 4 8 7 12.1
A 5 12 8 13.0
A 6 32 12 4.2
B 1
B 2
B 3 [...]
When it should (in essence) look like this:
House Day Mongoose Fruit Elephant
A 1 40 7 0.6
A 2
A 3
A 4
A 5
A 6 32 12 4.2 [rest of A houses have no data]
B 1
B 2 50 3 4.0
B 3
B 4 51 4 8.6
B 5
B 6 8 7 12.1 [rest of B houses have no data]
C 1
C 2 12 8 13.0
Please advise; I will eventually have to extend this technique to look at a sample dataset with 198K entries, and a test dataset with 115K.
Thanks!
Sounds to me like this should work:
merge(sample, original, by = c("House", "Day"), all.x = TRUE)
But hard to tell without a reproducible example. You may also want to look into dplyr::left_join(). That is, assuming your data looks like the following:
sample <- data.frame(House = rep(c("A", "B", "C"), each = 6),
Day = rep(1:6, 3))
# head(sample)
# House Day
# 1 A 1
# 2 A 2
# 3 A 3
# 4 A 4
# 5 A 5
# 6 A 6
original <- data.frame(House = c("A", "A", "B", "B", "C"),
Day = c(1, 6, 2, 4, 2),
Mongoose = c(40, 32, 50, 51, 8),
Fruit = c(7, 12, 3, 4, 8),
Elephant = c(0.6, 4.2, 4.0, 8.6, 12.1))
# head(original)
# House Day Mongoose Fruit Elephant
# 1 A 1 40 7 0.6
# 2 A 6 32 12 4.2
# 3 B 2 50 3 4.0
# 4 B 4 51 4 8.6
# 5 C 2 8 8 12.1
We obtain:
# head(merge(sample, original, by = c("House", "Day"), all.x = TRUE))
# House Day Mongoose Fruit Elephant
# 1 A 1 40 7 0.6
# 2 A 2 NA NA NA
# 3 A 3 NA NA NA
# 4 A 4 NA NA NA
# 5 A 5 NA NA NA
# 6 A 6 32 12 4.2
It could be a small tweak, look at this this line of your original code:
if ((sample$Day == original$Day) && (sample$House == original$House))
See if you can change it to this:
if ((sample$Day[i] == original$Day[i]) && (sample$House[i] == original$House[i]))
Because:
You are using a for loop with an i variable,
which you do very well with lines such as sample$Mongoose[i] <- original$Mongoose[i]
but in your example it seems the if statement is not actually making use of the i variable
so we revise it to make use of i so it will be comparing specifically that observation/rows's sample$Day with that observation/rows's original$Day, and the same for sample$House vs original$House

R sum consecutive duplicate rows and remove all but first

I am stuck with a probably simple question - how to sum consecutive duplicate rows and remove all but first row. And, if there is a NA in between two duplicates (such as 2,na,2) , also sum them and remove all but the first entry.
So far so good, here is my sample data
ia<-c(1,1,2,NA,2,1,1,1,1,2,1,2)
time<-c(4.5,2.4,3.6,1.5,1.2,4.9,6.4,4.4, 4.7, 7.3,2.3, 4.3)
a<-as.data.frame(cbind(ia, time))
sample output
a
ia time
1 1 4.5
2 1 2.4
3 2 3.6
4 NA 1.5
5 2 1.2
6 1 4.9
7 1 6.4
8 1 4.4
9 1 4.7
10 2 7.3
11 1 2.3
12 2 4.3
Now I want to
1.) sum the "time" column of consecutive ia's - i.e., sum the time if the number 1 occurs twice or more right after each other, in my case here sum first and second row of column time to 4.5+2.4.
2.) if there is a NA in between two numbers (ia column) which are the same (i.e., ia = 2, NA, 2), then also sum all of those times.
3.) keep only first occurence of the ia, and delete the rest.
In the end, I would want to have something like this:
a
ia time
1 1 6.9
3 2 6.3
6 1 20.4
10 2 7.3
11 1 2.3
12 2 4.3
I found this for summing, but it does not take into account the consecutive factor
aggregate(time~ia,data=a,FUN=sum)
and I found this for deleting
a[cumsum(rle(as.numeric(a[,1]))$lengths),]
although the rle approach keeps the last entry, and I would want to keep the first. I also have no idea how to handle the NAs.
if I have a pattern of 1-NA-2 then the NA should NOT be counted with either of them, in this case the NA row should be removed.
With data.table (as RHertel suggested for na.locf):
library(data.table)
library(zoo)
setDT(a)[na.locf(ia, fromLast=T)==na.locf(ia), sum(time), cumsum(c(T,!!diff(na.locf(ia))))]
# id V1
#1: 1 6.9
#2: 2 6.3
#3: 3 20.4
#4: 4 7.3
#5: 5 2.3
#6: 6 4.3
You first need to replace sequences of NAs with the values surrounding them (if they are the same). This answer shows zoo's na.locf function, which fills in NAs with the last observation. By testing whether it's the same when you carry values backwards or forwards, you can filter out the NAs you don't want, then do the carrying forward:
library(dplyr)
library(zoo)
a %>%
filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
mutate(ia = na.locf(ia))
#> ia time
#> 1 1 4.5
#> 2 1 2.4
#> 3 2 3.6
#> 4 2 1.5
#> 5 2 1.2
#> 6 1 4.9
#> 7 1 6.4
#> 8 1 4.4
#> 9 2 7.3
#> 10 1 2.3
#> 11 2 4.3
Now that you've fixed those NAs, you can group consecutive sets of values using cumsum. The full solution is:
result <- a %>%
filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
mutate(ia = na.locf(ia)) %>%
mutate(change = ia != lag(ia, default = FALSE)) %>%
group_by(group = cumsum(change), ia) %>%
summarise(time = sum(time))
result
#> Source: local data frame [6 x 3]
#> Groups: group [?]
#>
#> group ia time
#> (int) (dbl) (dbl)
#> 1 1 1 6.9
#> 2 2 2 6.3
#> 3 3 1 15.7
#> 4 4 2 7.3
#> 5 5 1 2.3
#> 6 6 2 4.3
If you want to get rid of the group column, use the additional lines:
result %>%
ungroup() %>%
select(-group)
nas <- which(is.na(df$ia))
add.index <- sapply(nas, function(x) {logi <- which(as.logical(df$ia))
aft <- logi[logi > x][1]
fore <- tail(logi[logi< x], 1)
if(df$ia[aft] == df$ia[fore]) aft else NA})
df$ia[nas] <- df$ia[add.index]
df <- df[complete.cases(df),]
First we determine if the NA values of the column are surrounded by the same value. If yes, the surrounding value replaces the NA. There is no problem if the data has consecutive NA values.
Next we do a standard sum by group operation. cumsum allows us to create a unique group based on changes in the numbers.
df$grps <- cumsum(c(F, !df$ia[-length(df$ia)] == df$ia[-1]))+1
aggregate(time ~ grps, df, sum)
# grps time
# 1 1 6.9
# 2 2 6.3
# 3 3 20.4
# 4 4 7.3
# 5 5 2.3
# 6 6 4.3
This is a base R approach. With packages like dplyr, zoo, or data.table different options are available as they come built with specialized functions to do what we did here.

Using ddply to aggregate over irregular time periods in longitudinal data

I'm looking for help adapting two existing scripts.
I am working with a longitudinal dataset, and aggregating a key variable over time periods. I have a variable for both weeks and months. I'm able to aggregate over both weeks and months - but my goal is to aggregate over weeks for the first six weeks, and then move over to aggregating by months after 6 weeks+.
Aggregating by weeks and months is easy enough...
df.summary_week <- ddply(df, .(weeks), summarise,
var.mean = mean(var,na.rm=T))
Which yields something like:
weeks var.mean
1 3.99
2 5.44
3 6.7
4 8.100
5 2.765
6 2.765
7 3.765
8 4.765
9 1.765
10 4.765
11 1.765
And then aggregating by month would yield something similar:
df.summary_months <- ddply(df, .(months), summarise,
var.mean = mean(var,na.rm=T))
months var.mean
1 5.00
2 3.001
3 4.7
4 7.100
My initial idea was to simply subset the two datasets with cut points and then bind them together, but I don't know how to do that when the 1-month aggregation starts at 6 weeks rather than 8.
Thoughts, R wizards?
Basic example data.
dat <- data.frame(var=1:24,weeks=1:24,months=rep(1:6,each=4))
Means for first 6 grps should be just 1:6, then means will be values
for subsequent 4 week periods. E.g. (mean(7:10) = 8.5 etc).
Make a suitable group identifier going from weeks to months:
dat$grp <- findInterval(dat$weeks,seq(7,max(dat$weeks),4)) + 6
dat$grp <- ifelse(dat$grp==6,dat$weeks,dat$grp)
#[1] 1 2 3 4 5 6 7 7 7 7 8 8 8 8 9 9 9 9 10 10 10 10 11 11
Group the data:
ddply(dat, .(grp), summarise, var.mean = mean(var,na.rm=T))
grp var.mean
1 1 1.0
2 2 2.0
3 3 3.0
4 4 4.0
5 5 5.0
6 6 6.0
7 7 8.5
8 8 12.5
9 9 16.5
10 10 20.5
11 11 23.5
How about just creating a new grouping column?
set.seed(1618)
dat <- data.frame(week = sample(1:26, 200, replace = TRUE),
value = rpois(200, 2))
dat <- within(dat, {
idx <- cut(week, c(0, 6, seq(10, max(week), by = 4)))
})
# head(dat)
# week value idx
# 1 6 1 (0,6]
# 2 16 2 (14,18]
# 3 9 1 (6,10]
# 4 13 2 (10,14]
# 5 8 2 (6,10]
# 6 16 2 (14,18]
library(plyr)
ddply(dat, .(idx), summarise,
mean = mean(value, na.rm = TRUE))
# idx mean
# 1 (0,6] 1.870968
# 2 (6,10] 2.259259
# 3 (10,14] 2.171429
# 4 (14,18] 1.931034
# 5 (18,22] 1.560000
# 6 (22,26] 1.954545
# checking a couple values
mean(dat[dat$week %in% 1:6, 'value'])
# [1] 1.870968
mean(dat[dat$week %in% 7:10, 'value'])
# [1] 2.259259
mean(dat[dat$week %in% 23:26, 'value'])
# [1] 1.954545

Resources