Linear intrapolation in data.table [duplicate] - r

I would like to perform a linear interpolation in a variable of a data frame which takes into account the: 1) time difference between the two points, 2) the moment when the data was taken and 3) the individual taken for measure the variable.
For example in the next dataframe:
df <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
Individuals=c(1,1,1,1,1,1,1,2,2,2),
Value=c(1, 2, 3, NA, 5, NA, 7, 5, NA, 7))
df
I would like to obtain:
result <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
Individuals=c(1,1,1,1,1,1,1,2,2,2),
Value=c(1, 2, 3, 4, 5, 6, 7, 5, 5.5, 6))
result
I cannot use exclusively the function na.approx of the package zoo because all observations are not consecutives, some observations belong to one individual and other observations belong to other ones. The reason is because if the second individual would have its first obsrevation with NA and I would use exclusively the function na.approx, I would be using information from the individual==1 to interpolate the NA of the individual==2 (e.g the next data frame would have sucherror)
df_2 <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
Individuals=c(1,1,1,1,1,1,1,2,2,2),
Value=c(1, 2, 3, NA, 5, NA, 7, NA, 5, 7))
df_2
I have tried using the packages zoo and dplyr:
library(dplyr)
library(zoo)
proof <- df %>%
group_by(Individuals) %>%
na.approx(df$Value)
But I cannot perform group_by in a zoo object.
Do you know how to interpolate NA values in one variable by groups?
Thanks in advance,

Use data.frame, rather than cbind to create your data. cbind returns a matrix, but you need a data frame for dplyr. Then use na.approx inside mutate. I've commented out group_by, as you haven't provided the grouping variable in your data, but the approach should work once you've added the grouping variable to the data frame.
df <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
Individuals=c(1,1,1,1,1,1,1,2,2,2),
Value=c(NA, 2, 3, NA, 5, NA, 7, 8, NA, 10))
library(dplyr)
library(zoo)
df %>%
group_by(Individuals) %>%
mutate(ValueInterp = na.approx(Value, na.rm=FALSE))
time Individuals Value ValueInterp
1 1 1 NA NA
2 2 1 2 2
3 3 1 3 3
4 4 1 NA 4
5 5 1 5 5
6 6 1 NA 6
7 7 1 7 7
8 1 2 8 8
9 2 2 NA 9
10 3 2 10 10
Update: To interpolate multiple columns, we can use mutate_at. Here's an example with two value columns. We use mutate_at to run na.approx on all columns that include "Value" in the column name. list(interp=na.approx) tells mutate_at to generate new column names by running na.approx and adding interp as a suffix to generate the new column names:
df <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
Individuals=c(1,1,1,1,1,1,1,2,2,2),
Value1=c(NA, 2, 3, NA, 5, NA, 7, 8, NA, 10),
Value2=c(NA, 2, 3, NA, 5, NA, 7, 8, NA, 10)*2)
df %>%
group_by(Individuals) %>%
mutate_at(vars(matches("Value")), list(interp=na.approx), na.rm=FALSE)
time Individuals Value1 Value2 Value1_interp Value2_interp
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 NA NA NA NA
2 2 1 2 4 2 4
3 3 1 3 6 3 6
4 4 1 NA NA 4 8
5 5 1 5 10 5 10
6 6 1 NA NA 6 12
7 7 1 7 14 7 14
8 1 2 8 16 8 16
9 2 2 NA NA 9 18
10 3 2 10 20 10 20
If you don't want to preserve the original, uninterpolated columns, you can do:
df %>%
group_by(Individuals) %>%
mutate_at(vars(matches("Value")), na.approx, na.rm=FALSE)

We can use data.table
library(data.table)
library(zoo)
setDT(df1)[, ValueInterp:= na.approx(Value, na.rm=TRUE), by = Individual]

Related

Completing the NAs of a Tibble in R

I have a database in R where there are some NAs in the variables. I would like to apply a logic function where the NAs would be filled with the immediately preceding value. Below is an example:
dados <- tibble::tibble(x = c(2, 3, 5, NA, 2, 1, NA, NA, 9, 3),
y = c(4, 1, 9, NA, 8, 5, NA, NA, 1, 2)
)
# A tibble: 10 x 2
x y
<dbl> <dbl>
1 2 4
2 3 1
3 5 9
4 NA NA
5 2 8
6 1 5
7 NA NA
8 NA NA
9 9 1
10 3 2
In this case, the 4th value of the variable x would be filled with a 5 and so on.
Thank you!
We could use fill from tidyr package:
ibrary(tidyr)
library(dplyr)
dados %>%
fill(c(x,y), .direction = "down")
x y
<dbl> <dbl>
1 2 4
2 3 1
3 5 9
4 5 9
5 2 8
6 1 5
7 1 5
8 1 5
9 9 1
10 3 2
We can use coalesce
library(dplyr)
dados %>%
mutate(across(x:y, ~ coalesce(., lag(.))))
# A tibble: 10 x 2
x y
<dbl> <dbl>
1 2 4
2 3 1
3 5 9
4 5 9
5 2 8
6 1 5
7 1 5
8 NA NA
9 9 1
10 3 2
library(dplyr)
dados %>%
mutate(x = case_when(is.na(x) ~ lag(x),
TRUE ~ x),
y = case_when(is.na(y) ~ lag(y),
TRUE ~ y))
The follow will only work, if the first value in a column is not NA but I leave that for the sake of clear and easy code as an execise for you we can solve this for one column as in:
library(tibble)
dados <- tibble::tibble(x = c(2, 3, 5, NA, 2, 1, NA, NA, 9, 3),
y = c(4, 1, 9, NA, 8, 5, NA, NA, 1, 2)
)
#where are the NA?
pos <- dados$x |>
is.na() |>
which()
# replace
while(any(is.na(dados$x)))
dados$x[pos] <- dados$x[pos-1]
dados

How to bind tibbles by row with different number of columns in R [duplicate]

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 2 years ago.
I want to bind df1 with df2 by row, keeping the same column name, to obtain df3.
library(tidyverse)
df1 <- tibble(a = c(1, 2, 3),
b = c(4, 5, 6),
c = c(1, 5, 7))
df2 <- tibble(a = c(8, 9),
b = c(5, 6))
# how to bind these tibbles by row to get
df3 <- tibble(a = c(1, 2, 3, 8, 9),
b = c(4, 5, 6, 5, 6),
c = c(1, 5, 7, NA, NA))
Created on 2020-10-30 by the reprex package (v0.3.0)
Try this using bind_rows() from dplyr. Updated credit to #AbdessabourMtk:
df3 <- dplyr::bind_rows(df1,df2)
Output:
# A tibble: 5 x 3
a b c
<dbl> <dbl> <dbl>
1 1 4 1
2 2 5 5
3 3 6 7
4 8 5 NA
5 9 6 NA
A base R option
df2[setdiff(names(df1),names(df2))]<-NA
df3 <- rbind(df1,df2)
giving
> df3
# A tibble: 5 x 3
a b c
<dbl> <dbl> <dbl>
1 1 4 1
2 2 5 5
3 3 6 7
4 8 5 NA
5 9 6 NA
We can use rbindlist from data.table
library(data.table)
rbindlist(list(df1, df2), fill = TRUE)
-output
# a b c
#1: 1 4 1
#2: 2 5 5
#3: 3 6 7
#4: 8 5 NA
#5: 9 6 NA

R - How to recode multiple columns [duplicate]

This question already has answers here:
Replacing character values with NA in a data frame
(7 answers)
Closed 1 year ago.
I am trying to change the 6s to NAs across multiple columns. I have tried using the mutate_at command in dplyr, but can't seem to make it work. Any ideas?
library(dplyr)
ID <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) #Create vector of IDs for ID column.
Score1 <- c(1, 2, 3, 2, 5, 6, 6, 2, 5, 4) #Create vector of scores for Score1 column.
Score2 <- c(2, 2, 3, 6, 5, 6, 6, 2, 3, 4) #Create vector of scores for Score2 column.
Score3 <- c(3, 2, 3, 4, 5, 5, 6, 2, 6, 4) #Create vector of scores for Score3 column.
df <- data.frame(ID, Score1, Score2, Score3) #Combine columns into a data frame.
VectorOfNames <- as.vector(c("Score1", "Score2", "Score3")) #Create a vector of column names.
df <- mutate_at(df, VectorOfNames, 6=NA) #Within the data frame, apply the function (6=NA) to the columns specified in VectorOfNames.
dplyr has the na_if() function for precisely this task. You were almost there with your code and can use:
mutate_at(df, VectorOfNames, ~na_if(.x, 6))
ID Score1 Score2 Score3
1 1 1 2 3
2 2 2 2 2
3 3 3 3 3
4 4 2 NA 4
5 5 5 5 5
6 6 NA NA 5
7 7 NA NA NA
8 8 2 2 2
9 9 5 3 NA
10 10 4 4 4
You could use :
library(dplyr)
df %>%mutate_at(VectorOfNames, ~replace(., . == 6, NA))
#OR
#df %>%mutate_at(VectorOfNames, ~ifelse(. == 6, NA, .))
# ID Score1 Score2 Score3
#1 1 1 2 3
#2 2 2 2 2
#3 3 3 3 3
#4 4 2 NA 4
#5 5 5 5 5
#6 6 NA NA 5
#7 7 NA NA NA
#8 8 2 2 2
#9 9 5 3 NA
#10 10 4 4 4
Or in base R :
df[VectorOfNames][df[VectorOfNames] == 6] <- NA

Combining/joining rows within the same dataframe based on grouping R [duplicate]

This question already has answers here:
combine rows in data frame containing NA to make complete row
(7 answers)
Closed 3 years ago.
I am executing a map_df function that results in a dataframe similar to the df below.
name <- c('foo', 'foo', 'foo', 'bar', 'bar', 'bar')
year <- c(19, 19, 19, 18, 18, 18)
A <- c(1, NA, NA, 2, NA, NA)
B <- c(NA, 3, NA, NA, 4, NA)
C <- c(NA, NA, 2, NA, NA, 5)
df <- data.frame(name, year, A, B, C)
name year A B C
1 foo 19 1 NA NA
2 foo 19 NA 3 NA
3 foo 19 NA NA 2
4 bar 18 2 NA NA
5 bar 18 NA 4 NA
6 bar 18 NA NA 5
Based on a my unique groups within the df, in this case: name + year, I want to merge the data into the same row. Desired result:
name year A B C
1 foo 19 1 3 2
2 bar 18 2 4 5
I can definitely accomplish this with a mix of filtering and joins, but with my actual dataframe that would be a lot of code and inefficient. I'm looking for a more elegant way to "squish" this dataframe.
library(dplyr)
df %>%
group_by(name, year) %>%
summarise_all(mean, na.rm = TRUE)
This is a dplyr answer. It works, if your data really looks like the one you posted.
Output:
name year A B C
<fct> <dbl> <dbl> <dbl> <dbl>
1 bar 18 2 4 5
2 foo 19 1 3 2

Counting complete cases by ID for several variables

I'm just beginning to learn R, so my apologies if this is simpler than I think it is, but I'm really struggling to find an answer.
What I'm attempting to do is to create a vector with a count of complete cases, by ID, for multiple variables.
For example, in this data frame:
ID<-c(1:5)
score.1<-c(1, 7, 3, 5, NA, 4, 6, 9, 11, NA)
score.2<-c(2, NA, 7, 6, NA, 5, NA, 7, 10, 1)
sample<-data.frame(ID, score.1, score.2)
ID score.1 score.2
1 1 2
2 7 NA
3 3 7
4 5 6
5 NA NA
1 4 5
2 6 NA
3 9 7
4 11 10
5 NA 1
The output I'm looking for is something like:
ID Complete
1 4
2 2
3 4
4 4
5 1
Is there a way to do this that I'm missing? I've tried count(complete.cases(sample)) with plyr and sum(complete.cases()), but it's not giving me what I actually want.
Any help with this is appreciated.
You can use dplyr:
library(dplyr)
sample %>%
mutate(new_var = rowSums(!is.na(sample[,2:3]))) %>%
group_by(ID) %>%
summarize(Complete = sum(new_var))
The output is exactly what you are looking for:
ID Complete
(int) (dbl)
1 4
2 2
3 4
4 4
5 1
with package dplyr and base function complete.cases, try
require(dplyr)
sample %>%
mutate(complete = complete.cases(sample)) %>%
group_by(ID) %>%
summarise(complete = sum(complete))
This should do it:
score.1_complete <- sample[complete.cases(sample$score.1), ]
score.2_complete <- sample[complete.cases(sample$score.2), ]
total <- rbind(score.1_complete, score.2_complete)
output <- count(total, "ID")
my reasoning:
score.1_complete selects the rows where score.1 (though not necessarily score.2) is complete. score.2_complete selects the rows where score.2 (though not necessarily score.1) is complete. therefore, counting how many times an ID shows up in total gives you how many times score.1 is complete for that ID + how many times score.2 is complete for that ID, which is what you want.
Here is another option with gather/summarise. We convert the 'wide' to 'long' format with gather (from tidyr), get the sum of non-NA 'value' grouped by 'ID'.
library(tidyr)
library(dplyr)
gather(sample, score, value,-ID) %>%
group_by(ID) %>%\
summarise(value= sum(!is.na(value)) )
# ID value
# (int) (int)
#1 1 4
#2 2 2
#3 3 4
#4 4 4
#5 5 1
Or a base R approach would be
tapply(rowSums(!is.na(sample[-1])), sample$ID, FUN=sum)
# 1 2 3 4 5
# 4 2 4 4 1

Resources