How do I calculate Euclidean distances across NA values in r - r

I have a date frame like this
individual <- c("1",NA,NA,NA,NA,NA,NA,NA,"1","1")
x <- c(665,NA,NA,NA,NA,NA,NA,NA,663,665)
y <- c(-474.5,NA,NA,NA,NA,NA,NA,NA,-474.5,-472.5)
frame <- rep(1:10)
df <- data.frame(individual,x,y,frame)
I have an ID column labeled 'individual', xy coordinates, and a frame number.
I need to calculate the euclidean distances for the x,y coordinates between rows but over the NA values.
So, in the example I gave - I would need to calculate the distances between rows 1 and 9, as well as 10 and 9. In the real data there would be substantially more rows of course.
Eventually what I need to do is interpolate the data, so that if the euclidean distance is <5, fill in the data rows that are missing with the ID of the individual. If the euclidean distance is >5, then ignore and interpolate nothing.
Here is the example result data frame that's needed:
individual <- c("1","1","1","1","1","1","1","1","1","1")
x <- c(665,NA,NA,NA,NA,NA,NA,NA,663,665)
y <- c(-474.5,NA,NA,NA,NA,NA,NA,NA,-474.5,-472.5)
frame <- rep(1:10)
dist_measure <- c(NA,NA,NA,NA,NA,NA,NA,NA,2,2.828427)
df <- data.frame(individual,x,y,frame,dist_measure)
Any advice on an approach to this problem is greatly appreciated. My first thought was to have a function that calculates Euclidean distance and put it in a for loop. But I'm a bit stuck on how to work this over the NA values. I thought somehow using the lag function in the tidyverse would help, but not sure again how to integrate that into the loop/function.
Thank you in advance.

This should work. I've added another individual into the hypothetical data to show how it works.
individual <- c("1",NA,NA,NA,NA,NA,NA,NA,"1","1",
"2",NA,NA,NA,NA,NA,NA,NA,"2","2")
x <- c(665,NA,NA,NA,NA,NA,NA,NA,663,665,
.665,NA,NA,NA,NA,NA,NA,NA,.663,.665)
y <- c(-474.5,NA,NA,NA,NA,NA,NA,NA,-474.5,-472.5,
-.4745,NA,NA,NA,NA,NA,NA,NA,-.4745,-.4725)
frame <- rep(1:10, 2)
df <- data.frame(individual,x,y,frame)
for(i in 1:2){
tmp <- df[min(which(df$individual == as.character(i))):
max(which(df$individual == as.character(i))), ]
ends <- range(which(is.na(tmp$individual))) + c(-1,1)
if(nrow(tmp) > 1 & ends[1] > 0 & ends[2] <= nrow(tmp)){
d <- c(dist(tmp[ends, c("x", "y")]))
if(d < 5){
df$individual[min(which(df$individual == as.character(i))):
max(which(df$individual == as.character(i)))] <- tmp$individual[ends[1]]
}
}
}
df
# individual x y frame
# 1 1 665.000 -474.5000 1
# 2 1 NA NA 2
# 3 1 NA NA 3
# 4 1 NA NA 4
# 5 1 NA NA 5
# 6 1 NA NA 6
# 7 1 NA NA 7
# 8 1 NA NA 8
# 9 1 663.000 -474.5000 9
# 10 1 665.000 -472.5000 10
# 11 2 0.665 -0.4745 1
# 12 2 NA NA 2
# 13 2 NA NA 3
# 14 2 NA NA 4
# 15 2 NA NA 5
# 16 2 NA NA 6
# 17 2 NA NA 7
# 18 2 NA NA 8
# 19 2 0.663 -0.4745 9
# 20 2 0.665 -0.4725 10

Related

adding two variables which has NA present

lets say data is 'ab':
a <- c(1,2,3,NA,5,NA)
b <- c(5,NA,4,NA,NA,6)
ab <-c(a,b)
I would like to have new variable which is sum of the two but keeping NA's as follows:
desired output:
ab$c <-(6,2,7,NA,5,6)
so addition of number + NA should equal number
I tried following but does not work as desired:
ab$c <- a+b
gives me : 6 NA 7 NA NA NA
Also don't know how to include "na.rm=TRUE", something I was trying.
I would also like to create third variable as categorical based on cutoff <=4 then event 1, otherwise 0:
desired output:
ab$d <-(1,1,1,NA,0,0)
I tried:
ab$d =ifelse(ab$a<=4|ab$b<=4,1,0)
print(ab$d)
gives me logical(0)
Thanks!
a <- c(1,2,3,NA,5,NA)
b <- c(5,NA,4,NA,NA,6)
dfd <- data.frame(a,b)
dfd$c <- rowSums(dfd, na.rm = TRUE)
dfd$c <- ifelse(is.na(dfd$a) & is.na(dfd$b), NA_integer_, dfd$c)
dfd$d <- ifelse(dfd$c >= 4, 1, 0)
dfd
a b c d
1 1 5 6 1
2 2 NA 2 0
3 3 4 7 1
4 NA NA NA NA
5 5 NA 5 1
6 NA 6 6 1

How to find whether at least one column satisfies a certain condition, with NAs

I have a dataframe with multiple columns: I need to identify those rows in which there is at least one outlier among some of the columns, but I do not know how to deal with NAs.
An example of dataframe (different from mine):
# X atq ME.BE.crsp X2
# 1 10 0.5 4
# NA 2 1.3 5
# 3 NA 5 2
# NA NA NA NA
# 2 4 NA 3
I'm doing the following:
data = data %>%
mutate(outlier= as.numeric(atq > quantile(atq, 0.99,na.rm=T)|
atq < quantile(atq, 0.01,na.rm=T)|
ME.BE.crsp > quantile(ME.BE.crsp, 0.99,na.rm = T)|
ME.BE.crsp < quantile(ME.BE.crsp, 0.01,na.rm = T)
))
My expected result is (I'm making up the outliers, the point is about NAs):
# X atq ME.BE.crsp X2 outlier
# 1 10 0.5 4 1
# NA 2 1.3 5 0
# 3 NA 5 2 0
# NA NA NA NA NA
# 2 4 NA 3 1
What I get instead is:
# X atq ME.BE.crsp X2 outlier
# 1 10 0.5 4 1
# NA 2 1.3 5 0
# 3 NA 5 2 NA
# NA NA NA NA NA
# 2 4 NA 3 NA
So, it seems that as soon as the as.numeric finds an NA either in data$atq or in data$ME.BE.crsp, it just gives NA to data$outlier, while I would like it to consider the non NA value and assign 0 or 1 based on that one.
Any suggestions? Thanks!
If both'atq' and 'ME.BE.crsp' are NA and it should return NA, then use a condition with case_when
library(dplyr)
data %>%
mutate(outlier= case_when(is.na(atq) & is.na(ME.BE.crsp) ~
NA_real_,
TRUE ~ as.numeric((atq > quantile(atq, 0.99,na.rm=TRUE)) &
!is.na(atq)|
(atq < quantile(atq, 0.01,na.rm=T)) & !is.na(atq)|
(ME.BE.crsp > quantile(ME.BE.crsp, 0.99,na.rm = T)) &
!is.na(ME.BE.crsp)|
(ME.BE.crsp < quantile(ME.BE.crsp, 0.01,na.rm = T)) &
!is.na(ME.BE.crsp)
)))

Replace missing values with row means if exactly N missing values per row

I have a data matrix with different number of missing values per rows. What I want is to replace the missing values with row means if the number of missing values per row is N (let's say 1).
I have already created a solution for this problem but it's a very inelegant one so I'm looking for something else.
My solution:
#SAMPLE DATA
a <- c(rep(c(1:4, NA), 2))
b <- c(rep(c(1:3, NA, 5), 2))
c <- c(rep(c(1:3, NA, 5), 2))
df <- as.matrix(cbind(a,b,c), ncol = 3, nrow = 10)
#CALCULATING THE NUMBER OF MISSING VALUES PER ROW
miss_row <- rowSums(apply(as.matrix(df), c(1,2), function(x) {
sum(is.na(x)) +
sum(x == "", na.rm=TRUE)
}) )
df <- cbind(df, miss_row)
#CALCULATING THE ROW MEANS FOR ROWS WITH 1 MISSING VALUE
row_mean <- ifelse(df[,4] == 1, rowMeans(df[,1:3], na.rm = TRUE), NA)
df <- cbind(df, row_mean)
Here is the way I mentionned in comment, with more details:
# create your matrix
df <- cbind(a, b, c) # already a matrix, you don't need as.matrix there
# Get number of missing values per row (is.na is vectorised so you can apply it directly on the entire matrix)
nb_NA_row <- rowSums(is.na(df))
# Replace missing values row-wise by the row mean when there is N NA in the row
N <- 1 # the given example
df[nb_NA_row==N] <- rowMeans(df, na.rm=TRUE)[nb_NA_row==N]
# check df
df
# a b c
# [1,] 1 1 1
# [2,] 2 2 2
# [3,] 3 3 3
# [4,] 4 NA NA
# [5,] 5 5 5
# [6,] 1 1 1
# [7,] 2 2 2
# [8,] 3 3 3
# [9,] 4 NA NA
#[10,] 5 5 5
df <- data.frame(df)
df$miss_row <- rowSums(is.na(df))
df$row_mean <- NA
df$row_mean[df$miss_row == 1] <- rowMeans(df[df$miss_row == 1,1:3],na.rm = TRUE)
# a b c miss_row row_mean
# 1 1 1 1 0 NA
# 2 2 2 2 0 NA
# 3 3 3 3 0 NA
# 4 4 NA NA 2 NA
# 5 NA 5 5 1 5
# 6 1 1 1 0 NA
# 7 2 2 2 0 NA
# 8 3 3 3 0 NA
# 9 4 NA NA 2 NA
# 10 NA 5 5 1 5
(This gives your expected output, which seems not to be completely in line with your text, but for this see comments and duplicate link)

Replace values within a range in a data frame in R

I have ranked rows in a data frame based on values in each column.Ranking 1-10. not every column in picture
I have code that replaces values to NA or 1. But I can't figure out how to replace range of numbers, e.g. 3-6 with 1 and then replace the rest (1-2 and 7-10) with NA.
lag.rank <- as.matrix(lag.rank)
lag.rank[lag.rank > n] <- NA
lag.rank[lag.rank <= n] <- 1
At the moment it only replaces numbers above or under n. Any suggestions? I figure it should be fairly simple?
Is this what your are trying to accomplish?
> x <- sample(1:10,20, TRUE)
> x
[1] 1 2 8 2 6 4 9 1 4 8 6 1 2 5 8 6 9 4 7 6
> x <- ifelse(x %in% c(3:6), 1, NA)
> x
[1] NA NA NA NA 1 1 NA NA 1 NA 1 NA NA 1 NA 1 NA 1 NA 1
If your data aren't integers but numeric you can use between from the dplyr package:
x <- ifelse(between(x,3,6), 1, NA)

R- Perform operations on column and place result in a different column, with the operation specified by the output column's name

I have a dataframe with 3 columns- L1, L2, L3- of data and empty columns labeled L1+L2, L2+L3, L3+L1, L1-L2, etc. combinations of column operations. Is there a way to check the column name and perform the necessary operation to fill that new column with data?
I am thinking:
-use match to find the appropriate original columns and using a for loop to iterate over all of the columns in this search?
so if the column I am attempting to fill is L1+L2 I would have something like:
apply(dataframe[,c(i, j), 1, sum)
It seems strange that you would store your operations in your column names, but I suppose it is possible to achieve:
As always, sample data helps.
## Creating some sample data
mydf <- setNames(data.frame(matrix(1:9, ncol = 3)),
c("L1", "L2", "L3"))
## The operation you want to do...
morecols <- c(
combn(names(mydf), 2, FUN=function(x) paste(x, collapse = "+")),
combn(names(mydf), 2, FUN=function(x) paste(x, collapse = "-"))
)
## THE FINAL SAMPLE DATA
mydf[, morecols] <- NA
mydf
# L1 L2 L3 L1+L2 L1+L3 L2+L3 L1-L2 L1-L3 L2-L3
# 1 1 4 7 NA NA NA NA NA NA
# 2 2 5 8 NA NA NA NA NA NA
# 3 3 6 9 NA NA NA NA NA NA
One solution could be to use eval(parse(...)) within lapply to perform the calculations and store them to the relevant column.
mydf[morecols] <- lapply(names(mydf[morecols]), function(x) {
with(mydf, eval(parse(text = x)))
})
mydf
# L1 L2 L3 L1+L2 L1+L3 L2+L3 L1-L2 L1-L3 L2-L3
# 1 1 4 7 5 8 11 -3 -6 -3
# 2 2 5 8 7 10 13 -3 -6 -3
# 3 3 6 9 9 12 15 -3 -6 -3
dfrm <- data.frame( L1=1:3, L2=1:3, L3=3+1, `L1+L2`=NA,
`L2+L3`=NA, `L3+L1`=NA, `L1-L2`=NA,
check.names=FALSE)
dfrm
#------------
L1 L2 L3 L1+L2 L2+L3 L3+L1 L1-L2
1 1 1 4 NA NA NA NA
2 2 2 4 NA NA NA NA
3 3 3 4 NA NA NA NA
#-------------
dfrm[, 4:7] <- lapply(names(dfrm[, 4:7]),
function(nam) eval(parse(text=nam), envir=dfrm) )
dfrm
#-----------
L1 L2 L3 L1+L2 L2+L3 L3+L1 L1-L2
1 1 1 4 2 5 5 0
2 2 2 4 4 6 6 0
3 3 3 4 6 7 7 0
I chose to use eval(parse(text=...)) rather than with, since the use of with is specifically cautioned against in its help page. I'm not sure I can explain why the eval(..., target_dfrm) form should be any safer, though.

Resources