I have a date frame like this
individual <- c("1",NA,NA,NA,NA,NA,NA,NA,"1","1")
x <- c(665,NA,NA,NA,NA,NA,NA,NA,663,665)
y <- c(-474.5,NA,NA,NA,NA,NA,NA,NA,-474.5,-472.5)
frame <- rep(1:10)
df <- data.frame(individual,x,y,frame)
I have an ID column labeled 'individual', xy coordinates, and a frame number.
I need to calculate the euclidean distances for the x,y coordinates between rows but over the NA values.
So, in the example I gave - I would need to calculate the distances between rows 1 and 9, as well as 10 and 9. In the real data there would be substantially more rows of course.
Eventually what I need to do is interpolate the data, so that if the euclidean distance is <5, fill in the data rows that are missing with the ID of the individual. If the euclidean distance is >5, then ignore and interpolate nothing.
Here is the example result data frame that's needed:
individual <- c("1","1","1","1","1","1","1","1","1","1")
x <- c(665,NA,NA,NA,NA,NA,NA,NA,663,665)
y <- c(-474.5,NA,NA,NA,NA,NA,NA,NA,-474.5,-472.5)
frame <- rep(1:10)
dist_measure <- c(NA,NA,NA,NA,NA,NA,NA,NA,2,2.828427)
df <- data.frame(individual,x,y,frame,dist_measure)
Any advice on an approach to this problem is greatly appreciated. My first thought was to have a function that calculates Euclidean distance and put it in a for loop. But I'm a bit stuck on how to work this over the NA values. I thought somehow using the lag function in the tidyverse would help, but not sure again how to integrate that into the loop/function.
Thank you in advance.
This should work. I've added another individual into the hypothetical data to show how it works.
individual <- c("1",NA,NA,NA,NA,NA,NA,NA,"1","1",
"2",NA,NA,NA,NA,NA,NA,NA,"2","2")
x <- c(665,NA,NA,NA,NA,NA,NA,NA,663,665,
.665,NA,NA,NA,NA,NA,NA,NA,.663,.665)
y <- c(-474.5,NA,NA,NA,NA,NA,NA,NA,-474.5,-472.5,
-.4745,NA,NA,NA,NA,NA,NA,NA,-.4745,-.4725)
frame <- rep(1:10, 2)
df <- data.frame(individual,x,y,frame)
for(i in 1:2){
tmp <- df[min(which(df$individual == as.character(i))):
max(which(df$individual == as.character(i))), ]
ends <- range(which(is.na(tmp$individual))) + c(-1,1)
if(nrow(tmp) > 1 & ends[1] > 0 & ends[2] <= nrow(tmp)){
d <- c(dist(tmp[ends, c("x", "y")]))
if(d < 5){
df$individual[min(which(df$individual == as.character(i))):
max(which(df$individual == as.character(i)))] <- tmp$individual[ends[1]]
}
}
}
df
# individual x y frame
# 1 1 665.000 -474.5000 1
# 2 1 NA NA 2
# 3 1 NA NA 3
# 4 1 NA NA 4
# 5 1 NA NA 5
# 6 1 NA NA 6
# 7 1 NA NA 7
# 8 1 NA NA 8
# 9 1 663.000 -474.5000 9
# 10 1 665.000 -472.5000 10
# 11 2 0.665 -0.4745 1
# 12 2 NA NA 2
# 13 2 NA NA 3
# 14 2 NA NA 4
# 15 2 NA NA 5
# 16 2 NA NA 6
# 17 2 NA NA 7
# 18 2 NA NA 8
# 19 2 0.663 -0.4745 9
# 20 2 0.665 -0.4725 10
I have a large dataset (8,000 obs) and about 16 lists with anywhere from 120 to 2,000 items. Essentially, I want to check to see if any of the observations in the dataset match an item in a list. If there is a match, I want to include a variable indicating the match.
As an example, if I have data that look like this:
dat <- as.data.frame(1:10)
list1 <- c(2:4)
list2 <- c(7,8)
I want to end with a dataset that looks something like this
Obs Var List
1 1
2 2 1
3 3 1
4 4 1
5 5
6 6
7 7 2
8 8 2
9 9
10 10
How do I go about doing this? Thank you!
Here is one way to do it using boolean sum and %in%. If several match, then the last one is taken here:
dat <- data.frame(Obs = 1:10)
list_all <- list(c(2:4), c(7,8))
present <- sapply(1:length(list_all), function(n) dat$Obs %in% list_all[[n]]*n)
dat$List <- apply(present, 1, FUN = max)
dat$List[dat$List == 0] <- NA
dat
> dat
Obs List
1 1 NA
2 2 1
3 3 1
4 4 1
5 5 NA
6 6 NA
7 7 2
8 8 2
9 9 NA
10 10 NA
I want to process data frame as follows, where I want to get the sum of 2 vectors and append it to a data frame as a row vector. 2 vectors are row vector of considering row and column vector which start just below the considering row with a fixed length.
data
A b1 b2 b3
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
output (expected)
A b1 b2 b3
1 4 5 6
2 6 7 8
3 8 9 -
4 10 - -
5 - - -
In the example if 1st row is considered, two vectors are
row vector r- [2 2 2]
column vector c - [2,3,4]
After getting the transpose of column vector I can add tow vectors and append it to a new data frame. This process must be done to all the rows.
Easiest way to do this is looping, but in R loops are not efficient, instead apply function can be used. However in this scenario, to do that need to know what is the current row number.
Is there a way to do this efficiently in R
1) rollapply We can use rollapply to form the matrix of subvectors of A and then add that together with an initial column of zero to m. Note that we pad A with NA values so that the result of rollapply is the appropriate shape.
library(zoo)
m <- cbind(A = 1:5, b1 = 2:6, b2 = 2:6, b3 = 2:6) # input matrix
nc1 <- ncol(m) - 1
A <- c(m[, 1], rep(NA, nc1))
cbind(0, rollapply(A[-1], nc1, c)) + m
giving:
A b1 b2 b3
[1,] 1 4 5 6
[2,] 2 6 7 8
[3,] 3 8 9 NA
[4,] 4 10 NA NA
[5,] 5 NA NA NA
2) base This solution is similar but does not use any packages. The first two lines are the same as in (1).
nc1 <- ncol(m) - 1
A <- c(m[, 1], rep(NA, nc1))
cbind(0, embed(A[-1], nc1)[, seq(nc1, 1)]) + m
giving:
A b1 b2 b3
[1,] 1 4 5 6
[2,] 2 6 7 8
[3,] 3 8 9 NA
[4,] 4 10 NA NA
[5,] 5 NA NA NA
I want to get a "m" length vector that, considering a m x n matrix, for each row, gives the value on the column identified by another column (say column "Z").
I made it using a for loop:
for (i in 1:dim(data.frame)[1]){vector[i] <- data.frame[i,data.frame$Z[i]]}
Do you see a simpler way to code it avoiding the loop?
"apply" is a possibility:
> M <- cbind( matrix(1:15,3,5), "Z"=c(3,1,2) )
> M
Z
[1,] 1 4 7 10 13 3
[2,] 2 5 8 11 14 1
[3,] 3 6 9 12 15 2
> v <- apply(M,1,function(x){x[x["Z"]]})
> v
[1] 7 2 6
>
I have a number of large datasets with ~10 columns, and ~200000 rows. Not all columns contain values for each row, although at least one column must contain a value for the row to be present, I would like to set a threshold for how many NAs are allowed in a row.
My Dataframe looks something like this:
ID q r s t u v w x y z
A 1 5 NA 3 8 9 NA 8 6 4
B 5 NA 4 6 1 9 7 4 9 3
C NA 9 4 NA 4 8 4 NA 5 NA
D 2 2 6 8 4 NA 3 7 1 32
And I would like to be able to delete the rows that contain more than 2 cells containing NA to get
ID q r s t u v w x y z
A 1 5 NA 3 8 9 NA 8 6 4
B 5 NA 4 6 1 9 7 4 9 3
D 2 2 6 8 4 NA 3 7 1 32
complete.cases removes all rows containing any NA, and I know one can delete rows that contain NA in certain columns but is there a way to modify it so that it is non-specific about which columns contain NA, but how many of the total do?
Alternatively, this dataframe is generated by merging several dataframes using
file1<-read.delim("~/file1.txt")
file2<-read.delim(file=args[1])
file1<-merge(file1,file2,by="chr.pos",all=TRUE)
Perhaps the merge function could be altered?
Thanks
Use rowSums. To remove rows from a data frame (df) that contain precisely n NA values:
df <- df[rowSums(is.na(df)) != n, ]
or to remove rows that contain n or more NA values:
df <- df[rowSums(is.na(df)) < n, ]
in both cases of course replacing n with the number that's required
If dat is the name of your data.frame the following will return what you're looking for:
keep <- rowSums(is.na(dat)) < 2
dat <- dat[keep, ]
What this is doing:
is.na(dat)
# returns a matrix of T/F
# note that when adding logicals
# T == 1, and F == 0
rowSums(.)
# quickly computes the total per row
# since your task is to identify the
# rows with a certain number of NA's
rowSums(.) < 2
# for each row, determine if the sum
# (which is the number of NAs) is less
# than 2 or not. Returns T/F accordingly
We use the output of this last statement to
identify which rows to keep. Note that it is not necessary to actually store this last logical.
If d is your data frame, try this:
d <- d[rowSums(is.na(d)) < 2,]
This will return a dataset where at most two values per row are missing:
dfrm[ apply(dfrm, 1, function(r) sum(is.na(x)) <= 2 ) , ]