How do I use output of indexes to subset my dataframe? - r

I want to match vector 1 to vector 2 to see if items in vector 1 and found in vector 2. Then I want to create 2 new vectors - a subset of vector 1 of the rows of values contained both vectors, and a subset of vector 1 for the values not found in both vectors. match() function followed by which(is.na()) works great for small data sets, but I have a data set with 1000 elements.
Data1 <- c(1, 2, 3, 4, 5)
Data2 <- c(1, 3, 5, 6, 7)
#Match vector1 to vector2
A <- match(Data1, Data2)
[1] 1 NA 2 NA 3
#to obtain positions of non matching elements
x <- which(is.na(A), arr.ind = TRUE)
[1] 2 4
Data1[c(2,4)]
#to obtain positions of matching elements
y < which(A >= 1)
[1] 1 3 5
Data1[c(1,3,5)]

Try this so you do not have to deal with the NAs from match():
Data1 <- c(1, 2, 3, 4, 5)
Data2 <- c(1, 3, 5, 6, 7)
# Values of Data1 in Data2
A <- Data1[Data1 %in% Data2]
A
# output:
# > A
# [1] 1 3 5
# create not in function
'%ni%' <- Negate('%in%')
# Values of Data1 not in Data2
B <- Data1[Data1 %ni% Data2]
B
# output:
# > B
# [1] 2 4

Related

Writing a summation formula using variables from multiple observations

I am trying to create a new variable for each observation using the following formula:
Index = ∑(BAj / DISTANCEij)
where:
j = focal observation; i= other observation
Basically, I'm taking the focal individual (i) and finding the euclidean distance between it and another point and dividing the other points BA by that distance. Do that for all the other points and then sum them all and repeat all of this for each point.
Here is some sample data:
ID <- 1:4
BA <- c(3, 5, 6, 9)
x <- c(0, 2, 3, 7)
y <- c(1, 3, 4, 9)
df <- data.frame(ID, BA, x, y)
print(df)
ID BA x y
1 1 3 0 1
2 2 5 2 3
3 3 6 3 4
4 4 9 7 9
Currently, I've extracted out vectors and created a formula to calculate part of the formula shown here:
vec1 <- df[1, ]
vec2 <- df[2, ]
dist <- function(vec1, vec2) vec1$BA/sqrt((vec2$x - vec1$x)^2 +
(vec2$y - vec1$y)^2)
My question is how do I repeat this with the x and y values for vec2 changing for each new other point with vec1 remaining the same and then sum them all together?
We may loop over the row sequence, extract the data and apply the dist function
library(dplyr)
library(purrr)
df %>%
mutate(dist_out = map_dbl(row_number(), ~ {
othr <- cur_data()[-.x,]
cur <- cur_data()[.x, ]
sum(dist(cur, othr))
}))
-output
ID BA x y dist_out
1 1 3 0 1 2.049983
2 2 5 2 3 5.943485
3 3 6 3 4 6.593897
4 4 9 7 9 3.404545
Here are two base R ways.
1. for loop
ID <- 1:4
BA <- c(3, 5, 6, 9)
x <- c(0, 2, 3, 7)
y <- c(1, 3, 4, 9)
df <- data.frame(ID, BA, x, y)
n <- nrow(df)
d <- dist(df[c("x", "y")], upper = TRUE)
d <- as.matrix(d)
Index <- numeric(n)
for(j in seq_len(n)) {
d_j <- d[-j, j, drop = TRUE]
Index[j] <- sum(df$BA[j]/d_j)
}
Index
#> [1] 2.049983 5.943485 6.593897 3.404545
Created on 2022-08-18 by the reprex package (v2.0.1)
2. sapply loop
Index <- sapply(seq_len(n), \(j) sum(df$BA[j]/d[-j, j, drop = TRUE]))
Index
#> [1] 2.049983 5.943485 6.593897 3.404545
Created on 2022-08-18 by the reprex package (v2.0.1)

How to find orders of elements in a vector in which duplicate elements have the same order?

I have a vector x = [5, 5, 3, 2, 2]. The rank of an element is its position in the descending list of unique values. I would like to return the vector contains the rank of each element, i.e [1, 1, 2, 3, 3]. Unfortunately, the function order does not do the job.
x <- c(5, 5, 3, 2, 2)
order(x)
and the result is
[1] 4 5 3 1 2
Could you please elaborate on how to do so?
1) factor Convert to a factor having the indicated levels and then convert to numeric to get the level numbers:
as.numeric(factor(x, levels = unique(x)))
## [1] 1 1 2 3 3
2) match Another possibility is to use match:
match(x, unique(x))
## [1] 1 1 2 3 3
3) findInterval findInterval requires non-descending numbers in the second argument so we negate x.
findInterval(-x, unique(-x))
## [1] 1 1 2 3 3
4) diff/cumsum
cumsum(c(TRUE, diff(x) != 0))
## [1] 1 1 2 3 3
5) rle
r <- rle(x)
r$values <- seq_along(r$values)
inverse.rle(r)
## [1] 1 1 2 3 3
Note
The input in R syntax is:
x <- c(5, 5, 3, 2, 2)

Back tracing parents/paths of two-column data of a tree

I have a tree data serialized like the following:
Relationship: P to C is "one-to-many", and C to P is "one-to-one". So column P may have duplicate values, but column C has unique values.
P, C
1, 2
1, 3
3, 4
2, 5
4, 6
# in data.frame
df <- data.frame(P=c(1,1,3,2,4), C=c(2,3,4,5,6))
1. How do I efficiently implement a function func so that:
func(df, val) returns a vector of full path to root (1 in this case).
For example:
func(df, 3) returns c(1,2,3)
func(df, 5) returns c(1,2,5)
func(df, 6) returns c(1,3,4,6)
2. Alternatively, quickly transforming df to a lookup table like this also works for me:
C, Paths
2, c(1,2)
3, c(1,3)
4, c(1,3,4)
5, c(1,2,5)
6, c(1,2,4,6)
Here is a solution using igraph
library(igraph)
g <- graph_from_data_frame(df)
df <- within(df,
Path <- sapply(match(as.character(C),names(V(g))),
function(k) toString(names(unlist(all_simple_paths(g,1,k))))))
such that
> df
P C Path
1 1 2 1, 2
2 1 3 1, 3
3 3 4 1, 3, 4
4 2 5 1, 2, 5
5 4 6 1, 3, 4, 6

Randomly sample contiguous rows from a data frame or matrix

I want to sample a number of contiguous rows from a data frame df.
df <- data.frame(C1 = c(1, 2, 4, 7, 9), C2 = c(2, 4, 6, 8, 10))
I am trying to get something similar to the following which allows me to sample 3 random rows and repeat the process 100 times.
test <- replicate(100, df[sample(1:nrow(df), 3, replace=T),], simplify=F)
By contiguous the result should be something like:
[[1]]
C1 C2
2 2 4
3 4 6
4 7 8
[[2]]
C1 C2
1 1 2
2 2 4
3 4 6
.
.
.
How could I achieve this?
We just need to sample the starting row index for a chunk.
sample.block <- function (DF, chunk.size) {
if (chunk.size > nrow(DF)) return(NULL)
start <- sample.int(nrow(DF) - chunk.size + 1, 1)
DF[start:(start + chunk.size - 1), ]
}
replicate(100, sample.block(df, 3), simplify = FALSE)

Arrange list of vectors

I have a list of vectors and an another vector. I would like the arrange the list of vectors according to values of the other vector
a <- c(1, 2)
b <- c(1, 4)
c <- c(1, 1)
x <- list(a, b, c) # list of vector
v <- c(3, 2, 5) # other vector
Here I want arrange x according to v. So the desired output will be:
2 b
3 a
5 c
Here is an option with stack and arrange
library(dplyr)
v %>%
set_names(letters[1:3]) %>%
stack %>%
arrange(values)
# values ind
#1 2 b
#2 3 a
#3 5 c
First order list x based on order of vector v and then bind vector with take names of ordered list to form related column.
It will something like:
cbind(as.data.frame(v), col = names(x))[order(v),]
# v col
#2 2 b
#1 3 a
#3 5 c
Data:
a <- c(1, 2)
b <- c(1, 4)
c <- c(1, 1)
x <- list(a=a, b=b, c=c) # list of vector
v <- c(3, 2, 5) # other vector

Resources