R Transition Matrix - r

I would like to convert a vector into a transitions matrix.
I have a vector t and divided this by its max value to get values between 0 and 1. I then made this into a matrix
t <- c(22, 65, 37, 84, 36, 14, 9, 19, 5, 49)
x <- t/max(t)
y <- x%*%t(x)
My problem is that I want the columns of the matrix (y) to add up to 1, i.e. to make it into a transition matrix but I'm not sure how to do that. Any suggestions appreciated!

sweep() is a versatile little function that you can use here to divide each column by its own sum:
yy <- sweep(y, MARGIN = 2, STATS = colSums(y), FUN = "/")
## Confirm that the columns of yy all sum to 1
colSums(yy)
## [1] 1 1 1 1 1 1 1 1 1 1

Related

Rounding numbers to the nearest values (with different intervals) in R

I want to round (or replace) numbers in a
a <- c(0.505, 1.555, 2.667, 53.850, 411.793)
to the nearest values in b:
b <- c(0, 5, 10, 50, 100, 200, 500)
The output will be this:
a_rnd <- c(0, 0, 5, 50, 500)
The logic is simple but I couldn't find any solution, because all the approaches I found require values in b have an equal interval!
How can I achieve this?
You can use sapply to loop over all values of a and use these indexes to extract the proper b values
b[sapply(a, function(x) which.min(abs(x - b)))]
#> [1] 0 0 5 50 500
This is a relatively simple approach:
b[apply(abs(outer(a, b, "-")), 1, which.min)]

unique pairs or combinations from a vector

Where am I going wrong with my function.
I am trying to create a function which will count all the unique pairs in a vector, say I have the following input:
ar <- c(10, 20, 20, 30, 30, 30, 40, 50)
The number of unique pairs is 20 = 1, 30 = 1 so I can just sum these up and the total number of unique pairs is 2.
However everything I am trying is creating 30 as having 2 unique pairs (since 30 occurs 3 times in the vector.
n <- 9
ar <- c(10, 20, 20, 30, 30, 30, 40, 50)
CountThePairs <- function(n, ar){
for(i in 1:length(ar)){
sum = ar[i] - ar[]
pairs = length(which(sum == 0))
}
return(sum)
}
CountThePairs(n = NULL, ar)
Is there an easier way of doing this? I prefer the base R version but interested in package versions also.
Here's a simpler way using floor and table form base R -
ar <- c(10, 20, 20, 30, 30, 30, 40, 50)
sum(floor(table(ar)/2))
[1] 2
Example 2 - Adding one more 30 to vector so now there are 2 pairs of 30 -
ar <- c(10, 20, 20, 30, 30, 30, 30, 40, 50)
sum(floor(table(ar)/2))
[1] 3
If 2 30 pairs count as one "unique" pair then original solution by #tmfmnk was correct -
sum(table(ar) >= 2)
You could use sapply on the unique values of the vector to return a logical vector if that value is repeated. The sum of that logical value is then the number of unique pairs.
ar <- c(10, 20, 20, 30, 30, 30, 40, 50)
is_pair <- sapply(unique(ar), function(x) length(ar[ar == x]) > 1)
sum(is_pair)
#[1] 2
I'm not sure what behaviour you want if there are four 30's - does this count as one unique pair still or is it now two? If the latter, you would need a slightly different solution:
n_pair <- sapply(unique(ar), function(x) length(ar[ar == x]) %/% 2)
sum(n_pair)
#[1] 2

intersection in R

I have two tables.
Both tables have only 1 column.
Both have random integer values between 1 to 1000.
I want to intersect these two tables. The catch is I want to intersect the numbers even if they have a difference of about 10.
1st table -> 5 , 50, 160, 280
2nd table -> 14, 75, 162, 360
Output ->
1st table -> 5, 160
2nd table -> 14, 162
How can I achieve this in R
You could do this with the sapply function, checking if each element of x or y is sufficiently close to some member of the other vector:
x <- c(5, 50, 160, 280)
y <- c(14, 75, 162, 360)
new.x <- x[sapply(x, function(z) min(abs(z-y)) <= 10)]
new.y <- y[sapply(y, function(z) min(abs(z-x)) <= 10)]
new.x
# [1] 5 160
new.y
# [1] 14 162
Here is an approach that uses the outer function (so your 2 tables will need to be reasonably sized):
x <- c(5,50,160,280)
y <- c(999,14,75,162,360)
tmp1 <- outer(x,y, function(x,y) abs(x-y))
tmp2 <- which(tmp1 <= 10, arr.ind=TRUE)
rbind(
x=x[ tmp2[,1] ],
y=y[ tmp2[,2] ]
)
This looks at every possible pair between x and y and computes the difference between the 2 values, then finds those with a difference <= 10.

R: find nearest index

I have two vectors with a few thousand points, but generalized here:
A <- c(10, 20, 30, 40, 50)
b <- c(13, 17, 20)
How can I get the indicies of A that are nearest to b? The expected outcome would be c(1, 2, 2).
I know that findInterval can only find the first occurrence, and not the nearest, and I'm aware that which.min(abs(b[2] - A)) is getting warmer, but I can't figure out how to vectorize it to work with long vectors of both A and b.
You can just put your code in a sapply. I think this has the same speed as a for loop so isn't technically vectorized though:
sapply(b,function(x)which.min(abs(x - A)))
FindInterval gets you very close. You just have to pick between the offset it returns and the next one:
#returns the nearest occurence of x in vec
nearest.vec <- function(x, vec)
{
smallCandidate <- findInterval(x, vec, all.inside=TRUE)
largeCandidate <- smallCandidate + 1
#nudge is TRUE if large candidate is nearer, FALSE otherwise
nudge <- 2 * x > vec[smallCandidate] + vec[largeCandidate]
return(smallCandidate + nudge)
}
nearest.vec(b,A)
returns (1,2,2), and should comparable to FindInterval in performance.
Here's a solution that uses R's often overlooked outer function. Not sure if it'll perform better, but it does avoid sapply.
A <- c(10, 20, 30, 40, 50)
b <- c(13, 17, 20)
dist <- abs(outer(A, b, '-'))
result <- apply(dist, 2, which.min)
# [1] 1 2 2

Interpolate NA values

I have two set of samples that are time independent. I would like to merge them and calculate the missing values
for the times where I do not have values of both. Simplified example:
A <- cbind(time=c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
Avalue=c(1, 2, 3, 2, 1, 2, 3, 2, 1, 2))
B <- cbind(time=c(15, 30, 45, 60), Bvalue=c(100, 200, 300, 400))
C <- merge(A,B, all=TRUE)
time Avalue Bvalue
1 10 1 NA
2 15 NA 100
3 20 2 NA
4 30 3 200
5 40 2 NA
6 45 NA 300
7 50 1 NA
8 60 2 400
9 70 3 NA
10 80 2 NA
11 90 1 NA
12 100 2 NA
By assuming linear change between each sample, it is possible to calculate the missing NA values.
Intuitively it is easy to see that the A value at time 15 and 45 should be 1.5. But a proper calculation for B
for instance at time 20 would be
100 + (20 - 15) * (200 - 100) / (30 - 15)
which equals 133.33333.
The first parenthesis being the time between estimate time and the last sample available.
The second parenthesis being the difference between the nearest samples.
The third parenthesis being the time between the nearest samples.
How can I use R to calculate the NA values?
Using the zoo package:
library(zoo)
Cz <- zoo(C)
index(Cz) <- Cz[,1]
Cz_approx <- na.approx(Cz)
The proper way to do this statistically and still get valid confidence intervals is to use Multiple Imputation. See Rubin's classic book, and there's an excellent R package for this (mi).
An ugly and probably inefficient Base R solution:
# Data provided:
A <- cbind(time=c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
Avalue=c(1, 2, 3, 2, 1, 2, 3, 2, 1, 2))
B <- cbind(time=c(15, 30, 45, 60), Bvalue=c(100, 200, 300, 400))
C <- merge(A,B, all=TRUE)
# Scalar valued at the minimum time difference: -> min_time_diff
min_time_diff <- min(diff(C$time))
# Adjust frequency of the series to hold all steps in range: -> df
df <- merge(C,
data.frame(time = seq(min_time_diff,
max(C$time),
by = min_time_diff)),
by = "time",
all = TRUE)
# Linear interpolation function handling ties,
# returns interpolated vector the same length
# a the input vector: -> vector
l_interp_vec <- function(na_vec){
approx(x = na_vec,
method = "linear",
ties = "constant",
n = length(na_vec))$y
}
# Applied to a dataframe, replacing NA values
# in each of the numeric vectors,
# with interpolated values.
# input is dataframe: -> dataframe()
interped_df <- data.frame(lapply(df, function(x){
if(is.numeric(x)){
# Store a scalar of min row where x isn't NA: -> min_non_na
min_non_na <- min(which(!(is.na(x))))
# Store a scalar of max row where x isn't NA: -> max_non_na
max_non_na <- max(which(!(is.na(x))))
# Store scalar of the number of rows needed to impute prior
# to first NA value: -> ru_lower
ru_lower <- ifelse(min_non_na > 1, min_non_na - 1, min_non_na)
# Store scalar of the number of rows needed to impute after
# the last non-NA value: -> ru_lower
ru_upper <- ifelse(max_non_na == length(x),
length(x) - 1,
(length(x) - (max_non_na + 1)))
# Store a vector of the ramp to function: -> l_ramp_up:
ramp_up <- as.numeric(
cumsum(rep(x[min_non_na]/(min_non_na), ru_lower))
)
# Apply the interpolation function on vector "x": -> y
y <- as.numeric(l_interp_vec(as.numeric(x[min_non_na:max_non_na])))
# Create a vector that combines the ramp_up vector
# and y if the first NA is at row 1: -> z
if(length(ramp_up) > 1 & max_non_na != length(x)){
# Create a vector interpolations if there are
# multiple NA values after the last value: -> lower_l_int
lower_l_int <- as.numeric(cumsum(rep(mean(diff(c(ramp_up, y))),
ru_upper+1)) +
as.numeric(x[max_non_na]))
# Store the linear interpolations in a vector: -> z
z <- as.numeric(c(ramp_up, y, lower_l_int))
}else if(length(ramp_up) > 1 & max_non_na == length(x)){
# Store the linear interpolations in a vector: -> z
z <- as.numeric(c(ramp_up, y))
}else if(min_non_na == 1 & max_non_na != length(x)){
# Create a vector interpolations if there are
# multiple NA values after the last value: -> lower_l_int
lower_l_int <- as.numeric(cumsum(rep(mean(diff(c(ramp_up, y))),
ru_upper+1)) +
as.numeric(x[max_non_na]))
# Store the linear interpolations in a vector: -> z
z <- as.numeric(c(y, lower_l_int))
}else{
# Store the linear interpolations in a vector: -> z
z <- as.numeric(y)
}
# Interpolate between points in x, return new x:
return(as.numeric(ifelse(is.na(x), z, x)))
}else{
x
}
}
)
)
# Subset interped df to only contain
# the time values in C, store a data frame: -> int_df_subset
int_df_subset <- interped_df[interped_df$time %in% C$time,]

Resources