I'm trying to compute a moving window transition matrix for credit rating data. My data looks like this but then with 4000+ rows (there are 8 different ratings in the complete dataset). So we have firms (id) and their prior and current credit rating and at what date this transition took place.
df <- data.frame(id = c(100,100,98,99,98,98,56,54), date = c(19750102, 19750205, 19750402, 19750609, 19831212, 19840202, 19840704,19861104), priorrating = c("A","BBB","AAA","AAA","AA","A","BB", "D"), currentrating = c("BBB","BB","AA","AAA","A","A","BB","D"))
I want to compute the transition probability matrices of these ratings as a moving window. I want to shift 6 months intervals by 1 month. So you would get matrices P for intervals [19750102,19750602], [19750202,19750702] and so forth.
For each element (each date) in the different intervals I want to compute the following. There are no ties (so no dates with multiple different transitions).
require(Matrix)
#Table with transitions between ratings at time t in the interval
N <- table(df$priorrating,df$currentrating)
#Getting the number of total exposed firms Y at time t in the interval
firms <- apply(N,1,sum)
Y <- sum(firms)
#Computing the off-diagional elements of matrix dA; the kjth off-diagonal element count the
#fraction of transitions from the kth category to the jth category in the number of exposed
#firms at time t.
dA <- N/Y
#Complete matrix dA by adding default row
dA[6,] <- 0
#Computing diagonal elements of dA; e, the kth diagonal element counts the fraction of the
#exposed firms Y leaving the state at time t
D <- rep(0,6)
diag(dA) <- D #setting diagonal to zeros
diag(N) <- D #setting diagonal of transition count matrix to zero
outtrans <- apply(N,1,sum) #vector with number of firms leaving each state at time t
diag(dA) <- -outtrans/Y
#Finally compute probability matrix P for element i in the interval (time t)
attributes(dA)$class <- "matrix"
P <- (diag(6) + dA) #note that there are 8 different ratings in the complete dataset.
And finally I want to compute P.hat for the entire interval, which would be the P matrix computed above of each element in the interval multiplied by eachother.
So the output would look like a list of matrices P.hat for each interval.
Now, my code above works on the entire sample. But I'm quite unsure how to implement it as a moving window.
PS: Hopefully my question is clear, if not please let me know!
Related
I have code here that generates a random spatial distribution of points, returns a distance column between every point and an infected individual and uses a function to calculate the probability of infection in the next time step. There are 60 hosts, one of which is infected. I would like to bind the values of Pi (which calculates infection probability) to my data frame with the original co-ordinates. Obviously one point is removed from the distance matrix, the infected individual. This value I would like to replace with NA in the main data frame as the next step in my code, and also to confirm that the co-ordinates correspond with the output of the function Pi.
So as it stands I am trying to attach a column of 59 rows to the main data frame of 60 rows.
# Create a spatial distribution with infected individuals
xcoord <- sample(1:100,60)
ycoord <- sample(1:100,60)
infectionstatus <- rep(0,60)
Df <- data.frame(xcoord, ycoord, infectionstatus)
a <- sample(1:60, 1)
Df$infectionstatus[a] <- 1
# Calculate distance between infected individuals and susceptibles
library(rdist)
distances <- pdist(Df[,1:2], metric = "euclidean")
position_infected_individual <- which(Df[,3]==1)
distance_from_infected <- distances[-(position_infected_individual), position_infected_individual]
#Assign parameter values and calculate probability of infection
beta<-100
alpha<-0.1
Pi<-vector()
for (p in 1:length(distance_from_infected)){
Pi[p] = 1-exp(-beta*exp(-alpha*distance_from_infected[p]))
}
The obvious step is:
replace:
distance_from_infected <- distances[-(position_infected_individual), position_infected_individual]
with:
distance_from_infected <- c(NA, distances[-(position_infected_individual), position_infected_individual])
But you're setting yourself up for quite a few failures.
Assuming only one infected case
That the DF can always be appropriately sorted so infected individual is first
That NA makes "sense" for this kind of numeric summary
I'm still new to the programming world and looking for some guidance on a model I am building for individual animal growths over time.
The goal for the code I'm working with is to
i) Generate random starting sizes of animals from a given distribution
ii) Give each of these individuals a starting growth rate from a given distribution
iii) Calculate new size of individual after 1 year
iv) Assign a new growth rate from above distribution
v) Calculate the new size of individual after another year.
So far I have the code below, and what I want to do is repeat the last two lines of code x amount of times without I having to physically run the code over and over.
# Generate starting lengths
lengths <- seq(from=4.4, to=5.4, by =0.1)
# Generate starting ks (growth rate)
ks <- seq(from=0.0358, to=0.0437, by =0.0001)
#Create individuals
create.inds <- function(id = NaN, length0=NaN, k1=NaN){
inds <- data.frame(id=id, length0 = length0, k1=k1)
inds
}
# Generate individuals
inds <- create.inds(id=1:n.initial,
length=sample(lengths,100,replace=TRUE),
k1=sample(ks, 100, replace=TRUE))
# Calculate new lengths based on last and 2nd last columns and insert into next column
inds[,ncol(inds)+1] <- 326*(1-exp(-(inds[,ncol(inds)])))+
(inds[,ncol(inds)-1]*exp(-(inds[,ncol(inds)])))
# Calculate new ks and insert into last column
inds[,ncol(inds)+1] <- sample(ks, 100, replace=TRUE)
Any and all assistance would be appreciated, also if you think there is a better way to write this please let me know.
i think what you are asking is a simple loop:
for (i in 1:100) { #replace 100 with the desired times you want this to excecute
inds[,ncol(inds)+1] <- 326*(1-exp(-(inds[,ncol(inds)])))+
(inds[,ncol(inds)-1]*exp(-(inds[,ncol(inds)])))
# Calculate new ks and insert into last column
inds[,ncol(inds)+1] <- sample(ks, 100, replace=TRUE)
}
I'm aware there are many questions related to smoothing data in R, however, my knowledge is far too basic to apply it to the following problem! My key issue is that my data is >1.7m rows.
My Problem
I have a list "df" of 4 equal length vectors.
df[[1]] is a vector containing all uk postcodes
df[[2]] is a vector of latitudes
df[[3]] is a vector of longitudes
df[[4]] contains concentrations of a certain material
What I need to do is create a vector of 'smoothed' concentrations for each postcode, which should be calculated as: "A weighted average of concentrations in all postcodes within a given distance. The weighting is defined as exp(-Distance)"
I currently have the following code. It works perfectly (I've tested on a subset of 100k postcodes). However, it's far too slow, given the fact it loops over almost 2 million entries.
Can anyone help me finding a faster way to do this?
df <- as.list(Import[,c("Postcode", "Latitude", "Longitude", "Concentration")])
n <- length(df[[1]])
Out <- rep(0,n)
for(i in 1:n){
#Calculate squared Euclidean Distance
BaseLat <- df[[2]][i]
BaseLong <- df[[3]][i]
Distance <- (df[[2]]-BaseLat)^2 + (df[[3]]-BaseLong)^2
#Weightings
Weight <- ifelse(Distance < 0.01, exp(-Distance), 0)
#Take average rate and assign to output vector
Out[i] <- sum(df[[4]]*Weight)/sum(Weight)
}
I'm trying to understand the cross wavelet function in R, but can't figure out how to convert the phase lag arrows to a time lag with the biwavelet package. For example:
require(gamair)
data(cairo)
data_1 <- within(cairo, Date <- as.Date(paste(year, month, day.of.month, sep = "-")))
data_1 <- data_1[,c('Date','temp')]
data_2 <- data_1
# add a lag
n <- nrow(data_1)
nn <- n - 49
data_1 <- data_1[1:nn,]
data_2 <- data_2[50:nrow(data_2),]
data_2[,1] <- data_1[,1]
require(biwavelet)
d1 <- data_1[,c('Date','temp')]
d2 <- data_2[,c('Date','temp')]
xt1 <- xwt(d1,d2)
plot(xt1, plot.phase = TRUE)
These are my two time series. Both are identical but one is lagging the other. The arrows suggest a phase angle of 45 degrees - apparently pointing down or up means 90 degrees (in or out of phase) so my interpretation is that I'm looking at a lag of 45 degrees.
How would I now convert this to a time lag i.e. how would I calculate the time lag between these signals?
I've read online that this can only be done for a specific wavelength (which I presume means for a certain period?). So, given that we're interested in a period of 365, and the time step between the signals is one day, how would one alculate the time lag?
So I believe you're asking how you can determine what the lag time is given two time series (in this case you artificially added in a lag of 49 days).
I'm not aware of any packages that make this a one-step process, but since we are essentially dealing with sin waves, one option would be to "zero out" the waves and then find the zero crossing points. You could then calculate the average distance between zero crossing points of wave 1 and wave 2. If you know the time step between measurements, you can easy calculate the lag time (in this case the time between measurement steps is one day).
Here is the code I used to accomplish this:
#smooth the data to get rid of the noise that would introduce excess zero crossings)
#subtracted 70 from the temp to introduce a "zero" approximately in the middle of the wave
spline1 <- smooth.spline(data_1$Date, y = (data_1$temp - 70), df = 30)
plot(spline1)
#add the smoothed y back into the original data just in case you need it
data_1$temp_smoothed <- spline1$y
#do the same for wave 2
spline2 <- smooth.spline(data_2$Date, y = (data_2$temp - 70), df = 30)
plot(spline2)
data_2$temp_smoothed <- spline2$y
#function for finding zero crossing points, borrowed from the msProcess package
zeroCross <- function(x, slope="positive")
{
checkVectorType(x,"numeric")
checkScalarType(slope,"character")
slope <- match.arg(slope,c("positive","negative"))
slope <- match.arg(lowerCase(slope), c("positive","negative"))
ipost <- ifelse1(slope == "negative", sort(which(c(x, 0) < 0 & c(0, x) > 0)),
sort(which(c(x, 0) > 0 & c(0, x) < 0)))
offset <- apply(matrix(abs(x[c(ipost-1, ipost)]), nrow=2, byrow=TRUE), MARGIN=2, order)[1,] - 2
ipost + offset
}
#find zero crossing points for the two waves
zcross1 <- zeroCross(data_1$temp_smoothed, slope = 'positive')
length(zcross1)
[1] 10
zcross2 <- zeroCross(data_2$temp_smoothed, slope = 'positive')
length(zcross2)
[1] 11
#join the two vectors as a data.frame (using only the first 10 crossing points for wave2 to avoid any issues of mismatched lengths)
zcrossings <- as.data.frame(cbind(zcross1, zcross2[1:10]))
#calculate the mean of the crossing point differences
mean(zcrossings$zcross1 - zcrossings$V2)
[1] 49
I'm sure there are more eloquent ways of going about this, but it should get you the information that you need.
In my case, for the tidal wave in semidiurnal, 90 degree equal to 3 hours (90*12.5 hours/360 = 3.125 hours). 12.5 hours is the period of semidiurnal. So, for 45 degree equal to -> 45*12.5/360 = 1.56 hours.
Thus in your case:
90 degree -> 90*365/360 = 91.25 hours.
45 degree -> 45*365/360= 45.625 hours.
My understanding is as follows:
For there to be a simple cause-and-effect relationship between the phenomena recorded in the time series, we would expect that the oscillations are phase-locked (Grinsted 2004); so, the period where you find the "in phase" arrow (--->) indicates the lag between the signals.
See the simulated examples with different distances between cause-and-effect phenomena; observe that greater the distance, greater is the period of occurrence of the "in phase arrow" in the Cross wavelet transform.
Nonlinear Processes in Geophysics (2004) 11: 561–566 SRef-ID: 1607-7946/npg/2004-11-561
See the example here
For a game design issue, I need to better inspect binomial distributions. Using R, I need to build a two dimensional table that - given a fixed parameters 'pool' (the number of dice rolled), 'sides' (the number of sides of the die) has:
In rows --> minimum for a success (ranging from 0 to sides, it's a discrete distribution)
In columns --> number of successes (ranging from 0 to pool)
I know how to calculate it as a single task, but I'm not sure on how to iterate to fill the entire table
EDIT: I forgot to say that I want to calculate the probability p of gaining at least the number of successes.
Ok, i think this could be a simple solution. It has ratio of successes on rows and success thresholds on dice roll (p) on columns.
poolDistribution <- function(n, sides=10, digits=2, roll.Under=FALSE){
m <- 1:sides
names(m) <- paste(m,ifelse(roll.Under,"-", "+"),sep="")
s <- 1:n
names(s) <- paste(s,n,sep="/")
sapply(m, function(m.value) round((if(roll.Under) (1 - pbinom(s - 1, n, (m.value)/sides))*100 else (1 - pbinom(s - 1, n, (sides - m.value + 1)/sides))*100), digits=digits))
This gets you half of the way.
If you are new to R, you might miss out on the fact that a very powerful feature is that you can use a vector of values as an index to another vector. This makes part of the problem trivially easy:
pool <- 3
sides <- 20 # <cough>D&D<cough>
# you need to strore the values somewhere, use a vector
NumberOfRollsPerSide <- rep(0, sides)
names(NumberOfRollsPerSide) <- 1:sides # this will be useful in table
## Repeast so long as there are still zeros
## ie, so long as there is a side that has not come up yet
while (any(NumberOfRollsPerSide == 0)) {
# roll once
oneRoll <- sample(1:sides, pool, TRUE)
# add (+1) to each sides' total rolls
# note that you can use the roll outcome to index the vector. R is great.
NumberOfRollsPerSide[oneRoll] <- NumberOfRollsPerSide[oneRoll] + 1
}
# These are your results:
NumberOfRollsPerSide
All you have left to do now is count, for each side, in which roll number it first came up.