I need some assistance in solving the following issue;
I am drawing a Taylor diagram using R. I calculated the JJAS mean precipitation values (mm/day) for observation and two models than I manually defined those values to get the Taylor diagram. It gives me output but that does not seem right as standard deviation values are too low (sample is attached).
This is the code I am using:
library(plotrix)
ref<-c(3.3592,4.1377,4.0888,3.3098)
model1<-c(2.5053,3.0912,2.9271,2.4238)
model2<-c(2.2181,2.7910,2.7024,2.2495)
taylor.diagram(ref,model1,add=FALSE,col="red")
taylor.diagram(ref,model2,add=TRUE,col="blue")
Alternate way is to use netcdf files of observation and models but I don't know how to extract the precipitation information and use them (I know how to view netcdf data in R but extraction is challenging at this stage for me).
Kindly solve this problem.
I have no knowledge about this but a cursory look at the code of taylor.diagram clarifies what is going on here.
If you enter taylor.diagram without parentheses in the console it will print the function. Around line 15 you will find this function to calculate SD:
SD <- function(x, subn) {
meanx <- mean(x, na.rm = TRUE)
devx <- x - meanx
ssd <- sqrt(sum(devx * devx, na.rm = TRUE)/(length(x[!is.na(x)]) -
subn))
return(ssd)
}
We can run this function with parameter subn as TRUE or FALSE (in R, TRUE equates to 1 and FALSE equates to 0):
> SD(ref, TRUE)
[1] 0.4505061
> SD(ref, FALSE)
[1] 0.3901498
> SD(model1, FALSE)
[1] 0.2798994
> SD(model1, TRUE)
[1] 0.3232
And from this we can see that subn is set to FALSE. Further inspection of the code shows:
subn <- sd.method != "sample"
In other words: if sd.method equals sample (the default value) then subn will be FALSE.
It's up to you to decide what is the correct choice here.
One of the great things of R is that all R functions can be inspected at the console. Doing so can resolve most questions related to 'why is this function behaving like this' with limited effort.
Related
I have several variables in my dataset that represent daily timing of events across a week.
For example for two rows might look like:
t1 = c(NA,12.6,10.7,11.5,12.5,9.5,14.1)
t2 = c(23.7,1.2,NA,22.9,23.2,0.5,0.1)
I want to calculate the variance of these rows. To do this, I need the mean and because these are periodic variables, I've adapted the code from this page:
#This can all be wrapped in a function like this
circ.mean <- function(m,int,na.rm=T) {
if(na.rm) m <- m[!is.na(m)]
rad.m = m*(360/int)*(pi/180)
mean.cos = mean(cos(rad.m))
mean.sin = mean(sin(rad.m))
x.deg = atan(mean.sin/mean.cos)*(180/pi)
return(x.deg/(360/int))
}
This works as expected for t2:
> circ.mean(t2,24)
[1] -0.06803088
although ideally the answer would be 23.93197. But for t1, it gives an incorrect answer:
> circ.mean(t1,24)
[1] -0.1810074
whereas using the normal mean function gives the right answer:
> mean(t1,na.rm=T)
[1] 11.81667
My questions are:
1) Is this "circular mean" code correct and if so, am I using it correctly?
2) I've had a stab my own circ.var function (see below) to calculate the variance of a periodic variable - will this produce the correct variances for all possible input timing vectors?
circ.var <- function(m,int=NULL,na.rm=TRUE) {
if(is.null(int)) stop("Period parameter missing")
if(na.rm) m <- m[!is.na(m)]
if(sum(!is.na(m))==0) return(NA)
n=length(m)
mean.m = circ.mean(m,int)
var.m = 1/(n-1)*sum((((m-mean.m+(int/2))%%int)-(int/2))^2)
return(var.m)
}
Any help would be hugely appreciated! Thanks for taking the time to read this!
I deleted my old answer, as I believe there was a mistake in the solution I provided.
I've written a series of R scripts that I've made available at my GitHub page which should calculate the mean, variance and other stats.
Thanks to #Gregor for his help.
I have found two outlier data points in my data set but I don't know how to remove them. All of the guides that I have found online seem to emphasize plotting the data but my question does not require plotting, it only takes regression model fitting. I am having great difficulty finding out how to remove the two data points from my data set and then fitting the new data set with a new model.
Here is the code that I have written and the outliers that I found:
library(alr4)
library(MASS)
data(lathe1)
head(lathe1)
y=lathe1$Life
x1=lathe1$Speed
x2=lathe1$Feed
x1_square=(x1)^2
x2_square=(x2)^2
#part A (Box-Cox method show log transformation)
y.regression=lm(y~x1+x2+(x1)^2+(x2)^2+(x1*x2))
mod=boxcox(y.regression, data=lathe1, lambda = seq(-1, 1, length=10))
best.lam=mod$x[which(mod$y==max(mod$y))]
best.lam
#part B (null-hypothesis F-test)
y.regression1_Reduced=lm(log(y)~1)
y.regression1=lm(log(y)~x1+x2+x1_square+x2_square+(x1*x2))
anova(y.regression1_Reduced, y.regression1)
#part D (F-test of log(Y) without beta1)
y.regression2=lm(log(y)~x2+x2_square)
anova(y.regression1_Reduced, y.regression2)
#part E (Cook's distance and refit)
cooks.distance(y.regression1)
Outliers:
9 10
0.7611370235 0.7088115474
I think you may be able (if execution time / corpus size allows it) to pass through your data using a loop and copy / remove elems by your criteria to obtain your desired result e.g.
corpus_list_without_outliers = []
for elem in corpus_list:
if(elem.speed <= 10000) # elem.[any_param_name] < arbitrary_outlier_value
# push to corpus_list_without_outliers because it is OK :)
print corpus_list_without_outliers
# regression algorithm after
this is how I'd see the situation, but you can change the above-if with a remove statement to avoid the creation of a second list etc. e.g.
for elem in corpus_list:
if(elem.speed > 10000) # elem.[any_param_name]
# remove from current corpus because it is an outlier :(
print corpus_list
# regression algorithm after
Hope it helped you!
I have the following R code, which I am trying to convert to MATLAB. (No, I do not want to run the R code in MATLAB like shown here).
The R code is here:
# model parameters
dt <- 0.001
t <- seq(dt,0.3,dt)
n=700*1000
D = 1
d = 0.5
# model
ft <- n*d/sqrt(2*D*t^3)*dnorm(d/sqrt(2*D*t),0,1)
fmids <- n*d/sqrt(2*D*(t+dt/2)^3)*dnorm(d/sqrt(2*D*(t+dt/2)),0,1)
plot(t,ft*dt,type="l",lwd=1.5,lty=2)
# simulation
#
# simulation by drawing from uniform distribution
# and converting to time by using quantile function of normal distribution
ps <- runif(n,0,1)
ts <- 2*pnorm(-d/sqrt(2*D*t))
sumn <- sapply(ts, FUN = function(tb) sum(ps < tb))
lines(t[-length(sumn)],sumn[-1]-sumn[-length(sumn)],col=4)
And the MATLAB code I have done so far is
% # model
ft = (n*d)./sqrt(2*D.*t.^3).*normpdf(d./sqrt(2*D.*t),0,1);
fmids = (n*d)./sqrt(2*D*((t+dt)./2).^3).*normpdf(d./sqrt(2*D.*((t+dt)./2)),0,1);
figure;plot(t,ft.*dt);
% # simulation
% #
% # simulation by drawing from uniform distribution
% # and converting to time by using quantile function of normal distribution
ps = rand(1,n);
ts = 2*normcdf(-d./sqrt(2*D*t));
So, here is where I am stuck. I don't understand what function sumn = sapply(ts, FUN = function(tb) sum(ps < tb)) does and where the parameter 'tb' came from. It is not defined in the given R code as well.
Could anyone tell me what the equivalent of that function R code is in MATLAB?
[EDIT 1: UPDATE]
So, based on the comments from #Croote, I came up with the following code for the function defined in sapply()
sumidx = bsxfun(#lt,ps,ts');
summat = sumidx.*repmat(ps,300,1);
sumn = sum(summat,2);
sumnfin = sumn(2:end)-sumn(1:end-1);
plot(t(1:length(sumn)-1),sumnfin)
However, I am not getting the desired results. The curves should overlap with each other: the blue curve is correct, so the orange need to overlap with the blue curve.
What am I missing here? Is R's pnorm() equivalent to MATLAB'snormcdf() as I have done here?
[EDIT 2: FOUND THE BUG!]
So, after fiddling around, I discovered that I all I had to do was obtain the number of occurrences of tb < pb. The line summat = sumidx.*repmat(ps,300,1) is not supposed to be there. After removing that line and keeping sumn = sum(sumidx,2);, I get the desired result.
So, based on the comments from #Croote and after fiddling around, I came up with the following code for the function defined in sapply()
sumidx = bsxfun(#lt,ps,ts');
sumn = sum(sumidx,2);
And for the plot, I coded it as
sumnfin = sumn(2:end)-sumn(1:end-1);
plot(t(1:length(sumn)-1),sumnfin)
Finally, I get the desired result
I had previously asked this question trying to get top down forecasted proportions forecast recombination using the hts package. The solution there works great for multilevel hierarchies, however I have found I get an error when I try to use the solution on a two level hierarchy.
library(hts)
# Create the hierarchy
newhts <- hts(htseg1$bts, list(ncol(htseg1$bts)))
# forecast creation adapted from the `combinef()` example
h <- 12
ally <- aggts(newhts)
allf <- matrix(NA, nrow = h, ncol = ncol(ally))
for(i in 1:ncol(ally))
allf[,i] <- forecast(auto.arima(ally[,i]), h = h, PI = FALSE)$mean
allf <- ts(allf, start = 51)
# Earo Wang's solution to my previous question
hts:::TdFp(allf, nodes = htseg1$nodes)
Error in *.default(fcasts[, 1L], prop) : time-series/vector length mismatch
The problem seems to arise because a two level hierarchy skips the last if conditional with the condition if (l.levels > 2L). The last statement of this conditional multiplies includes a piece where prop is multiplied by the time series flist[[k + 1L]], which converts prop into a time series matrix. When this statement is skipped, prop remains a regular matrix causing the error when the time series vector fcasts[, 1L] is multiplied by the matrix prop.
I understand that TdFp is a non exported function and therefore may not be as robust as the other functions in the package, but is there any way around this problem? Since it is a relatively simple case, I can code a solution myself, but since hts::forecast.hts() can handle two level hierarchies for method = "tdfp", I thought there might be a nice clean solution.
I'm having some very annoying problems getting a Naive Bayes Classifier to work with a document term matrix. I'm sure I'm making a very simple mistake but can't figure out what it is. My data is from accounts spreadsheets. I've been asked to figure out which categories (in text format: mostly names of departments or names of budgets) are more likely to spend money on charities and which ones mostly (or only) spend on private companies. They suggested I use Naive Bayes classifiers to do this. I have a thousand or so rows of data to train a model and many hundreds of thousands of rows to test the model against. I have prepared the strings, replacing spaces with underscores and ands/&s with +, then treated each category as one term: so 'alcohol and drug addiction' becomes: alcohol+drug_addiction.
Some example rows:
"environment+housing strategy+commissioning third_party_payments supporting_ppl_block_gross_chargeable" -> This row went to a charity
"west_north_west customer+tenancy premises h.r.a._special_maintenance" -> This row went to a private company.
Using this example as a template, I wrote the following function to come up with my document term matrix (using tm), both for training and test data.
library(tm)
library(e1071)
getMatrix <- function(chrVect){
testsource <- VectorSource(chrVect)
testcorpus <- Corpus(testsource)
testcorpus <- tm_map(testcorpus,stripWhitespace)
testcorpus <- tm_map(testcorpus, removeWords,stopwords("english"))
testmatrix <- t(TermDocumentMatrix(testcorpus))
}
trainmatrix <- getMatrix(traindata$cats)
testmatrix <- getMatrix(testdata$cats)
So far, so good. The problem is when I try to a) apply a Naive Bayes model and b) predict from that model. Using klar package - I get a zero probability error, since many of the terms have zero instances of one category and playing around with the laplace terms does not seem to fix this. Using e1071, the model worked, but then when I tested the model using:
model <- naiveBayes(as.matrix(trainmatrix),as.factor(traindata$Code))
rs<- predict(model, as.matrix(testdata$cats))
... every single item predicted the same category, even though they should be roughly equal. Something in the model clearly isn't working. Looking at some of the terms in model$tables - I can see that many have high values for private and zero for charity and others vice versa. I have used as.factor for the code.
output:
rs 1 2
1 0 0
2 19 17
Any ideas on what is going wrong? Do dtm matrices not play nice with naivebayes? Have I missed a step out in preparing the data? I'm completely out of ideas. Hope this is all clear. Happy to clarify if not. Any suggestions would be much appreciated.
I have already had the problem myself. You have done (as far as I see it) everything right, the Naive Bayes Implementation in e1071 (and thus klar) is buggy.
But there is an easy and quick fix so that Naive Bayes as implemented in e1071 works again: You should change your text-vectors to categorial variables, i.e. as.factor. You have already done this with your target variable traindata$Code, yet you have to also do this for your trainmatrix and for sure then your testdata.
I could not track the bug to 100% percent down, but it lies in this part in the naive bayes implementation from e1071 (I may note, klar is only a wrapper around e1071):
L <- log(object$apriori) + apply(log(sapply(seq_along(attribs),
function(v) {
nd <- ndata[attribs[v]]
## nd is now a cell, row i, column attribs[v]
if (is.na(nd) || nd == 0) {
rep(1, length(object$apriori))
} else {
prob <- if (isnumeric[attribs[v]]) {
## we select table for attribute
msd <- object$tables[[v]]
## if stddev is eqlt eps, assign threshold
msd[, 2][msd[, 2] <= eps] <- threshold
dnorm(nd, msd[, 1], msd[, 2])
} else {
object$tables[[v]][, nd]
}
prob[prob <= eps] <- threshold
prob
}
})), 1, sum)
You see that there is an if-else-condition: if we have no numerics, naive bayes is used as we expect it to work. If we have numerics - and here comes the bug - this naive bayes automatically assumes a normal distribution. If you only have 0 and 1 in your text, dnorm pretty much sucks. I assume due to very low values created by dnorm the prob. are always replaced by the threshold and thus the variable with the higher a priori factor will always „win“.
However, if I understand your problem correct, you do not even need prediction, rather the a priori factor for identifying which department gives money to whom. Then all you have to do is have a deep look at your model. In your model for every term there appears the apriori probability, which is what I assume you are looking for. Let's do this and the aforementioned with a slightly modified version of your sample:
## i have changed the vectors slightly
first <- "environment+housing strategy+commissioning third_party_payments supporting_ppl_block_gross_chargeable"
second <- "west_north_west customer+tenancy premises h.r.a._special_maintenance"
categories <- c("charity", "private")
library(tm)
library(e1071)
getMatrix <- function(chrVect){
testsource <- VectorSource(chrVect)
testcorpus <- Corpus(testsource)
testcorpus <- tm_map(testcorpus,stripWhitespace)
testcorpus <- tm_map(testcorpus, removeWords,stopwords("english"))
## testmatrix <- t(TermDocumentMatrix(testcorpus))
## instead just use DocumentTermMatrix, the assignment is superflous
return(DocumentTermMatrix(testcorpus))
}
## since you did not supply some more data, I cannot do anything about these lines
## trainmatrix <- getMatrix(traindata$cats)
## testmatrix <- getMatrix(testdata$cats)
## instead only
trainmatrix <- getMatrix(c(first, second))
## I prefer running this instead of as.matrix as i can add categories more easily
traindf <- data.frame(categories, as.data.frame(inspect(trainmatrix)))
## now transform everything to a character vector since factors produce an error
for (cols in names(traindf[-1])) traindf[[cols]] <- factor(traindf[[cols]])
## traindf <- apply(traindf, 2, as.factor) did not result in factors
## check if it's as we wished
str(traindf)
## it is
## let's create a model (with formula syntax)
model <- naiveBayes(categories~., data=traindf)
## if you look at the output (doubled to see it more clearly)
predict(model, newdata=rbind(traindf[-1], traindf[-1]))
But as I have already said, you do not need to predict. A look at the model is all right, e.g. model$tables$premises will give you the likelihood for the premises giving money to private corporations: 100 %.
If you are dealing with very large datasets, you should specify threshold and eps in your model. Eps defines the limit, when the threshold should be supplied. E.g. eps = 0 and threshold = 0.000001 can be of use.
Furthermore you should stick to using term-frequency weighting. tf*idv e.g. will not work due to the dnorm in the naive bayes.
Hope I can finally get my 50 reputation :P