I'm trying to do some econometric analysis using R and can't figure out how to do the analysis I'm look for. Specifically, I want to calculate consumer surplus.
I am trying to predict number of trips (dependent) based on variables like water quality, scenery, parking, etc. I've run a regression of my independent variables on my dependent variable using:
lm()
and also got my predicted values using:
y_hat <- as.matrix(mydata[c("y")])
Now I want to calculate the consumer surplus for each individual (~260 total) from my predicted (y_hat) values.
Welcome to R. I studied economics in college and wish R was taught. You will find that the programming language is very useful in your work.
Note that R is able to accomplish vectorized operations that may speed up your analysis. Consider:
mydata <- data.frame(x=letters[1:3], y=1:3)
x y
1 a 1
2 b 2
3 c 3
Let's say your predicted 'y' is 1.25.
y_hat <- 1.25
You can subtract that number by the entire column of the dataset and it will go row by row for you without having to use compicated 'for loops.'
y_hat - mydata[c("y")]
y
1 0.25
2 -0.75
3 -1.75
Without more information about your particular issue, that is all the help that I can offer. In the future, add a reproducible example that illustrates your data and the specific issue that you are stuck on.
Related
This is probably a simple problem but I just can't work it out. I have a dataframe of biochemistry test results. Some of these tests like base_crp are returning values like <3 because of limits of detection. I need to impute this data before moving forward. I'd like to do this properly, so not just substituting.
I tried multLN from the zCompositions package but it seems to think that all the <3 values are negative (error says X contains negative values). There also doesn't seem to be much documentation out there- is this an obscure package?
I also looked at LODI but it wants me to specify covariates for the imputation model- is there a proper way to select these? Anyway, I picked 3 that would theoretically correlate well and used this code:
clmi.out <- clmi(formula = log(base_crp) ~ base_wcc + base_neut + base_lymph, df = all, lod = crplim, seed = 12345, n.imps = 5)
where base_crp is the variable I'm trying to fix. I replaced all the <3 with NA and inserted a new column all$crplim <- "3". However, this is just returning
Error in sprintf("%s must be numeric.") : too few arguments.
Even if I can get LODI working, I'm not sure if it's the right tool. I'm only an undergraduate university student with little statistical background so I don't really understand what I'm doing- I just want something that will populate the column with numbers so I can move forward with Pearson correlations and linear regressions, etc. I would really appreciate some help with this. Thanks in advance.
I've done a bit of statistical modelling of CRP (C reactive protein) levels before - see this peer-reviewed paper as an example. CRP has an approximately log-normal distribution, and the median value in an unselected population across all testing indications is usually around 3.5 mg/l (most healthy people will be in that "<3mg/l" category). You probably don't want to be using an imputation model, because these are for missing data. The low CRP data is not missing. You already know it lies within a certain range, so you are losing information if you do the imputation this way.
It is reasonable to want to replace "<3" with a numeric value for regressions etc, as long as you are using this to correlate CRP with clinical findings etc and not (as Ben Norris points out) for CRP machine calibration.
I can tell you from data of over 10,000 samples of high-sensitvity CRP measurements in the study I linked above that the mean CRP in people with CRP < 3 is about 1.3, and it would be reasonable to substitute all of your "CRP < 3" measurements with 1.3 for most real-world clinical observational studies.
If you really need to have plausible numerical values on the missing CRP, you could impute the bottom half of a lognormal distribution. The following function would give you numbers that would likely be indistinguishable from real-life CRP measurements:
impute_crp <- function(n)
{
x <- exp(rnorm(10 * n, 1.355, 1.45))
round(x[x < 3][seq(n)], 1)
}
So you could do
impute_crp(10)
#> [1] 1.5 2.0 1.1 0.4 2.5 0.1 0.7 1.5 1.4 0.4
And
base_crp[base_crp == "<3"] <- impute_crp(length(which(base_crp == "<3"))
However, you will notice that I didn't use imputation at all in my own CRP models. Replacing the lower value with the threshold of detection was good enough for the purposes of modelling - and I'm fairly sure whether you replace the "< 3" with a lognormal tail, or all 1.3, or all 2, it will make no difference to the conclusions you are trying to draw.
I would appreciate some input in this a lot!
I have data for 5 time series (an example of 1 step in the series is in the plot below), where each step in the series is a vertical profile of species sightings in the ocean which were investigated 6h apart. All 5 steps are spaced vertically by 0.1m (and the 6h in time).
What I want to do is calculate the multivariate cross-correlation between all series in order to find out at which lag the profiles are most correlated and stable over time.
Profile example:
I find the documentation in R on that not so great, so what I did so far is use the package MTS with the ccm function to create cross correlation matrices. However, the interpretation of the figures is rather difficult with sparse documentation. I would appreciate some help with that a lot.
Data example:
http://pastebin.com/embed_iframe.php?i=8gdAeGP4
Save in file cross_correlation_stack.csv or change as you wish.
library(dplyr)
library(MTS)
library(data.table)
d1 <- file.path('cross_correlation_stack.csv')
d2 = read.csv(d1)
# USING package MTS
mod1<-ccm(d2,lag=1000,level=T)
#USING base R
acf(d2,lag.max=1000)
# MQ plot also from MTS package
mq(d2,lag=1000)
Which produces this (the ccm command):
This:
and this:
In parallel, the acf command from above produces this:
My question now is if somebody can give some input in whether I am going in the right direction or are there better suited packages and commands?
Since the default figures don't get any titles etc. What am I looking at, specifically in the ccm figures?
The ACF command was proposed somewhere, but can I use it here? In it's documentation it says ... calculates autocovariance or autocorrelation... I assume this is not what I want. But then again it's the only command that seems to work multivariate. I am confused.
The plot with the significance values shows that after a lag of 150 (15 meters) the p values increase. How would you interpret that regarding my data? 0.1 intervals of species sightings and many lags up to 100-150 are significant? Would that mean something like that peaks in sightings are stable over the 5 time-steps on a scale of 150 lags aka 15 meters?
In either way it would be nice if somebody who worked with this before can explain what I am looking at! Any input is highly appreciated!
You can use the base R function ccf(), which will estimate the cross-correlation function between any two variables x and y. However, it only works on vectors, so you'll have to loop over the columns in d1. Something like:
cc <- vector("list",choose(dim(d1)[2],2))
par(mfrow=c(ceiling(choose(dim(d1)[2],2)/2),2))
cnt <- 1
for(i in 1:(dim(d1)[2]-1)) {
for(j in (i+1):dim(d1)[2]) {
cc[[cnt]] <- ccf(d1[,i],d1[,j],main=paste0("Cross-correlation of ",colnames(d1)[i]," with ",colnames(d1)[j]))
cnt <- cnt + 1
}
}
This will plot each of the estimated CCF's and store the estimates in the list cc. It is important to remember that the lag-k value returned by ccf(x,y) is an estimate of the correlation between x[t+k] and y[t].
All of that said, however, the ccf is only defined for data that are more-or-less normally distributed, but your data are clearly overdispersed with all of those zeroes. Therefore, lacking some adequate transformation, you should really look into other metrics of "association" such as the mutual information as estimated from entropy. I suggest checking out the R packages entropy and infotheo.
So help would be much appreciated!
I have already completed a CCA plot which shows 7 sites, about 15 species and 6 environmental variables. However, it is saying that the unconstrained axis is 0 and I cannot complete an ANOVA on my CCA results in order to see what the significance of the axes are. I also attempted to use the spenvcor function to see the environmental to species correlation and it is giving me 1's for all of the axes.
So I am definitely doing something wrong but I just can't figure out what.
Here is my code:
MayEnviro <- read.csv("MayEnviro.csv", header=TRUE)
MaySpecies <- read.csv("MaySpecies.csv", header=TRUE)
t <- cca(MaySpecies,
MayEnviro[, c("AFDM","Chla","Chloride","TSS","TN","TP","Velocity")])
spenvcor(t)
The number of axes you can derive from a data set with n = 7 sites, m = 15 species is min(n, m) - 1, which is 6. As you also have 6 constraints (the environmental variables) you explain the data exactly and there is no residual variance to work with. In fact there are no constraints on the solution and the result is just like CA.
In this instance, with so few sites, you should look to fit a model with fewer constraints, say 2 or 3 at most.
Hi…I have a very basic question regarding the input of weighted data into R. Currently I have to process data (mostly for curve fitting purposes) similar to the following:
> head(mydata, 10)
v sf
1 0.3003434 3.933106
2 0.3027852 5.947432
3 0.3052270 9.832596
4 0.3076688 12.927439
5 0.3101106 14.197519
6 0.3125525 13.572904
7 0.3149943 11.691078
8 0.3174361 9.543095
9 0.3198779 8.048558
10 0.3223197 7.660252
The first column is the data (increasing & equidistant), while the 2nd column gives the frequency (weights), currently these weights don't add up to one, but I can easily fix that.
Now, I searched for weighted data in R and the closest I found was via using the survey package and the svydesign() command, but is it really that hard?
What I did to work around my lack of knowledge, and that got me in trouble with the Kolmogorov_Smirnov test (more below), is the following:
> y <- with(mydata, c(rep(v, times=floor(10*sf))))
which will repeat the elements of the first column in proportion to the corresponding weight (times 10 to get a whole number). But now the problem is, when I conduct the Kolmogorov-Smirnov goodness of fit test, I get a warning that the p-value can not be computed since the data has ties.
Question is: How can I input and process the data in its original form (i.e. as a frequency or probability table) for the purpose of curve fitting? Thanks.
I'm working on a very big data-set.(csv)
The data set is composed from both numeric and categorical columns.
One of the columns is my "target column" , meaning i want to use the other columns to determine which value (out of 3 possible known values) is likely to be in the "target column". In the end check my classification vs the real data.
My question:
I'm using R.
I am trying to find a way to select the subset of features which give the best classifiation.
going over all the subsets is impossible.
Does anyone know an algorithm or can think of a way do it on R?
This seems to be a classification problem. Without knowing the amount of covariates you have for your target, can't be sure, but wouldn't a neural network solve your problem?
You could use the nnet package, which uses a Feed-forward neural network and works with multiple classes. Having categorical columns is not a problem since you could just use factors.
Without a datasample I can only explain it just a bit, but mainly using the function:
newNet<-nnet(targetColumn~ . ,data=yourDataset, subset=yourDataSubset [..and more values]..)
You obtain a trained neural net. What is also important here is the size of the hidden layer which is a tricky thing to get right. As a rule of thumb it should be roughly 2/3 of the amount of imputs + amount of outputs (3 in your case).
Then with:
myPrediction <- predict(newNet, newdata=yourDataset(with the other subset))
You obtain the predicted values. About how to evaluate them, I use the ROCR package but currently only supports binary classification, I guess a google search will show some help.
If you are adamant about eliminate some of the covariates, using the cor() function may help you to identify the less caracteristic ones.
Edit for a step by step guide:
Lets say we have this dataframe:
str(df)
'data.frame': 5 obs. of 3 variables:
$ a: num 1 2 3 4 5
$ b: num 1 1.5 2 2.5 3
$ c: Factor w/ 3 levels "blue","red","yellow": 2 2 1 2 3
The column c has 3 levels, that is, 3 type of values it can take. This is something done by default by a dataframe when a column has strings instead of numerical values.
Now, using the columns a and b we want to predict which value c is going to be. Using a neural network. The nnet package is simple enough for this example. If you don't have it installed, use:
install.packages("nnet")
Then, to load it:
require(nnet)
after this, lets train the neural network with a sample of the data, for that, the function
portion<-sample(1:nrow(df),0.7*nrow(df))
will store in portion, 70% of the rows from the dataframe. Now, let's train that net! I recommend you to check the documentation for the nnet package with ?nnet for a deeper knowledge. Using only basics:
myNet<-nnet( c~ a+b,data=df,subset=portion,size=1)
c~ a+b is the formula for the prediction. You want to predict the column c using the columns a and b
data= means the data origin, in this case, the dataframe df
subset= self explanatory
size= the size of the hidden layer, as I said, use about 2/3 of the total columns(a+b) + total outputs(1)
We have trained net now, lets use it.
Using predict you will use the trained net for new values.
newPredictedValues<-predict(myNet,newdata=df[-portion,])
After that, newPredictedValues will have the predictions.
Since you have both numerical and categorical data, then you may try SVM.
I am using SVM and KNN on my numerical data and I also tried to apply DNN. DNN is pretty slow for training especially big data in R. KNN does not need to be trained, but is used for numerical data. And the following is what I am using. Maybe you can have a look at it.
#Train the model
y_train<-data[,1] #first col is response variable
x_train<-subset(data,select=-1)
train_df<-data.frame(x=x_train,y=y_train)
svm_model<-svm(y~.,data=train_df,type="C")
#Test
y_test<-testdata[,1]
x_test<-subset(testdata,select=-1)
pred<-predict(svm_model,newdata = x_test)
svm_t<-table(pred,y_test)
sum(diag(svm_t))/sum(svm_t) #accuracy