Error in looped regression using prodest package. R - r

I have an unbalanced panel, and I am trying to run this regression to estimate a production function.
I know this regression has a bootstrap/iterative component.
This is the package and a data similar to mine:
library(prodest)
data(chilean)
This data is way too similar to mine, the only difference is in chilean$idvar, this data got a numeric id but my data (base2$ruc) got a string id and way more obs.
This is the code for the regression:
On Chilean data:
LP_model1 <- prodestLP(chilean$Y,
fX = cbind(chilean$fX1, chilean$fX2),
sX = chilean$sX,
pX = chilean$pX,
idvar = chilean$idvar,
timevar = chilean$timevar,
R = 30)
And it works
This is for my data, all numerics, no missing values, just the difference in the idvar, and more columns that I don't use in this process.
levpet <- prodestLP(base2$c_y,
fX=base2$c_l,
sX=base2$c_k,
pX=base2$c_m,
idvar=base2$ruc,
timevar = base2$year,
R=30 )
and I get the following error:
Error in `[[<-.data.frame`(`*tmp*`, i, value = c(42719L, 82109L, 82678L, :
replacement has 469326 rows, data has 78221
My data has 78221 obs, while chilean dataset has 2544.
I don't understand what the error is saying or how to solve it.
Can anybody please help me with this one?
Thanks in advance.
This is an example using the chilean model:https://rpubs.com/hacamvan/319728
This is the package documentation: https://cran.r-project.org/web/packages/prodest/prodest.pdf
And here's what does my variables mean:
c_y is log of value added
c_l is log of wages
c_k is log of capital
c_m is log of materials
ruc is the firm identifier (a string)
year is a numeric.

Related

Error in array, regression loop using "plyr"

Good morning,
I´m currently trying to run a truncated regression loop on my dataset. In the following I will give you a reproducible example of my dataframe.
library(plyr)
library(truncreg)
df <- data.frame("grid_id" = rep(c(1,2), 6),
"htcm" = rep(c(160,170,175), 4),
stringsAsFactors = FALSE)
View(df)
Now I tried to run a truncated regression on the variable "htcm" grouped by grid_id to receive only coefficients (intercept such as sigma), which I then stored into a dataframe. This code is written based on the ideas of #hadley
reg <- dlply(df, "grid_id", function(.)
truncreg(htcm ~ 1, data = ., point = 160, direction = "left")
)
regcoef <- ldply(reg, coef)
As this code works for one of my three datasets, I receive error messages for the other two ones. The datasets do not differ in any column but in their absolute length
(length(df1) = 4,000; length(df2) = 100,000; length(df3) = 13,000)
The error message which occurs is
"Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), : 'data' must be of type vector, was 'NULL'
I do not even know how to reproduce an example where this error code occurs, because this code works totally fine with one of my three datasets.
I already accounted for missing values in both columns.
Does anyone has a guess what I can fix to this code?
Thanks!!
EDIT:
I think I found the origin of error in my code, the problem is most likely about that in a truncated regression model, the standard deviation is calculated which automatically implies more than one observation for any group. As there are also groups with only n = 1 observations included, the standard deviation equals zero which causes my code to detect a vector of length = NULL. How can I drop the groups with less than two observations within the regression code?

How to apply weights associated with the NIS (National inpatient sample) in R

I am trying to apply weights given with NIS data using the R package "survey", but I have been unsuccessful. I am fairly new to R and survey commands.
This is what I have tried:
# Create the unweighted dataset
d <- read.dta13(path)
# This produces the correct weighted amount of cases I need.
sum(d$DISCWT) # This produces the correct weighted amount of cases I need.
library(survey)
# Create survey object
dsvy <- svydesign(id = ~ d$HOSP_NIS, strata = ~ d$NIS_STRATUM, weights = ~ d$DISCWT, nest = TRUE, data = d)
d$count <- 1
svytotal(~d$count, dsvy)
However I get the following error after running the survey total:
Error in onestrat(x[index, , drop = FALSE], clusters[index], nPSU[index][1], :
Stratum (1131) has only one PSU at stage 1
Any help would be greatly appreciated, thank you!
The error indicates that you have specified a design where one of the strata has just a single primary sampling unit. It's not possible to get an unbiased estimate of variance for a design like that: the contribution of stratum 1131 will end up as 0/0.
As you see, R's default response is to give an error, because a reasonably likely explanation is that the data or the svydesign statement is wrong. Sometimes, as here, that's not what you want, and the global option 'survey.lonely.psu' describes other ways to respond. You want to set
options(survey.lonely.psu = "adjust")
This and other options are documented at help(surveyoptions)

Cross validation help: Error in model.frame.default(as.formula(delete.response(Terms)), newdata, : variable lengths differ (found for 'fun factor')

So I have a specific error that I can't figure out. By searching I am finding that the model and the cross validation set do not have the data with the same levels to fit the model. I am trying to understand completely with my use case. Basically I am building a QDA model to predict vehicle country based on numeric values. This code will run for anyone since it is a public google sheets document. For those of you who follow Doug Demuro on YouTube you may find this a tad bit interesting.
#load dataset into r
library(gsheet)
url = 'https://docs.google.com/spreadsheets/d/1KTArYwDWrn52fnc7B12KvjRb6nmcEaU6gXYehWfsZSo/edit'
doug_df = read.csv(text=gsheet2text(url, format='csv'), stringsAsFactors=FALSE,header=FALSE)
#begin cleanup. remove first blank rows of data
doug_df = doug_df[-c(1,2,3), ]
attach(doug_df)
#name columns appropriately
names(doug_df) = c("year","make","model","styling","acceleration","handling","fun factor","cool factor","total weekend score","features","comfort","quality","practicality","value","total daily score","dougscore","video duration","filming city","filming state","vehicle country")
#removing categorical columns and columns not being used for discriminate analysis to include totals columns
library(dplyr)
doug_df = doug_df %>% dplyr::select (-c(make,model,`total weekend score`,`total daily score`,dougscore,`video duration`,`filming city`,`filming state`))
#convert from character to numeric
num.cols <- c("year","styling","acceleration","handling","fun factor","cool factor","features","comfort","quality","practicality","value")
doug_df[num.cols] <- sapply(doug_df[num.cols], as.numeric)
`vehicle country` = as.factor(`vehicle country`)
#create a new column to reflect groupings for response variable
doug_df$country.group=ifelse(`vehicle country`=='Germany','Germany',
ifelse(`vehicle country`=='Italy','Italy',
ifelse(`vehicle country`=='Japan','Japan',
ifelse(`vehicle country`=='UK','UK',
ifelse(`vehicle country`=='USA','USA','Other')))))
#remove the initial country column
doug_df = doug_df %>% dplyr::select (-c(`vehicle country`))
#QDA with multiple predictors
library(MASS)
qdafit1 = qda(country.group~styling+acceleration+handling+`fun factor`+`cool factor`+features+comfort+quality+value,data=doug_df)
#predict using model and compute error
n=dim(doug_df)[1]
fittedclass = predict(qdafit1,data=doug_df)$class
table(doug_df$country.group,fittedclass)
Error = sum(doug_df$country.group != fittedclass)/n; Error
#conduct k 10 fold cross validation
allpredictedCV1 = rep("NA",n)
cvk = 10
groups = c(rep(1:cvk,floor(n/cvk)))
set.seed(4)
cvgroups = sample(groups,n,replace=TRUE)
for (i in 1:cvk) {
qdafit1 = qda(country.group~styling+acceleration+handling+`fun factor`+`cool factor`+features+comfort+quality+value,data=doug_df,subset=(cvgroups!=i))
newdata1i = data.frame(doug_df[cvgroups==i,])
allpredictedCV1[cvgroups==i] = as.character(predict(qdafit1,newdata1i)$class)
}
table(doug_df$country.group,allpredictedCV1)
CVmodel1 = sum(allpredictedCV1!=doug_df$country.group)/n; CVmodel1
This is throwing the error for the last part of the code w/ the cross validation:
Error in model.frame.default(as.formula(delete.response(Terms)), newdata, : variable lengths differ (found for 'fun factor')
Can someone help explain it a bit more in depth to me what is happening? I think that the variable fun factor doesn't have the same levels in each fold of the cross validation as it did the model. Now I need to know my options to fix it. Thanks in advance!
EDIT
In addition to the above, I am getting a very similar error for when I try to predict a dummy car review.
#build a dummy review and predict it using multiple models
dummy_review = data.frame(year=2014,styling=8,acceleration=6,handling=6,`fun factor`=8,`cool factor`=8,features=4,comfort=4,quality=6,practicality=3,value=5)
#predict vehicle country for dummy data using model 1
predict(qdafit1,dummy_review)$class
This returns the following error:
Error in model.frame.default(as.formula(delete.response(Terms)), newdata, : variable lengths differ (found for 'fun factor')

elm function in nnfor package does not recognize input for xreg

I'm trying to include an exogenous regressor in my time series analysis
elm_nn<-elm(ts(df),
m=1,
hd=NULL,
type="step",
reps = 20,
comb = "median",
difforder = c(0:12),
outplot =TRUE,
sel.lag=FALSE,
direct = FALSE,
allow.det.season = TRUE,
det.type = "auto",
xreg =reg)
Using
df<-c(0,0,173,0,0,80,0)
reg<-c(182,135,30,203,150,83,163)
Information in the package documentation alludes to xreg being a column, presumably from a table. I had a vector that I had used as an xreg when using ARIMA where it performed without issue. Using the same vector in elm however generates the error Error in xreg[1, ] : incorrect number of dimensions. I can't find information regarding exogenous regressors for extreme learning machines that deals specifically with time series. Any help would be greatly appreciated.
It appears that when I converted the list to a table, R added an additional column with alphanumeric values in the new column that looked likeA1,B1,...,Z1,A2,... called Var1. Trying to remove this column converted it into a single row and generated the following warning message:
Warning message:
In reg$Var1 <- NULL : Coercing LHS to a list
By converting the vector to a data frame instead, I was able to use it without issue.

Error when doing linear regression using zoo objects ... Error in `$<-.zoo`(`*tmp*`

I am new to R and slowly getting acquainted. My question refers to the following piece of code.
I am creating a zoo object with the following headers and then filtering by date. On the filtered dates I am subtracting two columns (Tom from Elena). Everything works fine until here.
Code below:
b <- read.zoo(b1, header = TRUE, index.column = 1, format = "%d/%m/%Y")
startDate = "2013/11/02"
endDate = "2013/12/20"
dates <- seq(as.Date(startDate), as.Date(endDate), by=1)
TE = b[dates]$Tom - b[dates]$Elena
However I am then regressing the results from my subtraction (see above TE) on Elena. However i get an error message every time i try and to this regression
TE$model <- lm(TE ~ b[dates]$Elena)
Error in $<-.zoo(*tmp*, "model", value = list(coefficients = c(-0.0597128230859905, :
not possible for univariate zoo series
I have tried creating a data frame and then doing the regression but with no avail. Any help would be appreciated. Thanks.
You can not add the outcome of a regression (a list of class lm) to a time series of class zoo.
I recommend saving the model in a separate object, e.g.,
fit <- lm(TE ~ b[dates]$Elena)

Resources