I am working on a diff-in-diff analysis where treatment is measured at the country-year observation and treatment reversals within countries over time are allowed. For the analysis, I use panelMatch to match the countries on 4 covariates. I'm also interested in a qualitative analysis of cases "on and off" the regression line but I have a few questions about using this analysis and predicted values to select these cases. Specifically,-When selecting cases that differ in their outcome from panel data, how do I select from/among the different years covered in the data?-How would I actually obtain the paired countries from panel match to make these comparisons?
Related
So I have a logistic regression model to predict teenage pregnancy, where I have turned into dummy variables categories such as religion, race, father's highest degree, and so forth. I also turned into dummy variables the income groups such as low, medium, and high.
I was wondering if I was actually meant to turn the Income group into dummy variables since, for example, with religion every religion is pretty much separate but Income is still sort of numerical and ascending as low incomes have lower dollar amounts, and high incomes have higher dollar amounts. I'm not sure if I am explaining this correctly, but I was shocked to see that my stepwise regression with R has removed Income as a significant coefficient for predicting teenage pregnancy. I think what I am referring to is ordinal variables upon further research.
A longitudinal data with 6 repeated measurements of weight and height for each individual child. There are three treatment groups. Need to compare the growth trend between groups.
-How can I plot all three groups in one?
-what could be more intuitive knot positions? the idea was to place knots at time interval of four months defined by visits. When I used b-splines bs(), it seemed could only place it conditioning on the x-variable (the age).
so far I have tried to run the codes,
gam_bs<-gam((weight[study_group==1])~bs((age[study_group==1]),knots = mean(age,na.rm=TRUE)),data=na.omit(visit_no_NA))
lm1<-lm((weight[study_group==1])~bs((age[study_group==1]),df=5),data=visit_no_NA)
I have conducted a logistic regression on a binary dependent variable and 5 independent variables. The dataframe I drew these variables from is survey data asking whether a person has voted for or against a policy change (binary dependent variable), with the other variables being questions regarding their income, location and other such personal information that could inform whether they would vote for or against the vote.
Having conducted the regression, I'd now like to calculate the predicted probability that each person would have voted yes/no to see how informative those variables are. In total my dataframe has information on 3000 people and I'd like to calculate the predicted probability of voting for/against for every single row/person.
What methods are available for doing so?
Appreciate the help!
You can use the predict function in order to calculate the predicted probabilities.
predict(model, newdata, type="response")
With model our logistic regression (the result of the glm() function), newdata a dataset which contains all the variables defined in our model and for all the individuals for which you want a probability.
I am running a multiple linear regression model using lm function in R to study the impact of some characteristics on the gene expression level.
My data matrix contains one continuous dependent variable (i.e. gene expression levels) and 50 explanatory variables which are the count of these characteristics on each gene and many of these counts are zeros.
I checked all of the regression assumptions and I found two issues the first one is the Heteroscedasticity and the other one is the autocorrelation problem. The later is not serious. I wonder if using multiple linear regression is correct or not and if there is any other regression techniques can be used to solve these problems.
I used stepwise method and I got just 11 significant variables among those 50. But when I checked the heteroscedasticity, and I found it still appears as shown below. The sample size is 15,000 genes. (15,000 rows and 50 columns).
Updated image, with weights added to lm call, re comments
I have a problem deriving a meaningful generalized linear regression model. The predictor variable depends on several predictands of which some are factors, and other vectors (elevation, slope, depth to permafrost (vector), exposure (rank, vector), permeability (factor)). There are seven groups of the data which correspond to their geographic location. The predictands within these groups are always equal, so that only the predicted variable (erosion rate) changes. There are 1700 data points.
I would like to set up a generalized linear regression model to test the influence of the predictands on the erosion rate. As the data are spatially autocorrelated, the analysis should take 30 samples from each group, then run the glm, let's say 10,000 times. I would like to compare the spread of the coefficients. Any help is highly appreciated!