Create a new datafram to do piecewise linear regression on percentages after doing serial crosstabs in R - r

I am working with R. I need to identify the predictors of higher Active trial start percentage over time (StartDateMonthsYrs). I will do linear regression with Percent.Active as the dependent variable.
My original dataframe is attached and my obtained Active trial start percentage over time (named Percent.Activeis presented here.
So, I need to assess whether federal sponsored trials, industry sponsored trials or Other sponsored trials were associated with higher active trial start percentage over time. I have many other variables that I wneed to assess but this is the sample of my data.
I am thinking to do many crosstabs for each variable (eg Fedral & Active then Industry & Active..etc.) in each month (may be with help of lapply then accumulate the obtained percentages data in the second sheet then run the analysis based on that.
My code for linear regression is as follow:
q.lm0 <- lm(Percent.Active ~ Time.point+ xyz, data.percentage);summary(q.lm0)

I'm a little bit confused. You write 'associated'. If you really want to look for association then yeah, a crosstab might be possible, and sufficient, as association is not the same as causation (which is further derived from correlation, if there is a theory behind). If you look for correlation, and insights over time, doing a regression with the lm package is not useful.
If you want to look for a regreesion type analysis there are packages in R like the plm package, which can deal with panel data, as you clearly have panel data (time points, and interested trials labels, and repetitive time points for these labels). Look at this post for infos about the package:https://stackoverflow.com/questions/2804001/panel-data-with-binary-dependent-variable-in-r
I'm writing you this because your Percent.Activevariable is only a binary outcome of 0/1 I'm not sure if this is on purpose. However, even if your outcome is not binary, the plm package might help, but you will find other mentioned packages in that post.

Related

How do I find out which observations of my dataset have been used for my mlm in R (nlme)?

I have longitudinal data and specified 3 multilevel models for different outcomes with nlme in R.
'model <- lme (...)'
They all are based on the same dataset.
Now,
'summary(model)'
shows me that the observations used for my final three models vary.
Probably, this is due to missing data that is different for every outcome (predictors stayed pretty much the same).
Is there a possibility to see, which observations of my dataset were included in each model? Note, that lme does not give me a S4 object, but medMer. Therefore,
'model#frame'
unfortunately does not work.
My aim is to give precise sample characteristics for each model. Therefore, I somehow need to adress the observations included each of them.
Thank you for any thoughts on this!

Did I just do an ANCOVA or MANOVA?

I’m trying to do an ANCOVA here ...
I want to analyze the effect of EROSION FORCE and ZONATION on all the species (listed with small letters) in each POOL.STEP (ranging from 1-12/1-4), while controlling for the effect of FISH.
I’m not sure if I’m doing it right. What is the command for ANCOVA?
So far I used lm(EROSIONFORCE~ZONATION+FISH,data=d), which yields:
So what I see here is that both erosion force percentage (intercept?) and sublittoral zonation are significant in some way, but I’m still not sure if I’ve done an ANCOVA correctly here or is this just an ANOVA?
In general, ANCOVA (analysis of covariance) is simply a special case of the general linear model with one categorical predictor (factor) and one continuous predictor (the "covariate"), so lm() is the right function to use.
However ... the bottom line is that you have a moderately challenging statistical problem here, and I would strongly recommend that you try to get local help (if you're working within a research group, can you consult with others in your group about appropriate methods?) I would suggest following up either on CrossValidated or r-sig-ecology#r-project.org
by putting EROSIONFORCE on the left side of the formula, you're specifying that you want to use EROSIONFORCE as a response (dependent) variable, i.e. your model is estimating how erosion force varies across zones and for different fish numbers - nothing about species response
if you want to analyze the response of a single species to erosion and zone, controlling for fish numbers, you need something like
lm(`Acmaeidae s...` ~ EROSIONFORCE+ZONATION+FISH, data=your_data)
the lm() suggestion above would do each species independently, i.e. you'd have to do a separate analysis for each species. If you also want to do it separately for each POOL.STEP you're going to have to do a lot of separate analyses. There are various ways of automating this in R, the most idiomatic is probably to melt your data (see reshape2::melt or tidy::gather) into long format and then use lmList from lme4.
since you have count data with low means, i.e. lots of zeros (and a few big values), you should probably consider a Poisson or negative binomial model, and possibly even a zero-inflated/hurdle model (i.e. analyze presence-absence and size of positive responses separately)
if you really want to analyze the joint distribution of all species (i.e., a response of a multivariate analysis, which is the M in MANOVA), you're going to have to work quite a bit harder ... there are a variety of joint species distribution models by people like Pierre Legendre, David Warton and others ... I'd suggest you try starting with the mvabund package, but you might need to do some reading first

Applying univariate coxph function to multiple covariates (columns) at once

First, I gathered from this link Applying a function to multiple columns that using the "function" function would perhaps do what I'm looking for. However, I have not been able to make the leap from thinking about it in the way presented to making it actually work in my situation (or really even knowing where to start). I'm a beginner in R so I apologize in advance if this is a really "newb" question. My data is a data frame that consists of an event variable (tumor recurrence) and a time variable (followup time/time to recurrence) as well as recurrence risk factors (t-stage, tumor size,age at dx, etc.). Some risk factors are categorical and some are continuous. I have been running my univariate analysis by hand, one at a time like this example univariateageatdx<-coxph(survobj~agedx), and then collecting the data. This gets very tedious for multiple factors and doing it for a few different recurrence types. I figured there must be a way to code such that I could basically have one line of code that had the coxph equation and then applied it to all of my variables of interest and spit out a result that had the univariate analysis results for each factor. I tried using cbind to bind variables (i.e x<-cbind("agedx","tumor size") then running cox coxph(recurrencesurvobj~x) but this of course just did the multivariate analysis on these variables and didn't split them out as true univariate analyses.
I also tried the following code based on a similar problem that I found on a different site, but it gave the error shown and I don't know quite what to make of it. Is this on the right track?
f <- as.formula(paste('regionalsurvobj ~', paste(colnames(nodcistradmasvssubcutmasR)[6-9], collapse='+')))
I then ran it has coxph(f)
Gave me the results of a multivariate cox analysis.
Thanks!
**edit: I just fixed the error, I needed to use the column numbers I suppose not the names. Changes are reflected in the code above. However, it still runs the variables selected as a multivariate analysis and not as the true univariate analysis...
If you want to go the formula-route (which in your case with multiple outcomes and multiple variables might be the most practical way to go about it) you need to create a formula per model you want to fit. I've split the steps here a bit (making formulas, making models and extracting data), they can off course be combined this allows you to inspect all your models.
#example using transplant data from survival package
#make new event-variable: death or no death
#to have dichot outcome
transplant$death <- transplant$event=="death"
#making formulas
univ_formulas <- sapply(c("age","sex","abo"),function(x)as.formula(paste('Surv(futime,death)~',x))
)
#making a list of models
univ_models <- lapply(univ_formulas, function(x){coxph(x,data=transplant)})
#extract data (here I've gone for HR and confint)
univ_results <- lapply(univ_models,function(x){return(exp(cbind(coef(x),confint(x))))})

R - Linear Regression - Control for a variable

I have a computer science background & I am trying to teach myself data science by solving the problems available on the internet
I have a smallish data set which has 3 variables - race, gender and annual income. There are about 10,000 sample observations. I am trying to predict income from race & gender.
I have divided the data into 2 parts - one for each gender & now I am trying to create 2 regression models. Is this possible in R? Can some one provide example syntax.
You don't specify how your data are stored or how the variable race is recorded (is it a factor?)
[If you're just fitting income against race for males, say, and you had the male income and race in income.m and race.m and if the second was a factor in R, then lm(income.m~race.m) will fit the line for males (use summary on the resulting object to get information about it). You could do something similar for females. But most people won't fit the models this way.]
If you're prepared to assume that the variation about the lines is the same for both genders, you can fit both lines with one model.
This has several advantages over analyzing the lines separately, though that can also be done.
If gender is either a factor or a numeric variable recorded as (0/1), and race is a factor and you have the data in a data frame (called, for example, incdata), then you'd fit both lines at once with:
lm(income~race*gender, data=incdata)
which is R shorthand for
lm(income~race+gender+race:gender, data=incdata)
where race:gender is an interaction term.
If you further assume that the effect of race is the same for both sexes, then the smaller model:
lm(income~race+gender, data=incdata)
would be used instead. This would often be the model people would fit if asked to 'control for gender', though many would consider the interaction model I mentioned before instead.
I'd strongly advise working on more simple regression problems first, with a textbook or set of notes suitable for guiding you through the ideas.
If you haven't already fitted a regression in R, I'd start with a smaller data set that only has a single predictor just to get used to the basic mechanics.
R comes with many data sets already built in. See, for example, library(help=datasets) which has about 80 data sets; some of the packages that come with R have more (MASS has over 80, for example). Many R packages on CRAN are packed with data sets, many suitable for regression.
For example, the cars data set (see ?cars in R) records the stopping distance of cars, given their speed. You don't need to read the data in, it's already there.
A simple linear regression (not necessarily the best model given some understanding of physics, but just about adequate for the data) would be:
lm(dist~speed, cars)
Again, you use summary to examine it. e.g. (I suggest you type these one at a time):
carsfit <- lm(dist~speed, cars)
summary(carsfit)
plot(dist~speed, cars)
abline(carsfit, col=2)
The examples in the help on the cars data set (?cars) gives several other models and plots. You might try those one at a time also.
The car package (CAR is short for "Companion to Applied Regression") has many small data sets specifically for regression.
It is very simple.
fit1 <- lm(income~gender+race,data=Dataframe1)
summary(fit1)
I would not recommend using two dataframes. Unless you are using more advanced statistical methods that require using two dataframes. Just use your gender variable.
Also, check this site out: http://www.statmethods.net/stats/regression.html
You could indeed do so Abhi but I believe your question is very broad.
(1) you could predict income from race and gender. This can be done in various ways but the most common would perhaps be "regression analysis". I suggest you do some searches on the internet on that topic. Answering what kind of regression and how to perform it is a matter of situation. You would probably find out yourself after reading about regression.
(2) R can do that. But i suggest you do some reading about regression before you get into R.
(3) If I were to analyze if race and gender can predict income I would simply do a linear regression where income would be the dependent variable and race and sex would be independent (predictors). This can be done by the "lm" function in R.
Or did I misunderstand something here?
Regards
You need to do some reading on Linear/Multiple Regression techniques. Not sure why you divide data into 2 groups based on gender. Random split the data into Train and Test, so that you can model on Train and Validate on test.

two-sided censored model in R (similar to Zeligs Tobit)?

Is there a model for dependent variables that are censored on both sides? And if so is there an implementation in R? I am only aware of tobit models (e.g. in Zelig package), but they´re obviously only censored on the left side... I wonder if it even makes sense to truncate on both sides...
There's a difference between truncation and censoring. You need to be aware of which is the case before you start modeling. (in a nutshell: Censoring means events can be detected, but the measurements are not known completely (i.e. in your case you neither know the exact beginning nor the exact end of the time interval subjects were under risk for the event you're considering). Truncation means events can be observed only if another condition is fullfilled: a popular example is survival in a retirement home that only accepts people over 65 to take up residence - entry into the study population is then truncated at age 65.)
if you have both left- and right censored data or data that are simultaneously right- and left-censored, the techncal term you are looking for is interval censored. ?Surv in package survival will show you how to define interval censored observations for modelling time-to-event in that case.
In a very real sense most of the observational studies on "free-range human" populations are doubly censored... i.e. we do not observe the individuals over all of their lifespans. Here is a citation to a PhD thesis that seems to lay out the statistical terminology well. Furthermore, several of the packages in R will function properly when set up for interval censoring or left-censoring, including packages survival, NADA, sand (from their DOE website) and several others for which you can search at Baron's website with appropriate search strategies in this link that sets up that page to get both functions and r-help entries.
Edit: Adding comments to address the clarification that this is about truncation rather than censoring.
If one is looking to fit to truncated distributions then look at the gamlss package, or create a suitable density for a doubly-truncated distribution and use fitdistr in the MASS package.

Resources