Is there a max. number of sample variables (imported as metadata file) that can be included in a phyloseq analysis to create an ordination plot? - bioconductor

I am trying to create an ordination plot with phyloseq data:
phyloseq-class experiment-level object
otu_table() OTU Table: [ 7934 taxa and 45 samples ]
sample_data() Sample Data: [ 45 samples by 37 sample variables ]
tax_table() Taxonomy Table: [ 7934 taxa by 6 taxonomic ranks ]
When I attempt to carry out and plot the ordination (including sample arrows as labelled variables), the ordination does not include all the sample variables. I have 24 sample variables I specify for inclusion, but it seems to "max out" at 14 and will not include any additional variables. Furthermore, it seems to only include the first 14 in the metatable after trying different metatables with random combinations of variables.
Here is the current code:
cap_ord <- ordinate(
physeq = saltmarsh_not_na,
method = "CAP",
distance = bray_not_na,
formula = ~ percent_water + BD + percent_OM + C + N + C_N + salinity + temp + pH + dic + doc_uM + nox + nh4 + po4 + don + doc_don + h2s + suva + a440 + E2_E3 + SR + protein_like + terrestrial_humic_like + m
)
cap_ord <- ordinate(
physeq = saltmarsh_not_na,
method = "CAP",
distance = bray_not_na,
formula = ~ percent_water + BD + percent_OM + C + N + C_N + salinity + temp + pH + dic + doc_uM + nox + nh4 + po4 + don + doc_don + h2s + suva + a440 + E2_E3 + SR + protein_like + terrestrial_humic_like + m
)
arrowmat <- vegan::scores(cap_ord, display = "bp")
I do not include code for the plots since the issue is with the data itself.
Is there simply a maximum number of sample variables that phyloseq can handle with ordination plots?
Any insight is appreciated.

Related

Why is one of my control variables not showing up in my regression table?

I am using R on panel data to get a regression table. One of my control variables is not showing up in my regression table (namely: SPINDEX).
Does anyone know how to fix this?
# Descriptives and tables
#descriptive 1
dfOnlyInteresting = select(df, "amountAquired", "NumberDirectors", "AnnualReportDate", "GenderRatio", "AGE", "roa", "xrd", "SPINDEX", "aquisitionFin", "aqusitionsNot0", "CEOTenure", "compRatio", "laggedAquisition", "amountAquired_mean")
stargazer(dfOnlyInteresting, type = "html", title="Descriptive statistics", digits=1, out="descriptives.doc")
# Create a panel dataframe
df.p = pdata.frame(df, index = c("GVKEY" ,"AnnualReportDate"))
# Test for duplicate row names
occur = data.frame(table(row.names(df.p)))
duplicateRowNames = occur[occur$Freq > 1,]
# Table 2
#----------------------------------------------------------
# Define models
#----------------------------------------------------------
mdlA <- amountAquired ~ GenderRatio + CEOTenure + AGE + roa + factor(SPINDEX) + aquisitionFin + xrd + laggedAquisition
mdlB <- amountAquired ~ GenderRatio + CEOTenure + AGE + roa + factor(SPINDEX) + aquisitionFin + xrd + laggedAquisition + compRatio
mdlC <- amountAquired ~ GenderRatio + CEOTenure + AGE + roa + factor(SPINDEX) + aquisitionFin + xrd + laggedAquisition + compRatio*NumberDirectors + compRatio
Thanks!

How do I run a linear regression between certain observations in a data set?

my problem is that I am unable to run a linear regression between say the 450th and 500th observations of NBA salaries (the salaries only in the year 18/19 in my continuous data set).
So far I have this code:
lm(log(Salary) ~ Twitter + Insta + Age + PPG + APG + RPG + SPG +
BPG + MPPG + FG + THREEPG + FT,
data = subset(Econ_III_Data_Set1, Year = 18/19))
But it is giving the same results as it would if I ran a linear regression on all salaries, not just the ones in the year 18/19 (or the 450th to 500th observations of salaries).
lm has a subset argument built in:
lm(log(Salary) ~ Twitter + Insta + Age + PPG + APG + RPG + SPG +
BPG + MPPG + FG + THREEPG + FT,
data = Econ_III_Data_Set1,
subset = Year == "18/19")

Adding two Y axes to an xy plot

Working with this data in Rstudio. I need to run a simple regression of ed76 on lwage76 and a saturated regression that turns ed76 into a dummy variable for every level within the column. Then I need to plot both regressions in an XY plot with lwage76 as the Y axis and ed76 as the X axis. This is what I have so far:
regression <- lm(nlsdata$lwage76~nlsdata$ed76)
predicted <- data.frame(Edu =nlsdata$ed76, Wage = predict(regression))
aggplot <- aggregate(Wage ~ Edu, data=predicted, mean)
xyplot( Wage ~ Edu, data = aggplot, grid = TRUE, type = c("p","l"))
This gives me a very nice XY plot, but now I need to add the predicted values from my staturated model:
satreg <- lm(lwage76 ~ ed76*edu_1 + ed76*edu_2 + ed76*edu_3 +
ed76*edu_4 + ed76*edu_5 + ed76*edu_6 + ed76*edu_7 +
ed76*edu_8 + ed76*edu_9 + ed76*edu_10 + ed76*edu_11 +
ed76*edu_12 + ed76*edu_13 + ed76*edu_14 + ed76*edu_15 +
ed76*edu_16 + ed76*edu_17, data = nlsdata)
satmodel <- data.frame(Edu =nlsdata$ed76, Wage = predict(satreg))
So how do I add the second data set to the graph that I have?
Solution in ggplot:
ggplot(data=predicted, aes(Edu, Wage)) +
geom_line() +
geom_point() +
geom_line(data=satmodel, colour="blue") +
geom_point(data=satmodel, colour="blue")
Alternatively, you can label each of your table and combined them into a single data.frame.
satmodel <- satmodel %>% mutate(type="sat_model")
predicted <- predicted %>% mutate(type="predicted")
df <- rbind(satmodel, predicted)
ggplot(df, aes(Edu, Wage, colour=type)) +
geom_line() +
geom_point()

upper scope has term ‘NA’ not included in model

I am working on a data set and would like to do step wise logistic regression using some variables and to do so I am using the add1() function in R. A sample of the data set can be downloaded from the link here: https://drive.google.com/file/d/0B0N-Nc7kEi4bVjhDd1FDaEE5cEE/view?usp=sharing
I thereby fit a logistic regression using:
train <- read.csv('training.csv')
glm.model_step_1 <- glm(loan_status ~ acc_open_past_24mths + annual_inc + avg_cur_bal + bc_open_to_buy + delinq_2yrs + dti + inq_last_6mths + installment + int_rate + mo_sin_old_il_acct + mo_sin_old_rev_tl_op + mo_sin_rcnt_rev_tl_op + mo_sin_rcnt_tl + mort_acc + mths_since_last_delinq + mths_since_recent_bc + mths_since_recent_inq + num_accts_ever_120_pd + num_actv_bc_tl + num_actv_rev_tl + num_bc_tl + num_il_tl + num_op_rev_tl + num_tl_op_past_12m + pct_tl_nvr_dlq + percent_bc_gt_75 + pub_rec_bankruptcies + revol_bal + revol_util + term + total_acc + total_bc_limit + total_il_high_credit_limit + fico_mean + addr_state + emp_length + verification_status + Count_NA + Info_missing + Engineer + Teacher + Doctor + Professor + Manager + Director + Analyst + senior + lead + consultant + home_ownership_own + home_ownership_rent + purpose_debt_consolidation + purpose_medical + purpose_credit_card + purpose_other,
data = train,
family = binomial(link = 'logit'))
And use the add1() function to do a forward selection.
add1(glm.model_step_1, scope = train)
This code does not work. I get the below error:
Error in factor.scope(attr(terms1, "factors"), list(add = attr(terms2, :
upper scope has term ‘NA’ not included in model
Does anyone know how to solve this error?
A question asked previously on datascience.stackexchange (https://datascience.stackexchange.com/questions/11604/checking-regression-coefficients-stability) mentioned checking for NAs. There aren't any NAs in the data set and that can be confirmed by running sapply(train, function(x) sum(is.na(x))
The train dataset of #Jash Sash has some anomalous values inside which force read.csv to read some numerical variables as factors with many categories.
Anyway, I consider here a model with only few variables in order to show how to avoid the error message reported above.
Remember that the scope argument must be a "formula giving the terms to be considered for adding or dropping"; it cannot be a data.frame like in the code of #Jash Sash.
train <- read.csv('training.csv')
numeric <- apply(train,2,is.factor)
glm.model_step_1 <- glm(loan_status ~ acc_open_past_24mths + avg_cur_bal + bc_open_to_buy,
data = na.omit(train),
family = binomial(link = 'logit'))
add1(glm.model_step_1, scope=~.+delinq_2yrs+inq_last_6mths+int_rate)
The results is:
Model:
loan_status ~ acc_open_past_24mths + avg_cur_bal + bc_open_to_buy
Df Deviance AIC
<none> 1038.6 1046.6
delinq_2yrs 1 1037.9 1047.9
inq_last_6mths 1 1038.0 1048.0
int_rate 1 1038.0 1048.0

Error in the class of random forest

I'm feeding a new set of data to the Random forest prediction model and encounter this error:
Error in checkData(oldData, RET) :
Classes of new data do not match original data
Here's the code:
fit1 <- cforest((b == 'three')~ affect+ certain+ negemo+ future+swear+sad
+negate+ppron+sexual+death + filler+leisure + conj+ funct + i
+future + past + bio + body+cause + cogmech + death +
discrep + future +incl + motion + quant + sad + tentat + excl+insight +percept +posemo
+ppron +quant + relativ + space + article + age + s_all + s_sad + gender
, data = trainset1,
controls=cforest_unbiased(ntree=500, mtry= 1))
testset2$pre_swl<-predict(fit1, newdata=testset2 , type='response')
Both of the training set and the test set are data.frame.

Resources