How to predict new variables for a new randomly generate dataset using multiple regression in R? - r

I have generate a MARS regression model using known soil property data collected from field samples across the great plains region. I reduced all variables down to 5 predictor variables (elevation, tpi, k_factor, precipitation and temperature) and a single dependent variable (soil organic content:SOC). I split the original data set to a training class and a test class. I was able to utilize my model to predict values on the test dataset after the model was created just fine.
I want to predict on a newly generated dataset with data derived from geospatial rasters across teh great plains region. I generated random samples based on the study are size and created a point shapefile over the area. The rasters were written into the points where they intersected to give me a table full of the 5 perdictor variables for each point. I do not have a SOC raster, so my new table is missing that column.
My intention was to predict the SOC values based on the 5 predictor variables in the new table. However, I keep getting an error " variable lengths differ" for each of my columns. I would like to export the predictions back to the new table to be able to visualize the distribution of SOC within GIS. Below is example of my code:
setwd("E:\\Fall19\\stats\\FinalProject\\Excel_tables")
table=read.csv("sel_el_train.csv")
attach(table)
my_data=table[,c(8,9,15,16,18,19)]
mars1 <- earth(
SOC ~ ., data=my_data)
print(mars1)
summary(mars1)
plot(mars1)
predict(mars1, newdata=test.data)
Below are screen shots of the bottom of the record. You can see a difference of the number of records i built the model out of and the dataset I'm trying to predict on.

I figured it out. The method I was using was very particular about heading spelling. My K_factor variable was not spelled correctly. Once all column names matched up, everything worked well.

Related

When setting your obsCovs for the function pcount (package unmarked) how does R "know" which obsCov observation corresponds to each y value?

I'm relatively new at R particularly with this package. I am running n-mixture models assessing detection probabilities and abundance. I have abundance data, site covariates and observation covariates. There are three repeated observations(rounds)/site. The observation covariates are set up as columns (three column/covariate, one for each round). The rows are individual sites. The abundance data is formatted similarly, with each column heading representing a different round. I've copied my code below.
y.abun2<-COYE[2:4]
obsCovs.ss <- list(temp=Covariate2021[3:5], Date=Covariate2021[13:15], Cloud=Covariate2021[17:19], Wind=Covariate2021[21:23],Observ=Covariate2021[25:27])
siteCovs.ss <- Covariate2021[c(29,30,31,32)]
coyeabund<-unmarkedFramePCount(y=y.abun2, siteCovs = siteCovs.ss,
obsCovs = obsCovs.ss)
After this I scale using this code:
coyeabund#siteCovs$TreeCover <-
scale(coyeabund#siteCovs$TreeCover)
Moving on to my model I use this code:
abun.coye.full<-pcount(~TreeCover+temp+Date+Cloud+Wind+Observ ~ HHSDI+ProportionNH+Quality, coyeabund,mixture="NB", K=132,se=TRUE)
Is the model matching the observation covariates to the abundance measurements to each round? (i.e., is it able to tell that temp column 5 corresponds to the third round of abundance measurements?)
The models seem fine so far but I am so new at this I want to confirm that I haven't gone astray.

Why is this the output for my linear model and how can I fix it?

I am trying to setup a multivariable linear programming model using R but the model keeps creating new variables in the output.
Essentially I am trying to find correlations between air quality and different factors such as population, time of day, weather readings, and a few others. For this example, I am looking at multiple different sensor locations over a months time. I have data on the actual AQI, weather data, and assumed the population in the area surrounding the sensor doesn't change over time (which might be my problem). Therefore, the population varies between the different sensor, however remains constant over the months. I then combined each sensors data into a data frame to conduct the linear programming. The code for my model is below:
model = lm(AQI ~ Time.of.Day + Temp + Humidity + Pressure + pop + ind + rd_dist, data = Krakdata)
The output is given in the picture below. I do not know why it doesn't come up with just population as an output. Instead, it outputs each population reading as another factor. Thanks!
Linear Model Output:
Krakdata example. Note how the population will not change until the next sensor comes up:
pop is a categorical variable. You need to convert it to an integer, otherwise each value will be treated as a separated category and therefore separate variable.
pop is a categorical variable, hence R treats it as such. R turns the pop variable into dummy variable, therefore the output. You have to convert it to numeric if this variable is supposed to be numeric in nature/in your analysis.
As to how to convert it:
Krakdata$pop <- as.numeric(as.character(Krakdata$pop))
As to how pop is read as factors while it resembles numbers, you need to look into your previous code or to the data itself.

Display the name of corresponding PC when using prcomp for PCA in r

I use prcomp to run PCA in r. When I output the summary, i.e. standard deviation, proportion of variance, cumulative proportion, the results are always ordered and the actual column name is replaced by PC1, PC2. Thus, I cannot tell the exact proportion of variance for each column.
Can anyone show me or give me some hint on how to display the column when outputting summary results. Two results pics are attached here:
It is not clear that you understand what principal components does. It reduces the dimensionality of the data. Assuming the rows are observations and the columns are variables, imagine plotting your rows in 35 dimensions (the columns). Most people have trouble visualizing more than 3 dimensions. Principal components creates a smaller set of axes that explains most the the variation in the data. The axes are Euclidian meaning they are at right angles to one another. Your plot and the result of the summary(res.pca5) and plot(res.pca5) functions show that the first dimension explains 28% of the variation in the 35 variables. Adding a second dimension gives you almost 38% and three gives you 44%. These new variables are combinations of your original variables, not the original variables. The first two components explain more of the variability than any other combination.
For some reason you did not try res.pca5 as a command (or the equivalent print(res.pca5)) which would show you the coefficients that pca used to create the components from the original variables or biplot(res.pca5) which plots the rows and columns in the new two dimensional space.

Plotting different mixture model clusters in the same curve

I have two sets of data, one representing a healthy data set having 4 variables and 11,000 points and another representing a faulty set having 4 variables and 600 points. I have used R's package MClust to obtain GMM clustering for each data set separately. What I want to do is to obtain both clusters in the same frame so as to study them at the same time. How can that be done?
I have tried joining both the datasets but the result I am obtaining is not what I want.
The code in use is:
Dat4M <- Mclust(Dat3, G = 3)
Dat3 is where I am storing my dataset, Dat4M is where I store the result of Mclust. G = 3 is the number of Gaussian mixtures I want, which in this case is three. To plot the result, the following code is used:
plot(Dat4M)
The following is obtained when I apply the above code in my Healthy dataset:
The following is obtained when the above code is used on Faulty dataset:
Notice that in the faulty data density curve, consider the mixture of CCD and CCA, we see that there are two density points that have been obtained. Now, I want to place the same in the same block in the healthy data and study the differences.
Any help on how to do this will be appreciated.

How to use Dismo's predict() with a maxent model based on a dataframe

I am trying to figure out how dismo's predict function operates in terms of a model built with 'x' as a dataframe, rather than raster layers. I have successfully run models using raster layers and made prediction maps based on this.
My model is built as follows;
library(dismo)
model <- maxent(x = sightings.data, p = presence.vector)
with sightings.data being a dataframe containing the GPS locations of sightings, followed by the conditions at these times and locations. presence.vector is a vector indicating if a row is a presence or background point.
I am looking to find out;
What arguments to supply to predict given a model of this type
What predict() is capable of providing from a model such as this
I have successfully run models using raster layers and made prediction maps based on this.
The help file for predict() is not particularly detailed and the 'Species distribution modelling with R' does not successfully cover this topic (the examples just list 'cannot run this example because maxent is not available' outputs).
I have tried modelling with a dataframe containing only variables I have raster layers for, and tried predicting as I would for a model built with rasters, but I get the following error;
Error in .local(object, ...) : missing layers (or wrong names)
I have ensured the dataframe column names and the raster layers have the same names, excluding the mandatory latitude and longitude columns;
names(raster.stack) <- colnames(sightings.data[3:5])
The method I have found from the code avaialble from the following paper Oppel at al 2012 demonstrates that dismo's predict can produce relative values when provided with a dataframe of input variables.
> predictions <- predict(model, variables)
> str(predictions)
num [1:100] 0.635 ...
I'm still looking for an easy method to create a predicted distribution raster map from such predicted values.
If you provide dismo::maxent a dataframe, the function will recognize the first column as longitude and second column as latitude. If the data not follow this format, the function will not work.
In this format sightings data does not need to include the GPS locations, so you can remove the x & y columns from sightings.data. Then you can run the model, and then you can predict to a raster stack with raster names that are identical to the names in the sightings.data column names.
Predict was looking for the GPS locations in your raster stack, which I'm guessing were not there.

Resources