R regression over columns with fixed deltas - r

I have a data frame in R , df, where each row, X, is a subject (N= 100) and each column,S, the score for each subject on a task each month over the span of two years. Thus i have a data frame of 100 subjects and 24 observations evenly spaced by 1 month intervals (ignoring month/day variance).
Question1: how do I fit a line (linear regression) to each subject? I have trouble understanding how to do this over columns, as opposed to rows within a column.
Question2: how do I fit a line (linear regression) to the whole data set? I ask because I would like to segment the dataset into groups A and B (i.e. a column is labeled as condition: {A,B}), and fit a line to each subset of subject over the 24 timepoints.
apologies if this a simple question.

I constructed a dataset based on your description. If this is useful, perhaps include it in your question itself.
df<- as.data.frame(matrix(rep(1:24,100)+rnorm(2400),nrow=100,byrow=T))
names(df)<- paste("S",1:24,sep="")
df$ID<-1:100
df$group <- as.factor(sample(c("A","B"),100,replace=T))
Now melt your data frame to get the S1 to S24 columns as a factor variable.
library(reshape2)
m<- melt(df,id.vars=c("ID","group"))
Then you can use the following kind of call to examine a linear model of time for a particular ID. You can use lapply to do this in one shot for all IDs.
summary(lm(value~as.numeric(variable), data=m, subset=ID==5))
And this will model all items as predicted by group. Note that the group factor is coerced to numeric. In this case A is 1 and B is 2.
summary(lm(value~group, data=m))

Related

When setting your obsCovs for the function pcount (package unmarked) how does R "know" which obsCov observation corresponds to each y value?

I'm relatively new at R particularly with this package. I am running n-mixture models assessing detection probabilities and abundance. I have abundance data, site covariates and observation covariates. There are three repeated observations(rounds)/site. The observation covariates are set up as columns (three column/covariate, one for each round). The rows are individual sites. The abundance data is formatted similarly, with each column heading representing a different round. I've copied my code below.
y.abun2<-COYE[2:4]
obsCovs.ss <- list(temp=Covariate2021[3:5], Date=Covariate2021[13:15], Cloud=Covariate2021[17:19], Wind=Covariate2021[21:23],Observ=Covariate2021[25:27])
siteCovs.ss <- Covariate2021[c(29,30,31,32)]
coyeabund<-unmarkedFramePCount(y=y.abun2, siteCovs = siteCovs.ss,
obsCovs = obsCovs.ss)
After this I scale using this code:
coyeabund#siteCovs$TreeCover <-
scale(coyeabund#siteCovs$TreeCover)
Moving on to my model I use this code:
abun.coye.full<-pcount(~TreeCover+temp+Date+Cloud+Wind+Observ ~ HHSDI+ProportionNH+Quality, coyeabund,mixture="NB", K=132,se=TRUE)
Is the model matching the observation covariates to the abundance measurements to each round? (i.e., is it able to tell that temp column 5 corresponds to the third round of abundance measurements?)
The models seem fine so far but I am so new at this I want to confirm that I haven't gone astray.

Run stats on frequency table as if it were full dataset in R [duplicate]

This question already has an answer here:
R: lm() result differs when using `weights` argument and when using manually reweighted data
(1 answer)
Closed 2 years ago.
I have billions of measurements for two values, x and y. This is too large to operate on the raw data, so I'm representing them as a frequency table. I have one row for each unique combination of x value and y value, and a variable freq showing how many data points had that combination of values.
If I want to estimate the relationship between x and y, I can do: lm(y ~ x, data=df, weights=df$freq). I've tested this and it gives accurate parameter estimates, but the wrong t value. It's still treating each row as one observation, so the degrees of freedom are much smaller than they should be.
Is there a way to run analyses that treats each row as the appropriate number of records?
Are there generalizable tools for having R operate on a frequency table as if it were a raw dataset?
note: this question shows how to recreate the raw data, but my raw data is unmanagably large, which is why I'm using a frequency table in the first place.
example
# This dataset has a negative correlation between x and y:
library(dplyr)
raw_data<-data.frame(
x=rep(c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4), 100),
y=rep(c(5,5,5,5,1,4,4,4,4,1,3,3,3,3,7,2,2,2,2,8), 100)
)
lm_raw<-lm(x ~ y, data=raw_data)
summary(lm_raw)[c("coefficients", "df")]
# Let's say instead I have a have a summary dataset that has the frequency for each x-y pair:
freq_data <- raw_data %>% group_by(x,y) %>% summarise(freq=n())
# Analyze and weight by frequency. Parameter estimates are right but the t value is wrong:
lm_freq<-lm(x ~ y, data=summh, weights=summh$freq)
summary(lm_freq)$coefficients
# ... because it's treating this as 8 data points instead of thousands
summary(lm_freq)$df
You can manually adjust the degrees of freedom:
lm_freq$df.residual <- with(lm_freq, sum(weights) - length(coefficients))
Now you should get the correct t-values. I referenced this article.

Identify the outliers with the highest squared residuals under the Linear regression model in R

I have a data set [1000 x 80] of 1000 data points each with 80 variable values. I have to linearly regress two variables: price and area, and identify the 5 data points that have highest squared residuals. For these identified data points, I have to display 4 of the 80 variable values.
I do not know how to use the residuals to identify the original data points. All I have at the moment is:
model_lm <- lm(log(price) ~ log(area), data = ames)
Can I please get some guidance on how I can approach the above problem
The model_lm object will contain a variable called 'residuals' that will have the residuals in the same order as the original observations. If I'm understanding the question correctly, then an easy way to do this is base R is:
ames$residuals <- model_lm$residuals ## Add the residuals to the data.frame
o <- order(ames$residuals^2, decreaseing=T) ## Reorder to put largest first
ames[o[1:5],] ## Return results

Using R to perform a PCA on melted variables

I have a dataset where I've measured gene expression of 21 genes and also measured the output of 3 other assays. I have measured these for 8 different clones. I have also measured these on 5 different days.
However, I haven't measured every gene or assay on every day, or for every clone. So I have datasets of varying lengths. In order to easily combine them into one large dataset, to perform a PCA on them, I melted each dataset and then row bound them. I then standardized all the values. I now have a dataset that looks like the below.
What I want to do is a PCA where each of the factors in "group" is calculated in the PCA. Then, I'd like to create graphs where different colors of datapoints represent different "clones" or "days". I've pasted my sad attempt to get that working below. Any help would be appreciated!
set.seed(1)
# Creates variables for a dataset
clone <- sample(c(rep(c("1A","2A","2B","3B","3C"), each=100),rep(c("1B","2C","3A"), each=200)))
day <- sample(c(rep(1,225),rep(2,25),rep(3,600),rep(4,25),rep(5,225)))
group <- sample(c(rep(paste0("gene",1:21), each=42),rep("assay1",90),rep("assay2",80),rep("assay3",48)))
value = rnorm(1100, mean=0, sd=3)
# Create data frame from variables
df <- data.frame(clone,day,group,value)
df$day <- as.factor(df$day)
# Create PCA data
df_PCA <- prcomp(clone + day + group ~ value, data = df, scale = FALSE)
# Graphing results of PCA
par(mfrow=c(2,3))
plot(df_PCA$x[,1:2], col=clone)
plot(df_PCA$x[,1:2], col=day)
plot(df_PCA$x[,1:3], col=clone)
plot(df_PCA$x[,1:3], col=day)
plot(df_PCA$x[,2:3], col=clone)
plot(df_PCA$x[,2:3], col=day)

How to use random forests in R for classification to decide if the value of a column is less or greater than a value N?

I have already used random forests in R for classification where the concerned column has categorical values ( 0 or 1 for example). For example, for the iris database, we can use random forests to classify the data depending on the species as follows:
myRF <- randomForest(Species ~ ., data=iris, importance=TRUE,proximity=TRUE)
This makes sense because Species can take only a couple of categorical values. The question is what about if Species could take values from 1 to 100 and I wanted to classify the data into two categories: the ones where the value if greater than 50 and the ones where the value is less than 50?
Of course, I could add another column whose value is 1 or 0 depending on Species, and then I do classification on that last column instead of Species, but is there a way to tell R directly that we want to classify our data into 2 categories: a category where Species is less than 50 and another one where it is greater than 50? ( Assuming the new hypothetical values for Species)?
Thank you
myRf ~ randomForest(Species < 50 ~ ., ...)
which is
really no different to defining a new variable that contains whether Species is less than 50, but avoids modifying your dataset;
only sensible if Species is a continuous rather than categorical variable (ie, it makes sense to compare species numbers in this way).
In the more general case where you want to predict that a factor will take on one of a subset of values, you can use
randomForest(y.fac %in% c("level1","level2",...) ~ .....)

Resources