How to check if an anova test excludes zero values - r

I was wondering if there is an simply way check if my zero values in my data are excluded in my anova.
I first changed all my zero values to NA with
BFL$logDecomposers[which(BFL$logDecomposers==0)] = NA
I'm not sure if 'na.action=na.exclude' makes sure my values are being ignored(like I want them to be)??
standard<-lm(logDecomposers~1, data=BFL) #null model
ANOVAlnDeco.lm<-lm(logDecomposers~Species_Number,data=BFL,na.action=na.exclude)
anova(standard,ANOVAlnDeco.lm)
P.S.:I've just been using R for a few weeks, and this website has been of tremendous help to me :)

You haven't given a reproducible example, but I'll make one up.
set.seed(101)
mydata <- data.frame(x=rnorm(100),y=rlnorm(100))
## add some zeros
mydata$y[1:5] <- 0
As pointed out by #Henrik you can use the subset argument to exclude these values:
nullmodel <- lm(y~1,data=mydata,subset=y>0)
fullmodel <- update(nullmodel,.~x)
It's a little confusing, but na.exclude and na.omit (the default) actually lead to the same fitted model -- the difference is in whether NA values are included when you ask for residual or predicted values. You can try it out:
mydata2 <- within(mydata,y[y==0] <- NA)
fullmodel2 <- update(fullmodel,subset=TRUE,data=mydata2)
(subset=TRUE turns off the previous subset argument, by specifying that all the data should be included).
You can compare the fits (coefficients etc.). One shortcut is to use the nobs method, which counts the number of observations used in the model:
nrow(mydata) ## 100
nobs(nullmodel) ## 95
nobs(fullmodel) ## 95
nobs(fullmodel2) ## 95
nobs(update(fullmodel,subset=TRUE)) ## 100

Related

outlier detection for a grouped (clustered) data

I wanna find outliers and eliminate them in my data(named "df"):
> head(df)
cluster machine.code age Good.Times repair.price
1 1 13010132 23 58.54 198170000
2 1 13010129 23 105.25 390847500
3 1 13010131 23 20.50 20701747
4 1 13010072 18 14.30 22340000
5 1 13010101 18 57.63 13220000
6 1 13010106 27 49.96 254450000
where my data has 65 clusters and I wanna run the outlier detection within each cluster separately,
I had used the code below for outlier detecting before for one cluster and it was fine:
library("ggstatsplot")
df<- read.csv("C:/Users/gadmin/Desktop/dataE.csv",header = TRUE)
ggbetweenstats(df,cluster, repair.price , outlier.tagging = TRUE)
Q <- quantile(df$repair.price, probs=c(.25, .75), na.rm = FALSE)
iqr <- IQR(df$repair.price)
up <- Q[2]+1.5*iqr # Upper Range
low<- Q[1]-1.5*iqr # Lower Range
eliminated<- subset(df, df$repair.price > (Q[1] - 1.5*iqr) & df$repair.price < (Q[2]+1.5*iqr))
ggbetweenstats(eliminated, cluster, repair.price, outlier.tagging = TRUE)
now I wanna do the same thing for all 65 clusters using "for" something like this:
for(i in 1:length(unique(df$cluster))) {
...
}
but I don't how? (I mean the part that after outlier detecting the first cluster, how should it be replaced(subset) and continue the process to another cluster)
Core question
There are various ways to detect outliers. As for the core of your question, I understand it as "How do I subset the data so I can apply a for-loop to remove the outliers for each cluster?"
# maybe insert a column id that assigns an id (identical to the row number) to identify individual entries
df$id <- seq(1, nrow(df))
# make a list to store the outlier ids for each cluster
outlrs <- list()
# loop through the clusters
for(clust in unique(df$cluster)){
subset <- df[df$cluster == clust,]
outlrs[[clust]] <- [INSERT YOUR OUTLIER DETECTION FUNCTION HERE*]
}
# remove the outliers
outliers <- do.call(rbind, outlrs)
df <- df[-outliers, ]
* the outlier detection function you use should ultimately output the id of the row containing the outlier. This part would have to be adapted to your method of outlier identification.
I didn't test it since I have insufficient data. You could use e.g. dput(df) to output a version of your data you can copy and paste to make it accessible to people who want to test their proposed solutions.
Edit: one (of many) alternative approaches
Alternatively, you could apply the functions you included in your question on a subset of the data within the loop, store the cleaned-up output e.g. as a list and subsequently apply do.call(rbind.data.frame, your_list) to the list.
Note
As Phil pointed out, it is questionable whether outliers should be removed, especially when you're just applying a loop that "takes care of them". While we can provide the means by which "outliers" can be removed programmatically, the question whether you should actually remove those outliers in a given situation is another one (probably more adequate on CrossValidated). It should also be noted that there are many algorithms to determine which values differ "significantly" from the bulk of values and the border between "significant" and not significant is arbitrary.

Build loop to use increasing part of dataframe in R as input to function

I'm using the first principal component from a PCA analysis as an explanatory variable in a forecasting model that forecasts recursively using Kalman filtering. In other words, at each point in time, the model updates and produces a new forecast based on the new observation included into the model. Since PCA uses data from all observations included in the model for its calculations, I need to run also the PCAs recursively, using only the observations prior to the point in time that I am forecasting (otherwise, the PCA-result could reveal information about the future, and help the model produce a more accurate answer than it would have otherwise). I think a loop might be the solution, but I am struggling with how to formulate the code.
As a more specific example, consider if I have the following data.frame
data <- as.data.frame(rbind(c(6,15,23),c(9,11,22), c(7,13,23), c(6,12,25),c(7,13,23)))
names(data) <- c("V1","V2","V3")
> data
V1 V2 V3
1 6 15 23
2 9 11 22
3 7 13 23
4 6 12 25
5 7 13 23
At each observation date, I wish to run a PCA (function prcomp() from the stats-package) for all observations up to, and including, that observation. So I want to first run PCA for the two first observation
pca2 <- prcomp(data[1:2,], scale = TRUE)
next I want to run PCA with the first, second and third observation as input
pca3 <- prcomp(data[1:3,], scale = TRUE)
next I want to run PCA with the first, second, third and fourth observation as input
pca4 <- prcomp(data[1:4,], scale = TRUE)
and so on, until the last run of the PCA, which includes all observations in the dataframe. For each of these "runs" of the PCA, I wish to extract the last value (though for pca2, I use both the first and second value) of the first principal component (PC1), and merge these into a final dataframe, where each monthly observation is the last value of the first principal component of PCA results for each of the runs.
The principal component outputs are:
> my_pca2 <- as.data.frame(pca2$x)
> my_pca2
PC1 PC2
1 -1.224745 -5.551115e-17
2 1.224745 5.551115e-17
> my_pca3 <- as.data.frame(pca3$x)
> my_pca3
PC1 PC2 PC3
1 -1.4172321 -0.2944338 6.106227e-16
2 1.8732448 -0.1215046 3.330669e-16
3 -0.4560127 0.4159384 4.163336e-16
> my_pca4 <- as.data.frame(pca4$x)
> my_pca4
PC1 PC2 PC3
1 -1.03030993 -1.10154914 0.015457199
2 2.00769890 0.07649216 0.011670433
3 0.03301806 -0.24226508 -0.033461874
4 -1.01040702 1.26732205 0.006334242
So I want my final output to be a dataframe to look like
>final.output
PC1
1 -1.224745
2 1.224745
3 -0.4560127
4 -1.01040702
Comment: yes, it looks a bit weird with the two first values, but please don't pay too much attention to that. My point is that I wish to build a dataframe that consists of the last calculated value for the first principal component for each of the PCA runs.
I am thinking that a for.loop might be the best solution here, but I have not been successful in finding any threads that might guide me closer to a coding solution. How can I make the loop use an increasing amount of the dataframe in the calculations? Does anyone have any suggestions/tips/links? Any help on this is much appreciated!
I had a very similar approach.
PCA <- vector("list", length=nrow(data)-1)
for(i in 1:(nrow(data)-1)) {
if(i==1) j <- 1:2 else j<-i+1
PCA[[i]] <- as.data.frame(prcomp(data[1:(1+i),], scale = TRUE)$x)[j, 1]
}
unlist(PCA)
You can use a for loop. It's maybe not the most efficient solution, but it will work.
First, you create an empty list to store your results:
all_results <- list()
Next, you iterate from 2 to the number of rows of data with a loop. For each iteration of the loop, run prcomp on data[1:i,]. You can directly create your pca data frame and extract PC1from it as a vector. Now you store it in the list at index i - 1
for(i in 2:nrow(data))
{
all_results[[i - 1]] <- as.data.frame(prcomp(data[1:i,], scale = TRUE)$x)$PC1
}
Now to extract all the results, you use lapply (list apply) to extract only the last element from each PC1 vector:
PC1 <- lapply(all_results, function(pca) pca[length(pca)] )
Now you convert these from a list of single elements to a vector:
PC1 <- do.call("c", PC1)
Finally, you want to stick the first value of the first analysis back on to the front of this vector:
PC1 <- c(all_results[[1]][1], PC1)

Reducing correlation of datasets with NA

Consider the sample data below:
a=c(NA,1,NA)
b=c(1,2,4)
c=c(0,1,0)
d=c(1,2,4)
df=data.frame(a,b,c,d)
Objective to find correlation between 2 columns where NA should reduce the correlation. NA means that an event did not take place.
Is there a way to use NA in the correlation such that it pulls down the value of the correlation?
> cor(df$a, df$b)
[1] NA
Or should I be looking at some other mathematical function?
Is there a way to use NA in the correlation such that it pulls down the value of the correlation?
Here is a way to use NA values to decrease correlation. For demonstration, I am using different data with some good size.
a <- sort(ruinf(10))
b <- sort(ruinf(10))
## Sorting so that there is some good correlation between them.
## Now making some values NA deliberately
a[c(9,10)] <- NA
cor(a[1:8],b[1:8])
## [1] 0.890465 #correlation value is high
## Lets assign a to c and Fill NA values with something
c <- a
## using mean causes no change to numerator but increases denominator.
c[is.na(a)] <- mean(a, na.rm=T) cor(c,b)
## [1] 0.6733387
Note that when you replace all NA terms with mean, the numerator has no change as there is multiplication with zero in additional terms. The denominator however adds some more values for b so that correlation value comes down. Also, the more NA in your data, more the correlation comes down.
The question doesn't make mathematical sense as there is no correlation between events that didn't happen. Correlation cannot be reduced by no event happening. There is no function to do this other than to transform the data.
You may replace the NA values with something like #Ujjwal Kumar has suggested but this is just data manipulation and not a predefined function
Look at the help file for cor ?cor and using functions like cor(df$a,df$b,use="pairwise.complete.obs" you can see how NA values should usually be treated where they are just removed and have no impact on the correlation itself
?cor output
If use is "everything", NAs will propagate conceptually, i.e., a resulting value will be NA whenever one of its contributing observations is NA.
If use is "all.obs", then the presence of missing observations will produce an error. If use is "complete.obs" then missing values are handled by casewise deletion (and if there are no complete cases, that gives an error).
"na.or.complete" is the same unless there are no complete cases, that gives NA. Finally, if use has the value
"pairwise.complete.obs" then the correlation or covariance between each pair of variables is computed using all complete pairs of observations on those variables. This can result in covariance or correlation matrices which are not positive semi-definite, as well as NA entries if there are no complete pairs for that pair of variables. For cov and var, "pairwise.complete.obs" only works with the "pearson" method. Note that (the equivalent of) var(double(0), use = *) gives NA for use = "everything" and "na.or.complete", and gives an error in the other cases.
I guess, there is no simple explanation. . You have to remove data with NA, and ofcourse corresponding data in columns b,c,d. And then compute correlation. You can check if thera are corrensponding NA in each dataset (a,b,c,d)
In yours example you can compute corelation with all combinations of b,c,d, but if you want compute cor for cor(a,b) you have to pick only rows that are without NA in a and b. And maybe when you compute this cor(a,b) multiply it by (number of rows with NA in a and b) divided by number of all rows in dataset
a=c(NA,1,NA)
b=c(1,2,4)
c=c(0,1,0)
d=c(1,2,4)
df=data.frame(a,b,c,d)

Multinomial logit model in R on grouped data, data conversion and mlogit set-up

I want to estimate the parameters of a multinomial logit model in R and wondered how to correctly structure my data. I’m using the “mlogit” package.
The purpose is to model people's choice of transportation mode. However, the dataset is a time series on aggregated level, e.g.:
This data must be reshaped from grouped count data to ungrouped data. My approach is to make three new rows for every individual, so I end up with a dataset looking like this:
For every individual's choice in the grouped data I make three new rows and use chid to tie these three
rows together. I now want to run :
mlogit.data(MyData, choice = “choice”, chid.var = “chid”, alt.var = “mode”).
Is this the correct approach? Or have I misunderstood the purpose of the chid function?
It's too bad this was migrated from stats.stackexchange.com, because you probably would have gotten a better answer there.
The mlogit package expects data on individuals, and can accept either "wide" or "long" data. In the former there is one row per individual indicating the mode chosen, with separate columns for every combination for the mode-specific variables (time and price in your example). In the long format there is are n rows for every individual, where n is the number of modes, a second column containing TRUE or FALSE indicating which mode was chosen for each individual, and one additional column for each mode-specific variable. Internally, mlogit uses long format datasets, but you can provide wide format and have mlogit transform it for you. In this case, with just two variables, that might be the better option.
Since mlogit expects individuals, and you have counts of individuals, one way to deal with this is to expand your data to have the appropriate number of rows for each mode, filling out the resulting data.frame with the variable combinations. The code below does that:
df.agg <- data.frame(month=1:4,car=c(3465,3674,3543,4334),bus=c(1543,2561,2432,1266),bicycle=c(453,234,123,524))
df.lvl <- data.frame(mode=c("car","bus","bicycle"), price=c(120,60,0), time=c(5,10,30))
get.mnth <- function(mnth) data.frame(mode=rep(names(df.agg[2:4]),df.agg[mnth,2:4]),month=mnth)
df <- do.call(rbind,lapply(df.agg$month,get.mnth))
cols <- unlist(lapply(df.lvl$mode,function(x)paste(names(df.lvl)[2:3],x,sep=".")))
cols <- with(df.lvl,setNames(as.vector(apply(df.lvl[2:3],1,c)),cols))
df <- data.frame(df, as.list(cols))
head(df)
# mode month price.car time.car price.bus time.bus price.bicycle time.bicycle
# 1 car 1 120 5 60 10 0 30
# 2 car 1 120 5 60 10 0 30
# 3 car 1 120 5 60 10 0 30
# 4 car 1 120 5 60 10 0 30
# 5 car 1 120 5 60 10 0 30
# 6 car 1 120 5 60 10 0 30
Now we can use mlogit(...)
library(mlogit)
fit <- mlogit(mode ~ price+time|0 , df, shape = "wide", varying = 3:8)
summary(fit)
#...
# Frequencies of alternatives:
# bicycle bus car
# 0.055234 0.323037 0.621729
#
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# price 0.0047375 0.0003936 12.036 < 2.2e-16 ***
# time -0.0740975 0.0024303 -30.489 < 2.2e-16 ***
# ...
coef(fit)["time"]/coef(fit)["price"]
# time
# -15.64069
So this suggests the reducing travel time by 1 (minute?) is worth about 15 (dollars)?
This analysis ignores the month variable. It's not clear to me how you would incorporate that, as month is neither mode-specific nor individual specific. You could "pretend" that month is individual-specific, and use a model formula like : mode ~ price+time|month, but with your dataset the system is computationally singular.
To reproduce the result from the other answer, you can use mode ~ 1|month with reflevel="car". This ignores the mode-specific variables and just estimates the effect of month (relative to mode = car).
There's a nice tutorial on mlogit here.
Are price and time real variables that you're trying to make a part of the model?
If not, then you don't need to "unaggregate" that data. It's perfectly fine to work with counts of the outcomes directly (even with covariates). I don't know the particulars of doing that in mlogit but with multinom, it's simple, and I imagine it's possible with mlogit:
# Assuming your original data frame is saved in "df" below
library(nnet)
response <- as.matrix(df[,c('Car', 'Bus', 'Bicycle')])
predictor <- df$Month
# Determine how the multinomial distribution parameter estimates
# are changing as a function of time
fit <- multinom(response ~ predictor)
In the above case the counts of the outcomes are used directly with one covariate, "Month". If you don't care about covariates, you could also just use multinom(response ~ 1) but it's hard to say what you're really trying to do.
Glancing at the "TravelMode" data in the mlogit package and some examples for it though, I do believe the options you've chosen are correct if you really want to go with individual records per person.

Looping a Student T-Test and Chi-Squared with Missing Data in R

I am trying to use R to run a student t-test and a chi squared test with large data sets. Since I am fairly new to R my inexperience has been preventing much success in my own code.
Both data sets have missing data and look something like this:
AA assayX activity assayY1 activity assayY2 activity
chemical 1 TRUE 0 12.2
chemical 2 TRUE 0
chemical 3 45.2 35.6
chemical 4 FALSE 0 0
AB assayX activity assayY1 activity assayY2 activity
chemical 1 TRUE FALSE TRUE
chemical 2 TRUE FALSE
chemical 3 TRUE TRUE
chemical 4 FALSE FALSE FALSE
Since it is a large data set I am trying to create a code where I can compare assayX to all assayYs. I'm hoping to create a student t-test loop for the first data set, and a chi squared loop to come the second data set. I had previously been successful creating a loop code for a correlation analysis, so I based my code off of that idea.
x<- na.omit(mydata1[, c(assayX)])
y<- na.omit(mydata1[, c(assayY1:assayYend)])
lapply(y, function(x)t.test(y~x))
x<-na.omit(mydata2[, c(assayX)])
y<- na.omit(mydata2[, c(assayY1:assayYend)]
lapply(y, x=x, chisq.test)
Problem with the first code is:
Invalid variable y
Problem with the second code is:
x and y must have the same length
I've done small tweaks here and there and have just got different types of errors like not enough 'y' observations and so on. I've been primarily using this site to figure out how work R, so I'm hoping you guys will have a clever little solution for a new guy.
After a long time and gaining experience in R, I can answer my own question. First is to make the datafile change blanks to NA.
df1 <- read.csv("data2.csv", header=T, na.strings=c("","NA"))
Then for the student.t
df1.p= rep(NA, 418)
for (i in length(df1$Assays)){
test= t.test(df1[,c(i)]~df1$assay.activity)
current.p.val= test$p.value
p.df1[i]=current.p.val
}
Then to add a Pearson's or Chi sq (not actually appropriate for this dataset, but just as an ex)
df1.p.2= rep(NA, length(df1$Assays))
df1.r.2= rep(NA, length(df1$Assays))
for (i in length(df1$Assays)){
test2= cor.test(df1$assay.activity, df1[,c(i)], mehtod='pearson')
current.p.val2= test2$p.value
current.rval = test2$estimate
df1.p.2[i] = current.p.val2
df1.r.2[i] = current.rval
}
df2= cbind(df1$Assays, df1.p, df1.p.2, df1.r.2)
I then filtered it for only assays with 0.1 significance, but that wasn't the question here. If you want to know that, just ask a question and I'll post an answer there :)
I don't think your data is being passed correctly to the test. t.test has arguments for whether the data is paired or not (default is false) and how to handle NAs should you want to change from the defalt. You should probably use those rather than omit NAs up front. An example with NAs in the data:
set.seed(1)
y <- runif(30, 0, 1)
y.NA <- c(3,24,27)
y[y.NA] <- NA
x <- runif(30, 0, 1)
x.NA <- c(1,3,8,12,21)
x[x.NA] <- NA
t.test(x,y)
For chisq.test you can use the table function.
chisq.test(table(x,y))$p.value

Resources