Simple If statement. Comparing non-numeric values - r

My data:
pirmas antras trecias
17 44 55
788 890 1409
968 218 344
333 355 Na
I want to check which correlation is bigger:
the correlation between pirmas and antras columns
or the correlation between antras and trecias columns
Next, I want to write the If statement.
If the correlation between antras and trecias columns is bigger, I fill this N/A value in the last column with the value of the column antras.
BUT I get an error, because the function cor.test is test and does not give me a numeric answer, so I cannot compare them in If statement.
How can I do this?
My source code:
data<- X12_5_3
data
a<-cor.test(data$pirmas, data$trecias)
b<-cor.test(data$antras, data$trecias)
if (a<b) {
data$trecias[4]<-data$antras[4]
}
data

You can extract the correlation value from the test objects with $estimate.
set.seed(7)
a <- cor.test(rnorm(5), rnorm(5))
b <- cor.test(rnorm(5), rnorm(5))
if (a$estimate < b$estimate) {
print('correlation of a smaller than b')
}

If you don't need to do a hypothesis test, just use cor() to get their correlation coefficient. Besides, because of the presence of missing values, you need to control the argument use to deal with it.
a <- cor(df$pirmas, df$trecias, use = "complete.obs")
b <- cor(df$antras, df$trecias, use = "complete.obs")

Related

how to use anovascores results to eliminate columns/predictors with pvalues greater than 0.01

I had a dataset with 36400 columns/features/predictors (types of proteins) and 500 observations and the last column is response column "class" that indicates 2 types of cells - A and B. we're supposed to perform feature selection to reduce the number of predictors that help in differentiating the 2 cell types. The first step to do that was to remove all columns whose max value was less than 2. I did the below to achieve this and reduced the # of predictors to 26000:
newdf<- protein2 %>%
#Select column whose max value is greater than equal to 2
select_if(~max(., na.rm = TRUE) >= 2)
ncol(newdf)
To further reduce, we're expected to remove predictors with low variance by performing anova test on each predictor and removing predictors with p-value >= 0.01. I think I did it right using the below code:
scores <- as.data.frame(apply(newdf[,-ncol(newdf)],2, anovaScores, newdf$class))
scores
new_scores <- scores[scores<0.01]
I'm not sure why, but i can't confirm my results using ncols or colnames or something. using length(new_scores) gives 2084 which is in the range of reduced predictors professor is expecting. But i need someone to confirm if this was the right way to go about. And if so, then why am I not able to split my data into training and testing datasets?
when trying that, i get the error
Error in new_scores$class : $ operator is invalid for atomic vectors.
This is how I'm splitting training and testing dataset:
intrain <- createDataPartition(y = new_scores$class ,p = 0.8,list = FALSE) #split data
assign("training", new_scores[intrain,] )
assign("testing", new_scores[-intrain,] )
The problem is in the createDataPartition line but not sure if something it did in the prior steps is incorrect or I'm missing something
Not sure how to provide reproducible data but the below is a snippet of the data with last column being response variable-class, and the rest all predictors:
X Y Z A B C class
3 4.5 3 4 8 10.1 A
9 6 2.5 6 4 4 B
4 3.8 4 9 6 8.2 B
6 7.1 6 7 4 8 A
4 5.6 9 5 3 7.5 A
You can do it like this, I used an example dataset:
library(caret)
library(mlbench)
data(Sonar)
newdf = Sonar
It makes sense to split the data into train and test first (see below comments by #missuse for details and also other possible alternatives) :
intrain <- createDataPartition(y = newdf$Class ,p = 0.8,list = FALSE) #split data
training = newdf[intrain,]
test = newdf[-intrain,]
We calculate the scores, it will return a vector:
scores <- apply(training[,-ncol(training)],2, anovaScores, training$Class)
table(scores<0.01)
FALSE TRUE
33 27
We expect to get back 27 predictor columns that have p < 0.01. We subset the data.frame with that. We write a vector for the columns to retain (including the dependent variable):
keep = c(which(scores<0.01),ncol(training))
training = training[,keep]
test = test[,keep]
> dim(training)
[1] 167 28
> dim(test)
[1] 167 28
And you can run caret from here.

Conducting a t-test with a grouping variable

Getting started on an assignment with R, and I haven't really worked with it before, so apologies if this is basic.
brain is an excel dataframe. Its format is as follows (for an odd 40-some rows):
para1 para2 para3 para4 para5 para6 para7
FF 133 132 124 118 64.5 816932
highVAL = ifelse(brain$para2>=130,1, 0)
highVAL gives me a vector of 1's and 0's, categorized by para2.
I'm looking to perform a t-test on the mean para7 between two sets: rows that have para2 > 130 and those that have para2 < 130.
In Python, I would construct two new arrays and append values in, and perform a t-test there. Not sure how I would go about it in R.
You're closer than you think! Your highVAL variable should be added as a new column to the brain data frame:
brain$highVAL <- brain$FSIQ >= 130
This adds a true/false column to the dataset. Then you can run the test using t-test's formula interface:
result <- t.test(MRIcount ~ highVAL, data = brain)

Performing a Specific Function for One Column For The First 12 Rows?

This is easy, but for some reason I'm having trouble with it. I have a set of Data like this:
File Trait Temp Value Rep
PB Mortality 16 52.2 54
PB Mortality 17 21.9 91
PB Mortality 18 15.3 50
...
And it goes on like that for 36 rows. What I need to do is divide the Value column by 100 in only the first 12 rows. I did:
NewData <- Data[1:12,4]/100
to try and create a new data frame without changing the old data. When I do this it divides the fourth column, but saves only the fourth column (rows 1-12) as a Values in the Global Environment by itself, not as Data with the rest of the rows/columns in the original set. Overall, I'm trying to fit the NewData in a nls function, so I need to save the modified data with the rest of the data, and not as a separate value. Is there a way for me to modify the first 12 rows without having R save it as a value?
Consider copying the dataframe and then updating column at select rows:
NewData <- Data
NewData$Value[1:12] <- NewData$Value[1:12]/10
# NewData[1:12,4] <- NewData[1:12,4]/10 ' ALTERNATE EQUIVALENT
library(dplyr)
newdata <- data[1:12,] %>% mutate(newV = VALUE/100)
newdata$Value = newdata$newV
newdata = newdata %>% select(-newV)
then you can do
full_data = rbind(newdata,data[13:36,])

Reducing correlation of datasets with NA

Consider the sample data below:
a=c(NA,1,NA)
b=c(1,2,4)
c=c(0,1,0)
d=c(1,2,4)
df=data.frame(a,b,c,d)
Objective to find correlation between 2 columns where NA should reduce the correlation. NA means that an event did not take place.
Is there a way to use NA in the correlation such that it pulls down the value of the correlation?
> cor(df$a, df$b)
[1] NA
Or should I be looking at some other mathematical function?
Is there a way to use NA in the correlation such that it pulls down the value of the correlation?
Here is a way to use NA values to decrease correlation. For demonstration, I am using different data with some good size.
a <- sort(ruinf(10))
b <- sort(ruinf(10))
## Sorting so that there is some good correlation between them.
## Now making some values NA deliberately
a[c(9,10)] <- NA
cor(a[1:8],b[1:8])
## [1] 0.890465 #correlation value is high
## Lets assign a to c and Fill NA values with something
c <- a
## using mean causes no change to numerator but increases denominator.
c[is.na(a)] <- mean(a, na.rm=T) cor(c,b)
## [1] 0.6733387
Note that when you replace all NA terms with mean, the numerator has no change as there is multiplication with zero in additional terms. The denominator however adds some more values for b so that correlation value comes down. Also, the more NA in your data, more the correlation comes down.
The question doesn't make mathematical sense as there is no correlation between events that didn't happen. Correlation cannot be reduced by no event happening. There is no function to do this other than to transform the data.
You may replace the NA values with something like #Ujjwal Kumar has suggested but this is just data manipulation and not a predefined function
Look at the help file for cor ?cor and using functions like cor(df$a,df$b,use="pairwise.complete.obs" you can see how NA values should usually be treated where they are just removed and have no impact on the correlation itself
?cor output
If use is "everything", NAs will propagate conceptually, i.e., a resulting value will be NA whenever one of its contributing observations is NA.
If use is "all.obs", then the presence of missing observations will produce an error. If use is "complete.obs" then missing values are handled by casewise deletion (and if there are no complete cases, that gives an error).
"na.or.complete" is the same unless there are no complete cases, that gives NA. Finally, if use has the value
"pairwise.complete.obs" then the correlation or covariance between each pair of variables is computed using all complete pairs of observations on those variables. This can result in covariance or correlation matrices which are not positive semi-definite, as well as NA entries if there are no complete pairs for that pair of variables. For cov and var, "pairwise.complete.obs" only works with the "pearson" method. Note that (the equivalent of) var(double(0), use = *) gives NA for use = "everything" and "na.or.complete", and gives an error in the other cases.
I guess, there is no simple explanation. . You have to remove data with NA, and ofcourse corresponding data in columns b,c,d. And then compute correlation. You can check if thera are corrensponding NA in each dataset (a,b,c,d)
In yours example you can compute corelation with all combinations of b,c,d, but if you want compute cor for cor(a,b) you have to pick only rows that are without NA in a and b. And maybe when you compute this cor(a,b) multiply it by (number of rows with NA in a and b) divided by number of all rows in dataset
a=c(NA,1,NA)
b=c(1,2,4)
c=c(0,1,0)
d=c(1,2,4)
df=data.frame(a,b,c,d)

How to check if an anova test excludes zero values

I was wondering if there is an simply way check if my zero values in my data are excluded in my anova.
I first changed all my zero values to NA with
BFL$logDecomposers[which(BFL$logDecomposers==0)] = NA
I'm not sure if 'na.action=na.exclude' makes sure my values are being ignored(like I want them to be)??
standard<-lm(logDecomposers~1, data=BFL) #null model
ANOVAlnDeco.lm<-lm(logDecomposers~Species_Number,data=BFL,na.action=na.exclude)
anova(standard,ANOVAlnDeco.lm)
P.S.:I've just been using R for a few weeks, and this website has been of tremendous help to me :)
You haven't given a reproducible example, but I'll make one up.
set.seed(101)
mydata <- data.frame(x=rnorm(100),y=rlnorm(100))
## add some zeros
mydata$y[1:5] <- 0
As pointed out by #Henrik you can use the subset argument to exclude these values:
nullmodel <- lm(y~1,data=mydata,subset=y>0)
fullmodel <- update(nullmodel,.~x)
It's a little confusing, but na.exclude and na.omit (the default) actually lead to the same fitted model -- the difference is in whether NA values are included when you ask for residual or predicted values. You can try it out:
mydata2 <- within(mydata,y[y==0] <- NA)
fullmodel2 <- update(fullmodel,subset=TRUE,data=mydata2)
(subset=TRUE turns off the previous subset argument, by specifying that all the data should be included).
You can compare the fits (coefficients etc.). One shortcut is to use the nobs method, which counts the number of observations used in the model:
nrow(mydata) ## 100
nobs(nullmodel) ## 95
nobs(fullmodel) ## 95
nobs(fullmodel2) ## 95
nobs(update(fullmodel,subset=TRUE)) ## 100

Resources