R: errors in cor() and corrplot() - r

Another stumbling block. I have a large set of data (called "brightly") with about ~180k rows and 165 columns. I am trying to create a correlation matrix of these columns in R.
Several problems have arisen, none of which I can resolve with the suggestions proposed on this site and others.
First, how I created the data set: I saved it as a CSV file from Excel. My understanding is that CSV should remove any formatting, such that anything that is a number should be read as a number by R. I loaded it with
brightly = read.csv("brightly.csv", header=TRUE)
But I kept getting "'x' must be numeric" error messages every time I ran cor(brightly), so I replaced all the NAs with 0s. (This may be altering my data, but I think it will be all right--anything that's "NA" is effectively 0, either for the continuous or dummy variables.)
Now I am no longer getting the error message about text. But any time I run cor()--either on all of the variables simultaneously or combinations of the variables--I get "Warning message:
In cor(brightly$PPV, brightly, use = "complete") :
the standard deviation is zero"
I am also having some of the correlations of that one variable with others show up as "NA." I have ensured that no cell in the data is "NA," so I do not know why I am getting "NA" values for the correlations.
I also tried both of the following to make REALLY sure I wasn't including any NA values:
cor(brightly$PPV, brightly, use = "pairwise.complete.obs")
and
cor(brightly$PPV,brightly,use="complete")
But I still get warnings about the SD being zero, and I still get the NAs.
Any insights as to why this might be happening?
Finally, when I try to do corrplot to show the results of the correlations, I do the following:
brightly2 <- cor(brightly)
Warning message:
In cor(brightly) : the standard deviation is zero
corrplot(brightly2, method = "number")
Error in if (min(corr) < -1 - .Machine$double.eps || max(corr) > 1 + .Machine$double.eps) { :
missing value where TRUE/FALSE needed
And instead of making my nice color-coded correlation matrix, I get this. I have yet to find an explanation of what that means.
Any help would be HUGELY appreciated! Thanks very much!!

Please check if you replaced your NAs with 0 or '0' as one is character and other is int. Or you can even try using as.numeric(column_name) function to convert your char 0s with int 0. Also this error occurs if your dataset has factors, because those are not int values corrplot throws this error.
It would be helpful of you put sample of your data in the question using
str(head(your_dataset))
That would be helpful for you to check the datatypes of columns.
Let me know if I am wrong.
Cheerio.

Related

Why does R keep giving me Undefined Columns Selected Error, and when I input a value, it gives me "invalid type (NULL)"?

I am a total beginner on R and I have to write the lines of code as I have shown in the attached image. When writing the highlighted line, I am faced with a "undefined columns error". I wondering whats causing that issue? When I try to insert a value before the i+1, for example (crash_data[HMSP, i+1] it gives me a "invalid type (NULL) for variable" error. HMSP is the name of one of the columns within the data set I imported
Does anyone have a solution for this? I would appreciate it so much, thank you.
Lines of Code in R I am following
Equation (1)
Link to dataset: https://drive.google.com/file/d/1qDvDJg5IORAlhNJH-e1ILco38S5zI_Qz/view?usp=sharing
crash_data<-read.csv("Dummycsv.csv", header=T)
Residuals_expanded_model<-matrix(0,311,51)
for (i in 1:51) {
Residuals_expanded_model[,i] <- resid(lm(crash_data[,i+1] ~
crash_data[,53]+ crash_data[,54]+ crash_data[,55]+ crash_data[,56]+
crash_data[,57]))
}
firm_spe_week_retur<-log(matrix(1,311,51)+Residuals_expanded_model)
write.csv(firm_spe_week_retur,"returns_W.csv")
Is HMSP the actual name, or a variable storing the name? If the former, you must use "HMSP". Also, the square brackets refer to x[row, col]
The problem is that there are NA values in your matrix and thus there are missing residuals, i.e. you try to replace the 311 elements in your matrix with only 309 values from your model: which(is.na(crash_data), arr.ind = TRUE)

traj step1measures: why am I getting an NA error if I have no missing values?

This is what happens when I run traj::step1measures
step1measures(datamat, timemat, ID = TRUE)
Error in if (cor.mat[i_row, i_col] > 0.999) { : missing value where TRUE/FALSE needed
2.
check.correlation(output[, -1], verbose)
1.
step1measures(datamat, timemat, ID = TRUE)
I have checked multiple times and I am sure that there are no null or missing values in the data and time matrices. Any suggestions for what's going wrong here/ where a missing value could be popping up?
There are a few reasons why you may be hitting this error:
At least one row of your data does not have the required number of
data points. A minimum of 4 data points per row is required. You can
see the data requirements in the function's documentation:
https://cran.rstudio.com/web/packages/traj/traj.pdf
Your data contains an ID row but you did not indicate that to the function.
Any other unexpected data value combinations that yields 'Inf', 'NA' or 'NaN' for one of the measures. This is the sneaky one. You may need to go to line 416 of the step1measures script and view the data before it's passed to the correlation function. You may notice that some data rows contain the invalid values. I would recommend removing those rows. In an ideal world, the package would be able to catch such issues and display a better error but it's not the case today.

Performing HCPC on the columns (i.e. variables) instead of the rows (i.e. individuals) after (M)CA

I would like to perform a HCPC on the columns of my dataset, after performing a CA. For some reason I also have to specify at the start, that all of my columns are of type 'factor', just to loop over them afterwards again and convert them to numeric. I don't know why exactly, because if I check the type of each column (without specifying them as factor) they appear to be numeric... When I don't load and convert the data like this, however, I get an error like the following:
Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) : infinite or
missing values in 'x'
Could this be due to the fact that there are columns in my dataset that only contain 0's? If so, how come that it works perfectly fine by reading everything in first as factor and then converting it to numeric before applying the CA, instead of just performing the CA directly?
The original issue with the HCPC, then, is the following:
# read in data; 40 x 267 data frame
data_for_ca <- read.csv("./data/data_clean_CA_complete.csv",row.names=1,colClasses = c(rep('factor',267)))
# loop over first 267 columns, converting them to numeric
for(i in 1:267)
data_for_ca[[i]] <- as.numeric(data_for_ca[[i]])
# perform CA
data.ca <- CA(data_for_ca,graph = F)
# perform HCPC for rows (i.e. individuals); up until here everything works just fine
data.hcpc <- HCPC(data.ca,graph = T)
# now I start having trouble
# perform HCPC for columns (i.e. variables); use their coordinates that are stocked in the CA-object that was created earlier
data.cols.hcpc <- HCPC(data.ca$col$coord,graph = T)
The code above shows me a dendrogram in the last case and even lets me cut it into clusters, but then I get the following error:
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w =
res.sauv$call$row.w.init) : object 'data.clust' not found
It's worth noting that when I perform MCA on my data and try to perform HCPC on my columns in that case, I get the exact same error. Would anyone have any clue as how to fix this or what I am doing wrong exactly? For completeness I insert a screenshot of the upper-left corner of my dataset to show what it looks like:
Thanks in advance for any possible help!
I know this is old, but because I've been troubleshooting this problem for a while today:
HCPC says that it accepts a data frame, but any time I try to simply pass it $col$coord or $colcoord from a standard ca object, it returns this error. My best guess is that there's some metadata it actually needs/is looking for that isn't in a data frame of coordinates, but I can't figure out what that is or how to pass it in.
The current version of FactoMineR will actually just allow you to give HCPC the whole CA object and tell it whether to cluster the rows or columns. So your last line of code should be:
data.cols.hcpc <- HCPC(data.ca, cluster.CA = "columns", graph = T)

Create new column in dataframe using if {} else {} in R

I'm trying to add a conditional column to a dataframe, but not getting the results I'm expecting.
I have a dataframe with values recorded for the column "steps" across 5-minute intervals over various days. I'm trying to impute missing values in the 'steps' column by using the mean number of steps for a given 5-minute interval on the days that do have measurements. n.b. I tried using the MICE package for this but it just crashed my computer so I opted for a more manual workaround.
As an intermediate stage, I have bound an additional column to the existing dataframe with the mean number of steps for that interval. What I want to do next is create a column that returns that mean if the raw number of steps is NULL, and just uses the raw value if not null. Here's my code for that part:
activityTimeAvgs$stepsImp <- if(is.na(activityTimeAvgs$steps)){
activityTimeAvgs$avgsteps
} else {
activityTimeAvgs$steps
}
What I expected to happen is that the if statement would evaluate as TRUE if 'steps' is NA and consequently give 'avgsteps'; in cases where 'steps' is not NA I would expect it to just use the raw value for 'steps'. However, the output just gives the value for 'avgsteps' in every row, which is not much use. I also get the following warning:
Warning message:
In if (is.na(activityTimeAvgs$steps)) { :
the condition has length > 1 and only the first element will be used
Any ideas where I'm going wrong?
Thanks in advance.
The if statement is not suitable for this. You need to use ifelse:
activityTimeAvgs$stepsImp <- ifelse(is.na(activityTimeAvgs$steps), activityTimeAvgs$avgsteps, activityTimeAvgs$steps)

Error with knnImputer from the DMwR Package: invalid 'times' argument

I'm trying to run knnImputer from the DMwR package on a genomic dataset. The dataset has two columns - one for location on a chromosome (numeric, an integer) and one for methylation values (also numeric, double), with many of the methylation values are missing. The idea is that distance should be based on location in the chromosome. I also have several other features, but chose to not include those). When I run the following line however, I get an error.
reg.knn <- knnImputation(as.matrix(testp), k=2, meth="median")
#ERROR:
#Error in rep(1, ncol(dist)) : nvalid 'times' argument
Any thoughts on what could be causing this?
If this doesn't work, does anyone know of anything other good KNN Imputers in R packages? I've been trying several but each returns some kind of error.
I got a similar error today:
Error in rep(1, ncol(dist)) : invalid 'times' argument
I could not find a solution online but with some trail and error , I think the issue is with no. of columns in data frame
Try passing at least '3' columns and do KNNimputation
I created a dummy column which gives ROW count of the observation (as third column).
It worked for me !
Examples for your reference:
Example 1 -
temp <- data.frame(X = c(1,2,3,4,5,6,7,8,9,10), Y = c(T, T, F, F,F,F,NA,NA,T,T))
temp7<-NULL temp7 <-knnImputation(temp,scale=T,k=3, meth='median', distData = NULL)
Error in rep(1, ncol(dist)) : invalid 'times' argument
Example 2 -
temp <- data.frame(X = 1:10, Y = c(T, T, F, F,F,F,NA,T,T,T), Z = c(NA,NA,7,8,9,5,11,9,9,4))
temp7<-NULL temp7 <-knnImputation(temp,scale=T,k=3, meth='median', distData = NULL)
Here number of columns passed is 3. Did NOT get any error!
Today, I encountered the same error. My df was much larger than 3 columns, so this seems to be not the (only?) problem.
I found that rows with too much NAs caused the problem (in my case, more than 95% of a given row was NA). Filtering out this row solved the problem.
Take home message: do not only filter for NAs over the columns (which I did), but also check the rows (it's of course impossible to impute by kNN if you cannot define what exactly is a nearest neighbor).
Would be nice if the package would provide a readable error message!
When I read into the code, I located the problem, if the column is smaller than 3, then in the process it where down-grade to something which is not a dataframe and thus the operation based on dataframe structure all fails, I think the author should handle this case.
And yes, the last answer also find it by trial, different road, same answer

Resources