I'm trying to run knnImputer from the DMwR package on a genomic dataset. The dataset has two columns - one for location on a chromosome (numeric, an integer) and one for methylation values (also numeric, double), with many of the methylation values are missing. The idea is that distance should be based on location in the chromosome. I also have several other features, but chose to not include those). When I run the following line however, I get an error.
reg.knn <- knnImputation(as.matrix(testp), k=2, meth="median")
#ERROR:
#Error in rep(1, ncol(dist)) : nvalid 'times' argument
Any thoughts on what could be causing this?
If this doesn't work, does anyone know of anything other good KNN Imputers in R packages? I've been trying several but each returns some kind of error.
I got a similar error today:
Error in rep(1, ncol(dist)) : invalid 'times' argument
I could not find a solution online but with some trail and error , I think the issue is with no. of columns in data frame
Try passing at least '3' columns and do KNNimputation
I created a dummy column which gives ROW count of the observation (as third column).
It worked for me !
Examples for your reference:
Example 1 -
temp <- data.frame(X = c(1,2,3,4,5,6,7,8,9,10), Y = c(T, T, F, F,F,F,NA,NA,T,T))
temp7<-NULL temp7 <-knnImputation(temp,scale=T,k=3, meth='median', distData = NULL)
Error in rep(1, ncol(dist)) : invalid 'times' argument
Example 2 -
temp <- data.frame(X = 1:10, Y = c(T, T, F, F,F,F,NA,T,T,T), Z = c(NA,NA,7,8,9,5,11,9,9,4))
temp7<-NULL temp7 <-knnImputation(temp,scale=T,k=3, meth='median', distData = NULL)
Here number of columns passed is 3. Did NOT get any error!
Today, I encountered the same error. My df was much larger than 3 columns, so this seems to be not the (only?) problem.
I found that rows with too much NAs caused the problem (in my case, more than 95% of a given row was NA). Filtering out this row solved the problem.
Take home message: do not only filter for NAs over the columns (which I did), but also check the rows (it's of course impossible to impute by kNN if you cannot define what exactly is a nearest neighbor).
Would be nice if the package would provide a readable error message!
When I read into the code, I located the problem, if the column is smaller than 3, then in the process it where down-grade to something which is not a dataframe and thus the operation based on dataframe structure all fails, I think the author should handle this case.
And yes, the last answer also find it by trial, different road, same answer
Related
I have a numeric column ("value") in a dataframe ("df"), and I would like to generate a new column ("valueBin") based on "value." I have the following conditional code to define df$valueBin:
df$valueBin[which(df$value<=250)] <- "<=250"
df$valueBin[which(df$value>250 & df$value<=500)] <- "250-500"
df$valueBin[which(df$value>500 & df$value<=1000)] <- "500-1,000"
df$valueBin[which(df$value>1000 & df$value<=2000)] <- "1,000 - 2,000"
df$valueBin[which(df$value>2000)] <- ">2,000"
I'm getting the following error:
"Error in $<-.data.frame(*tmp*, "valueBin", value = c(NA, NA, NA, :
replacement has 6530 rows, data has 6532"
Every element of df$value should fit into one of my which() statements. There are no missing values in df$value. Although even if I run just the first conditional statement (<=250), I get the exact same error, with "...replacement has 6530 rows..." although there are way fewer than 6530 records with value<=250, and value is never NA.
This SO link notes a similar error when using aggregate() was a bug, but it recommends installing the version of R I have. Plus the bug report says its fixed.
R aggregate error: "replacement has <foo> rows, data has <bar>"
This SO link seems more related to my issue, and the issue here was an issue with his/her conditional logic that caused fewer elements of the replacement array to be generated. I guess that must be my issue as well, and figured at first I must have a "<=" instead of an "<" or vice versa, but after checking I'm pretty sure they're all correct to cover every value of "value" without overlaps.
R error in '[<-.data.frame'... replacement has # items, need #
The answer by #akrun certainly does the trick. For future googlers who want to understand why, here is an explanation...
The new variable needs to be created first.
The variable "valueBin" needs to be already in the df in order for the conditional assignment to work. Essentially, the syntax of the code is correct. Just add one line in front of the code chuck to create this name --
df$newVariableName <- NA
Then you continue with whatever conditional assignment rules you have, like
df$newVariableName[which(df$oldVariableName<=250)] <- "<=250"
I blame whoever wrote that package's error message... The debugging was made especially confusing by that error message. It is irrelevant information that you have two arrays in the df with different lengths. No. Simply create the new column first. For more details, consult this post https://www.r-bloggers.com/translating-weird-r-errors/
You could use cut
df$valueBin <- cut(df$value, c(-Inf, 250, 500, 1000, 2000, Inf),
labels=c('<=250', '250-500', '500-1,000', '1,000-2,000', '>2,000'))
data
set.seed(24)
df <- data.frame(value= sample(0:2500, 100, replace=TRUE))
TL;DR ...and late to the party, but that short explanation might help future googlers..
In general that error message means that the replacement doesn't fit into the corresponding column of the dataframe.
A minimal example:
df <- data.frame(a = 1:2); df$a <- 1:3
throws the error
Error in $<-.data.frame(*tmp*, a, value = 1:3) : replacement
has 3 rows, data has 2
which is clear, because the vector a of df has 2 entries (rows) whilst the vector we try to replace has 3 entries (rows).
This is what happens when I run traj::step1measures
step1measures(datamat, timemat, ID = TRUE)
Error in if (cor.mat[i_row, i_col] > 0.999) { : missing value where TRUE/FALSE needed
2.
check.correlation(output[, -1], verbose)
1.
step1measures(datamat, timemat, ID = TRUE)
I have checked multiple times and I am sure that there are no null or missing values in the data and time matrices. Any suggestions for what's going wrong here/ where a missing value could be popping up?
There are a few reasons why you may be hitting this error:
At least one row of your data does not have the required number of
data points. A minimum of 4 data points per row is required. You can
see the data requirements in the function's documentation:
https://cran.rstudio.com/web/packages/traj/traj.pdf
Your data contains an ID row but you did not indicate that to the function.
Any other unexpected data value combinations that yields 'Inf', 'NA' or 'NaN' for one of the measures. This is the sneaky one. You may need to go to line 416 of the step1measures script and view the data before it's passed to the correlation function. You may notice that some data rows contain the invalid values. I would recommend removing those rows. In an ideal world, the package would be able to catch such issues and display a better error but it's not the case today.
I would like to perform a HCPC on the columns of my dataset, after performing a CA. For some reason I also have to specify at the start, that all of my columns are of type 'factor', just to loop over them afterwards again and convert them to numeric. I don't know why exactly, because if I check the type of each column (without specifying them as factor) they appear to be numeric... When I don't load and convert the data like this, however, I get an error like the following:
Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) : infinite or
missing values in 'x'
Could this be due to the fact that there are columns in my dataset that only contain 0's? If so, how come that it works perfectly fine by reading everything in first as factor and then converting it to numeric before applying the CA, instead of just performing the CA directly?
The original issue with the HCPC, then, is the following:
# read in data; 40 x 267 data frame
data_for_ca <- read.csv("./data/data_clean_CA_complete.csv",row.names=1,colClasses = c(rep('factor',267)))
# loop over first 267 columns, converting them to numeric
for(i in 1:267)
data_for_ca[[i]] <- as.numeric(data_for_ca[[i]])
# perform CA
data.ca <- CA(data_for_ca,graph = F)
# perform HCPC for rows (i.e. individuals); up until here everything works just fine
data.hcpc <- HCPC(data.ca,graph = T)
# now I start having trouble
# perform HCPC for columns (i.e. variables); use their coordinates that are stocked in the CA-object that was created earlier
data.cols.hcpc <- HCPC(data.ca$col$coord,graph = T)
The code above shows me a dendrogram in the last case and even lets me cut it into clusters, but then I get the following error:
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w =
res.sauv$call$row.w.init) : object 'data.clust' not found
It's worth noting that when I perform MCA on my data and try to perform HCPC on my columns in that case, I get the exact same error. Would anyone have any clue as how to fix this or what I am doing wrong exactly? For completeness I insert a screenshot of the upper-left corner of my dataset to show what it looks like:
Thanks in advance for any possible help!
I know this is old, but because I've been troubleshooting this problem for a while today:
HCPC says that it accepts a data frame, but any time I try to simply pass it $col$coord or $colcoord from a standard ca object, it returns this error. My best guess is that there's some metadata it actually needs/is looking for that isn't in a data frame of coordinates, but I can't figure out what that is or how to pass it in.
The current version of FactoMineR will actually just allow you to give HCPC the whole CA object and tell it whether to cluster the rows or columns. So your last line of code should be:
data.cols.hcpc <- HCPC(data.ca, cluster.CA = "columns", graph = T)
Another stumbling block. I have a large set of data (called "brightly") with about ~180k rows and 165 columns. I am trying to create a correlation matrix of these columns in R.
Several problems have arisen, none of which I can resolve with the suggestions proposed on this site and others.
First, how I created the data set: I saved it as a CSV file from Excel. My understanding is that CSV should remove any formatting, such that anything that is a number should be read as a number by R. I loaded it with
brightly = read.csv("brightly.csv", header=TRUE)
But I kept getting "'x' must be numeric" error messages every time I ran cor(brightly), so I replaced all the NAs with 0s. (This may be altering my data, but I think it will be all right--anything that's "NA" is effectively 0, either for the continuous or dummy variables.)
Now I am no longer getting the error message about text. But any time I run cor()--either on all of the variables simultaneously or combinations of the variables--I get "Warning message:
In cor(brightly$PPV, brightly, use = "complete") :
the standard deviation is zero"
I am also having some of the correlations of that one variable with others show up as "NA." I have ensured that no cell in the data is "NA," so I do not know why I am getting "NA" values for the correlations.
I also tried both of the following to make REALLY sure I wasn't including any NA values:
cor(brightly$PPV, brightly, use = "pairwise.complete.obs")
and
cor(brightly$PPV,brightly,use="complete")
But I still get warnings about the SD being zero, and I still get the NAs.
Any insights as to why this might be happening?
Finally, when I try to do corrplot to show the results of the correlations, I do the following:
brightly2 <- cor(brightly)
Warning message:
In cor(brightly) : the standard deviation is zero
corrplot(brightly2, method = "number")
Error in if (min(corr) < -1 - .Machine$double.eps || max(corr) > 1 + .Machine$double.eps) { :
missing value where TRUE/FALSE needed
And instead of making my nice color-coded correlation matrix, I get this. I have yet to find an explanation of what that means.
Any help would be HUGELY appreciated! Thanks very much!!
Please check if you replaced your NAs with 0 or '0' as one is character and other is int. Or you can even try using as.numeric(column_name) function to convert your char 0s with int 0. Also this error occurs if your dataset has factors, because those are not int values corrplot throws this error.
It would be helpful of you put sample of your data in the question using
str(head(your_dataset))
That would be helpful for you to check the datatypes of columns.
Let me know if I am wrong.
Cheerio.
I just created a sample which gives the structure of my data:
a<-c(1,2,3,4,5,6,7,8,9,10,1,2,3,4,2,3,5,4,5,6)
b<-c(1,2,3,4,4,1,2,3,9,7,2,3,6,1,9,3,1,5,7,8)
c<-c(1,1,1,0,0,1,0,1,0,0,0,0,0,1,1,1,1,0,1,0)
d<-c(10,9,7,10,11,2,3,3,1,1,2,2,2,2,2,2,2,2,2,3)
e<-c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5)
df<-data.frame(a,b,c,d,e)
library(memisc)
df_p1<- within(df,{
e<-recode(e,
c(1,2,3)->"West",
c(4,5)->"East")})
I just would like to recode the rows 1,2,3 into West and 4,5 into East. I know for sure that I ran that recode command a week ago and it worked perfectly. Now I get error.
Error in `[<-.data.frame`(`*tmp*`, nl, value = list(East = c(4, 5), West = c(1, :
replacement element 2 has 3 rows, need 20
In addition: Warning message:
In if (as.factor.result) { :
the condition has length > 1 and only the first element will be used
I just figured out the problem. I dont know whether that is common sense, but I didnt know it. The problem occurs only when I add library(car) to my script. I suppose some problems may arise using both library(memisc) and library(car). Using both you will get the error message.
How about amending it slightly to this?
Use the car library instead of memisc
require(car)
df_p1<- within(df,{
e<-recode(e, "c(1,2,3)='West'; c(4,5)='East'")})