R: 'Missing Value where True/False needed' - r

So I know this has been asked before, but from what I've searched I can't really find an answer to my problem. I should also add I'm relatively new to R (and any type of coding at all) so when it comes to fixing problems in code I'm not too sure what I'm looking for.
My code is:
education_ge <- data.frame(matrix(ncol=2, nrow=1))
colnames(education_ge) <- c("Education","Genetic.Engineering")
for (i in 1:nrow(survey))
if (survey[i,12]=="Bachelors")
education_ge$Education <- survey[i,12]
To give more info, 'survey' is a data frame with 12 columns and 26 rows, and the 12th column, 'Education', is a factor which has levels such as 'Bachelors', 'Masters', 'Doctorate' etc.
This is the error as it appears in R:
for (i in 1:nrow(survey))
if (survey[i,12]=="Bachelors")
education_ge$Education <- survey[i,12]
Error in if (survey[i, 12] == "Bachelors") education_ge$Education <- survey[i, :
missing value where TRUE/FALSE needed
Any help would be greatly appreciated!

If you just want to ignore any records with missing values and get on with your analysis, try inserting this at the beginning:
survey <- survey[ complete.cases(survey), ]
It basically finds the indexes of all the rows where there are no NAs anywhere, and then subsets survey to have only those rows.
For more information on subsetting, try reading this chapter: http://adv-r.had.co.nz/Subsetting.html
The command:
sapply(survey,function (x) sum(is.na(x)))
will show you how many NAs you have in each column. That might help your data cleaning.

You can try this:
sub<-subset(survey,survey$Education=="Bachelors")
education_ge$Education<-sub$Education
Let me know if this helps.

Related

`$<-.data.frame`(`*tmp*`, Numero, value = numeric(0) error [duplicate]

I have a numeric column ("value") in a dataframe ("df"), and I would like to generate a new column ("valueBin") based on "value." I have the following conditional code to define df$valueBin:
df$valueBin[which(df$value<=250)] <- "<=250"
df$valueBin[which(df$value>250 & df$value<=500)] <- "250-500"
df$valueBin[which(df$value>500 & df$value<=1000)] <- "500-1,000"
df$valueBin[which(df$value>1000 & df$value<=2000)] <- "1,000 - 2,000"
df$valueBin[which(df$value>2000)] <- ">2,000"
I'm getting the following error:
"Error in $<-.data.frame(*tmp*, "valueBin", value = c(NA, NA, NA, :
replacement has 6530 rows, data has 6532"
Every element of df$value should fit into one of my which() statements. There are no missing values in df$value. Although even if I run just the first conditional statement (<=250), I get the exact same error, with "...replacement has 6530 rows..." although there are way fewer than 6530 records with value<=250, and value is never NA.
This SO link notes a similar error when using aggregate() was a bug, but it recommends installing the version of R I have. Plus the bug report says its fixed.
R aggregate error: "replacement has <foo> rows, data has <bar>"
This SO link seems more related to my issue, and the issue here was an issue with his/her conditional logic that caused fewer elements of the replacement array to be generated. I guess that must be my issue as well, and figured at first I must have a "<=" instead of an "<" or vice versa, but after checking I'm pretty sure they're all correct to cover every value of "value" without overlaps.
R error in '[<-.data.frame'... replacement has # items, need #
The answer by #akrun certainly does the trick. For future googlers who want to understand why, here is an explanation...
The new variable needs to be created first.
The variable "valueBin" needs to be already in the df in order for the conditional assignment to work. Essentially, the syntax of the code is correct. Just add one line in front of the code chuck to create this name --
df$newVariableName <- NA
Then you continue with whatever conditional assignment rules you have, like
df$newVariableName[which(df$oldVariableName<=250)] <- "<=250"
I blame whoever wrote that package's error message... The debugging was made especially confusing by that error message. It is irrelevant information that you have two arrays in the df with different lengths. No. Simply create the new column first. For more details, consult this post https://www.r-bloggers.com/translating-weird-r-errors/
You could use cut
df$valueBin <- cut(df$value, c(-Inf, 250, 500, 1000, 2000, Inf),
labels=c('<=250', '250-500', '500-1,000', '1,000-2,000', '>2,000'))
data
set.seed(24)
df <- data.frame(value= sample(0:2500, 100, replace=TRUE))
TL;DR ...and late to the party, but that short explanation might help future googlers..
In general that error message means that the replacement doesn't fit into the corresponding column of the dataframe.
A minimal example:
df <- data.frame(a = 1:2); df$a <- 1:3
throws the error
Error in $<-.data.frame(*tmp*, a, value = 1:3) : replacement
has 3 rows, data has 2
which is clear, because the vector a of df has 2 entries (rows) whilst the vector we try to replace has 3 entries (rows).

Why after I use "subset", the filtered data is less than it should be?

I want to have "Blancas" and "Sultana" under the "Variete" column.
Why after I use "subset", the filtered data is less than it should be?
Figure 1 is the original data,
figure 2 is the expected result,
figure 3 is result I obtained with the code below:
df <- read_excel("R_NLE_FTSW.xlsx")
options(scipen=200)
BLANCAS<-subset(df, Variete==c("Blancas","Sultana"))
view(BLANCAS)
It's obvious that some data of BLANCAS are missing.
P.S. And if try it in a sub-sheet, the final result sometimes will be 5 times more!
path = "R_NLE_FTSW.xlsx"
df <- map_dfr(excel_sheets(path),
~ read_xlsx(path, sheet = 4))
I don't understand why sometimes it's more and sometimes less than the expected result. Can anyone help me? Thank you so much!
First of all, while you mention that you need both "Blancas" and "sultanas" , your expected result shows only Blancas! So get that straight first.
For such data comign from excel :
Always clean the data after its imported. Check for unqiue values to find if there are any extra spaces etc.
Trim the character data, ensure Date fields are correct and numbers are numeric (not characters)
Now to subset a data : Use df%>%filter(Variete %in% c('Blancas','Sultana')
-> you can modify the c() vector to include items of interest.
-> if you wish to clean on the go?
df%>%filter(trimws(Variete)) %in% c('Blancas','Sultana'))
and your sub-sheet problem : We even don't know what data is there. If its similar then apply same logics.

Copy rows of a df when NA == TRUE plus upper and lower rows in R

Not sure if the title is clear enough. I have the following dataframe: (ST.final is the name of the df)
n;date;ws;wd
1;2011-11-01 00:00:00;7,15;113,7
2;2011-11-01 00:10:00;7,25;115,7
3;2011-11-01 00:20:00;NA;NA
4;2011-11-01 00:30:00;NA;NA
5;2011-11-01 00:40:00;7,2;100,7
6;2011-11-01 00:50:00;6,95;104,7
And I want to create a new one with the rows containing NAs plus the upper and lower limit rows. The result should be something like this:
n;date;ws;wd
2;2011-11-01 00:10:00;7,25;115,7
3;2011-11-01 00:20:00;NA;NA
4;2011-11-01 00:30:00;NA;NA
5;2011-11-01 00:40:00;7,2;100,7
Maybe I am missing something but I have no clue on how to perform this task. So far I am trying to use this
interp.df <- ST.final[(is.na(ST.final$ws)),]
and as expected it just copy every row containing NAs. I searched for a solution on google but couldnt find anything similar.
Any help is appreciated.
You could try
idx <- which(!complete.cases(ST.final))
idx <- sort(unique(c(idx-1, idx, idx+1)))
ST.final[idx, ]

Error with knnImputer from the DMwR Package: invalid 'times' argument

I'm trying to run knnImputer from the DMwR package on a genomic dataset. The dataset has two columns - one for location on a chromosome (numeric, an integer) and one for methylation values (also numeric, double), with many of the methylation values are missing. The idea is that distance should be based on location in the chromosome. I also have several other features, but chose to not include those). When I run the following line however, I get an error.
reg.knn <- knnImputation(as.matrix(testp), k=2, meth="median")
#ERROR:
#Error in rep(1, ncol(dist)) : nvalid 'times' argument
Any thoughts on what could be causing this?
If this doesn't work, does anyone know of anything other good KNN Imputers in R packages? I've been trying several but each returns some kind of error.
I got a similar error today:
Error in rep(1, ncol(dist)) : invalid 'times' argument
I could not find a solution online but with some trail and error , I think the issue is with no. of columns in data frame
Try passing at least '3' columns and do KNNimputation
I created a dummy column which gives ROW count of the observation (as third column).
It worked for me !
Examples for your reference:
Example 1 -
temp <- data.frame(X = c(1,2,3,4,5,6,7,8,9,10), Y = c(T, T, F, F,F,F,NA,NA,T,T))
temp7<-NULL temp7 <-knnImputation(temp,scale=T,k=3, meth='median', distData = NULL)
Error in rep(1, ncol(dist)) : invalid 'times' argument
Example 2 -
temp <- data.frame(X = 1:10, Y = c(T, T, F, F,F,F,NA,T,T,T), Z = c(NA,NA,7,8,9,5,11,9,9,4))
temp7<-NULL temp7 <-knnImputation(temp,scale=T,k=3, meth='median', distData = NULL)
Here number of columns passed is 3. Did NOT get any error!
Today, I encountered the same error. My df was much larger than 3 columns, so this seems to be not the (only?) problem.
I found that rows with too much NAs caused the problem (in my case, more than 95% of a given row was NA). Filtering out this row solved the problem.
Take home message: do not only filter for NAs over the columns (which I did), but also check the rows (it's of course impossible to impute by kNN if you cannot define what exactly is a nearest neighbor).
Would be nice if the package would provide a readable error message!
When I read into the code, I located the problem, if the column is smaller than 3, then in the process it where down-grade to something which is not a dataframe and thus the operation based on dataframe structure all fails, I think the author should handle this case.
And yes, the last answer also find it by trial, different road, same answer

R warning message - invalid factor level, NA generated

I have the following block of code. I am a complete beginner in R (a few days old) so I am not sure how much of the code will I need to share to counter my problem. So here is all of it I have written.
mdata <- read.csv("outcome-of-care-measures.csv",colClasses = "character")
allstate <- unique(mdata$State)
allstate <- allstate[order(allstate)]
spldata <- split(mdata,mdata$State)
if (num=="best") num <- 1
ranklist <- data.frame("hospital" = character(),"state" = character())
for (i in seq_len(length(allstate))) {
if (outcome=="heart attack"){
pdata <- spldata[[i]]
pdata[,11] <- as.numeric(pdata[,11])
bestof <- pdata[!is.na(as.numeric(pdata[,11])),][]
inorder <- order(bestof[,11],bestof[,2])
if (num=="worst") num <- nrow(bestof)
hospital <- bestof[inorder[num],2]
state <- allstate[i]
ranklist <- rbind(ranklist,c(hospital,state))
}
}
allstate is a character vector of states.
outcome can have values similar to "heart attack"
num will be numeric or "best" or "worst"
I want to create a data frame ranklist which will have hospital names and the state names which follow a certain criterion.
However I keep getting the error
invalid factor level, NA generated
I know it has something to do with rbind but I cannot figure out what is it. I have tried googling about this, and also tried troubleshooting using other similar queries on this site too. I have checked any of my vectors I am trying to bind are not factors. I also tried forcing the coercion by setting the hospital and state as.character() during assignment, but didn't work.
I would be grateful for any help.
Thanks in advance!
Since this is apparently from a Coursera assignment I am not going to give you a solution but I am going to hint at it: Have a look at the help pages for read.csv and data.frame. Both have the argument stringsAsFactors. What is the default, true or false? Do you want to keep the default setting? Is colClasses = "character" in line 1 necessary? Use the str function to check what the classes of the columns in mdata and ranklist are. read.csv additionally has an na.strings argument. If you use it correctly, also the NAs introduced by coercion warning will disappear and line 16 won't be necessary.
Finally, don't grow a matrix or data frame inside a loop if you know the final size beforehand. Initialize it with the correct dimensions (here 52 x 2) and assign e.g. the i-th hospital to the i-th row and first column of the data frame. That way rbind is not necessary.
By the way you did not get an error but a warning. R didn't interrupt the loop it just let you know that some values have been coerced to NA. You can also simplify the seq_len statement by using seq_along instead.

Resources