R warning message - invalid factor level, NA generated - r

I have the following block of code. I am a complete beginner in R (a few days old) so I am not sure how much of the code will I need to share to counter my problem. So here is all of it I have written.
mdata <- read.csv("outcome-of-care-measures.csv",colClasses = "character")
allstate <- unique(mdata$State)
allstate <- allstate[order(allstate)]
spldata <- split(mdata,mdata$State)
if (num=="best") num <- 1
ranklist <- data.frame("hospital" = character(),"state" = character())
for (i in seq_len(length(allstate))) {
if (outcome=="heart attack"){
pdata <- spldata[[i]]
pdata[,11] <- as.numeric(pdata[,11])
bestof <- pdata[!is.na(as.numeric(pdata[,11])),][]
inorder <- order(bestof[,11],bestof[,2])
if (num=="worst") num <- nrow(bestof)
hospital <- bestof[inorder[num],2]
state <- allstate[i]
ranklist <- rbind(ranklist,c(hospital,state))
}
}
allstate is a character vector of states.
outcome can have values similar to "heart attack"
num will be numeric or "best" or "worst"
I want to create a data frame ranklist which will have hospital names and the state names which follow a certain criterion.
However I keep getting the error
invalid factor level, NA generated
I know it has something to do with rbind but I cannot figure out what is it. I have tried googling about this, and also tried troubleshooting using other similar queries on this site too. I have checked any of my vectors I am trying to bind are not factors. I also tried forcing the coercion by setting the hospital and state as.character() during assignment, but didn't work.
I would be grateful for any help.
Thanks in advance!

Since this is apparently from a Coursera assignment I am not going to give you a solution but I am going to hint at it: Have a look at the help pages for read.csv and data.frame. Both have the argument stringsAsFactors. What is the default, true or false? Do you want to keep the default setting? Is colClasses = "character" in line 1 necessary? Use the str function to check what the classes of the columns in mdata and ranklist are. read.csv additionally has an na.strings argument. If you use it correctly, also the NAs introduced by coercion warning will disappear and line 16 won't be necessary.
Finally, don't grow a matrix or data frame inside a loop if you know the final size beforehand. Initialize it with the correct dimensions (here 52 x 2) and assign e.g. the i-th hospital to the i-th row and first column of the data frame. That way rbind is not necessary.
By the way you did not get an error but a warning. R didn't interrupt the loop it just let you know that some values have been coerced to NA. You can also simplify the seq_len statement by using seq_along instead.

Related

Renaming columns using for loops gives error, R

I have obtained a data set with slightly inconsistent and messy variable names. I would like to rename them in a an efficient and automated way.
I have a set of data frames and I need to rename some columns in several of them. The order of the columns, and length of the data frames differ, so I would like to use any function such as grep() or a subset term (df$x[== "term"]).
I have found an older question regarding this problem (Rename columns in multiple dataframes, R), but I haven't been able to get any of the suggested solutions to work since I get an error message. I do not have reputation enough to comment and ask further questions on those replies. However, my problem seems to be a bit different as I get an error message from my for loop that is not mentioned in the earlier question:
Error in `colnames<-`(`*tmp*`, value = character(0)) :
attempt to set 'colnames' on an object with less than two dimensions
Setup: multiple data frames, let's call them myDF1, myDF2 ...
In those data frames there are columns with names (bad_name1, bad_name2) that should be changed to something else (good_name1, good_name2).
Replicable dataset:
myDF1 <- data.frame(bad_name1="A", bad_name2="B")
myDF2 <- data.frame(bad_name1="C", bad_name2="D")
for (x in c(myDF1,myDF2)) {
colnames(x) <- gsub(x = colnames(x), pattern = "bad_name0", replacement = "good_name1")
}
There are several ways of doing this. One that appealed to me is the subset method:
colnames(myDF1)[names(myDF1) == "bad_name1"] <- "good_name1")
This works fine as a single line, but not as a for loop.
for (x in c(myDF1,myDF2)) {
colnames(x)[colnames(x) == "bad_name1"] <- "good_name1"
}
Which renders the error message.
Error in `colnames<-`(`*tmp*`, value = character(0)) :
attempt to set 'colnames' on an object with less than two dimensions
The same error message applies with a 'gsub'-based method:
for (x in c(myDF1,myDF2)) {
colnames(x) <- gsub(x = colnames(x), pattern = "bad_name1", replacement = "good_name1")
}
I realise that I miss out on something fundamental here. I suppose that the for loop is not receiving the results of the 'colnames(x)' in an appropriate format. But I cannot understand how I'm supposed to make it work. The methods suggested in Rename columns in multiple dataframes, R does not really cover this error message.
Additional clarification, as asked for by vaettchen in a comment:
There is 3 column names that have to be changed (in all data frames). The reason is that they have names like varX.1, varX.2, varX.3 while I would prefer varXcount, varXmean, varXmax. So I have realised that there are names that I am not happy with, and decided on new ones based on my own taste.
You just need a few minor changes. Look at c(myDF1, myDF2) to see why that is not working - it splits the data frames into a list of 4 factors. Combine the data frames into a list and process the list:
all <- list(myDF1=myDF1, myDF2=myDF2)
for (x in seq_along(all)) {
colnames(all[[x]]) <- gsub(x = colnames(all[[x]]), pattern = "bad_name1",
replacement = "good_name1")
}
list2env(all, envir=.GlobalEnv)

R in counting data

Right now I'm trying to do a bell curve on a file called output9.csv on my.
Here is my code, I want to uses z score to detect outliers, and uses the difference between the value and mean of the data set.The difference is compared with standard deviation to find the outliers. va
#DATA LOAD
data <- read.csv('output9.csv')
height <- data$Height
hist(height) #histogram
#POPULATION PARAMETER CALCULATIONS
pop_sd <- sd(height)*sqrt((length(height)-1)/(length(height)))
pop_mean <- mean(height)
But I have this error after I tried the histogram part,
> hist(height)
Error in hist.default(height) : 'x' must be numeric
how should I fix this?
Since I don't have your data I can only guess. Can you provide it? Or at least a portion of it?
What class is your data? You can use class(data) to find out. The most common way is to have table-like data in data.frames. To subset one of your columns to use it for the hist you can use the $ operator. Be sure you subset on a column that actually exists. You can use names(data) (if data is a data.frame) to find out what columns exist in your data. Use nrow(data) to find out how many rows there are in your data.
After extracting your height you can go further. First check that your height object is numeric and has something in it. You can use class(height) to find out.
As you posted in your comment you have the following names
names(data)
# [1] "Host" "TimeStamp" "TimeZone" "Command" "RequestLink" "HTTP" [7] "ReplyCode" "Bytes"
Therefore you can extract your height with
height <- data$Bytes
Did you try to convert it to numeric? as.numeric(height) might do the trick. as.numeric() can coerce all things that are stored as characters but might also be numbers automatically. Try as.numeric("3") as an example.
Here an example I made up.
height <- c(1,1,2,3,1)
class(height)
# [1] "numeric"
hist(height)
This works just fine, because the data is numeric.
In the following the data are numbers but formatted as characters.
height_char <- c("1","1","2","3","1")
class(height_char)
# [1] "character"
hist(height_char)
# Error in hist.default(height) : 'x' must be numeric
So you have to coerce it first:
hist(as.numeric(height_char))
..and then it works fine.
For future questions: Try to give Minimal, Complete, and Verifiable Examples.

How to subset a list based on the length of its elements in R

In R I have a function (coordinates from the package sp ) which looks up 11 fields of data for each IP addresss you supply.
I have a list of IP's called ip.addresses:
> head(ip.addresses)
[1] "128.177.90.11" "71.179.12.143" "66.31.55.111" "98.204.243.187" "67.231.207.9" "67.61.248.12"
Note: Those or any other IP's can be used to reproduce this problem.
So I apply the function to that object with sapply:
ips.info <- sapply(ip.addresses, ip2coordinates)
and get a list called ips.info as my result. This is all good and fine, but I can't do much more with a list, so I need to convert it to a dataframe. The problem is that not all IP addresses are in the databases thus some list elements only have 1 field and I get this error:
> ips.df <- as.data.frame(ips.info)
Error in data.frame(`128.177.90.10` = list(ip.address = "128.177.90.10", :
arguments imply differing number of rows: 1, 0
My question is -- "How do I remove the elements with missing/incomplete data or otherwise convert this list into a data frame with 11 columns and 1 row per IP address?"
I have tried several things.
First, I tried to write a loop that removes elements with less than a length of 11
for (i in 1:length(ips.info)){
if (length(ips.info[i]) < 11){
ips.info[i] <- NULL}}
This leaves some records with no data and makes others say "NULL", but even those with "NULL" are not detected by is.null
Next, I tried the same thing with double square brackets and get
Error in ips.info[[i]] : subscript out of bounds
I also tried complete.cases() to see if it could potentially be useful
Error in complete.cases(ips.info) : not all arguments have the same length
Finally, I tried a variation of my for loop which was conditioned on length(ips.info[[i]] == 11 and wrote complete records to another object, but somehow it results in an exact copy of ips.info
Here's one way you can accomplish this using the built-in Filter function
#input data
library(RDSTK)
ip.addresses<-c("128.177.90.10","71.179.13.143","66.31.55.111","98.204.243.188",
"67.231.207.8","67.61.248.15")
ips.info <- sapply(ip.addresses, ip2coordinates)
#data.frame creation
lengthIs <- function(n) function(x) length(x)==n
do.call(rbind, Filter(lengthIs(11), ips.info))
or if you prefer not to use a helper function
do.call(rbind, Filter(function(x) length(x)==11, ips.info))
Alternative solution based on base package.
# find non-complete elements
ids.to.remove <- sapply(ips.info, function(i) length(i) < 11)
# remove found elements
ips.info <- ips.info[!ids.to.remove]
# create data.frame
df <- do.call(rbind, ips.info)

Recoding over multiple data frames in R

(edited to reflect help...I'm not doing great with formatting, but appreciate the feedback)
I'm a bit stuck on what I suspect is an easy enough problem. I have multiple different data sets that I have loaded into R, all of which have different numbers of observations, but all of which have two variables named "A1," "A2," and "A3". I want to create a new variable in each of the three data frames that contains the value held in "A1" if A3 contains a value greater than zero, and the value held in "A2" if A3 contains a value less than zero. Seems simple enough, right?
My attempt at this code uses this faux-data:
set.seed(1)
A1=seq(1,100,length=100)
A2=seq(-100,-1,length=100)
A3=runif(100,-1,1)
df1=cbind(A1,A2,A3)
A3=runif(100,-1,1)
df2=cbind(A1,A2,A3)
I'm about a thousand percent sure that R has some functionality for creating the same named variable in multiple data frames, but I have tried doing this with lapply:
mylist=list(df1,df2)
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0]
return(x)
})
But the newVar is not available for me once I leave the lapply loop. For example, if I ask for the mean of the new variable:
mean(df1$newVar)
[1] NA
Warning message:
In mean.default(df1$newVar) :
argument is not numeric or logical: returning NA
Any help would be appreciated.
Thank you.
Well first of all, df1 and df2 are not data.frames but matrices (the dollar syntax doesn't work on matrices).
In fact, if you do:
set.seed(1)
A1=seq(1,100,length=100)
A2=seq(-100,-1,length=100)
A3=runif(100,-1,1)
df1=as.data.frame(cbind(A1,A2,A3))
A3=runif(100,-1,1)
df2=as.data.frame(cbind(A1,A2,A3))
mylist=list(df1,df2)
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2
})
the code almost works but gives some warnings. In fact, there's still an error in the last line of the function called by lapply. If you change it like this, it works as expected:
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0] # you need to subset x$A2 otherwise it's too long
return(x) # better to state explicitly what's the return value
})
EDIT (as per comment):
as basically always happens in R, functions do not mutate existing objects but return brand new objects.
So, in this case df1 and df2 are still the same but lapply returns a list with the expected 2 new data.frames i.e. :
resultList <- lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0]
return(x)
})
newDf1 <- resultList[[1]]
newDf2 <- resultList[[2]]

Writing Function

I'm writing a function for a data set called opps on part number sales data, and I'm trying to break the data down into smaller data sets that are specific to the part numbers. I am trying to name the data sets as the argument "modNum". Here is what I have so far-
# modNum (Modified Product Number) takes a product number that looks
# like "950-0004-00" and makes it "opQty950.0004.00"
productNumber <- function(prodNum,modNum){
path <- "C:/Users/Data/"
readFile <- paste(path,"/opps.csv",sep="")
oppsQty <- read.csv(file=readFile,sep=",")
oppsQty$Line.Created.date <- as.Date(as.character(oppsQty$Line.Created),
"%m/%d/%Y")
modNum <- oppsQty[oppsQty$Part.Number=="prodNum",]
}
productNumber(280-0213-00,opQty280.0213.00)
#Error: object 'opQty910.0002.01' not found
The line I believe I'm having problems with is
modNum <- oppsQty[oppsQty$Part.Number=="prodNum",]
and it's because in order for the code to work, there have to be parenthesis around prodNum, but when i put the parenthesis in the code,
prodNum is no longer seen as the argument to be filled in. When i put the parenthesis inside the argument like this,-
productNumber(280-0213-00,"opQty280.0213.00")
I still have a problem.
How can I get around this?
I have tried rewriting the oppsQty$Part.Number variable to be numeric (shown below) so that I can eliminate the parenthesis all together, but I still have errors...
productNumber <- function(prodNum,nameNum){
path <- "C:/Users/Data"
readFile <- paste(path,"/opps.csv",sep="")
oppsQty <- read.csv(file=readFile,sep=",")
oppsQty$Line.Created.date <- as.Date(as.character(oppsQty$Line.Created),
"%m/%d/%Y")
#ifelse(oppsQty$Part.Number=="Discount",
# oppsQty$Part.Number=="000000000",
# oppsQty$Part.Number)
oppsQty$Part <- paste(substr(oppsQty$Part.Number,1,3),
substr(oppsQty$Part.Number,5,8),
substr(oppsQty$Part.Number,10,11),sep = "")
oppsQty$Part <- as.numeric(oppsQty$Part)
oppsQty$Part[is.na(oppsQty$Part)] <- 0
nameNum <- oppsQty[oppsQty$Part==prodNum,]
}
> productNumber(401110201,opQty401.1102.01)
Warning message:
In productNumber(401110201, opQty401.1102.01) : NAs introduced by coercion
Help is much appreciated!
Thank you!
At the moment you are passing prodNum as a numeric value, thus
280-0213-00 is evaluated as 67 (280-213-0= 67)
You should pass (and consider) prodNum as a character string (as this is what you intend)
ie. "280-0213-00"

Resources