From factor to Numeric in R - r

I am having big trouble trying to convert a set of 53 factor variables to numeric. Here are a couple of the functions I tried but none of them are working :
sapply(dataset, function(x) transform(as.character(x)))
and then
sapply(dataset, function(x) transform(as.numeric(x)))
I also tried it with lapply, but same thing...
as.numeric(levels(factor))
doesnt work either and finally I tried to do it one by one:
transform(dataset, s1 = as.numeric(s1), s2= as.numeric(s2)...etc)
Could somebody please help me ? I also have a couple of missing values NA and M within the variables so I dont know how I can adjust for that.
Thanks !

Although you didn't provide a reproducible example, this might work:
df[,c(2:54)] <- as.numeric(as.character(unlist(df[,c(2:54)])))
where c(2:54) stands for the columns you want to change to numeric

Related

Mean removing NAs inside vectors in R with ! operator

I am stuck with a problem in R.
It is about removing NAs within vectors and dataframes.
I am given the library, data frame and the vector as follows:
library(dslabs)
data(na_example)
ind <- is.na(na_example)
So, I need to compute the mean, but with the entries that are not NA inside the vector "ind".
I have tried everything, including the answer (I think) that is: mean(!ind), because I HAVE to use the ! operator.
The result is 0.855. However, the evaluating system does not give me a positive score.
Please, could you give me a hand?
You're looking for na.omit, not is.na:
library(dslabs)
data(na_example)
ind <- na.omit(na_example)
mean(ind)
Which gives you: 2.301754
So, I finally got after many hours of struggle.
I was putting the ! in the wrong place
ind <- is.na(na_example)
mean(!ind)
[1] 0.855
It should be:
ind <- !is.na(na_example)
mean(ind)
[1] 0.855

Replacing Inf Values with 0 Given a List of Columns in R

I'm terrible at apply functions and every answer I looked up on here somehow is hard for me to apply to this problem, I've tried as hard as I've can to not post here.
I have a list of column names called "log_fields"
I want to go through each of these columns in my data frame "df" and replace the infinite values with 0.
This is the code I'm currently trying to use, their must be a syntax error with my function argument because I'm being told the argument values is missing.
sapply(df[log_fields], function(x) replace(is.infinite(x),0))
I'm incredibly greatful for the help!
lapply(df[log_fields], function(x) ifelse(is.infinite(x), 0, x)) as 李哲源 suggested.
lapply (df[log_fields], function(x) {x[is.infinite(x)] <- 0;x}) as dww suggested.

Efficient Way to Convert to Numeric

I have converted a bunch of my columns from factor to numeric, but the code was very cumbersome. I had to individually convert each column, which ended up taking more time than it should. This is the code I used (only a short sample - I actually have many more columns):
city1$NY <-as.numeric(levels(city1$NY))[city1$NY]
city1$CHI<-as.numeric(levels(city1$CHI))[city1$CHI]
city1$LA <-as.numeric(levels(city1$LA))[city1$LA]
city1$ATL<-as.numeric(levels(city1$ATL))[city1$ATL]
city1$MIA<-as.numeric(levels(city1$MIA))[city1$MIA]
I was almost positive that instead of doing all of that, I could've just done:
city1[,CityNames]<-as.numeric(levels(city1[,CityNames]))[city1[,CityNames]]
Where CityNames is just all of the columns for the data that I would like to convert.. But that doesn't work, as I get:
Error in as.numeric(levels(city1[, CityNames]))[city1[, CityNames]] :
invalid subscript type 'list'
Can anyone tell what I am doing wrong? Or is there just simply no easier way to do this task other than my long, annoying first method?
I was almost positive that instead of doing all of that, I could've just done:
city1[,CityNames]<-as.numeric(levels(city1[,CityNames]))[city1[,CityNames]]
So, a small change is needed:
city1[,CityNames] <- lapply(city1[,CityNames], function(x) as.numeric(levels(x))[x] )
The original approach didn't work because
levels are vector-specific, so it's not clear what myvec = levels(city1[,CityNames]) is.
myvec[ city1[,CityNames] ] throws an error because city1[,CityNames] is a data.frame and cannot be used to subset in this way.
This is typically what I do when I want to convert many columns in a data.frame to a different data type:
convNames <- c("NY", "CHI", "LA", "ATL", "MIA")
for(name in convNames) { city1[, name] <- as.numeric(as.character((city1[, name])) }
It's a nice two lines and you just have to add the names of whatever columns you want to coerce to the convNames vector to add a new column to the coercing loop below.
EDIT: Do to a factor issue, do the lapply method above.
I'm not sure if it is faster, but may be since the lookups may be what is slowing you down. Try city1 <- as.numeric(as.character(city1)). The as.character() converts to the level values and then the as.numeric() interprets those strings as their a numeric equivalent. It may be significantly faster since it does not have to do any lookups into the levels vector for each value.

R: 'Missing Value where True/False needed'

So I know this has been asked before, but from what I've searched I can't really find an answer to my problem. I should also add I'm relatively new to R (and any type of coding at all) so when it comes to fixing problems in code I'm not too sure what I'm looking for.
My code is:
education_ge <- data.frame(matrix(ncol=2, nrow=1))
colnames(education_ge) <- c("Education","Genetic.Engineering")
for (i in 1:nrow(survey))
if (survey[i,12]=="Bachelors")
education_ge$Education <- survey[i,12]
To give more info, 'survey' is a data frame with 12 columns and 26 rows, and the 12th column, 'Education', is a factor which has levels such as 'Bachelors', 'Masters', 'Doctorate' etc.
This is the error as it appears in R:
for (i in 1:nrow(survey))
if (survey[i,12]=="Bachelors")
education_ge$Education <- survey[i,12]
Error in if (survey[i, 12] == "Bachelors") education_ge$Education <- survey[i, :
missing value where TRUE/FALSE needed
Any help would be greatly appreciated!
If you just want to ignore any records with missing values and get on with your analysis, try inserting this at the beginning:
survey <- survey[ complete.cases(survey), ]
It basically finds the indexes of all the rows where there are no NAs anywhere, and then subsets survey to have only those rows.
For more information on subsetting, try reading this chapter: http://adv-r.had.co.nz/Subsetting.html
The command:
sapply(survey,function (x) sum(is.na(x)))
will show you how many NAs you have in each column. That might help your data cleaning.
You can try this:
sub<-subset(survey,survey$Education=="Bachelors")
education_ge$Education<-sub$Education
Let me know if this helps.

R warning message - invalid factor level, NA generated

I have the following block of code. I am a complete beginner in R (a few days old) so I am not sure how much of the code will I need to share to counter my problem. So here is all of it I have written.
mdata <- read.csv("outcome-of-care-measures.csv",colClasses = "character")
allstate <- unique(mdata$State)
allstate <- allstate[order(allstate)]
spldata <- split(mdata,mdata$State)
if (num=="best") num <- 1
ranklist <- data.frame("hospital" = character(),"state" = character())
for (i in seq_len(length(allstate))) {
if (outcome=="heart attack"){
pdata <- spldata[[i]]
pdata[,11] <- as.numeric(pdata[,11])
bestof <- pdata[!is.na(as.numeric(pdata[,11])),][]
inorder <- order(bestof[,11],bestof[,2])
if (num=="worst") num <- nrow(bestof)
hospital <- bestof[inorder[num],2]
state <- allstate[i]
ranklist <- rbind(ranklist,c(hospital,state))
}
}
allstate is a character vector of states.
outcome can have values similar to "heart attack"
num will be numeric or "best" or "worst"
I want to create a data frame ranklist which will have hospital names and the state names which follow a certain criterion.
However I keep getting the error
invalid factor level, NA generated
I know it has something to do with rbind but I cannot figure out what is it. I have tried googling about this, and also tried troubleshooting using other similar queries on this site too. I have checked any of my vectors I am trying to bind are not factors. I also tried forcing the coercion by setting the hospital and state as.character() during assignment, but didn't work.
I would be grateful for any help.
Thanks in advance!
Since this is apparently from a Coursera assignment I am not going to give you a solution but I am going to hint at it: Have a look at the help pages for read.csv and data.frame. Both have the argument stringsAsFactors. What is the default, true or false? Do you want to keep the default setting? Is colClasses = "character" in line 1 necessary? Use the str function to check what the classes of the columns in mdata and ranklist are. read.csv additionally has an na.strings argument. If you use it correctly, also the NAs introduced by coercion warning will disappear and line 16 won't be necessary.
Finally, don't grow a matrix or data frame inside a loop if you know the final size beforehand. Initialize it with the correct dimensions (here 52 x 2) and assign e.g. the i-th hospital to the i-th row and first column of the data frame. That way rbind is not necessary.
By the way you did not get an error but a warning. R didn't interrupt the loop it just let you know that some values have been coerced to NA. You can also simplify the seq_len statement by using seq_along instead.

Resources