I'm using a dataset that has periods (.) in place of NAs. Right now, the column I'm looking at is a factor with levels 1, 2, and .. I'm trying to take a mean, and obviously, na.rm isn't working. I went back and cleaned the data by changing the periods to NAs (pe94[pe94 == "."] <- NA), and that appeared to work. However, mean can't take the mean of a factor, and when I convert the factor to a numeric, the NAs become 3s. How can I get rid of this problem?
I also had similar issues (and other issues) converting factors into numbers for mathematical analysis. However, I found a fairly simple solution that seems to work. Hope this helps ...
#Script to convert factor data to numeric data without loss or alterations of values
#Samlpe data frame with factor variables represented by numbers
factor.vector1<-factor(x=c(111,222,333,444,555))
thousands<-c("1,000","2,000","3,000","4,000","5,000")
factor.vector2<-factor(x=thousands)
df<-data.frame(factor.vector1, factor.vector2)
#Numbers as factors without comma place holders
#1st convert dataset to character data type
df[,1]<-as.character(df[,1])
#2nd convert dataset to numeric data type
df[,1]<-as.numeric(df[,1])
#Numbers as factors WITH comma place holders
#If data contains commas in the numbers (e.g. 2,000) use gsub to remove commas
#If commas are not removed before conversion, the value containing commas will become NA
df[,2]<-gsub(",", "", df[,2])
#1st convert dataset to character data type
df[,2]<-as.character(df[,2])
#2nd convert dataset to numeric data type
df[,2]<-as.numeric(df[,2])
Related
I passed my excel data frame and most variables are in the form of characters.
I have tried to transform them (starting with the column average) to numeric and make clear the "," meant decimals but it automatically fills all the cells with NA. When I print the data frame again or when I try to do the summary it is only NAs instead of numbers. I got a warning after both trials:
class(ArgIncome$Average) <- "numeric"
ArgIncome$Average <- as.numeric(as.character(ArgIncome$Average))
saying
"NAs introduced by coercion".
You can transform character variable into numeric like this:
ArgIncome$Average <- as.numeric(ArgIncome$Average) #if character
ArgIncome$Average <- as.numeric(as.character(ArgIncome$Average)) #if factor
I have a dataframe called Percent_DF like below.
When I try to convert the Percentage column datatype into numeric datatype, the output does not display the correct values for Percentage column.
I have tried to convert the fctr to numeric by using as.numeric datatype conversion.
Percent_DF$Percentage <- as.numeric(Percent_DF$Percentage)
I am getting 123 and 113 instead of 50.37 and 39.78 respectively. However, the Percentage column's data type has been converted into dbl. I have no idea why the above code produces different values.
The proble is that you have % in your strings.
Try:
Percent_DF$Percentage <- as.character(Percent_DF$Percentage)
Percent_DF$Percentage <- gsub("%","",Percent_DF$Percentage)
Percent_DF$Percentage <- as.numeric(Percent_DF$Percentage)
We first turn factor to character, then remove the % and turn the value to numeric
I generally like R, but the type conversion issues are driving me crazy.
Following issue:
I read a data frame from a database connection. The result is a data frame with character columns.
I know that the first column is a date format - all the others are numeric. However, no matter how I tried to convert the character columns of the data frame into the correct types, it didn't work out.
Upon conversion of the data frame into a matrix and then back into a data frame, all columns became type factor - and casting factors into numerics created wrong results cause the indices of the factor levels were converted instead of the real values.
Moreover, if the table is big in size - I do not want to convert each column manually. Isn't there a way to get this done automatically?
We can use type.convert by looping over the columns of the dataset with lapply. Convert the columns to character and apply the type.convert. If it is is a character class, it will convert to factor which we can reconvert it to Date class (as there is only a single column with character class. It is not sure about the format of the 'Date' class, so in case it is a different format, specify the format argument in as.Date).
df1[] <- lapply(df1, function(x) {x1 <- type.convert(as.character(x))
if(is.factor(x1))
as.Date(x1) else x1})
When I convert my data frame columns to numeric, all the values become NA
offense[,2:13] <- apply(offense[,2:13],2,as.numeric)
The converted data frame.
Dataframe before conversion.
They are all numbers no commas, I have even tried removing white spaces if there are any by chance by using
as.data.frame(apply(offense,2,function(x)gsub('\\s+','',x)))
but still the values are converted to NA on type conversion with a warning message.
I got the data from a URL (Data Science Cookbook chapter 3)
offense <- readHTMLTable(url, encoding = "UTF-8", colClasses="character")[[7]]
The imported variables are factors, so you have to use, e.g.
as.numeric(as.character(offense$`Pts/G`))
apply(offense[, 2:13], 2, function(x) as.numeric(as.character(x)))
See ?factor:
To transform a factor f to approximately its original numeric values,
as.numeric(levels(f))[f] is recommended and slightly more efficient
than as.numeric(as.character(f)).
(however, the first way did not work for me, maybe I made a mistake, but the second way with as.numeric(as.character()) works)
I have a big data frame (22k rows, 400 columns) which is generated using read.csv from a csv file. It appears that every column is a factor and all the row values are the levels of this factor.
I now want to do some analysis (like PCA) but I can't work with it unless it is a matrix, but even when I try it like matrix, all I get is
> prcomp(as.matrix(my_data))
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Is there a way of transforming this data frame with factors to a simple big matrix?
I am new in R so forgive all the (maybe terrible) mistakes.
Thanks
You can do it that way:
df<-data.frame(a=as.factor(c(1,2,3)), b=as.factor(c(2,3,4)))
m<-apply(apply(df, 1, as.character), 1, as.numeric)
apply uses a method on the given data.frame. It is important not to leave out to transform it to character first, because otherwise it will be converted to the internal numeric representation of the factor.
To add column names, do this:
m<-m[-1,] # removes the first 'empty' row
colnames(m)<-c("a", "b") # replace the right hand side with your desired column names, e.g. the first row of your data.frame
One more tip. You probably read the data.frame from a file, when you set the parameter header=TRUE, the first row will not be the header but the column names of the data.frame will be correct.