Converting NA's to factor in r - r

I want to convert all the NA's in one column (and only one column) of my data frame into "non-PA" instead. The class of the column is factor.
In the past I've successfully used:
df$column[is.na(df$column)] <- "non-PA"
But for some reason this time I get this error message:
In `[<-.factor`(`*tmp*`, is.na(management.points$management),
value = c(NA, : invalid factor level, NA generated
I've tried converting the column to characters and various other ways around it but I still get the same error message. What am I doing wrong?

You have to turn the column into a character vector first:
df$column <- as.character(df$column)
df$column[is.na(df$column)] <- "non-PA"
df$column <- factor(df$column)
The error happens because you cannot input a value in a factor if it is not already a level of that factor.
One potential downside (from #docendo's comment) is that this may remove unused factor levels. To keep them, you could just add "non_PA" to the levels instead of transforming to character:
levels(df$column) <- union(levels(df$column), "non_PA")
df$column[is.na(df$column)] <- "non-PA"

Related

NA in alphanumerical data

I have a data frame that has a variable Account_No. in which is in number format. I have account numbers that are numeric (2607242, 2607141) and alphanumeric (NWU14, NWU32). I see that all the alphanumeric data are NA. Please suggest how can I make those account number that are in alphanumeric format appear in my data set?
I tried:
as.numeric(x$Account_No."
What you described sounds like you started off with either a character or factor vector/column, then tried to coerce it to numeric, e.g.
x <- c("2607242", "2607141", "NWU14", "NWU32")
as.numeric(x)
[1] 2607242 2607141 NA NA
This also generates the warning message:
NAs introduced by coercion
If you intend to store values like NWU14, which contain characters other than numbers, then you should leave the type as character or factor.

Adding a new column to a DataFrame

I am trying to add a column for totals to a dataframe using R and am getting this error:
Error in rowSums(EurostatCrime2017[, 7:10]) : 'x' must be numeric.
Here is my code:
EurostatCrime2017$All_Theft <- rowSums(EurostatCrime2017[,7:11])
It could be due to the type issue. If we check the type of the columns with str
str(EurostatCrime2017[,7:10])
will find if the columns are not numeric or integers.
One option is to convert the columns to numeric
EurostatCrime2017[,7:10] <- lapply(EurostatCrime2017[,7:10], function(x)
as.numeric(as.character(x)))
Here, we specified as.character in case the columns are factor.
and then do the rowSums
I tried the options and it doesnt seem to be working. Here is a link to the document I am working on.
https://drive.google.com/open?id=193JI7z41xvpDh88MWrKp52I3HiQ76LFb

Removing levels in data frame when importing csv data

I am importing csv data to R using
data <- read.csv(file="file_name.csv")
This data has 9 columns and 5000 rows and data values are real number. Now I want to use this data as a data frame. But the first columns occurs with some levels. I don't want this levels.
Here is a sample data in .csv format
Could any one please help me to remove the levels from the first column after it is imported to R.
Here is my attempt:
data$col_1 = as.numeric(as.character(data$col_1))
But it showing warning:
Warning message:
NAs introduced by coercion
read.csv is basically a wrapper around read.table, turn off stringsAsFactors will work.
data <- read.csv(file="filename", stringsAsFactors=FALSE)
Then I guess that column will be treated as characters. Then you can do this to convert to numeric.:
data$col <- as.numeric(data$col)
Note: if you have a clean column containing only numbers, read.csv will read in as numeric intelligently, if it read in as factors, it means R detected something that is text or nonnumeric. you might want to pay attention to the warnings see which records got converted to NA due to what reason.
For example, I have a csv file.
When I read in, the id column will be treated as characters simply because there is one row contains ohyeah (if it is empty or NA, R still will treat as column as numeric). I would recommend you to first subset the records that have been contaminated, see if it is a big issue or not.
> subset(data, is.na(as.numeric(id)))
name id
4 dan ohyeah
Warning message:
In eval(expr, envir, enclos) : NAs introduced by coercio

Convert factor that includes "." to numeric

I'm using a dataset that has periods (.) in place of NAs. Right now, the column I'm looking at is a factor with levels 1, 2, and .. I'm trying to take a mean, and obviously, na.rm isn't working. I went back and cleaned the data by changing the periods to NAs (pe94[pe94 == "."] <- NA), and that appeared to work. However, mean can't take the mean of a factor, and when I convert the factor to a numeric, the NAs become 3s. How can I get rid of this problem?
I also had similar issues (and other issues) converting factors into numbers for mathematical analysis. However, I found a fairly simple solution that seems to work. Hope this helps ...
#Script to convert factor data to numeric data without loss or alterations of values
#Samlpe data frame with factor variables represented by numbers
factor.vector1<-factor(x=c(111,222,333,444,555))
thousands<-c("1,000","2,000","3,000","4,000","5,000")
factor.vector2<-factor(x=thousands)
df<-data.frame(factor.vector1, factor.vector2)
#Numbers as factors without comma place holders
#1st convert dataset to character data type
df[,1]<-as.character(df[,1])
#2nd convert dataset to numeric data type
df[,1]<-as.numeric(df[,1])
#Numbers as factors WITH comma place holders
#If data contains commas in the numbers (e.g. 2,000) use gsub to remove commas
#If commas are not removed before conversion, the value containing commas will become NA
df[,2]<-gsub(",", "", df[,2])
#1st convert dataset to character data type
df[,2]<-as.character(df[,2])
#2nd convert dataset to numeric data type
df[,2]<-as.numeric(df[,2])

Problems working with factors and apply functions

What I have is a data frame that contains, among others, a factor field which holds a range of values used as factor. From what I understand it is essentially bins for numeric values.
What I want to do is to convert these to numeric values so I can use them in the downstream analysis. The idea is simple enough; (a) get a function that takes the factor level, split it at the dash and extract numeric values and calculates the average and (b) apply the function of the column
data$Range.mean <- sapply(data$Range,
function(d) {
range <- as.matrix(strsplit(as.character(d), "-"))
(as.numeric(range[,1]) + as.numeric(range[,2]))/2
})
Which gives the following error
Error in FUN(X[[1L]], ...) :
(list) object cannot be coerced to type 'double'
I tried lapply instead which makes no difference. While looking for answers, I found some other solutions to this problem, which is essentially extracting the lower and upper bound separately to individual arrays then of course calculating pairwise average is trivial.
I would like to understand what I am doing/thinking wrong here though. Why is my code giving an error, and what does that error mean, really?
You are correct in that factors in fact are integers with labeled bins. So if you have a factor like this
x <- factor(c("0-1", "0-1", "1-2", "1-2"))
it is essentially a combination of the following components
as.integer(x)
levels(x)
To convert the factor to the actual values specified by its lables, you can take a detour through as.character and parse that into numbers.
# Recreating a data frame with a factor like yours
data <- data.frame(Range = cut(runif(100), 0:10/10))
levels(data$Range) <- sub("\\((.*),(.*)]", "\\1-\\2", levels(data$Range))
# Calculating range means
sapply(strsplit(as.character(data$Range), "-"),
function(x) mean(as.numeric(x)))

Resources