I'm working in R and have a dataframe, dd_2006, with numeric vectors. When I first imported the data, I needed to remove $'s, decimal points, and some blank spaces from 3 of my variables: SumOfCost, SumOfCases, and SumOfUnits. To do that, I used str_replace_all. However, once I used str_replace_all, the vectors were converted to characters. So I used as.numeric(var) to convert the vectors to numeric, but NAs were introduced, even though when I ran the code below BEFORE I ran the as.numeric code, there were no NAs in the vectors.
sum(is.na(dd_2006$SumOfCost))
[1] 0
sum(is.na(dd_2006$SumOfCases))
[1] 0
sum(is.na(dd_2006$SumOfUnits))
[1] 0
Here is my code from after the import, beginning with removing the $ from the vector. In the str(dd_2006) output, I deleted some of the variables for the sake of space, so the column #s in the str_replace_all code below don't match the output I've posted here (but they do in the original code):
library("stringr")
dd_2006$SumOfCost <- str_sub(dd_2006$SumOfCost, 2, ) #2=the first # after the $
#Removes decimal pt, zero's after, and commas
dd_2006[ ,9] <- str_replace_all(dd_2006[ ,9], ".00", "")
dd_2006[,9] <- str_replace_all(dd_2006[,9], ",", "")
dd_2006[ ,10] <- str_replace_all(dd_2006[ ,10], ".00", "")
dd_2006[ ,10] <- str_replace_all(dd_2006[,10], ",", "")
dd_2006[ ,11] <- str_replace_all(dd_2006[ ,11], ".00", "")
dd_2006[,11] <- str_replace_all(dd_2006[,11], ",", "")
str(dd_2006)
'data.frame': 12604 obs. of 14 variables:
$ CMHSP : Factor w/ 46 levels "Allegan","AuSable Valley",..: 1 1 1
$ FY : Factor w/ 1 level "2006": 1 1 1 1 1 1 1 1 1 1 ...
$ Population : Factor w/ 1 level "DD": 1 1 1 1 1 1 1 1 1 1 ...
$ SumOfCases : chr "0" "1" "0" "0" ...
$ SumOfUnits : chr "0" "365" "0" "0" ...
$ SumOfCost : chr "0" "96416" "0" "0" ...
I found a response to a similar question to mine here, using the following code:
# create dummy data.frame
d <- data.frame(char = letters[1:5],
fake_char = as.character(1:5),
fac = factor(1:5),
char_fac = factor(letters[1:5]),
num = 1:5, stringsAsFactors = FALSE)
Let us have a glance at data.frame
> d
char fake_char fac char_fac num
1 a 1 1 a 1
2 b 2 2 b 2
3 c 3 3 c 3
4 d 4 4 d 4
5 e 5 5 e 5
and let us run:
> sapply(d, mode)
char fake_char fac char_fac num
"character" "character" "numeric" "numeric" "numeric"
> sapply(d, class)
char fake_char fac char_fac num
"character" "character" "factor" "factor" "integer"
Now you probably ask yourself "Where's an anomaly?" Well, I've bumped into quite peculiar things in R, and this is not the most confounding thing, but it can confuse you, especially if you read this before rolling into bed.
Here goes: first two columns are character. I've deliberately called 2nd one fake_char. Spot the similarity of this character variable with one that Dirk created in his reply. It's actually a numerical vector converted to character. 3rd and 4th column are factor, and the last one is "purely" numeric.
If you utilize transform function, you can convert the fake_char into numeric, but not the char variable itself.
> transform(d, char = as.numeric(char))
char fake_char fac char_fac num
1 NA 1 1 a 1
2 NA 2 2 b 2
3 NA 3 3 c 3
4 NA 4 4 d 4
5 NA 5 5 e 5
Warning message:
In eval(expr, envir, enclos) : NAs introduced by coercion
but if you do same thing on fake_char and char_fac, you'll be lucky, and get away with no NA's:
transform(d, fake_char = as.numeric(fake_char),
char_fac = as.numeric(char_fac))
char fake_char fac char_fac num
1 a 1 1 1 1
2 b 2 2 2 2
3 c 3 3 3 3
4 d 4 4 4 4
5 e 5 5 5 5
So I tried the above code in my script, but still came up with NAs (without a warning message about coercion).
#changing sumofcases, cost, and units to numeric
dd_2006_1 <- transform(dd_2006, SumOfCases = as.numeric(SumOfCases), SumOfUnits = as.numeric(SumOfUnits), SumOfCost = as.numeric(SumOfCost))
> sum(is.na(dd_2006_1$SumOfCost))
[1] 12
> sum(is.na(dd_2006_1$SumOfCases))
[1] 7
> sum(is.na(dd_2006_1$SumOfUnits))
[1] 11
I've also used table(dd_2006$SumOfCases) etc. to look at the observations to see if there are any characters that I missed in the observations, but there weren't any. Any thoughts on why the NAs are popping up, and how to get rid of them?
As Anando pointed out, the problem is somewhere in your data, and we can't really help you much without a reproducible example. That said, here's a code snippet to help you pin down the records in your data that are causing you problems:
test = as.character(c(1,2,3,4,'M'))
v = as.numeric(test) # NAs intorduced by coercion
ix.na = is.na(v)
which(ix.na) # row index of our problem = 5
test[ix.na] # shows the problematic record, "M"
Instead of guessing as to why NAs are being introduced, pull out the records that are causing the problem and address them directly/individually until the NAs go away.
UPDATE: Looks like the problem is in your call to str_replace_all. I don't know the stringr library, but I think you can accomplish the same thing with gsub like this:
v2 = c("1.00","2.00","3.00")
gsub("\\.00", "", v2)
[1] "1" "2" "3"
I'm not entirely sure what this accomplishes though:
sum(as.numeric(v2)!=as.numeric(gsub("\\.00", "", v2))) # Illustrate that vectors are equivalent.
[1] 0
Unless this achieves some specific purpose for you, I'd suggest dropping this step from your preprocessing entirely, as it doesn't appear necessary and seems to be giving you problems.
If you want to convert the character to a numeric as well, then first convert it to a factor (using as.factor) and save/ overwrite existing variable. Next convert this factor variable to numeric (using as.numeric). You wouldn't be creating NAs this way and will be able to convert the data-set you have into numeric.
A simple solution is to let retype guess new data types for each column
library(dplyr)
library(hablar)
dd_2006 %>% retype()
Related
I want to read in R a range from an ods file, of which first column must be character, and the other 40 columns must be doubles. How do I specify the col_types? col_types = paste0("c", paste(rep("d", times = 40), collapse = "")) does not work, I get the error Unknown col_types. Can either be a class col_spec, NULL or NA. I cannot find any examples. Anyone a hint for a solution? Thanks!
col_types should be of class col_spec. It means, that you can define column type with readr::cols().
Let's say you have this table in table.ods:
A B C
1 characters numbers againnumbers
2 a 1 5
3 a 2 6
4 a 3 7
5 a 4 8
Then you can specify the columns with readr::cols(): the first column (named characters here) as characters by readr::col_character() and others by default as numbers with readr::col_double():
library(readODS)
library(readr)
df <- read_ods("table.ods",
col_types = cols(
characters = col_character(),
.default = col_double())
)
str(df)
#> 'data.frame': 4 obs. of 3 variables:
#> $ characters : chr "0" "1" "2" "3"
#> $ numbers : num 1 2 3 4
#> $ againnumbers: num 5 6 7 8
(If you want to use simple method with e.g. "cdd", you need to convert that string to col_spec with readr::as.col_spec(), however it's not named and doesn't seem to work correctly with read_ods().)
Recently I have come across a problem where my data has been converted to factors.
This is a large nuisance, as it's not (always) easily picked up on.
I am aware that I can convert them back with solutions such as as.character(paste(x)) or as.character(paste(x)), but that seems really unnecessary.
Example code:
nums <- c(1,2,3,4,5)
chars <- c("A","B","C,","D","E")
str(nums)
#> num [1:5] 1 2 3 4 5
str(chars)
#> chr [1:5] "A" "B" "C," "D" "E"
df <- as.data.frame(cbind(a = nums, b = chars))
str(df)
#> 'data.frame': 5 obs. of 2 variables:
#> $ a: Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5
#> $ b: Factor w/ 5 levels "A","B","C,","D",..: 1 2 3 4 5
Don't cbind as it converts data to matrix and matrix can hold data of only one type, so it converts numbers to characters.
Use data.frame because as.data.frame(a = nums, b = chars) returns an error.
Use stringsAsFactors = FALSE because in data.frame default value of
stringsAsFactors is TRUE which converts characters to factors. The numbers also change to factors because in 1) they have been changed to characters.
df <- data.frame(a = nums, b = chars, stringsAsFactors = FALSE)
str(df)
#'data.frame': 5 obs. of 2 variables:
# $ a: num 1 2 3 4 5
# $ b: chr "A" "B" "C," "D" ...
EDIT: As of the newest version of R, the default value of stringAsFactors has changed to FALSE.
This should no longer happen if you have updated R: data frames don't automatically turn chr to fct. In a way, data frames are now more similar to tibbles.
After reading in data and cleaning it, I ended up with factor columns that have levels that should no longer be there.
For example, d below has one blank cell in excel. When it’s read in, the factor columns have a level "", which shouldn’t be part of the data.
d <- read.csv(header = TRUE, text='
x,y,value
a,one,1
,,5
b,two,4
c,three,10
')
d
#> x y value
#> 1 a one 1
#> 2 5
#> 3 b two 4
#> 4 c three 10
str(d)
#> 'data.frame': 4 obs. of 3 variables:
#> $ x : Factor w/ 4 levels "","a","b","c": 2 1 3 4
#> $ y : Factor w/ 4 levels "","one","three",..: 2 1 4 3
#> $ value: int 1 5 4 10
How do I remove this level, "" from the factors which are about 20 factors in the data frame, without deleting the entire row that has just one empty cell, cause this will reduce my sample size from 299000 to just 7 observation(which I have tried before).
One way would be to replace the '' with NA and use droplevels to remove the unused levels
d[1:2] <- lapply(d[1:2], function(x) droplevels(replace(x, x=="", NA)))
levels(d$x)
#[1] "a" "b" "c"
levels(d$y)
#[1] "one" "three" "two"
Another option while reading the dataset (as we assume the OP wanted factor columns would be
d <- read.csv("yourfile.csv", na.strings = "")
This should make sure that the '' will be read as NA.
Update
Suppose, there are numeric columns in between and we need to do the replace/droplevels only for the factor columns
d[] <- lapply(d, function(x) if(is.factor(x)) droplevels(replace(x, x== "", NA))
else x)
I want to convert all the misplaced -ve sign in data to prefix - sign and convert data to numeric.
I have a data frame such as all these data is being read from a ; separated file, which has a wrong separation. I need to clean this data and convert it into numeric class where 4-,1-,8- becomes -4,-1,-8 and gets treated as -ve numbers.
My data frame is like:
data.frame(a=c("1","1-","2","4-"),b= c("2","3-","4","5"),c=c("3-","6-","3","8"),d=c("5","9","9-","6"))
This requires creating a sub regex-pattern for numbers 0-9 or a decimal point followed by a minus sign in a character class with arbitrary number of repeats, and replacing the minus signwith a preceding minus sign before passing to as.numeric. This has no safety tests. If you have not yet deleted your earlier question that had only a picture of the data, then you should go back and delete it now.
df1 <- data.frame(a=c("1","1-","2","4-"),
b= c("2","3-","4","5"),
c=c("3-","6-","3","8"),
d=c("5","9","9-","6"))
lapply(df1, function(col) as.numeric( sub("([0-9.]+)[-]", "-\\1", col) ) )
#---- result looks OK ---
$a
[1] 1 -1 2 -4
$b
[1] 2 -3 4 5
$c
[1] -3 -6 3 8
$d
[1] 5 9 -9 6
# --- now replace the original df1 structure with those values ---
df1[] <- lapply(df1, function(col) as.numeric( sub("([0-9.]+)[-]", "-\\1", col) ) )
#---- check for success----
> str(df1)
'data.frame': 4 obs. of 4 variables:
$ a: num 1 -1 2 -4
$ b: num 2 -3 4 5
$ c: num -3 -6 3 8
$ d: num 5 9 -9 6
Switch the 2 capture group (numeric and the negative sign) where df is your data.frame, then cast to numeric:
sapply(df,function(x){ as.numeric(sub("([0-9.]*)(-)$","\\2\\1",x)) })
I am curious about the behaviour of transform. Two ways I might try creating a new column as character not as factor:
x <- data.frame(Letters = LETTERS[1:3], Numbers = 1:3)
y <- transform(x, Alphanumeric = as.character(paste(Letters, Numbers)))
x$Alphanumeric = with(x, as.character(paste(Letters, Numbers)))
x
y
str(x$Alphanumeric)
str(y$Alphanumeric)
The results "look" the same:
> x
Letters Numbers Alphanumeric
1 A 1 A 1
2 B 2 B 2
3 C 3 C 3
> y
Letters Numbers Alphanumeric
1 A 1 A 1
2 B 2 B 2
3 C 3 C 3
But look inside and only one has worked:
> str(x$Alphanumeric) # did convert to character
chr [1:3] "A 1" "B 2" "C 3"
> str(y$Alphanumeric) # but transform didn't
Factor w/ 3 levels "A 1","B 2","C 3": 1 2 3
I didn't find ?transform very useful to explain this behaviour - presumably Alphanumeric was coerced back to being a factor - or find a way to stop it (something like stringsAsFactors = FALSE for data.frame). What is the safest way to do this? Are there similar pitfalls to beware of, for instance with the apply or plyr functions?
This is not so much an issue with transform as much as it is with data.frames, where stringsAsFactors is set, by default, to TRUE. Add an argument that it should be FALSE and you'll be on your way:
y <- transform(x, Alphanumeric = paste(Letters, Numbers),
stringsAsFactors = FALSE)
str(y)
# 'data.frame': 3 obs. of 3 variables:
# $ Letters : Factor w/ 3 levels "A","B","C": 1 2 3
# $ Numbers : int 1 2 3
# $ Alphanumeric: chr "A 1" "B 2" "C 3"
I generally use within instead of transform, and it seems to not have this problem:
y <- within(x, {
Alphanumeric = paste(Letters, Numbers)
})
str(y)
# 'data.frame': 3 obs. of 3 variables:
# $ Letters : Factor w/ 3 levels "A","B","C": 1 2 3
# $ Numbers : int 1 2 3
# $ Alphanumeric: chr "A 1" "B 2" "C 3"
This is because it takes an approach similar to your with approach: Create a character vector and add it (via [<-) into the existing data.frame.
You can view the source of each of these by typing transform.data.frame and within.data.frame at the prompt.
As for other pitfalls, that's much too broad of a question. One thing that comes to mind right waya is that apply would create a matrix from a data.frame, so all the columns would be coerced to a single type.