Removing comma from numbers in R - r

My numbers have “,” for 1,000 and above and R considers it as factor. I want to switch two such variables from factor to numeric (Actually both variables are Numbers, but R considers them as factor for some reason (data is imported from excel). To change a factor variable mydata$x1 to numeric variables I use the following code but it seems not to work properly and some values change, for example it changes 8180 to zero! and it happened many other values as well. Is there other ways to do so without such issues?
mydata$x1<- as.numeric(as.character(mydata$x1))

Since it seems as though the problem is that you have saved your numeric data as characters in Excel (instead of using format to display the commas) you may want a function like this.
#' Replace Commas Function
#'
#' This function converts a character representation of a number that contains a comma separator with a numeric value.
#' #keywords read data
#' #export
replaceCommas<-function(x){
x<-as.numeric(gsub("\\,", "", x))
}
Then
rcffull$RetBackers <- replaceCommas(rcffull$Returning.Backers)
rcffull$NewBackers <- replaceCommas(rcffull$New.Backers)

The reason that G5W is asking for dput ouput is that he (we) are unable to figure out where something that displays as 8180 when it's a factor might not properly be converted with that code. It's not because of leading or trailing spaces (which would not appear in a print-version of a factor. Witness this test:
> as.numeric(as.character(factor(" 8180")))
[1] 8180
> as.numeric(as.character(factor(" 8180 ")))
[1] 8180
And the fact that it gets converted to 0 is a real puzzle since generally items that do not get recognized as parseable R numerics will get coerced to NA (with a warning).
> as.numeric(as.character(factor(" 0 8180 ")))
[1] NA
Warning message:
NAs introduced by coercion
We really need the dput output from the item that displays as "8180" and its neighbors.

Related

How to clean a column with mixed variables like curreny in char format, blanks, strings in R

I have a dataset of startups, in which I have a column called "Amount" which is nothing but the valuations of each startup. when I tried to plot, the plot came out ugly and I found that those were in "char" format but when I tried to see the values in a column using table(copy$Amount) it showed all values mixed to each other. u can see the example pics here:
I'm a beginner to R, I tried several small codes but nothing worked. I want to remove the "string rows", "blank row", and "empty $ symbol row without number" and convert the remaining rows into numbers.
You can use parse_number from the readr package, which:
...drops any non-numeric characters before or after the first number. The grouping mark specified by the locale is ignored inside the number.
For example:
> x <- c("1,000", "$1,000", "$$1,000", "1,000$")
> readr::parse_number(x)
[1] 1000 1000 1000 1000

readxl, returning values are slightly off from the values on Excel

I'm trying to read an Excel file into R.
I used read_excel function of the readxl package with parameter col_types = "text" since the columns of the Excel sheet contain mixed data types.
df <- read_excel("Test.xlsx",sheet="Sheet1",col_types = "text")
But it appears a very slight difference in the numeric value is introduced. It's always those few values so I think it's some hidden attributes in Excel.
I tried to format those values as numbers in Excel, and also tried add 0s after the number, but it won't work.
I changed the numeric value of a cell from 2.3 to 2.4, and it was read correctly by R.
This is a consequence of floating-point imprecision, but it's a little tricky. When you enter the number 1.2 (for example) into R or Excel, it's not represented exactly as 1.2:
print(1.2,digits=22)
## [1] 1.199999999999999955591
Excel and R usually try to shield you from these details, which are inevitable if you're using fixed precision floating-point values (which most computer systems do), by limiting the printing precision to a level that will ignore those floating-point imprecisions. When you explicitly convert to character, however, R figures you don't want to lose information, so it gives you all the digits. Numbers that can be represented exactly in a binary representation, such as 2.375, don't gain all those extra digits.
However, there's a simple solution in this case:
readxl::read_excel("Test.xlsx", na="ND")
This tells R that the string "ND" should be treated as a special "not available" value, so all of your numeric values get handled properly. When you examine your data, the tiny imprecisions will still be there, but R will print the numbers the same way that Excel does.
I feel like there's probably a better way to approach this (mixed-type columns are really hard to deal with), but if you need to 'fix' the format of the numbers you can try something like this:
x <- c(format(1.2,digits=22),"abc")
## [1] "1.199999999999999955591" "abc"
fix_nums <- function(x) {
nn <- suppressWarnings(as.numeric(x))
x[!is.na(nn)] <- format(nn[!is.na(nn)])
return(x)
}
fix_nums(x)
## [1] "1.2" "abc"
Then if you're using tidyverse you can use my_data %>% mutate_all(fix_nums)

How can I convert a factor variable with missing values to a numeric variable?

I loaded my dataset (original.csv) to R:
original <- read.csv("original.csv")
str(original) showed that my dataset has 16 variables (14 factors, 2 integers). 14 variables have missing values. It was OK, but 3 variables that are originally numbers, are known as factors.
I searched web and get a command as: as.numeric(as.character(original$Tumor_Size))
(Tumor_Size is a variable that has been known as factor).
By the way, missing values in my dataset are marked as dot (.)
After running: as.numeric(as.character(original$Tumor_Size)), the values of Tumor_Size were listed and in the end a warning massage as: “NAs introduced by coercion” was appeared.
I expected after running above command, the variable converted to numeric, but second str(original) showed that my guess was wrong and Tumor_Size and another two variables were factors. In the below is sample of my dataset:
a piece of my dataset
How can I solve my problem?
The crucial information here is how missing values are encoded in your data file. The corresponding argument in read.csv() is called na.strings. So if dots are used:
original <- read.csv("original.csv", na.strings = ".")
I'm not 100% sure what your problem is but maybe this will help....
original<-read.csv("original.csv",header = TRUE,stringsAsFactors = FALSE)
original$Tumor_Size<-as.numeric(original$Tumor_Size)
This will introduce NA's because it cannot convert your dot(.) to a numeric value. If you try to replace the NA's with a dot again it will return the field as a character, to do this you can use,
original$Tumor_Size[is.na(original$Tumor_Size)]<-"."
Hope this helps.

How can I manage a .csv import in R where intended numeric columns include non-numeric data?

I have a dataset of GPS locations (x), and there are occasional NA values in the dataset. When applying a read.csv command, the GPS lat/long values (in UTM meters) are imported as a factor class, with each GPS value as a level.
To convert back to numbers, I have attempted to use the
print(x$lat, quotes = F)
command to remove quotes. The output appears to lack quotes, but when I store it
x$lat <- print(x$lat, quotes = F)
the column is coerced into a character string. This is a good first step, but the quotes are retained in the character string. Normally, I read that applying the following function usually works for removing non-numeric data
x$lat <- x$lat[!is.na(as.numeric(as.character(x$lat)))]
However, because the quotes are retained, none of the data are entirely "numeric", and so the !is.na(...) part returns a vector completely filled with FALSE values, and the resulting vector is NA of length x$lat.
I also copied this string to attempt to remove the quotes from the character vector that I can make, to no avail.
x$lat <- gsub("\\'", "", x$lat)
I suppose I could go into excel and delete the NA values, but I would like to learn how to manage the data effectively in R instead. T

Data frame column naming

I am creating a simple data frame like this:
qcCtrl <- data.frame("2D6"="DNS00012345", "3A4"="DNS000013579")
My understanding is that the column names should be "2D6" and "3A4", but they are actually "X2D6" and "X3A4". Why are the X's being added and how do I make that stop?
I do not recommend working with column names starting with numbers, but if you insist, use the check.names=FALSE argument of data.frame:
qcCtrl <- data.frame("2D6"="DNS00012345", "3A4"="DNS000013579",
check.names=FALSE)
qcCtrl
2D6 3A4
1 DNS00012345 DNS000013579
One of the reasons I caution against this, is that the $ operator becomes more tricky to work with. For example, the following fails with an error:
> qcCtrl$2D6
Error: unexpected numeric constant in "qcCtrl$2"
To get round this, you have to enclose your column name in back-ticks whenever you work with it:
> qcCtrl$`2D6`
[1] DNS00012345
Levels: DNS00012345
The X is being added because R does not like having a number as the first character of a column name. To turn this off, use as.character() to tell R that the column name of your data frame is a character vector.

Resources