Data lost in conversion from Factor to Numeric in R - r

I'm having trouble with a data conversion. I have this data that I get from a .csv file, for instance:
comisiones[2850,28:30]
Periodo.Pago Monto.Pago.Credito Disposicion.En.Efectivo
2850 Mensual 11,503.68 102,713.20
The field Monto.Pago.Credito has a Factor data class and I need it to be numeric but the double precision kind. I need the decimals.
str(comisiones$Monto.Pago.Credito)
Factor w/ 3205 levels "1,000.00","1,000.01",..: 2476 2197 1373 1905 1348 3002 1252 95 2648 667 ...
So I use the generic data conversion function as.numeric():
comisiones$Monto.Pago.Credito <- as.numeric(comisiones$Monto.Pago.Credito)
But then the observation changes to this:
comisiones[2850,28:30]
Periodo.Pago Monto.Pago.Credito Disposicion.En.Efectivo
2850 Mensual 796 102,713.20
str(comisiones$Monto.Pago.Credito)
num [1:5021] 2476 2197 1373 1905 1348 ...
The max of comisiones$Monto.Pago.Credito should be 11,504.68 but now it is 3205.
I don't know if there is a specific data class or type for the decimals in R, I've looked for it but, it didn´t work.

You need to clean up your column firstly, like remove the comma, convert it to character then to numeric:
comisiones$Monto.Pago.Credito <- as.numeric(gsub(",", "", comisiones$Monto.Pago.Credito))
The problem shows up when you convert a factor variable directly to numeric.

You can use extract_numeric from the tidyr package - it will handle factor inputs and remove commas, dollar signs, etc.
library(tidyr)
comisiones$Monto.Pago.Credito <- extract_numeric(comisiones$Monto.Pago.Credito)
If the resulting numbers are large, they may not print with decimal places when you view them, whether you used as.numeric or extract_numeric (which itself calls as.numeric). But the precision is still being stored. For instance:
> x <- extract_numeric("1,200,000.3444")
> x
[1] 1200000
Verify that precision is still stored:
> format(x, nsmall = 4)
[1] "1200000.3444"
> x > 1200000.3
[1] TRUE

Related

How to create specefic columns out of text in r

Here is just an example I hope you can help me with, given that the input is a line from a txt file, I want to transform it into a table (see output) and save it as a csv or tsv file.
I have tried with separate functions but could not get it right.
Input
"PR7 - Autres produits d'exploitation 6.9 371 667 1 389"
Desired output
Variable
note
2020
2019
2018
PR7 - Autres produits d'exploitation
6.9
371
667
1389
I'm assuming that this badly delimited data-set is the only place where you can read your data.
I created for the purpose of this answer an example file (that I called PR.txt) that contains only the two following lines.
PR6 - Blabla 10 156 3920 245
PR7 - Autres produits d'exploitation 6.9 371 667 1389
First I create a function to parse each line of this data-set. I'm assuming here that the original file does not contain the names of the columns. In reality, this is probably not the case. Thus this function that could be easily adapted to take a first "header" line into account.
readBadlyDelimitedData <- function(x) {
# Read the data
dat <- read.table(text = x)
# Get the type of each column
whatIsIt <- sapply(dat, typeof)
# Combine the columns that are of type "character"
variable <- paste(dat[whatIsIt == "character"], collapse = " ")
# Put everything in a data-frame
res <- data.frame(
variable = variable,
dat[, whatIsIt != "character"])
# Change the names
names(res)[-1] <- c("note", "Year2021", "Year2020", "Year2019")
return(res)
}
Note that I do not call the columns with the yearly figure by only "numeric" names because giving rows or columns purely "numerical" names is not a good practice in R.
Once I have this function, I can (l)apply it to each line of the data by combining it with readLines, and collapse all the lines with an rbind.
out <- do.call("rbind", lapply(readLines("tests/PR.txt"), readBadlyDelimitedData))
out
variable note Year2021
1 PR6 - Blabla 10.0 156
2 PR7 - Autres produits d'exploitation 6.9 371
Year2020 Year2019
1 3920 245
2 667 1389
Finally, I save the result with read.csv :
read.csv(out, file = "correctlyDelimitedFile.csv")
If you can get your hands on the Excel file, a simple gdata::read.xls or openxlsx::read.xlsx would be enough to read the data.
I wish I knew how to make the script simpler... maybe a tidyr magic person would have a more elegant solution?

Splitting a single variable dataframe

I have a CSV file that appears as just one variable. I want to split it to 6. I need help.
str(nyt_data)
'data.frame': 3104 obs. of 1 variable:
$ Article_ID.Date.Title.Subject.Topic.Code: Factor w/ 3104 levels "16833;7-Dec-03;Ruse in Toyland: Chinese Workers' Hidden Woe;Chinese Workers Hide Woes for American Inspectors;5",..: 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 ...
nyt_data$Article_ID.Date.Title.Subject.Topic.Code
The result displaced after the above line of code is:
> head(nyt_data$Article_ID.Date.Title.Subject.Topic.Code)
[1] 41246;1-Jan-96;Nation's Smaller Jails Struggle To Cope With Surge in Inmates;Jails overwhelmed with hardened criminals;12
[2] 41257;2-Jan-96;FEDERAL IMPASSE SADDLING STATES WITH INDECISION;Federal budget impasse affect on states;20
[3] 41268;3-Jan-96;Long, Costly Prelude Does Little To Alter Plot of Presidential Race;Contenders for 1996 Presedential elections;20
Please help me with code to split these into 6 separate columns Article_ID, Date, Title, Subject, Topic, Code.
The data is split with ";" but read.csv defaults to ",". Simply do the following:
df <- read.csv(data, sep = ";")
Just read CSV file with custom sep.
Like this:
data <- read.csv(input_file, sep=';')

R mistreat a number as a character

I am trying to read a series of text files into R. These files are of the same form, at least appear to be of the same form. Everything is fine except one file. When I read that file, R treated all numbers as characters. I used as.numeric to convert back, but the data value changed. I also tried to convert text file to csv and then read into R, but it did not work, either. Did any one have such problem before, please? How to fix it, please? Thank you!
The data is from Human Mortality Database. I cannot attach the data here due to copyright issue. But everyone can register through HMD and download data (www.mortality.org). As an example, I used Australian and Belgium 1 by 1 exposure data.
My codes are as follows:
AUSe<-read.table("AUS.Exposures_1x1.txt",skip=1,header=TRUE)[,-5]
BELe<-read.table("BEL.Exposures_1x1.txt",skip=1,header=TRUE)[,-5]
Then I want to add some rows in the above data frame or matrix. It is fine for Australian data (e.g, AUSe[1,3]+AUSe[2,3]). But error occurred when same command is applied to Belgium data: Error in BELe[1, 3] + BELe[2, 3] : non-numeric argument to binary operator. But if you look at the text file, you know those are two numbers. It is clear that R treated a number as a character when reading the text file, which is rather odd.
Try this instead:
BELe<-read.table("BEL.Exposures_1x1.txt",skip=1, colClasses="numeric", header=TRUE)[,-5]
Or you could surely post just a tiny bit of that file and not violate any copyright laws at least in my jurisdiction (which I think is the same one as The Human Mortality Database).
Belgium, Exposure to risk (period 1x1) Last modified: 04-Feb-2011, MPv5 (May07)
Year Age Female Male Total
1841 0 61006.15 62948.23 123954.38
1841 1 55072.53 56064.21 111136.73
1841 2 51480.76 52521.70 104002.46
1841 3 48750.57 49506.71 98257.28
.... . ....
So I might have suggested the even more accurate colClasses:
BELe<-read.table("BEL.Exposures_1x1.txt",skip=2, # really two lines to skip I think
colClasses=c(rep("integer", 2), rep("numeric",3)),
header=TRUE)[,-5]
I suspect the promlem occurs because of lines like these:
1842 110+ 0.00 0.00 0.00
So you will need to determine how much interest you have in preserving the 110+ values. With my method they will be coerced to NA's. (Well I thought they would be but like you I got an error. So this multi-step process is needed:
BELe<-read.table("Exposures_1x1.txt",skip=2,
header=TRUE)
BELe[ , 2:5] <- lapply(BELe[ , 2:5], as.character)
str(BELe)
#-------------
'data.frame': 18759 obs. of 5 variables:
$ Year : int 1841 1841 1841 1841 1841 1841 1841 1841 1841 1841 ...
$ Age : chr "0" "1" "2" "3" ...
$ Female: chr "61006.15" "55072.53" "51480.76" "48750.57" ...
$ Male : chr "62948.23" "56064.21" "52521.70" "49506.71" ...
$ Total : chr "123954.38" "111136.73" "104002.46" "98257.28" ...
#-------------
BELe[ , 2:5] <- lapply(BELe[ , 2:5], as.numeric)
#----------
Warning messages:
1: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
2: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
3: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
4: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
str(BELe)
#-----------
'data.frame': 18759 obs. of 5 variables:
$ Year : int 1841 1841 1841 1841 1841 1841 1841 1841 1841 1841 ...
$ Age : num 0 1 2 3 4 5 6 7 8 9 ...
$ Female: num 61006 55073 51481 48751 47014 ...
$ Male : num 62948 56064 52522 49507 47862 ...
$ Total : num 123954 111137 104002 98257 94876 ...
# and just to show that tey are not really integers:
BELe$Total[1:5]
#[1] 123954.38 111136.73 104002.46 98257.28 94875.89
The way I typically read those files is:
BELexp <- read.table("BEL.Exposures_1x1.txt", skip = 2, header = TRUE, na.strings = ".", as.is = TRUE)
Note that Belgium lost 3 years of data in WWI that may never be recovered, and hence these three years are all NAs, which in those files are marked with ".", a character string. Hence the argument na.strings = ".". Specifying that argument will take care of all columns except Age, which is character (intentionally), due to the "110+". The reason the HMD does this is so that users have to be intentional about treatment of the open age group. You can convert the age column to integer using:
BELexp$Age <- as.integer(gsub("[+]", "", BELexp$Age))
Since such issues are long the bane of R-HMD users, the HMD has recently posted some R functions in a small but growing package on github called (for now) DemogBerkeley. The function readHMD() removes all of the above headaches:
library(devtools)
install_github("DemogBerkeley", subdir = "DemogBerkeley", username = "UCBdemography")
BELexp <- readHMD("BEL.Exposures_1x1.txt")
Note that a new indicator column, called OpenInterval is added, while Age is converted to integer as above.
Can you try read.csv(... stringsAsFactors=FALSE) ?

Read a CSV file in R, and select each element

Sorry if the title is confusing. I can import a CSV file into R, but once I would like to select one element by providing the row and col index. I got more than one elements. All I want is to use this imported csv as a data.frame, which I can select any columns, rows and single cells. Can anyone give me some suggestions?
Here is the data:
SKU On Off Duration(hr) Sales
C010100100 2/13/2012 4/19/2012 17:00 1601 238
C010930200 5/3/2012 7/29/2012 0:00 2088 3
C011361100 2/13/2012 5/25/2012 22:29 2460 110
C012000204 8/13/2012 11/12/2012 11:00 2195 245
C012000205 8/13/2012 11/12/2012 0:00 2184 331
CODE:
Dat = read.table("Dat.csv",header=1,sep=',')
Dat[1,][1] #This is close to what I need but is not exactly the same
SKU
1 C010100100
Dat[1,1] # Ideally, I want to have results only with C010100100
[1] C010100100
3861 Levels: B013591100 B024481100 B028710300 B038110800 B038140800 B038170900 B038260200 B038300700 B040580700 B040590200 B040600400 B040970200 ... YB11624Q1100
Thanks!
You can convert to character to get the value as a string, and no longer as a factor:
as.character(Dat[1,1])
You have just one element, but the factor contains all levels.
Alternatively, pass the option stringsAsFactors=FALSE to read.table when you read the file, to prevent creation of factors for character values:
Dat = read.table("Dat.csv",header=1,sep=',', stringsAsFactors=FALSE )

R from character to numeric

I have this csv file (fm.file):
Date,FM1,FM2
28/02/2011,14.571611,11.469457
01/03/2011,14.572203,11.457512
02/03/2011,14.574798,11.487183
03/03/2011,14.575558,11.487802
04/03/2011,14.576863,11.490246
And so on.
I run this commands:
fm.data <- as.xts(read.zoo(file=fm.file,format='%d/%m/%Y',tz='',header=TRUE,sep=','))
is.character(fm.data)
And I get the following:
[1] TRUE
How do I get the fm.data to be numeric without loosing its date index. I want to perform some statistics operations that require the data to be numeric.
I was puzzled by two things: It didn't seem that that 'read.zoo' should give you a character matrix, and it didn't seem that changing it's class would affect the index values, since the data type should be separate from the indices. So then I tried to replicate the problem and get a different result:
txt <- "Date,FM1,FM2
28/02/2011,14.571611,11.469457
01/03/2011,14.572203,11.457512
02/03/2011,14.574798,11.487183
03/03/2011,14.575558,11.487802
04/03/2011,14.576863,11.490246"
require(xts)
fm.data <- as.xts(read.zoo(file=textConnection(txt),format='%d/%m/%Y',tz='',header=TRUE,sep=','))
is.character(fm.data)
#[1] FALSE
str(fm.data)
#-------------
An ‘xts’ object from 2011-02-28 to 2011-03-04 containing:
Data: num [1:5, 1:2] 14.6 14.6 14.6 14.6 14.6 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "FM1" "FM2"
Indexed by objects of class: [POSIXct,POSIXt] TZ:
xts Attributes:
List of 2
$ tclass: chr [1:2] "POSIXct" "POSIXt"
$ tzone : chr ""
zoo- and xts-objects have their data in a matrix accessed with coredata and their indices are a separate set of attributes.
I think the problem is you have some dirty data in you csv file. In other words FM1 or FM2 columns contain a character, somewhere, that stops it being interpreted as a numeric column. When that happens, XTS (which is a matrix underneath) will force the whole thing to character type.
Here is one way to use R to find suspicious data:
s <- scan(fm.file,what="character")
# s is now a vector of character strings, one entry per line
s <- s[-1] #Chop off the header row
all(grepl('^[-0-9,.]*$',s,perl=T)) #True means all your data is clean
s[ !grepl('^[-0-9,.]*$',s,perl=T) ]
which( !grepl('^[-0-9,.]*$',s,perl=T) ) + 1
The second-to-last line prints out all the csv rows that contain characters you did not expect. The last line tells you which rows in the file they are (+1 because we removed the header row).
Why not simply use read.csv and then convert the first column to an Date object using as.Date
> x <- read.csv(fm.file, header=T)
> x$Date <- as.Date(x$Date, format="%d/%m/%Y")
> x
Date FM1 FM2
1 2011-02-28 14.57161 11.46946
2 2011-03-01 14.57220 11.45751
3 2011-03-02 14.57480 11.48718
4 2011-03-03 14.57556 11.48780
5 2011-03-04 14.57686 11.49025

Resources