Today I have finally decided to start climbing R's steep learning curve. I have spent a few hours and I managed to import my dataset and do a few other basic things, but I am having trouble with the data type: a column which contains decimals is imported as integer, and conversion to double changes the values.
In trying to get a small csv file to put here as an example I discovered that the problem only happens when the data file is too large (my original file is a 1048418 by 12 matrix, but even with "only" 5000 rows I have the same problem. When I only have 100, 1000 or even 2000 rows the column is imported correctly as double).
Here is a smaller dataset (still 500kb, but again, if the dataset is small the problem is not replicated). The code is
> ex <- read.csv("exampleshort.csv",header=TRUE)
> typeof(ex$RET)
[1] "integer"
Why is the column of returns being imported as integer when the file is large, when it is clearly of the type double?
The worst thing is that if I try to convert it to double, the values are changed
> exdouble <- as.double(ex$RET)
> typeof(exdouble)
[1] "double"
> ex$RET[1:5]
[1] 0.005587 -0.005556 -0.005587 0.005618 -0.001862
2077 Levels: -0.000413 -0.000532 -0.001082 -0.001199 -0.0012 -0.001285 -0.001337 -0.001351 -0.001357 -0.001481 -0.001486 -0.001488 ... 0.309524
> exdouble[1:5]
[1] 1305 321 322 1307 41
This is not the only column that is imported wrong, but I figured that if I find a solution for one column, I should be able to sort the other ones out. Here is some more information:
> sapply(ex,class)
PERMNO DATE COMNAM SICCD PRC RET RETX SHROUT VWRETD VWRETX EWRETD EWRETX
"integer" "integer" "factor" "integer" "factor" "factor" "factor" "integer" "numeric" "numeric" "numeric" "numeric"
They should be in this order: integer, date, string, integer, double, double, double, integer, double, double, double, double (the types are probably wrong, but hopefully you will get what I mean)
See the help for read.csv: ?read.csv. Here is the relevant section:
colClasses: character. A vector of classes to be assumed for the
columns. Recycled as necessary, or if the character vector
is named, unspecified values are taken to be ‘NA’.
Possible values are ‘NA’ (the default, when ‘type.convert’ is
used), ‘"NULL"’ (when the column is skipped), one of the
atomic vector classes (logical, integer, numeric, complex,
character, raw), or ‘"factor"’, ‘"Date"’ or ‘"POSIXct"’.
Otherwise there needs to be an ‘as’ method (from package
‘methods’) for conversion from ‘"character"’ to the specified
formal class.
Note that ‘colClasses’ is specified per column (not per
variable) and so includes the column of row names (if any).
Good luck with your quest to learn R. It's difficult, but so much fun after you get past the first few stages (which I admit do take some time).
try this and fix the others accordingly:
ex <- read.csv("exampleshort.csv",header=TRUE,colClasses=c("integer","integer","factor","integer","numeric","factor","factor","integer","numeric","numeric","numeric","numeric"), na.strings=c("."))
As BenBolker points out, the colClasses argument is probably not needed. However, note that using the colClasses argument can make the operation faster, especially with a large dataset.
na.strings must be specified. See the following section in ?read.csv:
na.strings: a character vector of strings which are to be interpreted
as ‘NA’ values. Blank fields are also considered to be
missing values in logical, integer, numeric and complex
fields.
For reference purposes (this should not be used as the solution because the best solution is to import the data correctly in one step):
RET was not imported as an integer. It was imported as a factor. For future reference, if you want to convert a factor to a numeric, use
new_RET <-as.numeric(as.character(ex$RET))
Related
As stated above, I'm trying to convert data in my dataframe from integer/dbl to numeric but I end up with dbl for both columns.
Original dataset
Code I'm using to convert to numeric;
data$price <- as.numeric(data$price)
data$lot_size <- as.numeric(data$lot_size)
The dataframe I end up with:
The dataframe I end up with
Dataset I have been working with: https://dasl.datadescription.com/datafile/housing-prices-ge19
"numeric is identical to double"
https://stat.ethz.ch/R-manual/R-devel/library/base/html/numeric.html
> typeof(as.numeric(3L))
[1] "double"
> typeof(as.integer(3L))
[1] "integer"
The stuff with types in R is a bit confusing. I would say that numeric is not really a data type at all in R. You will never get the answer numeric from the typeof function.
Both, integers and doubles are considered to be numeric and the function is.numeric will return TRUE for either.
On the other hand, numeric is more often a synonym for double.
The functions numeric and as.numeric are the same as double and as.double.
Edit:
With a bit more research under my belt let me rephrase it like this:
'numeric' is the virtual superclass of both integer and double.
See for example getClass("numeric") and help(UseMethod) (first paragraph in the Details section).
Hadley says it better: Advanced R
I have a data.frame from which I extracted a column called Volume. The code is as follows:
volume = aapl.us$Volume
In the console, I am told the following:
typeof(volume)
# "integer"
length(volume)
# 8364
How is this possible?
The case that you encounter is not strange behavior in R. It may sound unintuitive at first to users of other programming language where there is a distinction between a scalar (single number) and a vector (one-dimensional array).
R does not have "scalar" data. Simplest data structure in R is a vector, and it can be a numeric, character, factor, integer, logical, or complex-valued vector. A single number in R is a "vector of length one", and not a "scalar". A vector must contain data of the same type.
typeof() returns the type of a variable (see the link for further information). In your case, Volume is a vector that contains integers, and that vector has length 8364.
Hi I am trying to convert my column within the data frame from "double" to a "factor", but its not working
I am trying to convert the "double" data type to "factor" but its converting it to an integer. I have tried a couple of other things from stackoverflow but nothing seems to work. I have provided my code below along with console output.
Task 1.5 - Change class type from Integer to Factor
typeof(iLPdf$class) #check type
iLPdf$class <- as.factor(iLPdf$class)
typeof(iLPdf$class) #check type
[1] "double"
iLPdf$class <- as.factor(iLPdf$class)
typeof(iLPdf$class) #check type
[1] "integer"
The issue here is that typeof checks the internal representation of an object. Factors are represented as integers. To check that something is actually a factor, use is.factor instead. From the docs:
typeof determines the (R internal) type or storage mode of any object
To verify this "claim", you can check the well known iris Species' column which is a factor. typeof(iris$Species) will however return integer because to R factors are integers.
Using is.factor is a better option, this ultimately boils down to the difference between types and classes in R.
is.factor(iris$Species)
[1] TRUE
Starting to learn R, and I would appreciate some help understanding how R decides the class of different vectors. I initialize vec <- c(1:6) and when I perform class(vec) I get 'integer'. Why is it not 'numeric', because I thought integers in R looked like this: 4L
Also with vec2 <- c(1,'a',2,TRUE), why is class(vec2) 'character'? I'm guessing R picks up on the characters and automatically assigns everything else to be characters...so then it actually looks like c('1','a','2','TRUE') am I correct?
Type the following, you can see the help page of the colon operator.
?`:`
Here is one paragraph.
For numeric arguments, a numeric vector. This will be of type integer
if from is integer-valued and the result is representable in the R
integer type, otherwise of type "double" (aka mode "numeric").
So, in your example c(1:6), since 1 for the from argument can be representable in R as integer, the resulting sequence becomes integer.
By the way, c is not needed to create a vector in this case.
For the second question, since in a vector all the elements have to be in the same type, R will automatically convert all the elements to the same. In this case, it is possible to convert everything to be character, but it is not possible to convert "a" to be numeric, so it results in a character vector.
My numbers have “,” for 1,000 and above and R considers it as factor. I want to switch two such variables from factor to numeric (Actually both variables are Numbers, but R considers them as factor for some reason (data is imported from excel). To change a factor variable mydata$x1 to numeric variables I use the following code but it seems not to work properly and some values change, for example it changes 8180 to zero! and it happened many other values as well. Is there other ways to do so without such issues?
mydata$x1<- as.numeric(as.character(mydata$x1))
Since it seems as though the problem is that you have saved your numeric data as characters in Excel (instead of using format to display the commas) you may want a function like this.
#' Replace Commas Function
#'
#' This function converts a character representation of a number that contains a comma separator with a numeric value.
#' #keywords read data
#' #export
replaceCommas<-function(x){
x<-as.numeric(gsub("\\,", "", x))
}
Then
rcffull$RetBackers <- replaceCommas(rcffull$Returning.Backers)
rcffull$NewBackers <- replaceCommas(rcffull$New.Backers)
The reason that G5W is asking for dput ouput is that he (we) are unable to figure out where something that displays as 8180 when it's a factor might not properly be converted with that code. It's not because of leading or trailing spaces (which would not appear in a print-version of a factor. Witness this test:
> as.numeric(as.character(factor(" 8180")))
[1] 8180
> as.numeric(as.character(factor(" 8180 ")))
[1] 8180
And the fact that it gets converted to 0 is a real puzzle since generally items that do not get recognized as parseable R numerics will get coerced to NA (with a warning).
> as.numeric(as.character(factor(" 0 8180 ")))
[1] NA
Warning message:
NAs introduced by coercion
We really need the dput output from the item that displays as "8180" and its neighbors.