R read.csv as numeric or string - r

I have a somewhat general question about how read.csv(...) works.
When I read csv datasets (created and exported from Excel), some columns are read into R as numeric (correctly), while others end up as either character (stringsAsFactor=FALSE) or factors (stringsAsFactor=TRUE). How does R determine if its string or numeric in the import process? There is no discernible difference in the columns - e.g., two columns, both t scores, yet one was read in as character, the other as numeric. Can someone explain this to me? (Does my question even make sense?)
Thanks
Andrea

Related

Converting data between classes in R: large CSV file, prices recorded like "$10.00"

So full disclosure, I am new to R and programming in general. Because of that, it is very hard for me to search when I have problems because I am not even sure what keywords to use. I am learning, and all I am hoping for y'all to do is point me in the right direction.
I have a very large csv file that I imported into R. Around 2 million observations (don't worry, I am not planning on using all 2 million). The only problem is that the people recording the data formatted the file to record to prices as "$10.00". Because of this, R recognizes the data has a factor, and also treats each individual price as a separate variable because of the dollar sign. I would like to reformat this column as a numeric variable.
I am sure there is some way to go about reformatting this in R, the only problem is I am not sure which functions I need. Sorry for the very basic question, I have just hit a wall a figured I would reach out.
Any and all help is much appreciated!
Thank you!
We could also use sub
as.numeric(sub('\\D+', '', x))
#[1] 10.00 11.24 15.22
data
x<-c("$10.00","$11.24","$15.22")
Suppose that your data looks like this:
x<-c("$10.00","$11.24","$15.22")
You can use the substring function to trim the initial dollar sign (which will still leave you with strings) and then use as.numeric to turn it to a numeric vector.
newx<-as.numeric(substring(x,2))
will produce a vector named newx with value
c(10.00,11.24,15.22)
We tell the substring to start at the 2nd character (strings in R are 1-indexed), and then cast to numeric.
In your data frame (suppose it is called df), you can replace the column like
df$MoneyColumn <- as.numeric(substring(df$MoneyColumn,2))

Need to convert factor variable to numeric, but is little more complicated [duplicate]

This question already has answers here:
How to read data when some numbers contain commas as thousand separator?
(11 answers)
Closed 7 years ago.
Today I download dataset in csv format from the Eurostat website. I load this dataset to the rstudio by read.csv command and by subseting get data I need. Now I am in situation that I have 12 observation with around 9 variables. One of the variables is value I am interested in, but the problem is value is coded as factor variable (with 754 levels).
It would be easily overcome by as.numeric command, but problem is that the numbers are in the format like this "48,478", so Rstudio don't see one number (just my guess) and if I use as.numeric command I don't get 48478 but some different number, maybe mean or else but definitely not 48478 as a number. After few minutes I realize that problem is probably with the "," and start looking for solution how to remove it.
One solution I found is that use edit command and erase it manually, but I am planning to use more subsets from the original dataset and I hope it's not necessary to every time I will make new dataset to use edit command and manually erase symbol that make me mad there.
You can read the data in and then replace the "," before converting string to numeric:
Read the dataset with stringsAsFactors=FALSE:
raw <- read.csv("a.csv",stringsAsFactors=FALSE)
Converte the string to numeric (same logic as you replace the "," in editor):
raw$number <- as.numeric(gsub(",","",raw$numberAsString)) # converte the numberAsString to numeric after substituting ","

Why does R mix up numerical with categorial variables?

I am confused. I input a .csv file in R and want to fit a linear multivariate regression model.
However, R declares all my obvious numeric variables to be factors and my categorial variables to be integers. Therefore, I cannot fit the model.
Does anyone know how to resolve this?
I know this is probably so basic. But I really need to know this. Elsewhere, I found only posts concerning how to declare factors. But this does not apply here.
Any suggestions very much appreciated!
The easiest way, imo, to handle this is to just tell R what type of data your columns contain when you read them into the workspace. For example, if you have a csv file where the first column should be characters, columns 2-21 should be numeric, and column 22 should be a factor, here's how I would read that csv file into the workspace:
Data <- read.csv("MyData.csv", colClasses=c("character", rep("numeric", 20), "factor"))
Sometimes (with certain versions of R, as Andrew points out) float entries in a CSV are long enough that it thinks they are strings and not floats. In this case, you can do the following
data <- read.csv("filename.csv")
data$some.column <- as.numeric(as.character(data$some.column))
Or you could pass stringsAsFactors=F to the read.csv call, and just apply as.numeric in the next line. That might be a bad idea though if you have a lot of data.
It's a little harder to say what's going on with the categorical variables. You might want to try just treating those as strings and see how that works. Sometimes R will treat factor vectors as being of numeric type, so this is a good first sanity check. If that doesn't work, you can also see if the regression functions in question will let you declare how the variables should be treated.
It is hard to tell without a sample of your data file and the commands that you have been using to try and work with the data, but here are some general problems that can lead to what you describe (though there could be other possibilities as well).
The read.csv and read.table (which is called by read.csv) function will try and guess the types of data when they are not told what each column should be (the colClasses argument). If everything looks like a number then it will convert to a number, but if it sees anything in the first lines that does not look like part of a number then it will read it in as character and convert to a factor. Some of the common reasons why what you think should be a number but R sees something non-numeric include: a finger slip results in a letter somewhere in the column; similar looking substitutions, O for 0 or l for 1; a comma where one is not expected, many European files use , where R expects . (but there are options to tell R what you want here) or if you use read.table without setting sep when it really is a comma separated file.
If you have a categorical variable represented by integers, then R will convert it to integers unless you tell it to make a factor. If you use as.numeric on a factor then it will return the integers used to represent the factor internally. How to convert a factor with labels that are numbers to a numeric is a question (and answer) in the FAQ.
If this does not point you in the right direction then give us a sample of your data and what commands you are using.

Reading data from a data.frame [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicates:
Convert factor to integer
R - How to convert a factor to an integer\numeric in R without a loss of information
I have read a text file where some of the columns with real number are read as factors into a data frame. How do i convert the factoe columns into numeric columns
You can use
as.numeric(as.character(x))
The reason that your column with numbers got read in as a factor is that either there's something somewhere that makes the column not number-only or you've messed up the decimal character (this being the special case of the first problem). If your decimal is not ., you can specify a new one via argument dec, e.g. dec = ",".
This is FAQ 7.10.
But rather than converting after the fact, why not read them in correctly by either specifying the colClasses argument if using read.table or one of its variants, or better yet, figuring out what character(s) in the file is(are) convincing R that your numbers are not all numbers and fixing the source file (or best, do both).

Convert Factor columns in data frame to numeric type columns [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicates:
Convert factor to integer
R - How to convert a factor to an integer\numeric in R without a loss of information
I have read a text file where some of the columns with real number are read as factors into a data frame. How do i convert the factoe columns into numeric columns
You can use
as.numeric(as.character(x))
The reason that your column with numbers got read in as a factor is that either there's something somewhere that makes the column not number-only or you've messed up the decimal character (this being the special case of the first problem). If your decimal is not ., you can specify a new one via argument dec, e.g. dec = ",".
This is FAQ 7.10.
But rather than converting after the fact, why not read them in correctly by either specifying the colClasses argument if using read.table or one of its variants, or better yet, figuring out what character(s) in the file is(are) convincing R that your numbers are not all numbers and fixing the source file (or best, do both).

Resources