Why does R mix up numerical with categorial variables? - r

I am confused. I input a .csv file in R and want to fit a linear multivariate regression model.
However, R declares all my obvious numeric variables to be factors and my categorial variables to be integers. Therefore, I cannot fit the model.
Does anyone know how to resolve this?
I know this is probably so basic. But I really need to know this. Elsewhere, I found only posts concerning how to declare factors. But this does not apply here.
Any suggestions very much appreciated!

The easiest way, imo, to handle this is to just tell R what type of data your columns contain when you read them into the workspace. For example, if you have a csv file where the first column should be characters, columns 2-21 should be numeric, and column 22 should be a factor, here's how I would read that csv file into the workspace:
Data <- read.csv("MyData.csv", colClasses=c("character", rep("numeric", 20), "factor"))

Sometimes (with certain versions of R, as Andrew points out) float entries in a CSV are long enough that it thinks they are strings and not floats. In this case, you can do the following
data <- read.csv("filename.csv")
data$some.column <- as.numeric(as.character(data$some.column))
Or you could pass stringsAsFactors=F to the read.csv call, and just apply as.numeric in the next line. That might be a bad idea though if you have a lot of data.
It's a little harder to say what's going on with the categorical variables. You might want to try just treating those as strings and see how that works. Sometimes R will treat factor vectors as being of numeric type, so this is a good first sanity check. If that doesn't work, you can also see if the regression functions in question will let you declare how the variables should be treated.

It is hard to tell without a sample of your data file and the commands that you have been using to try and work with the data, but here are some general problems that can lead to what you describe (though there could be other possibilities as well).
The read.csv and read.table (which is called by read.csv) function will try and guess the types of data when they are not told what each column should be (the colClasses argument). If everything looks like a number then it will convert to a number, but if it sees anything in the first lines that does not look like part of a number then it will read it in as character and convert to a factor. Some of the common reasons why what you think should be a number but R sees something non-numeric include: a finger slip results in a letter somewhere in the column; similar looking substitutions, O for 0 or l for 1; a comma where one is not expected, many European files use , where R expects . (but there are options to tell R what you want here) or if you use read.table without setting sep when it really is a comma separated file.
If you have a categorical variable represented by integers, then R will convert it to integers unless you tell it to make a factor. If you use as.numeric on a factor then it will return the integers used to represent the factor internally. How to convert a factor with labels that are numbers to a numeric is a question (and answer) in the FAQ.
If this does not point you in the right direction then give us a sample of your data and what commands you are using.

Related

R studio numeric integer display format options

I don't want the display format like this: 2.150209e+06
the format I want is 2150209
because when I export data, format like 2.150209e+06 caused me a lot of trouble.
I did some search found this function could help me
formatC(numeric_summary$mean, digits=1,format="f").
I am wondering can I set options to change this forever? I don't want to apply this function to every variable of my data because I have this problem very often.
One more question is, can I change the class of all integer variables to numeric automatically? For integer format, when I sum the whole column usually cause trouble, says "integer overflow - use sum(as.numeric(.))".
I don't need integer format, all I need is numeric format. Can I set options to change integer class to numeric please?
I don't know how you are exporting your data, but when I use write.csv with a data frame containing numeric data, I don't get scientific notation, I get the full number written out, including all decimal precision. Actually, I also get the full number written out even with factor data. Have a look here:
df <- data.frame(c1=c(2150209.123, 10001111),
c2=c('2150209.123', '10001111'))
write.csv(df, file="C:\\Users\\tbiegeleisen\\temp.txt")
Output file:
"","c1","c2"
"1",2150209.123,"2150209.123"
"2",10001111,"10001111"
Update:
It is possible that you are just dealing with a data rendering issue. What you see in the R console or in your spreadsheet does not necessarily reflect the precision of the underlying data. For instance, if you are using Excel, you highlight a numeric cell, press CTRL + 1 and then change the format. You should be able to see full/true precision of the underlying data. Similarly, the number you see printed in the R console might use scientific notation only for ease of reading (SN was invented partially for this very reason).
Thank you all.
For the example above, I tried this:
df <- data.frame(c1=c(21503413542209.123, 10001111),
c2=c('2150209.123', '100011413413111'))
c1 in df is scientific notation, c2 is not.
then I run write.csv(df, file="C:\Users\tbiegeleisen\temp.txt").
It does out put all digits.
Can I disable scientific notation in R please? Because, it still cause me trouble, although it exported all digits to txt.
Sometimes I want to visually compare two big numbers. For example, if I run
df <- data.frame(c1=c(21503413542209.123, 21503413542210.123),
c2=c('2150209.123', '100011413413111'))
df will be
c1 c2
2.150341e+13 2150209.123
2.150341e+13 100011413413111
The two values for c1 are actually different, but I cannot differentiate them in R, unless I exported them to txt. The numbers here are fake numbers, but the same problem I encounter very day.

R read.csv as numeric or string

I have a somewhat general question about how read.csv(...) works.
When I read csv datasets (created and exported from Excel), some columns are read into R as numeric (correctly), while others end up as either character (stringsAsFactor=FALSE) or factors (stringsAsFactor=TRUE). How does R determine if its string or numeric in the import process? There is no discernible difference in the columns - e.g., two columns, both t scores, yet one was read in as character, the other as numeric. Can someone explain this to me? (Does my question even make sense?)
Thanks
Andrea

Converting data between classes in R: large CSV file, prices recorded like "$10.00"

So full disclosure, I am new to R and programming in general. Because of that, it is very hard for me to search when I have problems because I am not even sure what keywords to use. I am learning, and all I am hoping for y'all to do is point me in the right direction.
I have a very large csv file that I imported into R. Around 2 million observations (don't worry, I am not planning on using all 2 million). The only problem is that the people recording the data formatted the file to record to prices as "$10.00". Because of this, R recognizes the data has a factor, and also treats each individual price as a separate variable because of the dollar sign. I would like to reformat this column as a numeric variable.
I am sure there is some way to go about reformatting this in R, the only problem is I am not sure which functions I need. Sorry for the very basic question, I have just hit a wall a figured I would reach out.
Any and all help is much appreciated!
Thank you!
We could also use sub
as.numeric(sub('\\D+', '', x))
#[1] 10.00 11.24 15.22
data
x<-c("$10.00","$11.24","$15.22")
Suppose that your data looks like this:
x<-c("$10.00","$11.24","$15.22")
You can use the substring function to trim the initial dollar sign (which will still leave you with strings) and then use as.numeric to turn it to a numeric vector.
newx<-as.numeric(substring(x,2))
will produce a vector named newx with value
c(10.00,11.24,15.22)
We tell the substring to start at the 2nd character (strings in R are 1-indexed), and then cast to numeric.
In your data frame (suppose it is called df), you can replace the column like
df$MoneyColumn <- as.numeric(substring(df$MoneyColumn,2))

`how to read a data set and remove commas from measured variables

I am having an issue creating a linear model from a data frame I have stored because the independent variable contains comma-separators (i.e 314,567.5 vs 314567.4). How could I use read.csv or readr to read a data set and return a data frame without the commas in that specific column?
The answer to the commas question is here.
However, you first need to read the file into R. Although it can be a bit of a pain, I've found that read.fwf is often the best solution in these situations, unless you have a different delimiter, such as a pipe, |, in which case read.delim would probably be best.

R claims that data is non-numeric, but after writing to file is numeric

I have read in a table in R, and am trying to take log of the data. This gives me an error that the last column contains non-numeric values:
> log(TD_complete)
Error in Math.data.frame(list(X2011.01 = c(187072L, 140815L, 785077L, :
non-numeric variable in data frame: X2013.05
The data "looks" numeric, i.e. when I read it my brain interprets it as numbers. I can't be totally wrong since the following will work:
> write.table(TD_complete,"C:\\tmp\\rubbish.csv", sep = ",")
> newdata = read.csv("C:\\tmp\\rubbish.csv")
> log(newdata)
The last line will happily output numbers.
This doesn't make any sense to me - either the data is numeric when I read it in the first time round, or it is not. Any ideas what might be going on?
EDIT: Unfortunately I can't share the data, it's confidential.
Review the colClasses argument of read.csv(), where you can specify what type each column should be read and stored as. That might not be so helpful if you have a large number of columns, but using it makes sure R doesn't have to guess what type of data you're using.
Just because "the last line will happily output numbers" doesn't mean R is treating the values as numeric.
Also, it would help to see some of your data.
If you provide the actual data or a sample of it, help will be much easier.
In this case I assume R has the column in question saved as a string and writes it without any parantheses into the CSV file. Once there, it reads it again and does not bother to interpret a value without any characters as anything else than a number. In other words, by writing and reading a CSV file you converted a string containing only numbers into a proper integer (or float).
But without the actual data or the rest of the code this is mere conjecture.

Resources