I am having an issue creating a linear model from a data frame I have stored because the independent variable contains comma-separators (i.e 314,567.5 vs 314567.4). How could I use read.csv or readr to read a data set and return a data frame without the commas in that specific column?
The answer to the commas question is here.
However, you first need to read the file into R. Although it can be a bit of a pain, I've found that read.fwf is often the best solution in these situations, unless you have a different delimiter, such as a pipe, |, in which case read.delim would probably be best.
Related
I've been googling and reading posts on problems similar but different to the one described below; apologies if this is a duplicate.
I've got a csv file with a field which can contain, among other things, a single instance of a double quote (object descriptions sometimes containing lengths specified in inches).
When I call fread as follows
data_in <- data.table::fread(file_path,stringsAsFactors = FALSE)
the resulting data frame contains two consecutive double quotes in instances where the source file only had one (e.g., the string which appears in the raw csv as
MI|WIRE 9" BGD
appears in the data frame as
MI|WIRE 9"" BGD
).
This character field can also contain commas, semicolons, single quotes in any quantity, and many other characters which I cannot identify.
This is a problem as I need the exact string to match another dataset's values with merge (in fact, the file being read in was originally written from r with fwrite).
I assume that nearly any io problem I'm wrestling with can be solved with readLines and some elbow grease, but I quite like fread. Based on what I've read online this seems similar to problems that others have faced and so I'm guessing that some tweaking of fread's parameters will solve this problem. Any ideas?
I have a raw dataset and the columns are not clearly defined at all. When I go to import the data using "Read.Table" in R, it automatically tries to approximate where the columns begin and end. But it is not correct. I know the number of characters per variable, but I am not sure how to customize them as one would in Excel(=Left(x,3) OR =MID(X,4,1)... etc.). Some variables are separated by spaces, some aren't. It is not consistent.
FYI: The document was originally ".dat", then I saved the file as a ".R" file.
Here is an example of my data
Any help is much appreciated! Let me know
You can use read_fwf from the great readr package, to specify the fix widths per variable.
my question is I have a column which has such format as 20000000002185979. Everytime I read the csv file into R, it became "2e+16". So I can't distinguish from different values. Do you have any good ideas about how to keep the original format when read the file into R? Thx!
Since it turned out to be the answer you wanted. I'll post it here to close out the question.
Since R is unable to maintain that many digits of precision with it's numeric values, you'll have to read it in as a character value. You can do that by setting the colClasses parameter of read.table.
I am confused. I input a .csv file in R and want to fit a linear multivariate regression model.
However, R declares all my obvious numeric variables to be factors and my categorial variables to be integers. Therefore, I cannot fit the model.
Does anyone know how to resolve this?
I know this is probably so basic. But I really need to know this. Elsewhere, I found only posts concerning how to declare factors. But this does not apply here.
Any suggestions very much appreciated!
The easiest way, imo, to handle this is to just tell R what type of data your columns contain when you read them into the workspace. For example, if you have a csv file where the first column should be characters, columns 2-21 should be numeric, and column 22 should be a factor, here's how I would read that csv file into the workspace:
Data <- read.csv("MyData.csv", colClasses=c("character", rep("numeric", 20), "factor"))
Sometimes (with certain versions of R, as Andrew points out) float entries in a CSV are long enough that it thinks they are strings and not floats. In this case, you can do the following
data <- read.csv("filename.csv")
data$some.column <- as.numeric(as.character(data$some.column))
Or you could pass stringsAsFactors=F to the read.csv call, and just apply as.numeric in the next line. That might be a bad idea though if you have a lot of data.
It's a little harder to say what's going on with the categorical variables. You might want to try just treating those as strings and see how that works. Sometimes R will treat factor vectors as being of numeric type, so this is a good first sanity check. If that doesn't work, you can also see if the regression functions in question will let you declare how the variables should be treated.
It is hard to tell without a sample of your data file and the commands that you have been using to try and work with the data, but here are some general problems that can lead to what you describe (though there could be other possibilities as well).
The read.csv and read.table (which is called by read.csv) function will try and guess the types of data when they are not told what each column should be (the colClasses argument). If everything looks like a number then it will convert to a number, but if it sees anything in the first lines that does not look like part of a number then it will read it in as character and convert to a factor. Some of the common reasons why what you think should be a number but R sees something non-numeric include: a finger slip results in a letter somewhere in the column; similar looking substitutions, O for 0 or l for 1; a comma where one is not expected, many European files use , where R expects . (but there are options to tell R what you want here) or if you use read.table without setting sep when it really is a comma separated file.
If you have a categorical variable represented by integers, then R will convert it to integers unless you tell it to make a factor. If you use as.numeric on a factor then it will return the integers used to represent the factor internally. How to convert a factor with labels that are numbers to a numeric is a question (and answer) in the FAQ.
If this does not point you in the right direction then give us a sample of your data and what commands you are using.
I am fairly new to R. I have a datafile which has a matrix of complex numbers, each of the form 123+123i, when I try to read in the data in R, using read.table(), it returns strings, which is not what I want. Is there some way to read in a file of complex numbers?
One possible thing that I could do, since the program that generates the matrix is available to me, I can modify it to generate two real numbers instead of a single complex number, and after reading into R, I can make them into a single complex number, now would this be the canonical way to doing what I want?
See ?read.table, in particular you want to use the colClasses="complex" argument.