How can write.csv and read.csv preserve the data format? - r

I have a dataframe which contains majority of the columns as numeric. Then I use this command to write to csv file
write.csv(df, "mydf.csv", row.names=FALSE, na="")
Then later I read in the file using:
df = read.csv("mydf.csv", header = F, sep = ",", stringsAsFactors=F,dec=".")
Then as I checked the data formate using
sapply(df, class)
all the columns change into character. If I don't put stringsAsFactors=F,
all the columns change into factor.
I could manually change columns into numeric later. I just wonder if there is a method I could preserve the data format at least for the majority columns while write or read csv file. Is there better solutions?

By default write.csv will include headers, but when you are importing them you are telling R that there are no headers. It's likely that those headers are non-numeric which triggers the switch to character. So either turn off the headers in the write.csv() or set header=TRUE in the read.csv()

Related

read.csv importing two columns instead of one

I'm trying to import a csv file into a vector. There are 100 entries in this csv file, and this is what the file looks like:
My code reads as follows:
> choice_vector <- read.csv("choices.csv", header = FALSE, fileEncoding="UTF-8-BOM")
> choice_vector
And yet, when I try to display said vector, it shows up as:
It is somehow creating a second column which I cannot figure out why it is doing so. In addition, trying to write to a new csv file actually writes the contents of that second column to that as well.
The second column was "habilitated" in excel.
Option1: Manually delete the column in excel.
Option2: Delete all columns with all NA
choice_vector2 <- choice_vector[,colSums(is.na(choice_vector))<nrow(choice_vector)]
In case of being interested in reading the first column only:
choice_vector <- read.csv("choices.csv", header = FALSE, fileEncoding="UTF-8-BOM")[,1]
Good luck!
Short answer:
You have an issue with your data file, but
choice_vector <- read.csv("choices.csv", header = FALSE, fileEncoding="UTF-8-BOM")$V1
should create the vector that you're expecting.
Long answer:
The read.csv function returns a data frame and you need to address a particular column within the data frame with the $ operator in order to extract that column as a vector. As for why you have an unexpected column of NAs, your CSV probably codes for two columns. When you read a CSV with R, a comma indicates a data field to its right. If you look at your CSV with a text editor, I'm guessing it'll look like this:
A,
B,
D,
A,
A,
F,
The absence of anything (other than another comma or a line break) to the right of a comma is interpreted as NA.
If we are using fread from data.table, there is a select option to select only the columns of interest
library(data.table)
dt <- fread("choices.csv", select = 1)
Other than that, it is not clear about why the issue happens. Could be some strange white space. If that is the case, specify strip.white = TRUE (by default it is FALSE)
read.csv(("choices.csv", header = FALSE,
fileEncoding="UTF-8-BOM", strip.white = TRUE)
Or as we commented, copy the columns of interest into a new file, save it and then read with read.csv

write.table unintendedly adds subscript x to header

I have got a comma delimited csv document with predefined headers and a few rows. I just want to exchange the comma delimiter to a pipe delimiter. So my naive approach is:
myData <- read.csv(file="C:/test.CSV", header=TRUE, sep=",", check.names = FALSE)
Viewing myData gives me results without X subscripts in header columns. If I set check.names = TRUE, the column headers have a X subscript.
Now I am trying to write a new csv with pipe-delimiter.
write.table(MyData1, file = "C:/test_pipe.CSV",row.names=FALSE, na="",col.names=TRUE, sep="|")
In the next step I am going to test my results:
mydata.test <- read.csv(file="C:/test_pipe.CSV", header=TRUE, sep="|")
Import seems fine, but unfortunately the X subscript in column headers appear again. Now my question is:
Is there something wrong with the original file or is there an error in my naive approach?
The original csv test.csv was created with Excel, of course without X subscripts in column headers.
Thanks in advance
You have to keep using check.names = FALSE, also the second time.
Else your header will be modified, because apparently it contains variable names that would not be considered valid names of columns of a data.frame. E.g., special characters would be replaced by dots, i.e. . Similarly, numbers would be pre-fixed with X.

How can a data frame be transformed into a string with a csv format on R?

I don't want to write a csv into a file, but to get a string representation of the dataframe with a csv format (to send it over the network).
I'm using R.NET, if it helps to know.
If you are not limited to base functions, you may try readr::format_csv.
library(readr)
format_csv(iris[1:2, 1:3])
# [1] "Sepal.Length,Sepal.Width,Petal.Length\n5.1,3.5,1.4\n4.9,3.0,1.4\n"
If you want a single string in csv format, you could capture the output from write.csv.
Let's use mtcars as an example:
paste(capture.output(write.csv(mtcars)), collapse = "\n")
This reads back into R fine with read.csv(text = ..., row.names = 1). You can make adjustments for the printing of row names and other attributes in write.csv.
Alternatively:
write.csv(mtcars, textConnection("output", "w"), row.names=FALSE)
which will create the variable output in the global environment and store it in a character vector.
You can do
paste0(output, collapse="\n")
to make it one big character string, similar to Rich's answer (but paste0() is marginally faster).

Load header for data.frame from file

I have two files. One file (csv) contains data, and second contains header for data (in one column). I need to unite both files and get data.frame with data from first file and header from second file. How it can be done?
Reduced sample. Data file:
10;21;36
7;56;543
7;7;7
7890;1;1
Header file:
height
weight
light
I need data.frame as from csv file:
height;weight;light
10;21;36
7;56;543
7;7;7
7890;1;1
You could use the col.names argument in read.table() to read the header file as the column names in the same call used to read the data file.
read.table(datafile, sep = ";", col.names = scan(headerfile, what = ""))
As #chinsoon12 shows in the comments, readLines() could also be used in place of scan().
We can read both the datasets with header=FALSE and change the column names with the first column of second dataset.
df1 <- read.csv("firstfile.csv", sep=";", header=FALSE)
df2 <- read.csv("secondfile.csv", header=FALSE)
colnames(df1) <- as.character(df2[,1])

Imported a csv-dataset to R but the values becomes factors

I am very new to R and I am having trouble accessing a dataset I've imported. I'm using RStudio and used the Import Dataset function when importing my csv-file and pasted the line from the console-window to the source-window. The code looks as follows:
setwd("c:/kalle/R")
stuckey <- read.csv("C:/kalle/R/stuckey.csv")
point <- stuckey$PTS
time <- stuckey$MP
However, the data isn't integer or numeric as I am used to but factors so when I try to plot the variables I only get histograms, not the usual plot. When checking the data it seems to be in order, just that I'm unable to use it since it's in factor form.
Both the data import function (here: read.csv()) as well as a global option offer you to say stringsAsFactors=FALSE which should fix this.
By default, read.csv checks the first few rows of your data to see whether to treat each variable as numeric. If it finds non-numeric values, it assumes the variable is character data, and character variables are converted to factors.
It looks like the PTS and MP variables in your dataset contain non-numerics, which is why you're getting unexpected results. You can force these variables to numeric with
point <- as.numeric(as.character(point))
time <- as.numeric(as.character(time))
But any values that can't be converted will become missing. (The R FAQ gives a slightly different method for factor -> numeric conversion but I can never remember what it is.)
You can set this globally for all read.csv/read.* commands with
options(stringsAsFactors=F)
Then read the file as follows:
my.tab <- read.table( "filename.csv", as.is=T )
When importing csv data files the import command should reflect both the data seperation between each column (;) and the float-number seperator for your numeric values (for numerical variable = 2,5 this would be ",").
The command for importing a csv, therefore, has to be a bit more comprehensive with more commands:
stuckey <- read.csv2("C:/kalle/R/stuckey.csv", header=TRUE, sep=";", dec=",")
This should import all variables as either integers or numeric.
None of these answers mention the colClasses argument which is another way to specify the variable classes in read.csv.
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = "numeric") # all variables to numeric
or you can specify which columns to convert:
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = c("PTS" = "numeric", "MP" = "numeric") # specific columns to numeric
Note that if a variable can't be converted to numeric then it will be converted to factor as default which makes it more difficult to convert to number. Therefore, it can be advisable just to read all variables in as 'character' colClasses = "character" and then convert the specific columns to numeric once the csv is read in:
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = "character")
point <- as.numeric(stuckey$PTS)
time <- as.numeric(stuckey$MP)
I'm new to R as well and faced the exact same problem. But then I looked at my data and noticed that it is being caused due to the fact that my csv file was using a comma separator (,) in all numeric columns (Ex: 1,233,444.56 instead of 1233444.56).
I removed the comma separator in my csv file and then reloaded into R. My data frame now recognises all columns as numbers.
I'm sure there's a way to handle this within the read.csv function itself.
This only worked right for me when including strip.white = TRUE in the read.csv command.
(I found the solution here.)
for me the solution was to include skip = 0
(number of rows to skip at the top of the file. Can be set >0)
mydata <- read.csv(file = "file.csv", header = TRUE, sep = ",", skip = 22)

Resources