I have got a comma delimited csv document with predefined headers and a few rows. I just want to exchange the comma delimiter to a pipe delimiter. So my naive approach is:
myData <- read.csv(file="C:/test.CSV", header=TRUE, sep=",", check.names = FALSE)
Viewing myData gives me results without X subscripts in header columns. If I set check.names = TRUE, the column headers have a X subscript.
Now I am trying to write a new csv with pipe-delimiter.
write.table(MyData1, file = "C:/test_pipe.CSV",row.names=FALSE, na="",col.names=TRUE, sep="|")
In the next step I am going to test my results:
mydata.test <- read.csv(file="C:/test_pipe.CSV", header=TRUE, sep="|")
Import seems fine, but unfortunately the X subscript in column headers appear again. Now my question is:
Is there something wrong with the original file or is there an error in my naive approach?
The original csv test.csv was created with Excel, of course without X subscripts in column headers.
Thanks in advance
You have to keep using check.names = FALSE, also the second time.
Else your header will be modified, because apparently it contains variable names that would not be considered valid names of columns of a data.frame. E.g., special characters would be replaced by dots, i.e. . Similarly, numbers would be pre-fixed with X.
Related
I have to read a big .csv file and read.csv is taking a while. I read that I should use read.csv to read a few rows, get the column classes, and then read the whole file. I tried to do that:
read.csv(full_path_astro_data,
header=TRUE,
sep=",",
comment.char="",
nrow=100,
stringsAsFactors=FALSE) %>%
sapply(class) -> col.classes
df_astro_data <- read.csv(full_path_astro_data,
header=TRUE,
sep=",",
colClasses=col.classes,
comment.char="",
nrow=47000,
stringsAsFactors=FALSE)
But then I got an error message:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got '0.0776562500000022'
It looks like a column that contains numeric (double?) data was incorrectly classified as integer. This could be because some numeric columns have many zeros at the beginning. So I tried to increase the number of rows in the first read.csv command, but that did not work. One solution I found was to do
col.classes %>%
sapply(function(x) ifelse(x=="integer", "numeric", x)) -> col.classes
With this the file is read much faster than without specifying column classes. Still, it would be best if all columns were classified correctly.
Any insights?
Thanks
I suspect you are correct that in your row sample some columns contain only integers, but outside your row sample they contain non-integers. This is a common problem with large files. You need to either increase your row sample size or explicitly specify column type for certain columns where you see this happening.
It should be noted that readr's read_csv does this row sampling automatically. From the docs: "all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself." You can do that like this:
read_csv( YourPathName,
col_types = cols(YourProblemColumn1 = col_double(),
YourProblemColumn2 = col_double())
)
ne,class,regex,match,event,msg
BOU2-P-2,"tengigabitethernet","tengigabitethernet(?'connector'\d{1,2}\/\d{1,2})","4/2","lineproto-5-updown","%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
these are the first two lines, with the first one that will serve as columns names, all separated by commas and with the values in quotation marks except for the first one, and I think it is that that creates troubles.
I am interested in the columns class and msg, so this output will suffice:
class msg
tengigabitethernet %lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down
but I can also import all the columns and unselect the ones I don't want later, it's no worries.
The data comes in a .csv file that was given to me.
If I open this file in excel the columns are all in one.
I work in France, but I don't know in which locale or encoding the file was created (btw I'm not French, so I am not really familiar with those).
I tried with
df <- read.csv("file.csv", stringsAsFactors = FALSE)
and the dataframe has the columns' names nicely separated but the values are all in the first one
then with
library(readr)
df <- read_delim('file.csv',
delim = ",",
quote = "",
escape_double = FALSE,
escape_backslash = TRUE)
but this way the regex column gets splitted in two columns so I lose the msg variable altogether.
With
library(data.table)
df <- fread("file.csv")
I get the msg variable present but empty, as the ne variable contains both ne and class, separated by a comma.
this is the best output for now, as I can manipulate it to get the desired one.
another option is to load the file as a character vector with readLines to fix it, but I am not an expert with regexs so I would be clueless.
the file is also 300k lines, so it would be hard to inspect it.
both read.delim and fread gives warning messages, I can include them if they might be useful.
update:
using
library(data.table)
df <- fread("file.csv", quote = "")
gives me a more easily output to manipulate, it splits the regex and msg column in two but ne and class are distinct
I tried with the input you provided with read.csv and had no problems; when subsetting each column is accessible. As for your other options, you're getting the quote option wrong, it needs to be "\""; the double quote character needs to be escaped i.e.: df <- fread("file.csv", quote = "\"").
When using read.csv with your example I definitely get a data frame with 1 line and 6 columns:
df <- read.csv("file.csv")
nrow(df)
# Output result for number of rows
# > 1
ncol(df)
# Output result for number of columns
# > 6
tmp$ne
# > "BOU2-P-2"
tmp$class
# > "tengigabitethernet"
tmp$regex
# > "tengigabitethernet(?'connector'\\d{1,2}\\/\\d{1,2})"
tmp$match
# > "4/2"
tmp$event
# > "lineproto-5-updown"
tmp$msg
# > "%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
I'm trying to import a csv file into a vector. There are 100 entries in this csv file, and this is what the file looks like:
My code reads as follows:
> choice_vector <- read.csv("choices.csv", header = FALSE, fileEncoding="UTF-8-BOM")
> choice_vector
And yet, when I try to display said vector, it shows up as:
It is somehow creating a second column which I cannot figure out why it is doing so. In addition, trying to write to a new csv file actually writes the contents of that second column to that as well.
The second column was "habilitated" in excel.
Option1: Manually delete the column in excel.
Option2: Delete all columns with all NA
choice_vector2 <- choice_vector[,colSums(is.na(choice_vector))<nrow(choice_vector)]
In case of being interested in reading the first column only:
choice_vector <- read.csv("choices.csv", header = FALSE, fileEncoding="UTF-8-BOM")[,1]
Good luck!
Short answer:
You have an issue with your data file, but
choice_vector <- read.csv("choices.csv", header = FALSE, fileEncoding="UTF-8-BOM")$V1
should create the vector that you're expecting.
Long answer:
The read.csv function returns a data frame and you need to address a particular column within the data frame with the $ operator in order to extract that column as a vector. As for why you have an unexpected column of NAs, your CSV probably codes for two columns. When you read a CSV with R, a comma indicates a data field to its right. If you look at your CSV with a text editor, I'm guessing it'll look like this:
A,
B,
D,
A,
A,
F,
The absence of anything (other than another comma or a line break) to the right of a comma is interpreted as NA.
If we are using fread from data.table, there is a select option to select only the columns of interest
library(data.table)
dt <- fread("choices.csv", select = 1)
Other than that, it is not clear about why the issue happens. Could be some strange white space. If that is the case, specify strip.white = TRUE (by default it is FALSE)
read.csv(("choices.csv", header = FALSE,
fileEncoding="UTF-8-BOM", strip.white = TRUE)
Or as we commented, copy the columns of interest into a new file, save it and then read with read.csv
I was hoping there may be a way to do this, but after trying for a while I have had no luck.
I am working with a datafile (.csv format) that is being supplied with multiple tables in a single file. Each table has its own header row, and data associated with it. Is there a way to import this file and create separate data frames for each header/dataset?
Any help or ideas that can be provided would be greatly appreciated.
A sample of the datafile and it's structure can be found Here
When trying to use read.csv I get the following error:
"Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names"
Read the help for read.table:
nrows: number of rows to parse
skip: number of rows to skip
You can parse your file as follows:
first <- read.table(myFile, nrows=2)
second <- read.table(myFile, skip=3, nrows=2)
third <- read.table(myFile, skip=6, nrows=8)
You can always automate this by using grep() to search for the table seperators.
You can also read the table using fill=TRUE, and then split out the tables afterwards.
I am very new to R and I am having trouble accessing a dataset I've imported. I'm using RStudio and used the Import Dataset function when importing my csv-file and pasted the line from the console-window to the source-window. The code looks as follows:
setwd("c:/kalle/R")
stuckey <- read.csv("C:/kalle/R/stuckey.csv")
point <- stuckey$PTS
time <- stuckey$MP
However, the data isn't integer or numeric as I am used to but factors so when I try to plot the variables I only get histograms, not the usual plot. When checking the data it seems to be in order, just that I'm unable to use it since it's in factor form.
Both the data import function (here: read.csv()) as well as a global option offer you to say stringsAsFactors=FALSE which should fix this.
By default, read.csv checks the first few rows of your data to see whether to treat each variable as numeric. If it finds non-numeric values, it assumes the variable is character data, and character variables are converted to factors.
It looks like the PTS and MP variables in your dataset contain non-numerics, which is why you're getting unexpected results. You can force these variables to numeric with
point <- as.numeric(as.character(point))
time <- as.numeric(as.character(time))
But any values that can't be converted will become missing. (The R FAQ gives a slightly different method for factor -> numeric conversion but I can never remember what it is.)
You can set this globally for all read.csv/read.* commands with
options(stringsAsFactors=F)
Then read the file as follows:
my.tab <- read.table( "filename.csv", as.is=T )
When importing csv data files the import command should reflect both the data seperation between each column (;) and the float-number seperator for your numeric values (for numerical variable = 2,5 this would be ",").
The command for importing a csv, therefore, has to be a bit more comprehensive with more commands:
stuckey <- read.csv2("C:/kalle/R/stuckey.csv", header=TRUE, sep=";", dec=",")
This should import all variables as either integers or numeric.
None of these answers mention the colClasses argument which is another way to specify the variable classes in read.csv.
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = "numeric") # all variables to numeric
or you can specify which columns to convert:
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = c("PTS" = "numeric", "MP" = "numeric") # specific columns to numeric
Note that if a variable can't be converted to numeric then it will be converted to factor as default which makes it more difficult to convert to number. Therefore, it can be advisable just to read all variables in as 'character' colClasses = "character" and then convert the specific columns to numeric once the csv is read in:
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = "character")
point <- as.numeric(stuckey$PTS)
time <- as.numeric(stuckey$MP)
I'm new to R as well and faced the exact same problem. But then I looked at my data and noticed that it is being caused due to the fact that my csv file was using a comma separator (,) in all numeric columns (Ex: 1,233,444.56 instead of 1233444.56).
I removed the comma separator in my csv file and then reloaded into R. My data frame now recognises all columns as numbers.
I'm sure there's a way to handle this within the read.csv function itself.
This only worked right for me when including strip.white = TRUE in the read.csv command.
(I found the solution here.)
for me the solution was to include skip = 0
(number of rows to skip at the top of the file. Can be set >0)
mydata <- read.csv(file = "file.csv", header = TRUE, sep = ",", skip = 22)