I need to write to csv without the columns names row.
The following snippet:
CSAT <- data[j,1]
Verbatim <- data[j,2]
write.csv (Verbatim, paste(CSAT,'.csv',sep = ""), row.names=FALSE)
CSAT is a variable containing dynamic value
(changes in runtime) for the file name.
write data to csv, but the the csv looks like this (2 rows instead of desired 1 row):
"x"
"I am very disappointed in your service. I had been with xxxxx for 9 years and even though I had problems a few times I knew they would look after me as a customer. I can't say the same about yyyyy. Only been with you 6 months and already feel let down by "
Where did "x" came from? it's not part of the data..
thanks
You cannot turn off col.names with write.csv because CSV files require them. So if if your data doesn't have headers, write.csv will add them. That's where the "x" came from. Try using the more robust write.table
write.table(Verbatim, paste(CSAT), row.names=FALSE, col.names=FALSE)
Related
I was just going through a tremendous headache caused by read_csv messing up my data by substituting content with NA while reading simple and clean csv files.
I’m iterating over multiple large csv files that add up to millions of observations. Some columns contain quite some NA for some variables.
When reading a csv that contains NA in a certain column for the first 1000 + x observations, read_csv populates the entire column with NA and thus, the data is lost for further operations.
The warning message “Warning: x parsing failure” is shown, but as I’m reading multiple files I cannot check this file by file. Still, I would not know an automated fix for the parsing problem indicated also with problems(x)
Using read.csv instead of read_csv does not cause the problem, but it is slow and I run into encoding issues (using different encodings requires too much memory for large files).
An option to overcome this bug is to add a first observation (first row) to your data that contains something for each column, but still I need to read the file first somehow.
See a simplified example below:
##create a dtafrane
df <- data.frame( id = numeric(), string = character(),
stringsAsFactors=FALSE)
##poluate columns
df[1:1500,1] <- seq(1:1500)
df[1500,2] <- "something"
# variable string contains the first value in obs. 1500
df[1500,]
## check the numbers of NA in variable string
sum(is.na(df$string)) # 1499
##write the df
write_csv(df, "df.csv")
##read the df with read_csv and read.csv
df_readr <- read_csv('df.csv')
df_read_standard <- read.csv('df.csv')
##check the number of NA in variable string
sum(is.na(df_readr$string)) #1500
sum(is.na(df_read_standard$string)) #1499
## the read_csv files is all NA for variable string
problems(df_readr) ##What should that tell me? How to fix it?
Thanks to MrFlick for giving the answering comment on my questions:
The whole reason read_csv can be faster than read.csv is because it can make assumptions about your data. It looks at the first 1000 rows to guess the column types (via guess_max) but if there is no data in a column it can't guess what's in that column. Since you seem to know what's supposed to be in the columns, you should use the col_types= parameter to tell read_csv what to expect rather than making it guess. See the ?readr::cols help page to see how to tell read_csv what it needs to know.
Also guess_max = Inf overcomes the problem, but the speed advantage of read_csv seems to be lost.
I have several .txt files which need to be imported to R as dataframes for some data analysis. One of these files has no EOL in any form, so I'm left wondering how I would go about to import that.
\"A\";\"B\";\"C\";\"D\";\"D\";\"E\";\"F\";\"G\";\"H\";\"I\";\"J\";\"K\";\"L\";\"M\";\"N\";\"O\";\"P\";\"Q\";\"R\";\"S\";\"T\";\"U\";\"V\"\"1\";4;\"55-555-5555-555\";1234-56-78;\"111\";1510;5;1234-12-17;12345.1234512345;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA\"2\";6;\"22-222-2222-222\";5678-56-78;\"222\";2051;0;NA;0;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA
This is how the first ~500 characters of that .txt file look like. The EOL would need to be placed like this:
\"A\";\"B\";\"C\";\"D\";\"D\";\"E\";\"F\";\"G\";\"H\";\"I\";\"J\";\"K\";\"L\";\"M\";\"N\";\"O\";\"P\";\"Q\";\"R\";\"S\";\"T\";\"U\";\"V\"
\"1\";4;\"55-555-5555-555\";1234-56-78;\"111\";1510;5;1234-12-17;12345.1234512345;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA
\"2\";6;\"22-222-2222-222\";5678-56-78;\"222\";2051;0;NA;0;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA
Normally I would just gsub a "\n" to the places I need it to be, but there is no reoccurring string at the places where I would place a \n, so I don't think that gsub would work in this instance.
Seeing how the missing values are clearly indicated with NA, is there a function similar to read_delim that has a "col_number = x" argument? Like the first x values are the headers, the next x values are the values of the first row and so on and so forth?
If it changes anything, these .txt files are rather big (>300mb).
Big thank you to Julian_Hn. Works like a charm.
I would probably just read this in as a vector and then reformat as matrix with the number of columns you know are in the dataset. This essentially does what you want
str <- "\"A\";\"B\";\"C\";\"D\";\"D\";\"E\";\"F\";\"G\";\"H\";\"I\";\"J\";\"K\";\"L\";\"M\";\"N\";\"O\";\"P\";\"Q\";\"R\";\"S\";\"T\";\"U\";\"V\";\"1\";4;\"55-555-5555-555\";1234-56-78;\"111\";1510;5;1234-12-17;12345.1234512345;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;\"2\";6;\"22-222-2222-222\";5678-56-78;\"222\";2051;0;NA;0;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA"
vec <- strsplit(str,";")[[1]]
//EDIT: add byrow = T To stay in the right format. Thanks Yuriy
table <- matrix(vec,ncol=23,nrow=3, byrow = T)
df <- as.data.frame(table)
A strange problem with write and read.csv. I have ways to work around this but would be great if someone can identify what is going on.
I have code from someone else which dynamically creates a series of CSVs by appending new rows. The problem is that read.csv appears to read the newly created csv inconsistently.
Dummy code example:
datfile <- "E:/temp.csv"
write(paste("Name","tempname",sep=","),datfile,1)
write(paste("VShort",50,sep=","),datfile,1,append=T)
write(paste("Short1",1,1,sep=","),datfile,1,append=T)
write(paste("Short2",0,2,sep=","),datfile,1,append=T)
write(paste("Short3",0,2,sep=","),datfile,1,append=T)
write(paste("Long",0,0.3,0.6,1,sep=","),datfile,1,append=T)
write(paste("Short4",2,0,sep=","),datfile,1,append=T)
read.csv(datfile,header=F,colClasses="character")
Seven rows of data written to CSV, but read.csv reads in 8 rows (Long is split over two rows). Eight rows and three columns read in.
The problem is fixed by opening temp.csv in Excel and saving. Then read.csv reads in the 7 lines appropriately.
The problem only appears to exist under certain conditions. For example, remove Short 3 and there is no problem:
datfile2 <- "E:/temp2.csv"
write(paste("Name","tempname",sep=","),datfile2,1)
write(paste("VShort",50,sep=","),datfile2,1,append=T)
write(paste("Short1",1,1,sep=","),datfile2,1,append=T)
write(paste("Short2",0,2,sep=","),datfile2,1,append=T)
write(paste("Long",0,0.3,0.6,1,sep=","),datfile2,1,append=T)
write(paste("Short4",2,0,sep=","),datfile2,1,append=T)
read.csv(datfile2,header=F,colClasses="character")
Six rows and five columns are read in.
Any ideas what is going on here?
R version 3.2.4 Revised
Windows 10
This is probably related to the following in ?read.csv:
The number of data columns is determined by looking at the first five
lines of input (or the whole file if it has less than five lines), or
from the length of col.names if it is specified and is longer. This
could conceivably be wrong if fill or blank.lines.skip are true, so
specify col.names if necessary (as in the ‘Examples’).
It just happens that the row with the most number of columns is the sixth row in your first example.
I suggest using col.names to get around this, e.g.:
`... read.csv(..., col.names = paste0('V', 1:6))`
As the OP notes in a comment to this answer, you can find out the number of
columns required using readLines:
Ncol <- max(unlist(lapply(strsplit(readLines(datfile), ","), length)))
and then modify the above to give:
read.csv(datfile,header=F,colClasses="character", col.names=paste0("V", 1:Ncol))
I am aware that there are similar questions on this site, however, none of them seem to answer my question sufficiently.
This is what I have done so far:
I have a csv file which I open in excel. I manipulate the columns algebraically to obtain a new column "A". I import the file into R using read.csv() and the entries in column A are stored as factors - I want them to be stored as numeric. I find this question on the topic:
Imported a csv-dataset to R but the values becomes factors
Following the advice, I include stringsAsFactors = FALSE as an argument in read.csv(), however, as Hong Ooi suggested in the page linked above, this doesn't cause the entries in column A to be stored as numeric values.
A possible solution is to use the advice given in the following page:
How to convert a factor to an integer\numeric without a loss of information?
however, I would like a cleaner solution i.e. a way to import the file so that the entries of column entries are stored as numeric values.
Cheers for any help!
Whatever algebra you are doing in Excel to create the new column could probably be done more effectively in R.
Please try the following: Read the raw file (before any excel manipulation) into R using read.csv(... stringsAsFactors=FALSE). [If that does not work, please take a look at ?read.table (which read.csv wraps), however there may be some other underlying issue].
For example:
delim = "," # or is it "\t" ?
dec = "." # or is it "," ?
myDataFrame <- read.csv("path/to/file.csv", header=TRUE, sep=delim, dec=dec, stringsAsFactors=FALSE)
Then, let's say your numeric columns is column 4
myDataFrame[, 4] <- as.numeric(myDataFrame[, 4]) # you can also refer to the column by "itsName"
Lastly, if you need any help with accomplishing in R the same tasks that you've done in Excel, there are plenty of folks here who would be happy to help you out
In read.table (and its relatives) it is the na.strings argument which specifies which strings are to be interpreted as missing values NA. The default value is na.strings = "NA"
If missing values in an otherwise numeric variable column are coded as something else than "NA", e.g. "." or "N/A", these rows will be interpreted as character, and then the whole column is converted to character.
Thus, if your missing values are some else than "NA", you need to specify them in na.strings.
If you're dealing with large datasets (i.e. datasets with a high number of columns), the solution noted above can be manually cumbersome, and requires you to know which columns are numeric a priori.
Try this instead.
char_data <- read.csv(input_filename, stringsAsFactors = F)
num_data <- data.frame(data.matrix(char_data))
numeric_columns <- sapply(num_data,function(x){mean(as.numeric(is.na(x)))<0.5})
final_data <- data.frame(num_data[,numeric_columns], char_data[,!numeric_columns])
The code does the following:
Imports your data as character columns.
Creates an instance of your data as numeric columns.
Identifies which columns from your data are numeric (assuming columns with less than 50% NAs upon converting your data to numeric are indeed numeric).
Merging the numeric and character columns into a final dataset.
This essentially automates the import of your .csv file by preserving the data types of the original columns (as character and numeric).
Including this in the read.csv command worked for me: strip.white = TRUE
(I found this solution here.)
version for data.table based on code from dmanuge :
convNumValues<-function(ds){
ds<-data.table(ds)
dsnum<-data.table(data.matrix(ds))
num_cols <- sapply(dsnum,function(x){mean(as.numeric(is.na(x)))<0.5})
nds <- data.table( dsnum[, .SD, .SDcols=attributes(num_cols)$names[which(num_cols)]]
,ds[, .SD, .SDcols=attributes(num_cols)$names[which(!num_cols)]] )
return(nds)
}
I had a similar problem. Based on Joshua's premise that excel was the problem I looked at it and found that the numbers were formatted with commas between every third digit. Reformatting without commas fixed the problem.
So, I had the similar situation here in my data file when I readin as a csv. All the numeric value were turned into char. But in my file there was a value with a word "Filtered" instead of NA. I converted "Filtered" to NA in vim editor of linux terminal with a command <%s/Filtered/NA/g> and saved this file and later used it and read it in R, all the values were num type and not char type any more.
Looks like character value "Filtered" was inducing all values to be char format.
Charu
Hello #Shawn Hemelstrand here are the steps in detail below:
example matrix file.csv having 'Filtered' word in it
I opened the file.csv in linux command terminal
vi file.csv
then press "Esc shift:"
and type the following command at the bottom
"%s/Filtered/NA/g"
press enter
then press "Esc shift:"
write "wq" at the bottom (this save the file and quit vim editor)
then in R script I read the file
data<- read.csv("file.csv", sep = ',', header = TRUE)
str(data)
All columns were num type which were earlier char type.
In case you need more help, it would be easier to share your txt or csv file.
The problem
I have a tab delimited input file that looks like so:
Variable [1] Variable [2]
111 Something
Nothing 222
The first row represents column names and the two next rows represents column values. As you can see, the column names includes both spaces and some tricky signs.
Now, what I want to do is to import this file into R and then output it again to a new text file, making it look exactly the same as the input. For this purpose I have created the following script (assuming that the input file is called "Test.txt"):
file <- "Test.txt"
x <- read.table(file, header = TRUE, sep = "\t")
write.table(x, file = "TestOutput.txt", sep = "\t", col.names = TRUE, row.names = FALSE)
From this, I get an output that looks like this:
"Variable..1." "Variable..2."
"1" "111" "Something"
"2" "Nothing" "222"
Now, there are a couple of problems with this output.
The "[" and "]" signs have been converted to dots.
The spaces have been converted to dots.
Quote signs have appeared everywhere.
How can I make the output file look exactly the same as the input file?
What I've tried so far
Regarding problem number one and two, I've tried specifying the column names through creating an internal vector, c("Variable [1]", "Variable [2]"), and then using the col.names option for read.table(). This gives me the exact same output. I've also tried different encodings, through the encoding option for table.read(). If I look at the internally created vector, mentioned above, it prints the variable names as they should be printed so I guess there is a problem with the conversion between the "text -> R" and the "R -> text" phases of the process. That is, if I look at the data frame created by read.table() without any internally created vectors, the column names are wrong.
As for problem number three, I'm pretty much lost and haven't been able to figure out what I should try.
Given the following input file as test.txt:
Variable [1] Variable [2]
111 Something
Nothing 222
Where the columns are tab-separated you can use the following code to create an exact copy:
a <- read.table(file='test.txt', check.names=F, sep='\t', header=T,
stringsAsFactors=F)
write.table(x=a, file='test_copy.txt', quote=F, row.names=F,
col.names=T, sep='\t')