I would like to read a large text file seperated by "|". So I used the below code.
sampleData <- read.table(file = '2013_4MM01_7-11_CD.txt',header =TRUE, sep = '|', nrows=10)
pos<- read.table.ffdf(file="2013_4MM01_7-11_CD.txt", header=TRUE, VERBOSE=TRUE,
FUN='read.table', sep = '|',
first.rows=10000, next.rows=50000, colClasses=classes)
I used "ff" package and checked my data after the running code ended. My data had some long numeric variable like "201304012371090245546" and the object read from the data was wrong. My ffdf object contained many duplicated rows and even a number that was not in the original txt file. I checked this by SAS. Please give me some useful advice.
Related
I am currently trying to analyze Twitter data collected via Python which is saved as tab delimited CSV file. However, a problem occurs when I try to read it into R.
The data is comprised of 8 columns (e.g.,col1: Twitter ID, col2: date of tweets, ... col4: tweet messages.. col9: location information).
So I expecting each row to contain information of those eight columns for all data points. However, for some reason in col4 where there only has to be tweet messages multiple fields (i.e., information of other rows from col1 to 8) are presented in that one specific cell.
Below are codes that I tried running. It is weird as this problem is not occuring when I am reading this CSV file in python. I just have no clue what is happening. Did anyone encountered similar issue?
data <- read.csv("Blacklives.csv", header = F, sep = '\t')
data <- read.csv2("Blacklives.csv", header = F, sep = '\t')
data <- read.delim2("Blacklives.csv", header = F, sep = '\t')
data <- read.delim2("Blacklives.csv", header = F, sep = '\t')
So I will try to provide more information to my error. The initial data generated by python looks like this in .csv form. It looks bit weird but data are tab delimited .csv file from Python. However, when I read this data into R (I read data in R and resaved it to .csv just to show you the rows with problematic cell), some cells (i.e., a cell that should contain a tweet from one person) have this large information in them. See below for the example of information contained in a cell. .csv file from R
My intention is to format a simple matrix in csv format. As I need to further process the formatted lines, I don't want to write the formatted string to a file.
I already tried using textConnection, which seems to be the right approach.
m<-matrix(c(1,2,3,4), nrow=2)
result<-write.csv(m, file=textConnection(csvData), row.names=FALSE, col.names=FALSE)
I'm expecting csvData to contain the contents of the formatted csv (file) as a vector containing the lines.
I get the error:
Error in textConnection(csvData) : invalid 'text' argument
What is the proper usage of textConnection?
Revised question
After some trying and cleaning all variables I ended up with
rm(list = ls())
m<-matrix(c(1,2,3,4), nrow=2)
result<-write.csv(m, file=textConnection("csvData", "w"), row.names=FALSE, col.names=FALSE)
This produces at least no errors, but I ended with a warning that col.names are ignored. The content of csvData is also not what I expected
> csvData
[1] "\"V1\",\"V2\"" "1,3" "2,4"
How to remove the header?
My solution
After trying I found, that write.csv should be replaced by write.table.
rm(list = ls())
m<-matrix(c(1,2,3,4), nrow=2)
result<-write.table(m, file=textConnection("csvData", "w"), row.names=FALSE, col.names=FALSE, sep=";")
Try write.csv() from MASS
MASS::write.matrix(m, file = textConnection("csvData", "w"), sep = ";")
One of the columns in my dataframe contains semicolon(;) and when I try to download the dataframe to a csv using fwrite, it is splitting that value into different columns.
Ex: Input : abcd;#6 After downloading it becomes : 1st column : abcd,
2nd column: #6
I want both to be in the same column.
Could you please suggest how to write the value within a single column.
I am using below code to read the input file:
InpData <- read.table(File01, header=TRUE, sep="~", stringsAsFactors = FALSE,
fill=TRUE, quote="", dec=",", skipNul=TRUE, comment.char="")
while for writing:
fwrite(InpData, File01, col.names=T, row.names=F, quote = F, sep="~")
You didn't give us an example, but it is possible you need to use a different separator than ";"
fwrite(x, file = "", sep = ",")
sep: The separator between columns. Default is ",".
If this simple solution does not work, we need the data to reproduce your problem.
I am trying to read.table for a tab-delimited file using the following command:
df <- read.table("input.txt", header=FALSE, sep="\t", quote="", comment.char="",
encoding="utf-8")
There should be 30 million rows. However, after read.table() returns, df contains only ~6 million rows. And there is a warning message like this:
Warning message:
In read.table("input.txt", header = FALSE, sep = "\t", quote = "", :
incomplete final line found by readTableHeader on 'input.txt'
I believe read.table quits after encountering a special sympbol (ASCII code: 1A Substitute) in one of the string columns. In the input file, the only special character is tab because it is used to separate columns. Is there anyway to ask read.table to treat any other character as not special?
If you have 30 million rows. I would use fread rather than read.table. It is faster. Learn more about here http://www.inside-r.org/packages/cran/data.table/docs/fread
fread(input, sep="auto", encoding = "UTF-8" )
Regarding your issue with read.table. I think the solutions here should solve it.
'Incomplete final line' warning when trying to read a .csv file into R
I was hoping there may be a way to do this, but after trying for a while I have had no luck.
I am working with a datafile (.csv format) that is being supplied with multiple tables in a single file. Each table has its own header row, and data associated with it. Is there a way to import this file and create separate data frames for each header/dataset?
Any help or ideas that can be provided would be greatly appreciated.
A sample of the datafile and it's structure can be found Here
When trying to use read.csv I get the following error:
"Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names"
Read the help for read.table:
nrows: number of rows to parse
skip: number of rows to skip
You can parse your file as follows:
first <- read.table(myFile, nrows=2)
second <- read.table(myFile, skip=3, nrows=2)
third <- read.table(myFile, skip=6, nrows=8)
You can always automate this by using grep() to search for the table seperators.
You can also read the table using fill=TRUE, and then split out the tables afterwards.