I am working with textfiles of many, long rows with varying number of elements. Each element in the rows are separated by \t and of course the rows are terminated by \n. I'm using read.table to read the textfiles. An example samplefile is this: https://www.dropbox.com/s/6utslbnwerwhi58/samplefile.txt
The sample file has 60 rows.
Code to read the file:
sampleData <- read.table("samplefile.txt", as.is=TRUE, fill = TRUE);
dim(sampleData);
The dim returns 70 rows when in fact it should be 60. When I try nrows=60 like
sampleData <- read.table("samplefile.txt", as.is=TRUE, fill = TRUE, nrows = 60);
dim(sampleData);
it does work, however, I don't know if doing so will delete some of the information. My suspicion is that the last portions of some of the rows are added to new rows. I don't know why that would be the case, however, as I have fill = TRUE;
I have also tried
na.strings = "NA", fill=TRUE, strip.white=TRUE, blank.lines.skip =
TRUE, stringsAsFactors=FALSE, quote = "", comment.char = ""
but to no avail.
Does anyone have any idea what might be going on?
In the absence of a reproducible example, try something like this:
# Make some fake data
R <- c("1 2 3 4","2 3 4","4 5 6 7 8")
writeLines(R, "samplefile.txt")
# read line by line
r <- readLines("samplefile.txt")
# split by sep
sp <- strsplit(r, " ")
# Make each into a list of dataframes (for rbind.fill)
sp <- lapply(sp, function(x)as.data.frame(t(x)))
# now bind
library(plyr)
rbind.fill(sp)
If this is similar to your actual problem, anyway.
Related
I have a text file of names, separated by commas, and I want to read this into whatever in R (data frame or vector are fine). I try read.csv and it just reads them all in as headers for separate columns, but 0 rows of data. I try header=FALSE and it reads them in as separate columns. I could work with this, but what I really want is one column that just has a bunch of rows, one for each name. For example, when I try to print this data frame, it prints all the column headers, which are useless, and then doesn't print the values. It seems like it should be easily usable, but it appears to me one column of names would be easier to work with.
Since the OP asked me to, I'll post the comment above as an answer.
It's very simple, and it comes from some practice in reading in sequences of data, numeric or character, using scan.
dat <- scan(file = your_filename, what = 'character', sep = ',')
You can use read.csv are read string as header, but then just extract names (using names) and put this into a data.frame:
data.frame(x = names(read.csv("FILE")))
For example:
write.table("qwerty,asdfg,zxcvb,poiuy,lkjhg,mnbvc",
"FILE", col.names = FALSE, row.names = FALSE, quote = FALSE)
data.frame(x = names(read.csv("FILE")))
x
1 qwerty
2 asdfg
3 zxcvb
4 poiuy
5 lkjhg
6 mnbvc
Something like that?
Make some test data:
# test data
list_of_names <- c("qwerty","asdfg","zxcvb","poiuy","lkjhg","mnbvc" )
list_of_names <- paste(list_of_names, collapse = ",")
list_of_names
# write to temp file
tf <- tempfile()
writeLines(list_of_names, tf)
You need this part:
# read from file
line_read <- readLines(tf)
line_read
list_of_names_new <- unlist(strsplit(line_read, ","))
I have a very large data set that for illustrative purposes looks something like the following.
Cust_ID , Sales_Assistant , Store
123 , Mary, Worthington, 22
456 , Jack, Charles , 42
The real data has many more columns and millions of rows. I'm using the following code to import it into R but it is falling over because one or more of the columns has a comma in the data (see Sales_Assistant above).
df <- read.csv("C:/dataextract.csv", header = TRUE , as.is = TRUE , sep = "," , na.strings = "NA" , quote = "" , fill = TRUE , dec = "." , allowEscapes = FALSE , row.names=NULL)
Adding row.names=NULL imported all the data but it split the Sales_Assistant column over two columns and threw all the other data out of alignment. If I run the code without this I get an error...
Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed
...and the data won't load.
Can you think of a way around this that doesn't involve tackling the data at source, or opening it in a text editor? Is there a solution in R?
First and foremost, it is a csv file. "Mary, Worthington" is meant to respond to two columns. If you have commas in your values, consider saving the data by using tsv (tab-separated values).
However, if you data has equal amount of commas per row with good alignment in some sense, I would consider ignoring the first row (which is the column names as you read the file) of the data frame and reassigning it proper column names.
For instance, in your case you can replace Sales_Assistant by
Sales_Assistant_First_Name, Sales_Assistant_Last_Name
which makes perfect sense. Then I could basically do
df <- df[-1, ]
colnames(df) <- c("Cust_ID" , "Sales_Assistant_First_Name" , "Sales_Assistant_Last_Name", "Store")
df <- read.csv("C:/dataextract.csv", skip = 1, header = FALSE)
df_cnames <- read.csv("C:/dataextract.csv", nrow = 1, header = FALSE)
df <- within(df, V2V3 <- paste(V2, V3, sep = ''))
df <- subset(df, select = (c("V1", "V2V3", "V4")))
colnames(df) <- df_cnames
It may need some modification depending on the actual source
I have big data set which consist of around 94 columns and 3 Million rows. This file have single as well as multiple spaces as delimiter between columns. I need to read some columns from this file in R. For this I tried using read.table() with options which can be seen in the code below, the code is pasted below-
### Defining the columns to be read from the file, the first 5 column, then we do not read next 24, after this we read next 5 columns. Last 60 columns are not read in-
col_classes = c(rep("character",2), rep("numeric", 3), rep("NULL",24), rep("numeric", 5), rep("NULL", 60))
### Reading first 100 rows of the data
data <- read.table(file, sep = " ",header = F, nrows = 100, na.strings ="", stringsAsFactors= F)
Since, the file which has to read in have more than one space as the delimiter between some of the column, the above method does not work. Is there any method using which we can read in this file efficiently.
You need to change your delimiter. " " refers to one whitespace character. "" refers to any length whitespace as being the delimiter
data <- read.table(file, sep = "" , header = F , nrows = 100,
na.strings ="", stringsAsFactors= F)
From the manual:
If sep = "" (the default for read.table) the separator is ‘white space’, that is one or more spaces, tabs, newlines or carriage returns.
Also, with a large datafile you may want to consider data.table:::fread to quickly read data straight into a data.table. I was myself using this function this morning. It is still experimental, but I find it works very well indeed.
If you want to use the tidyverse (or readr respectively) package instead, you can use read_table instead.
read_table(file, col_names = TRUE, col_types = NULL,
locale = default_locale(), na = "NA", skip = 0, n_max = Inf,
guess_max = min(n_max, 1000), progress = show_progress(), comment = "")
And see here in the description:
read_table() and read_table2() are designed to read the type of textual data where
each column is #' separate by one (or more) columns of space.
If you field have a fixed width, you should consider using read.fwf() which might handle missing values better.
I'm working with 12 large data files, all of which hover between 3 and 5 GB, so I was turning to RSQLite for import and initial selection. Giving a reproducible example in this case is difficult, so if you can come up with anything, that would be great.
If I take a small set of the data, read it in, and write it to a table, I get exactly what I want:
con <- dbConnect("SQLite", dbname = "R2")
f <- file("chr1.ld")
open(f)
data <- read.table(f, nrow=100, header=TRUE)
dbWriteTable(con, name = "Chr1test", value = data)
> dbListFields(con, "Chr1test")
[1] "row_names" "CHR_A" "BP_A" "SNP_A" "CHR_B" "BP_B" "SNP_B" "R2"
> dbGetQuery(con, "SELECT * FROM Chr1test LIMIT 2")
row_names CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
1 1 1 1579 SNP-1.578. 1 2097 SNP-1.1096. 0.07223050
2 2 1 1579 SNP-1.578. 1 2553 SNP-1.1552. 0.00763724
If I read in all of my data directly to a table, though, my columns aren't separated correctly. I've tried both sep = " " and sep = "\t", but both give the same column separation
dbWriteTable(con, name = "Chr1", value ="chr1.ld", header = TRUE)
> dbListFields(con, "Chr1")
[1] "CHR_A_________BP_A______________SNP_A__CHR_B_________BP_B______________SNP_B___________R
I can tell that it's clearly some sort of delimination issue, but I've exhausted my ideas on how to fix it. Has anyone run into this before?
*Edit, update:
It seems as though this works:
n <- 1000000
f <- file("chr1.ld")
open(f)
data <- read.table(f, nrow = n, header = TRUE)
con_data <- dbConnect("SQLite", dbname = "R2")
while (nrow(data) == n){
dbWriteTable(con_data, data, name = "ch1", append = TRUE, header = TRUE)
data <- read.table(f, nrow = n, header = TRUE)
}
close(f)
if (nrow(data) != 0){
dbWriteTable(con_data, data, name = "ch1", append = TRUE)
}
Though I can't quite figure out why just writing the table through SQLite is a problem. Possibly a memory issue.
I am guessing that your big file is causing a free memory issue (see Memory Usage under docs for read.table). It would have been helpful to show us the first few lines of chr1.ld (on *nix systems you just say "head -n 5 chr1.ld" to get the first five lines).
If it is a memory issue, then you might try sipping the file as a work-around rather than gulping it whole.
Determine or estimate the number of lines in chr1.ld (on *nix systems, say "wc -l chr1.ld").
Let's say your file has 100,000 lines.
`sip.size = 100
for (i in seq(0,100000,sip.size)) {
data <- read.table(f, nrow=sip.size, skip=i, header=TRUE)
dbWriteTable(con, name = "SippyCup", value = data, append=TRUE)
}`
You'll probably see warnings at the end but the data should make it through. If you have character data that read.table is trying to factor, this kludge will be unsatisfactory unless there are only a few factors, all of which are guaranteed to occur in every chunk. You may need to tell read.table not to factor those columns or use some other method to look at all possible factors so you can list them for read.table. (On *nix, split out one column and pipe it to uniq.)
I have big data set which consist of around 94 columns and 3 Million rows. This file have single as well as multiple spaces as delimiter between columns. I need to read some columns from this file in R. For this I tried using read.table() with options which can be seen in the code below, the code is pasted below-
### Defining the columns to be read from the file, the first 5 column, then we do not read next 24, after this we read next 5 columns. Last 60 columns are not read in-
col_classes = c(rep("character",2), rep("numeric", 3), rep("NULL",24), rep("numeric", 5), rep("NULL", 60))
### Reading first 100 rows of the data
data <- read.table(file, sep = " ",header = F, nrows = 100, na.strings ="", stringsAsFactors= F)
Since, the file which has to read in have more than one space as the delimiter between some of the column, the above method does not work. Is there any method using which we can read in this file efficiently.
You need to change your delimiter. " " refers to one whitespace character. "" refers to any length whitespace as being the delimiter
data <- read.table(file, sep = "" , header = F , nrows = 100,
na.strings ="", stringsAsFactors= F)
From the manual:
If sep = "" (the default for read.table) the separator is ‘white space’, that is one or more spaces, tabs, newlines or carriage returns.
Also, with a large datafile you may want to consider data.table:::fread to quickly read data straight into a data.table. I was myself using this function this morning. It is still experimental, but I find it works very well indeed.
If you want to use the tidyverse (or readr respectively) package instead, you can use read_table instead.
read_table(file, col_names = TRUE, col_types = NULL,
locale = default_locale(), na = "NA", skip = 0, n_max = Inf,
guess_max = min(n_max, 1000), progress = show_progress(), comment = "")
And see here in the description:
read_table() and read_table2() are designed to read the type of textual data where
each column is #' separate by one (or more) columns of space.
If you field have a fixed width, you should consider using read.fwf() which might handle missing values better.