I have big data set which consist of around 94 columns and 3 Million rows. This file have single as well as multiple spaces as delimiter between columns. I need to read some columns from this file in R. For this I tried using read.table() with options which can be seen in the code below, the code is pasted below-
### Defining the columns to be read from the file, the first 5 column, then we do not read next 24, after this we read next 5 columns. Last 60 columns are not read in-
col_classes = c(rep("character",2), rep("numeric", 3), rep("NULL",24), rep("numeric", 5), rep("NULL", 60))
### Reading first 100 rows of the data
data <- read.table(file, sep = " ",header = F, nrows = 100, na.strings ="", stringsAsFactors= F)
Since, the file which has to read in have more than one space as the delimiter between some of the column, the above method does not work. Is there any method using which we can read in this file efficiently.
You need to change your delimiter. " " refers to one whitespace character. "" refers to any length whitespace as being the delimiter
data <- read.table(file, sep = "" , header = F , nrows = 100,
na.strings ="", stringsAsFactors= F)
From the manual:
If sep = "" (the default for read.table) the separator is ‘white space’, that is one or more spaces, tabs, newlines or carriage returns.
Also, with a large datafile you may want to consider data.table:::fread to quickly read data straight into a data.table. I was myself using this function this morning. It is still experimental, but I find it works very well indeed.
If you want to use the tidyverse (or readr respectively) package instead, you can use read_table instead.
read_table(file, col_names = TRUE, col_types = NULL,
locale = default_locale(), na = "NA", skip = 0, n_max = Inf,
guess_max = min(n_max, 1000), progress = show_progress(), comment = "")
And see here in the description:
read_table() and read_table2() are designed to read the type of textual data where
each column is #' separate by one (or more) columns of space.
If you field have a fixed width, you should consider using read.fwf() which might handle missing values better.
Related
I have a very large data set that for illustrative purposes looks something like the following.
Cust_ID , Sales_Assistant , Store
123 , Mary, Worthington, 22
456 , Jack, Charles , 42
The real data has many more columns and millions of rows. I'm using the following code to import it into R but it is falling over because one or more of the columns has a comma in the data (see Sales_Assistant above).
df <- read.csv("C:/dataextract.csv", header = TRUE , as.is = TRUE , sep = "," , na.strings = "NA" , quote = "" , fill = TRUE , dec = "." , allowEscapes = FALSE , row.names=NULL)
Adding row.names=NULL imported all the data but it split the Sales_Assistant column over two columns and threw all the other data out of alignment. If I run the code without this I get an error...
Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed
...and the data won't load.
Can you think of a way around this that doesn't involve tackling the data at source, or opening it in a text editor? Is there a solution in R?
First and foremost, it is a csv file. "Mary, Worthington" is meant to respond to two columns. If you have commas in your values, consider saving the data by using tsv (tab-separated values).
However, if you data has equal amount of commas per row with good alignment in some sense, I would consider ignoring the first row (which is the column names as you read the file) of the data frame and reassigning it proper column names.
For instance, in your case you can replace Sales_Assistant by
Sales_Assistant_First_Name, Sales_Assistant_Last_Name
which makes perfect sense. Then I could basically do
df <- df[-1, ]
colnames(df) <- c("Cust_ID" , "Sales_Assistant_First_Name" , "Sales_Assistant_Last_Name", "Store")
df <- read.csv("C:/dataextract.csv", skip = 1, header = FALSE)
df_cnames <- read.csv("C:/dataextract.csv", nrow = 1, header = FALSE)
df <- within(df, V2V3 <- paste(V2, V3, sep = ''))
df <- subset(df, select = (c("V1", "V2V3", "V4")))
colnames(df) <- df_cnames
It may need some modification depending on the actual source
I have big data set which consist of around 94 columns and 3 Million rows. This file have single as well as multiple spaces as delimiter between columns. I need to read some columns from this file in R. For this I tried using read.table() with options which can be seen in the code below, the code is pasted below-
### Defining the columns to be read from the file, the first 5 column, then we do not read next 24, after this we read next 5 columns. Last 60 columns are not read in-
col_classes = c(rep("character",2), rep("numeric", 3), rep("NULL",24), rep("numeric", 5), rep("NULL", 60))
### Reading first 100 rows of the data
data <- read.table(file, sep = " ",header = F, nrows = 100, na.strings ="", stringsAsFactors= F)
Since, the file which has to read in have more than one space as the delimiter between some of the column, the above method does not work. Is there any method using which we can read in this file efficiently.
You need to change your delimiter. " " refers to one whitespace character. "" refers to any length whitespace as being the delimiter
data <- read.table(file, sep = "" , header = F , nrows = 100,
na.strings ="", stringsAsFactors= F)
From the manual:
If sep = "" (the default for read.table) the separator is ‘white space’, that is one or more spaces, tabs, newlines or carriage returns.
Also, with a large datafile you may want to consider data.table:::fread to quickly read data straight into a data.table. I was myself using this function this morning. It is still experimental, but I find it works very well indeed.
If you want to use the tidyverse (or readr respectively) package instead, you can use read_table instead.
read_table(file, col_names = TRUE, col_types = NULL,
locale = default_locale(), na = "NA", skip = 0, n_max = Inf,
guess_max = min(n_max, 1000), progress = show_progress(), comment = "")
And see here in the description:
read_table() and read_table2() are designed to read the type of textual data where
each column is #' separate by one (or more) columns of space.
If you field have a fixed width, you should consider using read.fwf() which might handle missing values better.
Is it possible in R to save a data frame (or data.table) into a textfile that contains different separators for various columns?
For example:
Column1[TAB]Column2[,]Column3 ?
[] indicate the separators, here a TAB and comma.
The function write.table accepts only one separator.
MASS::write.matrix can do the trick:
require(MASS)
m <- matrix(1:12, ncol = 3)
write.matrix(m, file = "", sep = c("\\tab", ","), blocksize = 1)
returns
1\tab5,9
2\tab 6,10
3\tab 7,11
4\tab 8,12
but as the documentation of this function does not say that multiple separators are allowed, it may be safer to do it by yourself, just in case the above has some side effects.
For example,
seps <- c("\\tab", ",", "\n")
apply(m, 1, function(x, seps)
cat(x, file = "", sep = seps, append = TRUE), seps = seps)
returns
1\tab5,9
2\tab6,10
3\tab7,11
4\tab8,12
Be aware that append is set to TRUE, so if the output file already exists it will be overwritten.
I have a large number of files, each in tab-delimited format. I need to apply some modeling (glm/gbm etc) on each of these files. They are obtained from hospital data where in exceptional cases entries may not be the proper format. For example, when entering glucose level for a patient, the data entry operator may enter N or A by mistake instead of actual number.
While reading these files in loop, I am encountering problem as such columns (glucose) are treated as factor while it should be a numeric. It is painful to investigate each file and and look for error. I am reading the files in the following way but it is certainly not a good approach.
read.table(fn, header = TRUE, sep= "\t" , na.strings = c('', 'NEG', 'TR', 'NA', '<NA>', "Done", "D", "A"))
Is there any other function through which I can assume those values/outliers to be na?
You can inspect which elements are not number (for the glucose case):
data = read.csv(file, as.is = TRUE, sep = '\t') # dont convert string to factor
glucose = data$glucose
sapply(glucose, function(x)!is.na(as.numeric(x)), USE.NAMES = FALSE)
Then you can work with these indexes (interpolate or remove).
To loop the files:
files = list.files(path, '*.csv')
for (file in files)
{
data = read.csv(file, sep = '\t', as.is = TRUE)
gluc = data$glucose
idxs = sapply(gluc, function(x)!is.na(as.numeric(x)), USE.NAMES = FALSE)
# interpolate or remove here
}
Use the colClasses argument to read.table and friends to specify which columns should be numeric, then R does not need to try and guess. If a column is designated to be numeric then any entries that are not numbers will be converted to NA automatically.
I am working with textfiles of many, long rows with varying number of elements. Each element in the rows are separated by \t and of course the rows are terminated by \n. I'm using read.table to read the textfiles. An example samplefile is this: https://www.dropbox.com/s/6utslbnwerwhi58/samplefile.txt
The sample file has 60 rows.
Code to read the file:
sampleData <- read.table("samplefile.txt", as.is=TRUE, fill = TRUE);
dim(sampleData);
The dim returns 70 rows when in fact it should be 60. When I try nrows=60 like
sampleData <- read.table("samplefile.txt", as.is=TRUE, fill = TRUE, nrows = 60);
dim(sampleData);
it does work, however, I don't know if doing so will delete some of the information. My suspicion is that the last portions of some of the rows are added to new rows. I don't know why that would be the case, however, as I have fill = TRUE;
I have also tried
na.strings = "NA", fill=TRUE, strip.white=TRUE, blank.lines.skip =
TRUE, stringsAsFactors=FALSE, quote = "", comment.char = ""
but to no avail.
Does anyone have any idea what might be going on?
In the absence of a reproducible example, try something like this:
# Make some fake data
R <- c("1 2 3 4","2 3 4","4 5 6 7 8")
writeLines(R, "samplefile.txt")
# read line by line
r <- readLines("samplefile.txt")
# split by sep
sp <- strsplit(r, " ")
# Make each into a list of dataframes (for rbind.fill)
sp <- lapply(sp, function(x)as.data.frame(t(x)))
# now bind
library(plyr)
rbind.fill(sp)
If this is similar to your actual problem, anyway.