r data.table readcsv file increases column amount - r

I have the issue, that I am trying to read immense amounts of data from csv files (Probably around 80 million rows separated into around 200 files)
Some of the files are not well structured. After a few hundred thousand rows, for some reason, the rows are ending with a comma (","), but no additional information behind this comma. A short example to illustrate this behaviour:
a,b,c
1,2,3
d,e,f,
4,5,6,
The rows have 19 columns. I tried manually telling readcsv to read it as 20 columns, using colClasses and col.names and fill=TRUE
all.files <- list.files(getwd(), full.names=T, recursive=T)
lapply(all.files, fread,
select=c(5,6,9),
col.names=paste0("V",seq_len(20)),
#colClasses=c("V1"="character","V2"="character","V3"="integer"),
colClasses=c(<all 20 data types, 20th arbitrarily as integer>),
fill=T)
Another workaround I tried was to not use fread at all, by doing
data <- lapply(all.files, readLines)
data <- unlist(data)
data <- as.data.table(tstrsplit(data,","))
data <- data[, c("V5","V6","V9"), with=F]
However, this approach leads to "Error: memory exhausted", which I believe might be solved by actually only reading the required 3 columns, instead of all 19.
Any hints on how to use fread for this scenario is greatly appreciated.

You can try using readr::read_csv as follows:
library(readr)
txt <- "a,b,c
1,2,3
d,e,f,
4,5,6,"
read_csv(txt)
results in the expected result:
# A tibble: 3 × 3
a b c
<chr> <chr> <chr>
1 1 2 3
2 d e f
3 4 5 6
And the following warning
Warning: 2 parsing failures.
row col expected actual
2 -- 3 columns 4 columns
3 -- 3 columns 4 columns
To only read specific columns use cols_only as follows:
read_csv(txt,
col_types = cols_only(a = col_character(),
c = col_character()))

Related

read_delim( ) from tidyverse cannot directly correct misaligned headers of text file as basic read.table() does

I am trying to use tidyverse read_delim() to read a tab-separated text file.
I can easily use the basic R's read.table() with no problem but when I tested read_delim() with delim = "\t"; I got a problem. For example, I have a file below, "test.txt". As you can see, the header shifts to the right as the first col is row names without a header.
T1 T2 T3
A 1 4 7
B 2 5 8
C 3 6 9
I can use basic R to read this file successfully:
dat <- read.table("test.txt", header=T, sep="\t")
dat
T1 T2 T3
A 1 4 7
B 2 5 8
C 3 6 9
But when I tried to use tidyverse read_delim, I got problems:
dat1 <- read_delim("test.txt", delim ="\t")
Rows: 3 Columns: 3
── Column specification ──────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): T1, T3
dbl (1): T2
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning message:
One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
I know basic R's read.table() can automatically correct this problem, but could someone tell me if tidyverse read_delim() has a way to resolve this issue?
Thank you!
-Xiaokuan
The issue isn’t exactly that the headers are misaligned - it’s that readr doesn’t support or recognize row names at all.* readr::read_delim() therefore doesn’t account for the fact that row names don’t have a column header, and just sees three column names followed by four columns of data.
If your goal is to import your data as a tibble, your best bet is probably to use base::read.table(), then tibble::as_tibble(), using the rownames arg to convert the row names to a regular column.
library(tibble)
dat <- read.table("test.txt", header=T, sep="\t")
as_tibble(dat, rownames = "row")
# A tibble: 3 × 4
row T1 T2 T3
<chr> <dbl> <dbl> <dbl>
1 A 1 4 7
2 B 2 5 8
3 C 3 6 9
Another option would be to manually edit your input file to include a column head above the row names.
*This isn’t an oversight, by the way — it’s an intentional choice by the tidyverse team, as they believe row names to be bad practice. e.g., from the tibble docs: “Generally, it is best to avoid row names, because they are basically a character column with different semantics than every other column.” Also see this interesting discussion from the tibble github.

Separating samples when importing a csv in R

I'm new to R, but want to use it for its statistics tools on some collected data. I'm trying to import raw data from an instrument output, but to do so, I need to take out the useless comments left over from the machine's display, then separate the multiple samples into their own dataframes. The data comes out as:
////this is some preamble
////for sample 1 that would graph
////data on the machines display
1 10
2 20
3 30
///This is the preamble
////for the second sample
1 11
2 19
3 32
4 41
5 50
////this is closing statements
////and final plot command
////for the machine's display
I'm currently trying to import it with whitespace delimiters. If I only had the one sample, I know I can just skip the first four lines and add the titles for the columns later, as
library(readr)
DATA <- read_table2("DATA.txt", col_names = FALSE, skip = 4)
colnames(DATA) <- c("X","Y")
But I can't figure out how to separate sample 2 and the remainder of the unimportant text.
Another problem that might arise is that separation of sample one and two happen on different lines depending on the file. So I figure I need to import the text file to scan through it before even making tables.
I know this is a bit of a cluster, but I appreciate any help.
This should get you started. I'm using base R but you can always convert to a tibble later if you want.
DATA <- read.table("DATA.txt", header=FALSE, comment="/")
colnames(DATA) <- c("X","Y")
begin <- which(DATA$X==1)
end <- c(diff(begin), nrow(DATA))
groups <- mapply(":", begin, end)
DATA.lst <- lapply(groups, function(g) DATA[g, ])
names(DATA.lst) <- sprintf("Group%0.2i", seq(length(groups)))
DATA.lst
# $Group01
# X Y
# 1 1 10
# 2 2 20
# 3 3 30
#
# $Group02
# X Y
# 4 1 11
# 5 2 19
# 6 3 32
# 7 4 41
# 8 5 50
DATA.lst is a list of data frames. You can extract them with the following code but it may not be your best option if you are planning to perform the same analyses on each. R makes it easy to process all of the data frames in a list which saves you writing code for each one:
for (i in 1:2) {assign(names(DATA.lst)[i], DATA.lst[[i]])}

Read Excel file and select specific rows and columns

I want to read a xls file into R and select specific columns.
For example I only want columns 1 to 10 and rows 5 - 700. I think you can do this with xlsx but I can't use that library on the network that I am using.
Is there another package that I can use? And how would I go about selecting the columns and rows that I want?
You can try this:
library(xlsx)
read.xlsx("my_path\\my_file.xlsx", "sheet_name", rowIndex = 5:700, colIndex = 1:10)
Since you are unable to lead the xlsx package, you might want to consider base R and use read.csv. For this, save your Excel file as a csv. The explanation for how to do this can be easily found on the web. Note, csv files can still be opened as Excel.
These are the steps you need to take to only read the 2nd and 3rd column and row.
hd = read.csv('a.csv', header=F, nrows=1, as.is=T) # first read headers
removeCols <- c('NULL', NA, NA) #define which columns to keep/remove
df <- read.csv('a.csv', skip=2, header=F, colClasses=removeCols) #skip says which rows not to read
colnames(df) <- hd[is.na(removeCols)]
df
two three
1 5 8
2 6 9
This is the example data I used.
a <- data.frame(one=1:3, two=4:6, three=7:9)
write.csv(a, 'a.csv', row.names=F)
read.csv('a.csv')
one two three
1 1 4 7
2 2 5 8
3 3 6 9

Replace semicolon-separated values to tab

I am trying to convert the data which I have in txt file:
4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512;...
to a column (table) where the values are separated by tab.
4.0945725440979
4.07999897003174
4.0686674118042...
So far I tried
mydata <- read.table("1.txt", header = FALSE)
separate_data<- strsplit(as.character(mydata), ";")
But it does not work. separate_data in this case consist only of 1 element:
[[1]]
[1] "1"
Based on the OP, it's not directly stated whether the raw data file contains multiple observations of a single variable, or should be broken into n-tuples. Since the OP does state that read.table results in a single row where s/he expects it to contain multiple rows, we can conclude that the correct technique is to use scan(), not read.table().
If the data in the raw data file represents a single variable, then the solution posted in comments by #docendo works without additional effort. Otherwise, additional work is required to tidy the data.
Here is an approach using scan() that reads the file into a vector, and breaks it into observations containing 5 variables.
rawData <- "4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512;4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512"
value <- scan(textConnection(rawData),sep=";")
columns <- 5 # set desired # of columns
observations <- length(aVector) / columns
observation <- unlist(lapply(1:observations,function(x) rep(x,times=columns)))
variable <- rep(1:columns,times=observations)
data.frame(observation,variable,value)
...and the output:
> data.frame(observation,variable,value)
observation variable value
1 1 1 4.094573
2 1 2 4.079999
3 1 3 4.068667
4 1 4 4.059601
5 1 5 4.052183
6 2 1 4.094573
7 2 2 4.079999
8 2 3 4.068667
9 2 4 4.059601
10 2 5 4.052183
>
At this point the data can be converted into a wide format tidy data set with reshape2::dcast().
Note that this solution requires that the number of data values in the raw data file is evenly divisible by the number of variables.

How to read a badly formatted CSV file with multiple embedded data sets and non-printing characters

I need to open a CSV file with the following options in the figure below. I add the link to my files. You can try with the file "20140313_Helix2_FP140_SC45.csv"
https://www.dropbox.com/sh/i5y8r8g7wymalw8/AABXsLkbpowxGObFpGHgv4m-a?dl=0
I have tried many options with read.table and read.csv but I need a dataframe with more than one column and data are separated.
It looks like captured printer output. But it's not too messy:
# read it in as raw lines
lines <- readLines("20140313_Helix2_FP140_SC45.csv")
I'm assuming you want the "frequency point" data (it's the most prevalent) so we find the first one of those:
start <- which(grepl("^FREQUENCY POINTS:", lines))[1]
The rest of the file is "regular" enough to just look for lines beginning with a number (i.e. the PNT column) and read that in, giving it saner column names than the read.table default):
dat <- read.table(textConnection(grep("^[0-9]+",lines[start:length(lines)], value=TRUE)),
col.names=c("PNT", "FREQ", "MAGNITUDE"))
And, here's the result:
head(dat)
## PNT FREQ MAGNITUDE
## 1 1 0.800000 -19.033
## 2 2 0.800125 -19.038
## 3 3 0.800250 -19.071
## 4 4 0.800375 -19.092
## 5 5 0.800500 -19.137
## 6 6 0.800625 -19.167
nrow(dat)
## [1] 1601
The # of rows matches (from what I can tell) the # of frequency point records.

Resources