How to read when delimiter is space and missing values are blank? - r

I have a space delimited file and some columns are blank, so we end up having multiple spaces, and fread fails with error. But read.table works fine. See example:
library(data.table)
# R version 3.4.2 (2017-09-28)
# data.table_1.10.4-3
fread("A B C D
1 2 3
4 5 6 7", sep = " ", header = TRUE)
Error in fread("A B C D\n1 2 3\n4 5 6 7") :
Expected sep (' ') but new line, EOF (or other non printing character) ends field 2 when detecting types from point 0: 1 2 3
read.table(text ="A B C D
1 2 3
4 5 6 7", sep = " ", header = TRUE)
# A B C D
# 1 1 2 NA 3
# 2 4 5 6 7
How do we read using fread, I tried setting sep = " " and na.string = "", didn't help.

In fread function, by default strip.white is set to TRUE, meaning leading trailing spaces are removed. That is useful to read files with fixed width or with irregular number of spaces as separator.
Whereas in read.table strip.white by default is set to FALSE.
fread("A B C D
1 2 3
4 5 6 7", sep = " ", header = TRUE, strip.white = FALSE)
# A B C D
# 1: 1 2 NA 3
# 2: 4 5 6 7
Note: Providing self-answer as I couldn't find relevant post, also this tripped me over once and twice.
Edit: This doesn't work anymore for data.table_1.12.2, related GitHub Issue.

Related

Adding Headers to Dataframe in R

I have a script written in R that is ran weekly and produces a csv. I need to add headers over top of some of the column names as they are grouped together.
Header1 Header2
A B C D E F
1 2 3 4 5 6
7 8 9 a b c
In this example ABC columns are under the "Header1" header, and DEF are under the "Header2" header. Obviously this can be done manually but I was curious if there was a package that can do this. "No" is an acceptable answer.
EDIT: should of added that the file can also be a xlsx. Initially I write off most of my files as CSVs since they usually get used by a script again at some point.
It is a bit ugly but you can do on a csv as long as you do not require any merging of cells. I used data.table in my example, but I am pretty sure you can use any other writing function as long as you write the headers with append = FALSE and col.names = FALSE and the data both with TRUE. Reading it back gets a bit ugly but you can skip the first row.
dt <- fread("A B C D E F
1 2 3 4 5 6
7 8 9 a b c")
fwrite(data.table(t(c("Header1", NA, NA, "Header2", NA, NA))), "test.csv", append = FALSE, col.names = FALSE)
fwrite(dt, "test.csv", append = TRUE, col.names = TRUE)
fread("test.csv")
# V1 V2 V3 V4 V5 V6
# 1: Header1 Header2
# 2: A B C D E F
# 3: 1 2 3 4 5 6
# 4: 7 8 9 a b c
fread("test.csv", skip = 1L)
# A B C D E F
# 1: 1 2 3 4 5 6
# 2: 7 8 9 a b c
If you happen to want your header information back you can do something like this. Read the first line, find the positions of the headers and find the headers itself.
headers <- strsplit(readLines("test.csv", n = 1L), ",")[[1]]
which(headers != "")
# [1] 1 4
headers[which(headers != "")]
# [1] "Header1" "Header2"

Reading in a table with non-standard row names in R?

I have a .txt file that looks like this:
xyz ghj asd qwe
a / b: 1 2 3 4
c / d: 5 6 7 8
e / f: 9 10 11 12
...
...
I'm trying to use read.table(header = T) but it seems to be misinterpreting the row name. Is there a way to deal with this in read.table() or should I just use readLines()
There is no option to just skip a few characters in each row using a read.table option.
Instead, you can call read.table twice, once for all the data after the first row, and the second time for the header.
Where your data are in a file called "test.txt", you would do:
library(magrittr)
tmp <- read.table(file="test.txt", sep="", stringsAsFactors = FALSE, skip=1)[, -c(1:3)] %>%
setNames(read.table(file="test.txt", sep="", stringsAsFactors = FALSE, nrows=1))
> tmp
xyz ghj asd qwe
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
>
Package magrittr is what gives you the pipe operator %>% that allows you to read the data and the header separately, but put them together in a single line. If you have a sufficiently-new R version you can use the |> operator instead, without the magrittr package.

Forcing read_delim in readr to treat multiple " and \ as part of column string

Given a ; delimited file of structure:
colA; colB; colC
1;A; 10
2;B; 11
3;C"; 12
4;D""; 15
5;"F";20
6;K"""; 21
7;""M";22
8; \""O;23
I would like to ensure that colB is always imported verbatim as a character string. In particular, I would like to preserve all values including ""M" and \""O.
Attempt
I'm currently trying:
require(readr)
tst_dta <- read_delim(
file = "test_file.csv",
escape_double = FALSE,
delim = ";",
col_types = cols(
colA = col_integer(),
colB = col_character(),
colC = col_integer()
)
)
but this returns:
> tst_dta
# A tibble: 8 x 3
colA colB colC
<int> <chr> <int>
1 1 A 10
2 2 B NA
3 3 "C\"" 12
4 4 "D\"\"" 15
5 5 F 20
6 6 "K\"\"\"" 21
7 7 "\"\"M\"" 22
8 8 " \\\"\"O" 23
Desired rsults
The desired results should reflect:
colA colB colC
<int> <chr> <int>
1 A 10
2 B 11
3 C" 12
4 D"" 15
5 "F" 20
6 K""" 21
7 ""M" 22
8 \""O 23
Other points:
Ideally, I would also like to ensure that non-ASCII characters are ignored in a manner that value \""[Non-ASCII-Character]Owould appear in the resulting data frame as \""O string.
Updates
As per comments, more examples:
is:
colA; colB; colC
1; text \" text; 2
should be:
colA;colB;colC
1;text text;2
is:
colA; colB; colC
1; text \;" text; 2
should be:
colA;colB;colC
1;text text;2
is:
colA; colB; colC
1; [non-ASCII] text something \;" text; 2
should be:
colA;colB;colC
1;text something;2
If you need to use readr-functions, then look at it's argument list and see if it has an equivalent to the quote argument in read.table (which allows simple access:
read.table(text=txt, header=TRUE, quote="", sep=";")
colA colB colC
1 1 A 10
2 2 B 11
3 3 C" 12
4 4 D"" 15
5 5 "F" 20
6 6 K""" 21
7 7 ""M" 22
8 8 ""O 23
Seems like it should succeed, since it's the third argument in readr::read_delim. The default in both cases is "\"" which is a single double-quote. Set it to an empty character (""):
Usage
read_delim(file, delim, quote = "\"", escape_backslash = FALSE,
escape_double = TRUE, col_names = TRUE, col_types = NULL,
locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
comment = "", trim_ws = FALSE, skip = 0, n_max = Inf,
guess_max = min(1000, n_max), progress = show_progress())
And this is the print representation of the result. I would note that this print representation seems bit irregular. Character values are enclosed in double quotes only if they have embedded double quotes, i.e \". On the other hand such columns are character which is a nice change from the default settings in read.table which give you factor columns:
read_delim(file=txt, quote="", delim=";")
# A tibble: 8 x 3
colA ` colB` ` colC`
<int> <chr> <chr>
1 1 A " 10"
2 2 B " 11 "
3 3 "C\"" " 12"
4 4 "D\"\"" " 15"
5 5 "\"F\"" 20
6 6 "K\"\"\"" " 21"
7 7 "\"\"M\"" 22
8 8 " \"\"O" 23
You are hereby warned that using this option with read_delim does mean that neither column names nor values are trimmed to remove whitespace. And everything is character, even the columns that would otherwise come in as character. Notice the name of your second column. That does not occur with read.table:
read_delim(file=txt, quote="", delim=";")$` colB` ==
read.table(text=txt, header=TRUE, quote="", sep=";")$colB
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Further gsub-processing would be needed if you wanted leading or trailing whitespace to be removed. rm_non_ascii in pkg {qdapRegex} can remove non-ASCII characters

Generate warning in R for more than one items

I have a question about generating warnings for more than one items at a time in R. Please refer to the following dataframe and codes:
Dataframe dat:
inputs var1 var2
A 1 a 1
B 2 b 3
B 3 b NA
C 4 d NA
C 5 e 4
if (any(duplicated(dat$inputs))==T){
warning(paste("The following inputs: ", dat$inputs[duplicated(dat$inputs)],"is duplicated.",sep=""))
}
As you can see both B and C will be shown in the warning, like:
Warning message:
The following inputs: B is duplicated.The following inputs: C is duplicated.
I'm okay with such warning message output, but it is not ideal. Is there a way to combine the two sentences and make it look like:
Warning message:
The following inputs: B,C are duplicated.
Thanks a lot in advance for your attention and time.
Helene
I couldn't get your code to run, so I made up some/modified your code/data.
dat = read.table(text = "
inputs var1 var2 var3
A 1 a 1
B 2 b 3
B 3 b NA
C 4 d NA
C 5 e 4", header = T)
if (any(b<-duplicated(dat$inputs))){
if (length(c<-unique(dat$inputs[b]))>1) {warning(paste0("The following inputs: ", paste0(c, collapse=", "), " are duplicated."))} else
{warning(paste0("The following input: ", paste0(c, collapse=", "), " is duplicated."))}
}
Warning message:
The following inputs: B, C are duplicated.
Single duplicate
dat = read.table(text = "
inputs var1 var2 var3
A 1 a 1
A 2 b 3
E 3 b NA
C 4 d NA
G 5 e 4", header = T)
Warning message:
The following input: A is duplicated.

R read.table loops row column entries to next row

This is the first time I encountered this problem using read.table: For row entries with very large number of columns, read.table loops the column entries into the next rows.
I have a .txt file with rows of variable and unequal length. For reference this is the .txt file I am reading: http://www.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/4.0/c5.bp.v4.0.symbols.gmt
Here is my code:
tabsep <- gsub("\\\\t", "\t", "\\t")
MSigDB.collection = read.table(fileName, header = FALSE, fill = TRUE, as.is = TRUE, sep = tabsep)
Partial output: first columns
V1 V2 V3 V4 V5 V6
1 TRNA_PROCESSING http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1 TRNT1 FARS2
2 REGULATION_OF_BIOLOGICAL_QUALITY http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY DLC1 ALS2 SLC9A7
3 DNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS XRCC5 XRCC4 RAD51C
4 AMINO_SUGAR_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS UAP1 CHIA GNPDA1
5 BIOPOLYMER_CATABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS BTRC HNRNPD USE1
6 RNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD SYNCRIP
7 INTS6 LSM5 LSM4 LSM3 LSM1
8 CRK
9 GLUCAN_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/GLUCAN_METABOLIC_PROCESS GCK PYGM GSK3B
10 PROTEIN_POLYUBIQUITINATION http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_POLYUBIQUITINATION ERCC8 HUWE1 DZIP3
...
Partial output: last columns
V403 V404 V405 V406 V407 V408 V409 V410 V411 V412 V413 V414 V415 V416 V417 V418 V419 V420 V421
1
2 CALCA CALCB FAM107A CDK11A RASGRP4 CDK11B SYN3 GP1BA TNN ENO1 PTPRC MTL5 ISOC2 RHAG VWF GPI HPX SLC5A7 F2R
3
4
5
6 IRF2 IRF3 SLC2A4RG LSM6 XRCC6 INTS1 HOXD13 RP9 INTS2 ZNF638 INTS3 ZNF254 CITED1 CITED2 INTS9 INTS8 INTS5 INTS4 INTS7
7 POU1F1 TCF7L2 TNFRSF1A NPAS2 HAND1 HAND2 NUDT21 APEX1 ENO1 ERF DTX1 SOX30 CBY1 DIS3 SP1 SP2 SP3 SP4 NFIC
8
9
10
For instance, column entries for row 6 gets looped to fill row 7 and row 8. I seem to only this problem for row entries with very large number of columns. This occurs for other .txt files as well but it breaks at different column numbers. I inspected all the row entries at where the break happens and there are no unusual characters in the entries (they are all standard upper case gene symbols).
I have tried both read.table and read.delim with the same result. If I convert the .txt file to .csv first and use the same code, I do not have this problem (see below for the equivalent output). But I don't want to convert each file first .csv and really I just want to understand what is going on.
Correct output if I convert to .csv file:
MSigDB.collection = read.table(fileName, header = FALSE, fill = TRUE, as.is = TRUE, sep = ",")
V1 V2 V3 V4 V5 V6
1 TRNA_PROCESSING http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1 TRNT1 FARS2 METTL1
2 REGULATION_OF_BIOLOGICAL_QUALITY http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY DLC1 ALS2 SLC9A7 PTGS2
3 DNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS XRCC5 XRCC4 RAD51C XRCC3
4 AMINO_SUGAR_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS UAP1 CHIA GNPDA1 GNE
5 BIOPOLYMER_CATABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS BTRC HNRNPD USE1 RNASEH1
6 RNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD SYNCRIP MED24
7 GLUCAN_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/GLUCAN_METABOLIC_PROCESS GCK PYGM GSK3B EPM2A
8 PROTEIN_POLYUBIQUITINATION http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_POLYUBIQUITINATION ERCC8 HUWE1 DZIP3 DDB2
9 PROTEIN_OLIGOMERIZATION http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_OLIGOMERIZATION SYT1 AASS TP63 HPRT1
To elaborate on my comment...
From the help page to read.table:
The number of data columns is determined by looking at the first five lines of input (or the whole file if it has less than five lines), or from the length of col.names if it is specified and is longer. This could conceivably be wrong if fill or blank.lines.skip are true, so specify col.names if necessary (as in the ‘Examples’).
To work around this with unknown datasets, use count.fields to determine the number of separators in a file, and use that to create col.names for read.table to use:
x <- max(count.fields("~/Downloads/c5.bp.v4.0.symbols.gmt", "\t"))
Names <- paste("V", sequence(x), sep = "")
y <- read.table("~/Downloads/c5.bp.v4.0.symbols.gmt", col.names=Names, sep = "\t", fill = TRUE)
Inspect the first few lines. I'll leave the actual full inspection to you.
y[1:6, 1:10]
# V1
# 1 TRNA_PROCESSING
# 2 REGULATION_OF_BIOLOGICAL_QUALITY
# 3 DNA_METABOLIC_PROCESS
# 4 AMINO_SUGAR_METABOLIC_PROCESS
# 5 BIOPOLYMER_CATABOLIC_PROCESS
# 6 RNA_METABOLIC_PROCESS
# V2 V3 V4
# 1 http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1 TRNT1
# 2 http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY DLC1 ALS2
# 3 http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS XRCC5 XRCC4
# 4 http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS UAP1 CHIA
# 5 http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS BTRC HNRNPD
# 6 http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD
# V5 V6 V7 V8 V9 V10
# 1 FARS2 METTL1 SARS AARS THG1L SSB
# 2 SLC9A7 PTGS2 PTGS1 MPV17 SGMS1 AGTR1
# 3 RAD51C XRCC3 XRCC2 XRCC6 ISG20 PRIM1
# 4 GNPDA1 GNE CSGALNACT1 CHST2 CHST4 CHST5
# 5 USE1 RNASEH1 RNF217 ISG20 CDKN2A CPA2
# 6 SYNCRIP MED24 RORB MED23 REST MED21
nrow(y)
# [1] 825
Here's a minimal example for those who don't want to download the other file to try it out.
Create a 6-line CSV file where the last line has more fields than the first 5 lines and try to use read.table on it:
cat("1,2,3,4", "1,2,3,4", "1,2,3,4", "1,2,3,4",
"1,2,3,4", "1,2,3,4,5", file = "test1.txt",
sep = "\n")
read.table("test1.txt", header = FALSE, sep = ",", fill = TRUE)
# V1 V2 V3 V4
# 1 1 2 3 4
# 2 1 2 3 4
# 3 1 2 3 4
# 4 1 2 3 4
# 5 1 2 3 4
# 6 1 2 3 4
# 7 5 NA NA NA
Note the difference with if the longest line were in the first five lines of the file:
cat("1,2,3,4", "1,2,3,4,5", "1,2,3,4", "1,2,3,4",
"1,2,3,4", "1,2,3,4", file = "test2.txt",
sep = "\n")
read.table("test2.txt", header = FALSE, sep = ",", fill = TRUE)
# V1 V2 V3 V4 V5
# 1 1 2 3 4 NA
# 2 1 2 3 4 5
# 3 1 2 3 4 NA
# 4 1 2 3 4 NA
# 5 1 2 3 4 NA
# 6 1 2 3 4 NA
To fix the problem, we use count.fields which returns a vector of the number of fields detected in each line. We take the max from that and pass it on to a col.names argument for read.table.
x <- count.fields("test1.txt", sep=",")
x
# [1] 4 4 4 4 4 5
read.table("test.txt", header = FALSE, sep = ",", fill = TRUE,
col.names = paste("V", sequence(max(x)), sep = ""))
# V1 V2 V3 V4 V5
# 1 1 2 3 4 NA
# 2 1 2 3 4 NA
# 3 1 2 3 4 NA
# 4 1 2 3 4 NA
# 5 1 2 3 4 NA
# 6 1 2 3 4 5

Resources