R read.table loops row column entries to next row - r

This is the first time I encountered this problem using read.table: For row entries with very large number of columns, read.table loops the column entries into the next rows.
I have a .txt file with rows of variable and unequal length. For reference this is the .txt file I am reading: http://www.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/4.0/c5.bp.v4.0.symbols.gmt
Here is my code:
tabsep <- gsub("\\\\t", "\t", "\\t")
MSigDB.collection = read.table(fileName, header = FALSE, fill = TRUE, as.is = TRUE, sep = tabsep)
Partial output: first columns
V1 V2 V3 V4 V5 V6
1 TRNA_PROCESSING http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1 TRNT1 FARS2
2 REGULATION_OF_BIOLOGICAL_QUALITY http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY DLC1 ALS2 SLC9A7
3 DNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS XRCC5 XRCC4 RAD51C
4 AMINO_SUGAR_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS UAP1 CHIA GNPDA1
5 BIOPOLYMER_CATABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS BTRC HNRNPD USE1
6 RNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD SYNCRIP
7 INTS6 LSM5 LSM4 LSM3 LSM1
8 CRK
9 GLUCAN_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/GLUCAN_METABOLIC_PROCESS GCK PYGM GSK3B
10 PROTEIN_POLYUBIQUITINATION http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_POLYUBIQUITINATION ERCC8 HUWE1 DZIP3
...
Partial output: last columns
V403 V404 V405 V406 V407 V408 V409 V410 V411 V412 V413 V414 V415 V416 V417 V418 V419 V420 V421
1
2 CALCA CALCB FAM107A CDK11A RASGRP4 CDK11B SYN3 GP1BA TNN ENO1 PTPRC MTL5 ISOC2 RHAG VWF GPI HPX SLC5A7 F2R
3
4
5
6 IRF2 IRF3 SLC2A4RG LSM6 XRCC6 INTS1 HOXD13 RP9 INTS2 ZNF638 INTS3 ZNF254 CITED1 CITED2 INTS9 INTS8 INTS5 INTS4 INTS7
7 POU1F1 TCF7L2 TNFRSF1A NPAS2 HAND1 HAND2 NUDT21 APEX1 ENO1 ERF DTX1 SOX30 CBY1 DIS3 SP1 SP2 SP3 SP4 NFIC
8
9
10
For instance, column entries for row 6 gets looped to fill row 7 and row 8. I seem to only this problem for row entries with very large number of columns. This occurs for other .txt files as well but it breaks at different column numbers. I inspected all the row entries at where the break happens and there are no unusual characters in the entries (they are all standard upper case gene symbols).
I have tried both read.table and read.delim with the same result. If I convert the .txt file to .csv first and use the same code, I do not have this problem (see below for the equivalent output). But I don't want to convert each file first .csv and really I just want to understand what is going on.
Correct output if I convert to .csv file:
MSigDB.collection = read.table(fileName, header = FALSE, fill = TRUE, as.is = TRUE, sep = ",")
V1 V2 V3 V4 V5 V6
1 TRNA_PROCESSING http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1 TRNT1 FARS2 METTL1
2 REGULATION_OF_BIOLOGICAL_QUALITY http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY DLC1 ALS2 SLC9A7 PTGS2
3 DNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS XRCC5 XRCC4 RAD51C XRCC3
4 AMINO_SUGAR_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS UAP1 CHIA GNPDA1 GNE
5 BIOPOLYMER_CATABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS BTRC HNRNPD USE1 RNASEH1
6 RNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD SYNCRIP MED24
7 GLUCAN_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/GLUCAN_METABOLIC_PROCESS GCK PYGM GSK3B EPM2A
8 PROTEIN_POLYUBIQUITINATION http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_POLYUBIQUITINATION ERCC8 HUWE1 DZIP3 DDB2
9 PROTEIN_OLIGOMERIZATION http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_OLIGOMERIZATION SYT1 AASS TP63 HPRT1

To elaborate on my comment...
From the help page to read.table:
The number of data columns is determined by looking at the first five lines of input (or the whole file if it has less than five lines), or from the length of col.names if it is specified and is longer. This could conceivably be wrong if fill or blank.lines.skip are true, so specify col.names if necessary (as in the ‘Examples’).
To work around this with unknown datasets, use count.fields to determine the number of separators in a file, and use that to create col.names for read.table to use:
x <- max(count.fields("~/Downloads/c5.bp.v4.0.symbols.gmt", "\t"))
Names <- paste("V", sequence(x), sep = "")
y <- read.table("~/Downloads/c5.bp.v4.0.symbols.gmt", col.names=Names, sep = "\t", fill = TRUE)
Inspect the first few lines. I'll leave the actual full inspection to you.
y[1:6, 1:10]
# V1
# 1 TRNA_PROCESSING
# 2 REGULATION_OF_BIOLOGICAL_QUALITY
# 3 DNA_METABOLIC_PROCESS
# 4 AMINO_SUGAR_METABOLIC_PROCESS
# 5 BIOPOLYMER_CATABOLIC_PROCESS
# 6 RNA_METABOLIC_PROCESS
# V2 V3 V4
# 1 http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1 TRNT1
# 2 http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY DLC1 ALS2
# 3 http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS XRCC5 XRCC4
# 4 http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS UAP1 CHIA
# 5 http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS BTRC HNRNPD
# 6 http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD
# V5 V6 V7 V8 V9 V10
# 1 FARS2 METTL1 SARS AARS THG1L SSB
# 2 SLC9A7 PTGS2 PTGS1 MPV17 SGMS1 AGTR1
# 3 RAD51C XRCC3 XRCC2 XRCC6 ISG20 PRIM1
# 4 GNPDA1 GNE CSGALNACT1 CHST2 CHST4 CHST5
# 5 USE1 RNASEH1 RNF217 ISG20 CDKN2A CPA2
# 6 SYNCRIP MED24 RORB MED23 REST MED21
nrow(y)
# [1] 825
Here's a minimal example for those who don't want to download the other file to try it out.
Create a 6-line CSV file where the last line has more fields than the first 5 lines and try to use read.table on it:
cat("1,2,3,4", "1,2,3,4", "1,2,3,4", "1,2,3,4",
"1,2,3,4", "1,2,3,4,5", file = "test1.txt",
sep = "\n")
read.table("test1.txt", header = FALSE, sep = ",", fill = TRUE)
# V1 V2 V3 V4
# 1 1 2 3 4
# 2 1 2 3 4
# 3 1 2 3 4
# 4 1 2 3 4
# 5 1 2 3 4
# 6 1 2 3 4
# 7 5 NA NA NA
Note the difference with if the longest line were in the first five lines of the file:
cat("1,2,3,4", "1,2,3,4,5", "1,2,3,4", "1,2,3,4",
"1,2,3,4", "1,2,3,4", file = "test2.txt",
sep = "\n")
read.table("test2.txt", header = FALSE, sep = ",", fill = TRUE)
# V1 V2 V3 V4 V5
# 1 1 2 3 4 NA
# 2 1 2 3 4 5
# 3 1 2 3 4 NA
# 4 1 2 3 4 NA
# 5 1 2 3 4 NA
# 6 1 2 3 4 NA
To fix the problem, we use count.fields which returns a vector of the number of fields detected in each line. We take the max from that and pass it on to a col.names argument for read.table.
x <- count.fields("test1.txt", sep=",")
x
# [1] 4 4 4 4 4 5
read.table("test.txt", header = FALSE, sep = ",", fill = TRUE,
col.names = paste("V", sequence(max(x)), sep = ""))
# V1 V2 V3 V4 V5
# 1 1 2 3 4 NA
# 2 1 2 3 4 NA
# 3 1 2 3 4 NA
# 4 1 2 3 4 NA
# 5 1 2 3 4 NA
# 6 1 2 3 4 5

Related

R Transform number/character to be a superscript or subscript

I want to superscript a number (or character, it doesnt matter) to an existing string.
This is what my initial dataframe looks like:
testframe = as.data.frame(c("A34", "B21", "C64", "D83", "E92", "F24"))
testframe$V2 = c(1,3,2,2,3,NA)
colnames(testframe)[1] = "V1"
V1 V2
1 A34 1
2 B21 3
3 C64 2
4 D83 2
5 E92 3
6 F24 NA
What I want to do now is to use the V2 as a kind of "footnote", so any superscript or subscript (doesnt matter which one). When there is no entry in V2, then I just want to keep V1 as it is.
I found a similar question where I saw this answer:
> paste0("H", "\u2081", "O")
[1] "H₁O"
This is what I want, but my problem is that it has to be created automatically since I have way too many rows in my real dataframe.
I tried to add an extra column "V3" to enter the Superscripts and Subscripts Codes:
testframe$V3 = c("u2081", "u2083", "u2082", "u2082", "u2083", NA)
V1 V2 V3
1 A34 1 u2081
2 B21 3 u2083
3 C64 2 u2082
4 D83 2 u2082
5 E92 3 u2083
6 F24 NA <NA>
But when I try paste(testframe$V1, testframe$V3, sep = "\") it gives me an error. How can I use the \ in this case?
If the subscripts will always be digits 0 through 9, you can index into a vector of Unicode subscript digits:
subscripts <- c(
"\u2080",
"\u2081",
"\u2082",
"\u2083",
"\u2084",
"\u2085",
"\u2086",
"\u2087",
"\u2089"
)
testframe$V1 <- paste0(
testframe$V1,
ifelse(
is.na(testframe$V2),
"",
subscripts[testframe$V2 + 1]
)
)
testframe
V1 V2
1 A34₁ 1
2 B21₃ 3
3 C64₂ 2
4 D83₂ 2
5 E92₃ 3
6 F24 NA

Adding Headers to Dataframe in R

I have a script written in R that is ran weekly and produces a csv. I need to add headers over top of some of the column names as they are grouped together.
Header1 Header2
A B C D E F
1 2 3 4 5 6
7 8 9 a b c
In this example ABC columns are under the "Header1" header, and DEF are under the "Header2" header. Obviously this can be done manually but I was curious if there was a package that can do this. "No" is an acceptable answer.
EDIT: should of added that the file can also be a xlsx. Initially I write off most of my files as CSVs since they usually get used by a script again at some point.
It is a bit ugly but you can do on a csv as long as you do not require any merging of cells. I used data.table in my example, but I am pretty sure you can use any other writing function as long as you write the headers with append = FALSE and col.names = FALSE and the data both with TRUE. Reading it back gets a bit ugly but you can skip the first row.
dt <- fread("A B C D E F
1 2 3 4 5 6
7 8 9 a b c")
fwrite(data.table(t(c("Header1", NA, NA, "Header2", NA, NA))), "test.csv", append = FALSE, col.names = FALSE)
fwrite(dt, "test.csv", append = TRUE, col.names = TRUE)
fread("test.csv")
# V1 V2 V3 V4 V5 V6
# 1: Header1 Header2
# 2: A B C D E F
# 3: 1 2 3 4 5 6
# 4: 7 8 9 a b c
fread("test.csv", skip = 1L)
# A B C D E F
# 1: 1 2 3 4 5 6
# 2: 7 8 9 a b c
If you happen to want your header information back you can do something like this. Read the first line, find the positions of the headers and find the headers itself.
headers <- strsplit(readLines("test.csv", n = 1L), ",")[[1]]
which(headers != "")
# [1] 1 4
headers[which(headers != "")]
# [1] "Header1" "Header2"

Read tab seperated data set with errors

I have a little problem with some datasets which are containing tab seperated data, but unfortunately there are some errors in the raw data, causing problems while reading into R.
A small example for better understanding, the dataset looks like this:
Col1 Col2 Col3
1 2 3
4 5 6
7
8 9
10 11 12
The 7 8 9 part should be in one row, but is wrongly seperated into two (in the raw data). Is there any chance to correct this while reading in and not by manually changing this? Because the dataset is around 4m observations large, a manual correction would take a lot of time...
Try this example:
# read the file line by line:
x <- readLines("data.txt")
# Split by " " (or in your case "\t"), and convert to dataframe with 3 columns:
res <- data.frame(matrix(unlist(strsplit(x[-1], " "), recursive = TRUE),
ncol = 3, byrow = TRUE))
# Add column names to dataframe:
colnames(res) <- unlist(strsplit(x[1], " "))
res
# Col1 Col2 Col3
# 1 1 2 3
# 2 4 5 6
# 3 7 8 9
# 4 10 11 12
Example data.txt file:
Col1 Col2 Col3
1 2 3
4 5 6
7
8 9
10 11 12
Note: Just noticed your real data is 4 million rows, maybe this is not the most efficient way.
My solution is more complicated than the solution by user zx8754 but here it goes.
readWrong <- function(file, skip = 1){
txt <- readLines(file)
header <- txt[seq_len(skip)]
header <- scan(what = character(), textConnection(header))
txt <- txt[-seq_len(skip)]
data <- scan(textConnection(txt))
data <- matrix(data, ncol = length(header), byrow = TRUE)
data <- as.data.frame(data)
names(data) <- header
data
}
readWrong("data.txt")
# Col1 Col2 Col3
#1 1 2 3
#2 4 5 6
#3 7 8 9
#4 10 11 12

Text processing on data frame in r

I have a text file in which data is stored is stored as given below
{{2,3,4},{1,3},{4},{1,2} .....}
I want to remove the brackets and convert it to two column format where first column is bracket number and followed by the term
1 2
1 3
1 4
2 1
2 3
3 4
4 1
4 2
so far i have read the file
tab <- read.table("test.txt",header=FALSE,sep="}")
This gives a dataframe
V1 V2 V3 V4
1 {{2,3,4 {1,3 {4 {1,2 .....
How to proceed ?
We read it with readLines and then remove the {} with strsplit and convert it to two column dataframe with index and reshape to 'long' format with separate_rows
library(tidyverse)
v1 <- setdiff(unlist(strsplit(lines, "[{}]")), c("", ","))
tibble(index = seq_along(v1), Col = v1) %>%
separate_rows(Col, convert = TRUE)
# A tibble: 8 x 2
# index Col
# <int> <int>
#1 1 2
#2 1 3
#3 1 4
#4 2 1
#5 2 3
#6 3 4
#7 4 1
#8 4 2
Or a base R method would be replace the , after the } with another delimiter, split by , into a list and stack it to a two column data.frame
v1 <- scan(text=gsub("[{}]", "", gsub("},", ";", lines)), what = "", sep=";", quiet = TRUE)
stack(setNames(lapply(strsplit(v1, ","), as.integer), seq_along(v1)))[2:1]
data
lines <- readLines(textConnection("{{2,3,4},{1,3},{4},{1,2}}"))
#reading from file
lines <- readLines("yourfile.txt")
Data:
tab <- read.table(text=' V1 V2 V3 V4
1 {{2,3,4 {1,3 {4 {1,2
2 {{2,3,4 {1,3 {4 {1,2 ')
Code: using gsub, remove { and split the string by ,, then make a data frame. The column names are removed. Finally the list of dataframes in df1 are combined together using rbindlist
df1 <- lapply( seq_along(tab), function(x) {
temp <- data.frame( x, strsplit( gsub( "{", "", tab[[x]], fixed = TRUE ), split = "," ),
stringsAsFactors = FALSE)
colnames(temp) <- NULL
temp
} )
Output:
data.table::rbindlist(df1)
# V1 V2 V3
# 1: 1 2 2
# 2: 1 3 3
# 3: 1 4 4
# 4: 2 1 1
# 5: 2 3 3
# 6: 3 4 4
# 7: 4 1 1
# 8: 4 2 2

R: read file, split into multiple dataframes

I have a file that is laid out in the following way:
# Query ID 1
# note
# note
tab delimited data across 12 columns
# Query ID 2
# note
# note
tab delimited data across 12 columns
I'd like to import this data into R so that each query is its own dataframe. Ideally as a list of dataframes with the query ID as the name of each item in the list. I've been searching for awhile, but I haven't seen a good way to do this. Is this possible?
Thanks
We have used comma instead of tab to make it easier to see and have put the body of the file in a string but aside from making the obvious changes try this. First we use readLines to read in the file and then determine where the headers are and create a grp vector which has the same number of elements as lines in the file and whose values are the header for that line. Finally split the lines, and apply Read to each group.
but aside from that try this:
# test data
Lines <- "# Query ID 1
# note
# note
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
# Query ID 2
# note
# note
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12"
L <- readLines(textConnection(Lines)) # L <- readLines("myfile")
isHdr <- grepl("Query", L)
grp <- L[isHdr][cumsum(isHdr)]
# Read <- function(x) read.table(text = x, sep = "\t", fill = TRUE, comment = "#")
Read <- function(x) read.table(text = x, sep = ",", fill = TRUE, comment = "#")
Map(Read, split(L, grp))
giving:
$`# Query ID 1`
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 1 2 3 4 5 6 7 8 9 10 11 12
2 1 2 3 4 5 6 7 8 9 10 11 12
$`# Query ID 2`
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 1 2 3 4 5 6 7 8 9 10 11 12
2 1 2 3 4 5 6 7 8 9 10 11 12
No packages needed.

Resources