Text processing on data frame in r - r

I have a text file in which data is stored is stored as given below
{{2,3,4},{1,3},{4},{1,2} .....}
I want to remove the brackets and convert it to two column format where first column is bracket number and followed by the term
1 2
1 3
1 4
2 1
2 3
3 4
4 1
4 2
so far i have read the file
tab <- read.table("test.txt",header=FALSE,sep="}")
This gives a dataframe
V1 V2 V3 V4
1 {{2,3,4 {1,3 {4 {1,2 .....
How to proceed ?

We read it with readLines and then remove the {} with strsplit and convert it to two column dataframe with index and reshape to 'long' format with separate_rows
library(tidyverse)
v1 <- setdiff(unlist(strsplit(lines, "[{}]")), c("", ","))
tibble(index = seq_along(v1), Col = v1) %>%
separate_rows(Col, convert = TRUE)
# A tibble: 8 x 2
# index Col
# <int> <int>
#1 1 2
#2 1 3
#3 1 4
#4 2 1
#5 2 3
#6 3 4
#7 4 1
#8 4 2
Or a base R method would be replace the , after the } with another delimiter, split by , into a list and stack it to a two column data.frame
v1 <- scan(text=gsub("[{}]", "", gsub("},", ";", lines)), what = "", sep=";", quiet = TRUE)
stack(setNames(lapply(strsplit(v1, ","), as.integer), seq_along(v1)))[2:1]
data
lines <- readLines(textConnection("{{2,3,4},{1,3},{4},{1,2}}"))
#reading from file
lines <- readLines("yourfile.txt")

Data:
tab <- read.table(text=' V1 V2 V3 V4
1 {{2,3,4 {1,3 {4 {1,2
2 {{2,3,4 {1,3 {4 {1,2 ')
Code: using gsub, remove { and split the string by ,, then make a data frame. The column names are removed. Finally the list of dataframes in df1 are combined together using rbindlist
df1 <- lapply( seq_along(tab), function(x) {
temp <- data.frame( x, strsplit( gsub( "{", "", tab[[x]], fixed = TRUE ), split = "," ),
stringsAsFactors = FALSE)
colnames(temp) <- NULL
temp
} )
Output:
data.table::rbindlist(df1)
# V1 V2 V3
# 1: 1 2 2
# 2: 1 3 3
# 3: 1 4 4
# 4: 2 1 1
# 5: 2 3 3
# 6: 3 4 4
# 7: 4 1 1
# 8: 4 2 2

Related

R Transform number/character to be a superscript or subscript

I want to superscript a number (or character, it doesnt matter) to an existing string.
This is what my initial dataframe looks like:
testframe = as.data.frame(c("A34", "B21", "C64", "D83", "E92", "F24"))
testframe$V2 = c(1,3,2,2,3,NA)
colnames(testframe)[1] = "V1"
V1 V2
1 A34 1
2 B21 3
3 C64 2
4 D83 2
5 E92 3
6 F24 NA
What I want to do now is to use the V2 as a kind of "footnote", so any superscript or subscript (doesnt matter which one). When there is no entry in V2, then I just want to keep V1 as it is.
I found a similar question where I saw this answer:
> paste0("H", "\u2081", "O")
[1] "H₁O"
This is what I want, but my problem is that it has to be created automatically since I have way too many rows in my real dataframe.
I tried to add an extra column "V3" to enter the Superscripts and Subscripts Codes:
testframe$V3 = c("u2081", "u2083", "u2082", "u2082", "u2083", NA)
V1 V2 V3
1 A34 1 u2081
2 B21 3 u2083
3 C64 2 u2082
4 D83 2 u2082
5 E92 3 u2083
6 F24 NA <NA>
But when I try paste(testframe$V1, testframe$V3, sep = "\") it gives me an error. How can I use the \ in this case?
If the subscripts will always be digits 0 through 9, you can index into a vector of Unicode subscript digits:
subscripts <- c(
"\u2080",
"\u2081",
"\u2082",
"\u2083",
"\u2084",
"\u2085",
"\u2086",
"\u2087",
"\u2089"
)
testframe$V1 <- paste0(
testframe$V1,
ifelse(
is.na(testframe$V2),
"",
subscripts[testframe$V2 + 1]
)
)
testframe
V1 V2
1 A34₁ 1
2 B21₃ 3
3 C64₂ 2
4 D83₂ 2
5 E92₃ 3
6 F24 NA

Complex reshaping from wide to long in R (pulling multiple things from original variable name)

I have a question about reshaping a complex data from wide to long format.
"Prim_key" is the unique id. The variables have the following format: "sn016_1_2". I need to pull the first number into a column and name it "S" (For example, here it would be 1) and the second number to a column named "T" (For example, here it would be 2) and then pull the values into other variable names grouped by the unique id. The prefix sn016 is also not the only prefix. Here are the variables:
[1] "prim_key" "sn016_1_2" "sn016_1_3" "sn016_1_4" "sn016_1_5" "sn016_1_6" "sn016_1_7" "sn016_2_3"
[9] "sn016_2_4" "sn016_2_5" "sn016_2_6" "sn016_2_7" "sn016_3_4" "sn016_3_5" "sn016_3_6" "sn016_3_7"
[17] "sn016_4_5" "sn016_4_6" "sn016_4_7" "sn016_5_6" "sn016_5_7" "sn016_6_7" "sn017_1_2" "sn017_1_3"
[25] "sn017_1_4" "sn017_1_5" "sn017_1_6" "sn017_1_7" "sn017_2_3" "sn017_2_4" "sn017_2_5" "sn017_2_6"
[33] "sn017_2_7" "sn017_3_4" "sn017_3_5" "sn017_3_6" "sn017_3_7" "sn017_4_5" "sn017_4_6" "sn017_4_7"
[41] "sn017_5_6" "sn017_5_7" "sn017_6_7"
"Prim_key" is the unique id. Any ideas on how to do this? I feel like it shouldn't be terribly hard but it's evading me.
Here's an example of what I'm looking for:
THESE VARS: "prim_key" "sn016_1_2" "sn016_1_3" "sn016_2_6" "sn016_2_7" "sn016_3_4" "sn016_3_5"
prim_key S T sn016
1 1 2 value
1 1 3 value
1 2 6 value
1 2 7 value
1 3 4 value
1 3 5 value
P.s. The goal long format example is not showing up correctly. So I've attached as an image.
Thanks in advance for any help!!
Perhaps you might try using pivot_longer from tidyr.
You can specify:
Columns to make longer (could select columns that start with "sn", such as starts_with("sn"), or all columns except for prim_key)
Names of the new columns generated, which include the initial letter/number combination (e.g., sn016), S, and T
And a regex pattern to split up into these columns
The code as follows:
library(tidyverse)
df %>%
pivot_longer(cols = -prim_key,
names_to = c(".value", "S", "T"),
names_pattern = "(\\w+)_(\\d+)_(\\d+)")
Output
# A tibble: 10 x 5
prim_key S T sn016 sn017
<dbl> <chr> <chr> <int> <int>
1 1 1 2 5 NA
2 1 1 3 2 NA
3 1 2 6 5 3
4 1 2 7 1 2
5 1 3 5 NA 3
6 1 1 2 2 NA
7 1 1 3 3 NA
8 1 2 6 3 4
9 1 2 7 2 3
10 1 3 5 NA 5
Data
Example data made up:
df <- structure(list(prim_key = c(1, 1), sn016_1_2 = c(5L, 2L), sn016_1_3 = 2:3,
sn016_2_6 = c(5L, 3L), sn016_2_7 = 1:2, sn017_2_6 = 3:4,
sn017_2_7 = 2:3, sn017_3_5 = c(3L, 5L)), class = "data.frame", row.names = c(NA,
-2L))
We could use melt from data.table
library(data.table)
dcast(melt(setDT(df), id.var = 'prim_key')[, c("nm1", "S", "T")
:= tstrsplit(variable, '_')], rowid(nm1, S, T) + prim_key + S + T
~ nm1, value.var = 'value')[, nm1 := NULL][]
# prim_key S T sn016 sn017
# 1: 1 1 2 5 NA
# 2: 1 1 3 2 NA
# 3: 1 2 6 5 3
# 4: 1 2 7 1 2
# 5: 1 3 5 NA 3
# 6: 1 1 2 2 NA
# 7: 1 1 3 3 NA
# 8: 1 2 6 3 4
# 9: 1 2 7 2 3
#10: 1 3 5 NA 5
data
df <- structure(list(prim_key = c(1, 1), sn016_1_2 = c(5L, 2L), sn016_1_3 = 2:3,
sn016_2_6 = c(5L, 3L), sn016_2_7 = 1:2, sn017_2_6 = 3:4,
sn017_2_7 = 2:3, sn017_3_5 = c(3L, 5L)), class = "data.frame", row.names = c(NA,
-2L))
The answers using external packages are probably the way to go in terms of parsimony. It's useful, however, to be able to brute force your desired solution using base R sometimes. Below is an example. One benefit of the following is that the call to lapply can be replaced with the parallel version parLapply or mclapply both from the parallel package that ships with R.
#### First make some example data
# The column names you gave
cnames <- c("prim_key", "sn016_1_2", "sn016_1_3", "sn016_1_4", "sn016_1_5",
"sn016_1_6", "sn016_1_7", "sn016_2_3", "sn016_2_4", "sn016_2_5",
"sn016_2_6", "sn016_2_7", "sn016_3_4", "sn016_3_5", "sn016_3_6",
"sn016_3_7", "sn016_4_5", "sn016_4_6", "sn016_4_7", "sn016_5_6",
"sn016_5_7", "sn016_6_7", "sn017_1_2", "sn017_1_3", "sn017_1_4",
"sn017_1_5", "sn017_1_6", "sn017_1_7", "sn017_2_3", "sn017_2_4",
"sn017_2_5", "sn017_2_6", "sn017_2_7", "sn017_3_4", "sn017_3_5",
"sn017_3_6", "sn017_3_7", "sn017_4_5", "sn017_4_6", "sn017_4_7",
"sn017_5_6", "sn017_5_7", "sn017_6_7")
# An example matrix with random data
mat <- matrix(runif(length(cnames) * 4), nrow = 4)
# Make the column names corrcet
colnames(mat) <- cnames
### Now pretend we already had the data
# Get the column names of the input matrix
cnames <- colnames(mat)
# The column names that are not your primary key
n_primkey <- cnames[which(cnames != "prim_key")]
# Get the unique set of prefixes for the non-primkey variables
prefix <- strsplit(n_primkey, "_")
prefix <- unique(unlist(lapply(prefix, "[", 1)))
# Go row by row through the original matrix
dat <- lapply(seq_len(nrow(mat)), function(i) {
# The row we're dealing with now
row <- mat[i, ]
# The column names of your output matrix
dcnames <- c("prim_key", "S", "T", prefix)
# A pre-allocated data.frame to hold the rehaped data for this row
dat <- matrix(rep(NA, length(dcnames) * length(n_primkey)), ncol = length(dcnames))
dat <- as.data.frame(dat)
colnames(dat) <- dcnames
# All values for this row have the same prim_key value
dat$prim_key <- row["prim_key"]
# Go through each of the non-prim_key variables, split them, and put the
# values in the correct place
for (j in seq_len(length(n_primkey))) {
# k has the non-prim_key name we're dealing with
k <- n_primkey[j]
# l splits this name by underscores "_"
l <- strsplit(k, "_")
# The first element gives the prefix
pref <- l[[1]][1]
# The second gives the "S" value
S_val <- l[[1]][2]
# The third gives the "T" value
T_val <- l[[1]][3]
# Allocate these values into the output data.frame we created ealier
dat[j, "S"] <- S_val
dat[j, "T"] <- T_val
dat[j, pref] <- row[k]
}
# Return the data for row i of the input data
dat
})
# dat is a list, so combine each element into a single data.frame
dat <- do.call(rbind, dat)
# Check a few
dat[1:2, ]
mat[1, ]

R: Reorganizing rows of each dataframe within a list into a new list of new dataframes

Edited to add more details and clarify.
Basically, I have a list of data frames, they have the same row numbers but various column numbers, so the dimension of each data frame is different. What I want to do now is to select the first row of each data frame, put them into a new data frame and use it as the first element of a new list, then do the same things for second rows, third rows...
I have contemplated to use 2 for loops to reassign rows, however that seems to be a very bad way to go about it given that nested for loop is pretty slow and the data I have is huge. Would truly appreciate sound insight and help.
myList <- list()
df1 <- as.data.frame(matrix(1:6, nrow=3, ncol=2))
df2 <- as.data.frame(matrix(7:15, nrow=3, ncol=3))
myList[[1]]<-df1
myList[[2]]<-df2
print(myList)
Current example data -
> print(myList)
[[1]]
V1 V2
1 1 4
2 2 5
3 3 6
[[2]]
V1 V2 V3
1 7 10 13
2 8 11 14
3 9 12 15
Desired Outcome
> print(myList2)
[[1]]
V1 V2 V3
1 1 4 0
2 7 10 13
[[2]]
V1 V2 V3
1 2 5 0
2 8 11 14
[[3]]
V1 V2 V3
1 3 6 0
2 9 12 15
The different dimensions of the current data frames makes it tricky.
Here's a base method of:
Adding all the column names to each list item
Converting the list to an array.
Transposing the array using aperm to match your intended output
Optional turning the array into a list using apply.
myListBase <- myList #added because we modify the original list
#get all of the unique names from the list of dataframes
##default ordering is by ordering in list
all_cols <- Reduce(base::union, lapply(myListBase, names))
#loop, add new columns, and then re-order them so all data.frames
# have the same order
myListBase <- lapply(myListBase,
function(DF){
DF[, base::setdiff(all_cols, names(DF))] <- 0 #initialze columns
DF[, all_cols] #reorder columns
}
)
#create 3D array - could be simplified using abind::abind(abind(myListBase, along = 3))
myArrayBase <- array(unlist(myListBase, use.names = F),
dim = c(nrow(myListBase[[1]]), #rows
length(all_cols), #columns
length(myListBase) #3rd dimension
),
dimnames = list(NULL, all_cols, NULL))
#rows and 3rd dimension are transposed
myPermBase <- aperm(myArrayBase, c(3,2,1))
myPermBase
#, , 1
#
# V1 V2 V3
#[1,] 1 4 0
#[2,] 7 10 13
#
#, , 2
#
# V1 V2 V3
#[1,] 2 5 0
#[2,] 8 11 14
#
#, , 3
#
# V1 V2 V3
#[1,] 3 6 0
#[2,] 9 12 15
#make list of dataframes - likely not necessary
apply(myPermBase, 3, data.frame)
#[[1]]
# V1 V2 V3
#1 1 4 0
#2 7 10 13
#
#[[2]]
# V1 V2 V3
#1 2 5 0
#2 8 11 14
#
#[[3]]
# V1 V2 V3
#1 3 6 0
#2 9 12 15
Performance
The first version of the answer had a data.table and abind method but I've removed it - the base version is much faster and there's not much additional clarity gained.
Unit: microseconds
expr min lq mean median uq max neval
camille_purrr_dplyr 7910.9 8139.25 8614.956 8246.30 8387.20 60159.5 1000
cole_DT_abind 2555.8 2804.75 3012.671 2917.95 3061.55 6602.3 1000
cole_base 600.3 634.40 697.987 663.00 733.10 3761.6 1000
Complete code for reference:
library(dplyr)
library(purrr)
library(data.table)
library(abind)
library(microbenchmark)
myList <- list()
df1 <- as.data.frame(matrix(1:6, nrow=3, ncol=2))
df2 <- as.data.frame(matrix(7:15, nrow=3, ncol=3))
myList[[1]]<-df1
myList[[2]]<-df2
microbenchmark(
camille_purrr_dplyr = {
myList %>%
map_dfr(tibble::rownames_to_column, var = "id") %>%
mutate_at(vars(-id), ~ifelse(is.na(.), 0, .)) %>%
split(.$id) %>%
map(select, -id)
}
,
cole_DT_abind = {
myListDT <- copy(myList)
all_cols <- Reduce(base::union, lapply(myListDT, names))
# data.table used for side effects of updating-by-reference in lapply
lapply(myListDT, setDT)
# add non-existing columns
lapply(myListDT,
function(DT) {
DT[, base::setdiff(all_cols, names(DT)) := 0]
setorderv(DT, all_cols)
})
# abind is used to make an array
myArray <- abind(myListDT, along = 3)
# aperm is used to transpose the array to the preferred route
myPermArray <- aperm(myArray, c(3,2,1))
# myPermArray
#or as a list of data.frames
apply(myPermArray, 3, data.frame)
}
,
cole_base = {
myListBase <- myList
all_cols <- Reduce(base::union, lapply(myListBase, names))
myListBase <- lapply(myListBase,
function(DF){
DF[, base::setdiff(all_cols, names(DF))] <- 0
DF[, all_cols]
}
)
myArrayBase <- array(unlist(myListBase, use.names = F),
dim = c(nrow(myListBase[[1]]), length(all_cols), length(myListBase)),
dimnames = list(NULL, all_cols, NULL))
myPermBase <- aperm(myArrayBase, c(3,2,1))
apply(myPermBase, 3, data.frame)
}
# ,
# cole_base_aperm = {
# myListBase <- myList
#
# all_cols <- Reduce(base::union, lapply(myListBase, names))
#
# myListBase <- lapply(myListBase,
# function(DF){
# DF[, base::setdiff(all_cols, names(DF))] <- 0
# DF[, all_cols]
# }
# )
#
# myArrayABind <- abind(myListBase, along = 3)
#
# myPermBase <- aperm(myArrayABind, c(3,2,1))
# apply(myPermBase, 3, data.frame)
# }
, times = 1000
)
One way with a few dplyr & purrr functions is to add an ID column to each row in each data frame, bind them all, then split by that ID. The base rbind would throw an error because of the mismatched column names, but dplyr::bind_rows takes a list of any number of data frames and adds NA columns for anything missing.
First step gets you one data frame:
library(dplyr)
library(purrr)
myList %>%
map_dfr(tibble::rownames_to_column, var = "id")
#> id V1 V2 V3
#> 1 1 1 4 NA
#> 2 2 2 5 NA
#> 3 3 3 6 NA
#> 4 1 7 10 13
#> 5 2 8 11 14
#> 6 3 9 12 15
Fill in the NAs with 0 in all columns except the ID—this could also be adjusted if need be. Split by ID, and drop the ID column since you no longer need it.
myList %>%
map_dfr(tibble::rownames_to_column, var = "id") %>%
mutate_at(vars(-id), ~ifelse(is.na(.), 0, .)) %>%
split(.$id) %>%
map(select, -id)
#> $`1`
#> V1 V2 V3
#> 1 1 4 0
#> 4 7 10 13
#>
#> $`2`
#> V1 V2 V3
#> 2 2 5 0
#> 5 8 11 14
#>
#> $`3`
#> V1 V2 V3
#> 3 3 6 0
#> 6 9 12 15

Replace strings in all dataframe cells by corresponding entries in another data frame

I have a dataframe with a differing number of names in a cell of a dataframe which I want to replace with corresponding numbers of another dataframe. Afterwards, I want to proceed and calculate the mean and maximum but thats not part of my problem.
df_with_names <-read.table(text="
id names
1 AA,BB
2 AA,CC,DD
3 BB,CC
4 AA,BB,CC,DD
",header=TRUE,sep="")
The dataframe with the correspoding numbers looks like
df_names <-read.table(text="
name number_1 number_2
AA 20 30
BB 12 14
CC 13 29
DD 14 27
",header=TRUE,sep="")
At the end of the first step it should be
id number_1 number_2
1 20,12 30,14
2 20,13,14 30,29,27
3 12,13 14,29
4 20,12,13,14 30,14,29,27
From here I know how to proceed but I don't know how to get there.
I tried to separate the names of each row in a loop into a dataframe and then replace the names but I always fail to get the right column of df_with_names. After a while, I doubt that replace() is the function I am looking for. Who can help?
library(data.table)
dt1 = as.data.table(df_with_names)
dt2 = as.data.table(df_names)
setkey(dt2, name)
dt2[setkey(dt1[, strsplit(as.character(names), split = ","), by = id], V1)][,
lapply(.SD, paste0, collapse = ","), keyby = id]
# id name number_1 number_2
#1: 1 AA,BB 20,12 30,14
#2: 2 AA,CC,DD 20,13,14 30,29,27
#3: 3 BB,CC 12,13 14,29
#4: 4 AA,BB,CC,DD 20,12,13,14 30,14,29,27
The above first splits the names along the comma in the first data.table, then joins that with the second one (after setting keys appropriately) and collapses all of the resulting columns back with a comma.
Another all in one:
data2match <- strsplit(df_with_names$names, ',')
lookup <- function(lookfor, in_df, return_col, search_col=1) {
in_df[, return_col][match(lookfor, in_df[, search_col])]
}
output <-
# for each number_x column....
sapply(names(df_names)[-1],
function(y) {
# for each set of names
sapply(data2match,
function(x) paste(sapply(x, lookup, df_names,
y, USE.NAMES=F), collapse=','))
})
data.frame(id=1:nrow(output), output)
Produces:
id number_1 number_2
1 1 20,12 30,14
2 2 20,13,14 30,29,27
3 3 12,13 14,29
4 4 20,12,13,14 30,14,29,27
Note: make sure both dataframes are ordered by id otherwise you may see unexpected results
listing <- df_with_names
listing <- strsplit(as.character(listing$names),",")
col1 <- lapply(listing, function(x) df_names[(df_names[[1]] %in% x),2])
col2 <- lapply(listing, function(x) df_names[(df_names[[1]] %in% x),3])
col1 <- unlist(lapply(col1, paste0, collapse = ","))
col2 <- unlist(lapply(col2, paste0, collapse = ","))
data.frame(number_1 = col1, number_2 = col2 )
number_1 number_2
1 20,12 30,14
2 20,13,14 30,29,27
3 12,13 14,29
4 20,12,13,14 30,14,29,27
I don't like names like "names" or "name", so I went with "nam":
do.call( rbind, # reassembles the individual lists
apply(df_with_names, 1, # for each row in df_with_names
function(x) lapply( # lapply(..., paste) to each column
# Next line will read each comma separated value and
# and match to rows of df_names[] and return cols 2:3
df_names[ df_names$nam %in% scan(text=x[2], what="", sep=",") ,
2:3, drop=FALSE] , # construct packet of text digits
paste0, collapse=",") ) )
number_1 number_2
[1,] "20,12" "30,14"
[2,] "20,13,14" "30,29,27"
[3,] "12,13" "14,29"
[4,] "20,12,13,14" "30,14,29,27"
(I'm surprised that scan(text= ... a factor variable actually succeeded.)
Another method:
df3 = data.frame(id=df1$id,
number_1=as.character(df1$names),
number_2=as.character(df1$names), stringsAsFactors=FALSE)
for(n1 in 1:nrow(df3))
for(n2 in 1:nrow(df2)){
df3[n1,2] = sub(df2[n2,1],df2[n2,2], df3[n1,2] )
df3[n1,3] = sub(df2[n2,1],df2[n2,3], df3[n1,3] )
}
df3
# id number_1 number_2
#1 1 20,12 30,14
#2 2 20,13,14 30,29,27
#3 3 12,13 14,29
#4 4 20,12,13,14 30,14,29,27
I think it would actually be worth your while to rearrange your df_with_names dataset to make things more straight-forward:
spl <- strsplit(as.character(df_with_names$names), ",")
df_with_names <- data.frame(
id=rep(df_with_names$id, sapply(spl, length)),
name=unlist(spl)
)
# id name
#1 1 AA
#2 1 BB
#3 2 AA
#4 2 CC
#5 2 DD
#6 3 BB
#7 3 CC
#8 4 AA
#9 4 BB
#10 4 CC
#11 4 DD
aggregate(
. ~ id,
data=merge(df_with_names, df_names, by="name")[-1],
FUN=function(x) paste(x,collapse=",")
)
# id number_1 number_2
#1 1 20,12 30,14
#2 2 20,13,14 30,29,27
#3 3 12,13 14,29
#4 4 20,12,13,14 30,14,29,27

R read.table loops row column entries to next row

This is the first time I encountered this problem using read.table: For row entries with very large number of columns, read.table loops the column entries into the next rows.
I have a .txt file with rows of variable and unequal length. For reference this is the .txt file I am reading: http://www.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/4.0/c5.bp.v4.0.symbols.gmt
Here is my code:
tabsep <- gsub("\\\\t", "\t", "\\t")
MSigDB.collection = read.table(fileName, header = FALSE, fill = TRUE, as.is = TRUE, sep = tabsep)
Partial output: first columns
V1 V2 V3 V4 V5 V6
1 TRNA_PROCESSING http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1 TRNT1 FARS2
2 REGULATION_OF_BIOLOGICAL_QUALITY http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY DLC1 ALS2 SLC9A7
3 DNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS XRCC5 XRCC4 RAD51C
4 AMINO_SUGAR_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS UAP1 CHIA GNPDA1
5 BIOPOLYMER_CATABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS BTRC HNRNPD USE1
6 RNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD SYNCRIP
7 INTS6 LSM5 LSM4 LSM3 LSM1
8 CRK
9 GLUCAN_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/GLUCAN_METABOLIC_PROCESS GCK PYGM GSK3B
10 PROTEIN_POLYUBIQUITINATION http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_POLYUBIQUITINATION ERCC8 HUWE1 DZIP3
...
Partial output: last columns
V403 V404 V405 V406 V407 V408 V409 V410 V411 V412 V413 V414 V415 V416 V417 V418 V419 V420 V421
1
2 CALCA CALCB FAM107A CDK11A RASGRP4 CDK11B SYN3 GP1BA TNN ENO1 PTPRC MTL5 ISOC2 RHAG VWF GPI HPX SLC5A7 F2R
3
4
5
6 IRF2 IRF3 SLC2A4RG LSM6 XRCC6 INTS1 HOXD13 RP9 INTS2 ZNF638 INTS3 ZNF254 CITED1 CITED2 INTS9 INTS8 INTS5 INTS4 INTS7
7 POU1F1 TCF7L2 TNFRSF1A NPAS2 HAND1 HAND2 NUDT21 APEX1 ENO1 ERF DTX1 SOX30 CBY1 DIS3 SP1 SP2 SP3 SP4 NFIC
8
9
10
For instance, column entries for row 6 gets looped to fill row 7 and row 8. I seem to only this problem for row entries with very large number of columns. This occurs for other .txt files as well but it breaks at different column numbers. I inspected all the row entries at where the break happens and there are no unusual characters in the entries (they are all standard upper case gene symbols).
I have tried both read.table and read.delim with the same result. If I convert the .txt file to .csv first and use the same code, I do not have this problem (see below for the equivalent output). But I don't want to convert each file first .csv and really I just want to understand what is going on.
Correct output if I convert to .csv file:
MSigDB.collection = read.table(fileName, header = FALSE, fill = TRUE, as.is = TRUE, sep = ",")
V1 V2 V3 V4 V5 V6
1 TRNA_PROCESSING http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1 TRNT1 FARS2 METTL1
2 REGULATION_OF_BIOLOGICAL_QUALITY http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY DLC1 ALS2 SLC9A7 PTGS2
3 DNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS XRCC5 XRCC4 RAD51C XRCC3
4 AMINO_SUGAR_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS UAP1 CHIA GNPDA1 GNE
5 BIOPOLYMER_CATABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS BTRC HNRNPD USE1 RNASEH1
6 RNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD SYNCRIP MED24
7 GLUCAN_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/GLUCAN_METABOLIC_PROCESS GCK PYGM GSK3B EPM2A
8 PROTEIN_POLYUBIQUITINATION http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_POLYUBIQUITINATION ERCC8 HUWE1 DZIP3 DDB2
9 PROTEIN_OLIGOMERIZATION http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_OLIGOMERIZATION SYT1 AASS TP63 HPRT1
To elaborate on my comment...
From the help page to read.table:
The number of data columns is determined by looking at the first five lines of input (or the whole file if it has less than five lines), or from the length of col.names if it is specified and is longer. This could conceivably be wrong if fill or blank.lines.skip are true, so specify col.names if necessary (as in the ‘Examples’).
To work around this with unknown datasets, use count.fields to determine the number of separators in a file, and use that to create col.names for read.table to use:
x <- max(count.fields("~/Downloads/c5.bp.v4.0.symbols.gmt", "\t"))
Names <- paste("V", sequence(x), sep = "")
y <- read.table("~/Downloads/c5.bp.v4.0.symbols.gmt", col.names=Names, sep = "\t", fill = TRUE)
Inspect the first few lines. I'll leave the actual full inspection to you.
y[1:6, 1:10]
# V1
# 1 TRNA_PROCESSING
# 2 REGULATION_OF_BIOLOGICAL_QUALITY
# 3 DNA_METABOLIC_PROCESS
# 4 AMINO_SUGAR_METABOLIC_PROCESS
# 5 BIOPOLYMER_CATABOLIC_PROCESS
# 6 RNA_METABOLIC_PROCESS
# V2 V3 V4
# 1 http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1 TRNT1
# 2 http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY DLC1 ALS2
# 3 http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS XRCC5 XRCC4
# 4 http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS UAP1 CHIA
# 5 http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS BTRC HNRNPD
# 6 http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD
# V5 V6 V7 V8 V9 V10
# 1 FARS2 METTL1 SARS AARS THG1L SSB
# 2 SLC9A7 PTGS2 PTGS1 MPV17 SGMS1 AGTR1
# 3 RAD51C XRCC3 XRCC2 XRCC6 ISG20 PRIM1
# 4 GNPDA1 GNE CSGALNACT1 CHST2 CHST4 CHST5
# 5 USE1 RNASEH1 RNF217 ISG20 CDKN2A CPA2
# 6 SYNCRIP MED24 RORB MED23 REST MED21
nrow(y)
# [1] 825
Here's a minimal example for those who don't want to download the other file to try it out.
Create a 6-line CSV file where the last line has more fields than the first 5 lines and try to use read.table on it:
cat("1,2,3,4", "1,2,3,4", "1,2,3,4", "1,2,3,4",
"1,2,3,4", "1,2,3,4,5", file = "test1.txt",
sep = "\n")
read.table("test1.txt", header = FALSE, sep = ",", fill = TRUE)
# V1 V2 V3 V4
# 1 1 2 3 4
# 2 1 2 3 4
# 3 1 2 3 4
# 4 1 2 3 4
# 5 1 2 3 4
# 6 1 2 3 4
# 7 5 NA NA NA
Note the difference with if the longest line were in the first five lines of the file:
cat("1,2,3,4", "1,2,3,4,5", "1,2,3,4", "1,2,3,4",
"1,2,3,4", "1,2,3,4", file = "test2.txt",
sep = "\n")
read.table("test2.txt", header = FALSE, sep = ",", fill = TRUE)
# V1 V2 V3 V4 V5
# 1 1 2 3 4 NA
# 2 1 2 3 4 5
# 3 1 2 3 4 NA
# 4 1 2 3 4 NA
# 5 1 2 3 4 NA
# 6 1 2 3 4 NA
To fix the problem, we use count.fields which returns a vector of the number of fields detected in each line. We take the max from that and pass it on to a col.names argument for read.table.
x <- count.fields("test1.txt", sep=",")
x
# [1] 4 4 4 4 4 5
read.table("test.txt", header = FALSE, sep = ",", fill = TRUE,
col.names = paste("V", sequence(max(x)), sep = ""))
# V1 V2 V3 V4 V5
# 1 1 2 3 4 NA
# 2 1 2 3 4 NA
# 3 1 2 3 4 NA
# 4 1 2 3 4 NA
# 5 1 2 3 4 NA
# 6 1 2 3 4 5

Resources