R: reading and manipulating a strangely formatted file - r

I have a file that is formatted in a slightly weird way, like so:
Cluster 1 Score:3.96
Category Term Count
GOTERM_BP_FAT GO:0006412 34
KEGG_PATHWAY hsa00970 9
GOTERM_BP_FAT GO:0043038 9
GOTERM_BP_FAT GO:0043039 9
Cluster 2 Score:3.94
Category Term Count
GOTERM_BP_FAT GO:0006414 21
KEGG_PATHWAY hsa03010 20
GOTERM_BP_FAT GO:0034660 16
GOTERM_BP_FAT GO:0006399 11
GOTERM_BP_FAT GO:0042254 10
GOTERM_BP_FAT GO:0022613 12
... and several more "sub-data frames" (including space in-between) and additional (here omitted) columns for the rows after the Cluster X rows.
What I want to do is to somehow read each separate cluster, get it as a data frame (i.e. a data frame with the names Category, Term, Count), manipulate the data frame a bit (adding columns based on calculations, mostly) and then write the manipulated data frame AND the the Cluster X row to a new file on the very same format as it started.
I've racked my brain for some smart way to do this, but I haven't really come up with anything other than reading each row separately and doing different things depending on the type of row, like this:
con <- file('test.txt', open="r")
# Read file line for line
while ( length(currentLine <- readLines(con, n=1, warn=FALSE)) > 0 ) {
line = strsplit(currentLine, '\t')[[1]]
# Save previous data, initiate new cluster name/score
if ( grepl('Annotation Cluster', line[1]) ) {
# Save previous data if available
if ( exists('currentData') ) {
## save the current data somehow
}
# Initiate new
clusterInfo = line
}
# Initiate new, empty data frame
else if ( grepl('Category', line[1]) ) {
currentData = data.frame(t(rep(NA, length(line))))
names(currentData) = line
}
# Add data to data frame
else if ( grepl('GOTERM', line[1]) || grepl('KEGG', line[1]) ) {
currentData = rbind(currentData, line)
# Delete NAs if line row
if ( nrow(currentData) == 2 ) {
currentData = na.omit(currentData)
}
}
}
The above is obviously not finished (I'm not sure how to save the clusterInfo together with currentData to the same format), but I hope I get across my idea. I'm not really too fond of this approach, though... It seems very odd, to me, to create data frames row-by-row like this, and try and save the data at the same time as you're initiating the start of the next block of data.
Is there some better way of doing this?

You can try read.mtable from my GitHub-only "SOfun" package.
Usage would be something like:
library(SOfun)
read.mtable(x, "Cluster", header = TRUE) ## Replace "x" with your file name
# $`Cluster 1 Score:3.96`
# Category Term Count
# 1 GOTERM_BP_FAT GO:0006412 34
# 2 KEGG_PATHWAY hsa00970 9
# 3 GOTERM_BP_FAT GO:0043038 9
# 4 GOTERM_BP_FAT GO:0043039 9
#
# $`Cluster 2 Score:3.94`
# Category Term Count
# 1 GOTERM_BP_FAT GO:0006414 21
# 2 KEGG_PATHWAY hsa03010 20
# 3 GOTERM_BP_FAT GO:0034660 16
# 4 GOTERM_BP_FAT GO:0006399 11
# 5 GOTERM_BP_FAT GO:0042254 10
# 6 GOTERM_BP_FAT GO:0022613 12
As you acn see, the "cluster" information is retained as the list names. Thus, you can go ahead and use lapply to do whatever calculations you need to do, and then re-write the data in whatever form you need to.
Reproducible sample data:
x <- tempfile()
writeLines("Cluster 1 Score:3.96
Category Term Count
GOTERM_BP_FAT GO:0006412 34
KEGG_PATHWAY hsa00970 9
GOTERM_BP_FAT GO:0043038 9
GOTERM_BP_FAT GO:0043039 9
Cluster 2 Score:3.94
Category Term Count
GOTERM_BP_FAT GO:0006414 21
KEGG_PATHWAY hsa03010 20
GOTERM_BP_FAT GO:0034660 16
GOTERM_BP_FAT GO:0006399 11
GOTERM_BP_FAT GO:0042254 10
GOTERM_BP_FAT GO:0022613 12", con = x, sep = "\n")

You could read the file with readLines and split it with a numeric index ('indx') created based on the lines having 'Cluster'. Read the list elements with read.table, create two new columns ('Cluster' and 'Score') and rbind the list elements to create a single dataset.
lines <- readLines('Clusterfile.txt')
indx <- cumsum(grepl('^Cluster', lines))
res <- do.call(rbind,lapply(split(lines, indx), function(x) {
d1 <-read.table(text=x[-1], header=TRUE, stringsAsFactors=FALSE)
d2 <- read.table(text=gsub('[^0-9.]+', ' ', x[1]),
col.names=c('Cluster', 'Score'))
cbind(d1, d2)}))
row.names(res) <- NULL
head(res,3)
# Category Term Count Cluster Score
#1 GOTERM_BP_FAT GO:0006412 34 1 3.96
#2 KEGG_PATHWAY hsa00970 9 1 3.96
#3 GOTERM_BP_FAT GO:0043038 9 1 3.96

Related

Error in Sub_data[1, i] : subscript out of bounds

I'm trying to write a program in R language and i use for loops and if statement
i have a data that contain 17 rows and 1091 columns (ariables)
I want to compare the values of a the 17th row and put the columns that have the same values in one data fram to treate them after
the algorithm i though of contain the following steps :
1-Take the column i want to compare and put it in new data frame (Sub_data)
2- compare the value in the 17th of this column with all the others values of other columns in the first data (All_data)
3-when the value of the column equal to the value of any other column (B) take that column B and add it to the data frame
4-after that i want to compare the variation of the variables in the Sub_data (that contains the same values of the 17th rows) and chose one column of the columns that has the same variation and eliminate the others
Here i present the rows and the first two columns of my data ( All_data)
MT95T843 MT95T756
QC_G.F9_01_4768 70027.0213162601 95774.1359666849
QC_G.F9_01_4765 69578.1863357392 81479.2957458262
QC_G.F9_01_4762 69578.1863357392 87021.9542724389
QC_G.F9_01_4759 68231.1433794304 95558.7673782843
QC_G.F9_01_4756 64874.1293568862 96780.772452217
QC_G.F9_01_4753 63866.6577969569 91854.3530432699
CtrF01R5_G.D1_01_4757 66954.3879935821 128861.361627886
CtrF01R4_G.D5_01_4763 97352.5522885788 101353.25926633
CtrF01R3_G.C8_01_4754 61311.7857641721 7603.60895516428
CtrF01R2_G.D3_01_4760 85768.3611731878 109461.75444564
CtrF01R1_G.C9_01_4755 85302.8194715206 104253.845374077
BtiF01R5_G.D7_01_4766 61252.4254487766 115683.737549183
BtiF01R4_G.D6_01_4764 81873.9637852956 112164.142293011
BtiF01R3_G.D2_01_4758 84981.2191408476 0
BtiF01R2_G.D4_01_4761 36629.0246187626 124806.491006666
BtiF01R1_G.D8_01_4767 0 109927.264246577
rt 13.9018138671285 13.9058590777331
Code for input dataframe :
df1 <- data.frame(Name = c("QC_G.F9_01_4768", "QC_G.F9_01_4765", "QC_G.F9_01_4762", "QC_G.F9_01_4759", "QC_G.F9_01_4756", "QC_G.F9_01_4753",
"CtrF01R5_G.D1_01_4757", "CtrF01R4_G.D5_01_4763", "CtrF01R3_G.C8_01_4754", "CtrF01R2_G.D3_01_4760", "CtrF01R1_G.C9_01_4755",
"BtiF01R5_G.D7_01_4766", "BtiF01R4_G.D6_01_4764", "BtiF01R3_G.D2_01_4758", "BtiF01R2_G.D4_01_4761", "BtiF01R1_G.D8_01_4767",
"rt"),
MT95T843 = c(70027.0213162601, 69578.1863357392, 69578.1863357392, 68231.1433794304, 64874.1293568862, 63866.6577969569, 66954.3879935821,
97352.5522885788, 61311.7857641721, 85768.3611731878, 85302.8194715206, 61252.4254487766, 81873.9637852956, 84981.2191408476,
36629.0246187626, 0, 13.9018138671285),
MT95T756 = c(95774.1359666849, 81479.2957458262, 87021.9542724389, 95558.7673782843, 96780.772452217, 91854.3530432699, 128861.361627886,
101353.25926633, 7603.60895516428, 109461.75444564, 104253.845374077, 115683.737549183, 112164.142293011, 0, 124806.491006666,
109927.264246577, 13.9058590777331))
df1
#> Name MT95T843 MT95T756
#> 1 QC_G.F9_01_4768 70027.02132 95774.13597
#> 2 QC_G.F9_01_4765 69578.18634 81479.29575
#> 3 QC_G.F9_01_4762 69578.18634 87021.95427
#> 4 QC_G.F9_01_4759 68231.14338 95558.76738
#> 5 QC_G.F9_01_4756 64874.12936 96780.77245
#> 6 QC_G.F9_01_4753 63866.65780 91854.35304
#> 7 CtrF01R5_G.D1_01_4757 66954.38799 128861.36163
#> 8 CtrF01R4_G.D5_01_4763 97352.55229 101353.25927
#> 9 CtrF01R3_G.C8_01_4754 61311.78576 7603.60896
#> 10 CtrF01R2_G.D3_01_4760 85768.36117 109461.75445
#> 11 CtrF01R1_G.C9_01_4755 85302.81947 104253.84537
#> 12 BtiF01R5_G.D7_01_4766 61252.42545 115683.73755
#> 13 BtiF01R4_G.D6_01_4764 81873.96379 112164.14229
#> 14 BtiF01R3_G.D2_01_4758 84981.21914 0.00000
#> 15 BtiF01R2_G.D4_01_4761 36629.02462 124806.49101
#> 16 BtiF01R1_G.D8_01_4767 0.00000 109927.26425
#> 17 rt 13.90181 13.90586
I'm stuck in the third step where i got this error message
Error in Sub_data[1, i] : subscript out of bounds
Here's the code i used :
library("readxl")
library("janitor")
All_data <- read_excel("DataMatrix_Excel.xlsx")
dim(All_data)
17 1091
for(i in 1:1091){
#Add column
Sub_data <- cbind(All_data[ , 1, drop=F])
for(j in 2:1091){
if(Sub_data[17,1]==All_data[17,j]) {
Sub_data <- cbind(Sub_data,All_data[ , j, drop=F])
#I added this line just to see if my code work
print(paste("The dim is " , dim(Sub_data)))
}
Please tell me if you need any more informations or clarification, also please tell me if you need any suggestions
Thank you very much

Capture the output of arules::inspect as data.frame

In "Zero frequent items" when using the eclat to mine frequent itemsets, the OP is interested in the groupings/clusterings based on how frequent they are ordered together. This grouping can be inspected by the arules::inspect function.
library(arules)
dataset <- read.transactions("8GbjnHK2.txt", sep = ";", rm.duplicates = TRUE)
f <- eclat(dataset,
parameter = list(
supp = 0.001,
maxlen = 17,
tidLists = TRUE))
inspect(head(sort(f, by = "support"), 10))
The data set can be downloaded from https://pastebin.com/8GbjnHK2.
However, the output cannot be easily saved to another object as a data frame.
out <- inspect(f)
So how do we capture the output of inspect(f) for use as data frame?
We can use the methods labels to extract the associations/groupings and quality to extract the quality measures (support and count). We can then use cbind to store these into a data frame.
out <- cbind(labels = labels(f), quality(f))
head(out)
# labels support count
# 1 {3031093,3059242} 0.001010 16
# 2 {3031096,3059242} 0.001073 17
# 3 {3060614,3060615} 0.001010 16
# 4 {3022540,3072091} 0.001010 16
# 5 {3061698,3061700} 0.001073 17
# 6 {3031087,3059242} 0.002778 44
Coercing the itemsets to a data.frame also creates the required output.
> head(as(f, "data.frame"))
items support count
1 {3031093,3059242} 0.001010101 16
2 {3031096,3059242} 0.001073232 17
3 {3060614,3060615} 0.001010101 16
4 {3022540,3072091} 0.001010101 16
5 {3061698,3061700} 0.001073232 17
6 {3031087,3059242} 0.002777778 44

Merge without replacement to get unique rows only

I would like to select n rows from y that matches the strings in x where n= length of x, but the same row in y should not be selected more than one time. The rows should be selected randomly from y.
> head(x$Age_Yrs_Sex)
[1] "65_0" "72_1" "82_0" "52_0" "81_0" "58_0"
> head(y,20)
ID Age_Yrs_Sex
1 10678800017 30_0
2 106788000024 63_0
4 10678800048 59_0
5 1067880000055 68_1
7 1067800079 59_0
8 10678800086 36_1
10 10678000109 39_0
12 1067880123 42_0
13 10678800130 45_1
14 106788000147 49_1
15 1067880000154 24_0
16 106780000161 44_0
17 1067880000178 43_1
19 106780000192 79_0
20 106880000208 22_0
22 107880000222 89_0
23 167880000239 28_0
24 106788000246 44_1
25 106780000253 76_0
26 106780000260 45_1
Assuming that the entries in x are always less than those in y for a given match, this should work (using dplyr). Generating usable example data here:
y <-
data.frame(
ID = 1:1000
, Age_Yrs_Sex = paste(sample(1:10, 1000, TRUE)
, 0:1
, sep = "_")
)
x <-
data.frame(
Age_Yrs_Sex = paste(c(1,1:4), 0, sep = "_")
)
Count the number of matches for each thing (can be skipped if it is always 1)
matches <-
table(x$Age_Yrs_Sex)
Filter the table to just the matches, then select from each group the number of matches found in the table above (using slice, randomly sample row numbers from 1 to the number of rows, returning the number of results of that match from the table).
y %>%
filter(Age_Yrs_Sex %in% names(matches)) %>%
group_by(Age_Yrs_Sex) %>%
slice(sample(1:n(), matches[as.character(Age_Yrs_Sex[1])]))
Gives (for example):
ID Age_Yrs_Sex
<int> <fctr>
1 95 1_0
2 777 1_0
3 151 2_0
4 951 3_0
5 403 4_0

arules: How find the data matching an lhs(rule) in R or an SQL WHERE clause?

I'm finding working with the arule package a bit tricky. I'm using the apriori algorithm to find association rules; something similar to an example in the arules documentation.
data("AdultUCI")
dim(AdultUCI)
AdultUCI[1:2,]
#Ignore everything from here to the last two lines, this is just data preparation
## remove attributes
AdultUCI[["fnlwgt"]] <- NULL
AdultUCI[["education-num"]] <- NULL
## map metric attributes
AdultUCI[[ "age"]] <- ordered(cut(AdultUCI[[ "age"]], c(15,25,45,65,100)),
labels = c("Young", "Middle-aged", "Senior", "Old"))
AdultUCI[[ "hours-per-week"]] <- ordered(cut(AdultUCI[[ "hours-per-week"]],
c(0,25,40,60,168)),
labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))
AdultUCI[[ "capital-gain"]] <- ordered(cut(AdultUCI[[ "capital-gain"]],
c(-Inf,0,median(AdultUCI[[ "capital-gain"]][AdultUCI[[ "capital-gain"]]>0]),
Inf)), labels = c("None", "Low", "High"))
AdultUCI[[ "capital-loss"]] <- ordered(cut(AdultUCI[[ "capital-loss"]],
c(-Inf,0, median(AdultUCI[[ "capital-loss"]][AdultUCI[[ "capital-loss"]]>0]),
Inf)), labels = c("None", "Low", "High"))
#resume reading here
rules <- apriori(AdultUCI, parameter=list(support=0.6, confidence=0.75, minlen=4))
inspect(rules)
Which returns the following four rules
lhs rhs support confidence lift
1 {race=White,
capital-gain=None,
native-country=United-States} => {capital-loss=None} 0.680398 0.9457029 0.9920537
2 {race=White,
capital-loss=None,
native-country=United-States} => {capital-gain=None} 0.680398 0.9083504 0.9901500
3 {race=White,
capital-gain=None,
capital-loss=None} => {native-country=United-States} 0.680398 0.9189249 1.0239581
4 {capital-gain=None,
capital-loss=None,
native-country=United-States} => {race=White} 0.680398 0.8730100 1.0210133
I must be missing something: how do you find the rows in the source data that match an lhs rule using just arules functions?
Is there an easy way to build an SQL WHERE clause from the lhs(rules)?
Thanks
This answer is based in the following answer: https://stats.stackexchange.com/questions/21340/finding-suitable-rules-for-new-data-using-arules.
The solution is very slow, i´m not sure if will work for large aplications.
library(arules)
rules <- apriori(AdultUCI, parameter=list(support=0.4, confidence=0.75, minlen=4))
inspect(rules)
rec <- function(rules, data, iter){
basket <- data[iter]
rulesMatchLHS <- is.subset(rules#lhs,basket)
suitableRules <- rulesMatchLHS & !(is.subset(rules#rhs,basket))
rules <- sort(rules[rulesMatchLHS], decreasing=TRUE, by="lift")
as(head(rules, 1), "data.frame")
}
recom_loop <- function(rules, data){
temp <- lapply(seq_along(data), function(x) rec(rules, data, x))
temp <- do.call("rbind", temp)
recom <- gsub(".*=> |\\{|\\}", "", temp$rules)
as.data.frame(cbind(as(data, "data.frame"), recom))
}
trans <- as(AdultUCI, "transactions")
recom <- recom_loop(rules, trans[1:50])
Here is some example output:
head(recom)
transactionID
1 1
2 2
3 3
4 4
5 5
6 6
items
1 {age=Middle-aged,workclass=State-gov,education=Bachelors,marital-status=Never-married,occupation=Adm-clerical,relationship=Not-in-family,race=White,sex=Male,capital-gain=Low,capital-loss=None,hours-per-week=Full-time,native-country=United-States,income=small}
2 {age=Senior,workclass=Self-emp-not-inc,education=Bachelors,marital-status=Married-civ-spouse,occupation=Exec-managerial,relationship=Husband,race=White,sex=Male,capital-gain=None,capital-loss=None,hours-per-week=Part-time,native-country=United-States,income=small}
3 {age=Middle-aged,workclass=Private,education=HS-grad,marital-status=Divorced,occupation=Handlers-cleaners,relationship=Not-in-family,race=White,sex=Male,capital-gain=None,capital-loss=None,hours-per-week=Full-time,native-country=United-States,income=small}
4 {age=Senior,workclass=Private,education=11th,marital-status=Married-civ-spouse,occupation=Handlers-cleaners,relationship=Husband,race=Black,sex=Male,capital-gain=None,capital-loss=None,hours-per-week=Full-time,native-country=United-States,income=small}
5 {age=Middle-aged,workclass=Private,education=Bachelors,marital-status=Married-civ-spouse,occupation=Prof-specialty,relationship=Wife,race=Black,sex=Female,capital-gain=None,capital-loss=None,hours-per-week=Full-time,native-country=Cuba,income=small}
6 {age=Middle-aged,workclass=Private,education=Masters,marital-status=Married-civ-spouse,occupation=Exec-managerial,relationship=Wife,race=White,sex=Female,capital-gain=None,capital-loss=None,hours-per-week=Full-time,native-country=United-States,income=small}
recom
1 race=White
2 race=White
3 race=White
4 race=White
5 race=White
6 capital-gain=None
As for the first question, transactions supporting may be found using this function (should work faster, than the one from the other response):
supp_trans_ids = function(items, transactions){
# makes a logical matrix showing which set of items in rows are fully contains in transactions on rows
tmp = is.subset(items, transactions)
tmp2 = lapply(
seq_len(nrow(tmp)),
# 'which' alone would leave a name for each index, which is a complete rule (and would use a lot of memory therefore)
function(i) {
t = which(tmp[i,])
names(t) = NULL
t
}
)
# to easily idenfify sets of items
names(tmp2) = rownames(tmp)
tmp2
}
Now, you may find which transactions support each rule's lhs with:
AdultUCI_trans = as(AdultUCI, 'transactions')
trans_supporting = supp_trans_ids(lhs(rules), AdultUCI_trans)
e.g.
> str(trans_supporting)
List of 4
$ {race=White,capital-gain=None,native-country=United-States} : int [1:35140] 2 3 6 8 13 17 18 19 20 21 ...
$ {race=White,capital-loss=None,native-country=United-States} : int [1:36585] 1 2 3 6 8 9 10 13 17 18 ...
$ {race=White,capital-gain=None,capital-loss=None} : int [1:36164] 2 3 6 8 13 17 18 19 20 21 ...
$ {capital-gain=None,capital-loss=None,native-country=United-States}: int [1:38066] 2 3 4 6 8 11 13 14 17 18 ...
And data you may find with:
AdultUCI_trans[trans_supporting[[1]]] # transactions supporting
AdultUCI[trans_supporting[[1]],] # data on which these transactions are based

R and appending to data frames

I have some cross correlation function crosscor, and I would like to loop through the function for each of the columns I have in my data matrix. The function outputs some cross correlation that looks something like this each time it is run:
Lags Cross.Correlation P.value
1 0 -0.0006844958 0.993233547
2 1 0.1021006478 0.204691627
3 2 0.0976746274 0.226628526
4 3 0.1150337867 0.155426784
5 4 0.1943150900 0.016092041
6 5 0.2360415470 0.003416147
7 6 0.1855274375 0.022566685
8 7 0.0800646242 0.330081900
9 8 0.1111071269 0.177338885
10 9 0.0689602574 0.404948252
11 10 -0.0097332533 0.906856279
12 11 0.0146241719 0.860926388
13 12 0.0862549791 0.302268025
14 13 0.1283308019 0.125302070
15 14 0.0909537922 0.279988895
16 15 0.0628012627 0.457795228
17 16 0.1669241304 0.047886605
18 17 0.2019811994 0.016703619
19 18 0.1440124960 0.090764520
20 19 0.1104842808 0.197035340
21 20 0.1247428178 0.146396407
I would like put all of the lists together so they are in a data frame, and ultimately export it into a csv file so the columns are as follows: lags.3, cross-correlation.3, p-value.3, lags.3, cross-correlation.2....etc. until p.value.50.
I have tried to use do.call as follows, but have not been successful:
for(i in 3:50)
{
l1<-crosscor(data[,2], data[,i], lagmax=20)
ccdata<-do.call(rbind, l1)
cat("Data row", i)
}
I've also tried just creating the data frame straight out, but am just getting the lag column names:
ccdata <- data.frame()
for(i in 3:50)
{
ccdata[i-2:i+1]<-crosscor(data[,2], data[,i], lagmax=20)
cat("Data row", i)
}
What am I doing wrong? Or is there an online source on data sets I could access to figure out how to do this? Best,
There is a transpose method for data.frames. If "crosscor" is the name of the object just try this:
tcrosscor <- t(crosscor)
write.csv(tcrosscor, file="my_crosscor_1.csv")
The first row would be the Lag's; the second row, the Cross.Correlation's; the third row the P.value's. I suppose you could "flatten" it further so it would be entirely "horizontal" or "wide". Seems painful but this might go something like:
single_line <- as.data.frame(unlist(tcrosscor))
names(single_line) <- paste("Lag", 'Cross.Correlation', 'P.value'), rep(1:50, 3), sep=".")
write.csv(single_line, file="my_single_1.csv")

Resources