Parse text files with R - r

I am trying to parse a text file with lines like this:
QUERY Query_3 Peptide 528 AT1G01110.2
DOMAINS
1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45
1 Query_3 Non-specific 412094 93 173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC 45
ENDDOMAINS
SITES
ENDSITES
MOTIFS
1 Query_3 globin helix H G93 101P 412094
1 Query_3 IQ motif V125 143L 412094
1 Query_3 globin helix A Q161 173V 412094
ENDMOTIFS
ENDQUERY
QUERY Query_4 Peptide 196 AT1G01160.1
DOMAINS
1 Query_4 Specific 428268 22 73 8.8084e-19 76.1579 pfam05030 SSXT - 45
ENDDOMAINS
ENDQUERY
QUERY Query_5 Peptide 308 AT1G01180.1
DOMAINS
1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167
ENDDOMAINS
ENDQUERY
It is essentially tab delimited rows separated by descriptions (e.g. QUERY, DOMAINS, ENDDOMAINS ...). I want to make two data frames for QUERY and DOMAINS like:
#data frame 1 ("QUERY" rows):
QUERY Query_3 Peptide 528 AT1G01110.2
QUERY Query_4 Peptide 196 AT1G01160.1
QUERY Query_5 Peptide 308 AT1G01180.1
#data frame 2 (rows after "DOMAINS"):
1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45
1 Query_3 Non-specific 412094 93 173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC 45
1 Query_4 Specific 428268 22 73 8.8084e-19 76.1579 pfam05030 SSXT - 45
1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167
Is there a way to do this in R? Thanks!
(BTW, this is an output from rpsbproc, a bioinformatics tool for parsing RPS-BLAST output, just in case someone also needs to parse the output.)

Try these:
txt <- readLines("text.txt")
grep("^QUERY", txt, value = TRUE) |>
paste(collapse = "\n") |>
read.table(text = _, header = FALSE)
# V1 V2 V3 V4 V5
# 1 QUERY Query_3 Peptide 528 AT1G01110.2
# 2 QUERY Query_4 Peptide 196 AT1G01160.1
# 3 QUERY Query_5 Peptide 308 AT1G01180.1
split(txt, cumsum(txt == "DOMAINS")) |>
lapply(function(z) if (z[1] == "DOMAINS" && !is.na(end <- which(z[-1] == "ENDDOMAINS"))) z[2:end]) |>
unlist() |>
paste(collapse = "\n") |>
read.table(text = _, header = FALSE)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
# 1 1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45
# 2 1 Query_3 Non-specific 412094 93 173 6.07039e-04 42.1551 cd22307 Adgb_C_mid-like NC 45
# 3 1 Query_4 Specific 428268 22 73 8.80840e-19 76.1579 pfam05030 SSXT - 45
# 4 1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167

You could try this.
rl <- readlines('foo.dat')
lapply(c('Query.*[Ss]pecific','^QUERY'), \(x) rl[grep(x, rl)]) |> setNames(c('QUERY', 'Domains'))
# $QUERY
# [1] "1 Query_3 Specific 404128 374 470 8.74687e-20 84.2155 pfam13178 DUF4005 C 45"
# [2] "1 Query_3 Non-specific 412094 93 173 0.000607039 42.1551 cd22307 Adgb_C_mid-like NC 45"
# [3] "1 Query_4 Specific 428268 22 73 8.8084e-19 76.1579 pfam05030 SSXT - 45"
# [4] "1 Query_5 Specific 433324 139 268 3.13921e-13 64.6367 pfam13578 Methyltransf_24 - 450167"
#
# $Domains
# [1] "QUERY Query_3 Peptide 528 AT1G01110.2" "QUERY Query_4 Peptide 196 AT1G01160.1"
# [3] "QUERY Query_5 Peptide 308 AT1G01180.1"
If you really want data frames with just one column, do this:
lapply(c('Query.*[Ss]pecific','^QUERY'), \(x) data.frame(v=rl[grep(x, rl)])) |> setNames(c('QUERY', 'Domains'))

Related

str_match based on vector with count issue

I havent got a reprex but my data are stored in a csv file
https://transcode.geo.data.gouv.fr/services/5e2a1fbefa4268bc25628f27/feature-types/drac:site?format=CSV&projection=WGS84
library(readr)
bzh_sites <- read_csv("site.csv")
I want to count row based on characters matching (column NATURE)
pattern<-c("allée|aqueduc|architecture|atelier|bas|carrière|caveau|chapelle|château|chemin|cimetière|coffre|dépôt|dolmen|eau|église|enceinte|enclos|éperon|espace|exploitation|fanum|ferme|funéraire|groupe|habitat|maison|manoir|menhir|monastère|motte|nécropole|occupation|organisation|parcellaire|pêcherie|prieuré|production|rue|sépulture|stèle|thermes|traitement|tumulus|villa")
test2 <- bzh_sites %>%
drop_na(NATURE) %>%
group_by(NATURE = str_match( NATURE, pattern )) %>%
summarise(n = n())
gives me :
NATURE n
1 allée 176
2 aqueduc 73
3 architecture 68
4 atelier 200
AND another test with the same data (NATURE)
pattern <- c("allée|aqueduc|architecture|atelier")
test2 <- bzh_sites %>%
drop_na(NATURE) %>%
group_by(NATURE = str_match( NATURE, pattern )) %>%
summarise(n = n())
gives me :
NATURE n
1 allée 178
2 aqueduc 74
3 architecture 79
4 atelier 248
I have no idea about the différences of count.
I tried to find out where the discrepancy is for first group i.e "allée". This is what I found :
library(stringr)
pattern1<-c("allée|aqueduc|architecture|atelier|bas|carrière|caveau|chapelle|château|chemin|cimetière|coffre|dépôt|dolmen|eau|église|enceinte|enclos|éperon|espace|exploitation|fanum|ferme|funéraire|groupe|habitat|maison|manoir|menhir|monastère|motte|nécropole|occupation|organisation|parcellaire|pêcherie|prieuré|production|rue|sépulture|stèle|thermes|traitement|tumulus|villa")
#Get indices where 'allée' is found using pattern1
ind1 <- which(str_match(bzh_sites$NATURE, pattern1 )[, 1] == 'allée')
pattern2 <- c("allée|aqueduc|architecture|atelier")
#Get indices where 'allée' is found using pattern1
ind2 <- which(str_match(bzh_sites$NATURE, pattern2)[, 1] == 'allée')
#Indices which are present in ind2 but absent in ind1
setdiff(ind2, ind1)
#[1] 3093 10400
#Get corresponding text
temp <- bzh_sites$NATURE[setdiff(ind2, ind1)]
temp
#[1] "dolmen allée couverte" "coffre funéraire allée couverte"
What happens when we use pattern1 and pattern2 on temp
str_match(temp, pattern1)
# [,1]
#[1,] "dolmen"
#[2,] "coffre"
str_match(temp, pattern2)
# [,1]
#[1,] "allée"
#[2,] "allée"
As we can see using pattern1 certain values are classified in another group since they occur first in the string hence we have a mismatch.
A similar explanation can be given for mismatches in other groups.
str_match only returns first match, to get all the matches in pattern we can use str_match_all
table(unlist(str_match_all(bzh_sites$NATURE, pattern1)))
# allée aqueduc architecture atelier bas
# 178 76 79 252 62
# carrière caveau chapelle château chemin
# 46 35 226 205 350
# cimetière coffre dépôt dolmen eau
# 275 155 450 542 114
# église enceinte enclos éperon space
# 360 655 338 114 102
#exploitation fanum ferme funéraire groups
# 1856 38 196 1256 295
# habitat maison manoir menhir monastère
# 1154 65 161 1036 31
# motte nécropole occupation organisation parcellaire
# 566 312 5152 50 492
# pêcherie prieuré production rue sépulture
# 69 66 334 44 152
# stèle thermes traitement tumulus villa
# 651 50 119 1232 225

How to add columns to a dataframe based on indexes in R? (See example)

I'm working with a self made infix function which simply calculates the
percentage growth between observations in columns.
options(digits=3)
`%grow%` <- function(x,y) {
(y-x) / x * 100
}
test <- data.frame(a=c(101,202,301), b=c(123,214,199), h=c(134, 217, 205))
Then I use lapply to my toy database in order to add two new columns.
test[,4:5] <- lapply(1:(ncol(test)-1), function(i) test[,i] %grow% test[,(i+1)])
test
#Output
a b h V4 V5
1 101 123 134 21.78 8.94
2 202 214 217 5.94 1.40
3 301 199 205 -33.89 3.02
This is easy considering I just have three columns and I just can write test[,4:5]. Now talking in general terms: How to do this if we have n columns using column indexes?
What I mean is I want to create n-1 columns to a given database starting from the last one. Something like:
test[,(last_current_column+1):(last_column_created_using_function)]
Considering what I've read in some other posts, using my example, test[,(last_current_column+1): could be written as:
test[,(ncol(test)+1):]
but second part is still missing and I have no idea how to write it.
I hope I made myself clear. I fully appreciate any comment or advise.
Happy 2019 :)
Another way would be:
#options(digits=3)
`%grow%` <- function(x,y) {
(y-x) / x * 100
}
test <- data.frame(a=c(101,202,301),
b=c(123,214,199),
h=c(134, 217, 205),
d=c(156,234,235))
# a b h d
# 1 101 123 134 156
# 2 202 214 217 234
# 3 301 199 205 235
seqcols <- seq_along(test) # saved just to improve readability
test[,seqcols[-length(seqcols)] + max(seqcols)] <- lapply(seqcols[-length(seqcols)],
function(i) test[,i] %grow% test[,(i+1)])
test
# a b h d V5 V6 V7
# 1 101 123 134 156 21.78 8.94 16.42
# 2 202 214 217 234 5.94 1.40 7.83
# 3 301 199 205 235 -33.89 3.02 14.63
Similar to the second solution from #Ronak Shah, just with the use of map2_df from purrr:
cbind(test,
new=purrr::map2_df(test[seqcols[-length(seqcols)]], test[seqcols[-1]], `%grow%`),
deparse.level=1)
# a b h d new.a new.b new.h
# 1 101 123 134 156 21.78 8.94 16.42
# 2 202 214 217 234 5.94 1.40 7.83
# 3 301 199 205 235 -33.89 3.02 14.63
You would always ncol(test) - 1 new columns. Now using this logic there are multiple ways to do this.
One way would be to construct a character vector with some prefix value.
test[paste0("new_col", seq_len(ncol(test) - 1))] <- lapply(1:(ncol(test)-1),
function(i) test[,i] %grow% test[,(i+1)])
test
# a b h new_col1 new_col2
#1 101 123 134 21.782178 8.943089
#2 202 214 217 5.940594 1.401869
#3 301 199 205 -33.887043 3.015075
Another option using mapply and transform by creating subsets of dataframe
transform(test,
new_col = mapply(`%grow%`, test[1:(ncol(test)- 1)], test[2:ncol(test)]))
# a b h new_col.a new_col.b
#1 101 123 134 21.782178 8.943089
#2 202 214 217 5.940594 1.401869
#3 301 199 205 -33.887043 3.015075

Adding columns by splitting number, and removing duplicates

I have a dataframe like the following (this is a reduced example, I have many more rows and columns):
CH1 CH2 CH3
1 3434 282 7622
2 4442 6968 8430
3 4128 6947 478
4 6718 6716 3017
5 3735 9171 1128
6 65 4876 4875
7 9305 6944 3309
8 4283 6060 650
9 5588 2285 203
10 205 2345 9225
11 8634 4840 780
12 6383 0 1257
13 4533 7692 3760
14 9363 9846 4697
15 3892 79 4372
16 6130 5312 9651
17 7880 7386 6239
18 8515 8021 2295
19 1356 74 8467
20 9024 8626 4136
I need to create additional columns by splitting the values. For example, value 1356 would have to be split into 6, 56, and 356. I do this on a for loop splitting by string. I do this to keep the leading zeros. So far, decent.
# CREATE ADDITIONAL COLUMNS
for(col in 1:3) {
# Create a temporal variable
temp <- as.character(data[,col] )
# Save the new column
for(mod in c(-1, -2, -3)) {
# Create the column
temp <- cbind(temp, str_sub(as.character(data[,col]), mod))
}
# Merge to the row
data <- cbind(data, temp)
}
My problem is that not all cells have 4 digits: some may have 1, 2 or 3 digits. Therefore, I get repeated values when I split. For example, for 79 I get: 79 (original), 9, 79, 79, 79.
Problem: I need to remove the repeated values. Of course, I could do unique, but that gives me rows of uneven number of columns. I need to fill those missing (i.e. the removed repeated values) with NA. I can only compare this by row.
I checked CJ Yetman's answer here, but they only replace consecutive numbers. I only need to keep unique values.
Reproducible Example: Here is a fiddle with my code working: http://rextester.com/IKMP73407
Expected outcome: For example, for rows 11 & 12 of the example (see the link for the reproducible example), if this is my original:
8634 4 34 634 4840 0 40 840 780 0 80 780
6383 3 83 383 0 0 0 0 1257 7 57 257
I'd like to get this:
8634 4 34 634 4840 0 40 840 780 NA 80 NA
6383 3 83 383 0 NA NA NA 1257 7 57 257
You can use apply():
The data:
data <- structure(list(CH1 = c(3434L, 4442L, 4128L, 6718L, 3735L, 65L,
9305L, 4283L, 5588L, 205L, 8634L, 6383L, 4533L, 9363L, 3892L,
6130L, 7880L, 8515L, 1356L, 9024L), CH2 = c(282L, 6968L, 6947L,
6716L, 9171L, 4876L, 6944L, 6060L, 2285L, 2345L, 4840L, 0L, 7692L,
9846L, 79L, 5312L, 7386L, 8021L, 74L, 8626L), CH3 = c(7622L,
8430L, 478L, 3017L, 1128L, 4875L, 3309L, 650L, 203L, 9225L, 780L,
1257L, 3760L, 4697L, 4372L, 9651L, 6239L, 2295L, 8467L, 4136L
)), .Names = c("CH1", "CH2", "CH3"), row.names = c(NA, 20L), class = "data.frame")
Select row 11 and 12:
data <- data[11:12, ]
Using your code:
# CREATE ADDITIONAL COLUMNS
for(col in 1:3) {
# Create a temporal variable
temp <- data[,col]
# Save the new column
for(mod in c(10, 100, 1000)) {
# Create the column
temp <- cbind(temp, data[, col] %% mod)
}
data <- cbind(data, temp)
}
data[,1:3] <- NULL
The result is:
temp V2 V3 V4 temp V2 V3 V4 temp V2 V3 V4
11 8634 4 34 634 4840 0 40 840 780 0 80 780
12 6383 3 83 383 0 0 0 0 1257 7 57 257
Then go through the data row by row and remove duplicates and transpose the outcome:
t(apply(data, 1, function(row) {
row[duplicated(row)] <- NA
return(row)
}))
The result is:
temp V2 V3 V4 temp V2 V3 V4 temp V2 V3 V4
11 8634 4 34 634 4840 0 40 840 780 NA 80 NA
12 6383 3 83 383 0 NA NA NA 1257 7 57 257

printing a list based on range met

I would like to generate an string output into a list if some values are met. I have a table that looks like this:
grp V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1: 1 go.1 142 144 132 134 0 31 11 F D T hy al qe 34 6 3
2: 2 go.1 313 315 303 305 0 31 11 q z t hr ye er 29 20 41
3: 3 go.1 316 318 306 308 0 31 11 f w y hu er es 64 43 19
4: 4 go.1 319 321 309 311 0 31 11 r a y ie uu qr 26 22 20
5: 5 go.1 322 324 312 314 0 31 11 g w y hp yu re 44 7 0
I'm using this function to generate a desired output:
library(IRanges); library(data.table)
rangeFinder = function(x){
x.ir = reduce(IRanges(x$V2, x$V3))
max.idx = which.max(width(x.ir))
ans = data.table(out = x[1,1],
start = start(x.ir)[max.idx],
end = end(x.ir)[max.idx])
return(ans)}
rangeFinder(x.out)
out start end
1: 1 313 324
I would also like to generate a list with the letters (from column V9-V11) in the between the start and end output from rangeFinder.
For example, the output should look like this.
out
[[go.1]]
[1] "qztfwyraygwy"
rangeFinder is looking at values in column V2 and V3 and printing the longest match of numbers. Notice how "FDT" is not included in the list output even though rangeFinder produced an output from 313-324 (and not from 142-324). How can I get the desired output?
reduce has an argument with.revmap to add a "metadata" column (accessible with mcols()) to the object. This associates with each reduced range the indexes of the original range that map to the reduced range, as an IntegerList class, basically a list where all elements are guaranteed to be integer vectors. So these are the rows you're interested in
ir <- with(x, IRanges(V2, V3))
r <- reduce(ir, with.revmap=TRUE)
i <- unlist(mcols(r)[which.max(width(r)), "revmap"])
and the data character string can be munged with something like
j <- paste0("V", 9:11)
paste0(as.matrix(x[i, j, drop=FALSE]), collapse="")
It's better to ask your questions about IRanges on the Bioconductor mailing list; no subscription required.
with.revmap is a convenience argument added relatively recently; I think
h = findOverlaps(ir, r)
i = queryHits(h)[subjectHits(h) == which.max(width(r))]
is a replacement.

Converting probe ids to entrez ids from a list of lists

The conversion of probe ids to entrez ids is quite straight forward
i1<-c("246653_at", "246897_at", "251347_at", "252988_at", "255528_at", "256535_at", "257203_at", "257582_at", "258807_at", "261509_at", "265050_at", "265672_at")
select(ath1121501.db, i1, "ENTREZID", "PROBEID")
PROBEID ENTREZID
1 246653_at 833474
2 246897_at 832631
3 251347_at 825272
4 252988_at 829998
5 255528_at 827380
6 256535_at 840223
7 257203_at 821955
8 257582_at 841494
9 258807_at 819558
10 261509_at 843504
11 265050_at 841636
12 265672_at 817757
But Iam unsure how to do it for a long list of lists resulting from a clustering and store it as a list of ENTREZ ids instead of probe ids again:
For instance:
[[1]]
247964_at 248684_at 249126_at 249214_at 250223_at 253620_at 254907_at 259897_at 261256_at 267126_s_at
28 40 44 45 54 95 108 152 171 229
[[2]]
248230_at 250869_at 259765_at 265948_at 266221_at
33 64 151 216 221
[[3]]
245385_at 247282_at 248967_at 250180_at 250881_at 251073_at 53874_at 256093_at 257054_at 260007_at
5 22 42 52 65 67 101 117 125 155
261868_s_at 263136_at 267497_at
181 195 232
It should be something like
[[1]]
"835761","834904","834356","834281","831256","829175","826721","843479","837084","816891","816892"
and similarly for other list of lists.

Resources