Text processing in R

Text processing in R - r

I have a text file with many lines (first two are shown below)
1: 146 189 229
2: 191 229
I need to convert to the output
1 146
1 189
1 229
2 191
2 229
I have read the lines in loop, removed the ":" and split by " ".
fbnet <- readLines("0.egonet")
for (line in fbnet){
line <- gsub(":","",line)
line <- unlist(strsplit(line, " ", fixed = TRUE),use.names=FALSE)
friend = line[1]
}
How to proceed next

We can read with read.csv/read.txt specifying the delimiter as : to output a data.frame with 2 columns and then use separate_rows to split the second column ('V2' - when we specify header = FALSE - the automatic naming of columns starts with letter V followed by sequence of numbers for each column) with space delimiter into separate rows and remove the NA elements (in case there are multiple spaces) with filter
library(tidyverse)
read.csv(text=fbnet, sep=":", header = FALSE) %>%
separate_rows(V2, convert = TRUE) %>%
filter(!is.na(V2))
V1 V2
1 1 146
2 1 189
3 1 229
4 2 191
5 2 229
Or using read_delim from readr with separate_rows and filter
read_delim(paste(trimws(fbnet), collapse="\n"), delim=":", col_names = FALSE) %>%
separate_rows(X2, convert = TRUE) %>%
filter(!is.na(X2))
data
fbnet <- readLines(textConnection("1: 146 189 229
2: 191 229"))
#if we are reading from file, then
fbnet <- readLines("file.txt")

Related

How to lapply grep() data by id

I have a df RawDat with two rows ID, data. I want to grep() my data by the id using e.g. lapply() to generate a new df where the data is sorted into columns by their id:
My df looks like this, except I have >80000 rows, and 75 ids:
ID data
abl 564
dlh 78
vho 354
mez 15
abl 662
dlh 69
vho 333
mez 9
.
.
.
I can manually extract the data using the grep() function:
ExtRawDat = as.data.frame(RawDat[grep("abl",RawDat$ID),])
However, I would not want to do that 75 times and cbind() them. Rather, I would like to use the lapply() function to automate it. I have tried several variations of the following code, but I don't get a script that provide the desired output.
I have a vector with the 75 ids ProLisV, to loop my argument
ExtRawDat = as.data.frame(lapply(ProLisV[1:75],function(x){
Temp1 = RawDat[grep(x,RawDat$ID),] # The issue is here, the pattern is not properly defined with the X input (is it detrimental that some of the names in the list having spaces etc.?)
Values = as.data.frame(Temp1$data)
list(Values$data)
}))
The desired output looks like this:
abl dlh vho mez ...
564 78 354 15
662 69 333 9
.
.
.
How do I adjust that function to provide the desired output? Thank you.

It looks like what you are trying to do is to convert your data from long form to wide form. One way to do this easily is to use the spread function from the tidyr package. To use it, we need a column to remove duplicate identifiers, so we'll first add a grouping variable:
n.ids <- 4 # With your full data this should be 75
df$group <- rep(1:n.ids, each = n.ids, length.out = nrow(df))
tidyr::spread(df, ID, data)
# group abl dlh mez vho
# 1 1 564 78 15 354
# 2 2 662 69 9 333
If you don't want the group column at the end, just do df$group <- NULL.
Data
df <- read.table(text = "
ID data
abl 564
dlh 78
vho 354
mez 15
abl 662
dlh 69
vho 333
mez 9", header = T)

Sum paired files over a list of files

I have multiple files where two and two files belong together and should be summed based on values in column 2 to create one file. All files have the same rows. The files that should be summed have similar ID before the L* part of the string.
I would like to make a loop that identifies the paired files and sums in based on column 2.
I have created a function that reads the files, but not sure how to proceed:
file_list <- list.files(pattern = "*.csv)
library(data.table)
lst <- lapply(file_list, function(x)
fread(x, select=c("V1", "V2"))[,
list(ID=paste(V1), freq=V2)])
Below is shown two of the pairs:
Pair one:
01_001_F08_S80_L009
16S_rRNA_copy_A-1 75
16S_rRNA_copy_B-1 86
16S_rRNA_copy_C-1 102
01_001_F08_S80_L002
16S_rRNA_copy_A-1 98
16S_rRNA_copy_B-1 96
16S_rRNA_copy_C-1 101
Pair two:
01_001_F09_S81_L006
16S_rRNA_copy_A-1 242
16S_rRNA_copy_B-1 244
16S_rRNA_copy_C-1 302
01_001_F09_S81_L003
16S_rRNA_copy_A-1 252
16S_rRNA_copy_B-1 253
16S_rRNA_copy_C-1 322

We can split the data by the substring of the names of the 'lst' (created with sub), loop through the list, rbind the nested list elements, grouped by 'ID', get the sum
lapply(split(lst, sub("\\d+$", "", names(lst))),
function(x) rbindlist(x)[, .(freq = sum(freq)), ID])
#$`01_001_F08_S80_L`
# ID freq
#1: 16S_rRNA_copy_A-1 173
#2: 16S_rRNA_copy_B-1 182
#3: 16S_rRNA_copy_C-1 203
#$`01_001_F09_S81_L`
# ID freq
#1: 16S_rRNA_copy_A-1 494
#2: 16S_rRNA_copy_B-1 497
#3: 16S_rRNA_copy_C-1 624

R - Sum range over lookback period, divided sum of look back - excel to R

I am looking to workout a percentage total over a look back range in R.
I know how to do this in excel with the following formula:
=SUM(B2:B4)/SUM(B2:B4,C2:C4)
This is summing column B over a range of today looking back 3 lines. It then divides this sum buy the total sum of column B + C again looking back 3 lines.
I am looking to achieve the same calculation in R to run across my matrix.
The output would look something like this:
adv dec perct
1 69 376
2 113 293
3 270 150 0.355625492
4 74 371 0.359559402
5 308 96 0.513790386
6 236 173 0.491255962
7 252 134 0.663886572
8 287 129 0.639966969
9 219 187 0.627483444
This is a line of code I could perhaps add the look back range too:
perct <- apply(data.matrix[,c('adv','dec')], 1, function(x) { (x[1] / x[1] + x[2]) } )
If i could get [1] to sum the previous 3 line range and
If i could get [2] to also sum the previous 3 line range.
Still learning how to apply forward and look back periods within R. So any additional learning on the answer would be appreciated!

Here are some approaches. The first 3 use rollsumr and/or rollapplyr in zoo and the last one uses only the base of R.
1) rollsumr Create a matrix with rollsumr whose columns contain the rollling sums, convert that to row proportions and take the "adv" column. Finally assign that to a new column frac in DF. This approach has the shortest code.
library(zoo)
DF$frac <- prop.table(rollsumr(DF, 3, fill = NA), 1)[, "adv"]
giving:
> DF
adv dec frac
1 69 376 NA
2 113 293 NA
3 270 150 0.3556255
4 74 371 0.3595594
5 308 96 0.5137904
6 236 173 0.4912560
7 252 134 0.6638866
8 287 129 0.6399670
9 219 187 0.6274834
1a) This variation is similar except instead of using prop.table we write out the ratio. The code is longer but you may find it clearer.
m <- rollsumr(DF, 3, fill = NA)
DF$frac <- with(as.data.frame(m), adv / (adv + dec))
1b) This is a variation of (1) that is the same except it uses a magrittr pipeline:
library(magrittr)
DF %>% rollsumr(3, fill = NA) %>% prop.table(1) %>% `[`(TRUE, "adv") -> DF$frac
2) rollapplyr We could use rollapplyr with by.column = FALSE like this. The result is the same.
ratio <- function(x) sum(x[, "adv"]) / sum(x)
DF$frac <- rollapplyr(DF, 3, ratio, by.column = FALSE, fill = NA)
3) Yet another variation is to compute the numerator and denominator separately:
DF$frac <- rollsumr(DF$adv, 3, fill = NA) /
rollapplyr(DF, 3, sum, by.column = FALSE, fill = NA)
4) base This uses embed followed by rowSums on each column to get the rolling sums and then uses prop.table as in (1).
DF$frac <- prop.table(sapply(lapply(rbind(NA, NA, DF), embed, 3), rowSums), 1)[, "adv"]
Note: The input used in reproducible form is:
Lines <- "adv dec
1 69 376
2 113 293
3 270 150
4 74 371
5 308 96
6 236 173
7 252 134
8 287 129
9 219 187"
DF <- read.table(text = Lines, header = TRUE)

Consider an sapply that loops through the number of rows in order to index two rows back:
DF$pred <- sapply(seq(nrow(DF)), function(i)
ifelse(i>=3, sum(DF$adv[(i-2):i])/(sum(DF$adv[(i-2):i]) + sum(DF$dec[(i-2):i])), NA))
DF
# adv dec pred
# 1 69 376 NA
# 2 113 293 NA
# 3 270 150 0.3556255
# 4 74 371 0.3595594
# 5 308 96 0.5137904
# 6 236 173 0.4912560
# 7 252 134 0.6638866
# 8 287 129 0.6399670
# 9 219 187 0.6274834

R: Convert consensus output into a data frame

I'm currently performing a multiple sequence alignment using the 'msa' package from Bioconductor. I'm using this to calculate the consensus sequence (msaConsensusSequence) and conservation score (msaConservationScore). This gives me outputs that are values ...
e.g.
ConsensusSequence:
i.llE etc (str = chr)
(lower case = 20%+ conservation, uppercase = 80%+ conservation, . = <20% conservation)
ConservationScore:
221 -296 579 71 423 etc (str = named num)
I would like to convert these into a table where the first row contains columns where each is a different letter in the consensus sequence and the second row is the corresponding conservation score.
e.g.
i . l l E
221 -296 579 71 423
Could people please advise on the best way to go about this?
Thanks
Natalie

For what you have said in the comments you can get a data frame like this:
data(BLOSUM62)
alignment <- msa(mySequences)
conservation <- msaConservationScore(alignment, BLOSUM62)
# Now create the data fram
df <- data.frame(consensus = names(conservation), conservation = conservation)
head(df)
consensus conservation
1 T 141
2 E 160
3 E 165
4 E 325
5 ? 179
6 ? 71
7 T 216
8 W 891
9 ? 38
10 T 405
11 L 204
If you prefer to transpose it you can:
df <- t(df)
colnames(df) <- 1:ncol(df)

Editing a column inside a dataframe

I am trying to edit my column inside the dataframe i tried using tstrsplit but I didnt get the desired result. i am trying to remove ';' from OID & i want single value in every row in OID column.
this is my code below i did
library(data.table);
setDT(df)[, paste0("OID", 1:3) := tstrsplit(OID, ";", fixed = TRUE)]
doing this code it created 3 different columns OID1 OID2 OID3 but i need to only edit column OID & have single values in it has displayed below in my desired output.
here below is my data-->
QID OID
189 204;202;201;203;
189 202;203;201;204;
189 na
189 204;202;201;203;
189 na
189 204;202;201;203;
189 na
my desired output what i need is below-->
QID OID
189 202
189 201
189 204
189 203

If we need a single element from each row, we can split the 'OID' by ;, loop through the list output with sapply, get a single element with (sample - as the rules are not clear), and update the 'OID' with that output.
transform(df, OID = sapply(strsplit(OID, ";"), sample, 1))
# QID OID
#1 189 202
#2 189 204
#3 189 203
#4 189 202
If we need unique values per row
transform(df, OID = sample(unique(unlist(strsplit(OID, ";")))))
# QID OID
#1 189 202
#2 189 201
#3 189 203
#4 189 204
NOTE: If the "OID" column class is factor, convert to character class before splitting i.e. strsplit(as.character(OID), ";")
data
df <- structure(list(QID = c(189L, 189L, 189L, 189L),
OID = c("204;202;201;203;",
"202;203;201;204;", "204;202;201;203;", "204;202;201;203;")),
.Names = c("QID", "OID"), class = "data.frame", row.names = c(NA, -4L))

I think another option is using the library stringr::str_split_fixed, it vectorised over string, so it should be more efficient than sapply.
str_split_fixed(string, pattern, n)
Please see here: http://www.inside-r.org/packages/cran/stringr/docs/str_split_fixed
df <- data.frame(QID=c(189,189,189,189),
OID=c("204;202;201;203","202;203;201;204",
"204;202;201;203","204;202;201;203"))
df
# QID OID
# 1 189 204;202;201;203
# 2 189 202;203;201;204
# 3 189 204;202;201;203
# 4 189 204;202;201;203
library(stringr)
df$OID = str_split_fixed(df$OID, ";",4)[,1] #get the first seperated column
df
# QID OID
#1 189 204
#2 189 202
#3 189 204
#4 189 204

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Text processing in R - r

Related

How to lapply grep() data by id

Sum paired files over a list of files

R - Sum range over lookback period, divided sum of look back - excel to R

R: Convert consensus output into a data frame

Editing a column inside a dataframe

Categories

Resources