How to lapply grep() data by id - r

I have a df RawDat with two rows ID, data. I want to grep() my data by the id using e.g. lapply() to generate a new df where the data is sorted into columns by their id:
My df looks like this, except I have >80000 rows, and 75 ids:
ID data
abl 564
dlh 78
vho 354
mez 15
abl 662
dlh 69
vho 333
mez 9
.
.
.
I can manually extract the data using the grep() function:
ExtRawDat = as.data.frame(RawDat[grep("abl",RawDat$ID),])
However, I would not want to do that 75 times and cbind() them. Rather, I would like to use the lapply() function to automate it. I have tried several variations of the following code, but I don't get a script that provide the desired output.
I have a vector with the 75 ids ProLisV, to loop my argument
ExtRawDat = as.data.frame(lapply(ProLisV[1:75],function(x){
Temp1 = RawDat[grep(x,RawDat$ID),] # The issue is here, the pattern is not properly defined with the X input (is it detrimental that some of the names in the list having spaces etc.?)
Values = as.data.frame(Temp1$data)
list(Values$data)
}))
The desired output looks like this:
abl dlh vho mez ...
564 78 354 15
662 69 333 9
.
.
.
How do I adjust that function to provide the desired output? Thank you.

It looks like what you are trying to do is to convert your data from long form to wide form. One way to do this easily is to use the spread function from the tidyr package. To use it, we need a column to remove duplicate identifiers, so we'll first add a grouping variable:
n.ids <- 4 # With your full data this should be 75
df$group <- rep(1:n.ids, each = n.ids, length.out = nrow(df))
tidyr::spread(df, ID, data)
# group abl dlh mez vho
# 1 1 564 78 15 354
# 2 2 662 69 9 333
If you don't want the group column at the end, just do df$group <- NULL.
Data
df <- read.table(text = "
ID data
abl 564
dlh 78
vho 354
mez 15
abl 662
dlh 69
vho 333
mez 9", header = T)

Related

How to use extract on multiple columns and name output columns based on input column names

I have a data frame of blood pressure data of the following form:
bpdata <- data.frame(bp1 = c("120/89", "110/70", "121/78"), bp2 = c("130/69", "120/90", "125/72"), bp3 = c("115/90", "112/71", "135/80"))
I would like to use the following extract command, but globally, i.e. on all bp\d columns
extract(bp1, c("systolic_1","diastolic_1"),"(\\d+)/(\\d+)")
How can I capture the digit in the column selection and use it in the column output names? I can hack around this by creating a list of column names and then using one of the apply family, but it seems to me there ought to be a more elegant way to do this.
Any suggestions?
We could use read.csv on multiple columns in a loop (Map) with sep = "/" and cbind the list elements at the end with do.call
do.call(cbind, Map(function(x, y) read.csv(text= x, sep="/", header = FALSE,
col.names = paste0(c('systolic', 'diastolic'), y)),
unname(bpdata), seq_along(bpdata)))
# systolic1 diastolic1 systolic2 diastolic2 systolic3 diastolic3
#1 120 89 130 69 115 90
#2 110 70 120 90 112 71
#3 121 78 125 72 135 80
Or without a loop, paste the columns to a single string for each row and then use read.csv/read.table
read.csv(text = do.call(paste, c(bpdata, sep="/")),
sep="/", header = FALSE,
col.names = paste0(c('systolic', 'diastolic'),
rep(seq_along(bpdata), each = 2)))
# systolic1 diastolic1 systolic2 diastolic2 systolic3 diastolic3
#1 120 89 130 69 115 90
#2 110 70 120 90 112 71
#3 121 78 125 72 135 80
Or using tidyverse, similar option is to unite the column into a single one with /, then use either extract or separate to split the column into multiple columns
library(dplyr)
library(tidyr)
library(stringr)
bpdata %>%
unite(bpcols, everything(), sep="/") %>%
separate(bpcols, into = str_c(c('systolic', 'diastolic'),
rep(seq_along(bpdata), each = 2)), convert = TRUE)
# systolic1 diastolic1 systolic2 diastolic2 systolic3 diastolic3
#1 120 89 130 69 115 90
#2 110 70 120 90 112 71
#3 121 78 125 72 135 80

How to merge tables with different column headers in loop [duplicate]

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 3 years ago.
I have a for loop that goes through a specific column in different CSV files (all these different files are just different runs for a specific class) and retrieve the count of each value. For example, in the first file (first run):
0 1 67
101 622 277
In the second run:
0 1 67 68
109 592 297 2
In the third run:
0 1 67
114 640 246
Note that each run might result in different values (look at the second run that includes one more value that is 68). I would like to merge all these results in one list and then write it to a CSV file. To do that, I did the following:
files <- list.files("/home/adam/Desktop/runs", pattern="*.csv", recursive=TRUE, full.names=TRUE, include.dirs=TRUE)
all <- list()
col <- 14
for(j in 1:length(files)){
dataset <- read.csv(files[j])
uniqueValues <- table(dataset[,col]) #this generates the examples shown above
all <- rbind(uniqueValues)
}
write.table(all, "all.csv", col.names=TRUE, sep=",")
The result of all is:
0 1 67
114 640 246
How to solve that?
The expected results in:
0 1 67 68
101 622 277 0
109 592 297 2
114 640 246 0
Marked this as a potential duplicate see link here
library(plyr)
df1 <- data.frame(A0 = c(101),
A1 = c(622),
A67 = c(277))
df2 <- data.frame(A0 = c(109),
A1 = c(592),
A67 = c(297),
A68= c(2))
df3 <- data.frame(A0 = c(114),
A1 = c(640),
A67 = c(246))
newds=rbind.fill(df1,df2,df3)

Sum paired files over a list of files

I have multiple files where two and two files belong together and should be summed based on values in column 2 to create one file. All files have the same rows. The files that should be summed have similar ID before the L* part of the string.
I would like to make a loop that identifies the paired files and sums in based on column 2.
I have created a function that reads the files, but not sure how to proceed:
file_list <- list.files(pattern = "*.csv)
library(data.table)
lst <- lapply(file_list, function(x)
fread(x, select=c("V1", "V2"))[,
list(ID=paste(V1), freq=V2)])
Below is shown two of the pairs:
Pair one:
01_001_F08_S80_L009
16S_rRNA_copy_A-1 75
16S_rRNA_copy_B-1 86
16S_rRNA_copy_C-1 102
01_001_F08_S80_L002
16S_rRNA_copy_A-1 98
16S_rRNA_copy_B-1 96
16S_rRNA_copy_C-1 101
Pair two:
01_001_F09_S81_L006
16S_rRNA_copy_A-1 242
16S_rRNA_copy_B-1 244
16S_rRNA_copy_C-1 302
01_001_F09_S81_L003
16S_rRNA_copy_A-1 252
16S_rRNA_copy_B-1 253
16S_rRNA_copy_C-1 322
We can split the data by the substring of the names of the 'lst' (created with sub), loop through the list, rbind the nested list elements, grouped by 'ID', get the sum
lapply(split(lst, sub("\\d+$", "", names(lst))),
function(x) rbindlist(x)[, .(freq = sum(freq)), ID])
#$`01_001_F08_S80_L`
# ID freq
#1: 16S_rRNA_copy_A-1 173
#2: 16S_rRNA_copy_B-1 182
#3: 16S_rRNA_copy_C-1 203
#$`01_001_F09_S81_L`
# ID freq
#1: 16S_rRNA_copy_A-1 494
#2: 16S_rRNA_copy_B-1 497
#3: 16S_rRNA_copy_C-1 624

R: Convert consensus output into a data frame

I'm currently performing a multiple sequence alignment using the 'msa' package from Bioconductor. I'm using this to calculate the consensus sequence (msaConsensusSequence) and conservation score (msaConservationScore). This gives me outputs that are values ...
e.g.
ConsensusSequence:
i.llE etc (str = chr)
(lower case = 20%+ conservation, uppercase = 80%+ conservation, . = <20% conservation)
ConservationScore:
221 -296 579 71 423 etc (str = named num)
I would like to convert these into a table where the first row contains columns where each is a different letter in the consensus sequence and the second row is the corresponding conservation score.
e.g.
i . l l E
221 -296 579 71 423
Could people please advise on the best way to go about this?
Thanks
Natalie
For what you have said in the comments you can get a data frame like this:
data(BLOSUM62)
alignment <- msa(mySequences)
conservation <- msaConservationScore(alignment, BLOSUM62)
# Now create the data fram
df <- data.frame(consensus = names(conservation), conservation = conservation)
head(df)
consensus conservation
1 T 141
2 E 160
3 E 165
4 E 325
5 ? 179
6 ? 71
7 T 216
8 W 891
9 ? 38
10 T 405
11 L 204
If you prefer to transpose it you can:
df <- t(df)
colnames(df) <- 1:ncol(df)

Subset Columns based on partial matching of column names in the same data frame

I would like to understand how to subset multiple columns from same data frame by matching the first 5 letters of the column names with each other and if they are equal then subset it and store it in a new variable.
Here is a small explanation of my required output. It is described below,
Lets say the data frame is eatable
fruits_area fruits_production vegetable_area vegetable_production
12 100 26 324
33 250 40 580
66 510 43 581
eatable <- data.frame(c(12,33,660),c(100,250,510),c(26,40,43),c(324,580,581))
names(eatable) <- c("fruits_area", "fruits_production", "vegetables_area",
"vegetable_production")
I was trying to write a function which will match the strings in a loop and will store the subset columns after matching first 5 letters from the column names.
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
checkExpression(eatable,"your_string")
The above function checks the string correctly but I am confused how to do matching among the column names in the dataset.
Edit:- I think regular expressions would work here.
You could try:
v <- unique(substr(names(eatable), 0, 5))
lapply(v, function(x) eatable[grepl(x, names(eatable))])
Or using map() + select_()
library(tidyverse)
map(v, ~select_(eatable, ~matches(.)))
Which gives:
#[[1]]
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
#
#[[2]]
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581
Should you want to make it into a function:
checkExpression <- function(df, l = 5) {
v <- unique(substr(names(df), 0, l))
lapply(v, function(x) df[grepl(x, names(df))])
}
Then simply use:
checkExpression(eatable, 5)
I believe this may address your needs:
checkExpression <- function(dataset,str){
cols <- grepl(paste0("^",str),colnames(dataset),ignore.case = TRUE)
subset(dataset,select=colnames(dataset)[cols])
}
Note the addition of "^" to the pattern used in grepl.
Using your data:
checkExpression(eatable,"fruit")
## fruits_area fruits_production
##1 12 100
##2 33 250
##3 660 510
checkExpression(eatable,"veget")
## vegetables_area vegetable_production
##1 26 324
##2 40 580
##3 43 581
Your function does exactly what you want but there was a small error:
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
Change the name of the object from which your subsetting from obje to dataset.
checkExpression(eatable,"fr")
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
checkExpression(eatable,"veg")
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581

Resources