I am looking for an idiomatic way to join a column, say named 'x', which exists in every data.frame element of a list. I came up with a solution with two steps by using lapply and Reduce. The second attempt trying to use only Reduce failed. Can I actually use only Reduce with one anonymous function to do this?
#data
xs <- replicate(5, data.frame(x=sample(letters, 10, T), y =runif(10)), simplify = FALSE)
# This works, but may be still unnecessarily long
otmap = lapply(xs, function(df) df$x)
jotm = Reduce(c, otmap)
# This does not count as another solution:
jotm = Reduce(c, lapply(xs, function(df) df$x))
# Try to use only Reduce function. This produces an error
jotr =Reduce(function(a,b){c(a$x,b$x)}, xs)
# Error in a$x : $ operator is invalid for atomic vectors
We can unlist after extracting the 'x' column
unlist(lapply(xs, `[[`, 'x'))
#[1] b y y i z o q w p d f f z b h m c u f s j e i v y b w j n q e w i r h p z q f x a b v z e x l c q f
#Levels: b d i o p q w y z c f h m s u e j n v r x a l
Related
This is my first time posting here + a very new to coding in general, apologies.
Background information: I am trying to position the peptides and the HLA sequences, so that the corresponding HLA Allele protein sequence will be added on my 'Proteins' file. I know the combination of peptide and HLA allele that can occur, but I have to use a positional matrix to find out the HLA alleles exact protein sequence.
I have a .csv file called 'proteins', which looks like below.
Peptide,HLA,Binding
KLEDLERDL,HLA-A*02:01,Positive-Low
EVMPVSMAK,HLA-A*03:01,Positive-Intermediate
EVMPVSMAK,HLA-A*11:01,Positive-High
KTFPPTEPK,HLA-A*03:01,Positive-Intermediate
KTFPPTEPK,HLA-A*11:01,Positive-Intermediate
ATFSVPMEK,HLA-A*03:01,Positive-Intermediate
ATFSVPMEK,HLA-A*11:01,Positive-High
and I have a positional matrix .tsv file, which looks like below, I'm only showing the first allele of the file, which is A*01:01.
allele P-25 P-24 P-23 P-22 P-21 P-20 P-19 P-18 P-17 P-16 P-15 P-14 P-13 P-12 P-11 P-10 P-9 P-8 P-7 P-6
P-5 P-4 P-3 P-2 P-1 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P20 P21 P22 P23 P24 P25 P26 P27 P28 P29 P30 P31 P32 P33 P34 P35
P36 P37 P38 P39 P40 P41 P42 P43 P44 P45 P46 P47 P48 P49 P50 P51 P52 P53 P54 P55 P56 P57 P58 P59 P60 P61 P62 P63 P64 P65 P66 P67 P68 P69 P70 P71 P72 P73 P74 P75 P76
P77 P78 P79 P80 P81 P82 P83 P84 P85 P86 P87 P88 P89 P90 P91 P92 P93 P94 P95 P96 P97 P98 P99 P100 P101 P102 P103 P104 P105 P106 P107 P108 P109 P110 P111 P112 P113 P114 P115 P116 P117
P118 P119 P120 P121 P122 P123 P124 P125 P126 P127 P128 P129 P130 P131 P132 P133 P134 P135 P136 P137 P138 P139 P140 P141 P142 P143 P144 P145 P146 P147 P148 P149 P150 P151 P152 P153 P154 P155 P156 P157 P158
P159 P160 P161 P162 P163 P164 P165 P166 P167 P168 P169 P170 P171 P172 P173 P174 P175 P176 P177 P178 P179 P180 P181 P182 P183 P184 P185 P186 P187 P188 P189 P190 P191 P192 P193 P194 P195 P196 P197 P198 P199
P200 P201 P202 P203 P204 P205 P206 P207 P208 P209 P210 P211 P212 P213 P214 P215 P216 P217 P218 P219 P220 P221 P222 P223 P224 P225 P226 P227 P228 P229 P230 P231 P232 P233 P234 P235 P236 P237 P238 P239 P240
P241 P242 P243 P244 P245 P246 P247 P248 P249 P250 P251 P252 P253 P254 P255 P256 P257 P258 P259 P260 P261 P262 P263 P264 P265 P266 P267 P268 P269 P270 P271 P272 P273 P274 P275 P276 P277 P278 P279 P280 P281
P282 P283 P284 P285 P286 P287 P288 P289 P290 P291 P292 P293 P294 P295 P296 P297 P298 P299 P300 P301 P302 P303 P304 P305 P306 P307 P308 P309 P310 P311 P312 P313 P314 P315 P316 P317 P318 P319 P320 P321 P322
P323 P324 P325 P326 P327 P328 P329 P330 P331 P332 P333 P334 P335 P336 P337 P338 P339 P340 P341 P342 P343 P344 P345 P346 P347 P348 P349 P350 P351 P352 P353 P354 P355 P356 P357 P358 P359 P360 P361
A*01:01 M A V M A P R T L L L L L S G A L A L .
. T Q T W A G S H S M R Y F F T S V S R
P G R G E P R F I A V G Y V D D T Q F V
R F D S D A A S Q K M E P R A P W I E Q
E G P E Y W D Q E T R N M K A H S Q T D
R A N L G T L R G Y Y N Q S E D G S H T
I Q I M Y G C D V G P D G R F L R G Y .
R Q D A Y D G K D Y . I A L N E D L R S
W T A A D M A A Q I T K R K W E A V H A
A E . . . . . . . . . . . . . . Q R R V
Y L E G R C V D G L R R Y L E N . . . G
K E T L Q R T D P P K T H M T H H P I S
D H E A T L R C W A L G F Y P A E I T L
T W Q R D G E D . Q T Q D T E L V E T R
P A G D G T F Q K W A A V V V P S G E E
Q R Y T C H V Q H E G L P K P L T L R W
E L S S Q P T I P I V G I I A G L V L L
G A V I T G A V V A A V M W R R K S S D
R K G G S Y T Q A A S S D S A Q G S D V
S L T A C K V
My attempt so far:
Concatenate all the character values per row from the positional matrix file (the ones starting with P) into one string like so
ABCDEGGH***IHJLMNOP
and then match it according to the HLA position in 'proteins' file.
Create a new column in 'proteins' file, called 'HLA amino acid sequence', where the concatenated string value is added.
Problem:
I figured out how to concatenate string values together, but not according to what I need. Code below:
positional_matrix <- A_AA_mat_pos
concatenated_amino_acid <- c(positional_matrix, sep = "")
do.call(paste, positional_matrix)
head(concatenated_amino_acid)
Looking at the head of this concatenated list,it pastes all the values column wise into one list, whereas I want each row to be concatenated instead.
Sounds like you can use a for loop to cycle through each row:
concat_rows <- rep(0, nrow(positional_matrix)) # init list
for(i in seq(1:nrow(positional_matrix)) {
concat_rows[i] <- paste(positional_matrix[i, 1],
positional_matrix[i, 2],
...,
sep = '')
}
I will say that when I was looking at this, I had to list each column out in the paste command. When I tried to use something more scalable, like [1:ncol(..)] or something similar, it converted everything into numbers. Maybe someone else can shed some light onto that..
Welcome to Stack overflow. I personally would use a "tidy" approach.
library(tidyverse) # or library(tidyr)
positional_data <- positional_data %>%
unite("concatenated", `P-25`:`P333`)
That makes a new column called concatenated.
For what its worth, I searched google for tidy r concatenate across columns, which had this link, which compares methods.
For a fuller solution, please look at suggestions for asking a reproducible example, particularly how to share data or make a minimal example.
I am trying to merge a list of matrices all by the first column like this:
a x x
a q q
b y y
c z z
d w w x x x x
e v v q q q q
e r r y y y y
----------> z z z z
a x x w w w w
a q q v v v v
b y y r r r r
c z z
d w w
e v v
e r r
I would like to use the first column to combine the matrices but it does not need to be in the resulting matrix. The thing that is challenging me is the fact that there are multiple instances of the same value in the first row (a and e)
I have been looking around but unable to find any solutions that account for the same values in the column that the matrices are being joined with. With my current code (shown bellow) I get something like:
x x x x
q q q q
x x x x
q q q q
x x x x
q q q q
y y y y
z z z z
w w w w
v v v v
r r r r
v v v v
r r r r
v v v v
r r r r
I cant seem to find out why the duplicate rows are appearing but it has something to do with the length of list so I am assuming it takes place in the merge function.
mergeM <- function(list){ # list is a list of matrices
len = length(list)
mat = merge(list[[1]],list[[2]],by.x = "V1", by.y = "V1", all = TRUE)
if(len >2){
for(i in 3:len){
mat = merge(mat,list[[i]],by.x = "V1", by.y = "V1", all = TRUE)
}
}
mat = mat[,-1]
return(mat)
}# end function
I have a data frame that looks like this:
GID7173723 GID4878677 GID88208 GID346403 GID268825 GID7399578
1 A A A A G A
2 T T T T C T
3 G G G G G G
4 A A A A A A
5 G G G G G G
6 G G G G G G
7 A A A A A A
8 G G G G G G
9 A A A A A A
10 A A A A A A
However, when I use the apply function to get the sum of all 'A' by row divided by the number of columns in the dataframe, I get the total sum of A's instead of getting row sums.
Here is the function I wrote:
myfun <- function(x){
out <- sum(x=='A')/ncol(x)
return(out)
}
apply(df,MARGIN = 1,FUN=myfun)
I cannot figure out why the apply function gives me the total sum of A and not by row.
We can use rowSums
rowSums(df1=="A")/ncol(df1)
Or use `rowMeans
rowMeans(df1 == "A")
With apply, the ncol doesn't apply as it is a vector, so we need length(x)
myfun <- function(x){
sum(x=='A')/length(x)
#or
# mean(x == "A")
}
Solution with apply()
apply(df, 1,FUN=function(rowVec) table(rowVec)['A'])
table() gives counts of each of the bases - you select 'A' out of them.
I have a data frame with sequences as columns and amino acid sites as rows. I would like to compare the difference between these sequences at each site.
seq1 seq2 seq3 seq4 seq5 seq6 seq7 seq8
1 K E K K A A A A
2 V D A A T A A A
3 W W W W W W W W
4 R R R R R R S R
5 F S F F F Y F F
6 P P P P P P P P
7 N N N C N N N N
8 V I D D Q Q Q Q
9 Q Q Q Q Q Q Q Q
10 E E G G L I S F
11 L L Q L L L L L
12 N N Y Y V V S S
13 N N N N Q Q P P
14 L L L L L L L L
15 T T T T T T T I
Ideally, I would like to be able to have an additional column in my data frame that shows me the sites that are the same in all sequences and those that are the same only between seq1-4 or seq 5-8.
I am not sure what the best way to do this is, and any help is greatly appreciated.
Also, is there a way to add another column that shows the types of amino acids observed at each site?
Thanks in advance!
I am first getting an array where all columns are same:
allsame <- apply(df,1,function(x){
val <- ifelse(length(unique(x)) == 1,1,0)
})
Next I am getting an an array where either of the column sets are same
startfour <- apply(df[,1:4],1,function(x){
val <- ifelse(length(unique(x)) == 1,1,0)
})
lastfour <- apply(df[,5:8],1,function(x){
val <- ifelse(length(unique(x)) == 1,1,0)
})
gen <- startfour + lastfour
eithersame <- ifelse(gen == 0,0,1)
Finally you can just create a column vector as required and join it to the dataframe using the above 2 arrays
output <- as.character(length(allsame))
for(i in 1:length(allsame)){
if(allsame[i] == 1){
output[i] <- "all same"
}
else if(eithersame[i] == 1){
output[i] <- "either same"
}
else{
output[i] <- "none same"
}
}
df <- cbind(df,output)
Here is a quick and dirty way to create the flags that you mentioned. Assuming the dataframe is called amino:
amino$first_flag<-with(amino,ifelse(seq1==seq2 & seq2==seq3 & seq3 == seq4,"same","diff"))
amino$second_flag<-with(amino,ifelse(seq5==seq6 & seq6==seq7 & seq7 == seq8,"same","diff"))
amino$total_flag<-with(amino,ifelse(first_flag=="same" & second_flag=="same" & seq1==seq5,"same","diff"))
Hopefully that works.
edit: and for your last question, I'm not sure what you mean but if you just want the letters that appear in each row then something like this could work:
for(i in 1:nrow(amino)) amino$types[i]<-paste(unique(amino[i,1:4,drop=TRUE]),collapse=",")
It will give you a column containing a comma separated list of the letters that appeared in each row.
edit2: If you have significantly more than 8 sequences, then a modified form of Ganesh's solution might work better (his output code isn't actually necessary):
amino$first_flag <- apply(amino[,1:4],1,function(x){
ifelse(length(unique(x)) == 1,"same","diff")
})
amino$second_flag <- apply(amino[,5:8],1,function(x){
ifelse(length(unique(x)) == 1,"same","diff")
})
amino$total_flag <- apply(amino[,1:8],1,function(x){
ifelse(length(unique(x)) == 1,"same","diff")
})
amino$types <- apply(amino[,1:8],1,function(x) paste(unique(x),collapse=","))
And for your new question-
amino$one_diff <- apply(amino[,1:8],1,function(x){
ifelse(7 %in% as.data.frame(table(x))[,2,drop=TRUE],"1 diff",NA)
})
This uses the table() function which normally gives you a count based on a vector or a column like table(amino$seq1). Using apply, we instead stick a row of the 8 sequences into it, it returns the counts, then we use as.data.frame and the brackets [] to get rid of some extra table() output that we don't need. The "7 %in%" part means if there are 7 of the same letters then there must be 1 different one. Anything else (i.e., all 8 same or more than 1 difference) will get NA.
I have an ff object. One of the columns, which is a string variable, has white spaces, and I want to remove these.
I have tried the following:
1). newcol <- gsub("[[:space:]]", "", mydata$mystr)
2). newcol<- as.ffdf(gsub("[[:space:]]", "", mydata$mystr))
I also tried to use the as.character command, such that I said the following before applying the gsub command:
mydata$mystr <- as.character(ff(c(mydata$mystr)))
However, none of these options works. Any suggestions/help would be greatly appreciated.
EDIT: SOLUTION GIVEN MY AKRUN BELOW
May be you can try with ffbase
library(ffbase)
library(ff)
head(ffd$y[])
#[1] p l k a i v
#20 Levels: a c c e f h i j k k l l n
#n o ... v
ffd$y <- with(ffd, gsub('[[:space:]]', '', y))
head(ffd$y[])
#[1] p l k a i v
#Levels: a c e f h i j k l n o p q t v
data
set.seed(24)
d <- data.frame(x=1:26, y=sample(c(letters, paste(' ', letters, ' ')),
26, replace=TRUE), z=Sys.time()+1:26)
ffd <- as.ffdf(d)