Identify identical columns in two data frames and extract them in r - r

I have two data frames; mRNA (here) and RPPA(here). The mRNA data frame has 1,212 columns, while the RPPA data frame has 937 columns. All columns names in the RPPA data frame appear also in the mRNA data frame (but not in the same order). Within the columns, the values are different between the two data frames.
I want to create a new mRNA data frame, which will contain the same columns as the RPPA data frame, and will not contain the columns that do not appear in the ("old") mRNA data frame.
An example:
mRNA <- data.frame(A=c(25,76,23,45), B=c(56,89,12,452), C=c(45,456,243,5), D=c(13,65,23,16), E=c(17:20), F=c(256,34,0,5))
RPPA <- data.frame(B=c(46,47,45,49), A=c(51,87,34,87), D=c(76,34,98,23))
The expected result would be:
> new.mRNA
B A D
56 25 13
89 76 65
12 23 23
452 45 16
I've tried converting the RPPA column names into a vector, and than use it with the command mRNA[col.names.vector], as described here, but it doesn't work. It gives the error undefined columns selected.
Is there a quick way to do it (without functions, loops etc.)?

Both of the answers that were posted didn't work for my data. Thanks to both answers posted, and with a little more research, I figured out the answer:
First, you need to generate a vector that will include ONLY the column names that appear in BOTH data frames. In order to do that I used the command intersect and Reduce:
target <- Reduce(intersect, list(colnames(raw.mRNA), colnames(RPPA)))
Now you can use the answer that was given:
new.mRNA <- mRNA[target]
and this will generate a new data frame with the right values.
Thank you #akrun and #Titolondon for your help

You can find the dissimilar columns in two data frames as per the below code.
col_name=colnames(mRNA[which(!(colnames(mRNA) %in% colnames(RPPA)))])
new_mRNA=mRNA %>% select(-col_name)

We can subset the mRNA by the column names of 'RPPA' and assign it to 'RPPA'
RPPA[] <- mRNA[names(RPPA)]

Subset of a data.frame with a vector should have work.
Create a vector of the column name you want to keep
Subset you data.frame using this vector
mRNA <- data.frame(A=c(25,76,23,45), B=c(56,89,12,452), C=c(45,456,243,5), D=c(13,65,23,16), E=c(17:20), F=c(256,34,0,5))
RPPA <- data.frame(B=c(46,47,45,49), A=c(51,87,34,87), D=c(76,34,98,23))
mRNA
#> A B C D E F
#> 1 25 56 45 13 17 256
#> 2 76 89 456 65 18 34
#> 3 23 12 243 23 19 0
#> 4 45 452 5 16 20 5
RPPA
#> B A D
#> 1 46 51 76
#> 2 47 87 34
#> 3 45 34 98
#> 4 49 87 23
mRNA[, names(RPPA)]
#> B A D
#> 1 56 25 13
#> 2 89 76 65
#> 3 12 23 23
#> 4 452 45 16

Related

R lapply(): Change all columns within all data frames in a list to numeric, then convert all values to percentages

Question:
I am a little stumped as to how I can batch process as.numeric() (or any other function for that matter) for columns in a list of data frames.
I understand that I can view specific data frames or colunms within this list by using:
> my.list[[1]]
# or columns within this data frame using:
> my.list[[1]][1]
But my trouble comes when I try to apply this into an lapply() function to change all of the data from integer to numeric.
# Example of what I am trying to do
> my.list[[each data frame in list]][each column in data frame] <-
as.numberic(my.list[[each data frame in list]][each column in data frame])
If you can assist me in any way, or know of any resources that can help me out I would appreciate it.
Background:
My data frames are structured as the below example, where I have 5 habitat types and information on how much area an individual species home range extends to n :
# Example data
spp.1.data <- data.frame(Habitat.A = c(100,45,0,9,0), Habitat.B = c(0,0,203,45,89), Habitat.C = c(80,22,8,9,20), Habitat.D = c(8,59,77,83,69), Habitat.E = c(23,15,99,0,10))
I have multiple data frames with the above structure which I have assigned to a list object:
all.spp.data <- list(spp.1.data, spp.2.data, spp.1.data...n)
I am then trying to coerce all data frames to as.numeric() so I can create data frames of % habitat use i.e:
# data, which is now numeric as per Phil's code ;)
data.numeric <- lapply(data, function(x) {
x[] <- lapply(x, as.numeric)
x
})
> head(data.numeric[[1]])
Habitat.A Habitat.B Habitat.C Habitat.D Habitat.E
1 100 0 80 8 23
2 45 0 22 59 15
3 0 203 8 77 99
4 9 45 9 83 0
5 0 89 20 69 10
EDIT: I would like to sum every row, in all data frames
# Add row at the end of each data frame populated by rowSums()
f <- function(i){
data.numeric[[i]]$Sums <- rowSums(data.numeric[[i]])
data.numeric[[i]]
}
data.numeric.SUM <- lapply(seq_along(data.numeric), f)
head(data.numeric.SUM[[1]])
Habitat.A Habitat.B Habitat.C Habitat.D Habitat.E Sums
1 100 0 80 8 23 211
2 45 0 22 59 15 141
3 0 203 8 77 99 387
4 9 45 9 83 0 146
5 0 89 20 69 10 188
EDIT: This is the code I used to convert values within the data frames to % habitat used
# Used Phil's logic to convert all numbers in percentages
data.numeric.SUM.perc <- lapply(data.numeric.SUM,
function(x) {
x[] <- (x[]/x[,6])*100
x
})
Perc.Habitat.A Perc.Habitat.B Perc.Habitat.C Perc.Habitat.D Perc.Habitat.E
1 47 32 0 6 0
2 0 0 52 31 47
3 38 16 2 6 11
4 4 42 20 57 37
5 11 11 26 0 5
6 100 100 100 100 100
This is still not the most condensed way to do this, but it did the trick for me.
Thank you, Phil, Val and Leo P, for helping with this problem.
I'd do this a bit more explicitly:
all.spp.data <- lapply(all.spp.data, function(x) {
x[] <- lapply(x, as.numeric)
x
})
As a personal preference, this clearly conveys to me that I'm looping over each column in a data frame, and looping over each data frame in a list.
If you really want to do it all with lapply, here's a way to go:
lapply(all.spp.data,function(x) do.call(cbind,lapply(1:nrow(x),function(y) as.numeric(x[,y]))))
This uses a nested lapply call. The first one references the single data.frames to x. The second one references the column index for each x to y. So in the end I can reference each column by x[,y].
Since everything will be split up in single vectors, I'm calling do.call(cbind, ... ) to bring it back to a matrix. If you prefer you could add data.frame() around it to bring it back into the original type.

Speedup search of Elements

I got two data.frames m (23 columns 135.973 rows) with the two important columns
head(m[,2])
# [1] "chr1" "chr1" "chr1" "chr1" "chr1" "chr1"
head(m[,7])
# [1] 3661216 3661217 3661223 3661224 3661564 3661567
and search (4 columns 1.019.423 rows) with three important columns
head(search[,1])
# [1] "chr1" "chr1" "chr1" "chr1" "chr1" "chr1"
head(search[,3])
# [1] 3000009 3003160 3003187 3007262 3028947 3050944
head(search[,4])
# [1] 3000031 3003182 3003209 3007287 3028970 3050995
For each row in m I like to get the information if the m[XX,7] position is between any position of search[,3] and search[,4]. So search[,3] can be considered as "start" and search[,4] as "end". In addition search[,1] and m[,2] have to be identical.
Example:
m at row 215
"chr1" 10.984.038
hits in search at line 2898
"chr1" 10.984.024 10.984.046
In general I'm not interested which line or how many lines of search could be found. I just want the information for any line of m is there a matching line in search yes or no.
I'm ending up in this function:
f_4<-function(x,y,z){
for (out in 1:length(x[,1])) {
z[out]<-length(which((y[,1]==x[out,2]) &(x[out,7]>=y[,3]) &(x[out,7]<=y[,4])))
}
return(z)
}
found4<-vector(length=length(m[,1]), mode="numeric")
found4<-f_4(m,search,found4)
It took 3 hours to run this code.
I have already tried some speedup approaches, however I didn't manage to get any of this running proper or faster.
I even tried some lappy/apply approaches -which worked but aren't faster-. However they failed when trying to speed up using parLapply/parRapply.
Anybody got a quite faster approach and may can give some advise?
EDIT 2015/09/18
Found another way to speed up, using foreach %dopar%.
f5<-function(x,y,z){
foreach(out=1:length(x[,1]), .combine="c") %dopar% {
takt<-1000
z=length(which((y[,1]==x[out,2]) &(x[out,7]>=y[,3]) &(x[out,7]<=y[,4]) ))
}
return(z)
}
found5<-vector(length=length(m[,1]), mode="numeric")
found5<-f5(m,search,found5)
Only need 45min. However I'm always getting 0 only. Thing I need to read some more of the foreach %dopar% tutorials.
You can try merging with subsequent logical subsetting. First let's create some mock data:
set.seed(123) # used for reproducibility
m <-as.data.frame(matrix(sample(50,7000, replace=T), ncol=7, nrow=1000))
search <- as.data.frame(matrix(sample(50,1200, replace=T), ncol=4, nrow=300))
Since we want to compare different rows of the two sets, we can use the criterion that m[,2] should be equal to search[,1]. For convenience we can name these columns "ID" in both sets:
m <- cbind(m,seq_along(1:nrow(m)))
search <- cbind(search,seq_along(1:nrow(search)))
colnames(m) <- c("a","ID","c","d","e","f","val","rownum.m")
colnames(search) <- c("ID","nothing","start","end", "rownum.s")
We have added a column to m named 'rownum.m' and a similar column to search which in the end will help identifying the resulting entries in the initial dataset.
Now we can merge the data sets, such that the ID is the same:
m2 <- merge(m,search)
In a final step, we can perform a logical subset of the merged data set and assign the output to a new data frame m3:
m3 <- m2[(m2[,"val"] >= m2[,"start"]) & (m2[,"val"] <= m2[,"end"]),]
#> head(m3)
# ID a c d e f val rownum.m nothing start end rownum.s
#5 1 14 36 36 31 30 25 846 10 20 36 291
#13 1 34 49 24 8 44 21 526 10 20 36 291
#17 1 19 32 29 44 24 35 522 6 33 48 265
#20 1 19 32 29 44 24 35 522 32 31 50 51
#21 1 19 32 29 44 24 35 522 10 20 36 291
#29 1 6 50 10 13 43 22 15 10 20 36 291
If we are only interested in a TRUE/FALSE statement whether a specific row of m matches the criterions, we can define a vector match_s:
match_s <- m$rownum.m %in% m3$rownum.m
which can be stored as an additional column in the original data set m:
m <- cbind(m,match_s)
Finally, we can remove the auxiliary column 'rownum.m' from the data set m which is no longer needed, with m <- m[,-8].
The result is:
> head(m)
# a ID c d e f val match_s
#1 15 14 8 11 16 13 23 FALSE
#2 40 30 8 48 42 50 20 FALSE
#3 21 9 8 19 30 36 19 TRUE
#4 45 43 26 32 41 33 27 FALSE
#5 48 43 25 10 15 13 4 FALSE
#6 3 24 31 33 8 5 36 FALSE
If you're trying to find SNPs (say) inside a set of genomic regions, don't use R. Use BEDOPS.
Convert your SNP or single-base positions to a three-column BED file. In R, make a three-column data table with m[,2], m[,7] and m[,7] + 1, which represent the chromosome, start and stop position of the SNP. Use write.table() to write out this data table to a tab-delimited text file.
Do the same with your genomic regions: Write search[,1], search[,3], and search[,4] to a three-column data table representing the chromosome, start and stop position of the region. Use write.table() to write this out to a tab-delimited text file.
Use sort-bed to sort both BED files. This step might be optional, but it doesn't take long to do and it guarantees that the files are prepped for use with BEDOPS tools.
Finally, use bedmap on the two BED files to map SNPs to genomic regions. Mapping associates SNPs with regions. The bedmap tool can report which SNPs map to a region, or report the number of SNPs, or one or more of many other operations. The documentation for bedmap goes into more detail on the list of operations, but the provided example should get you started quickly.
If your data are in BED format, or can be quickly coerced into BED format, don't use R for genomic operations, as it is slow and memory-intensive. The BEDOPS toolkit introduced the use of sorting to make genomic operations fast, with low memory overhead.

How to find group of rows of a data frame where error occures

I have a two-column dataframe contataining thousands of IDs where each ID has hundreds of data rows, in other words a data frame of about 6 million rows. I am grouping (using either dplyr or data.table) this data frame by IDs and performing a "tso" (outlier detection) function on grouped data frame. The problem is after hours of computation it returns me an error related to ARIMA specification of one of the IDs. Question is how can I identify the ID (or the row number) where my function is returning error?? (if I detect it then I can remove that ID from dataframe)
I tried to manually perform my function on subgroups of this dataframe however I cannot reach the erroneous ID because there are thousands of IDs so it takes me weeks to find them this way.
outlier.detection <- function(x,iter) {
y <- as.ts(x)
out2 <- tso(y,maxit.iloop = iter,tsmethod = "auto.arima",remove.method = "bottom-up",cval=3)
y[out2$outliers$ind] <- NA
return(y)}
df <- data.table(outlying1);setkey(df,id)
test <- df[,list(new.weight = outlier.detection(weight,iter=1)),by=id]
the above function finds the annomalies and replace them with NAs. here is an example,
ID weight
1 a 50
2 a 50
3 a 51
4 a 51.5
5 a 52
6 b 80
7 b 81
8 b 81.5
9 b 90
10 b 82
it will look like the following,
ID weight
1 a 50
2 a 50
3 a 51
4 a 51.5
5 a 52
6 b 80
7 b 81
8 b 81.5
9 b NA
10 b 82

Avoid using a loop to get sum of rows in R, where I want to start and stop the sum on different columns for each row

I am relatively new to R from Stata. I have a data frame that has 100+ columns and thousands of rows. Each row has a start value, stop value, and 100+ columns of numerical values. The goal is to get the sum of each row from the column that corresponds to the start value to the column that corresponds to the stop value. This is direct enough to do in a loop, that looks like this (data.frame is df, start is the start column, stop is the stop column):
for(i in 1:nrow(df)) {
df$out[i] <- rowSums(df[i,df$start[i]:df$stop[i]])
}
This works great, but it is taking 15 minutes or so. Does anyone have any suggestions on a faster way to do this?
You can do this using some algebra (if you have a sufficient amount of memory):
DF <- data.frame(start=3:7, end=4:8)
DF <- cbind(DF, matrix(1:50, nrow=5, ncol=10))
# start end 1 2 3 4 5 6 7 8 9 10
#1 3 4 1 6 11 16 21 26 31 36 41 46
#2 4 5 2 7 12 17 22 27 32 37 42 47
#3 5 6 3 8 13 18 23 28 33 38 43 48
#4 6 7 4 9 14 19 24 29 34 39 44 49
#5 7 8 5 10 15 20 25 30 35 40 45 50
take <- outer(seq_len(ncol(DF)-2)+2, DF$start-1, ">") &
outer(seq_len(ncol(DF)-2)+2, DF$end+1, "<")
diag(as.matrix(DF[,-(1:2)]) %*% take)
#[1] 7 19 31 43 55
If you are dealing with values of all the same types, you typically want to do things in matrices. Here is a solution in matrix form:
rows <- 10^3
cols <- 10^2
start <- sample(1:cols, rows, replace=T)
end <- pmin(cols, start + sample(1:(cols/2), rows, replace=T))
# first 2 cols of matrix are start and end, the rest are
# random data
mx <- matrix(c(start, end, runif(rows * cols)), nrow=rows)
# use `apply` to apply a function to each row, here the
# function sums each row excluding the first two values
# from the value in the start column to the value in the
# end column
apply(mx, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
# df version
df <- as.data.frame(mx)
df$out <- apply(df, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
You can convert your data.frame to a matrix with as.matrix. You can also run the apply directly on your data.frame as shown, which should still be reasonably fast. The real problem with your code is that your are modifying a data frame nrow times, and modifying data frames is very slow. By using apply you get around that by generating your answer (the $out column), which you can then cbind back to your data frame (and that means you modify your data frame just once).

Combining vectors of unequal length into a data frame

I have a list of vectors which are time series of inequal length. My ultimate goal is to plot the time series in a ggplot2 graph. I guess I am better off first merging the vectors in a dataframe (where the shorter vectors will be expanded with NAs), also because I want to export the data in a tabular format such as .csv to be perused by other people.
I have a list that contains the names of all the vectors. It is fine that the column titles be set by the first vector, which is the longest. E.g.:
> mylist
[[1]]
[1] "vector1"
[[2]]
[1] "vector2"
[[3]]
[1] "vector3"
etc.
I know the way to go is to use Hadley's plyr package but I guess the problem is that my list contains the names of the vectors, not the vectors themselves, so if I type:
do.call(rbind, mylist)
I get a one-column df containing the names of the dfs I wanted to merge.
> do.call(rbind, actives)
[,1]
[1,] "vector1"
[2,] "vector2"
[3,] "vector3"
[4,] "vector4"
[5,] "vector5"
[6,] "vector6"
[7,] "vector7"
[8,] "vector8"
[9,] "vector9"
[10,] "vector10"
etc.
Even if I create a list with the object themselves, I get an empty dataframe :
mylist <- list(vector1, vector2)
mylist
[[1]]
1 2 3 4 5 6 7 8 9 10 11 12
0.1875000 0.2954545 0.3295455 0.2840909 0.3011364 0.3863636 0.3863636 0.3295455 0.2954545 0.3295455 0.3238636 0.2443182
13 14 15 16 17 18 19 20 21 22 23 24
0.2386364 0.2386364 0.3238636 0.2784091 0.3181818 0.3238636 0.3693182 0.3579545 0.2954545 0.3125000 0.3068182 0.3125000
25 26 27 28 29 30 31 32 33 34 35 36
0.2727273 0.2897727 0.2897727 0.2727273 0.2840909 0.3352273 0.3181818 0.3181818 0.3409091 0.3465909 0.3238636 0.3125000
37 38 39 40 41 42 43 44 45 46 47 48
0.3125000 0.3068182 0.2897727 0.2727273 0.2840909 0.3011364 0.3181818 0.2329545 0.3068182 0.2386364 0.2556818 0.2215909
49 50 51 52 53 54 55 56 57 58 59 60
0.2784091 0.2784091 0.2613636 0.2329545 0.2443182 0.2727273 0.2784091 0.2727273 0.2556818 0.2500000 0.2159091 0.2329545
61
0.2556818
[[2]]
1 2 3 4 5 6 7 8 9 10 11 12
0.2824427 0.3664122 0.3053435 0.3091603 0.3435115 0.3244275 0.3320611 0.3129771 0.3091603 0.3129771 0.2519084 0.2557252
13 14 15 16 17 18 19 20 21 22 23 24
0.2595420 0.2671756 0.2748092 0.2633588 0.2862595 0.3549618 0.2786260 0.2633588 0.2938931 0.2900763 0.2480916 0.2748092
25 26 27 28 29 30 31 32 33 34 35 36
0.2786260 0.2862595 0.2862595 0.2709924 0.2748092 0.3396947 0.2977099 0.2977099 0.2824427 0.3053435 0.3129771 0.2977099
37 38 39 40 41 42 43 44 45 46 47 48
0.3320611 0.3053435 0.2709924 0.2671756 0.2786260 0.3015267 0.2824427 0.2786260 0.2595420 0.2595420 0.2442748 0.2099237
49 50 51 52 53 54 55 56 57 58 59 60
0.2022901 0.2251908 0.2099237 0.2213740 0.2213740 0.2480916 0.2366412 0.2251908 0.2442748 0.2022901 0.1793893 0.2022901
but
do.call(rbind.fill, mylist)
data frame with 0 columns and 0 rows
I have tried converting the vectors to dataframes, but there is no cbind.fill function, so plyr complains that the dataframes are of different length.
So my questions are:
Is this the best approach? Keep in mind that the goals are a) a ggplot2 graph and b) a table with the time series, to be viewed outside of R
What is the best way to get a list of objects starting with a list of the names of those objects?
What the best type of graph to highlight the patterns of 60 timeseries? The scale is the same, but I predict there'll be a lot of overplotting. Since this is a cohort analysis, it might be useful to use color to highlight the different cohorts in terms of recency (as a continuous variable). But how to avoid overplotting? The differences will be minimal so faceting might leave the viewer unable to grasp the difference.
I think that you may be approaching this the wrong way:
If you have time series of unequal length then the absolute best thing to do is to keep them as time series and merge them. Most time series packages allow this. So you will end up with a multi-variate time series and each value will be properly associated with the same date.
So put your time series into zoo objects, merge them, then use my qplot.zoo function to plot them. That will deal with switching from zoo into a long data frame.
Here's an example:
> z1 <- zoo(1:8, 1:8)
> z2 <- zoo(2:8, 2:8)
> z3 <- zoo(4:8, 4:8)
> nm <- list("z1", "z2", "z3")
> z <- zoo()
> for(i in 1:length(nm)) z <- merge(z, get(nm[[i]]))
> names(z) <- unlist(nm)
> z
z1 z2 z3
1 1 NA NA
2 2 2 NA
3 3 3 NA
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
>
> x.df <- data.frame(dates=index(x), coredata(x))
> x.df <- melt(x.df, id="dates", variable="val")
> ggplot(na.omit(x.df), aes(x=dates, y=value, group=val, colour=val)) + geom_line() + opts(legend.position = "none")
If you're doing it just because ggplot2 (as well as many other things) like data frames then what you're missing is that you need the data in long format data frames. Yes, you just put all of your response variables in one column concatenated together. Then you would have 1 or more other columns that identify what makes those responses different. That's the best way to have it set up for things like ggplot.
You can't. A data.frame() has to be rectangular; but recycling rules assures that the shorter vectors get expanded.
So you may have a different error here -- the data that you want to rbind is not suitable, maybe ? -- but is hard to tell as you did not supply a reproducible example.
Edit Given your update, you get precisely what you asked for: a list of names gets combined by rbind. If you want the underlying data to appear, you need to involve get() or another data accessor.

Resources