Combining datasets - r

I have 15 datasets. The 1st column is "subject" and is identical in all sets. The number of the rest of the columns is not the same in all datasets. I need to combine all of this data in a single dataframe. I found the command "Reduce" but I am just starting with R and I couldn't understand if this is what I need and if so, what is the syntax? Thanks!

I suggest including a reproducible example in the future so that others can see the format of data you're working with and what you're trying to do.
Here is some randomly generated example data, each with the "Subject" column:
list_of_dfs <- list(
df1 = data.frame(Subject = 1:4, a = rnorm(4), b = rnorm(4)),
df2 = data.frame(Subject = 5:8, c = rnorm(4), d = rnorm(4), e = rnorm(4)),
df3 = data.frame(Subject = 7:10, f = rnorm(4)),
df4 = data.frame(Subject = 2:5, g = rnorm(4), h = rnorm(4))
)
Reduce with merge is a good choice:
combined_df <- Reduce(
function(x, y) { merge(x, y, by = "Subject", all = TRUE) },
list_of_dfs
)
And the output:
> combined_dfs
Subject a b c d e f g h
1 1 1.1106594 1.2530046 NA NA NA NA NA NA
2 2 -1.0275630 0.6437101 NA NA NA NA -1.9393347 -0.4361952
3 3 0.1558639 1.2792212 NA NA NA NA -0.8861966 1.0137530
4 4 0.4283585 -0.1045530 NA NA NA NA 1.8924896 -0.3788198
5 5 NA NA 0.08261190 0.77058804 -1.165042 NA 0.7950784 -1.3467386
6 6 NA NA 2.51214598 0.62024328 1.496520 NA NA NA
7 7 NA NA 0.01581309 -0.04777196 -1.327884 1.5111734 NA NA
8 8 NA NA 0.80448136 -0.33347573 -2.290428 -0.3863564 NA NA
9 9 NA NA NA NA NA -1.2371795 NA NA
10 10 NA NA NA NA NA 1.6819063 NA NA

Related

R dataframe: combine conditions by processing

I have to find all columns with all NA-values. If there are not all NA-values in column, I have to replace NAs with 0.
My solution is:
NA_check <- colSums(is.na(frame)) == nrow(frame) #True or False - all NA or not
frame[is.na(frame) & which(names(frame) %in% names(NA_check)[which(NA_check == FALSE, arr.ind=T)])] <- 0
These conditions work separately, but they don't work together or I get some errors combining them. How can I solve my problem?
P.S. This modification also doesn't work if NA_checkis not all FALSE:
frame[is.na(frame[which(names(frame) %in% names(NA_check)[which(NA_check == FALSE, arr.ind=T)])])] <- 0
You can find out columns which has atleast one non-NA value (not all values are NA) and replace NA in that subset to 0.
not_all_NA <- colSums(!is.na(frame)) > 0
frame[not_all_NA][is.na(frame[not_all_NA])] <- 0
We can check this with an example :
frame <- data.frame(a = c(NA, NA, 3, 4), b = NA, c = c(NA, 1:3), d = NA)
frame
# a b c d
#1 NA NA NA NA
#2 NA NA 1 NA
#3 3 NA 2 NA
#4 4 NA 3 NA
not_all_NA <- colSums(!is.na(frame)) > 0
frame[not_all_NA][is.na(frame[not_all_NA])] <- 0
frame
# a b c d
#1 0 NA 0 NA
#2 0 NA 1 NA
#3 3 NA 2 NA
#4 4 NA 3 NA
We can also do this with dplyr :
library(dplyr)
frame %>% mutate(across(where(~any(!is.na(.))), tidyr::replace_na, 0))

Populate matrix by colname identity

I have many samples, each one of which has a corresponding abundance matrix. From these abundance matrices, I would like to create a large matrix that contains abundance information for each sample in rows.
For example, a single abundance matrix would look like:
A B C D
sample1 1 3 4 2
where A, B, C, and D represent colnames, and the abundances are the row values.
I would like to populate my larger matrix, which has as colnames all possible letters (A:Z) and all possible samples (sample1:sampleN) as rows, by matching the colname values.
For ex. :
A B C D E F G .... Z
sample1 1 3 4 2 NA NA NA ....
sample2 NA NA 2 5 7 NA NA ....
sample3 4 NA 6 9 2 NA 2 .....
....
sampleN
Different samples have a varying mix of abundances, in no guaranteed order.
When iteratively adding to this larger matrix, how could I ensure that the correct columns are populated by the right abundance values (ex. column "A" is only filled by values corresponding to abundances of "A" in different samples)? Thanks!
Starting data, changing just a little to highlight differences:
m1 <- as.matrix(read.table(header=TRUE, text="
A B C Z
sample1 1 3 4 2"))
m2 <- as.matrix(read.table(header=TRUE, text="
A B C D E F G
sample2 NA NA 2 5 7 NA NA
sample3 4 NA 6 9 2 NA 2"))
First, we need to make sure both matrices have the same column names:
newcols <- setdiff(colnames(m2), colnames(m1))
m1 <- cbind(m1, matrix(NA, nr=nrow(m1), nc=length(newcols), dimnames=list(NULL, newcols)))
newcols <- setdiff(colnames(m1), colnames(m2))
m2 <- cbind(m2, matrix(NA, nr=nrow(m2), nc=length(newcols), dimnames=list(NULL, newcols)))
m1
# A B C Z D E F G
# sample1 1 3 4 2 NA NA NA NA
m2
# A B C D E F G Z
# sample2 NA NA 2 5 7 NA NA NA
# sample3 4 NA 6 9 2 NA 2 NA
And now we combine them; regular cbind needs the column names to be aligned as well:
rbind(m2, m1[,colnames(m2),drop=FALSE])
# A B C D E F G Z
# sample2 NA NA 2 5 7 NA NA NA
# sample3 4 NA 6 9 2 NA 2 NA
# sample1 1 3 4 NA NA NA NA 2
You should be able to take advantage of matrix indexing, like so:
big[cbind(rownames(abun),colnames(abun))] <- abun
Using this example abundance matrix, and a big matrix to fill:
abun <- matrix(c(1,3,4,2),nrow=1,dimnames=list("sample1",LETTERS[1:4]))
big <- matrix(NA,nrow=5,ncol=26,dimnames=list(paste0("sample",1:5),LETTERS))
Another solution using reduce from purrr package and union_all from dplyr package:
library(purrr)
library(dplyr)
sample_names <- c("sample1","sample2","sample3")
Generating 3 random abundance dataframes:
num1 <- round(runif(runif(1,min = 1, max = 10),min = 1, max = 10))
df1 <- data.frame(t(num1))
colnames(df1) <- sample(LETTERS,length(num1))
num2 <- round(runif(runif(1,min = 1, max = 10),min = 1, max = 10))
df2 <- data.frame(t(num2))
colnames(df2) <- sample(LETTERS,length(num2))
num3 <- round(runif(runif(1,min = 1, max = 10),min = 1, max = 10))
df3 <- data.frame(t(num3))
colnames(df3) <- sample(LETTERS,length(num3))
This is actually the code that does all the magic:
A <- reduce(list(df1,df2,df3),union_all)
col_order <- sort(colnames(A),decreasing = FALSE)
A <- A[,col_order]
rownames(A) <- sample_names
Output:
> A
A C E F O P Q U W Y
sample1 9 NA NA NA 9 NA 5 6 NA NA
sample2 NA NA NA NA 5 4 NA NA 5 NA
sample3 NA 6 5 9 NA NA 3 NA 5 7

Average over specific number of rows, according to criteria

My data is structured as follows:
set.seed(20)
RawData <- data.frame(Trial = c(rep(1, 10), rep(2, 10)),
X_Velocity = runif(20, 1, 3),
Y_Velocity = runif(20, 4, 6))
I now wish to calculate an average for X_Velocity and Y_Velocity across every two rows, for each Trial. My anticipated output, for the first four rows would be:
X_Velocity_AVG Y_Velocity_AVG
NA NA
2.6460545 4.522224
NA NA
1.8081265 4.5175165
How do I complete this?
You could do this using function f in which the average of every two elements is computed:
f <- function(a) tapply(a, rep(1:(length(a)/2), each = 2), FUN = mean)
res <- data.frame(X_Velocity_AVG=rep(NA, nrow(RawData)),
Y_Velocity_AVG=rep(NA, nrow(RawData)))
res$X_Velocity_AVG[c(F,T)] <- f(RawData$X_Velocity)
res$Y_Velocity_AVG[c(F,T)] <- f(RawData$Y_Velocity)
# X_Velocity_AVG Y_Velocity_AVG
# 1 NA NA
# 2 2.646055 4.522224
# 3 NA NA
# 4 1.808127 4.517517
# 5 NA NA
# 6 2.943262 4.334551
# 7 NA NA
# 8 1.162082 5.899396
# 9 NA NA
# 10 1.697668 4.739195
# 11 NA NA
# 12 2.473324 4.778723
# 13 NA NA
# 14 1.744730 5.020097
# 15 NA NA
# 16 1.644518 4.986245
# 17 NA NA
# 18 1.431219 5.375815
# 19 NA NA
# 20 2.108719 4.909284

Expand loop over multiple columns in R

I have a table (mydf) as shown below. I would like to use this for loop (my code) in R which works for only one column (for ALT1 column in this instance) to loop over all the columns containing ALT1 through ALTn and store the output in separate variables from final1 through finaln.
The purpose here is to loop over ALT1 through ALTn to match the nucleotide columns (A,C,G,T,N) and get the corresponding values as shown in the result below.Thank you for your help!
mycode
final1 <- {}
i <- 1
result =merge(coverage.bam, rows.concat.alt, by="start")
for(i in 1:nrow(result)){
final1[i] = paste(paste(result$chr[i], result$start[i], result$end[i],sep=":"),"-",
result$REF[i],"(",result[,(as.character(result$REF[i]))][i],")",",", result$ALT1[i],
"(",result[,(as.character(result$ALT1[i]))][i][!is.na(result[,(as.character(result$ALT1[i]))][i])],")",sep="")
}
final1
I have tried to expand this code for ALT through ALTn, but it does not work, could you help me solve this please?
final <- list()
setValue<-function(element){
print(element)
for(i in 1:nrow(result)){
final[[i]] = paste(paste(result$chr[i], result$start[i], result$end[i],sep=":"),"-",
result$REF[i],"(",result[,(as.character(result$REF[i]))][i],")",",", result[,element][i],
"(",result[,(as.character(result[,element][i])))][i][!is.na(result[,(as.character(result[,element][i])][i])],")",sep="")
}
}
for(i in colnames(result)){
if(grepl('ALT', i)){
setValue(i)
}
}
mydf
chr start end A C G T N = - REF ALT ALT1 ALT2 ALT3 ALTn
1 chr10 102022031 102022031 NA 34 NA NA NA NA NA C G G NA NA NA
2 chr10 102220574 102220574 2 22 2 3 NA NA NA C AGT A G T NA
3 chr10 115322228 115322228 NA 25 NA NA NA NA NA C A A NA NA NA
4 chr10 122222925 122222925 30 NA NA NA NA NA NA A C C NA NA NA
5 chr10 121111042 121111042 NA 48 NA NA NA NA NA C T T NA NA NA
6 chr10 124444484 124444484 NA 60 NA NA NA NA NA C T T NA NA NA
Result
"chr10:102022031:102022031-C(34),G()" "chr10:102220574:102220574-C(22),A(2),G(2),T(3)" "chr10:115322228:115322228-C(25),A()"
[4] "chr10:122222925:122222925-A(30),C()" "chr10:121111042:121111042-C(48),T()" "chr10:124444484:124444484-C(60),T()"
Try
p1 <- do.call(paste,c(mydf[1:3], sep=":"))
p2 <- apply(mydf[c(4:8, 11:16)], 1, function(x) {
Un1 <- unique(match( x[7:11], names(x)[1:4], nomatch=0))
i1 <- match(x[6], names(x))
v1 <- paste0(names(x[i1]),'(', x[i1], ')')
v2 <- as.numeric(x[Un1])
v2[is.na(v2)] <- ''
v3 <-paste(names(x[Un1]), '(', v2, ')', sep='', collapse=",")
paste(v1, v3, sep=",") })
paste(p1, p2, sep="-")
#[1] "chr10:102022031:102022031-C(34),G()"
#[2] "chr10:102220574:102220574-C(22),A(2),G(2),T(3)"
#[3] "chr10:115322228:115322228-C(25),A()"
#[4] "chr10:122222925:122222925-A(30),C()"
#[5] "chr10:121111042:121111042-C(48),T()"
#[6] "chr10:124444484:124444484-C(60),T()"

Match a list of items with rows items of a data.frame

Hi guys I have a difficult situation to manage:
I have a data.frame that looks like this:
General_name
a
b
c
d
m
n
and another data.frame that looks like this:
First_names_list a=34;b=4
Second_names_list d=2;m=98;n=32
Third_names_list c=1;d=12;m=0.1
I have to match each element of the first data.frame with each element before = in the second data.frame[,2] so that finally I have to obtain the following table:
Names a b c d m n
First_names_list 34 4 NA NA NA NA
Second_names_list NA NA NA 2 98 32
Third_names_list NA NA 1 12 0.1 NA
Any suggestion? It seems to be too difficult to me.
Best
E.
Option 1
Here is one approach using dcast from "reshape2" and concat.split from my "splitstackshape" package:
library(splitstackshape)
## The following can also be done in 2 steps. The basic idea is to split
## the values into a semi-long form for `dcast` to be able to use. So,
## I've split first on the semicolon, and made the data into a long form
## at the same time, then I've split on =, but kept it wide that time.
out <- concat.split(concat.split.multiple(df, "V2", ";", "long"),
"V2", "=", drop = TRUE)
out
# V1 time V2_1 V2_2
# 1 First_names_list 1 a 34.0
# 2 Second_names_list 1 d 2.0
# 3 Third_names_list 1 c 1.0
# 4 First_names_list 2 b 4.0
# 5 Second_names_list 2 m 98.0
# 6 Third_names_list 2 d 12.0
# 7 First_names_list 3 <NA> NA
# 8 Second_names_list 3 n 32.0
# 9 Third_names_list 3 m 0.1
library(reshape2)
dcast(out[complete.cases(out), ], V1 ~ V2_1, value.var="V2_2")
# V1 a b c d m n
# 1 First_names_list 34 4 NA NA NA NA
# 2 Second_names_list NA NA NA 2 98.0 32
# 3 Third_names_list NA NA 1 12 0.1 NA
Option 2
Here's another option using a more recent version of data.table. The concept is very similar to the approach taken above.
library(data.table)
library(reshape2)
packageVersion("data.table")
# [1] ‘1.8.11’
dt <- data.table(df)
S1 <- dt[, list(X = unlist(strsplit(as.character(V2), ";"))), by = V1]
S1[, c("A", "B") := do.call(rbind.data.frame, strsplit(X, "="))]
S1
# V1 X A B
# 1: First_names_list a=34 a 34
# 2: First_names_list b=4 b 4
# 3: Second_names_list d=2 d 2
# 4: Second_names_list m=98 m 98
# 5: Second_names_list n=32 n 32
# 6: Third_names_list c=1 c 1
# 7: Third_names_list d=12 d 12
# 8: Third_names_list m=0.1 m 0.1
dcast.data.table(S1, V1 ~ A, value.var="B")
# V1 a b c d m n
# 1: First_names_list 34 4 NA NA NA NA
# 2: Second_names_list NA NA NA 2 98 32
# 3: Third_names_list NA NA 1 12 0.1 NA
Both of the above options assume we're starting with:
df <- structure(list(V1 = c("First_names_list", "Second_names_list",
"Third_names_list"), V2 = c("a=34;b=4", "d=2;m=98;n=32",
"c=1;d=12;m=0.1")), .Names = c("V1", "V2"), class = "data.frame",
row.names = c(NA, -3L))
Here is a solution, using apply within apply:
#Data frame 1
df1 <- read.table(text=
"General_name
a
b
c
d
m
n", header=T, as.is=T)
#Data frame 2
df2 <- read.table(text=
"col1 col2
First_names_list a=34;b=4
Second_names_list d=2;m=98;n=32
Third_names_list c=1;d=12;m=0.1", header=T, as.is=T)
#make lists for each row, sep by ";"
df2split <- strsplit(df2$col2,split=";")
#result
t(
sapply(seq(1:nrow(df2)),function(c){
x <- df2split[[c]]
sapply(df1$General_name,function(n){
t <- gsub(paste0(n,"="),"",x[grepl(n,x)])
ifelse(length(t)==0,NA,as.numeric(t))
})
})
)
I feel this is a slightly round-about way to do it so I look forward to a better solution as well. But this works.
library(data.table)
library(reshape2)
#creating datasets
dt <- data.table(read.csv(textConnection('
"First_names_list","a=34;b=4"
"Second_names_list","d=2;m=98;n=32"
"Third_names_list","c=1;d=12;m=0.1"
'),header = FALSE))
General_name = c('a','b','c','d','m','n')
TotalBreakup <- data.table(
V1 = General_name
)
# Fixing datatypes
TotalBreakup <- TotalBreakup[,lapply(.SD,as.character)]
dt <- dt[,lapply(.SD,as.character)]
# looping through each row and calculating breakdown
for(i in 1:nrow(dt))
{
# the next two statements are the workhorse of this code. Run each part of these statements step by step to see
dtlist <- strsplit(unlist(strsplit(dt[i,V2],";")),"=")
breakup <- data.table(
t(
matrix(
unlist(
strsplit(
unlist(
strsplit(
dt[i,V2],
";"
)
),
"="
)
),
nrow = 2
)
)
)
# fixing datatypes again
breakup <- breakup[,lapply(.SD,as.character)]
#appending to master dataset
TotalBreakup <- merge(TotalBreakup, breakup, by = "V1", all.x = TRUE)
}
#formatting results
setnames(TotalBreakup,c("Names",dt[,V1]))
TotalBreakup <- acast(melt(TotalBreakup,id.vars = "Names"),variable~Names)
Output -
> TotalBreakup
a b c d m n
First_names_list "34" "4" NA NA NA NA
Second_names_list NA NA NA "2" "98" "32"
Third_names_list NA NA "1" "12" "0.1" NA
A way is this:
#the second dataframe you provided
DF2 <- read.table(text = '
First_names_list a=34;b=4
Second_names_list d=2;m=98;n=32
Third_names_list c=1;d=12;m=0.1
', header = F, stringsAsFactors = F)
#empty dataframe
DF <- structure(list(a = c(NA, NA, NA), b = c(NA, NA, NA), c = c(NA,
NA, NA), d = c(NA, NA, NA), m = c(NA, NA, NA), n = c(NA, NA,
NA)), .Names = c("a", "b", "c", "d", "m", "n"), row.names = c("First_names_list",
"Second_names_list", "Third_names_list"), class = "data.frame")
DF
# a b c d m n
#First_names_list NA NA NA NA NA NA
#Second_names_list NA NA NA NA NA NA
#Third_names_list NA NA NA NA NA NA
#fill the dataframe
myls <- strsplit(DF2$V2, split = ";")
for(i in 1:length(myls))
{
sapply(myls[[i]],
function(x) { res <- unlist(strsplit(x, "=")) ; DF[i,res[1]] <<- res[2] })
}
DF
# a b c d m n
#First_names_list 34 4 <NA> <NA> <NA> <NA>
#Second_names_list <NA> <NA> <NA> 2 98 32
#Third_names_list <NA> <NA> 1 12 0.1 <NA>

Resources