Merge/match two data frames

Merge/match two data frames - r

I would like to merge two data frames, y$genes and symbol_annotations, by
the row names of y and the second column, "hgnc_symbol", of symbol_annotations, and create a column labeled "Symbol", y$genes$Symbol, listing all of the matches. If there is no match between "hgnc_symbol" and the row name, I would like for 'NA' to populate instead of an empty cell. I keep getting an error because the two data frames aren't of the same dimensions and contain NAs, and I'm not sure how to correct it.
>read.counts <- read.table("gene_counts.txt", header=TRUE)
>row.names(read.counts) <- read.counts$Geneid
>treatment <- factor(treatment)
> head(treatment)
[1] T0 IL2 IL2.ZA IL2.OKT3 IL2.OKT3.ZA T0
Levels: T0 IL2 IL2.OKT3 IL2.OKT3.ZA IL2.ZA
>y <- DGEList(read.counts, group=treatment, genes=read.counts)
>head(y$genes)
SM01 SM02 SM03 SM04 SM05 SM06 SM07 SM08 SM09 SM10 SM11 SM12 SM13 SM14 SM15 SM16 SM17 SM18 SM19
ENSG00000223972 0 1 1 1 0 0 1 0 0 3 0 0 1 2 0 0 0 0 1
ENSG00000227232 33 31 13 15 20 43 36 32 43 43 61 42 92 73 80 64 33 25 28
ENSG00000278267 1 0 1 0 0 5 3 1 1 2 1 0 2 4 6 0 2 2 1
ENSG00000243485 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
ENSG00000237613 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ENSG00000268020 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
SM20 SM21 SM22 SM23 SM24 SM25 SM26 SM27 SM28 SM29 SM30
ENSG00000223972 0 0 0 0 1 0 0 0 0 0 0
ENSG00000227232 15 60 13 29 22 28 87 42 61 67 74
ENSG00000278267 2 3 5 1 3 4 4 3 2 4 3
ENSG00000243485 0 0 0 0 0 1 0 0 0 0 1
ENSG00000237613 0 0 0 0 0 0 0 0 0 0 0
ENSG00000268020 0 0 0 0 0 0 0 0 0 0 0
>head(symbol_annotations, n=10)
ensembl_gene_id hgnc_symbol
1 ENSG00000210049 MT-TF
2 ENSG00000211459 MT-RNR1
3 ENSG00000210077 MT-TV
4 ENSG00000210082 MT-RNR2
5 ENSG00000209082 MT-TL1
6 ENSG00000198888 MT-ND1
7 ENSG00000210100 MT-TI
8 ENSG00000223795 <NA>
9 ENSG00000210107 MT-TQ
10 ENSG00000210112 MT-TM
>dim(symbol_annotations)
[1] 58069 2
>dim(y$genes)
[1] 58051 30
>y$genes$Symbol <- merge((rownames(y)), symbol_annotations[,c(2)])
Error in if (n > 0) c(NA_integer_, -n) else integer() :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In rep.fac * nx : NAs produced by integer overflow
2: In .set_row_names(as.integer(prod(d))) :
NAs introduced by coercion to integer range

Related

Adding multiple columns in between columns in a data frame using a For Loop

outputdata (df)
Store.No Task
1 70
2 50
3 20
I am trying to add 53 columns after the 'Task' column by using its position not the name. Then I want want columns names to begin from 1 and end on the number 53 with 0 in the rows. The rows in this example go to row number 3 but it could vary so would it be possible to use nrow function to specify the number of rows rather than hard coding
outputdata- Desired Outcome
Store.No Task 1 2 3 4 5 6 7 8 9 10 ...53
1 70 0 0 0 0 0 0 0 0 0 0
2 50 0 0 0 0 0 0 0 0 0 0
3 20 0 0 0 0 0 0 0 0 0 0
Code used
x <- 1
y <- 0
for (i in 1:53){
outputdata <- add_column(outputdata, x = 0, .after = Fo+y)
y <- y + 1
x <- x + 1
}
The error i'm getting is the columns are being called x,x.1,x.2,x.3,x.4...x.53. Rather than 1,2,3,4...53...not too sure why this could be
I am still quite new to R so there is a far more efficient way of doing this then please let me know
Many thanks

You do not need to loop to do this:
as.data.frame(cbind(df, matrix(0, nrow = nrow(df), ncol = 53)))
Store.No Task Third Fourth 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 1 70 4 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2 50 5 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 3 20 6 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
matrix will create a matrix with 53 columns and 3 rows filled with 0
cbind will add this matrix to the end of your data
as.data.frame will convert it to a dataframe
Update
To insert these zero columns positionally you can subset your df into two parts: df[, 1:2] are the first and second columns, while df[,3:ncol(df)] are the third to end of your dataframe.
as.data.frame(cbind(df[,1:2], matrix(0, nrow = nrow(df), ncol = 53), df[,3:ncol(df)))
Store.No Task 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
1 1 70 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 3 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 Third Fourth
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 7
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 8
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 9
add_column
Alternatively you can use the add_column function from the tibble package as you were in your post using the .after argument to insert after the second column:
library(tibble)
tibble::add_column(df, as.data.frame(matrix(0, nrow = nrow(df), ncol = 53)), .after = 2)
Note: this function will fix the column names to add a "V" before any column name that starts with a number. So 1 will become V1.
Data
df <- data.frame(Store.No = 1:3,
Task = c(70, 50, 20),
Third = 4:6,
Fourth = 7:9)

How to add elements to a vector based on another vector position wise in R

I have a positive integer value vector C which is of length 192. I want to generate another vector based on this vector C. The new vector to be created is called B (same length as C). The algorithm for creation is:
Whenever a value above 0 is observed in C, add the same value 12 places back in the vector B. For example, if the first 15 entries of C are 0 and the 16th entry is 3, then I want to add the value 3, 12 positions back (which is 16-12=position 4) in vector B. The vector B would be generated in this way over all values of C.
Any help would be greatly appreciated! The vector C can be obtained by the R library "outbreaks" and the data file from that package is ebola_kikwit_1995$onset.

It doesn't seem to be very tricky. An index vector and a for loop will do what the question asks for.
library("outbreaks")
i <- which(ebola_kikwit_1995$onset > 0)
i <- i[i > 12]
ebola_kikwit_1995$B <- 0L
for(j in i){
ebola_kikwit_1995$B[j] <- ebola_kikwit_1995$onset[j] + ebola_kikwit_1995$B[j - 12]
}
ebola_kikwit_1995$B
# [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [28] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [55] 0 0 0 0 0 1 1 1 0 0 0 0 3 0 0 0 1 2 0 2 0 2 0 1 0 0 0
# [82] 0 0 0 0 4 0 3 1 3 1 0 1 1 1 1 1 5 3 0 5 7 2 2 5 3 2 8
#[109] 5 9 5 4 10 10 13 14 20 10 9 16 7 14 13 10 18 13 17 21 31 13 21 21 15 17 16
#[136] 18 22 18 18 24 35 16 22 23 18 18 19 21 26 21 23 26 37 17 0 25 0 0 21 0 0 0
#[163] 25 28 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
#[190] 0 0 0

How to find three consecutive rows with the same value

I have a dataframe as follows:
chr leftPos Sample1 X.DD 3_samples MyStuff
1 324 -1 1 1 1
1 4565 -1 0 0 0
1 6887 -1 1 0 0
1 12098 1 -1 1 1
2 12 -1 1 0 1
2 43 -1 1 1 1
5 1 -1 1 1 0
5 43 0 1 -1 0
5 6554 1 1 1 1
5 7654 -1 0 0 0
5 8765 1 1 1 0
5 9833 1 1 1 -1
6 12 1 1 0 0
6 43 0 0 0 0
6 56 1 0 0 0
6 79 1 0 -1 0
6 767 1 0 -1 0
6 3233 1 0 -1 0
I would like to convert it according to the following rules
For each chromosome:
a. If there are three or more 1's or -1's consecutively in a column then the value stays as it is.
b. If there are less than three 1's or -1s consecutively in a column then the value of the 1 or -1 changes to 0
The rows in a column have to have the same sign (+ or -ve) to be called consecutive.
The result of the dataframe above should be:
chr leftPos Sample1 X.DD 3_samples MyStuff
1 324 -1 0 0 0
1 4565 -1 0 0 0
1 6887 -1 0 0 0
1 12098 0 0 0 0
2 12 0 0 0 0
2 43 0 0 0 0
5 1 0 1 0 0
5 43 0 1 0 0
5 6554 0 1 0 0
5 7654 0 0 0 0
5 8765 0 0 0 0
5 9833 0 0 0 0
6 12 0 0 0 0
6 43 0 0 0 0
6 56 1 0 0 0
6 79 1 0 -1 0
6 767 1 0 -1 0
6 3233 1 0 -1 0
I have managed to do this for two consecutive rows but I'm not sure how to change this for three or more rows.
DAT_list2res <-cbind(DAT_list2[1:2],DAT_list2res)
colnames(DAT_list2res)[1:2]<-c("chr","leftPos")
DAT_list2res$chr<-as.numeric(gsub("chr","",DAT_list2res$chr))
DAT_list2res<-as.data.frame(DAT_list2res)
dx<-DAT_list2res
f0 <- function( colNr, dx)
{
col <- dx[,colNr]
n1 <- which(col == 1| col == -1) # The `1`-rows.
d0 <- which( diff(col) == 0) # Consecutive rows in a column are equal.
dc0 <- which( diff(dx[,1]) == 0) # Same chromosome.
m <- intersect( n1-1, intersect( d0, dc0 ) )
return ( setdiff( 1:nrow(dx), union(m,m+1) ) )
}
g <- function( dx )
{
for ( i in 3:ncol(dx) ) { dx[f0(i,dx),i] <- 0 }
return ( dx )
}
dx<-g(dx)

Here is one solution only using base R.
First define a function that will replace any repetitions which are less than 3 for zeros:
replace_f <- function(x){
subs <- rle(x)
subs$values[subs$lengths < 3] <- 0
inverse.rle(subs)
}
Then split your data.frame by chr and then apply the function to all columns that you want to change (in this case columns 3 to 6):
df[,3:6] <- do.call("rbind", lapply(split(df[,3:6], df$chr), function(x) apply(x, 2, replace_f)))
Notice that we combine the results together with rbind before replacing the original data. This will give you the desired result:
chr leftPos Sample1 X.DD X3_samples MyStuff
1 1 324 -1 0 0 0
2 1 4565 -1 0 0 0
3 1 6887 -1 0 0 0
4 1 12098 0 0 0 0
5 2 12 0 0 0 0
6 2 43 0 0 0 0
7 5 1 0 1 0 0
8 5 43 0 1 0 0
9 5 6554 0 1 0 0
10 5 7654 0 0 0 0
11 5 8765 0 0 0 0
12 5 9833 0 0 0 0
13 6 12 0 0 0 0
14 6 43 0 0 0 0
15 6 56 1 0 0 0
16 6 79 1 0 -1 0
17 6 767 1 0 -1 0
18 6 3233 1 0 -1 0

A data.table solution using rleid would be
require(data.table)
setDT(dat)
dat[,Sample1 := Sample1 * as.integer(.N>=3), by=.(chr, rleid(Sample1))]
This used the grouping by rleid(Sample1) and data.table's helpful .N-variable.
Doing it for all columns you could use the eval(parse(text=...)) syntax as follows:
for(i in names(dat)[3:6]){
by_string = paste0("list(chr, rleid(", i, "))")
def_string = paste0(i, "* as.integer(.N>=3)")
dat[,(i) := eval(parse(text=def_string)), by=eval(parse(text=by_string))]
}
So it results in:
> dat[]
chr leftPos Sample1 X.DD X3_samples MyStuff
1: 1 324 -1 0 0 0
2: 1 4565 -1 0 0 0
3: 1 6887 -1 0 0 0
4: 1 12098 0 0 0 0
5: 2 12 0 0 0 0
6: 2 43 0 0 0 0
7: 5 1 0 1 0 0
8: 5 43 0 1 0 0
9: 5 6554 0 1 0 0
10: 5 7654 0 0 0 0
11: 5 8765 0 0 0 0
12: 5 9833 0 0 0 0
13: 6 12 0 0 0 0
14: 6 43 0 0 0 0
15: 6 56 1 0 0 0
16: 6 79 1 0 -1 0
17: 6 767 1 0 -1 0
18: 6 3233 1 0 -1 0

Using conditionals to count matching rows from two dataframes r

I have two dataframes, I want to count the number of times both have a result_gain of 1 in the same chr with the same start and stop number +/- probes.
here probes = 1000
So if dataframe1 had a result_gain of 1 at a start+/-probes and stop+/-probes at chr 5 at the same chr number and start+/-probes and stop+/-probes with a result_gain of 1 then I wish to count this, the method I have attached is not working, do you have any suggestions?
dataframe1
chr start stop result_gain result_loss result_cnloh
2 0 90247720 0 0 0
2 95627407 243199373 0 0 0
7 0 57789531 1 0 0
7 61760895 159138663 1 0 0
8 0 6974050 0 0 0
8 8102641 43646413 1 0 0
8 47060977 146364022 0 0 0
9 0 38771460 0 0 0
9 71034203 141213431 0 0 0
10 0 38685231 0 0 0
10 42810783 135534747 0 0 0
11 0 51530241 0 0 0
11 54835623 135006516 0 0 0
12 0 34768168 0 0 0
12 38416139 133851895 0 0 0
13 19263735 115169878 0 0 0
14 20213937 107349540 0 0 0
15 20161372 102531392 1 0 0
17 0 22175355 0 0 0
17 25375921 81195210 0 0 0
dataframe 2
chr start stop result_gain result_loss result_cnloh
2 0 90247720 1 0 0
2 95627407 243199373 0 0 0
7 0 57789531 1 0 0
7 61760895 159138663 1 0 0
8 0 6974050 0 0 0
8 8101641 43646413 1 0 0
8 47060977 146364022 0 0 0
9 0 38771460 0 0 0
9 71034203 141213431 0 0 0
10 0 38685231 0 0 0
10 42810783 135534747 0 0 0
11 0 51530241 0 1 0
11 54835623 135006516 0 0 0
12 0 34768168 0 0 0
12 38416139 133851895 0 0 0
13 19263735 115169878 0 0 0
14 20213937 107349540 0 0 0
15 20161372 102531392 1 0 0
17 0 22175355 0 0 0
17 25375921 81195210 0 0 0
here are the following matching rows from both
matching rows from dataframe1
7 0 57789531 1 0 0
7 61760895 159138663 1 0 0
8 8102641 43646413 1 0 0
15 20161372 102531392 1 0 0
matching rows from dataframe2
7 0 57789531 1 0 0
7 61760895 159138663 1 0 0
8 8101641 43646413 1 0 0
15 20161372 102531392 1 0 0
output
score = 4
After subsetting each chr for dataframe1 and dataframe2 I am using the following conditionals to count, but it does not work.
for (e in 1:nrow(gains_idcc)) {
for (h in 1:nrow(gains_dciss)) {
if (gains_dciss$start[e] <= (idcc$start[h] + probes_per_bp) | dciss$start[e] <= (idcc$start[h] - probes_per_bp) | dciss$start[e] == idcc$start[h] && dciss$stop[e] >= (idcc$stop[h] + probes_per_bp) | dciss$stop[e] >= (idcc$stop[h] - probes_per_bp) | dciss$stop[e] <= idcc$stop[h]) {
score_gain = score_gain + nrow(gains_dciss)

How to sum leading diagonal of table in R

I have a table created using the table() command in R:
y
x 0 1 2 3 4 5 6 7 8 9
0 23 0 0 0 0 1 0 0 0 0
1 0 23 1 0 1 0 1 2 0 2
2 1 1 28 0 0 0 1 0 2 2
3 0 1 0 24 0 1 0 0 0 1
4 1 1 0 0 34 0 3 0 0 0
5 0 0 0 0 0 33 0 0 0 0
6 0 0 0 0 0 2 32 0 0 0
7 0 1 0 1 0 0 0 36 0 1
8 1 1 1 1 0 0 0 1 20 1
9 1 3 0 1 0 1 0 1 0 24
This table shows the results of a classification, and I want to sum the leading diagonal of it (the diagonal with the large numbers - like 23, 23, 28 etc). Is there a sensible/easy way to do this in R?

How about sum(diag(tbl)), where tbl is your table?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Merge/match two data frames - r

Related

Adding multiple columns in between columns in a data frame using a For Loop

How to add elements to a vector based on another vector position wise in R

How to find three consecutive rows with the same value

Using conditionals to count matching rows from two dataframes r

How to sum leading diagonal of table in R

Categories

Resources