Conversion from pairwise matrix to Cytoscape edge table is too slow - r

My code is similar to this.
Given a matrix like this:
a b c d
a 1 NA 3 4
b NA 2 NA 4
c NA NA NA NA
d NA NA NA 4
It converts it to this:
a a 1
a c 3
a d 4
b b 2
b d 4
d d 4
The relevant code is as below:
2 pears <- read.delim("pears.txt", header = TRUE, sep = "\t", dec = ".")
3 edges <- NULL
4 for (i in 1:nrow(pears)) {
5 for (j in 1:ncol(pears)) {
6 if (!(is.na(pears[i,j]))) {
7 edges <- rbind(edges, c(rownames(pears)[i], colnames(pears)[j], pears[i,j]))
8 }
9 }
10 print(i)
11 }
12 colnames(edges) <- c("gene1", "gene2", "PCC")
13 write.table(edges, "edges.txt", row.names = FALSE, quote = FALSE, sep = "\t")
When I run the code from a remote server in the background using screen -S on a 17804x17804 sparse (99% NA) matrix, it initially runs 5 print statements every 13 seconds. However, it has now slowed down to 7 print statements every minute. Why is the algorithm getting slower and slower as it progresses? Is there another way I can convert my matrix into a Cytoscape's format quicker?

We convert the data.frame to matrix, use melt from reshape2 to get the dimnames as two columns along with the values as third column, then subset while using na.rm to remove the NA rows
library(reshape2)
melt(as.matrix(df1), na.rm = TRUE)
data
df1 <- structure(list(a = c(1L, NA, NA, NA), b = c(NA, 2L, NA, NA),
c = c(3L, NA, NA, NA), d = c(4L, 4L, NA, 4L)), class = "data.frame",
row.names = c("a",
"b", "c", "d"))

Related

Sum many rows with some of them have NA in all needed columns

I am trying to do rowSums but I got zero for the last row and I need it to be "NA".
My df is
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
I used this code based on this link; Sum of two Columns of Data Frame with NA Values
df$sum<-rowSums(df[,c("a", "b", "c")], na.rm=T)
Any advice will be greatly appreciated
For each row check if it is all NA and if so return NA; otherwise, apply sum. We have selected columns a, b and c even though that is all the columns because the poster indicated that there might be additional ones.
sum_or_na <- function(x) if (all(is.na(x))) NA else sum(x, na.rm = TRUE)
transform(df, sum = apply(df[c("a", "b", "c")], 1, sum_or_na))
giving:
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
Note
df in reproducible form is assumed to be:
df <- structure(list(a = c(1L, 2L, 3L, NA), b = c(4L, NA, 5L, NA),
c = c(7L, 8L, NA, NA)),
row.names = c("1", "2", "3", "4"), class = "data.frame")

Replace NA value with first NonNA value of a column in R

I have data frame (df) with multiple column[33] for some column first observation is NA , I want to replace the first row "Na" value with first "Non Na" value.
if this is my data data frame:
x y z zz
1 na na na
2 na na na
3 S 3 na
4 d 4 7
I want my data frame to be
x y z zz
1 S 3 7
2 na na na
3 S 3 na
4 d 4 7
I used following code to get the result for a single column but how to dynamically do this for multiple column.
df$y[1] <- df$y[min(which(!is.na(df$y)))]
Any help will be appreciated. Thank you.
Do you mean to have something like this?
df[1,] <- apply(df, 2, function(x) trimws(x[min(which(!is.na(x)))]))
Output is:
x y z zz
1 1 S 3 7
2 2 <NA> <NA> <NA>
3 3 S 3 <NA>
4 4 d 4 7
Sample data:
df <- structure(list(x = 1:4, y = c(NA, NA, "S", "d"), z = c(NA, NA,
3L, 4L), zz = c(NA, NA, NA, 7L)), .Names = c("x", "y", "z", "zz"
), class = "data.frame", row.names = c(NA, -4L))

Summation of the corresponding number of values which are in different columns

My data frame looks like below:
df<-data.frame(alphabets1=c("A","B","C","B","C"," ","NA"),alphabets2=c("B","A","D","D"," ","E","NA"),alphabets3=c("C","F","G"," "," "," ","NA"), number = c("1","2","3","1","4","1","2"))
alphabets1 alphabets2 alphabets3 number
1 A B C 1
2 B A F 2
3 C D G 3
4 B D 1
5 C 4
6 E 1
7 NA NA NA 2
NOTE1: within the row all the values are unique, that is, below shown is not possible.
alphabets1 alphabets2 alphabets3 number
1 A A C 1
NOTE2: data frame may contains NA or is blank
I am struggling to get the below output: which is nothing but a dataframe which has the alphabets and the sum of their corresponding numbers, that is A alphabet is in 1st and 2nd rows so its sum of its corresponding number is 1+2 i.e 3 and let's say B, its in 1st, 2nd and 4th row so the sum will be 1+2+1 i.e 4.
output <-data.frame(alphabets1=c("A","B","C","D","E","F","G"), number = c("3","4","8","4","1","2","3"))
output
alphabets number
1 A 3
2 B 4
3 C 8
4 D 4
5 E 1
6 F 2
7 G 3
NOTE3: output may or may not have the NA or blanks (it doesn't matter!)
We can reshape it to 'long' format and do a group by operation
library(data.table)
melt(setDT(df), id.var="number", na.rm = TRUE, value.name = "alphabets1")[
!grepl("^\\s*$", alphabets1), .(number = sum(as.integer(as.character(number)))),
alphabets1]
# alphabets1 number
#1: A 3
#2: B 4
#3: C 8
#4: D 4
#5: E 1
#6: F 2
#7: G 3
Or we can use xtabs from base R
xtabs(number~alphabets1, data.frame(alphabets1 = unlist(df[-4]),
number = as.numeric(as.character(df[,4]))))
NOTE: In the OP's dataset, the missing values were "NA", and not real NA and the 'number' column is factor (which was changed by converting to integer for doing the sum)
data
df <- data.frame(alphabets1=c("A","B","C","B","C"," ",NA),
alphabets2=c("B","A","D","D"," ","E",NA),
alphabets3=c("C","F","G"," "," "," ",NA),
number = c("1","2","3","1","4","1","2"))
Here is a base R method using sapply and table. I first converted df$number into a numeric. See data section below.
data.frame(table(sapply(df[-length(df)], function(i) rep(i, df$number))))
Var1 Freq
1 11
2 A 3
3 B 4
4 C 8
5 D 4
6 E 1
7 F 2
8 G 3
9 NA 6
To make the output a little bit nicer, we could wrap a few more functions and perform a subsetting within sapply.
data.frame(table(droplevels(unlist(sapply(df[-length(df)],
function(i) rep(i[i %in% LETTERS],
df$number[i %in% LETTERS])),
use.names=FALSE))))
Var1 Freq
1 A 3
2 B 4
3 C 8
4 D 4
5 E 1
6 F 2
7 G 3
It may be easier to do this afterward, though.
data
I ran
df$number <- as.numeric(df$number)
on the OP's data resulting in this.
df <-
structure(list(alphabets1 = structure(c(2L, 3L, 4L, 3L, 4L, 1L,
5L), .Label = c(" ", "A", "B", "C", "NA"), class = "factor"),
alphabets2 = structure(c(3L, 2L, 4L, 4L, 1L, 5L, 6L), .Label = c(" ",
"A", "B", "D", "E", "NA"), class = "factor"), alphabets3 = structure(c(2L,
3L, 4L, 1L, 1L, 1L, 5L), .Label = c(" ", "C", "F", "G", "NA"
), class = "factor"), number = c(1, 2, 3, 1, 4, 1, 2)), .Names = c("alphabets1",
"alphabets2", "alphabets3", "number"), row.names = c(NA, -7L), class = "data.frame")

Rearrange data by matching columns

I am having issue with rearranging some data.
The original data is:
structure(list(id = 1:3, artery.1 = structure(c(1L, 1L, 2L), .Label = c("a",
"b"), class = "factor"), artery.2 = structure(c(1L, NA, 2L), .Label = c("b",
"c"), class = "factor"), artery.3 = structure(c(1L, NA, 2L), .Label = c("c",
"d"), class = "factor"), artery.4 = structure(c(NA, NA, 1L), .Label = "e", class = "factor"), artery.5 = structure(c(NA, NA, 1L), .Label = "f", class = "factor"),
diameter.1 = c(3L, 2L, 1L), diameter.2 = c(2L, NA, 2L), diameter.3 = c(3L,
NA, 3L), diameter.4 = c(NA, NA, 4L), diameter.5 = c(NA, NA,
5L)), .Names = c("id", "artery.1", "artery.2", "artery.3",
"artery.4", "artery.5", "diameter.1", "diameter.2", "diameter.3",
"diameter.4", "diameter.5"), class = "data.frame", row.names = c(NA,
-3L))
# id artery.1 artery.2 artery.3 artery.4 artery.5 diameter.1 diameter.2 diameter.3 diameter.4 diameter.5
# 1 1 a b c <NA> <NA> 3 2 3 NA NA
# 2 2 a <NA> <NA> <NA> <NA> 2 NA NA NA NA
# 3 3 b c d e f 1 2 3 4 5
I would like to get to this:
structure(list(id = 1:3, a = c(3L, 2L, NA), b = c(2L, NA, 1L),
c = c(3L, NA, 2L), d = c(NA, NA, 3L), e = c(NA, NA, 4L),
f = c(NA, NA, 5L)), .Names = c("id", "a", "b", "c", "d",
"e", "f"), class = "data.frame", row.names = c(NA, -3L))
# id a b c d e f
# 1 1 3 2 3 NA NA NA
# 2 2 2 NA NA NA NA NA
# 3 3 NA 1 2 3 4 5
Basically, a to f represents arteries and the numerical values represent the corresponding diameter. Each row represents a patient.
Is there a neat way to sort this dataframe out?
Modern tidyr makes the solution even more succinct via the pivot_ functions:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-id, names_pattern = '(artery|diameter)\\.(\\d+)', names_to = c('.value', NA)) %>%
filter(!is.na(artery)) %>%
pivot_wider(names_from = artery, values_from = diameter)
id a b c d e f
<int> <int> <int> <int> <int> <int> <int>
1 1 3 2 3 NA NA NA
2 2 2 NA NA NA NA NA
3 3 NA 1 2 3 4 5
Here is the older solution, which uses the deprecated gather and spread functions:
library(dplyr)
library(tidyr)
new.df <- gather(df, variable, value, artery.1:diameter.5) %>%
separate(variable, c('variable', 'num')) %>%
spread(variable, value) %>%
subset(!is.na(artery)) %>%
mutate(diameter = as.numeric(diameter)) %>%
select(-num) %>%
spread(artery, diameter)
Output:
id a b c d e f
1 1 3 2 3 NA NA NA
2 2 2 NA NA NA NA NA
3 3 NA 1 2 3 4 5
Or using melt/dcast combination with data.table while selecting variables using regex in the patterns function
library(data.table) #v>=1.9.6
dcast(melt(setDT(df),
id = "id",
measure = patterns("artery", "diameter")),
id ~ value1,
sum,
value.var = "value2",
subset = .(!is.na(value2)),
fill = NA)
# id a b c d e f
# 1: 1 3 2 3 NA NA NA
# 2: 2 2 NA NA NA NA NA
# 3: 3 NA 1 2 3 4 5
As you can see, both melt and dcast are very flexible and you can use regex, specify a subset, pass multiple functions and specify how you want to fill missing values.
You can use xtabs with reshape from base R. Use the latter to transform data to long format and use the former to get the count table:
xtabs(diameter ~ id + artery, reshape(df, varying = 2:11, sep = '.', dir = "long"))
# artery
#id a b c d e f
# 1 3 2 3 0 0 0
# 2 2 0 0 0 0 0
# 3 0 1 2 3 4 5
This can be done with two reshape() calls. First, we can longify both artery and diameter on id, then widen with artery as the time variable. To prevent a column of NAs, we also must subset out rows with NA values for artery in the intermediate frame.
reshape(subset(reshape(df,dir='l',varying=setdiff(names(df),'id'),timevar=NULL),!is.na(artery)),dir='w',timevar='artery');
## id diameter.a diameter.b diameter.c diameter.d diameter.e diameter.f
## 1.1 1 3 2 3 NA NA NA
## 2.1 2 2 NA NA NA NA NA
## 3.1 3 NA 1 2 3 4 5
The diameter. prefixes can be removed afterward, if desired. However, an advantage of this solution is that it would be capable of preserving multiple column sets, whereas the xtabs() solution cannot. The prefixes would be essential to distinguish the column sets in that case.

How to delete rows which has 4 values from 5 missing in a row

I'm trying to delete rows which has from 4 to 5 missing values in one row. I've already tried a code which I found here, but no success yet.
the example of dataset (dt) is:
id a b c d e
1 10 NA NA 9 8
2 NA 7 7 NA NA
3 10 NA NA NA NA
Desired output:
id a b c d e
1 10 NA NA 9 8
2 NA 7 7 NA NA
I used this code dt[!apply(dt, 1, function(i) all(1:5 %in% which(is.na(i)))),] but no success.
Any suggestion is high appreciated.
Here, I am not selecting the first column, i.e. id because in the post it was mentioned 4 from 5 missing. The number of columns in dt is 6. So, I guess the first column id is not used. dt[,-1] selects all other columns except the id.
dt[rowSums(is.na(dt[,-1]))!=4,]
# id a b c d e
#1 1 10 NA NA 9 8
#2 2 NA 7 7 NA NA
If you are using apply, you could use
dt[apply(dt[,-1], 1, function(i) sum(is.na(i))!=4),]
Suppose, you wanted to delete rows with >=4 NAs, (from #Taras B's comments)
dt[rowSums(is.na(dt[,-1])) <4,]
data
dt <- structure(list(id = 1:3, a = c(10L, NA, 10L), b = c(NA, 7L, NA
), c = c(NA, 7L, NA), d = c(9L, NA, NA), e = c(8L, NA, NA)), .Names = c("id",
"a", "b", "c", "d", "e"), class = "data.frame", row.names = c(NA,
-3L))

Resources