List (and plot) top k values of a matrix - r

How can I plot (for example with a barplot) the top 5 values in a matrix with row names and column names?
rownames = c("row1", "row2", "row3", "row4")
colnames = c("col1", "col2", "col3")
P <- matrix(3:14, nrow = 4, byrow = TRUE, dimnames = list(rownames, colnames))
# col1 col2 col3
#row1 3 4 5
#row2 6 7 8
#row3 9 10 11
#row4 12 13 14

List top k values (with their positions) of a matrix P:
k <- 5
ij <- arrayInd(order(P, decreasing = TRUE)[1:k], dim(P))
top.k <- data.frame(x = P[ij], i = ij[, 1], j = ij[, 2])
# x i j
#1 14 4 3
#2 13 4 2
#3 12 4 1
#4 11 3 3
#5 10 3 2
If P has row/column names, we can add them to top.k:
top.k[c("ni", "nj")] <- Map(`[`, dimnames(P), top.k[c("i", "j")])
# x i j ni nj
#1 14 4 3 row4 col3
#2 13 4 2 row4 col2
#3 12 4 1 row4 col1
#4 11 3 3 row3 col3
#5 10 3 2 row3 col2
You can quickly produce a bar-chart using:
with( top.k, barplot(x, names.arg = paste(ni, nj, sep = ",")) )
but I don't know if you like its style. Again, plotting is a subjective matter.

Related

Replace values in one dataframe with another thats not NA

I have two dataframes A and B, that share have the same column names and the same first column (Location)
A <- data.frame("Location" = 1:3, "X" = c(21,15, 7), "Y" = c(41,5, 5), "Z" = c(12,103, 88))
B <- data.frame("Location" = 1:3, "X" = c(NA,NA, 14), "Y" = c(50,8, NA), "Z" = c(NA,14, 12))
How do i replace the values in dataframe A with the values from B if the value in B is not NA?
Thanks.
We can use coalesce
library(dplyr)
A %>%
mutate(across(-Location, ~ coalesce(B[[cur_column()]], .)))
-output
# Location X Y Z
#1 1 21 50 12
#2 2 15 8 14
#3 3 14 5 12
Here's an answer in base R:
i <- which(!is.na(B),arr.ind = T)
A[i] <- B[i]
A
Location X Y Z
1 1 21 50 12
2 2 15 8 14
3 3 14 5 12
One option with fcoalesce from data.table pakcage
list2DF(Map(data.table::fcoalesce,B,A))
gives
Location X Y Z
1 1 21 50 12
2 2 15 8 14
3 3 14 5 12

Post-processing of full_join output to remove multiplicity

I have two data frames(df1, df2) and performed full_join using the common column of interest col1.
df1 <- data.frame(col1=c('A','D','C','C','E','E','I'),col2=c(4,7,8,3,2,4,9))
df2 <- data.frame(col1=c('A','A','B','C','C','E','E','I'),col2=c(4,1,6,8,3,2,1,9))
df1 %>% full_join(df2, by = "col1")
# col1 col2.x col2.y
# 1 A 4 4
# 2 A 4 1
# 3 D 7 NA
# 4 C 8 8
# 5 C 8 3
# 6 C 3 8
# 7 C 3 3
# 8 E 2 2
# 9 E 2 1
# 10 E 4 2
# 11 E 4 1
# 12 I 9 9
# 13 B NA 6
As expected the full_join provides multiplicty of the joining column values and I wish to avoid it. I wish to arrive at the following output. What kind of post-processing approaches do you suggest?
# col1 col2.x col2.y
# 1 A 4 4
# 2 A NA 1
# 3 D 7 NA
# 4 C 8 8
# 5 C 3 3
# 6 E 2 2
# 7 E 4 1
# 8 I 9 9
# 9 B NA 6
More information:
Case 1: I do not need four rows in the output for two same values in both input objects:
# 4 C 8 8
# 5 C 8 3
# 6 C 3 8
# 7 C 3 3
instead, I want only two as:
# 4 C 8 8
# 5 C 3 3
Case 2: Similarly, I need same row for the difference in values:
# 8 E 2 2
# 9 E 2 1
# 10 E 4 2
# 11 E 4 1
instead, I want only two rows as below:
# 8 E 2 2
# 9 E 4 1
A possible solution in 2 steps using the data.table-package:
0) load package & convert to data.table's
library(data.table)
setDT(df1)
setDT(df2)
1) define helper function
unlistSD <- function(x) {
l <- length(x)
ls <- sapply(x, lengths)
m <- max(ls)
newSD <- vector(mode = "list", length = l)
for (i in 1:l) {
u <- unlist(x[[i]])
lu <- length(u)
if (lu < m) {
u <- c(u, rep(NA_real_, m - lu))
}
newSD[[i]] <- u
}
return(setNames(as.list(newSD), names(x)))
}
2) merge and apply helper function
merge(df1[, .(col2 = list(col2)), by = col1],
df2[, .(col2 = list(col2)), by = col1],
by = "col1", all = TRUE
)[, unlistSD(.SD), by = col1]
which gives the following result:
col1 col2.x col2.y
1: A 4 4
2: A NA 1
3: C 8 8
4: C 3 3
5: D 7 NA
6: E 2 2
7: E 4 1
8: I 9 9
9: B NA 6
Another possibiliy with base R:
unlistDF <- function(d, groupcols) {
ds <- split(d[, setdiff(names(d), groupcols)], d[,groupcols])
ls <- lapply(ds, function(x) max(sapply(x, lengths)))
dl <- lapply(ds, function(x) lapply(as.list(x), unlist))
du <- Map(function(x, y) {
lapply(x, function(i) {
if(length(i) < y) {
c(i, rep(NA_real_, y - length(i)))
} else i
})
}, x = dl, y = ls)
ld <- lapply(du, as.data.frame)
cbind(d[rep(1:nrow(d), ls), groupcols, drop = FALSE],
do.call(rbind.data.frame, c(ld, make.row.names = FALSE)),
row.names = NULL)
}
Now you can use this function as follows in combination with merge:
df <- merge(aggregate(col2 ~ col1, df1, as.list),
aggregate(col2 ~ col1, df2, as.list),
by = "col1", all = TRUE)
unlistDF(df, "col1")

R - Sum list of matrix with different columns

I have a large list of matrix with different columns and I would like to sum these matrix counting 0 if column X does not exist in one matrix.
If you have used the function rbind.fill from plyr I would like something similar but with sum function. Of course I could build a function to do that, but I'm thinking about a native function efficiently programmed in Frotrain or C due to my large data.
Here an example:
This is the easy example where I have the same columns:
aa <- list(
m1 = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','b','c'))),
m2 = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','b','c')))
)
aa
Reduce('+',aa)
Giving the results:
> aa
$m1
a b c
1 1 4 7
2 2 5 8
3 3 6 9
$m2
a b c
1 1 4 7
2 2 5 8
3 3 6 9
> Reduce('+',aa)
a b c
1 2 8 14
2 4 10 16
3 6 12 18
And with my data:
bb <- list(
m1 = matrix(c(1,2,3,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','c'))),
m2 = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','b','c')))
)
bb
Reduce('+',bb)
Here I would like to have b = c(0,0,0) in the first matrix to sum them.
> bb
$m1
a c
1 1 7
2 2 8
3 3 9
$m2
a b c
1 1 4 7
2 2 5 8
3 3 6 9
Many thanks!
Xevi
One option would be
un1 <- sort(unique(unlist(lapply(bb, colnames))))
bb1 <- lapply(bb, function(x) {
nm1 <- setdiff(un1, colnames(x))
m1 <- matrix(0, nrow = nrow(x), ncol = length(nm1), dimnames = list(NULL, nm1))
cbind(x, m1)[, un1]})
and use the Reduce
Reduce(`+`, bb1)
# a b c
# 1 2 4 14
# 2 4 5 16
# 3 6 6 18

I have multiple dataframes under one name and I need to create a new column in each one by combining two of the other columns? [duplicate]

I have several csv files all named with dates and for all of them I want to create a new column in each file that contains data from two other columns placed together. Then, I want to combine them into one big dataframe and choose only two of those columns to keep. Here's an example:
Say I have two dataframes:
a b c a b c
x 1 2 3 x 3 2 1
y 2 3 1 y 2 1 3
Then I want to create a new column d in each of them:
a b c d a b c d
x 1 2 3 13 x 3 2 1 31
y 2 3 1 21 y 2 1 3 23
Then I want to combine them like this:
a b c d
x 1 2 3 13
y 2 3 1 21
x 3 2 1 31
y 2 1 3 23
Then keep two of the columns a and d and delete the other two columns b and c:
a d
x 1 13
y 2 21
x 3 31
y 2 23
Here is my current code (It doesn't work when I try to combine two of the columns or when I try to only keep two of the columns):
f <- list.files(pattern="201\\d{5}\\.csv") # reading in all the files
mydata <- sapply(f, read.csv, simplify=FALSE) # assigning them to a dataframe
do.call(rbind,mydata) # combining all of those dataframes into one
mydata$Data <- paste(mydata$LAST_UPDATE_DT,mydata$px_last) # combining two of the columns into a new column named "Data"
c('X','Data') %in% names(mydata) # keeping two of the columns while deleting the rest
The object mydata is a list of data frames. You can change the data frames in the list with lapply:
lapply(mydata, function(x) "[<-"(x, "c", value = paste0(x$a, x$b)))
file1 <- "a b
x 2 3"
file2 <- "a b
x 3 1"
mydata <- lapply(c(file1, file2), function(x) read.table(text = x, header =TRUE))
lapply(mydata, function(x) "[<-"(x, "c", value = paste0(x$a, x$b)))
# [[1]]
# a b c
# x 2 3 23
#
# [[2]]
# a b c
# x 3 1 31
You can use rbind (data1,data2)[,c(1,3)] for that. I assume that you can create col d in each dataframe which is a basic thing.
data1<-structure(list(a = 1:2, b = 2:3, c = c(3L, 1L), d = c(13L, 21L
)), .Names = c("a", "b", "c", "d"), row.names = c("x", "y"), class = "data.frame")
> data1
a b c d
x 1 2 3 13
y 2 3 1 21
data2<-structure(list(a = c(3L, 2L), b = c(2L, 1L), c = c(1L, 3L), d = c(31L,
23L)), .Names = c("a", "b", "c", "d"), row.names = c("x", "y"
), class = "data.frame")
> data2
a b c d
x 3 2 1 31
y 2 1 3 23
data3<-rbind(data1,data2)
> data3
a b c d
x 1 2 3 13
y 2 3 1 21
x1 3 2 1 31
y1 2 1 3 23
finaldata<-data3[,c("a","d")]
> finaldata
a d
x 1 13
y 2 21
x1 3 31
y1 2 23

How to set unique row and column names of a matrix when its dimension is unknown?

I have matrix like :
[,1][,2][,3][,4]
[1,] 12 32 43 55
[2,] 54 54 7 8
[3,] 2 56 76 88
[4,] 58 99 93 34
I do not know in advance how many rows and columns I will have in matrix. Thus, I need to create row and column names dynamically.
I can name columns (row) directly like:
colnames(rmatrix) <- c("a", "b", "c", "d")
However, how can I create my names vector dynamically to fit the dimensions of the matrix?
nm <- ("a", "b", "c", "d")
colnames(rmatrix) <- nm
You can use rownames and colnames and setting do.NULL=FALSE in order to create names dynamically, as in:
set.seed(1)
rmatrix <- matrix(sample(0:100, 16), ncol=4)
dimnames(rmatrix) <- list(rownames(rmatrix, do.NULL = FALSE, prefix = "row"),
colnames(rmatrix, do.NULL = FALSE, prefix = "col"))
rmatrix
col1 col2 col3 col4
row1 26 19 58 61
row2 37 86 5 33
row3 56 97 18 66
row4 89 62 15 42
you can change prefix to name the rows/cols as you want to.
To dynamically names columns (or rows) you can try
colnames(rmatrix) <- letters[1:ncol(rmatrix)]
where letters can be replaced by a vector of column names of your choice. You can do similar thing for rows.
You may use provideDimnames. Some examples with various degree of customisation:
m <- matrix(1:12, ncol = 3)
provideDimnames(m)
# A B C
# A 1 5 9
# B 2 6 10
# C 3 7 11
# D 4 8 12
provideDimnames(m, base = list(letters, LETTERS))
# A B C
# a 1 5 9
# b 2 6 10
# c 3 7 11
# d 4 8 12
provideDimnames(m, base = list(paste0("row_", letters), paste0("col_", letters)))
# col_a col_b col_c
# row_a 1 5 9
# row_b 2 6 10
# row_c 3 7 11
# row_d 4 8 12

Resources