melt the lower half from systematic matrix in R - r

Given that I have a three by three systematic matrix.
> x<-matrix(1:9,3)
> x[lower.tri(x)] = t(x)[lower.tri(x)]
> x
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 4 5 8
[3,] 7 8 9
Then I apply library reshape2 to make it in long-format.
> library(reshape2)
> x <- melt(x)
> x
Var1 Var2 value
1 1 1 1
2 2 1 4
3 3 1 7
4 1 2 4
5 2 2 5
6 3 2 8
7 1 3 7
8 2 3 8
9 3 3 9
As the upper diagonal and bottom diagonal are identical, I only need half of result, which will look like below.
Var1 Var2 value
1 1 1
2 1 4
3 1 7
2 2 5
3 2 8
3 3 9
Any elegant approach to do this?

You can change the values for the bottom or upper half to NA, and then melt ignoring missing values, assume there are not missing values in the matrix originally or you don't need to keep them in the result if there are:
x[upper.tri(x)] = NA
reshape2::melt(x, na.rm=T)
# Var1 Var2 value
#1 1 1 1
#2 2 1 4
#3 3 1 7
#5 2 2 5
#6 3 2 8
#9 3 3 9

As the 'x' was already assigned and melted, we can get a logical index of the non-duplicate rows after sorting the subset of dataset with 1st and 2nd column by row and then use it to subset the rows
x[!duplicated(t(apply(x[1:2], 1, sort))),]
# Var1 Var2 value
#1 1 1 1
#2 2 1 4
#3 3 1 7
#5 2 2 5
#6 3 2 8
#9 3 3 9

Related

Producing all combinations of two column values in R

I have a data.frame with two columns
> data.frame(a=c(5,4,3), b =c(1,2,4))
a b
1 5 1
2 4 2
3 3 4
I want to produce a list of data.frames with different combinations of those column values; there should be a total of six possible scenarios for the above example (correct me if I am wrong):
a b
1 5 1
2 4 2
3 3 4
a b
1 5 1
2 4 4
3 3 2
a b
1 5 2
2 4 1
3 3 4
a b
1 5 2
2 4 4
3 3 1
a b
1 5 4
2 4 2
3 3 1
a b
1 5 4
2 4 1
3 3 2
Is there a simple function to do it? I don't think expand.grid worked out for me.
Actually expand.grid can work here, but it is not recommended since it's rather inefficient when you have many rows in df (you need to subset n! out of n**n if you have n rows).
Below is an example using expand.grid
u <- do.call(expand.grid, rep(list(seq(nrow(df))), nrow(df)))
lapply(
asplit(
subset(
u,
apply(u, 1, FUN = function(x) length(unique(x))) == nrow(df)
), 1
), function(v) within(df, b <- b[v])
)
One more efficient option is to use perms from package pracma
library(pracma)
> lapply(asplit(perms(df$b),1),function(v) within(df,b<-v))
[[1]]
a b
1 5 4
2 4 2
3 3 1
[[2]]
a b
1 5 4
2 4 1
3 3 2
[[3]]
a b
1 5 2
2 4 4
3 3 1
[[4]]
a b
1 5 2
2 4 1
3 3 4
[[5]]
a b
1 5 1
2 4 2
3 3 4
[[6]]
a b
1 5 1
2 4 4
3 3 2
Using combinat::permn create all possible permutations of b value and for each bind it with a column.
df <- data.frame(a= c(5,4,3), b = c(1,2,4))
result <- lapply(combinat::permn(df$b), function(x) data.frame(a = df$a, b = x))
result
#[[1]]
# a b
#1 5 1
#2 4 2
#3 3 4
#[[2]]
# a b
#1 5 1
#2 4 4
#3 3 2
#[[3]]
# a b
#1 5 4
#2 4 1
#3 3 2
#[[4]]
# a b
#1 5 4
#2 4 2
#3 3 1
#[[5]]
# a b
#1 5 2
#2 4 4
#3 3 1
#[[6]]
# a b
#1 5 2
#2 4 1
#3 3 4

R: Return values in a columns when the value in another column becomes negative for the first time

For each ID, I want to return the value in the 'distance' column where the value becomes negative for the first time. If the value does not become negative at all, return the value 99 (or some other random number) for that ID. A sample data frame is given below.
df <- data.frame(ID=c(rep(1, 4),rep(2,4),rep(3,4),rep(4,4),rep(5,4)),distance=rep(1:4,5), value=c(1,4,3,-1,2,1,-4,1,3,2,-1,1,-4,3,2,1,2,3,4,5))
> df
ID distance value
1 1 1 1
2 1 2 4
3 1 3 3
4 1 4 -1
5 2 1 2
6 2 2 1
7 2 3 -4
8 2 4 1
9 3 1 3
10 3 2 2
11 3 3 -1
12 3 4 1
13 4 1 -4
14 4 2 3
15 4 3 2
16 4 4 1
17 5 1 2
18 5 2 3
19 5 3 4
20 5 4 5
The desired output is as follows
> df2
ID first_negative_distance
1 1 4
2 2 3
3 3 3
4 4 1
5 5 99
I tried but couldn't figure out how to do it through dplyr. Any help would be much appreciated. The actual data I'm working on has thousands of ID's with 30 different distance levels for each. Bear in mind that for any ID, there could be multiple instances of negative values. I just need the first one.
Edit:
Tried the solution proposed by AntonoisK.
> df%>%group_by(ID)%>%summarise(first_neg_dist=first(distance[value<0]))
first_neg_dist
1 4
This is the result I am getting. Does not match what Antonois got. Not sure why.
library(dplyr)
df %>%
group_by(ID) %>%
summarise(first_neg_dist = first(distance[value < 0]))
# # A tibble: 5 x 2
# ID first_neg_dist
# <dbl> <int>
# 1 1 4
# 2 2 3
# 3 3 3
# 4 4 1
# 5 5 NA
If you really prefer 99 instead of NA you can use
summarise(first_neg_dist = coalesce(first(distance[value < 0]), 99L))
instead.

Bind the frequencies of two observations in r

I've two observations and my aim is to bind the frequency counts together so I can perform e.g. a chi-square test.
a <- c(1,1,5,6,3,6,1,5,5,1,2,5,2,1,3,3,6,5,7,4)
b <- c(1,5,4,4,1,5,4,4,2,1,2,1,2)
> table(a)
a
1 2 3 4 5 6 7
5 2 3 1 5 3 1
> table(b)
b
1 2 4 5
4 3 4 2
As the print shows, the second observation lacks observations for the factors 3,6 and 7. Hence I can't bind them using cbind(table(a), table(b)). As this results in:
> cbind(table(a), table(b))
[,1] [,2]
1 5 4
2 2 3
3 3 4
4 1 2
5 5 4
6 3 3
7 1 4
Warning message:
In cbind(table(a), table(b)) :
number of rows of result is not a multiple of vector length (arg 2)
I was wondering about appropriate methods to combine the observations to get a result similar to this:
[,1] [,2]
1 5 4
2 2 3
3 3 0
4 1 4
5 5 2
6 3 0
7 1 0
We can convert it to factor with levels specified as the sorted union of both vectors, get the frequency of each vector (table) and cbind it
un1 <- sort(union(a,b))
cbind(table(factor(a, levels = un1)), table(factor(b, levels = un1)))
# [,1] [,2]
#1 5 4
#2 2 3
#3 3 0
#4 1 4
#5 5 2
#6 3 0
#7 1 0
This will also work:
df <- merge(table(a), table(b), by.x='a', by.y='b', all=TRUE)[-1]
df[is.na(df)] <- 0
df
# Freq.x Freq.y
#1 5 4
#2 2 3
#3 3 0
#4 1 4
#5 5 2
#6 3 0
#7 1 0

Count with table() and exclude 0's

I try to count triplets; for this I use three vectors that are packed in a dataframe:
X=c(4,4,4,4,4,4,4,4,1,1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,3,3)
Y=c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,3,4,2,2,2,2,3,4,1,1,2,2,3,3,4,4)
Z=c(4,4,5,4,4,4,4,4,6,1,1,1,1,1,1,1,2,2,2,2,7,2,3,3,3,3,3,3,3,3)
Count_Frame=data.frame(matrix(NA, nrow=(length(X)), ncol=3))
Count_Frame[1]=X
Count_Frame[2]=Y
Count_Frame[3]=Z
Counts=data.frame(table(Count_Frame))
There is the following problem: if I increase the value range in the vectors or use even more vectors the "Counts" dataframe quickly approaches its size limit due to the many 0-counts. Is there a way to exclude the 0-counts while generating "Counts"?
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(Count_Frame)), grouped by all the columns (.(X, Y, Z)), we get the number or rows (.N).
library(data.table)
setDT(Count_Frame)[,.N ,.(X, Y, Z)]
# X Y Z N
# 1: 4 1 4 7
# 2: 4 1 5 1
# 3: 1 1 6 1
# 4: 1 1 1 3
# 5: 1 2 1 2
# 6: 1 3 1 1
# 7: 1 4 1 1
# 8: 2 2 2 4
# 9: 2 3 7 1
#10: 2 4 2 1
#11: 3 1 3 2
#12: 3 2 3 2
#13: 3 3 3 2
#14: 3 4 3 2
Instead of naming all the columns, we can use names(Count_Frame) as well (if there are many columns)
setDT(Count_Frame)[,.N , names(Count_Frame)]
You can accomplish this with aggregate:
Count_Frame$one <- 1
aggregate(one ~ X1 + X2 + X3, data=Count_Frame, FUN=sum)
This will calculate the positive instances of table, but will not list the zero counts.
One solution is to create a combination of the column values and count those instead:
library(tidyr)
as.data.frame(table(unite(Count_Frame, tmp, X1, X2, X3))) %>%
separate(Var1, c('X1', 'X2', 'X3'))
Resulting output is:
X1 X2 X3 Freq
1 1 1 1 3
2 1 1 6 1
3 1 2 1 2
4 1 3 1 1
5 1 4 1 1
6 2 2 2 4
7 2 3 7 1
8 2 4 2 1
9 3 1 3 2
10 3 2 3 2
11 3 3 3 2
12 3 4 3 2
13 4 1 4 7
14 4 1 5 1
Or using plyr:
library(plyr)
count(Count_Frame, colnames(Count_Frame))
output
# > count(Count_Frame, colnames(Count_Frame))
# X1 X2 X3 freq
# 1 1 1 1 3
# 2 1 1 6 1
# 3 1 2 1 2
# 4 1 3 1 1
# 5 1 4 1 1
# 6 2 2 2 4
# 7 2 3 7 1
# 8 2 4 2 1
# 9 3 1 3 2
# 10 3 2 3 2
# 11 3 3 3 2
# 12 3 4 3 2
# 13 4 1 4 7
# 14 4 1 5 1

Rbind same data.frame with column switching

I am not new to R, but I cannot solve this problem: I have a data.frame and want to rbind the same data.frame with coloumn switching. But R does not switch the columns.
Example:
set.seed(13)
df <- data.frame(var1 = sample(5), var2 = sample(5))
> df
var1 var2
1 4 1
2 1 3
3 2 4
4 5 2
5 3 5
> rbind(df, df[,c(2,1)])
var1 var2
1 4 1
2 1 3
3 2 4
4 5 2
5 3 5
6 4 1
7 1 3
8 2 4
9 5 2
10 3 5
As you can see, the coloumns are not switched (row 6-10) whereas switching the columns alone works like a charm:
> df[,c(2,1)]
var2 var1
1 1 4
2 3 1
3 4 2
4 2 5
5 5 3
I guess this has something to do with the column names, but I cannot figure out what exacly.
Can anyone help?
Kind regards!
As pointed out by #Henrik, from ?rbind.data.frame: "The rbind data frame method [...] matches columns by name. So try this:
> rbind(df, setNames(df[,c(2,1)], c("var1", "var2")))
var1 var2
1 4 1
2 1 3
3 2 4
4 5 2
5 3 5
6 1 4
7 3 1
8 4 2
9 2 5
10 5 3
this also works:
> rbind(as.matrix(df), as.matrix(df[,c(2,1)]))

Resources