Cartesian Product on column headers for Time Series Data - r

Say I had a dataframe like
d <- c("03-12-2018","03-11-2018")
g <- c(10,5)
p <- c(8,9)
a <- c(7,2)
df <- data.frame(d,g,p,a)
colnames(df) <- c("date","grapes","pears","apples")
df
date grapes pears apples
1 03-12-2018 10 8 7
2 03-11-2018 5 9 2
I essentially want output looking like:
date grapes_pears grapes_apples pears_apples
3-12-2018 2 3 1
3-11-2018 -4 3 7
So the values in the output table are just the difference between the first fruit and the second fruit in the column. A basic Cartesian product on the headers (ex date column) is fine... I know I will receive pairs in reverse (grapes_pears, pears_grapes) and simply a sign change for the value and also cases with grapes_grapes but for now that is okay. Will refine later.
Thanks for your help.

You can try combn(), i.e.
combn(names(df[-1]),2, FUN = function(i) Reduce(`-`, df[i]))
which gives,
[,1] [,2] [,3]
[1,] 2 3 1
[2,] -4 3 7

Related

How does the 'group' argument in rowsum work?

I understand what rowsum() does, but I'm trying to get it to work for myself. I've used the example provided in R which is structured as such:
x <- matrix(runif(100), ncol = 5)
group <- sample(1:8, 20, TRUE)
xsum <- rowsum(x, group)
What is the matrix of values that is produced by xsum and how are the values obtained. What I thought was happening was that the values obtained from group were going to be used to state how many entries from the matrix to use in a rowsum. For example, say that group = (2,4,3,1,5). What I thought this would mean is that the first two entries going by row would be selected as the first entry to xsum. It appears as though this is not what is happening.
rowsum adds all rows that have the same group value. Let us take a simpler example.
m <- cbind(1:4, 5:8)
m
## [,1] [,2]
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
## [4,] 4 8
group <- c(1, 1, 2, 2)
rowsum(m, group)
## [,1] [,2]
## 1 3 11
## 2 7 15
Since the first two rows correspond to group 1 and the last 2 rows to group 2 it sums the first two rows giving the first row of the output and it sums the last 2 rows giving the second row of the output.
rbind(`1` = m[1, ] + m[2, ], `2` = m[3, ] + m[4, ])
## [,1] [,2]
## 1 3 11
## 2 7 15
That is the 3 is formed by adding the 1 from row 1 of m and the 2 of row 2 of m. The 11 is formed by adding 5 from row 1 of m and 6 from row 2 of m.
7 and 15 are formed similarly.

Calculating moving differences across columns per row in r

I would like to do calculations across columns in my data, by row. The calculations are "moving" in that I would like to know the difference between two numbers in column 1 and 2, then columns 3 and 4, and so on. I have looked at "loops" and "rollapply" functions, but could not figure this out. Below are three options of what was attempted. Only the third option gives me the result I am after, but it is very lengthy code and also does not allow for automation (the input data will be a much larger matrix, so typing out the calculation for each row won't work).
Please advice how to make this code shorter and/or any other packages/functions to check out which will do the job. THANK YOU!
MY TEST SCRIPT IN R + errors/results
Sample data set
a<- c(1,2,3, 4, 5)
b<- c(1,2,3, 4, 5)
c<- c(1,2,3, 4, 5)
test.data <- data.frame(cbind(a,b*2,c*10))
names(test.data) <- c("a", "b", "c")
Sample of calculations attempted:
OPTION 1
require(zoo)
rollapply(test.data, 2, diff, fill = NA, align = "right", by.column=FALSE)
RESULT 1 (not what we're after. What we need is at the bottom of Option 3)
# a b c
#[1,] NA NA NA
#[2,] 1 2 10
#[3,] 1 2 10
#[4,] 1 2 10
#[5,] 1 2 10
OPTION 2:
results <- for (i in 1:length(nrow(test.data))) {
diff(as.numeric(test.data[i,]), lag=1)
print(results)}
RESULT 2: (again not what we're after)
# NULL
OPTION 3: works, but long way, so would like to simplify code and make generic for any length of observations in my dataframe and any number of columns (i.e. more than 3). I would like to "automate" the steps below, if know number of observations (i.e. rows).
row1=diff(as.numeric(test[1,], lag=1))
row2=diff(as.numeric(test[2,], lag=1))
row3=diff(as.numeric(test[3,], lag=1))
row4=diff(as.numeric(test[4,], lag=1))
row5=diff(as.numeric(test[5,], lag=1))
results.OK=cbind.data.frame(row1, row2, row3, row4, row5)
transpose.results.OK=data.frame(t(as.matrix(results.OK)))
names(transpose.results.OK)=c("diff.ab", "diff.bc")
Final.data = transpose.results.OK
print(Final.data)
RESULT 3: (THIS IS WHAT I WOULD LIKE TO GET, "row1" can be "obs1" etc)
# diff.ab diff.bc
#row1 1 8
#row2 2 16
#row3 3 24
#row4 4 32
#row5 5 40
THE END
Here are the 3 options redone plus a 4th option:
# 1
library(zoo)
d <- t(rollapplyr(t(test.data), 2, diff, by.column = FALSE))
# 2
d <- test.data[-1]
for (i in 1:nrow(test.data)) d[i, ] <- diff(unlist(test.data[i, ]))
# 3
d <- t(diff(t(test.data)))
# 4 - also this works
nc <- ncol(test.data)
d <- test.data[-1] - test.data[-nc]
For any of them to set the names:
colnames(d) <- paste0("diff.", head(names(test.data), -1), colnames(d))
(2) and (4) give this data.frame and (1) and (3) give the corresponding matrix:
> d
diff.ab diff.bc
1 1 8
2 2 16
3 3 24
4 4 32
5 5 40
Use as.matrix or as.data.frame if you want the other.
An apply based solution using diff on row-wise can be achieved as:
# Result
res <- t(apply(test.data, 1, diff)) #One can change it to data.frame
# Name of the columns
colnames(res) <- paste0("diff.", head(names(test.data), -1),
tail(names(test.data), -1))
res
# diff.ab diff.bc
# [1,] 1 8
# [2,] 2 16
# [3,] 3 24
# [4,] 4 32
# [5,] 5 40

Get row indices of data frame A according to multiple matching criteria in that data frame and another data frame, B

Let's say we have two data frames in R, df.A and df.B, defined thus:
bin_name <- c('bin_1','bin_2','bin_3','bin_4','bin_5')
bin_min <- c(0,2,4,6,8)
bin_max <- c(2,4,6,8,10)
df.A <- data.frame(bin_name, bin_min, bin_max, stringsAsFactors = FALSE)
obs_ID <- c('obs_1','obs_2','obs_3','obs_4','obs_5','obs_6','obs_7','obs_8','obs_9','obs_10')
obs_min <- c(6.5,0,8,2,1,7,5,6,8,3)
obs_max <- c(7,3,10,3,9,8,5.5,8,10,4)
df.B <- data.frame(obs_ID, obs_min, obs_max, stringsAsFactors = FALSE)
df.A defines the ranges of bins, while df.B consists of rows of observations with min and max values that may or may not fall entirely within a bin defined in df.A.
We want to generate a new vector of length nrow(df.B) containing the row indices of df.A corresponding to the bin in which each observation falls entirely. If an observation straddles a bin falls or partially outside it, then it can't be assigned to a bin and should return NA (or something similar).
In the above example, the correct output vector would be this:
bin_rows <- c(4, NA, 5, 2, NA, 4, 3, 4, 5, 2)
I came up with a long-winded solution using sapply:
bin_assignments <- sapply(1:nrow(df.B), function(i) which(df.A$bin_max >= df.B$obs_max[i] & df.A$bin_min <= df.B$obs_min[i])) #get bin assignments for every observation
bin_assignments[bin_assignments == "integer(0)"] <- NA #replace "integer(0)" entries with NA
bin_assignments <- do.call("c", bin_assignments) #concatenate the output of the sapply call
Several months ago I discovered a simple, single-line solution to this problem that didn't use an apply function. However, I forgot how I did this and I have not been able to rediscover it! The solution might involve match() or which(). Any ideas?
1) Using SQL it can readily be done in one statement:
library(sqldf)
sqldf('select a.rowid
from "df.B" b
left join "df.A" a on obs_min >= bin_min and obs_max <= bin_max')
rowid
1 4
2 NA
3 5
4 2
5 NA
6 4
7 3
8 4
9 5
10 2
2) merge/by We can do it in two statements using merge and by. No packages are used.
This does have the downside that it materializes the large join which the SQL solution would not need to do.
Note that df.B, as defined in the question, has obs_10 is the second level rather than the 10th level. If it were such that obs_10 were the 10th level then the second argument to by could have been just m$obs_ID so fixing up the input first could simplify it.
m <- merge(df.B, df.A)
stack(by(m, as.numeric(sub(".*_", "", m$obs_ID)),
with, c(which(obs_min >= bin_min & obs_max <= bin_max), NA)[1]))
giving:
values ind
1 4 1
2 NA 2
3 5 3
4 2 4
5 NA 5
6 4 6
7 3 7
8 4 8
9 5 9
10 2 10
3) sapply Note that using the c(..., NA)[1] trick from (2) we can simplify the sapply solution in the quesiton to one statement:
sapply(1:nrow(df.B), function(i)
c(which(df.A$bin_max >= df.B$obs_max[i] & df.A$bin_min <= df.B$obs_min[i]), NA)[1])
giving:
[1] 4 NA 5 2 NA 4 3 4 5 2
3a) mapply A nicer variation of (3) using mapply is given by #Ronak Shah` in the comments:
mapply(function(x, y) c(which(x >= df.A$bin_min & y <= df.A$bin_max), NA)[1],
df.B$obs_min,
df.B$obs_max)
4) outer Here is another one statement solution that uses no packages.
seq_len(nrow(df.A)) %*%
(outer(df.A$bin_max, df.B$obs_max, ">=") & outer(df.A$bin_min, df.B$obs_min, "<="))
giving:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 4 0 5 2 0 4 3 4 5 2

Ranking and Counting Matrix Elements in R

I know there are similar questions but I couldn't find an answer to my question. I'm trying to rank elements in a matrix and then extract data of 5 highest elements.
Here is my attempt.
set.seed(20)
d<-matrix(rnorm(100),nrow=10,ncol=10)
start<-d[1,1]
for (i in 1:10) {
for (j in 1:10) {
if (start < d[i,j])
{high<-d[i,j]
rowind<-i
colind<-j
}
}
}
Although this gives me the data of the highest element, including row and column numbers, I can't think of a way to do the same for elements ranked from 2 to 5. I also tried
rank(d, ties.method="max")
But it wasn't helpful because it just spits out the rank in vector format.
What I ultimately want is a data frame (or any sort of table) that contains
rank, column name, row name, and the data(number) of highest 5 elements in matrix.
Edit
set.seed(20)
d<-matrix(rnorm(100),nrow=10,ncol=10)
d[1,2]<-5
d[2,1]<-5
d[1,3]<-4
d[3,1]<-4
Thanks for the answers. Those perfectly worked for my purpose, but as I'm running this code for correlation chart -where there will be duplicate numbers for every pair- I want to count only one of the two numbers for ranking purpose. Is there any way to do this? Thanks.
Here's a very crude way:
DF = data.frame(row = c(row(d)), col = c(col(d)), v = c(d))
DF[order(DF$v, decreasing=TRUE), ][1:5, ]
row col v
91 1 10 2.208443
82 2 9 1.921899
3 3 1 1.785465
32 2 4 1.590146
33 3 4 1.556143
It would be nice to only have to partially sort, but in ?order, it looks like this option is only available for sort, not for order.
If the matrix has row and col names, it might be convenient to see them instead of numbers. Here's what I might do:
dimnames(d) <- list(letters[1:10], letters[1:10])
DF = data.frame(as.table(d))
DF[order(DF$Freq, decreasing=TRUE), ][1:5, ]
Var1 Var2 Freq
91 a j 2.208443
82 b i 1.921899
3 c a 1.785465
32 b d 1.590146
33 c d 1.556143
The column names don't make much sense here, unfortunately, but you can change them with names(DF) <- as usual.
Here is one option with Matrix
library(Matrix)
m1 <- summary(Matrix(d, sparse=TRUE))
head(m1[order(-m1[,3]),],5)
# i j x
#93 3 10 2.359634
#31 1 4 2.234804
#23 3 3 1.980956
#55 5 6 1.801341
#16 6 2 1.678989
Or use melt
library(reshape2)
m2 <- melt(d)
head(m2[order(-m2[,3]), ], 5)
Here is something quite simple in base R.
# set.seed(20)
# d <- matrix(rnorm(100), nrow = 10, ncol = 10)
d.rank <- matrix(rank(-d), nrow = 10, ncol = 10)
which(d.rank <= 5, arr.ind=TRUE)
row col
[1,] 3 1
[2,] 2 4
[3,] 3 4
[4,] 2 9
[5,] 1 10
d[d.rank <= 5]
[1] 1.785465 1.590146 1.556143 1.921899 2.208443
Results can (easily) be made clearer (see comment from Frank):
cbind(which(d.rank <= 5, arr.ind=TRUE), v = d[d.rank <= 5], rank = rank(-d[d.rank <= 5]))
row col v rank
[1,] 3 1 1.785465 3
[2,] 2 4 1.590146 4
[3,] 3 4 1.556143 5
[4,] 2 9 1.921899 2
[5,] 1 10 2.208443 1

Change row order in a matrix/dataframe

I need to change/invert rows in my data frame, not transposing the data but moving the bottom row to the top and so on. If the data frame was:
1 2 3
4 5 6
7 8 9
I need to convert to
7 8 9
4 5 6
1 2 3
I've read about sort() but I don't think it is what I need or I'm not able to find the way.
There probably are more elegant ways, but this works:
m <- matrix(1:9, ncol=3, byrow=TRUE)
# m[rev(seq_len(nrow(m))), ] # Initial answer
m[nrow(m):1, ]
[,1] [,2] [,3]
[1,] 7 8 9
[2,] 4 5 6
[3,] 1 2 3
This works because you are indexing the matrix with a reversed sequence of integers as the row index. nrow(m):1 results in 3 2 1.
You can reverse the order of a data.frame using the dplyr package:
iris %>% arrange(-row_number())
Or without using the pipe-operator by doing
arrange(iris, -row_number())
I would reverse the rows an index starting with the number of rows, along this line
revdata <- thedata[dim(thedata)[1L]:1,]
I think this is the simplest way:
MyMatrix = matrix(1:20, ncol = 2)
MyMatrix[ nrow(MyMatrix):1, ]
If you want to reverse the columns, just do
MyMatrix[ , ncol(MyMatrix):1 ]
We can reverse the order of row.names (for data.frame only):
# create data.frame
m <- matrix(1:9, ncol=3, byrow=TRUE)
df_m <- data.frame(m)
#reverse
df_m[rev(rownames(df_m)), ]
# X1 X2 X3
# 3 7 8 9
# 2 4 5 6
# 1 1 2 3
Veeery late, but this seems to be working fast, does not need any extra packages and is simple:
for(i in 1:ncol(matrix)) {matrix[,i] = rev(matrix[,i])}
I guess that for frequent use, one would make a function out of it.
Tested with R v=3.3.1.
Encounter this problem today and here I am providing another solution for your interests.
m <- matrix(1:9, ncol=3, byrow=TRUE)
apply(m,2,rev)

Resources