I would like to do calculations across columns in my data, by row. The calculations are "moving" in that I would like to know the difference between two numbers in column 1 and 2, then columns 3 and 4, and so on. I have looked at "loops" and "rollapply" functions, but could not figure this out. Below are three options of what was attempted. Only the third option gives me the result I am after, but it is very lengthy code and also does not allow for automation (the input data will be a much larger matrix, so typing out the calculation for each row won't work).
Please advice how to make this code shorter and/or any other packages/functions to check out which will do the job. THANK YOU!
MY TEST SCRIPT IN R + errors/results
Sample data set
a<- c(1,2,3, 4, 5)
b<- c(1,2,3, 4, 5)
c<- c(1,2,3, 4, 5)
test.data <- data.frame(cbind(a,b*2,c*10))
names(test.data) <- c("a", "b", "c")
Sample of calculations attempted:
OPTION 1
require(zoo)
rollapply(test.data, 2, diff, fill = NA, align = "right", by.column=FALSE)
RESULT 1 (not what we're after. What we need is at the bottom of Option 3)
# a b c
#[1,] NA NA NA
#[2,] 1 2 10
#[3,] 1 2 10
#[4,] 1 2 10
#[5,] 1 2 10
OPTION 2:
results <- for (i in 1:length(nrow(test.data))) {
diff(as.numeric(test.data[i,]), lag=1)
print(results)}
RESULT 2: (again not what we're after)
# NULL
OPTION 3: works, but long way, so would like to simplify code and make generic for any length of observations in my dataframe and any number of columns (i.e. more than 3). I would like to "automate" the steps below, if know number of observations (i.e. rows).
row1=diff(as.numeric(test[1,], lag=1))
row2=diff(as.numeric(test[2,], lag=1))
row3=diff(as.numeric(test[3,], lag=1))
row4=diff(as.numeric(test[4,], lag=1))
row5=diff(as.numeric(test[5,], lag=1))
results.OK=cbind.data.frame(row1, row2, row3, row4, row5)
transpose.results.OK=data.frame(t(as.matrix(results.OK)))
names(transpose.results.OK)=c("diff.ab", "diff.bc")
Final.data = transpose.results.OK
print(Final.data)
RESULT 3: (THIS IS WHAT I WOULD LIKE TO GET, "row1" can be "obs1" etc)
# diff.ab diff.bc
#row1 1 8
#row2 2 16
#row3 3 24
#row4 4 32
#row5 5 40
THE END
Here are the 3 options redone plus a 4th option:
# 1
library(zoo)
d <- t(rollapplyr(t(test.data), 2, diff, by.column = FALSE))
# 2
d <- test.data[-1]
for (i in 1:nrow(test.data)) d[i, ] <- diff(unlist(test.data[i, ]))
# 3
d <- t(diff(t(test.data)))
# 4 - also this works
nc <- ncol(test.data)
d <- test.data[-1] - test.data[-nc]
For any of them to set the names:
colnames(d) <- paste0("diff.", head(names(test.data), -1), colnames(d))
(2) and (4) give this data.frame and (1) and (3) give the corresponding matrix:
> d
diff.ab diff.bc
1 1 8
2 2 16
3 3 24
4 4 32
5 5 40
Use as.matrix or as.data.frame if you want the other.
An apply based solution using diff on row-wise can be achieved as:
# Result
res <- t(apply(test.data, 1, diff)) #One can change it to data.frame
# Name of the columns
colnames(res) <- paste0("diff.", head(names(test.data), -1),
tail(names(test.data), -1))
res
# diff.ab diff.bc
# [1,] 1 8
# [2,] 2 16
# [3,] 3 24
# [4,] 4 32
# [5,] 5 40
Related
I understand what rowsum() does, but I'm trying to get it to work for myself. I've used the example provided in R which is structured as such:
x <- matrix(runif(100), ncol = 5)
group <- sample(1:8, 20, TRUE)
xsum <- rowsum(x, group)
What is the matrix of values that is produced by xsum and how are the values obtained. What I thought was happening was that the values obtained from group were going to be used to state how many entries from the matrix to use in a rowsum. For example, say that group = (2,4,3,1,5). What I thought this would mean is that the first two entries going by row would be selected as the first entry to xsum. It appears as though this is not what is happening.
rowsum adds all rows that have the same group value. Let us take a simpler example.
m <- cbind(1:4, 5:8)
m
## [,1] [,2]
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
## [4,] 4 8
group <- c(1, 1, 2, 2)
rowsum(m, group)
## [,1] [,2]
## 1 3 11
## 2 7 15
Since the first two rows correspond to group 1 and the last 2 rows to group 2 it sums the first two rows giving the first row of the output and it sums the last 2 rows giving the second row of the output.
rbind(`1` = m[1, ] + m[2, ], `2` = m[3, ] + m[4, ])
## [,1] [,2]
## 1 3 11
## 2 7 15
That is the 3 is formed by adding the 1 from row 1 of m and the 2 of row 2 of m. The 11 is formed by adding 5 from row 1 of m and 6 from row 2 of m.
7 and 15 are formed similarly.
Let's say we have two data frames in R, df.A and df.B, defined thus:
bin_name <- c('bin_1','bin_2','bin_3','bin_4','bin_5')
bin_min <- c(0,2,4,6,8)
bin_max <- c(2,4,6,8,10)
df.A <- data.frame(bin_name, bin_min, bin_max, stringsAsFactors = FALSE)
obs_ID <- c('obs_1','obs_2','obs_3','obs_4','obs_5','obs_6','obs_7','obs_8','obs_9','obs_10')
obs_min <- c(6.5,0,8,2,1,7,5,6,8,3)
obs_max <- c(7,3,10,3,9,8,5.5,8,10,4)
df.B <- data.frame(obs_ID, obs_min, obs_max, stringsAsFactors = FALSE)
df.A defines the ranges of bins, while df.B consists of rows of observations with min and max values that may or may not fall entirely within a bin defined in df.A.
We want to generate a new vector of length nrow(df.B) containing the row indices of df.A corresponding to the bin in which each observation falls entirely. If an observation straddles a bin falls or partially outside it, then it can't be assigned to a bin and should return NA (or something similar).
In the above example, the correct output vector would be this:
bin_rows <- c(4, NA, 5, 2, NA, 4, 3, 4, 5, 2)
I came up with a long-winded solution using sapply:
bin_assignments <- sapply(1:nrow(df.B), function(i) which(df.A$bin_max >= df.B$obs_max[i] & df.A$bin_min <= df.B$obs_min[i])) #get bin assignments for every observation
bin_assignments[bin_assignments == "integer(0)"] <- NA #replace "integer(0)" entries with NA
bin_assignments <- do.call("c", bin_assignments) #concatenate the output of the sapply call
Several months ago I discovered a simple, single-line solution to this problem that didn't use an apply function. However, I forgot how I did this and I have not been able to rediscover it! The solution might involve match() or which(). Any ideas?
1) Using SQL it can readily be done in one statement:
library(sqldf)
sqldf('select a.rowid
from "df.B" b
left join "df.A" a on obs_min >= bin_min and obs_max <= bin_max')
rowid
1 4
2 NA
3 5
4 2
5 NA
6 4
7 3
8 4
9 5
10 2
2) merge/by We can do it in two statements using merge and by. No packages are used.
This does have the downside that it materializes the large join which the SQL solution would not need to do.
Note that df.B, as defined in the question, has obs_10 is the second level rather than the 10th level. If it were such that obs_10 were the 10th level then the second argument to by could have been just m$obs_ID so fixing up the input first could simplify it.
m <- merge(df.B, df.A)
stack(by(m, as.numeric(sub(".*_", "", m$obs_ID)),
with, c(which(obs_min >= bin_min & obs_max <= bin_max), NA)[1]))
giving:
values ind
1 4 1
2 NA 2
3 5 3
4 2 4
5 NA 5
6 4 6
7 3 7
8 4 8
9 5 9
10 2 10
3) sapply Note that using the c(..., NA)[1] trick from (2) we can simplify the sapply solution in the quesiton to one statement:
sapply(1:nrow(df.B), function(i)
c(which(df.A$bin_max >= df.B$obs_max[i] & df.A$bin_min <= df.B$obs_min[i]), NA)[1])
giving:
[1] 4 NA 5 2 NA 4 3 4 5 2
3a) mapply A nicer variation of (3) using mapply is given by #Ronak Shah` in the comments:
mapply(function(x, y) c(which(x >= df.A$bin_min & y <= df.A$bin_max), NA)[1],
df.B$obs_min,
df.B$obs_max)
4) outer Here is another one statement solution that uses no packages.
seq_len(nrow(df.A)) %*%
(outer(df.A$bin_max, df.B$obs_max, ">=") & outer(df.A$bin_min, df.B$obs_min, "<="))
giving:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 4 0 5 2 0 4 3 4 5 2
I know there are similar questions but I couldn't find an answer to my question. I'm trying to rank elements in a matrix and then extract data of 5 highest elements.
Here is my attempt.
set.seed(20)
d<-matrix(rnorm(100),nrow=10,ncol=10)
start<-d[1,1]
for (i in 1:10) {
for (j in 1:10) {
if (start < d[i,j])
{high<-d[i,j]
rowind<-i
colind<-j
}
}
}
Although this gives me the data of the highest element, including row and column numbers, I can't think of a way to do the same for elements ranked from 2 to 5. I also tried
rank(d, ties.method="max")
But it wasn't helpful because it just spits out the rank in vector format.
What I ultimately want is a data frame (or any sort of table) that contains
rank, column name, row name, and the data(number) of highest 5 elements in matrix.
Edit
set.seed(20)
d<-matrix(rnorm(100),nrow=10,ncol=10)
d[1,2]<-5
d[2,1]<-5
d[1,3]<-4
d[3,1]<-4
Thanks for the answers. Those perfectly worked for my purpose, but as I'm running this code for correlation chart -where there will be duplicate numbers for every pair- I want to count only one of the two numbers for ranking purpose. Is there any way to do this? Thanks.
Here's a very crude way:
DF = data.frame(row = c(row(d)), col = c(col(d)), v = c(d))
DF[order(DF$v, decreasing=TRUE), ][1:5, ]
row col v
91 1 10 2.208443
82 2 9 1.921899
3 3 1 1.785465
32 2 4 1.590146
33 3 4 1.556143
It would be nice to only have to partially sort, but in ?order, it looks like this option is only available for sort, not for order.
If the matrix has row and col names, it might be convenient to see them instead of numbers. Here's what I might do:
dimnames(d) <- list(letters[1:10], letters[1:10])
DF = data.frame(as.table(d))
DF[order(DF$Freq, decreasing=TRUE), ][1:5, ]
Var1 Var2 Freq
91 a j 2.208443
82 b i 1.921899
3 c a 1.785465
32 b d 1.590146
33 c d 1.556143
The column names don't make much sense here, unfortunately, but you can change them with names(DF) <- as usual.
Here is one option with Matrix
library(Matrix)
m1 <- summary(Matrix(d, sparse=TRUE))
head(m1[order(-m1[,3]),],5)
# i j x
#93 3 10 2.359634
#31 1 4 2.234804
#23 3 3 1.980956
#55 5 6 1.801341
#16 6 2 1.678989
Or use melt
library(reshape2)
m2 <- melt(d)
head(m2[order(-m2[,3]), ], 5)
Here is something quite simple in base R.
# set.seed(20)
# d <- matrix(rnorm(100), nrow = 10, ncol = 10)
d.rank <- matrix(rank(-d), nrow = 10, ncol = 10)
which(d.rank <= 5, arr.ind=TRUE)
row col
[1,] 3 1
[2,] 2 4
[3,] 3 4
[4,] 2 9
[5,] 1 10
d[d.rank <= 5]
[1] 1.785465 1.590146 1.556143 1.921899 2.208443
Results can (easily) be made clearer (see comment from Frank):
cbind(which(d.rank <= 5, arr.ind=TRUE), v = d[d.rank <= 5], rank = rank(-d[d.rank <= 5]))
row col v rank
[1,] 3 1 1.785465 3
[2,] 2 4 1.590146 4
[3,] 3 4 1.556143 5
[4,] 2 9 1.921899 2
[5,] 1 10 2.208443 1
I have a matrix as below
B = matrix(
c(2, 4, 3, 1, 5, 7),
nrow=3,
ncol=2)
B # B has 3 rows and 2 columns
# [,1] [,2]
#[1,] 2 1
#[2,] 4 5
#[3,] 3 7
I would like to create a data.frame with 3 columns: row number, column number and actual value from above matrix. I am thinking of writing 2 for loops. Is there a more efficient way to do this?
The output that i want (i am showing only first 2 rows below)
rownum columnnum value
1 1 2
1 2 1
Try
cbind(c(row(B)), c(col(B)), c(B))
Or
library(reshape2)
melt(B)
As per #nicola's comments, the output needed may be in the row-major order. In that case, take the transpose of the matrix and do the same
TB <- t(B)
cbind(rownum = c(col(TB)), colnum = c(row(TB)), value = c(TB))
data.frame(which(B==B, arr.ind=TRUE), value=as.vector(B))
Assume you have a data frame like this:
df <- data.frame(Nums = c(1,2,3,4,5,6,7,8,9,10), Cum.sums = NA)
> df
Nums Cum.sums
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
6 6 NA
7 7 NA
8 8 NA
9 9 NA
10 10 NA
and you want an output like this:
Nums Cum.sums
1 1 0
2 2 0
3 3 0
4 4 3
5 5 5
6 6 7
7 7 9
8 8 11
9 9 13
10 10 15
The 4. element of the column Cum.sum is the sum of 1 and 2, the 5. element of the Column Cum.sum is the sum of 2 and 3 and so on...
This means, I would like to build the cumulative sum of the first row and save it in the second row. However I don't want the normal cumulative sum but the sum of the element 2 rows above the current row plus the element 3 rows above the current row.
I allready tried to play a little bit around with the sum and cumsum function but I failed.
Any ideas?
Thanks!
You could use the embed function to create the appropriate lags, rowSums to sum, then lag appropriately (I used head).
df$Cum.sums[-(1:3)] <- head(rowSums(embed(df$Nums,2)),-2)
You don't need any special function, just use normal vector operations (these solutions are all equivalent):
df$Cum.sums[-(1:3)] <- head(df$Nums, -3) + head(df$Nums[-1], -2)
or
with(df, Cum.sums[-(1:3)] <- head(Nums, -3) + head(Nums[-1], -2))
or
df$Cum.sums[-(1:3)] <- df$Nums[1:(nrow(df)-3)] + df$Nums[2:(nrow(df)-2)]
I believe the first 3 sums SHOULD be NA, not 0, but if you prefer zeroes, you can initialize the sums first:
df$Cum.sums <- 0
Another solution, elegant and general, using matrix multiplication - and so very inefficient for large data. So it's not much practical, though a nice excercise:
len <- nrow(df)
sr <- 2 # number of rows to sum
lag <- 3
mat <- matrix(
head(c(
rep(0, lag * len),
rep(rep(1:0, c(sr, len - sr + 1)), len)
), len * len),
nrow = 10, byrow = TRUE
)
mat %*% df$Nums