Assume you have a data frame like this:
df <- data.frame(Nums = c(1,2,3,4,5,6,7,8,9,10), Cum.sums = NA)
> df
Nums Cum.sums
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
6 6 NA
7 7 NA
8 8 NA
9 9 NA
10 10 NA
and you want an output like this:
Nums Cum.sums
1 1 0
2 2 0
3 3 0
4 4 3
5 5 5
6 6 7
7 7 9
8 8 11
9 9 13
10 10 15
The 4. element of the column Cum.sum is the sum of 1 and 2, the 5. element of the Column Cum.sum is the sum of 2 and 3 and so on...
This means, I would like to build the cumulative sum of the first row and save it in the second row. However I don't want the normal cumulative sum but the sum of the element 2 rows above the current row plus the element 3 rows above the current row.
I allready tried to play a little bit around with the sum and cumsum function but I failed.
Any ideas?
Thanks!
You could use the embed function to create the appropriate lags, rowSums to sum, then lag appropriately (I used head).
df$Cum.sums[-(1:3)] <- head(rowSums(embed(df$Nums,2)),-2)
You don't need any special function, just use normal vector operations (these solutions are all equivalent):
df$Cum.sums[-(1:3)] <- head(df$Nums, -3) + head(df$Nums[-1], -2)
or
with(df, Cum.sums[-(1:3)] <- head(Nums, -3) + head(Nums[-1], -2))
or
df$Cum.sums[-(1:3)] <- df$Nums[1:(nrow(df)-3)] + df$Nums[2:(nrow(df)-2)]
I believe the first 3 sums SHOULD be NA, not 0, but if you prefer zeroes, you can initialize the sums first:
df$Cum.sums <- 0
Another solution, elegant and general, using matrix multiplication - and so very inefficient for large data. So it's not much practical, though a nice excercise:
len <- nrow(df)
sr <- 2 # number of rows to sum
lag <- 3
mat <- matrix(
head(c(
rep(0, lag * len),
rep(rep(1:0, c(sr, len - sr + 1)), len)
), len * len),
nrow = 10, byrow = TRUE
)
mat %*% df$Nums
Related
I would like to do calculations across columns in my data, by row. The calculations are "moving" in that I would like to know the difference between two numbers in column 1 and 2, then columns 3 and 4, and so on. I have looked at "loops" and "rollapply" functions, but could not figure this out. Below are three options of what was attempted. Only the third option gives me the result I am after, but it is very lengthy code and also does not allow for automation (the input data will be a much larger matrix, so typing out the calculation for each row won't work).
Please advice how to make this code shorter and/or any other packages/functions to check out which will do the job. THANK YOU!
MY TEST SCRIPT IN R + errors/results
Sample data set
a<- c(1,2,3, 4, 5)
b<- c(1,2,3, 4, 5)
c<- c(1,2,3, 4, 5)
test.data <- data.frame(cbind(a,b*2,c*10))
names(test.data) <- c("a", "b", "c")
Sample of calculations attempted:
OPTION 1
require(zoo)
rollapply(test.data, 2, diff, fill = NA, align = "right", by.column=FALSE)
RESULT 1 (not what we're after. What we need is at the bottom of Option 3)
# a b c
#[1,] NA NA NA
#[2,] 1 2 10
#[3,] 1 2 10
#[4,] 1 2 10
#[5,] 1 2 10
OPTION 2:
results <- for (i in 1:length(nrow(test.data))) {
diff(as.numeric(test.data[i,]), lag=1)
print(results)}
RESULT 2: (again not what we're after)
# NULL
OPTION 3: works, but long way, so would like to simplify code and make generic for any length of observations in my dataframe and any number of columns (i.e. more than 3). I would like to "automate" the steps below, if know number of observations (i.e. rows).
row1=diff(as.numeric(test[1,], lag=1))
row2=diff(as.numeric(test[2,], lag=1))
row3=diff(as.numeric(test[3,], lag=1))
row4=diff(as.numeric(test[4,], lag=1))
row5=diff(as.numeric(test[5,], lag=1))
results.OK=cbind.data.frame(row1, row2, row3, row4, row5)
transpose.results.OK=data.frame(t(as.matrix(results.OK)))
names(transpose.results.OK)=c("diff.ab", "diff.bc")
Final.data = transpose.results.OK
print(Final.data)
RESULT 3: (THIS IS WHAT I WOULD LIKE TO GET, "row1" can be "obs1" etc)
# diff.ab diff.bc
#row1 1 8
#row2 2 16
#row3 3 24
#row4 4 32
#row5 5 40
THE END
Here are the 3 options redone plus a 4th option:
# 1
library(zoo)
d <- t(rollapplyr(t(test.data), 2, diff, by.column = FALSE))
# 2
d <- test.data[-1]
for (i in 1:nrow(test.data)) d[i, ] <- diff(unlist(test.data[i, ]))
# 3
d <- t(diff(t(test.data)))
# 4 - also this works
nc <- ncol(test.data)
d <- test.data[-1] - test.data[-nc]
For any of them to set the names:
colnames(d) <- paste0("diff.", head(names(test.data), -1), colnames(d))
(2) and (4) give this data.frame and (1) and (3) give the corresponding matrix:
> d
diff.ab diff.bc
1 1 8
2 2 16
3 3 24
4 4 32
5 5 40
Use as.matrix or as.data.frame if you want the other.
An apply based solution using diff on row-wise can be achieved as:
# Result
res <- t(apply(test.data, 1, diff)) #One can change it to data.frame
# Name of the columns
colnames(res) <- paste0("diff.", head(names(test.data), -1),
tail(names(test.data), -1))
res
# diff.ab diff.bc
# [1,] 1 8
# [2,] 2 16
# [3,] 3 24
# [4,] 4 32
# [5,] 5 40
I am basically new to using R software.
I have a list of repeating codes (numeric/ categorical) from an excel file. I need to add another column values (even at random) to which every same code will get the same value.
Codes Value
1 122
1 122
2 155
2 155
2 155
4 101
4 101
5 251
5 251
Thank you.
We can use match:
n <- length(code0 <- unique(code))
value <- sample(4 * n, n)[match(code, code0)]
or factor:
n <- length(unique(code))
value <- sample(4 * n, n)[factor(code)]
The random integers generated are between 1 and 4 * n. The number 4 is arbitrary; you can also put 100.
Example
set.seed(0); code <- rep(1:5, sample(5))
code
# [1] 1 1 1 1 1 2 2 3 3 3 3 4 4 4 5
n <- length(code0 <- unique(code))
sample(4 * n, n)[match(code, code0)]
# [1] 5 5 5 5 5 18 18 19 19 19 19 12 12 12 11
Comment
The above gives the most general treatment, assuming that code is not readily sorted or taking consecutive values.
If code is sorted (no matter what value it takes), we can also use rle:
if (!is.unsorted(code)) {
n <- length(k <- rle(code)$lengths)
value <- rep.int(sample(4 * n, n), k)
}
If code takes consecutive values 1, 2, ..., n (but not necessarily sorted), we can skip match or factor and do:
n <- max(code)
value <- sample(4 * n, n)[code]
Further notice: If code is not numerical but categorical, match and factor method will still work.
What you could also do is the following, it is perhaps more intuitive to a beginner:
data <- data.frame('a' = c(122,122,155,155,155,101,101,251,251))
duplicates <- unique(data)
duplicates[, 'b'] <- rnorm(nrow(duplicates))
data <- merge(data, duplicates, by='a')
I know there are similar questions but I couldn't find an answer to my question. I'm trying to rank elements in a matrix and then extract data of 5 highest elements.
Here is my attempt.
set.seed(20)
d<-matrix(rnorm(100),nrow=10,ncol=10)
start<-d[1,1]
for (i in 1:10) {
for (j in 1:10) {
if (start < d[i,j])
{high<-d[i,j]
rowind<-i
colind<-j
}
}
}
Although this gives me the data of the highest element, including row and column numbers, I can't think of a way to do the same for elements ranked from 2 to 5. I also tried
rank(d, ties.method="max")
But it wasn't helpful because it just spits out the rank in vector format.
What I ultimately want is a data frame (or any sort of table) that contains
rank, column name, row name, and the data(number) of highest 5 elements in matrix.
Edit
set.seed(20)
d<-matrix(rnorm(100),nrow=10,ncol=10)
d[1,2]<-5
d[2,1]<-5
d[1,3]<-4
d[3,1]<-4
Thanks for the answers. Those perfectly worked for my purpose, but as I'm running this code for correlation chart -where there will be duplicate numbers for every pair- I want to count only one of the two numbers for ranking purpose. Is there any way to do this? Thanks.
Here's a very crude way:
DF = data.frame(row = c(row(d)), col = c(col(d)), v = c(d))
DF[order(DF$v, decreasing=TRUE), ][1:5, ]
row col v
91 1 10 2.208443
82 2 9 1.921899
3 3 1 1.785465
32 2 4 1.590146
33 3 4 1.556143
It would be nice to only have to partially sort, but in ?order, it looks like this option is only available for sort, not for order.
If the matrix has row and col names, it might be convenient to see them instead of numbers. Here's what I might do:
dimnames(d) <- list(letters[1:10], letters[1:10])
DF = data.frame(as.table(d))
DF[order(DF$Freq, decreasing=TRUE), ][1:5, ]
Var1 Var2 Freq
91 a j 2.208443
82 b i 1.921899
3 c a 1.785465
32 b d 1.590146
33 c d 1.556143
The column names don't make much sense here, unfortunately, but you can change them with names(DF) <- as usual.
Here is one option with Matrix
library(Matrix)
m1 <- summary(Matrix(d, sparse=TRUE))
head(m1[order(-m1[,3]),],5)
# i j x
#93 3 10 2.359634
#31 1 4 2.234804
#23 3 3 1.980956
#55 5 6 1.801341
#16 6 2 1.678989
Or use melt
library(reshape2)
m2 <- melt(d)
head(m2[order(-m2[,3]), ], 5)
Here is something quite simple in base R.
# set.seed(20)
# d <- matrix(rnorm(100), nrow = 10, ncol = 10)
d.rank <- matrix(rank(-d), nrow = 10, ncol = 10)
which(d.rank <= 5, arr.ind=TRUE)
row col
[1,] 3 1
[2,] 2 4
[3,] 3 4
[4,] 2 9
[5,] 1 10
d[d.rank <= 5]
[1] 1.785465 1.590146 1.556143 1.921899 2.208443
Results can (easily) be made clearer (see comment from Frank):
cbind(which(d.rank <= 5, arr.ind=TRUE), v = d[d.rank <= 5], rank = rank(-d[d.rank <= 5]))
row col v rank
[1,] 3 1 1.785465 3
[2,] 2 4 1.590146 4
[3,] 3 4 1.556143 5
[4,] 2 9 1.921899 2
[5,] 1 10 2.208443 1
I have one matrix of mutation counts, say "counts". This matrix has column names V1, V2,...,Vi,...Vn where not every "i" is there. Thus it can jump, such as V1, V2, V5 say. Further, most of columns have a 0 in them.
I need to create a sum matrix, called "answer", where element i, j is the sum of the number of the number counts at both i and j. At the i, i element it just shows the number of counts at i.
Here's a quick data set up. I already have the correct dimensioned matrix set up in my code called "answer". Thus what I would need to automate are the last several lines where I fill in the matrix.
counts <- matrix(data = c(0,2,0,5,0,6,0), nrow = 1, ncol = 7, dimnames=list("",c("V1","V2","V3","V4","V5","V6","V7")))
answer <- matrix(data =0, nrow = 3, ncol = 3, dimnames = list(c("V2","V4","V6"),c("V2","V4","V6")))
answer[1,1] <- 2
answer[1,2] <- 7
answer[1,3] <- 8
answer[2,1] <- 7
answer[2,2] <- 5
answer[2,3] <- 11
answer[3,1] <- 8
answer[3,2] <- 11
answer[3,3] <- 6
I understand I can do this with 2 nested for loops, but surely there must be a better way no? Thanks!
This could be done with the right use of expand.grid and rowSums:
n = counts[, counts > 0]
answer = matrix(rowSums(expand.grid(n, n)), nrow=length(n), dimnames=list(names(n), names(n)))
diag(answer) = n
To show how it works, n would end up being:
V2 V4 V5
2 5 6
and expand.grid(n, n) would be:
Var1 Var2
1 2 2
2 5 2
3 6 2
4 2 5
5 5 5
6 6 5
7 2 6
8 5 6
9 6 6
The last line (diag) is necessary because otherwise the diagonal would be twice the original vector (adding 2+2, 5+5, or 6+6).
I am trying to calculated the lagged difference (or actual increase) for data that has been inadvertently aggregated. Each successive year in the data includes values from the previous year. A sample data set can be created with this code:
set.seed(1234)
x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3)
y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2)
z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1)
(df <- rbind(x, y, z))
I can use a combination of lapply() and split() to calculate the difference between each year for every unique id, like so:
(diffs <- lapply(split(df, df$id), function(x){-diff(x$value)}))
However, because of the nature of the diff() function, there are no results for the values in year 1, which means that after I flatten the diffs list of lists with Reduce(), I cannot add the actual yearly increases back into the data frame, like so:
df$actual <- Reduce(c, diffs) # flatten the list of lists
In this example, there are only 10 calculated differences or lags, while there are 15 rows in the data frame, so R throws an error when trying to add a new column.
How can I create a new column of actual increases with (1) the values for year 1 and (2) the calculated diffs/lags for all subsequent years?
This is the output I'm eventually looking for. My diffs list of lists calculates the actual values for years 2 and 3 just fine.
id value year actual
1 21 3 5
2 26 3 16
3 26 3 14
4 26 3 10
5 29 3 14
1 16 2 10
2 10 2 5
3 12 2 10
4 16 2 7
5 15 2 13
1 6 1 6
2 5 1 5
3 2 1 2
4 9 1 9
5 2 1 2
I think this will work for you. When you run into the diff problem just lengthen the vector by putting 0 in as the first number.
df <- df[order(df$id, df$year), ]
sdf <-split(df, df$id)
df$actual <- as.vector(sapply(seq_along(sdf), function(x) diff(c(0, sdf[[x]][,2]))))
df[order(as.numeric(rownames(df))),]
There's lots of ways to do this but this one is fairly fast and uses base.
Here's a second & third way of approaching this problem utilizing aggregate and by:
aggregate:
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x))
df$actual <- c(unlist(t(aggregate(value~id, df, diff2)[, -1])))
df[order(as.numeric(rownames(df))),]
by:
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x))
df$actual <- unlist(by(df$value, df$id, diff2))
df[order(as.numeric(rownames(df))),]
plyr
df <- df[order(df$id, df$year), ]
df <- data.frame(temp=1:nrow(df), df)
library(plyr)
df <- ddply(df, .(id), transform, actual=diff2(value))
df[order(-df$year, df$temp),][, -1]
It gives you the final product of:
> df[order(as.numeric(rownames(df))),]
id value year actual
1 1 21 3 5
2 2 26 3 16
3 3 26 3 14
4 4 26 3 10
5 5 29 3 14
6 1 16 2 10
7 2 10 2 5
8 3 12 2 10
9 4 16 2 7
10 5 15 2 13
11 1 6 1 6
12 2 5 1 5
13 3 2 1 2
14 4 9 1 9
15 5 2 1 2
EDIT: Avoiding the Loop
May I suggest avoiding the loop and turning what I gave to you into a function (the by solution is the easiest one for me to work with) and sapply that to the two columns you desire.
set.seed(1234) #make new data with another numeric column
x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3)
y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2)
z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1)
df <- rbind(x, y, z)
df <- df.rep <- data.frame(df[, 1:2], new.var=df[, 2]+sample(1:5, nrow(df),
replace=T), year=df[, 3])
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x)) #function one
group.diff<- function(x) unlist(by(x, df$id, diff2)) #answer turned function
df <- data.frame(df, sapply(df[, 2:3], group.diff)) #apply group.diff to col 2:3
df[order(as.numeric(rownames(df))),] #reorder it
Of course you'd have to rename these unless you used transform as in:
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x)) #function one
group.diff<- function(x) unlist(by(x, df$id, diff2)) #answer turned function
df <- transform(df, actual=group.diff(value), actual.new=group.diff(new.var))
df[order(as.numeric(rownames(df))),]
This would depend on how many variables you were doing this to.
1) diff.zoo. With the zoo package its just a matter of converting it to zoo using split= and then performing the diff :
library(zoo)
zz <- zz0 <- read.zoo(df, split = "id", index = "year", FUN = identity)
zz[2:3, ] <- diff(zz)
It gives the following (in wide form rather than the long form you mentioned) where each column is an id and each row is a year minus the prior year:
> zz
1 2 3 4 5
1 6 5 2 9 2
2 10 5 10 7 13
3 5 16 14 10 14
The wide form shown may actually be preferable but you can convert it to long form if you want that like this:
dt <- function(x) as.data.frame.table(t(x))
setNames(cbind(dt(zz), dt(zz0)[3]), c("id", "year", "value", "actual"))
This puts the years in ascending order which is the convention normally used in R.
2) rollapply. Also using zoo this alternative uses a rolling calculation to add the actual column to your data. It assumes the data is structured as you show with the same number of years in each group arranged in order:
df$actual <- rollapply(df$value, 6, partial = TRUE, align = "left",
FUN = function(x) if (length(x) < 6) x[1] else x[1]-x[6])
3) subtraction. Making the same assumptions as in the prior solution we can further simplify it to just this which subtracts from each value the value 5 positions hence:
transform(df, actual = value - c(tail(value, -5), rep(0, 5)))
or this variation:
transform(df, actual = replace(value, year > 1, -diff(ts(value), 5)))
EDIT: added rollapply and subtraction solutions.
Kind of hackish but keeping in place your wonderful Reduce you could add mock rows to your df for year 0:
mockRows <- data.frame(id = 1:5, value = 0, year = 0)
(df <- rbind(df, mockRows))
(df <- df[order(df$id, df$year), ])
(diffs <- lapply(split(df, df$id), function(x){diff(x$value)}))
(df <- df[df$year != 0,])
(df$actual <- Reduce(c, diffs)) # flatten the list of lists
df[order(as.numeric(rownames(df))),]
This is the output:
id value year actual
1 1 21 3 5
2 2 26 3 16
3 3 26 3 14
4 4 26 3 10
5 5 29 3 14
6 1 16 2 10
7 2 10 2 5
8 3 12 2 10
9 4 16 2 7
10 5 15 2 13
11 1 6 1 6
12 2 5 1 5
13 3 2 1 2
14 4 9 1 9
15 5 2 1 2