sum nth row every loop - r

my objective is to sum every nth row by every count. Maybe a loop function might help.
I used this code :
irr = rollapply( irr , width = 1 , by = n , align = "left" , FUN = sum )
Example:
V1
3
2
4
7
5
so if n = 2, the first 2 rows will sum up.
Results:
V1
5
4
7
5
So the problem is, i have multiple "n" in another data.frame variable.
2 5 3 and i want to make "n" change, let say to "3" when it finish summing the first two rows,
next n = 3
Results:
5 16
This is my first time using r so please pardon me for any mistake i made and if the question is hard to understand.Thanks

You can split the data frame according to n and then sum it over every list
As an example,
v1 <- data.frame(X = c(3,2,4,7,5, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4))
n <- data.frame(Y = c(2, 3, 2, 4, 1,4))
unlist(lapply(split(v1$X, rep(1:nrow(n), n$Y)), sum))
# 1 2 3 4 5 6
# 5 16 5 22 8 10

Related

Aggregate rows in data.frame containing same values over different columns [duplicate]

This question already has answers here:
Aggregating regardless of the order of columns
(4 answers)
Closed 3 years ago.
The following works as expected:
m <- matrix (c(1, 2, 3,
1, 2, 4,
2, 1, 4,
2, 1, 4,
2, 3, 4,
2, 3, 6,
3, 2, 3,
3, 2, 2), byrow=TRUE, ncol=3)
df <- data.frame(m)
aggdf <- aggregate(df$X3, list(df$X1, df$X2), FUN=sum)
colnames(aggdf) <- c("A", "B", "value")
and results in:
A B value
1 2 1 8
2 1 2 7
3 3 2 5
4 2 3 10
But I would like to treat rows 1/2 and 3/4 as equal, not caring whether observation A is 1 and B is 2 or vice versa.
I also do not care about how the aggregation is sorting A/B in the final data.frame, so both of the following results would be fine:
A B value
1 2 1 15
2 3 2 15
A B value
1 1 2 15
2 2 3 15
How can that be achieved?
You need to get them in a consistent order. For just 2 columns, pmin and pmax work nicely:
df$A = with(df, pmin(X1, X2))
df$B = with(df, pmax(X1, X2))
aggregate(df$X3, df[c("A", "B")], FUN = sum)
# A B x
# 1 1 2 15
# 2 2 3 15
For more columns, use sort, as akrun recommends:
df[1:2] <- t(apply(df[1:2], 1, sort))
By changing 1:2 to all the key columns, this generalizes up easily.

calculating simple retention in R

For the dataset test, my objective is to find out how many unique users carried over from one period to the next on a period-by-period basis.
> test
user_id period
1 1 1
2 5 1
3 1 1
4 3 1
5 4 1
6 2 2
7 3 2
8 2 2
9 3 2
10 1 2
11 5 3
12 5 3
13 2 3
14 1 3
15 4 3
16 5 4
17 5 4
18 5 4
19 4 4
20 3 4
For example, in the first period there were four unique users (1, 3, 4, and 5), two of which were active in the second period. Therefore the retention rate would be 0.5. In the second period there were three unique users, two of which were active in the third period, and so the retention rate would be 0.666, and so on. How would one find the percentage of unique users that are active in the following period? Any suggestions would be appreciated.
The output would be the following:
> output
period retention
1 1 NA
2 2 0.500
3 3 0.666
4 4 0.500
The test data:
> dput(test)
structure(list(user_id = c(1, 5, 1, 3, 4, 2, 3, 2, 3, 1, 5, 5,
2, 1, 4, 5, 5, 5, 4, 3), period = c(1, 1, 1, 1, 1, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4)), .Names = c("user_id", "period"
), row.names = c(NA, -20L), class = "data.frame")
How about this? First split the users by period, then write a function that calculates the proportion carryover between any two periods, then loop it through the split list with mapply.
splt <- split(test$user_id, test$period)
carryover <- function(x, y) {
length(unique(intersect(x, y))) / length(unique(x))
}
mapply(carryover, splt[1:(length(splt) - 1)], splt[2:length(splt)])
1 2 3
0.5000000 0.6666667 0.5000000
Here is an attempt using dplyr, though it also uses some standard syntax in the summarise:
test %>%
group_by(period) %>%
summarise(retention=length(intersect(user_id,test$user_id[test$period==(period+1)]))/n_distinct(user_id)) %>%
mutate(retention=lag(retention))
This returns:
period retention
<dbl> <dbl>
1 1 NA
2 2 0.5000000
3 3 0.6666667
4 4 0.5000000
This isn't so elegant but it seems to work. Assuming df is the data frame:
# make a list to hold unique IDS by
uniques = list()
for(i in 1:max(df$period)){
uniques[[i]] = unique(df$user_id[df$period == i])
}
# hold the retention rates
retentions = rep(NA, times = max(df$period))
for(j in 2:max(df$period)){
retentions[j] = mean(uniques[[j-1]] %in% uniques[[j]])
}
Basically the %in% creates a logical of whether or not each element of the first argument is in the second. Taking a mean gives us the proportion.

collapse/aggregate some parts of an adjacency matrix simultaneously on rows and columns

I have a matrix, which represents mobility between various jobs:
jobnames <- c("job 1","job 2","job 3","job 4","job 5","job 6","job 7")
jobdat <- matrix(c(
5, 5, 5, 0, 0, 5, 5,
5, 5, 2, 5, 5, 1, 5,
1, 5, 5, 5, 0, 0, 1,
1, 0, 5, 5, 8, 0, 1,
0, 5, 0, 0, 5, 5, 1,
0, 0, 5, 5, 0, 5, 5,
0, 1, 0, 0, 5, 1, 5
),
nrow = 7, ncol = 7, byrow = TRUE,
dimnames = list(jobnames,jobnames
))
This is treated as a directed, weighted adjacency matrix in a social network analysis. The direction of the network is from rows to columns: So mobility is defined as going from a job-row to a job-column. The diagonal is relevant, since it is possible to change to the same job in another firm.
I need to collapse this matrix according to a prefigured list
containing the index of the jobs that should be combined:
group.list <- list(grp1=c(1,2) ,grp2 =c(3,4))
Now, since it is an adjacency matrix, it's a bit different than the other ' answers about how to collapse a matrix that I've ' found here and elsewhere. The collapse has to be simultanious on both the rows and the columns. And some jobs isn't grouped at all. So the result in this example should be like this:
group.jobnames <- c("job 1 and 2","job 3 and 4","job 5","job 6","job 7")
group.jobdat <- matrix(c(
20,12,5,6,10,
7,17,8,0,2,
5,0,5,5,1,
0,10,0,5,5,
1,0,5,1,5
),
nrow = 5, ncol = 5, byrow = TRUE,
dimnames = list(group.jobnames,group.jobnames
))
This example groups the two first jobs and then the next two, but in my actual data it could be any combination of (indexes of) jobs, and any number of jobs in each group. So job [1,7] could be one group, and job [2,3,6] could be another group, while job 4 or 5 wasn't grouped. Or any other combination.
Thank you for your time,
I believe there are some typos in the intended output, and the group.list definition. If I am correct in my interpretation, here is a solution.
Here is a new group.list to conform with the names of the desired output. In this version, group 2 is mapped to 1 and group 4 is mapped to 3, which conforms with the text in group.jobs.
group.list <- list(grp1=c(1, 3), grp2=c(2, 4))
Given this list, construct a grouping vector
# initial grouping
groups <- seq_len(ncol(jobdat))
# map elements of second list item to values of first list item
groups[match(group.list[["grp2"]], groups)] <- group.list[["grp1"]]
groups
[1] 1 1 3 3 5 6 7
So, now groups 1 and 2 are the same as well as 3 and 4. Now, we use rowsum and a couple of transposes to calculate the output.
myMat <- t(rowsum(t(rowsum(jobdat, groups)), groups))
# add the group names
dimnames(myMat) <- list(group.jobnames,group.jobnames)
myMat
job 1 and 2 job 3 and 4 job 5 job 6 job 7
job 1 and 2 20 12 5 6 10
job 3 and 4 7 20 8 0 2
job 5 5 0 5 5 1
job 6 0 10 0 5 5
job 7 1 0 5 1 5
In response to the OP's comments below, the grouping was intended to be within list elements, rather than corresponding positions between list elements as I had originally interpreted. To accomplish this form a grouping, a repeated feeding of replace to Reduce will accomplish the task.
With group.list as in the question,
group.list <- list(grp1=c(1, 2), grp2=c(3, 4))
groups <- Reduce(function(x, y) replace(x, x[x %in% y], min(y)),
c(list(groups), unname(group.list)))
groups
[1] 1 1 3 3 5 6 7
Here, replace takes the original grouping, finds the elements in the grouping that are in one of the vectors in group.list, and replaces these with the minimum value of that vector. The Reduce function repeatedly applies this operation on the original group variable, except modifying it in each iteration.
With this result, we use the above transposes and rowsum to get
myMat
job 1 and 2 job 3 and 4 job 5 job 6 job 7
job 1 and 2 20 12 5 6 10
job 3 and 4 7 20 8 0 2
job 5 5 0 5 5 1
job 6 0 10 0 5 5
job 7 1 0 5 1 5

select dataframes from a list based on maximum column value

I have a list x2 having two data frames, x and x1. Both have 4 columns: n,m,l and k. I want to select the data frame that has maximum last value for column k.
In the below example, I would like data frame 2nd to be selected because the last value in column K is greater than last value in column K for data frame 1.
x <- data.frame(n = c(2, 13, 5),m = c(2, 23, 6),l = c(2, 33, 7),k = c(2, 43, 8))
x1 <- data.frame((n = c(2, 3, 15),m = c(2, 3, 16),l = c(2, 3, 17),k = c(2, 3, 18))
x2<-list(x,x1)
Using lapply, loop through the list of x2 and get the last value of k column of that data frame. Using which.max, find the index which has the maximum of the previous lapply command and extract that dataframe from x2
Note: This code does not account for ties in the last value of k column.
x2[which.max(lapply(x2, function(x) tail(x$k, 1)))]
# [[1]]
# n m l k
# 1 2 2 2 2
# 2 3 3 3 3
# 3 15 16 17 18
if(x$k[length(x$k)] >= x1$k[length(x1$k)]) x else x1
an if statement where
x$k[length(x$k)] - gets the last element from column k of matrix x
n m l k
1 2 2 2 2
2 3 3 3 3
3 15 16 17 18

R Multiplying a list of lists with a vector

I have a dataframe with 1 column consisting of 10 lists each with a varying number of elements. I also have a vector with 10 different values in it (10 integers).
I want to take the "sumproduct" of each 10 lists with its corresponding vector value, and end up with 10 values.
Value 1 = sumproduct(First list, First vector value)
Value 2 = sumproduct(Second list, Second vector value)
etc...
Final_Answer <- c(Value 1, Value 2, ... , Value 10)
I have a function that generates the dataframe containing lists of numbers representing years. The dataframe is contructed using a loop to generate each value then rowbinding the value together with the dataframe.
Time_Function <- function(Maturity)
{for (i in 0:Count)
{x<-as.numeric(((as.Date(as.Date(Maturity)-i*365)-Start_Date)/365)
Time <- rbind(Time, data.frame(x))}
return((Time))
}
The result is this:
http://pastebin.com/J6phR2hv
http://i.imgur.com/Sf4mpA5.png
If my vector looks like [1,2,3,4...,10], I want the output to be:
Final Answer = [(1*1.1342466 + 1*0.6342466 + 1* 0.1342466), (2*1.3835616 + 2*0.8835616 + 2*0.3835616), ... , ( ... +10*0.0630137)]
Assuming you want to multiply each value in the list by the respective scalar and then add it all up, here is one way to do it.
list1 <- mapply(rep, 1:10, 10:1)
vec1 <- 1:10
df <- data.frame( I(list1), vec1)
df
list1 vec1
1 1, 1, 1,.... 1
2 2, 2, 2,.... 2
3 3, 3, 3,.... 3
4 4, 4, 4,.... 4
5 5, 5, 5,.... 5
6 6, 6, 6,.... 6
7 7, 7, 7, 7 7
8 8, 8, 8 8
9 9, 9 9
10 10 10
mapply(df$list1, df$vec1, FUN = function(x, y) {y* sum(x)})
[1] 10 36 72 112 150 180 196 192 162 100

Resources