Cumulative sum conditional over multiple columns in r dataframe containing the same values - r

Say my data.frame is as outlined below:
df<-as.data.frame(cbind("Home"=c("a","c","e","b","e","b"),
"Away"=c("b","d","f","c","a","f"))
df$Index<-rep(1,nrow(df))
Home Away Index
1 a b 1
2 c d 1
3 e f 1
4 b c 1
5 e a 1
6 b f 1
What I want to do is calculate a cumulative sum using the Index column for each character a - f regardless of whether they in the Home or Away columns. Thus a column called Cumulative_Sum_Home, say, takes the character in the Home row, "b" in the case of row 6, and counts how many times "b" has appeared in either the Home or Away columns in all previous rows including row 6. Thus in this case b has appeared 3 times cumulatively in the first 6 rows, and thus the Cumulative_Sum_Home gives the value 3. Likewise the same logic applies to the Cumulative_Sum_Away column. Taking row 5, character "a" appears in the Away column, and has cumulatively appeared 2 times in either Home or Away columns up to that row, so the column Cumulative_Sum_Away takes the value 2.
Home Away Index Cumulative_Sum_Home Cumulative_Sum_Away
1 a b 1 1 1
2 c d 1 1 1
3 e f 1 1 1
4 b c 1 2 2
5 e a 1 2 2
6 b f 1 3 2
I have to confess to being totally stumped as to how to solve this problem. I've tried looking at the data.table approaches, but I've never used that package before so I can't immediately see how to solve it. Any tips would be greatly received.

There is scope to make this leaner but if that doesn't matter much for you then this should be okay.
NewColumns = list()
for ( i in sort(unique(c(levels(df[,"Home"]),levels(df[,"Away"]))))) {
NewColumnAddition = i == df$Home | i ==df$Away
NewColumnAddition[NewColumnAddition] = cumsum(NewColumnAddition[NewColumnAddition])
NewColumns[[i]] = NewColumnAddition
}
df$Cumulative_Sum_Home = sapply(
seq(nrow(df)),
function(i) {
NewColumns[[as.character(df[i,"Home"])]][i]
}
)
df$Cumulative_Sum_Away = sapply(
seq(nrow(df)),
function(i) {
NewColumns[[as.character(df[i,"Away"])]][i]
}
)
> df
Home Away Index HomeSum AwaySum
1 a b 1 1 1
2 c d 1 1 1
3 e f 1 1 1
4 b c 1 2 2
5 e a 1 2 2
6 b f 1 3 2
Here's a data.table alternative -
setDT(df)
for ( i in sort(unique(c(levels(df[,Home]),levels(df[,Away]))))) {
df[, TotalSum := cumsum(i == Home | i == Away)]
df[Home == i, Cumulative_Sum_Home := TotalSum]
df[Away == i, Cumulative_Sum_Away := TotalSum]
}
df[,TotalSum := NULL]

Related

Filtering observations using multivariate column conditions

I'm not very experienced R user, so seek advice how to optimize what I've build and in which direction to move on.
I have one reference data frame, it contains four columns with integer values and one ID.
df <- matrix(ncol=5,nrow = 10)
colnames(df) <- c("A","B","C","D","ID")
# df
for (i in 1:10){
df[i,1:4] <- sample(1:5,4, replace = TRUE)
}
df <- data.frame(df)
df$ID <- make.unique(rep(LETTERS,length.out=10),sep='')
df
A B C D ID
1 2 4 3 5 A
2 5 1 3 5 B
3 3 3 5 3 C
4 4 3 1 5 D
5 2 1 2 5 E
6 5 4 4 5 F
7 4 4 3 3 G
8 2 1 5 5 H
9 4 4 1 3 I
10 4 2 2 2 J
Second data frame has manual input, it's user input, I want to turn it into shiny app later on, that's why also I'm asking for optimization, because my code doesn't seem very neat to me.
df.man <- data.frame(matrix(ncol=5,nrow=1))
colnames(df.man) <- c("A","B","C","D","ID")
df.man$ID <- c("man")
df.man$A <- 4
df.man$B <- 4
df.man$C <- 3
df.man$D <- 4
df.man
A B C D ID
4 4 3 4 man
I want to filter rows from reference sequentially, following the rules:
If there is exact match in a whole row between reference table and manual than extract this(those) from reference and show me that row, if not then reduce number of matching columns from right to left until there is a match but not between less then two variables(columns A,B).
So with my limited knowledge I've wrote this:
# subtraction manual from reference
df <- df %>% dplyr::mutate(Adiff=A-df.man$A)%>%
dplyr::mutate(Bdiff=B-df.man$B)%>%
dplyr::mutate(Cdiff=C-df.man$C) %>%
dplyr::mutate(Ddiff=D-df.man$D)
# check manually how much in a row has zero difference and filter those
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0 & Ddiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0 & Ddiff==0),
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0),
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0),
"less then two exact match")
))
tbl_df(df0[,1:5])
# A tibble: 1 x 5
A B C D ID
<int> <int> <int> <int> <chr>
1 4 4 3 3 G
It works and found ID G but looks ugly to me. So the first question is - What would be recommended way to improve this? Are there any functions, packages or smth I'm missing?
Second question - I want to complicate condition.
Imagine we have reference data set.
A B C D ID
2 4 3 5 A
5 1 3 5 B
3 3 5 3 C
4 3 1 5 D
2 1 2 5 E
5 4 4 5 F
4 4 3 3 G
2 1 5 5 H
4 4 1 3 I
4 2 2 2 J
Manual input is
A B C D ID
4 4 2 2 man
Filtering rules should be following:
If there is exact match in a whole row between reference table and manual than extract this(those) from reference and show me that row, if not then reduce number of matching columns from right to left until there is a match but not between less then two variables(columns A,B).
From those rows where I have only two variable matches filter those which has ± 1 difference in columns to the right. So I should have filtered case G and I from reference table from the example above.
keep going the way I did above, I would do the following:
ifelse(nrow(df0%>%filter(Cdiff %in% (-1:1) & Ddiff %in% (-1:1)))>0,
df01 <- df0%>%filter(Cdiff %in% (-1:1) & Ddiff %in% (-1:1)),
ifelse(nrow(df0%>%filter(Cdiff %in% (-1:1)))>0,
df01<- df0%>%filter(Cdiff %in% (-1:1)),
"NA"))
It will be about 11 columns at the end, but I assume it doesn't matter so much.
Keeping in mind this objective - how would you suggest to proceed?
Thanks!
This is a lot to sort through, but I have some ideas that might be helpful.
First, you could keep your df a matrix, and use row names for your letters. Something like:
set.seed(2)
df
A B C D
A 5 1 5 1
B 4 5 1 2
C 3 1 3 2
D 3 1 1 4
E 3 1 5 3
F 1 5 5 2
G 2 3 4 3
H 1 1 5 1
I 2 4 5 5
J 4 2 5 5
And for demonstration, you could use a vector for manual as this is input:
# Complete match example
vec.man <- c(3, 1, 5, 3)
To check for complete matches between the manual input and reference (all 4 columns), with all numbers, you can do:
df[apply(df, 1, function(x) all(x == vec.man)), ]
A B C D
3 1 5 3
If you don't have a complete match, would calculate differences between df and vec.man:
# Change example vec.man
vec.man <- c(3, 1, 5, 2)
df.diff <- sweep(df, 2, vec.man)
A B C D
A 2 0 0 -1
B 1 4 -4 0
C 0 0 -2 0
D 0 0 -4 2
E 0 0 0 1
F -2 4 0 0
G -1 2 -1 1
H -2 0 0 -1
I -1 3 0 3
J 1 1 0 3
The diffs that start with and continue with 0 will be your best matches (same as looking from right to left iteratively). Then, your best match is the column of the first non-zero element in each row:
df.best <- apply(df.diff, 1, function(x) which(x!=0)[1])
A B C D E F G H I J
1 1 3 3 4 1 1 1 1 1
You can see that the best match is E which was non-zero in the 4th column (last column did not match). You can extract rows that have 4 in df.best as your best matches:
df.match <- df[which(df.best == max(df.best, na.rm = T)), ]
A B C D
3 1 5 3
Finally, if you want all the rows with closest match +/- 1 if only 2 match, you could check for number of best matches (should be 3). Then, compare differences with vector c(0,0,1) which would imply 2 matches then 3rd column off by +/- 1:
# Example vec.man with only 2 matches
vec.man <- c(3, 1, 6, 9)
> df.match
A B C D
C 3 1 3 2
D 3 1 1 4
E 3 1 5 3
if (max(df.best, na.rm = T) == 3) {
vec.alt = c(0, 0, 1)
df[apply(df.diff[,1:3], 1, function(x) all(abs(x) == vec.alt)), ]
}
A B C D
3 1 5 3
This should be scalable for 11 columns and 4 matches.
To generalize for different numbers of columns, #IlyaT suggested:
n.cols <- max(df.best, na.rm=TRUE)
vec.alt <- c(rep(0, each=n.cols-1), 1)

How can I replace values in an R matrix matching on rowname, columnname and value

I have a really big matrix and I'd like to replace the values in it using a lookup table.
I have a table of values (which looks a bit like this):
Origin Destination Distance Final
1 1 1 A
1 1 2 B
1 1 3 E
1 2 2 F
1 3 1 B
1 3 2 C
2 1 1 B
2 2 1 A
2 3 3 C
3 1 1 A
3 1 2 D
3 2 1 B
3 3 2 A
...
and I have a matrix, which looks something like this:
x 1 1 3 1 2 1 ...
1 1 3 2 1 2 1
1 2 2 1 2 2 1
3 2 1 2 1 1 2
1 3 1 2 1 2 1
2 1 1 3 1 1 1
1 2 2 1 3 1 1
...
I'm trying to match my matrix rownames with the Origin column, the matrix Colnames with the Destination Column and the matrix values with the Distance Column and then replace that value with the Final Column.
The Matrix is 4000 by 4000.
The Table is 27 by 4
So when I'm done it should look like:
x 1 1 3 1 2 1 ...
1 A E C A F A
1 B B B B F A
3 D A A A B A
1 E A C A F A
2 B B C B A B
1 B B B E A A
...
I'm currently using a little loop, which looks like this;
for (i in 1:nrow(CategoryTable)){
Origin <- CategoryTable[i,"O"]
Dest <- CategoryTable[i,"D"]
Distance <- CategoryTable[i,"Dist"]
Final <- CategoryTable[i,"Final"]
CategoryGrid[CategoryGrid == Distance][CategoryGrid[row.names(CategoryGrid) %in% Origin,colnames(CategoryGrid) %in% Dest]] <-CategoryTable[i,"Final"]
}
Based on this question (Replace all values in a matrix <0.1 with 0) I can replace all the things matching a specific value or the things matching a column or row. But I can't match all at once.
The active ingredient of the current attempt is:
CategoryGrid[CategoryGrid == Distance][CategoryGrid[row.names(CategoryGrid) %in% Origin,colnames(CategoryGrid) %in% Dest]] <-CategoryTable[i,"Final"]
So I was trying to match the rows and columns and then pass that as a boolean vector to the value match, and then do the RHS assignation.
However, what I actually get is:
Error in CategoryGrid[row.names(CategoryGrid) %in% Origin, colnames(CategoryGrid) %in% :
incorrect number of dimensions
How would you go about achieving this?
Replacing the active ingredient with the following did the trick.
CategoryGrid[row.names(CategoryGrid) == Origin , colnames(CategoryGrid) == Dest] <- apply(CategoryGrid[row.names(CategoryGrid) == Origin , colnames(CategoryGrid) == Dest], MARGIN=c(1,2),function(x) ifelse(x == Distance, Final, x))
Specify the rules for the rows and columns and then do an apply and put the third variable in it's own function.

R Check for levels within a group and duplicate row if not present

I have a problem with in Shiny which I will show in a simple example:
I have the following data:
Group<-c("A","A","B","C","C","D")
Value<-c(1,2,6,7,3,9)
df<-data.frame(Group, Value)
Group Value
A 1
A 2
B 6
C 7
C 3
D 9
Then I add a row to see how many reps a group has:
df$num <- ave(df$Value, df$Group, FUN = seq_along)
Group Value num
A 1 1
A 2 2
B 6 1
C 7 1
C 3 2
D 9 1
Now, I would like it, to check if the group contains a 2nd rep, and if not, duplicate the 1st row of the group (containing num=1) and setting num to 2.
So I would like to end up with:
Group Value num
A 1 1
A 2 2
B 6 1
B 6 2 #this row is added
C 7 1
C 3 2
D 9 1
D 9 2 #this row is added
I have tried to search for solution, but I mainly ended up with subject like a condition that is based on a certain value, rather than conditions within a group.
Could someone help me? I would appreciate it a lot!
Can this code do the trick ?
res <- lapply(unique(df$Group), function(x){
a <- df[df$Group == x, ]
if(nrow(a) == 1) {
a <- a[rep(row.names(a), 2), ]
a$num <- c(1:2)
}
a
})
do.call(rbind, res)

vectorise rows of a dataframe, apply vector function, return to original dataframe r

Given the following df:
a=c('a','b','c')
b=c(1,2,5)
c=c(2,3,4)
d=c(2,1,6)
df=data.frame(a,b,c,d)
a b c d
1 a 1 2 2
2 b 2 3 1
3 c 5 4 6
I'd like to apply a function that normally takes a vector (and returns a vector) like cummax row by row to the columns in position b to d.
Then, I'd like to have the output back in the df, either as a vector in a new column of the df, or replacing the original data.
I'd like to avoid writing it as a for loop that would iterate every row, pull out the content of the cells into a vector, do its thing and put it back.
Is there a more efficient way? I've given the apply family functions a go, but I'm struggling to first get a good way to vectorise content of columns by row and get the right output.
the final output could look something like that (imagining I've applied a cummax() function).
a b c d
1 a 1 2 2
2 b 2 3 3
3 c 5 5 6
or
a b c d output
1 a 1 2 2 (1,2,2)
2 b 2 3 1 (2,3,3)
3 c 5 4 6 (5,5,6)
where output is a vector.
Seems this would just be a simple apply problem that you want to cbind to df:
> cbind(df, apply(df[ , 4:2] # work with columns in reverse order
, 1, # do it row-by-row
cummax) )
a b c d 1 2 3
d a 1 2 2 2 1 6
c b 2 3 1 2 3 6
b c 5 4 6 2 3 6
Ouch. Bitten by failing to notice that this would be returned in a column oriented matrix and need to transpose that result; Such a newbie mistake. But it does show the value of having a question with a reproducible dataset I suppose.
> cbind(df, t(apply(df[ , 4:2] , 1, cummax) ) )
a b c d d c b
1 a 1 2 2 2 2 2
2 b 2 3 1 1 3 3
3 c 5 4 6 6 6 6
To destructively assign the result to df you would just use:
df <- # .... that code.
This does the concatenation with commas (and as a result no longer needs to be transposed:
> cbind(df, output=apply(df[ , 4:2] , 1, function(x) paste( cummax(x), collapse=",") ) )
a b c d output
1 a 1 2 2 2,2,2
2 b 2 3 1 1,3,3
3 c 5 4 6 6,6,6

replace values in a column based on another column but following the numeric index from the first replacement

I have a data.frame that looks like the one above. I need to replace the values in the first columns based on the values on second column but the replacement need to continue the numeric value of column 1, and only replacing the values in column 1 when !ValB==A
>df1
ValA ValB
1 A
1 A
2 A
2 A
3 A
3 A
4 A
4 A
1 B
1 B
1 B
2 B
2 B
3 B
4 B
4 B
1 C
1 C
2 C
2 C
3 C
3 C
4 C
1 C
What I want is replace the values in column1 but using ValB==B as the index for replacing the values in ValA. The replacement has to continue the values in ValA, i.e, when there is a 1 and the ValB==B the ValA has to be 5, the 2 has to be 6 and so on. Please here is the desired output, what will make easier to understand what I am doing. I could do a for loop with if and elseif statement but I am sure that there is a cleaner way,
Desired output
>df1
ValA ValB
1 A
1 A
2 A
2 A
3 A
3 A
4 A
4 A
5 B
5 B
5 B
6 B
6 B
6 B
7 B
7 B
8 C
8 C
9 C
9 C
10 C
10 C
11 C
12 C
You could do something like this. It basically runs a cumulative sum over a boolean vector which tells you whether ValA and ValB of one row are equal to the one of the previous row -
# do a running sum of the values
df$c = cumsum(
c(
# first value of the result is the same value as the first value of A
df$ValA[1],
# go through the second to the last value of the vector and compared it to the first to the n - 1th values
sapply(
2:nrow(df),
function(index) {
# look for change in value of A and B both
# if changed then return 1, else return 0
!(
df$ValA[index] == df$ValA[index - 1] &
df$ValB[index] == df$ValB[index - 1]
)
}
)
))

Resources