Sum a group of columns by row count - r

I'm trying to create a new dataset from an existing one. The new dataset is supposed to combine 60 rows from the original dataset in order to convert a sum of events occurring each second to the total by minute. The number of columns will generally not be known in advance.
For example, with this dataset, if we split it into groups of 3 rows:
d1
a b c d
1 1 1 0 1
2 0 1 0 1
3 0 1 0 0
4 0 0 1 0
5 0 0 1 0
6 1 0 0 0
We'll get this data.frame. Row 1 contains the column sums for rows 1-3 of d1 and Row 2 contains the column sums for rows 4-6 of d1:
d2
a b c d
1 1 3 0 2
2 1 0 2 0
I've tried d2<-colSums(d1[seq(1,NROW(d1),3),]) which is about as close as I've been able to get.
I've also considered recommendations from How to sum rows based on multiple conditions - R?,How to select every xth row from table,Remove last N rows in data frame with the arbitrary number of rows,sum two columns in R, and Merging multiple rows into single row. I'm all out of ideas. Any help would be greatly appreciated.

Create a grouping variable, group_by that variable, then summarise_all.
# your data
d <- data.frame(a = c(1,0,0,0,0,1),
b = c(1,1,1,0,0,0),
c = c(0,0,0,1,1,1),
d = c(1,1,0,0,0,0))
# create the grouping variable
d$group <- rep(c("A","B"), each = 3)
# apply the mean to all columns
library(dplyr)
d %>%
group_by(group) %>%
summarise_all(funs(sum))
Returns:
# A tibble: 2 x 5
group a b c d
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 3 0 2
2 B 1 0 3 0

Overview
After reading Split up a dataframe by number of rows, I realized the only thing you need to know is how you'd like to split() d1.
In this case, you'd like to split d1 into multiple data frames based on every 3 rows. In this case, you use rep() to specify that you'd like each element in the sequence - 1:2 - to be repeated three times (the number of rows divided by the length of your sequence).
After that, the logic involves using map() to sum each column for each data frame created after d1 %>% split(). Here, summarize_all() is helpful since you don't need to know the column names ahead of time.
Once the calculations are complete, you use bind_rows() to stack all the observations back into one data frame.
# load necessary package ----
library(tidyverse)
# load necessary data ----
df1 <-
read.table(text = "a b c d
1 1 0 1
0 1 0 1
0 1 0 0
0 0 1 0
0 0 1 0
1 0 0 0", header = TRUE)
# perform operations --------
df2 <-
df1 %>%
# split df1 into two data frames
# based on three consecutive rows
split(f = rep(1:2, each = nrow(.) / length(1:2))) %>%
# for each data frame, apply the sum() function to all the columns
map(.f = ~ .x %>% summarize_all(.funs = funs(sum))) %>%
# collapse data frames together
bind_rows()
# view results -----
df2
# a b c d
# 1 1 3 0 2
# 2 1 0 2 0
# end of script #

Related

How to add a column to an R dataframe flagging activity in other columns?

In the reproducible working code shown at the bottom of this post, I'm trying to add a column ("ClassAtSplit", shown in the image below with the header highlighted orange) to the R dataframe that flags the activity in other columns of that dataframe. I could use a series of cumbersome conditionals and offsets like I do in the Excel formula for "ClassAtSplit" shown in the right-most column in the image below labeled "ClassAtSplitFormula", but I am looking for an efficient way to do this in R using something like dplyr. Any suggestions for doing this?
I'm not trying to reproduce "ClassAtSplitFormula"! I only included it to show the formulas in "ClassAtSplit" column.
Reproducible code:
library(dplyr)
data <-
data.frame(
Element = c("C","B","D","A","A","A","C","B","B","B"),
SplitCode = c(0,0,0,0,1,1,0,0,2,2)
)
data <- data %>% group_by(Element) %>% mutate(PreSplitClass=row_number())
data
Likely there is the most efficient way to solve this just in one or two lines but with a case_when() you can implement fast and clearly the nested ifelse excel formulas using dplyr.
library(dplyr)
data <-
data.frame(
Element = c("C","B","D","A","A","A","C","B","B","B"),
SplitCode = c(0,0,0,0,1,1,0,0,2,2)
)
data <- data %>% group_by(Element) %>% mutate(PreSplitClass=row_number()) %>% ungroup()
data %>%
mutate(ClassAtSplit =
case_when(
SplitCode == 0 ~as.integer(0), # This eliminates checking for > 0
SplitCode > lag(SplitCode) ~ PreSplitClass, # if > previous value
SplitCode == lag(SplitCode) ~ lag(PreSplitClass) # if equal (0s are avoided)
)
)
Output:
# A tibble: 10 × 4
Element SplitCode PreSplitClass ClassAtSplit
<chr> <dbl> <int> <int>
1 C 0 1 0
2 B 0 1 0
3 D 0 1 0
4 A 0 1 0
5 A 1 2 2
6 A 1 3 2
7 C 0 2 0
8 B 0 2 0
9 B 2 3 3
10 B 2 4 3
If you specify the first case SplitCode == 0 then you won't need to check all the time if the values are equal only because they are 0s.

Summing columns from multiple data frames in R

I have multiple dataframes that I created in a for loop. They all have the same 3 columns, XLocs, YLocs, PatchStatus. XLocs and YLocs contain the same coordinates in each dataframe. PatchStatus can be either 0 or 1 depending how the model ran. Example of dataframe 1 looks like
print(listofdfs[1])
allPoints.xLocs allPoints.yLocs allPoints.patchStatus
1 73.5289654 8.8633913 0
2 21.0795393 44.4840248 0
3 51.5969348 21.7864016 0
4 61.9007129 32.4763183 1
5 62.3447741 41.0651838 1
6 16.9311605 6.3765206 0
And dataframe 2 looks like
print(listofdfs[2])
allPoints.xLocs allPoints.yLocs allPoints.patchStatus
1 73.5289654 8.8633913 0
2 21.0795393 44.4840248 1
3 51.5969348 21.7864016 0
4 61.9007129 32.4763183 1
5 62.3447741 41.0651838 1
6 16.9311605 6.3765206 0
I'm hoping to have 1 resultant dataframe that has XLocs, YLocs, and SUM of patch status (note I plan on combining 15 data frames, so PatchStatus can be between 0 and 15).
I posted this answer along with the heatmap - Plot-3: https://stackoverflow.com/a/60974584/1691723
library('data.table')
df2 <- rbindlist(l = listofdfs)
df2 <- df2[, .(sum_patch = sum(allPoints.patchStatus)), by = .(allPoints.xLocs, allPoints.yLocs)]
We can bind the datasets together and do a group by sum
library(dplyr)
bind_rows(listofdfs) %>%
group_by( allPoints.xLocs, allPoints.yLocs) %>%
summarise(allPoints.patchStatus = sum(allPoints.patchStatus))
Or using rbind and aggregate from base R
aggregate(allPoints.patchStatus ~ ., do.call(rbind, listofdfs), FUN = sum)

Is there a R function to count the values in the rows more than 0 [duplicate]

This question already has answers here:
Number of column values greater than 0 for given row? [duplicate]
(2 answers)
Closed 3 years ago.
Is there a R function to count the values more than 0 in a row
test <- data.frame(a=c(a,"y"),b=c(0,"5"),c=c(2,"0"))
test
a b c
1 1 0 2
2 y 5 0
I need to get following, because first row contains 1 values more than 0 and second row contains 1 value more than 0. I need to exclude first column as it is only character
test
a b c d
1 a 0 2 1
2 y 5 0 1
We can convert the type of columns with type.convert, select the numeric columns, check if it is greater than 0, get the row wise sum of logical matrix, and create a new column in the 'test' dataset
library(tidyverse)
library(magrittr)
type.convert(test, as.is = TRUE) %>%
select_if(is.numeric) %>%
is_greater_than(0) %>%
rowSums %>%
bind_cols(test, d = .)
# a b c d
#1 a 0 2 1
#2 y 5 0 1

R: How to get common counts (frequency) of levels of two factor variables by ID Variable (as new data frame) [duplicate]

This question already has answers here:
Create columns from factors and count [duplicate]
(2 answers)
Closed 7 years ago.
To get the question clear, let me start with one baby example of my data frame.
ID <- c(rep("first", 2), rep("second", 4), rep("third",1), rep("fourth", 3))
Var_1 <- c(rep("A",2), rep("B", 2), rep("A",3), rep("B", 2), "A")
Var_2 <- c(rep("C",2), rep("D",3) , rep("C",2), rep("E",2), "D")
DF <- data.frame(ID, Var_1, Var_2)
> DF
ID Var_1 Var_2
1 first A C
2 first A C
3 second B D
4 second B D
5 second A D
6 second A C
7 third A C
8 fourth B E
9 fourth B E
10 fourth A D
There is one ID factor variable and two factor variables Var_1 with R=2 factor levels and Var_2 with C=3 factor levels.
I would like to get a new data frame with (RxC)+1=(2x3)+1 Variables with the frequencies of all combinations of factor levels - separately for each level in ID Variable, that would look like this:
ID A.C A.D A.E B.C B.D B.E
1 first 2 0 0 0 0 0
2 second 1 1 0 0 2 0
3 third 1 0 0 0 0 0
4 fourth 0 1 0 0 0 2
I tried a couple of functions, but results were not even close to this, so they are not even worth of mentioning. In original data frame I should get (6x9)+1=55 Variables.
EDIT: There are solutions for counting factor levels for one or many variables separatly, but I couldn´t figure it out how to make a common counts for combinations of factor levels for two (or more) variables. Implementig the solution to others seems easy now when I got the answers, but I could not get there by myself.
Using the dcast function from the reshape package (or data.table which has an enhanced implementation of the dcast function):
library(reshape2)
dcast(DF, ID ~ paste(Var_1,Var_2,sep="."), fun.aggregate = length)
which gives:
ID A.C A.D B.D B.E
1 first 2 0 0 0
2 fourth 0 1 0 2
3 second 1 1 2 0
4 third 1 0 0 0
We could use paste to create a variable combining Var_1 and Var_2 and then produce a contingency table with ID and the new variable:
table(DF$ID,paste(DF$Var_1,DF$Var_2,sep="."))
output
A.C A.D B.D B.E
first 2 0 0 0
fourth 0 1 0 2
second 1 1 2 0
third 1 0 0 0
To order the table rows, we would need to factor(DF$ID,levels=c("first","second","third","fourth")) beforehand.
Try
library(tidyr)
library(dplyr)
DF %>%
unite(Var, Var_1, Var_2, sep = ".") %>%
count(ID, Var) %>%
spread(Var, n, fill = 0)
Which gives:
#Source: local data frame [4 x 5]
#
# ID A.C A.D B.D B.E
# (fctr) (dbl) (dbl) (dbl) (dbl)
#1 first 2 0 0 0
#2 fourth 0 1 0 2
#3 second 1 1 2 0
#4 third 1 0 0 0

creating new column based on rows being equal in R

Here is a simple question about creating a new column conditional on a row duplicate in one column matching criterion in different column. Specifically, if the row is a duplicate in column "pairs", create new column "new" based on rows in column "y" being equal/unequal.
In the actual data frame I have even more conditions for other columns but my main issue is with making these conditions dependent on the rows being the same in the "pairs" column.
Many thanks!
pairs y new
1 1 1
1 0 1
2 1 0
2 1 0
3 3 1
3 1 1
Assuming values are always paired, i.e., there are only two row in each group:
DF <- read.table(text="pairs y new
1 1 1
1 0 1
2 1 0
2 1 0
3 3 1
3 1 1", header=TRUE)
library(plyr)
#for integers:
ddply(DF, .(pairs), transform, new1 = 1*(diff(y) != 0L))
#for numerics:
ddply(DF, .(pairs), transform, new1 = 1*(abs(diff(y)) > .Machine$double.eps ^ 0.5))

Resources