Collecting success and total from incomplete binary groups in dplyr - r

Lets say I have the following
>blob
id group growth
1 A 1
2 A 1
3 B 0
4 B 1
5 B 0
6 C 0
7 C 0
8 C 0
I would like to eventually pull out success out of total data. I have gone this far
blob %>%
group_by(group,growth) %>%
tally()
group growth n
A 1 2
B 0 2
B 1 1
C 0 3
I would like to have something like
group success total
A 2 2
B 1 3
C 0 3
I have also tried
sales %>%
group_by(group,growth) %>%
tally() %>%
summarise(fail= n[factor(growth)==1],total = sum(n))
but I get an error because not all growths are equal to 1.

n() is a function from dplyr to count the number. If we group_by the group, we can use n() to count the number of rows and also use sum to add up the success number.
library(dplyr)
dt2 <- dt %>%
group_by(group) %>%
summarise(success = sum(growth), n = n())
Data Preparation
dt <- read.table(text = "id group growth
1 A 1
2 A 1
3 B 0
4 B 1
5 B 0
6 C 0
7 C 0
8 C 0",
header = TRUE, stringsAsFactors = FALSE)

Here's a simple example with data.table
require(data.table)
setDT(df1)
df1[, .(success = sum(growth), total = .N), by=group]
group success total
1: A 2 2
2: B 1 3
3: C 0 3

a=Map(tapply,list(dt$growth),list(dt$group),c(sum,length))
`names<-`(do.call(cbind.data.frame,a),c("Successes","Totals"))
Successes Totals
A 2 2
B 1 3
C 0 3
You can use a mapply function instead of a map:
mapply(tapply,list(dt$growth),list(dt$group),c(sum,length))
[,1] [,2]
A 2 2
B 1 3
C 0 3
Then you can decide to give the names you want to the specific columns. (Please change the class of the object from matrix to a dataframe).

Related

How to calculate values for the first row that meets a certain condition?

I have the following dummy dataframe:
t <- data.frame(
a= c(0,0,2,4,5),
b= c(0,0,4,6,5))
a b
0 0
0 0
2 4
4 6
5 5
I want to replace just the first value that it is not zero for the column b. Imagine that the row that meets this criteria is i. I want to replace t$b[i] with t[i+2]+t[i+1] and the rest of t$b should remain the same. So the output would be
a b
0 0
0 0
2 11
4 6
5 5
In fact the dataset is dynamic so I cannot directly point to a specific row, it has to meet the criteria of being the first row not equal to zero in column b.
How can I create this new t$b?
Here is a straight forward solution in base R:
t <- data.frame(
a= c(0,0,2,4,5),
b= c(0,0,4,6,5))
ind <- which(t$b > 0)[1L]
t$b[ind] <- t$b[ind+2L] + t$b[ind+1L]
t
a b
1 0 0
2 0 0
3 2 11
4 4 6
5 5 5
Here is a roundabout way of getting there with a combination of group_by() and mutate():
library(tidyverse)
t %>%
mutate(
b_cond = b != 0,
row_number = row_number()
) %>%
group_by(b_cond) %>%
mutate(
min_row_number = row_number == min(row_number),
b = if_else(b_cond & min_row_number, lead(b, 1) + lead(b, 2), b)
) %>%
ungroup() %>%
select(a, b) # optional, to get back to original columns
# A tibble: 5 × 2
a b
<dbl> <dbl>
1 0 0
2 0 0
3 2 11
4 4 6
5 5 5

Find Maximum value and ties based on different criteria

For each combination of my variables simulation and iteration, I would like to
find out whether group "a" had the highest value of rand1, as well
as rand2,
know whether group "a" tied with another group based on rand1, as well as rand2
Some sample df (with hard coded values for rand1 and rand2 for reproducibility:
df = crossing(simulation = 1:3,
iteration = 1:3,
group =c("a","b","c")) %>%
mutate(rand1 = c(6,2,2,6,4,6, sample(6,21,replace=T)), # roundabout way to get the same head of df as in the example, forgot to use set.seed
rand2 = c(4,1,2,5,6,1,sample(6,21,replace=T)))
which gives:
simulation iteration group rand1 rand2
1 1 a 6 4
1 1 b 2 1
1 1 c 2 2
1 2 a 6 5
1 2 b 4 6
1 2 c 6 1
This is what I want my output to look like: top.crit1 is 1 if group a is max, 0 if there is a tie. ties.crit1 lets me know if a was tied for max value with another group, same for top.crit2 and ties.crit2 [not added below to avoid cluttering]
Desired output:
simulation iteration group rand1 rand2 top.crit1 ties.crit1
1 1 a 6 4 1 0
1 1 b 2 1 1 0
1 1 c 2 2 1 0
1 2 a 6 5 0 1
1 2 b 4 6 0 1
1 2 c 6 1 0 1
This is my code so far for only determining the max value (but doesn't take into account ties), it's a bit tedious to determine the maximum value separately for rand1 and rand2.
df.test = df %>%
group_by(simulation, iteration) %>%
slice(which.max(rand1)) %>%
mutate(top.crit1 = if_else(group=="a",1,0)) %>%
select(-rand2, -rand1, -group) %>%
full_join(., df)
This would work if you arrange to have group a as first row of each group
df %>%
group_by(simulation, iteration) %>%
mutate(top.crit1 = rand1[1] > max(rand1[-1])) %>%
mutate(ties.crit1 = rand1[1] == max(rand1[-1]))

Generate matrix of unique combination using 2 variables from a data frame r

I have a data frame as
df<- as.data.frame(expand.grid(0:1, 0:4, 0:3,0:7, 2:7))
I want to get all unique combinations using 2 variables of the given 5 variables in the data frame df
Apply a function f (extracting unique couple) to each couple of columns:
f<-function(col,df)
{
return(unique(df[,col]))
}
#All combinantions
comb_col<-combn(colnames(df),2)
Your output
apply(comb_col,2,f,df=df)
[[1]]
Var1 Var2
1 0 0
2 1 0
3 0 1
4 1 1
5 0 2
6 1 2
7 0 3
8 1 3
9 0 4
10 1 4
[[2]]
Var1 Var3
1 0 0
2 1 0
11 0 1
12 1 1
21 0 2
22 1 2
31 0 3
32 1 3
...
You can use distinct function from dplyr package:
df <- as.data.frame(expand.grid(0:1, 0:4, 0:3,0:7, 2:7))
library(dplyr)
df %>%
distinct(Var1, Var2)
Also you have an option to keep the rest of your columns with .keep_all = TRUE parameter.
If you want to get all the possible combinations:
# Generate matrix with all combinations of variables
comb <- combn(names(df), 2)
# Generate a list with all unique values in your data.frame
apply(comb, 2, function(x) df %>% distinct_(.dots = x))

Find consecutive values in dataframe

I have a dataframe. I wish to detect consecutive numbers and populate a new column as 1 or 0.
ID Val
1 a 8
2 a 7
3 a 5
4 a 4
5 a 3
6 a 1
Expected output
ID Val outP
1 a 8 0
2 a 7 1
3 a 5 0
4 a 4 1
5 a 3 1
6 a 1 0
You could do this with the diff function in combination with abs and see whether the outcome is 1 or another value:
d$outP <- c(0, abs(diff(d$Val)) == 1)
which gives:
> d
ID Val outP
1 a 8 0
2 a 7 1
3 a 5 0
4 a 4 1
5 a 3 1
6 a 1 0
If you only want to take decreasing consecutive values into account, you can use:
c(0, diff(d$Val) == -1)
When you want to do this for each ID, you can also do this in base R or with dplyr:
# base R
d$outP <- ave(d$Val, d$ID, FUN = function(x) c(0, abs(diff(x)) == 1))
# dplyr
library(dplyr)
d %>%
group_by(ID) %>%
mutate(outP = c(0, abs(diff(Val)) == 1))
We can also a faster option by comparing the previous value with current
with(df1, as.integer(c(FALSE, Val[-length(Val)] - Val[-1]) ==1))
#[1] 0 1 0 1 1 0
If we need to group by "ID", one option is data.table
library(data.table)
setDT(df1)[, outP := as.integer((shift(Val, fill =Val[1]) - Val)==1) , by = ID]

Grouping and Counting instances?

Is it possible to group and count instances of all other columns using R (dplyr)? For example, The following dataframe
x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1
Turns to this (note: y is value that is being counted)
EDIT:- explaining the transformation, x is what I'm grouping by, for each number grouped, i want to count how many times 0 and 1 and 2 was mentioned, as in the first row in the transformed dataframe, we counted how many times x = 1 was equal to 0 in the other columns (y), so 0 was in column a one time, column b two times and column c one time
x y a b c
1 0 1 2 1
1 1 1 0 2
1 2 1 1 0
2 1 1 0 1
2 2 0 1 0
An approach with a combination of the melt and dcast functions of data.table or reshape2:
library(data.table) # v1.9.5+
dt.new <- dcast(melt(setDT(df), id.vars="x"), x + value ~ variable)
this gives:
dt.new
# x value a b c
# 1: 1 0 1 2 1
# 2: 1 1 1 0 2
# 3: 1 2 1 1 0
# 4: 2 1 1 0 1
# 5: 2 2 0 1 0
In dcast you can specify which aggregation function to use, but this is in this case not necessary as the default aggregation function is length. Without using an aggregation function, you will get a warning about that:
Aggregation function missing: defaulting to length
Furthermore, if you do not explicitly convert the dataframe to a data table, data.table will redirect to reshape2 (see the explanation from #Arun in the comments). Consequently this method can be used with reshape2 as well:
library(reshape2)
df.new <- dcast(melt(df, id.vars="x"), x + value ~ variable)
Used data:
df <- read.table(text="x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1", header=TRUE)
I'd use a combination of gather and spread from the tidyr package, and count from dplyr:
library(dplyr)
library(tidyr)
df = data.frame(x = c(1,1,1,2), a = c(0,1,2,1), b = c(0,0,2,2), c = c(0,1,1,1))
res = df %>%
gather(variable, value, -x) %>%
count(x, variable, value) %>%
spread(variable, n, fill = 0)
# Source: local data frame [5 x 5]
#
# x value a b c
# 1 1 0 1 2 1
# 2 1 1 1 0 2
# 3 1 2 1 1 0
# 4 2 1 1 0 1
# 5 2 2 0 1 0
Essentially, you first change the format of the dataset to:
head(df %>%
gather(variable, value, -x))
# x variable value
#1 1 a 0
#2 1 a 1
#3 1 a 2
#4 2 a 1
#5 1 b 0
#6 1 b 0
Which allows you to use count to get the information regarding how often certain values occur in columns a to c. After that, you reformat the dataset to your required format using spread.

Resources