For each combination of my variables simulation and iteration, I would like to
find out whether group "a" had the highest value of rand1, as well
as rand2,
know whether group "a" tied with another group based on rand1, as well as rand2
Some sample df (with hard coded values for rand1 and rand2 for reproducibility:
df = crossing(simulation = 1:3,
iteration = 1:3,
group =c("a","b","c")) %>%
mutate(rand1 = c(6,2,2,6,4,6, sample(6,21,replace=T)), # roundabout way to get the same head of df as in the example, forgot to use set.seed
rand2 = c(4,1,2,5,6,1,sample(6,21,replace=T)))
which gives:
simulation iteration group rand1 rand2
1 1 a 6 4
1 1 b 2 1
1 1 c 2 2
1 2 a 6 5
1 2 b 4 6
1 2 c 6 1
This is what I want my output to look like: top.crit1 is 1 if group a is max, 0 if there is a tie. ties.crit1 lets me know if a was tied for max value with another group, same for top.crit2 and ties.crit2 [not added below to avoid cluttering]
Desired output:
simulation iteration group rand1 rand2 top.crit1 ties.crit1
1 1 a 6 4 1 0
1 1 b 2 1 1 0
1 1 c 2 2 1 0
1 2 a 6 5 0 1
1 2 b 4 6 0 1
1 2 c 6 1 0 1
This is my code so far for only determining the max value (but doesn't take into account ties), it's a bit tedious to determine the maximum value separately for rand1 and rand2.
df.test = df %>%
group_by(simulation, iteration) %>%
slice(which.max(rand1)) %>%
mutate(top.crit1 = if_else(group=="a",1,0)) %>%
select(-rand2, -rand1, -group) %>%
full_join(., df)
This would work if you arrange to have group a as first row of each group
df %>%
group_by(simulation, iteration) %>%
mutate(top.crit1 = rand1[1] > max(rand1[-1])) %>%
mutate(ties.crit1 = rand1[1] == max(rand1[-1]))
Related
I have the following dummy dataframe:
t <- data.frame(
a= c(0,0,2,4,5),
b= c(0,0,4,6,5))
a b
0 0
0 0
2 4
4 6
5 5
I want to replace just the first value that it is not zero for the column b. Imagine that the row that meets this criteria is i. I want to replace t$b[i] with t[i+2]+t[i+1] and the rest of t$b should remain the same. So the output would be
a b
0 0
0 0
2 11
4 6
5 5
In fact the dataset is dynamic so I cannot directly point to a specific row, it has to meet the criteria of being the first row not equal to zero in column b.
How can I create this new t$b?
Here is a straight forward solution in base R:
t <- data.frame(
a= c(0,0,2,4,5),
b= c(0,0,4,6,5))
ind <- which(t$b > 0)[1L]
t$b[ind] <- t$b[ind+2L] + t$b[ind+1L]
t
a b
1 0 0
2 0 0
3 2 11
4 4 6
5 5 5
Here is a roundabout way of getting there with a combination of group_by() and mutate():
library(tidyverse)
t %>%
mutate(
b_cond = b != 0,
row_number = row_number()
) %>%
group_by(b_cond) %>%
mutate(
min_row_number = row_number == min(row_number),
b = if_else(b_cond & min_row_number, lead(b, 1) + lead(b, 2), b)
) %>%
ungroup() %>%
select(a, b) # optional, to get back to original columns
# A tibble: 5 × 2
a b
<dbl> <dbl>
1 0 0
2 0 0
3 2 11
4 4 6
5 5 5
I asked a question a few months back about how to identify and keep only observations that follow a certain pattern: How can I identify patterns over several rows in a column and fill a new column with information about that pattern using R?
I want to take this a step further. In that question I just wanted to identify that pattern. Now, if the pattern appears several times within a group, how I keep only the last occurance of that pattern. For example, given df1 how can I achieve df2
df1
TIME ID D
12:30:10 2 0
12:30:42 2 0
12:30:59 2 1
12:31:20 2 0
12:31:50 2 0
12:32:11 2 0
12:32:45 2 1
12:33:10 2 1
12:33:33 2 1
12:33:55 2 1
12:34:15 2 0
12:34:30 2 0
12:35:30 2 0
12:36:30 2 0
12:36:45 2 0
12:37:00 2 0
12:38:00 2 1
I want to end up with the following df2
df2
TIME ID D
12:33:55 2 1
12:34:15 2 0
12:34:30 2 0
12:35:30 2 0
12:36:30 2 0
12:36:45 2 0
12:37:00 2 0
12:38:00 2 1
Thoughts? There were some helpful answers in the question I linked above, but I now want to narrow it.
Here is a base R function I find too complicated but that gets what is asked for.
If I understood the pattern correctly, it doesn't matter if the last sequence ends in a 1 or a 0. The test with df1b has a last sequence ending in a 0.
keep_last_pattern <- function(data, col){
x <- data[[col]]
if(x[length(x)] == 0) x[length(x)] <- 1
#
i <- ave(x, cumsum(x), FUN = \(y) y[1] == 1 & length(y) > 1)
r <- rle(i)
l <- length(r$lengths)
n <- which(as.logical(r$values))
r$values[ n[-length(n)] ] <- 0
r$values[l] <- r$lengths[l] == 1 && r$values[l] == 0
j <- as.logical(inverse.rle(r))
#
data[j, ]
}
keep_last_pattern(df1, "D")
df1b <- df1
df1b[17, "D"] <- 0
keep_last_pattern(df1b, "D")
Do you want to rows the sequence in each ID between second last 1 and last 1 ?
Here is a function to do that which can be applied for each ID.
library(dplyr)
extract_sequence <- function(x) {
inds <- which(x == 1)
inds[length(inds) - 1]:inds[length(inds)]
}
df %>%
group_by(ID) %>%
slice(extract_sequence(D)) %>%
ungroup
# TIME ID D
# <chr> <int> <int>
#1 12:33:55 2 1
#2 12:34:15 2 0
#3 12:34:30 2 0
#4 12:35:30 2 0
#5 12:36:30 2 0
#6 12:36:45 2 0
#7 12:37:00 2 0
#8 12:38:00 2 1
Not sure this will help as it's unclear what your pattern is.
Let's assume you have data like this, with one column indicating in some way whether the row matches a pattern or not:
set.seed(123)
df <- data.frame(
grp = sample(LETTERS[1:3], 10, replace = TRUE),
x = 1:10,
y = c(0,1,0,0,1,1,1,1,1,1),
pattern = rep(c("TRUE", "FALSE"),5)
)
If the aim is to keep only the last occurrence of pattern == "TRUE" per group, this might work:
df %>%
filter(pattern == "TRUE") %>%
group_by(grp) %>%
slice_tail(.)
# A tibble: 3 x 4
# Groups: grp [3]
grp x y pattern
<chr> <int> <dbl> <chr>
1 A 1 0 TRUE
2 B 9 1 TRUE
3 C 5 1 TRUE
suppose I have the following data:
A <- c(4,4,4,4,4)
B <- c(1,2,3,4,4)
C <- c(1,2,4,4,4)
D <- c(3,2,4,1,4)
filt <- c(1,1,10,8,10)
data <- as.data.frame(rbind(A,B,C,D,filt))
data <- t(data)
data <- as.data.frame(data)
> data
A B C d filt
V1 4 1 1 3 1
V2 4 2 2 2 1
V3 4 3 4 4 10
V4 4 4 4 1 8
V5 4 4 4 4 10
I want to get counts on the occurances of 1,2,3, & 4 for each variable, after filtering. In my attempt to achieve this below, I get Error: length(rows) == 1 is not TRUE.
data %>%
dplyr::filter(filt ==1) %>%
plyr::summarize(A_count = count(A),
B_count = count(B))
I get the error - its because some of my columns do not contain all values 1-4. Is there a way to specify what it should look for & give 0 values if not found? I'm not sure how to do this if possible, or if there is a different work around.
Any help is VERY appreciated!!!
This was a bit of a weird one, I didn't use classical plyr, but I think this is roughly what you're looking for. I removed the filtering column , filt as to not get counts of that:
library(dplyr)
data %>%
filter(filt == 1) %>%
select(-filt) %>%
purrr::map_df(function(a_column){
purrr::map_int(1:4, function(num) sum(a_column == num))
})
# A tibble: 4 x 4
A B C D
<int> <int> <int> <int>
1 0 1 1 0
2 0 1 1 1
3 0 0 0 1
4 2 0 0 0
Lets say I have the following
>blob
id group growth
1 A 1
2 A 1
3 B 0
4 B 1
5 B 0
6 C 0
7 C 0
8 C 0
I would like to eventually pull out success out of total data. I have gone this far
blob %>%
group_by(group,growth) %>%
tally()
group growth n
A 1 2
B 0 2
B 1 1
C 0 3
I would like to have something like
group success total
A 2 2
B 1 3
C 0 3
I have also tried
sales %>%
group_by(group,growth) %>%
tally() %>%
summarise(fail= n[factor(growth)==1],total = sum(n))
but I get an error because not all growths are equal to 1.
n() is a function from dplyr to count the number. If we group_by the group, we can use n() to count the number of rows and also use sum to add up the success number.
library(dplyr)
dt2 <- dt %>%
group_by(group) %>%
summarise(success = sum(growth), n = n())
Data Preparation
dt <- read.table(text = "id group growth
1 A 1
2 A 1
3 B 0
4 B 1
5 B 0
6 C 0
7 C 0
8 C 0",
header = TRUE, stringsAsFactors = FALSE)
Here's a simple example with data.table
require(data.table)
setDT(df1)
df1[, .(success = sum(growth), total = .N), by=group]
group success total
1: A 2 2
2: B 1 3
3: C 0 3
a=Map(tapply,list(dt$growth),list(dt$group),c(sum,length))
`names<-`(do.call(cbind.data.frame,a),c("Successes","Totals"))
Successes Totals
A 2 2
B 1 3
C 0 3
You can use a mapply function instead of a map:
mapply(tapply,list(dt$growth),list(dt$group),c(sum,length))
[,1] [,2]
A 2 2
B 1 3
C 0 3
Then you can decide to give the names you want to the specific columns. (Please change the class of the object from matrix to a dataframe).
Is it possible to group and count instances of all other columns using R (dplyr)? For example, The following dataframe
x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1
Turns to this (note: y is value that is being counted)
EDIT:- explaining the transformation, x is what I'm grouping by, for each number grouped, i want to count how many times 0 and 1 and 2 was mentioned, as in the first row in the transformed dataframe, we counted how many times x = 1 was equal to 0 in the other columns (y), so 0 was in column a one time, column b two times and column c one time
x y a b c
1 0 1 2 1
1 1 1 0 2
1 2 1 1 0
2 1 1 0 1
2 2 0 1 0
An approach with a combination of the melt and dcast functions of data.table or reshape2:
library(data.table) # v1.9.5+
dt.new <- dcast(melt(setDT(df), id.vars="x"), x + value ~ variable)
this gives:
dt.new
# x value a b c
# 1: 1 0 1 2 1
# 2: 1 1 1 0 2
# 3: 1 2 1 1 0
# 4: 2 1 1 0 1
# 5: 2 2 0 1 0
In dcast you can specify which aggregation function to use, but this is in this case not necessary as the default aggregation function is length. Without using an aggregation function, you will get a warning about that:
Aggregation function missing: defaulting to length
Furthermore, if you do not explicitly convert the dataframe to a data table, data.table will redirect to reshape2 (see the explanation from #Arun in the comments). Consequently this method can be used with reshape2 as well:
library(reshape2)
df.new <- dcast(melt(df, id.vars="x"), x + value ~ variable)
Used data:
df <- read.table(text="x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1", header=TRUE)
I'd use a combination of gather and spread from the tidyr package, and count from dplyr:
library(dplyr)
library(tidyr)
df = data.frame(x = c(1,1,1,2), a = c(0,1,2,1), b = c(0,0,2,2), c = c(0,1,1,1))
res = df %>%
gather(variable, value, -x) %>%
count(x, variable, value) %>%
spread(variable, n, fill = 0)
# Source: local data frame [5 x 5]
#
# x value a b c
# 1 1 0 1 2 1
# 2 1 1 1 0 2
# 3 1 2 1 1 0
# 4 2 1 1 0 1
# 5 2 2 0 1 0
Essentially, you first change the format of the dataset to:
head(df %>%
gather(variable, value, -x))
# x variable value
#1 1 a 0
#2 1 a 1
#3 1 a 2
#4 2 a 1
#5 1 b 0
#6 1 b 0
Which allows you to use count to get the information regarding how often certain values occur in columns a to c. After that, you reformat the dataset to your required format using spread.