This question already has answers here:
cumsum by group [duplicate]
(2 answers)
Closed 4 years ago.
I have this example dataset:
df <- data.frame(ID = c(1, 1, 1, 2, 2, 2), A = c("2018-10-12",
"2018-10-12", "2018-10-13", "2018-10-14", "2018-10-15", "2018-10-16"),
B = c(1, 5, 7, 2, 54, 202))
ID A B
1 1 2018-10-12 1
2 1 2018-10-12 5
3 1 2018-10-13 7
4 2 2018-10-14 2
5 2 2018-10-15 54
6 2 2018-10-16 202
What I'm trying to do is create a column C that is the sum of B but only for dates before each respective row. For instance, the output I'm seeking is:
ID A B C
1 1 2018-10-12 1 1
2 1 2018-10-12 5 6
3 1 2018-10-13 7 13
4 2 2018-10-14 2 2
5 2 2018-10-15 54 56
6 2 2018-10-16 202 258
I generally will use subsets to do individual sumifs when I have those questions, but I'm not sure how to do this in a new column.
My end goal is to determine the dates that each ID (if applicable) crosses 50.
Thanks!
We can do a group by cumulative sum to create the 'C' column
library(dplyr)
df %>%
group_by(ID) %>%
mutate(C = cumsum(B))
Or use data.table
library(data.table)
setDT(df)[, C := cumsum(B), by = ID]
or with base R
df$C <- with(df, ave(B, ID, FUN = cumsum))
Related
I have a dataframe including a column of factors that I would like to subset to select every nth row, after grouping by factor level. For example,
my_df <- data.frame(col1 = c(1:12), col2 = rep(c("A","B", "C"), 4))
my_df
col1 col2
1 1 A
2 2 B
3 3 C
4 4 A
5 5 B
6 6 C
7 7 A
8 8 B
9 9 C
10 10 A
11 11 B
12 12 C
Subsetting to select every 2nd row should yield my_new_df as,
col1 col2
1 4 A
2 10 A
3 5 B
4 11 B
5 6 C
6 12 C
I tried in dplyr:
my_df %>% group_by(col2) %>%
my_df[seq(2, nrow(my_df), 2), ] -> my_new_df
I get an error:
Error: Can't subset columns that don't exist.
x Locations 4, 6, 8, 10, and 12 don't exist.
ℹ There are only 2 columns.
To see if the nrow function was a problem, I tried using the number directly. So,
my_df %>% group_by(col2) %>%
my_df[seq(2, 4, 2), ] -> my_new_df
Also gave an error,
Error: Can't subset columns that don't exist.
x Location 4 doesn't exist.
ℹ There are only 2 columns.
Run `rlang::last_error()` to see where the error occurred.
My expectation was that it would run the subsetting on each group of data and then combine them into 'my_new_df'. My understanding of how group_by works is clearly wrong but I am stuck on how to move past this error. Any help would much appreciated.
Try:
my_df %>%
group_by(col2)%>%
slice(seq(from = 2, to = n(), by = 2))
# A tibble: 6 x 2
# Groups: col2 [3]
col1 col2
<int> <chr>
1 4 A
2 10 A
3 5 B
4 11 B
5 6 C
6 12 C
You might want to ungroup after slicing if you want to do other operations not based on col2.
Here is a data.table option:
library(data.table)
data <- as.data.table(my_df)
data[(rowid(col2) %% 2) == 0]
col1 col2
1: 4 A
2: 5 B
3: 6 C
4: 10 A
5: 11 B
6: 12 C
Or base R:
my_df[as.logical(with(my_df, ave(col1, col2, FUN = function(x)
seq_along(x) %% 2 == 0))), ]
col1 col2
4 4 A
5 5 B
6 6 C
10 10 A
11 11 B
12 12 C
I am struggling with one maybe easy question. I have a dataframe of 1 column with n rows (n is a multiple of 3). I would like to add a second column with integers like: 1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,.. How can I achieve this with dplyr as a general solution for different length of rows (all multiple of 3).
I tried this:
df <- tibble(Col1 = c(1:12)) %>%
mutate(Col2 = rep(1:4, each=3))
This works. But I would like to have a solution for n rows, each = 3 . Many thanks!
You can specify each and length.out parameter in rep.
library(dplyr)
tibble(Col1 = c(1:12)) %>%
mutate(Col2 = rep(row_number(), each=3, length.out = n()))
# Col1 Col2
# <int> <int>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 2
# 5 5 2
# 6 6 2
# 7 7 3
# 8 8 3
# 9 9 3
#10 10 4
#11 11 4
#12 12 4
We can use gl
library(dplyr)
df %>%
mutate(col2 = as.integer(gl(n(), 3, n())))
As integer division i.e. %/% 3 over a sequence say 0:n will result in 0, 0, 0, 1, 1, 1, ... adding 1 will generate the desired sequence automatically, so simply this will also do
df %>% mutate(col2 = 1+ (row_number()-1) %/% 3)
# A tibble: 12 x 2
Col1 col2
<int> <dbl>
1 1 1
2 2 1
3 3 1
4 4 2
5 5 2
6 6 2
7 7 3
8 8 3
9 9 3
10 10 4
11 11 4
12 12 4
I have a data with about 1000 groups each group is ordered from 1-100(can be any number within 100).
As I was looking through the data. I found that some groups had bad orders, i.e., it would order to 100 then suddenly a 24 would show up.
How can I delete all of these error data
As you can see from the picture above(before -> after), I would like to find all rows that don't follow the order within the group and just delete it.
Any help would be great!
lag will compute the difference between the current value and the previous value, diff will be used to select only positive difference i.e. the current value is greater than the previous value. min is used as lag give the first value NA. I keep the helper column diff to check, but you can deselect using %>% select(-diff)
library(dplyr)
df1 %>% group_by(gruop) %>% mutate(diff = order-lag(order)) %>%
filter(diff >= 0 | order==min(order))
# A tibble: 8 x 3
# Groups: gruop [2]
gruop order diff
<int> <int> <int>
1 1 1 NA
2 1 3 2
3 1 5 2
4 1 10 5
5 2 1 NA
6 2 4 3
7 2 4 0
8 2 8 4
Data
df1 <- read.table(text="
gruop order
1 1
1 3
1 5
1 10
1 2
2 1
2 4
2 4
2 8
2 3
",header=T, stringsAsFactors = F)
Assuming the order column increments by 1 every time we can use ave where we remove those rows which do not have difference of 1 with the previous row by group.
df[!ave(df$order, df$group, FUN = function(x) c(1, diff(x))) != 1, ]
# group order
#1 1 1
#2 1 2
#3 1 3
#4 1 4
#6 2 1
#7 2 2
#8 2 3
#9 2 4
EDIT
For the updated example, we can just change the comparison
df[ave(df$order, df$group, FUN = function(x) c(1, diff(x))) >= 0, ]
Playing with data.table:
library(data.table)
setDT(df1)[, diffo := c(1, diff(order)), group][diffo == 1, .(group, order)]
group order
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 2 1
6: 2 2
7: 2 3
8: 2 4
Where df1 is:
df1 <- data.frame(
group = rep(1:2, each = 5),
order = c(1:4, 2, 1:4, 3)
)
EDIT
If you only need increasing order, and not steps of one then you can do:
df3 <- transform(df1, order = c(1,3,5,10,2,1,4,7,9,3))
setDT(df3)[, diffo := c(1, diff(order)), group][diffo >= 1, .(group, order)]
group order
1: 1 1
2: 1 3
3: 1 5
4: 1 10
5: 2 1
6: 2 4
7: 2 7
8: 2 9
This question already has answers here:
Add count of unique / distinct values by group to the original data
(3 answers)
Closed 4 years ago.
I'm stuck trying to do some counting on a data frame. The gist is to group by one variable and then break further by groups based on a second variable. From here I want to count the size if the subgroups for each group. The sample code is this:
set.seed(123456)
df <- data.frame(User = c(rep("A", 5), rep("B", 4), rep("C", 6)),
Rank = c(rpois(5,1), rpois(4,2), rpois(6,3)))
#This results in an error
df %>% group_by(User) %>% group_by(Rank) %>% summarize(Res = n_groups())
So what I want is 'User A' to have 3, 'User B' to have 4, and 'User C' to have 5. In other words the data frame df would end up looking like:
User Rank Result
1 A 2 3
2 A 2 3
3 A 1 3
4 A 0 3
5 A 0 3
6 B 1 4
7 B 2 4
8 B 0 4
9 B 6 4
10 C 1 5
11 C 4 5
12 C 3 5
13 C 5 5
14 C 5 5
15 C 8 5
I'm still learning dplyr, so I'm unsure how I should do it. How can this be achieved? Non-dplyr answers are also very welcome. Thanks in advance!
Try this:
df %>% group_by(User) %>% mutate(Result=length(unique(Rank)))
Or (see comment below):
df %>% group_by(User) %>% mutate(Result=n_distinct(Rank))
A base R option would be using ave
df$Result <- with(df, ave(Rank, User, FUN = function(x) length(unique(x))))
df$Result
#[1] 3 3 3 3 3 4 4 4 4 5 5 5 5 5 5
and a data.table option is
library(data.table)
setDT(df)[, Result := uniqueN(Rank), by = User]
Newbie question
I have 2 columns in a data frame that looks like
Name Size
A 1
A 1
A 1
A 2
A 2
B 3
B 5
C 7
C 17
C 17
I need a third column that will run continuously as a sequence until either Name or Size changes value
Name Size NewCol
A 1 1
A 1 2
A 1 3
A 2 1
A 2 2
B 3 1
B 5 1
C 7 1
C 17 1
C 17 2
Basically a dummy field to reference each record separately even if Name and Size are the same.
So the index changes from k to k+1 when it encounters both same values for Name and Size otherwise resets.
Therefore in my data set if I have 200 A and 1s suppose each will be indexed between 1..200. Then when it moves to A and 2 the index shall reset
We can try with data.table
library(data.table)
setDT(df1)[, NewCol := match(Size, unique(Size)), by = .(Name)]
df1
# Name Size NewCol
#1: A 1 1
#2: A 1 1
#3: A 2 2
#4: B 3 1
#5: C 7 1
#6: C 17 2
If there is a typo somewhere in the expected output, may be this would be the output
setDT(df1)[, NewCol := seq_len(.N), .(Name, Size)]
Or using dplyr
library(dplyr)
df1 %>%
group_by(Name) %>%
mutate(NewCol = match(Size, unique(Size)))
Or
df1 %>%
group_by(Name) %>%
mutate(NewCol = row_number())
Or we can use the same approach with ave from base R
I guess this might not be the most efficient solution, but at least a good start :
# Reproducing the example
df <- data.frame(Name=LETTERS[c(1, 1, 1, 1, 1, 2, 2, 3, 3, 3)], Size=c(1, 1, 1, 2, 2, 3, 5, 7, 17, 17))
# Create new colum with unique id
df$NewCol <- paste0(df$Name, df$Size)
# Modify column to write count instead
df$NewCol <- unlist(sapply(unique(df$NewCol), function(id) 1:table(df$NewCol)[id]))
df
Name Size NewCol
1 A 1 1
2 A 1 2
3 A 1 3
4 A 2 1
5 A 2 2
6 B 3 1
7 B 5 1
8 C 7 1
9 C 17 1
10 C 17 2