How to create column, from the cumulative column in r? - r

df <- data.frame(dat=c("11-03","12-03","13-03"),
c=c(0,15,20,4,19,21,2,10,14), d=rep(c("A","B","C"),each=3))
suppose c has the cumulative values. I want to create a column daily that will look like
dat c d daily
1 11-03 0 A 0
2 12-03 15 A 15
3 13-03 20 A 5
4 11-03 4 B 4
5 12-03 19 B 15
6 13-03 21 B 2
7 11-03 2 C 2
8 12-03 10 C 8
9 13-03 14 C 4
for each value of d and dat (date wise) a daily change in value is generated from the column c has that cumulative value.

We can get the diff of 'c' after grouping by 'd'
library(dplyr)
df %>%
group_by(d) %>%
mutate(daily = c(first(c), diff(c)))
# A tibble: 9 x 4
# Groups: d [3]
# dat c d daily
# <fct> <dbl> <fct> <dbl>
#1 11-03 0 A 0
#2 12-03 15 A 15
#3 13-03 20 A 5
#4 11-03 4 B 4
#5 12-03 19 B 15
#6 13-03 21 B 2
#7 11-03 2 C 2
#8 12-03 10 C 8
#9 13-03 14 C 4
Or do the difference between the 'c' and the lag of 'c'
df %>%
group_by(d) %>%
mutate(daily = c - lag(c))

Data.table solution:
df <- as.data.table(df)
df[, daily:= c - shift(c, fill = 0),by=d]
Shift is datatable's lag operator, so basically we subtract from C its previous value within each group.
fill = 0 replaces NAs with zeros, because within each group, there is no previous value (shift(c)) for the first element.

Related

How to fill a column by group with sampled row numbers according to n per group

I am working with a dataframe in R. I have groups stated by column Group1. I need to create a new column named sampled where I need to fill with a specific value after using sample per group from 1 to each number of rows per group. Here is the data I have:
library(tidyverse)
#Data
dat <- data.frame(Group1=sample(letters[1:3],15,replace = T))
Then dat looks like this:
dat
Group1
1 b
2 a
3 a
4 c
5 c
6 c
7 a
8 b
9 c
10 b
11 a
12 b
13 c
14 c
15 c
In order to get the N per group, we do this:
#Code
dat %>%
arrange(Group1) %>%
group_by(Group1) %>%
mutate(N=n())
Which produces:
# A tibble: 15 x 2
# Groups: Group1 [3]
Group1 N
<chr> <int>
1 a 4
2 a 4
3 a 4
4 a 4
5 b 4
6 b 4
7 b 4
8 b 4
9 c 7
10 c 7
11 c 7
12 c 7
13 c 7
14 c 7
15 c 7
What I need to do is next. I have the N per group, so I have to create a sample of 3 numbers from 1:N. In the case of group a having N=4 it would be sample(1:4,3) which produces [1] 2 4 3. With this in the group a I need that rows belonging to sampled values must be filled with 999. So for first group we would have:
Group1 N sampled
<chr> <int> <int>
1 a 4 NA
2 a 4 999
3 a 4 999
4 a 4 999
And then the same for the rest of groups. In this way using sample we will have random values per group. Is that possible to do using dplyr or tidyverse. Many thanks!
You could try:
set.seed(3242)
library(dplyr)
dat %>%
arrange(Group1) %>%
add_count(Group1, name = 'N') %>%
group_by(Group1) %>%
mutate(
sampled = case_when(
row_number() %in% sample(1:n(), 3L) ~ 999L,
TRUE ~ NA_integer_
)
)
Output:
# A tibble: 15 × 3
# Groups: Group1 [3]
Group1 N sampled
<chr> <int> <int>
1 a 4 999
2 a 4 999
3 a 4 NA
4 a 4 999
5 b 4 999
6 b 4 999
7 b 4 999
8 b 4 NA
9 c 7 NA
10 c 7 999
11 c 7 NA
12 c 7 999
13 c 7 NA
14 c 7 NA
15 c 7 999

Keeping all NAs in dplyr distinct function

I have a data.frame (the eBird basic dataset) where many observers may upload a record from a same sighting to a database, in this case, the event is given a "group identifier"; when not from a group session, a NA will appear in the database; so I'm trying to filter out all those duplicates from group events and keep all NAs, I'm trying to do this without splitting the dataframe in two:
library(dplyr)
set.seed(1)
df <- tibble(
x = sample(c(1:6, NA), 30, replace = T),
y = sample(c(letters[1:4]), 30, replace = T)
)
df %>% count(x,y)
gives:
> df %>% count(x,y)
# A tibble: 20 x 3
x y n
<int> <chr> <int>
1 1 a 1
2 1 b 2
3 2 a 1
4 2 b 1
5 2 c 1
6 2 d 3
7 3 a 1
8 3 b 1
9 3 c 4
10 4 d 1
11 5 a 1
12 5 b 2
13 5 c 1
14 5 d 1
15 6 a 1
16 6 c 2
17 NA a 1
18 NA b 2
19 NA c 2
20 NA d 1
I want no NA at x to be grouped together, as here happened with "NA b" and "NA c" combinations; distinct function has no information on not taking NAs into the computation; is splitting the dataframe the only solution?
With distinct an option is to create a new column based on the NA elements in 'x'
library(dplyr)
df %>%
mutate(x1 = row_number() * is.na(x)) %>%
distinct %>%
select(-x1)
Or we can use duplicated with an OR (|) condition to return all NA elements in 'x' with filter
df %>%
filter(is.na(x)|!duplicated(cur_data()))
# A tibble: 20 x 2
# x y
# <int> <chr>
# 1 1 b
# 2 4 b
# 3 NA a
# 4 1 d
# 5 2 c
# 6 5 a
# 7 NA d
# 8 3 c
# 9 6 b
#10 2 b
#11 3 b
#12 1 c
#13 5 d
#14 2 d
#15 6 d
#16 2 a
#17 NA c
#18 NA a
#19 1 a
#20 5 b

R Get minimum value in dataframe selecting rows on 2 columns [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 2 years ago.
I have a dataframe like the one I've simplified below. I want to first select rows with the same value based on column X, then in that selection select rows with the same value based on column Y. Then from that selection, I want to take the minimal value. I'm now using a forloop, but seems there must be an easier way. Thanks!
set.seed(123)
data<-data.frame(X=rep(letters[1:3], each=8),Y=rep(c(1,2)),Z=sample(1:100, 12))
data
X Y Z
1 a 1 76
2 a 1 22
3 a 2 32
4 a 2 23
5 b 1 14
6 b 1 40
7 b 2 39
8 b 2 35
9 c 1 15
10 c 1 13
11 c 2 21
12 c 2 42
Desired outcome:
X Y Z
2 a 1 22
4 a 2 23
5 b 1 14
8 b 2 35
10 c 1 13
11 c 2 21
Here is a data.table solution:
library(data.table)
data = data.table(data)
data[, min(Z), by=c("X", "Y")]
EDIT based on OP's comment:
If there is a NA value in one of the columns we sort by, an additional row is created:
data[2,2] <-NA
data[, min(Z,na.rm = T), by=c("X", "Y")]
X Y V1
1: a 1 31
2: a NA 79
3: a 2 14
4: b 1 31
5: b 2 14
6: c 1 50
7: c 2 25
library(tidyverse)
data %>%
group_by(X, Y) %>%
summarise(Z = min(Z))
Will do the trick! The other answer right now is the data.table way, this is tidyverse. Both are extremely powerful ways to approach data cleaning & manipulation - it could be helpful to familiarize yourself with one!
In base you can use aggregate to get min from Z grouped by the remaining columns like:
aggregate(Z~.,data,min)
# X Y Z
#1 a 1 31
#2 b 1 31
#3 c 1 50
#4 a 2 14
#5 b 2 14
#6 c 2 25
In case there is an NA in the groups:
data[2,2] <-NA
Ignore it:
aggregate(Z~.,data,min)
# X Y Z
#1 a 1 31
#2 b 1 31
#3 c 1 50
#4 a 2 14
#5 b 2 14
#6 c 2 25
Show it:
aggregate(data$Z, list(X=data$X, Y=addNA(data$Y)), min)
# X Y x
#1 a 1 31
#2 b 1 31
#3 c 1 50
#4 a 2 14
#5 b 2 14
#6 c 2 25
#7 a <NA> 79
This code could benefit from splitting it up in multiple lines, but it works. in Base-R
do.call(rbind,
lapply(unlist(lapply(split(data,data$X), function(x) split(x,x$Y)),recursive=F), function(y) y[y$Z==min(y$Z),])
)
X Y Z
a.1 a 1 31
a.2 a 2 14
b.1 b 1 31
b.2 b 2 14
c.1 c 1 50
c.2 c 2 25

Dynamic select expression in function [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I am trying to write a function that will convert this data frame
library(dplyr)
library(rlang)
library(purrr)
df <- data.frame(obj=c(1,1,2,2,3,3,3,4,4,4),
S1=rep(c("a","b"),length.out=10),PR1=rep(c(3,7),length.out=10),
S2=rep(c("c","d"),length.out=10),PR2=rep(c(7,3),length.out=10))
obj S1 PR1 S2 PR2
1 1 a 3 c 7
2 1 b 7 d 3
3 2 a 3 c 7
4 2 b 7 d 3
5 3 a 3 c 7
6 3 b 7 d 3
7 3 a 3 c 7
8 4 b 7 d 3
9 4 a 3 c 7
10 4 b 7 d 3
In to this data frame
df %>% {bind_rows(select(., obj, S = S1, PR = PR1),
select(., obj, S = S2, PR = PR2))}
obj S PR
1 1 a 3
2 1 b 7
3 2 a 3
4 2 b 7
5 3 a 3
6 3 b 7
7 3 a 3
8 4 b 7
9 4 a 3
10 4 b 7
11 1 c 7
12 1 d 3
13 2 c 7
14 2 d 3
15 3 c 7
16 3 d 3
17 3 c 7
18 4 d 3
19 4 c 7
20 4 d 3
But I would like the function to be able to work with any number of columns. So it would also work if I had S1, S2, S3, S4 or if there was an additional category ie DS1, DS2. Ideally the function would take as arguments the patterns that determine which columns are stacked on top of each other, the number of sets of each column, the names of the output columns and the names of any variables that should also be kept.
This is my attempt at this function:
stack_col <- function(df, patterns, nums, cnames, keep){
keep <- enquo(keep)
build_exp <- function(x){
paste0("!!sym(cnames[[", x, "]]) := paste0(patterns[[", x, "]],num)") %>%
parse_expr()
}
exps <- map(1:length(patterns), ~expr(!!build_exp(.)))
sel_fun <- function(num){
df %>% select(!!keep,
!!!exps)
}
map(nums, sel_fun) %>% bind_rows()
}
I can get the sel_fun part to work for a fixed number of patterns like this
patterns <- c("S", "PR")
cnames <- c("Species", "PR")
keep <- quo(obj)
sel_fun <- function(num){
df %>% select(!!keep,
!!sym(cnames[[1]]) := paste0(patterns[[1]], num),
!!sym(cnames[[2]]) := paste0(patterns[[2]], num))
}
sel_fun(1)
But the dynamic version that I have tried does not work and gives this error:
Error: `:=` can only be used within a quasiquoted argument
Here is a function to get the expected output. Loop through the 'patterns' and the corresponding new column names ('cnames') using map2, gather into 'long' format, rename the 'val' column to the 'cnames' passed into the function, bind the columns (bind_cols) and select the columns of interest
stack_col <- function(dat, pat, cname, keep) {
purrr::map2(pat, cname, ~
dat %>%
dplyr::select(keep, matches(.x)) %>%
tidyr::gather(key, val, matches(.x)) %>%
dplyr::select(-key) %>%
dplyr::rename(!! .y := val)) %>%
dplyr::bind_cols(.) %>%
dplyr::select(keep, cname)
}
stack_col(df, patterns, cnames, 1)
# obj Species PR
#1 1 a 3
#2 1 b 7
#3 2 a 3
#4 2 b 7
#5 3 a 3
#6 3 b 7
#7 3 a 3
#8 4 b 7
#9 4 a 3
#10 4 b 7
#11 1 c 7
#12 1 d 3
#13 2 c 7
#14 2 d 3
#15 3 c 7
#16 3 d 3
#17 3 c 7
#18 4 d 3
#19 4 c 7
#20 4 d 3
Also, multiple patterns reshaping can be done with data.table::melt
library(data.table)
melt(setDT(df), measure = patterns("^S\\d+", "^PR\\d+"),
value.name = c("Species", "PR"))[, variable := NULL][]
This solves your problem, although it does not fix your function:
The idea is to use gather and spread on the columns which starts with the specific pattern. Therefore I create a regex which matches the column names and then first gather all of them, extract the group and the rename the groups with the cnames. Finally spread takes separates the new columns.
library(dplyr)
library(purrr)
library(tidyr)
library(stringr)
patterns <- c("S", "PR")
cnames <- c("Species", "PR")
names(cnames) <- patterns
complete_pattern <- str_c("^", str_c(patterns, collapse = "|^"))
df %>%
mutate(rownumber = 1:n()) %>%
gather(new_variable, value, matches(complete_pattern)) %>%
mutate(group = str_extract(new_variable, complete_pattern),
group = str_replace_all(group, cnames),
group_number = str_extract(new_variable, "\\d+")) %>%
select(-new_variable) %>%
spread(group, value)
# obj rownumber group_number PR Species
# 1 1 1 1 3 a
# 2 1 1 2 7 c
# 3 1 2 1 7 b
# 4 1 2 2 3 d
# 5 2 3 1 3 a
# 6 2 3 2 7 c
# 7 2 4 1 7 b
# 8 2 4 2 3 d
# 9 3 5 1 3 a
# 10 3 5 2 7 c
# 11 3 6 1 7 b
# 12 3 6 2 3 d
# 13 3 7 1 3 a
# 14 3 7 2 7 c
# 15 4 8 1 7 b
# 16 4 8 2 3 d
# 17 4 9 1 3 a
# 18 4 9 2 7 c
# 19 4 10 1 7 b
# 20 4 10 2 3 d

Assign single value to grouped variable, R

I have a data.frame with two variables. I need to group them by var1 and replace every x in var2 with the unique different value in that group.
For example:
var1 var2
1 1 a
2 2 a
3 2 x
4 3 b
5 4 c
6 5 a
7 6 c
8 6 x
9 7 c
10 8 x
11 8 b
12 8 b
13 9 a
Outcome should be:
var1 var2
1 1 a
2 2 a
3 2 a <-
4 3 b
5 4 c
6 5 a
7 6 c
8 6 c <-
9 7 c
10 8 b <-
11 8 b
12 8 b
13 9 a
I did manage to solve this example:
dat <- data.frame(var1=c(1,2,2,3,4,5,6,6,7,8,8,8,9), var2=c("a","a","x","b","c","a","a","x","c","x","b","b","a"))
dat %>% group_by(var1) %>% mutate(
var2 = as.character(var2),
var2 = ifelse(var2 == 'x',var2[order(var2)][1],var2))
But this does not work for my real data because of the ordering :(
I would need another approach, I think of something like checking explicit for "not x" but I did not came to a solution.
Any help appreciatet!
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'var1', we get the 'var2' that are not 'x', select the first observation and assign (:=) it to 'var2'.
library(data.table)
setDT(df1)[, var2 := var2[var2!='x'][1], var1]
Or with dplyr
library(dplyr)
df1 %>%
group_by(var1) %>%
mutate(var2 = var2[var2!="x"][1])
# var1 var2
# <int> <chr>
#1 1 a
#2 2 a
#3 2 a
#4 3 b
#5 4 c
#6 5 a
#7 6 c
#8 6 c
#9 7 c
#10 8 b
#11 8 b
#12 8 b
#13 9 a

Resources