I have a simple data frame in a tidy format:
group variable value
<fct> <chr> <dbl>
1 fishers_here 100
1 money_per_fisher 2000
1 unnecessary_variable 10
2 fishers_here 140
2 money_per_fisher 8000
2 unnecessary_variable 304
3 fishers_here 10
3 money_per_fisher 9000
....
for each group I'd like to have the variable "total money in group" which is just fishers_here * money_per_fisher; basically I'd like it to look like this:
group variable value
<fct> <chr> <dbl>
1 fishers_here 100
1 money_per_fisher 2000
1 unnecessary_variable 10
1 TOTAL_MONEY 200000
....
Is there a simple way to get this done with tidyverse?
By simple I mean without having to filter, summarise, add the variable column back in and then join the two now separate dataframes.
You can spread, do the multiplication and then gather back up. Note I'm assuming that there is a typo in the group number in row 6 as I commented, where it should be group 2 instead of group 1. If that's not the case, then some additional cleaning steps are needed. You can also sort your resulting rows however you want (e.g. to put the rows for each group back together)
library(tidyverse)
tbl <- read_table2(
"group variable value
1 fishers_here 100
1 money_per_fisher 2000
1 unnecessary_variable 10
2 fishers_here 140
2 money_per_fisher 8000
2 unnecessary_variable 304
3 fishers_here 10
3 money_per_fisher 9000"
)
tbl %>%
spread(variable, value) %>%
mutate(total_money_in_group = money_per_fisher * fishers_here) %>%
gather(variable, value, -group)
#> # A tibble: 12 x 3
#> group variable value
#> <dbl> <chr> <dbl>
#> 1 1 fishers_here 100
#> 2 2 fishers_here 140
#> 3 3 fishers_here 10
#> 4 1 money_per_fisher 2000
#> 5 2 money_per_fisher 8000
#> 6 3 money_per_fisher 9000
#> 7 1 unnecessary_variable 10
#> 8 2 unnecessary_variable 304
#> 9 3 unnecessary_variable NA
#> 10 1 total_money_in_group 200000
#> 11 2 total_money_in_group 1120000
#> 12 3 total_money_in_group 90000
Created on 2019-02-04 by the reprex package (v0.2.1)
An option would be to filter the 'money_per_fisher', 'fishers_here', grouped by 'group', summarise to get the prod of 'value', bind the rows with the original data and arrange by 'group'
library(tidyverse)
df1 %>%
filter(variable %in% c('fishers_here', 'money_per_fisher')) %>%
group_by(group) %>%
summarise(variable = "total_money_in_group", value = prod(value)) %>%
bind_rows(tbl, .) %>%
arrange(group)
# A tibble: 11 x 3
# group variable value
# <int> <chr> <dbl>
# 1 1 fishers_here 100
# 2 1 money_per_fisher 2000
# 3 1 unnecessary_variable 10
# 4 1 total_money_in_group 200000
# 5 2 fishers_here 140
# 6 2 money_per_fisher 8000
# 7 2 unnecessary_variable 304
# 8 2 total_money_in_group 1120000
# 9 3 fishers_here 10
#10 3 money_per_fisher 9000
#11 3 total_money_in_group 90000
data
df1 <- structure(list(group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L),
variable = c("fishers_here",
"money_per_fisher", "unnecessary_variable", "fishers_here", "money_per_fisher",
"unnecessary_variable", "fishers_here", "money_per_fisher"),
value = c(100L, 2000L, 10L, 140L, 8000L, 304L, 10L, 9000L
)), class = "data.frame", row.names = c(NA, -8L))
Based on your output I think this is a possible solution:
df %>%
group_by(group) %>%
summarise(value = prod(value))
Edit: If you want a column on the original dataset you can use mutate instead of summarise
Related
I need some help consolidating columns in R
I have ~130 columns, some of which have a similar name. For example, I have ~25 columns called "pathogen".
However, after importing my datasheet into R, these colums are now listed as follows : pathogen..1, pathogen...2, etc. Because of how R renamed these columns, I'm not sure how to proceed.
I need to consolidate all my columns with the same/similar name, so that I have only 1 column called "pathogen". I also need this consolidated column to include the sums of all the consolidated columns called "pathogen".
here an example of my input
sample Unidentified…1 Unidentified…2 Pathogen..1 Pathogen…2
1 5 3 6 8
2 7 2 1 0
3 8 4 2 9
4 9 6 4 0
5 0 7 5 1
Here is my desired output
Sample Unidentified Pathogen
1 8 14
2 9 1
3 12 11
4 15 4
5 7 6
Any help would be really appreciated.
Here is an option where you pivot to create the two groups and then you summarize.
library(tidyverse)
df |>
pivot_longer(cols = -sample,
names_to = ".value",
names_pattern = "(\\w+)") |>
group_by(sample) |>
summarise(across(everything(), sum))
#> # A tibble: 5 x 3
#> sample Unidentified Pathogen
#> <dbl> <dbl> <dbl>
#> 1 1 8 14
#> 2 2 9 1
#> 3 3 12 11
#> 4 4 15 4
#> 5 5 7 6
or with Base R
data.frame(
sample = 1:5,
Unidentified = rowSums(df[,grepl("Unidentified", colnames(df))]),
Pathogen = rowSums(df[,grepl("Pathogen", colnames(df))])
)
#> sample Unidentified Pathogen
#> 1 1 8 14
#> 2 2 9 1
#> 3 3 12 11
#> 4 4 15 4
#> 5 5 7 6
or another pivot option where we go long and then immediately go long and summarize the nested cells.
library(tidyverse)
df |>
pivot_longer(-sample, names_pattern = "(\\w+)") |>
pivot_wider(names_from = name,
values_from = value,
values_fn = list(value = sum))
#> # A tibble: 5 x 3
#> sample Unidentified Pathogen
#> <dbl> <dbl> <dbl>
#> 1 1 8 14
#> 2 2 9 1
#> 3 3 12 11
#> 4 4 15 4
#> 5 5 7 6
Here I reshape long to make the column names more easily manipulable. I separate them into "stub" and "number" values and the default separator settings work fine. Then I sum the total values for each id-stub combo, and spread wide again.
library(tidyverse)
data.frame(
check.names = FALSE,
sample = c(1L, 2L, 3L, 4L, 5L),
`Unidentified…1` = c(5L, 7L, 8L, 9L, 0L),
`Unidentified…2` = c(3L, 2L, 4L, 6L, 7L),
Pathogen..1 = c(6L, 1L, 2L, 4L, 5L),
`Pathogen…2` = c(8L, 0L, 9L, 0L, 1L)
) %>%
pivot_longer(-sample) %>%
separate(name, c("stub","num")) %>%
count(sample, stub, wt = value) %>%
pivot_wider(names_from = "stub", values_from = "n")
Result
# A tibble: 5 × 3
sample Pathogen Unidentified
<int> <int> <int>
1 1 14 8
2 2 1 9
3 3 11 12
4 4 4 15
5 5 6 7
I have a dataset with financial data. Sometimes, a product gets refunded, resulting in a negative count of the product (so the money gets returned). I want to conditionally filter these rows out of the dataset.
Example:
library(tidyverse)
set.seed(1)
df <- tibble(
count = sample(c(-1,1),80,replace = TRUE,prob=c(.2,.8)),
id = rep(1:4,20)
)
df %>%
group_by(id) %>%
summarize(total = sum(count))
# A tibble: 4 x 2
id total
<int> <dbl>
1 1 10
2 2 14
3 3 16
4 4 10
id = 1 has 15 positive counts and 5 negatives. (15 - 5= 10). I want to keep 10 values in df with id = 1 with the positive values.
id = 2 has 17 positive counts and 3 negatives. (17- 3 = 14). I want to keep 14 values in df with id = 2 with the positive values.
In the end, this condition should be True nrow(df) == sum(df$count)
Unfortunately, a filtering join such as anti_join() will remove all the rows. For some reason I cannot think of another option to filter the tibble.
Thanks for helping me!
You can "uncount" using the total column to get the number of repeats of each row.
df %>%
group_by(id) %>%
summarize(total = sum(count)) %>%
uncount(total) %>%
mutate(count = 1)
#> # A tibble: 50 x 2
#> id count
#> <int> <dbl>
#> 1 1 1
#> 2 1 1
#> 3 1 1
#> 4 1 1
#> 5 1 1
#> 6 1 1
#> 7 1 1
#> 8 1 1
#> 9 1 1
#> 10 1 1
#> # ... with 40 more rows
Created on 2022-10-21 with reprex v2.0.2
My questing is kinda similar to this question, and is building on this answer, only thing is that my data is long format, not wide, and I would like to keep it that way.
Wondered if there's smart way to calculate the weighted.mean() shown in this answer, but with long data.
Say my data lookslike this
library(tidyverse)
dft_w <- structure(list(obs = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), education = c("A",
"A", "B", "B", "B", "B", "A", "A"), Item = c("income", "weight",
"income", "weight", "income", "weight", "income", "weight"),
Amount = c(1000L, 10L, 2000L, 1L, 1500L, 5L, 2000L, 2L)), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame")); dft_w
# A tibble: 8 x 4
obs education Item Amount
<int> <chr> <chr> <int>
1 1 A income 1000
2 1 A weight 10
3 2 B income 2000
4 2 B weight 1
5 3 B income 1500
6 3 B weight 5
7 4 A income 2000
8 4 A weight 2
and I would like to get to something like this
# A tibble: 12 x 4
obs education Item Amount
<int> <chr> <chr> <dbl>
1 1 A income 1000
2 1 A weight 10
3 1 A weighted_income 1167.
4 2 B income 2000
5 2 B weight 1
6 2 B weighted_income 1583.
7 3 B income 1500
8 3 B weight 5
9 3 B weighted_income 1583.
10 4 A income 2000
11 4 A weight 2
12 4 A weighted_income 1167.
dft_w %>%
group_by(education) %>%
summarize(
Amount = rep(weighted.mean(Amount[Item == "income"], Amount[Item == "weight"]), length(unique(obs))),
obs = unique(obs),
Item = "weighted_income"
) %>%
bind_rows(dft_w, .) %>%
arrange(obs, education, Item)
# # A tibble: 12 x 4
# obs education Item Amount
# <int> <chr> <chr> <dbl>
# 1 1 A income 1000
# 2 1 A weight 10
# 3 1 A weighted_income 1167.
# 4 2 B income 2000
# 5 2 B weight 1
# 6 2 B weighted_income 1583.
# 7 3 B income 1500
# 8 3 B weight 5
# 9 3 B weighted_income 1583.
# 10 4 A income 2000
# 11 4 A weight 2
# 12 4 A weighted_income 1167.
Note that this will error if the data does not contain equal numbers of "income" and "weight" (erring with 'x' and 'w' must have the same length).
This can be preempted with sufficient filtering, perhaps this:
dft_w %>%
slice(-1) %>% # just to trigger the fail, test the filter
group_by(obs, education) %>%
filter(all(c("income", "weight") %in% Item)) %>%
group_by(education) %>%
summarize(
Amount = rep(weighted.mean(Amount[Item == "income"], Amount[Item == "weight"]), length(unique(obs))),
obs = unique(obs),
Item = "weighted_income"
) %>%
bind_rows(slice(dft_w, -1), .) %>% # slice() only to keep the output consistent
arrange(obs, education, Item)
# # A tibble: 10 x 4
# obs education Item Amount
# <int> <chr> <chr> <dbl>
# 1 1 A weight 10
# 2 2 B income 2000
# 3 2 B weight 1
# 4 2 B weighted_income 1583.
# 5 3 B income 1500
# 6 3 B weight 5
# 7 3 B weighted_income 1583.
# 8 4 A income 2000
# 9 4 A weight 2
# 10 4 A weighted_income 2000
noting that the obs/education pair without both will not gain the "weighted_income" value.
Another way around is to use tidyr's pivot_wider and pivot_longer in the same pipe chain so you can actually work with wide data before going back to long format. It may not be the most efficient way but it allows to keep "wide-format" tips & tricks.
library(dplyr)
dft_w %>%
tidyr::pivot_wider(names_from = Item, values_from = Amount) %>%
group_by(education) %>%
mutate(weighted_income = weighted.mean(income, weight)) %>%
tidyr::pivot_longer(3:last_col(), names_to = "Item", values_to = "Amount")
Output:
# A tibble: 12 x 4
# Groups: education [2]
obs education Item Amount
<int> <chr> <chr> <dbl>
1 1 A income 1000
2 1 A weight 10
3 1 A weighted_income 1167.
4 2 B income 2000
5 2 B weight 1
6 2 B weighted_income 1583.
7 3 B income 1500
8 3 B weight 5
9 3 B weighted_income 1583.
10 4 A income 2000
11 4 A weight 2
12 4 A weighted_income 1167.
Here is just another way of doing this using tibble::add_row. I just opted for only one summary per grouping variable:
library(dplyr)
library(purrr)
dft_w %>%
group_split(education) %>%
map_dfr(~ .x %>%
add_row(obs = .x$obs[1], education = .x$education[1],
Item = "weighted.mean", Amount = weighted.mean(.x$Amount[.x$Item == "income"],
.x$Amount[.x$Item == "weight"])))
# A tibble: 10 x 4
obs education Item Amount
<int> <chr> <chr> <dbl>
1 1 A income 1000
2 1 A weight 10
3 4 A income 2000
4 4 A weight 2
5 1 A weighted.mean 1167.
6 2 B income 2000
7 2 B weight 1
8 3 B income 1500
9 3 B weight 5
10 2 B weighted.mean 1583.
I have dataframe df1 containing data and groups, and df2 which stores the same groups, and one value per group.
I want to filter rows of df1 by df2 where lag by group is higher than indicated value.
Dummy example:
# identify the first year of disturbance by lag by group
df1 <- data.frame(year = c(1:4, 1:4),
mort = c(5,16,40,4,5,6,10,108),
distance = rep(c("a", "b"), each = 4))
df2 = data.frame(distance = c("a", "b"),
my.median = c(12,1))
Now calculate the lag between values (creates new column) and filter df1 based on column values of df2:
# calculate lag between years
df1 %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
filter(yearLag > df2$my.median) ##
This however does not produce expected results:
# A tibble: 3 x 4
# Groups: distance [2]
year mort distance yearLag
<int> <dbl> <fct> <dbl>
1 2 16 a 11
2 3 40 a 24
3 4 108 b 98
Instead, I expect to get:
# A tibble: 3 x 4
# Groups: distance [2]
year mort distance yearLag
<int> <dbl> <fct> <dbl>
1 3 40 a 24
2 1 5 b 5
3 3 10 b 4
The filter works great while applied to single value, but how to adapt it to vector, and especially vector of groups (as the order of elements can potentially change?)
Is this what you're trying to do?
df1 %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
left_join(df2) %>%
filter(yearLag > my.median)
Result:
# A tibble: 4 x 5
# Groups: distance [2]
year mort distance yearLag my.median
<int> <dbl> <fct> <dbl> <dbl>
1 3 40 a 24 12
2 1 5 b 5 1
3 3 10 b 4 1
4 4 108 b 98 1
here is a data.table approach
library( data.table )
#creatae data.tables
setDT(df1);setDT(df2)
#create yearLag variable
df1[, yearLag := mort - shift( mort, type = "lag", fill = 0 ), by = .(distance) ]
#update join and filter wanted rows
df1[ df2, median.value := i.my.median, on = .(distance)][ yearLag > median.value, ][]
# year mort distance yearLag median.value
# 1: 3 40 a 24 12
# 2: 1 5 b 5 1
# 3: 3 10 b 4 1
# 4: 4 108 b 98 1
Came to the same conclusion. You should left_join the data frames.
df1 %>% left_join(df2, by="distance") %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
filter(yearLag > my.median)
# A tibble: 4 x 5
# Groups: distance [2]
year mort distance my.median yearLag
<int> <dbl> <fct> <dbl> <dbl>
1 3 40 a 12 24
2 1 5 b 1 5
3 3 10 b 1 4
4 4 108 b 1 98
So I have one column of a dataframe which contains a value, which is equal to a different column name. For each row, I want to change the value of the column that is named.
df <- tibble(.rows = 6) %>% mutate(current_stage = c("Stage-1", "Stage-1", "Stage-2", "Stage-3", "Stage-4", "Stage-4"), `Stage-1` = c(1,1,1,2,4,5), `Stage-2` = c(40,50,20,10,15,10), `Stage-3` = c(1,2,3,4,5,6), `Stage-4` = c(NA, 1, NA, 2, NA, 3))
A tibble: 6 x 5
current_stage `Stage-1` `Stage-2` `Stage-3` `Stage-4`
<chr> <dbl> <dbl> <dbl> <dbl>
Stage-1 1 40 1 NA
Stage-1 1 50 2 1
Stage-2 1 20 3 NA
Stage-3 2 10 4 2
Stage-4 4 15 5 NA
Stage-4 5 10 6 3
So in the first row, I would want to edit the value in the Stage-1 column because the current_stage column has Stage-1. I've tried using !!rlang::sym:
df %>% mutate(!!rlang::sym(current_stage) := 15)
but I get the error: Error in is_symbol(x) : object 'current_stage' not found.
Is this even possible to do? Or should I just bite the bullet and write a different function?
Within the tidyverse, I think using a long format with gather is the easiest way as suggested by Jack Brookes:
library(tidyverse)
df %>%
rowid_to_column() %>%
gather(stage, value, -current_stage, -rowid) %>%
mutate(value = if_else(stage == current_stage, 15, value)) %>%
spread(stage, value)
#> # A tibble: 6 x 6
#> rowid current_stage `Stage-1` `Stage-2` `Stage-3` `Stage-4`
#> <int> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Stage-1 15 40 1 NA
#> 2 2 Stage-1 15 50 2 1
#> 3 3 Stage-2 1 15 3 NA
#> 4 4 Stage-3 2 10 15 2
#> 5 5 Stage-4 4 15 5 15
#> 6 6 Stage-4 5 10 6 15
Created on 2019-05-20 by the reprex package (v0.2.1)