I've got a data frame that looks something like this:
precinct, race, age, people
1001, black, 18-40, 1
1001, white, 18-40, 2
1001, hispanic, 18-40, 3
1001, asian, 18-40, 4
1001, black, 40 or older, 5
1001, white, 40 or older, 6
1001, hispanic, 40 or older, 7
1001, asian, 40 or older, 8
I want to make it look like this:
precinct, black, white, hispanic, asian, 18-40, 40 or older
1001, 6, 8, 10, 12, 10, 26
I've used dcast
dcast(
data = mydataframe,
formula = Precinct ~ race + age,
fun.aggregate = sum,
value.var = 'people'
)
but this does not produce my desired result.
When we create formula with + on the rhs of ~ it creates the combinations between those columns instead of having every single unique element from those columns. In order to have the latter, we may need to melt to long format and then use dcast on the single column (assuming those columns are of the same type)
library(data.table)
dcast(melt(setDT(mydataframe), id.var = c('precinct', 'people')),
precinct ~ value, fun.aggregate = sum, value.var = 'people')
-output
Key: <precinct>
precinct 18-40 40 or older asian black hispanic white
<int> <int> <int> <int> <int> <int> <int>
1: 1001 10 26 12 6 10 8
library(dplyr)
library(tidyr)
mydataframe %>%
pivot_longer(cols = c(race, age), names_to = NULL) %>%
pivot_wider(names_from = value, values_from = people, values_fn = sum)
-output
# A tibble: 1 × 7
precinct black `18-40` white hispanic asian `40 or older`
<int> <int> <int> <int> <int> <int> <int>
1 1001 6 10 8 10 12 26
data
mydataframe <- structure(list(precinct = c(1001L, 1001L, 1001L, 1001L, 1001L,
1001L, 1001L, 1001L), race = c("black", "white", "hispanic",
"asian", "black", "white", "hispanic", "asian"), age = c("18-40",
"18-40", "18-40", "18-40", "40 or older", "40 or older", "40 or older",
"40 or older"), people = 1:8), row.names = c(NA, -8L),
class = "data.frame")
Related
I'm relatively new to R and I have a dataframe that looks like this:
1
2
3
4
5
6
7
8
9
10
Name
Max
Max
Max
Joey
Joey
Nancy
Nancy
Nancy
Linda
Linda
Amount_Type
InternetBill
Groceries
WaterBill
InternetBill
Groceries
WaterBill
Groceries
InternetBill
WaterBill
Groceries
Amount
$75
$230.66
$40
$70
$188.75
$35
$175.89
$75
$30
$236.87
I need to add 3 more rows and pivot the dataframe:
The dataframe needs to be grouped by name and outputs 3 totals columns:
Fixed_Cost which should include InternetBill and WaterBill amounts
Variable_Cost which should include Groceries
Total_Cost which should be fixed + variable costs
So something like this:
Name
Fixed_Cost
Variable_Cost
Total_Cost
Max
$115
$230.66
$345.66
Joey
$70
$188.75
$258.75
Nancy
$110
$175.89
$285.89
Linda
$30
$236.87
$266.87
Any advice on how to go about doing this? Thanks!
If we transpose the data, it becomes more easier to do a group by sum
library(data.table)
data.table::transpose(setDT(df1), make.names = 1)[,
Amount := readr::parse_number(Amount)][,
.(Fixed_Cost = sum(Amount[Amount_Type %in% c("InternetBill", "WaterBill")]),
Variable_Cost = sum(Amount[!Amount_Type %in% c("InternetBill", "WaterBill")])),
by = Name][,
Total_Cost := Fixed_Cost + Variable_Cost][]
-output
Name Fixed_Cost Variable_Cost Total_Cost
<char> <num> <num> <num>
1: Max 115 230.66 345.66
2: Joey 70 188.75 258.75
3: Nancy 110 175.89 285.89
4: Linda 30 236.87 266.87
data
df1 <- structure(list(`0` = c("Name", "Amount_Type", "Amount"), `1` = c("Max",
"InternetBill", "$75"), `2` = c("Max", "Groceries", "$230.66"
), `3` = c("Max", "WaterBill", "$40"), `4` = c("Joey", "InternetBill",
"$70"), `5` = c("Joey", "Groceries", "$188.75"), `6` = c("Nancy",
"WaterBill", "$35"), `7` = c("Nancy", "Groceries", "$175.89"),
`8` = c("Nancy", "InternetBill", "$75"), `9` = c("Linda",
"WaterBill", "$30"), `10` = c("Linda", "Groceries", "$236.87"
)), class = "data.frame", row.names = c(NA, -3L))
library(tidyverse)
setNames(data.frame(t(df1[,-1])), df1[,1]) %>%
pivot_wider(Name, names_from = Amount_Type, values_from = Amount,
values_fn = parse_number, values_fill = 0) %>%
mutate(Fixed_cost = InternetBill + WaterBill, variable_cost = Groceries,
Total_Cost = Fixed_cost + variable_cost, .keep ='unused')
# A tibble: 4 x 4
Name Fixed_cost variable_cost Total_Cost
<chr> <dbl> <dbl> <dbl>
1 Max 115 231. 346.
2 Joey 70 189. 259.
3 Nancy 110 176. 286.
4 Linda 30 237. 267.
Hi I have a dataframe with a date column and some numeric columns
this is the data
year_month total_visits search_brand_co~ search_non_bran~ facebook_cost display_cost total_organic_s~
<date> <int> <dbl> <dbl> <dbl> <dbl> <int>
1 2020-11-01 91655 30314. 60676. 14548. 4555. 829852
2 2020-12-01 98227 327. 2027. 0 6895. 1047370
3 2021-01-01 91352 0 0 193. 7009. 1284317
4 2021-02-01 77060 15058. 18690. 6728. 6294. 668924
5 2021-03-01 96749 32883. 87256. 0 5587. 418764
6 2021-04-01 84738 29919. 71820. 0 2655. 297460
what I need is group by that will sum the columns where the date column "year_month" is smaller or equal than "2021-01-01" and the other where "year_month" is greater or equal than "2021-02-01"
the final dataframe should only have 2 rows with the same number of columns
Thanks for the help
We may need to create a grouping by just creating a logical vector or convert it to integer with + or as.integer and then summarise on the numeric columns to get the sum
library(dplyr)
df1 %>%
group_by(grp = c('gt_2021_01_01', 'lt_2021_01_01')[1 +
(year_month < as.Date("2021-01-01"))]) %>%
summarise(across(where(is.numeric), sum, na.rm = TRUE))
-output
# A tibble: 2 x 7
grp total_visits search_brand_co search_non_bran facebook_cost display_cost total_organic_s
<chr> <int> <dbl> <dbl> <dbl> <dbl> <int>
1 gt_2021_01_01 349899 77860 177766 6921 21545 2669465
2 lt_2021_01_01 189882 30641 62703 14548 11450 1877222
data
df1 <- structure(list(year_month = structure(c(18567, 18597, 18628,
18659, 18687, 18718), class = "Date"), total_visits = c(91655L,
98227L, 91352L, 77060L, 96749L, 84738L), search_brand_co = c(30314,
327, 0, 15058, 32883, 29919), search_non_bran = c(60676, 2027,
0, 18690, 87256, 71820), facebook_cost = c(14548, 0, 193, 6728,
0, 0), display_cost = c(4555, 6895, 7009, 6294, 5587, 2655),
total_organic_s = c(829852L, 1047370L, 1284317L, 668924L,
418764L, 297460L)), row.names = c("1", "2", "3", "4", "5",
"6"), class = "data.frame")
I would like to transform my data from long format to wide by the values in two columns. How can I do this using tidyverse?
Updated dput
structure(list(Country = c("Algeria", "Benin", "Ghana", "Algeria",
"Benin", "Ghana", "Algeria", "Benin", "Ghana"
), Indicator = c("Indicator 1",
"Indicator 1",
"Indicator 1",
"Indicator 2",
"Indicator 2",
"Indicator 2",
"Indicator 3",
"Indicator 3",
"Indicator 3"
), Status = c("Actual", "Forecast", "Target", "Actual", "Forecast",
"Target", "Actual", "Forecast", "Target"), Value = c(34, 15, 5,
28, 5, 2, 43, 5,
1)), row.names
= c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"))
Country Indicator Status Value
<chr> <chr> <chr> <dbl>
1 Algeria Indicator 1 Actual 34
2 Benin Indicator 1 Forecast 15
3 Ghana Indicator 1 Target 5
4 Algeria Indicator 2 Actual 28
5 Benin Indicator 2 Forecast 5
6 Ghana Indicator 2 Target 2
7 Algeria Indicator 3 Actual 43
8 Benin Indicator 3 Forecast 5
9 Ghana Indicator 3 Target 1
Expected output
Country Indicator1_Actual Indicator1_Forecast Indicator1_Target Indicator2_Actual
Algeria 34 15 5 28
etc
Appreciate any tips!
foo <- data %>% pivot_wider(names_from = c("Indicator","Status"), values_from = "Value")
works perfectly!
I think the mistake is in your pivot_wider() command
data %>% pivot_wider(names_from = Indicator, values_from = c(Indicator, Status))
I bet you can't use the same column for both names and values.
Try this code
data %>% pivot_wider(names_from = c(Indicator, Status), values_from = Value))
Explanation: Since you want the column names to be Indicator 1_Actual, you need both columns indicator and status going into your names_from
It would be helpful if you provided example data and expected output. But I tested this on my dummy data and it gives the expected output -
Data:
# A tibble: 4 x 4
a1 a2 a3 a4
<int> <int> <chr> <dbl>
1 1 5 s 10
2 2 4 s 20
3 3 3 n 30
4 4 2 n 40
Call : a %>% pivot_wider(names_from = c(a2, a3), values_from = a4)
Output :
# A tibble: 4 x 5
a1 `5_s` `4_s` `3_n` `2_n`
<int> <dbl> <dbl> <dbl> <dbl>
1 1 10 NA NA NA
2 2 NA 20 NA NA
3 3 NA NA 30 NA
4 4 NA NA NA 40
Data here if you want to reproduce
structure(list(a1 = 1:4, a2 = 5:2, a3 = c("s", "s", "n", "n"),
a4 = c(10, 20, 30, 40)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
Edit : For the edited question after trying out the correct pivot_wider() command - It looks like your data could actually have duplicates, in which case the output you are seeing would make sense - I would suggest you try to figure out if your data actually has duplicates by using filter(Country == .., Indicator == .., Status == ..)
This can be achieved by calling both your columns to pivot wider in the names_from argument in pivot_wider().
data %>%
pivot_wider(names_from = c("Indicator","Status"),
values_from = "Value")
Result
Country `Indicator 1_Ac… `Indicator 1_Fo… `Indicator 1_Ta… `Indicator 2_Ac… `Indicator 2_Fo…
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Algeria 34 15 5 28 5
I have a data frame, df:
df <- structure(list(ID = structure(c("ID961", "ID961",
"ID961", "ID961", "ID726",
"ID726", "ID726", "ID864",
"ID864", "ID864"), label = "ID"),
TYPE = structure(c("blind", "blind", "blind", "blind",
"blind", "blind", "blind", "blind", "blind", "notblind"
), label = "blind or not"), AGE = structure(c(50,
50, 50, 50, 67, 67, 67, 35, 35, 35), label = "Age"), AGEU = structure(c("YEARS",
"YEARS", "YEARS", "YEARS", "YEARS", "YEARS", "YEARS", "YEARS",
"YEARS", "YEARS"), label = "Age Units"), AVISIT = structure(c("26",
"46", "46", "36", "66",
"64", "67", "37", "37",
"67"), label = "treat")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"), label = "df")
> df
# A tibble: 10 x 5
ID TYPE AGE AGEU AVISIT
<chr> <chr> <dbl> <chr> <chr>
1 ID961 blind 50 YEARS 26
2 ID961 blind 50 YEARS 46
3 ID961 blind 50 YEARS 46
4 ID961 blind 50 YEARS 36
5 ID726 blind 67 YEARS 66
6 ID726 blind 67 YEARS 64
7 ID726 blind 67 YEARS 67
8 ID864 blind 35 YEARS 37
9 ID864 blind 35 YEARS 37
10 ID864 notblind 35 YEARS 67
For every ID, I want to find out if each column entry that matches that ID is uniform or not.
So for example:
ID961 - TYPE is all the same, AGE is all the same, AGEU is all the same, but AVISIT is not the same.
ID726 - TYPE is all the same, AGE is all the same, AGEU is all the same, but AVISIT is not the same.
ID864 - AGE is all the same, AGEU is all the same, but AVISIT is not the same and TYPE is not the same.
So therefore, I want it returned that AGE and AGEU are all uniform within that ID, eg:
uniform
[1] "AGE" "AGEU"
I have no idea how to do this - I understand I can use
match <- df %>% group_by(ID)
But then don't know how to progress from there.
Perhaps you can do:
library(dplyr)
df %>%
group_by(ID) %>%
summarise(across(everything(), ~ n_distinct(.x) == 1), .groups = "drop")
# Or deprecated way
df %>%
group_by(ID) %>%
summarise_all(~n_distinct(.x) == 1)
# A tibble: 3 x 5
ID TYPE AGE AGEU AVISIT
<chr> <lgl> <lgl> <lgl> <lgl>
1 ID726 TRUE TRUE TRUE FALSE
2 ID864 FALSE TRUE TRUE FALSE
3 ID961 TRUE TRUE TRUE FALSE
And if you want the column names, you can do:
df %>%
group_by(ID) %>%
summarise(across(everything(), ~ n_distinct(.x) == 1), .groups = "drop") %>%
rowwise() %>%
transmute(ID, uniform = toString(names(.)[which(c_across(cols = -ID)) + 1]))
# Or ...
df %>%
group_by(ID) %>%
summarise_all(~n_distinct(.x) == 1) %>%
transmute(ID,
uniform = pmap_chr(.[-1], ~ toString(names(df)[c(FALSE, ...)])))
# A tibble: 3 x 2
# Rowwise:
ID uniform
<chr> <chr>
1 ID726 TYPE, AGE, AGEU
2 ID864 AGE, AGEU
3 ID961 TYPE, AGE, AGEU
I have a data frame that looks like this:
# A tibble: 5 x 5
# Groups: Trial [1]
GID Trial pop `1A-1145442` `1A-1158042`
<chr> <chr> <chr> <int> <int>
GID421213 ES1 ES1-5 12 11
GID419903 ES1 ES1-5 22 12
GID3881 ES1 ES1-5 22 22
GID13646 ES1 ES1-5 12 12
GID418846 ES1 ES1-5 22 11
Here is a dput of it :
structure(list(GID = c("GID421213", "GID419903", "GID3881", "GID13646",
"GID418846"), Trial = c("ES1", "ES1", "ES1", "ES1", "ES1"), pop = c("ES1-5",
"ES1-5", "ES1-5", "ES1-5", "ES1-5"), `1A-1145442` = c(12L, 22L,
22L, 12L, 22L), `1A-1158042` = c(11L, 12L, 22L, 12L, 11L)), row.names =
c(NA, -5L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), vars =
"Trial", drop = TRUE, indices = list(0:4), group_sizes = 5L,
biggest_group_size = 5L, labels = structure(list(Trial = "ES1"), row.names
= c(NA, -1L), class = "data.frame", vars = "Trial", drop = TRUE))
I want to perform a regrouping transformation into a new column from the Trial column just as I did in the past with the pop column using regex operations but now with dplyr. The Trial column consists of ES values from 1 to 38: I would like to group in this fashion ES1-3,ES3-6,ES7-9 and so forth using the dplyr package. I know I could start with df >%> group_by(df,Trial) but from there on I have no idea how I could operate.
library(dplyr)
df %>%
mutate(pop2 = case_when(
Trial == "ES1" | Trial == "ES2" | Trial == "ES3" ~ "ES1-3",
Trial == "ES4" | Trial == "ES5" | Trial == "ES6" ~ "ES4-6"
))
Will return
# A tibble: 5 x 6
# Groups: Trial [1]
GID Trial pop `1A-1145442` `1A-1158042` pop2
<chr> <chr> <chr> <int> <int> <chr>
1 GID421213 ES1 ES1-5 12 11 ES1-3
2 GID419903 ES1 ES1-5 22 12 ES1-3
3 GID3881 ES1 ES1-5 22 22 ES1-3
4 GID13646 ES1 ES1-5 12 12 ES1-3
5 GID418846 ES1 ES1-5 22 11 ES1-3
Given
(df <- data.frame(Trial = paste0("ES", 1:10)))
# Trial
# 1 ES1
# 2 ES2
# 3 ES3
# 4 ES4
# 5 ES5
# 6 ES6
# 7 ES7
# 8 ES8
# 9 ES9
# 10 ES10
We may, using base R, do
size <- 3
groups <- (as.numeric(substring(df$Trial, 3)) - 1) %/% size
(df$newCol <- sprintf("ES%d-%d", 1 + groups * size, size * (1 + groups)))
# [1] "ES1-3" "ES1-3" "ES1-3" "ES4-6" "ES4-6" "ES4-6" "ES7-9" "ES7-9"
# [9] "ES7-9" "ES10-12"
Here as.numeric(substring(df$Trial, 3)) gets the numeric part of df$Trial and converts it to a numeric vector. Subtracting 1 and using %/% then returns the group number for each element of df$Trial, starting from 0. Given a group number, we can easily construct a new column with sprintf.
size is the size of groups. E.g., setting size <- 5 would give values ES1-5, ES6-10, and so on.
Here's a solution that uses parse_number from readr.
df %>%
mutate(grp = cut(parse_number(Trial),
breaks = seq(1, 38, by = 3),
right = FALSE)) %>%
group_by(grp)
This pulls out the number from Trial then cuts to create a grouping variable, which it then groups by. right=FALSE indicates that the interval is closed on the left.
An edit based on a comment below.
df %>%
mutate(grp = cut(parse_number(Trial),
breaks = c(seq(1, 34, by = 3) 38),
right = FALSE),
include.lowest = TRUE) %>%
group_by(grp)