Here is my sample data:
samp_df <- tibble(id= c("A", "A", "B", "B"),
event= c(111, 112, 113, 114),
values = c(23, 12, 45, 60),
min_value = c(12, 12, 113, 113))
I would like to create a column in the dataframe that has the event of the min value. So in the example the column would look like: c(112, 112, 113, 113). So the idea is that I want to take the value from event whenever values and min_value match. It is important that this is also grouped by the id variable.
Here is what I tried, but it's not exactly right as it add NA's instead of the event:
samp_df <- samp_df %>% group_by(id) %>%
mutate(event_with_min = if_else(min_value == value,
event, NA_integer_)
A dplyr solution would also be optimal!
library(tibble)
library(dplyr)
library(tidyr)
samp_df <- tibble(id= c("A", "A", "B", "B"),
event= c(111, 112, 113, 114),
values = c(23, 12, 45, 60))
samp_df %>%
group_by(id) %>%
mutate(min_value = min(values),
event_with_min = if_else(min_value == values, event, NA_real_)) %>%
fill(event_with_min, .direction = "downup")
#> # A tibble: 4 x 5
#> # Groups: id [2]
#> id event values min_value event_with_min
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 111 23 12 112
#> 2 A 112 12 12 112
#> 3 B 113 45 45 113
#> 4 B 114 60 45 113
Created on 2022-10-14 by the reprex package (v2.0.1)
I've had to sign up as a guest as I am working on a train. Will revert to own user once back at home! Peter
Related
I am working with the R programming language.
I have a dataset that looks something like this:
x = c("GROUP", "A", "B", "C")
date_1 = c("CLASS 1", 20, 60, 82)
date_1_1 = c("CLASS 2", 37, 22, 8)
date_2 = c("CLASS 1", 15,100,76)
date_2_1 = c("CLASS 2", 84, 18,88)
my_data = data.frame(x, date_1, date_1_1, date_2, date_2_1)
x date_1 date_1_1 date_2 date_2_1
1 GROUP CLASS 1 CLASS 2 CLASS 1 CLASS 2
2 A 20 37 15 84
3 B 60 22 100 18
4 C 82 8 76 88
I am trying to restructure the data so it looks like this:
note : in the real excel data, date_1 is the same date as date_1_1 and date_2 is the same as date_2_1 ... R wont accept the same names, so I called them differently
Currently, I am manually doing this in Excel using different "tranpose" functions - but I am wondering if there is a way to do this in R (possibly using the DPLYR library).
I have been trying to read different tutorial websites online (Pivoting), but so far nothing seems to match the problem I am trying to work on.
Can someone please show me how to do this?
Thanks!
Made assumptions about your data because of the duplicate column names. For example, if the Column header pattern is CLASS_ClassNum_Date
df<-data.frame(GROUP = c("A", "B", "C"),
CLASS_1_1 = c(20, 60, 82),
CLASS_2_1 = c(37, 22, 8),
CLASS_1_2 = c(15,100,76),
CLASS_2_2 = c(84, 18,88))
library(tidyr)
pivot_longer(df, -GROUP,
names_pattern = "(CLASS_.*)_(.*)",
names_to = c(".value", "Date"))
GROUP Date CLASS_1 CLASS_2
<chr> <chr> <dbl> <dbl>
1 A 1 20 37
2 A 2 15 84
3 B 1 60 22
4 B 2 100 18
5 C 1 82 8
6 C 2 76 88
Edit: Substantially improved pivot_longer by using names_pattern= correctly
There are lots of ways to achieve your desired outcome, but I don't believe there is an 'easy'/'simple' way. Here is one potential solution:
library(tidyverse)
library(vctrs)
x = c("GROUP", "A", "B", "C")
date_1 = c("CLASS 1", 20, 60, 82)
date_1_1 = c("CLASS 2", 37, 22, 8)
date_2 = c("CLASS 1", 15,100,76)
date_2_1 = c("CLASS 2", 84, 18,88)
my_data = data.frame(x, date_1, date_1_1, date_2, date_2_1)
# Combine column names with the names in the first row
colnames(my_data) <- paste(my_data[1,], colnames(my_data), sep = "-")
my_data %>%
filter(`GROUP-x` != "GROUP") %>% # remove first row (info now in column names)
pivot_longer(everything(), # pivot the data
names_to = c(".value", "Date"),
names_sep = "-") %>%
mutate(GROUP = vec_fill_missing(GROUP, # fill NAs in GROUP introduced by pivoting
direction = "downup")) %>%
filter(Date != "x") %>% # remove "unneeded" rows
mutate(`CLASS 2` = vec_fill_missing(`CLASS 2`, # fill NAs again
direction = "downup")) %>%
na.omit() %>% # remove any remaining NAs
mutate(across(starts_with("CLASS"), ~as.numeric(.x)),
Date = str_extract(Date, "\\d+")) %>%
rename("date" = "Date", # rename the columns
"group" = "GROUP",
"count_class_1" = `CLASS 1`,
"count_class_2" = `CLASS 2`) %>%
arrange(date) # arrange by "date" to get your desired output
#> # A tibble: 6 × 4
#> date group count_class_1 count_class_2
#> <chr> <chr> <dbl> <dbl>
#> 1 1 A 20 37
#> 2 1 B 60 84
#> 3 1 C 82 18
#> 4 2 A 15 37
#> 5 2 B 100 22
#> 6 2 C 76 8
Created on 2022-12-09 with reprex v2.0.2
My data often contain either "Left/Right" or "Pre/Post" prefixes without separators in a wide format that I need to pivot to tall format combining variables by these prefixes. I have a work around of using "gsub()" to insert a separator ("_" or ".") into the column names. "pivot_longer" then does what I want with the "names_sep" argument. I'm wondering though if there is a way to make this work more directly with "pivot_longer" "names" syntax ("names_prefix", "names_pattern", "names_to"). Here is what I am attempting:
Original wide format example:
HW <- tribble(
~Subject, ~LeftV1, ~RightV1, ~LeftV2, ~RightV2, ~LeftV3, ~RightV3,
"A", 0, 1, 10, 11, 100, 101,
"B", 2, 3, 12, 13, 102, 103,
"C", 4, 5, 14, 15, 104, 105)
Desired tall format:
HWT <- tribble(
~Subject, ~Side, ~V1, ~V2, ~V3,
"A", "Left", 0, 10, 100,
"A", "Right", 1, 11, 101,
"B", "Left", 2, 12, 102,
"B", "Right", 3, 13, 103,
"C", "Left", 4, 14, 104,
"C", "Right", 5, 15, 105)
I've tried various iterations of syntax that look more or less like this:
HWT <- HW %>% pivot_longer(
cols = contains(c("Left", "Right")),
names_pattern = "/^(Left|Right)",
names_to = c('Side', '.value') )
or this:
HWT <- HW %>% pivot_longer(
cols = contains(c("Left", "Right")),
names_prefix = "/^(Left|Right)",
names_to = c('Side', '.value') )
Each of which give syntax errors that I am unsure how to resolve.
We could use
library(tidyr)
library(dplyr)
HW %>%
pivot_longer(cols = -Subject, names_to = c("Side", ".value"),
names_pattern = "^(Left|Right)(.*)")
# A tibble: 6 × 5
Subject Side V1 V2 V3
<chr> <chr> <dbl> <dbl> <dbl>
1 A Left 0 10 100
2 A Right 1 11 101
3 B Left 2 12 102
4 B Right 3 13 103
5 C Left 4 14 104
6 C Right 5 15 105
Here is a similar approach concerning pivot_longer but with another strategy. I find it easier to understand if we could a simple separate like _. For this we could use rename_with and str_replace before pivoting:
librayr(dplyr)
library(stringr)
HW %>%
rename_with(., ~str_replace_all(., 'V', '_V')) %>%
pivot_longer(-Subject,
names_to =c("Side", ".value"),
names_sep ="_")
# A tibble: 6 x 5
Subject Side V1 V2 V3
<chr> <chr> <dbl> <dbl> <dbl>
1 A Left 0 10 100
2 A Right 1 11 101
3 B Left 2 12 102
4 B Right 3 13 103
5 C Left 4 14 104
6 C Right 5 15 105
My dataframe is as below
df <- data.frame(Webpage = c(111, 111, 111, 111, 222, 222),
Dept = c(101, 101, 101, 102, 102, 103),
Emp_Id = c(1, 1, 2, 3, 4, 4),
weights = c(5,5,2,3,4,5))
Webpage Dept Emp_Id weights
111 101 1 5
111 101 1 5
111 101 2 2
111 102 3 3
222 102 4 4
222 103 4 5
I want for each webpage what is the number of employee seen that webpage in terms of their weights and weight percentage.
Unique employee are unique combination of Dept and Emp_ID
For e.g. webpage 111 is seen by Emp_ID 1,2 and 3. So number of employee seen is sum of their weights i.e 5+2+3 =10 and weight percentage is 0.52(10/19). 19 is the total sum of weights of unique employee(which is the unique combination of Dept and Emp_ID)
Webpage Number_people_seen seen_percentage
111 10 0.52
222 9 0.47
What I tried is below but not sure how to get the sum of weights.
library(dplyr)
df %>% group_by(Webpage) %>% distinct(Dept,Emp_Id)
df <- data.frame(Webpage = c(111, 111, 111, 111, 222, 222),
Dept = c(101, 101, 101, 102, 102, 103),
Emp_Id = c(1, 1, 2, 3, 4, 4),
weights = c(5,5,2,3,4,5))
library(tidyverse)
df %>%
group_by(Webpage) %>%
distinct(Dept,Emp_Id, .keep_all = T) %>%
summarise(Number_people_seen = sum(weights)) %>%
mutate(seen_percentage = prop.table(Number_people_seen))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 3
#> Webpage Number_people_seen seen_percentage
#> <dbl> <dbl> <dbl>
#> 1 111 10 0.526
#> 2 222 9 0.474
Created on 2021-04-05 by the reprex package (v0.3.0)
df %>% group_by(Webpage, Emp_Id) %>%
summarise(no_of_ppl_seen = unique(weights)) %>%
group_by(Webpage) %>%
summarise(no_of_ppl_seen = sum(no_of_ppl_seen)) %>%
mutate(seen_percentage = no_of_ppl_seen/sum(no_of_ppl_seen))
# A tibble: 2 x 3
Webpage no_of_ppl_seen seen_percentage
<dbl> <dbl> <dbl>
1 111 10 0.526
2 222 9 0.474
OR
df %>% filter(!duplicated(across(everything()))) %>%
group_by(Webpage) %>%
summarise(number_ppl_seen = sum(weights)) %>%
mutate(seen_perc = number_ppl_seen/sum(number_ppl_seen))
Not sure if tidyr::gather can be used to take multiple columns and merge them in multiple key columns.
Similar questions have been asked but they all refer to gathering multiple columns in one key column.
I'm trying to gather 4 columns into 2 key and 2 value columns like in the following example:
Sample data:
df <- data.frame(
subject = c("a", "b"),
age1 = c(33, 35),
age2 = c(43, 45),
weight1 = c(90, 67),
weight2 = c(70, 87)
)
subject age1 age2 weight1 weight2
1 a 33 43 90 70
2 b 35 45 67 87
Desired result:
dfe <- data.frame(
subject = c("a", "a", "b", "b"),
age = c("age1", "age2", "age1", "age2"),
age_values = c(33, 43, 35, 45),
weight = c("weight1", "weight2", "weight1", "weight2"),
weight_values = c(90, 70, 67, 87)
)
subject age age_values weight weight_values
1 a age1 33 weight1 90
2 a age2 43 weight2 70
3 b age1 35 weight1 67
4 b age2 45 weight2 87
Here's one way to do it -
df %>%
gather(key = "age", value = "age_values", age1, age2) %>%
gather(key = "weight", value = "weight_values", weight1, weight2) %>%
filter(substring(age, 4) == substring(weight, 7))
subject age age_values weight weight_values
1 a age1 33 weight1 90
2 b age1 35 weight1 67
3 a age2 43 weight2 70
4 b age2 45 weight2 87
Here's one approach. The idea is to do the use gather, then split the resulting dataframe by variable (age and weight), do the mutate operations separately for each of the two dataframes, then merge the dataframes back together using subject and the variable number (1 or 2).
library(dplyr)
library(tidyr)
library(purrr)
df %>%
gather(age1:weight2, key = key, value = value) %>%
separate(key, sep = -1, into = c("var", "num")) %>%
split(.$var) %>%
map(~mutate(., !!.$var[1] := paste0(var, num), !!paste0(.$var[1], "_values") := value)) %>%
map(~select(., -var, -value)) %>%
Reduce(f = merge, x = .) %>%
select(-num)
I have multiple factors ("a","b","c") in my dataset, each with corresponding values for Price and Cost.
dat <- data.frame(
ProductCode = c("a", "a", "b", "b", "c", "c"),
Price = c(24, 37, 78, 45, 20, 34),
Cost = c(10,15,45,25,10,17)
)
I am looking for the sum of Price and Cost for each ProductCode.
by.code <- group_by(dat, code)
by.code <- summarise(by.code,
SumPrice = sum(Price),
SumCost = sum(Cost))
This code does not work as it sums all values in the column, without breaking them into categories.
SumPrice SumCost
1 238 122
Thanks in advance for your help.
This is not dplyr - This answer is for you if you dont mind the sqldf or data.table package:
sqldf("select ProductCode, sum(Price) as PriceSum, sum(Cost) as CostSum from dat group by ProductCode")
ProductCode PriceSum CostSum
a 61 25
b 123 70
c 54 27
OR using the data.table package:
library(data.table)
MM<-data.table(dat)
MM[, list(sum(Price),sum(Cost)), by = ProductCode]
ProductCode V1 V2
1: a 61 25
2: b 123 70
3: c 54 27
Your code works fine. There was just a typo. You should name your column ProductionCode into code and your code works fine. I just did that and R is giving proper output. Below is the code:
library(dplyr)
dat <- data.frame(
code = c("a", "a", "b", "b", "c", "c"),
Price = c(24, 37, 78, 45, 20, 34),
Cost = c(10,15,45,25,10,17)
)
dat
by.code <- group_by(dat, code)
by.code <- summarise(by.code,
SumPrice = sum(Price),
SumCost = sum(Cost))
by.code
We can use aggregate from base R
aggregate(.~ProductCode, dat, sum)
# ProductCode Price Cost
#1 a 61 25
#2 b 123 70
#3 c 54 27