Sum of unique combination of values in columns in r - r

My dataframe is as below
df <- data.frame(Webpage = c(111, 111, 111, 111, 222, 222),
Dept = c(101, 101, 101, 102, 102, 103),
Emp_Id = c(1, 1, 2, 3, 4, 4),
weights = c(5,5,2,3,4,5))
Webpage Dept Emp_Id weights
111 101 1 5
111 101 1 5
111 101 2 2
111 102 3 3
222 102 4 4
222 103 4 5
I want for each webpage what is the number of employee seen that webpage in terms of their weights and weight percentage.
Unique employee are unique combination of Dept and Emp_ID
For e.g. webpage 111 is seen by Emp_ID 1,2 and 3. So number of employee seen is sum of their weights i.e 5+2+3 =10 and weight percentage is 0.52(10/19). 19 is the total sum of weights of unique employee(which is the unique combination of Dept and Emp_ID)
Webpage Number_people_seen seen_percentage
111 10 0.52
222 9 0.47
What I tried is below but not sure how to get the sum of weights.
library(dplyr)
df %>% group_by(Webpage) %>% distinct(Dept,Emp_Id)

df <- data.frame(Webpage = c(111, 111, 111, 111, 222, 222),
Dept = c(101, 101, 101, 102, 102, 103),
Emp_Id = c(1, 1, 2, 3, 4, 4),
weights = c(5,5,2,3,4,5))
library(tidyverse)
df %>%
group_by(Webpage) %>%
distinct(Dept,Emp_Id, .keep_all = T) %>%
summarise(Number_people_seen = sum(weights)) %>%
mutate(seen_percentage = prop.table(Number_people_seen))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 3
#> Webpage Number_people_seen seen_percentage
#> <dbl> <dbl> <dbl>
#> 1 111 10 0.526
#> 2 222 9 0.474
Created on 2021-04-05 by the reprex package (v0.3.0)

df %>% group_by(Webpage, Emp_Id) %>%
summarise(no_of_ppl_seen = unique(weights)) %>%
group_by(Webpage) %>%
summarise(no_of_ppl_seen = sum(no_of_ppl_seen)) %>%
mutate(seen_percentage = no_of_ppl_seen/sum(no_of_ppl_seen))
# A tibble: 2 x 3
Webpage no_of_ppl_seen seen_percentage
<dbl> <dbl> <dbl>
1 111 10 0.526
2 222 9 0.474
OR
df %>% filter(!duplicated(across(everything()))) %>%
group_by(Webpage) %>%
summarise(number_ppl_seen = sum(weights)) %>%
mutate(seen_perc = number_ppl_seen/sum(number_ppl_seen))

Related

R: assign values to a column when two other columns match

Here is my sample data:
samp_df <- tibble(id= c("A", "A", "B", "B"),
event= c(111, 112, 113, 114),
values = c(23, 12, 45, 60),
min_value = c(12, 12, 113, 113))
I would like to create a column in the dataframe that has the event of the min value. So in the example the column would look like: c(112, 112, 113, 113). So the idea is that I want to take the value from event whenever values and min_value match. It is important that this is also grouped by the id variable.
Here is what I tried, but it's not exactly right as it add NA's instead of the event:
samp_df <- samp_df %>% group_by(id) %>%
mutate(event_with_min = if_else(min_value == value,
event, NA_integer_)
A dplyr solution would also be optimal!
library(tibble)
library(dplyr)
library(tidyr)
samp_df <- tibble(id= c("A", "A", "B", "B"),
event= c(111, 112, 113, 114),
values = c(23, 12, 45, 60))
samp_df %>%
group_by(id) %>%
mutate(min_value = min(values),
event_with_min = if_else(min_value == values, event, NA_real_)) %>%
fill(event_with_min, .direction = "downup")
#> # A tibble: 4 x 5
#> # Groups: id [2]
#> id event values min_value event_with_min
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 111 23 12 112
#> 2 A 112 12 12 112
#> 3 B 113 45 45 113
#> 4 B 114 60 45 113
Created on 2022-10-14 by the reprex package (v2.0.1)
I've had to sign up as a guest as I am working on a train. Will revert to own user once back at home! Peter

How to pivot_longer with prefix to ID column

My data often contain either "Left/Right" or "Pre/Post" prefixes without separators in a wide format that I need to pivot to tall format combining variables by these prefixes. I have a work around of using "gsub()" to insert a separator ("_" or ".") into the column names. "pivot_longer" then does what I want with the "names_sep" argument. I'm wondering though if there is a way to make this work more directly with "pivot_longer" "names" syntax ("names_prefix", "names_pattern", "names_to"). Here is what I am attempting:
Original wide format example:
HW <- tribble(
~Subject, ~LeftV1, ~RightV1, ~LeftV2, ~RightV2, ~LeftV3, ~RightV3,
"A", 0, 1, 10, 11, 100, 101,
"B", 2, 3, 12, 13, 102, 103,
"C", 4, 5, 14, 15, 104, 105)
Desired tall format:
HWT <- tribble(
~Subject, ~Side, ~V1, ~V2, ~V3,
"A", "Left", 0, 10, 100,
"A", "Right", 1, 11, 101,
"B", "Left", 2, 12, 102,
"B", "Right", 3, 13, 103,
"C", "Left", 4, 14, 104,
"C", "Right", 5, 15, 105)
I've tried various iterations of syntax that look more or less like this:
HWT <- HW %>% pivot_longer(
cols = contains(c("Left", "Right")),
names_pattern = "/^(Left|Right)",
names_to = c('Side', '.value') )
or this:
HWT <- HW %>% pivot_longer(
cols = contains(c("Left", "Right")),
names_prefix = "/^(Left|Right)",
names_to = c('Side', '.value') )
Each of which give syntax errors that I am unsure how to resolve.
We could use
library(tidyr)
library(dplyr)
HW %>%
pivot_longer(cols = -Subject, names_to = c("Side", ".value"),
names_pattern = "^(Left|Right)(.*)")
# A tibble: 6 × 5
Subject Side V1 V2 V3
<chr> <chr> <dbl> <dbl> <dbl>
1 A Left 0 10 100
2 A Right 1 11 101
3 B Left 2 12 102
4 B Right 3 13 103
5 C Left 4 14 104
6 C Right 5 15 105
Here is a similar approach concerning pivot_longer but with another strategy. I find it easier to understand if we could a simple separate like _. For this we could use rename_with and str_replace before pivoting:
librayr(dplyr)
library(stringr)
HW %>%
rename_with(., ~str_replace_all(., 'V', '_V')) %>%
pivot_longer(-Subject,
names_to =c("Side", ".value"),
names_sep ="_")
# A tibble: 6 x 5
Subject Side V1 V2 V3
<chr> <chr> <dbl> <dbl> <dbl>
1 A Left 0 10 100
2 A Right 1 11 101
3 B Left 2 12 102
4 B Right 3 13 103
5 C Left 4 14 104
6 C Right 5 15 105

finding difference of a column row-by-row in R

I have a subset of data as below:
structure(list(id = c(100, 101, 102, 103, 104, 105),
`family id` = c(1,1, 2, 2, 3, 3),
disease = c(1, 0, 0, 1, 1, 0),
val = c("3.1", "6.2", "2.45", "7.77", "4.56", "2.1")),
class = c("tbl_df", "tbl","data.frame"), row.names = c(NA, -6L))
I want to find the difference: value of sibling with disease(1) - value of sibling with no disease(0)?
the output should be as below:
Adding a helper id column and using tidyr::pivot_wider you could do:
library(dplyr)
library(tidyr)
df |>
group_by(`family id`) |>
mutate(id1 = row_number(), val = as.numeric(val)) |>
ungroup() |>
pivot_wider(names_from = id1, values_from = -c(id1, `family id`), names_sep = "") |>
mutate(difference = ifelse(disease1 == 1, val1 - val2, val2 - val1))
#> # A tibble: 3 × 8
#> `family id` id1 id2 disease1 disease2 val1 val2 difference
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 100 101 1 0 3.1 6.2 -3.1
#> 2 2 102 103 0 1 2.45 7.77 5.32
#> 3 3 104 105 1 0 4.56 2.1 2.46

Grouping by each value of a column based on the categories of a list they fall into

Today has been quite challenging so I can't think of any new ideas anymore so the solution to this question may be quite obvious to you. I have a very simple data frame like bellow:
structure(list(user_id = c(101, 102, 102, 103, 103, 106, 107,
111), phone_number = c(4030201, 4030201, 4030202, 4030202, 4030203,
4030204, 4030205, 4030203), id = 1:8), class = "data.frame", row.names = c(NA,
-8L))
and also a list:
list(c(1, 2, 3, 4, 5, 8), 6, 7)
I want to group each value in id column of my data frame based on the category they fall into of elements of the list preferably with purrr package functions. So the desired output is something like this:
grp <- c(1, 1, 1, 1, 1, 2, 3, 1)
Thank you very much in advance and learning from/ beside you guys has been a great honor of my life.
Sincerely Yours
Anoushiravan
One option involving purrr could be:
df %>%
mutate(grp = imap(lst, ~ .y * (id %in% .x)) %>% reduce(`+`))
user_id phone_number id grp
1 101 4030201 1 1
2 102 4030201 2 1
3 102 4030202 3 1
4 103 4030202 4 1
5 103 4030203 5 1
6 106 4030204 6 2
7 107 4030205 7 3
8 111 4030203 8 1
Case-I when the list is unnamed
df <- structure(list(user_id = c(101, 102, 102, 103, 103, 106, 107,
111), phone_number = c(4030201, 4030201, 4030202, 4030202, 4030203,
4030204, 4030205, 4030203), id = 1:8), class = "data.frame", row.names = c(NA,
-8L))
lst <- list(c(1, 2, 3, 4, 5, 8), 6, 7)
library(tidyverse)
df %>% mutate(GRP = map(id, \(xy) seq_along(lst)[map_lgl(lst, ~ xy %in% .x)]))
#> user_id phone_number id GRP
#> 1 101 4030201 1 1
#> 2 102 4030201 2 1
#> 3 102 4030202 3 1
#> 4 103 4030202 4 1
#> 5 103 4030203 5 1
#> 6 106 4030204 6 2
#> 7 107 4030205 7 3
#> 8 111 4030203 8 1
Case-II when the list is named
df <- structure(list(user_id = c(101, 102, 102, 103, 103, 106, 107,
111), phone_number = c(4030201, 4030201, 4030202, 4030202, 4030203,
4030204, 4030205, 4030203), id = 1:8), class = "data.frame", row.names = c(NA,
-8L))
lst <- list(a = c(1, 2, 3, 4, 5, 8), b = 6, c = 7)
library(tidyverse)
df %>% mutate(GRP = map(id, \(xy) names(lst)[map_lgl(lst, ~ xy %in% .x)]))
#> user_id phone_number id GRP
#> 1 101 4030201 1 a
#> 2 102 4030201 2 a
#> 3 102 4030202 3 a
#> 4 103 4030202 4 a
#> 5 103 4030203 5 a
#> 6 106 4030204 6 b
#> 7 107 4030205 7 c
#> 8 111 4030203 8 a
Created on 2021-06-14 by the reprex package (v2.0.0)

How do I select non-unique combination of columns?

My data looks like this:
counts <- data.frame(
pos = c(101, 101, 101, 102, 102, 102, 103, 103, 103, 101, 101, 101),
chr = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4),
subj = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C")
)
pos is supposed to belong to only one unique chr, but here pos 101 belongs to both chr 1 and 4.
I can detect this case like:
counts %>% select(pos, chr) %>%
group_by(pos) %>%
summarise(n_chrs = length(unique(chr))) %>%
filter(n_chrs > 1)
This returns pos which has more than to chr values:
A tibble: 1 x 2
pos n_chrs
<dbl> <int>
1 101 2
What I'd like is to know which chr values are implicated, something like:
pos chr
1 101 1
2 101 4
Thanks!
You could do:
library(dplyr)
counts %>%
group_by(pos) %>%
distinct(chr) %>%
filter(n() > 1)
Output:
# A tibble: 2 x 2
# Groups: pos [1]
pos chr
<dbl> <dbl>
1 101 1
2 101 4
An option using data.table
library(data.table)
unique(setDT(counts), by = 'chr')[, .(chr = chr[.N > 1]), pos]
# pos chr
#1: 101 1
#2: 101 4
Instead of summarize, you could just use mutate to create the group-wise count. This will make sure you keep chr, which you're interested in:
counts %>% select(pos, chr) %>%
group_by(pos) %>%
mutate(n_chrs = length(unique(chr))) %>%
filter(n_chrs > 1) %>%
unique()
Result:
# A tibble: 2 x 3
# Groups: pos [1]
pos chr n_chrs
<dbl> <dbl> <int>
1 101 1 2
2 101 4 2

Resources