Calculating the frequency of a factor with specific formula - r

I have data set like this:
df <- data.frame( ID = c("A","A","A","B","B","B","C","C","C"),
levels = c( "Y", "R", "O","Y", "R", "O","Y", "R", "O" ),
Counts=c(5,1,5,10,2,1,3,5,8))
ID levels Counts
A Y 5
A R 1
A O 5
B Y 10
B R 2
B O 1
C Y 3
C R 5
C O 8
I want to create another column that has a percentage of the second column(levels) like this formula
freq=(Y+O/Y+O+R)*100
So now the data frame should look like this :
ID freq
A 0.1
B 0.2
C 0.3
I tried a couple of solutions but it did not work can you please help me?

Using pivot_wider
df1 %>%
pivot_wider(id_cols = ID, values_from = Counts, names_from = levels) %>%
mutate(freq = (Y+O/Y+O+R)*100,
freq. = (Y+O)/(Y+O+R)*100) # %>% select(-Y, -R, -O)
ID Y R O freq freq.
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 5 1 5 1200 90.9
2 B 10 2 1 1310 84.6
3 C 3 5 8 1867. 68.8
I'm not sure what does your formula want.

You may try using match -
library(dplyr)
df %>%
group_by(ID) %>%
summarise(freq = (Counts[match('Y', levels)] + Counts[match('O', levels)])/sum(Counts))
# ID freq
# <chr> <dbl>
#1 A 0.909
#2 B 0.846
#3 C 0.688

Related

How I can calculate the variance across all columns in a data frame in R according to the values of another data frame using Dplyr?

I have a table (table1) with correlation coefficients that correspond to a variable as described below :
var = c("A","B","C","D","E")
cor = c(0.7,0.3,0.5,0.1,0.9)
table1 = tibble(var,cor)
# A tibble: 5 × 2
var cor
<chr> <dbl>
1 A 0.7
2 B 0.3
3 C 0.5
4 D 0.1
5 E 0.9
I have a vector of interest :
y=c(1,2,3,4)
and a new table (table2) as shown below
A = c(1,2,NA,4)
B =c(5,6,7,8)
C=c(NA,10,11,12)
D=c(13,14,15,16)
table2 = tibble(A,B,C,D);table2
# A tibble: 4 × 4
A B C D
<dbl> <dbl> <dbl> <dbl>
1 1 5 NA 13
2 2 6 10 14
3 NA 7 11 15
4 4 8 12 16
I want to calculate the covariance of vector y with (across) all columns of table2 but only if the corresponding correlation of table 1 is greater than 0.3 and if is less than 0.3 to return 0.
Therefore I want to search for correlation in table 1 > 0.3 i.e A and C (because table 2 does not have column E).
How I can implement this in R using base or dplyr package ?
This should work for you:
library(dplyr)
library(tidyr)
var = c("A","B","C","D","E")
cor = c(0.7,0.3,0.5,0.1,0.9)
table1 = tibble(var,cor)
A = c(1,2,NA,4)
B =c(5,6,7,8)
C=c(NA,10,11,12)
D=c(13,14,15,16)
table2 = tibble(A,B,C,D)
table2
y=c(1,2,3,4)
table3 <- table2 %>%
summarise(across(.cols = everything(), cov, y = y, use = "complete.obs")) %>%
pivot_longer(cols = everything(), names_to = "var", values_to = "covar") %>%
merge(table1) %>%
filter(cor > 0.3)

Add levels missing in one group to summary table using dplyr

When summarizing data, some groups may have observations not present in another group. In the example below, group 2 has no males. How can I in a tidy way, insert these observations in a summary table?
data example:
a <- data.frame(gender=factor(c("m", "m", "m", "f", "f", "f", "f")), group=c(1,1,1,1,1,2,2))
gender group
1 m 1
2 m 1
3 m 1
4 f 1
5 f 1
6 f 2
7 f 2
data summary:
a %>% group_by(gender, group) %>% summarise(n=n())
gender group n
<fct> <dbl> <int>
1 f 1 2
2 f 2 2
3 m 1 3
Desired output:
gender group n
<fct> <dbl> <int>
1 f 1 2
2 f 2 2
3 m 1 3
4 m 2 0
At the end, we can use complete
library(dplyr)
library(tidyr)
a %>%
group_by(gender, group) %>%
summarise(n=n(), .groups = 'drop') %>%
complete(gender, group, fill = list(n = 0))
-output
# A tibble: 4 x 3
# gender group n
# <fct> <dbl> <dbl>
#1 f 1 2
#2 f 2 2
#3 m 1 3
#4 m 2 0
Or an option is also to reshape to wide and then back to long format
a %>%
pivot_wider(names_from = group, values_from = group,
values_fn = length, values_fill = 0) %>%
pivot_longer(cols = -gender, names_to = 'group', values_to = 'n')
It is more easier in base R
as.data.frame(table(a))

Dplyr, join successive dataframes to pre-existing columns, summing their values

I want to perform multiple joins to original dataframe, from the same source with different IDs each time. Specifically I actually only need to do two joins, but when I perform the second join, the columns being joined already exist in the input df, and rather than add these columns with new names using the .x/.y suffixes, I want to sum the values to the existing columns. See the code below for the desired output.
# Input data:
values <- tibble(
id = LETTERS[1:10],
variable1 = 1:10,
variable2 = (1:10)*10
)
df <- tibble(
twin_id = c("A/F", "B/G", "C/H", "D/I", "E/J")
)
> values
# A tibble: 10 x 3
id variable1 variable2
<chr> <int> <dbl>
1 A 1 10
2 B 2 20
3 C 3 30
4 D 4 40
5 E 5 50
6 F 6 60
7 G 7 70
8 H 8 80
9 I 9 90
10 J 10 100
> df
# A tibble: 5 x 1
twin_id
<chr>
1 A/F
2 B/G
3 C/H
4 D/I
5 E/J
So this is the two joins:
joined_df <- df %>%
tidyr::separate(col = twin_id, into = c("left_id", "right_id"), sep = "/", remove = FALSE) %>%
left_join(values, by = c("left_id" = "id")) %>%
left_join(values, by = c("right_id" = "id"))
> joined_df
# A tibble: 5 x 7
twin_id left_id right_id variable1.x variable2.x variable1.y variable2.y
<chr> <chr> <chr> <int> <dbl> <int> <dbl>
1 A/F A F 1 10 6 60
2 B/G B G 2 20 7 70
3 C/H C H 3 30 8 80
4 D/I D I 4 40 9 90
5 E/J E J 5 50 10 100
And this is the output I want, using the only way I can see to get it:
output_df_wanted <- joined_df %>%
mutate(
variable1 = variable1.x + variable1.y,
variable2 = variable2.x + variable2.y) %>%
select(twin_id, left_id, right_id, variable1, variable2)
> output_df_wanted
# A tibble: 5 x 5
twin_id left_id right_id variable1 variable2
<chr> <chr> <chr> <int> <dbl>
1 A/F A F 7 70
2 B/G B G 9 90
3 C/H C H 11 110
4 D/I D I 13 130
5 E/J E J 15 150
I can see how to get what I want using a mutate statement, but I will have a much larger number of variables in the actually dataset. I am wondering if this is the best way to do this.
You can try reshaping your data and using dplyr::summarise_at:
library(tidyr)
library(dplyr)
df %>%
separate(col = twin_id, into = c("left_id", "right_id"), sep = "/", remove = FALSE) %>%
pivot_longer(-twin_id) %>%
left_join(values, by = c("value" = "id")) %>%
group_by(twin_id) %>%
summarise_at(vars(starts_with("variable")), sum) %>%
separate(col = twin_id, into = c("left_id", "right_id"), sep = "/", remove = FALSE)
## A tibble: 5 x 5
# twin_id left_id right_id variable1 variable2
# <chr> <chr> <chr> <int> <dbl>
#1 A/F A F 7 70
#2 B/G B G 9 90
#3 C/H C H 11 110
#4 D/I D I 13 130
#5 E/J E J 15 150
You can use my package safejoin if it's acceptable to you to use a github package.
The idea is that you have conflicting columns, dplyr and base R deal with conflict by renaming them while safejoin is more flexible, you can use the function you want to apply in case of conflicts. Here you want to add them so we'll use conflict = `+`, for the same effect you could have used conflict = ~ .x + .y or conflict = ~ ..1 + ..2.
# remotes::install_github("moodymudskipper/safejoin")
library(tidyverse)
library(safejoin)
values <- tibble(
id = LETTERS[1:10],
variable1 = 1:10,
variable2 = (1:10)*10
)
df <- tibble(
twin_id = c("A/F", "B/G", "C/H", "D/I", "E/J")
)
joined_df <- df %>%
tidyr::separate(col = twin_id, into = c("left_id", "right_id"), sep = "/", remove = FALSE) %>%
left_join(values, by = c("left_id" = "id")) %>%
safe_left_join(values, by = c("right_id" = "id"), conflict = `+`)
joined_df
#> # A tibble: 5 x 5
#> twin_id left_id right_id variable1 variable2
#> <chr> <chr> <chr> <int> <dbl>
#> 1 A/F A F 7 70
#> 2 B/G B G 9 90
#> 3 C/H C H 11 110
#> 4 D/I D I 13 130
#> 5 E/J E J 15 150
Created on 2020-04-29 by the reprex package (v0.3.0)

R Vlookup Two Criteria and Fill in the Value

Actual dataframe consist of more than a million rows.
Say for example a dataframe is:
UniqueID Code Value OtherData
1 A 5 Z01
1 B 6 Z02
1 C 7 Z03
2 A 10 Z11
2 B 11 Z24
2 C 12 Z23
3 A 10 Z21
I want to obtain ratio of A/B. For example, for UniqueID 1, its ratio of A/B = 5/6.
Thus, I transform the original dataframe to:
UniqueID A_Value B_Value Ratio_A/B
1 5
2 10
3 10
Question is, how do I lookup the original dataframe by its UniqueID and then fill in its B value? If there is no B value, then just return 0.
Thank you.
You can first remove the columns which are not necessary, select only rows where Code has value "A" or "B", get the data in wide format and create a new column with the value of A/B
library(dplyr)
library(tidyr)
df %>%
select(-OtherData) %>%
filter(Code %in% c("A", "B")) %>%
pivot_wider(names_from = Code, values_from = Value, values_fill = list(Value = 0)) %>%
#OR if you want to have NA values instead of 0 use
#pivot_wider(names_from = Code, values_from = Value) %>%
mutate(Ratio_A_B = A/B)
# UniqueID A B Ratio_A_B
# <int> <int> <int> <dbl>
#1 1 5 6 0.833
#2 2 10 11 0.909
#3 3 10 0 Inf

Group data hierarchically on two levels, then compute relative frequencies in R using dplyr [duplicate]

This question already has answers here:
Relative frequencies / proportions with dplyr
(10 answers)
Closed 3 years ago.
I want to do something which appears simple, but I don't have a good feel for R yet, it is a maze of twisty passages, all different.
I have a table with several variables, and I want to group on two variables ... I want a two-level hierarchical grouping, also known as a tree. This can evidently be done using the group_by function of dplyr.
And then I want to compute marginal statistics (in this case, relative frequencies) based on group counts for level 1 and level 2.
In pictures, given this table of 18 rows:
I want this table of 6 rows:
Is there a simple way to do this in dplyr? (I can do it in SQL, but ...)
Edited for example
For example, based on the nycflights13 package:
library(dplyr)
install.packages("nycflights13")
require(nycflights13)
data(flights) # contains information about flights, one flight per row
ff <- flights %>%
mutate(approx_dist = floor((distance + 999)/1000)*1000) %>%
select(carrier, approx_dist) %>%
group_by(carrier, approx_dist) %>%
summarise(n = n()) %>%
arrange(carrier, approx_dist)
This creates a tbl ff with the number of flights for each pair of (carrier, inter-airport-distance-rounded-to-1000s):
# A tibble: 33 x 3
# Groups: carrier [16]
carrier approx_dist n
<chr> <dbl> <int>
1 9E 1000 15740
2 9E 2000 2720
3 AA 1000 9146
4 AA 2000 17210
5 AA 3000 6373
And now I would like to compute the relative frequencies for the "approx_dist" values in each "carrier" group, for example, I would like to get:
carrier approx_dist n rel_freq
<chr> <dbl> <int>
1 9E 1000 15740 15740/(15740+2720)
2 9E 2000 2720 2720/(15740+2720)
If I understood your problem correctly, here is what you can do. This is not to exactly solve your problem (we don't have the data), but to give you some hints:
library(dplyr)
d <- data.frame(col1= rep(c("a", "a", "a", "b", "b", "b"),2),
col2 = rep(c("a1", "a2", "a3", "b1", "b2", "b3"),2),
stringsAsFactors = F)
d %>% group_by(col1) %>% mutate(count_g1 = n()) %>% ungroup() %>%
group_by(col1, col2) %>% summarise(rel_freq = n()/unique(count_g1)) %>% ungroup()
# # A tibble: 6 x 3
# col1 col2 rel_freq
# <chr> <chr> <dbl>
# 1 a a1 0.333
# 2 a a2 0.333
# 3 a a3 0.333
# 4 b b1 0.333
# 5 b b2 0.333
# 6 b b3 0.333
Update: #TimTeaFan's suggestion on how to re-write the code above using prop.table
d %>% group_by(col1, col2) %>% summarise(n = n()) %>% mutate(freq = prop.table(n))
Update: Running this trick on the ff table given in the question's example, which has everything set up except the last mutate:
ff %>% mutate(rel_freq = prop.table(n))
# A tibble: 33 x 4
# Groups: carrier [16]
carrier approx_dist n rel_freq
<chr> <dbl> <int> <dbl>
1 9E 1000 15740 0.853
2 9E 2000 2720 0.147
3 AA 1000 9146 0.279
4 AA 2000 17210 0.526
5 AA 3000 6373 0.195
6 AS 3000 714 1
7 B6 1000 24613 0.450
8 B6 2000 22159 0.406
9 B6 3000 7863 0.144
10 DL 1000 20014 0.416
# … with 23 more rows
...or
ff %>% mutate(rel_freq = n/sum(n))
Fake data for demonstration:
library(dplyr)
df <- data.frame(stringsAsFactors = F,
col1 = rep(c("A","B"), each = 9),
col2 = rep(1:3),
value = 1:18)
#> df
# col1 col2 value
#1 A 1 1
#2 A 2 2
#3 A 3 3
#4 A 1 4
#5 A 2 5
#6 A 3 6
#7 A 1 7
#8 A 2 8
#9 A 3 9
#10 B 1 10
#11 B 2 11
#12 B 3 12
#13 B 1 13
#14 B 2 14
#15 B 3 15
#16 B 1 16
#17 B 2 17
#18 B 3 18
Solution
df %>%
group_by(col1, col2) %>%
summarise(col2_ttl = sum(value)) %>% # Count is boring for this data, but you
mutate(share_of_col1 = col2_ttl / sum(col2_ttl)) #... could use `n()` for that
## A tibble: 6 x 4
## Groups: col1 [2]
# col1 col2 col2_ttl share_of_col1
# <chr> <int> <int> <dbl>
#1 A 1 12 0.267
#2 A 2 15 0.333
#3 A 3 18 0.4
#4 B 1 39 0.310
#5 B 2 42 0.333
#6 B 3 45 0.357
First we group by both columns. In this case, the ordering makes a difference, because the groups are created hierarchically, and each summary we run summarizes the last layer of grouping. So the summarise line (or summarize, it was written with UK spelling but with US spelling aliases) sums up the values in each col1-col2 combination, leaving a residual grouping by col1 which we can use in the next line. (Try putting a # after sum(value)) to see what is produced at that stage.)
In the last line, the col2_ttl is divided by the sum of all the col2_ttl in its group, ie the total across each col1.

Resources