Frequency of unique values of one variable grouped in another variable - R? - r

Extreme newbie question: I have 2 variables, region ID and household ID, there are duplicate households within the regions. I'm just trying to find out how many unique households are in each region.
This is what I am trying:
library(dplyr)
table <- data %>% group_by(region) %>% summarise(hid = unique(hid))
Error message:
Error: Column hid must be length 1 (a summary value), not 142

Something like this might get you what you want:
library(tidyverse)
df <- tibble(region_id = c(1, 2, 3, 1, 2, 3),
household_id = c("a", "b", "b", "a", "a", "b"))
df %>%
group_by(region_id) %>%
count(household_id) %>%
summarize(unique_households = n())

Related

Randomly sleeting rows based on all groups in two columns

I have a large dataset with about 167k rows. I would like to take a sample of 2000 rows of it while making sure I am taking rows from all groups in two columns (id & quality) in the data.
This is a snapshot of the data
df <- data.frame(id=c(1,2,3,4,5,1,2),
quality=c("a","b","c","d","z","g","t"))
df %>% glimpse()
Rows: 7
Columns: 2
$ id <dbl> 1, 2, 3, 4, 5, 1, 2
$ quality <chr> "a", "b", "c", "d", "z", "g", "t"
So, I need to ensure that the sampled data has rows from all combinations of these two group columns.
I hope someone can help out.
Thanks!
I think that's what you're looking for.
my_df <- data.frame(id = c(1, 2, 3, 4, 5, 1, 2, 2, 2),
quality = c("a", "b", "c", "d", "z", "g", "t", "t", "t"))
my_df <- my_df %>% group_by(id, quality) %>% mutate(Unique = cur_group_id())
my_df$Test <- seq.int(from = 1, to = nrow(my_df), by = 1)
my_a <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_b <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_c <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_d <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_e <- my_df %>% group_by(Unique) %>% sample_n(., 1)
You don't need that much dataframe, those are just examples to show that for each unique group 1 row will be extract randomly. The difference is seen in the column named "Test" especially for the id = 2 and quality = t, based on the data sample.
If you want to make sure that each id and quality is represented in your new sample, you will need to group you data by these variables.
What you are looking for is the following,
df %>%
group_by(id,quality) %>%
sample_n(1, replace = TRUE)
You can change sample size pr group and id, and set replacement as desired.
It gives the following output,
# Groups: id, quality [7]
id quality
<dbl> <chr>
1 1 a
2 1 g
3 2 b
4 2 t
5 3 c
6 4 d
7 5 z
The data that you provided, have unique groups, and therefore sampling the way you want it, gives the same number of rows as you data.
Edit: sample_n is superseeded by slice_sample, I wasnt aware of this. But you can easily change the script by,
df %>%
group_by(id,quality) %>%
slice_sample(
n = 1
)
You can also sample a proportion of your data.frame by setting prop instead of n,
df %>%
group_by(id,quality) %>%
slice_sample(
prop = 0.25
)

Plotting Number of Times Value Appears in Two Dataframes in R

I have two sets of data. Each contains a column for the name of the molecule and a column for the number of times that molecule appears in the sample. I want to create a scatterplot with the number of times a molecule appears in dataset #1 on the x-axis and how many times it appears in dataset #2. If a molecule is in one dataset and not the other, it appears 0 times.
Example:
dat1 <- data.frame(
name = c("A", "B", "D", "E")
count = c(10, 1, 30, 10)
)
dat2 <- data.frame(
name = c("A", "B", "C", "F")
count = c(1, 3, 50, 40)
)
Point #1 would be (10,1) corresponding to A, Point #2 would be (1,3), Point #3 would be (0,50) and so on. I don't want to label my points since my datasets contain tens of thousands of molecules.
Try joining the data.frames
full_join(dat1, dat2, by="name") %>%
mutate_all(function(xx) ifelse(is.na(xx), 0, xx)) %>%
ggplot(aes(count.x, count.y)) +
geom_point()
which produces
You would need a full_join():
library(dplyr)
library(ggplot2)
#Data
dat1 <- data.frame(
name = c("A", "B", "D", "E"),
count = c(10, 1, 30, 10)
)
dat2 <- data.frame(
name = c("A", "B", "C", "F"),
count = c(1, 3, 50, 40)
)
#Code
dat1 %>% full_join(dat2 %>% rename(count2=count)) %>%
replace(is.na(.),0) %>%
ggplot(aes(x=count,y=count2))+
geom_point()+
geom_text(aes(label=name),vjust=-0.5)
Output:

Creating duplicate in R

I have the following input data frame with 4 columns and 3 rows.
The time column can take value from 1 to the corresponding value of the maturity column for that customer, I want to create more observations for each customer till the value of time is = value of maturity, with the other columns retaining their original value. Please see the below links for input and expected output
Input
Output
Here is a dplyr solution inspired but not exactly equal to this post.
library(dplyr)
df <- data.frame(custno = 1:3, time = 1, dept = c("A", "B", "A"))
df %>%
slice(rep(1:n(), each = 5)) %>%
group_by(custno) %>%
mutate(time = seq_along(time))
Edit
After the comments by the OP, the following seems to be better.
First, the data:
df <- data.frame(custno = 1:3, time = 1,
dept = c("A", "B", "A"),
maturity = c(5,4,6))
And the solution.
df %>%
tidyr::uncount(maturity) %>%
group_by(custno) %>%
mutate(time = seq_along(time))
We can also use slice with row_number
library(dplyr)
library(data.table)
df %>%
slice(rep(row_number(), maturity)) %>%
mutate(time = rowid(custno))
data
df <- data.frame(custno = 1:3, time = 1,
dept = c("A", "B", "A"),
maturity = c(5,4,6))

Apply t.test on a tidy format data

I have a data frame in tidy format as follows:
df <- data.frame(name = c("A", "C", "B", "A", "B", "C", "D") ,
group = c(rep("case", 3), rep("cntrl", 4)),
mean = rnorm(7, 0,1))
I would like to group the data by two variables name and group and apply a t.test on mean value of each category. For example doing t.test between A_case.vs.A_cntrl and add pvalue as the result to the table.
Do you have any idea how can I do this using tidyverse package?
Thanks,
here, a group wise, t.test on 'name' cannot be performed as there is only a single observation for each pair. Instead, we can do
library(dplyr)
df %>%
summarise(ttest = list(t.test(mean[group == 'case'],
mean[group == 'cntrl']))) %>%
pull(ttest)
Update
If we need to create a column, use mutate
df %>%
mutate(pval = t.test(mean[group == 'case'],
mean[group == 'cntrl'])$p.value)
Or reshape to 'wide' format and then do the t.test on the columns
library(tidyr)
df %>%
pivot_wider(names_from = group, values_from = mean) %>%
summarise(ttest = list(t.test(case, cntrl))) %>%
pull(ttest)

In nested data frame, pass information from one list column to function applied in another

I am working on a report for which I have to export a large number of similar data frames into nice looking tables in Word. My goal is to achieve this in one go, using flextable to generate the tables and purrr / tidyverse to apply all the formatting procedures to all rows in a nested data frame. This is what my data frame looks like:
df <- data.frame(school = c("A", "B", "A", "B", "A", "B"),
students = c(round(runif(6, 1, 10), 0)),
grade = c(1, 1, 2, 2, 3, 3))
I want to generate separate tables for all groups in column 'school' and started by using the nest() function within tidyr.
list <- df %>%
group_by(school) %>%
nest()
This gives me a nested data frame to which I can apply the functions in flextable using purrr:
list <- list %>%
mutate(ftables = map(data, flextable)) %>%
mutate(ftables = purrr::map(ftables, ~ set_header_labels(.,
students = "No of students",
grade = "Grade")))
The first mutate generates a new column with flextable objects for each school, and the second mutate applies header labels to the table, based on the column names that are saved in the object.
My goal is now to add another header that is based on the name of the school. This value resides in the list column entitled school, which corresponds row-wise to the tables generated in the list column ftables. How can I pass the name of the school to the add_header function within ftables, using purrr or any other procedure?
Expected output
I have been able to achieve what I want for individual schools with this procedure (identical header cells will later be merged):
school.name <- "A"
ftable.a <- df %>%
filter(school == "A") %>%
select(-school) %>%
flextable() %>%
set_header_labels(students = "No of students",
grade = "Grade") %>%
add_header(students = school.name,
grade = school.name)
ftable.a
package purrr provides function map2 that you should use:
library(flextable)
library(magrittr)
library(dplyr)
library(tidyr)
library(purrr)
df <- data.frame(school = c("A", "B", "A", "B", "A", "B"),
students = c(round(runif(6, 1, 10), 0)),
grade = c(1, 1, 2, 2, 3, 3))
byschool <- df %>%
group_by(school) %>%
nest()
byschool <- byschool %>%
mutate(ftables = map(data, flextable)) %>%
mutate(ftables = purrr::map(
ftables, ~ set_header_labels(.,
students = "No of students",
grade = "Grade"))) %>%
mutate(ftables = purrr::map2(ftables, school, function(ft, h){
add_header(ft, students = h, grade = h)
} ))

Resources