I have data like that:
df <- (
tibble(
ID = rep(1:2, 4),
Group = c("A", "B", "A", "B","A", "B", "A", "B"),
Parameter = c("Blood", "Blood", "Height", "Height", "Waist", "Waist", "Hip", "Hip"),
Value = c(6.3, 6.0, 180, 170, 90, 102, 60, 65))
)
I want to calculate the ratio between "Height" and "Waist" and between "Waist" and "Hip".
I have the following solution. But my solution requires using spread() and delivers only the calculation for "Waist-to-hip".
df <- rbind(df,
spread(df, Parameter, Value)
%>% transmute(ID = ID,
Group = Group,
Parameter = "Ratio.Height-to-Hip",
Value = Height / Hip,
Parameter = "Ratio.Waist-to-Hip",
Value = Waist / Hip))
Is it possible to stay in tidy data format and avoid switching to the long-format? Why is the calculation for "Height-to-hip" missing?
Here is one the possible solution:
# Calculate ratios "Height" vs "Waist" and "Waist" vs "Hip"
# 1. Load packages
library(tidyr)
library(dplyr)
# 2. Data set
df <- tibble(
id = rep(1:2, 4),
group = c("A", "B", "A", "B","A", "B", "A", "B"),
parameter = c("Blood", "Blood", "Height", "Height", "Waist", "Waist", "Hip", "Hip"),
value = c(6.3, 6.0, 180, 170, 90, 102, 60, 65))
# 3. Filter and transform data set
df <- df %>%
filter(parameter %in% c("Height", "Waist", "Hip")) %>%
spread(parameter, value)
# 4. Convert column names to lower case
colnames(df) <- tolower(colnames(df))
# 5. Calcutate ratios
df <- df %>%
mutate(
ratio_height_vs_waist = round(height / waist, 2),
ratio_waist_vs_hip = round(waist / hip, 2))
The main problem is that the data are not in a tidy format.
Two key features of the tidy format are (Wickham, 2013):
Each variable forms a column;
Each observation forms a row.
In its original format, your data violates these two rules. For example, the Parameter column contains four variables (Blood, Height, Waist, and Hip). The knock-on effect of grouping several variables within Parameter is that each observation has to be repeated across several rows. In general, repeated rows of an identifier (ID in this case) in the absence of repeated measures is a sign that two or more variables have been grouped under a single column.
Anyway, here's my attempt to clean the data (I have used mutate and and not transmute for illustrative purposes).
# Load packages
library(dplyr)
library(tidyr)
library(magrittr) # For the %<>% function, which I love
# Make data frame, df
df <- tibble(
ID = rep(1:2, 4),
Group = c("A", "B", "A", "B","A", "B", "A", "B"),
Parameter = c("Blood", "Blood", "Height", "Height", "Waist", "Waist", "Hip", "Hip"),
Value = c(6.3, 6.0, 180, 170, 90, 102, 60, 65))
# Wrangle df
df %<>%
# ID and Group appear to be repeated, so use them to group_by
group_by(ID, Group) %>%
# Spread the Value column by the Parameter column
spread(key = Parameter,
value = Value) %>%
# Ungroup, just because its a good habit
ungroup() %>%
# Generate new columns.
mutate(Ratio_height_to_hip = Height / Hip,
Ratio_waist_to_hip = Waist / Hip)
# Print df
df
#> # A tibble: 2 x 8
#> ID Group Blood Height Hip Waist Ratio_height_to_hip
#> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 A 6.3 180 60 90 3.000000
#> 2 2 B 6.0 170 65 102 2.615385
#> # ... with 1 more variables: Ratio_waist_to_hip <dbl>
df <- df %>%
spread(Parameter, Value) %>%
mutate("Ratio.Height-to-Hip" = Height / Hip) %>%
mutate("Ratio.Waist-to-Hip" = Hip / Waist) %>%
gather("Parameter", "Value", -c("ID", "Group"))
Your data is not in tidy format ;) If you want your data in tidy format remove the last step.
Related
I have this data frame:
df <- data.frame(id = c(918, 919, 920, 921, 922),
city = c("a", "c", "b", "c", "a"),
mosquitoes = c(9, 13, 8, 25, 10))
What I want to do is to get the number of unique ID values for each city and then create a new dataframe that should looks like:
newdf <- data.frame(city = c("a", "b", "c"),
id = c(2,1,2),
mosquitoes = c(19, 8, 38))
I know how to do half of that using
newdf <- aggregate(mosquitoes ~ city, data = df, sum)
But no matter how I try, I can't get the range for unique values of ID according to the cities that I have. I've been trying
newdf$id <- aggregate(length(id) ~ city, data = df, sum)
And I also tried a loop (because my original data has way more than 3 cities), but only got disaster and can't make it work at all:
x <- unique(df$city)
unique_ID <-
for (x in df$city) {
city = unique(df$city)
mosquitoes = ?
ID = ?
}
This topic was the most similar to mine I could found, but apparently it only works with numeric values. At least I couldn't make it work with my character columns.
Can someone please help me?
You could do:
library(tidyverse)
df <- data.frame(id = c(918, 919, 920, 921, 922),
city = c("a", "c", "b", "c", "a"),
mosquitoes = c(9, 13, 8, 25, 10))
df %>%
group_by(city) %>%
summarise(id = n(), mosquitoes = sum(mosquitoes))
#> # A tibble: 3 x 3
#> city id mosquitoes
#> <chr> <int> <dbl>
#> 1 a 2 19
#> 2 b 1 8
#> 3 c 2 38
Created on 2022-09-05 with reprex v2.0.2
I have a large dataset with about 167k rows. I would like to take a sample of 2000 rows of it while making sure I am taking rows from all groups in two columns (id & quality) in the data.
This is a snapshot of the data
df <- data.frame(id=c(1,2,3,4,5,1,2),
quality=c("a","b","c","d","z","g","t"))
df %>% glimpse()
Rows: 7
Columns: 2
$ id <dbl> 1, 2, 3, 4, 5, 1, 2
$ quality <chr> "a", "b", "c", "d", "z", "g", "t"
So, I need to ensure that the sampled data has rows from all combinations of these two group columns.
I hope someone can help out.
Thanks!
I think that's what you're looking for.
my_df <- data.frame(id = c(1, 2, 3, 4, 5, 1, 2, 2, 2),
quality = c("a", "b", "c", "d", "z", "g", "t", "t", "t"))
my_df <- my_df %>% group_by(id, quality) %>% mutate(Unique = cur_group_id())
my_df$Test <- seq.int(from = 1, to = nrow(my_df), by = 1)
my_a <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_b <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_c <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_d <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_e <- my_df %>% group_by(Unique) %>% sample_n(., 1)
You don't need that much dataframe, those are just examples to show that for each unique group 1 row will be extract randomly. The difference is seen in the column named "Test" especially for the id = 2 and quality = t, based on the data sample.
If you want to make sure that each id and quality is represented in your new sample, you will need to group you data by these variables.
What you are looking for is the following,
df %>%
group_by(id,quality) %>%
sample_n(1, replace = TRUE)
You can change sample size pr group and id, and set replacement as desired.
It gives the following output,
# Groups: id, quality [7]
id quality
<dbl> <chr>
1 1 a
2 1 g
3 2 b
4 2 t
5 3 c
6 4 d
7 5 z
The data that you provided, have unique groups, and therefore sampling the way you want it, gives the same number of rows as you data.
Edit: sample_n is superseeded by slice_sample, I wasnt aware of this. But you can easily change the script by,
df %>%
group_by(id,quality) %>%
slice_sample(
n = 1
)
You can also sample a proportion of your data.frame by setting prop instead of n,
df %>%
group_by(id,quality) %>%
slice_sample(
prop = 0.25
)
I have a dataframe that looks like this:
df <- tibble(date = c(2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01),
site = c("X", "X", "X", "X", "Z", "Z", "Z", "Z"),
treatment = c("a", "a", "b", "b", "a", "a", "b", "b"),
species = c("vetch", "clover", "vetch", "clover", "vetch", "clover", "vetch", "clover"),
frequency = c(0, 1, 1, 1 1, 0, 1, 0))
But with lots of dates and sites and treatments. What I want is to filter out observations where all frequencies of that species (across all treatments and dates) is 0 for that site. So in the above I want to remove clover at site "Z" because it did not occur at any treatment or date at that site, but I want to leave clover in site "X" because it did occur in one of the treatments. So I want:
tibble(date = c(2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01),
site = c("X", "X", "X" "X", "Z", "Z"),
treatment = c("a", "a", "b", "b", "a", "b"),
species = c("vetch", "clover", "vetch", "clover", "vetch", "vetch")
frequency = c(0, 1, 1, 1, 1, 1))
My first thought was to pivot_wider, select columns then pivot_longer again, but this didn't work because the clover column was still selected by having a 1 in site "X":
df %>%
pivot_wider(names_from = species, names_prefix = "spp.", values_from = frequency, values_fill = 0) %>%
group_by(site) %>%
select_if(~ !is.numeric(.) || sum(.) != 0) %>%
pivot_longer(starts_with("spp."), names_to = "species", names_prefix = "spp.", values_to = "frequency") -> df
So I guess I need to filter instead, but I can't figure out how to do that.
Maybe not for this dataset but generally using sum might not be the right approach since if you have negative numbers it might cancel it out and you'll get wrong groups removed. You can use all or any :
With dplyr :
library(dplyr)
df %>% group_by(date, site, species) %>% filter(any(frequency != 0))
#Also
#df %>% group_by(date, site, species) %>% filter(!all(frequency == 0))
# date site treatment species frequency
# <dbl> <chr> <chr> <chr> <dbl>
#1 2018 X a vetch 0
#2 2018 X a clover 1
#3 2018 X b vetch 1
#4 2018 X b clover 1
#5 2018 Z a vetch 1
#6 2018 Z b vetch 1
The same can be done in data.table as well :
library(data.table)
setDT(df)[, .SD[any(frequency != 0)], .(date, site, species)]
Or in base R :
subset(df, ave(frequency != 0, date, site, species, FUN = any))
An easy solution can be achieved by creating another column that contains the frequency of each species grouped by date, site and species (ignoring treatment). Then you can easily filter using this new column and afterwards eliminate it.
library(tidyverse)
df %>%
# Group by date site and species
group_by(date, site, species) %>%
# Create new column that sums frequency values by grouping variables
mutate(appears = sum(frequency)) %>%
# ignore rows where appears = 0
filter(appears != 0) %>%
# Eliminate appears column
select(-appears)
Extreme newbie question: I have 2 variables, region ID and household ID, there are duplicate households within the regions. I'm just trying to find out how many unique households are in each region.
This is what I am trying:
library(dplyr)
table <- data %>% group_by(region) %>% summarise(hid = unique(hid))
Error message:
Error: Column hid must be length 1 (a summary value), not 142
Something like this might get you what you want:
library(tidyverse)
df <- tibble(region_id = c(1, 2, 3, 1, 2, 3),
household_id = c("a", "b", "b", "a", "a", "b"))
df %>%
group_by(region_id) %>%
count(household_id) %>%
summarize(unique_households = n())
I would like to sample any number from Min to Max column of a data.frame after grouping and every group having different seed. I've tried a few approaches, you can see them in the reproducible example below, but none of them work.
The data.frame consists of four columns:
letters - my grouping variable
seed - an integer that is dynamic and group/letter specific
min - minimum value for the sample()
max - maximum value for the sample()
Here is a reproducible example:
set.seed(123)
data.frame(letter = sample(letters[1:3],20, replace=TRUE)) %>%
group_by(letter) %>%
summarise(seed = n()) %>%
mutate(min = ifelse(letter == "a", 20,
ifelse(letter == "b", 40, 60)),
max = ifelse(letter == "a", 30,
ifelse(letter == "b", 50, 70))) %>%
group_by(letter) %>%
# set.seed(seed) %>% # or mutate(randomNumber = sample(min:max, 1, set.seed(seed))) # these aren't working, but I hope you get my point
mutate(randomNumber = sample(min:max, 1))
Many thanks in advance!
I would suggest to use pmap from the purrr package in your last row:
library(tidyverse)
set.seed(123)
data.frame(letter = sample(letters[1:3],20, replace=TRUE)) %>%
group_by(letter) %>%
summarise(seed = n()) %>%
mutate(min = ifelse(letter == "a", 20,
ifelse(letter == "b", 40, 60)),
max = ifelse(letter == "a", 30,
ifelse(letter == "b", 50, 70))) %>%
group_by(letter) %>%
mutate(randomNumber = pmap_dbl(list(min, max, seed), function(x, y, z){set.seed(z); sample(x:y, 1)}))
# A tibble: 3 x 5
# Groups: letter [3]
letter seed min max randomNumber
<fct> <int> <dbl> <dbl> <dbl>
1 a 5 20 30 21
2 b 7 40 50 49
3 c 8 60 70 63