I'm a little new to R and was hoping to get some insight about how to calculate for any percentile, for example 5th, 15th , etc...
The data I'm working with has two columns:
salary: (datatype is numeric / double)
student (set up as factor / integer, but only has yes/no)
I've already used:
favstats(salary~student, data=Default, na.rm=TRUE)
to get the two rows of stats broken down by whether they're a student or not; however, I'm not sure how to have the output show me a percentile of my choosing.
Would love to know the simplest way to go on about this in R Studio.
Thank you!
The quantile() function in base r does this.
x <- rnorm(100)
percentiles <- c(0.05, 0.15)
quantile(x, percentiles)
#> 5% 15%
#> -1.593506 -1.120130
If you need to produce a more complex summary table, you can do something with {tidyverse} like this:
library(tidyverse)
n <- 50
d <- tibble(student = rep(c(T, F), each = n),
salary = c(rnorm(n, 75, 10), rnorm(n, 95, 15)))
d %>%
group_by(student) %>%
summarize(quantile_salary = quantile(salary, percentiles))
#> # A tibble: 4 × 2
#> # Groups: student [2]
#> student quantile_salary
#> <lgl> <dbl>
#> 1 FALSE 70.0
#> 2 FALSE 78.5
#> 3 TRUE 57.3
#> 4 TRUE 64.0
Created on 2022-09-22 by the reprex package (v2.0.1)
I have a dataset which looks like this, though much larger
### ##Fake data for stack exdb <- data.frame(zone =
c(1,1,1,2,2,2), site = c("study", "collect", "collect", "study",
"collect", "collect"), x = c(53.307726, 53.310660, 53.307089,
53.313831, 53.319087, 53.318792), y = c(-6.222291, -6.217151, -6.215080, -6.214152, -6.218723, -6.215815))
I need to run a point analysis between the STUDY site and the COLLECT site to see the distance in metres. The problem is that I have many different ZONES or groups that are all independent (i.e the distance from a point in zone 1 is irrelevant to a point in zone 2).
For this reason I need to do two things,
the point analysis, which computes the distance between the one study site per zone and the multiple collect sites in meters,
and then write a FOREACH or a LOOP function which calculates this distance for every group in the data set.
an optimal output would look like
exdb <- data.frame(zone = c(1,1,1,2,2,2),
site = c("study", "collect", "collect", "study", "collect", "collect"),
x = c(53.307726, 53.310660, 53.307089, 53.313831, 53.319087, 53.318792),
y = c(-6.222291, -6.217151, -6.215080, -6.214152, -6.218723, -6.215815),
dist = c(0, 10.3, 30.4, 0, 12.5, 11.2))
Where the study site in each zone is always 0, as it is the distance from this site, and the distance to each collect site is ONLY CALCULATED TO THE STUDY SITE IN EACH UNIQUE ZONE.
Thank you very much.
Kil
Simple Base R version, no other packages required.
Starting with exdb as above.
First add a new column called dist with the value "study" because the plan is to self-merge on zone and site=="study":
> exdb$dist = "study"
Self-Merge, keeping only the coordinate columns:
> MM = merge(exdb, exdb,
by.x=c("zone","site"),
by.y=c("zone","dist"))[,c("x.x","y.x","x.y","y.y")]
Use distGeo to overwrite the dist column. Keeps it neat and tidy:
> exdb$dist = distGeo(MM[,2:1],MM[,4:3])
> exdb
zone site x y dist
1 1 study 53.30773 -6.222291 0.0000
2 1 collect 53.31066 -6.217151 473.2943
3 1 collect 53.30709 -6.215080 485.8806
4 2 study 53.31383 -6.214152 0.0000
5 2 collect 53.31909 -6.218723 659.5238
6 2 collect 53.31879 -6.215815 563.1349
Returns same answer as #wimpel but with no additional dependencies and in fewer lines of code.
Maybe something like this?
Assuming x and y are latitude and longitude, we can use the haversine function to get the distance in meters after pivoting the table to have both points in a row between which the distance is being calculated from (in meters):
library(tidyverse)
library(pracma)
#>
#> Attaching package: 'pracma'
#> The following object is masked from 'package:purrr':
#>
#> cross
data <- data.frame(zone = c(1, 1, 1, 2, 2, 2), site = c(
"study", "collect", "collect", "study",
"collect", "collect"
), x = c(
53.307726, 53.310660, 53.307089,
53.313831, 53.319087, 53.318792
), y = c(-6.222291, -6.217151, -6.215080, -6.214152, -6.218723, -6.215815))
data %>%
pivot_wider(names_from = site, values_from = c(x, y)) %>%
unnest(y_collect, y_study, x_collect, x_study) %>%
mutate(
dist = list(x_study, y_study, x_collect, y_collect) %>% pmap_dbl(~haversine(c(..1, ..2), c(..3, ..4)) * 1000)
)
#> Warning: Values are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = length` to identify where the duplicates arise
#> * Use `values_fn = {summary_fun}` to summarise duplicates
#> Warning: Values are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = length` to identify where the duplicates arise
#> * Use `values_fn = {summary_fun}` to summarise duplicates
#> Warning: unnest() has a new interface. See ?unnest for details.
#> Try `df %>% unnest(c(y_collect, y_study, x_collect, x_study))`, with `mutate()` if needed
#> # A tibble: 4 x 6
#> zone x_study x_collect y_study y_collect dist
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 53.3 53.3 -6.22 -6.22 472.
#> 2 1 53.3 53.3 -6.22 -6.22 484.
#> 3 2 53.3 53.3 -6.21 -6.22 659.
#> 4 2 53.3 53.3 -6.21 -6.22 563.
Created on 2021-09-13 by the reprex package (v2.0.1)
I'm still learning the spatial side but does this work?
library(sf)
library(tidyverse)
exdb %>%
arrange(zone, desc(site)) %>% #ensure study is first
st_as_sf(coords = c("x", "y"), crs = 4326) %>%
group_by(zone) %>%
mutate(
study_coord = geometry[1],
dist = st_distance(geometry, study_coord, by_element = T),
)
I believe this should work.. But I could not reproduce your distances in the desired output.
library(data.table)
library(purrr) # Or tidyverse
library(geosphere)
# Make your data a data.table
setDT(mydata)
# Split to a list based on zone and site
L <- split(mydata, by = c("zone", "site"), flatten = FALSE)
# Loop over list
L <- lapply(L, function(zone) {
#get reference point to take dustance from
point.study <- c(zone$study$y,zone$study$x)
zone$study$dist <- 0
# Calculate distance
zone$collect$dist <- unlist(purrr::pmap( list(a = zone$collect$y,
b = zone$collect$x ),
~(geosphere::distGeo( point.study, c(..1, ..2)))))
return(zone)
})
# Rowbind the results together
data.table::rbindlist(lapply(L, data.table::rbindlist))
# zone site x y dist
# 1: 1 study 53.30773 -6.222291 0.0000
# 2: 1 collect 53.31066 -6.217151 473.2943
# 3: 1 collect 53.30709 -6.215080 485.8806
# 4: 2 study 53.31383 -6.214152 0.0000
# 5: 2 collect 53.31909 -6.218723 659.5238
# 6: 2 collect 53.31879 -6.215815 563.1349
I'm quite new to r and coding in general. Your help would be highly appreciated :)
I'm trying to select the top n values by group with n depending on an other value (in the following called factor) from my data frame. Then, the selected values shoud be summarised by group to calculate the mean (d100). My goal is to get one value for d100 per group.
(Background: In forestry there is an indicator called d100 which is the mean diameter of the 100 thickest trees per hectare. If the size of the sampling area is smaller than 1 ha you need to select accordingly fewer trees to calculate d100. That's what the factor is for.)
First I tried to put the factor inside my dataframe as an own column. Then I thought maybe it would help to have something like a "lookup-table", because R said, that n must be a single number. But I don't know how to create a lookup-function. (See last part of the sample code.) Or maybe summarising df$factor before using it would do the trick?
Sample data:
(I indicated expressions where I'm not sure how to code them in R like this: 'I dont know how')
# creating sample data
library(tidyverse)
df <- data.frame(group = c(rep(1, each = 5), rep(2, each = 8), rep(3, each = 10)),
BHD = c(rnorm(23, mean = 30, sd = 5)),
factor = c(rep(pi*(15/100)^2, each = 5), rep(pi*(20/100)^2, each = 8), rep(pi*(25/100)^2, each = 10))
)
# group by ID, then select top_n values of df$BHD with n depending on value of df$factor
df %>%
group_by(group) %>%
slice_max(
BHD,
n = 100*df$factor,
with_ties = F) %>%
summarise(d100 = mean('sliced values per group'))
# other thought: having a "lookup-table" for the factor like this:
lt <- data.frame(group = c(1, 2, 3),
factor = c(pi*(15/100)^2, pi*(20/100)^2, pi*(25/100)^2))
# then
df %>%
group_by(group) %>%
slice_max(
BHD,
n = 100*lt$factor 'where lt$group == df$group',
with_ties = F) %>%
summarise(d100 = mean('sliced values per group'))
I already found this answer to a problem which seems similar to mine, but it didn't quite help.
Since all the factor values are the same within each group, you can select any one factor value.
library(dplyr)
df %>%
group_by(group) %>%
top_n(BHD, n = 100* first(factor)) %>%
ungroup
# group BHD factor
# <dbl> <dbl> <dbl>
# 1 1 25.8 0.0707
# 2 1 24.6 0.0707
# 3 1 27.6 0.0707
# 4 1 28.3 0.0707
# 5 1 29.2 0.0707
# 6 2 28.8 0.126
# 7 2 39.5 0.126
# 8 2 23.1 0.126
# 9 2 27.9 0.126
#10 2 31.7 0.126
# … with 13 more rows
I have a data frame in which I would I would like to compute some extra column as a function of the existing columns, but want to specify both each new column name and the function dynamically. I have a vector of column names that are already in the dataframe df_daily:
DAILY_QUESTIONS <- c("Q1_Daily", "Q2_Daily", "Q3_Daily", "Q4_Daily", "Q5_Daily")
The rows of the dataframe have responses to each question from each user each time they answer the questionnaire, as well as a column with the number of days since the user first answered the questionnaire (i.e. Days_From_First_Use = 0 on the very first use, = 1 if it is used the next day etc.). I want to average the responses to these questions by Days_From_First_Use . I start by by grouping my dataframe by Days_From_First_Use:
df_test <- df_daily %>%
group_by(Days_From_First_Use)
and then try averaging the responses in a loop as follows:
for(i in 1:5){
df_test <- df_test %>%
mutate(!! paste0('Avg_Score_', DAILY_QUESTIONS[i]) :=
paste0('mean(', DAILY_QUESTIONS[i], ')'))
}
Unfortunately, while my new variable names are correct ("Avg_Score_Q1_Daily", "Avg_Score_Q2_Daily", "Avg_Score_Q3_Daily", "Avg_Score_Q4_Daily", "Avg_Score_Q5_Daily"), my answers are not: every row in my data frame has a string such as "mean(Q1_Daily)" in the relevant column .
So I'm clearly doing something wrong - what do I need to do fix this and get the average score across all users on each day?
Sincerely and with many thanks in advance
Thomas Philips
I took a somewhat different approach, using summarize(across(...)) after group_by(Days_From_First_Use) I achieve the dynamic names by using rename_with and a custom function that replaces (starts with)"Q" with "Avg_Score_Q"
library(dplyr, warn.conflicts = FALSE)
# fake data -- 30 normalized "responses" from 0 to 2 days from first use to 5 questions
DAILY_QUESTIONS <- c("Q1_Daily", "Q2_Daily", "Q3_Daily", "Q4_Daily", "Q5_Daily")
df_daily <- as.data.frame(do.call('cbind', lapply(1:5, function(i) rnorm(30, i))))
colnames(df_daily) <- DAILY_QUESTIONS
df_daily$Days_From_First_Use <- floor(runif(30, 0, 3))
df_test <- df_daily %>%
group_by(Days_From_First_Use) %>%
summarize(across(.fns = mean)) %>%
rename_with(.fn = function(x) gsub("^Q","Avg_Score_Q",x))
#> `summarise()` ungrouping output (override with `.groups` argument)
df_test
#> # A tibble: 3 x 6
#> Days_From_First… Avg_Score_Q1_Da… Avg_Score_Q2_Da… Avg_Score_Q3_Da…
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0 1.26 1.75 3.02
#> 2 1 0.966 2.14 3.48
#> 3 2 1.08 2.45 3.01
#> # … with 2 more variables: Avg_Score_Q4_Daily <dbl>, Avg_Score_Q5_Daily <dbl>
Created on 2020-12-06 by the reprex package (v0.3.0)
I'm trying to write a function to calculate toplines (as commonly used in polling data). It needs to include both a "percent" and "valid percent" column.
Here's an example
library(tidyverse)
# prepare some data
d <- gss_cat %>%
mutate(tvhours2 = tvhours,
tvhours2 = replace(tvhours2, tvhours > 5 , "6-8"),
tvhours2 = replace(tvhours2, tvhours > 8 , "9+"),
tvhours2 = fct_explicit_na(tvhours2),
# make a weight variable
fakeweight = rnorm(n(), mean = 1, sd = .25))
The following function works as far as it goes:
make.topline <- function(variable, data, weight){
variable <- enquo(variable)
weight <- enquo(weight)
table <- data %>%
# calculate denominator
mutate(total = sum(!!weight)) %>%
# calculate proportions
group_by(!!variable) %>%
summarise(pct = (sum(!!weight)/first(total))*100,
n = sum(!!weight))
table
}
make.topline(variable = tvhours2, data = d, weight = fakeweight)
I'm struggling to implement the valid percent column. Here is the syntax I tried.
make.topline2 <- function(variable, data, weight){
variable <- enquo(variable)
weight <- enquo(weight)
table <- data %>%
# calculate denominator
mutate(total = sum(!!weight),
valid.total = sum(!!weight[!!variable != "(Missing)"])) %>%
# calculate proportions
group_by(!!variable) %>%
summarise(pct = (sum(!!weight)/first(total))*100,
valid.pct = (sum(!!weight)/first(valid.total))*100,
n = sum(!!weight))
table
}
make.topline2(variable = tvhours2, data = d, weight = fakeweight)
This yields the following error:
Error: Base operators are not defined for quosures.
Do you need to unquote the quosure?
# Bad:
myquosure != rhs
# Good:
!!myquosure != rhs
Call `rlang::last_error()` to see a backtrace
I know the problem is in this line, but I don't know how to fix it:
mutate(valid.total = sum(!!weight[!!variable != "(Missing)"]))
You can put parentheses around the !!weight. I think of this as making sure we are using the extract brackets only after weight is unquoted (so an order of operations thing).
That line would then look like:
valid.total = sum((!!weight)[!!variable != "(Missing)"])
Alternatively, you could use the new curly-curly operator ({{), which takes the place of the enquo()/!! combination for relatively simple cases like yours. Then your function would look something like
make.topline <- function(variable, data, weight){
table <- data %>%
# calculate denominator
mutate(total = sum({{ weight }}),
valid.total = sum({{ weight }}[{{ variable }} != "(Missing)"])) %>%
# calculate proportions
group_by({{ variable }}) %>%
summarise(pct = (sum({{ weight }})/first(total))*100,
valid.pct = (sum({{ weight }})/first(valid.total))*100,
n = sum({{ weight }}))
table
}
Like the parentheses solution, this runs without error.
make.topline(variable = tvhours2, data = d, weight = fakeweight)
# A tibble: 9 x 4
tvhours2 pct valid.pct n
<fct> <dbl> <dbl> <dbl>
1 0 3.16 5.98 679.
2 1 10.9 20.6 2342.
3 2 14.1 26.6 3022.
4 3 9.10 17.2 1957.
5 4 6.67 12.6 1432.
6 5 3.24 6.13 696.
7 6-8 4.02 7.61 864.
8 9+ 1.67 3.16 358.
9 (Missing) 47.2 89.3 10140.