Set rnorm parameters equal to vector - r

I have a data frame that contains columns of sample sizes, means, and standard deviations, as well as a target value:
ssize <- c(200, 300, 150)
mean <- c(10, 40, 50)
sd <- c(5, 15, 65)
target <- c(7, 23, 30)
df <- data.frame(ssize, mean, sd, target)
I wish to add another variable below that returns the number of elements less than the target value, as drawn from a normal distribution with parameters mean and sd and sample size ssize. However, I cannot get rnorm to use the values of each row as parameters. For example, running
df$below <- sum(rnorm(df$ssize, df$mean, df$sd) < df$target)
generates distributions that have sample sizes equal to length(df$ssize) instead of the value of df$ssize itself.
Updated: data table solution for large datasets?
The solutions from #alistaire and #G5W work well, but I would like to extract the mean value of below from 100 replicates of rnorm, for each row. I tried both solutions:
df <- df %>% mutate(below = mean(replicate(100, pmap_int(., ~sum(rnorm(..1, ..2, ..3) < ..4)))))
df$below <- with(df, sapply(1:nrow, function(i) mean(replicate(100, sum(rnorm(n[i], mean[i], sd[i]) < target[i])))))
But they will take a very long time to run with my dataset, which has >4.3m rows. Is there a data table (or other) solution that might be faster?

List columns are a natural way to do this, so you can store the samples right next to the parameters that generated them. Using purrr for iteration,
library(tidyverse)
set.seed(47) # for reproducibility
df <- data_frame(n = c(200, 300, 150), # rename to name of parameter in rnorm so pmap works naturally
mean = c(10, 40, 50),
sd = c(5, 15, 65),
target = c(7, 23, 30))
df %>%
mutate(samples = pmap(.[1:3], rnorm), # iterate in parallel over parameters and store samples as list column
below = map2_int(samples, target, ~sum(.x < .y))) # iterate over samples and target, calculate number below, and simplify to integer vector
#> # A tibble: 3 x 6
#> n mean sd target samples below
#> <dbl> <dbl> <dbl> <dbl> <list> <int>
#> 1 200 10 5 7 <dbl [200]> 47
#> 2 300 40 15 23 <dbl [300]> 41
#> 3 150 50 65 30 <dbl [150]> 58

You can do this in base R with lapply and a temporary function
df$below = with(df,
sapply(1:3, function(i) sum(rnorm(ssize[i], mean[i], sd[i]) < target[i])))
df$below
[1] 44 45 48

Related

How do I mutate across to multiple columns together that have similar names in R?

I have many columns that have same names that always start with the same string, either n_ for the number of students, score_ for the percent of students who passed, and loc_ for the room number.
In this, I want to multiple the n_ columns with their respective score_ columns (so n_math * score_math, n_sci * score_sci, etc.) and create new columns called n_*_success for the number of students who passed the class.
If I had just a few columns like in this sample dataset, I would do something like this for each column:
mutate(n_sci_success = n_sci * score_sci)
But I have many columns and I'd like to write some expression that will match column names.
I think I have to use regex and across (like across(starts_with("n_)), but I just can't figure it out. Any help would be much appreciated!
Here's a sample dataset:
library(tidyverse)
test <- tibble(id = c(1:4),
n_sci = c(10, 20, 30, 40),
score_sci = c(1, .9, .75, .7),
loc_sci = c(1, 2, 3, 4),
n_math = c(100, 50, 40, 30),
score_math = c(.5, .6, .7, .8),
loc_math = c(4, 3, 2, 1),
n_hist = c(10, 50, 30, 20),
score_hist = c(.5, .5, .9, .9),
loc_hist = c(2, 1, 4, 3))
Here's one way using across and new pick function from dplyr 1.1.0
library(dplyr)
out <- test %>%
mutate(across(starts_with('n_'), .names = 'res_{col}') *
pick(starts_with('score_')) * pick(starts_with('loc_')))
out %>% select(starts_with('res'))
# res_n_sci res_n_math res_n_hist
# <dbl> <dbl> <dbl>
#1 10 200 10
#2 36 90 25
#3 67.5 56 108
#4 112 24 54
This should also work if you replace all pick with across. pick is useful for selecting columns, across is useful when you need to apply a function to the columns selected.
I am using across in the 1st case (with starts_with('n_')) is because I want to give unique names to the new columns using .names which is not present in pick.
pick() is very nice, thanks for sharing. Here is way using reduce from purrr package:
We first use split.default to get a list, then apply reduce via map_dfr:
library(purrr)
library(stringr)
test %>%
split.default(str_remove(names(.), ".*_")) %>%
map_dfr(reduce, `*`)
# A tibble: 4 × 4
hist id math sci
<dbl> <int> <dbl> <dbl>
1 10 1 200 10
2 25 2 90 36
3 108 3 56 67.5
4 54 4 24 112
We can use {dplyover} to solve this kind of problems.
Disclaimer: I'm the maintainer and the package is not on CRAN.
We have two options:
The easiest way is to use dplyover::across2(). Below I use dplyr::transmute() to only show the newly created columns, but we can use mutate() instead to add the new columns to our data.frame.
across2() lets you specify two sets of columns to loop over. Here we choose all columns that starts_with("n_") and all columns that starts_with("score_"). We can then use .x (for the former) and .y (for the latter) in the .fns argument. In the .names argument we can specify how our new names should look like. We take the name of the first column {xcol} and add _success to it.
library(dplyover) # https://timteafan.github.io/dplyover/
test %>%
transmute(
across2(starts_with("n_"),
starts_with("score_"),
~ .x * .y,
.names = "{xcol}_success")
)
#> # A tibble: 4 × 3
#> n_sci_success n_math_success n_hist_success
#> <dbl> <dbl> <dbl>
#> 1 10 50 5
#> 2 18 30 25
#> 3 22.5 28 27
#> 4 28 24 18
While this approach is easy and straightforward there is one problem: it assumes that the columns are in the correct order. This is also an assumption of the other two answers. If we have a large data.frame and are not sure if really all columns are in the correct order, dplyover::over() is the better and programmatically safe option.
Here we loop over a string and use this to construct the variable names. Within over() we use cut_names("^.*_") to get the stems of the variable names, in our example c("sci", "math", "hist"). Then in the function in .fns we construct the variable names by wrapping a string inside .() (to evaluate it as a variable name). Within the string we can use {x} to access the string of the current iteration. This approach will always combine n_sci with score_sci even if the columns are not ordered correctly. Finally, here too we can create nice names on the fly in the .names argument.
test %>%
transmute(
over(cut_names("^.*_"), # <- gets us c("sci", "math", "hist")
~ .("n_{.x}") * .("score_{.x}"),
.names = "n_{x}_success"
)
)
#> # A tibble: 4 × 3
#> n_sci_success n_math_success n_hist_success
#> <dbl> <dbl> <dbl>
#> 1 10 50 5
#> 2 18 30 25
#> 3 22.5 28 27
#> 4 28 24 18
Data from OP
library(tidyverse)
test <- tibble(id = c(1:4),
n_sci = c(10, 20, 30, 40),
score_sci = c(1, .9, .75, .7),
loc_sci = c(1, 2, 3, 4),
n_math = c(100, 50, 40, 30),
score_math = c(.5, .6, .7, .8),
loc_math = c(4, 3, 2, 1),
n_hist = c(10, 50, 30, 20),
score_hist = c(.5, .5, .9, .9),
loc_hist = c(2, 1, 4, 3))
Created on 2023-02-12 with reprex v2.0.2

Splitting data frame into 100 buckets and after split output should contain the values 25, 50, 75 in those 100 values

I want to split the data frame into 100 buckets(to show these 100 values in a graph) and in those 100 values should contains 25,50,75 values(25, 50, 75 values should not missed in any case in those 100 values).
But the problem is if I pass a data frame with less than 100 values I was not getting 25, 50, 75 values in the output with the below code.
Below is the data frame with less than 100 rows for which ntile() is failing to give 25, 50,75 values in output_df.
df <- data.frame(usr_thp =
c(564.4733658,
642.1442142,
751.2999088,
767.3654291,
1177.598259,
1235.646466,
1442.834145,
1799.008496,
2136.925833,
2183.736126,
2251.76611,
2565.485513,
2791.037469,
2817.081995,
2817.169915,
3090.413274,
3452.138419,
3792.897529,
3813.101734,
4364.9713,
4517.857481,
4690.001855,
4737.380507,
4830.179267,
4906.446091,
5437.591944,
5465.150774,
5614.31162,
5862.69037,
5874.591271,
5956.859055,
5990.081847,
6104.084737,
6136.593924,
6156.072481,
6322.750252,
6453.179695,
6525.322676,
6549.369024,
6553.52475,
6646.479767,
6903.330889,
7044.806765,
7089.765708,
7299.0698,
7658.256263,
7668.333518,
7683.219606,
7884.227521,
7976.371067,
8034.630945,
8153.86454,
8155.635059,
8348.227014,
8980.884306,
9223.800387,
9410.946208,
9478.217587,
9801.870998,
10023.12566,
10553.40744,
10583.41615,
11625.85716,
11975.92464,
12304.78107,
13511.52622,
13681.24599,
14783.2377,
15560.06325,
15773.75931,
16202.474,
17118.28004,
17829.97004,
18139.05167,
19044.59533,
20792.20441,
21081.17953,
21585.34027,
22058.40687,
28447.04231), count_no = c(1,2,4,5,6,8,9,10,11,12,14,
15,16,18,19,20,21,22,24,25,26,28,29,30,31,32,34,35,36,38,39,40,41,42,
44,45,46,48,49,50,51,52,54,55,56,58,59,60,61,62,64,65,66,68,69,70,71,
72,74,75,76,78,79,80,81,82,84,85,86,88,89,90,91,92,94,95,96,98,99,100))
N_QUANTILES <- 100
QUARTILES <- c(0.25, 0.50, 0.75) * N_QUANTILES
output_df <- df %>%
dplyr::select(scenario='usr_thp') %>%
dplyr::mutate(quantile = dplyr::ntile(scenario,N_QUANTILES))
I have to show output_df values in a graph and it should contains 25,50,75 values. But here if you observe the output 25,50,75 values are not part of it.
Is this what you're looking for?
df %>%
summarize(quantile = paste0( "Q",1:3, "(", scales::percent(c(0.25, 0.50, 0.75)), ")"),
value = quantile(usr_thp, c(0.25, 0.50, 0.75))) %>%
pivot_wider(names_from = quantile, values_from = value)
# A tibble: 1 × 3
`Q1(25%)` `Q2(50%)` `Q3(75%)`
<dbl> <dbl> <dbl>
1 4480. 6600. 10156.

Calculate distance between multiple latitude and longitude points

I have a dataset that has latitude and longitude information for participants' home and work, and I'd like to create a new column in the dataset containing the euclidean distance between home and work for each participant. I think this should be relatively simple, but all the other Q&As I've seen seem to be dealing with slightly different issues.
To start, I tried running this code (using the geosphere package):
distm(c(homelong, homelat), c(worklong, worklat), fun=distHaversine)
But got an error saying "Error in .pointsToMatrix(x) : Wrong length for a vector, should be 2" because (if I understand correctly) I'm trying to calculate the distance between multiple sets of two points.
Can I adjust this code to get what I'm looking for, or is there something else I should be trying instead? Thanks!
distm() returns a distance matrix, which is not what you want; you want the pairwise distances. So use the distance function (distHaversine(), distGeo(), or whatever) directly:
library(tidyverse)
locations <- tibble(
homelong = c(0, 2),
homelat = c(2, 5),
worklong = c(70, 60),
worklat = c(45, 60)
)
locations <- locations %>%
mutate(
dist = geosphere::distHaversine(cbind(homelong, homelat), cbind(worklong, worklat))
)
locations
#> # A tibble: 2 × 5
#> homelong homelat worklong worklat dist
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 2 70 45 8299015.
#> 2 2 5 60 60 7809933.
Note that geosphere functions want matrices as inputs, so you can cbind() your columns together. Don't c() them; that's creating a single shapeless vector and losing the differentiation between lon and lat. This is the cause of the error, I suspect; the vector only has one dimension, not two like a matrix.
You can have the latitudes and longitudes in a dataframe and then do rowwise operations on the dataframe to get the distance corresponding to each row.
library(tidyverse)
library(geosphere)
locations <- tibble(
homelong = c(0, 2),
homelat = c(2, 5),
worklong = c(70, 60),
worklat = c(45, 60)
)
locations %>%
rowwise() %>%
mutate(d = as.numeric(distm(c(homelong, homelat), c(worklong, worklat), fun = distHaversine)))
results in
# A tibble: 2 x 5
# Rowwise:
homelong homelat worklong worklat d
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 2 70 45 8299015.
2 2 5 60 60 7809933.

programatically create new variables which are sums of nested series of other variables

I have data giving me the percentage of people in some groups who have various levels of educational attainment:
df <- data_frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10))
df
# A tibble: 2 x 5
group no.highschool high.school college graduate
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 20. 70. 10. 0.
2 B 10. 40. 40. 10.
E.g., in group A 70% of people have a high school education.
I want to generate 4 variables that give me the proportion of people in each group with less than each of the 4 levels of education (e.g., lessthan_no.highschool, lessthan_high.school, etc.).
desired df would be:
desired.df <- data.frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10),
lessthan_no.highschool = c(0,0),
lessthan_high.school = c(20, 10),
lessthan_college = c(90, 50),
lessthan_graduate = c(100, 90))
In my actual data I have many groups and a lot more levels of education. Of course I could do this one variable at a time, but how could I do this programatically (and elegantly) using tidyverse tools?
I would start by doing something like a mutate_at() inside of a map(), but where I get tripped up is that the list of variables being summed is different for each of the new variables. You could pass in the list of new variables and their corresponding variables to be summed as two lists to a pmap(), but it's not obvious how to generate that second list concisely. Wondering if there's some kind of nesting solution...
Here is a base R solution. Though the question asks for a tidyverse one, considering the dialog in the comments to the question I have decided to post it.
It uses apply and cumsum to do the hard work. Then there are some cosmetic concerns before cbinding into the final result.
tmp <- apply(df[-1], 1, function(x){
s <- cumsum(x)
100*c(0, s[-length(s)])/sum(x)
})
rownames(tmp) <- paste("lessthan", names(df)[-1], sep = "_")
desired.df <- cbind(df, t(tmp))
desired.df
# group no.highschool high.school college graduate lessthan_no.highschool
#1 A 20 70 10 0 0
#2 B 10 40 40 10 0
# lessthan_high.school lessthan_college lessthan_graduate
#1 20 90 100
#2 10 50 90
how could I do this programatically (and elegantly) using tidyverse tools?
Definitely the first step is to tidy your data. Encoding information (like edu level) in column names is not tidy. When you convert education to a factor, make sure the levels are in the correct order - I used the order in which they appeared in the original data column names.
library(tidyr)
tidy_result = df %>% gather(key = "education", value = "n", -group) %>%
mutate(education = factor(education, levels = names(df)[-1])) %>%
group_by(group) %>%
mutate(lessthan_x = lag(cumsum(n), default = 0) / sum(n) * 100) %>%
arrange(group, education)
tidy_result
# # A tibble: 8 x 4
# # Groups: group [2]
# group education n lessthan_x
# <chr> <fct> <dbl> <dbl>
# 1 A no.highschool 20 0
# 2 A high.school 70 20
# 3 A college 10 90
# 4 A graduate 0 100
# 5 B no.highschool 10 0
# 6 B high.school 40 10
# 7 B college 40 50
# 8 B graduate 10 90
This gives us a nice, tidy result. If you want to spread/cast this data into your un-tidy desired.df format, I would recommend using data.table::dcast, as (to my knowledge) the tidyverse does not offer a nice way to spread multiple columns. See Spreading multiple columns with tidyr or How can I spread repeated measures of multiple variables into wide format? for the data.table solution or an inelegant tidyr/dplyr version. Before spreading, you could create a key less_than_x_key = paste("lessthan", education, sep = "_").

Calculate sum and min max across each group in a data frame in R

I have a below sample data frame
df <- data.frame("Group"= c(1,1,2,2,2),"H" =
c("H1","H3","H3","H4","H2"), "W1" = c(95, 0, 0,0,50) , "W2" = c(0,
95,95, 0,85),"W3" = c(85, 50,50 ,95,0))
Need to calculate two additional Metrics:
1st Metric : Based on each group and rows across w1,w2,w3 of that group if the value is equal or more than 85 for w1, w2 & w3 then output is 100%.
For Example: For Group 2, For w2 & w3, Maximum value is equal more than 85
and For w1 , it is less than 85 so result is 66.7
2nd Metric : Minimum of maximum of rows across w1,w2,w3 columns of that group. For Example : For Group 2, min(max[0 0 50], max[95 0 85], max[50 95 0]) = 50
For more clarity, Here is desired output data frame:
DesiredDf <- data.frame("Group"= c(1,1,2,2,2),"H" =
c("H1","H3","H3","H4","H2"), "W1" = c(95, 0, 0,0,50) ,
"W2" = c(0, 95,95, 0,85), "W3" = c(85, 50,50 ,95,0),
"W" = c(100,100,66.7 ,66.7,66.7),MINMAX = c(85,85,50,50,50))
Have tried out for loop and sapply method but the actual data set is too large and execution is too slow.Looking for ways to calculate these metrics more seamlessly in R.
The data.table way:
# use data.table
library(data.table)
setDT(df)
# aggregate data by group in order to calculate the 2 desired metrics
df1 <- df[ , .(maxw1 = max(W1), maxw2 = max(W2), maxw3 = max(W3)), by=Group]
# calculate the metrics
df1[ , metric1 := rowMeans(cbind(maxw1>=85, maxw2>=85, maxw3>=85))]
df1[ , metric2 := do.call(pmin,.SD), .SDcols = c("maxw1", "maxw2", "maxw3")]
# merge metrics back on to original dataframe
df <- merge(df, df1[ , .(Group, metric1, metric2)], by="Group")
By using dplyr:
df %>%
group_by(Group) %>%
mutate(w = rowMeans(cbind(max(W1) >= 85, max(W2) >= 85, max(W3) >= 85)),
minmax = min(max(W1), max(W2), max(W3)))
# A tibble: 5 x 7
# Groups: Group [2]
Group H W1 W2 W3 w minmax
<dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1. H1 95. 0. 85. 1.00 85.
2 1. H3 0. 95. 50. 1.00 85.
3 2. H3 0. 95. 50. 0.667 50.
4 2. H4 0. 0. 95. 0.667 50.
5 2. H2 50. 85. 0. 0.667 50.

Resources