Selecting subsets of a grouped variable - r

The data I used can be found here (the "sq.txt" file).
Below is a summary of the data:
> summary(sq)
behaviour date squirrel time
resting :983 2017-06-28: 197 22995 : 127 09:30:00: 17
travelling :649 2017-06-26: 160 22758 : 116 08:00:00: 16
feeding :344 2017-06-30: 139 23080 : 108 16:25:00: 15
OOS :330 2017-07-18: 110 23089 : 100 08:11:00: 13
vocalization:246 2017-06-27: 99 23079 : 97 08:31:00: 13
social : 53 2017-06-29: 96 22865 : 95 15:24:00: 13
(Other) : 67 (Other) :1871 (Other):2029 (Other) :2585
Each squirrel has a number of observations that correspond to a number of different behaviours (behaviour).
For example, squirrel 22995 was observed 127 times. These 127 observations correspond to different behaviour categories: 7 feeding, 1 territorial, 55 resting, etc. I then need to divide the number of each behaviour by the total number of observations (i.e. feeding = 7/127, territorial = 1/127, resting = 55/127, etc.) to get proportions of time spent doing each behaviour.
I already have grouped my observations by squirrel using the dplyr package.
Is there a way, using dplyr, for me to calculate proportions for one column (behaviour) based on the total observations for a column (squirrel) where the values have been grouped?

Something like this?
sq %>%
count(squirrel, behaviour) %>%
group_by(squirrel) %>%
mutate(p = n/sum(n)) %>%
# add this line to see result for squirrel 22995
filter(squirrel == 22995)
# A tibble: 8 x 4
# Groups: squirrel [1]
squirrel behaviour n p
<int> <chr> <int> <dbl>
1 22995 feeding 7 0.0551
2 22995 nest_building 4 0.0315
3 22995 OOS 9 0.0709
4 22995 resting 55 0.433
5 22995 social 6 0.0472
6 22995 territorial 1 0.00787
7 22995 travelling 32 0.252
8 22995 vocalization 13 0.102
EDIT:
If you want to include zero counts for squirrels where a behaviour was not observed, one way is to use tidyr::complete(). That generates NA by default, which you may want to replace with zero.
library(dplyr)
library(tidyr)
sq %>%
count(squirrel, behaviour) %>%
complete(squirrel, behaviour) %>%
group_by(squirrel) %>%
mutate(p = n/sum(n, na.rm = TRUE)) %>%
replace_na(list(n = 0, p = 0)) %>%
filter(squirrel == 22995)
# A tibble: 11 x 4
# Groups: squirrel [1]
squirrel behaviour n p
<int> <chr> <dbl> <dbl>
1 22995 dead 0 0
2 22995 feeding 7.00 0.0551
3 22995 grooming 0 0
4 22995 nest_building 4.00 0.0315
5 22995 OOS 9.00 0.0709
6 22995 resting 55.0 0.433
7 22995 social 6.00 0.0472
8 22995 territorial 1.00 0.00787
9 22995 travelling 32.0 0.252
10 22995 vigilant 0 0
11 22995 vocalization 13.0 0.102

Related

Point in Polygon method in R for merge regions of a shapefile with over 15'000 observations of a dataset

I am struggling since days with a problem.
region
geometry
1
MULTIPOLYGON(6830854,....
2
MULTIPOLYGON(6830854,....
gisid
geox
geoy
drug dependency
1
800000
150000
65
2
600000
300000
80
The new dataset should look llike this
Region
mean (drug depency)
1
55
2
54
for example region 1 contains gisid 4,5,8,9,50, 65, 83 and they lie in the region 1 and then calculate the mean value of drug depency. I couldn't find any solution.
Here's a somewhat contrived example. The key function is st_intersection(). There may be faster solutions using other methods in sf.
library(sf)
library(tmap)
library(tidyverse)
data("World")
world.centroids <- st_centroid(World)[,1]
world.centroids.expanded <- world.centroids[rep(row.names(world.centroids), 3),]
world.centroids.expanded$ref <- runif(nrow(world.centroids.expanded), 0 ,1 )
intersections <- st_intersection(World[,1], world.centroids.expanded)
intersections <- st_drop_geometry(intersections)
intersections |> group_by(iso_a3) |> summarise(avg = mean(ref))
# A tibble: 164 × 2
iso_a3 avg
<fct> <dbl>
1 AFG 0.382
2 AGO 0.457
3 ALB 0.510
4 ARE 0.359
5 ARG 0.627
6 ARM 0.361
7 ATA 0.130
8 ATF 0.241
9 AUS 0.636
10 AUT 0.304

How to get a conditional proportion in a tibble in r

I have this tibble
host_id district availability_365
<dbl> <chr> <dbl>
1 8573 Fatih 280
2 3725 Maltepe 365
3 1428 Fatih 355
4 6284 Fatih 164
5 3518 Esenyurt 0
6 8427 Esenyurt 153
7 4218 Fatih 0
8 5342 Kartal 134
9 4297 Pendik 0
10 9340 Maltepe 243
# … with 51,342 more rows
I want to find out how high the proportion of the hosts (per district) is which have all their rooms on availability_365 == 0. As you can see there are 51352 rows but there aren't different hosts in all rows. There are actually exactly 37572 different host_ids.
I know that I can use the command group_by(district) to get it split up into the 5 different districts but I am not quite sure how to solve the issue to find out how many percent of the hosts only have rooms with no availability. Anybody can help me out here?
Use summarise() function along with group_by() in dplyr.
library(dplyr)
df %>%
group_by(district) %>%
summarise(Zero_Availability = sum(availability_365==0)/n())
# A tibble: 5 x 2
district Zero_Availability
<chr> <dbl>
1 Esenyurt 0.5
2 Fatih 0.25
3 Kartal 0
4 Maltepe 0
5 Pendik 1
It's difficult to make sure my answer is working without actually having the data, but if you're open to using data.table, the following should work
library(data.table)
setDT(data)
data[, .(no_avail = all(availability_365 == 0)), .(host_id, district)][, .(
prop_no_avail = sum(no_avail) / .N
), .(district)]

group_by() summarise() and weights percentages - R

Let's suppose that a company has 3 Bosses and 20 Employees, where each Employee has done n_Projects with an overall Performance in percentage:
> df <- data.frame(Boss = sample(1:3, 20, replace=TRUE),
Employee = sample(1:20,20),
n_Projects = sample(50:100, 20, replace=TRUE),
Performance = round(sample(1:100,20,replace=TRUE)/100,2),
stringsAsFactors = FALSE)
> df
Boss Employee n_Projects Performance
1 3 8 79 0.57
2 1 3 59 0.18
3 1 11 76 0.43
4 2 5 85 0.12
5 2 2 75 0.10
6 2 9 66 0.60
7 2 19 85 0.36
8 1 20 79 0.65
9 2 17 79 0.90
10 3 14 77 0.41
11 1 1 78 0.97
12 1 7 72 0.52
13 2 6 62 0.69
14 2 10 53 0.97
15 3 16 91 0.94
16 3 4 98 0.63
17 1 18 63 0.95
18 2 15 90 0.33
19 1 12 80 0.48
20 1 13 97 0.07
The CEO asks me to compute the quality of the work for each boss. However, he asks for a specific calculation: Each Performance value has to have a weight equal to the n_Project value over the total n_Project for that boss.
For example, for Boss 1 we have a total of 604 n_Projects, where the project 1 has a Performance weight of 0,13 (78/604 * 0,97 = 0,13), project 3 a Performance weight of 0,1 (59/604 * 0,18 = 0,02), and so on. The sum of these Performance weights are the Boss performance, that for Boss 1 is 0,52. So, the final output should be like this:
Boss total_Projects Performance
1 604 0.52
2 340 0.18 #the values for boss 2 are invented
3 230 0.43 #the values for boss 3 are invented
However, I'm still struggling with this:
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
In addition to this problem, can you give me any feedback about this problem (my code, specifically) or any recommendation to improve data-manipulations skills? (you can see in my profile that I have asked a lot of questions like this, but still I'm not able to solve them on my own)
We can get the sum of product of `n_Projects' and 'Performance' and divide by the 'total_projects'
library(dplyr)
df %>%
group_by(Boss) %>%
summarise(total_projects = sum(n_Projects),
Weight_Project = sum(n_Projects * Performance)/total_projects)
# or
# Weight_Project = n_Projects %*% Performance/total_projects)
# A tibble: 3 x 3
# Boss total_projects Weight_Project
# <int> <int> <dbl>
#1 1 604 0.518
#2 2 595 0.475
#3 3 345 0.649
Adding some more details about what you did and #akrun's answer :
You must have received the following error message :
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
## Error in summarise_impl(.data, dots) :
## Column `Weight_Project` must be length 1 (a summary value), not 7
This tells you that the calculus you made for Weight_Project does not yield a unique value for each Boss, but 7. summarise is there to summarise several values into one (by means, sums, etc.). Here you just divide each value of n_Projects by sum(total_Projects), but you don't summarise it into a single value.
Assuming that what you had in mind was first calculating the weight for each performance, then combining it with the performance mark to yield the weighted mean performance, you can proceed in two steps :
df %>%
group_by(Boss) %>%
mutate(Weight_Performance = n_Projects / sum(n_Projects)) %>%
summarise(weighted_mean_performance = sum(Weight_Performance * Performance))
The mutate statement preserves the number of total rows in df, but sum(n_Projects) is calculated for each Boss value thanks to group_by.
Once, for each row, you have a project weight (which depends on the boss), you can calculate the weighted mean — which is a mean thus a summary value — with summarise.
A more compact way that still lets appear the weighted calculus would be :
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum((n_Projects / sum(n_Projects)) * Performance))
# Reordering to minimise parenthesis, which is #akrun's answer
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum(n_Projects * Performance) / sum(n_Projects))

Resampling cross-sectional time series data in R

I'm dealing with cross-sectional time series data (many DIFFERENT individuals over time). At the individual level, each person has a quantity of a good demanded. This data is unbalanced with respect to how many individuals are in each period. For each time period, I've aggregated the individual data into a single time series. Example data structure below
Cross-Section Time Series
Time | Person | Quantity
----------------------
11/18| Bob | 2
11/18| Sally | 1
11/18| Jake | 5
12/18| Jim | 2
12/18| Roger | 8
Time Series
Time | Total Q
-------------
11/18| 8
12/18| 10
What I want to do for each period is resample (with replacement) the individual quantity, aggregate across the individuals, iterate X amount of times, and then get an mean and standard error from the bootstrap.
The end result should look like
Time | Total Q | Boot Strap Total Mean
-------------------------------------
11/18| 8 | 8.5
12/18| 10 | 10.05
Here is some code to create example sample data:
library(tidyverse)
set.seed(1234)
Cross_Time = data.frame(x) %>%
mutate(Period = sample(1:10, 50, replace=T),
Q=rnorm(50,10,1)) %>%
arrange(Period)
Timeseries = Cross_Time %>%
group_by(Period) %>%
summarize(Total=sum(Q))
I know this is possible in R, but I'm at a loss as to how to code it or what the right questions I need to ask are. All help is appreciated!
We may do the following:
X <- 1000
Cross_Time %>% group_by(Period) %>%
do({QS <- colSums(replicate(sample(.$Q, replace = TRUE), n = X))
data.frame(Period = .$Period[1], `Total Q` = sum(.$Q), Mean = mean(QS), `Standard Error` = sd(QS))})
# A tibble: 10 x 4
# Groups: Period [10]
# Period Total.Q Mean Standard.Error
# <int> <dbl> <dbl> <dbl>
# 1 1 28.8 28.8 0.284
# 2 2 35.9 35.8 0.874
# 3 3 109. 109. 3.90
# 4 4 48.9 48.9 2.16
# 5 5 20.2 20.2 0.658
# 6 6 59.0 58.8 3.57
# 7 7 88.7 88.6 2.64
# 8 8 22.7 22.7 1.04
# 9 9 47.7 47.7 2.46
# 10 10 27.9 27.9 0.575
I think the code is quite self-explanatory. In every group we resample it's values with replacement X times with replicate and compute the two desired statistics. It's also straightforward to add any others!

Top-n-Box (Likert Scale) by factor groups in a dataframe

I have the following dataframe, which is the result of a cluster analysis with ten 7-likert attitude scales for specific product benefits (see 'variable' column). At this, n is the number of persons stating a specific value for each Benefit and sum is the total sum of persons for each cluster. n2 is just the relative share of answers to all answers per cluster (n2=n/cum*100, which is basically %).
Now, I want to create a new column, aggregating / summing up the top-n (indicated in 'value' column) percent (indicated in n2) for each benefit, e.g. a new column "Top-3-Box" with e.g. a value of 46.5 for rows 1-7/Benefit.1 (which is the sum of the n2 of the rows with the top-3 value 7,6,5). It would be great if there would be a solution for this, which is instantly applicable in dplyr.
Please see the dataframe below:
cluster variable value n cum n2
<int> <chr> <dbl> <int> <int> <dbl>
1 1 Benefit.1 1 11 86 12.8
2 1 Benefit.1 2 11 86 12.8
3 1 Benefit.1 3 6 86 7
4 1 Benefit.1 4 18 86 20.9
5 1 Benefit.1 5 16 86 18.6
6 1 Benefit.1 6 14 86 16.3
7 1 Benefit.1 7 10 86 11.6
8 1 Benefit.10 1 10 86 11.6
9 1 Benefit.10 2 13 86 15.1
10 1 Benefit.10 3 8 86 9.3
# ... with 40 more rows
I highly appreciate your support!
We can do a group by sum of 'n2' by subsetting the values corresponding to the first 3 'value'
library(dplyr)
df1 %>%
group_by(cluster, variable) %>%
mutate(percent = sum(n2[value %in% 1:3]))
If the 'value' is already ordered per 'cluster', 'variable', then we can just subset the 'n2'
df1 %>%
group_by(cluster, variable) %>%
mutate(percent = sum(n2[1:3]))

Resources