How do i calculate the average with certain condition in R? - r

I've been trying to calculate the average of a column with condition in a dataframe and plot it in a graph. But so far I can only get the average of the whole column with mean(df$Age). Sample dataframe
What I'm trying to get is the average age of employees in Vancouver but I'm not sure how to do it so I can't plot it out.

To get average of a specific city you can subset it and take mean.
result <- mean(df$Age[df$CityName == 'Vancouver'], na.rm = TRUE)

library(tidyverse)
tribble(
~Age, ~City,
61, "Vancouver",
58, "Vancouver",
48, "Terrace",
48, "Terrace"
) %>%
group_by(City) %>%
summarise(Age = mean(Age))
#> # A tibble: 2 x 2
#> City Age
#> <chr> <dbl>
#> 1 Terrace 48
#> 2 Vancouver 59.5
Created on 2021-11-12 by the reprex package (v2.0.1)

Related

Conditional Mutating using specific identifiers in a data frame

I have a dataset that employs a similar approach to this one.
ID <- c(4,5,6,7,3,8,9)
quantity <- c(20,300, 350, 21, 44, 20, 230)
measurementvalue <- c("tin", "kg","kg","tin","tin","tin","kg")
kgs <- c(21,12, 30,23,33,11,24)
DF <- data.frame(ID, quantity, measurementvalue)
My standard way of deriving a new column totalkgs using conditional mutating is the code below.
DF <- DF %>%
mutate(totalkgs =
ifelse(quantity == "tin", quantity * 5,
ifelse(quantity =="kg", quantity*1, quantity)))
However, the dataset had erroneous entries in the column quantity so I'd like to perform a division on those specific identifiers. All the final values of both the multiplication and division should be store in the column totalkgs. How do I go about this?
Let's assume the id's with the error data are 3,5,9,7. and I'd like to divide the values found in the column quantity with 10.
You could use case_when:
ID <- c(4,5,6,7,3,8,9)
quantity <- c(20,300, 350, 21, 44, 20, 230)
measurementvalue <- c("tin", "kg","kg","tin","tin","tin","kg")
kgs <- c(21,12, 30,23,33,11,24)
DF <- data.frame(ID, quantity, measurementvalue)
library(dplyr)
DF %>%
mutate(quantity2 = ifelse(ID %in% c(3,5,9,7), quantity/10, quantity)) %>%
mutate(totalkgs = case_when(measurementvalue == "tin" ~ quantity2 * 5,
measurementvalue == "kg" ~ quantity2 * 1,
TRUE ~ quantity2)) %>%
select(-quantity2) #if you want
#> ID quantity measurementvalue totalkgs
#> 1 4 20 tin 100.0
#> 2 5 300 kg 30.0
#> 3 6 350 kg 350.0
#> 4 7 21 tin 10.5
#> 5 3 44 tin 22.0
#> 6 8 20 tin 100.0
#> 7 9 230 kg 23.0
Created on 2022-07-05 by the reprex package (v2.0.1)

Calculating distance between all locations to first location, by group

I have GPS locations from several seabird tracks, each starting from colony x. Therefore the individual tracks all have similar first locations. For each track, I would like to calculate the beeline distance between each GPS location and either (a) a specified location that represents the location of colony x, or (b) the first GPS point of a given track which represents the location of colony x. For (b), I would look to use the first location of each new track ID (track_id).
I have looked for appropriate functions in geosphere, sp, raster, adehabitatLT, move, ... and just cannot seem to find what I am looking for.
I can calculate the distance between successive GPS points, but that is not what I need.
package(dplyr)
df %>%
group_by(ID) %>%
mutate(lat_prev = lag(Lat,1), lon_prev = lag(Lon,1) ) %>%
mutate(dist = distVincentyEllipsoid(matrix(c(lon_prev, lat_prev), ncol = 2), # or use distHaversine
matrix(c(Lon, Lat), ncol = 2)))
#example data:
df <- data.frame(Lon = c(-96.8, -96.60861, -96.86875, -96.14351, -92.82518, -90.86053, -90.14208, -84.64081, -83.7, -82, -80, -88.52732, -94.46049,-94.30, -88.60, -80.50, -81.70, -83.90, -84.60, -90.10, -90.80, -92.70, -96.10, -96.55, -96.50, -96.00),
Lat = c(25.38657, 25.90644, 26.57339, 27.63348, 29.03572, 28.16380, 28.21235, 26.71302, 25.12554, 24.50031, 24.89052, 30.16034, 29.34550, 29.34550, 30.16034, 24.89052, 24.50031, 25.12554, 26.71302, 28.21235, 28.16380, 29.03572, 27.63348, 26.57339, 25.80000, 25.30000),
ID = c(rep("ID1", 13), rep("ID2", 13)))
Grateful for any pointers.
You were pretty close. The key is that you want to calcualte the distance from the first observation in each track. Therefore you need to first adorn with the order in each track (easy to do with dplyr::row_number()). Then for the distance calculation, make the reference observation always the first by subsetting with order == 1.
library(tidyverse)
library(geosphere)
df <- data.frame(Lon = c(-96.8, -96.60861, -96.86875, -96.14351, -92.82518, -90.86053, -90.14208, -84.64081, -83.7, -82, -80, -88.52732, -94.46049,-94.30, -88.60, -80.50, -81.70, -83.90, -84.60, -90.10, -90.80, -92.70, -96.10, -96.55, -96.50, -96.00),
Lat = c(25.38657, 25.90644, 26.57339, 27.63348, 29.03572, 28.16380, 28.21235, 26.71302, 25.12554, 24.50031, 24.89052, 30.16034, 29.34550, 29.34550, 30.16034, 24.89052, 24.50031, 25.12554, 26.71302, 28.21235, 28.16380, 29.03572, 27.63348, 26.57339, 25.80000, 25.30000),
ID = c(rep("ID1", 13), rep("ID2", 13)))
df %>%
group_by(ID) %>%
mutate(order = row_number()) %>%
mutate(dist = distVincentyEllipsoid(matrix(c(Lon[order == 1], Lat[order == 1]), ncol = 2),
matrix(c(Lon, Lat), ncol = 2)))
#> # A tibble: 26 x 5
#> # Groups: ID [2]
#> Lon Lat ID order dist
#> <dbl> <dbl> <chr> <int> <dbl>
#> 1 -96.8 25.4 ID1 1 0
#> 2 -96.6 25.9 ID1 2 60714.
#> 3 -96.9 26.6 ID1 3 131665.
#> 4 -96.1 27.6 ID1 4 257404.
#> 5 -92.8 29.0 ID1 5 564320.
#> 6 -90.9 28.2 ID1 6 665898.
#> 7 -90.1 28.2 ID1 7 732131.
#> 8 -84.6 26.7 ID1 8 1225193.
#> 9 -83.7 25.1 ID1 9 1319482.
#> 10 -82 24.5 ID1 10 1497199.
#> # ... with 16 more rows
Created on 2022-01-09 by the reprex package (v2.0.1)
This also seems to work (sent to me by a friend) - very similar to Dan's suggestion above, but slightly different
library(geosphere)
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Dist_to_col = distHaversine(c(Lon[1], Lat[1]),cbind(Lon,Lat)))

Conditional Sum of a column in R with NAs

I have a dataset with 4 columns which looks like that:
City
Year
Week
Average
Guelph
2020
2020-04-12
28.3
Hamilton
2020
2020-04-12
10.7
Waterloo
2020
2020-04-12
50.1
Guelph
2020
2020-04-20
3.5
Hamilton
2020
2020-04-20
42.9
I would like to sum the average column for the same week. In other words, I want to create a new dataset with three columns (Year, week, Average) where I won't have 3 different rows for the same week but only one (e.g instead of having three times 20220-04-12, I will have it one) and the corresponding cell in the average column will be the sum of all the rows that correspond to the same week. Something like that:
Year
Week
Average
2020
2020-04-12
89.1
2020
2020-04-20
46.4
where 89.1 is the sum of the first three rows that are at the same week and 46.4 is the sum of the last two rows of the initial table that correspond to the same week (2020-04-20).
The code I am using for that looks like that:
data_set <- data_set %>%
select(`Year`, `Week`, `Average`) %>%
group_by(Year, Week) %>%
summarize(Average = sum(Average))
but for some weeks I am getting back NAs and for some other I get the correct sum I want. The data are all numeric and in the initial dataset there are some NA values on the Average column.
Thanks in advance
You can accomplish this by passing in na.rm = TRUE to sum. Also, since you group_by(Year, Week), there isn't much to gain with using select in this case since you are generating a summary statistic on the Average variable within summarise.
df <- structure(list(City = c("Guelph", "Hamilton", "Waterloo", "Guelph",
"Hamilton"), Year = c(2020L, 2020L, 2020L, 2020L, 2020L), Week = c("2020-04-12",
"2020-04-12", "2020-04-12", "2020-04-20", "2020-04-20"), Average = c(28.3,
10.7, 50.1, 3.5, 42.9)), class = "data.frame", row.names = c(NA,
-5L))
library(dplyr)
df %>%
mutate(
Week = as.Date(Week),
) %>%
group_by(Year, Week) %>%
summarise(
Average = sum(Average, na.rm = TRUE)
)
#> # A tibble: 2 x 3
#> # Groups: Year [1]
#> Year Week Average
#> <int> <date> <dbl>
#> 1 2020 2020-04-12 89.1
#> 2 2020 2020-04-20 46.4
Created on 2021-03-10 by the reprex package (v0.3.0)

How to group a data frame by variables that are approximately the same?

I have the following data frame with data for each county:
A tibble: 6 x 4
# Groups: countyfips, day_after_reopening, deciles_income [6]
countyfips day_after_reopening deciles_income winner2016
<int> <drtn> <int> <chr>
1 1001 -109 days 8 Donald Trump
2 1001 -102 days 8 Donald Trump
3 1001 -95 days 8 Donald Trump
4 1001 -88 days 8 Donald Trump
5 1001 -81 days 8 Donald Trump
6 1001 -74 days 8 Donald Trump
And I would like to group it by the day_after_reopening column. However, the problem is that for each county the day_after_reopening number differs a little, as the observations are taken at the same time for each county, but the counties each opened on a different day of the week (e.g.out of the two counties I would like to have in the same group, one might have -109, the other -108).
How would you group these two counties with very similar numeric values together? Thank you.
You can create artificial groups based on some pre defined difference between numbers.
I created one example below:
require(dplyr)
# Difference max that you want
difference_max <- 2
# Create dummy data frame
day_after_reopening <- c(108, 109, 107, 50, 51, 68, 69, 67, 108, 109, 55, 56, 57, 100, 101, 101, 100,56)
df <- data.frame(day_after_reopening = day_after_reopening, index = seq(1:length(day_after_reopening)))
# Order the interesting column
df <- df[order(df$day_after_reopening),]
df$test <- c(diff(df$day_after_reopening, lag = 1), 0)
# Create the breaks where the difference value is greater than a selected value
breaks <- df[df$test > difference_max,]
breaks$test <- "here"
df <- rbind(breaks, df)
df <- df[order(df$day_after_reopening, df$test),]
# Create the split points and grouping
df <- df %>%
mutate(split_point = test < "here",
breaks = with(rle(split_point), rep(seq_along(lengths), lengths))) %>%
filter(split_point) %>%
group_by(breaks) %>%
summarise(day_after_reopening_mean = mean(day_after_reopening))
> df
# A tibble: 5 x 2
breaks day_after_reopening_mean
<int> <dbl>
1 1 50.5
2 3 56
3 5 68
4 7 100.
5 9 108.
Ok, then sounds like you'll first want to get the max number of days so you know how far to go out, and code for a new dataframe could be something like below (I've never used cut() so wouldn't know how to do it a bit more automatically):
df2 <- df1 %>%
mutate(day_after_grp =
case_when(day_after_reopening >=0 & day_after_reopening <=6 ~ "0-6",
day_after_reopening >=7 & day_after_reopening <=13 ~ "7-13",
day_after_reopening >=14 & day_after_reopening <=20 ~ "14-20",
etc to max. You'd then have a new variable "day_after_grp" to use for groupings.
Again, there may be a more programmatic way to do it w/ less copy/paste.

How to compare technical duplicates on separate rows in R?

I would like to compare the mean, sd, and percentage CV of two technical duplicates in R.
Currently my data frame looks like this:
library(tidyverse)
data <- tribble(
~rowname, ~Sample, ~Phagocytic_Score,
1, 1232, 24030,
2, 1232, 11040,
3, 4321, 7266,
4, 4321, 4096,
5, 5631, 7383,
6, 5631, 21507
)
Created on 2019-10-22 by the reprex package (v0.3.0)
So I would want to compare the values from rows 1 and 2 together, 3 and 4 and so on.
With ideally this being stored in a new data frame just with the average score and stats if that makes sense.
Sorry I'm quite new to R so apoplogies if this is really straightforward.
Thanks! Mari
summarize() can give you exactly this, especially if all the stats you want are computed within groups defined by one variable, i.e. Sample:
library(raster)
#> Loading required package: sp
library(tidyverse)
data <- tribble(
~rowname, ~Sample, ~Phagocytic_Score,
1, 1232, 24030,
2, 1232, 11040,
3, 4321, 7266,
4, 4321, 4096,
5, 5631, 7383,
6, 5631, 21507
)
data %>%
group_by(Sample) %>%
summarize(
mean = mean(Phagocytic_Score),
sd = sd(Phagocytic_Score),
pct_cv = cv(Phagocytic_Score)
)
#> # A tibble: 3 x 4
#> Sample mean sd pct_cv
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1232 17535 9185. 52.4
#> 2 4321 5681 2242. 39.5
#> 3 5631 14445 9987. 69.1
We've got some repeating going on, though, don't we? Each variable is defined as a function call with the same input variable. summarize_at() is more appropriate, then:
data %>%
group_by(Sample) %>%
summarize_at("Phagocytic_Score",
list(mean = mean, sd = sd, cv = cv))
#> # A tibble: 3 x 4
#> Sample mean sd cv
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1232 17535 9185. 52.4
#> 2 4321 5681 2242. 39.5
#> 3 5631 14445 9987. 69.1
Ah, but there's still some more room for improvement. Why are we repeating the names of the functions as the names of the variables, since they're the same? Well, mget() will take a single vector of the function names we want, and return a named list of those functions, with the names as those function names:
data %>%
group_by(Sample) %>%
summarize_at("Phagocytic_Score",
mget(c("mean", "sd", "cv"), inherits = TRUE))
#> # A tibble: 3 x 4
#> Sample mean sd cv
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1232 17535 9185. 52.4
#> 2 4321 5681 2242. 39.5
#> 3 5631 14445 9987. 69.1
Note we need inherits = TRUE for the reason explained here.
Created on 2019-10-22 by the reprex package (v0.3.0)
If I'm understanding your question, you are looking to summarize your dataframe by grouping based on one of the columns. I assume that in your real data you don't always have exactly two observations of each of your samples.
This approach uses the tidyverse packages, there are other ways to accomplish the same thing
library(tidyverse)
df %>% # name of your data frame
group_by(Sample) %>% This puts all the observations with the same value under "Sample" into groups for subsequent analysis
summarize(Mean = mean(Phagocytic_Score),
SD = sd(Phagocytic_Score),
PercentCV = SD/Mean # using the sd and mean just calculated for each group
)

Resources