How to compare technical duplicates on separate rows in R? - r

I would like to compare the mean, sd, and percentage CV of two technical duplicates in R.
Currently my data frame looks like this:
library(tidyverse)
data <- tribble(
~rowname, ~Sample, ~Phagocytic_Score,
1, 1232, 24030,
2, 1232, 11040,
3, 4321, 7266,
4, 4321, 4096,
5, 5631, 7383,
6, 5631, 21507
)
Created on 2019-10-22 by the reprex package (v0.3.0)
So I would want to compare the values from rows 1 and 2 together, 3 and 4 and so on.
With ideally this being stored in a new data frame just with the average score and stats if that makes sense.
Sorry I'm quite new to R so apoplogies if this is really straightforward.
Thanks! Mari

summarize() can give you exactly this, especially if all the stats you want are computed within groups defined by one variable, i.e. Sample:
library(raster)
#> Loading required package: sp
library(tidyverse)
data <- tribble(
~rowname, ~Sample, ~Phagocytic_Score,
1, 1232, 24030,
2, 1232, 11040,
3, 4321, 7266,
4, 4321, 4096,
5, 5631, 7383,
6, 5631, 21507
)
data %>%
group_by(Sample) %>%
summarize(
mean = mean(Phagocytic_Score),
sd = sd(Phagocytic_Score),
pct_cv = cv(Phagocytic_Score)
)
#> # A tibble: 3 x 4
#> Sample mean sd pct_cv
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1232 17535 9185. 52.4
#> 2 4321 5681 2242. 39.5
#> 3 5631 14445 9987. 69.1
We've got some repeating going on, though, don't we? Each variable is defined as a function call with the same input variable. summarize_at() is more appropriate, then:
data %>%
group_by(Sample) %>%
summarize_at("Phagocytic_Score",
list(mean = mean, sd = sd, cv = cv))
#> # A tibble: 3 x 4
#> Sample mean sd cv
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1232 17535 9185. 52.4
#> 2 4321 5681 2242. 39.5
#> 3 5631 14445 9987. 69.1
Ah, but there's still some more room for improvement. Why are we repeating the names of the functions as the names of the variables, since they're the same? Well, mget() will take a single vector of the function names we want, and return a named list of those functions, with the names as those function names:
data %>%
group_by(Sample) %>%
summarize_at("Phagocytic_Score",
mget(c("mean", "sd", "cv"), inherits = TRUE))
#> # A tibble: 3 x 4
#> Sample mean sd cv
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1232 17535 9185. 52.4
#> 2 4321 5681 2242. 39.5
#> 3 5631 14445 9987. 69.1
Note we need inherits = TRUE for the reason explained here.
Created on 2019-10-22 by the reprex package (v0.3.0)

If I'm understanding your question, you are looking to summarize your dataframe by grouping based on one of the columns. I assume that in your real data you don't always have exactly two observations of each of your samples.
This approach uses the tidyverse packages, there are other ways to accomplish the same thing
library(tidyverse)
df %>% # name of your data frame
group_by(Sample) %>% This puts all the observations with the same value under "Sample" into groups for subsequent analysis
summarize(Mean = mean(Phagocytic_Score),
SD = sd(Phagocytic_Score),
PercentCV = SD/Mean # using the sd and mean just calculated for each group
)

Related

Pivot_longer and Pivot wider syntax

I want to ask for ideas on creating a syntax to pivot_longer given on this.
I've already tried researching in the internet but I can't seem to find any examples that is similar to my data given where it has a Metric column which is also seperated in 3 different columns of months.
My desire final output is to have seven columns consisting of (regions,months, and the five Metrics)
How to formulate the pivot_longer and pivot_wider syntax to clean my data in order for me to visualize it?
The tricky part isn't pivot_longer. You first have to clean your Excel spreadsheet, i.e. get rid of empty rows and merge the two header rows containing the names of the variables and the dates.
One approach to achieve your desired result may look like so:
library(readxl)
library(tidyr)
library(janitor)
library(dplyr)
x <- read_excel("data/Employment.xlsx", skip = 3, col_names = FALSE) %>%
# Get rid of empty rows and cols
janitor::remove_empty()
# Make column names
col_names <- data.frame(t(x[1:2,])) %>%
fill(1) %>%
unite(name, 1:2, na.rm = TRUE) %>%
pull(name)
x <- x[-c(1:2),]
names(x) <- col_names
# Convert to long and values to numerics
x %>%
pivot_longer(-Region, names_to = c(".value", "months"), names_sep = "_") %>%
separate(months, into = c("month", "year")) %>%
mutate(across(!c(Region, month, year), as.numeric))
#> # A tibble: 6 × 8
#> Region month year `Total Population … `Labor Force Part… `Employment Rat…
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Philippin… April 2020f 73722. 55.7 82.4
#> 2 Philippin… Janu… 2021p 74733. 60.5 91.3
#> 3 Philippin… April 2021p 74971. 63.2 91.3
#> 4 National … April 2020f 9944. 54.2 87.7
#> 5 National … Janu… 2021p 10051. 57.2 91.2
#> 6 National … April 2021p 10084. 60.1 85.6
#> # … with 2 more variables: Unemployment Rate <dbl>, Underemployment Rate <dbl>

Calculating distance between all locations to first location, by group

I have GPS locations from several seabird tracks, each starting from colony x. Therefore the individual tracks all have similar first locations. For each track, I would like to calculate the beeline distance between each GPS location and either (a) a specified location that represents the location of colony x, or (b) the first GPS point of a given track which represents the location of colony x. For (b), I would look to use the first location of each new track ID (track_id).
I have looked for appropriate functions in geosphere, sp, raster, adehabitatLT, move, ... and just cannot seem to find what I am looking for.
I can calculate the distance between successive GPS points, but that is not what I need.
package(dplyr)
df %>%
group_by(ID) %>%
mutate(lat_prev = lag(Lat,1), lon_prev = lag(Lon,1) ) %>%
mutate(dist = distVincentyEllipsoid(matrix(c(lon_prev, lat_prev), ncol = 2), # or use distHaversine
matrix(c(Lon, Lat), ncol = 2)))
#example data:
df <- data.frame(Lon = c(-96.8, -96.60861, -96.86875, -96.14351, -92.82518, -90.86053, -90.14208, -84.64081, -83.7, -82, -80, -88.52732, -94.46049,-94.30, -88.60, -80.50, -81.70, -83.90, -84.60, -90.10, -90.80, -92.70, -96.10, -96.55, -96.50, -96.00),
Lat = c(25.38657, 25.90644, 26.57339, 27.63348, 29.03572, 28.16380, 28.21235, 26.71302, 25.12554, 24.50031, 24.89052, 30.16034, 29.34550, 29.34550, 30.16034, 24.89052, 24.50031, 25.12554, 26.71302, 28.21235, 28.16380, 29.03572, 27.63348, 26.57339, 25.80000, 25.30000),
ID = c(rep("ID1", 13), rep("ID2", 13)))
Grateful for any pointers.
You were pretty close. The key is that you want to calcualte the distance from the first observation in each track. Therefore you need to first adorn with the order in each track (easy to do with dplyr::row_number()). Then for the distance calculation, make the reference observation always the first by subsetting with order == 1.
library(tidyverse)
library(geosphere)
df <- data.frame(Lon = c(-96.8, -96.60861, -96.86875, -96.14351, -92.82518, -90.86053, -90.14208, -84.64081, -83.7, -82, -80, -88.52732, -94.46049,-94.30, -88.60, -80.50, -81.70, -83.90, -84.60, -90.10, -90.80, -92.70, -96.10, -96.55, -96.50, -96.00),
Lat = c(25.38657, 25.90644, 26.57339, 27.63348, 29.03572, 28.16380, 28.21235, 26.71302, 25.12554, 24.50031, 24.89052, 30.16034, 29.34550, 29.34550, 30.16034, 24.89052, 24.50031, 25.12554, 26.71302, 28.21235, 28.16380, 29.03572, 27.63348, 26.57339, 25.80000, 25.30000),
ID = c(rep("ID1", 13), rep("ID2", 13)))
df %>%
group_by(ID) %>%
mutate(order = row_number()) %>%
mutate(dist = distVincentyEllipsoid(matrix(c(Lon[order == 1], Lat[order == 1]), ncol = 2),
matrix(c(Lon, Lat), ncol = 2)))
#> # A tibble: 26 x 5
#> # Groups: ID [2]
#> Lon Lat ID order dist
#> <dbl> <dbl> <chr> <int> <dbl>
#> 1 -96.8 25.4 ID1 1 0
#> 2 -96.6 25.9 ID1 2 60714.
#> 3 -96.9 26.6 ID1 3 131665.
#> 4 -96.1 27.6 ID1 4 257404.
#> 5 -92.8 29.0 ID1 5 564320.
#> 6 -90.9 28.2 ID1 6 665898.
#> 7 -90.1 28.2 ID1 7 732131.
#> 8 -84.6 26.7 ID1 8 1225193.
#> 9 -83.7 25.1 ID1 9 1319482.
#> 10 -82 24.5 ID1 10 1497199.
#> # ... with 16 more rows
Created on 2022-01-09 by the reprex package (v2.0.1)
This also seems to work (sent to me by a friend) - very similar to Dan's suggestion above, but slightly different
library(geosphere)
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Dist_to_col = distHaversine(c(Lon[1], Lat[1]),cbind(Lon,Lat)))

How do i calculate the average with certain condition in R?

I've been trying to calculate the average of a column with condition in a dataframe and plot it in a graph. But so far I can only get the average of the whole column with mean(df$Age). Sample dataframe
What I'm trying to get is the average age of employees in Vancouver but I'm not sure how to do it so I can't plot it out.
To get average of a specific city you can subset it and take mean.
result <- mean(df$Age[df$CityName == 'Vancouver'], na.rm = TRUE)
library(tidyverse)
tribble(
~Age, ~City,
61, "Vancouver",
58, "Vancouver",
48, "Terrace",
48, "Terrace"
) %>%
group_by(City) %>%
summarise(Age = mean(Age))
#> # A tibble: 2 x 2
#> City Age
#> <chr> <dbl>
#> 1 Terrace 48
#> 2 Vancouver 59.5
Created on 2021-11-12 by the reprex package (v2.0.1)

Create dataframe with correlation and p-value by group?

I am trying to correlate several variables according to a specific group (COUNTY) in R. Although I am able to successfully find the correlation for each column through this method, I can't seem to find a way to save the p-value to the table for each group. Any suggestions?
Example Data:
crops <- data.frame(
COUNTY = sample(37001:37900),
CropYield = sample(c(1:100), 10, replace = TRUE),
MaxTemp =sample(c(40:80), 10, replace = TRUE),
precip =sample(c(0:10), 10, replace = TRUE),
ColdDays =sample(c(1:73), 10, replace = TRUE))
Example Code:
crops %>%
group_by(COUNTY) %>%
do(data.frame(Cor=t(cor(.[,2:5], .[,2]))))
^This gives me the correlation for each column but I need to know the p-value for each one as well. Ideally the final output would look like this.
Desired Output
You only have 1 observation per COUNTY, so it will not work.. I set more examples per COUNTY:
set.seed(111)
crops <- data.frame(
COUNTY = sample(37001:37002,10,replace=TRUE),
CropYield = sample(c(1:100), 10, replace = TRUE),
MaxTemp =sample(c(40:80), 10, replace = TRUE),
precip =sample(c(0:10), 10, replace = TRUE),
ColdDays =sample(c(1:73), 10, replace = TRUE))
I think you need to convert to a long format, and do a cor.test per COUNTY and variable
calcor=function(da){
data.frame(cor.test(da$CropYield,da$value)[c("estimate","p.value")])
}
crops %>%
pivot_longer(-c(COUNTY,CropYield)) %>%
group_by(COUNTY,name) %>% do(calcor(.))
# A tibble: 6 x 4
# Groups: COUNTY, name [6]
COUNTY name estimate p.value
<int> <chr> <dbl> <dbl>
1 37001 ColdDays 0.466 0.292
2 37001 MaxTemp -0.225 0.628
3 37001 precip -0.356 0.433
4 37002 ColdDays 0.888 0.304
5 37002 MaxTemp 0.941 0.220
6 37002 precip -0.489 0.674
The above gives you correlation for every variable against crop yield, for every county. Now it's a matter of converting it into wide format:
crops %>%
pivot_longer(-c(COUNTY,CropYield)) %>%
group_by(COUNTY,name) %>% do(calcor(.)) %>%
pivot_wider(values_from=c(estimate,p.value),names_from=name)
COUNTY estimate_ColdDa… estimate_MaxTemp estimate_precip p.value_ColdDays
<int> <dbl> <dbl> <dbl> <dbl>
1 37001 0.466 -0.225 -0.356 0.292
2 37002 0.888 0.941 -0.489 0.304
# … with 2 more variables: p.value_MaxTemp <dbl>, p.value_precip <dbl>

Hierarchical Forecasting problem generating the hts object

I want to do hierarchical forecasting as described in Hyndman Forecasting's book in chapter 10: https://otexts.com/fpp2/
My problem is that for generating this type of forecasting (specifically the bottom-up approach) I need to develop a hts object that is a matrix. For example:
If I have a data frame like this:
Image of an example of data frame prior to hts object
I need to convert it to a matrix like this:
Image of Matrix that I need
For this matrix, every row is a unit of time (it could be days, months, etc.).
My problem is that my data frame looks like this:
Image of Problem with dataframe
One column is the date and the other are the categories from which I need to forecast the sales. The problem is this: for supermarket=4, id_product=187, and id_label=a the system registers movements on days 21 and 23 but nothing happens on day 22, which means that I need to have sales=0 on that day or in other words a row like this:
Image of Row missing
How can I generate the matrix needed to create the hts object? Do I need to create the missing rows with 0? (I have thousands of missing rows, so it would be a nightmare to do it by hand)
Here is a reproducible example:
date=c("2019-03-22","2019-03-23","2019-04-24","2019-03-25")
id_supermarket=c(4,4,2,2)
id_product=c(187,187,189,190)
id_label=c("a","a","c","d")
sales=c(21,22,23,24)
df=as.data.frame(cbind(date,id_supermarket,id_product,id_label,sales))
Thanks in advance.
I recommend you use the fable package instead of hts. It is more recent and much easier to use. Here is an example with your data.
library(tsibble)
library(fable)
# Create tsibble
df <- tibble(
date = lubridate::ymd(c("2019-03-22", "2019-03-23", "2019-03-24", "2019-03-25")),
id_supermarket = as.character(c(4, 4, 2, 2)),
id_product = c(187, 187, 189, 190),
id_label = c("a", "a", "c", "d"),
sales = c(21, 22, 23, 24)
) %>%
as_tsibble(index = date, key = c(id_supermarket, id_product, id_label)) %>%
fill_gaps(.full = TRUE)
# Forecast with reconciliation
fc <- df %>%
aggregate_key(id_supermarket * id_label, sales = sum(sales, na.rm = TRUE)) %>%
model(
arima = ARIMA(sales)
) %>%
reconcile(
arima = min_trace(arima)
) %>%
forecast(h = "5 days")
fc
#> # A fable: 45 x 6 [1D]
#> # Key: id_supermarket, id_label, .model [9]
#> id_supermarket id_label .model date sales .distribution
#> <chr> <chr> <chr> <date> <dbl> <dist>
#> 1 2 c arima 2019-03-26 5.82 N(5.8, 44)
#> 2 2 c arima 2019-03-27 5.82 N(5.8, 44)
#> 3 2 c arima 2019-03-28 5.82 N(5.8, 44)
#> 4 2 c arima 2019-03-29 5.82 N(5.8, 44)
#> 5 2 c arima 2019-03-30 5.82 N(5.8, 44)
#> 6 2 d arima 2019-03-26 6.34 N(6.3, 46)
#> 7 2 d arima 2019-03-27 6.34 N(6.3, 46)
#> 8 2 d arima 2019-03-28 6.34 N(6.3, 46)
#> 9 2 d arima 2019-03-29 6.34 N(6.3, 46)
#> 10 2 d arima 2019-03-30 6.34 N(6.3, 46)
#> # … with 35 more rows
Created on 2020-02-01 by the reprex package (v0.3.0)

Resources