mutate to merge minimum value from different df in R - r

I have two datasets: One of species in my study and how many times I observed them, and another larger dataset that is a broader database of observations.
I want to mutate a column in my short dataset for "lowest latitude observed" (or highest or mean) from values in the other dataset but I can't quite figure out how to match them in a mutate.
set.seed(1)
# my dataset. sightings isn't important for this, just important that the solution doesn't mess up existing columns.
fake_spp_df <- data.frame(
species = c("a","b","c","d",'e'),
sightings = c(5,1,2,6,3)
)
# broader occurrence dataset
fake_spp_occurrences <- data.frame(
species = rep(c("a","b","c","d",'f'),each=20), # notice spp "f" - not all species are the same between datasets
latitude = runif(100, min = 0, max = 80),
longitude = runif(100, min=-90, max = -55)
)
# so I know to find one species min, i could do this:
min(fake_spp_occurrences$latitude[fake_spp_occurrences$species == "a"]),
# but I want to do that in a mutate()
# this was my failed attempt:
fake_spp_df %>%
mutate(lowest_lat = min(fake_spp_occurrences$latitude[fake_spp_occurrences$species == species])
)
desired result:
> fake_spp_df
species sightings lowest_lat max_lat median_lat
1 a 5 1.7 etc...
2 b 1 5.3
3 c 2 2.2
4 d 6 4.3
5 e 3 NA
thinking this could also be done witth some kind of join or merge, but I'm not sure.
thanks!

summarise the fake_spp_occurrences dataset and then perform the join.
library(dplyr)
fake_spp_occurrences %>%
group_by(species) %>%
summarise(lowest_lat = min(latitude),
max_lat = max(latitude),
median_lat = median(latitude)) %>%
right_join(fake_spp_df, by = 'species')
# species lowest_lat max_lat median_lat sightings
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 a 4.94 79.4 48.1 5
#2 b 1.07 74.8 35.7 1
#3 c 1.87 68.9 41.9 2
#4 d 6.74 76.8 38.2 6
#5 e NA NA NA 3

Related

Conditionally join from multiple tables based on a column?

What I'm trying to accomplish is basically the equivalent of a vlookup in Excel with an if statement to determine which table to use based on the value of a given column.
Main dataset looks like this:
#STATE CODE AMOUNT
# NJ 1 88
# DE 2 75
# VA 1 24
# PA 1 32
Then there are a handful of other tables that I need to use for the lookups to add the factor column, depending on the state - some states are unique, and the rest all use a common table. For example (the actual tables are much longer than this):
NJ:
#CODE FACTOR
# 1 0.75
# 2 0.90
PA:
#CODE FACTOR
1 0.80
2 0.95
All Other:
#CODE FACTOR
1 0.82
2 0.93
So the final output would be:
#STATE CODE AMOUNT FACTOR
# NJ 1 88 0.75
# DE 2 75 0.93
# VA 1 24 0.82
# PA 1 32 0.80
Is there a way to conditionally join/lookup from the various factor tables depending on the value of State, in this example? Or would I need to combine the factor tables into a single table and explicitly list every state/factor combination and then join based on both State and Code? Thanks for your help.
Instead of having different dataframes for each state it would be easier if the data is present in one dataframe.
Here is one approach -
library(dplyr)
combined_states <- bind_rows(lst(NJ, PA, other), .id = "STATE")
main %>%
mutate(STATE_temp = replace(STATE,
!STATE %in% unique(combined_states$STATE), 'other')) %>%
left_join(combined_states, by = c('STATE_temp' = 'STATE', 'CODE')) %>%
select(-STATE_temp)
# STATE CODE AMOUNT FACTOR
# <chr> <dbl> <dbl> <dbl>
#1 NJ 1 88 0.75
#2 DE 2 75 0.93
#3 VA 1 24 0.82
#4 PA 1 32 0.8
Note that the name of other dataframe should match with the replaced value for STATE_temp.
data
main <- tibble(STATE = c('NJ', 'DE', 'VA', 'PA'),
CODE = c(1, 2, 1, 1),
AMOUNT = c(88, 75, 24, 32))
NJ <- tibble(CODE = c(1, 2), FACTOR = c(0.75, 0.9))
PA <- tibble(CODE = c(1, 2), FACTOR = c(0.8, 0.95))
other <- tibble(CODE = c(1, 2), FACTOR = c(0.82, 0.93))
Make a list of all data frames from your global environment with state abbreviation names, bind them into a single data frame with a STATE column, and join to the main dataset. Then for anything left over we can use rows_patch to fill in NA values.
library(dplyr)
state_tables = ls(pattern = paste(state.abb, collapse = "|")) |>
mget() |>
bind_rows(.id = "STATE")
main = main |>
left_join(state_tables, by = c("STATE", "CODE")) |>
rows_patch(other, by = "CODE")
main
# A tibble: 4 × 4
# STATE CODE AMOUNT FACTOR
# <chr> <dbl> <dbl> <dbl>
# 1 NJ 1 88 0.75
# 2 DE 2 75 0.93
# 3 VA 1 24 0.82
# 4 PA 1 32 0.8
(Using Ronak's kindly shared data)

How to calculate mean by row for multiple groups using dplyr in R?

I have a dataframe with 4 columns Age, Location, Distance, and Value. Age and Location each have two possible values whereas Distance can have three. Value is the observed continuous variable which has been measured 3 times per Distance.
Accounting for Age and Location, I would like to calculate a mean for one of the Distance values and then calculate another mean Value when the other two Distance are combined. I am trying to answer, what is the mean Value for Distance 0.5 relative to Distance 1.5 & 2.5 for each Age and Location?
How can I do this using dplyr?
Example Data
library(dyplyr)
set.seed(123)
df1 <- data.frame(matrix(ncol = 4, nrow = 36))
x <- c("Age","Location","Distance","Value")
colnames(df1) <- x
df1$Age <- rep(c(1,2), each = 18)
df1$Location <- as.character(rep(c("Central","North"), each = 9))
df1$Distance <- rep(c(0.5,1.5,2.5), each = 3)
df1$Value <- round(rnorm(36,200,25),0)
Output should look something like this
Age Location Mean_0.5 Mean_1.5_and_2.5
1 1 Central 206 202
2 1 North 210 201
3 2 Central 193 186
4 2 North 202 214
We may use %in% or == to subset the 'Value' based on the 'Distance' values (assuming the precision is correct) after grouping by 'Age', 'Location'
library(dplyr)
df1 %>%
group_by(Age, Location) %>%
summarise(Mean_0.5 = mean(Value[Distance == 0.5]),
Mean_1.5_and_2.5 = mean(Value[Distance %in% c(1.5, 2.5)]),
.groups = 'drop')
-output
# A tibble: 4 × 4
Age Location Mean_0.5 Mean_1.5_and_2.5
<dbl> <chr> <dbl> <dbl>
1 1 Central 206. 202.
2 1 North 210. 201.
3 2 Central 193 186.
4 2 North 202. 214.

linear interpolation (approx) by group in a dplyr pipe in R

I have a question that I find kind of hard to explain with a MRE and in an easy
way to answer, mostly because I don't fully understand where the problem lies
myself. So that's my sorry for being vague preamble.
I have a tibble with many sample and reference measurements, for which I want
to do some linear interpolation for each sample. I do this now by taking out
all the reference measurements, rescaling them to sample measurements using
approx, and then patching it back in. But because I take it out first, I
cannot do it nicely in a group_by dplyr pipe way. right now I do it with a
really ugly workaround where I add empty (NA) newly created columns to the
sample tibble, then do it with a for-loop.
So my question is really: how can I implement the approx part within groups
into the pipe, so that I can do everything within groups? I've experimented
with dplyr::do(), and ran into the vignette on "programming with dplyr", but
searching mostly gives me broom::augment and lm stuff that I think operates
differently... (e.g. see
Using approx() with groups in dplyr). This thread also seems promising: How do you use approx() inside of mutate_at()?
Somebody on irc recommended using a conditional mutate, with case_when, but I
don't fully understand where and how within this context yet.
I think the problem lies in the fact that I want to filter out part of the data
for the following mutate operations, but the mutate operations rely on the
grouped data that I just filtered out, if that makes any sense.
Here's a MWE:
library(tidyverse) # or just dplyr, tibble
# create fake data
data <- data.frame(
# in reality a dttm with the measurement time
timestamp = c(rep("a", 7), rep("b", 7), rep("c", 7)),
# measurement cycle, normally 40 for sample, 41 for reference
cycle = rep(c(rep(1:3, 2), 4), 3),
# wheather the measurement is a reference or a sample
isref = rep(c(rep(FALSE, 3), rep(TRUE, 4)), 3),
# measurement intensity for mass 44
r44 = c(28:26, 30:26, 36, 33, 31, 38, 34, 33, 31, 18, 16, 15, 19, 18, 17)) %>%
# measurement intensity for mass 45, normally also masses up to mass 49
mutate(r45 = r44 + rnorm(21, 20))
# of course this could be tidied up to "intensity" with a new column "mass"
# (44, 45, ...), but that would make making comparisons even harder...
# overview plot
data %>%
ggplot(aes(x = cycle, y = r44, colour = isref)) +
geom_line() +
geom_line(aes(y = r45), linetype = 2) +
geom_point() +
geom_point(aes(y = r45), shape = 1) +
facet_grid(~ timestamp)
# what I would like to do
data %>%
group_by(timestamp) %>%
do(target_cycle = approx(x = data %>% filter(isref) %>% pull(r44),
y = data %>% filter(isref) %>% pull(cycle),
xout = data %>% filter(!isref) %>% pull(r44))$y) %>%
unnest()
# immediately append this new column to the original dataframe for all the
# samples (!isref) and then apply another approx for those values.
# here's my current attempt for one of the timestamps
matchref <- function(dat) {
# split the data into sample gas and reference gas
ref <- filter(dat, isref)
smp <- filter(dat, !isref)
# calculate the "target cycle", the points at which the reference intensity
# 44 matches the sample intensity 44 with linear interpolation
target_cycle <- approx(x = ref$r44,
y = ref$cycle, xout = smp$r44)
# append the target cycle to the sample gas
smp <- smp %>%
group_by(timestamp) %>%
mutate(target = target_cycle$y)
# linearly interpolate each reference gas to the target cycle
ref <- ref %>%
group_by(timestamp) %>%
# this is needed because the reference has one more cycle
mutate(target = c(target_cycle$y, NA)) %>%
# filter out all the failed ones (no interpolation possible)
filter(!is.na(target)) %>%
# calculate interpolated value based on r44 interpolation (i.e., don't
# actually interpolate this value but shift it based on the 44
# interpolation)
mutate(r44 = approx(x = cycle, y = r44, xout = target)$y,
r45 = approx(x = cycle, y = r45, xout = target)$y) %>%
select(timestamp, target, r44:r45)
# add new reference gas intensities to the correct sample gasses by the target cycle
left_join(smp, ref, by = c("time", "target"))
}
matchref(data)
# and because now "target" must be length 3 (the group size) or one, not 9
# I have to create this ugly for-loop
# for which I create a copy of data that has the new columns to be created
mr <- data %>%
# filter the sample gasses (since we convert ref to sample)
filter(!isref) %>%
# add empty new columns
mutate(target = NA, r44 = NA, r45 = NA)
# apply matchref for each group timestamp
for (grp in unique(data$timestamp)) {
mr[mr$timestamp == grp, ] <- matchref(data %>% filter(timestamp == grp))
}
Here's one approach that spreads the references and samples to new columns. I drop r45 for simplicity in this example.
data %>%
select(-r45) %>%
mutate(isref = ifelse(isref, "REF", "SAMP")) %>%
spread(isref, r44) %>%
group_by(timestamp) %>%
mutate(target_cycle = approx(x = REF, y = cycle, xout = SAMP)$y) %>%
ungroup
gives,
# timestamp cycle REF SAMP target_cycle
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 30 28 3
# 2 a 2 29 27 4
# 3 a 3 28 26 NA
# 4 a 4 27 NA NA
# 5 b 1 31 26 NA
# 6 b 2 38 36 2.5
# 7 b 3 34 33 4
# 8 b 4 33 NA NA
# 9 c 1 15 31 NA
# 10 c 2 19 18 3
# 11 c 3 18 16 2.5
# 12 c 4 17 NA NA
Edit to address comment below
To retain r45 you can use a gather-unite-spread approach like this:
df %>%
mutate(isref = ifelse(isref, "REF", "SAMP")) %>%
gather(r, value, r44:r45) %>%
unite(ru, r, isref, sep = "_") %>%
spread(ru, value) %>%
group_by(timestamp) %>%
mutate(target_cycle_r44 = approx(x = r44_REF, y = cycle, xout = r44_SAMP)$y) %>%
ungroup
giving,
# # A tibble: 12 x 7
# timestamp cycle r44_REF r44_SAMP r45_REF r45_SAMP target_cycle_r44
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 30 28 49.5 47.2 3
# 2 a 2 29 27 48.8 48.7 4
# 3 a 3 28 26 47.2 46.8 NA
# 4 a 4 27 NA 47.9 NA NA
# 5 b 1 31 26 51.4 45.7 NA
# 6 b 2 38 36 57.5 55.9 2.5
# 7 b 3 34 33 54.3 52.4 4
# 8 b 4 33 NA 52.0 NA NA
# 9 c 1 15 31 36.0 51.7 NA
# 10 c 2 19 18 39.1 37.9 3
# 11 c 3 18 16 39.2 35.3 2.5
# 12 c 4 17 NA 39.0 NA NA

How to run a for loop for each group in a dataframe?

This question is similar to this one asked earlier but not quite. I would like to iterate through a large dataset (~500,000 rows) and for each unique value in one column, I would like to do some processing of all the values in another column.
Here is code that I have confirmed to work:
df = matrix(nrow=783,ncol=2)
counts = table(csvdata$value)
p = (as.vector(counts))/length(csvdata$value)
D = 1 - sum(p**2)
The only problem with it is that it returns the value D for the entire dataset, rather than returning a separate D value for each set of rows where ID is the same.
Say I had data like this:
How would I be able to do the same thing as the code above, but return a D value for each group of rows where ID is the same, rather than for the entire dataset? I imagine this requires a loop, and creating a matrix to store all the D values in with ID in one column and the value of D in the other, but not sure.
Ok, let's work with "In short, I would like whatever is in the for loop to be executed for each block of data with a unique value of "ID"".
In general you can group rows by values in one column (e.g. "ID") and then perform some transformation based on values/entries in other columns per group. In the tidyverse this would look like this
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(value.mean = mean(value))
## A tibble: 8 x 3
## Groups: ID [3]
# ID value value.mean
# <fct> <int> <dbl>
#1 a 13 12.6
#2 a 14 12.6
#3 a 12 12.6
#4 a 13 12.6
#5 a 11 12.6
#6 b 12 15.5
#7 b 19 15.5
#8 cc4 10 10.0
Here we calculate the mean of value per group, and add these values to every row. If instead you wanted to summarise values, i.e. keep only the summarised value(s) per group, you would use summarise instead of mutate.
library(tidyverse)
df %>%
group_by(ID) %>%
summarise(value.mean = mean(value))
## A tibble: 3 x 2
# ID value.mean
# <fct> <dbl>
#1 a 12.6
#2 b 15.5
#3 cc4 10.0
The same can be achieved in base R using one of tapply, ave, by. As far as I understand your problem statement there is no need for a for loop. Just apply a function (per group).
Sample data
df <- read.table(text =
"ID value
a 13
a 14
a 12
a 13
a 11
b 12
b 19
cc4 10", header = T)
Update
To conclude from the comments&chat, this should be what you're after.
# Sample data
set.seed(2017)
csvdata <- data.frame(
microsat = rep(c("A", "B", "C"), each = 8),
allele = sample(20, 3 * 8, replace = T))
csvdata %>%
group_by(microsat) %>%
summarise(D = 1 - sum(prop.table(table(allele))^2))
## A tibble: 3 x 2
# microsat D
# <fct> <dbl>
#1 A 0.844
#2 B 0.812
#3 C 0.812
Note that prop.table returns fractions and is shorter than your (as.vector(counts))/length(csvdata$value). Note also that you can reproduce your results for all values (irrespective of ID) if you omit the group_by line.
A base R option would be
df1$value.mean <- with(df1, ave(value, ID))

How to create a table of percentage changes in R?

I have a table df that looks like this (with many more columns and many more rows):
I want to create a new table df2 in which each cell in the red box is replaced by percentage change from the value for the previous date (with the top row obviously displaying NA). For example the bottom right cell in the new table should read -.0188
Is there a function (base or from some package) that can do this? If so, how can I use it?
Here is a solution using some made up data and tidyverse functions. The key thing here is dplyr::lag, which lets you reference the previous value in a data frame. Note that the selection vars(-day) selects all the columns except day, and the use of funs(pct_change = ) creates new columns with the original name prepended to pct_change. Inside funs, the . refers to a value in that column, so we can directly calculate the proportional change by dividing by the previous value and subtracting 1.
library(tidyverse)
set.seed(100)
tbl <- tibble(
day = 1:4,
col1 = rnorm(4, mean = 10),
col2 = rnorm(4, mean = 10)
)
tbl %>%
mutate_at(
.vars = vars(-day),
.funs = funs(pct_change = (. / lag(.)) - 1)
)
#> # A tibble: 4 x 5
#> day col1 col2 col1_pct_change col2_pct_change
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 9.50 10.1 NA NA
#> 2 2 10.1 10.3 0.0667 0.0199
#> 3 3 9.92 9.42 -0.0208 -0.0873
#> 4 4 10.9 10.7 0.0973 0.138
Created on 2018-04-19 by the reprex package (v0.2.0).
A data.table based solution can be as:
library(data.table)
library(lubridate)
#Convert Date column to of type Date/POSIXct and order on that column
DT[,Date:=ymd(Date)][
order(Date),c(.(Date = Date), lapply(.SD,function(x)(x/lag(x) - 1))),.SDcols=2:4]
# Date AEDUSD AUDUSD CADUSD
# 1: 2008-03-17 NA NA NA
# 2: 2008-03-18 0.0000000000 -0.021097496 -0.006626446
# 3: 2008-03-19 0.0000000000 0.009361054 0.001294305
# 4: 2008-03-20 -0.0003671072 -0.006578238 -0.009645023
# 5: 2008-03-24 0.0000000000 -0.026378637 -0.018775100
Data:
DT = data.table(Date = c("2008-03-17","2008-03-18","2008-03-19","2008-03-20","2008-03-24"),
AEDUSD = c(0.2724, 0.2724, 0.2724, 0.2723, 0.2723),
AUDUSD = c(0.9385, 0.9187, 0.9273, 0.9212, 0.8969),
CADUSD = c(1.0111, 1.0044, 1.0057, 0.9960, 0.9773))

Resources