How to merge two dataframes when column values are not exact? - r

I have:
Linearly interpolated the dFe_env data every 1 m and create a data frame (This works)
Extracted the 'Depth' (based on sinking rate) in 30 minute intervals (This works)
Created a 'Time' column where it increases every 30 minutes (This works)
How do I:
Merge two dataframes together (Bckgd_env2 and bulk_Fe2). In 'bulk_Fe2' the Depth increases by 1m and in 'Bckgd_env2' the depth increases by 0.8m. Can I get the closest 'Depth' match, extract the dFe_env at that depth and create a new data frame with Depth, Time and dFe_env all together?
library(dplyr)
Depth <- c(0, 2, 20, 50, 100, 500, 800, 1000, 1200, 1500)
dFe_env <- c(0.2, 0.2, 0.3, 0.4, 0.2, 0.1, 0.1, 0.1, 0.1, 0.1)
bulk_Fe <- data.frame(Depth, dFe_env)
summary(bulk_Fe)
is.data.frame(bulk_Fe)
do_interp <- function(dat, Depth = seq(0,1500, by=1)) {
out <- tibble(Depth = Depth)
for (var in c("dFe_env")) {
out[[var]] <- tryCatch(approx(dat$Depth, dat[[var]], Depth)$y, method="ngb", error = function(e) NA_real_)
}
out
}
bulk_Fe2 <- bulk_Fe %>% do(do_interp(.))
bulk_Fe2
summary(bulk_Fe2)
D0 <- 0 #Starting depth
T0 <- 0 #Starting time of the experiment
r <- 40 #sinking rate per day
r_30min <- r/48 #sinking speed every 30 minutes (There are 48 x 30 minute intervals in 24 hours)
days <- round(1501/(r)) #days 1501 is maximum depth
time <- days * 24 * 60 #minutes
n_steps <- 1501/r_30min
Bckgd_env2 <- data.frame(Depth =seq(from = D0, by= r_30min, length.out = n_steps + 1),
Time = seq(from = T0, by= 30, length.out = n_steps + 1))
head(Bckgd_env2)
round(Bckgd_env2, digits = 1)
Bckgd_env3 <- merge(Bckgd_env2, bulk_Fe2)
Bckgd_env3
plot(Bckgd_env2$dFe_env ~ Bckgd_env2$Depth, ylab="dFe (nmol/L)", xlab="Depth (m)", las=1)

You have already built the mechanism for interpolation which will be useful for the join. But you didn't build it at the right depth values. It is just a matter of reorganizing your code.
Start with buiding Bckgd_env2, and only afterwards compute bulk_Fe2 and bulk_Fe3:
bulk_Fe2 <- bulk_Fe %>% do(do_interp(., Depth=Bckgd_env2$Depth))
Bckgd_env3 <- merge(Bckgd_env2, bulk_Fe2)

Related

Create iteration function to apply to a list of dataframes

I was wondering how to perform a recursive calculation over a list of dataframes? I have previously created a for loop that works but I would now like to convert this into a function that I can then apply to a list of dataframes. The equation that the for loop is based on is:
api(i) = k * api(i-1) + rain(i)
For loop example:
k <- 0.9 # assign k value
rain <- runif(n =50, min = 0, max = 4) # generate rainfall data
api <- numeric(length(rain)) # create api vector
api[1] <- k * rain[1] # set first api value
# for loop to calculate api
for (i in 2:length(rain)) {
api[i] <- k * api[i-1] + rain[i]
}
Example data:
library(lubridate)
date_time <-
seq(ymd_hm('2022-01-01 00:00'), ymd_hm('2022-02-01 23:45'), by = '15 mins')
precip <- runif(n = length(date_time),
min = 0,
max = 4)
df1 <- data.frame(date_time, precip)
date_time <-
seq(ymd_hm('2022-08-01 00:00'), ymd_hm('2022-09-01 23:45'), by = '15 mins')
precip <- runif(n = length(date_time),
min = 0,
max = 4)
df2 <- data.frame(date_time, precip)
list_Df <- list(df1, df2)
I would like to assign the api object to a new column in each dataframe (e.g. df$api)
Thanks for your help!

How to write a function that collects a specific list of observations from a time series data frame

In the data set created below, assume I randomly picked up 20 flat rocks. Each of these rocks were assigned a unique ID number. I measured the concentration of 7 substances (Copper,Iron,Carbon,Lead,Mg,CaCO, and Zinc) across the surface of the longest axis of each rock. Distance is recorded in mm, and therefore is a function of each rocks length. Note that not all Rocks are of the same length. Location is a grouping variable that describes where the Rock was picked up.
ID <- data.frame(ID=rep(c(12,122,242,329,595,130,145,245,654,878), each = 200))
ID2 <- data.frame(ID=rep(c(863,425,24,92,75,3,200,300,40,500), each = 300))
RockID<-data.frame(RockID = c(unlist(ID), unlist(ID2)))
Location <- rep(c("Alpha","Beta","Charlie","Delta","Echo"), each = 1000)
a <- rep(c(1:200),times = 10)
b <- rep(c(1:300), times = 10)
Time <- data.frame(Time = c(unlist(a), unlist(b)))
set.seed(1)
Copper <- rnorm(5000, mean = 0, sd = 5)
Iron <- rnorm(5000, mean = 0, sd = 10)
Carbon <- rnorm(5000, mean = 0, sd = 1)
Lead <- rnorm(5000, mean = 0, sd = 4)
Mg <- rnorm(5000, mean = 0, sd = 6)
CaCO <- rnorm(5000, mean = 0, sd = 2)
Zinc <- rnorm(5000, mean = 0, sd = 3)
data <-cbind(RockID, Location, Time,Copper,Iron,Carbon,Lead,Mg,CaCO,Zinc)
data$ID <- as.factor(data$RockID)
I want to create a new data frame that contains the following information:
1. The first observation and the last observation for each individual
2. The average of the first 3 observations and last 3 observations for each individual
3. The same as step 2. for the first and last 5, 7, and 10 observations
I want the new data frame to be set up like this:
ID FirstPt First3 First5 First7 First10 LastPt Last3 Last5 Last7 Last10
12 … … … … … … … … … …
122
242
329
595
130
145
245
654
878
863
425
ect...
How would I write a function to accomplish this?
We can create a function to calculate average of first and last n values. Use pivot_longer to get data in long format, group_by each RockID and substance and calculate the mean.
library(dplyr)
average_of_first_n_values <- function(value, x) mean(head(value, x))
average_of_last_n_values <- function(value, x) mean(tail(value, x))
data %>%
tidyr::pivot_longer(cols = Copper:Zinc) %>%
group_by(RockID, name) %>%
summarise(first_obs = first(value),
last_obs = last(value),
first_3_avg = average_of_first_n_values(value, 3),
first_5_avg = average_of_first_n_values(value, 5),
first_7_avg = average_of_first_n_values(value, 7),
first_10_avg = average_of_first_n_values(value, 10),
last_3_avg = average_of_last_n_values(value, 3),
last_5_avg = average_of_last_n_values(value, 5),
last_7_avg = average_of_last_n_values(value, 7),
last_10_avg = average_of_last_n_values(value, 10))

Natural interval making (1 dimension clustering) with constraint on the range of the observations in each interval

I have a vector of real numbers on which I want to create natural intervals. In other words, I want to perform 1 dimensional clustering. The constraint is that in each interval, the difference between the highest and the lowest value must be less than a constant c, say 3. I want to obtain a solution having a minimal number of intervals.
I tried to make my intervals using density estimation featuring a gaussian kernel and by decreasing the binwidth until each range is less than 3. However, it does not work since some intervals ranges stay greater than 3 even if I reduce the binwidth a lot. Also, there comes a time where the algorithm starts to create intervals containing no data.
library(tidyverse)
library(data.table)
# Create vector of real numbers -------------------------------------------------------------------------------------------------
set.seed(2019)
nb <- c(10, 23, 17, 16, 20)
x <- c(
rnorm(nb[1], mean = 20, sd = 0.5),
rnorm(nb[2], mean = 5, sd = 0.1),
rnorm(nb[3], mean = 10, sd = 0.5),
rnorm(nb[4], mean = 30, sd = 0.8),
rnorm(nb[5], mean = 18, sd = 10)
)
# Functions ---------------------------------------------------------------------------------------------------------------------
# Returns all local minima given a density object
find_local_mins <- function(density) {
y <- density$y
x <- density$x
ind_mins <- which(y - shift(y, 1) < 0 & y - shift(y, 1, type = "lead") < 0)
mins <- x[ind_mins]
return(mins)
}
# Compute differences between max and min value of a vector between breaks
compute_clusters_ranges <- function(x, breaks) {
clusters <- cut(x, breaks = c(-Inf, breaks, Inf))
splits <- split(x, clusters)
clusters_ranges <- map_dbl(splits, ~ diff(range(.)))
return(clusters_ranges)
}
# ----------
# Find and plot intervals using gaussian kernel with binwith of 2 ---------------------------------------------------------------
densite <- density(x, kernel = "gaussian", bw = 2, n = 10000) # Estimate density
mins <- find_local_mins(densite) # Find local minima for clustering
plot(densite, xlab = "x", main = "")
rug(x, ticksize = 0.06)
abline(v = mins, col = rep("blue", length(mins)))
# Compute range (difference between max and min value) for each interval --------------------------------------------------------
cluster_ranges <- compute_clusters_ranges(x, mins)
cluster_ranges # Some ranges are still greater than 3, so we cluster again with a smaller binwith
# ----------
# Find and plot intervals using gaussian kernel with binwith of 1 ---------------------------------------------------------------
densite <- density(x, kernel = "gaussian", bw = 1, n = 10000) # Estimate density
mins <- find_local_mins(densite) # Find local minima for clustering
plot(densite, xlab = "x", main = "")
rug(x, ticksize = 0.06)
abline(v = mins, col = rep("blue", length(mins)))
# Compute range (difference between max and min value) for each interval --------------------------------------------------------
cluster_ranges <- compute_clusters_ranges(x, mins)
cluster_ranges # Some ranges are still greater than 3, so we cluster again with a smaller binwith
# ----------
# Find and plot intervals using gaussian kernel with binwith of 0.659 -----------------------------------------------------------
densite <- density(x, kernel = "gaussian", bw = 0.659, n = 10000) # Estimate density
mins <- find_local_mins(densite) # Find local minima for clustering
plot(densite, xlab = "x", main = "")
rug(x, ticksize = 0.06)
abline(v = mins, col = rep("blue", length(mins)))
# Compute range (difference between max and min value) for each interval --------------------------------------------------------
cluster_ranges <- compute_clusters_ranges(x, mins)
cluster_ranges # The empty interval [36.62, 36.63] have been created
```r
I want to obtain natural intervals for a vector of numeric data. In each interval created, I want the difference between the greatest and the smallest value to be less than 3. I want to obtain this using as few intervals as possible.
Sort the data
Find the minimum value m
Put everything in [m:m+3) into a bin
Repeat with the remaining data?
This fairly trivial approach appears to satisfy the constraints you gave: width at most 3, no empty intervals. As fast as it gets.

Specifying x values when converting approx() to data frame

I am trying to get a data frame from the output of approx(t,y, n=120) below. My intent is for the input values returned to be in increments of 0.25; for instance, 0, 0.25, 0.5, 0.75, ... so I've set n = 120.
However, the data frame I get doesn't return those input values.
t <- c(0, 0.5, 2, 5, 10, 30)
z <- c(1, 0.9869, .9478, 0.8668, .7438, .3945)
data.frame(approx(t, z, n = 120))
I appreciate any assistance in this matter.
There are 121, not 120, points from 0 to 30 inclusive in steps of 0.25
length(seq(0, 30, 0.25))
## [1] 121
so use this:
approx(t, z, n = 121)
Another approach is:
approx(t, z, xout = seq(min(t), max(t), 0.25))

Using fund() in r to find frequency

I'm trying to analyse a sine wave and print out it's frequency along with the specific times. Unfortunately Im getting the error
In fund(samples, f = sampleRate, wl = timeSize, ovlp = 0, fmax = maxFreq, :
NAs introduced by coercion
Here is the code
library(tuneR)
library(seewave)
printFrequencyAndTime <- function(filename) {
waveform <- readWave(filename)
sampleRate <- waveform#samp.rate #44100
samples <- waveform#left #array vector
nOfSamples <- length(samples) #132301
maxFreq <- sampleRate/2
minFreq <- 0
timeSize <- 512
pi <- 3.1415926535
nOfFreqBands <- 512 # number of frequency bands
freqAverage <- 64 #group n number of frequency bands together
#fundamental frequency track
ff <- fund(samples, f= sampleRate, wl = timeSize, ovlp = 0, fmax = maxFreq, threshold = NULL, from = 0, to = 3, plot = TRUE, xlab = "Time (s)", ylab = "Frequency (kHz)", ylim = c(0, sampleRate/2000), pb = FALSE)
#bandNumber[j] <- j*sampleRate/timeSize
cat("Frequency is ",ff, "Hz; time is ", t, "s.")
}
Its been bugging me for a while now so can anyone tell me if my approach is correcting. I'm thinking I need to do the fourier transform of my sine wave data into the freq components, then i need to get the frequency with the maximum amplitude and that will be the frequency of my sine wave.

Resources