Specifying x values when converting approx() to data frame - r

I am trying to get a data frame from the output of approx(t,y, n=120) below. My intent is for the input values returned to be in increments of 0.25; for instance, 0, 0.25, 0.5, 0.75, ... so I've set n = 120.
However, the data frame I get doesn't return those input values.
t <- c(0, 0.5, 2, 5, 10, 30)
z <- c(1, 0.9869, .9478, 0.8668, .7438, .3945)
data.frame(approx(t, z, n = 120))
I appreciate any assistance in this matter.

There are 121, not 120, points from 0 to 30 inclusive in steps of 0.25
length(seq(0, 30, 0.25))
## [1] 121
so use this:
approx(t, z, n = 121)
Another approach is:
approx(t, z, xout = seq(min(t), max(t), 0.25))

Related

How to merge two dataframes when column values are not exact?

I have:
Linearly interpolated the dFe_env data every 1 m and create a data frame (This works)
Extracted the 'Depth' (based on sinking rate) in 30 minute intervals (This works)
Created a 'Time' column where it increases every 30 minutes (This works)
How do I:
Merge two dataframes together (Bckgd_env2 and bulk_Fe2). In 'bulk_Fe2' the Depth increases by 1m and in 'Bckgd_env2' the depth increases by 0.8m. Can I get the closest 'Depth' match, extract the dFe_env at that depth and create a new data frame with Depth, Time and dFe_env all together?
library(dplyr)
Depth <- c(0, 2, 20, 50, 100, 500, 800, 1000, 1200, 1500)
dFe_env <- c(0.2, 0.2, 0.3, 0.4, 0.2, 0.1, 0.1, 0.1, 0.1, 0.1)
bulk_Fe <- data.frame(Depth, dFe_env)
summary(bulk_Fe)
is.data.frame(bulk_Fe)
do_interp <- function(dat, Depth = seq(0,1500, by=1)) {
out <- tibble(Depth = Depth)
for (var in c("dFe_env")) {
out[[var]] <- tryCatch(approx(dat$Depth, dat[[var]], Depth)$y, method="ngb", error = function(e) NA_real_)
}
out
}
bulk_Fe2 <- bulk_Fe %>% do(do_interp(.))
bulk_Fe2
summary(bulk_Fe2)
D0 <- 0 #Starting depth
T0 <- 0 #Starting time of the experiment
r <- 40 #sinking rate per day
r_30min <- r/48 #sinking speed every 30 minutes (There are 48 x 30 minute intervals in 24 hours)
days <- round(1501/(r)) #days 1501 is maximum depth
time <- days * 24 * 60 #minutes
n_steps <- 1501/r_30min
Bckgd_env2 <- data.frame(Depth =seq(from = D0, by= r_30min, length.out = n_steps + 1),
Time = seq(from = T0, by= 30, length.out = n_steps + 1))
head(Bckgd_env2)
round(Bckgd_env2, digits = 1)
Bckgd_env3 <- merge(Bckgd_env2, bulk_Fe2)
Bckgd_env3
plot(Bckgd_env2$dFe_env ~ Bckgd_env2$Depth, ylab="dFe (nmol/L)", xlab="Depth (m)", las=1)
You have already built the mechanism for interpolation which will be useful for the join. But you didn't build it at the right depth values. It is just a matter of reorganizing your code.
Start with buiding Bckgd_env2, and only afterwards compute bulk_Fe2 and bulk_Fe3:
bulk_Fe2 <- bulk_Fe %>% do(do_interp(., Depth=Bckgd_env2$Depth))
Bckgd_env3 <- merge(Bckgd_env2, bulk_Fe2)

R. lapply multinomial test to list of dataframes

I have a data frame A, which I split into a list of 100 data frames, each having 3 rows (In my real data each data frame has 500 rows). Here I show A with 2 elements of the list (row1-row3; row4-row6):
A <- data.frame(n = c(0, 1, 2, 0, 1, 2),
prob = c(0.4, 0.5, 0.1, 0.4, 0.5, 0.1),
count = c(24878, 33605, 12100 , 25899, 34777, 13765))
# This is the list:
nest <- split(A, rep(1:2, each = 3))
I want to apply the multinomial test to each of these data frames and extract the p-value of each test. So far I have done this:
library(EMT)
fun <- function(x){
multinomial.test(x$count,
prob=x$prob,
useChisq = FALSE, MonteCarlo = TRUE,
ntrial = 100, # n of withdrawals accomplished
atOnce=100)
}
lapply(nest, fun)
However, I get:
"Error in multinomial.test(x$counts_set, prob = x$norm_genome, useChisq = F, :
Observations have to be stored in a vector, e.g. 'observed <- c(5,2,1)'"
Does anyone have a smarter way of doing this?
The results of split are created with names 1, 2 and so on. That's why x$count in fun cannot access it. To make it simpler, you can combine your splitted elements using the list function and then use lapply:
n <- c(0,1,2,0,1,2)
prob <- c(0.4, 0.5, 0.1, 0.4, 0.5, 0.1)
count <- c(24878, 33605, 12100 , 25899, 34777, 13765)
A <- cbind.data.frame(n, prob, count)
nest = split(A,rep(1:2,each=3))
fun <- function(x){
multinomial.test(x$count,
prob=x$prob,
useChisq = F, MonteCarlo = TRUE,
ntrial = 100, # n of withdrawals accomplished
atOnce=100)
}
# Create a list of splitted elements
new_list <- list(nest$`1`, nest$`2`)
lapply(new_list, fun)
A solution with dplyr.
A = data.frame(n = c(0,1,2,0,1,2),
prob = c(0.4, 0.5, 0.1, 0.4, 0.5, 0.1),
count = c(43, 42, 9, 74, 82, 9))
library(dplyr)
nest <- A %>%
mutate(pattern = rep(1:2,each=3)) %>%
group_by(pattern) %>%
dplyr::summarize(mn_pvals = multinomial.test(count, prob)$p.value)
nest

Calculate row specific based on min

My data looks like this
df <- data.frame(x = c(3, 5, 4, 4, 3, 2),
y = c(.9, .8, 1, 1.2, .5, .1))
I am trying to multiply each x value by either y or 1, depending on which has the least value.
df$z <- df$x * min(df$y, 1)
The problem is it is taking the min of the whole column, so it is multiplying every x by 0.1.
Instead, I need x multiplied by .9, .8, 1, 1, .5, .1...
We need pmin that will go through each value of 'y' and get the minimum val when it is compared with the second value (which is recycled)
pmin(df$y, 1)
#[1] 0.9 0.8 1.0 1.0 0.5 0.1
Likewise, we can have n arguments (as the parameter is ...)
pmin(df$y, 1, 0)
#[1] 0 0 0 0 0 0
To get the output, just multiply 'x' with the pmin output
df$x * pmin(df$y, 1)
which can also be written as
with(df, x * pmin(y, 1))
Maybe you could use an ifelse function:
df <- data.frame(x = c(3, 5, 4, 4, 3, 2),
y = c(.9, .8, 1, 1.2, .5, .1))
df$z = ifelse(df$y<1, df$x*df$y, df$x*1)
This will compare the values of each row.
Hope it helps! :)

Create normally distributed variables with a defined correlation in R

I am trying to create a data frame in R, with a set of variables that are normally distributed. Firstly, we only create the data frame with the following variables:
RootCause <- rnorm(500, 0, 9)
OtherThing <- rnorm(500, 0, 9)
Errors <- rnorm(500, 0, 4)
df <- data.frame(RootCuase, OtherThing, Errors)
In the second part, we're asked to redo the above, but with a defined correlation between RootCause and OtherThing of 0.5. I have tried reading through a couple of pages and articles explaining correlation commands in R, but I am afraid I am struggling with comprehending it.
Easy answer
Draw another random variable OmittedVar and add it to the other variables:
n <- 1000
OmittedVar <- rnorm(n, 0, 9)
RootCause <- rnorm(n, 0, 9) + OmittedVar
OtherThing <- rnorm(n, 0, 9) + OmittedVar
Errors <- rnorm(n, 0, 4)
cor(RootCause, OtherThing)
[1] 0.4942716
Other answer: use multivariate normal function from MASS package:
But you have to define the variance/covariance matrix that gives you the correlation you like (the Sigma argument here):
d <- MASS::mvrnorm(n = n, mu = c(0, 0), Sigma = matrix(c(9, 4.5, 4.5, 9), nrow = 2, ncol = 2), tol = 1e-6, empirical = FALSE, EISPACK = FALSE)
cor(d[,1], d[,2])
[1] 0.5114698
Note:
Getting a correlation other than 0.5 depends on the process; if you want to change it from 0.5, you'll change the details (from adding 1 * OmittedVar in the first strat or changing Sigma in the second strat). But you'll have to look up details on variance rulse of the normal distribution.

Average Cells of Two or More DataFrames

So I currently have 3 data frames that I need to average each cell in, and I am at a loss of how to do this... Essentially, I need to obtain the mean of the first observation in column 1 for df1, df2, df3, and like that for every single observation.
Here is a reproducible sample data.
set.seed(789)
df1 <- data.frame(
a = runif(100, 0, 100),
b = runif(100, 0, 100),
c = runif(100, 0, 100),
d = runif(100, 0, 100))
df2 <- data.frame(
a = runif(100, 0, 100),
b = runif(100, 0, 100),
c = runif(100, 0, 100),
d = runif(100, 0, 100))
df3 <- data.frame(
a = runif(100, 0, 100),
b = runif(100, 0, 100),
c = runif(100, 0, 100),
d = runif(100, 0, 100))
I need to create a fourth data frame of dimensions 100 by 4 that is the result of averaging each cell across the first three dataframes. Any ideas are highly appreciated!
We can do this with Reduce with + and divide by the number of datasets in a list. This has the flexibility of keeping 'n' number of datasets in a list
dfAvg <- Reduce(`+`, mget(paste0("df", 1:3)))/3
Or another option is to convert to array and then use apply, which also have the option of removing the missing values (na.rm=TRUE)
apply(array(unlist(mget(paste0("df", 1:3))), c(dim(df1), 3)), 2, rowMeans, na.rm = TRUE)
As #user20650 mentioned, rowMeans can be applied directly on the array with the dim
rowMeans(array(unlist(mget(paste0("df", 1:3))), c(dim(df1), 3)), dims=2)

Resources