cut() function appears to group observations incorrectly - r

Here's an example of how the group label from cut() doesn't seem accurate. The observation with x1=200 is classified in the [0,200) group of x2, which is wrong. The label can be fixed by increasing dig.lab, but I still think the default rounding should give a result for x2 with face validity. Is this a bug?
df <- data.frame(x1 = c(100, 100.5, 200, 200.5))
df$x2 <- cut(df$x1, breaks = c(0,200.1,999), right = FALSE)
df$x3 <- cut(df$x1, breaks = c(0,200.1,999), right = FALSE, dig.lab = 4)
df
# x1 x2 x3
# 1 100.0 [0,200) [0,200.1)
# 2 100.5 [0,200) [0,200.1)
# 3 200.0 [0,200) [0,200.1)
# 4 200.5 [200,999) [200.1,999)

Related

Apply different functions to columns of a dataframe selecting functions by name

Let's say I've got a dataframe with multiple columns, some of which I want to transform. The column names define what transformation needs to be used.
library(tidyverse)
set.seed(42)
df <- data.frame(A = 1:100, B = runif(n = 100, 0, 1), log10 = runif(n = 100, 10, 100), log2 = runif(n = 100, 10, 100), log1p = runif(n = 100, 10, 100), sqrt = runif(n = 100, 10, 100))
trans <- list()
trans$log10 <- log10
trans$log2 <- log2
trans$log1p <- log1p
trans$sqrt <- sqrt
Ideally, I would like to use an across call where the column names were matched up with the trans function names and the transformations would be performed on the fly.
The desired output is the following:
df_trans <- df %>%
dplyr::mutate(log10 = trans$log10(log10),
log2 = trans$log2(log2),
log1p = trans$log1p(log1p),
sqrt = trans$sqrt(sqrt))
df_trans
However, I don't want to manually specify each transformation separately. In the representative example I only have 4 but this number could vary and be significantly higher making manual specification cumbersome and error prone.
I have managed to match up the column names with the functions by turning the trans list into a data frame and left-joining but am then unable to call the function in the trans_function column.
trans_df <- enframe(trans, value = "trans_function")
df %>%
pivot_longer(cols = everything()) %>%
left_join(trans_df) %>%
dplyr::mutate(value = trans_function(value))
Error: Problem with mutate() column value.
i value = trans_function(value).
x could not find function "trans_function"
I think I either need to find a way of calling the functions from the list columns or another way of matching up the function names with the column names. All ideas are welcome.
We can use cur_column() in across to get the column name and use it to subset trans.
library(dplyr)
df %>%
mutate(across(names(trans), ~trans[[cur_column()]](.x))) %>%
head
# A B log10 log2 log1p sqrt
#1 1 0.9148060 1.821920 6.486402 3.998918 3.470303
#2 2 0.9370754 1.470472 5.821200 3.932046 7.496103
#3 3 0.2861395 1.469690 6.437524 2.799395 8.171007
#4 4 0.8304476 1.653261 5.639570 3.700698 6.905755
#5 5 0.6417455 1.976905 4.597484 4.500461 9.441077
#6 6 0.5190959 1.985133 5.638341 4.551289 4.440590
Comparing it with output of df_trans.
head(df_trans)
# A B log10 log2 log1p sqrt
#1 1 0.9148060 1.821920 6.486402 3.998918 3.470303
#2 2 0.9370754 1.470472 5.821200 3.932046 7.496103
#3 3 0.2861395 1.469690 6.437524 2.799395 8.171007
#4 4 0.8304476 1.653261 5.639570 3.700698 6.905755
#5 5 0.6417455 1.976905 4.597484 4.500461 9.441077
#6 6 0.5190959 1.985133 5.638341 4.551289 4.440590
One way can be to use lapply:
library(tidyverse)
set.seed(42)
df <- data.frame(A = 1:100, B = runif(n = 100, 0, 1), log10 = runif(n = 100, 10, 100), log2 = runif(n = 100, 10, 100), log1p = runif(n = 100, 10, 100), sqrt = runif(n = 100, 10, 100))
trans <- list()
trans$log10 <- log10
trans$log2 <- log2
trans$log1p <- log1p
trans$sqrt <- sqrt
df_trans <- setNames(lapply(names(df),
function(x) if(x %in% names(trans))
{ trans[[x]](df[,(x)])} else {df[,x]}),names(df)) %>%
bind_cols() %>%
as.data.frame()
head(df_trans)
which gives:
A B log10 log2 log1p sqrt
1 1 0.1365052 1.739051 6.301896 4.530600 4.318942
2 2 0.1771364 1.549601 5.793220 4.521715 3.649834
3 3 0.5195605 1.902438 4.819125 3.343266 6.788565
4 4 0.8111208 1.572253 6.219991 4.075945 3.322401
5 5 0.1153620 1.751276 6.306097 4.060292 7.817301
6 6 0.8934218 1.724403 6.201123 3.235938 9.749128
The original dataframe being:
head(df)
A B log10 log2 log1p sqrt
1 1 0.1365052 54.83409 78.89684 91.81428 18.65326
2 2 0.1771364 35.44878 55.45401 90.99323 13.32129
3 3 0.5195605 79.88006 28.22936 27.31143 46.08461
4 4 0.8111208 37.34675 74.54249 57.90612 11.03835
5 5 0.1153620 56.39961 79.12693 56.99123 61.11019
6 6 0.8934218 53.01557 73.57393 24.43022 95.04549
In base R, we may use Map
df[names(trans)] <- Map(function(x, y) x(y), trans, df[names(trans)])
-checking
> identical(df, df_trans)
[1] TRUE
Another possibility is the following:
library(tidyverse)
set.seed(42)
df <- data.frame(A = 1:100, B = runif(n = 100, 0, 1), log10 = runif(n = 100, 10, 100), log2 = runif(n = 100, 10, 100), log1p = runif(n = 100, 10, 100), sqrt = runif(n = 100, 10, 100))
df %>%
mutate(across(-(A:B), ~ getFunction(cur_column())(.x))) %>% head
#> A B log10 log2 log1p sqrt
#> 1 1 0.9148060 1.821920 6.486402 3.998918 3.470303
#> 2 2 0.9370754 1.470472 5.821200 3.932046 7.496103
#> 3 3 0.2861395 1.469690 6.437524 2.799395 8.171007
#> 4 4 0.8304476 1.653261 5.639570 3.700698 6.905755
#> 5 5 0.6417455 1.976905 4.597484 4.500461 9.441077
#> 6 6 0.5190959 1.985133 5.638341 4.551289 4.440590

How to plot correlation matrix in R with already derived correlation values between variables?

I have an output table which looks like:
Vars Corr SE
1_2 0.51 0.003
1_3 0.32 0.001
...
49_50 0.23 0.006
where correlation values were derived in another software for variables stated in Vars (1_2 refers to between Variable 1 and 2). What is the best way to convert this into a format which could display the correlation matrix between all 50 variables?
I'm assuming there needs to be a way to make the diagonals 1 as well?
Thanks!
So suppose you have the data in a single column, you can restructure and use corrplot
cordata = data.frame(
Vars = paste0(rep(1:50, times = 50), "_",
rep(1:50, each = 50)),
Corr = rnorm(n = 50*50, mean = 0, sd = .3)
) %>%
#for the sake of demonstration return Corrs beyond -1 and 1 to 0.
mutate(Corr = replace(Corr, Corr > 1 | Corr < -1, 0))
head(cordata)
Vars Corr
1 1_1 0.453807195
2 2_1 0.237179163
3 3_1 0.303635874
4 4_1 -0.314318833
5 5_1 0.008682918
6 6_1 -0.067164730
cormat = matrix(cordata$Corr, byrow = TRUE, ncol = 50)
# You can use corrplot::corrplot
corrplot(cormat)

BTYD Individual Level Estimations For All Observations

I am using BTYD BG NBD in R and did the individual level estimates.
For instance following the documentation in page 20 of:
BTYD Walkthrough
Code for Data Prep:
system.file("data/cdnowElog.csv", package = "BTYD")%>%
dc.ReadLines(., cust.idx = 2, date.idx = 3, sales.idx = 5)%>%
dc.MergeTransactionsOnSameDate()%>%
mutate(date = parse_date_time(date, "%Y%m%d")) -> elog
end.of.cal.period <- as.Date("1997-09-30")
elog.cal <- elog[which(elog$date <= end.of.cal.period), ]
split.data <- dc.SplitUpElogForRepeatTrans(elog.cal);
birth.periods <- split.data$cust.data$birth.per
last.dates <- split.data$cust.data$last.date
clean.elog <- split.data$repeat.trans.elog;
freq.cbt <- dc.CreateFreqCBT(clean.elog);
tot.cbt <- dc.CreateFreqCBT(elog)
cal.cbt <- dc.MergeCustomers(tot.cbt, freq.cbt)
cal.cbs.dates <- data.frame(birth.periods, last.dates, end.of.cal.period)
cal.cbs <- dc.BuildCBSFromCBTAndDates(cal.cbt, cal.cbs.dates,per="week")
params <- pnbd.EstimateParameters(cal.cbs);
one could get estimates for a particular observation.
Code for Individual Level Estimation:
cal.cbs["1516",]
# x t.x T.cal
# 26.00 30.86 31.00
x <- cal.cbs["1516", "x"]
t.x <- cal.cbs["1516", "t.x"]
T.cal <- cal.cbs["1516", "T.cal"]
bgnbd.ConditionalExpectedTransactions(params, T.star = 52,
x, t.x, T.cal)
# [1] 25.76
My question is, is it possible to recursively run this such that I could get a data frame containing the expectations for each row instead of hard coding a particular ID number such as "1516" in this case?
Thanks!
Yes, it is straightforward with dplyr's mutate()
cal.cbs%>%
data.frame()%>%
mutate(`Conditional Expectation` = bgnbd.ConditionalExpectedTransactions(params, T.star = 52, x, t.x, T.cal))
x t.x T.cal Conditional Expectation
1 2 30.428571 38.85714 2.3224971
2 1 1.714286 38.85714 1.0646350
3 0 0.000000 38.85714 0.5607707
4 0 0.000000 38.85714 0.5607707
5 0 0.000000 38.85714 0.5607707
6 7 29.428571 38.85714 6.0231497

Mean and standard deviation of triplicated vector data

I have an experiment where I measured a bit less than 200 variables in triplicate. In other words, I have three vectors of ~ 200 values.
I want a quick way to determine if I should use mean or median for my calculations. I can do the mean easily ((v1 + v2 + v3) / 3), but how do I calculate the SD to have it in a vector of ~ 200 SDs? And what about the median?
After having these values, I need to do growth curves (measurements were taken over certain period of time).
Here is a dplyr solution:
require(dplyr)
d <- data.frame(
x1 = rnorm(10),
x2 = rnorm(10),
x3 = rnorm(10)
)
d %>%
rowwise() %>%
mutate(
mean = mean(c(x1, x2, x3)),
median = median(c(x1, x2, x3)),
sd = sd(c(x1, x2, x3))
)
It sounds like you also have a substantive question about longitudinal data. If so, crossvalidated would be a good platform for this question.
apply is what you do. Have your vector in a matrix, e.g.
mydat <- matrix(rnorm(600), ncol = 3)
means <- apply(mydat, MARGIN = 1, mean) # MARGIN = 1 is rows, MARGIN = 2 would be columns...
sds <- apply(mydat, MARGIN = 1, sd)
medians <- apply(mydat, MARGIN = 1, median)
Though I have to say, with 3 values each, using median sounds pretty questionable.
Traditional 'for' loop can also be used, though it is not preferred:
for(i in 1:nrow(d)) d[i,4]=mean(unlist(d[i,1:3]))
for(i in 1:nrow(d)) d[i,5]=sd(unlist(d[i,1:3]))
for(i in 1:nrow(d)) d[i,6]=median(unlist(d[i,1:3]))
names(d)[4:6]=c('meanval', 'sdval', 'medianval')
d
x1 x2 x3 meanval sdval medianval
1 -1.3230176 0.6956100 -0.7210798 -0.44949580 1.0363556 -0.7210798
2 -1.8931166 0.9047873 -1.0378874 -0.67540558 1.4337404 -1.0378874
3 -0.2137543 0.1846733 0.6410478 0.20398893 0.4277283 0.1846733
4 0.1371915 -1.0345325 -0.2260038 -0.37444827 0.5998009 -0.2260038
5 -0.8662465 -0.8229465 -0.2230030 -0.63739866 0.3595296 -0.8229465
6 -0.2918697 -1.3543493 1.3025262 -0.11456426 1.3372826 -0.2918697
7 -0.4931936 1.7186173 1.3757156 0.86704643 1.1904138 1.3757156
8 0.3982403 -0.3394208 1.9316059 0.66347514 1.1585131 0.3982403
9 -1.0332427 -0.3045905 1.1513260 -0.06216908 1.1122775 -0.3045905
10 -1.5603811 -0.1709146 -0.5409815 -0.75742575 0.7195765 -0.5409815
Using d from #DMC's answer.

Running Mean/SD: How can I select within the averaging window based on criteria

I need to calculate a moving average and standard deviation for a moving window. This is simple enough with the catools package!
... However, what i would like to do, is having defined my moving window, i want to take an average from ONLY those values within the window, whose corresponding values of other variables meet certain criteria. For example, I would like to calculate a moving Temperature average, using only the values within the window (e.g. +/- 2 days), when say Relative Humidity is above 80%.
Could anybody help point me in the right direction? Here is some example data:
da <- data.frame(matrix(c(12,15,12,13,8,20,18,19,20,80,79,91,92,70,94,80,80,90),
ncol = 2, byrow = TRUE))
names(da) = c("Temp", "RH")
Thanks,
Brad
I haven't used catools, but in the help text for the (presumably) most relevant function in that package, ?runmean, you see that x, the input data, can be either "a numeric vector [...] or matrix with n rows". In your case the matrix alternative is most relevant - you wish to calculate mean of a focal variable, Temp, conditional on a second variable, RH, and the function needs access to both variables. However, "[i]f x is a matrix than each column will be processed separately". Thus, I don't think catools can solve your problem. Instead, I would suggest rollapply in the zoo package. In rollapply, you have the argument by.column. Default is TRUE: "If TRUE, FUN is applied to each column separately". However, as explained above we need access to both columns in the function, and set by.column to FALSE.
# First, specify a function to apply to each window: mean of Temp where RH > 80
meanfun <- function(x) mean(x[(x[ , "RH"] > 80), "Temp"])
# Apply the function to windows of size 3 in your data 'da'.
meanTemp <- rollapply(data = da, width = 3, FUN = meanfun, by.column = FALSE)
meanTemp
# If you want to add the means to 'da',
# you need to make it the same length as number of rows in 'da'.
# This can be acheived by the `fill` argument,
# where we can pad the resulting vector of running means with NA
meanTemp <- rollapply(data = da, width = 3, FUN = meanfun, by.column = FALSE, fill = NA)
# Add the vector of means to the data frame
da2 <- cbind(da, meanTemp)
da2
# even smaller example to make it easier to see how the function works
da <- data.frame(Temp = 1:9, RH = rep(c(80, 81, 80), each = 3))
meanTemp <- rollapply(data = da, width = 3, FUN = meanfun, by.column = FALSE, fill = NA)
da2 <- cbind(da, meanTemp)
da2
# Temp RH meanTemp
# 1 1 80 NA
# 2 2 80 NaN
# 3 3 80 4.0
# 4 4 81 4.5
# 5 5 81 5.0
# 6 6 81 5.5
# 7 7 80 6.0
# 8 8 80 NaN
# 9 9 80 NA

Resources