row index of "looked at" row case_when in R - r

I´m currently struggling with a coding task concerning the use of a case_when statement in R.
In general I would like to use the looked at row index of the case_when statement in the assignment part.
A short explanation to the data. I have large data.frame with a date-column, a geo layer-column and some numeric columns with numbers for the calculations.
The data.frame doesn't have any sorting and not for every point in time all geo layers are necessarily in the data.frame. Sadly I can't provide a real data set due to legal issues.
The task at hand is to compute on the one hand simple mathematical operations for the same point in time on the other side to compute mathematical operations for different points in time for the same geo layer and numeric value.
The mathematical operations vary as dose the interval between the time points.
For instance I need to calculate a change rate to the last quarter and last year of the value:
((current_value - last_quarter_value) / current_value)*100
This is how I'd like to code it.
library(tidyverse)
test_dataframe <- data.frame(
times = c(rep(as.Date("2021-03-01"),2),rep(as.Date("2020-12-01"),2)),
geo_layer = rep(c("001001001", "001001002"),2),
numeric_value_a = 1:4,
numeric_value_b = 4:1,
numeric_value_c = c(1,NA,3,1)
)
check_comparison_times <- unique(test_dataframe$times)
test_dataframe <- test_dataframe %>%
mutate(
normale_calculation = case_when(
!is.na(numeric_value_c) ~ (numeric_value_a + numeric_value_b) / numeric_value_c,
TRUE ~ Inf
),
time_comparison = case_when(
is.na(numeric_value_c) ~ Inf,
(times - months(3)) %in% check_comparison_times ~ test_dataframe[
which(
test_dataframe[,"times"] ==
(test_dataframe[row_index_of_current_looked_at_row, "times"] - months(3)) &
test_dataframe[,"geo_layer"] ==
test_dataframe[row_index_of_current_looked_at_row, "geo_layer"]
)
,"numeric_value_c"] - test_dataframe[row_index_of_current_looked_at_row, "numeric_value_c"],
TRUE ~ -Inf
)
)
With this desired outcome:
times geo_layer numeric_value_a numeric_value_b numeric_value_c normal_calculation time_comparison
1 2021-03-01 001001001 1 4 1 5.000000 2
2 2021-03-01 001001002 2 3 NA Inf Inf
3 2020-12-01 001001001 3 2 3 1.666667 -Inf
4 2020-12-01 001001002 4 1 1 5.000000 -Inf
Currently I solve the problem with a triple loop in which I first pair the Values for time then for geo_layer and then execute the mathematical operation.
Since my Data-Set is much much lager than that this. This solution is every in efficient.
Thanks for your help.

Related

R: Running multiple tests by selecting (and increasing) number of fixed data points selected - Followup

This is a follow-up from a previous post (R: Running multiple tests by selecting (and increasing) number of fixed data points selected):
I have a dataframe (saved as data.csv) that looks something like this:
person
outcome
baseline_post
time
1
0
baseline
BL_1
1
1
baseline
BL_2
1
0
baseline
BL_3
1
2
baseline
BL_4
1
4
post
post_1
1
3
post
post_2
1
4
post
post_3
1
6
post
post_4
2
1
baseline
BL_1
2
2
baseline
BL_2
2
0
baseline
BL_3
2
1
baseline
BL_4
2
3
post
post_1
2
2
post
post_2
2
4
post
post_3
2
3
post
post_4
And same as the previous post, the purpose is to try iterate a same test (can be any test) over the desired fixed combinations arranged across time,
i.e., For each participant, compare outcome(s) at BL_1 against post_1, then BL_1 and BL_2 against post_1 ... BL_1, BL_2, BL_3 and BL_4 against post_1 etc.
Basically all combinations increasing in the number of weeks tested before (BL_1 to 4) and after (post_1 to 2) treatment.
I tried modifying from #Caspar V.'s codes (thanks #Caspar V. for your previous response):
#creating pre/post data frames for later use
df <- read.csv("C:/Users/data.csv")
df_baseline <- filter(df, baseline_post == "baseline") %>%
rename(baseline = baseline_post) %>%
rename(time_baseline = time)
df_post <- filter(df, baseline_post == "post") %>%
rename(post = baseline_post) %>%
rename(time_post = time)
#generate a list of desired comparisons
comparisons = list()
for(a_len in seq_along(df_baseline$baseline)) for(b_len in seq_along(df_post$post)){
comp = list(baseline = head(df_baseline$time_baseline, a_len), post = head(df_post$time_post, b_len))
comparisons = append(comparisons, list(comp))
}
#KIV create combined df for time if required
df_baseline_post <- cbind(df_baseline$time_baseline, df_post$time_post)
colnames(df_baseline_post) = c("time_baseline", "time_post")
#iterate through list of comparisons
for(df_baseline_post in comparisons) {
cat(df_baseline_post$time_baseline, 'versus', df_baseline_post$time_post, '\n')
#this is where your analysis goes, poisson_frequencies being a test function I created
poisson_frequencies(df)
}
This is unfortunately my output, which are 16 "versus-es", because there are 16 possible combinations based on the above data:
versus
versus
versus
versus
versus
versus
...
versus
I am not sure what went wrong. Appreciate any input. I am new when it comes to programming in R.
There's a number of problems; the following should get you back on track. Good luck!
1)
You're getting 64 comparisons in comparisons, not 16. If you would just look at the contents of comparisons you'd see that. It's because you have duplicates in df$time. You'll need to remove them first:
#generate a list of desired comparisons
groupA = unique(df_baseline$time_baseline)
groupB = unique(df_post$time_post)
comparisons = list()
for(a_len in seq_along(groupA)) for(b_len in seq_along(groupB)) {
comp = list(baseline = head(groupA, a_len), post = head(groupB, b_len))
comparisons = append(comparisons, list(comp))
}
2)
The following block is not used, and the variable df_baseline_post is overwritten in the for-loop after it, so you can just remove this:
#KIV create combined df for time if required
# df_baseline_post <- cbind(df_baseline$time_baseline, df_post$time_post)
# colnames(df_baseline_post) = c("time_baseline", "time_post")
3)
You're executing poisson_frequencies(df) every time, but not doing anything with the output. That's why you're not seeing anything. You'll need to put a print() around it: print(poisson_frequencies(df)). Of course df is also not the data you want to work with, but I hope you already knew that.
4)
df_baseline_post$time_baseline and df_baseline_post$time_post don't exist. The loop should be:
for(df_baseline_post in comparisons) {
cat(df_baseline_post$baseline, 'versus', df_baseline_post$post, '\n')
print(poisson_frequencies(df))
}

How to group, pivot and count in R

Using the GermanCredit dataset from the caret library.
library("caret")
data(GermanCredit)
After filtering this down a bit
credit.all <- GermanCredit[,c(10, 1:9, 11:13, 16:19)]
attach(credit.all)
names(credit.all)
We have these names
[1] "Class" "Duration"
[3] "Amount" "InstallmentRatePercentage"
[5] "ResidenceDuration" "Age"
[7] "NumberExistingCredits" "NumberPeopleMaintenance"
[9] "Telephone" "ForeignWorker"
[11] "CheckingAccountStatus.lt.0" "CheckingAccountStatus.0.to.200"
[13] "CheckingAccountStatus.gt.200" "CreditHistory.ThisBank.AllPaid"
[15] "CreditHistory.PaidDuly" "CreditHistory.Delay"
[17] "CreditHistory.Critical"
What I need to do is pivot at summarize on two of these columns, something I know how to do in SQL like this.
SELECT
Class
, SUM(CASE WHEN `CreditHistory.Critical` = 1 THEN 1 ELSE 0 END) AS Critical
, SUM(CASE WHEN `CreditHistory.Critical` = 0 THEN 1 ELSE 0 END) AS NotCritical
, SUM(CASE WHEN `CreditHistory.Critical` = 1 THEN 1 ELSE 0 END) / COUNT(*) AS PctCritical
FROM `credit.all`
GROUP BY
Class
Which would produce something like this
However, I am struggling mightily to get a foothold in R, using books and Google, it seems I should use reshape2 melt and dcast to achieve something like this. What I have tried are basically variants of this:
library(reshape2)
credit.melted <- melt(credit.all[,c(1,17)], ID=c("name", "Class"))
dcast(credit.melted, Class~CreditHistory.Critical, nrow, fill=0)
But all of my attempts with these functions have produced errors too cryptic and too common to understand what I am doing wrong.
Error in vapply(indices, fun, .default) : values must be length 1,
but FUN(X[[1]]) result is length 0
Sometimes my random permutations of the function calls produce slightly different error output, but nothing that points me in the right direction.
Question: How can I do the pivoted summary similar to the SQL result using R?
I wouldn't really consider this a pivot. You're not trying to use a pivot command in SQL. You can use dplyr to follow the exact same method as your SQL:
library(dplyr)
credit.all %>%
group_by(Class) %>%
summarize(Critical = sum(CreditHistory.Critical == 1),
NotCritical = sum(CreditHistory.Critical == 0),
PctCritical = mean(CreditHistory.Critical == 1))
# # A tibble: 2 x 4
# Class Critical NotCritical PctCritical
# <fct> <int> <int> <dbl>
# 1 Bad 50 250 0.167
# 2 Good 243 457 0.347
Since it's a binary column the == 1 isn't really necessary, but I leave it in because (a) it's more similar to your SQL code, and (b) if there were other values but you wanted the count of 1s, this would be the way to do it. However, you could get the same results a little more simply like this:
credit.all %>%
group_by(Class) %>%
summarize(Critical = sum(CreditHistory.Critical),
NotCritical = n() - Critical,
PctCritical = Critical / n())
If you really want a pivot approach, we can go that route, it just seems less straightforward. Your data is already in a long format, so we don't need to melt, we can just cast:
pivot = dcast(Class ~ CreditHistory.Critical, data = credit.all)
pivot
# Using CreditHistory.Critical as value column: use value.var to override.
# Aggregation function missing: defaulting to length
# Class 0 1
# 1 Bad 250 50
# 2 Good 457 243
You could then rename the columns and calculate the percentages:
names(pivot)[2:3] = c("NotCritical", "Critical")
pivot$PctCritical = with(pivot, Critical / (Critical + NotCritical)

Simulating a timeseries in dplyr instead of using a for loop

So, while lag and lead in dplyr are great, I want to simulate a timeseries of something like population growth. My old school code would look something like:
tdf <- data.frame(time=1:5, pop=50)
for(i in 2:5){
tdf$pop[i] = 1.1*tdf$pop[i-1]
}
which produces
time pop
1 1 50.000
2 2 55.000
3 3 60.500
4 4 66.550
5 5 73.205
I feel like there has to be a dplyr or tidyverse way to do this (as much as I love my for loop).
But, something like
tdf <- data.frame(time=1:5, pop=50) %>%
mutate(pop = 1.1*lag(pop))
which would have been my first guess just produces
time pop
1 1 NA
2 2 55
3 3 55
4 4 55
5 5 55
I feel like I'm missing something obvious.... what is it?
Note - this is a trivial example - my real examples use multiple parameters, many of which are time-varying (I'm simulating forecasts under different GCM scenarios), so, the tidyverse is proving to be a powerful tool in bringing my simulations together.
Reduce (or its purrr variants, if you like) is what you want for cumulative functions that don't already have a cum* version written:
data.frame(time = 1:5, pop = 50) %>%
mutate(pop = Reduce(function(x, y){x * 1.1}, pop, accumulate = TRUE))
## time pop
## 1 1 50.000
## 2 2 55.000
## 3 3 60.500
## 4 4 66.550
## 5 5 73.205
or with purrr,
data.frame(time = 1:5, pop = 50) %>%
mutate(pop = accumulate(pop, ~.x * 1.1))
## time pop
## 1 1 50.000
## 2 2 55.000
## 3 3 60.500
## 4 4 66.550
## 5 5 73.205
If the starting value of pop is, say, 50, then pop = 50 * 1.1^(0:4) will give you the next four values. With your code, you could do:
data.frame(time=1:5, pop=50) %>%
mutate(pop = pop * 1.1^(1:n() - 1))
Or,
base = 50
data.frame(time=1:5) %>%
mutate(pop = base * 1.1^(1:n()-1))
Purrr's accumulate function can handle time-varying indices, if you pass them
to your simulation function as a list with all the parameters in it. However, it takes a bit of wrangling to get this working correctly. The trick here is that accumulate() can work on list as well as vector columns. You can use the tidyr function nest() to group columns into a list vector containing the current population state and parameters, then use accumulate() on the resulting list column. This is a bit complicated to explain, so I've included a demo, simulating logistic growth with either a constant growth rate or a time-varying stochastic growth rate. I also included an example of how to use this to simulate multiple replicates for a given model using dpylr+purrr+tidyr.
library(dplyr)
library(purrr)
library(ggplot2)
library(tidyr)
# Declare the population growth function. Note: the first two arguments
# have to be .x (the prior vector of populations and parameters) and .y,
# the current parameter value and population vector.
# This example function is a Ricker population growth model.
logistic_growth = function(.x, .y, growth, comp) {
pop = .x$pop[1]
growth = .y$growth[1]
comp = .y$comp[1]
# Note: this uses the state from .x, and the parameter values from .y.
# The first observation will use the first entry in the vector for .x and .y
new_pop = pop*exp(growth - pop*comp)
.y$pop[1] = new_pop
return(.y)
}
# Starting parameters the number of time steps to simulate, initial population size,
# and ecological parameters (growth rate and intraspecific competition rate)
n_steps = 100
pop_init = 1
growth = 0.5
comp = 0.05
#First test: fixed growth rates
test1 = data_frame(time = 1:n_steps,pop = pop_init,
growth=growth,comp =comp)
# here, the combination of nest() and group_by() split the data into individual
# time points and then groups all parameters into a new vector called state.
# ungroup() removes the grouping structure, then accumulate runs the function
#on the vector of states. Finally unnest transforms it all back to a
#data frame
out1 = test1 %>%
group_by(time)%>%
nest(pop, growth, comp,.key = state)%>%
ungroup()%>%
mutate(
state = accumulate(state,logistic_growth))%>%
unnest()
# This is the same example, except I drew the growth rates from a normal distribution
# with a mean equal to the mean growth rate and a std. dev. of 0.1
test2 = data_frame(time = 1:n_steps,pop = pop_init,
growth=rnorm(n_steps, growth,0.1),comp=comp)
out2 = test2 %>%
group_by(time)%>%
nest(pop, growth, comp,.key = state)%>%
ungroup()%>%
mutate(
state = accumulate(state,logistic_growth))%>%
unnest()
# This demostrates how to use this approach to simulate replicates using dplyr
# Note the crossing function creates all combinations of its input values
test3 = crossing(rep = 1:10, time = 1:n_steps,pop = pop_init, comp=comp) %>%
mutate(growth=rnorm(n_steps*10, growth,0.1))
out3 = test3 %>%
group_by(rep)%>%
group_by(rep,time)%>%
nest(pop, growth, comp,.key = state)%>%
group_by(rep)%>%
mutate(
state = accumulate(state,logistic_growth))%>%
unnest()
print(qplot(time, pop, data=out1)+
geom_line() +
geom_point(data= out2, col="red")+
geom_line(data=out2, col="red")+
geom_point(data=out3, col="red", alpha=0.1)+
geom_line(data=out3, col="red", alpha=0.1,aes(group=rep)))
The problem here is that dplyr is running this as a set of vector operations rather than evaluating the term one at a time. Here, 1.1*lag(pop) is being interpreted as "calculate the lagged values for all of pop, then multiple them all by 1.1". Since you set pop=50 lagged values for all the steps were 50.
dplyr does have some helper functions for sequential evaluation; the standard function cumsum, cumprod, etc. work, and a few new ones (see ?cummean) all work within dplyr. In your example, you could simulate the model with:
tdf <- data.frame(time=1:5, pop=50, growth_rate = c(1, rep(1.1,times=4)) %>%
mutate(pop = pop*cumprod(growth_rate))
time pop growth_rate
1 50.000 1.0
2 55.000 1.1
3 60.500 1.1
4 66.550 1.1
5 73.205 1.1
Note that I added growth rate as a column here, and I set the first growth rate to 1. You could also specify it like this:
tdf <- data.frame(time=1:5, pop=50, growth_rate = 1.1) %>%
mutate(pop = pop*cumprod(lead(growth_rate,default=1))
This makes it explicit that the growth rate column refers to the rate of growth in the current time step from the previous one.
There are limits to how many different simulations you can do this way, but it should be feasible to construct a lot of discrete-time ecological models using some combination of the cumulative functions and parameters specified in columns.
What about the map functions, i.e.
tdf <- data_frame(time=1:5)
tdf %>% mutate(pop = map_dbl(.x = tdf$time, .f = (function(x) 50*1.1^x)))

Calculation using two subsets of variable column in long data frame with R Reshape

I have a dataframe that has two sets of data that I need to multiply for a calculation. A simple version would be
sample = data.frame(apples=c(10,20,25,30,40,NA,NA,15))
sample$oranges = c(25,60,90,86,10,67,45,10)
sample$oats = c(65,75,85,95,105,115,125,135)
sample$eggs = c(23,22,21,20,19,18,17,16)
sample$consumer =c('john','mark','luke','paul','peter','thomas','matthew','brian')
sample$mealtime = c('breakfast','lunch','lunch','snack','lunch','breakfast','snack','dinner')
s1 = melt(sample,id.vars=c(5,6),measure.vars=c(1:4))
and what I'm trying to do is something along the lines of
s2 = dcast(s1, mealtime ~ ., function(x) (x[variable == 'oranges'] * x[variable =='apples'])/sum(x[variable == 'apples'])
In practice its a much longer data.frame and a more elaborate calculation but the principle should be the same. Thanks -- first post to SO so apologies for any errors.
The output would be a data frame that has mealtimes as the Id var and the apple weighted average of the orange data as the values for each mealtime.
Something along the lines of
Group.1 x
1 breakfast 1.785714
2 dinner 1.071429
3 lunch 27.500000
4 snack 18.428571
This was calculated using
sample$wa = sample$oranges*sample$apples/sum(sample$apples)
aggregate(sample$wa,by=list(sample$mealtime),sum,na.rm=T)
which feels off mathematically but was meant to be a quick kludgy approximation.
This is a much better task for plyr than it is for reshape.
library(plyr)
s1<-ddply(sample,.(mealtime), function(x) {return(sum(x$apples,x$oranges))})
And now you have clarified the output:
ddply(sample,.(mealtime), summarize,
wavg.oranges = sum(apples * oranges, na.rm=TRUE) / sum(apples, na.rm=TRUE))
# mealtime wavg.oranges
# 1 breakfast 25.00000
# 2 dinner 10.00000
# 3 lunch 45.29412
# 4 snack 86.00000

Elegant way to report missing values in a data.frame

Here's a little piece of code I wrote to report variables with missing values from a data frame. I'm trying to think of a more elegant way to do this, one that perhaps returns a data.frame, but I'm stuck:
for (Var in names(airquality)) {
missing <- sum(is.na(airquality[,Var]))
if (missing > 0) {
print(c(Var,missing))
}
}
Edit: I'm dealing with data.frames with dozens to hundreds of variables, so it's key that we only report variables with missing values.
Just use sapply
> sapply(airquality, function(x) sum(is.na(x)))
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
You could also use apply or colSums on the matrix created by is.na()
> apply(is.na(airquality),2,sum)
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
> colSums(is.na(airquality))
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
My new favourite for (not too wide) data are methods from excellent naniar package. Not only you get frequencies but also patterns of missingness:
library(naniar)
library(UpSetR)
riskfactors %>%
as_shadow_upset() %>%
upset()
It's often useful to see where the missings are in relation to non missing which can be achieved by plotting scatter plot with missings:
ggplot(airquality,
aes(x = Ozone,
y = Solar.R)) +
geom_miss_point()
Or for categorical variables:
gg_miss_fct(x = riskfactors, fct = marital)
These examples are from package vignette that lists other interesting visualizations.
We can use map_df with purrr.
library(mice)
library(purrr)
# map_df with purrr
map_df(airquality, function(x) sum(is.na(x)))
# A tibble: 1 × 6
# Ozone Solar.R Wind Temp Month Day
# <int> <int> <int> <int> <int> <int>
# 1 37 7 0 0 0 0
summary(airquality)
already gives you this information
The VIM packages also offers some nice missing data plot for data.frame
library("VIM")
aggr(airquality)
Another graphical alternative - plot_missing function from excellent DataExplorer package:
Docs also points out to the fact that you can save this results for additional analysis with missing_data <- plot_missing(data).
More succinct-: sum(is.na(x[1]))
That is
x[1] Look at the first column
is.na() true if it's NA
sum() TRUE is 1, FALSE is 0
Another function that would help you look at missing data would be df_status from funModeling library
library(funModeling)
iris.2 is the iris dataset with some added NAs.You can replace this with your dataset.
df_status(iris.2)
This will give you the number and percentage of NAs in each column.
For one more graphical solution, visdat package offers vis_miss.
library(visdat)
vis_miss(airquality)
Very similar to Amelia output with a small difference of giving %s on missings out of the box.
I think the Amelia library does a nice job in handling missing data also includes a map for visualizing the missing rows.
install.packages("Amelia")
library(Amelia)
missmap(airquality)
You can also run the following code will return the logic values of na
row.has.na <- apply(training, 1, function(x){any(is.na(x))})
Another graphical and interactive way is to use is.na10 function from heatmaply library:
library(heatmaply)
heatmaply(is.na10(airquality), grid_gap = 1,
showticklabels = c(T,F),
k_col =3, k_row = 3,
margins = c(55, 30),
colors = c("grey80", "grey20"))
Probably won't work well with large datasets..
A dplyr solution to get the count could be:
summarise_all(df, ~sum(is.na(.)))
Or to get a percentage:
summarise_all(df, ~(sum(is_missing(.) / nrow(df))))
Maybe also worth noting that missing data can be ugly, inconsistent, and not always coded as NA depending on the source or how it's handled when imported. The following function could be tweaked depending on your data and what you want to consider missing:
is_missing <- function(x){
missing_strs <- c('', 'null', 'na', 'nan', 'inf', '-inf', '-9', 'unknown', 'missing')
ifelse((is.na(x) | is.nan(x) | is.infinite(x)), TRUE,
ifelse(trimws(tolower(x)) %in% missing_strs, TRUE, FALSE))
}
# sample ugly data
df <- data.frame(a = c(NA, '1', ' ', 'missing'),
b = c(0, 2, NaN, 4),
c = c('NA', 'b', '-9', 'null'),
d = 1:4,
e = c(1, Inf, -Inf, 0))
# counts:
> summarise_all(df, ~sum(is_missing(.)))
a b c d e
1 3 1 3 0 2
# percentage:
> summarise_all(df, ~(sum(is_missing(.) / nrow(df))))
a b c d e
1 0.75 0.25 0.75 0 0.5
If you want to do it for particular column, then you can also use this
length(which(is.na(airquality[1])==T))
ExPanDaR’s package function prepare_missing_values_graph can be used to explore panel data:
For piping you could write:
# Counts
df %>% is.na() %>% colSums()
# % of missing rounded to 2 decimals
df %>% summarise_all(.funs = ~round(100*sum(is.na(.))/length(.),2))

Resources