Extracting counts of a variable grouped at 2 levels - r

I have weather data tagged by year, month and day. Here is some of the data:
Date MinT Year Month
1976-01-01 1.1 1976 1
1976-01-02 0.3 1976 1
1976-01-03 1.3 1976 1
The data run is 1976:2016 for all months. Call this TestData.
I can group and subset as follows (it is very clunky but that is because I have been trying to test each step)
temp1 <- TestData %>%
group_by(Year)
temp2 <- temp1 %>%
subset(between(Month, 1, 3))
temp3 <- temp2
v1 <- replace(temp3$minT, temp3$minT >-2.0,0) ### replaces data above the threshold
temp3["v1"] <- v1
index1 <- with(temp3, tapply(X = v1, INDEX = Year, FUN = sum)) ## sums the month 1-3-2 degree values
index2 <- with(temp3, tapply(X = v1, INDEX = Year, FUN = length)) ## counts the number of items in each year for the selected period.
index2 gives me a count of the days in each month. I can use index1 and 2 to create index of 'weather for the month'.
What I would like is to be able to get a count of all of the days below -2 (or whatever) and so get an index of comparable severity for each month.
The v1 assignment is necessary because if I use rle to count instances, some months will have zero instances and they drop from the final tally meaning the compiled table of indices against minT, year and month has index vectors of different lengths which R doesn't like. I have tried rle as the FUN in the index2 assignment but that would not let me reach the day counts. The same was true for using a range value with length in that assignment (index3) as well.
Short of generating a mini table for each year, I am stuck. Does anyone have any suggestions?

I guess summarise is the function you are looking for. Something like this (different data, same principle):
library(latticeExtra)
threshold <- 40
SeatacWeather %>%
group_by(year, month) %>%
filter(min.temp < threshold) %>%
summarise(days_below_threshold = n())

Related

R replace first n column value with NA

I have a large set of stock data over a two year period. The data frame is sorted by stock id and date, i.e. first I have all data for one stock and then all data for the second stock and so on. Now I want to replace the first 29 values (rows) in a column with NA for each stock. Is there a simple way to do that?
I have tried to use:
aggregate(column~stock_id, data = df, FUN = function(x){x[1:29] <- NA})
but it does not work.
aggregate is for summarizing - you end up with 1 row per group. You want the same number of rows, so aggregate won't work for you.
I'd use dplyr:
library(dplyr)
df %>% group_by(stock_id) %>%
mutate(column = case_when(row_number() < 30 ~ NA_real_, TRUE ~ column))
In base R, we can use ave
i1 <- with(df, ave(seq_along(stock_id), stock_id, FUN = seq_along) < 30)
df[i1, setdiff(names(df), 'stock_id)] <- NA

creating new column while using group_by, quantile and other functions takes long time and doesn't gives desired outcome

I have a dataframe of 100 columns and 2 million rows. Among the columns three column are year, compound_id, lt_rto. Hare
length(unique(year))
30
length(unique(compound_id))
642
What I want to do is create a new column named avg_rto that is for each year and each compound_id the mean for lowest 12% of lt_rto values. For example - suppose for year 2001, and coumpund_id xyz, it will find the all the values of lt_rto that are at lower 12% and calculate the mean. This mean will be at the rows where year == 2001 & comound_id == "xyz" .
The code I came up is -
dt <- dt %>% group_by(year, compound_id) %>%
mutate( avg_rto = mean( dt[['lt_rto']] < quantile(fun.zero.omit(dt[['lt_rto']]),
probs = .88, na.rm = TRUE ) ))
Note: I also intend to omit the zero values while calculating the lower 12 % value.
The above code gives me same value for all the observations. And this also takes a lot time.
My problem is I can not figure out what's wrong on the code and how can I reduce the run time.
Thank you for your help.
You can write a function which ignores 0 values and calculates mean of lowest 12%.
mean_of_lower_12_perc <- function(x) {
val <- x[x != 0]
mean(sort(val)[1:(0.12 * length(val))], na.rm = TRUE)
}
Now apply this function by group.
library(dplyr)
dt %>%
group_by(year, compound_id) %>%
mutate( avg_rto = mean_of_lower_12_perc(lt_rto))
If your data is huge you can try data.table.
library(data.table)
setDT(dt)[, avg_rto := mean_of_lower_12_perc(lt_rto)]

Divide whole dataframe by mean of control group for each of several sub-groups

Starting data
I'm working in R and I have a set of data generated from groups (cohorts) of animals treated with different doses of different drugs. A simplified reproducible example of my dataset follows:
# set starting values for simulation of animal cohorts across doses of various drugs with a few numeric endpoints
cohort_size <- 3
animals <- letters[1:cohort_size]
drugs <- factor(c("A", "B", "C"))
doses <- factor(c(0, 10, 100))
total_size <- cohort_size * length(drugs) * length(doses)
# simulate data based on above parameters
df <- cbind(expand.grid(drug = drugs, dose = doses, animal = animals),
data.frame(
other_metadata = sample(LETTERS[24:26], size = total_size, replace = TRUE),
num1 = rnorm(total_size, mean = 10, sd = 3),
num2 = rnorm(total_size, mean = 60, sd = 9),
num3 = runif(total_size, min = 1, max = 5)))
This produces something like:
## drug dose animal other_metadata num1 num2 num3
## 1 A 0 a X 6.448411 54.49473 4.111368
## 2 B 0 a Y 9.439396 67.39118 4.917354
## 3 C 0 a Y 8.519773 67.11086 3.969524
## 4 A 10 a Z 6.286326 69.25982 2.194252
## 5 B 10 a Y 12.428265 70.32093 1.679301
## 6 C 10 a X 13.278707 68.37053 1.746217
My goal
For each drug treatment, I consider the dose == 0 animals as my control group for that drug (let's say each was run at a different time and has it's own control group). I wish to calculate the mean for each numeric endpoint (columns 5:7 in this example) of the control group. Next I want to normalize (divide) every numeric endpoint (columns 5:7) for every animal by the mean of it's respective control group.
In other words num1 for all animals where drug == "A" should be divided by the mean of num1 for all animals where drug == "A" AND dose == 0 and so on for each endpoint.
The final output should be the same size as the original data.frame with all of the non-numeric metadata columns remaining unchanged on the left side and all the numeric data columns now with the normalized values.
Naturally I'd like to find the simplest solution possible - minimizing creation of new variables and ideally in a single dplyr pipeline if possible.
What I've tried so far
I should say that I have technically solved this but the solution is super ugly with a ton of steps so I'm hoping to get help to find a more elegant solution.
I know I can easily get the averages for the control groups into a new data.frame using:
df %>%
filter(dose == 0) %>%
group_by(drug, dose) %>%
summarise_all(mean)
I've looked into several things but can't figure out how to implement them. In order of what seems most promising to me:
dplyr::group_modify()
dplyr::rowwise()
sweep() in some type of loop
Thanks in advance for any help you can offer!
If the intention is to divide the numeric columns by the mean of the control group values, grouped by 'drug', after grouping by 'drug', use mutate with across (from dplyr 1.0.0), divide the column values (. with mean of the values where the 'dose' is 0
library(dplyr) # 1.0.0
df %>%
group_by(drug) %>%
mutate(across(where(is.numeric), ~ ./mean(.[dose == 0])))
If we have a dplyr version is < 1.0.0, use mutate_if
df %>%
group_by(drug) %>%
mutate_if(is.numeric, ~ ./mean(.[dose == 0]))

Rearrange dataframe to fit longitudinal model in R

I have a dataframe where each entry relates to a job posting in the NHS specifying the week the job was posted, and what NHS Trust (and region) the job is in.
At the moment my dataframe looks something like this:
set.seed(1)
df1 <- data.frame(
NHS_Trust = sample(1:30,20,T),
Week = sample(1:10,20,T),
Region = sample(1:15,20,T))
And I would like to count the number of jobs for each week across each NHS Trust and assign that value to a new column 'jobs' so my dataframe looks like this:
set.seed(1)
df2 <- data.frame(
NHS_Trust = rep(1:30, each=10),
Week = rep(seq(1,10),30),
Region = rep(as.integer(runif(30,1,15)),1,each = 10),
Jobs = rpois(10*30, lambda = 2))
The dataframe may then be used to create a Poisson longitudinal multilevel model where I may model the number of jobs.
Using the data.table package you can group by, count and assign to a new column in a single expression. The syntax for data.tables is dt[i, j, by]. Here i is "with" - ie the subset of data specified by i or data in the order of i which is empty in this case so all data is used in its original order. The j tells what is to be done, here counting the the number of occurrences using .N, which is then assigned to the new variable count using the assign operator :=. The by takes a list of variables where the j operation is performed on each group.
library(data.table)
setDT(df1)
df1[, count := .N, by = .(NHS_Trust, Week, Region)]
A tidyverse approach would be
library(tidyverse)
df1 <- df1 %>%
group_by(NHS_Trust, Week, Region) %>%
count()
You can use count to count number of jobs across each Region, NHS_Trust and Week and use complete to fill in missing combinations.
library(dplyr)
df1 %>%
count(Region, NHS_Trust, Week, name = 'Jobs') %>%
tidyr::complete(Region, Week = 1:10, fill = list(Jobs = 0))
I guess I'm moving my comment to an answer:
df2 <- df1 %>% group_by(Region, NHS_Trust, Week) %>% count(); colnames(df2)[4] <- "Jobs"
df2$combo <- paste0(df2$Region, "_", df2$NHS_Trust, "_", df2$Week)
for (i in 1:length(unique(df2$Region))){
for (j in 1:length(unique(df2$NHS_Trust))){
for (k in 1:length(unique(df2$Week))){
curr_combo <- paste0(unique(df2$Region)[i], "_",
unique(df2$NHS_Trust)[j], "_",
unique(df2$Week)[k])
if(!curr_combo %in% df2$combo){
curdat <- data.frame(unique(df2$Region)[i],
unique(df2$NHS_Trust)[j],
unique(df2$Week)[k],
0,
curr_combo,
stringsAsFactors = FALSE)
#cat(curdat)
names(curdat) <- names(df2)
df2 <- rbind(as.data.frame(df2), curdat)
}
}
}
}
tail(df2)
# Region NHS_Trust Week Jobs combo
# 4495 15 1 4 0 15_1_4
# 4496 15 1 5 0 15_1_5
# 4497 15 1 8 0 15_1_8
# 4498 15 1 3 0 15_1_3
# 4499 15 1 6 0 15_1_6
# 4500 15 1 9 0 15_1_9
The for loop here check which Region-NHS_Trust-Week combinations are missing from df2 and appends those to df2 with a corresponding Jobs value of 0. The checking is done with the help of the new variable combo which is just a concatenation of the values in the fields mentioned earlier separated by underscores.
Edit: I am plenty sure the people here can come up with something more elegant than this.

How to count the occurrence of permutations in a data set in R?

I have a question on how to count the occurrence of specified permuations in a data set in R.
I am currently working on continuous-glucose-monitoring data sets. Shortly, each data set has between 1500 to 2000 observations (each observation is a plasma glucose value measured every 5 minutes over 6 days).
I need to count the occurrence of glucose values below 3.9 occurring for 15 minutes or more and less than 120 minutes in a row (>3 observations and <24 observations for values <3.9 in a row) on a numeric scale.
I have made a new variable with a factor 1 or 0 for whether the plasma glucose value is below 3.9 or not.
I would then like to count the number of occurrences of permutations > three 1’s in a row and < twenty-four 1’s in a row.
Is there a function in R for this or what would be the easiest approach?
Im not sure if i got your data-structure right, but maybe the following code still can help
I'm assuming a data-structure that includes Measurement, person-id and measurement-id.
library(dplyr)
# create dumy-data
set.seed(123)
data_test = data.frame(measure = rnorm(100, 3.5,2), person_id = rep(1:10, each = 10), measure_id = rep(1:10, 10))
data_test$below_criterion = 0 # indicator for measures below crit-value
data_test$below_criterion[which(data_test$measure < 3.9)] = 1 # indicator for measures below crit-value
# indicator, that shows if the current measurement is the first one below crit_val in a possible series
# shift columns, to compare current value with previous one
data_test = data_test %>% group_by(person_id) %>% mutate(prev_below_crit = c(below_criterion[1], below_criterion[1:(n()-1)]))
data_test$start_of_run = 0 # create the indicator variable
data_test$start_of_run[which(data_test$below_criterion == 1 & data_test$prev_below_crit == 0)] = 1 # if current value is below crit and previous value is above, this is the start of a series
data_test = data_test %>% group_by(person_id) %>% mutate(grouper = cumsum(start_of_run)) # helper-variable to group all the possible series within a person
data_test = data_test %>% select(measure, person_id, measure_id, below_criterion, grouper) # get rid of the previous created helper-variables
data_results = data_test %>% group_by(person_id, grouper) %>% summarise(count_below_crit = sum(below_criterion)) # count the length of each series by summing up all below_crit indicators within a person and series
data_results = data_results %>% group_by(person_id) %>% filter(count_below_crit >= 3 & count_below_crit <=24) %>% summarise(n()) # count all series within a desired length for each person
data_results
data.frame(data_test)

Resources