R, interval, mutate - r

I have a dataset about animals.
library(tidyverse)
a <- c("Date", "Specie", "Number")
b <- c("2020-01-01", "Dog", "3")
c <- c("2020-01-02", "Dog", "4")
d <- c("2020-01-03", "Dog", "5")
e <- c("2020-01-04", "Dog", "6")
f <- c("2020-01-01", "Cat", "3")
g <- c("2020-01-02", "Cat", "7")
h <- c("2020-01-03", "Cat", "8")
i <- c("2020-01-04", "Cat", "10")
df <- as.data.frame(rbind(b, c, d, e, f, g, h, i))
names(df) <- a
df$Date <- as.Date(df$Date)
df$Number <- as.integer(df$Number)
start <- as.Date("2020-01-02")
end <- as.Date("2020-01-04")
df %>%
filter(Date >= start & Date <= end) %>%
group_by(Specie) %>%
summarise(new = prod(10 + Number), .groups = "drop")
The goal is to create a new variable that gives me: (using tidyverse)
For each specie, between 2020-01-02 and 2020-01-04 (included), I want a new variable that is the product of (10+number of dead animals that day).
For-example, for dogs it would be (10+4)(10+5)(10+6).
Same for all specie.
Please note that for some specie, I don't have the number of dead animals during all the days of the interval.
Is dropping them the best option?
If yes, how do you do it.
Note that the code filters and hence keeps only my dataset for the dates specify. I want to return the output that the code delivers but in my original dataset.
That is, the output I get should be a new variable (mutate) for all species. And not a subset of my dataset.
I did a left-join to merge the original dataset with the new (filtered) one. It works, but I think there's a more efficient way.
Thank's for your help much appreciated.

If there are some numbers missing in Number, I could think of three ways for handling those missing values:
Set them to NA and use prod(..., na.rm = TRUE) to remove them in the calculation of the product.
Set them to 0 and use prod(..., na.rm = FALSE) to at least inflate the product by a factor of 10.
If you want to preserve some mean (arithmetic or geometric) of the factors (10+a_i), set the missing values to that mean minus 10.
In cases 2 and 3 you can for example give a lower bound for the product: prod >= 10^n. (Taking the logarithm to base 10 on both sides yields log(prod) = sum(log) >= n.) But maybe you want to reserve case 2 for those rows that really have zero dead animals.
Regarding your second point, use mutate with ifelse to flag your wanted dates (instead of filter) and additionally group by this new flag. Then use again mutate instead of summarise.
df %>% mutate(
new = ifelse(Date >= start & Date <= end, 1, NA)
) %>% group_by(
Specie,
new
) %>% mutate(
new = ifelse(!is.na(new), prod(10 + Number, na.rm = FALSE), new)
)

Related

Using a loop to create columns based on two data frames

I have a situation where I think a loop would be appropriate to avoid repeating chunks of code.
I have two data frames which look like the following:
patid <- seq(1,10)
date_of_session <- sample(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 10)
date_of_referral <- sample(seq(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 10)
df1 <- data.frame(patid, date_of_session, date_of_referral)
patid1 <- sample(seq(1,10), 50, replace = TRUE)
eventdate <- sample(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 50)
comorbidity <- sample(c("hypertension", "stroke", "AF"), 50, replace = TRUE)
df2 <- data.frame(patid1, eventdate, comorbidity)
I need to repeat the following code for each comorbidity in df2 which basically generates a binary (1/0) column for each comorbidity based on whether the earliest "eventdate" (diagnosis) came before "date of session" OR "date of referral" (if "date of session" is NA) for each patient.
df_comorb <- df2 %>%
filter(comorbidity == "hypertension") %>%
group_by(patid) %>%
filter(eventdate == min(eventdate)) %>%
df1 <- left_join(df1, df2_comorb, by = "patid")
df1 <- df1 %>%
mutate(hypertension_baseline = ifelse(eventdate < date_of_session | eventdate < date_of_referral, 1, 0)) %>%
replace_na(list(hypertension_baseline = 0)) %>%
select(-eventdate)
I'd like to avoid repeating the code for each of the 27 comorbid conditions in the full dataset. I figured a loop would be the best way to repeat this for each comorbidity but I don't know how to approach writing one for this problem.
Any help would be appreciated.

Select only Max number (and recode max) and keep others blank in dataframe and recode with multiple conditions with multiple variables

I am trying to select max number for rows within each group and recode that number as "Last" and keep other as blank (below dataframe: new variable name is "Z"). After that I want to create new variable with multiple conditions corresponding with other variables (below dataframe: new variable name is "X").
Dataframe is:
ID = c(1,1,1,1,2,2,3,3,3,4,4)
Care = c("Yes","Yes","Yes","Yes","Yes","No","Yes","No","Yes","No","No")
Y = c(1,2,3,4,1,2,1,2,3,1,2)
Z = c("", "", "", "Last","","Last","","","Last","","Last")
X = c("","","","Always","","Lost","","","Linked","","Never")
df <- data.frame(ID,Care,Y,Z,X)
df
I am able to create Y using this code:
main <- df %>% group_by(ID) %>% mutate(Y = row_number())
But, I want to create new Variables "Z" and "X" in my dataframe. X would be if care is Yes in all rows within each group = "Always", if care is No in all rows within each group = Never, if care is Yes at earlier and No at the Last = "Lost", if care is Yes or No at earlier but Yes at the Last = "Linked"
Here I am able to create Z variable (still need to create X):
main %>% group_by(ID) %>% mutate(Z=row_number()>=which.max(Y))
I have been struggling with this for awhile now. Any help would be greatly appreciated!
Easy! :)
You can save that step of working with which.max(Y) and instead just compare row_number() against n() in each group.
Creating Z is just an easy ifelse-statement and what I assume caused you a little trouble in creating X can be solved with case_when() to work through the four cases you describe. First, check whether all() observations within the group hold true to your condition of being "Yes" or "No", then check the two "mixed" cases afterwards.
This is what you're looking for:
library(dplyr)
df <- tibble(
ID = c(1,1,1,1,2,2,3,3,3,4,4),
Care = c("Yes","Yes","Yes","Yes","Yes","No","Yes","No","Yes","No","No")
)
df2 <- df %>%
group_by(ID) %>%
mutate(
Z = ifelse(row_number() == n(), "Last", ""),
X = case_when(
Z == "" ~ "",
all(Care == "Yes") ~ "Always",
all(Care == "No") ~ "Never",
Care == "Yes" ~ "Linked",
Care == "No" ~ "Lost"
)
)

Time series function in dplyr

I am working with data that stops in a specific year and is NA afterwards. And I need to calculate allot of variables based on lagged values of other variables. I would like to find a way that a whole series is calculated instead of each time one year when one of the variables is NA. I was looking at dplyr given that I am working with panel data and thus need to group it by ID.
I provide the example below:
set.seed(1)
df <- data.frame( year = c(seq(2000, 2018), seq(2000, 2018)) , id = c(rep(1, 19),rep(2, 19)), varA = floor(rnorm(38)*100), varB= floor(rnorm(38)*100), varC= floor(rnorm(38)*100))
df <- df %>% mutate(varA = if_else(year>2010, as.double(NA) , varA) ,
varB = if_else(year>2010, as.double(NA) , varB),
varC = if_else(year>2010, as.double(NA) , varC)) %>% group_by(id) %>% arrange(year)
What I would like is to find a way to calculate a variable that is equal to variable C when it is available, but afterwards is equal to a formula based on lagged values of variable C, B and A. When executing the code below, varResult and D are ony calculated for one year given that the lags are only available for one year:
df <- df %>% mutate( varD = lag(varA)*lag(varB),
varRESULT = if_else(is.na(varC), lag(varC, 1)/lag(varD, 2)*lag(varD, 1), varC))
But I would like to find a way to calculate immidiatly the whole serries (taking into account the panel dimension of the data) instead of heaving to repeat the code 7 times. Preferably a solution where you can calculate varD seperatly from varResults, given that in the final application I have multiple variables that are linked to each other.
Proposed solution:
Starting with the first NA, the "recursive" lags of vars varA, varB, and varC are equal to the last value of these variables.
Thus, starting from these initial variables, we can create new variables: varA1, varB1, and varC1 where we fill the NAs with the last value, by id:
library(dplyr)
library(tidyr) # for the function `fill`
df <- df %>%
mutate(varA1 = varA, varB1 = varB, varC1 = varC) %>%
group_by(id) %>%
arrange(year) %>%
fill(varA1, varB1, varC1) # fills with last value
Then, we apply the formula:
df <- df %>%
mutate( varD = lag(varA1)*lag(varB1),
varRESULT = if_else(is.na(varC), lag(varC1, 1)/lag(varD, 2)*lag(varD, 1), varC)) %>%
select(-varA1, -varB1, -varC1)

What is the most efficient way to perform a t.test from tidy data in r?

I'm working with a dataset that was poorly formatted and I'm trying to get it into a tidy format for statistical testing and data visualization. I'm hoping someone can provide some insight on whether I have the data in the correct tidy format and what the simplest way to perform multiple t.tests.
Here is some sample data similar to my untidied format:
library(tidyverse)
data <- data.frame("subject_id" = 1:10, "age" = 21:30, "weight" = 150:159, "height" = 65:74,
"x_c1_avg" = c(1:9, NA), "y_c1_avg" = runif(10),"z_c1_avg" = c(9:1, NA),
"x_c2e1_avg" = c(1:9, NA), "y_c2e1_avg" = runif(10), "z_c2e1_avg" = runif(10),
"x_c2e2_avg" = runif(10), "y_c2e2_avg" = runif(10), "z_c2e2_avg" = runif(10))
glimpse(data)
The tibble contains demographic information and then three measures, collected at different conditions with some of the measures being performed by two examiners (e.g. x_c1_avg is the average for measure x collected at condition 1 (a certain leg position) and y_c2e1_avg is the average for measure y collected by examiner 1 at condition 2.
So my first question, am I correct that the output of the code below would be considered tidy? Measure, condition and examiner are each in their own columns with the values in another column.
data2 <- data %>%
gather(key = "condition", value = "value", -c(subject_id:height)) %>%
separate(condition, into = c("measure", "condition"), sep = "_", extra = "drop") %>%
separate(condition, into = c("condition", "examiner"), sep = 2, fill = "right")
My second question, what is the most efficient way to perform a paired t.test on this data or is there a way to do this without creating new vectors for each variable? There are 12 conditions in total but I'll only be performing t.tests for six comparisons. I'd be comparing measure x at c1 to measure x at c2 for each subject or measure y by examiner 1 at condition 2 to measure y by examiner 2 at condition 2 and so on. My current code is a:
x_c1 <- data2 %>%
filter(measure == "x", condition == "c1") %>%
select(value)
x_c2_e2 <- data2 %>%
filter(measure == "x", condition == "c2", examiner == "e2") %>%
select(value)
t.test(x_c1$value, x_c2_e2$value, paired = TRUE)
However, this seems much more complicated than it needs to be and feels like I'm reversing the work I did to get it in tidy format. It would have been much easier to run this from the start:
t.test(data$x_c1_avg, data$x_c2e2_avg, paired = TRUE)

Using apply to replace nested for loop

My goal is to go through various signals and ignore any 1's that are not part of a series (minimum of at least two 1's in a row). The data is an xts time series with 180K+ columns and 84 months. I've provided a small simplified data set I've used a nest for loop, but it's taking way too long to finish on the entire data set. It works but is horribly inefficient.
I know there's some way to use an apply function, but I can't figure it out.
Example data:
mod_sig <- data.frame(a = c(0,1,0,0,0,1,1,0,0,0,1,0,1,1),
b = c(0,0,1,0,0,1,0,0,0,1,1,1,1,1),
c = c(0,1,0,1,0,1,1,1,0,0,0,1,1,0),
d = c(0,1,1,1,0,1,1,0,0,1,1,1,1,1),
e = c(0,0,0,0,0,0,0,0,0,0,1,0,0,0))
mod_sig <- xts(mod_sig, order.by = as.Date(seq(as.Date("2016-01-01"), as.Date("2017-02-01"), by = "month")))
Example code:
# fixing months where condition is only met for one month
# creating a new data frame for modified signals
Signals_Fin <- data.frame(matrix(nrow = nrow(mod_sig), ncol = ncol(mod_sig)))
colnames(Signals_Fin) <- colnames(mod_sig)
# Loop over Signals to change 1's to 0's for one month events
for(col in 1:ncol(mod_sig)) {
for(row in 1:nrow(mod_sig)) {
val <- ifelse(mod_sig[row,col] == 1,
ifelse(mod_sig[row-1,col] == 0,
ifelse(mod_sig[row+1,col] == 0,0,1),1),0)
Signals_Fin[row, col] <- val
}
}
As you can see with the loop, any 1's that aren't in a sequence are changed to 0's. I know there is a better way, so I'm hoping to improve my approach. Any insights would be greatly appreciated. Thanks!
Answer from Zack and Ryan:
Zack and Ryan were spot on with dyplr, I only made slight modifications based off what was given and some colleague help.
Answer code:
mod_sig <- data.frame(a = c(0,1,0,0,0,1,1,0,0,0,1,0,1,1),
b = c(0,0,1,0,0,1,0,0,0,1,1,1,1,1),
c = c(0,1,0,1,0,1,1,1,0,0,0,1,1,0),
d = c(0,1,1,1,0,1,1,0,0,1,1,1,1,1),
e = c(0,0,0,0,0,0,0,0,0,0,1,0,0,0))
Signals_fin = mod_sig %>%
mutate_all(funs(ifelse((. == 1 & (lag(.) == 1 | lead(.) == 1)),1,0))) %>%
mutate_all(funs(ifelse(is.na(.), 0, .)))
Signals_fin <- xts(Signals_fin, order.by = as.Date(seq(as.Date("2016-01-01"), as.Date("2017-02-01"), by = "month")))
here's a stab from a dplyr perspective, I converted your row_names to a column but you can just as easily convert them back to rownames with tibble::column_to_rownames():
library(dplyr)
library(tibble)
mod_sig %>%
as.data.frame() %>%
rownames_to_column('months') %>%
mutate_at(vars(-months), function(x){
if_else(x == 1 &
(lag(x, order_by = .$months) == 1 |
lead(x, order_by = .$months) == 1),
1,
0)
})
As suggested by #Ryan, his mutate_at call is more elegant, it's important everything is already sorted, though:
mod_sig %>%
as.data.frame() %>%
rownames_to_column('months') %>%
mutate_at(vars(-months), ~ as.numeric(.x & (lag(.x) | lead(.x))))
And to build on his suggestion:
mod_sig %>%
as.data.frame() %>%
mutate_all(~ as.numeric(.x & (lag(.x) | lead(.x))))

Resources