dplyr::mutate changes row numbers, how to keep them? - r

I am using lme4::lmList on a tibble to obtain the coefficients of linear fit lines fitted for each subject (id) in my data. What I actually want is a nice long chain of pipes because I don't want to keep any of this output, just use it for a slope/intercept plot. However, I am running into a problem. lmList is creating a dataframe where the row numbers are the original subject ID numbers. I want to keep that information, but as soon as I use mutate on the output, the row numbers change to be sequential from 1. I tried rescuing them first by using rowid_to_column but that just gives me a column of sequential numbers from 1 too. What can I do, other than drop out of the pipe and put them in a column with base R? Is unique(a_df$id) really the best solution? I had a look around on here but didn't see a question like this one.
library(tibble)
library(dplyr)
library(Matrix)
library(lme4)
a_df <- tibble(id = c(rep(4, 3), rep(11, 3), rep(12, 3), rep(42, 3)),
age = c(rep(seq(1, 3), 4)),
hair = 1 + (age*2) + rnorm(12) + as.vector(sapply(rnorm(4), function(x) rep(x, 3))))
# as.data.frame to get around stupid RStudio diagnostics bug
int_slope <- coef(lmList(hair ~ age | id, as.data.frame(a_df))) %>%
setNames(., c("Intercept", "Slope"))
# Notice how the row numbers are the original subject ids?
print(int_slope)
Intercept Slope
4 2.9723596 1.387635
11 0.2824736 2.443538
12 -1.8912636 2.494236
42 0.8648395 1.680082
int_slope2 <- int_slope %>% mutate(ybar = Intercept + (mean(a_df$age) * Slope))
# Look! Mutate has changed them to be the numbers 1 to 4
print(int_slope2)
Intercept Slope ybar
1 2.9723596 1.387635 5.747630
2 0.2824736 2.443538 5.169550
3 -1.8912636 2.494236 3.097207
4 0.8648395 1.680082 4.225004
# Try to rescue them with rowid_to_column
int_slope3 <- int_slope %>% rowid_to_column(var = "id")
# Nope, 1 to 4 again
print(int_slope3)
id Intercept Slope
1 1 2.9723596 1.387635
2 2 0.2824736 2.443538
3 3 -1.8912636 2.494236
4 4 0.8648395 1.680082
Thanks,
SJ

The dplyr/tidyverse universe doesn't "believe in" row names. Any data that is important for an observation should be included in a column. The tibble package includes a function to move row names into a column. Try
int_slope %>% rownames_to_column()
before any mutates.

Nothing like asking for help to make you see the answer. Those aren't row numbers, they're numeric row names. Of course they are! Non-contiguous row numbers make no sense. rownames_to_column is my answer.

Why you just donĀ“t create another 'ybar' column on int_slope?
int_slope$ybar<- Intercept + mean(a_df$age) * Slope

Related

Mutating dataframe in R in only certain cases

I have a dataframe in R with multiple columns that I want to manipulate depending on the value for a specific column. Here is a sample dataframe set up in the same way:
dat <- data.frame(A=c("L","Right","R","Left"), B=c(2,5,7,-8), C=c(-3,6,-4,9))
dat
If the value in column A is either "L" or "Left", I want to convert the other columns in that row to negative values unless the value is already negative. For example, in the first row, I would want the 2 to change to a negative 2, but keep the -3 the same. In the fourth row, I would want the -8 to stay the same, but want to change the 9 to a -9.
I previously used this code to convert to negative values, but this was when I was working with only positive values initially and did not want to leave some values unchanged.
library(dplyr)
dat <- dat %>% mutate(across(where(is.numeric), ~ if_else(grepl("^L", A), -1, 1) * .))
I am not sure how to change the above code to address this new issue, and I would greatly appreciate if someone could help answer this question. Thanks so much!
Take the absolute value and multiply it by -1 for L and sign(.) otherwise.
dat %>%
mutate(across(-1, ~ if_else(grepl("^L", A), -1, sign(.)) * abs(.)))
giving:
A B C
1 L -2 -3
2 Right 5 6
3 R 7 -4
4 Left -8 -9
or
dat %>%
mutate(across(-1, ~ if_else(grepl("^L", A), -abs(.), .)))
Note that overwriting dat with the new value can make it harder to debug because then there can be confusion over which version of dat is
the current one and if you try to rerun just the above line you will be starting with the second version rather than the first. Suggest using
dat2 <- dat %>% ...whatever...

Delete duplicate rows and sum corresponding values of last column in a dataframe

If we want to remove the duplicates from a dataframe df, we need just to write df[!duplicated(df),] and duplicates will be removed from it. I have the following dataframe:
df <- data.frame(from = c("z","y","z","w","y"), to=c("x","w","x","z","w"), weight=c(2,1,3,5,6))
I would like to obtain something different. In df[,1:2], the first and the third rows are equals between them and I would like to: 1) delete one of them; 2) sum the corresponding values of weight. E.g. for this example, the expected result is:
from to weight
z x 5
y w 7
w z 5
Anyway, if I use:
df2=df[,1:2]
which(duplicated(df2) | duplicated(df2[nrow(df2):1, ])[nrow(df2):1])
I obtain
[1] 1 2 3 5
which does not allow me to obtain the desidered result (e.g. 1 and 3 are equals between them, 2 and 5 are equals between them, but this information is not contained in the latter result).
We can do a group by sumoperation instead of duplicated
aggregate(weight~ ., df, sum)
In dplyr, this can be done using
library(dplyr)
df %>%
group_by(from, to) %>%
summarise(weight = sum(weight))

R code to detect a change in a variable over time for multiple patients

I have a data set with multiple rows per patient, where each row represents a 1-week period of time over the course of 4 months. There is a variable grade that can take on values of 1,2,or 3, and I want to detect when a single patient's grade INCREASES (1 to 2, 1 to 3, or 2 to 3) at any point (the result would be a yes/no variable). I could write a function to do it but I'm betting there is some clever functional programming I could do to make use of existing R functions. Here is a sample data set below. Thank you!
df=data.frame(patient=c(1,1,1,2,2,3,3,3,3),period=c(1,2,3,1,3,1,3,4,5),grade=c(1,1,1,2,3,1,1,2,3))
what I would want is a resulting data frame of:
data.frame(patient=c(1,2,3),grade.increase=c(0,1,1))
library(dplyr)
df %>%
arrange(patient, period) %>%
mutate(grade.increase = case_when(grade > lag(grade) ~ TRUE,TRUE ~ FALSE)) %>%
group_by(patient) %>%
summarise(grade.increase = max(grade.increase))
Combining lag which checks the previous value with case_when allows us to identify each grade.increase.
Summarising the maximum of grade.increase for each patient gets the desired results as boolean calculations treat FALSE as 0 and TRUE as 1.
If you feel like doing this in base R, here's a solution that uses the split-apply-combine approach.
You use split to make a list with a separate data frame for each patient;
you use lapply to iterate a summarization function over each list element, where the summarization function uses diff to look at changes in grade and if and any to summarize; and then
you wrap the whole thing in do.call(rbind, ...) to collapse the resulting list into a data frame.
Here's what that looks like:
do.call(rbind, lapply(split(df, df[,"patient"]), function(i) {
data.frame(patient = i[,"patient"][1],
grade.increase = if (any(diff(i[,"grade"]) > 0)) 1 else 0 )
}))
Result:
patient grade.increase
1 1 0
2 2 1
3 3 1

Dplyr/Lubridate: How to summarise overlapping intervals after grouping

I would like to group agreements and then compare how much their periods overlap (or are apart).
My dataframe may look like:
library(tidyverse)
library(lubridate)
tribble(
~ShipTo, ~Code, ~Start, ~End,
"xxxx", "AAA11", 2018-01-01, 2018-03-01,
"yyyy", "BBB23", 2018-02-01, 2018-05-11,
"yyyy", "BBB23", 2018-03-01, 2018-06-11,
"cccc", "AAA11", 2018-01-06, 2018-03-12,
"yyyy", "CCC04", 2018-01-16, 2018-03-31,
"xxxx", "DDD", 2018-01-21, 2018-03-25
)
I would like to mutate a column to create lubridate periods and evaluate them after grouping by ShipTo and Code. What I tried was:
dft3<-dft %>% filter(concat1 %in% to_filter2) %>%
arrange(ShipTo,Code)%>%
group_by(ShipTo,Code)%>%
mutate(period=interval(Start,End),
nextperiod=interval(lead(Start),lead(End)),
interv=day(as.period(intersect(period, nextperiod), "days"))) %>%
group_by(ShipTo,Code)%>%
summarise(count=n(),
intervmax=max(interv),
intervmin=min(interv))
If I remove the line group_by(ShipTo,Code)%>% the intervals are created correctly and also the lead intervals are correctly calculated from the next line. But when I naively use group_by, the intervals are not calculated correctly.
I suspect that perhaps my database should be split into many tables by groups and then, after the operation of creating and comparing intervals it should be glued back together.
Is there a succinct way to do it? Or perhaps there is a simpler way I have not yet learned? Thank you in advance for the hint in the right direction.
EDIT: The desired output should be a column with value of overlaps of intervals in days (or distances between intervals if no overlap). Grouping destroys the calculation. I would like to have these values calculated within groups (not accross them).
EDIT2: I trying to solve the problem by splitting dataframe into a list of dataframes and then combining it, but I am not sure of a syntax. It does not quite work, produces tables with one column, a help I was given on other portal (perhaps it can ilustrate the issue). The idea is to split a database, create new columns and combine the tables to a single table.
fnOverlaps <- function(x) {
mutate(x,okres=interval(Start,End),
nastokres=interval(lead(Start),lead(End)),
interv=day(as.period(intersect(okres, nastokres), "days")))
}
dft3<-dft3 %>%
split(list(.$ShipTo, .$Code), drop = TRUE) %>%
map_df(fnOverlaps) %>%
flatten_dfr()
The result (for one group) that I expect would look like this.
tribble(
~ShipTo, ~Code, ~interv,
"yyyy", "BBB23", 70 #say there is a 70 days overlap
"yyyy", "BBB23", NA #there is no next row to compare
)
It looks like the issue is being caused by trying to combine vectors with the class "Interval." Specifically, they appear to be getting converted to numeric and losing their inherent information.
I think the only viable solution is to split the data.frame, run the analysis on each component separately with lapply, then bring them back together with bind_rows. The number of groups with only one entry present an issue as max and min return -Inf and Inf when the argument is empty after removing NAs. But, that is easy enough to correct for.
This code should work. Note that I am using group_by to ensure the ShipTo/Code columns are kept, though you could do that in other ways.
dft %>%
split(paste(.$ShipTo, "XXX", .$Code)) %>%
lapply(function(x){
x %>%
arrange(ShipTo,Code) %>%
mutate(period=interval(Start,End)
, nextperiod=interval(lead(Start),lead(End))
, interv=day(as.period(intersect(period, nextperiod), "days"))
) %>%
group_by(ShipTo,Code)%>%
summarise(count=n(),
intervmax=max(interv, na.rm = TRUE),
intervmin=min(interv, na.rm = TRUE)) %>%
ungroup()
}) %>%
bind_rows() %>%
mutate(intervmax = ifelse(is.infinite(intervmax)
, NA, intervmax)
, intervmin = ifelse(is.infinite(intervmin)
, NA, intervmin))
Returns
# A tibble: 5 x 5
ShipTo Code count intervmax intervmin
<chr> <chr> <int> <dbl> <dbl>
1 cccc AAA11 1 NA NA
2 xxxx AAA11 1 NA NA
3 xxxx DDD 1 NA NA
4 yyyy BBB23 2 71.0 71.0
5 yyyy CCC04 1 NA NA
I am putting it just for the record. I received an answer from Jake Knaupp on slack r4ds group with the modern map_df() syntax, it calculates overlap of periods but it converts periods to numeric. And there is a bunch of warnings it will do that.
myFun <- function(x) {
mutate(x,period=interval(Start,End),
nextperiod=interval(lead(Start),lead(End)),
interv=day(as.period(intersect(period, nextperiod), "days")))
}
df %>%
split(list(.$ShipTo, .$Code), drop = TRUE) %>%
map_df(myFun)

R: identify the factor associated with the highest sum of values for multiple groups

Consider this:
plot=c("A","A","A","A","B","B","B","B")
mean=c(3,5,40,0,3,5,3,0)
sp=c("ch","ch","ag",NA,"ch","ag","ch",NA)
df=data.frame(plot,mean,sp)
plot mean sp
1 A 3 ch
2 A 5 ch
3 A 40 ag
4 A 0 <NA>
5 B 3 ch
6 B 5 ag
7 B 3 ch
8 B 0 <NA>
I'd like to figure out some code that will return the "sp" from each "plot" with the highest cumulative "mean" value. For the example above, I'd like to return this:
plot=c("A","B")
sp=c("ag","ch")
df=data.frame(plot,sp)
plot sp
1 A ag
2 B ch
In case that wasn't clear, for plot A, the sp "ag" is returned becasue it has the highest cumulative mean value (40) for the plot. For plot B, "ch" is returned because it has the highest cumulative value (6). The values are not important to me; I want only the most dominant sp by cumulative mean value for each plot.
I've played around with aggregate and suspect that would be useful here, but am unsure about how to proceed.
Many thanks (this site is a huge resource for those of us new to R!)
Not sure how #jebyrnes would have done it with summarise and filter (edit: I figured it out and it's pretty simple too), but here's how I'd go about it with dplyr:
library(dplyr)
group_by(df, plot,sp) %>% summarise(sum=sum(mean)) %>% summarise(sp=sp[sum==max(sum)])
# plot sp
#1 A ag
#2 B ch
Here's an approach that uses the "data.table" package
library(data.table)
setDT(df)[, cumsum(mean), by=.(plot, sp)][, .(sp = sp[V1 == max(V1)]), by=plot]
# plot sp
# 1: A ag
# 2: B ch
After setting df to a data table with setDT(df), we are doing two things
[, cumsum(mean), by=.(plot, sp)] calculates the cumulative sum of the mean column, grouped by plot and sp
[, .(sp = sp[V1 == max(V1)]), by=plot] takes the sp value for which V1 (calculated in step 1) is equal to the maximum of V1 and renames that column sp, grouped by plot
You should be able to do this in two steps.
Step 1, aggregate the data frame by plot at sp and calculate the cummulative mean. You can use a package such as plyr with ddply or the dplyr package for this.
Step 2, once you've done this, for each plot output the sp with the highest cumulative mean. There are a lot of ways to to this. I'd again go with dplyr, but that's because I'm a bit besotted with it at the moment.
Actually...you can do this whole thing with 4 lines in dplyr with one line per operation piping your way through with magritr. 5 if you want to get rid of the cumulative means column. You just need a group_by, summarise, and filter statement. I'll post the code if you want it, but it will be far more useful for you to go read, say, http://seananderson.ca/2014/09/13/dplyr-intro.html and try it yourself.
Or....
df %>%
group_by(plot, sp) %>%
summarise(cumMean = sum(mean, na.rm=T)) %>%
filter(cumMean == max(cumMean)) %>%
select(plot, sp)
Aggregate twice: once to calculate the sums for each plot and sp, and a second time to get the maxima for each plot. The second aggregation is only going to give you the mean, though, so merge it back in with the first aggregate.
df2 = aggregate(mean ~ plot + sp, FUN = sum, data = df)
df3a = aggregate(mean ~ plot, data = df2, FUN = max)
merge(df3a, df2)
I haven't tested what happens if you have equal sums coming up here, though. Also, this drops any NAs in the data frame. If you want to keep those, I'd make sure you bring the data frame in with strings rather than factors and then changing the NAs to placeholders ("None" or even "NA") before you begin. The above code works fine with strings!
df = data.frame(plot,mean,sp, stringsAsFactors = FALSE)
df[is.na(df$sp), "sp"] = "None"
> df
plot mean sp
1 A 3 ch
2 A 5 ch
3 A 40 ag
4 A 0 None
5 B 3 ch
6 B 5 ag
7 B 3 ch
8 B 0 None

Resources