Mutating dataframe in R in only certain cases - r

I have a dataframe in R with multiple columns that I want to manipulate depending on the value for a specific column. Here is a sample dataframe set up in the same way:
dat <- data.frame(A=c("L","Right","R","Left"), B=c(2,5,7,-8), C=c(-3,6,-4,9))
dat
If the value in column A is either "L" or "Left", I want to convert the other columns in that row to negative values unless the value is already negative. For example, in the first row, I would want the 2 to change to a negative 2, but keep the -3 the same. In the fourth row, I would want the -8 to stay the same, but want to change the 9 to a -9.
I previously used this code to convert to negative values, but this was when I was working with only positive values initially and did not want to leave some values unchanged.
library(dplyr)
dat <- dat %>% mutate(across(where(is.numeric), ~ if_else(grepl("^L", A), -1, 1) * .))
I am not sure how to change the above code to address this new issue, and I would greatly appreciate if someone could help answer this question. Thanks so much!

Take the absolute value and multiply it by -1 for L and sign(.) otherwise.
dat %>%
mutate(across(-1, ~ if_else(grepl("^L", A), -1, sign(.)) * abs(.)))
giving:
A B C
1 L -2 -3
2 Right 5 6
3 R 7 -4
4 Left -8 -9
or
dat %>%
mutate(across(-1, ~ if_else(grepl("^L", A), -abs(.), .)))
Note that overwriting dat with the new value can make it harder to debug because then there can be confusion over which version of dat is
the current one and if you try to rerun just the above line you will be starting with the second version rather than the first. Suggest using
dat2 <- dat %>% ...whatever...

Related

Replace negative values with NA if values occur x number of times in a row in R

I would like to replace negative values in a column with NA values if they occur more than 144 times back to back (aka in a row)in R. I've created a separate column that lists whether the value is negative by:
df$a_neg <- df$a < 0
This gives me a new column that says TRUE if the value was negative and FALSE if it was positive.
I've tried subsetting into a new dataframe by doing:
df_neg <- df %>%
subset() %>%
group_by(a_neg) %>%
filter(n() > 144)
All that does is give me back the exact frame dataframe as I started with. Any help will be greatly appreciated!
You can try this:
# Create data
set.seed(123)
df = data.frame(a=rnorm(2000,-2.5))
# Find contiguous blocks of positive and negative values
df$block = cumsum(c(1,abs(diff(df$a<0))))
# Count how many negative values are in each block
df$block_neg = ave(df$a,df$block,FUN = function(x) {sum(x<0)})
# Assign NA to blocks with greater than 144 negative values
df$a1 = ifelse(df$block_neg>144,NA,df$a)

Delete duplicate rows and sum corresponding values of last column in a dataframe

If we want to remove the duplicates from a dataframe df, we need just to write df[!duplicated(df),] and duplicates will be removed from it. I have the following dataframe:
df <- data.frame(from = c("z","y","z","w","y"), to=c("x","w","x","z","w"), weight=c(2,1,3,5,6))
I would like to obtain something different. In df[,1:2], the first and the third rows are equals between them and I would like to: 1) delete one of them; 2) sum the corresponding values of weight. E.g. for this example, the expected result is:
from to weight
z x 5
y w 7
w z 5
Anyway, if I use:
df2=df[,1:2]
which(duplicated(df2) | duplicated(df2[nrow(df2):1, ])[nrow(df2):1])
I obtain
[1] 1 2 3 5
which does not allow me to obtain the desidered result (e.g. 1 and 3 are equals between them, 2 and 5 are equals between them, but this information is not contained in the latter result).
We can do a group by sumoperation instead of duplicated
aggregate(weight~ ., df, sum)
In dplyr, this can be done using
library(dplyr)
df %>%
group_by(from, to) %>%
summarise(weight = sum(weight))

R code to detect a change in a variable over time for multiple patients

I have a data set with multiple rows per patient, where each row represents a 1-week period of time over the course of 4 months. There is a variable grade that can take on values of 1,2,or 3, and I want to detect when a single patient's grade INCREASES (1 to 2, 1 to 3, or 2 to 3) at any point (the result would be a yes/no variable). I could write a function to do it but I'm betting there is some clever functional programming I could do to make use of existing R functions. Here is a sample data set below. Thank you!
df=data.frame(patient=c(1,1,1,2,2,3,3,3,3),period=c(1,2,3,1,3,1,3,4,5),grade=c(1,1,1,2,3,1,1,2,3))
what I would want is a resulting data frame of:
data.frame(patient=c(1,2,3),grade.increase=c(0,1,1))
library(dplyr)
df %>%
arrange(patient, period) %>%
mutate(grade.increase = case_when(grade > lag(grade) ~ TRUE,TRUE ~ FALSE)) %>%
group_by(patient) %>%
summarise(grade.increase = max(grade.increase))
Combining lag which checks the previous value with case_when allows us to identify each grade.increase.
Summarising the maximum of grade.increase for each patient gets the desired results as boolean calculations treat FALSE as 0 and TRUE as 1.
If you feel like doing this in base R, here's a solution that uses the split-apply-combine approach.
You use split to make a list with a separate data frame for each patient;
you use lapply to iterate a summarization function over each list element, where the summarization function uses diff to look at changes in grade and if and any to summarize; and then
you wrap the whole thing in do.call(rbind, ...) to collapse the resulting list into a data frame.
Here's what that looks like:
do.call(rbind, lapply(split(df, df[,"patient"]), function(i) {
data.frame(patient = i[,"patient"][1],
grade.increase = if (any(diff(i[,"grade"]) > 0)) 1 else 0 )
}))
Result:
patient grade.increase
1 1 0
2 2 1
3 3 1

dplyr::mutate changes row numbers, how to keep them?

I am using lme4::lmList on a tibble to obtain the coefficients of linear fit lines fitted for each subject (id) in my data. What I actually want is a nice long chain of pipes because I don't want to keep any of this output, just use it for a slope/intercept plot. However, I am running into a problem. lmList is creating a dataframe where the row numbers are the original subject ID numbers. I want to keep that information, but as soon as I use mutate on the output, the row numbers change to be sequential from 1. I tried rescuing them first by using rowid_to_column but that just gives me a column of sequential numbers from 1 too. What can I do, other than drop out of the pipe and put them in a column with base R? Is unique(a_df$id) really the best solution? I had a look around on here but didn't see a question like this one.
library(tibble)
library(dplyr)
library(Matrix)
library(lme4)
a_df <- tibble(id = c(rep(4, 3), rep(11, 3), rep(12, 3), rep(42, 3)),
age = c(rep(seq(1, 3), 4)),
hair = 1 + (age*2) + rnorm(12) + as.vector(sapply(rnorm(4), function(x) rep(x, 3))))
# as.data.frame to get around stupid RStudio diagnostics bug
int_slope <- coef(lmList(hair ~ age | id, as.data.frame(a_df))) %>%
setNames(., c("Intercept", "Slope"))
# Notice how the row numbers are the original subject ids?
print(int_slope)
Intercept Slope
4 2.9723596 1.387635
11 0.2824736 2.443538
12 -1.8912636 2.494236
42 0.8648395 1.680082
int_slope2 <- int_slope %>% mutate(ybar = Intercept + (mean(a_df$age) * Slope))
# Look! Mutate has changed them to be the numbers 1 to 4
print(int_slope2)
Intercept Slope ybar
1 2.9723596 1.387635 5.747630
2 0.2824736 2.443538 5.169550
3 -1.8912636 2.494236 3.097207
4 0.8648395 1.680082 4.225004
# Try to rescue them with rowid_to_column
int_slope3 <- int_slope %>% rowid_to_column(var = "id")
# Nope, 1 to 4 again
print(int_slope3)
id Intercept Slope
1 1 2.9723596 1.387635
2 2 0.2824736 2.443538
3 3 -1.8912636 2.494236
4 4 0.8648395 1.680082
Thanks,
SJ
The dplyr/tidyverse universe doesn't "believe in" row names. Any data that is important for an observation should be included in a column. The tibble package includes a function to move row names into a column. Try
int_slope %>% rownames_to_column()
before any mutates.
Nothing like asking for help to make you see the answer. Those aren't row numbers, they're numeric row names. Of course they are! Non-contiguous row numbers make no sense. rownames_to_column is my answer.
Why you just donĀ“t create another 'ybar' column on int_slope?
int_slope$ybar<- Intercept + mean(a_df$age) * Slope

r find index of minima with gap / distance condition

list <- c(1,1,1,4,5,6,9,9,2)
I want to find the index of the 3 lowest values , but with the condition that the index of the found minima is at least 3 points apart from each other.
To find the 3 lowest indices I'm using
which(list <= sort(list, decreasing=FALSE)[3], arr.ind=TRUE)
which doesn't look for any conditions and results in
1,2,3
My desired result is
1,9,4
I want to know if it's possible doing that without any loops since my list is a lot bigger.
Thank you so much in advance.
To clarify what I meant: The indices of minima must always be in a certain distance. For example for the list list<-c(1,3,9,5,9,9,2) the result of the minima should be 1,7,4. Not 1,7,2, because that the indices 1 and 2 are too close together.
Thank you again for helping me.
Try this using dplyr:
create a dataframe with sequence in the 2nd column, then sort and find first occurance
library(dplyr)
kk <- data.frame(cbind(list, seq=seq_along(list))) %>%
arrange(list) %>% # sort list
group_by(list) %>% # group
summarise(V3=min(seq)) %>% # find first occurance
.$V3 %>% # get sequence values
head(3) # get top 3
[1] 1 9 4

Resources