I am doing some manipulations with dplyr. I am working with the brca data set.
I have to find a solution for the below question.
" We are interested what variable might be the best indicator for the outcome
malignant ("M") or benign ("B"). There are 30 features (variables) and we
want to select one variable that has the largest difference between means
for groups M and B."
Now i want to find the difference between the two resulting rows and then find the maximum difference and the resulting column name.
Can anyone help me with this?
Thanks... :)
To get column name and the value with the highest absolute difference between two rows you can do -
library(dplyr)
library(tidyr)
sumOutcome %>%
summarise(across(-outcome, diff)) %>%
pivot_longer(cols = everything()) %>%
slice(which.max(abs(value)))
# name value
# <chr> <dbl>
#1 concave_pts_worst 436.
Related
In R, I have a data frame and a vector. The data frame has a column of dates (e.g. column A). The vector also contains dates. The dates are not necessarily continuous (i.e. a few consecutive dates may be 1/4/23, 1/17/23, 2/4/23, etc.) for either column A or the vector.
I want to create a new column in the data frame (column B) which is equal to (for each row) the minimum value of the vector that is greater than the date in column A. Perhaps a more general way of putting it, I want to create a new data frame column based on an existing column compared to a vector.
I have figured out how to do this using a function/loop, but it is not the cleanest. Is there a simpler way to do this without a loop? A dplyr solution would be ideal, as that is what I mostly use elsewhere in my code, but any help would be much appreciated. It would also be helpful to know if this is not possible without a loop. Thanks!
Using a rowwise mutate in dplyr, subset the vector to elements >= your date column, sort, and take the first element:
library(dplyr)
# example data
dat <- data.frame(
columnA = as.Date(c("2023-01-04", "2023-01-17", "2023-02-04"))
)
vec <- as.Date(c("2023-01-01", "2023-03-01", "2023-01-04", "2023-01-30"))
dat %>%
rowwise() %>%
mutate(columnB = first(sort(vec[vec >= columnA]))) %>%
ungroup()
# A tibble: 3 × 2
columnA columnB
<date> <date>
1 2023-01-04 2023-01-04
2 2023-01-17 2023-01-30
3 2023-02-04 2023-03-01
I have a dataset with repeated measures: measurements nested within participants (ID) nested in groups. A variable G (with range 0-100) was measured on the group-level. I want to create a new column that shows:
The first day on which the maximum value of G was reached in a group coded as zero.
How many days each measurement (in this same group) occurred before or after the day on which the maximum was reached. For example: a measurement taken 2 days before the maximum is then coded -2, and a measurement 5 days after the maximum is coded as 5.
Here is an example of what I'm aiming for: Example
I highlighted the days on which the maximum value of G was reached in the different groups. The column 'New' is what I'm trying to get.
I've been trying with dplyr and I managed to get for each group the maximum with group_by, arrange(desc), slice. I then recoded those maxima into zero and joined this dataframe with my original dataframe. However, I cannot manage to do the 'sequence' of days leading up to/ days from the maximum.
EDIT: sorry I didn't include a reprex. I used this code so far:
To find the maximum value: First order by date
data <- data[with(data, order(G, Date)),]
Find maximum and join with original data:
data2 <- data %>%
dplyr::group_by(Group) %>%
arrange(desc(c(G)), .by_group=TRUE) %>%
slice(1) %>%
ungroup()
data2$New <- data2$G
data2 <- data2 %>%
dplyr::select(c("ID", "New", "Date"))
data3 <- full_join(data, data2, by=c("ID", "Date"))
data3$New[!is.na(data3$New)] <- 0
This gives me the maxima coded as zero and all the other measurements in column New as NA but not yet the number of days leading up to this, and the number of days since. I have no idea how to get to this.
It would help if you would be able to provide the data using dput() in your question, as opposed to using an image.
It looked like you wanted to group_by(Group) in your example to compute number of days before and after the maximum date in a Group. However, you have ID of 3 and Group of A that suggests otherwise, and maybe could be clarified.
Here is one approach using tidyverse I hope will be helpful. After grouping and arranging by Date, you can look at the difference in dates comparing to the Date where G is maximum (the first maximum detected in date order).
Also note, as.numeric is included to provide a number, as the result for New is a difftime (e.g., "7 days").
library(tidyverse)
data %>%
group_by(Group) %>%
arrange(Date) %>%
mutate(New = as.numeric(Date - Date[which.max(G)]))
I have a dataset with survey score results for 3 hospitals over a number of years. This survey contains 2 questions.
The dataset looks like this -
set.seed(1234)
library(dplyr)
library(tidyr)
dataset= data.frame(Hospital=c(rep('A',10),rep('B',8),rep('C',6)),
YearN=c(2015,2016,2017,2018,2019,
2015,2016,2017,2018,2019,
2015,2016,2017,2018,
2015,2016,2017,2018,
2015,2016,2017,
2015,2016,2017),
Question=c(rep('Overall Satisfaction',5),
rep('Overall Cleanliness',5),
rep('Overall Satisfaction',4),
rep('Overall Cleanliness',4),
rep('Overall Satisfaction',3),
rep('Overall Cleanliness',3)),
ScoreYearN=c(rep(runif(24,min = 0.6,max = 1))),
TotalYearN=c(rep(round(runif(24,min = 1000,max = 5000),0))))
MY OBJECTIVE
To add two columns to the dataset such that -
The first column contains the score for the given question in the given
hospital for the previous year
The second column contains the total number of respondents for the given question in the given hospital for the previous year
MY ATTEMPT
I called the first column ScoreYearN-1 and the second column TotalYearN-1
I used the lag function to create the new columns that contain the lagged values from the existing columns.
library(dplyr)
library(tidyr)
dataset$`ScoreYearN-1`=lag(dataset$ScoreYearN)
dataset$`TotalYearN-1`=lag(dataset$TotalYearN)
Which gives me a resulting dataset where I have the desired outcome for the first five rows only (these rows correspond to the first Hospital-Question combination).
The remaining rows do not account for this grouping, and hence the 2015 'N-1' values take on the values of the previous group.
I'm not sure this is the best way to go about this problem. If you have any better suggestions, I'm happy to consider them.
Any help will be greatly appreciated.
You're close! Just use dplyr to group by hospital
dataset_lagged <- dataset %>%
group_by(Hospital,Question) %>%
mutate(`ScoreYearN-1` = lag(ScoreYearN),
`TotalYearN-1` = lag(TotalYearN))
I have a large dataframe (4631 rows x 2995 cols). The rows represent the zip codes of all the hospitals in the US and the columns represent the zip codes of patients. I have calculated the distance between the patient's home zips and the hospitals so that each cell value is a numeric value representing the miles between each patient's home and each hospital.
An example df is:
10960 11040 56277 55379
37160 674.14 238.04 25.89 5.31
37091 162.62 71.25 428.56 672.11
89148 931.31 0.03 389.25 1000.05
91776 15.05 508.74 315.61 101.01
What I want to do now is extract the lowest five values for each patient, which would represent the five closest hospitals for each patient. But not only do I need to extract the cell values but I also need the row names so I can know which zip codes those hospitals are in.
So for example, if I was only looking for the lowest two values for each patient/column, I would like to know that for patient 10960 the closest hospital is 15.05 miles away and is in the 91776 zip code, and the second closest hospital is 162.62 miles away and is in the 37091 zip code.
I have this data transposed so if it would be easier to do this by swapping the rows and columns that's fine by me. I don't need the code to do that.
I've found ways to get the lowest values using functions and apply and stuff but it doesn't give me the corresponding zip codes.
I would appreciate any help!
Thanks!
Something like this should do the trick:
library(dplyr)
library(tidyr)
df %>%
mutate(hospital = rownames(.)) %>%
gather("patient", "distance", -hospital) %>%
group_by(patient) %>%
arrange(distance) %>%
slice(1:5) %>%
ungroup
First add a hospital column from the rownames, and then in the gather step the columns for distance are being turned into rows - each colname becomes an entry under the new patient column and the distances in each column become part of the distance column. group_by and arrange find sort the distances within each patient, and slice takes the first 5 rows of each. The ungroup isn't required but it's nice to undo the group_by if the grouping is no longer necessary.
maybe this would work:
library(dplyr)
test <- lapply(1:length(df), function(i) {
x <- arrange(df, names(df)[i])
tibble(HospitalZipCode = rownames(x)[1:5],
Distance = x[1:5,i, drop=TRUE],
Order = 1:5,
PatientID=names(df)[i])
}) %>% bind_rows()
This should give you a table with 5 rows per patient. I added a column for the order of the hospitals (1 for closest, 2 for second, etc.)
I have a tibble with several columns, including an ID column and a "score" column. The ID column has some duplicated values. I want to create a tibble that has one row per unique ID, and the same number of columns as the original tibble. For any ID, the "score" value in this new tibble should be the mean of the scores for the ID in the original tibble. And for any ID, the value for the other columns should be the first value that appears for that ID in the original tibble.
When the number of columns in the original tibble is small and known, this is an easy problem. Example:
scores <- tibble(
ID = c(1, 1, 2, 2, 3),
score = 1:5,
a = 6:10)
scores %>%
group_by(ID) %>%
summarize(score = mean(score), a = first(a))
But I often work with tibbles (or data frames) that have dozens of columns. I don't know in advance how many columns there will be or how they will be named. In these cases, I still want a function that takes, within each group, the mean of the score column and the first value of the other columns. But it isn't practical to spell out the name of each column. Is there a generic command that will let me summarize() by taking the mean of one column and the first value of all of the others?
A two-step solution would start by using mutate() to replace each score within a group with the mean of those scores. Then I could create my desired tibble by taking the first row of each group. But is there a one-step solution, perhaps using one of the select_helpers in dplyr?
Summarizing unknown number of column in R using dplyr is the closest post that I've been able to find. But I can't see that it quite speaks to this problem.
You can use mutate to get the mean values and then use slice to get the first row of each group, i.e.
library(dplyr)
scores %>%
group_by(ID) %>%
mutate(score = mean(score)) %>%
slice(1L)
#Source: local data frame [3 x 3]
#Groups: ID [3]
# ID score a
# <dbl> <dbl> <int>
#1 1 1.5 6
#2 2 3.5 8
#3 3 5.0 10