I'm a relatively new R user. I would really appreciate any help with my dataset please.
I have a dataset with 24 million rows. There are 3 variables in the dataset: patient name, pharmacy name, and count of medications picked up from the pharmacy at that visit.
Some patients appear in the dataset more than once (ie. they have picked up medications from different pharmacies at different time points).
The data frame looks like this:
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("A", "B", "B", "B", "C"),
meds = c(3, 2, 5, 8, 2))
From this data I want to generate a new dataset, which has ONE pharmacy for each patient. This pharmacy needs to be the one where the patient has picked up the highest number of medications.
For example: for Tom his most frequent pharmacy is Pharmacy B because he has picked up 13 medications from there (5+8 meds). The dataset I would like to generate:
data.frame(name = c("Tom", "Rob", "Amy"),
pharmacy = c("B", "B", "C"),
meds = c(13, 2, 2))
Can someone please help me with writing a code to do this?
I have tried various functions in R, such as dplyr, tidyr, aggregate() with no success. Any help would be genuinely appreciated.
Thank you very much
Alex
Your question is not reproducible. But here is one solution:
# create reproducible example of data
dataset1 <- data.frame(
name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("pharmacy_A", "pharmacy_B", "pharmacy_B", "pharmacy_B", "pharmacy_C"),
meds_count = c(3, 2, 5, 8, 2))
library(dplyr) #load dplyr
dataset2 <- dataset1 %>% group_by(name, pharmacy) %>% # group by your grouping variables
summarise(meds_count = sum(meds_count)) %>% # sum no. of meds by your grouping variables
top_n(1, meds_count) %>% # filter for only the top 1 count
ungroup()
Resulting dataframe:
> dataset2
# A tibble: 3 x 3
name pharmacy meds_count
<fct> <fct> <dbl>
1 Amy pharmacy_C 2.00
2 Rob pharmacy_B 2.00
3 Tom pharmacy_B 13.0
If I understood you correctly, I think you're looking for something like this.
require(tidyverse)
#Sample data. I copied yours.
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("A", "B", "B", "B", "C"),
meds = c(3, 2, 5, 8, 2))
Edit. I changed the group_by(), summarise() and added filter.
df %>%
group_by(name, pharmacy) %>%
summarise(SumMeds = sum(meds, na.rm = TRUE)) %>%
filter(SumMeds == max(SumMeds))
Results:
name pharmacy SumMeds
<fct> <fct> <dbl>
1 Amy C 2.
2 Rob B 2.
3 Tom B 13.
Generating your dataset:
patient = c("Tom","Rob","Tom","Tom","Amy")
pharmacy = c("A","B","B","B","C")
meds = c(3,2,5,8,2)
df = data.frame(patient,pharmacy,meds)
df is your dataframe
library(dplyr)
df = df %>% group_by(patient,pharmacy) %>%
summarize(meds =sum(meds)) %>%
group_by(patient) %>%
filter(meds == max(meds))
Take your df, group by patient and pharmacy
calculate total medicines bought by each patient from different pharmacies by taking the sum of medicines.
Then group_by patient
Finally filter for max.
Print the dataframe
print(df)
You can do it in base R with aggregate twice followed by merge.
It seems to me a bit complicated to have to use aggregate twice. Maybe dplyr solutions run more quickly, especially with a dataset with 24 million rows.
agg <- aggregate(meds ~ name + pharmacy, df, FUN = function(x) sum(x))
agg2 <- aggregate(meds ~ name, agg, function(x) x[which.max(x)])
merge(agg, agg2)[c(1, 3, 2)]
# name pharmacy meds
#1 Amy C 2
#2 Rob B 2
#3 Tom B 13
Data.
This is the dataset in the question after the edit.
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("A", "B", "B", "B", "C"),
meds = c(3, 2, 5, 8, 2), stringsAsFactors = FALSE)
Assuming the following dataset:
df <- tribble(
~patient, ~pharmacy, ~medication,
"Tom", "Pharmacy A", "3 meds",
"Rob", "Pharmacy B", "2 meds",
"Tom", "Pharmacy B", "5 meds",
"Tom", "Pharmacy B", "8 meds",
"Amy", "Pharmacy C", "2 meds"
)
A tidyverse-friendly option could be:
df %>%
mutate(med_n = as.numeric(str_extract(medication, "[0-9]"))) %>% # 1
group_by(patient, pharmacy) %>% # 2
mutate(med_sum = sum(med_n)) %>% # 3
group_by(patient) %>% # 4
filter(med_sum == max(med_sum)) %>% # 5
select(patient, pharmacy, med_sum) %>% # 6
distinct() # 7
create a numeric variable as you can't add strings
among all patient / pharmacy couples
find the total number of medications
then among all patients
keep only pharmacies with the highest patient / pharm totals
discard useless variables
discard duplicated lines (several lines per patient / pharmacy couple)
Related
I have been trying to learn the best way to recode variables in a column based on the condition of a name being associated with more than one race.
I have been working with a dataframe like this:
df <- data.frame('Name' = c("Jon", "Jon", "Bobby", "Sarah", "Fred"),
'Race' = c("Black", "White", "Asian", "Asian", "Black"))
What I am trying to do is recode any value that appears more than once in a group and transform it into a "multi-racial" category.
The end goal is to construct a dataframe like below:
df1 <- data.frame('Name' = c("Jon", "Bobby", "Sarah", "Fred"),
'Race' = c("Multiracial", "Asian", "Asian", "Black"))
The way I currently am doing it is by getting a list of people with multiple answers grouping race by name. Then, get a list of the names with more than one answer and for the names with more than one answer only, replace the race with "multi-racial". Code shown below:
df1 <- unique(df[, c('Name', 'Race')])
multi_answer <-
df1 %>%
dplyr::group_by(Name) %>%
dplyr::summarise(n_answers = n_distinct(Race))
multi_answer <- multi_answer[multi_answer$n_answers >1,]
df1[df1$Name %in% c(multi_answer$Name), 'Race'] <- 'multi-racial'
df1 <- unique(df1)
You can just group_by the Name and then summarize the data. You just use the condition of "if there is more than one entry" (i.e., n() > 1):
library(tidyverse)
df |>
group_by(Name)|>
summarise(Race = ifelse(n() > 1, "multi-racial", Race))
#> # A tibble: 4 x 2
#> Name Race
#> <chr> <chr>
#> 1 Bobby Asian
#> 2 Fred Black
#> 3 Jon multi-racial
#> 4 Sarah Asian
I hope everyone is doing well. I am having a bit of a brain fart trying to aggregate in R. Lets say I have this df:
student
subject
Amber
math
Colin
math
Bob
science
Amber
math
Amber
science
And I want to get a count of the number of times the student's subject is math and add that to the data frame, so the result would look like this:
student
subject
total 'math'
Amber
math
2
Colin
math
1
Bob
science
0
Amber
math
2
Amber
science
2
Is this possible? I tried aggregate(subject["math"] ~ student, data = df, length) just to get the first part done, but I get "Error in model.frame.default(formula = subject["math"] ~ : variable lengths differ (found for 'student')".
Thank you in advance!
I think that you want something like this
library(magrittr)
library(dplyr)
df <- data.frame(
student = c("Amber", "Colin", "Bob", "Amber", "Amber"),
subject = c("math", "math", "science", "math", "science")
)
df %>% group_by(student,subject) %>% mutate(`Total math` = n()) %>% filter(`Total math` > 0) %>% filter (subject=="math") %>% distinct -> df2
merge(x=df, y=df2, by="student", all.x = TRUE) %>% mutate(`Total math` = ifelse(!is.na(`Total math`), `Total math`,0)) %>% rename(subject="subject.x") %>% select(student, subject, `Total math`) %>% print
I've tried a different approach and it's different from your desire output but does that work for you ?
my_df <- data.frame("Student" = c("Amber", "Colin", "Bob", "Amber", "Amber"),
"Subject" = c("math", "math", "science", "math", "science"),
stringsAsFactors = FALSE)
my_df <- my_df %>% group_by(Student, Subject) %>% summarise("Total" = n())
library(dplyr)
df_with_count<-df%>%group_by(student,subject)%>%mutate(count=n())
found here:
https://www.tutorialspoint.com/how-to-add-a-new-column-in-an-r-data-frame-with-count-based-on-factor-column
Consider this data frame, containing multiple entries for a person named Steve/Stephan Jones and a person named Steve/Steven Smith (as well as Jane Jones and Matt/Matthew Smith)
df <- data.frame(First = c("Steve", "Stephan", "Steve", "Jane", "Steve", "Steven", "Matt"),
Last = c(rep("Jones", 4), rep("Smith", 3)))
What I'd like is to match values of First to the appropriate value of Name in this data frame.
nicknames <- data.frame(Name = c("Stephan", "Steven", "Stephen", "Matthew"),
N1 = c(rep("Steve", 3), "Matt"))
To yield this target
target <- data.frame(First = c("Stephan", "Stephan", "Stephan", "Jane", "Steven", "Steven", "Matthew"),
Last = c(rep("Jones", 4), rep("Smith", 3)))
The issue is that there are multiple values of Name corresponding to a N1 (or First) value of "Steve", so I need to check within each group based of df$Last to see which version of Steven/Stephan/Stephen is correct.
Using something like this
library(dplyr)
library(stringr)
df %>%
group_by(Last) %>%
mutate(First = First[which.max(str_length(First))])
won't work because the value of "Jane" in row 4 will be converted to "Stephan"
I'm not sure, if this solves your problem and is consistent to your desired output:
library(dplyr)
df %>%
mutate(id = row_number()) %>%
left_join(nicknames, by=c("First" = "N1")) %>%
mutate(real_name = coalesce(Name, First)) %>%
group_by(Last, real_name) %>%
mutate(id = n()) %>%
group_by(Last, First) %>%
filter(id==max(id)) %>%
select(-Name, -id)
returns
# A tibble: 7 x 3
# Groups: Last, First [6]
First Last real_name
<chr> <chr> <chr>
1 Steve Jones Stephan
2 Stephan Jones Stephan
3 Steve Jones Stephan
4 Jane Jones Jane
5 Steve Smith Steven
6 Steven Smith Steven
7 Matt Smith Matthew
I have a data set with 3 columns: the first has names of cities, the second has dates and the third data for water quality. I have ordered my data according to the city names and then the dates and now I am trying to sum the data of the water quality for each city separately. Do you know how I can do that in r studio?
Any help would be appreciated.
Thank you
If you're using the tidyverse package you can group by the city and then summarise the information adding up all the data from water_quality
library(tidyverse)
# im using this data set
ds <- tibble(city = "a", wq = 1) %>%
add_row( city = "a", wq = 1) %>%
add_row( city = c("a", "b", "c", "a", "b"), wq = c(.5, .2, .5, .7, 1.2))
#you're interested in this part
ds %>% group_by(city) %>%
summarise(sum = sum(wq))
this is the output
# A tibble: 3 x 2
city sum
<chr> <dbl>
1 a 3.2
2 b 1.4
3 c 0.5
I would like to calculate the average exam score of each student and add this as a new column to a data frame:
library(dplyr)
my_students <- c("John", "Lisa", "Sam")
student_exam <- c("John", "Lisa", "John", "John")
score_exam <- c(7, 6, 7, 6)
students <- as.data.frame(my_students)
scores <- as.data.frame(student_exam)
scores <- cbind(scores, score_exam)
new_frame <- students %>% mutate(avg_score = (scores %>% filter(student_exam == my_students) %>% mean(score_exam)))
But the code above gives the following error:
Error in Ops.factor(student_examn, my_students) :
level sets of factors are different
I assume it has to do with filter(student_exam == my_students). How would I do this in dplyr?
You need to make sure you define two data frames with matching column named "name". You can then use group_by and summarize to group scores by student and summarize the average for each student. This solution has a warning that is telling you that you should be aware that not every student in your class has an average exam score. As a result, Sam's average score is NA.
library(dplyr)
my_students <- c("John", "Lisa", "Sam")
student_exam <- c("John", "Lisa", "John", "John")
score_exam <- c(7, 6, 7, 6)
students <- data.frame("name" = as.character(my_students))
scores <- data.frame("name" = as.character(student_exam), "score" = score_exam)
avg_scores <- scores %>%
group_by(name) %>%
summarize(avgScore = mean(score)) %>%
right_join(students)