Adding column based on data in other data frame - r

I would like to calculate the average exam score of each student and add this as a new column to a data frame:
library(dplyr)
my_students <- c("John", "Lisa", "Sam")
student_exam <- c("John", "Lisa", "John", "John")
score_exam <- c(7, 6, 7, 6)
students <- as.data.frame(my_students)
scores <- as.data.frame(student_exam)
scores <- cbind(scores, score_exam)
new_frame <- students %>% mutate(avg_score = (scores %>% filter(student_exam == my_students) %>% mean(score_exam)))
But the code above gives the following error:
Error in Ops.factor(student_examn, my_students) :
level sets of factors are different
I assume it has to do with filter(student_exam == my_students). How would I do this in dplyr?

You need to make sure you define two data frames with matching column named "name". You can then use group_by and summarize to group scores by student and summarize the average for each student. This solution has a warning that is telling you that you should be aware that not every student in your class has an average exam score. As a result, Sam's average score is NA.
library(dplyr)
my_students <- c("John", "Lisa", "Sam")
student_exam <- c("John", "Lisa", "John", "John")
score_exam <- c(7, 6, 7, 6)
students <- data.frame("name" = as.character(my_students))
scores <- data.frame("name" = as.character(student_exam), "score" = score_exam)
avg_scores <- scores %>%
group_by(name) %>%
summarize(avgScore = mean(score)) %>%
right_join(students)

Related

Boostraping in hierarchical data in R

I have a dataset of the following form:
dat <- expand.grid(cat=factor(1:4), lab=factor(1:10))
dat <- cbind(dat, x=runif(18), y=runif(18, 2, 5))
where I have observations of 4 cats in 10 labs.
Now I want to simulate samples from this dataset by resampling in order to have:
each cat observed in 5 (random) labs AND each lab with 50% (or 2) random cats observed.
Honestly I cannot figure my way out of this... Thanks in advance
This type of thing is generally easiest with a function.
This function takes the data and first filters the number of labs to sample from, then samples the cats for each lab.
library(dplyr)
dat <- expand.grid(cat=factor(1:4), lab=factor(1:10)) %>%
mutate(x = runif(nrow(.)),
y = runif(nrow(.), 2, 5))
samplr <- function(dat, nlab = 5, ncat = 2){
dat %>%
filter(lab %in% sample(unique(dat$lab), nlab)) %>%
group_by(lab) %>%
filter(cat %in% sample(unique(dat$cat), ncat))
}
samplr(dat)
and you can then change the number of cats or labs being sampled
samplr(dat, nlab = 4, ncat = 3)

R: How to replace values in column with random numbers WITH duplicates

I have a df with data, and a name for each row. I would like the names to be replaced by a random string/number, but with the same string, when a name appears twice or more (eg. for Adam and Camille below).
df <- data.frame("name" = c("Adam", "Adam", "Billy", "Camille", "Camille", "Dennis"), "favourite food" = c("Apples", "Banana", "Oranges", "Banana", "Apples", "Oranges"), stringsAsFactors = F)
The expected output is something like this (it is not important how the random string looks or the lenght of it)
df_exp <- data.frame("name" = c("xxyz", "xxyz", "xyyz", "xyzz", "xyzz", "yyzz"), "favourite food" = c("Apples", "Banana", "Oranges", "Banana", "Apples", "Oranges"), stringsAsFactors = F)
I have tried several random replacement functions in R, however each of them creates a random string for each row in data, and not an individual one for duplicates, eg. stri_rand_strings:
library(stringi)
library(magrittr)
library(tidyr)
library(dplyr)
df <- df %>%
mutate(UniqueID = do.call(paste0, Map(stri_rand_strings, n=6, length=c(2, 6),
pattern = c('[A-Z]', '[0-9]'))))
One way is with a group_by/mutate
df %>%
group_by(name) %>%
mutate(hidden = stringi::stri_rand_strings(1, length=4)) %>%
ungroup() %>%
mutate(name=hidden)
Basically we just generate one random string per group.
You could also generate a translation table first with something like
new_names <- df %>%
distinct(name) %>%
mutate(new_name = stringi::stri_rand_strings(n(), length=c(2,6)))
and then merge that to the original data. But either way I'm not sure that stri_rand_strings is guaranteed to return unique values -- they're just random values. While unlikely to be the same, it would be easier to check that they are all distinct by creating the translation table first.

Manipulating variables to produce a new dataset in R

I'm a relatively new R user. I would really appreciate any help with my dataset please.
I have a dataset with 24 million rows. There are 3 variables in the dataset: patient name, pharmacy name, and count of medications picked up from the pharmacy at that visit.
Some patients appear in the dataset more than once (ie. they have picked up medications from different pharmacies at different time points).
The data frame looks like this:
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("A", "B", "B", "B", "C"),
meds = c(3, 2, 5, 8, 2))
From this data I want to generate a new dataset, which has ONE pharmacy for each patient. This pharmacy needs to be the one where the patient has picked up the highest number of medications.
For example: for Tom his most frequent pharmacy is Pharmacy B because he has picked up 13 medications from there (5+8 meds). The dataset I would like to generate:
data.frame(name = c("Tom", "Rob", "Amy"),
pharmacy = c("B", "B", "C"),
meds = c(13, 2, 2))
Can someone please help me with writing a code to do this?
I have tried various functions in R, such as dplyr, tidyr, aggregate() with no success. Any help would be genuinely appreciated.
Thank you very much
Alex
Your question is not reproducible. But here is one solution:
# create reproducible example of data
dataset1 <- data.frame(
name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("pharmacy_A", "pharmacy_B", "pharmacy_B", "pharmacy_B", "pharmacy_C"),
meds_count = c(3, 2, 5, 8, 2))
library(dplyr) #load dplyr
dataset2 <- dataset1 %>% group_by(name, pharmacy) %>% # group by your grouping variables
summarise(meds_count = sum(meds_count)) %>% # sum no. of meds by your grouping variables
top_n(1, meds_count) %>% # filter for only the top 1 count
ungroup()
Resulting dataframe:
> dataset2
# A tibble: 3 x 3
name pharmacy meds_count
<fct> <fct> <dbl>
1 Amy pharmacy_C 2.00
2 Rob pharmacy_B 2.00
3 Tom pharmacy_B 13.0
If I understood you correctly, I think you're looking for something like this.
require(tidyverse)
#Sample data. I copied yours.
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("A", "B", "B", "B", "C"),
meds = c(3, 2, 5, 8, 2))
Edit. I changed the group_by(), summarise() and added filter.
df %>%
group_by(name, pharmacy) %>%
summarise(SumMeds = sum(meds, na.rm = TRUE)) %>%
filter(SumMeds == max(SumMeds))
Results:
name pharmacy SumMeds
<fct> <fct> <dbl>
1 Amy C 2.
2 Rob B 2.
3 Tom B 13.
Generating your dataset:
patient = c("Tom","Rob","Tom","Tom","Amy")
pharmacy = c("A","B","B","B","C")
meds = c(3,2,5,8,2)
df = data.frame(patient,pharmacy,meds)
df is your dataframe
library(dplyr)
df = df %>% group_by(patient,pharmacy) %>%
summarize(meds =sum(meds)) %>%
group_by(patient) %>%
filter(meds == max(meds))
Take your df, group by patient and pharmacy
calculate total medicines bought by each patient from different pharmacies by taking the sum of medicines.
Then group_by patient
Finally filter for max.
Print the dataframe
print(df)
You can do it in base R with aggregate twice followed by merge.
It seems to me a bit complicated to have to use aggregate twice. Maybe dplyr solutions run more quickly, especially with a dataset with 24 million rows.
agg <- aggregate(meds ~ name + pharmacy, df, FUN = function(x) sum(x))
agg2 <- aggregate(meds ~ name, agg, function(x) x[which.max(x)])
merge(agg, agg2)[c(1, 3, 2)]
# name pharmacy meds
#1 Amy C 2
#2 Rob B 2
#3 Tom B 13
Data.
This is the dataset in the question after the edit.
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("A", "B", "B", "B", "C"),
meds = c(3, 2, 5, 8, 2), stringsAsFactors = FALSE)
Assuming the following dataset:
df <- tribble(
~patient, ~pharmacy, ~medication,
"Tom", "Pharmacy A", "3 meds",
"Rob", "Pharmacy B", "2 meds",
"Tom", "Pharmacy B", "5 meds",
"Tom", "Pharmacy B", "8 meds",
"Amy", "Pharmacy C", "2 meds"
)
A tidyverse-friendly option could be:
df %>%
mutate(med_n = as.numeric(str_extract(medication, "[0-9]"))) %>% # 1
group_by(patient, pharmacy) %>% # 2
mutate(med_sum = sum(med_n)) %>% # 3
group_by(patient) %>% # 4
filter(med_sum == max(med_sum)) %>% # 5
select(patient, pharmacy, med_sum) %>% # 6
distinct() # 7
create a numeric variable as you can't add strings
among all patient / pharmacy couples
find the total number of medications
then among all patients
keep only pharmacies with the highest patient / pharm totals
discard useless variables
discard duplicated lines (several lines per patient / pharmacy couple)

How to restrict full_join() duplicates? - R

I am a novice R programmer. Below is the dataframe I am using.
I am currently running into a filtering problem with the full_join() from tidyverse.
library(tidyverse)
set.seed(1234)
df <- data.frame(
trial = rep(0:1, each = 8),
sex = rep(c('M','F'), 4),
participant = rep(1:4, 4),
x = runif(16, 1, 10),
y = runif(16, 1, 10))
df
I am currently doing the following operation to do the full_join()
df <- df %>% mutate(k = 1)
df <- df %>%
full_join(df, by = "k")
I am restricting the results to obtain the combination of points for the same participant between the trials
df2 <- filter(df, sex.x == sex.y, participant.x == participant.y, trial.x != trial.y)
df3 <- filter(df2, participant.x == 1)
df3
Here, at this step, I am running into trouble. I do not care about the order of the points. How do I condense the duplicates into one row?
Thank you
Depending on the columns you are considering, use the duplicate function. The first one will weed out duplicates based on the first 5 columns. The last one will weed out duplicates based on
df3[!duplicated(df3[,1:5]),]
df3[!duplicated(df3[,7:11]),]

Change from baseline for repeated ids with missing baseline points

Change from baseline for repeated ids with missing baseline points
A similar question has been asked and answered below:
Change from baseline for repeated ids
My question differs from the original question in that I have missing baseline values. I am including a small reproducible example below:
df1 <- data.frame( probeID = c( rep("A", 19), rep("B",19), rep("C",19)),
Subject_ID = c( rep( c( rep(1,5), rep(2,4), rep(3,5), rep(4,5)),3)),
time = c(rep( c( c(1:5), c(2:5), rep( 1:5,2)),3)))
df1$measure <- df1$Subject_ID*c( 1:nrow(df1))
df2 <- subset( df1, Subject_ID != 2)
df2 %>%
group_by(probeID, Subject_ID) %>%
mutate(change = measure - measure[time==1])
However, when I replace df2 with df1 in the pipe above, it fails because data is missing for the time = 1 data point for Subject_ID=2. My desired output in the df1 case should be be identical to the output from df2. I would appreciate any help.
Thanks
JJ
Was having some trouble trying to figure out what your question was asking for, does this work?
df1 %>%
group_by(probeID, Subject_ID) %>%
mutate(change = measure - first(measure))

Resources