I am performing this computation that involves two data frames.I created two reproducible examples of the two data frames as an example
> df1
Day1 Day2 Day3 Day4 Day5 Day6 Day7 Day8 Day9 Day10
Time1 0.03 0.43 0.39 0.41 0.94 0.70 0.18 0.65 0.72 0.72
Time2 0.42 0.63 0.93 0.53 0.19 0.55 0.22 0.16 0.56 0.04
and
> df2
Day Time X3 X4 X5
1 1 1 9.252042 19.512621 11.601671
2 1 2 5.021522 17.712484 5.044728
3 2 1 9.603795 19.404302 17.206771
4 2 2 19.686793 18.791541 12.655874
5 3 1 7.546551 18.810526 19.865979
6 3 2 18.233872 19.596584 11.653980
7 4 1 17.499680 14.014276 15.553013
8 4 2 8.115352 17.898786 12.841630
9 5 1 10.719540 8.518823 19.126440
10 5 2 12.853401 6.026599 14.041490
11 6 1 19.984946 10.693528 6.890835
12 6 2 16.360035 15.778092 18.087471
13 7 1 15.498714 15.039444 5.259257
14 7 2 13.179111 17.533358 7.382507
15 8 1 5.124188 15.507194 12.547365
16 8 2 8.008336 10.463382 6.934014
17 9 1 11.246527 6.975527 14.464758
18 9 2 17.914083 18.039384 19.324091
19 10 1 9.876625 19.216317 8.787550
20 10 2 11.851955 15.729080 5.741095
the columns in df1 represent days that the values were recorded and the rows indicate the hours/or time (time 1 or 2). In df2, the first two columns represent the days and times respectively and the other columns are for the locations where data was recorded.
what I will like to do with R is to create another data frame which has the same size as df2, that divides the values in df2[,3:5] by the corresponding df1 value i.e depending on the values in the day and time columns of df2, select the corresponding values of df1.
an example is for for the first value of df2$X3, in the new data frame I will have a value of 9.252042 divided by 0.03. and for the third value of df2$X3, I will have a value of 9.603795 divided by 0.43.
Thank you in advance for any help!
I suppose You makes your data (df1 and df2) as below:
df1 = data.frame(time=c(1:10),time1=c(0.03,0.43,0.39,0.41,.94,.70,.18,.065,0.72,0.72),time2 = c(.42,.63,.93,.53,.19,.55,.22,.16,.56,.04))
df2 = data.frame(Day = rep(c(1:10),each=2),Time = rep(c(1,2),10),X3=c(9.2,5.02,9.6,19.6,7.5,18.2,17.4,8.1,10.7,12.8,19.9,16.3,15.4,13.1,5.1,8,11.2,17.9,9.8,11.8),X4=c(19.5,17.7,19.4,18.8,18,19.5,14.01,17.8,8.5,6,10.6,15.7,15,17.5,15,10,6,18,19,15),X5=c(11.6,5,17,12,19,11,15,12,19,14,6,18,5,7,12,6,14,19,8,5))
Then the code that you will new to create df3 will be this:
df3 = data.frame(df2$Day,df2$Time,newx3 = df2$X3 / df1$time[df2$Day],newx4 = df2$X4 / df1$time[df2$Day],newx5 = df2$X5 / df1$time[df2$Day])
My suggestion is to follow tidy data principles
Here I provide an example with the same structure as your dataframes but more simplified and only with days 1-3:
library(dplyr)
library(tidyr)
untidy = tibble(day1 = c(0.03,0.42), day2 = c(0.43,0.63), day3 = c(0.39,0.93))
tidy = tibble(day = c(1,1,2,2,3,3), time = c(1,2,1,2,1,2), val1 = c(9.252042,5.012522,9.603795,19.686793,7.546551,18.233872))
untidy_to_tidy = untidy %>%
gather(day,val2) %>%
mutate(day = as.double(gsub("day","",day)),
time = rep(c(1,2), (ncol(untidy) * nrow(untidy))/2)) %>%
select(day,time,val2)
tidy %>%
left_join(untidy_to_tidy, by = c("day","time")) %>%
mutate(division = val1 / val2)
If you are new to R please keep it simple and do like this:
read you CSV/TSV/etc using read_csv("YOUR_FILE.CSV") from readr package
in my example replace
untidy = tibble(day1 = c(0.03,0.42), day2 = c(0.43,0.63), day3 = c(0.39,0.93))
by
untidy = read_csv("YOUR_FILE.CSV")
and
tidy = tibble(day = c(1,1,2,2,3,3), time = c(1,2,1,2,1,2), val1 = c(9.252042,5.012522,9.603795,19.686793,7.546551,18.233872))
by
tidy = read_csv("YOUR_OTHER_FILE.CSV")
What you need to do is be careful: Your two dataframes are arranged in a sweet order. The code is as follows:
df2[3:5]/unlist(df1)
X3 X4 X5
1 308.401400 650.420700 386.72237
2 11.956005 42.172581 12.01126
3 22.334407 45.126284 40.01575
4 31.248878 29.827843 20.08869
5 19.350131 48.232118 50.93841
6 19.606314 21.071596 12.53116
: : : :
: : : :
Related
I'm trying to figure out how I can make R count the number of df$event_dt by each individual df$id in the episode.start - episode.end range for that id in All_date_events that AdhereR produces.
Essentially I want to find the sum of events for that given ID in the date range.
After running this code you'll have 2 data frames.
df, transformed from original df with 4 columns is the input data
All_date_events is the output I'd like to add a event.num column to.
library(dplyr)
library(tidyverse)
library(AdhereR)
library(hablar)
library(lubridate)
# Starting data
df <- data.frame(id = c("1","1","2","2","1","1","1"),
event_dt = dmy('2-1-2012',
'2-4-2012',
'2-5-2012',
'2-7-2012',
'2-12-2012',
'1-3-2013',
'22-5-2013'))
df <- df %>% mutate(ones = "1", class = "N1")
df <- transform(df,ones = as.numeric(ones)))
All_date_events <- compute.treatment.episodes(df,
ID.colname="id",
event.date.colname="event_dt",
event.duration.colname="ones",
event.daily.dose.colname="ones",
medication.class.colname="class",
carryover.within.obs.window = TRUE, # carry-over into the OW
carry.only.for.same.medication = TRUE, # & only for same type
consider.dosage.change = TRUE, # dosage change starts new episode...
medication.change.means.new.treatment.episode = TRUE, # & type change
maximum.permissible.gap = 90, # & a gap longer than 180 days
maximum.permissible.gap.unit = "days", # unit for the above (days)
followup.window.start = 0, # 2-years FUW starts at earliest event
followup.window.start.unit = "days",
followup.window.duration = 365*2,
followup.window.duration.unit = "days",
date.format = "%d/%m/%Y");
Either of these results is what I'm hoping for
id
episode.ID
episode.start
end.episode.gap.days
episode.duration
episode.end
event.num
1
1
2012-01-02
243
92
2012-04-03
2
1
2
2012-12-02
223
172
2013-05-23
3
2
1
2012-05-02
668
62
2012-07-03
2
Id.code
Event.date
Event.start
Event.end
Events.num
Event.Episode
1
2.1.2012
1
2.4.2012
2.1.2012
2.4.2012
2
1
2
2.5.2012
2
2.7.2012
2.5.2012
2.7.2012
2
1
1
2.12.2012
1
21.3.2013
1
22.5.2013
2.5.2012
22.5.2013
3
2
Let's suppose that a company has 3 Bosses and 20 Employees, where each Employee has done n_Projects with an overall Performance in percentage:
> df <- data.frame(Boss = sample(1:3, 20, replace=TRUE),
Employee = sample(1:20,20),
n_Projects = sample(50:100, 20, replace=TRUE),
Performance = round(sample(1:100,20,replace=TRUE)/100,2),
stringsAsFactors = FALSE)
> df
Boss Employee n_Projects Performance
1 3 8 79 0.57
2 1 3 59 0.18
3 1 11 76 0.43
4 2 5 85 0.12
5 2 2 75 0.10
6 2 9 66 0.60
7 2 19 85 0.36
8 1 20 79 0.65
9 2 17 79 0.90
10 3 14 77 0.41
11 1 1 78 0.97
12 1 7 72 0.52
13 2 6 62 0.69
14 2 10 53 0.97
15 3 16 91 0.94
16 3 4 98 0.63
17 1 18 63 0.95
18 2 15 90 0.33
19 1 12 80 0.48
20 1 13 97 0.07
The CEO asks me to compute the quality of the work for each boss. However, he asks for a specific calculation: Each Performance value has to have a weight equal to the n_Project value over the total n_Project for that boss.
For example, for Boss 1 we have a total of 604 n_Projects, where the project 1 has a Performance weight of 0,13 (78/604 * 0,97 = 0,13), project 3 a Performance weight of 0,1 (59/604 * 0,18 = 0,02), and so on. The sum of these Performance weights are the Boss performance, that for Boss 1 is 0,52. So, the final output should be like this:
Boss total_Projects Performance
1 604 0.52
2 340 0.18 #the values for boss 2 are invented
3 230 0.43 #the values for boss 3 are invented
However, I'm still struggling with this:
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
In addition to this problem, can you give me any feedback about this problem (my code, specifically) or any recommendation to improve data-manipulations skills? (you can see in my profile that I have asked a lot of questions like this, but still I'm not able to solve them on my own)
We can get the sum of product of `n_Projects' and 'Performance' and divide by the 'total_projects'
library(dplyr)
df %>%
group_by(Boss) %>%
summarise(total_projects = sum(n_Projects),
Weight_Project = sum(n_Projects * Performance)/total_projects)
# or
# Weight_Project = n_Projects %*% Performance/total_projects)
# A tibble: 3 x 3
# Boss total_projects Weight_Project
# <int> <int> <dbl>
#1 1 604 0.518
#2 2 595 0.475
#3 3 345 0.649
Adding some more details about what you did and #akrun's answer :
You must have received the following error message :
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
## Error in summarise_impl(.data, dots) :
## Column `Weight_Project` must be length 1 (a summary value), not 7
This tells you that the calculus you made for Weight_Project does not yield a unique value for each Boss, but 7. summarise is there to summarise several values into one (by means, sums, etc.). Here you just divide each value of n_Projects by sum(total_Projects), but you don't summarise it into a single value.
Assuming that what you had in mind was first calculating the weight for each performance, then combining it with the performance mark to yield the weighted mean performance, you can proceed in two steps :
df %>%
group_by(Boss) %>%
mutate(Weight_Performance = n_Projects / sum(n_Projects)) %>%
summarise(weighted_mean_performance = sum(Weight_Performance * Performance))
The mutate statement preserves the number of total rows in df, but sum(n_Projects) is calculated for each Boss value thanks to group_by.
Once, for each row, you have a project weight (which depends on the boss), you can calculate the weighted mean — which is a mean thus a summary value — with summarise.
A more compact way that still lets appear the weighted calculus would be :
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum((n_Projects / sum(n_Projects)) * Performance))
# Reordering to minimise parenthesis, which is #akrun's answer
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum(n_Projects * Performance) / sum(n_Projects))
I am trying to identify the appropriate thresholds for two activities which generate the greatest success rate.
Listed below is an example of what I am trying to accomplish. For each location I am trying to identify the thresholds to use for activities 1 & 2, so that if either criteria is met then we would guess 'yes' (1). I then need to make sure that we are guessing 'yes' for only a certain percentage of the total volume for each location, and that we are maximizing our accuracy (our guess of yes = 'outcome' of 1).
location <- c(1,2,3)
testFile <- data.frame(location = rep.int(location, 20),
activity1 = round(rnorm(20, mean = 10, sd = 3)),
activity2 = round(rnorm(20, mean = 20, sd = 3)),
outcome = rbinom(20,1,0.5)
)
set.seed(145)
act_1_thresholds <- seq(7,12,1)
act_2_thresholds <- seq(19,24,1)
I was able to accomplish this by creating a table that contains all of the possible unique combinations of thresholds for activities 1 & 2, and then merging it with each observation within the sample data set. However, with ~200 locations in the actual data set, each of which with thousands of observations I quickly ran of out of space.
I would like to create a function that takes the location id, set of possible thresholds for activity 1, and also for activity 2, and then calculates how often we would have guessed yes (i.e. the values in 'activity1' or 'activity2' exceed their respective thresholds we're testing) to ensure our application rate stays within our desired range (50% - 75%). Then for each set of thresholds which produce an application rate within our desired range we would want to store only the set of which maximizes accuracy, along with their respective location id, application rate, and accuracy rate. The desired output is listed below.
location act_1_thresh act_2_thresh application_rate accuracy_rate
1 1 13 19 0.52 0.45
2 2 11 24 0.57 0.53
3 3 14 21 0.67 0.42
I had tried writing this into a for loop, but was not able to navigate my way through the number of nested arguments I would have to make in order to account for all of these conditions. I would appreciate assistance from anyone who has attempted a similar problem. Thank you!
An example of how to calculate the application and accuracy rate for a single set of thresholds is listed below.
### Create yard IDs
location <- c(1,2,3)
### Create a single set of thresholds
single_act_1_threshold <- 12
single_act_2_threshold <- 20
### Calculate the simulated application, and success rate of thresholds mentioned above using historical data
as.data.table(testFile)[,
list(
application_rate = round(sum(ifelse(single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2, 1, 0))/
nrow(testFile),2),
accuracy_rate = round(sum(ifelse((single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2) & (outcome == 1), 1, 0))/
sum(ifelse(single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2, 1, 0)),2)
),
by = location]
Consider expand.grid that builds a data frame of all combinations betwen both thresholds. Then use Map to iterate elementwise between both columns of data frame to build a list of data tables (of which now includes columns for each threshold indicator).
act_1_thresholds <- seq(7,12,1)
act_2_thresholds <- seq(19,24,1)
# ALL COMBINATIONS
thresholds_df <- expand.grid(th1=act_1_thresholds, th2=act_2_thresholds)
# USER-DEFINED FUNCTION
calc <- function(th1, th2)
as.data.table(testFile)[, list(
act_1_thresholds = th1, # NEW COLUMN
act_2_thresholds = th2, # NEW COLUMN
application_rate = round(sum(ifelse(th1 <= activity1 | th2 <= activity2, 1, 0)) /
nrow(testFile),2),
accuracy_rate = round(sum(ifelse((th1 <= activity1 | th2 <= activity2) & (outcome == 1), 1, 0)) /
sum(ifelse(th1 <= activity1 | th2 <= activity2, 1, 0)),2)
), by = location]
# LIST OF DATA TABLES
dt_list <- Map(calc, thresholds_df$th1, thresholds_df$th2)
# NAME ELEMENTS OF LIST
names(dt_list) <- paste(thresholds_df$th1, thresholds_df$th2, sep="_")
# SAME RESULT AS POSTED EXAMPLE
dt_list$`12_20`
# location act_1_thresholds act_2_thresholds application_rate accuracy_rate
# 1: 1 12 20 0.23 0.5
# 2: 2 12 20 0.23 0.5
# 3: 3 12 20 0.23 0.5
And if you need to append all elements use data.table's rbindlist:
final_dt <- rbindlist(dt_list)
final_dt
# location act_1_thresholds act_2_thresholds application_rate accuracy_rate
# 1: 1 7 19 0.32 0.47
# 2: 2 7 19 0.32 0.47
# 3: 3 7 19 0.32 0.47
# 4: 1 8 19 0.32 0.47
# 5: 2 8 19 0.32 0.47
# ---
# 104: 2 11 24 0.20 0.42
# 105: 3 11 24 0.20 0.42
# 106: 1 12 24 0.15 0.56
# 107: 2 12 24 0.15 0.56
# 108: 3 12 24 0.15 0.56
I have a data frame with patient data and measurements of different variables over time.
The data frame looks a bit like this but more lab-values variables:
df <- data.frame(id=c(1,1,1,1,2,2,2,2,2),
time=c(0,3,7,35,0,7,14,28,42),
labvalue1=c(4.04,NA,2.93,NA,NA,3.78,3.66,NA,2.54),
labvalue2=c(NA,63.8,62.8,61.2,78.1,NA,77.6,75.3,NA))
> df2
id time labvalue1 labvalue2
1 1 0 4.04 NA
2 1 3 NA 63.8
3 1 7 2.93 62.8
4 1 35 NA 61.2
5 2 0 NA 78.1
6 2 7 3.78 NA
7 2 14 3.66 77.6
8 2 28 NA 75.3
9 2 42 2.54 NA
I want to calculate for each patient (with unique ID) the decrease or slope per day for the first and last measurement. To compare the slopes between patients. Time is in days. So, eventually I want a new variable, e.g. diff_labvalues - for each value, that gives me for labvalue1:
For patient 1: (2.93-4.04)/ (7-0) and for patient 2: (2.54-3.78)/(42-7) (for now ignoring the measurements in between, just last-first); etc for labvalue2, and so forth.
So far I have used dplyr, created the first1 and last1 functions, because first() and last() did not work with the NA values.
Thereafter, I have grouped_by 'id', used mutate_all (because there are more lab-values in the original df) calculated the difference between the last1() and first1() lab-values for that patient.
But cannot find HOW to extract the values of the corresponding time values (the delta-time value) which I need to calculate the slope of the decline.
Eventually I want something like this (last line):
first1 <- function(x) {
first(na.omit(x))
}
last1 <- function(x) {
last(na.omit(x))
}
df2 = df %>%
group_by(id) %>%
mutate_all(funs(diff=(last1(.)-first1(.)) / #it works until here
(time[position of last1(.)]-time[position of first1(.)]))) #something like this
Not sure if tidyverse even has a solution for this, so any help would be appreciated. :)
We can try
df %>%
group_by(id) %>%
filter(!is.na(labs)) %>%
summarise(diff_labs = (last(labs) - first(labs))/(last(time) - first(time)))
# A tibble: 2 x 2
# id diff_labs
# <dbl> <dbl>
#1 1 -0.15857143
#2 2 -0.03542857
and
> (2.93-4.04)/ (7-0)
#[1] -0.1585714
> (2.54-3.78)/(42-7)
#[1] -0.03542857
Or another option is data.table
library(data.table)
setDT(df)[!is.na(labs), .(diff_labs = (labs[.N] - labs[1])/(time[.N] - time[1])) , id]
# id diff_labs
#1: 1 -0.15857143
#2: 2 -0.03542857
I want to write a code that checks two columns in a dataframe and compares them. one is supposed to be the Max Temp and the other is Min Temp. if values of Tmax column is less than the Tmin, then it should interchange the values. I need to do this for multiple files in a folder.
Date TMAX TMIN
1 01/01/1960 4.7353 -4.3722
2 01/02/1960 8.3800 11.0600
3 01/03/1960 3.4400 -3.5300
4 01/04/1960 -1.4300 -8.2200
5 01/05/1960 -1.9600 -5.0100
6 01/06/1960 4.5800 -6.3400
7 01/07/1960 -8.8900 -2.7300
after running the code, it should produce something like
Date TMAX TMIN
1 01/01/1960 4.7353 -4.3722
2 01/02/1960 11.0600 8.3800
3 01/03/1960 3.4400 -3.5300
4 01/04/1960 -1.4300 -8.2200
5 01/05/1960 -1.9600 -5.0100
6 01/06/1960 4.5800 -6.3400
7 01/07/1960 -2.7300 -8.8900
dplyr::mutate(df, TMAX = pmax(TMAX, TMIN), TMIN = pmin(TMAX, TMIN))
Or just
transform(df, TMAX = pmax(TMAX, TMIN), TMIN = pmin(TMAX, TMIN))
How about:
df <- data.frame(TMAX=1:5,TMIN=c(2,1,6,3,4))
dfn <- df
dfn$TMAX <- pmax(df$TMIN,df$TMAX)
dfn$TMIN <- pmin(df$TMIN,df$TMAX)
## TMAX TMIN
## 1 2 1
## 2 2 1
## 3 6 3
## 4 4 3
## 5 5 4
(Maybe not the most elegant way) Assuming your data is called df:
check<-df[,2]>df[,3]
dfn<-df
dfn[!check,3]<-df[!check,2]
dfn[!check,2]<-df[!check,3]
And a more elegant way is:
transform(df, V2=ifelse(V2<V3,V3,V2),V3= ifelse(V3>V2,V2,V3))