Suppose you have a long data.frame of the following form:
ID Group Year Field VALUE
1 1 2016 AA 10
2 1 2016 AA 16
1 1 2016 TOTAL 100
2 1 2016 TOTAL 120
etc..
and you want to create an grouped output of weighted.mean(Value,??) for each group_by(Group, Year, Field) using Field == TOTAL as the weight for years >2013.
So far i am using dplyr:
dat %>%
filter(Year>2013) %>%
group_by(Group, Year, Field) %>%
summarize(m = weighted.mean(VALUE,VALUE[Field == 'TOTAL'])) %>%
ungroup()
Now the problem (to my understanding) is that by using group_by I cannot define the "Field" value afterwards, as I tell it to look at the group of "Field == AA".
Transforming data from long to wide is not a solution, as i have >1000 different field values which potentially increase over time, and this code will be run daily at some point.
First of all, this is a hacky solution, and I am sure there is a better approach to this issue. The goal is to make a new column containing the weights, and this approach does so using the filling nature of left_join(), but I am sure you could do this with fill() or across().
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.0.3
# Example data from OP
dat <- data.frame(ID = c(1,2,1,2), Group = rep(1,4), Year = rep(2016,4),Field = c("AA","AA","TOTAL","TOTAL"), VALUE = c(10,16,100,120))
# Make a new dataframe containing the TOTAL values
weights <- dat %>% filter(Field == "TOTAL") %>% mutate(w = VALUE) %>% select(-Field,-VALUE)
weights
#> ID Group Year w
#> 1 1 1 2016 100
#> 2 2 1 2016 120
# Make a new frame containing the original values and the weights
new_dat <- left_join(dat,weights, by = c("Group","Year","ID"))
# Add a column for weight
new_dat %>%
filter(Year>2013) %>%
group_by(Group, Year, Field) %>%
summarize(m = weighted.mean(VALUE,w)) %>%
ungroup()
#> `summarise()` regrouping output by 'Group', 'Year' (override with `.groups` argument)
#> # A tibble: 2 x 4
#> Group Year Field m
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 2016 AA 13.3
#> 2 1 2016 TOTAL 111.
Created on 2020-11-03 by the reprex package (v0.3.0)
Related
Suppose we start with this very simple dataframe called myData:
> myData
Element Class
1 A 0
2 A 0
3 C 0
4 A 0
5 B 1
6 B 1
7 A 2
Generated by:
myData = data.frame(Element = c("A","A","C","A","B","B","A"),Class = c(0,0,0,0,1,1,2))
How would I use dplyr to extract the number of times "A" appears in the Element column of the myData dataframe? I would simply like the number 4 returned, for further processing in dplyr. All I have so far is the dplyr code shown at the bottom, which seems clumsy because among other things it yields another dataframe with more information than just the number 4 that is needed:
# A tibble: 1 x 2
Element counted
<chr> <int>
1 A 4
The dplyr code that produces the above tibble:
library(dplyr)
myData %>% group_by(Element) %>% filter(Element == "A") %>% summarise(counted = n())
We can use count which simplifies the group_by + summarise step
library(dplyr)
myData %>%
filter(Element == 'A') %>%
count(Element, name = 'counted')
Or with just summarise and sum
myData %>%
summarise(counted = sum(Element == 'A'), Element = 'A') %>%
relocate(Element, .before = 1)
Element counted
1 A 4
Another option using tally like this:
myData = data.frame(Element = c("A","A","C","A","B","B","A"),Class = c(0,0,0,0,1,1,2))
library(dplyr)
myData %>%
filter(Element == "A") %>%
group_by(Element) %>%
tally()
#> # A tibble: 1 × 2
#> Element n
#> <chr> <int>
#> 1 A 4
Created on 2022-07-28 by the reprex package (v2.0.1)
I'm trying to calculate the number of days that a patient spent during a given state in R.
The image of an example data is included below. I only have columns 1 to 3 and I want to get the answer in column 5. I am thinking if I am able to create a date column in column 4 which is the first recorded date for each state, then I can subtract that from column 2 and get the days I am looking for.
I tried a group_by(MRN, STATE) but the problem is, it groups the second set of 1's as part of the first set of 1's, so does the 2's which is not what I want.
Use mdy_hm to change OBS_DTM to POSIXct type, group_by ID and rleid of STATE so that first set of 1's are handled separately than the second set. Use difftime to calculate difference between OBS_DTM with the minimum value in the group in days.
If your data is called data :
library(dplyr)
data %>%
mutate(OBS_DTM = lubridate::mdy_hm(OBS_DTM)) %>%
group_by(MRN, grp = data.table::rleid(STATE)) %>%
mutate(Answer = as.numeric(difftime(OBS_DTM, min(OBS_DTM),units = 'days'))) %>%
ungroup %>%
select(-grp) -> result
result
You could try the following:
library(dplyr)
df %>%
group_by(ID, State) %>%
mutate(priorObsDTM = lag(OBS_DTM)) %>%
filter(!is.na(priorObsDTM)) %>%
ungroup() %>%
mutate(Answer = as.numeric(OBS_DTM - priorObsDTM, units = 'days'))
The dataframe I used for this example:
df <- df <- data.frame(
ID = 1,
OBS_DTM = as.POSIXlt(
c('2020-07-27 8:44', '2020-7-27 8:56', '2020-8-8 20:12',
'2020-8-14 10:13', '2020-8-15 13:32')
),
State = c(1, 1, 2, 2, 2),
stringsAsFactors = FALSE
)
df
# A tibble: 3 x 5
# ID OBS_DTM State priorObsDTM Answer
# <dbl> <dttm> <dbl> <dttm> <dbl>
# 1 1 2020-07-27 08:56:00 1 2020-07-27 08:44:00 0.00833
# 2 1 2020-08-14 10:13:00 2 2020-08-08 20:12:00 5.58
# 3 1 2020-08-15 13:32:00 2 2020-08-14 10:13:00 1.14
I want to create a variable using dplyr that takes in a value conditional on another variable.
See example below.
data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
I want to create a variable 'baseline' that takes the content of variable 'value' where time = 1 and by group. As such the desired output would be
data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1),
baseline = c(1,1,3,3)))
Tried to run the following code with indexing but am clearly going wrong somewhere
x <- data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
x %>% group_by(group) %>%
mutate(baseline = .[[.$time==1,.$value]])
Thanks
We can use which.min
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(baseline = value[which.min(time)])
# A tibble: 4 x 4
# Groups: group [2]
# group time value baseline
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 1
#2 a 2 2 1
#3 b 1 3 3
#4 b 2 4 3
and if it is already ordered by 'time', then simply use first
df1 %>%
group_by(group) %>%
mutate(baseline = first(value))
data
df1 <- data.frame(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
I need to calculate Revenue per load, after grouping by "Team" for my Shiny Dashboard. I am being told I have an invalid 'type' (character) of argument
I have tried changing how the summarise function is formatted. It does not work in the console, so I have removed the Shiny portions of the code.
August <- data.frame("Revenue" = c(10,20,30,40), "Volume" = c(2,4,5,7),
"Team" = c("Blue","Green","Gold","Purple"))
x <- August %>% group_by(Team) %>% summarise(Revenue = sum(Revenue)) /
August %>% group_by(Team) %>% summarise(Volume = sum(volume)) %>%
"Error: invalid 'type' (character) of argument"
this shows up instead of the bar graph
Summarize the Revenue and Volume and then take their ratio. Note that summarise proceeds from left to right so that after Revenue and Volume have been defined in the summarise statement the references in the RevByVol definition to them refers to these new definitions and not to the original unsummarized versions.
August %>%
group_by(Team) %>%
summarise(Revenue = sum(Revenue),
Volume = sum(Volume),
RevByVol = Revenue / Volume) %>%
ungroup
giving:
# A tibble: 4 x 4
Team Revenue Volume RevByVol
<fct> <dbl> <dbl> <dbl>
1 Blue 10 2 5
2 Gold 30 5 6
3 Green 20 4 5
4 Purple 40 7 5.71
I have a longitudinal data set and would like to extract the latest, non-missing complete set of observations for each variable in the data set where id is a unique identifier, yr is year, and x1 and x2 are variables with missing values. The actual data set has 100s of variables over the course of 60 years.
data <- data.frame(id=rep(1:3,3)
yr=rep(1:3,times=1, each=3)
x1=c(1,3,7,NA,NA,NA,9,4,10)
x2=c(NA,NA,NA,3,9,6,NA,NA,NA))
Below are my expected results. For x1, the latest complete set of observations is year 3. For x2, the latest complete set of observations is year 2.
Using base R
subset(data, yr %in% names(tail(which(sapply(split(data[c('x1', 'x2')],
data$yr), function(x) any(colSums(!is.na(x)) == nrow(x)))), 2)))
Here's a tidyverse solution. First, I create the data frame.
# Create data frame
df <- data.frame(id=rep(1:3,3),
yr=rep(1:3,times=1, each=3),
x1=c(1,3,7,NA,NA,NA,9,4,10),
x2=c(NA,NA,NA,3,9,6,NA,NA,NA))
Next, I load the required libraries.
# Load library
library(dplyr)
library(tidyr)
I then go from wide to long format, group by yr and key (i.e., variable name), remove those that have NA values (i.e., keep those that are all not NA), group by key, keep those data that are in the maximum year, switch back to wide format, and arrange to make the printed result look pretty.
df %>%
gather("key", "val", x1, x2) %>%
group_by(yr, key) %>%
filter(all(!is.na(val))) %>%
group_by(key) %>%
filter(yr == max(yr)) %>%
spread(key, val) %>%
arrange(yr)
#> # A tibble: 6 x 4
#> id yr x1 x2
#> <int> <int> <dbl> <dbl>
#> 1 1 2 NA 3
#> 2 2 2 NA 9
#> 3 3 2 NA 6
#> 4 1 3 9 NA
#> 5 2 3 4 NA
#> 6 3 3 10 NA
Created on 2019-05-29 by the reprex package (v0.3.0)