I want to create a binary/indicator variable based on lagged observation. I have a variable X1. The raw data looks like below. It's a sample data. Original data has close to 10K records.
X1
Diagnosis
1
2
3
4
Treatment
1
2
3
I want the output to look like this :
X1 NewVar
Diagnosis Diagnosis
1 Diagnosis
2 Diagnosis
3 Diagnosis
4 Diagnosis
Treatment Treatment
1 Treatment
2 Treatment
3 Treatment
Any help would be highly appreciated!
You can achieve this with cumsum. The cumsum can create a new group each time a Diagnosis or Treatment appears. Then the NewVar in each group will take the value of first X1 in this group:
library(dplyr)
dtf %>%
mutate(g = cumsum(X1 == 'Diagnosis' | X1 == 'Treatment')) %>%
group_by(g) %>%
mutate(NewVar = X1[1]) %>%
ungroup() %>% select(-g)
# # A tibble: 9 x 2
# X1 NewVar
# <fctr> <fctr>
# 1 Diagnosis Diagnosis
# 2 1 Diagnosis
# 3 2 Diagnosis
# 4 3 Diagnosis
# 5 4 Diagnosis
# 6 Treatment Treatment
# 7 1 Treatment
# 8 2 Treatment
# 9 3 Treatment
the dtf in above code:
> dput(dtf)
structure(list(X1 = structure(c(5L, 1L, 2L, 3L, 4L, 6L, 1L, 2L,
3L), .Label = c("1", "2", "3", "4", "Diagnosis", "Treatment"), class = "factor")), .Names = "X1", class = "data.frame", row.names = c(NA,
-9L))
Here is an option with data.table. After converting to 'data.table' (setDT(dtf), get the cumulative sum of logical vector based on 'X1' values as characters and assign 'NewVar' as the first element of 'X1' (X1[1])
library(data.table)
setDT(dtf)[, NewVar := X1[1], cumsum(grepl('^[A-Za-z]+$', X1))]
dtf
# X1 NewVar
#1: Diagnosis Diagnosis
#2: 1 Diagnosis
#3: 2 Diagnosis
#4: 3 Diagnosis
#5: 4 Diagnosis
#6: Treatment Treatment
#7: 1 Treatment
#8: 2 Treatment
#9: 3 Treatment
Related
I am trying to find the minimum value among different columns and group.
A small sample of my data looks something like this:
group cut group_score_1 group_score_2
1 a 1 3 5.0
2 b 2 2 4.0
3 a 0 2 2.5
4 b 3 5 4.0
5 a 2 3 6.0
6 b 1 5 1.0
I want to group by the groups and for each group, find the row which contains the minimum group score among both group scores and then also get the name of the column which contains the minimum (group_score_1 or group_score_2),
so basically my result should be something like this:
group cut group_score_1 group_score_2
1 a 0 2 2.5
2 b 1 5 1.0
I tried a few ideas, and came up eventually to dividing the into several new data frames, filtering by group and selecting the relevant columns and then using which.min(), but I'm sure there's a much more efficient way to do it. Not sure what I am missing.
We can use data.table methods
library(data.table)
setDT(df)[df[, .I[which.min(do.call(pmin, .SD))],
group, .SDcols = patterns('^group_score')]$V1]
# group cut group_score_1 group_score_2
#1: a 0 2 2.5
#2: b 1 5 1.0
For each group, you can calculate min value and select the row in which that value exist in one of the column.
library(dplyr)
df %>%
group_by(group) %>%
filter({tmp = min(group_score_1, group_score_2);
group_score_1 == tmp | group_score_2 == tmp})
# group cut group_score_1 group_score_2
# <chr> <int> <int> <dbl>
#1 a 0 2 2.5
#2 b 1 5 1
The above works well when you have only two group_score columns. If you have many such columns it is not possible to list down each one of them with group_score_1 == tmp | group_score_2 == tmp etc. In such case, get the data in long format and get the corresponding cut value of the minimum value and join the data. Assuming cut is unique in each group.
df %>%
tidyr::pivot_longer(cols = starts_with('group_score')) %>%
group_by(group) %>%
summarise(cut = cut[which.min(value)]) %>%
left_join(df, by = c("group", "cut"))
Here is a base R option using pmin + ave + subset
subset(
df,
as.logical(ave(
do.call(pmin, df[grep("group_score_\\d+", names(df))]),
group,
FUN = function(x) x == min(x)
))
)
which gives
group cut group_score_1 group_score_2
3 a 0 2 2.5
6 b 1 5 1.0
Data
> dput(df)
structure(list(group = c("a", "b", "a", "b", "a", "b"), cut = c(1L,
2L, 0L, 3L, 2L, 1L), group_score_1 = c(3L, 2L, 2L, 5L, 3L, 5L
), group_score_2 = c(5, 4, 2.5, 4, 6, 1)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
This question already has an answer here:
dplyr::first() to choose first non NA value
(1 answer)
Closed 2 years ago.
I understand we can use the dplyr function coalesce() to unite different columns, but is there such function to unite rows?
I am struggling with a confusing incomplete/doubled dataframe with duplicate rows for the same id, but with different columns filled. E.g.
id sex age source
12 M NA 1
12 NA 3 1
13 NA 2 2
13 NA NA NA
13 F 2 NA
and I am trying to achieve:
id sex age source
12 M 3 1
13 F 2 2
You can try:
library(dplyr)
#Data
df <- structure(list(id = c(12L, 12L, 13L, 13L, 13L), sex = structure(c(2L,
NA, NA, NA, 1L), .Label = c("F", "M"), class = "factor"), age = c(NA,
3L, 2L, NA, 2L), source = c(1L, 1L, 2L, NA, NA)), class = "data.frame", row.names = c(NA,
-5L))
df %>%
group_by(id) %>%
fill(everything(), .direction = "down") %>%
fill(everything(), .direction = "up") %>%
slice(1)
# A tibble: 2 x 4
# Groups: id [2]
id sex age source
<int> <fct> <int> <int>
1 12 M 3 1
2 13 F 2 2
As mentioned by #A5C1D2H2I1M1N2O1R2T1 you can select the first non-NA value in each group. This can be done using dplyr :
library(dplyr)
df %>% group_by(id) %>% summarise(across(.fns = ~na.omit(.)[1]))
# A tibble: 2 x 4
# id sex age source
# <int> <fct> <int> <int>
#1 12 M 3 1
#2 13 F 2 2
Base R :
aggregate(.~id, df, function(x) na.omit(x)[1], na.action = 'na.pass')
Or data.table :
library(data.table)
setDT(df)[, lapply(.SD, function(x) na.omit(x)[1]), id]
I'm trying to perform a group_by summarise on a categorical variable, frailty score. The data is structured such that there are multiple observations for each subject, some of which contain missing data e.g.
Subject Frailty
1 Managing well
1 NA
1 NA
2 NA
2 NA
2 Vulnerable
3 NA
3 NA
3 NA
I would like the data to be summarised so that a frailty description appears if there is one available, and NA if not e.g.
Subject Frailty
1 Managing well
2 Vulnerable
3 NA
I tried the following two approaches which both returned errors:
Mode <- function(x) {
ux <- na.omit(unique(x[!is.na(x)]))
tab <- tabulate(match(x, ux)); ux[tab == max(tab)]
}
data %>%
group_by(Subject) %>%
summarise(frailty = Mode(frailty)) %>%
Error: Expecting a single value: [extent=2].
condense <- function(x){unique(x[!is.na(x)])}
data %>%
group_by(subject) %>%
summarise(frailty = condense(frailty))
Error: Column frailty must be length 1 (a summary value), not 0
One solution involving dplyr could be:
df %>%
group_by(Subject) %>%
slice(which.min(is.na(Frailty)))
Subject Frailty
<int> <chr>
1 1 Managing_well
2 2 Vulnerable
3 3 <NA>
If there are only one a single non-NA element, then after grouping by 'Subject', get the first non-NA element
library(dplyr)
data %>%
group_by(Subject) %>%
summarise(Frailty = Frailty[which(!is.na(Frailty))[1]])
# A tibble: 3 x 2
# Subject Frailty
# <int> <chr>
#1 1 Managing well
#2 2 Vulnerable
#3 3 <NA>
If there are more than one non-NA unique elements, either we paste them together or return as a list
data %>%
group_by(Subject) %>%
summarise(Frailty = na_if(toString(unique(na.omit(Frailty))), ""))
# A tibble: 3 x 2
# Subject Frailty
# <int> <chr>
#1 1 Managing well
#2 2 Vulnerable
#3 3 <NA>
data
data <- structure(list(Subject = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L
), Frailty = c("Managing well", NA, NA, NA, NA, "Vulnerable",
NA, NA, NA)), class = "data.frame", row.names = c(NA, -9L))
I have a data frame that contains historical price returns. The data is organized with date columns and many Asset columns (denoted as A1,A2...). Each asset column contains price return data for each unique historical date. I would like to process this data to create a data frame with many asset columns and only one row of data - with the data row containing the aggregated/average of the rows for the new columns. The new columns needs headers that are the original asset name, concatenated with date information. A simplified example of the original date follows:
> df <- read.csv("data.csv", header=T)
> df
Year Month A1 A2 A3
1 2015 Jan 1 1 1
2 2015 Feb 2 2 2
3 2015 Mar 3 3 3
4 2016 Jan 1 1 1
5 2016 Feb 2 2 2
6 2016 Mar 3 3 3
I used simple repeating numbers for the returns here. I am using a function that requires the data to be organized as follows:
> df2 <- read.csv("data2.csv", header=T)
> df2
Returns A1.Jan A1.Feb A1.Mar A2.Jan A2.Feb A2.Mar A3.Jan A3.Feb A3.Mar
1 Average 1 2 3 1 2 3 1 2 3
For clarity, A1.Jan contains the average of all Year's Jan returns. Thanks in advance for the insight and/or solution.
Take a look at the base function reshape. This is basically the same task as is solved by the last example on its help page:
reshape(df, idvar="Year", direction="wide", timevar="Month")
Year A1.Jan A2.Jan A3.Jan A1.Feb A2.Feb A3.Feb A1.Mar A2.Mar A3.Mar
1 2015 1 1 1 2 2 2 3 3 3
4 2016 1 1 1 2 2 2 3 3 3
You wanted the Year variable to remain as a column identifier but wanted the Month variable to act as a sequence that gets spread "wide".
With data.table you can do
library(data.table)
setDT(df)
df[, lapply(.SD, mean), .SDcols = names(df)[grep("^A", names(df))], by = Month
][, Returns := "Average"
][, melt(.SD, id = c("Month", "Returns"))
][, dcast(.SD, Returns ~ variable + Month, value.var = 'value', sep = ".")]
# Returns A1.Feb A1.Jan A1.Mar A2.Feb A2.Jan A2.Mar A3.Feb A3.Jan A3.Mar
#1: Average 2 1 3 2 1 3 2 1 3
In the first line we aggregate the data by Month. The part names(df)[grep("^A", names(df)) ensures that we only aggregate variables that start with the letter "A".
The second line creates variable Returns that contains the value "Average".
melt gathers you data into long format and dcast finally spreads into desired output.
data
df <- structure(list(Year = c(2015L, 2015L, 2015L, 2016L, 2016L, 2016L
), Month = c("Jan", "Feb", "Mar", "Jan", "Feb", "Mar"), A1 = c(1L,
2L, 3L, 1L, 2L, 3L), A2 = c(1L, 2L, 3L, 1L, 2L, 3L), A3 = c(1L,
2L, 3L, 1L, 2L, 3L)), .Names = c("Year", "Month", "A1", "A2",
"A3"), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6"))
Here's a tidyverse solution. I factored the months so they can be ordered, then used tidyr::gather() to convert into long format so I could dplyr::group_by() by month to dplyr::summarise() to find the average:
library(dplyr)
library(tidyr)
df <- read.table(text = "
Year Month A1 A2 A3
1 2015 Jan 1 1 1
2 2015 Feb 2 2 2
3 2015 Mar 3 3 3
4 2016 Jan 1 1 1
5 2016 Feb 2 2 2
6 2016 Mar 3 3 3", header = T) %>%
tbl_df()
df$Month <- df$Month %>%
factor(levels = format(ISOdate(2000, 1:12, 1), "%b"))
df_tidy <- df %>%
gather(asset, value, -Year, -Month) %>%
group_by(Month, asset) %>%
summarise(Average = mean(value)) %>%
arrange(asset, Month)
df_tidy
# # A tibble: 9 x 3
# # Groups: Month [3]
# Month asset Average
# <fct> <chr> <dbl>
# 1 Jan A1 1
# 2 Feb A1 2
# 3 Mar A1 3
# 4 Jan A2 1
# 5 Feb A2 2
# 6 Mar A2 3
# 7 Jan A3 1
# 8 Feb A3 2
# 9 Mar A3 3
# convert to wide format, as in OP - not sure of 'easy' way
# to order columns by asset.month other than using 'select()'
# (it currently sorts alphabetically).
df_tidy %>%
unite(Returns, c(asset, Month), sep = ".") %>%
spread(Returns, Average)
# # A tibble: 1 x 9
# A1.Feb A1.Jan A1.Mar A2.Feb A2.Jan A2.Mar A3.Feb A3.Jan A3.Mar
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2 1 3 2 1 3 2 1 3
I have a data set that looks like this:
id a b
1 AA 2
1 AB 5
1 AA 1
2 AB 2
2 AB 4
3 AB 4
3 AB 3
3 AA 1
I need to calculate the cumulative mean for each record within each group and excluding the case where a == 'AA', So sample output should be:
id a b mean
1 AA 2 -
1 AB 5 5
1 AA 1 5
2 AB 2 2
2 AB 4 (4+2)/2
3 AB 4 4
3 AB 3 (4+3)/2
3 AA 1 (4+3)/2
3 AA 4 (4+3)/2
I tried to achieve it using dplyr and cummean by getting an error.
df <- df %>%
group_by(id) %>%
mutate(mean = cummean(b[a != 'AA']))
Error: incompatible size (123), expecting 147 (the group size) or 1
Can you suggest a better way to achieve the same in R ?
The trick here is to reconstruct the cummean by dividing the adjusted cumsum by the adjusted count. As a one-liner:
df %>% group_by(id) %>% mutate(cumsum(b * (a != 'AA')) / cumsum(a != 'AA'))
We can make this a little nicer (the "multiply by a!='AA' - magic!" is the ugliness in my mind) by taking out the a != 'AA' as a column
df %>%
group_by(id) %>%
mutate(relevance = 0+(a!='AA'),
mean = cumsum(relevance * b) / cumsum(relevance))
There may be an easier approach. Here, we group by 'id'. Create a new column 'Mean' by first converting the elements in 'b' that corresponds to 'AA' in 'a' to NA (b*NA^(a=='AA')). NA^(a=='AA') gives an output of NA for 'AA' in 'a' and 1 for all other values. So, when we multiply by 'b', it replaces the 1 with the values in 'b' while NA remains as such. We use na.aggregate to replace the 'NA' with the mean of non-NA elements in each group, then wrap with cummean to get the cumulative mean. If the first value in each group for 'a' is 'AA', we can get NA for that by multiplying with NA^(row_number()==1 & a=='AA').
library(zoo)
library(dplyr)
df %>%
group_by(id) %>%
mutate(Mean= cummean(na.aggregate(b*NA^(a=='AA')))*
NA^(row_number()==1 & a=='AA'))
# Source: local data frame [9 x 4]
#Groups: id [3]
# id a b Mean
# (int) (chr) (int) (dbl)
#1 1 AA 2 NA
#2 1 AB 5 5.0
#3 1 AA 1 5.0
#4 2 AB 2 2.0
#5 2 AB 4 3.0
#6 3 AB 4 4.0
#7 3 AB 3 3.5
#8 3 AA 1 3.5
#9 3 AA 4 3.5
data
df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
a = c("AA",
"AB", "AA", "AB", "AB", "AB", "AB", "AA", "AA"), b = c(2L, 5L,
1L, 2L, 4L, 4L, 3L, 1L, 4L)), .Names = c("id", "a", "b"),
class = "data.frame", row.names = c(NA, -9L))