I'm currently using group_by then slice, to get the maximum dates in my data. There are a few rows where the date is NA, and when using slice(which.max(END_DT)), the NAs end up getting dropped. Is there an equivalent of summarise_all, so that I can keep the NAs in my data?
ID Date INitials
1 01-01-2020 AZ
1 02-01-2020 BE
2 NA CC
I'm using
df %>%
group_by(ID) %>%
slice(which.max(Date))
I need the final results to look like below, but it's dropping the NA entirely
ID Date Initials
1 02-01-2020 BE
2 NA CC
which.max() is not suitable in this case because (1) it drops missing values and (2) it only finds the first position of maxima. Here is a general solution:
library(dplyr)
df %>%
mutate(Date = as.Date(Date, "%m-%d-%Y")) %>%
group_by(ID) %>%
filter(Date == max(Date) | all(is.na(Date)))
# # A tibble: 2 x 3
# # Groups: ID [2]
# ID Date INitials
# <int> <date> <fct>
# 1 1 2020-02-01 BE
# 2 2 NA CC
df <- structure(list(ID = c(1L, 1L, 2L), Date = structure(c(1L, 2L,
NA), .Label = c("01-01-2020", "02-01-2020"), class = "factor"),
INitials = structure(1:3, .Label = c("AZ", "BE", "CC"), class = "factor")),
class = "data.frame", row.names = c(NA, -3L))
It's dropping the NA because you're asking it to find the max date...which NA would not fall into. If you want to go the "which.max" route, then I'd just run the dataset again, using filter, and grab the NA(s) and bind them to the dataset.
df.1 <- df%>%
filter(is.na(Date))
df <- rbind(df, df.1)
Related
Example: How many days has the longest period of NA update (having NA consecutively)
Date
value
1/2/2020
NA
1/3/2020
NA
1/4/2020
3
1/5/2020
NA
1/6/2020
1
1/7/2020
3
1/8/2020
3
1/9/2020
NA
1/10/2020
3
->>The result for the longest: 4days (from 1/5/2020 to 1/9/2020 is the longest period)
I had tried using the filter to list NA's date and get stuck...
Here's an approach, using data.table
setDT(df)[is.na(value)][,diff:=c(0, diff(as.IDate(Date, "%m/%d/%y")))][which.max(diff)]
Similar approach using dplyr
df %>%
filter(is.na(value)) %>%
mutate(diff=c(0,diff(as.Date(Date,"%m/%d/%y")))) %>%
slice_max(diff)
Output:
Date value diff
<chr> <int> <dbl>
1 1/9/2020 NA 4
Here's an efficient approach using the data.table package.
# input data
df <- data.frame(Date = c("1/2/2020","1/3/2020", "1/4/2020","1/5/2020","1/6/2020",
"1/7/2020","1/8/2020", "1/9/2020","1/10/2020"),
value = c(NA, NA, 3L, NA, 1L, 3L, 3L, NA, 3L))
library(data.table)
# convert Date column from character to date class
setDT(df)[, Date := as.IDate(Date, format="%m/%d/%Y")]
# create a column that tells when the last NA occured
df[ is.na(value), days_since_last_na := Date - shift(Date, type="lag")]
subset(df, days_since_last_na == max(days_since_last_na, na.rm=T))
> Date value days_since_last_na
> 1: 2020-01-09 NA 4
this is your example data df:
df <- structure(list(Date = c("01.02.2020", "01.03.2020", "01.04.2020",
"01.05.2020", "01.06.2020", "01.07.2020", "01.08.2020", "01.09.2020",
"01.10.2020"), value = c(NA, NA, 3L, NA, 1L, 3L, 3L, NA, 3L)), class = "data.frame", row.names = c(NA,
9L))
code suggestion:
library(dplyr)
library(lubridate) ## convenient date handling
df %>%
filter(is.na(value)) %>%
mutate(Date = lubridate::mdy(Date),
from = Date,
to = lead(Date, 1),
duration = to - from
) %>%
filter(!is.na(duration)) %>%
## extract observations of shortest and longest duration:
summarise(across(everything(), ~ c(min(.x), max(.x))))
output:
## Date value from to duration
## 1 2020-01-02 NA 2020-01-02 2020-01-03 1 days
## 2 2020-01-05 NA 2020-01-05 2020-01-09 4 days
I'm looking for an efficient way to rename several columns.
I have a dataframe that looks like the following.
id sdf dir fki
1 3 4 2
2 5 2 1
3 4 1 2
I want to rename columns sdf, dir, and fki.
I know I could do so like this:
df <- df %>%
rename(newname1 = sdf,
newname2 = dir,
newname3 = fki)
With the amount of columns I have, it is taking a long time to type the names of the columns I would like to replace.
Ideally, I would like to create a vector with names:
newcolumns <- c("newname1", "newname2", "newname3")
And then specify that these should replace the column names in the dataframe, starting with column sdf. Is there a way to do this?
We can use rename_at
library(dplyr)
df %>%
rename_at(vars(-id), ~ newcolumns)
-output
# id newname1 newname2 newname3
#1 1 3 4 2
#2 2 5 2 1
#3 3 4 1 2
Or with rename_with
df %>%
rename_with(~ newcolumns, -id)
Or pass a named vector and use !!! in rename
df %>%
rename(!!! setNames(names(df)[-1], newcolumns))
Or using base R
names(df)[-1] <- newcolumns
data
df <- structure(list(id = 1:3, sdf = c(3L, 5L, 4L), dir = c(4L, 2L,
1L), fki = c(2L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-3L))
This question already has an answer here:
dplyr::first() to choose first non NA value
(1 answer)
Closed 2 years ago.
I understand we can use the dplyr function coalesce() to unite different columns, but is there such function to unite rows?
I am struggling with a confusing incomplete/doubled dataframe with duplicate rows for the same id, but with different columns filled. E.g.
id sex age source
12 M NA 1
12 NA 3 1
13 NA 2 2
13 NA NA NA
13 F 2 NA
and I am trying to achieve:
id sex age source
12 M 3 1
13 F 2 2
You can try:
library(dplyr)
#Data
df <- structure(list(id = c(12L, 12L, 13L, 13L, 13L), sex = structure(c(2L,
NA, NA, NA, 1L), .Label = c("F", "M"), class = "factor"), age = c(NA,
3L, 2L, NA, 2L), source = c(1L, 1L, 2L, NA, NA)), class = "data.frame", row.names = c(NA,
-5L))
df %>%
group_by(id) %>%
fill(everything(), .direction = "down") %>%
fill(everything(), .direction = "up") %>%
slice(1)
# A tibble: 2 x 4
# Groups: id [2]
id sex age source
<int> <fct> <int> <int>
1 12 M 3 1
2 13 F 2 2
As mentioned by #A5C1D2H2I1M1N2O1R2T1 you can select the first non-NA value in each group. This can be done using dplyr :
library(dplyr)
df %>% group_by(id) %>% summarise(across(.fns = ~na.omit(.)[1]))
# A tibble: 2 x 4
# id sex age source
# <int> <fct> <int> <int>
#1 12 M 3 1
#2 13 F 2 2
Base R :
aggregate(.~id, df, function(x) na.omit(x)[1], na.action = 'na.pass')
Or data.table :
library(data.table)
setDT(df)[, lapply(.SD, function(x) na.omit(x)[1]), id]
I'm trying to perform a group_by summarise on a categorical variable, frailty score. The data is structured such that there are multiple observations for each subject, some of which contain missing data e.g.
Subject Frailty
1 Managing well
1 NA
1 NA
2 NA
2 NA
2 Vulnerable
3 NA
3 NA
3 NA
I would like the data to be summarised so that a frailty description appears if there is one available, and NA if not e.g.
Subject Frailty
1 Managing well
2 Vulnerable
3 NA
I tried the following two approaches which both returned errors:
Mode <- function(x) {
ux <- na.omit(unique(x[!is.na(x)]))
tab <- tabulate(match(x, ux)); ux[tab == max(tab)]
}
data %>%
group_by(Subject) %>%
summarise(frailty = Mode(frailty)) %>%
Error: Expecting a single value: [extent=2].
condense <- function(x){unique(x[!is.na(x)])}
data %>%
group_by(subject) %>%
summarise(frailty = condense(frailty))
Error: Column frailty must be length 1 (a summary value), not 0
One solution involving dplyr could be:
df %>%
group_by(Subject) %>%
slice(which.min(is.na(Frailty)))
Subject Frailty
<int> <chr>
1 1 Managing_well
2 2 Vulnerable
3 3 <NA>
If there are only one a single non-NA element, then after grouping by 'Subject', get the first non-NA element
library(dplyr)
data %>%
group_by(Subject) %>%
summarise(Frailty = Frailty[which(!is.na(Frailty))[1]])
# A tibble: 3 x 2
# Subject Frailty
# <int> <chr>
#1 1 Managing well
#2 2 Vulnerable
#3 3 <NA>
If there are more than one non-NA unique elements, either we paste them together or return as a list
data %>%
group_by(Subject) %>%
summarise(Frailty = na_if(toString(unique(na.omit(Frailty))), ""))
# A tibble: 3 x 2
# Subject Frailty
# <int> <chr>
#1 1 Managing well
#2 2 Vulnerable
#3 3 <NA>
data
data <- structure(list(Subject = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L
), Frailty = c("Managing well", NA, NA, NA, NA, "Vulnerable",
NA, NA, NA)), class = "data.frame", row.names = c(NA, -9L))
I have a big dataset with a variety of variables concerning infectious complications. There are columns, containing symptoms written as strings in the corresponding columns ("Dysuria", "Fever", etc.). I would like to know the number of positive symptoms in each observation. I have tried to write different codes, using rowSums within mutate_at with is.character and !is.na, trying to do it simpler and as short as a single line of code, but it did not work.
example:
symps_na %>%
mutate_if(~any(is.character(.), rowSums)) %>%
View()
Then, I wrote a code for each column separately, trying to recode string variables to 1, convert them to numeric and then sum these ones to get the number of symptoms (see the codes below).
symps_na<-
pb_table_ord %>%
select(ID, dysuria:fever)%>%
mutate(dysuria=ifelse(dysuria=="Dysuria", 1, dysuria)) %>%
mutate(frequency=ifelse(frequency=="Frequency", 1, frequency)) %>%
mutate(urgency=ifelse(urgency=="Urgency", 1, urgency)) %>%
mutate(prostatepain=ifelse(prostatepain=="Prostate pain", 1, prostatepain)) %>%
mutate(rigor=ifelse(!is.na(rigor), 1, rigor)) %>%
mutate(loinpain=ifelse(!is.na(loinpain), 1, loinpain)) %>%
mutate(fever=ifelse(!is.na(fever), 1, fever)) %>%
mutate_at(vars(dysuria:fever), as.numeric) %>%
mutate(symptoms.sum=rowSums(select(., dysuria:fever)))
but the column symptoms.sum returns NA's instead numbers.
Oh, sorry, just have realized that I have missed na.rm=TRUE! But anyway. Can anyone suggest a more elegant way how could one get the summary number of non-NA/string variables for each observation in a separate column?
You can create two sets of columns one where you need to check value same as column name and the other one where you need to check to for NA values. I have created a sample data shared at the end of the answer and the two vectors cols1 which is a vector of column names which has same value as in it's column and cols2 where we need to check for NA values. You can change that according to column names that you have.
library(dplyr)
cols1 <- c('b', 'c')
cols2 <- c('d')
purrr::imap_dfc(df %>% select(cols1), `==`) %>% mutate_all(as.numeric) %>%
bind_cols(df %>% transmute_at(vars(cols2), ~+(!is.na(.)))) %>%
mutate(symptoms.sum = rowSums(select(., b:d), na.rm = TRUE))
# A tibble: 5 x 4
# b c d symptoms.sum
# <dbl> <dbl> <int> <dbl>
#1 1 1 0 2
#2 0 1 1 2
#3 1 0 1 2
#4 NA NA 1 1
#5 1 NA 0 1
data
Tested on this data which looks like this
df <- structure(list(a = 1:5, b = structure(c(1L, 2L, 1L, NA, 1L), .Label = c("b",
"c"), class = "factor"), c = structure(c(1L, 1L, 2L, NA, NA), .Label = c("c",
"d"), class = "factor"), d = c(NA, 1, 2, 4, NA)), class = "data.frame",
row.names = c(NA, -5L))
df
# a b c d
#1 1 b c NA
#2 2 c c 1
#3 3 b d 2
#4 4 <NA> <NA> 4
#5 5 b <NA> NA