Removing NA's in tibble after using pivot_wider - r

I am trying to create a table from a data set that takes two factors from a variable, pivots them wider, and lines them up in a single row. Unfortunately, I either keep producing two separate lists, or I get this:
dput(head(test1, 5))
# Edited section:
test1 <- df %>% # Code used to create the table below
select(`Incident ID`,`Device Time`, Description, `Elapsed Time`) %>%
filter(Description == "CPR Stopped" | Description == "CPR Started") %>%
mutate(Index = c(1:16)) %>%
pivot_wider(names_from = Description,
values_from = c(`Elapsed Time`, `Device Time`)) %>%
filter(!is.na(test1))
dput(head(test1, 5))
`Incident ID` Index `Elapsed Time_CPR Started` `Elapsed Time_CPR Stopped` `Device Time_CPR Started` `Device Time_CPR Stopped`
<chr> <int> <time> <time> <time> <time>
1 F190158585 1 01'03" NA 18'37" NA
2 F190158585 2 NA 01'08" NA 18'42"
3 F190158585 3 01'34" NA 19'08" NA
4 F190158585 4 NA 03'47" NA 21'22"
5 F190158585 5 04'00" NA 21'35" NA
I am trying to get a table that looks like this:
df <- data.frame(Index = c(1:3),
CPR_Started = c("00:01:00", "00:02:03", "00:05:46"),
CPR_Stopped = c("00:01:53", "00:04:30", "00:08:00"))
print(df)
Index CPR_Started CPR_Stopped
1 1 00:01:00 00:01:53
2 2 00:02:03 00:04:30
3 3 00:05:46 00:08:00

Related

Split columns that are dictionaries

I have a data frame that has about 200,000 rows with columns like:
ID
dictionary column 1
dictionary column 2
1
{""1720100"":4,""1720101"":3}
{""1720100"":5,""1720101"":1,""1720102"":2}
2
{""1720100"":4}
{""1720100"":4,""1720101"":2}
...
...
...
The output table I would like to get is:
ID
col_a
col_b
col_c
col_d
1
1720100
4
1720101
5
1
1720101
3
1720102
1
1
NA
NA
1720103
2
2
1720100
4
1720101
4
2
NA
NA
1720102
2
...
...
...
...
...
And, I feel like it would be even better if the data frame is divided into several chunks before splitting the columns above to reduce the time needed for the calculation. Could anyone help me with this?
Looks like you may want to extract json from the column, using jsonlite package. You can put data into longer form, since you have json in two columns. Then with more pivoting to get desired final format. The final select just reorders columns values on the number contained in the column name.
library(tidyverse)
library(jsonlite)
df %>%
pivot_longer(cols = -ID) %>%
mutate(json_parsed = map(value, ~fromJSON(sprintf("[%s]", .), flatten = T))) %>%
unnest(json_parsed) %>%
pivot_longer(cols = -c(ID, name, value), names_to = "n", values_to = "v") %>%
pivot_wider(id_cols = ID, values_from = c(n, v), values_fn = list) %>%
unnest(cols = -ID) %>%
select(ID, order(parse_number(names(.)[-1])) + 1)
Output
ID n_dictionary_column_1 v_dictionary_column_1 n_dictionary_column_2 v_dictionary_column_2
<dbl> <chr> <int> <chr> <int>
1 1 1720100 4 1720100 5
2 1 1720101 3 1720101 1
3 1 1720102 NA 1720102 2
4 2 1720100 4 1720100 4
5 2 1720101 NA 1720101 2
6 2 1720102 NA 1720102 NA

is there an R code for the following data wrangling and transformation

I have the following data set
id<-c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4)
s02<-c(001,002,003,004,001,002,003,004,005,001,002,003,004,005,006,007,001,002,003,004,005,006,007,008,009,010,011,012,013,014,015,016,017,018,019,020,021,022,023,024,025,026,027,028,029)
dat1<-data.frame(id,s02)
I would wish to create a data set based on this dat1. I would wish to have an R code that creates n s02 automatically as s02__0, s02__1, s02__2, s02__3, s02__4, in which case my n==5. Then based on the ID in dat1, the code should allocate each s02 to the respective s02__0 to s02__4 in the data frame. These rows are uniquely identified by another ID_2 created based on the number of rows. If incase the s02 are less in the row created, then the remaining cells should be allocated ##N/A##. if the s02 are more than the n, then another new row with an increment from the unique ID_2 is formed to accommodate the extra s02 and every blank cell is still filled with ##N/A##.
From the dataset above, I would wish to have the following output
id<-c(1,2,3,3,4,4,4,4,4,4)
id_2<-c(1,1,1,2,1,2,3,4,5,6)
s02__0<-c(1,1,1,6,1,6,11,16,21,26)
s02__1<-c(2,2,2,7,2,7,12,17,22,27)
s02__2<-c(3,3,3,##N/A##,3,8,13,18,23,28)
s02__3<-c(4,4,4,##N/A##,4,9,14,19,24,29)
s02__4<-c(##N/A##,5,5,##N/A##,5,10,15,20,25,##N/A##)
dat2<-data.frame(id,id_2,s02__0,s02__1,s02__2,s02__3,s02__4)
This can produce what you want:
library(tidyverse)
#Data
id<-c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3)
s02<-c(001,002,003,004,001,002,003,004,005,001,002,003,004,005,006,007)
dat1<-data.frame(id,s02)
#Code
dat2 <- dat1 %>% group_by(id) %>% mutate(id2 = ifelse(s02<=5,1,2)) %>% ungroup() %>%
group_by(id,id2) %>% mutate(val=1:n()-1,nid = cur_group_id()) %>% ungroup() %>%
select(-id2) %>% mutate(id=paste0(id,'.',nid),val=paste0('s02','.',val)) %>% select(-nid) %>%
pivot_wider(names_from = c(val),values_from = s02) %>%
mutate(id=gsub("\\..*","", id)) %>% group_by(id) %>%
mutate(id2=1:n()) %>% select(order(colnames(.)))
dat2
# A tibble: 4 x 7
# Groups: id [3]
id id2 s02.0 s02.1 s02.2 s02.3 s02.4
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 2 3 4 NA
2 2 1 1 2 3 4 5
3 3 1 1 2 3 4 5
4 3 2 6 7 NA NA NA

replace a value in one column with a value from a second column on condition of a value from a third column from different rows

I have a data frame:
df1 <- data.frame(Object = c("Klaus","Klaus","Peter","Peter","Daniel","Daniel"),
PointA = as.numeric(c("7",NA,"17",NA,NA,NA)),
PointB = as.numeric(c("18","22",NA,NA,"17",NA)),
measure = c("1","2","1","2","1","2")
)
And I want this:
df2 <- data.frame(Object = c("Klaus","Klaus","Peter","Peter","Daniel","Daniel"),
PointA = as.numeric(c("7","18","17",NA,NA,"17")),
PointB = as.numeric(c("18","22",NA,NA,"17",NA)),
measure = c("1","2","1","2","1","2")
)
Which is, if there is a no value for an Object for PointA for measure == 2, I want it replaced with PointB from measure == 1 of the same Object.
First thing that comes to mind is:
library(dplyr)
df$PointA <- coalesce(df$PointA, df$PointB)
But afaik there is no way to make this condional.
Then I thought maybe something like:
df$PointA[is.na(df$PointA)] <- df$PointB
But this does not differentiate for the measure.
So I thought about:
df$PointA <- ifelse(df$measure == 2 & is.na(df$PointA), df$PointB, df$PointA)
But that does not take into account that I need the corresponding value from measure == 1.
Now, I am at a loss here. I am out of ideas how to approch this. Help?
Edit: I got two very good solutions already, but both rely on the order in the data frame. I tried, but obviously my example was to simple. I am looking for something that works under the following condition, too:
df1 <- df1[sample(nrow(df1)), ]
One possible option is using row_number() from dplyr. In case you need to sort your dataframe first, you can insert an arrange statement.
library(dplyr)
df1 %>%
arrange(Object, measure) %>%
group_by(Object) %>%
mutate(PointA = if_else(measure == 2 & is.na(PointA), PointB[row_number()-1], PointA))
# A tibble: 6 x 4
# Groups: Object [3]
# Object PointA PointB measure
# <chr> <dbl> <dbl> <chr>
# 1 Daniel NA 17 1
# 2 Daniel 17 NA 2
# 3 Klaus 7 18 1
# 4 Klaus 18 22 2
# 5 Peter 17 NA 1
# 6 Peter NA NA 2
You could use coalesce +lag as shown below:
library(tidyverse)
df1 %>%
arrange(Object, measure) %>%
group_by(Object) %>%
mutate(PointA = coalesce(PointA, lag(PointB)))
# A tibble: 6 x 4
# Groups: Object [3]
Object PointA PointB measure
<chr> <dbl> <dbl> <chr>
1 Klaus 7 18 1
2 Klaus 18 18 2
3 Peter 17 NA 1
4 Peter NA NA 2
5 Daniel NA 17 1
6 Daniel 17 NA 2
This could be condensed, but it should be relatively clear and doesn't rely on the row order at all. Beware if you have multiple rows for the same Object/Measure pair - the self-join will have multiple matches and you'll end up with a lot more rows than you started with.
library(dplyr)
df_fill = df1 %>%
filter(measure == 1) %>%
select(Object, fill_in = PointB) %>%
mutate(needs_fill = 1L)
result = df1 %>%
mutate(needs_fill = if_else(measure == 2 & is.na(PointA), 1L, NA_integer_)) %>%
left_join(df_fill) %>%
mutate(PointA = coalesce(PointA, fill_in)) %>%
select(-fill_in, -needs_fill)
result
# Object PointA PointB measure
# 1 Klaus 7 18 1
# 2 Klaus 18 22 2
# 3 Peter 17 NA 1
# 4 Peter NA NA 2
# 5 Daniel NA 17 1
# 6 Daniel 17 NA 2
Same as above but without saving the intermediate object:
result = df1 %>%
mutate(needs_fill = if_else(measure == 2 & is.na(PointA), 1L, NA_integer_)) %>%
left_join(
df1 %>%
filter(measure == 1) %>%
select(Object, fill_in = PointB) %>%
mutate(needs_fill = 1L)
) %>%
mutate(PointA = coalesce(PointA, fill_in)) %>%
select(-fill_in, -needs_fill)

Using any() or all() with is.na() over multiple columns

I'd like to drop rows from my dataset that are all NAs (AKA keep rows with any non-NAs) for a list of columns. How could I update this code so that x & y are supplied as a vector? This would enable me to flexibly add and drop columns for inspection.
library(dplyr)
ds <-
tibble(
id = c(1:4),
x = c(NA, 1, NA, 4),
y = c(NA, NA , 3, 4)
)
ds %>%
rowwise() %>%
filter(
any(
!is.na(x),
!is.na(y)
)
) %>%
ungroup()
I'm trying to write something like any(!is.na(c(x,y))) but I'm not sure how to supply multiple arguments to is.na().
We can use filter_at with any_vars
ds %>%
filter_at(vars(x:y), any_vars(!is.na(.)))
# A tibble: 3 x 3
# id x y
# <int> <dbl> <dbl>
#1 2 1 NA
#2 3 NA 3
#3 4 4 4
-Update - Feb 7 2022
In the new version of dplyr (as #GitHunter0 suggested) can use if_all/if_any or across
ds %>%
filter(if_any(x:y, complete.cases))
# A tibble: 3 × 3
id x y
<int> <dbl> <dbl>
1 2 1 NA
2 3 NA 3
3 4 4 4
You can also use ds %>% filter(!if_all(x:y, is.na)).

Flag dates based on multiple columns

I have a df, this provides information about the create_date and delete_date(if any) for a given ID.
Structure:
ID create_date1 create_date2 delete_date1 delete_date2
1 01-01-2014 NA NA NA
2 01-04-2014 01-08-2014 01-05-2014 NA
the create_date and delete_date extends till 10, i.e. create_date10
and delete_date10 columns are present
Rules/Logic:
We charge a user on monthly basis, if a user was created on 30th of a month, even then it's treated as if the user was active for a month(very low cost)
If a user has a delete date (irrespective on which date) in this month, then from next month the user is not charged
If a user has only create_date and no delete_date then all dates including the create_month is charged
Output expected:
ID 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 2014-07 2014-08
1 1 1 1 1 1 1 1 1
2 0 0 0 1 1 0 0 1
so on till current date
1 indicates the user is charged/active for that month
Problem:
I have been struggling to do this, but can't even understand how to do this. My earlier method is a bit too slow
Previous Solution:
Make the dataset into tall
Insert sequence of dates for each ID as a new column
Use a for loop to check the status
for each ID, status is equal to 1,
if create_date is equal to sequence, and it's 0 if the lag(delete_date) is equal to sequence
else is same as lag(status)
ID create_date delete_date sequence status?
1 01-01-2014 NA 2014-01 1
1 01-01-2014 NA 2014-02 1
1 01-01-2014 NA 2014-03 1
may not be that efficient : assuming this is just for a single year(could be extended easily)
# convert all dates to Date format
df[,colnames(df[-1])] = lapply(colnames(df[-1]), function(x) as.Date(df[[x]], format = "%d-%m-%Y"))
# extract the month
library(lubridate)
df[,colnames(df[-1])] = lapply(colnames(df[-1]), function(x) month(df[[x]]))
# df
# ID create_date1 create_date2 delete_date1 delete_date2
#1 1 1 NA NA NA
#2 2 4 8 5 NA
# get the current month
current.month <- month(Sys.Date())
# assume for now current month is 9
current.month <- 9
flags <- rep(FALSE, current.month)
func <- function(x){
x[is.na(x)] <- current.month # replacing all NA with current month(9)
create.columns.indices <- x[grepl("create_date", colnames(df[-1]))] # extract the create_months
delete.columns.indices <- x[grepl("delete_date", colnames(df[-1]))] # extract the delete_months
flags <- pmin(1,colSums(t(sapply(seq_along(create.columns.indices),
function(x){
flags[create.columns.indices[x]:delete.columns.indices[x]] = TRUE;
flags
}))))
flags
}
df1 = cbind(df$ID , t(apply(df[-1], 1, func)))
colnames(df1) = c("ID", paste0("month",1:current.month))
# df1
# ID month1 month2 month3 month4 month5 month6 month7 month8 month9
#[1,] 1 1 1 1 1 1 1 1 1 1
#[2,] 2 0 0 0 1 1 0 0 1 1
Here's a still-pretty-long tidyverse approach:
library(tidyverse)
df %>% gather(var, date, -ID) %>% # reshape to long form
# separate date type from column set number
separate(var, c('action', 'number'), sep = '_date', convert = TRUE) %>%
mutate(date = as.Date(date, '%d-%m-%Y')) %>% # parse dates
spread(action, date) %>% # spread create and delete to two columns
mutate(min_date = min(create, delete, na.rm = TRUE), # add helper columns; use outside
max_date = max(create, delete, na.rm = TRUE)) %>% # variable to save memory if an issue
group_by(ID, number) %>%
mutate(month = list(seq(min_date, max_date, by = 'month')), # add month sequence list column
# boolean vector of whether range of months in whole range
active = ifelse(is.na(create),
list(rep(FALSE, length(month[[1]]))),
lapply(month, `%in%`,
seq.Date(create,
min(delete, max_date, na.rm = TRUE),
by = 'month')))) %>%
unnest() %>% # unnest list columns to long form
group_by(ID, month = format(month, '%Y-%m')) %>%
summarise(active = any(active) * 1L) %>% # combine muliple rows for one ID
spread(month, active) # reshape to wide form
## Source: local data frame [2 x 9]
## Groups: ID [2]
##
## ID `2014-01` `2014-02` `2014-03` `2014-04` `2014-05` `2014-06` `2014-07` `2014-08`
## * <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 1 1 1 1 1 1 1 1 1
## 2 2 0 0 0 1 1 0 0 1

Resources