spread single key against multiple values - r

I have data frame as follows:
Actual data runs into hundreds of rows and columns
The objective here is to spread "Attribute Value" against each of the column V1, V2,...VN.
That is dates that are appearing in column V1, should get spread into column names
And corresponding "Attribute Value" should appear against each below
df1 <- data.frame(ROW_ID = c("23416","23416","23416"),
Process_ID = c("SLT","SLT","SLT"),
Operation_Code = c("SLT","SLT","SLT"),
Resource_Group_Code = c("BD","BD","BT"),
Location_Code = c("JS","JS","JS"),
Resource_Code = c("B-T234","B-T234","B-T234"),
Resource_Desc = c("699","699","699"),
iDeleteFlag = c("N","N","N"),
Attribute_Code = c("RA002","RA002","RA002"),
Attribute_Value = c("266","269","298"),
Capacity_Type = c("s","s","s"),
Planning_Version = c("PDMT","PDMT","PDMT"),
"V1"= c("2021-10-10", "2021-10-31", "2021-11-07"),
"V2"= c("2021-10-17", "", "2021-11-14"),
"V3" = c("2021-10-24", "", "2021-11-21"),
"V4" = c("", "2021-11-07", ""),
"V5" = c("", "2021-11-21", ""))
The required output is as follows:
df2 <- data.frame(ROW_ID = c("23416","23416","23416"),
Process_ID = c("SLT","SLT","SLT"),
Operation_Code = c("SLT","SLT","SLT"),
Resource_Group_Code = c("BD","BD","BT"),
Location_Code = c("JS","JS","JS"),
Resource_Code = c("B-T234","B-T234","B-T234"),
Resource_Desc = c("699","699","699"),
iDeleteFlag = c("N","N","N"),
Attribute_Code = c("RA002","RA002","RA002"),
Capacity_Type = c("s","s","s"),
Planning_Version = c("PDMT","PDMT","PDMT"),
"2021-10-10"= c("266", "", ""),
"2021-10-17"= c("266", "", ""),
"2021-10-24" = c("266", "", ""),
"2021-10-31" = c("", "269", ""),
"2021-11-07" = c("", "269", "298"),
"2021-11-14" = c("", "", "298"),
"2021-11-21" = c("", "269", "298"))
My code is as follows:
my code not giving required output
RA002variable_2021ANeg <- gather(RA002variable_2021ANeg, key, value, -
ROW_ID, - Process_ID, - Operation_Code, - Resource_Group_Code, -
Location_Code, - Resource_Code, - Resource_Desc, -
iDeleteFlag, - Attribute_Code1, - Capacity_Type, -
Planning_Version, -Attribute_Value) %>%
mutate(key =( Attribute_Value)) %>%
select(- Attribute_Value) %>%
spread(key, value)

Gather and spread have been substituted with pivot_longerand pivot_wider. While gather and spread are still working, it's best we all got used to the new functions.
Since your ROW_ID is not unique for each row, I create additional index column (simply the row number), gather columns V1:V5 and spread dates into column names:
df1 %>%
mutate(index = row_number()) %>%
pivot_longer(V1:V5, names_to = "name", values_to = "value") %>%
select(-name) %>%
filter(value != "") %>%
pivot_wider(names_from = "value", values_from = "Attribute_Value")

Related

If statement with pipe

I've seen a few articles on how to use if statements or conditionals using piping, but I'm not sure how to apply it to my situation. Along with a specific answer to my problem, I was also hoping for also a more general explanation about adding a if statement with piping so I am able to handle most situations.
I tried to learn to use this answer below (use if() to use select() within a dplyr pipe chain), but I don't understand why we are supplying "." as an argument on the third line below and when I should do so
mtcars %>%
group_by(cyl) %>%
{ if (cond) filter(., am == 1) else . } %>%
summarise(m = mean(wt))
Here's a sample of my data:
df_parse<-
structure(list(value = c("HURESPLI\t2\tLINE NUMBER OF THE RESPONDENT\tCURRENT\t22 - 23",
"FILLER\t2\t\t27 - 28", "HUBUSL1\t2\tENTER LINE NUMBER\t81 - 82",
"GEDIV\t1\tDIVISION\t91 - 91", "GESTFIPS\t2\tFEDERAL INFORMATION\t93 - 94"
), starts_with_position = c(TRUE, TRUE, TRUE, TRUE, TRUE), missing_vars = c("HUFINAL\t FINAL OUTCOME CODE\t 24 - 26",
"HETENURE\t ARE YOUR LIVING QUARTERS... (READ ANSWER CATEGORIES)\t 29 - 30",
"FOR HUBUS = 1 VALID ENTRIES 83 - 84", " 92 - 92", " 95 - 95"
)), row.names = c(NA, 5L), class = "data.frame")
I'm trying to separate out the missing_vars column using extract (tidyr) and gsub as shown below:
df_parse<-
df_parse %>%
mutate(dup_value2 = missing_vars) %>%
extract(col = dup_value2, into = "position2", regex = "(\\d+\\s*-\\s*\\d+)$") %>%
mutate(id2 = gsub(pattern = "\\t.*", replacement = "", x = missing_vars)) %>%
mutate(desc2 = gsub(".*\\\t\\d+\\\t", replacement = "", x = missing_vars)) %>%
mutate(desc2 = gsub("(\\d+\\s*-\\s*\\d+)$", replacement = "", x = missing_vars))
This works fine, but I wanted to add a conditional on the start of this pipe, where df_parse$starts_with_position == TRUE
Something like this? (I know it doesn't work)
df_parse %>% if(starts_with_position==TRUE){
mutate(dup_value2 = missing_vars) %>%
extract(col = dup_value2, into = "position2", regex = "(\\d+\\s*-\\s*\\d+)$") %>%
mutate(id2 = gsub(pattern = "\\t.*", replacement = "", x = missing_vars)) %>%
mutate(desc2 = gsub(".*\\\t\\d+\\\t", replacement = "", x = missing_vars)) %>%
mutate(desc2 = gsub("(\\d+\\s*-\\s*\\d+)$", replacement = "", x = missing_vars))
}else ""

How to convert data with different levels of information into wide format? [duplicate]

This question already has an answer here:
Reshaping data.frame with a by-group where id variable repeats [duplicate]
(1 answer)
Closed 2 years ago.
I have a data of patients' operations/procedures (example as shown in the picture below) where one row describes a patient's procedure. There are 2 levels of information,
the first being the operation details, i.e. op_start_dt, priority_operation and asa_status
the second being the procedure details, i.e. proc_desc and proc_table
An operation can have more than 1 procedures. In the example below, patient A has 2 operations (defined by distinct op_start_dt). In his first operation, he had 1 procedure (defined by distinct proc_desc) and in his second, he had 2 procedures.
I would like to convert the data into a wide format, where a patient only has one row, and his information will be arranged operation by operation and within each operation, it will be arrange procedure by procedure, as shown below. So, proc_descxy refers to the proc_desc on xth operation and yth procedure.
Data:
df <- structure(list(patient = c("A", "A", "A"), department = c("GYNAECOLOGY /OBSTETRICS DEPT",
"GYNAECOLOGY /OBSTETRICS DEPT", "GYNAECOLOGY /OBSTETRICS DEPT"
), op_start_dt = structure(c(1424853000, 1424870700, 1424870700
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), priority_operation = c("Elective",
"Elective", "Elective"), asa_status = c(2, 3, 3), proc_desc = c("UTERUS, MALIGNANT CONDITION, EXTENDED HYSTERECTOMY WITH/WITHOUT LYMPHADENECTOMY",
"KIDNEY AND URETER, VARIOUS LESIONS, NEPHROURETERECTOMY, LAPAROSCOPIC",
"HEART, VARIOUS LESIONS, HEART TRANSPLANTATION"), proc_table = c("99",
"6A", "7C")), row.names = c(NA, 3L), class = "data.frame")
Desired output:
df <- structure(list(patient = "A", department = "GYNAECOLOGY /OBSTETRICS DEPT",
no_op = 2, op_start_dt1 = structure(1424853000, class = c("POSIXct",
"POSIXt"), tzone = "UTC"), no_proc1 = 1, priority_operation1 = "Elective",
asa_status1 = 2, proc_desc11 = "UTERUS, MALIGNANT CONDITION, EXTENDED HYSTERECTOMY WITH/WITHOUT LYMPHADENECTOMY",
proc_table11 = "99", op_start_dt2 = structure(1424870700, class = c("POSIXct",
"POSIXt"), tzone = "UTC"), no_of_proc2 = 2, priority_operation2 = "Elective",
asa_status2 = 3, proc_desc21 = "KIDNEY AND URETER, VARIOUS LESIONS, NEPHROURETERECTOMY, LAPAROSCOPIC",
proc_table21 = "6A", proc_desc22 = "HEART, VARIOUS LESIONS, HEART TRANSPLANTATION",
proc_table22 = "7C"), row.names = 1L, class = "data.frame")
My attempt:
I tried to work this out, but it gets confusing along the way, with pivot_longer then pivot_wideragain.
df %>%
# Operation-level Information
group_by(patient) %>%
mutate(op_nth = dense_rank(op_start_dt),
no_op = n_distinct(op_start_dt)) %>%
# Procedure-level Information
group_by(patient, op_start_dt) %>%
mutate(proc_nth = row_number(),
no_proc = n_distinct(proc_desc)) %>%
ungroup() %>%
# Make pivoting easier
mutate_all(as.character) %>%
# Pivot Procedure-level Information
pivot_longer(-c(patient, department, no_op, op_nth, proc_nth)) %>%
# Remove the indices for "Procedure" for Operation_level Information
mutate(proc_nth = case_when(!(name %in% c("op_start_dt", "no_proc", "priority_operation", "asa_status")) ~ proc_nth)) %>%
# Create the column names
unite(name, c(name, op_nth, proc_nth), sep = "", na.rm = TRUE) %>%
distinct() %>%
pivot_wider(names_from = name, values_from = value)
Create a unique ID column for each patient and then use pivot_wider.
library(dplyr)
df %>%
group_by(patient) %>%
mutate(row = row_number()) %>%
tidyr::pivot_wider(names_from = row, values_from = op_start_dt:proc_table)

How to use gather function to select 0 column if there is no column name of a data frame

I have big data frame. In the "0" column my variable names are recorded (i.e. gene names / AGI) however, there's no column name for this column (this df derived from DESeq).
I want to use the gather function to summarize data. How can I drop "0" column when there is no column name?
For an e.g.
If there is a column name (e.g. gene names / AGI) I know how to do this like below)
df1 <- structure(list(AGI = c("ATCG01240", "ATCG01310", "ATMG00070"), aox2_0h__1 = c(15.79105291, 14.82652303, 14.70630068), aox2_0h__2 = c(16.06494674, 14.50610036, 14.52189807), aox2_0h__3 = c(14.64596287, 14.73266459, 13.07143141), aox2_0h__4 = c(15.71713641, 15.15430026, 16.32190068 ), aox2_12h__1 = c(14.99030606, 15.08046949, 15.8317372), aox2_12h__2 = c(15.15569857, 14.98996474, 14.64862254), aox2_12h__3 = c(15.12144791, 14.90111092, 14.59618842), aox2_12h__4 = c(14.25648197, 15.09832061, 14.64442686), aox2_24h__1 = c(15.23997241, 14.80968391, 14.22573239 ), aox2_24h__2 = c(15.57551513, 14.94861669, 15.18808897), aox2_24h__3 = c(15.04928714, 14.83758685, 13.06948037), aox2_24h__4 = c(14.79035385, 14.93873234, 14.70402827), aox5_0h__1 = c(15.8245918, 14.9351844, 14.67678306), aox5_0h__2 = c(15.75108628, 14.85867002, 14.45704948 ), aox5_0h__3 = c(14.36545859, 14.79296855, 14.82177912), aox5_0h__4 = c(14.80626019, 13.43330964, 16.33482718), aox5_12h__1 = c(14.66327372, 15.22571466, 16.17761867), aox5_12h__2 = c(14.58089039, 14.98545497, 14.4331578), aox5_12h__3 = c(14.58091828, 14.86139511, 15.83898617 ), aox5_12h__4 = c(14.48097297, 15.1420725, 13.39369381), aox5_24h__1 = c(15.41855602, 14.9890092, 13.92629626), aox5_24h__2 = c(15.78386057, 15.19372889, 14.63254456), aox5_24h__3 = c(15.55321382, 14.82013321, 15.74324956), aox5_24h__4 = c(14.53085803, 15.12196994, 14.81028556 ), WT_0h__1 = c(14.0535031, 12.45484834, 14.89102226), WT_0h__2 = c(13.64720361, 15.07144643, 14.99836235), WT_0h__3 = c(14.28295759, 13.75283646, 14.98220861), WT_0h__4 = c(14.79637443, 15.1108037, 15.21711524 ), WT_12h__1 = c(15.05711898, 13.33689777, 14.81064042), WT_12h__2 = c(14.83846779, 13.62497318, 14.76356308), WT_12h__3 = c(14.77215863, 14.72814995, 13.0835214), WT_12h__4 = c(14.70685445, 14.98527337, 16.12727292), WT_24h__1 = c(15.43813077, 14.56918572, 14.92146565 ), WT_24h__2 = c(16.05986898, 14.70583866, 15.64566505), WT_24h__3 = c(14.87721853, 13.22461859, 16.34119942), WT_24h__4 = c(14.92822133, 14.74382383, 12.79146694)), class = "data.frame", row.names = c(NA, -3L))
sdf1 <- gather(df1, "group", "Expression",-AGI) %>% separate(group, c("sample", "time", "r")) %>% unite(tgroup, c("sample", "time")) %>% group_by(AGI, tgroup) %>% summarize(expression_mean = mean(Expression)) %>% spread(tgroup, expression_mean) %>% column_to_rownames(colnames(.)[1])
Above sample AGI is in the first column, my current working df's this column is in "0" column and it has not named as "AGI"
Q1- How can I drop "0" column for gather?
Q2 - How can I give a column name to "0" column (i.e. AGI)?

R: Gather/Spread/Reshape 21 Columns Based on 21 Other Column s

I would like to create columns based on values in some fields, populated by values in other fields. For example column1_time has value "1030" and column1_status has value "booked". I would like to pivot those into a new field time1030 with value "booked." There are 21 unique columns with times, (the times are only listed once per row, so they are unique across the 21 columns) -- and there are 21 unique columns with statuses that map back to the time columns. So these 42 time+status columns should be rearranged to one column per unique time, being populated by that time's corresponding status.
I have data that looks like this:
I would like to utilize R's gather/spread or reshape2 (legacy) functionality to transpose this data to look like this:
I tinkered around with gather and spread for a few hours but couldn't figure it out. I thought setting the key to ends_with('_time') and the value to ends_with('_status') might work but it did not from my attempts.
For a reproducible example of the data:
structure(list(appointment1_time = c("1030", "1030"), appointment2_time = c("1100",
"1100"), appointment3_time = c("1130", "1130"), appointment4_time = c("1200",
"1200"), appointment5_time = c("1230", "1230"), appointment6_time = c("0100",
"0100"), appointment7_time = c("0130", "0130"), appointment8_time = c("0200",
"0200"), appointment9_time = c("0230", "0230"), appointment10_time = c("0300",
"0300"), appointment11_time = c("0330", "0330"), appointment12_time = c("0400",
"0400"), appointment13_time = c("0430", "0430"), appointment14_time = c("0500",
"0500"), appointment15_time = c("0530", "0530"), appointment16_time = c("0600",
""), appointment17_time = c("0630", ""), appointment18_time = c("0700",
""), appointment19_time = c("0730", ""), appointment20_time = c(NA_character_,
NA_character_), appointment21_time = c(NA_character_, NA_character_
), appointment1_status = c("booked", "available"), appointment2_status = c("booked",
"available"), appointment3_status = c("booked", "available"),
appointment4_status = c("booked", "available"), appointment5_status = c("booked",
"available"), appointment6_status = c("booked", "available"
), appointment7_status = c("booked", "available"), appointment8_status = c("booked",
"available"), appointment9_status = c("booked", "available"
), appointment10_status = c("booked", "available"), appointment11_status = c("booked",
"available"), appointment12_status = c("available", "available"
), appointment13_status = c("available", "available"), appointment14_status = c("available",
"available"), appointment15_status = c("booked", "available"
), appointment16_status = c("available", ""), appointment17_status = c("available",
""), appointment18_status = c("available", ""), appointment19_status = c("available",
""), appointment20_status = c(NA_character_, NA_character_
), appointment21_status = c(NA_character_, NA_character_)), row.names = 1:2, class = "data.frame")
A solution using tidyverse.
library(tidyverse)
# Get the time order
ord <- dat %>% select(ends_with("time")) %>% slice(1) %>% unlist()
# Remove NA
ord <- ord[!is.na(ord)]
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid) %>%
separate(Column, into = c("Apt", "time/status"), sep = "_") %>%
spread(`time/status`, Value) %>%
# Remove NA or "" in the status column
filter(!is.na(status) & !status %in% "") %>%
mutate(Apt = str_c("apt_slot", time, sep = "_")) %>%
select(-time) %>%
spread(Apt, status) %>%
select(-rowid) %>%
# Reorder the column
select(str_c("apt_slot", ord, sep = "_"))
dat2
# apt_slot_1030 apt_slot_1100 apt_slot_1130 apt_slot_1200 apt_slot_1230 apt_slot_0100 apt_slot_0130
# 1 booked booked booked booked booked booked booked
# 2 available available available available available available available
# apt_slot_0200 apt_slot_0230 apt_slot_0300 apt_slot_0330 apt_slot_0400 apt_slot_0430 apt_slot_0500
# 1 booked booked booked booked available available available
# 2 available available available available available available available
# apt_slot_0530 apt_slot_0600 apt_slot_0630 apt_slot_0700 apt_slot_0730
# 1 booked available available available available
# 2 available <NA> <NA> <NA> <NA>

R - count of items in line chart: match DateTime to count of items

I have a dataframe with the following structure:
df <- structure(list(Name = structure(1:9, .Label = c("task 1", "task 2",
"task 3", "task 4", "task 5", "task 6", "task 7", "task 8", "task 9"
), class = "factor"), Start = structure(c(1479799800, 1479800100,
1479800400, 1479800700, 1479801000, 1479801300, 1479801600, 1479801900,
1479802200), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1479801072,
1479800892, 1479801492, 1479802092, 1479802692, 1479803292, 1479803892,
1479804492, 1479805092), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("Name",
"Start", "End"), row.names = c(NA, -9L), class = "data.frame")
Now I want to count the items in column "Name" over time. They all have a start and end datetimes, which are formated as POSIXct.
With help of this solution here on SO I was able to do so (or at least I think I was) with following code:
library(data.table)
setDT(df)
dates = seq(min(df$Start), max(df$End), by = "min")
lookup = data.table(Start = dates, End = dates, key = c("Start", "End"))
ans = foverlaps(df, lookup, type = "any", which = TRUE)
library(ggplot2)
ggplot(ans[, .N, by = yid], aes(x = yid, y = N)) + geom_line()
Problem now:
How do I match my DateTime-scale to those integer values on the x-axis? Or is there a faster and better solution to solve my problem?
I tried to use x = as.POSIXct(yid, format = "%Y-%m-%dT%H:%M:%S", origin = min(df$Start)) within the aes of the ggplot(). But that didn't work.
EDIT:
When using the solution for this problem, I face another. Items, where there is no count, are displayed with the count of the latest countable item in the plot. This is why we have to merge (leftjoin) the table with the counts (ants) again with a complete sequence of all Datetimes and put a 0 for every NA. So we get explicit values for every necessary datapoint.
Like this:
# The part we use to count and match the right times
df1 <- ans[, .N, by = yid] %>%
mutate(time = min(df$Start) + minutes(yid))
# The part where we use the sequence from the beginning for a LEFT JOIN with the counting dataframe
df2 <- data.frame(time = dates)
dt <- merge(x = df2, y = df1, by = "time", all.x = TRUE)
dt[is.na(dt)] <- 0
In the tidyverse framework, this is a slightly different task -
Generate the sames dates variable you have.
Construct a data frame with all dates and all times (cartesian join)
Filter out the rows that are not in the interval for each task
Add up the tasks for each minute that remain
Plot.
That looks something like this --
library(tidyverse)
library(lubridate)
dates = seq(min(df$Start), max(df$End), by = "min")
df %>%
mutate(key = 1) %>%
left_join(data_frame(key = 1, times = dates)) %>%
mutate(include = times %within% interval(Start, End)) %>%
filter(include) %>%
group_by(times) %>%
summarise(count = n()) %>%
ggplot(aes(times, count)) +
geom_line()
#> Joining, by = "key"
If you need it to be faster, it will almost certainly be faster using your original data.table code.
Consider this.
library(data.table)
setDT(df)
dates = seq(min(df$Start), max(df$End), by = "min")
lookup = data.table(Start = dates, End = dates, key = c("Start", "End"))
ans = foverlaps(df, lookup, type = "any", which = TRUE)
ans[, .N, by = yid] %>%
mutate(time = min(df$Start) + minutes(yid)) %>%
ggplot(aes(time, N)) +
geom_line()
Now we use data.table to calculate the overlap, and then index time off the starting minute. Once we add a new column with the times, we can plot.

Resources