R: Gather/Spread/Reshape 21 Columns Based on 21 Other Column s - r

I would like to create columns based on values in some fields, populated by values in other fields. For example column1_time has value "1030" and column1_status has value "booked". I would like to pivot those into a new field time1030 with value "booked." There are 21 unique columns with times, (the times are only listed once per row, so they are unique across the 21 columns) -- and there are 21 unique columns with statuses that map back to the time columns. So these 42 time+status columns should be rearranged to one column per unique time, being populated by that time's corresponding status.
I have data that looks like this:
I would like to utilize R's gather/spread or reshape2 (legacy) functionality to transpose this data to look like this:
I tinkered around with gather and spread for a few hours but couldn't figure it out. I thought setting the key to ends_with('_time') and the value to ends_with('_status') might work but it did not from my attempts.
For a reproducible example of the data:
structure(list(appointment1_time = c("1030", "1030"), appointment2_time = c("1100",
"1100"), appointment3_time = c("1130", "1130"), appointment4_time = c("1200",
"1200"), appointment5_time = c("1230", "1230"), appointment6_time = c("0100",
"0100"), appointment7_time = c("0130", "0130"), appointment8_time = c("0200",
"0200"), appointment9_time = c("0230", "0230"), appointment10_time = c("0300",
"0300"), appointment11_time = c("0330", "0330"), appointment12_time = c("0400",
"0400"), appointment13_time = c("0430", "0430"), appointment14_time = c("0500",
"0500"), appointment15_time = c("0530", "0530"), appointment16_time = c("0600",
""), appointment17_time = c("0630", ""), appointment18_time = c("0700",
""), appointment19_time = c("0730", ""), appointment20_time = c(NA_character_,
NA_character_), appointment21_time = c(NA_character_, NA_character_
), appointment1_status = c("booked", "available"), appointment2_status = c("booked",
"available"), appointment3_status = c("booked", "available"),
appointment4_status = c("booked", "available"), appointment5_status = c("booked",
"available"), appointment6_status = c("booked", "available"
), appointment7_status = c("booked", "available"), appointment8_status = c("booked",
"available"), appointment9_status = c("booked", "available"
), appointment10_status = c("booked", "available"), appointment11_status = c("booked",
"available"), appointment12_status = c("available", "available"
), appointment13_status = c("available", "available"), appointment14_status = c("available",
"available"), appointment15_status = c("booked", "available"
), appointment16_status = c("available", ""), appointment17_status = c("available",
""), appointment18_status = c("available", ""), appointment19_status = c("available",
""), appointment20_status = c(NA_character_, NA_character_
), appointment21_status = c(NA_character_, NA_character_)), row.names = 1:2, class = "data.frame")

A solution using tidyverse.
library(tidyverse)
# Get the time order
ord <- dat %>% select(ends_with("time")) %>% slice(1) %>% unlist()
# Remove NA
ord <- ord[!is.na(ord)]
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid) %>%
separate(Column, into = c("Apt", "time/status"), sep = "_") %>%
spread(`time/status`, Value) %>%
# Remove NA or "" in the status column
filter(!is.na(status) & !status %in% "") %>%
mutate(Apt = str_c("apt_slot", time, sep = "_")) %>%
select(-time) %>%
spread(Apt, status) %>%
select(-rowid) %>%
# Reorder the column
select(str_c("apt_slot", ord, sep = "_"))
dat2
# apt_slot_1030 apt_slot_1100 apt_slot_1130 apt_slot_1200 apt_slot_1230 apt_slot_0100 apt_slot_0130
# 1 booked booked booked booked booked booked booked
# 2 available available available available available available available
# apt_slot_0200 apt_slot_0230 apt_slot_0300 apt_slot_0330 apt_slot_0400 apt_slot_0430 apt_slot_0500
# 1 booked booked booked booked available available available
# 2 available available available available available available available
# apt_slot_0530 apt_slot_0600 apt_slot_0630 apt_slot_0700 apt_slot_0730
# 1 booked available available available available
# 2 available <NA> <NA> <NA> <NA>

Related

How can I make the group-by code to call a function from another package faster?

I have below code to compute a meta value using meta package:
probMetaControl <- long %>% group_by(ID, sample) %>% group_split() %>% mclapply(mc.cores = 10 ,function(endf){
message(endf$ID[1])
res <- meta::metagen(data = endf, studlab = ID, TE = expression , seTE = sd, sm = "SMD",
n.e = rep(1,nrow(endf)),
method.tau = "REML",
hakn = TRUE,
control = list(maxiter=1000))
data.frame(
ID = endf$ID[1],
sample = endf$sample[1],
meta.exprs = res$TE.fixed,
stringsAsFactors = F
)
}) %>% do.call(what = rbind) %>% as.data.frame()
the long dataframe has around 800,000 rows. The small part of long dataframe is as:
as.data.table(structure(list(ID = c("h:5982", "h:3310", "h:7849", "h:2978",
"h:7318"), pID = c("X1053_at", "X117_at", "X121_at", "X1255_g_at",
"X1294_at"), sd = c(0.228908614809978, 0.436455554523966, 0.210542866430305,
0.672545478318169, 0.26926204466525), sample = c("A", "B", "A",
"C", "A"), expression = c(6.53920197406645, 6.12380136266864,
8.01553257692446, 4.62636832157394, 7.58222133679378)), row.names = c(NA,
-5L), class = c("data.table", "data.frame")))
At the moment, this code takes 23 mins to run. Is there any way to make it faster?

spread single key against multiple values

I have data frame as follows:
Actual data runs into hundreds of rows and columns
The objective here is to spread "Attribute Value" against each of the column V1, V2,...VN.
That is dates that are appearing in column V1, should get spread into column names
And corresponding "Attribute Value" should appear against each below
df1 <- data.frame(ROW_ID = c("23416","23416","23416"),
Process_ID = c("SLT","SLT","SLT"),
Operation_Code = c("SLT","SLT","SLT"),
Resource_Group_Code = c("BD","BD","BT"),
Location_Code = c("JS","JS","JS"),
Resource_Code = c("B-T234","B-T234","B-T234"),
Resource_Desc = c("699","699","699"),
iDeleteFlag = c("N","N","N"),
Attribute_Code = c("RA002","RA002","RA002"),
Attribute_Value = c("266","269","298"),
Capacity_Type = c("s","s","s"),
Planning_Version = c("PDMT","PDMT","PDMT"),
"V1"= c("2021-10-10", "2021-10-31", "2021-11-07"),
"V2"= c("2021-10-17", "", "2021-11-14"),
"V3" = c("2021-10-24", "", "2021-11-21"),
"V4" = c("", "2021-11-07", ""),
"V5" = c("", "2021-11-21", ""))
The required output is as follows:
df2 <- data.frame(ROW_ID = c("23416","23416","23416"),
Process_ID = c("SLT","SLT","SLT"),
Operation_Code = c("SLT","SLT","SLT"),
Resource_Group_Code = c("BD","BD","BT"),
Location_Code = c("JS","JS","JS"),
Resource_Code = c("B-T234","B-T234","B-T234"),
Resource_Desc = c("699","699","699"),
iDeleteFlag = c("N","N","N"),
Attribute_Code = c("RA002","RA002","RA002"),
Capacity_Type = c("s","s","s"),
Planning_Version = c("PDMT","PDMT","PDMT"),
"2021-10-10"= c("266", "", ""),
"2021-10-17"= c("266", "", ""),
"2021-10-24" = c("266", "", ""),
"2021-10-31" = c("", "269", ""),
"2021-11-07" = c("", "269", "298"),
"2021-11-14" = c("", "", "298"),
"2021-11-21" = c("", "269", "298"))
My code is as follows:
my code not giving required output
RA002variable_2021ANeg <- gather(RA002variable_2021ANeg, key, value, -
ROW_ID, - Process_ID, - Operation_Code, - Resource_Group_Code, -
Location_Code, - Resource_Code, - Resource_Desc, -
iDeleteFlag, - Attribute_Code1, - Capacity_Type, -
Planning_Version, -Attribute_Value) %>%
mutate(key =( Attribute_Value)) %>%
select(- Attribute_Value) %>%
spread(key, value)
Gather and spread have been substituted with pivot_longerand pivot_wider. While gather and spread are still working, it's best we all got used to the new functions.
Since your ROW_ID is not unique for each row, I create additional index column (simply the row number), gather columns V1:V5 and spread dates into column names:
df1 %>%
mutate(index = row_number()) %>%
pivot_longer(V1:V5, names_to = "name", values_to = "value") %>%
select(-name) %>%
filter(value != "") %>%
pivot_wider(names_from = "value", values_from = "Attribute_Value")

Plot multiple geom_line and geom_smooth objects in one plot

I have somewhat messy looking dataframes, like this one:
df0
# A tibble: 3 x 9
# Groups: Sequ [1]
Sequ Speaker Utterance A_intpl A_dur B_intpl B_dur C_intpl C_dur
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2 ID16.A cool >wha… 31.44786152… 10.5,17,1… 32.86993284… 9.5,16,17… 58.3368399… 14,17,17…
2 2 NA (0.228) 32.75735987… 15.5,17,1… 30.83469006… 14.5,16.9… 26.0386462… 3,17,16,…
3 2 ID16.B u:m Tenne… 32.05752604… 4.5,17,16… 29.95825107… 3.5,16,17… 55.9298614… 8,17,17,…
I want to plot the *_intpl values for each speaker (A, B, or C) for each of the three Utterances in a single chart both as line charts and as trend lines.
I'm just half successful doing this:
library(tidyr)
library(ggplot2)
library(dplyr)
df0 %>%
pivot_longer(cols = contains("_"),
names_to = c("Event_by", ".value"),
names_pattern = "^(.*)_([^_]+$)") %>%
separate_rows(c(intpl, dur), sep = ",", convert = TRUE) %>%
mutate(Time = cumsum(dur)) %>%
mutate(Utterance = paste0(sub(".*(.)$", "\\1",Speaker), ": ", Utterance),
Utterance = factor(Utterance, levels = unique(Utterance))) %>%
ggplot(aes(x = Time, y = log2(intpl),
group = Event_by,
colour = Event_by)) +
geom_line()+
geom_smooth(method = 'lm', color = "red", formula = y~x)+
facet_wrap(~ Utterance, ncol = 1, scales= "free_x")
Half successful because the line plots and trend lines are side-by-side, as if in three columns, whereas they should be in rows, one below the other - how can that be achieved?
Reproducible data:
structure(list(Sequ = c(2L, 2L, 2L), Speaker = c("ID16.A", NA,
"ID16.B"), Utterance = c("cool >what part?<", "(0.228)", "u:m Tennessee="
), A_intpl = c("31.4478615210995,31.5797510648522,31.7143985369445,31.651083739602,31.5806035086034,36.8956763912703,36.2882129597292,35.2124499461012,34.1366869324732,34.1366869324732,32.1927035724058,30.2487202123383,28.3047368522709,26.3607534922035,30.5278334848495,30.5919390424853,30.8898529369568,31.578968913188,31.9011198738002,32.1543265113196,31.9708002079533,31.966536408565,31.8762658607759,31.8994741472105,31.4215913971938,32.1510578328563,31.7863350712876,32.4685052625667,31.7422271490296,32.3286054977263,31.9998974949481,32.5177992323864,32.4727499785435,32.9310888953766,32.7592010033585,33.2231711877427,33.1593949301066,33.2432973964816,33.2569729073414,33.492144800249,33.317650964723,33.4835787832119,33.2377190454279,32.9200836384356,32.9684568771567,32.6400987016883,27.5447101464944,29.3948945479171,35.3449171857603,33.5932932239592,31.8416692621581,30.0900453003569,32.7850431084597,32.7589003618266,32.8365550655013,32.386716057622,32.8420792704881,32.6909995562489,32.6269434402016,32.7370944106334,32.7529759209752,32.6528826975113,32.3663573764448,32.7326853004792,32.6930038462418,32.8975978772676,33.1752899475416,33.2034433355001,33.0667431432803,32.6322933080614,33.2503168843178,32.7573598713719",
"32.7573598713719,32.7531704791313,32.7366130631104,32.918942216354,32.8309939530596,32.3856893430525,32.5368873543441,32.5628510484821,32.5628510484821,32.5628510484821,32.5506564332008,32.7477119716583,32.3458470743288,32.0575260428013",
"32.0575260428013,32.1628824338111,32.0093334061923,32.1461460586991,31.9080762250966,31.9469105074833,31.7431187667232,31.7194255656503,31.7394296413187,31.8594986292975,31.7498243274746,31.9069142374258,32.0835520942767,31.6257067057109,31.757232379438,31.9036689124911,32.1319749301918,31.7203280774998,31.7877137245706,32.3030946636177,32.2800139298454,32.164646135728,32.3636504940227,32.5657818936495,32.3859453482697,32.4797898358193,32.5319835105237,32.92233491509,32.8240561109448,32.664496027779,33.1835064752029,33.0366413969703,33.0406288190821,33.3232964677672,33.2206260057731,33.1537134269402,33.2783471506207,33.2933281566788,33.5322350394609,33.3815736723684,33.7905544185063,33.6143820666896,33.7490659591585,33.7260102344634,34.0721931066557,34.0455026427054,34.3735788774521,34.2888420421073,34.3913721165542,34.5982135545306,34.4417202731001,34.6586347152449,31.1590521215434,31.3276405983897,28.2379253186548,31.133030931336,34.0715906921349,35.8967950760285,35.9334551147377,35.8565504335515,35.7446081905229,35.6300325834155,35.8390086948751,35.9711743270411,36.0029493274176,35.8891056768339"
), A_dur = c("10.5,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,0.5",
"15.5,17,17,16,17,17,16,17,17,16,17,17,16,12.5", "4.5,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,5.5"
), B_intpl = c("32.8699328424689,32.8154348109057,32.5454364786882,32.408257038977,32.5304564519672,32.3270203236281,31.9233218634346,32.0166346064182,31.7360745988363,31.7546527359571,31.8603220354065,31.6520061326962,31.5603191463274,31.3357561466519,31.0976090032219,31.1405090978825,31.1697180784961,31.0863999545386,31.3126984044729,30.580776446803,30.7137016246273,31.0801914571091,31.2343922096768,31.2749857511594,31.3488604642844,30.9327390960718,31.0750482778561,31.1849119826023,31.4180114886183,31.5284273181104,31.147361398529,31.1128597713973,31.5551385744611,31.7479939892741,31.5890352680344,31.5470790538009,31.5427330200078,31.3901913024084,31.5423214446953,31.4814325586741,31.4937336232021,31.3483738841556,31.2516462059018,31.2233881922543,31.2572951780583,31.0087226975291,31.1197589042273,31.053748381687,30.8202174718598,30.845143129195,30.8727194789634,30.4231467151428,30.7254093759809,30.2757746547116,30.6047530953025,29.6835591414008,28.257421076205,29.4634886416064,29.183064807185,28.6935506287734,29.3989017421637,30.8936090542518,30.6884831327852,30.805770713392,30.6938909098627,30.8317757801268,30.8509115577427,30.6836198471168,30.7979978629801,31.0260101704105,30.6248844591805,30.8346900656087",
"30.8346900656087,30.9826158466835,29.814086001996,29.7839590794955,30.7928804535206,31.1589874726521,31.0547403039501,31.2268131145794,31.155503802286,31.3036925274762,31.4782621660348,31.0928322383151,31.589958621025,29.9582510795225",
"29.9582510795225,29.9796434055214,29.9405638729798,30.2602098442174,30.5011865525849,30.6753859842987,28.9331380886365,30.7736467776919,30.8457967803438,30.843630408183,30.8767570425033,30.9178344980247,30.734598946287,30.8877440413271,30.9225051837881,30.9534076039184,31.0172861192043,30.9371712793451,30.9806052132295,31.0593603717961,31.1156928565737,30.4713263393479,26.028518302418,28.1426546887905,29.4308434671559,30.7190322455213,31.2289674937063,31.7389027418913,32.2488379900763,32.7587732382613,33.2687084864463,33.7786437346312,34.2885789828162,34.7985142310012,35.3084494791862,35.8183847273712,36.3283199755562,36.8382552237412,37.3481904719262,37.8581257201112,38.3680609682962,25.5986933949893,29.7968031963901,30.5336819967028,30.1876589408847,30.4260367500101,30.2997107671214,30.3429716412578,30.3537316791924,30.4111899964144,30.7293520851914,30.7778983966343,30.9712137067708,30.9072589183658,31.0696990205164,30.5713926084448,31.3458855877875,31.4169903025083,31.5148974986093,31.5972499257413,31.2293401943969,31.2033325602348,31.1657434266985,30.6784877073261,30.6991365599664,30.6763195188897"
), B_dur = c("9.5,16,17,17,16,17,17,16,17.0000000000146,16.9999999999854,16,17,16.9999999999854,16.0000000000146,17,17,16,17,17,16,17,17,16,17.0000000000146,16.9999999999854,16,17,16.9999999999854,16.0000000000146,17,17,16,17,17,16,17,17,16,17.0000000000146,16.9999999999854,16,17,16.9999999999854,16.0000000000146,17,17,16,17,17,16,17,17,16,17.0000000000146,16.9999999999854,16,17,16.9999999999854,16.0000000000146,17,17,16,17,17,16,17,17,16,17.0000000000146,16.9999999999854,16,2.5",
"14.5,16.9999999999854,16.0000000000146,17,17,16,17,17,16,17,17,16,17.0000000000146,13.4999999999854",
"3.5,16,17,16.9999999999854,16.0000000000146,17,17,16,17,17,16,17,17,16,17.0000000000146,16.9999999999854,16,17,16.9999999999854,16.0000000000146,17,17,16,17,17,16,17,17,16,17.0000000000146,16.9999999999854,16,17,16.9999999999854,16.0000000000146,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,7.5"
), C_intpl = c("58.3368399069697,58.249224089011,59.5198368051218,58.8722012497097,58.4418996252205,58.5849059154389,59.2752163985494,52.8407480422202,51.6276603912397,48.0255346632529,44.753541512539,41.4815483618252,38.2095552111114,34.9375620603975,31.6655689096837,28.3935757589698,25.121582608256,19.4712933827274,22.0108873782783,24.5504813738291,24.8441573376901,24.6902151101703,24.4029572181118,24.9753161974674,24.8664406826514,24.8486668451201,25.1137001504163,25.1142578332509,25.4902077628339,25.4075561268027,25.6622548410237,61.2421678149908,25.1600975771354,25.6667198263373,25.442560744158,25.8736383423437,25.5859074180431,24.7860400673889,24.4337707697216,24.3214953242744,23.915753514736,23.7363185577661,23.7186569801299,23.4313514771952,23.5730151254578,62.5124513171595,23.3260531660862,23.4498217326665,23.2145314844252,57.5586745434594,63.4646233226955,23.0706406704345,23.3318690599491,62.044649715831,62.2720656330432,22.2532276715887,62.7059140614625,22.9511208849958,22.5603175709988,23.3456453893988,63.2523901625561,60.6655429980934,60.2358824325868,59.957910796633,57.3999702562457,54.8277282980263,43.0269305132552,31.2261327284841,19.425334943713,22.7319906068577,26.0386462700023",
"26.0386462700023,29.345301933147,32.6519575962917,35.9586132594364,48.3773995023798,60.7961857453232,49.4980424442242,55.9907960862667,57.2956837917999,58.1409925994177,59.025022056064,60.0098263540792,60.4028460580062,61.2629030450653,55.9298614021542",
"55.9298614021542,55.3877180252389,61.3547152702855,61.7847919095391,56.2457623439544,62.5477315546977,62.3078007189967,62.4272469013149,57.6479672147315,62.9844338801191,58.0081708266629,63.3872796098875,59.0138830718112,58.0612924481098,58.38680047729,58.687179350318,63.8724230039733,63.4126777597892,63.6865154626743,63.5670658627636,63.4496590540706,63.7595297692908,58.9069708176601,63.4547681163061,64.3198376700797,63.415319961042,64.0985879957056,64.1201809531605,63.677902665454,64.1934303628317,64.4682003346273,64.2868853545462,24.8444135816353,64.1579626357752,63.8897139146875,58.5472675827292,64.5784992977498,64.0848591719068,63.8841268679761,64.2901359712354,64.395692486112,64.5425896391638,64.8060565909917,64.3618830026368,64.7088481705444,64.5005944199885,64.5540289192148,64.7408010459365,63.378880767685,63.3415589069662,63.5362700331647,63.5924807719723,63.575801461932,63.6799360982113,64.0041021410894,64.3144923757986,63.8692943755376,63.8594574363473,64.2731841085802,63.3314657812309,64.2758880216293,64.1011768977101,64.0261661917799,64.2865302330478,63.724697791255,64.1202175712152"
), C_dur = c("14,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,14",
"3,17,16,17,17,16,17,17,16,17,17,16,17,17,8", "8,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,17,16,17,2"
)), row.names = c(NA, -3L), groups = structure(list(Sequ = 2L,
.rows = structure(list(1:3), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
There's a possible solution with use of grid.arrange() func from library(gridExtra) library(grid) packages.
I've wrapped your data into unique charts and combined them together into arranged chart.
df1 = df0 %>%
pivot_longer(cols = contains("_"),
names_to = c("Event_by", ".value"),
names_pattern = "^(.*)_([^_]+$)") %>%
separate_rows(c(intpl, dur), sep = ",", convert = TRUE) %>%
mutate(Time = cumsum(dur)) %>%
mutate(Utterance = paste0(sub(".*(.)$", "\\1",Speaker), ": ", Utterance),
Utterance = factor(Utterance, levels = unique(Utterance)))
Set chart objects into enviroment:
for (i in unique(df1$Event_by)){
for (j in levels(df1$Utterance)){
assign(x = paste0(i,j), value = ggplot(data = df1[df1$Event_by == i & df1$Utterance == j,], aes(x = Time, y = log2(intpl))) +
geom_line()+
geom_smooth(method = 'lm', color = "red", formula = y~x))
}
}
Create grided chart:
library(gridExtra) library(grid)
grid.arrange(
`AA: cool >what part?<`,
`AB: u:m Tennessee=` ,
`ANA: (0.228)` ,
`BA: cool >what part?<` ,
`BB: u:m Tennessee=` ,
`BNA: (0.228)` ,
`CA: cool >what part?<` ,
`CB: u:m Tennessee=` ,
`CNA: (0.228)` ,
nrow = 3)
Although i think there should be better solution for that.
You can also try to explore below articlesfor arranging plots:
http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/81-ggplot2-easy-way-to-mix-multiple-graphs-on-the-same-page/
https://ggplot2-book.org/facet.html
Moreover, there's is no themming added to my solution

Count Backwards in String until pattern R

I'm trying to extract UPCs from item descriptions. There is a varying number of /'s in the front of the description, but the UPC is always right before the last /, so I was using a count of characters, however, there is a variable number of characters at the end based on pack size. In the replication, you can see on the first row what this is supposed to look like at the end, but the second row has dropped the first digit of the UPC and picked up the /. Looking for a way to do this inline with DPLYR. My original code is under the replication.
test <- structure(list(Month = structure(c(17987, 17987), class = "Date"),store_id = c("7005", "7005"), UPC = c("000004150860081","00001200050404/"), `Item Description` = c("ACQUA PANNA SPRING WATER/EACH/000004150860081/1","AQUAFINA 24PK/24PK/000001200050404/24"), `Cals Item Description` = c(NA_character_,NA_character_), `Sub-Category` = c(NA_character_, NA_character_), Category = c(NA_character_, NA_character_), Department = c(NA_character_,NA_character_), `Sales Dollars` = c(17.43, 131.78), Units = c(7,528), Cost = c(8.4, 112.2), `Gross Margin` = c(9.03, 19.58), `Gross Margin %` = c(0.5181, 0.1486)), row.names = c(NA,-2L), class = c("tbl_df", "tbl", "data.frame"))
foo <- list.files(pattern = "*.csv", full.names = T) %>%
map_df(~read_csv(.)) %>%
mutate(date = lubridate::mdy(str_sub(textbox43, start = -10))) %>%
mutate(store_id = str_sub(textbox6, start = 1, end = 4)) %>%
mutate(item_desc = textbox57) %>%
filter(!is.na(item_desc), item_desc != "") %>%
mutate(dollars = textbox58,
units = textbox59,
cost = textbox61,
gm = textbox66,
gm_pct = textbox67) %>%
mutate(UPC = str_sub(item_desc, start = -17, end = -3))
Is this what you want?
sub("^.*/([^/]+)/[^/]*$",
"\\1",
test$`Item Description`)
Returns:
[1] "000004150860081" "000001200050404"
Edit: You were asking for dplyr style:
test %>%
mutate(item_id = sub("^.*/([^/]+)/[^/]*$",
"\\1",
test$`Item Description`))

How to convert data with different levels of information into wide format? [duplicate]

This question already has an answer here:
Reshaping data.frame with a by-group where id variable repeats [duplicate]
(1 answer)
Closed 2 years ago.
I have a data of patients' operations/procedures (example as shown in the picture below) where one row describes a patient's procedure. There are 2 levels of information,
the first being the operation details, i.e. op_start_dt, priority_operation and asa_status
the second being the procedure details, i.e. proc_desc and proc_table
An operation can have more than 1 procedures. In the example below, patient A has 2 operations (defined by distinct op_start_dt). In his first operation, he had 1 procedure (defined by distinct proc_desc) and in his second, he had 2 procedures.
I would like to convert the data into a wide format, where a patient only has one row, and his information will be arranged operation by operation and within each operation, it will be arrange procedure by procedure, as shown below. So, proc_descxy refers to the proc_desc on xth operation and yth procedure.
Data:
df <- structure(list(patient = c("A", "A", "A"), department = c("GYNAECOLOGY /OBSTETRICS DEPT",
"GYNAECOLOGY /OBSTETRICS DEPT", "GYNAECOLOGY /OBSTETRICS DEPT"
), op_start_dt = structure(c(1424853000, 1424870700, 1424870700
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), priority_operation = c("Elective",
"Elective", "Elective"), asa_status = c(2, 3, 3), proc_desc = c("UTERUS, MALIGNANT CONDITION, EXTENDED HYSTERECTOMY WITH/WITHOUT LYMPHADENECTOMY",
"KIDNEY AND URETER, VARIOUS LESIONS, NEPHROURETERECTOMY, LAPAROSCOPIC",
"HEART, VARIOUS LESIONS, HEART TRANSPLANTATION"), proc_table = c("99",
"6A", "7C")), row.names = c(NA, 3L), class = "data.frame")
Desired output:
df <- structure(list(patient = "A", department = "GYNAECOLOGY /OBSTETRICS DEPT",
no_op = 2, op_start_dt1 = structure(1424853000, class = c("POSIXct",
"POSIXt"), tzone = "UTC"), no_proc1 = 1, priority_operation1 = "Elective",
asa_status1 = 2, proc_desc11 = "UTERUS, MALIGNANT CONDITION, EXTENDED HYSTERECTOMY WITH/WITHOUT LYMPHADENECTOMY",
proc_table11 = "99", op_start_dt2 = structure(1424870700, class = c("POSIXct",
"POSIXt"), tzone = "UTC"), no_of_proc2 = 2, priority_operation2 = "Elective",
asa_status2 = 3, proc_desc21 = "KIDNEY AND URETER, VARIOUS LESIONS, NEPHROURETERECTOMY, LAPAROSCOPIC",
proc_table21 = "6A", proc_desc22 = "HEART, VARIOUS LESIONS, HEART TRANSPLANTATION",
proc_table22 = "7C"), row.names = 1L, class = "data.frame")
My attempt:
I tried to work this out, but it gets confusing along the way, with pivot_longer then pivot_wideragain.
df %>%
# Operation-level Information
group_by(patient) %>%
mutate(op_nth = dense_rank(op_start_dt),
no_op = n_distinct(op_start_dt)) %>%
# Procedure-level Information
group_by(patient, op_start_dt) %>%
mutate(proc_nth = row_number(),
no_proc = n_distinct(proc_desc)) %>%
ungroup() %>%
# Make pivoting easier
mutate_all(as.character) %>%
# Pivot Procedure-level Information
pivot_longer(-c(patient, department, no_op, op_nth, proc_nth)) %>%
# Remove the indices for "Procedure" for Operation_level Information
mutate(proc_nth = case_when(!(name %in% c("op_start_dt", "no_proc", "priority_operation", "asa_status")) ~ proc_nth)) %>%
# Create the column names
unite(name, c(name, op_nth, proc_nth), sep = "", na.rm = TRUE) %>%
distinct() %>%
pivot_wider(names_from = name, values_from = value)
Create a unique ID column for each patient and then use pivot_wider.
library(dplyr)
df %>%
group_by(patient) %>%
mutate(row = row_number()) %>%
tidyr::pivot_wider(names_from = row, values_from = op_start_dt:proc_table)

Resources