Unlist data frame column and pasting them together - r

I have a dataframe as defined below:
df <- structure(list(ID = 1:19, MEDICATION = c("0", "NOVOMIX 26 BF, 20 D",
"NOVOMIX 14 D", "NOVOMIX 34 BF 22 D", "MIXTARD 52 BF 20 D", "MIXTARD 40 BF 24 D",
"MIXTARD 10 BF 8 D", "MIXTARD 42 BF 24 D", "MIXTARD 20 BF 18 D",
"MIXTARD 82 BF 46 D", "MIXTARD 14 BF 10 D", "NOVOMIX 15 BF 15 D",
"MIXTARD", NA, "MIXTARD 10 BF 4 D", "NOVOMIX", "MIXTARD --> NOVOMIX",
"NOT GIVEN ANY DIABETES MEDICATION INPATIENT PATIENT NORMALLY ON METFORMIN",
"GIVEN ASPART")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -19L), .Names = c("ID", "MEDICATION"))
I would like to extract all the medications (i.e. NOVOMIX, MIXTARD, METFORMIN, ASPART from the MEDICATION variable in the dataframe and paste them together. I wrote my code as follows:
library(tidyverse)
library(rebus)
df %>%
mutate(MEDICATION2 = str_extract_all(MEDICATION, pattern =
or1(c("NOVOMIX", "MIXTARD", "METFORMIN", "ASPART")))) %>%
unnest(MEDICATION2) %>%
group_by(ID) %>%
mutate(MEDICATION2 = str_c(unlist(MEDICATION2), collapse = " - ")) %>%
slice(1)
My expected output is:
df_out <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19), MEDICATION = c("0", "NOVOMIX 26 BF, 20 D",
"NOVOMIX 14 D", "NOVOMIX 34 BF 22 D", "MIXTARD 52 BF 20 D", "MIXTARD 40 BF 24 D",
"MIXTARD 10 BF 8 D", "MIXTARD 42 BF 24 D", "MIXTARD 20 BF 18 D",
"MIXTARD 82 BF 46 D", "MIXTARD 14 BF 10 D", "NOVOMIX 15 BF 15 D",
"MIXTARD", NA, "MIXTARD 10 BF 4 D", "NOVOMIX", "MIXTARD --> NOVOMIX",
"NOT GIVEN ANY DIABETES MEDICATION INPATIENT PATIENT NORMALLY ON METFORMIN",
"GIVEN ASPART"), MEDICATION2 = c(NA, "NOVOMIX", "NOVOMIX", "NOVOMIX",
"MIXTARD", "MIXTARD", "MIXTARD", "MIXTARD", "MIXTARD", "MIXTARD",
"MIXTARD", "NOVOMIX", "MIXTARD", NA, "MIXTARD", "NOVOMIX", "MIXTARD - NOVOMIX",
"METFORMIN", "ASPART")), .Names = c("ID", "MEDICATION", "MEDICATION2"
), row.names = c(NA, -19L), class = "data.frame")
The problem is the code removed the row with MEDICATION == 0 and I think my code is too long for a simple extraction of strings. I would like to ask for help if you know how this code can be shorten (if possible).

We can use stri_extract_all_regex from the stringi package to extract all the words which matches the pattern.
library(stringi)
med_pattern <- c("NOVOMIX|MIXTARD|METFORMIN|ASPART")
df$MEDICATION2 <- stri_extract_all_regex(df$MEDICATION, pattern = med_pattern)
As mentioned by #mt1022, the new column is a list. We may paste them together with
df$MEDICATION2<-paste(stri_extract_all_regex(df$MEDICATION,pattern = med_pattern))
However, it will not give some unwanted characters for lists with more than 1 element. This should give you the expected output.
chars <- stri_extract_all_regex(df$MEDICATION, pattern = med_pattern)
df$MEDICATION2 <- sapply(chars, paste, collapse = "-")
df$MEDICATION2
#[1] "NA" "NOVOMIX" "NOVOMIX" "NOVOMIX"
#[5] "MIXTARD" "MIXTARD" "MIXTARD" "MIXTARD"
#[9] "MIXTARD" "MIXTARD" "MIXTARD" "NOVOMIX"
#[13] "MIXTARD" "NA" "MIXTARD" "NOVOMIX"
#[17] "MIXTARD-NOVOMIX" "METFORMIN" "ASPART"
You can also do this in single line :
df$MEDICATION2 <- sapply(stri_extract_all_regex(df$MEDICATION,
pattern = med_pattern), paste, collapse = "-")

Related

Using dplyr to create new groups inside a column

This is my dataframe:
mydf<-structure(list(DS_FAIXA_ETARIA = c("Inválido", "16 anos", "17 anos",
"18 anos", "19 anos", "20 anos", "21 a 24 anos", "25 a 29 anos",
"30 a 34 anos", "35 a 39 anos"), n = c(5202L, 48253L, 67401L,
79398L, 88233L, 90738L, 149634L, 198848L, 238406L, 265509L)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
I would like to have grouped the observations into one group called: 16 a 20 anos.
"16 anos", "17 anos",
"18 anos", "19 anos", "20 anos"
In other words I would like to "merge" the rows 2-6 and sum its observations on the n column. I would have one row represent the sum of rows 2-6.
Is it possible to do this using group_by and then summarise(sum(DS_FAIXA_ETARIA)) verbs from dplyr?
This would be the output that I want:
mydf<-structure(list(DS_FAIXA_ETARIA = c("Inválido","16 a 20 anos" ,"21 a 24 anos", "25 a 29 anos",
"30 a 34 anos", "35 a 39 anos"), n = c(5202L,374023L , 149634L, 198848L, 238406L, 265509L)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
Many thanks
This should the job. First sum with summarize.
Then add_row to the original dataframe. slice_tail and arrange
df1 <- mydf %>%
summarise(`16 a 20 anos`= sum(n[2:6]))
mydf %>%
add_row(DS_FAIXA_ETARIA=names(df1), n=df1$`16 a 20 anos`[1]) %>%
slice_tail(n=5) %>%
arrange(DS_FAIXA_ETARIA)
Output:
DS_FAIXA_ETARIA n
<chr> <int>
1 16 a 20 anos 374023
2 21 a 24 anos 149634
3 25 a 29 anos 198848
4 30 a 34 anos 238406
5 35 a 39 anos 265509
We create a grouping variable based on the occurrence of 'Invalido' or those elements with only digits (\\d+) followed by space and 'anos', then summarise by pasteing the first and last elements while getting the sum of 'n'
library(dplyr)
library(stringr)
mydf %>%
group_by(grp = replace(cumsum(!str_detect(DS_FAIXA_ETARIA,
'^\\d+\\s+anos$')), DS_FAIXA_ETARIA == 'Inválido', 0)) %>%
summarise(DS_FAIXA_ETARIA = if(n() > 1)
str_c(DS_FAIXA_ETARIA[c(1, n())], collapse="_") else
DS_FAIXA_ETARIA, n = sum(n), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 6 x 2
# DS_FAIXA_ETARIA n
# <chr> <int>
#1 Inválido 5202
#2 16 anos_20 anos 374023
#3 21 a 24 anos 149634
#4 25 a 29 anos 198848
#5 30 a 34 anos 238406
#6 35 a 39 anos 265509

Formatting grouped data for tables in R

I'm trying to display my data in table format and I can't figure out how to rearrange my data to display it in the proper format. I'm used to wrangling data for plots, but I'm finding myself a little lost when it comes to preparing tables. This seems like something really basic, but I haven't been able to find an explanation on what I'm doing wrong here.
I have 3 columns of data, Type, Year, and n. The data formatted as it is now produces a table that looks like this:
Type Year n
Type C 1 5596
Type D 1 1119
Type E 1 116
Type A 1 402
Type F 1 1614
Type B 1 105
Type C 2 26339
Type D 2 14130
Type E 2 98
Type A 2 3176
Type F 2 3071
Type B 2 88
What I want to do is to have Type as row names, Year as column names, and n populating the table contents like this:
1 2
Type A 402 3176
Type B 105 88
Type C 26339 5596
Type D 1119 14130
Type E 116 98
Type F 1614 3071
The mistake might have been made upstream from this point. Using the full original data set I arrived at this output by doing the following:
exampletable <- df %>%
group_by(Year) %>%
count(Type) %>%
select(Type, Year, n)
Here is the dput() output
structure(list(Type = c("Type C", "Type D", "Type E", "Type A",
"Type F", "Type B", "Type C", "Type D", "Type E", "Type A", "Type F",
"Type B", "Type C", "Type D", "Type E", "Type A", "Type F", "Type B",
"Type C", "Type D", "Type E", "Type A", "Type F", "Type B", "Type C",
"Type D", "Type E"), Year = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5), n = c(5596,
1119, 116, 402, 1614, 105, 26339, 14130, 98, 3176, 3071, 88,
40958, 17578, 104, 3904, 3170, 102, 33145, 23800, 93, 1264, 7084,
1262, 34642, 24911, 504)), class = c("spec_tbl_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -27L), spec = structure(list(
cols = list(Type = structure(list(), class = c("collector_character",
"collector")), Year = structure(list(), class = c("collector_double",
"collector")), n = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
You can get the data in wide format and change Type column to rowname.
tidyr::pivot_wider(df, names_from = Year, values_from = n) %>%
tibble::column_to_rownames('Type')
# 1 2 3 4 5
#Type C 5596 26339 40958 33145 34642
#Type D 1119 14130 17578 23800 24911
#Type E 116 98 104 93 504
#Type A 402 3176 3904 1264 NA
#Type F 1614 3071 3170 7084 NA
#Type B 105 88 102 1262 NA
You can use tidyr package to get to wider format and tibble package to convert a column to rownames
dataset <- read.csv(file_location)
dataset <- tidyr::pivot_wider(dataset, names_from = Year, values_from = n)
tibble::column_to_rownames(dataset, var = 'Type')
1 2
Type C 5596 26339
Type D 1119 14130
Type E 116 98
Type A 402 3176
Type F 1614 3071
Type B 105 88

binning the numbers with wrong outcome

I have problems with the output after I bin the a numerical vector.
I am trying to bin the length of stay, which was calculated beforehand with difftime function. It does not make sense to provide the whole code since this is only the background. Yet, when I bin, I do not get the right answer.
Here is the length of stay assigned it with los.
dput(los)
c(61.0416666666667, 61.0416666666667, 61.0416666666667, 2, 2, 3, 3)
Here are my breaks. I used na.rm inside as tried several methods. I passed na.rm with TRUE, FALSE and took it out of my breaks.
breaks <- c(0, 0.8, 0.16,
1.0, 1.8, 1.16,
2.0, 2.8, 2.16,
3.0, 3.8, 3.16,
4.0, 4.8, 4.16,
5.0, 5.8, 5.16,
6.0, 6.8, 6.16,
7.0, 14.0, 21.0, 28.0, max(los)) #, , na.rm = FALSE
Nevertheless, the next code tried
dt_los$losbinned <- cut(dt_los$LOS,
breaks = breaks,
labels = c("0hrs", "8hrs", "16hrs", "1 d",
"1 d 8hrs", "1 d 16hrs", "2 d",
"2 d 8hrs", "2 d 16hrs", "3 d",
"3 d 8hrs", "3 d 16hrs", "4 d",
"4 d 8hrs", "4 d 16hrs", "5 d",
"5 d 8hrs", "5 d 16hrs", "6 d",
"6 d 8hrs","6 d 16hrs", "7 - 14 d",
"14 - 21 d", "21 - 28 d", "> 28 d"),
right = FALSE)#
with different parameters passed for the 'right' gives me this:
when right = FALSE I do not get LOS for 61.04 binned for the category ">28 d". BBut do get the right bins for the other ones 2.00 and 3.00.
structure(list(IDcol = 101:107, Admissions = structure(c(1539160200,
1539160200, 1539160200, 1539154800, 1539154800, 1539154800, 1539154800
), class = c("POSIXct", "POSIXt"), tzone = "Europe/London"),
Discharges = structure(c(1544434200, 1544434200, 1544434200,
1539327600, 1539327600, 1539414000, 1539414000), class = c("POSIXct",
"POSIXt"), tzone = "Europe/London"), Admission_type = c("Elective",
"Emergency", "Emergency", "Elective", "Emergency", "Elective",
"Emergency"), LOS = c(61.0416666666667, 61.0416666666667,
61.0416666666667, 2, 2, 3, 3), Ward_code = c("DSN", "DSN",
"DNA", "NAS", "BAS", "BAS", "BAS"), Same_day_discharge = c(FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), Spell_type = c("Elective",
"Emergency", "Emergency", "Elective", "Emergency", "Elective",
"Emergency"), Adm_period = c(TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE), losbinned = structure(c(NA, NA, NA, 7L, 7L,
10L, 10L), .Label = c("0hrs", "8hrs", "16hrs", "1 d", "1 d 8hrs",
"1 d 16hrs", "2 d", "2 d 8hrs", "2 d 16hrs", "3 d", "3 d 8hrs",
"3 d 16hrs", "4 d", "4 d 8hrs", "4 d 16hrs", "5 d", "5 d 8hrs",
"5 d 16hrs", "6 d", "6 d 8hrs", "6 d 16hrs", "7 - 14 d",
"14 - 21 d", "21 - 28 d", "> 28 d"), class = "factor")), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
when I pass right = TRUE, the output for 61.04 is binning into ">28 d" which is the desired answer, yet, I do not get the right bins for 2.0 and 3.0, which are bbinned in 1 d 16hrs for 2.0 and 2 d 16 hrs for 3. And again, these shall be binned in 2, respectively 3.
structure(list(IDcol = 101:107, Admissions = structure(c(1539160200,
1539160200, 1539160200, 1539154800, 1539154800, 1539154800, 1539154800
), class = c("POSIXct", "POSIXt"), tzone = "Europe/London"),
Discharges = structure(c(1544434200, 1544434200, 1544434200,
1539327600, 1539327600, 1539414000, 1539414000), class = c("POSIXct",
"POSIXt"), tzone = "Europe/London"), Admission_type = c("Elective",
"Emergency", "Emergency", "Elective", "Emergency", "Elective",
"Emergency"), LOS = c(61.0416666666667, 61.0416666666667,
61.0416666666667, 2, 2, 3, 3), Ward_code = c("DSN", "DSN",
"DNA", "NAS", "BAS", "BAS", "BAS"), Same_day_discharge = c(FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), Spell_type = c("Elective",
"Emergency", "Emergency", "Elective", "Emergency", "Elective",
"Emergency"), Adm_period = c(TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE), losbinned = structure(c(25L, 25L, 25L, 6L, 6L,
9L, 9L), .Label = c("0hrs", "8hrs", "16hrs", "1 d", "1 d 8hrs",
"1 d 16hrs", "2 d", "2 d 8hrs", "2 d 16hrs", "3 d", "3 d 8hrs",
"3 d 16hrs", "4 d", "4 d 8hrs", "4 d 16hrs", "5 d", "5 d 8hrs",
"5 d 16hrs", "6 d", "6 d 8hrs", "6 d 16hrs", "7 - 14 d",
"14 - 21 d", "21 - 28 d", "> 28 d"), class = "factor")), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
The actual and expected results should the the right bins assigned for my length of stay. For 61.04 -> ">28d", for 2 -> "2 d", for 3 -> "3 d".
If this can be done with tidyverse that would be amazing. But respecting the bins I have assigned. However, I am aware this isn't done yet. Therefore, okay with the corrected code I have came up with, but corrected.
The cut function's bins are exclusive to inclusive.
From the cut function's help: The factor level labels are constructed as "(b1, b2]", "(b2, b3]" etc. for right = TRUE and as "[b1, b2)"
In order to include the lowest value (or highest value in this case), the include.lowest=TRUE option in required. This will make the first bin exclusive to exclusive, "[b1, b2]".
Try:
labels<-c("0hrs", "8hrs", "16hrs", "1 d",
"1 d 8hrs", "1 d 16hrs", "2 d",
"2 d 8hrs", "2 d 16hrs", "3 d",
"3 d 8hrs", "3 d 16hrs", "4 d",
"4 d 8hrs", "4 d 16hrs", "5 d",
"5 d 8hrs", "5 d 16hrs", "6 d",
"6 d 8hrs","6 d 16hrs", "7 - 14 d",
"14 - 21 d", "21 - 28 d", "> 28 d")
dt_los$losbinned <- cut(los, breaks=breaks, labels=labels, right=FALSE, include.lowest = TRUE)

Linking records in R with multiple, sometimes missing blocking fields

I have data on multiple individuals over time that I am trying to link together in R. The problem is that the names of individuals are often very similar yet spelled slightly differently, and ID variables are often missing (party and district are never missing but are not enough to uniquely describe individuals). Here's an example of 3 distinct individuals, all with Ennis as their last name:
df = structure(list(chamber = c("H", "H", "H", "H", "H", "H", "H",
"H", "H", "S", "S", "S", "S", "S"), year = c("2005", "2007",
"1997", "1999", "2001", "1995", "1997", "1999", "2001", "2007",
"2011", "2012", "2013", "2013"), name = c("Ennis", "Ennis", "Ennis, B",
"Ennis, B", "Ennis, B", "Ennis, D", "Ennis, D", "Ennis, D", "Ennis, D",
"Ennis", "Ennis, Bruce", "Ennis, Bruce", "Ennis, Bruce", "Ennis, J"
), party = c("100", "100", "100", "100", "100", "200", "200",
"200", "200", "100", "100", "100", "100", "100"), district = c("028",
"028", "028", "028", "028", "006", "006", "006", "006", "014",
"014", "014", "014", "007"), os.id = c(NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, "DEL000009", "DEL000009", "DEL000009", NA), msp.id = c("1298",
"1298", NA, NA, "1298", NA, NA, NA, "13676", NA, "1298", "1298",
"1298", "567")), .Names = c("chamber", "year", "name", "party",
"district", "os.id", "msp.id"), row.names = c(NA, 14L), class = "data.frame")
Which describes this example data frame:
chamber year name party district os.id msp.id
1 H 2005 Ennis 100 028 <NA> 1298
2 H 2007 Ennis 100 028 <NA> 1298
3 H 1997 Ennis, B 100 028 <NA> <NA>
4 H 1999 Ennis, B 100 028 <NA> <NA>
5 H 2001 Ennis, B 100 028 <NA> 1298
6 H 1995 Ennis, D 200 006 <NA> <NA>
7 H 1997 Ennis, D 200 006 <NA> <NA>
8 H 1999 Ennis, D 200 006 <NA> <NA>
9 H 2001 Ennis, D 200 006 <NA> 13676
10 S 2007 Ennis 100 014 <NA> <NA>
11 S 2011 Ennis, Bruce 100 014 DEL000009 1298
12 S 2012 Ennis, Bruce 100 014 DEL000009 1298
13 S 2013 Ennis, Bruce 100 014 DEL000009 1298
14 S 2013 Ennis, J 100 007 <NA> 567
So observations 1-5 and 10-14 describe "Ennis, B", observations 6-9 describe "Ennis, D" and observation 14 describes "Ennis, J". I deduced this through piecing multiple fields together logically in my mind. Naturally I want to automate this, as I have hundreds of thousands such observations. Eventually I want to assign a unique, non-missing ID to all of these 14 observations. In this case, that would be 3 unique IDs.
I have done some research and I think the RecordLinkage package in R could do what I need, along with some fuzzy string matching for the names. The problem is that the blocking variables I would want to employ, which are party, os.id, and msp.id are not always present. That is, blocking variables need to match exactly. This is a problem, since for example observation 3 and observation 11 both describe the same person but 3 has NA for the blocking variable.
Here's the code I'm tinkering with that fails due to the missing blocking variables:
rpairsfuzzy <- compare.dedup(df,blockfld = c(4,6,7), strcmp = TRUE)
dim(rpairsfuzzy$pairs)
rpairsfuzzy$pairs
It only identifies observations 11-13 as matches since these all contain non-missing data. But clearly that's wrong.

strsplit by variable separator

I have some strings of data separated by " " that needs to be split into columns. Is there an easy way to split the data by every nth separator. For example, the first value in x tells you that the first 4 values in y correspond to the first trial. The second value in x tells you that the next 3 values in y correspond to the second trial, and so on.
x <- c("4 3 3", "3 3 3 2 3")
y <- c("110 88 77 66 55 44 33 22 33 44 11 22 11", "44 55 66 33 22 11 22 33 44 55 66 77 88 66 77 88")
The goal is something like this:
structure(list(session = 1:2, trial.1 = structure(1:2, .Label = c("110 88 77",
"44 55 66"), class = "factor"), trial.2 = structure(c(2L, 1L), .Label = c("33 22 11",
"66 55 44"), class = "factor"), trial.3 = structure(1:2, .Label = c("22 33 44",
"23 33 44"), class = "factor"), trial.4 = structure(c(NA, 1L), .Label = "55 66", class = "factor"),
trial.5 = structure(c(NA, 1L), .Label = "77 88 66", class = "factor")), .Names = c("session",
"trial.1", "trial.2", "trial.3", "trial.4", "trial.5"), class = "data.frame", row.names = c(NA,
-2L))
Ideally, any extra values from y need to be dropped from the resulting data frame, and the uneven row lengths should be filled with NA's.
This maybe useful
dumx<-strsplit(x,' ')
dumy<-strsplit(y,' ')
dumx<-lapply(dumx,function(x)(cumsum(as.numeric(x))))
dumx<-lapply(dumx,function(x){mapply(seq,c(1,x+1)[-(length(x)+1)],x,SIMPLIFY=FALSE)})
ans<-mapply(function(x,y){lapply(x,function(w,z){z[w]},z=y)},dumx,dumy)
I will leave you to convert the resulting list to dataframe :)

Resources