Formatting grouped data for tables in R - r

I'm trying to display my data in table format and I can't figure out how to rearrange my data to display it in the proper format. I'm used to wrangling data for plots, but I'm finding myself a little lost when it comes to preparing tables. This seems like something really basic, but I haven't been able to find an explanation on what I'm doing wrong here.
I have 3 columns of data, Type, Year, and n. The data formatted as it is now produces a table that looks like this:
Type Year n
Type C 1 5596
Type D 1 1119
Type E 1 116
Type A 1 402
Type F 1 1614
Type B 1 105
Type C 2 26339
Type D 2 14130
Type E 2 98
Type A 2 3176
Type F 2 3071
Type B 2 88
What I want to do is to have Type as row names, Year as column names, and n populating the table contents like this:
1 2
Type A 402 3176
Type B 105 88
Type C 26339 5596
Type D 1119 14130
Type E 116 98
Type F 1614 3071
The mistake might have been made upstream from this point. Using the full original data set I arrived at this output by doing the following:
exampletable <- df %>%
group_by(Year) %>%
count(Type) %>%
select(Type, Year, n)
Here is the dput() output
structure(list(Type = c("Type C", "Type D", "Type E", "Type A",
"Type F", "Type B", "Type C", "Type D", "Type E", "Type A", "Type F",
"Type B", "Type C", "Type D", "Type E", "Type A", "Type F", "Type B",
"Type C", "Type D", "Type E", "Type A", "Type F", "Type B", "Type C",
"Type D", "Type E"), Year = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5), n = c(5596,
1119, 116, 402, 1614, 105, 26339, 14130, 98, 3176, 3071, 88,
40958, 17578, 104, 3904, 3170, 102, 33145, 23800, 93, 1264, 7084,
1262, 34642, 24911, 504)), class = c("spec_tbl_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -27L), spec = structure(list(
cols = list(Type = structure(list(), class = c("collector_character",
"collector")), Year = structure(list(), class = c("collector_double",
"collector")), n = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))

You can get the data in wide format and change Type column to rowname.
tidyr::pivot_wider(df, names_from = Year, values_from = n) %>%
tibble::column_to_rownames('Type')
# 1 2 3 4 5
#Type C 5596 26339 40958 33145 34642
#Type D 1119 14130 17578 23800 24911
#Type E 116 98 104 93 504
#Type A 402 3176 3904 1264 NA
#Type F 1614 3071 3170 7084 NA
#Type B 105 88 102 1262 NA

You can use tidyr package to get to wider format and tibble package to convert a column to rownames
dataset <- read.csv(file_location)
dataset <- tidyr::pivot_wider(dataset, names_from = Year, values_from = n)
tibble::column_to_rownames(dataset, var = 'Type')
1 2
Type C 5596 26339
Type D 1119 14130
Type E 116 98
Type A 402 3176
Type F 1614 3071
Type B 105 88

Related

Pivot Wider causing issues when as.yearmon is used

I have the following code:
library(zoo)
library(xts)
df1<-structure(list(Date = structure(c(13523, 13532, 13539, 13551,
13565, 13567, 13579, 13588, 13600, 13607, 13616, 13628, 13637,
13656, 13658, 13670, 13686, 13691, 13698, 13705, 13721, 13735,
13768, 13770, 13783, 13789, 13797, 13811, 13819, 13824, 13838,
13846, 13852, 13860), class = "Date"), Category = c("Type 1",
"Type 2", "Type 1", "Type 1", "Type 1", "Type 2", "Type 1", "Type 3",
"Type 1", "Type 1", "Type 2", "Type 1", "Type 1", "Type 1", "Type 2",
"Type 1", "Type 3", "Type 1", "Type 1", "Type 1", "Type 1", "Type 2",
"Type 1", "Type 3", "Type 1", "Type 1", "Type 1", "Type 1", "Type 2",
"Type 1", "Type 1", "Type 1", "Type 3", "Type 2"), Value = c(2250,
1200, 625, 2250, 1000, 2750, 2250, 2750, 950, 2000, 1100, 950,
2250, 1000, 2500, 2250, 2500, 1000, 2250, 1200, 700, 2500, 2000,
2500, 900, 2250, 1200, 925, 2500, 2250, 750, 2000, 2500, 950)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -34L), groups = structure(list(
Date = structure(c(13523, 13532, 13539, 13551, 13565, 13567,
13579, 13588, 13600, 13607, 13616, 13628, 13637, 13656, 13658,
13670, 13686, 13691, 13698, 13705, 13721, 13735, 13768, 13770,
13783, 13789, 13797, 13811, 13819, 13824, 13838, 13846, 13852,
13860), class = "Date"), .rows = structure(list(1L, 2L, 3L,
4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L,
27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -34L), .drop = TRUE))
I've created a rolling_sum by month for this particular dataset using:
df_month <- df1 %>%
group_by(Category, Month = format(Date, "%Y-%m-%d")) %>%
summarize(Rolling_Sum = sum(Value))
df_month$Month <- as.yearmon(df_month$Month)
In preparation for a conversion to an xts format I'd like to pivot-wider and replace all null/NAs values with 0. However the pivot-wider seems to break the dataset, making the null replacement and xts conversion impossible:
df_turned <- df_month %>% group_by(Category) %>% pivot_wider(names_from = Category, values_from = Rolling_Sum, id_cols = Month)
If that had worked, I would have done:
df_turned <- df_turned %>% replace(.=="NULL", 0)
Then:
df_turned <- xts(df_turned, order.by = df_turned$Month)
Any advice most appreciated.
If we don't want duplicates, then use values_fn
library(tidyr)
library(dplyr)
df_turned <- df_month %>%
ungroup %>%
pivot_wider(names_from = Category, values_from = Rolling_Sum,
values_fn = sum, values_fill = 0)
-output
df_turned
# A tibble: 12 × 4
Month `Type 1` `Type 2` `Type 3`
<yearmon> <dbl> <dbl> <dbl>
1 Jan 2007 2875 1200 0
2 Feb 2007 3250 2750 0
3 Mar 2007 3200 0 2750
4 Apr 2007 2950 1100 0
5 May 2007 3250 2500 0
6 Jun 2007 3250 0 2500
7 Jul 2007 4150 0 0
8 Sep 2007 2900 0 2500
9 Oct 2007 4375 0 0
10 Nov 2007 5000 2500 0
11 Aug 2007 0 2500 0
12 Dec 2007 0 950 2500
Now, we can convert to xts
xts(df_turned[-1], order.by = df_turned$Month)
Type 1 Type 2 Type 3
Jan 2007 2875 1200 0
Feb 2007 3250 2750 0
Mar 2007 3200 0 2750
Apr 2007 2950 1100 0
May 2007 3250 2500 0
Jun 2007 3250 0 2500
Jul 2007 4150 0 0
Aug 2007 0 2500 0
Sep 2007 2900 0 2500
Oct 2007 4375 0 0
Nov 2007 5000 2500 0
Dec 2007 0 950 2500
As indicated in my comment, your problem is that you create duplicates because as.yearmon is called after the grouping by "Month". You are de facto grouping by "Date". We could do:
library(dplyr)
library(tidyr)
df1 |>
group_by(Category,
Month = as.yearmon(Date)) |>
pivot_wider(names_from = Category,
values_from = Value,
values_fn = sum,
values_fill = 0
) |>
select(-Date) # Or mutate "Date" above instead of creating "Month".
Then call xts.
Month = as.yearmon(Date) shouldn't cause a problem if Date is a date-type. However, if it is causing trouble as you indicate in your comment, as.yearmon(format(Date, "%Y-%m-%d")).
Output:
# A tibble: 12 × 4
Month `Type 1` `Type 2` `Type 3`
<yearmon> <dbl> <dbl> <dbl>
1 Jan 2007 2875 1200 0
2 Feb 2007 3250 2750 0
3 Mar 2007 3200 0 2750
4 Apr 2007 2950 1100 0
5 May 2007 3250 2500 0
6 Jun 2007 3250 0 2500
7 Jul 2007 4150 0 0
8 Sep 2007 2900 0 2500
9 Oct 2007 4375 0 0
10 Nov 2007 5000 2500 0
11 Aug 2007 0 2500 0
12 Dec 2007 0 950 2500
Update After #akrun updated answer with a similar solution, my solution seems more verbose. The reason is that my approach works directly on the df1 object and solves the problem there.
Use read.zoo like this:
library(zoo)
df_month |>
read.zoo(index = "Month", split = "Category", aggregate = sum) |>
na.fill(0)
giving this zoo object -- as.xts can be used to convert that to xts if needed.
Type 1 Type 2 Type 3
Jan 2007 2875 1200 0
Feb 2007 3250 2750 0
Mar 2007 3200 0 2750
Apr 2007 2950 1100 0
May 2007 3250 2500 0
Jun 2007 3250 0 2500
Jul 2007 4150 0 0
Aug 2007 0 2500 0
Sep 2007 2900 0 2500
Oct 2007 4375 0 0
Nov 2007 5000 2500 0
Dec 2007 0 950 2500
or directly from df1 modified from the comment below
df1 |>
read.zoo(df1, FUN = as.yearmon, split = "Category", aggregate = sum) |>
na.fill(0)
Note
df_month from question in immediately reproducible form
df_month <-
structure(list(Category = c("Type 1", "Type 1", "Type 1", "Type 1",
"Type 1", "Type 1", "Type 1", "Type 1", "Type 1", "Type 1", "Type 1",
"Type 1", "Type 1", "Type 1", "Type 1", "Type 1", "Type 1", "Type 1",
"Type 1", "Type 1", "Type 1", "Type 1", "Type 1", "Type 2", "Type 2",
"Type 2", "Type 2", "Type 2", "Type 2", "Type 2", "Type 3", "Type 3",
"Type 3", "Type 3"), Month = structure(c(2007, 2007, 2007.08333333333,
2007.08333333333, 2007.16666666667, 2007.16666666667, 2007.25,
2007.25, 2007.33333333333, 2007.33333333333, 2007.41666666667,
2007.41666666667, 2007.5, 2007.5, 2007.5, 2007.66666666667, 2007.66666666667,
2007.75, 2007.75, 2007.75, 2007.83333333333, 2007.83333333333,
2007.83333333333, 2007, 2007.08333333333, 2007.25, 2007.33333333333,
2007.58333333333, 2007.83333333333, 2007.91666666667, 2007.16666666667,
2007.41666666667, 2007.66666666667, 2007.91666666667), class = "yearmon"),
Rolling_Sum = c(2250, 625, 2250, 1000, 2250, 950, 2000, 950,
2250, 1000, 2250, 1000, 2250, 1200, 700, 2000, 900, 2250,
1200, 925, 2250, 750, 2000, 1200, 2750, 1100, 2500, 2500,
2500, 950, 2750, 2500, 2500, 2500)), row.names = c(NA, -34L
), groups = structure(list(Category = c("Type 1", "Type 2", "Type 3"
), .rows = structure(list(1:23, 24:30, 31:34), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))

subsetting a nested list based on a conidtion in R

my nested list looks like this:
myList <- list(structure(list(id = 1:3, value = c(22, 33, 44),
code = c("943", "943", "3a0"),
product = c("Product 1", "Product 1", "Product 1")),
row.names = c(NA,-3L),
class = c("data.table", "data.frame")),
structure(list(id = 1:3, value = c(22, 33, 44),
code = c("943", "94f", "3a0"),
product = c("Product 2", "Product 2", "Product 2")),
row.names = c(NA,-3L),
class = c("data.table", "data.frame")),
structure(list(id = 1:3, value = c(22, 33, 44),
code = c("977", "943", "3a0"),
product = c("Product 3", "Product 3", "Product 3")),
row.names = c(NA,-3L),
class = c("data.table", "data.frame")))
i want to remove all list objects that have more than one list element with the same code. For example the first object [[1]] has two entries that have the code 943. I want to remove the entire object and keep only those that do not have any duplicates.
The expected outcome would therefore be: myList <- list(
structure(list(id = 1:3, value = c(22, 33, 44),
code = c("943", "94f", "3a0"),
product = c("Product 2", "Product 2", "Product 2")),
row.names = c(NA,-3L),
class = c("data.table", "data.frame")),
structure(list(id = 1:3, value = c(22, 33, 44),
code = c("977", "943", "3a0"),
product = c("Product 3", "Product 3", "Product 3")),
row.names = c(NA,-3L),
class = c("data.table", "data.frame")))
I was thinking of using and lapply, but i couldnt get it to qwork
any(duplicated(myList[[1]]$code))
any ides or suggestion?
this seems like a relatively simple problem, but i cant figure it out
Your code any(duplicated(myList[[1]]$code)) can be used in Filter
Filter(function(x) !any(duplicated(x$code)), myList)
#[[1]]
# id value code product
#1: 1 22 943 Product 2
#2: 2 33 94f Product 2
#3: 3 44 3a0 Product 2
#[[2]]
# id value code product
#1: 1 22 977 Product 3
#2: 2 33 943 Product 3
#3: 3 44 3a0 Product 3
Or with purrr :
purrr::keep(myList, ~!any(duplicated(.x$code)))
purrr::discard(myList, ~any(duplicated(.x$code)))
Does this work:
myList[sapply(lapply(myList, function(x) +duplicated(x$code)), function(x) sum(x) == 0)]
[[1]]
id value code product
1: 1 22 943 Product 2
2: 2 33 94f Product 2
3: 3 44 3a0 Product 2
[[2]]
id value code product
1: 1 22 977 Product 3
2: 2 33 943 Product 3
3: 3 44 3a0 Product 3

how to change one col value using the information from other cols in the same df

I have a data.farme that looks like this:
I want to generate a new df as codebook where the numbers in col Label will be replaced using the information from ID and Subject.
what should I do?
The codebook file that I want to achieve is sth that looks like this:
Sample data can be build using codes:
df<-structure(list(Var = c("Subject1", "Subject2", "Subject4", "Subject5",
"Subject6", "Score1", "Score2", "Score3", "Score4", "Score5",
"Score6", "TestDate1", "TestDate2", "TestDate3", "TestDate4",
"TestDate5", "TestDate6"), Label = c("Subject 1", "Subject 2",
"Subject 4", "Subject 5", "Subject 6", "Score for Subject 1",
"Score for Subject 2", "Score for Subject 3", "Score for Subject 4",
"Score for Subject 5", "Score for Subject 6", "Date for test Subject 1",
"Date for test Subject 2", "Date for test Subject 3", "Date for test Subject 4",
"Date for test Subject 5", "Date for test Subject 6"), ID = c(1,
2, 3, 4, 5, 6, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Subject = c("Math",
"ELA", "PE", "Art", "Physic", "Chemistry", NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA)), row.names = c(NA, -17L), class = c("tbl_df",
"tbl", "data.frame"))
We can use str_replace_all with a named vector
library(dplyr)
library(stringr)
df1 <- df %>%
transmute(Var, Label = str_replace_all(Label,
setNames(na.omit(Subject), na.omit(ID))))
-output
df1
# A tibble: 17 x 2
# Var Label
# <chr> <chr>
# 1 Subject1 Subject Math
# 2 Subject2 Subject ELA
# 3 Subject4 Subject Art
# 4 Subject5 Subject Physic
# 5 Subject6 Subject Chemistry
# 6 Score1 Score for Subject Math
# 7 Score2 Score for Subject ELA
# 8 Score3 Score for Subject PE
# 9 Score4 Score for Subject Art
#10 Score5 Score for Subject Physic
#11 Score6 Score for Subject Chemistry
#12 TestDate1 Date for test Subject Math
#13 TestDate2 Date for test Subject ELA
#14 TestDate3 Date for test Subject PE
#15 TestDate4 Date for test Subject Art
#16 TestDate5 Date for test Subject Physic
#17 TestDate6 Date for test Subject Chemistry
or using gsubfn
library(gsubfn)
df$Label <- with(df, gsubfn("(\\d+)",
setNames(as.list(na.omit(Subject)), na.omit(ID)), Label))

create dataframe from list of lists of data.frames

I have a list of lists of data.frames, which I would like to convert to a data.frame. The structure is as follows:
l_of_lists <- list(
year1 = list(
one = data.frame(date = c("Jan-10", "Jan-22"), type = c("type 1", "type 2")),
two = data.frame(date = c("Feb-1", "Feb-28"), type = c("type 2", "type 3")),
three = data.frame(date = c("Mar-10", "Mar-15"), type = c("type 1", "type 4"))
),
year2 = list( # dates is used here on purpose, as the names don't perfectly match
one = data.frame(dates = c("Jan-22"), type = c("type 2"), another_col = c("entry 2")),
two = data.frame(date = c("Feb-10", "Feb-18"), type = c("type 2", "type 3"), another_col = c("entry 2", "entry 3")),
three = data.frame(date = c("Mar-10", "Mar-15"), type = c("type 1", "type 4"), another_col = c("entry 4", "entry 5"))
),
year3 = list( # this deliberately only contains two data frames
one = data.frame(date = c("Jan-10", "Jan-12"), type = c("type 1", "type 2")),
two = data.frame(date = c("Feb-8", "Jan-28"), type = c("type 2", "type 3"))
))
The data frame has two particularities I tried to mimic above:
the column names differ by 1-2 characters (e.g. date vs. dates)
some columns are only present in some data frames (e.g. another_col)
I now would like to convert this to a data frame (I tried different calls to rbind and also do.call, as described e.g. here unsuccessfully) and would like to
- match on column names tolerantly (if the column names are similar to 1-2 characters, I want them to be matched and
- fill non-existent columns with NA in other columns.
I want a data frame similar to the following
year level date type another_col
1 one "Jan-10" "type 1" NA
1 one "Jan-22" "type 2" NA
1 two "Feb-1" "type 2" NA
1 two "Feb-28" "type 3" NA
1 three "Mar-10" "type 1" NA
1 three "Mar-15" "type 4" NA
2 one "Jan-22" "type 2" "entry 2"
2 two "Feb-1" "type 2" "entry 2"
2 two "Feb-28" "type 3" "entry 3"
2 three "Mar-10" "type 1" "entry 4"
2 three "Mar-15" "type 4" "entry 5"
3 one "Jan-10" "type 1" NA
3 one "Jan-12" "type 2" NA
3 two "Feb-8" "type 2" NA
3 two "Feb-28" "type 3" NA
Can someone point out if rbind is the correct path here - and what I am missing?
You could do something like the following using purrr and dplyr:
l_of_lists <- list(
year1 = list(
one = data.frame(date = c("Jan-10", "Jan-22"), type = c("type 1", "type 2")),
two = data.frame(date = c("Feb-1", "Feb-28"), type = c("type 2", "type 3")),
three = data.frame(date = c("Mar-10", "Mar-15"), type = c("type 1", "type 4"))
),
year2 = list( # dates is used here on purpose, as the names don't perfectly match
one = data.frame(dates = c("Jan-22"), type = c("type 2"), another_col = c("entry 2")),
two = data.frame(date = c("Feb-10", "Feb-18"), type = c("type 2", "type 3"), another_col = c("entry 2", "entry 3")),
three = data.frame(date = c("Mar-10", "Mar-15"), type = c("type 1", "type 4"), another_col = c("entry 4", "entry 5"))
),
year3 = list( # this deliberately only contains two data frames
one = data.frame(date = c("Jan-10", "Jan-12"), type = c("type 1", "type 2")),
two = data.frame(date = c("Feb-8", "Jan-28"), type = c("type 2", "type 3"))
))
# add libraries
library(dplyr)
library(purrr)
# Map bind_rows to each list within the list
l_of_lists %>%
map_dfr(~bind_rows(.x, .id = "level"), .id = "year")
This will yield:
year level date type dates another_col
1 year1 one Jan-10 type 1 <NA> <NA>
2 year1 one Jan-22 type 2 <NA> <NA>
3 year1 two Feb-1 type 2 <NA> <NA>
4 year1 two Feb-28 type 3 <NA> <NA>
5 year1 three Mar-10 type 1 <NA> <NA>
6 year1 three Mar-15 type 4 <NA> <NA>
7 year2 one <NA> type 2 Jan-22 entry 2
8 year2 two Feb-10 type 2 <NA> entry 2
9 year2 two Feb-18 type 3 <NA> entry 3
10 year2 three Mar-10 type 1 <NA> entry 4
11 year2 three Mar-15 type 4 <NA> entry 5
12 year3 one Jan-10 type 1 <NA> <NA>
13 year3 one Jan-12 type 2 <NA> <NA>
14 year3 two Feb-8 type 2 <NA> <NA>
15 year3 two Jan-28 type 3 <NA> <NA>
Then of course you can do some regex parsing to keep only the numeric year:
l_of_lists %>%
map_dfr(~bind_rows(.x, .id = "level"), .id = "year") %>%
mutate(year = substring(year, regexpr("\\d", year)))
If you know that date and dates are the same, you can always use mutate to changed then to those values that are not missing (i.e.mutate(date = ifelse(!is.na(date), date, dates)))

Unlist data frame column and pasting them together

I have a dataframe as defined below:
df <- structure(list(ID = 1:19, MEDICATION = c("0", "NOVOMIX 26 BF, 20 D",
"NOVOMIX 14 D", "NOVOMIX 34 BF 22 D", "MIXTARD 52 BF 20 D", "MIXTARD 40 BF 24 D",
"MIXTARD 10 BF 8 D", "MIXTARD 42 BF 24 D", "MIXTARD 20 BF 18 D",
"MIXTARD 82 BF 46 D", "MIXTARD 14 BF 10 D", "NOVOMIX 15 BF 15 D",
"MIXTARD", NA, "MIXTARD 10 BF 4 D", "NOVOMIX", "MIXTARD --> NOVOMIX",
"NOT GIVEN ANY DIABETES MEDICATION INPATIENT PATIENT NORMALLY ON METFORMIN",
"GIVEN ASPART")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -19L), .Names = c("ID", "MEDICATION"))
I would like to extract all the medications (i.e. NOVOMIX, MIXTARD, METFORMIN, ASPART from the MEDICATION variable in the dataframe and paste them together. I wrote my code as follows:
library(tidyverse)
library(rebus)
df %>%
mutate(MEDICATION2 = str_extract_all(MEDICATION, pattern =
or1(c("NOVOMIX", "MIXTARD", "METFORMIN", "ASPART")))) %>%
unnest(MEDICATION2) %>%
group_by(ID) %>%
mutate(MEDICATION2 = str_c(unlist(MEDICATION2), collapse = " - ")) %>%
slice(1)
My expected output is:
df_out <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19), MEDICATION = c("0", "NOVOMIX 26 BF, 20 D",
"NOVOMIX 14 D", "NOVOMIX 34 BF 22 D", "MIXTARD 52 BF 20 D", "MIXTARD 40 BF 24 D",
"MIXTARD 10 BF 8 D", "MIXTARD 42 BF 24 D", "MIXTARD 20 BF 18 D",
"MIXTARD 82 BF 46 D", "MIXTARD 14 BF 10 D", "NOVOMIX 15 BF 15 D",
"MIXTARD", NA, "MIXTARD 10 BF 4 D", "NOVOMIX", "MIXTARD --> NOVOMIX",
"NOT GIVEN ANY DIABETES MEDICATION INPATIENT PATIENT NORMALLY ON METFORMIN",
"GIVEN ASPART"), MEDICATION2 = c(NA, "NOVOMIX", "NOVOMIX", "NOVOMIX",
"MIXTARD", "MIXTARD", "MIXTARD", "MIXTARD", "MIXTARD", "MIXTARD",
"MIXTARD", "NOVOMIX", "MIXTARD", NA, "MIXTARD", "NOVOMIX", "MIXTARD - NOVOMIX",
"METFORMIN", "ASPART")), .Names = c("ID", "MEDICATION", "MEDICATION2"
), row.names = c(NA, -19L), class = "data.frame")
The problem is the code removed the row with MEDICATION == 0 and I think my code is too long for a simple extraction of strings. I would like to ask for help if you know how this code can be shorten (if possible).
We can use stri_extract_all_regex from the stringi package to extract all the words which matches the pattern.
library(stringi)
med_pattern <- c("NOVOMIX|MIXTARD|METFORMIN|ASPART")
df$MEDICATION2 <- stri_extract_all_regex(df$MEDICATION, pattern = med_pattern)
As mentioned by #mt1022, the new column is a list. We may paste them together with
df$MEDICATION2<-paste(stri_extract_all_regex(df$MEDICATION,pattern = med_pattern))
However, it will not give some unwanted characters for lists with more than 1 element. This should give you the expected output.
chars <- stri_extract_all_regex(df$MEDICATION, pattern = med_pattern)
df$MEDICATION2 <- sapply(chars, paste, collapse = "-")
df$MEDICATION2
#[1] "NA" "NOVOMIX" "NOVOMIX" "NOVOMIX"
#[5] "MIXTARD" "MIXTARD" "MIXTARD" "MIXTARD"
#[9] "MIXTARD" "MIXTARD" "MIXTARD" "NOVOMIX"
#[13] "MIXTARD" "NA" "MIXTARD" "NOVOMIX"
#[17] "MIXTARD-NOVOMIX" "METFORMIN" "ASPART"
You can also do this in single line :
df$MEDICATION2 <- sapply(stri_extract_all_regex(df$MEDICATION,
pattern = med_pattern), paste, collapse = "-")

Resources