R - Collapsing observations and creating new columns based on multiple columns [duplicate] - r

This question already has answers here:
How to Reshape Long to Wide While Preserving Some Variables in R [duplicate]
(3 answers)
Closed 1 year ago.
In my dataframe there are multiple rows for a single observation (each referenced by ref). I would like to collapse the rows and create new columns for the keyword and the reg_birth columns.
df1 <- structure(list(rif = c("text10", "text10", "text10", "text11", "text11"),
date = c("20180329", "20180329", "20180329", "20180329", "20180329"),
keyword = c("Lucca", "Piacenza", "Milano", "Cascina", "Padova"),
reg_birth = c("Tuscany", "Emilia", "Lombardy", "Veneto", "Veneto")),
row.names = c(NA, 5L), class = "data.frame")
If I just want to create the columns for the 'keyword' column I used this code, as shown in this answer.
df1 %>% group_by(rif,date) %>%
mutate(n = row_number()) %>%
pivot_wider(id_cols = c(rif,date), values_from = keyword, names_from = n, names_prefix = 'keyword')
However, I don't know how to do the same for an additional column (reg_birth in this case).
This is my expected output
rif keyword reg.birth keyword2 reg_birth2 keyword3 reg_birth3
1 text10 Lucca Tuscany Piacenza Emilia Milano Lombardy
2 text11 Cascina Veneto Padova Veneto <NA> <NA>
Thank you.

You may try with pivot_wider from tidyr.
library(dplyr)
library(tidyr)
df1 %>%
mutate(id = data.table::rowid(rif, date)) %>%
pivot_wider(names_from = id, values_from = c(keyword, reg_birth))
# rif date keyword_1 keyword_2 keyword_3 reg_birth_1 reg_birth_2 reg_birth_3
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 text10 20180329 Lucca Piacenza Milano Tuscany Emilia Lombardy
#2 text11 20180329 Cascina Padova NA Veneto Veneto NA

Related

R pivot_wider x multiple columns with names as ColName + row number

Thank you for looking, I struggled to think of a good name for the question.
I have looked at many different versions of my question here and elsewhere but haven't found the exact answer I need.
I have data consisting of ID numbers, codes, and dates. There are often more than 1 row per ID, there is no set number of times an ID can appear. The data now are in long format, I need to pivot them to wide format. I need the result to be 1 row per ID, with the Code, and Date fields widened. I want to keep them the same names but add a +N to the end of the field name where the highest +N is the max number of times a single ID repeats...If an ID repeats 9 times, the pivot will produce 9 new columns for Code and 9 for Date, as Code1, Code2...Code9.
I can do it, see below, but the names are a total mess. I am unable to get the names to be neat the way I describe.
EG Data:
eg_data <- data.frame(
ID = c('1','1','1','2', '2', '3', '4', '4') ,
FName = c('John','John','John','Gina', 'Gina', 'Tom', 'Bobby', 'Bobby') ,
LName = c('Smith','Smith','Smith','Jones', 'Jones', 'Anderson', 'Kennedy', 'Kennedy') ,
Code = c('ECV','EDC','EER','ECV', 'ECV', 'EER', 'EDC', 'EER') ,
Date = c('2022-04-23','2021-12-21','2020-01-25','2022-05-18', '2020-05-26', '2021-01-21', '2020-05-14', '2020-06-25'))
What I've done - yes, it works, but it requires renaming every_single_column that gets widened. There must be a better way:
eg_data %>%
group_by(ID) %>%
mutate(
Count = row_number(),
CodeName = paste0('ContactCode', Count),
DateName = paste0('ContactDate', Count)) %>%
ungroup() %>%
select(-Count) %>%
pivot_wider(
names_from = c(CodeName, DateName), values_from = c(Code, Date)) -> LongNamesYuck
Desired Output (using two of the IDs above for brevity)
desired_format <- data.frame(
ID = c('1','2'),
FName = c('John', 'Gina'),
LName = c('Smith', 'Jones'),
Code1 = c('ECV', 'ECV'),
Code2 = c('EDC', 'ECV'),
Code3 = c('EER', NA),
Date1 = c('2022-04-23', '2022-05-18'),
Date2 = c('2021-12-21', '2020-05-26'),
Date3 = c('2020-01-25', NA))
The example below gets close, but it uses the values from one of the fields as new field names, and I don't want that.
How to `pivot_wider` multiple columns without combining the names?
Any help is appreciated, thank you.
We can make it compact i.e. using rowid (from data.table instead of two lines group_by/mutate), and then use pivot_wider with names_from on the sequence ('rn') column and the values_from as a vector of column names (quoted/unquoted)
library(dplyr)
library(tidyr)
library(data.table)
out <- eg_data %>%
mutate(rn = rowid(ID, FName, LName)) %>%
pivot_wider(names_from = rn, values_from = c(Code, Date), names_sep = "")
-checking the output with desired_format
> desired_format
ID FName LName Code1 Code2 Code3 Date1 Date2 Date3
1 1 John Smith ECV EDC EER 2022-04-23 2021-12-21 2020-01-25
2 2 Gina Jones ECV ECV <NA> 2022-05-18 2020-05-26 <NA>
> out[1:2,]
# A tibble: 2 × 9
ID FName LName Code1 Code2 Code3 Date1 Date2 Date3
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 John Smith ECV EDC EER 2022-04-23 2021-12-21 2020-01-25
2 2 Gina Jones ECV ECV <NA> 2022-05-18 2020-05-26 <NA>

In R, replace values across time series based on another column

Actually this is linked to my previous question: Replace values across time series columns based on another column
However I need to modify values across a time series data set but based on a condition from the same row but across another set of time series columns. The dataset looks like this:
#there are many more years (yrs) in the data set
product<-c("01","02")
yr1<-c("1","7")
yr2<-c("3","4")
#these follow the number of years
type.yr1<-c("mixed","number")
type.yr2<-c("number","mixed")
#this is a reference column to pull values from in case the type value is "mixed"
mixed.rate<-c("1+5GBP","7+3GBP")
df<-data.frame(product,yr1,yr2,type.yr1,type.yr2,mixed.rate)
Where the value 1 should be replaced by "1+5GBP" and 4 should be "7+3GBP". I am thinking of something like the below -- could anyone please help?
df %>%
mutate(across(c(starts_with('yr'),starts_with('type'), ~ifelse(type.x=="mixed", mixed.rate.x, .x)))
The final result should be:
product<-c("01","02")
yr1<-c("1+5GBP","7")
yr2<-c("3","7+3GBP")
type.yr1<-c("mixed","number")
type.yr2<-c("number","mixed")
mixed.rate<-c("1+5 GBP","7+3GBP")
df<-data.frame(product,yr1,yr2,type.yr1,type.yr2,mixed.rate)
If I understand you correctly, I think you might benefit from pivoting longer, replacing the values in a single if_else, and swinging back to wide.
df %>%
pivot_longer(cols = -c(product,mixed.rate), names_to=c(".value", "year"), names_pattern = "(.*)(\\d)") %>%
mutate(yr=if_else(type.yr=="mixed",mixed.rate,yr)) %>%
pivot_wider(names_from=year, values_from=c(yr,type.yr),names_sep = "")
Output:
product mixed.rate yr1 yr2 type.yr1 type.yr2
<chr> <chr> <chr> <chr> <chr> <chr>
1 01 1+5 GBP 1+5 GBP 3 mixed number
2 02 7+3GBP 7 7+3GBP number mixed
You can use pivot_longer to have all yrs in one column and type.yrs in another column. Then record 1 into 1+5GBP and 4 into 7+3GBP if the type.yr column is mixed. then pivot_wider
df %>%
pivot_longer(contains('yr'), names_to = c('.value','grp'),
names_pattern = '(\\D+)(\\d+)') %>%
mutate(yr = ifelse(type.yr == 'mixed', recode(yr, '1' = '1+5GBP', '4' = '7+3GBP'), yr)) %>%
pivot_wider(c(product, mixed.rate), names_from = grp,
values_from = c(yr, type.yr), names_sep = '')
# A tibble: 2 x 6
product mixed.rate yr1 yr2 type.yr1 type.yr2
<chr> <chr> <chr> <chr> <chr> <chr>
1 01 1+5GBP 1+5GBP 3 mixed number
2 02 7+3GBP 7 7+3GBP number mixed
If you're happy to use base R instead of dplyr then the following will produce your required output:
for (i in 1:2) {
df[,paste0('yr',i)] <- if_else(df[,paste0('type.yr',i)]=='mixed',df[,'mixed.rate'],df[,paste0('yr',i)])
}

R - Collapsing observations and creating new columns

In my dataframe there are multiple rows for a single observation (each referenced by ref). I would like to collapse the rows and create new columns for the keyword column. The outcome would include as many keyowrd colums as the number of rows for an observation (e.g. keyword_1, keyword_2, etc). Do you have any idea? Thanks a lot.
This is my MWE
df1 <- structure(list(rif = c("text10", "text10", "text10", "text11", "text11"),
date = c("20180329", "20180329", "20180329", "20180329", "20180329"),
keyword = c("Lucca", "Piacenza", "Milano", "Cascina", "Padova")),
row.names = c(NA, 5L), class = "data.frame")
Does this work:
library(dplyr)
library(tidyr)
df1 %>% group_by(rif,date) %>% mutate(n = row_number()) %>% pivot_wider(id_cols = c(rif,date), values_from = keyword, names_from = n, names_prefix = 'keyword')
# A tibble: 2 x 5
# Groups: rif, date [2]
rif date keyword1 keyword2 keyword3
<chr> <chr> <chr> <chr> <chr>
1 text10 20180329 Lucca Piacenza Milano
2 text11 20180329 Cascina Padova NA

Indicate which corresponding columns have a TRUE indicator

I have the following dataset:
df<-data.frame(
identifer=c(1,2,3,4),
DF=c("Tablet","Powder","Suspension","System"),
DF_source1=c("Capsule","Powder,Metered","Tablet",NA),
DF_source2=c(NA,NA,"Tablet",NA),
DF_source3=c("Tablet, Extended Release","Liquid","Tablet",NA),
Route_source1=c("Oral","INHALATION","Oral",NA),
Route_source2=c(NA,"TOPICAL","Oral",NA),
Route_source3=c("Oral","IRRIGATION","oral",NA))
I want to know which DF_source matches DF, and additionally which associated Route I should take.
I want the output to look like this:
df_out<-data.frame(
identifer=c(1,2,3,4),
DF=c("Tablet","Powder","Suspension","System"),
DF_match=c("Tablet, Extended Release","Powder,Metered;Powder",NA,NA),
Route_match=c("Oral","INHALATION;TOPICAL",NA,NA),
DF_match_count=c(1,2,0,0),
DF_route_count=c(1,2,0,0))
I tried this but I'm not sure how to pull values for DF_match and Route_ Match
df%>%mutate_at(vars(matches("(DF_source)")),
list(string_detect = ~str_detect(tolower(DF),tolower(str_replace_all(.,"/|,(\\s)?|(?<!,)\\s","|")))))
Any help would be appreciated, thanks!
I'm not entirely sure this is what you have in mind, but hope this might help.
Your end result appears not to match your example data (e.g. TOPICAL is missing).
This might be easier in a tidier form with pivot_longer.
Edit: If columns are factors, convert to character for str_detect in filter.
library(tidyverse)
library(stringr)
df %>%
mutate_if(is.factor, as.character) %>%
pivot_longer(cols = -c(identifer, DF), names_to = c(".value", "number"), names_pattern = "(\\w+)(\\d+)") %>%
filter(str_detect(DF_source, DF)) %>%
group_by(identifer) %>%
summarise(DF_match = paste(DF_source, collapse = ';'),
Route_match = paste(Route_source, collapse = ';'),
match_count = n()) %>%
right_join(df[,c("identifer", "DF")], by = "identifer") %>%
select(c(identifer, DF, DF_match, Route_match, match_count))
Output
# A tibble: 4 x 5
identifer DF DF_match Route_match match_count
<dbl> <chr> <chr> <chr> <int>
1 1 Tablet Tablet, Extended Release Oral 1
2 2 Powder Powder,Metered;Powder INHALATION;TOPICAL 2
3 3 Suspension NA NA NA
4 4 System NA NA NA

Separating Column Based on First Value of String

I have an ID variable that I am trying to separate into two separate columns based on their prefix being either a 1 or 2.
An example of my data is:
STR_ID
1434233
2343535
1243435
1434355
I have tried countless ways to try to separate these variables into columns based on their prefixes, but cannot seem to figure it out. Any ideas on how I would do this? Thank you in advance.
We create a grouping variable with substr by extracting the first character/digit of 'STR_ID', and spread it to 'wide' format
library(tidyverse)
df1 %>%
group_by(grp = paste0('grp', substr(STR_ID, 1, 1))) %>%
mutate(i = row_number()) %>%
spread(grp, STR_ID) %>%
select(-i)
# A tibble: 3 x 2
# grp1 grp2
# <int> <int>
#1 1434233 2343535
#2 1243435 NA
#3 1434355 NA
data
df1 <- structure(list(STR_ID = c(1434233L, 2343535L, 1243435L, 1434355L
)), class = "data.frame", row.names = c(NA, -4L))

Resources