R pivot_wider x multiple columns with names as ColName + row number - r

Thank you for looking, I struggled to think of a good name for the question.
I have looked at many different versions of my question here and elsewhere but haven't found the exact answer I need.
I have data consisting of ID numbers, codes, and dates. There are often more than 1 row per ID, there is no set number of times an ID can appear. The data now are in long format, I need to pivot them to wide format. I need the result to be 1 row per ID, with the Code, and Date fields widened. I want to keep them the same names but add a +N to the end of the field name where the highest +N is the max number of times a single ID repeats...If an ID repeats 9 times, the pivot will produce 9 new columns for Code and 9 for Date, as Code1, Code2...Code9.
I can do it, see below, but the names are a total mess. I am unable to get the names to be neat the way I describe.
EG Data:
eg_data <- data.frame(
ID = c('1','1','1','2', '2', '3', '4', '4') ,
FName = c('John','John','John','Gina', 'Gina', 'Tom', 'Bobby', 'Bobby') ,
LName = c('Smith','Smith','Smith','Jones', 'Jones', 'Anderson', 'Kennedy', 'Kennedy') ,
Code = c('ECV','EDC','EER','ECV', 'ECV', 'EER', 'EDC', 'EER') ,
Date = c('2022-04-23','2021-12-21','2020-01-25','2022-05-18', '2020-05-26', '2021-01-21', '2020-05-14', '2020-06-25'))
What I've done - yes, it works, but it requires renaming every_single_column that gets widened. There must be a better way:
eg_data %>%
group_by(ID) %>%
mutate(
Count = row_number(),
CodeName = paste0('ContactCode', Count),
DateName = paste0('ContactDate', Count)) %>%
ungroup() %>%
select(-Count) %>%
pivot_wider(
names_from = c(CodeName, DateName), values_from = c(Code, Date)) -> LongNamesYuck
Desired Output (using two of the IDs above for brevity)
desired_format <- data.frame(
ID = c('1','2'),
FName = c('John', 'Gina'),
LName = c('Smith', 'Jones'),
Code1 = c('ECV', 'ECV'),
Code2 = c('EDC', 'ECV'),
Code3 = c('EER', NA),
Date1 = c('2022-04-23', '2022-05-18'),
Date2 = c('2021-12-21', '2020-05-26'),
Date3 = c('2020-01-25', NA))
The example below gets close, but it uses the values from one of the fields as new field names, and I don't want that.
How to `pivot_wider` multiple columns without combining the names?
Any help is appreciated, thank you.

We can make it compact i.e. using rowid (from data.table instead of two lines group_by/mutate), and then use pivot_wider with names_from on the sequence ('rn') column and the values_from as a vector of column names (quoted/unquoted)
library(dplyr)
library(tidyr)
library(data.table)
out <- eg_data %>%
mutate(rn = rowid(ID, FName, LName)) %>%
pivot_wider(names_from = rn, values_from = c(Code, Date), names_sep = "")
-checking the output with desired_format
> desired_format
ID FName LName Code1 Code2 Code3 Date1 Date2 Date3
1 1 John Smith ECV EDC EER 2022-04-23 2021-12-21 2020-01-25
2 2 Gina Jones ECV ECV <NA> 2022-05-18 2020-05-26 <NA>
> out[1:2,]
# A tibble: 2 × 9
ID FName LName Code1 Code2 Code3 Date1 Date2 Date3
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 John Smith ECV EDC EER 2022-04-23 2021-12-21 2020-01-25
2 2 Gina Jones ECV ECV <NA> 2022-05-18 2020-05-26 <NA>

Related

In R, replace values across time series based on another column

Actually this is linked to my previous question: Replace values across time series columns based on another column
However I need to modify values across a time series data set but based on a condition from the same row but across another set of time series columns. The dataset looks like this:
#there are many more years (yrs) in the data set
product<-c("01","02")
yr1<-c("1","7")
yr2<-c("3","4")
#these follow the number of years
type.yr1<-c("mixed","number")
type.yr2<-c("number","mixed")
#this is a reference column to pull values from in case the type value is "mixed"
mixed.rate<-c("1+5GBP","7+3GBP")
df<-data.frame(product,yr1,yr2,type.yr1,type.yr2,mixed.rate)
Where the value 1 should be replaced by "1+5GBP" and 4 should be "7+3GBP". I am thinking of something like the below -- could anyone please help?
df %>%
mutate(across(c(starts_with('yr'),starts_with('type'), ~ifelse(type.x=="mixed", mixed.rate.x, .x)))
The final result should be:
product<-c("01","02")
yr1<-c("1+5GBP","7")
yr2<-c("3","7+3GBP")
type.yr1<-c("mixed","number")
type.yr2<-c("number","mixed")
mixed.rate<-c("1+5 GBP","7+3GBP")
df<-data.frame(product,yr1,yr2,type.yr1,type.yr2,mixed.rate)
If I understand you correctly, I think you might benefit from pivoting longer, replacing the values in a single if_else, and swinging back to wide.
df %>%
pivot_longer(cols = -c(product,mixed.rate), names_to=c(".value", "year"), names_pattern = "(.*)(\\d)") %>%
mutate(yr=if_else(type.yr=="mixed",mixed.rate,yr)) %>%
pivot_wider(names_from=year, values_from=c(yr,type.yr),names_sep = "")
Output:
product mixed.rate yr1 yr2 type.yr1 type.yr2
<chr> <chr> <chr> <chr> <chr> <chr>
1 01 1+5 GBP 1+5 GBP 3 mixed number
2 02 7+3GBP 7 7+3GBP number mixed
You can use pivot_longer to have all yrs in one column and type.yrs in another column. Then record 1 into 1+5GBP and 4 into 7+3GBP if the type.yr column is mixed. then pivot_wider
df %>%
pivot_longer(contains('yr'), names_to = c('.value','grp'),
names_pattern = '(\\D+)(\\d+)') %>%
mutate(yr = ifelse(type.yr == 'mixed', recode(yr, '1' = '1+5GBP', '4' = '7+3GBP'), yr)) %>%
pivot_wider(c(product, mixed.rate), names_from = grp,
values_from = c(yr, type.yr), names_sep = '')
# A tibble: 2 x 6
product mixed.rate yr1 yr2 type.yr1 type.yr2
<chr> <chr> <chr> <chr> <chr> <chr>
1 01 1+5GBP 1+5GBP 3 mixed number
2 02 7+3GBP 7 7+3GBP number mixed
If you're happy to use base R instead of dplyr then the following will produce your required output:
for (i in 1:2) {
df[,paste0('yr',i)] <- if_else(df[,paste0('type.yr',i)]=='mixed',df[,'mixed.rate'],df[,paste0('yr',i)])
}

R - Collapsing observations and creating new columns based on multiple columns [duplicate]

This question already has answers here:
How to Reshape Long to Wide While Preserving Some Variables in R [duplicate]
(3 answers)
Closed 1 year ago.
In my dataframe there are multiple rows for a single observation (each referenced by ref). I would like to collapse the rows and create new columns for the keyword and the reg_birth columns.
df1 <- structure(list(rif = c("text10", "text10", "text10", "text11", "text11"),
date = c("20180329", "20180329", "20180329", "20180329", "20180329"),
keyword = c("Lucca", "Piacenza", "Milano", "Cascina", "Padova"),
reg_birth = c("Tuscany", "Emilia", "Lombardy", "Veneto", "Veneto")),
row.names = c(NA, 5L), class = "data.frame")
If I just want to create the columns for the 'keyword' column I used this code, as shown in this answer.
df1 %>% group_by(rif,date) %>%
mutate(n = row_number()) %>%
pivot_wider(id_cols = c(rif,date), values_from = keyword, names_from = n, names_prefix = 'keyword')
However, I don't know how to do the same for an additional column (reg_birth in this case).
This is my expected output
rif keyword reg.birth keyword2 reg_birth2 keyword3 reg_birth3
1 text10 Lucca Tuscany Piacenza Emilia Milano Lombardy
2 text11 Cascina Veneto Padova Veneto <NA> <NA>
Thank you.
You may try with pivot_wider from tidyr.
library(dplyr)
library(tidyr)
df1 %>%
mutate(id = data.table::rowid(rif, date)) %>%
pivot_wider(names_from = id, values_from = c(keyword, reg_birth))
# rif date keyword_1 keyword_2 keyword_3 reg_birth_1 reg_birth_2 reg_birth_3
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 text10 20180329 Lucca Piacenza Milano Tuscany Emilia Lombardy
#2 text11 20180329 Cascina Padova NA Veneto Veneto NA

R: Pivot numeric data from columns to rows based on string in variable name

I have a data set that I want to pivot to long format depending on if the variable name contains any of the strings: list_a <- c("a", "b", "c") and list_b <- c("usd", "eur", "gbp"). The data set only contains values in one row. I want the values in list_b to become column names and the values in list_a to become row names in the resulting dataset. Please see the reproducable example data set below.
I currently solve this issue by applying the following R code (once for each value in list_b) resulting in three data frames called "df_usd", "df_eur" and "df_gbp" which I then merge based on the column "name". This is however a bit cumbersome and I would very much appreciate if you could help me with finding a more elegant solution since the variables in list_b change from month to month (list_a stays the same each month) and updating the existing code manually is both time consuming and opens up for manual error.
# Current solution for df_usd:
df_usd <- df %>%
select(date, contains("usd")) %>%
pivot_longer(cols = contains(c("a_", "b_", "c_")),
names_to = "name", values_to = "usd") %>% mutate(name = case_when(
str_detect(name, "a_") ~ "a",
str_detect(name, "b_") ~ "b",
str_detect(name, "c_") ~ "c")) %>%
select(-date)
A screenshot of the starting point in Excel
A screenshot of the result I want to acheive in Excel
# Example data to copy and paste into R for easy reproduction of problem:
df <- data.frame (date = c("2020-12-31"),
a_usd = c(1000),
b_usd = c(2000),
c_usd = c(3000),
a_eur = c(100),
b_eur =c(200),
c_eur = c(300),
a_gbp = c(10),
b_gbp = c(20),
c_gbp = c(30))
It would be to specify names_sep with names_to in pivot_longer
library(dplyr)
df %>%
pivot_longer(cols = -date, names_to = c("grp", ".value"), names_sep = "_")
-output
# A tibble: 3 x 5
# date grp usd eur gbp
# <chr> <chr> <dbl> <dbl> <dbl>
#1 2020-12-31 a 1000 100 10
#2 2020-12-31 b 2000 200 20
#3 2020-12-31 c 3000 300 30
A base R option using reshape
reshape(
setNames(df, gsub("(\\w+)_(\\w+)", "\\2.\\1", names(df))),
direction = "long",
varying = -1
)
gives
date time usd eur gbp id
1.a 2020-12-31 a 1000 100 10 1
1.b 2020-12-31 b 2000 200 20 1
1.c 2020-12-31 c 3000 300 30 1

String with values mapped from other data frame in R

I would like to make a string basing on ids from other columns where the real value sits in a dictionary.
Ideally, this would look like:
library(tidyverse)
region_dict <- tibble(
id = c("reg_id1", "reg_id2", "reg_id3"),
name = c("reg_1", "reg_2", "reg_3")
)
color_dict <- tibble(
id = c("col_id1", "col_id2", "col_id3"),
name = c("col_1", "col_2", "col_3")
)
tibble(
region = c("reg_id1", "reg_id2", "reg_id3"),
color = c("col_id1", "col_id2", "col_id3"),
my_string = str_c(
"xxx"_,
region_name,
"_",
color_name
))
#> # A tibble: 3 x 3
#> region color my_string
#> <chr> <chr> <chr>
#> 1 reg_id1 col_id1 xxx_reg_1_col_1
#> 2 reg_id2 col_id2 xxx_reg_2_col_2
#> 3 reg_id3 col_id3 xxx_reg_3_col_3
Created on 2021-03-01 by the reprex package (v0.3.0)
I know of dplyr's recode() function but I can't think of a way to use it the way I want.
I also thought about first using left_join() and then concatenating the string from the new columns. This is what would work but doesn't seem pretty to me as I would get columns that I'd need to remove later. In the real dataset I have 5 variables.
I'll be glad to read your ideas.
This may also be solved with a fuzzyjoin, but based on the similarity in substring, it would make sense to remove the prefix substring from the 'id' columns of each data and do a left_join, then create the 'my_string' by pasteing the columns together
library(stringr)
library(dplyr)
region_dict %>%
mutate(id1 = str_remove(id, '.*_')) %>%
left_join(color_dict %>%
mutate(id1 = str_remove(id, '.*_')), by = 'id1') %>%
transmute(region = id.x, color = id.y,
my_string = str_c('xxx_', name.x, '_', name.y))
-output
# A tibble: 3 x 3
# region color my_string
# <chr> <chr> <chr>
#1 reg_id1 col_id1 xxx_reg_1_col_1
#2 reg_id2 col_id2 xxx_reg_2_col_2
#3 reg_id3 col_id3 xxx_reg_3_col_3

R dplyr rowwise mutate

Good morning all, this is my first time posting on stack overflow. Thank you for any help!
I have 2 dataframes that I am using to analyze stock data. One data frame has dates among other information, we can call it df:
df1 <- tibble(Key = c('a','b','c'), i =11:13, date= ymd(20110101:20110103))
The second dataframe also has dates and other important information.
df2 <-tibble(Answer = c('a','d','e','b','f','c'), j =14:19, date= ymd(20150304:20150309))
Here is what I want to do:
For each row in df1, I need to:
-Find the date in df2 where, for when df2$answer is the same as df1$key, it is the closest to that row's date in df1.
-Then extract information for another column in that row in df2, and put it in a new row in df1.
The code i tried:
df1 %>%
group_by(Key, i) %>%
mutate(
`New Column` = df2$j[
which.min(subset(df2$date, df2$Answer== Key) - date)])
This has the result:
Key i date `New Column`
1 a 11 2011-01-01 14
2 b 12 2011-01-02 14
3 c 13 2011-01-03 14
This is correct for the first row, a. In df2, the closest date is 2015-03-04, for which the value of j is in fact 14.
However, for the second row, Key=b, I want df2 to subset to only look at dates for rows where df2$Answer = b. Therefore, the date should be 2015-03-07, for which j =17.
Thank you for your help!
Jesse
This should work:
library(dplyr)
df1 %>%
left_join(df2, by = c("Key" = "Answer")) %>%
mutate(date_diff = abs(difftime(date.x, date.y, units = "secs"))) %>%
group_by(Key) %>%
arrange(date_diff) %>%
slice(1) %>%
ungroup()
We are first joining the two data frames with left_join. Yes, I'm aware there are possibly multiple dates for each Key, bear with me.
Next, we calculate (with mutate) the absolute value (abs) of the difference between the two dates date.x and date.y.
Now that we have this, we will group the data by Key using group_by. This will make sure that each distinct Key will be treated separately in subsequent calculations.
Since we've calculated the date_diff, we can now re-order (arrange) the data for each Key, with the smallest date_diff as first for each Key.
Finally, we are only interested in that first, smallest date_diff for each Key, so we can discard the rest using slice(1).
This pipeline gives us the following:
Key i date.x j date.y date_diff
<chr> <int> <date> <int> <date> <time>
1 a 11 2011-01-01 14 2015-03-04 131587200
2 b 12 2011-01-02 17 2015-03-07 131760000
3 c 13 2011-01-03 19 2015-03-09 131846400

Resources