Long to Wide with Non-Unique Key Combinations in R - r

I am trying to convert a dataset from long to wide format. Need to do this to feed into another program for analysis purposes. My input data is below:
sdata <- data.frame(c(1,1,1,1,1,1,1,1,1,1,1,1,1),c(1,1,1,1,1,1,1,1,1,2,2,2,2),c("X1","A","B","C","D","X2","A","B","C","X1","A","B","C"),c(81,31,40,5,5,100,8,90,2,50,20,24,6))
col_headings <- c("Orig","Dest","Desc","Estimate")
names(sdata) <- col_headings
Input Data
Depending on the unique combination of Orig-Dest-X1, Orig-Dest-X2 category above, the subcategories vary from only A,B,C to A,B,C,D to A,B, etc. I am trying to get the desired output (code to recreate in R below) along with image of desired output.
sdata_spread <- data.frame(c(1,1),c(1,2),c(81,50),c(31,20),c(40,24),c(5,6),c(5,NA),c(100,NA),c(8,NA),c(90,NA),c(2,NA))
col_headings <- c("Orig","Dest","X1", "X1_A", "X1_B", "X1_C", "X1_D","X2", "X2_A", "X2_B", "X2_C")
names(sdata_spread) <- col_headings
Desired Output
I tried the following:
sdata_spread <- sdata %>% spread(Desc,Estimate)
The error I got was:
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 6 rows
I also tried the accepted answer given here: Long to wide with no unique key and here: Long to wide format with several duplicates. Circumvent with unique combo of columns but it did not get me the desired output.
Any insights would be much appreciated.
Thanks,
Krishnan

One option is to create a grouping variable based on the occurrence of 'X' as the first character in the 'Desc', use that to modify the 'Desc' by pasteing the first element of 'Desc' with each of the element based on a condition in case_when and reshape to wide format with pivot_wider (from tidyr_1.0.0, spread/gather are getting deprecated and in its place pivot_wider/pivot_longer are used)
library(dplyr)
library(tidyr)
library(stringr)
sdata %>%
group_by(grp = cumsum(str_detect(Desc, '^X'))) %>%
mutate(Desc = case_when(row_number() > 1 ~ str_c(first(Desc), Desc, sep="_"),
TRUE ~ as.character(Desc))) %>%
ungroup %>%
select(-grp) %>%
pivot_wider(names_from = Desc, values_from = Estimate)
# A tibble: 2 x 11
# Orig Dest X1 X1_A X1_B X1_C X1_D X2 X2_A X2_B X2_C
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 81 31 40 5 5 100 8 90 2
#2 1 2 50 20 24 6 NA NA NA NA NA

Related

Combining multiple rows for one ID into one row with multiple columns based on 2 different variables in R

I am working with a dataframe in R that looks like this:
id <- c(1,1,1,2,2,3,3,3,3)
dx_code <- c("MI","HF","PNA","PNA","Cellulitis","MI","Flu","Sepsis","HF")
dx_date <- c("7/11/22","7/11/22","8/1/22","8/4/22","8/7/22","8/4/22","7/11/22","7/11/22","9/10/22")
df <- data.frame(id, dx_code, dx_date)
df
I want to be able to group it so that each patient ID has each date they were seen and each diagnosis they received on each specific date. So it would look something like:
id2 <- c(1,2,3)
dx_date1 <- c("7/11/22","8/4/22","8/4/22")
dx_date1code1 <- c("MI","PNA","MI")
dx_date1code2 <- c("HF",NA,NA)
dx_date2 <- c("8/1/22","8/7/22","7/11/22")
dx_date2code1 <- c("PNA","Cellulitis","Flu")
dx_date2code2 <- c(NA,NA,"Sepsis")
dx_date3 <- c(NA,NA,"9/10/22")
dx_date3code1 <- c(NA,NA,"HF")
df2 <- data.frame(id2, dx_date1, dx_date1code1,dx_date1code2,dx_date2,dx_date2code1,dx_date2code2,dx_date3,dx_date3code1)
df2
I am not sure how to reformat it in this way - is there a function in R, or should I try to use for loops? I would appreciate any help - thanks so much!
I believe you can use pivot_wider for this. The output is not the same is in the original post, but similar to what you provided in your comment.
You can enumerate dates and codes after grouping by id using row_number().
After using pivot_wider, you can select column names based on the numeric value contained, which will reorder so that dates and codes columns are next to each other.
library(tidyverse)
df %>%
group_by(id) %>%
mutate(code_num = row_number()) %>%
pivot_wider(id_cols = id,
names_from = code_num,
values_from = c(dx_date, dx_code)) %>%
select(id, names(.)[-1][order(readr::parse_number(names(.)[-1]))])
Output
id dx_date_1 dx_code_1 dx_date_2 dx_code_2 dx_date_3 dx_code_3 dx_date_4 dx_code_4
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 7/11/22 MI 7/11/22 HF 8/1/22 PNA NA NA
2 2 8/4/22 PNA 8/7/22 Cellulitis NA NA NA NA
3 3 8/4/22 MI 7/11/22 Flu 7/11/22 Sepsis 9/10/22 HF

In R, replace values across time series based on another column

Actually this is linked to my previous question: Replace values across time series columns based on another column
However I need to modify values across a time series data set but based on a condition from the same row but across another set of time series columns. The dataset looks like this:
#there are many more years (yrs) in the data set
product<-c("01","02")
yr1<-c("1","7")
yr2<-c("3","4")
#these follow the number of years
type.yr1<-c("mixed","number")
type.yr2<-c("number","mixed")
#this is a reference column to pull values from in case the type value is "mixed"
mixed.rate<-c("1+5GBP","7+3GBP")
df<-data.frame(product,yr1,yr2,type.yr1,type.yr2,mixed.rate)
Where the value 1 should be replaced by "1+5GBP" and 4 should be "7+3GBP". I am thinking of something like the below -- could anyone please help?
df %>%
mutate(across(c(starts_with('yr'),starts_with('type'), ~ifelse(type.x=="mixed", mixed.rate.x, .x)))
The final result should be:
product<-c("01","02")
yr1<-c("1+5GBP","7")
yr2<-c("3","7+3GBP")
type.yr1<-c("mixed","number")
type.yr2<-c("number","mixed")
mixed.rate<-c("1+5 GBP","7+3GBP")
df<-data.frame(product,yr1,yr2,type.yr1,type.yr2,mixed.rate)
If I understand you correctly, I think you might benefit from pivoting longer, replacing the values in a single if_else, and swinging back to wide.
df %>%
pivot_longer(cols = -c(product,mixed.rate), names_to=c(".value", "year"), names_pattern = "(.*)(\\d)") %>%
mutate(yr=if_else(type.yr=="mixed",mixed.rate,yr)) %>%
pivot_wider(names_from=year, values_from=c(yr,type.yr),names_sep = "")
Output:
product mixed.rate yr1 yr2 type.yr1 type.yr2
<chr> <chr> <chr> <chr> <chr> <chr>
1 01 1+5 GBP 1+5 GBP 3 mixed number
2 02 7+3GBP 7 7+3GBP number mixed
You can use pivot_longer to have all yrs in one column and type.yrs in another column. Then record 1 into 1+5GBP and 4 into 7+3GBP if the type.yr column is mixed. then pivot_wider
df %>%
pivot_longer(contains('yr'), names_to = c('.value','grp'),
names_pattern = '(\\D+)(\\d+)') %>%
mutate(yr = ifelse(type.yr == 'mixed', recode(yr, '1' = '1+5GBP', '4' = '7+3GBP'), yr)) %>%
pivot_wider(c(product, mixed.rate), names_from = grp,
values_from = c(yr, type.yr), names_sep = '')
# A tibble: 2 x 6
product mixed.rate yr1 yr2 type.yr1 type.yr2
<chr> <chr> <chr> <chr> <chr> <chr>
1 01 1+5GBP 1+5GBP 3 mixed number
2 02 7+3GBP 7 7+3GBP number mixed
If you're happy to use base R instead of dplyr then the following will produce your required output:
for (i in 1:2) {
df[,paste0('yr',i)] <- if_else(df[,paste0('type.yr',i)]=='mixed',df[,'mixed.rate'],df[,paste0('yr',i)])
}

Tally()ing Multiple Observations In an Entire Data Frame

I'm having trouble with figuring out how to deal with a column that features several observations that I would like to tally. For example:
HTML/CSS;Java;JavaScript;Python;SQL
This is one of the cells for a column of a data frame and I'd like to tally the occurrences of each programming language. Is this something that should be tackled with str_detect(), with corpus(), or is there another way I'm not seeing?
My goal is to make each one of these languages (HTML, CSS, Java, JavaScript, Python, SQL, etc...) into a column name with the tally of how many times they occur in this column of the data frame.
I feel like I might've phrased this strangely so let me know if you need any clarification.
In tidyverse you can use separate_rows and count.
library(dplyr)
df %>% tidyr::separate_rows(PL, sep = ';') %>% count(PL)
In base R, we can split the string on semi-colon and count with table :
table(unlist(strsplit(df$PL, ';')))
#If you need a dataframe
#stack(table(unlist(strsplit(df$PL, ';'))))
If you just want a total count of each label, you can use unnest_longer and a grouped count:
# using #DPH's example data
library(dplyr)
library(tidyr)
df %>%
mutate(across(PL, strsplit, ";")) %>%
unnest_longer(PL) %>%
group_by(PL) %>%
count()
# A tibble: 6 x 2
# Groups: PL [6]
PL n
<chr> <int>
1 HTML/CSS 2
2 Java 1
3 JavaScript 2
4 Python 1
5 R 3
6 SQL 2
If I understood your problem correctly this would be solution:
library(dplyr)
library(tidyr)
# demo data
df <- dplyr::tibble(ID = c("Line 1: ","Line 2:"),
PL = c("HTML/CSS;JavaScript;Python;SQL;R","R;HTML/CSS;Java;JavaScript;SQL;R"))
# calculations
df %>%
dplyr::mutate(PLANG = stringr::str_split(PL, ";")) %>%
tidyr::unnest(c(PLANG)) %>%
dplyr::group_by(ID, PLANG) %>%
dplyr::count() %>%
tidyr::pivot_wider(names_from = "PLANG", values_from = "n", values_fill = 0)
ID `HTML/CSS` JavaScript Python R SQL Java
<chr> <int> <int> <int> <int> <int> <int>
1 "Line 1: " 1 1 1 1 1 0
2 "Line 2:" 1 1 0 2 1 1

How can you gather() multiple columns at the same time in dplyr (R)?

I am trying to gather untidy data from wide to long format.
I have 748 variables, that need to be condensed to approximately 30.
In this post, I asked: how to tidy my wide data? The answer: use gather().
However, I am still struggling to gather multiple columns and was hoping you could pinpoint where I'm going wrong.
Reproducible example:
tb1 <- tribble(~x1,~x2,~x3,~y1,~y2,~y3,
1,NA,NA,NA,1,NA,
NA,1,NA,NA,NA,1,
NA,NA,1,NA,NA,1)
# A tibble: 3 x 6
# x1 x2 x3 y1 y2 y3
# <dbl> <dbl> <dbl> <lgl> <dbl> <dbl>
#1 1 NA NA NA 1 NA
#2 NA 1 NA NA NA 1
#3 NA NA 1 NA NA 1
with x1-y3 having the following characteristics:
1 x1 Green
2 x2 Yellow
3 x3 Orange
4 y1 Yes
5 y2 No
6 y3 Maybe
I tried this:
tb1 %>%
rename("Green" =x1,
"Yellow"=x2,
"Orange"=x3,
"Yes"=y1,
"No"=y2,
"Maybe"=y3) %>%
gather(X,val,-Green,-Yellow,-Orange) %>%
gather(Y,val,-X) %>%
select(-val)
I did get an output that I wanted for these variables, but I can't imagine how to do this for 700+ variables?! Is there a more effective way?
tb1 %>%
rename("Green" =x1,
"Yellow"=x2,
"Orange"=x3,
"Yes"=y1,
"No"=y2,
"Maybe"=y3) %>%
gather(X,val,-Green,-Yellow,-Orange) %>%
filter(!is.na(val)) %>%
select(-val) %>%
gather(Y,val,-X) %>%
filter(!is.na(val)) %>%
select(-val)
# A tibble: 3 x 2
X Y
<chr> <chr>
1 No Green
2 Maybe Yellow
3 Maybe Orange
I think I might be just not acquainted enough with gather() so this is probably a stupid question - would appreciate the help. Thanks!
I’m assuming the issue here is with manually specify all the different variable names. Luckily, tidyverse has the ?select_helpers which make it easier to select columns based on different rules.
Instead of renaming the variables at the beginning, we can rename them at the end. This lets us use starts_with to get all columns starting with x or y and gather them together in one step. Then we can use ends_with to select the value columns from those gather steps and filter and drop them.
Finally, we replace all values of x1, y1 etc. with their true values in one step using mutate_all and a lookup table
# Make lookup table to match X and Y variables with Values
# the initial values should be the `names` (first) and the values to change them to
# should be the `values` (after the =)
lookup <- c('x1' = 'Green',
'x2' = 'Yellow',
'x3' = 'Orange',
'y1' = 'Yes',
'y2' = 'No',
'y3' = 'Maybe')
tb1 %>%
gather(X, Xval, starts_with('x')) %>% # Gather all variables that start with ‘x'
gather(Y, Yval, starts_with('y')) %>% # Gather all variables that start with ‘y'
filter_at(vars(ends_with('val')), # Looking in columns ending with ‘val'
all_vars(!is.na(.))) %>% %>% # Drop rows if ANY of these cols are NA
select(-ends_with('val')) %>% # Drop columns ending in ‘val'
mutate_all(~lookup[.]) # Replace value from lookup table in all cols
# A tibble: 3 x 2
X Y
<chr> <chr>
1 Green No
2 Yellow Maybe
3 Orange Maybe
One tricky thing with select_helpers is knowing when you an use them alone and when you need to “register” them with vars. In gather and select, you can use them as is. In mutate, filter, summarize, etc. you need to surround them with vars

Accessing grouped subset in dplyr

I have the feeling this was already asked several times, but I can not make it run in my case. Don't know why.
I group_by my data frame and calculate a mean from values. Additionally, I marked a specific row and I want to calculate the ratio of my fresh calculated mean with the value of my highlighted row of the subset.
library(dplyr)
df <- data.frame(int=c(5:1,4:1),
highlight=c(T,F,F,F,F,F,T,F,F),
exp=c('a','a','a','a','a','b','b','b','b'))
df %>%
group_by(exp) %>%
summarise(mean=mean(int),
l1=nrow(.),
ratio_mean=.[.$highlight, 'int']/mean)
But for some reason, . is not the subset of group_by but the complete input. Am I missing something here?
My expected output would be
exp mean ratio_mean
<fct> <dbl> <dbl>
1 a 3 1.67
2 b 2.5 1.2
This works:
df %>%
group_by(exp) %>%
summarise(mean = mean(int),
l1 = n(),
ratio_mean = int[highlight] / mean)
But what's going wrong with your solution?
nrow(.) counts the number of rows of your whole input dataframe, wherase n() counts only the rows per group
.[.$highlight, 'int']/mean here again you use the whole input dataframe and subset using the highlight column, but it get's divided by the correct group mean. Actually you are returning two values here as two rows of your original df have a highlight = TRUE. This causes a nasty NA-column name.
To save it, we could use do() as suggested by #MikkoMarttila, but this gets a little bit clunky:
df %>%
group_by(exp) %>%
do(summarise(., mean = mean(.$int),
l1 = nrow(.),
ratio_mean = .$int[.$highlight] / mean))
Original output
df %>%
group_by(exp) %>%
summarise(mean=mean(int),
l1=nrow(.),
ratio_mean=.[.$highlight, 'int']/mean)
# A tibble: 2 x 4
# exp mean l1 ratio_mean$ NA
# <fct> <dbl> <int> <dbl> <dbl>
# 1 a 3 9 1.67 2
# 2 b 2.5 9 1 1.2

Resources