Unnesting related list-columns of different size - r

After parsing xml files I have data looking like this:
example_df <-
tibble(id = "ABC",
wage_type = "salary",
name = c("Description","Code","Base",
"Description","Code","Base",
"Description","Code"),
value = c("wage_element_1","51B","600",
"wage_element_2","51C","740",
"wage_element_3","51D"))
example_df
# A tibble: 8 x 4
id wage_type name value
<chr> <chr> <chr> <chr>
1 ABC salary Description wage_element_1
2 ABC salary Code 51B
3 ABC salary Base 600
4 ABC salary Description wage_element_2
5 ABC salary Code 51C
6 ABC salary Base 740
7 ABC salary Description wage_element_3
8 ABC salary Code 51D
with roughly 1000 different id, and each having three possible values for wage_type.
I want to change the values in the name column into columns.
I have tried to use pivot but I am struggling to handle the resulting list-cols: since not all salary have a Base, the resulting list-cols are of different size as below:
example_df <- example_df %>%
pivot_wider(id_cols = c(id, wage_type),
names_from = name,
values_from = value)
example_df
# A tibble: 1 x 5
id wage_type Description Code Base
<chr> <chr> <list> <list> <list>
1 ABC salary <chr [3]> <chr [3]> <chr [2]>
So when I try to unnest the cols it throws an error:
example_df%>%
unnest(cols = c(Description,Code,Base))
Error: Can't recycle `Description` (size 3) to match `Base` (size 2).
I understand that is because tidyr functions do not recycle, but I could not find a way around this or a base rsolution to my problem. I have tried to make a df with
unlist(strsplit(as.character(x)) solution as per how to split one row into multiple rows in R but also ran into a resulting column length issue.
Desired output is as follows:
desired_df <-
tibble(
id=c("ABC","ABC","ABC"),
wage_type=c("salary","salary","salary"),
Description = c("wage_element_1","wage_element_2","wage_element_3"),
Code = c("51B","51C","51D"),
Base = c("600","740",NA))
desired_df
id wage_type Description Code Base
<chr> <chr> <chr> <chr> <chr>
1 ABC salary wage_element_1 51B 600
2 ABC salary wage_element_2 51C 740
3 ABC salary wage_element_3 51D NA
I would love a tidyr solution but any help would be appreciated. Thanks.

I would suggest this approach using tidyverse functions. The issue you had was due to how functions manage different rows. So, by creating an id variable like id2 you can avoid list outputs in your final reshaped data:
library(tidyverse)
#Code
example_df %>%
arrange(name) %>%
group_by(id,wage_type,name) %>%
mutate(id2=1:n()) %>% ungroup() %>%
pivot_wider(names_from = name,values_from=value) %>%
select(-id2)
Output:
# A tibble: 3 x 5
id wage_type Base Code Description
<chr> <chr> <chr> <chr> <chr>
1 ABC salary 600 51B wage_element_1
2 ABC salary 740 51C wage_element_2
3 ABC salary NA 51D wage_element_3

Related

extracting list items in r data frame

I have a data frame with two columns: ID and product. The product contains a list of items like the ones below:
ID
Product
1
'desk','chair','clock'
2
NA
3
'pen'
I want to extract every single product in a separate row with the corresponding ID, as below:
ID
Product
1
'desk'
1
'chair'
1
'clock'
3
'pen'
It would be appreciated if you had any suggestions.
You can do it with separate.
library(tidyverse)
df <- data.frame(
id = c(1,2,3),
product=c('desk, chair, clock', NA, 'pen')
)
df |>
separate_rows(product) |>
drop_na()
#> # A tibble: 4 × 2
#> id product
#> <dbl> <chr>
#> 1 1 desk
#> 2 1 chair
#> 3 1 clock
#> 4 3 pen
You can do it with tidyr lib separate_rows
library(tidyr)
df = df %>%
separate_rows(Product, sep = ",")
Beside the above answers, I tried this method and works fine as well.
result_df <- unnest(df, Product)

How to split a column into different columns in R

I have data in which the education column takes the form from 1 to 3.
Payment column - also accepts values from 1 to 3
I wanted to group them in pairs
I have an education and payment column. How can I convert the table so that the payment is divided into 3 different columns
I would like the table to look like this:
enter image description here
*I tried to do this but it gave me an error
The function pivot_wider from the tidyr package is the way to go:
library(dplyr)
library(dplyr)
df %>%
pivot_wider(names_from = Education,
values_from = count,
names_glue = "Education = {.name}")
# A tibble: 3 × 4
PaymentTier `Education = 1` `Education = 2` `Education = 3`
<dbl> <dbl> <dbl> <dbl>
1 1 1000 666 6543
2 2 33 2222 9999
3 3 455 1111 5234
Data:
df <- data.frame(
Education = c(1,1,1,2,2,2,3,3,3),
PaymentTier = c(1,2,3,1,2,3,1,2,3),
count = c(1000,33,455,666,2222,1111,6543,9999,5234)
)

How to sum all values of a cell if it corresponds with a specific value in another cell?

I might just be going about it the wrong way, but I'm having trouble pulling out all of the female scores and all of the male scores into their own respective dataframes.
I don't need to have any of the exam information, so really I could just get every 'f' and it's corresponding score and every 'm' and it's corresponding score into a dataframe.
data <- tribble(~"X",~"Exam1",~"X.1",~"Exam2",~"X.2",
"n","Score","Gender","Score","Gender",
"1","45","m","66","f",
"2","60","f","73","m")
# Create informative column names
Colnames <- colnames(data) %>% str_c(.,dplyr::slice(data,1) %>% unlist,sep = "_")
# Set column names
data <- data %>%
setNames(Colnames) %>%
dplyr::slice(-1)
# Arrange data by exam type by first getting exam "number"
colnames(data) %>%
str_extract("\\d|\\d\\d") %>%
str_subset("\\d") %>%
unique %>%
# Split and arrange data by exams
purrr::map_df(~{
data %>%
dplyr::select(matches(str_c("X_n|",.x))) %>%
dplyr::mutate(Exam = str_c("Exam ",.x)) %>%
dplyr::rename_all(~c("Serial number","Exam score","Gender","Exam"))
}) %>%
# Split data by gender
dplyr::group_by(Gender) %>%
dplyr::group_split()
Output:
[[1]]
# A tibble: 2 × 4
`Serial number` `Exam score` Gender Exam
<chr> <chr> <chr> <chr>
1 2 60 f Exam 1
2 1 66 f Exam 2
[[2]]
# A tibble: 2 × 4
`Serial number` `Exam score` Gender Exam
<chr> <chr> <chr> <chr>
1 1 45 m Exam 1
2 2 73 m Exam 2

Clustering similar strings based on another column in R

I have a large data frame that shows the distance between strings and their counts.
For example, in row 1, you see the distance between apple and pple as well as the times that I have counted apple (counts_col1= 100) and the times I ve counted pple (counts_col2=2).
library(tidyverse)
df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
col2 = c("pple","app","app", "bananna", "banan", "banan"),
distance = c(1,2,3,1,1,2),
counts_col1 = c(100,100,2,200,200,2),
counts_col2 = c(2,50,50,2,20,20))
df
#> # A tibble: 6 × 5
#> col1 col2 distance counts_col1 counts_col2
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 apple pple 1 100 2
#> 2 apple app 2 100 50
#> 3 pple app 3 2 50
#> 4 banana bananna 1 200 2
#> 5 banana banan 1 200 20
#> 6 bananna banan 2 2 20
Created on 2022-03-15 by the reprex package (v2.0.1)
Now I want to cluster the apples and the bananas based on the string that has the maximum number of counts, which is the apple (100) and the banana (200).
I want my data to look somehow like this
cluster elements sum_counts
apple apple 152
NA pple NA
NA app NA
banana banana 222
NA bananna NA
NA banan NA
The format of the output does not have to be like this. I am really struggling to break down this problem and cluster the groups.
Any help or comment are really appreciated!
You can try using random walk clustering from igraph:
count_df <- data.table::melt(
data.table::as.data.table(df),
measure = list(c("col1", "col2"), c("counts_col1", "counts_col2")),
value.name = c("col", "counts")
) %>%
select(col, counts) %>%
unique()
df %>%
igraph::graph_from_data_frame(directed = FALSE) %>%
igraph::walktrap.community(weights = igraph::E(.)$distance) %>%
# igraph::components() %>%
igraph::membership() %>%
split(names(.), .) %>%
map_dfr(
~tibble(col = .x) %>%
semi_join(count_df, ., by = "col") %>%
arrange(desc(counts)) %>%
summarise(cluster = first(col), elements = list(col), sum_count = sum(counts))
)
cluster elements sum_count
1 apple apple, app, pple 152
2 banana banana, banan, bananna 222
This works on this toy example, but I think your example is to simple and probably does not reflect your main problem. Or it might be even easier if you are interested in finding connected components (if two words are connected they are in same cluster). Then you would need to replace walktrap.community with components.
Here is one approach, where I initially add a group identifier for the sets (I presume you have this in your actual set), and then after making a longer type dataset, I group by this id, and identifier the "word" that has the largest value. I then use an inner join between the initial df and this resulting set of key rows that have the largest_value word, summarize, and rename. I push all the variants into a list column.
df <- df %>% mutate(id=c(1,1,1,2,2,2))
df %>% inner_join(
rbind(
df %>% select(id,distance,col=col1, counts=counts_col1),
df %>% select(id,distance,col=col2, counts=counts_col2)
) %>%
group_by(id) %>%
slice_max(counts) %>%
distinct(col),
by=c("col1"="col")
) %>%
group_by(col1) %>%
summarize(variants = list(c(col1, cur_group()$col1)),
total = min(counts_col1) + sum(counts_col2)) %>%
rename_all(~c("cluster", "elements", "sum_counts"))
# A tibble: 2 x 3
cluster elements sum_counts
<chr> <list> <dbl>
1 apple <chr [3]> 152
2 banana <chr [3]> 222
A similar approach in data.table (also depends on having that id column)
setDT(df)
df[rbind(
df[,.(id,col=col1,counts=counts_col1)],
df[,.(id,col=col2,counts=counts_col2)]
)[order(-counts),.SD[1], by=id],on=.(col1=col)][
, .(elements=list(c(col2,.BY$cluster)),
sum_counts = min(counts_col1) + sum(counts_col2)),
by=.(cluster=col1)]
cluster elements sum_counts
<char> <list> <num>
1: banana bananna,banan,banana 222
2: apple pple,app,apple 152

dataframe manupulation using ddply

I am having a dataframe named output
output dataframe
I want to generate mode(most repeating) of code for each distinct patientID and count of unique patientID with the above code for each distinct zipcode.
I tried this:
ddply(output,~zipcode,summarize,max=mode(code))
this code will generate mode of code for each distinct zipcode...but I want to generate mode of code for distinct patientID within distinct zipcode.
output=data.frame(code=c("E78.5","N08","E78.5","I65.29","Z68.29","D64.9"),patientID=c("34423","34423","34423","34423","34424","34425"),zipcode=c(00718,00718,00718,00718,00718,00719),city=c("NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO"))
my output=
zipcode most_rep_code patient_count
1 718 E78.5 1
2 719 D64.9 1
If I understand you correctly that you need to find the code with the highest frequency by patientID and zipcode, then dplyr might be of use. I think you need to just have the above 3 columns as grouping variables and then summarise to get the count of each group. The highest in each row is the mode. The new column gives the count of the mode.
# Your reprex data
output=data.frame(code=c("E78.5","N08","E78.5","I65.29","Z68.29","D64.9"),patientID=c("34423","34423","34423","34423","34424","34425"),zipcode=c(00718,00718,00718,00718,00718,00719),city=c("NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO"))
library(dplyr)
output %>%
dplyr::group_by(patientID, code, zipcode) %>%
dplyr::summarise(mode_freq = n())
# A tibble: 5 x 4
# Groups: patientID, code [5]
patientID code zipcode freq
<fct> <fct> <dbl> <int>
1 34423 E78.5 718 2
2 34423 I65.29 718 1
3 34423 N08 718 1
4 34424 Z68.29 718 1
5 34425 D64.9 719 1
I've included dplyr:: because I'm assuming you have plyr loaded and so function names will conflict.
Update:
To get to your suggested output of the mode, by definition it should be the highest frequency:
output %>%
group_by(patientID, code, zipcode) %>%
summarise(mode_freq = n()) %>%
ungroup() %>%
group_by(zipcode) %>%
filter(mode_freq == max(mode_freq))
# A tibble: 2 x 4
# Groups: zipcode [2]
patientID code zipcode mode_freq
<fct> <fct> <dbl> <int>
1 34423 E78.5 718 2
2 34425 D64.9 719 1

Resources