I have a df that is of non-finite length that looks like the table below.
The example here only has 2 traits: "lipids" and "density". Other rows may have 50 traits or more. But will always have the same pattern of trait, unit, method. When importing into R using read_excel it changes non unique names to xxx...[col.number]. I want to use pivot_longer to cast the data into a long format from wide. I'm having difficulty manipulating the function and would appreciate some help. The final column names I would like would be geno_name, observation_id, trait, value, unit, method
Sample Data
Desired Output (without the drop_na statement to show example)
x <- structure(list(geno_name = "MB mixed", observation_id = 10, lipids = NA,
unit...3 = NA, method...4 = NA, density = 1.125, unit...6 = "g cm^-3",
method...7 = "3D scanning"), class = "data.frame", row.names = c(NA,-1L))
So far I have:
x %>% pivot_longer(
cols = 3:ncol(x),
names_to = c("trait","unit","method"),
#need help with these other arguments
values_drop_na = T)
The data column names to be used in 'long' format doesn't all have the same pattern in column names. Therefore, the steps included are
rename columns that doesn't have the ... or _ in their column names by adding those with paste/str_c
reshape to long format with pivot_longer - taking into account the pattern in names with either names_sep or names_pattern, specify the names_to as a vector of c(".value", "trait") in the same order we want the column values and the suffix value to be stored as separate columns
Once we reshaped, create a grouping column based on the values in the 'trait' (some of them are numbers - create a logical vector and get the cumulative sum) along with the other grouping 'geno_name', 'observation_id' (which doesn't create a unique column though))
Now summarise the other columns by slicing the first row after ordering based on NA elements i.e. if there are no NA, the first value will be non-NA or else it will be NA
library(dplyr)
library(stringr)
library(tidyr)
x %>%
rename_at(vars(names(.)[!str_detect(names(.), "[_.]+")]),
~ str_c("value...", .)) %>%
pivot_longer(cols = 3:ncol(.),
names_to = c(".value", "trait"), names_sep = "\\.+") %>%
group_by(geno_name, observation_id,
grp = cumsum(str_detect(trait, "\\D+"))) %>%
summarise(across(everything(), ~ .[order(is.na(.))][1]),
.groups = 'drop') %>%
select(-grp)
-output
# A tibble: 2 x 6
# geno_name observation_id trait value unit method
# <chr> <dbl> <chr> <dbl> <chr> <chr>
#1 MB mixed 10 lipids NA <NA> <NA>
#2 MB mixed 10 density 1.12 g cm^-3 3D scanning
data
x <- structure(list(geno_name = "MB mixed", observation_id = 10, lipids = NA,
unit...3 = NA, method...4 = NA, density = 1.125, unit...6 = "g cm^-3",
method...7 = "3D scanning"), class = "data.frame", row.names = c(NA,
-1L))
Related
I have 2 data frames: one with a list of medications, the other with a different but highly overlapping list of medications along with corresponding medication ID codes. I want to merge these two data frames to apply the medication codes to the first data frame's medication list. I have a lot of partial string matches, and I want to detect strings in a case-insensitive manner.
library(tidyverse)
library(stringr)
label <- c("0.4% Lidocaine Hydrochloride", "10% Dextrose", "Act Raloxifene")
df1 <- as.DataFrame(label)
label2 <- c("LIDOCAINE", "RALOXIFENE", "JANUMET", "ESOMEPRAZOLE", "METFORMIN")
code <- c(0003, 0005, 0006, 0001, 0011)
df2 <- data.frame(label2, code)%>%
rename(label=label2)
I try to use str_detect from stringr package
merge_df <- merge(df1, df2,
by.x=c("label" = ifelse(str_detect(df1$label, regex(df2$label, ignore_case = T)),
df1$label, NA)),
by.y=c("label" = ifelse(str_detect(df1$label, regex(df2$label, ignore_case = T)),
df2$label, NA)),
ignore.case=T,all.x=T,all.y=T,
suffixes = c("_list", "_dict"),
nomatch=0)
And I get the error:
Error in str_detect():
! Can't recycle string (size 3) to match pattern (size 5).
An approach using left_join.
First add a variable l_lower in both sets containing all tolower strings, separated by strsplit to enable match of all entries.
After joining and arranging the y-labels remove duplicated entries and the helper column.
library(dplyr)
library(tidyr)
left_join(df1 %>%
rowwise() %>%
mutate(l_label = strsplit(tolower(label), " ")) %>%
unnest(l_label),
df2 %>%
rowwise() %>%
mutate(l_label = unlist(strsplit(tolower(label), " "))), "l_label") %>%
arrange(label.y) %>%
group_by(label.x) %>%
filter(!duplicated(label.x)) %>%
select(-l_label) %>%
ungroup()
# A tibble: 3 × 3
label.x label.y code
<chr> <chr> <dbl>
1 0.4% Lidocaine Hydrochloride LIDOCAINE 3
2 Act Raloxifene RALOXIFENE 5
3 10% Dextrose NA NA
Data
df1 <- structure(list(label = c("0.4% Lidocaine Hydrochloride", "10% Dextrose",
"Act Raloxifene")), class = "data.frame", row.names = c(NA, -3L
))
df2 <- structure(list(label = c("LIDOCAINE", "RALOXIFENE", "JANUMET",
"ESOMEPRAZOLE", "METFORMIN"), code = c(3, 5, 6, 1, 11)),
class = "data.frame", row.names = c(NA,
-5L))
How do I reshape the data.frame input to result? Basically the first part of the column name before "dosis" should be the new variable with two value columns value and dosis containing the data of the columns ending with/without "dosis".
This should not be too difficult, but I have difficulties finding the correct regex to use with pivot_longer or melt.data.table.
library(tibble)
library(tidyr)
library(magrittr)
library(data.table)
input <-
tribble(
~"abc", ~"abcdosis", ~"def", ~"defdosis", ~"ghi", ~"ghidosis",
1, 0, 9, NA, 1, 2
)
result <-
tribble(
~"variable", ~"value", ~"dosis",
"abc", 1, 0,
"def", 9, NA,
"ghi", 1, 2
)
# Not working
pivot_longer(input,
everything(),
names_to = c("variable", "dosis"),
names_pattern = "(^dosis)?(dosis)")
# Also not working
melt.data.table(as.data.table(input), measure.vars = patterns("^(?!.*dosis).*$", "dosis$"))
Using dplyr::rename_with() you can paste "value" onto the end of of the non-"dosis" columns, then use then use the ".value" sentinel in pivot_longer(). Of course, may have to be more specific with the columns in rename_with() if you data is more complex.
library(dplyr)
library(tidyr)
input %>%
rename_with(~paste0(., "value"), -ends_with("dosis")) %>%
pivot_longer(everything(), names_to = c("variable", ".value"), names_pattern = "(.*?)(value|dosis)$")
# A tibble: 3 x 3
variable value dosis
<chr> <dbl> <dbl>
1 abc 1 0
2 def 9 NA
3 ghi 1 2
I need to reshape a complicated table from rows of stacked election data to cleanly formatted columns containing all the information. I'm having trouble automating this.
Here's a simple version of the input data. Note that there are just 2 elections in this example; in the real data there are many, so the code needs to generalize:
input <-
structure(list(a = c("2020 ge", "winner", NA, "2016 ge", "winner"
), b = c(NA, "orange (cat)", NA, NA, "peach (kitten)"), c = c(NA,
"runner up", NA, NA, "runner up"), d = c(NA, "peach (kitten)", NA,
NA, "orange (cat)"), e = c(NA, "margin", NA, NA, "margin"), f = c(NA,
100, NA, NA, 150)), row.names = c(NA, 5L), class = "data.frame")
And this is the output I would like:
output <-
structure(list(`2019_winner_name` = "orange", `2020_winner_party` = "cat",
`2020_runner_up_name` = "peach", `2020_runner_up_party` = "kitten",
`2020_margin` = 100, `2016_winner_name` = "peach", `2016_winner_party` = "kitten",
`2016_runner_up_name` = "orange", `2016_runner_up_party` = "cat",
`2016_margin` = 150), row.names = 1L, class = "data.frame")
Here is what I've tried so far, which works for one year:
# test data
test <-
input %>%
slice(1:2) %>%
fill(c(b, c, d, e, f), .direction = c("up"))
# select first row
row_one <-
test %>%
select(a) %>%
slice(1)
# select year
year <-
str_extract(row_one$a, "^([0-9]*)")
# select second row as name
row_two <-
test %>%
select(a) %>%
slice(2) %>%
as.character()
# bring back to test data
test <-
test %>%
mutate(a = row_two) %>%
slice(1) %>%
add_row() %>%
fill(c(b, d, f)) %>%
mutate(a = ifelse(is.na(a), b, a),
c = ifelse(is.na(c), d, c),
e = ifelse(is.na(e), f, e)) %>%
select(a, c, e) %>%
row_to_names(1) %>%
rename_all(funs(paste0(year, "_", .)))
# extract party variable
test <-
test %>%
mutate_at(vars(contains("winner"), contains("runner")),
funs(party = str_extract(., "(?<=\\().+?(?=\\))"))) %>%
mutate_at(vars(ends_with("winner"), ends_with("up")),
funs(name = str_extract(., "([^()]*)")))
What would be an easier and more concise way to do this, given the unusual data format? How could I automate this so that I can run it over multiple election years?
Thank you.
First off, I agree with #deschen in that this is very messy data. Rather than trying to tidy/reshape the data as provided I would recommend exploring whether source data can be parsed in a better (tidier) way.
Having said that, it is possible to reshape & tidy data into your expected output. Mind you, this is a fairly messy procedure and I have no idea how well this generalises on bigger data.
library(tidyverse)
# Define a convenience function that turns a vector with an even number of elements
# into a named vector where every odd element is the name of the following even element
to_named_vec <- function(x) {
if (length(x) == 1) return(magrittr::set_names(x, "margin"))
nm <- x[c(TRUE, FALSE)]
vec <-x[c(FALSE, TRUE)]
return(magrittr::set_names(vec, nm))
}
# First convert the input into a nested `list`
lst <- input %>%
t() %>%
as.character() %>%
discard(is.na) %>%
split(., cumsum(str_detect(., "\\d{4}"))) %>%
map(~ .x %>%
str_remove(" ge") %>%
stringi::stri_replace_all_regex("(\\w+)\\s\\((\\w+)\\)", "name_$1_party_$2") %>%
str_split("_") %>%
unlist()) %>%
magrittr::set_names(map_chr(., head, 1)) %>%
map(~ .x[-1] %>%
split(cumsum(str_detect(.x[-1], "(winner|runner up|margin)"))) %>%
magrittr::set_names(map_chr(., head, 1)) %>%
map(~ .x %>% tail(-1) %>% to_named_vec() %>% bind_rows()))
# The last step involves `unlist`ing the nested `list`, tidying the names and
# converting the named vector into a `tibble` with `bind_rows`.
lst %>%
unlist() %>%
set_names(., str_replace_all(names(.), "\\.", "_")) %>%
set_names(., str_replace(names(.), "_margin", "")) %>%
bind_rows()
## A tibble: 1 x 10
#`2020_winner_na~ `2020_winner_pa~ `2020_runner up~ `2020_runner up~ `2020_margin` `2016_winner_na~
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 orange cat peach kitten 100 peach
## ... with 4 more variables: `2016_winner_party` <chr>, `2016_runner up_name` <chr>, `2016_runner
## up_party` <chr>, `2016_margin` <chr>
It's best to step through the code line-by-line to understand what every steps does; roughly,
we transpose input,
convert the resulting matrix into a character vector, discard NAs, and
split the vector on the occurrence of "\d{4}" (i.e. the year of the GE).
We then operate on every list element separately, by
removing the string " ge",
replacing occurrences of the form "orange (cat)" with "name_orange_party_cat",
splitting entries on "_".
The rest is a matter of giving the nested list elements proper names that from the vector of list elements themselves.
The final step involves unlisting the nested list and tidying the names of the named vector to reflect those from your expected output.
I am having some issues trying to sum a bunch of columns in R. I am analyzing a huge dataset so I am reproducing a sample. of fake data.
Here's how the data looks like (I have 800 columns).
library(data.table)
dataset <- data.table(name = c("A", "B", "C", "D"), a1 = 1:4, a2 = c(1,2,NaN,5), a3 = 1:4, a4 = 1:4, a5 = c(1,2,NA,5), a6 = 1:4, a8 = 1:4)
dataset
What I want to do is sum the columns in buckets of 100 columns so, for example, all the values in the first row between the first column and the column 100, all the values in the first row between the column 1 and the column 200, all the values in the second row between the first column and the column 100, etc.
Using the sample data I've come with this solution using rowSums.
dataset %>%
mutate_if(~!is.numeric(.x), as.numeric) %>%
mutate_all(funs(replace_na(., 0))) %>%
mutate(sum = rowSums(.[,paste("a", 1:3, sep="")])) %>%
mutate(sum1 = rowSums(.[,paste("a", 4:5, sep="")])) %>%
mutate(sum2 = rowSums(.[,paste("a", 6:8, sep="")]))
but I am getting the following error:
Error in `[.data.frame`(., , paste("a", 6:8, sep = "")) : undefined columns selected
as the data does not include column a7.
The original data is missing a bunch of columns between a1 and a800 so solving this would be key to make it work.
What would it be the best way to approach and solve this error?
Also, I have a few more questions regarding the code I've written:
Is there a smarter way to select the column a1 and a100 instead of using this approach .[,paste("a", 1:3, sep="")]? I am interested in selected the column by name. I do not want to select it by the position of the column because sometimes a100 does not mean that is the column 100.
Also, I am converting the NAs and the NaNs to 0 in order to be able to sum the rows. I am doing it this way mutate_all(funs(replace_na(., 0))), losing my first row than contains the names of the values. What would it be the best way to replace NA and NaN without mutating the string values of the first row to 0?
The type of the columns I am adding is integer as I converted them beforehand mutate_if(~!is.numeric(.x), as.numeric) . Should I follow the same approach in case I have dbl?
Thank you!
Here is one way to do this after transforming data to longer format, for each name, we create a group of n rows and take the sum.
library(dplyr)
library(tidyr)
n <- 2 #No of columns to bucket. Change this to 100 for your case.
dataset %>%
pivot_longer(cols = -name, names_to = 'col') %>%
group_by(name) %>%
group_by(grp = rep(seq_len(n()), each = n, length.out = n()), add = TRUE) %>%
summarise(value = sum(value, na.rm = TRUE)) %>%
#If needed in wider format again
pivot_wider(names_from = grp, values_from = value, names_prefix = 'col')
# name col1 col2 col3 col4
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 2 2 2 1
#2 B 4 4 4 2
#3 C 3 6 3 3
#4 D 9 8 9 4
so in a dataset, I have a column named "Interventions", and each row looks like this:
row1: "Drug: Rituximab|Drug: Utomilumab|Drug: Avelumab|Drug: PF04518600"
row2: "Biological: alemtuzumab|Biological: donor lymphocytes|Drug: carmustine|Drug: cytarabine|Drug: etoposide|Drug: melphalan|Procedure: allogeneic bone marroow"
I want to only extract the Intervention type such as "Drug", "Biological", "Procedure" to remain in the column. And even better, if can only have the unique Intervention type instead of "Drug" 4 times like the first row.
The expected output would look like this:
row1: "Drug"
row2: "Biological, Drug, Procedure"
I am just getting started with r, I have tidyverse installed and kinda used to playing with the %>%. If anyone can help me with this, much appreciated !
If we want to extract only the prefix part before the :
library(dplyr)
library(stringr)
library(tidyr)
library(purrr)
df1 %>%
mutate(Interventions = map_chr(str_extract_all(Interventions,
"\\w+(?=:)"), ~ toString(sort(unique(.x)))))
# Interventions
#1 Drug
#2 Biological, Drug, Procedure
Or another option is to separate the rows based on the delimiters, slice the alternate rows and paste together the sorted unique values in 'Interventions'
df1 %>%
mutate(rn = row_number()) %>%
separate_rows(Interventions, sep="[:|]") %>%
group_by(rn) %>%
slice(seq(1, n(), by = 2)) %>%
distinct() %>%
summarise(Interventions = toString(sort(unique(Interventions)))) %>%
ungroup %>%
select(-rn)
# A tibble: 2 x 1
# Interventions
# <chr>
#1 Drug
#2 Biological, Drug, Procedure
data
df1 <- structure(list(Interventions = c("Drug: Rituximab|Drug: Utomilumab|Drug: Avelumab|Drug: PF04518600",
"Biological: alemtuzumab|Biological: donor lymphocytes|Drug: carmustine|Drug: cytarabine|Drug: etoposide|Drug: melphalan|Procedure: allogeneic bone marroow"
)), class = "data.frame", row.names = c(NA, -2L))
Not as concise and the same logic as Akruns but in Base R:
# Create df:
df1 <- structure(list(Interventions = c("Drug: Rituximab|Drug: Utomilumab|Drug: Avelumab|Drug: PF04518600",
"Biological: alemtuzumab|Biological: donor lymphocytes|Drug: carmustine|Drug: cytarabine|Drug: etoposide|Drug: melphalan|Procedure: allogeneic bone marroow"
)), class = "data.frame", row.names = c(NA, -2L))
# Assign a row id vec:
df1$row_num <- 1:nrow(df1)
# Split string on | delim:
split_up <- strsplit(df1$Interventions, split = "[|]")
# Roll down the dataframe - keep uniques:
rolled_out <- unique(data.frame(row_num = rep(df1$row_num, sapply(split_up, length)),
Interventions = gsub("[:].*","", unlist(split_up))))
# Stack the dataframe:
df2 <- aggregate(Interventions~row_num, rolled_out, paste0, collapse = ", ")
# Drop id vec:
df2 <- within(df2, rm("row_num"))