How to divide this string in multiple columns? - r

I've this string and I need to split it into different columns
legend = "Frequency..Derivatives.measure...Derivatives.instrument...Derivatives.risk.category...Derivatives.reporting.country...Derivatives.counterparty.sector...Derivatives.counterparty.country...Derivatives.underlying.risk.sector...Derivatives.currency.leg.1...Derivatives.currency.leg.2...Derivatives.maturity...Derivatives.rating...Derivatives.execution.method...Derivatives.basis...Period..30.06.1998.31.12.1998.30.06.1999.31.12.1999.30.06.2000.31.12.2000.30.06.2001.31.12.2001.30.06.2002.31.12.2002.30.06.2003.31.12.2003.30.06.2004.31.12.2004.30.06.2005.31.12.2005.30.06.2006.31.12.2006.30.06.2007.31.12.2007.30.06.2008.31.12.2008.30.06.2009.31.12.2009.30.06.2010.31.12.2010.30.06.2011.31.12.2011.30.06.2012.31.12.2012.30.06.2013.31.12.2013.30.06.2014.31.12.2014.30.06.2015.31.12.2015.30.06.2016.31.12.2016.30.06.2017.31.12.2017.30.06.2018.31.12.2018.30.06.2019"
Every three points there should be a new column, until the word perdiod. Note that the first word Frequency is divided from the second word Derivatives.measure by only two points not three.
After that, there are a series of Date (6 months interval) and they should be divided in this way: "everytime there's a 4 digit number perform a split".
How can I do this? Thank You

We can use strsplit to split at the ... with fixed = TRUE into a list of vectors and then rbind the vectors to create a data.frame
df1 <- do.call(rbind.data.frame, strsplit(legend, "...", fixed = TRUE))
names(df1) <- paste0("V", seq_along(df1))
If we also need to include the last condition to split the "Period"
library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
tibble(col = legend) %>%
mutate(rn = row_number()) %>%
separate_rows(col, sep= "[.]{3}") %>%
mutate(rn2 = str_c("V", rowid(rn))) %>%
pivot_wider(names_from = rn2, values_from = col) %>%
rename_at(ncol(.), ~ "Period") %>%
mutate(Period = str_remove(Period, "Period\\.+")) %>%
separate_rows(Period, sep="(?<=\\.[0-9]{4})\\.")

Related

how to remove duplicate values from specific columns in a data frame?

I want to remove duplicate text within certain column values of the data frame.
like this..
what should i do?
In base R, we can split the 'originaltext' column by , followed by zero or more spaces (\\s*), then loop over the list with sapply, get the unique values and paste them by collapseing without space
df1$result <- sapply(strsplit(df1$originaltext, ",\\s*"),
function(x) paste(unique(x), collapse=""))
Here's a way with dplyr :
library(dplyr)
df %>%
mutate(row = row_number()) %>%
tidyr::separate_rows(original_text, sep = ',\\s*') %>%
group_by(row) %>%
summarise(result = paste0(unique(original_text), collapse = ''),
original_text = toString(original_text)) %>%
select(-row)

R: How to identify indices of minima of all groups in data frame

In R, say I have a data frame times of times with columns: athlete (character), season (integer), distance (factor, out of 400, 800, 1500, 5000, 10000) and tm (float) and want to identify the indices of rows that are the lowest value of time for each unique combination of the other four variables.
I can do this with the following code that sorts by grouping columns and then by tm:
times1 <- times # make copy of array
times1$rownum <- 1:nrow(times1) # add column of row numbers
times1 <- times1[with(times1, order(athlete, season, distance, tm)), ] # sort array
whichmins <- times1$rownum[!duplicated(subset(times1, select = -c(tm, rownum)))] # identify rows where grouping factors change
But I was wondering if there was a more concise way to do it using aggregate, dplyr or data tables.
I tried using dplyr's group_by function with which.min but I could not get it to work.
Thank you
With tidyverse, similar approach would be to arrange by the columns, filter the distinct elements based on the logical vector from duplicated and pull the 'rownum'
library(dplyr)
times %>%
mutate(rownum = row_number()) %>%
arrange(athlete, season, distance, tm) %>%
filter(!duplicated(select(., -c(tm, rownum))) %>%
pull(rownum)
Or instead of duplicated, use the distinct
times %>%
mutate(rownum = row_number()) %>%
arrange(athlete, season, distance, tm) %>%
distinct(across(-c(tm, rownum)), .keep_all = TRUE) %>%
pull(rownum)
If we want to use a group by operation, then after the grouping by 'athlete', 'season', 'distance', slice the row where the 'tm' is minimum and pull the 'rownum'
times %>%
mutate(rownum = row_number())
group_by(athlete, season, distance) %>%
slice_min(tm) %>%
pull(rownum)
Or with summarise
times %>%
mutate(rownum = row_number())
group_by(athlete, season, distance) %>%
summarise(rownum = rownum[which.min(tm)]) %>%
pull(rownum)
Or using data.table
library(data.table)
setDT(times)[order(athlete, season, distance, tm),
.I[!duplicated(.SD[, setdiff(names(.SD), 'tm')), with = FALSE])]]
Or with unique
unique(setorder(setorder(setDT(times, keep.rownames = TRUE),
athlete, season, distance, tm), by = c('athlete', 'season', 'distance'))[, rn]

Simplify a list to a data frame & create new columns from numeric vectors in the list

I have a fairly simple list:
ls <- list(560L, 4163L, 3761L, 287:290, 4467L, 3564L, 200:202)
where each row corresponds to a row in a data frame:
df <- enframe(c("tom", "dick", "harry", "sally", "sarah", "petra", "helen"), value = "name", name = NULL)
Because some row elements of the list contain a numeric vector it's not as easy as converting the list to a data frame and using bind_cols to combine the data.
So, I'd like to be able to simplify the list into a data frame and put each vector element into a column so I can combine with the df. The simplified list from this sample would be a data frame 7 rows by 4 columns. The non-reprex data will change and so the number of columns would represent the number of elements in the longest numeric vector and not just this sample.
Thanks.
We can use unnest_wider
library(tidyr)
library(dplyr)
set_names(ls, df$name) %>%
tibble(col = .) %>%
unnest_wider(c(col))
Or after stacking into a 2 column data.frame, use pivot_wider
set_names(ls, df$name) %>%
stack %>%
group_by(ind) %>%
mutate(rn = row_number()) %>%
ungroup %>%
pivot_wider(names_from = ind, values_from = values)
If we needs the opposite
df %>%
mutate(val = ls) %>%
unnest(val) %>%
group_by(name) %>%
mutate(rn = str_c('col', row_number())) %>%
ungroup %>%
pivot_wider(names_from = rn, values_from = val)
Or with unnest_wider
library(stringr)
df %>%
mutate(val = ls) %>%
unnest_wider(c(val), names_repair = ~ c('name', str_c('col', 1:4)))

Gather a tibble with matrix columns

My tibble looks like this:
df = tibble(x = 1:3, col1 = matrix(rnorm(6), ncol = 2),
col2 = matrix(rnorm(6), ncol = 2))
it has three columns of which two contain a matrix with 2 columns each (in my case there are many more columns, this example is just to illustrate the problem). I transform this data to long format by using gather
gather(df, key, val, -x)
but this gives me not the desired result. It stacks only the first column of column 1 and column 2 and dismisses the rest. What I want is that val contains the row vectors of column 1 and column 2, i.e. val is a matrix valued column (containing 1x2 matrices). The tidyverse seems, however, not be able to deal with matrix-valued columns appropriately. Is there a way to achieve my desired result? (Ideally using the routines from tidyverse)
Some of the columns are matrix. It needs to be converted to proper data.frame columns and then would work
library(dplyr)
library(tidyr)
do.call(data.frame, df) %>%
pivot_longer(cols = -x)
Or use gather
do.call(data.frame, df) %>%
gather(key, val, -x)
Or another option is to convert the matrix to vector with c and then use unnest
df %>%
mutate_at(-1, ~ list(c(.))) %>%
unnest(c(col1, col2))
if the 'col1', 'col2', values would be in a single column
df %>%
mutate_at(-1, ~ list(c(.))) %>%
pivot_longer(cols = -x) %>%
unnest(c(value))

Using replace_na for multiple data subsets

I'm trying to replace the NAs in multiple column variables with randomly generated values from each student_id's subset row data:
data snapshot
so for student 3, systolic needs two NAs replaced. I used the min and max values for each variable within the student 3 subset to generate random values.
library(dplyr)
library(tidyr)
library(tibble)
library(tidyverse)
dplyr::filter(exercise, student_id == "3") %>% replace_na(list(systolic= round(sample(runif(1000, 125,130),2),0),
diastolic =round(sample(runif(1000, 85,85),3),0), heart_rate= round(sample(runif(1000, 79,86),2),0),
phys_score = round(sample(runif(1000, 8,9),2),0)
However it works only when one NA needs replacing: successfully replaced systolic NA values. When I try to replace more than one NAs, this error comes up.
Error: Replacement for `systolic` is length 2, not length 1
Is there a way to fix this? I tried converting the column variables to data frames instead of the vectors they are now, but it only returned the original data without any replacement changes.
Are there any simpler ways to this? Any suggestions/comments would be appreciated. Thanks.
A solution that makes things a little more automated but may be unnecessarily complex.
Generated some grouped missing data from the mtcars dataset
library(magrittr)
library(purrr)
library(dplyr)
library(stringr)
library(tidyr)
## Generate some missing data with a subset of car make
mtcars_miss <- mtcars %>%
as_tibble(rownames = "car") %>%
select(car) %>%
separate(car, c("make", "name"), " ") %>%
bind_cols(mtcars[, -1] %>%
map_df(~.[sample(c(TRUE, NA), prob = c(0.8, 0.2),
size = length(.), replace = TRUE)])) %>%
filter(make %in% c("Mazda", "Hornet", "Merc"))
Function to replace na values from a given variable by sampling within the min and max and depending on some group (here make).
replace_na_sample <- function(df_miss, var, group = "make") {
var <- enquo(var)
df_miss %>%
group_by(.dots = group) %>%
mutate(replace_var := round(runif(n(), min(!!var, na.rm = T),
max(!!var, na.rm = T)), 0)) %>%
rowwise %>%
mutate_at(.vars = vars(!!var),
.funs = funs(replace_na(., replace_var))) %>%
select(-replace_var) %>%
ungroup
}
Example replacing several missing values in multiple columns.
mtcars_replaced <- mtcars_miss %>%
replace_na_sample(cyl, group = "make") %>%
replace_na_sample(disp, group = "make") %>%
replace_na_sample(hp, group = "make")

Resources