Split data frame column when square brackets - r

I have a data frame with some model estimations. Depending on the observation the estimation has just a value or a value together with a confidence interval between square brackets. By the way, the variable is a character (I guess that I need to change it some-when)
df<-data.frame(c("5","3","8 [3 - 5]")
I would like to split this data frame column (x) into two columns. A first one for the estimated values (y) and a second one for the confidence interval with or without brackets (z).
I have tried with tidyr::separate and tidyr::split (I am big fun of the dplyr family:-), but I do not get the wished result.
tidyr::separate(col=x,into=c("y","z"),sep="//[")
Do you know what I am doing wrong?

This can be done with extract
library(tidyr)
extract(df, x, into = c("y", "z"), "(\\d+)\\s*(.*)")
Or use the extra argument in separate
separate(df, x, into = c("y", "z"), "\\s+", extra = "merge")
data
df <- data.frame(x= c("5","3","8 [3 - 5]"))

Here ya go:
library("stringr")
df <- data.frame(c("5", "3", "8 [3 - 5]"))
df2 = str_split_fixed(string = df[,1], pattern = "\\[", n = 2)
df2[,2] = gsub(pattern = "\\]", replacement = "", x = df2[,2])

Related

Convert characters like "84+3" into numeric variables using R

I have a large data.frame with several variables like "89+2" (all two-digit integer + one-digit integer) and I'm trying to quickly convert to numeric variables. Realistically, either just eliminating the second numeric OR performing the calculation and adding them together would work... Bit of an R newbie. Any help appreciated.
example:
df$LM = c("91+2", "89+3", "88+2")
Looking for
df$LM_num = c(91, 89, 88)
or
df$LM_num = c(93, 92, 90)
We can use separate
library(tidyr)
library(dplyr)
separate(df, LM, into = c("LM_num1", "LM_num2"), convert = TRUE) %>%
mutate(LM_num = LM_num1 + LM_num2)
Or with parse_number
library(readr)
df$LM_num <- parse_number(df$LM)
Or another option is eval(parse
df$LM_num <- sapply(df$LM, function(x) eval(parse(text = x)))
Assuming df given reproducibly in the Note at the end use the first line below to get the first number or the second line to get the sum. They both read the LM column as if it were a file splitting on + creating a two column data frame. The first line extracts the first column whereas the second line adds the two columns. No packages are used.
transform(df, LM_num = read.table(text = LM, sep = "+")[[1]])
transform(df, LM_num = rowSums(read.table(text = LM, sep = "+")))
Note
df <- data.frame(LM = c("91+2", "89+3", "88+2"))
Another option would be:
x <- '92+3'
sum(as.numeric(strsplit(x, split = '+',fixed = TRUE)[[1]]))
In case of having a data.frame:
df <- data.frame(LM = c("91+2", "89+3", "88+2"))
df$sum <- sapply(seq_len(nrow(df)),
function(i) sum(as.numeric(strsplit(df$LM, split = '+', fixed = TRUE)[[i]])))
# LM sum
# 1 91+2 93
# 2 89+3 92
# 3 88+2 90

Flipping two sides of string

I need to prepare a certain dataset for analysis. What I have is a table with column names (obviously). The column names are as follows (sample colnames):
"X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"
(this is a vector, for those not familiair with R colnames() function)
Now, what I want is simply to flip the values in front of, and after the underscore. e.g. X99_NORM becomes NORM_X99. Note that I want this only for the column names which contain NORM in their name.
Some other base R options
1)
Use sub to switch the beginning and end - we can make use of capturing groups here.
x <- sub(pattern = "(^X\\d+)_(NORM$)", replacement = "\\2_\\1", x = x)
Result
x
# [1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"
2)
A regex-free approach that might be more efficient using chartr, dirname and paste. But we need to get the indices of the columns that contain "NORM" first
idx <- grep(x = x, pattern = "NORM", fixed = TRUE)
x[idx] <- paste0("NORM_", dirname(chartr("_", "/", x[idx])))
x
data
x <- c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")
x = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")
replace(x,
grepl("NORM", x),
sapply(strsplit(x[grepl("NORM", x)], "_"), function(x){
paste(rev(x), collapse = "_")
}))
#[1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"
A tidyverse solution with stringr:
library(tidyverse)
library(stringr)
my_data <- tibble(column = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"))
my_data %>%
filter(str_detect(column, "NORM")) %>%
mutate(column_2 = paste0("NORM", "_", str_extract(column, ".+(?=_)"))) %>%
select(column_2)
# A tibble: 3 x 1
column_2
<chr>
1 NORM_X99
2 NORM_X101
3 NORM_X30

Selecting multiple columns using Regular Expressions

I have variables with names such as r1a r3c r5e r7g r9i r11k r13g r15i etc. I am trying to select variables which starts with r5 - r12 and create a dataframe in R.
The best code that I could write to get this done is,
data %>% select(grep("r[5-9][^0-9]" , names(data), value = TRUE ),
grep("r1[0-2]", names(data), value = TRUE))
Given my experience with regular expressions span a day, I was wondering if anyone could help me write a better and compact code for this!
Here's a regex that gets all the columns at once:
data %>% select(grep("r([5-9]|1[0-2])", names(data), value = TRUE))
The vertical bar represents an 'or'.
As the comments have pointed out, this will fail for items such as r51, and can also be shortened. Instead, you will need a slightly longer regex:
data %>% select(matches("r([5-9]|1[0-2])([^0-9]|$)"))
Suppose that in the code below x represents your names(data). Then the following will do what you want.
# The names of 'data'
x <- scan(what = character(), text = "r1a r3c r5e r7g r9i r11k r13g r15i")
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.numeric(y[sapply(y, `!=`, "")])
x[y > 4]
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"
EDIT.
You can make a function with a generalization of the above code. This function has three arguments, the first is the vector of variables names, the second and the third are the limits of the numbers you want to keep.
var_names <- function(x, from = 1, to = Inf){
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.integer(y[sapply(y, `!=`, "")])
x[from <= y & y <= to]
}
var_names(x, 5)
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"
Remove the non-digits, scan the remainder in and check whether each is in 5:12 :
DF <- data.frame(r1a=1, r3c=2, r5e=3, r7g=4, r9i=5, r11k=6, r13g=7, r15i=8) # test data
DF[scan(text = gsub("\\D", "", names(DF)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6
Using magrittr it could also be written like this:
library(magrittr)
DF %>% .[scan(text = gsub("\\D", "", names(.)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6

R: Convert frequency to percentage with only a selected number of columns

I would like to convert a dataframe filled with frequencies into a dataframe filled with percentage by row using dplyr.
My data set has the particularity to get filled with others variables and I just want to calculate the percentage for a set of columns defined by a vector of names. Plus, I want to use the dplyr library.
sim_dat <- function() abs(floor(rnorm(26)*10))
df <- data.frame(a = letters, b = sim_dat(), c = sim_dat(), d = sim_dat()
, z = LETTERS)
names_to_transform <- names(df)[2:4]
df2 <- df %>%
mutate(sum_freq_codpos = rowSums(.[names_to_transform])) %>%
mutate_each(function(x) x / sum_freq_codpos, names_to_transform)
# does not work
Any idea on how to do it? I have tried with mutate_at and mutate_each but I can't get it to work.
you're almost there!:
df2 <- df %>%
mutate(sum_freq_codpos = rowSums(.[names_to_transform])) %>%
mutate_at(names_to_transform, funs(./sum_freq_codpos))
the dot . roughly translates to "the object i am manipulating here", which in this call is "the focal variable in names_to_transform".

split column if variable number of pieces data.frame

I want to split column y of df below according to the '_' but my data is incomplet. (df is just a representative portion of a bigger data.frame).
df <- data.frame(x = 1:10,
y = c("vuh_ftu_yefq", "sos_nvtspb", "pfymm_ucms",
"tucbexcqzh", "n_zndbhoun", "wdetzaolvn",
"lvohrpdqns", "wso_bsqwvr", "wx_gbkbxjl",
"t_dbxkkvge"))
I have tried using:
df$z <- strsplit(df$y,'_')
But I get an error because the number of pieces in each list are different.
How can I do this?
Assumptions:
) needed to close out df in your example.
incomplete data means it's filled in from the left such that a value without intervening '_' is the first or datum.
tidyr's separate():
result <- separate(df, y, into = c("z1","z2","z3") , sep ='_', extra = "drop")
the key here is extra = "drop" which according to docs always returns length(into) pieces by dropping or expanding as necessary.
data.table's tstrsplit()
DT <- as.data.table(df)
result <- DT[, c("z1", "z2","z3") := tstrsplit(y, '_', fixed=TRUE)][]
the default behaviour for tstrsplit() does what you need and the fixed=TRUE is to pass to strsplit() underneath to keep things hasty.
note: if your incomplete data is filled from the right you need to unmix your variables here!!!
You could use the separate function from tidyr.
# required package
require(tidyr)
# separate (removing the y column)
separate(df, y, paste0("z", 1:3), sep = "_", extra = "merge")
# separate without removing the y column
separate(df, y, paste0("z", 1:3), sep = "_", extra = "merge", remove = FALSE)

Resources