R strsplit function in a data frame - r

I create a data frame which now I want to separate one new column by split the ":" in first column.
data frame:
unc.edu.0057f9f7-779b-4914-8290-abbad2a0d81e.2556919.rsem.genes.normalized_results:ASL|435 214.4421
unc.edu.0057f9f7-779b-4914-8290-abbad2a0d81e.2556919.rsem.genes.normalized_results:ASS1|445 2863.8055
unc.edu.0057f9f7-779b-4914-8290-abbad2a0d81e.2556919.rsem.genes.normalized_results:OTC|5009 0
unc.edu.050c2191-b96c-41e7-abdb-e52cbe82f268.2456235.rsem.genes.normalized_results:ASL|435 332.7522
unc.edu.050c2191-b96c-41e7-abdb-e52cbe82f268.2456235.rsem.genes.normalized_results:ASS1|445 3322.629
unc.edu.050c2191-b96c-41e7-abdb-e52cbe82f268.2456235.rsem.genes.normalized_results:OTC|5009 0
desired output:
unc.edu.0057f9f7-779b-4914-8290-abbad2a0d81e.2556919.rsem.genes.normalized_results ASL|435 214.4421
unc.edu.0057f9f7-779b-4914-8290-abbad2a0d81e.2556919.rsem.genes.normalized_results ASS1|445 2863.8055
unc.edu.0057f9f7-779b-4914-8290-abbad2a0d81e.2556919.rsem.genes.normalized_results OTC|5009 0
unc.edu.050c2191-b96c-41e7-abdb-e52cbe82f268.2456235.rsem.genes.normalized_results ASL|435 332.7522
unc.edu.050c2191-b96c-41e7-abdb-e52cbe82f268.2456235.rsem.genes.normalized_results ASS1|445 3322.629
unc.edu.050c2191-b96c-41e7-abdb-e52cbe82f268.2456235.rsem.genes.normalized_results OTC|5009 0
I have tried
strsplit(df$V1, split = "\\:")
but Error in strsplit(t$V1, split = "\:") : non-character argument come out. Thank you.

The error is because we have a variable of class factor. Convert it to character and it should work
lst <- strsplit(as.character(df$V1), split = ":", fixed = TRUE)
If we need to create two columns, one easy way is with read.table
df1 <- read.table(text = as.character(df$V1), sep=":", stringsAsFactors=FALSE)
Or using separate from tidyr
library(tidyr)
separate(df1, V1, into = c("V1", "V2"))

tidyr::separate(data = df, col = V1, into = c('a', 'b'), sep = ':')

Related

Convert characters like "84+3" into numeric variables using R

I have a large data.frame with several variables like "89+2" (all two-digit integer + one-digit integer) and I'm trying to quickly convert to numeric variables. Realistically, either just eliminating the second numeric OR performing the calculation and adding them together would work... Bit of an R newbie. Any help appreciated.
example:
df$LM = c("91+2", "89+3", "88+2")
Looking for
df$LM_num = c(91, 89, 88)
or
df$LM_num = c(93, 92, 90)
We can use separate
library(tidyr)
library(dplyr)
separate(df, LM, into = c("LM_num1", "LM_num2"), convert = TRUE) %>%
mutate(LM_num = LM_num1 + LM_num2)
Or with parse_number
library(readr)
df$LM_num <- parse_number(df$LM)
Or another option is eval(parse
df$LM_num <- sapply(df$LM, function(x) eval(parse(text = x)))
Assuming df given reproducibly in the Note at the end use the first line below to get the first number or the second line to get the sum. They both read the LM column as if it were a file splitting on + creating a two column data frame. The first line extracts the first column whereas the second line adds the two columns. No packages are used.
transform(df, LM_num = read.table(text = LM, sep = "+")[[1]])
transform(df, LM_num = rowSums(read.table(text = LM, sep = "+")))
Note
df <- data.frame(LM = c("91+2", "89+3", "88+2"))
Another option would be:
x <- '92+3'
sum(as.numeric(strsplit(x, split = '+',fixed = TRUE)[[1]]))
In case of having a data.frame:
df <- data.frame(LM = c("91+2", "89+3", "88+2"))
df$sum <- sapply(seq_len(nrow(df)),
function(i) sum(as.numeric(strsplit(df$LM, split = '+', fixed = TRUE)[[i]])))
# LM sum
# 1 91+2 93
# 2 89+3 92
# 3 88+2 90

How to merge all dataframe columns into a single one in R

I know how to manually merge specific columns of a dataframe into a single column:
df_new <- data.frame(paste(df$a, df$b, df$c))
My question is how can I do this dynamically with all of the dataframe's columns?
You can use do.call: ‘do.call’ constructs and executes a function call from a name or a function and a list of arguments to be passed to it.
do.call(paste, df)
A solution from the tidyverse could be tidyr::unite():
df <- data.frame(x = letters[1:4], y = LETTERS[1:4], z = 1:4)
df_new <- tidyr::unite(df, col = "union", sep = " ")
where col is the name of the newly constructed column in the dataframe. sep is equivalent to its use in paste.

Spliting string in column by seperator and adding those as new columns in the same data frame using R

I have a column in dataframe df with value 'name>year>format'. Now I want to split this column by > and add those values to new columns named as name, year, format. How can I do this in R.
You can do that easily using separate function in tidyr;
library(tidyr)
library(dplyr)
data <-
data.frame(
A = c("Joe>1993>student")
)
data %>%
separate(A, into = c("name", "year", "format"), sep = ">", remove = FALSE)
# A name year format
# Joe>1993>student Joe 1993 student
If you do not want the original column in the result dataframe change remove to TRUE
An option is read.table in base R
cbind(df, read.table(text = as.character(df$column), sep=">",
header = FALSE, col.names = c("name", "year", "format")))
In case your data is big, it would be a good idea to use data.table as it is very fast.
If you know how many fields your "combined" column has:
Suppose the column has 3 fields, and you know it:
library(data.table)
# the 1:3 should be replaced by 1:n, where n is the number of fields
dt1[, paste0("V", 1:3) := tstrsplit(y, split = ">", fixed = TRUE)]
If you DON'T know in advance how many fields the column has:
Now we can get some help from the stringi package:
library(data.table)
library(stringi)
maxFields <- dt2[, max(stri_count_fixed(y, ">")) + 1]
dt2[, paste0("V", 1:maxFields) := tstrsplit(y, split = ">", fixed = TRUE, fill = NA)]
Data used:
library(data.table)
dt1 <- data.table(x = c("A", "B"), y = c("letter>2018>pdf", "code>2020>Rmd"))
dt2 <- rbind(dt1, data.table(x = "C", y = "report>2019>html>pdf"))

assign dynamic variable names using a loop and mutate from dplyr

I would like to split a character field into individual variables, one for each character in a string.
library(dplyr)
temp1 <- data.frame(a = c('dedefdewfe' , 'rewewqreqw'))
for(i in 1:10){
temp1 <- temp1 %>%
mutate(paste('v' , i , ,sep = '') = substr(a , i , i))
}
The resulting dataframe would have 11 variables, the original a , v1 through v10
tidyr::separate is good for this. You can't split on an empty string, but you can specify splitting positions ...
library(tidyr)
library(dplyr)
temp1 %>%
mutate(b=a) %>% ## make a copy
separate(b,into=paste0("v",1:10),sep=1:9)
(probably better practice to use nc <- nchar(temp1$a[1]) and then use nc, nc-1 instead of 10, 9 respectively)

split column if variable number of pieces data.frame

I want to split column y of df below according to the '_' but my data is incomplet. (df is just a representative portion of a bigger data.frame).
df <- data.frame(x = 1:10,
y = c("vuh_ftu_yefq", "sos_nvtspb", "pfymm_ucms",
"tucbexcqzh", "n_zndbhoun", "wdetzaolvn",
"lvohrpdqns", "wso_bsqwvr", "wx_gbkbxjl",
"t_dbxkkvge"))
I have tried using:
df$z <- strsplit(df$y,'_')
But I get an error because the number of pieces in each list are different.
How can I do this?
Assumptions:
) needed to close out df in your example.
incomplete data means it's filled in from the left such that a value without intervening '_' is the first or datum.
tidyr's separate():
result <- separate(df, y, into = c("z1","z2","z3") , sep ='_', extra = "drop")
the key here is extra = "drop" which according to docs always returns length(into) pieces by dropping or expanding as necessary.
data.table's tstrsplit()
DT <- as.data.table(df)
result <- DT[, c("z1", "z2","z3") := tstrsplit(y, '_', fixed=TRUE)][]
the default behaviour for tstrsplit() does what you need and the fixed=TRUE is to pass to strsplit() underneath to keep things hasty.
note: if your incomplete data is filled from the right you need to unmix your variables here!!!
You could use the separate function from tidyr.
# required package
require(tidyr)
# separate (removing the y column)
separate(df, y, paste0("z", 1:3), sep = "_", extra = "merge")
# separate without removing the y column
separate(df, y, paste0("z", 1:3), sep = "_", extra = "merge", remove = FALSE)

Resources