I have the following string structure as a column value in my data frame :
Y: 10 ,W: 3 , cp: 0.05
the numeric values at each row differ but the structure remains the same. I want to split this string into 3 columns, each containing only the numbers. So there will be one column for Y with the corresponding numeric value, another for W and the last for cp.
I have tried using str_split in the following way:
str_split(string,pattern = " ,",simplify = TRUE )
which obviously gives me:
[,1] [,2] [,3]
[1,] "Y: 40 " "W: 2" " cp: 0.05"
Now, I want to keep only the numbers in each of those columns. Still learning this stuff so not sure how to proceed! Any help is highly appreciated!
There are definitely nicer ways, but this should do the job:
Now updated for string vector with more than one element and bringing it into a matrix with three named columns. Should work on vectors of any length.
library(stringr)
string <- c("Y: 10 ,W: 3 , cp: 0.05","Y: 4 ,W: 9 , cp: 2.2")
vec <- t(str_split(str_split(string, " ,", simplify = TRUE), ": ", simplify = TRUE)[,2])
mtx = matrix(
vec,
nrow = length(vec)/3,
ncol = 3)
colnames(mtx) <- c("Y","W","cp")
mtx
Maybe not the most elegant way but it works:
library(dplyr)
library(stringr)
library(tidyr)
tibble(row = c(1,2),
col = c("Y: 10 ,W: 3 , cp: 0.05","Y: 4 ,W: 9 , cp: 2.2")) %>%
separate(col, into=c("col1", "col2", "col3"), sep = ",") %>%
gather(id, col, -row) %>%
select(-id) %>%
mutate(col = str_trim(col)) %>%
separate(col, into=c("letter", "number"), sep=":") %>%
mutate(number = str_trim(number)) %>%
spread(letter, number) %>%
select(-row)
# A tibble: 2 x 3
cp W Y
<chr> <chr> <chr>
1 0.05 3 10
2 2.2 9 4
Note that I had to add a new column named row to your data frame to make this approach work
I find sometimes reformatting name: value pair data back to an existing structure helps to take care of complexity. In this case, I've formatted to a JSON object, and then used stream_in from jsonlite to deal with the data.
This is nice because it will automatically name the columns, and also takes care of occasions when not every value is represented in every row, or the order changes. E.g.:
txt <- c(
"Y: 10 ,W: 3 , cp: 0.05",
"Y: 6 ,W: 7 , cp: 0.08",
"cp: 0.08, Y: 6 "
)
library(jsonlite)
proctxt <- paste("{", gsub("([A-Za-z]+?):", '"\\1":', txt), "}")
stream_in(textConnection(proctxt))
# Found 3 records...
# Imported 3 records. Simplifying...
# Y W cp
#1 10 3 0.05
#2 6 7 0.08
#3 6 NA 0.08
You can remove all unneeded characters e.g. with gsub and then use strsplit or read.csv.
In base it would look like:
string <- c("Y: 10 ,W: 3 , cp: 0.05", "Y: 10 ,W: 3 , cp: 0.05")
read.csv(text=gsub("[[:alpha:]: ]", "", string), header=FALSE)
# V1 V2 V3
#1 10 3 0.05
#2 10 3 0.05
#or with strsplit
strsplit(gsub("[[:alpha:]: ]", "", string), ",")
Given that your text strings are uniform it should be relatively simple to do, the first part would look like this:
txt <- c(
"Y: 10 ,W: 3 , cp: 0.05",
"Y: 6 ,W: 7 , cp: 0.08",
"Y: 5 ,W: 0 , cp: 0.08"
)
x <- do.call(rbind, strsplit(txt, split = " ,"))
And that would get a matrix of your "label: value"
library(stringr)
y <- matrix(data = str_extract(string = x,
pattern = "([0-9.]+)"),
ncol = ncol(x))
Will get you to text strings that signify your values, if you want, you can just use str_extract() without the matrix call to get your values as a vector, and:
z <- matrix(data = as.numeric(y),
ncol = ncol(x))
will get you your matrix as numerics, which it sounds like is what you're interested in.
all together it's fairly tidy, and without the intermediate matrix call, if don't need that, it would look like:
library(stringr)
txt <- c(
"Y: 10 ,W: 3 , cp: 0.05",
"Y: 6 ,W: 7 , cp: 0.08",
"Y: 5 ,W: 0 , cp: 0.08"
)
x <- do.call(rbind, strsplit(txt, split = " ,"))
y <- str_extract(string = x,
pattern = "([0-9.]+)")
z <- matrix(data = as.numeric(y),
ncol = ncol(x))
With z giving you a matrix of numerics.
I believe this should work:
library(tidyverse)
string <- c("Y: 10 ,W: 3 , cp: 0.05","Y: 4 ,W: 9 , cp: 2.2")
dat <- tibble(x = string) %>%
separate(x,c("Y","W","cp"), sep = " ,")
dat2 <- dat %>% mutate_all(., ~str_remove(.,"\\D+"))
Related
The Problem
I'm trying to come up with a neat/fast way to read files delimited by newline (\n) characters into more than one column.
Essentially in a given input file, multiple rows in the input file should become a single row in the output, however most file reading functions sensibly interpret the newline character as signifying a new row, and so they end up as a data frame with a single column. Here's an example:
The input files look like this:
Header Info
2021-01-01
text
...
#
2021-01-02
text
...
#
...
Where the ... represents potentially multiple rows in the input file, and the # signifies what should really be the end of a row in the output data frame. So upon reading this file, it should become a data frame like this (ignoring the header):
X1
X2
...
Xn
2021-01-01
text
...
...
2021-01-02
text
...
...
...
...
...
...
My attempt
I've tried base, data.table, readr and vroom, and they all have one of two outputs, either a data frame with a single column, or a vector. I want to avoid a for loop, and so my current solution is using base::readLines(), to read it as a character vector, then manually adding some "proper" column separators (e.g. ;), and then joining and splitting again.
# Save the example data to use as input
writeLines(c("Header Info", "2021-01-01", "text", "#", "2021-01-02", "text", "#"), "input.txt")
input <- readLines("input.txt")
input <- paste(input[2:length(input)], collapse = ";") # Skip the header
input <- gsub(";#;*", replacement = "\n", x = input)
input <- strsplit(unlist(strsplit(input, "\n")), ";")
input <- do.call(rbind.data.frame, input)
# Clean up the example input
unlink("input.txt")
My code above works and gives the desired result, but surely there's a better way??
Edit: This is internal in a function, so part (perhaps the larger part) of the intention of any simplification is to improve the speed.
Thanks in advance!
1) Read in the data, locate the # signs giving logical variable at and then create a grouping variable g which has distinct values for each desired line. Finally use tapply with paste to rework it into lines that can be read using read.table and read it. (If there are commas in the data then use some other separating character.)
L <- readLines("input.txt")[-1]
at <- grepl("#", L)
g <- cumsum(at)
read.table(text = tapply(L[!at], g[!at], paste, collapse = ","),
sep = ",", col.names = cnames)
giving this data frame:
V1 V2
1 2021-01-01 text
2 2021-01-02 text
2) Another approach is to rework the data into dcf form by removing the # sign and prefacing other lines with their column name and a colon. Then use read.dcf. cnames is a character vector of column names that you want to use.
cnames <- c("Date", "Text")
L <- readLines("input.txt")[-1]
LL <- sub("#", "", paste0(c(paste0(cnames, ": "), ""), L))
DF <- as.data.frame(read.dcf(textConnection(LL)))
DF[] <- lapply(DF, type.convert, as.is = TRUE)
DF
giving this data frame:
Date Text
1 2021-01-01 text
2 2021-01-02 text
3) This approach simply reshapes the data into a matrix and then converts it to a data frame. Note that (1) converts numeric columns to numeric whereas this one just leaves them as character.
L <- readLines("input.txt")[-1]
k <- grep("#", L)[1]
as.data.frame(matrix(L, ncol = k, byrow = TRUE))[, -k]
## V1 V2
## 1 2021-01-01 text
## 2 2021-01-02 text
Benchmark
The question did not mention speed as a consideration but in a comment it was later mentioned. Based on the data in the benchmark below (1) runs over twice as fast as the code in the question and (3) runs nearly 25x faster.
library(microbenchmark)
writeLines(c("Header Info",
rep(c("2021-01-01", "text", "#", "2021-01-02", "text", "#"), 10000)),
"input.txt")
library(microbenchmark)
writeLines(c("Header Info", rep(c("2021-01-01", "text", "#", "2021-01-02", "text", "#"), 10000)), "input.txt")
microbenchmark(times = 10,
ques = {
input <- readLines("input.txt")
input <- paste(input[2:length(input)], collapse = ";") # Skip the header
input <- gsub(";#;*", replacement = "\n", x = input)
input <- strsplit(unlist(strsplit(input, "\n")), ";")
input <- do.call(rbind.data.frame, input)
},
ans1 = {
L <- readLines("input.txt")[-1]
at <- grepl("#", L)
g <- cumsum(at)
read.table(text = tapply(L[!at], g[!at], paste, collapse = ","), sep = ",")
},
ans3 = {
L <- readLines("input.txt")[-1]
k <- grep("#", L)[1]
as.data.frame(matrix(L, ncol = k, byrow = TRUE))[, -k]
})
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## ques 1146.62 1179.65 1188.74 1194.78 1200.11 1219.01 10 c
## ans1 518.95 522.75 548.33 532.59 561.55 647.14 10 b
## ans3 50.47 51.19 51.68 51.69 52.25 52.52 10 a
You can get round some of the string manipulation with something along the lines of:
input <- readLines("input.txt")[-1] #Read in and remove header
ncol <- which(input=="#")[1]-1 #Number of columns of data
data.frame(matrix(input[input != "#"], ncol = ncol, byrow=TRUE)) #Convert to dataframe
# X1 X2
#1 2021-01-01 text
#2 2021-01-02 text
At this point, you might consider going the full mile and use a proper grammar to parse it. I don't know how big or complex the situation really is, but using pegr it might look something like this:
input <-
"Header Info
2021-01-01
text
multiple lines
of
text
#
2021-01-02
text
more
lines of text
#
"
library(pegr)
peg <- new.parser(commonRules,action=TRUE) +
c("HEADER <- 'Header Info' EOL" , "{}" ) + # Rule to match literal 'Header Info' and a \n, then discard
c("TYPE <- 'text' EOL" , "{-}" ) + # Rule to match literal 'text', store paste and store as $TYPE
c("DATE <- (!EOL .)* EOL" , "{-}" ) + # Rule to match any character leading up to a new line. Could improve to look for a date format
c("EOS <- '#' EOL" , "{}" ) + # Rule to match end of section, then discard
c("BODY <- (!EOS .)*" , "{-}" ) + # Rule to match body of text, including newlines
c("SECTION <- DATE TYPE BODY EOS" ) + # Combining rules to match each section
c("DOCUMENT <- HEADER SECTION*" ) # Combining more rules to match the endire document
res <- peg[["DOCUMENT"]](input))
final <- matrix( value(res), ncol=3, byrow=TRUE ) %>%
as.data.frame %>%
setnames( names(value(res))[1:3])
final
Produces:
DATE TYPE BODY
1 2021-01-01 text multiple lines\nof\ntext\n
2 2021-01-02 text more\nlines of text\n
It might feel clunky if you don't know the syntax, but once you do, its a fire and forget solution. It'll run according to spec until the spec doesn't hold. You don't have to worry about fragile pretreatment and it is easy to adapt to changing formats in the future.
There's also the tidyverse way:
library(tidyr)
library(readr)
library(stringr)
max_columns <- 5
d <- {
readr::read_file("file.txt") %>%
stringr::str_remove("^Header Info\n") %>%
tibble::enframe(name = NULL) %>%
separate_rows(value, sep = "#\n") %>%
separate("value", into = paste0("X", 1:max_columns) , sep = "\n")
}
Using your example input in a file called file.txt, the d looks like:
# A tibble: 3 x 5
X1 X2 X3 X4 X5
<chr> <chr> <chr> <chr> <chr>
1 2021-01-01 text ... "" NA
2 2021-01-02 text ... "" NA
3 ... NA NA NA NA
Warning message:
Expected 5 pieces. Missing pieces filled with `NA` in 3 rows [1, 2, 3].
Note that the warning is simply to make sure you know you're getting NA's, this is inevitable if number of rows between # varies
I am using the data similar to that provided by Sirius for demonstration. you can also do something like this to have variable number of columns in resulting data frame
example <- "Header Info
2021-01-01
text
multiple lines
of
text
#
2021-01-02
text
more
lines of text
#"
library(tidyverse)
example %>% as.data.frame() %>% setNames('dummy') %>%
separate_rows(dummy, sep = '\\n') %>%
filter(row_number() !=1) %>%
group_by(rowid = rev(cumsum(rev(dummy == '#')))) %>%
filter(dummy != '#') %>%
mutate(name = paste0('X', row_number())) %>%
pivot_wider(id_cols = rowid, names_from = name, values_from = dummy)
#> # A tibble: 2 x 6
#> # Groups: rowid [2]
#> rowid X1 X2 X3 X4 X5
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 2 2021-01-01 text multiple lines of text
#> 2 1 2021-01-02 text more lines of text <NA>
Created on 2021-05-30 by the reprex package (v2.0.0)
I want to display a list of matrices (not a single matrix, as has been asked about elsewhere) without the small [1,] and [,1] row and column indicators.
For example, given myList:
myList <- list(matrix(c(1,2,3,4,5,6), nrow = 2), matrix(c(1,2,3,4,5,6), nrow = 3))
names(myList) <- c("This is the first matrix:", "This is the second matrix:")
I'm looking for some function myFunction() that will output:
> myFunction(myList)
$`This is the first matrix:`
1 3 5
2 4 6
$`This is the second matrix:`
1 4
2 5
3 6
It would be even better if it could eliminate the $... around the list names so that it would display:
This is the first matrix:
1 3 5
2 4 6
This is the second matrix:
1 4
2 5
3 6
After reading all the related questions, I've tried
myList %>% lapply(print, row.names = F)
myList %>% lapply(prmatrix, collab = NULL, rowlab = NULL)
myList %>% lapply(write.table, sep = " ", row.names = F, col.names = F)
But none work as intended.
So you are just missing the headers? How about something like
library(purrr) #for walk2()
print_with_name <- function(mat, name) {
cat(name,"\n")
write.table(mat, sep = " ", row.names = F, col.names = F)
}
myList %>% walk2(., names(.), print_with_name)
I have a data frame with many columns. The columns differ in their types: some are numeric, some are character, etc. Here's a small example where we just have 3 variables with 2 types:
# Generate data
dat <- data.frame(x = c("1","2","3"),
y = c(1.0,2.5,3.3),
z = c(1,2,3),
stringsAsFactors = FALSE)
I want to replace the value 3 with a space, but only for character columns. Here's my current code:
out <- as.data.frame(lapply(dat, function(x) {
ifelse(is.character(x),
gsub("3", " ", x),
x)}),
stringsAsFactors = FALSE)
The problem is that the ifelse() function ignores that y and z are numeric and that it also seems to coerce the numeric variables to character anyway.
And idea has been to pull out the character columns, gsub() them, then bind them back to the original data frame. This, however, changes the ordering of the columns. Key to any solution is that I do not need to specify variables by name but only by type.
One can also do this trivially using dplyr:
# Load package
library(dplyr)
# Create data
dat <- data.frame(x = c("1","2","3"),
y = c(1.0,2.5,3.3),
z = c(1,2,3),
stringsAsFactors = FALSE)
# Replace 3's with spaces for character columns
dat <- dat %>% mutate_if(is.character, function(x) gsub(pattern = "3", " ", x))
I tried your code and for me it seems like ifelse did not work but separating if ad else does. Below is the code which works:
# Generate data
dat <- data.frame(x = c("1","2","3"),
y = c(1.0,2.5,3.3),
z = c(1,2,3),
stringsAsFactors = FALSE)
> lapply(dat, function(x) { if(is.character(x)) gsub("3", " ", x) else x })
$x
[1] "1" "2" " "
$y
[1] 1.0 2.5 3.3
$z
[1] 1 2 3
> as.data.frame(lapply(dat, function(x) { if(is.character(x)) gsub("3", " ", x) else x }))
x y z
1 1 1.0 1
2 2 2.5 2
3 3.3 3
It comes down to this line in ?ifelse
ifelse returns a value with the same shape as test ...
is.character is length one so the returned value is length 1. You can use if(...) yes else no as you have suggested instead as #Heikki have suggested.
Similar to #user3614648 solution:
library(dplyr)
dat %>%
mutate_if(is.character, funs(ifelse(. == "3", " ", .)))
x y z
1 1 1.0 1
2 2 2.5 2
3 3.3 3
I've column with two alpha numeric characters separated by '->' I'm trying to split them into columns.
Df:
column e
1. asd1->ref2
2. fde4 ->fre4
3. dfgt-fgr ->frt5
4. ftr5 -> lkh-oiut
5. rey6->usre-lynng->usre-lkiujh->kiuj-bunny
6. dge1->fgt4->okiuj-dfet
Desired output
col 1 col 2
1. asd1 ref2
2. fde4 fre4
3. frt5
4. ftr5
5. rey6
6. dge1 fgt4
I tried using out <- strsplit(as.character(Df$column e),'_->_') with no output and used str_extract(m1$column e,"(?<=\\[)[[:alnum:]]")->m1$column f, also strsplit(as.character(Df$column e),' -> 'fixed=T)[[1]][[1]] but not getting the desired output.
The column if of integer type and all are capital letters(I'm not sure if this is imp.)
Here is one way with tidyverse
library(tidyverse)
df1 %>%
separate(columne, into = c('col1', 'col2'), sep = "->", extra = 'drop') %>%
mutate_all(funs(replace(., str_detect(., '-'), "")))
# col1 col2
#1 asd1 ref2
#2 fde4 fre4
#3 frt5
#4 ftr5
#5 rey6
#6 dge1 fgt4
A base R solution as well, though a fair bit less concise than #akrun's tidyverse one:
# split as appropriate
out <- strsplit( as.character( Df$column.e ), '->' )
out <- lapply( out, function(x) {
# I assume you don't want the white space
y <- trimws( x )
# take the first two "columns"
y <- y[1:2]
# remove any items containing a hyphen
y[ grepl( "-", y ) ] <- ""
y
}
)
# then bind it all rowwise
out <- do.call( rbind, out )
data.frame( out )
X1 X2
1 asd1 ref2
2 fde4 fre4
3 frt5
4 ftr5
5 rey6
6 dge1 fgt4
I'd like to make a column of (possibly) non-unique strings into a column of unique strings.
For instance, consider:
df <- data.frame(
'Initials' = c("AA","AB","AB")
, 'Data' = c(1,2,3)
)
df
Initials Data
1 AA 1
2 AB 2
3 AB 3
I would like to obtain this:
Initials Data
1 AA 1
2 AB (1) 2
3 AB (2) 3
Note: I know I could use the rownames to uniquely identify the row, but I'd like to retain the string stored in the Initials column, with a number appended.
transform(df, Initials = ave(as.character(Initials), Initials,
FUN = function(x) if (length(x) > 1) paste0(x, " (", seq(x), ")") else x))
# Initials Data
# 1 AA 1
# 2 AB (1) 2
# 3 AB (2) 3
w <- ave(df$Data, df$Initials, FUN = seq_along )
> df$Initials <- paste(df$Initials, "(", w, ")", sep = "")
# > df
# Initials Data
# 1 AA(1) 1
# 2 AB(1) 2
# 3 AB(2) 3
TLDR
Can use make_unique from the makeunique package (Disclaimer: I'm the author)
# install.packages('makeunique')
library(makeunique)
df <- data.frame(
Initials = c("AA","AB","AB"),
Data = c(1,2,3)
)
df[['Initials']] <- make_unique(df[['Initials']])
df
#> Initials Data
#> 1 AA 1
#> 2 AB (1) 2
#> 3 AB (2) 3
Created on 2022-10-15 by the reprex package (v2.0.1)
Full Answer
Disclaimer: I'm the author the makeunique package
The problem with #Sven Hohenstein's answer is that it can still produce duplicates. For example if your input dataset happens to include a string 'AB (2)'. See example of issue below
df <- data.frame(
Initials = c("AA","AB","AB", "AB (2)"),
Data = c(1,2,3,4)
)
transform(df, Initials = ave(as.character(Initials), Initials,
FUN = function(x) if (length(x) > 1) paste0(x, " (", seq(x), ")") else x))
#> Initials Data
#> 1 AA 1
#> 2 AB (1) 2
#> 3 AB (2) 3
#> 4 AB (2) 4
Created on 2022-10-15 by the reprex package (v2.0.1)
Really you'd either want something like make_unique from the makeunique package that warns you when this happens and lets you change suffix formatting to fix the issue
# install.packages('makeunique')
library(makeunique)
df <- data.frame(
Initials = c("AA","AB","AB", "AB (2)"),
Data = c(1,2,3,4)
)
df[['Initials']] <- make_unique(df[['Initials']])
#> Error in make_unique(df[["Initials"]]): make_unique failed to make vector unique.
#> This is because appending ' <dup_number>' to duplicate values led tocreation of term(s) that were in the original dataset:
#> [AB (2)]
#>
#> Please try again with a different argument for either `wrap_in_brackets` or `sep`
# Change suffix format to fix problem
df[['Initials']] <- make_unique(df[['Initials']], sep = "-")
Created on 2022-10-15 by the reprex package (v2.0.1)
Or, you could make.unique from base R that ensures uniqueness but doesn't let you control how your suffixes look.
If you don't want to use a package, feel free to just copy the source function for make_unique into your code and use it like your own function
make_unique <- function(x, sep = " ", wrap_in_brackets = TRUE, warn_about_type_conversion = TRUE){
if(!(is.character(sep) & length(sep) == 1)) stop('`sep` must be a string, not a ', paste0(class(sep), collapse = " "))
if(!(is.logical(wrap_in_brackets) & length(wrap_in_brackets) == 1)) stop('`wrap_in_brackets` must be a flag, not a ', paste0(class(wrap_in_brackets), collapse = " "))
if(!(is.logical(warn_about_type_conversion) & length(warn_about_type_conversion) == 1)) stop('`warn_about_type_conversion` must be a flag, not a ', paste0(class(warn_about_type_conversion), collapse = " "))
if(!any(is.numeric(x),is.character(x),is.factor(x))) stop('input to `make_unique` must be a character, numeric, or factor variable')
if(is.factor(x)) {
if(warn_about_type_conversion) warning('make_unique: Converting factor to character variable')
x <- as.character(x)
}
else if(is.numeric(x)) {
if(warn_about_type_conversion) warning('make_unique: Converting numeric variable to a character vector')
x <- as.character(x)
}
deduplicated = stats::ave(x, x, FUN = function(a){
if(length(a) > 1){
suffixes <- seq_along(a)
if(wrap_in_brackets) suffixes <- paste0('(', suffixes, ')')
paste0(a, sep, suffixes)
}
else {a}
})
values_still_duplicated <- deduplicated[duplicated(deduplicated)]
if(length(stats::na.omit(values_still_duplicated)) > 0){
stop(
"make_unique failed to make vector unique.\n",
"This is because appending ' <dup_number>' to duplicate values led to",
"creation of term(s) that were in the original dataset: \n[",
paste0(values_still_duplicated, collapse = ', '),
"]\n\nPlease try again with a different argument for either `wrap_in_brackets` or `sep`"
)
}
return(deduplicated)
}