I have a data.frame where one of columns have such structure:
"2019-09-11 13:29:55:647 INFO".
How could I separate this column into three columns, where:
column 1 is :"2019-09-11 13:29:55"
column 2 is: "647"
column 3 is "INFO".
I want to use tidyr separate function but can't write a regular expression for separators.
We can use read.csv after inserting a delimiter
cbind(df1, read.csv(text = sub("^(\\S+) (\\S+):([^:]+)$",
"\\1,\\2,\\3", df1$datetime), col.names =c('newcol1', 'newcol2', 'newcol3'),
header = FALSE, stringsAsFactors = FALSE))
If we are using tidyverse, specify the sep with a regex lookaround, i.e. to match : followed by characters that are not a : till the end or the space between two digits
library(tidyr)
separate(df1, datetime, into = c('newcol1', 'newcol2', 'newcol3'),
sep="(?<=\\d) (?=\\d)|:(?=[^:]+$)")
# newcol1 newcol2 newcol3
#1 2019-09-11 13:29:55 647 INFO
Or with extract, capture the characters as a group till the last : followed by digits till the end of the string
extract(df1, datetime, into = c('newcol1', 'newcol2', 'newcol3'),
"^(\\S+)\\s(.*):([^:]+)$")
# newcol1 newcol2 newcol3
#1 2019-09-11 13:29:55 647 INFO
data
df1 <- data.frame(datetime = "2019-09-11 13:29:55:647 INFO",
stringsAsFactors = FALSE)
Related
I have a data frame df:
df = data.frame(text = "M2O__LS___P_20160727T165346_20160808T165347_VV_ts_pan_fgdo_inte_dbrrt_IW9")
I want to have a new data frame df2 that extracts the first (2016.07.27) and second (2016.08.08) dates from the text column of df.
The desired output data frame is:
begin end
1 20160727 20160808
I would specifically be interested in knowing a tidyverse approach to this problem (unless the base r is much more efficient)
1) Base R Read the text separating by underscore, take the 7th and 8th columns, add the desired names and remove the T and everything after it.
DF <- read.table(text = df$text, sep = "_")[7:8]
names(DF) <- c("begin", "end")
DF[] <- lapply(DF, sub, pattern = "T.*", replacement = "")
DF
## begin end
## 1 20160727 20160808
1a) Another base solution.
pat <- "_(\\d{8})T.*_(\\d{8})T"
strcapture(pat, df$text, list(begin = character(0), end = character(0)))
## begin end
## 1 20160727 20160808
2) read.pattern pat is from (1a)
library(gsubfn)
read.pattern(text = df$text, pattern = pat, col.names = c("begin", "end"))
## begin end
## 1 20160727 20160808
3) tidyr pat is from (1a).
library(tidyr)
extract(df, text, c("begin", "end"), pat)
## begin end
## 1 20160727 20160808
Another solution:
library(tidyverse)
df %>%
transmute(begin = str_extract(.$text,"(?<=_)\\d{8}"),
end = str_extract(.$text,"(?<=\\d_)\\d{8}"))
#> begin end
#> 1 20160727 20160808
Given this vector:
vector <- c("Superman1000", "Batman35", "Wonderwoman240")
I want to split the superhero's name and age.
df=data.frame(vector= c("Superman1000", "Batman35", "Wonderwoman240"))
library(stringr)
library(stringi)
library(dplyr)
df %>% separate(vector, c("A", "B"))
I tried this but it doesn't work.
If the data is same as shown, we can remove all the digits to get super hero name and remove all the non-digits to get their age.
df$super_hero <- gsub("\\D", "", df$vector)
df$super_hero_age <- gsub("\\d+", "", df$vector)
Or with tidyr::extract
tidyr::extract(df, vector, into = c("name", "age"),regex = "(.*\\D)(\\d+)")
# name age
#1 Superman 1000
#2 Batman 35
#3 Wonderwoman 240
As mentioned by #A5C1D2H2I1M1N2O1R2T1, we can also use strcapture
strcapture("(.*\\D)(\\d+)", df$vector,
proto = data.frame(superhero = character(), age = integer()))
We can use read.csv from base R after creating a delimiter before the numeric part with sub
read.csv(text = sub("(\\d+)", ",\\1", df$vector), header = FALSE,
stringsAsFactors = FALSE, col.names = c('name', 'age'))
# name age
#1 Superman 1000
#2 Batman 35
#3 Wonderwoman 240
Or another option is separate where we specify a regex lookaround
library(tidyr)
separate(df, vector, into = c("name", "age"), sep= "(?<=[a-z])(?=\\d)")
# name age
#1 Superman 1000
#2 Batman 35
#3 Wonderwoman 240
I know there are some answers here about splitting a string every nth character, such as this one and this one, However these are pretty question specific and mostly related to a single string and not to a data frame of multiple strings.
Example data
df <- data.frame(id = 1:2, seq = c('ABCDEFGHI', 'ZABCDJHIA'))
Looks like this:
id seq
1 1 ABCDEFGHI
2 2 ZABCDJHIA
Splitting on every third character
I want to split the string in each row every thrid character, such that the resulting data frame looks like this:
id 1 2 3
1 ABC DEF GHI
2 ZAB CDJ HIA
What I tried
I used the splitstackshape before to split a string on a single character, like so: df %>% cSplit('seq', sep = '', stripWhite = FALSE, type.convert = FALSE) I would love to have a similar function (or perhaps it is possbile with cSplit) to split on every third character.
An option would be separate
library(tidyverse)
df %>%
separate(seq, into = paste0("x", 1:3), sep = c(3, 6))
# id x1 x2 x3
#1 1 ABC DEF GHI
#2 2 ZAB CDJ HIA
If we want to create it more generic
n1 <- nchar(as.character(df$seq[1])) - 3
s1 <- seq(3, n1, by = 3)
nm1 <- paste0("x", seq_len(length(s1) +1))
df %>%
separate(seq, into = nm1, sep = s1)
Or using base R, using strsplit, split the 'seq' column for each instance of 3 characters by passing a regex lookaround into a list and then rbind the list elements
df[paste0("x", 1:3)] <- do.call(rbind,
strsplit(as.character(df$seq), "(?<=.{3})", perl = TRUE))
NOTE: It is better to avoid column names that start with non-standard labels such as numbers. For that reason, appended 'x' at the beginning of the names
You can split a string each x characters in base also with read.fwf (Read Fixed Width Format Files), which needs either a file or a connection.
read.fwf(file=textConnection(as.character(df$seq)), widths=c(3,3,3))
V1 V2 V3
1 ABC DEF GHI
2 ZAB CDJ HIA
I have a data below in the dataframe column-
X_ABC_123_DF</n>
A_NJU_678_PP</n>
J_HH_99_LL</n>
II_00_777_PPP</n>
I want to extract the value between second and third underscore for each row in the dataframe, which i am planning to create a new column and store those values.. I found one way on SO mentioned below, but they haven't mentioned how to write this in R. I am not sure how to write its regex function.
^(?:[^_]+_){2}([^_ ]+)<br>
extract word between 2nd underscore and 3rd underscore or space
A few solutions:
df$values = sapply(strsplit(df$V1, "_"), function(x) x[3])
df$values = gsub("(.*_){2}(\\d+)_.+", "\\2", df$V1)
library(dplyr)
library(stringr)
df %>%
mutate(values = str_extract(V1, "\\d+(?=_[a-zA-Z]+.+$)"))
Result:
V1 values
1 X_ABC_123_DF</n> 123
2 A_NJU_678_PP</n> 678
3 J_HH_99_LL</n> 99
4 II_00_777_PPP</n> 777
Data:
df = read.table(text = "X_ABC_123_DF</n>
A_NJU_678_PP</n>
J_HH_99_LL</n>
II_00_777_PPP</n>", stringsAsFactors = FALSE)
1) Assume the input is a data frame df with a single column V1. Read it in using read.table with sep="_" and then pick out the third column. No packages or regular expressions are used. If df$V1 is already character (as opposed to factor) then the as.character could be omitted.
read.table(text = as.character(df$V1), sep = "_")$V3
## [1] 123 678 99 777
2) If the third column is the only one that contains digits (which is the case for the sample data in the question) then it would be sufficient to replace each non-digit with the empty string:
as.numeric(gsub("\\D", "", df$V1))
## [1] 123 678 99 777
My dataframe which I read from a csv file has column names like this
abc.def, ewf.asd.fkl, qqit.vsf.addw.coil
I want to remove the '.' from all the names and convert them to
abcdef, eqfasdfkl, qqitvsfaddwcoil.
I tried using the sub command sub(".","",colnames(dataframe)) but this command took out the first letter of each column name and the column names changed to
bc.def, wf.asd.fkl, qit.vsf.addw.coil
Anyone know another command to do this. I can change the column name one by one, but I have a lot of files with 30 or more columns in each file.
Again, I want to remove the "." from all the colnames. I am trying to do this so I can use "sqldf" commands, which don't deal well with "."
Thank you for your help
1) sqldf can deal with names having dots in them if you quote the names:
library(sqldf)
d0 <- read.csv(text = "A.B,C.D\n1,2")
sqldf('select "A.B", "C.D" from d0')
giving:
A.B C.D
1 1 2
2) When reading the data using read.table or read.csv use the check.names=FALSE argument.
Compare:
Lines <- "A B,C D
1,2
3,4"
read.csv(text = Lines)
## A.B C.D
## 1 1 2
## 2 3 4
read.csv(text = Lines, check.names = FALSE)
## A B C D
## 1 1 2
## 2 3 4
however, in this example it still leaves a name that would have to be quoted in sqldf since the names have embedded spaces.
3) To simply remove the periods, if DF is a data frame:
names(DF) <- gsub(".", "", names(DF), fixed = TRUE)
or it might be nicer to convert the periods to underscores so that it is reversible:
names(DF) <- gsub(".", "_", names(DF), fixed = TRUE)
This last line could be alternatively done like this:
names(DF) <- chartr(".", "_", names(DF))
UPDATE dplyr 0.8.0
As of dplyr 0.8 funs() is soft deprecated, use formula notation.
a dplyr way to do this using stringr.
library(dplyr)
library(stringr)
data <- data.frame(abc.def = 1, ewf.asd.fkl = 2, qqit.vsf.addw.coil = 3)
renamed_data <- data %>%
rename_all(~str_replace_all(.,"\\.","_")) # note we have to escape the '.' character with \\
Make sure you install the packages with install.packages().
Remember you have to escape the . character with \\. in regex, which functions like str_replace_all use, . is a wildcard.
To replace all the dots in the names you'll need to use gsub, rather than sub, which will only replace the first occurrence.
This should work.
test <- data.frame(abc.def = NA, ewf.asd.fkl = NA, qqit.vsf.addw.coil = NA)
names(test) <- gsub( ".", "", names(test), fixed = TRUE)
test
abcdef ewfasdfkl qqitvsfaddwcoil
1 NA NA NA
You can also try:
names(df) = gsub(pattern = ".", replacement = "", x = names(df))