Split columns considering only the first dot in R using separate - r

This is my dataframe:
df <- tibble(col1 = c("1. word","2. word","3. word","4. word","5. N. word","6. word","7. word","8. word"))
I need to split in two columns using separate function and rename them as Numbers and other called Words. Ive doing this but its not working:
df %>% separate(col = col1 , into = c('Number','Words'), sep = "^. ")
The problem is that the fifth has 2 dots. And I dont know how to handle with this regarding the regex.
Any help?

Here is an alternative using readrs parse_number and a regex:
library(dplyr)
library(readr)
df %>%
mutate(Numbers = parse_number(col1), .before=1) %>%
mutate(col1 = gsub('\\d+\\. ','',col1))
Numbers col1
<dbl> <chr>
1 1 word
2 2 word
3 3 word
4 4 word
5 5 N. word
6 6 word
7 7 word

A tidyverse approach would be to first clean the data then separate.
df %>%
mutate(col1 = gsub("\\s.*(?=word)", "", col1, perl=TRUE)) %>%
tidyr::separate(col1, into = c("Number", "Words"), sep="\\.")
Result:
# A tibble: 8 x 2
Number Words
<chr> <chr>
1 1 word
2 2 word
3 3 word
4 4 word
5 5 word
6 6 word
7 7 word
8 8 word

I'm assuming that you would like to keep the cumbersome "N." in the result. For that, my advice is to use extract instead of separate:
df %>%
extract(
col = col1 ,
into = c('Number','Words'),
regex = "([0-9]+)\\. (.*)")
The regular expression ([0-9]+)\\. (.*) means that you are looking first for a number, that you want to put in a first column, followed by a dot and a space (\\. ) that should be discarded, and the rest should go in a second column.
The result:
# A tibble: 8 × 2
Number Words
<chr> <chr>
1 1 word
2 2 word
3 3 word
4 4 word
5 5 N. word
6 6 word
7 7 word
8 8 word

Try read.table + sub
> read.table(text = sub("\\.", ",", df$col1), sep = ",")
V1 V2
1 1 word
2 2 word
3 3 word
4 4 word
5 5 N. word
6 6 word
7 7 word
8 8 word

I am not sure how to do this with tidyr, but the following should work with base R.
df$col1 <- gsub('N. ', '', df$col1)
df$Numbers <- as.numeric(sapply(strsplit(df$col1, ' '), '[', 1))
df$Words <- sapply(strsplit(df$col1, ' '), '[', 2)
df$col1 <- NULL
Result
> head(df)
Numbers Words
1 1 word
2 2 word
3 3 word
4 4 word
5 5 word
6 6 word

Related

Tidytext - set expressions as a single token

I am trying to separate my text data into tokens using the unnest_tokens function from the tidytext package. The thing is that some expressions appear multiple times and I would like to keep them a single token instead of multiple tokens.
Normal outcome:
df <- data.frame(
Id = c(1, 2),
Text = c('A first nice text', 'A second nice text')
)
df %>%
unnest_tokens(word, text)
Id Word
1 1 a
2 1 first
3 1 nice
4 1 text
5 2 a
6 2 second
7 2 nice
8 2 text
What I would like (expression = "nice text"):
df <- data.frame(
Id = c(1, 2),
Text = c('A first nice text', 'A second nice text')
)
df %>%
unnest_tokens(word, text)
Id Word
1 1 a
2 1 first
3 1 nice text
4 2 a
5 2 second
6 2 nice text
Here's a concise solution based on negative lookahead (?!...), to disallow separate_rows to separate Text on whitespace \\s if there's nice to the left of \\s and text to its right (\\bare word boundary anchors, in case you have, say, "nice texts", which you do want to separate)
library(tidyr)
df %>%
separate_rows(Text, sep = "(?!\\bnice\\b)\\s(?!\\btext\\b)")
# A tibble: 6 × 2
Id Text
<dbl> <chr>
1 1 A
2 1 first
3 1 nice text
4 2 A
5 2 second
6 2 nice text
A more advanced regex is with (*SKIP)(*F):
df %>%
separate_rows(Text, sep = "(\\bnice text\\b)(*SKIP)(*F)|\\s")
For more info: How do (*SKIP) or (*F) work on regex?
A bit verbose, and there might be an option to exclude certain phrases in the unnest_tokens, but it does the trick:
library(tidyverse)
library(tidytext)
df <- data.frame(Id = c(1, 2),,
Text = c('A first nice text', 'A second nice text')) %>%
unnest_tokens('Word', Text)
df %>%
group_by(Id) %>%
summarize(Word = paste(if_else(lag(Word) == 'nice' & Word == 'text', 'nice text', Word))) %>%
mutate(temp_id = row_number()) %>%
filter(temp_id != temp_id[Word == 'nice text'] - 1) %>%
ungroup() %>%
select(-temp_id)
which gives:
# A tibble: 6 x 2
Id Word
<dbl> <chr>
1 1 a
2 1 first
3 1 nice text
4 2 a
5 2 second
6 2 nice text

Remove unwanted letter in data column names in R environment

I have a dataset the contains a large number of columns every column has a name of date in the form of x2019.10.10
what I want is to remove the x letter and change the type of the date to be 2019-10-10
How this could be done in the R environment?
One solution would be:
Get rid of x
Replace . with -.
Here I create a dataframe that has similar columns to yours:
df = data.frame(x2019.10.10 = c(1, 2, 3),
x2020.10.10 = c(4, 5, 6))
df
x2019.10.10 x2020.10.10
1 1 4
2 2 5
3 3 6
And then, using dplyr (looks much tidier):
library(dplyr)
names(df) = names(df) %>%
gsub("x", "", .) %>% # Get rid of x and then (%>%):
gsub("\\.", "-", .) # replace "." with "-"
df
2019-10-10 2020-10-10
1 1 4
2 2 5
3 3 6
If you do not want to use dplyr, here is how you would do the same thing in base R:
names(df) = gsub("x", "", names(df))
names(df) = gsub("\\.", "-", names(df))
df
2019-10-10 2020-10-10
1 1 4
2 2 5
3 3 6

Count how many characters from a column appear in another column

I am trying to count how many characters from column expected appear in column read.
They may appear in different order and they should not be counted twice.
For example, in this df
df <- tibble::tibble(expected=c("AL0","CP1","NM3","PK9","RM2"),
read=c("AL0X24",
"CXP44",
"MLN",
"KKRR9",
"22MMRRS"
))
The result should be:
result <- c(3,2,2,2,3)
An option with str_extract/n_distinct. Wrap the [, ] with the 'expected' column string using paste, extract all the characters that show the pattern in 'expected' from 'read' and count the number of distinct elements with n_distinct
library(stringr)
library(dplyr)
with(df, sapply(str_extract_all(read, paste0("[", expected, "]")), n_distinct))
#[1] 3 2 2 2 3
Or another option with str_replace_all with str_count. Here, we remove the duplicate characters in 'read' with str_replace_all and use that to count the characters in 'expected' by pasteing the [, and ]
df %>%
mutate(Count = str_count(str_replace_all(read, "(\\w)\\1+", "\\1"),
str_c("[", expected, "]")))
# A tibble: 5 x 3
# expected read Count
# <chr> <chr> <int>
#1 AL0 AL0X24 3
#2 CP1 CXP44 2
#3 NM3 MLN 2
#4 PK9 KKRR9 2
#5 RM2 22MMRRS 3
One option could be:
mapply(function(x, y) sum(x %in% unique(y)),
x = strsplit(df$expected, ""),
y = strsplit(df$read, ""))
[1] 3 2 2 2 3

keeping document number in tidytext

When I unnest_tokens for a list I enter manually; the output includes the row number each word came from.
library(dplyr)
library(tidytext)
library(tidyr)
library(NLP)
library(tm)
library(SnowballC)
library(widyr)
library(textstem)
#test data
text<- c( "furloughs","Working MORE for less pay", "total burnout and exhaustion")
#break text file into single words and list which row they are in
text_df <- tibble(text = text)
tidy_text <- text_df %>%
mutate_all(as.character) %>%
mutate(row_name = row_number())%>%
unnest_tokens(word, text) %>%
mutate(word = wordStem(word))
The results look like this, which is what I want.
row_name word
<int> <chr>
1 1 furlough
2 2 work
3 2 more
4 2 for
5 2 less
6 2 pai
7 3 total
8 3 burnout
9 3 and
10 3 exhaust
But when I try to read in the real responses from a csv file:
#Import data
text <- read.csv("TextSample.csv", stringsAsFactors=FALSE)
But otherwise use the same code:
#break text file into single words and list which row they are in
text_df <- tibble(text = text)
tidy_text <- text_df %>%
mutate_all(as.character) %>%
mutate(row_name = row_number())%>%
unnest_tokens(word, text) %>%
mutate(word = wordStem(word))
I get the entire token list assigned to row 1 and then again assigned to row 2 and so on.
row_name word
<int> <chr>
1 1 c
2 1 furlough
3 1 work
4 1 more
5 1 for
6 1 less
7 1 pai
8 1 total
9 1 burnout
10 1 and
OR, if I move the mutate(row_name = row_number) to after the unnest command, I get the row number for each token.
word row_name
<chr> <int>
1 c 1
2 furlough 2
3 work 3
4 more 4
5 for 5
6 less 6
7 pai 7
8 total 8
9 burnout 9
10 and 10
What am I missing?
I guess if you import the text using text <- read.csv("TextSample.csv", stringsAsFactors=FALSE), text is a data frame while if you enter it manually it is a vector.
If you would alter the code to: text_df <- tibble(text = text$col_name) to select the column from the data frame (which is a vector) in the csv case, I think you should get the same result as before.

Programmatically split non-delimited strings and generate new columns

I have a 1 column data table containing non-delimited strings like so
d1 = data.table(x = c("2728661941-1945", "2657461921-1925", "2786161921-1925"))
d1
#> x
#> 1: 2728661941-1945
#> 2: 2657461921-1925
#> 3: 2786161921-1925
I have another data table of the form
dic = data.table(field = c("ID","group","year"),start=c(1,6,7), length=c(5,1,9))
dic
#> field start length
#> 1: ID 1 5
#> 2: group 6 1
#> 3: year 7 9
I want to split the strings in the data table d1 using the information in dic, and end up with a new data frame of the form
d2 = data.table(ID = c("27286", "26574", "27861"),
group = c(6, 6, 6),
year = c("1941-1945", "1921-1925", "1921-1925")
d2
#> ID group year
#> 1: 27286 6 1941-1945
#> 2: 26574 6 1921-1925
#> 3: 27861 6 1921-1925
I have tried
d2 = copy(d1)[,(dic$field) := transpose(
lapply(x, stri_sub, from = dic$start, length = dic$length))]
But, the underneath data is in list form, not really in table form. I want to be able to refer to the created fields as columns.
I have to admit I am not entirely sure what I am doing, and I don't really have to use data table for this, but I can't think of another way to do it. The easiest dataset I have contains strings of 79 characters, and there are 25 fields that would be generated, so I would prefer not to have to pull each field individually.
I hope this makes sense. Any suggestions are appreciated.
1) read.fwf Try read.fwf. No packages are used.
read.fwf(textConnection(d1$x), dic$length, col.names = dic$field)
giving:
ID group year
1 27286 6 1941-1945
2 26574 6 1921-1925
3 27861 6 1921-1925
2) separate This also works and gives the same answer:
library(tidyr)
d1 %>%
separate(x, sep = dic$start - 1, into = dic$field, remove = TRUE)
regex is useful here, particularly since you can programmatically define the patterns you want to search for and output
d1 %>%
mutate(x=gsub(paste0("(.{", dic$length, "})", collapse=""), paste0("\\", seq_along(dic$length), collapse=" "), x)) %>%
separate(x, into=dic$field, sep=" ")
# ID group year
# 1 27286 6 1941-1945
# 2 26574 6 1921-1925
# 3 27861 6 1921-1925
Explanation
# Pattern to search for
paste0("(.{", dic$length, "})", collapse="")
# "(.{5})(.{1})(.{9})"
# (.{5}) - group that contains any 5 characters - will be group 1
# (.{1}) - group that contains any 1 character - will be group 2
# (.{9}) - group that contains any 9 characters - will be group 3
# Pattern to output
paste0("\\", seq_along(dic$length), collapse=" ")
# "\\1 \\2 \\3"
# \\1 - output group 1
# \\2 - output group 2
# each group is separated by a space
Use tidyr::separate to split the resulting space-delimited string into distinct fields
Not using the dic table, but this can be easily done with extract from tidyr:
library(tidyr)
extract(d1, x, c("ID", "group", "year"), "^(.{5})(.{1})(.{9})$")
Result:
ID group year
1: 27286 6 1941-1945
2: 26574 6 1921-1925
3: 27861 6 1921-1925
Using the dic table as reference:
library(dplyr)
breaks <- setNames(as.list(paste0("substr(x", ", ", dic$start, ", ", dic$start+dic$length-1, ")")), dic$field)
d1 %>%
mutate_(.dots = breaks)
setNames(data.frame(do.call(rbind, lapply(d1$x, function(X) sapply(1:NROW(dic),
function(i) c(substring(X, dic$start[i], dic$start[i] + dic$length[i])))))), dic$field)
# ID group year
#1 272866 61 1941-1945
#2 265746 61 1921-1925
#3 278616 61 1921-1925
We can use the strcapture function from base R to technically capture the strings. The we will input it in a dataframe that has been predefined.
strcapture("(\\d{5})(\\d)(.*)",d1$x,data.frame(Id=numeric(),group=numeric(),year=character()))
Id group year
1 27286 6 1941-1945
2 26574 6 1921-1925
3 27861 6 1921-1925
Explanation: (\\d{5}) captures the first 5 digits then (\\d) captures the next digits and (.*) captures everything else afterwards.

Resources