R Extract specific text from variable - r

I have a dataframe with this column:
VAR1
var_1.1
var_1.2
var_1.3
var_2.1
var_2.2
var_2.3
So what I would like is create a new column that keeps only
VAR1 VAR2
var_1.1 1
var_1.2 1
var_1.3 1
var_2.1 2
var_2.2 2
var_2.3 2
Basically retaining the text between "_" and "."
Thx!

We can use str_extract to match one or more digits (\\d+) after the _ followed by the .
df1 %>%
mutate(VAR2 = str_extract(VAR1, "(?<=_)\\d+(?=\\.)"))
# VAR1 VAR2
#1 var_1.1 1
#2 var_1.2 1
#3 var_1.3 1
#4 var_2.1 2
#5 var_2.2 2
#6 var_2.3 2
Or with str_replace to capture the digits as a group and in the replacement specify the backreference of the captured group
df1 %>%
mutate(VAR2 = str_replace(VAR1, ".*_(\\d+)\\..*", "\\1"))
Or with sub from base R
sub(".*_(\\d+)\\..*", "\\1", df1$VAR1)
data
df1 <- structure(list(VAR1 = c("var_1.1", "var_1.2", "var_1.3", "var_2.1",
"var_2.2", "var_2.3")), class = "data.frame", row.names = c(NA,
-6L))

We can use a non-regex approach if you have the same data as shown by using parse_number
readr::parse_number(df$VAR1)
#[1] 1.1 1.2 1.3 2.1 2.2 2.3
Now since you want the number before dot (.), we can floor the number we got from above.
df$Var2 <- floor(readr::parse_number(df$VAR1))
df
# VAR1 Var2
#1 var_1.1 1
#2 var_1.2 1
#3 var_1.3 1
#4 var_2.1 2
#5 var_2.2 2
#6 var_2.3 2

Related

Split columns considering only the first dot in R using separate

This is my dataframe:
df <- tibble(col1 = c("1. word","2. word","3. word","4. word","5. N. word","6. word","7. word","8. word"))
I need to split in two columns using separate function and rename them as Numbers and other called Words. Ive doing this but its not working:
df %>% separate(col = col1 , into = c('Number','Words'), sep = "^. ")
The problem is that the fifth has 2 dots. And I dont know how to handle with this regarding the regex.
Any help?
Here is an alternative using readrs parse_number and a regex:
library(dplyr)
library(readr)
df %>%
mutate(Numbers = parse_number(col1), .before=1) %>%
mutate(col1 = gsub('\\d+\\. ','',col1))
Numbers col1
<dbl> <chr>
1 1 word
2 2 word
3 3 word
4 4 word
5 5 N. word
6 6 word
7 7 word
A tidyverse approach would be to first clean the data then separate.
df %>%
mutate(col1 = gsub("\\s.*(?=word)", "", col1, perl=TRUE)) %>%
tidyr::separate(col1, into = c("Number", "Words"), sep="\\.")
Result:
# A tibble: 8 x 2
Number Words
<chr> <chr>
1 1 word
2 2 word
3 3 word
4 4 word
5 5 word
6 6 word
7 7 word
8 8 word
I'm assuming that you would like to keep the cumbersome "N." in the result. For that, my advice is to use extract instead of separate:
df %>%
extract(
col = col1 ,
into = c('Number','Words'),
regex = "([0-9]+)\\. (.*)")
The regular expression ([0-9]+)\\. (.*) means that you are looking first for a number, that you want to put in a first column, followed by a dot and a space (\\. ) that should be discarded, and the rest should go in a second column.
The result:
# A tibble: 8 × 2
Number Words
<chr> <chr>
1 1 word
2 2 word
3 3 word
4 4 word
5 5 N. word
6 6 word
7 7 word
8 8 word
Try read.table + sub
> read.table(text = sub("\\.", ",", df$col1), sep = ",")
V1 V2
1 1 word
2 2 word
3 3 word
4 4 word
5 5 N. word
6 6 word
7 7 word
8 8 word
I am not sure how to do this with tidyr, but the following should work with base R.
df$col1 <- gsub('N. ', '', df$col1)
df$Numbers <- as.numeric(sapply(strsplit(df$col1, ' '), '[', 1))
df$Words <- sapply(strsplit(df$col1, ' '), '[', 2)
df$col1 <- NULL
Result
> head(df)
Numbers Words
1 1 word
2 2 word
3 3 word
4 4 word
5 5 word
6 6 word

How to identify the text that are in common between sentences?

I would like to find the text or string that appeared in 3 of my columns.
> dput(df1)
structure(list(Jan = "The price of oil declined.", Feb = "The price of gold declined.",
Mar = "Prices remained unchanged."), row.names = c(NA, -1L
), class = c("tbl_df", "tbl", "data.frame"))
I want to get something like
Word Count
The 2
price 3
declined 2
of 2
Thank you.
You can count the occurrence of each word in the text and keep only the ones that occur more than once.
library(dplyr)
library(tidyr)
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = everything()) %>%
separate_rows(value, sep = '\\s+') %>%
mutate(value = tolower(gsub('[[:punct:]]', '', value))) %>%
count(value) %>%
filter(n > 1)
May be this:
setNames(data.frame(table(unlist
(strsplit
(trimws(tolower(stack(df)$values),whitespace = '\\.'), '\\s+', perl=TRUE)
)
)
), c('words', 'Frequency'))
stack(df) will stack the df to columnar structure from row structure, then using values column we get all the sentences. we use trimws to remove all the unnecessary punctuation. we use strsplit to split data with spaces. Finally unlisting it to make it flatten. Taking the table and then converting to data.frame yields the desired results.setNames renames the columns.
Output:
# words Frequency
#1 declined 2
#2 gold 1
#3 of 2
#4 oil 1
#5 price 2
#6 prices 1
#7 remained 1
#8 the 2
#9 unchanged 1
This code won't process the data as you may wish, for ex. treating "price" and "Prices" as the same word. If you want that it will get more complicated.
> data.frame(table(strsplit(tolower(gsub("\\.|\\,","",paste(as.character(unlist(df)),collapse=" ")))," ")))
Var1 Freq
1 declined 2
2 gold 1
3 of 2
4 oil 1
5 price 2
6 prices 1
7 remained 1
8 the 2
9 unchanged 1
Base R solution:
setNames(
data.frame(
table(
unlist(strsplit(tolower(do.call(c, df1)), "\\s+|[[:punct:]]"))
)
),
c("Words", "Frequency")
)

Adapting string variables to specific characteristics in R

I have the following data:
id code
1 I560
2 K980
3 R30
4 F500
5 650
I would like to do the following two actions regarding the colum code:
i) select the two numbers after the letter and
ii) remove those observations that do not start with a letter. So in the end, the data frame should look like this:
id code
1 I56
2 K98
3 R30
4 F50
In base R, you could do :
subset(transform(df, code = sub('([A-Z]\\d{2}).*', '\\1', code)),
grepl('^[A-Z]', code))
Or using tidyverse functions
library(dplyr)
library(stringr)
df %>%
mutate(code = str_extract(code, '[A-Z]\\d{2}')) %>%
filter(str_detect(code, '^[A-Z]'))
# id code
#1 1 I56
#2 2 K98
#3 3 R30
#4 4 F50
An option with substr from base R
df1$code <- substr(df1$code, 1, 3)
df1[grepl('^[A-Z]', df1$code),]
# id code
#1 1 I56
#2 2 K98
#3 3 R30
#4 4 F50
data
df1 <- structure(list(id = 1:5, code = c("I56", "K98", "R30", "F50",
"650")), row.names = c(NA, -5L), class = "data.frame")

How to replace column names that include specific string r

I would like to replace columns that contain "score" string with predefined names.
Here is a simple example dataset and my desired column names to replace.
df1 <- data.frame(a = c(1,2,3,4,5),
b = c(5,6,7,8,9),
c.1_score = c(10,10,2,3,4),
a.2_score= c(1,3,5,6,7))
replace.cols <- c("c_score", "a_score")
The number of columns changes each trial. So whenever the column name includes _score, I would like to replace them with my predefined replace.cols names.
The desired col names should be a b c_score and a_score.
Any thought?
Thanks.
We can use rename_at
library(dplyr)
df1 <- df1 %>%
rename_at(vars(ends_with('score')), ~ replace.cols)
df1
# a b c_score a_score
#1 1 5 10 1
#2 2 6 10 3
#3 3 7 2 5
#4 4 8 3 6
#5 5 9 4 7
or with str_remove
library(stringr)
df1 %>%
rename_at(vars(ends_with('score')), ~ str_remove(., '\\.\\d+'))
Or using base R (assuming the column names order is maintained in 'replace.cols')
names(df1)[endsWith(names(df1), 'score')] <- replace.cols

Assigning values to patterns of letters in character strings using R

I have a data frame that looks like this:
head(df)
shotchart
1 BMMMBMMBMMBM
2 MMMBBMMBBMMB
3 BBBBMMBMMMBB
4 MMMMBBMMBBMM
Different patterns of the letter 'M' are worth certain values such as the following:
MM = 1
MMM = 2
MMMM = 3
I want to create an extra column to this data frame that calculates the total value of the different patterns of 'M' in each row individually.
For example:
head(df)
shotchart score
1 BMMMBMMBMMBM 4
2 MMMBBMMBBMMB 4
3 BBBBMMBMMMBB 3
4 MMMMBBMMBBMM 5
I can't seem to figure out how to assign the values to the different 'M' patterns.
I tried using the following code but it didn't work:
df$score <- revalue(df$scorechart, c("MM"="1", "MMM"="2", "MMMM"="3"))
We create a named vector ('nm1'), split the 'shotchart' to extract only 'M' and then use the named vector to change the values to get the sum
nm1 <- setNames(1:3, strrep("M", 2:4))
sapply(strsplit(gsub("[^M]+", ",", df$shotchart), ","),
function(x) sum(nm1[x[nzchar(x)]], na.rm = TRUE))
Or using tidyverse
library(tidyverse)
df %>%
mutate(score = str_extract_all(shotchart, "M+") %>%
map_dbl(~ nm1[.x] %>%
sum(., na.rm = TRUE)))
# shotchart score
#1 BMMMBMMBMMBM 4
#2 MMMBBMMBBMMB 4
#3 BBBBMMBMMMBB 3
#4 MMMMBBMMBBMM 5
You can also split on "B" and base the result on the count of "M" characters -1 as follows:
df <- data.frame(shotchart = c("BMMMBMMBMMBM", "MMMBBMMBBMMB", "BBBBMMBMMMBB", "MMMMBBMMBBMM"),
score = NA_integer_,
stringsAsFactors = F)
df$score <- lapply(strsplit(df$shotchart, "B"), function(i) sum((nchar(i)-1)[(nchar(i)-1)>0]))
# shotchart score
#1 BMMMBMMBMMBM 4
#2 MMMBBMMBBMMB 4
#3 BBBBMMBMMMBB 3
#4 MMMMBBMMBBMM 5

Resources