Count characters of a section of a string

Count characters of a section of a string - r

I have this df:
dput(df)
structure(list(URLs = c("http://bursesvp.ro//portal/user/_/Banco_Votorantim_Cartoes/0-7f2f5cb67f1-22918b.html",
"http://46.165.216.78/.CartoesVotorantim/Usuarios/Cadastro/BV6102891782/",
"http://www.chalcedonyhotel.com/images/promoc/premiado.tam.fidelidade/",
"http://bmbt.ro/portal/a3/_Votorantim_/VotorantimCartoes2016/0-7f2f5cb67f1-22928b.html",
"http://voeazul.nl/azul/")), .Names = "URLs", row.names = c(NA,
-5L), class = "data.frame")
It describes different URLs and I am trying to count the number of characters of the host name, whether that is an actual name(http://hostname.com/....) or an IP(http://000.000.000.000/...). However, if it is an actual name, then I only want the nchar between www. and .com. If it's an IP then all its numbers and "in between" dots.
Expected Outcome for the above sample data:
exp_outcome
1 8
2 13
3 15
4 4
5 7
I tried to do something with strsplit but could not get anywhere.

Another, maybe more direct way with a different regex:
nchar(sub("^http://(www\\.)?(([a-z]+)|([0-9.]+))(\\.[a-z]+)?/+.+$", "\\2", x$df))
#[1] 8 13 15 4 7
explanation:
^http://: looks for "http://" after beginning of the string
(www\\.)?: looks for "www.", zero or one time (so this is optional)
(([a-z]+)|([0-9.]+)): the pattern that will be captured : either lowercase letters one or more time or digits and points
(\\.[a-z]+)?: looks for "." followed by one or more lowercase letters, zero or one time (so again optional)
/+.+$: looks for "/" followed by anything, one or more times till the end of string
NB:
sub("^http://(www\\.)?(([a-z]+)|([0-9.]+))(\\.[a-z]+)?/+.+$", "\\2", x$df)
# [1] "bursesvp" "46.165.216.78" "chalcedonyhotel" "bmbt" "voeazul"

Here’s how to do it (assuming your data.frame is called x):
domains = sub('^(http://)([^/]+)(.*)$', '\\2', x$df)
# This will fail for IP addresses …
hostname = sub('^(www\\.)?([^.]+)(\\..+)?$', '\\2', domains)
# … which we treat separately here:
is_ip = grepl('^(\\d{1,3}\\.){3}\\d{1,3}$', domains)
hostname[is_ip] = domains[is_ip]
exp_outcome$domain_length = nchar(hostname)
On a side note, I converted your original data.frame to character strings — it simply makes no sense to use a factor for URLs.

After 5 months of dealing with URLs in general, I found the following packages which make life a bit easier (Regex provided by other answers do work great by the way),
library(urltools)
library(iptools)
df$Hostname <- domain(df$URLs)
#However, TLDs and 'www' need to go so I used suffix_extract()$domain from `iptools`
df$Hostname <- ifelse(is.na(suffix_extract(df$Hostname)$domain), df$Hostname,
suffix_extract(df$Hostname)$domain)
#which gives:
# URLs Hostname
#1 http://bursesvp.ro//portal/user/_/... bursesvp
#2 http://46.165.216.78/.CartoesVotorantim/Usuarios/... 46.165.216.78
#3 http://www.chalcedonyhotel.com/images/promoc/ chalcedonyhotel
#4 http://bmbt.ro/portal/a3/_Votorantim_/... bmbt
#5 http://voeazul.nl/azul/ voeazul
#then simply,
nchar(df$Hostname)
#[1] 8 13 15 4 7

Related

Turn txt file into dataframe

I have a txt file with this data in it:
1 message («random_choice»)[5];
2 reply («принято»)[2][3];
3 regulate («random_choice»)[5];
4 Early reg («for instance»)[2][3][4];
4xx: Success (загрузка):
6 OK («fine»)[2][3];
I want to turn it into dataframe, consisting of three columns ID, message, comment.
I also want to remove unnecessary numbers at the end in square brackets.
And also some values in ID column have strings (usually xx). In these cases, column must be just empty.
So, desired result must look like this:
ID Message Comment
1 message random_choice
2 reply принято
3 regulate random_choice
4 Early reg for instance
Success загрузка
6 OK fine
How could i do that? Even when i try to read this txt file i get strange error:
df <- read.table("data_received.txt", header = TRUE)
error i get:
Error in read.table("data_received.txt", header = TRUE) :
more columns than column names

You can use strcapture for this.
Fake data, you'll likely do txt <- readLines("data_received.txt"). (Since my locale on windows is not being friendly to those strings, I'll replace with straight ascii, assuming it'll work just fine on your system.)
txt <- readLines(textConnection("1 message («random_choice»)[5];
# 2 reply («asdf»)[2][3];
# 3 regulate («random_choice»)[5];
# 4 Early reg («for instance»)[2][3][4];
# 4xx: Success (something):
# 6 OK («fine»)[2][3];"))
The breakout:
out <- strcapture("^(\\S+)\\s+([^(]+)\\s+\\((.*)\\).*$", txt,
proto = data.frame(ID=0L, Message="", Comment=""))
# Warning in fun(mat[, i]) : NAs introduced by coercion
out
# ID Message Comment
# 1 1 message «random_choice»
# 2 2 reply «asdf»
# 3 3 regulate «random_choice»
# 4 4 Early reg «for instance»
# 5 NA Success something
# 6 6 OK «fine»
The proto= argument indicates what type of columns are generated. Since I set the ID=0L, it assumes it'll be integer, so anything that does not convert to integer becomes NA (which satisfies your fifth row omission).
Explanation on the regex:
in general:
* means zero-or-more of the previous character (or character class)
+ means one-or-more
? (not used, but useful nonetheless) means zero or one
^ and $ mean the beginning and end of the string, respectively (a ^ within [..] is different)
(...) is a capture group: anything within the non-escaped parens is stored, anything not is discarded
[...] is a character group, any of the characters is a match; if this is instead [^..], then it is inverted: anything except what is listed
[[...]] is a character class
^(\\S+), start with (^) one or more (+) non-space characters (\\S);
\\s+ one or more space character (\\s) (discarded);
([^(]+) one or more character that is not a left-paren;
\\((.*)\\)$ a literal left-paren (\\() and then zero or more of anything (.*), all the way to a literal right-paren (\\)) and the end of the string ($).
It should be noted that \\s and \\S are non-POSIX regex characters, where it is generally suggested to use [^[:space:]] for \\S (no space chars) and [[:space:]] for \\s. Those are equivalent but I went with code-golf initially. With this replacement, it looks like
out <- strcapture("^([^[:space:]]+)[[:space:]]+([^(]+)[[:space:]]+\\((.*)\\).*$", txt,
proto = data.frame(ID=0L, Message="", Comment=""))

We can use {unglue}. Here we see you have two patterns, one contains "«" and ID, the other doesn't. {unglue} will use the first pattern that matches. any {foo} or {} expression matches the regex ".*?", and a data.frame is built from the names put between brackets.
txt <- c(
"1 message («random_choice»)[5];", "2 reply («asdf»)[2][3];",
"3 regulate («random_choice»)[5];", "4 Early reg («for instance»)[2][3][4];",
"4xx: Success (something):", "6 OK («fine»)[2][3];")
library(unglue)
patterns <-
c("{id} {Message} («{Comment}»){}",
"{} {Message} ({Comment}){}")
unglue_data(txt, patterns)
#> id Message Comment
#> 1 1 message random_choice
#> 2 2 reply asdf
#> 3 3 regulate random_choice
#> 4 4 Early reg for instance
#> 5 <NA> Success something
#> 6 6 OK fine

Rename Dataframe Column Names in R using Previous Column Name and Regex Pattern

I am working in R for the first time and I have been having difficulty renaming column names in a dataframe (Grade.Data). I have a dataset imported from an csv file that has column names like this:
Student.ID
Grade
Interactive.Exercises.1..Health
Interactive.Exercises.2..Fitness
Quizzes.1..Week.1.Quiz
Quizzes.2..Week.2.Quiz
Case.Studies.1..Case.Study1
Case.Studies.2..Case.Study2
I would like to be able to change the variable names so that they are more simple, i.e. from Interactive.Exercises.1.Health to Interactive.Exercises.1 or Quizzes.1.Week.1.Quiz to Quizzes.1
So far, I have tried this:
grep(".*[0-9]", names(Grade.Data))
But I get this returned:
[1] 3 4 5 6 7 8 9 11 12 13 14 15 16 17 19 20 21 22 23 24 25
Can anyone help me figure out what is going on, and write a better regex expression? Thank you so much.

It seems you truncate column names after the first chunk of digits.
You may use the following sub solution:
names(Grade.Data) <- sub("^(.*?\\d+).*$", "\\1", names(Grade.Data))
See the regex demo
Details
^ - start of string
(.*?\\d+) - Group 1 (later referred with \1 from the replacement pattern) matching any 0+ chars as few as possible (.*?) and then 1 or more digits (\d+)
.* - any 0+ chars as many as possible
$ - end of string

There is nothing wrong with your regex itself. What you are looking for is probably the combination of regexpr - which gets the start and ending of your regex- and regmatches - which gets the actual string corresponding to the output of regexpr:
start_end <- regexpr(".*[0-9]", names(Grade.data))
regmatches(names(Grade.data), start_end)
# [1] "Interactive.Exercises.1" "Interactive.Exercises.2"
# [3] "Quizzes.1..Week.1" "Quizzes.2..Week.2"
# [5] "Case.Studies.1..Case.Study1"
Adding a question-mark behind the dot-star will make the regex match as few characters as possible, so it will stop after the first numeric value:
start_end <- regexpr(".*?[0-9]", names(Grade.data))
regmatches(names(Grade.data), start_end)
# [1] "Interactive.Exercises.1" "Interactive.Exercises.2"
# [3] "Quizzes.1" "Quizzes.2"
# [5] "Case.Studies.1"

you should use the function names, following I write a little example, the names string can be as long as you need.
names(x = Grade.Data) <- c("Col1_name", "Col2_name")

How do I find the position of a (fuzzy) match within a string?

I have a text processing problem in R. I want to get the character within a string where a different string makes an exact match and/or a fuzzy match with some edit distance. For example:
A = "blahmatchblah"
B = "match"
C = "latch"
I would like to return something telling me that the 5th character within string A is where the match for a search of both B and C. All the pattern matching tools I'm aware of will tell me if there's a (fuzzy) match for B and C within A, but none for where that match begins.

The base function aregexec() is used for approximate string position matching. Unfortunately it's not vectorized over pattern, so we'll have to use a loop to get the positions for both B and C.
sapply(c(B, C), aregexec, A)
# $match
# [1] 5
# attr(,"match.length")
# [1] 5
#
# $latch
# [1] 5
# attr(,"match.length")
# [1] 5
See help(aregexec) for more.

I don't have rep to comment but at least for the first part of your question: gregexpr(B,A)[[1]][1] will yield 5 because "match" is a valid sub-sequence in A.

A few months back I made an interface to the fuzzywuzzy Python package in R, which has the get_matching_blocks() method (it's pretty close to what you actually ask).
Assuming you want to find the matching blocks between two strings,
A = "blahmatchblah"
B = "match"
library(fuzzywuzzyR)
init <- SequenceMatcher$new(string1 = A, string2 = B)
init$get_matching_blocks()
returns,
[[1]]
Match(a=4, b=0, size=5)
[[2]]
Match(a=13, b=5, size=0)
The first sublist gives the matching blocks of the two strings. a = 4 gives the starting index of the string A and b=0 gives the starting index of the string B (indexing starts from 0). size = 5 gives the count of characters that both strings match (in this case the matching block is "match" and has 5 characters).
The documentation, especially for SequenceMatcher, has more info.

Delete duplicate elements in String in R

I've got some problems deleting duplicate elements in a string.
My data look similar to this:
idvisit path
1 1,16,23,59
2 2,14,14,19
3 5,19,23,19
4 10,10
5 23,23,27,29,23
I have a column containing an unique ID and a column containing a path for web page navigation.
The right column contains some cases, where pages just were reloaded and the page were tracked twice or even more.
The pages are separated with commas and are saved as factors.
My problem is, that I don't want to have multiple pages in a row, so the data should look like this.
idvisit path
1 1,16,23,59
2 2,14,19
3 5,19,23,19
4 10
5 23,27,29,23
The multiple pages next to each other should be removed. I know how to delete a specific multiple number using regexpressions, but I have about 20.000 different pages and can't do this for all of them.
Does anyone have a solution or a hint, for my problem?
Thanks
Sebastian

We can use tidyverse. Use the separate_rows to split the 'path' variable by the delimiter (,) to convert to a long format, then grouped by 'idvisit', we paste the run-length-encoding values
library(tidyverse)
separate_rows(df1, path) %>%
group_by(idvisit) %>%
summarise(path = paste(rle(path)$values, collapse=","))
# A tibble: 5 × 2
# idvisit path
# <int> <chr>
#1 1 1,16,23,59
#2 2 2,14,19
#3 3 5,19,23,19
#4 4 10
#5 5 23,27,29,23
Or a base R option is
df1$path <- sapply(strsplit(df1$path, ","), function(x) paste(rle(x)$values, collapse=","))
NOTE: If the 'path' column is factor class, convert to character before passing as argument to strsplit i.e. strsplit(as.character(df1$path), ",")

Using stringr package, with function: str_replace_all, I think it gets what you want using the following regular expression: ([0-9]+),\\1and then replace it with \\1 (we need to scape the \ special character):
library(stringr)
> str_replace_all("5,19,23,19", "([0-9]+),\\1", "\\1")
[1] "5,19,23,19"
> str_replace_all("10,10", "([0-9]+),\\1", "\\1")
[1] "10"
> str_replace_all("2,14,14,19", "([0-9]+),\\1", "\\1")
[1] "2,14,19"
You can use it in a array form: x <- c("5,19,23,19", "10,10", "2,14,14,19") then:
str_replace_all(x, "([0-9]+),\\1", "\\1")
[1] "5,19,23,19" "10" "2,14,19"
or using sapply:
result <- sapply(x, function(x) str_replace_all(x, "([0-9]+),\\1", "\\1"))
Then:
> result
5,19,23,19 10,10 2,14,14,19
"5,19,23,19" "10" "2,14,19"
Notes:
The first line is the attribute information:
> str(result)
Named chr [1:3] "5,19,23,19" "10" "2,14,19"
- attr(*, "names")= chr [1:3] "5,19,23,19" "10,10" "2,14,14,19"
If you don't want to see them (it does not affect the result), just do:
attributes(result) <- NULL
Then,
> result
[1] "5,19,23,19" "10" "2,14,19"
Explanation about the regular expression used: ([0-9]+),\\1
([0-9]+): Starts with a group 1 delimited by () and finds any digit (at least one)
,: Then comes a punctuation sign: , (we can include spaces here, but the original example only uses this character as delimiter)
\\1: Then comes an identical string to the group 1, i.e.: the repeated number. If that doesn't happen, then the pattern doesn't match.
Then if the pattern matches, it replaces it, with the value of the variable \\1, i.e. the first time the number appears in the pattern matched.
How to handle more than one duplicated number, for example 2,14,14,14,19?:
Just use this regular expression instead: ([0-9]+)(,\\1)+, then it matches when at least there is one repetition of the delimiter (right) and the number. You can try other possibilities using this regex101.com (in MHO it more user friendly than other online regular expression checkers).
I hope this would work for you, it is a flexible solution, you just need to adapt it with the pattern you need.

Wrong replacement of strings with gsub in R

I am trying to exclude all ".1" occurences from my labelexp data frame.
My input
ID
1 NE001403
2 NE001458.1
3 NE001494.1
4 NE001634.1
5 NE001635.1
6 NE001637.1
I have tried it: labelexp$ID <- gsub(".1", "", labelexp$ID), but my output was:
ID
1 NE0403
2 NE0458
3 NE0494
4 NE0634
5 NE0635
6 NE0637
Any ideas? Thank you.

The "." is a special character in regular expressions in R - it means any character. You need to put "\\" in front of it to tell R that you mean it to be the character ".". Thus, try:
labelexp$ID <- gsub("\\.1", "", labelexp$ID)
Does that work for you?

You can also use fixed=TRUE option:
sub(".1", "","NE001458.1",fixed=TRUE)
"NE001458"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Count characters of a section of a string - r

Related

Turn txt file into dataframe

Rename Dataframe Column Names in R using Previous Column Name and Regex Pattern

How do I find the position of a (fuzzy) match within a string?

Delete duplicate elements in String in R

Wrong replacement of strings with gsub in R

Categories

Resources