Turn txt file into dataframe - r

I have a txt file with this data in it:
1 message («random_choice»)[5];
2 reply («принято»)[2][3];
3 regulate («random_choice»)[5];
4 Early reg («for instance»)[2][3][4];
4xx: Success (загрузка):
6 OK («fine»)[2][3];
I want to turn it into dataframe, consisting of three columns ID, message, comment.
I also want to remove unnecessary numbers at the end in square brackets.
And also some values in ID column have strings (usually xx). In these cases, column must be just empty.
So, desired result must look like this:
ID Message Comment
1 message random_choice
2 reply принято
3 regulate random_choice
4 Early reg for instance
Success загрузка
6 OK fine
How could i do that? Even when i try to read this txt file i get strange error:
df <- read.table("data_received.txt", header = TRUE)
error i get:
Error in read.table("data_received.txt", header = TRUE) :
more columns than column names

You can use strcapture for this.
Fake data, you'll likely do txt <- readLines("data_received.txt"). (Since my locale on windows is not being friendly to those strings, I'll replace with straight ascii, assuming it'll work just fine on your system.)
txt <- readLines(textConnection("1 message («random_choice»)[5];
# 2 reply («asdf»)[2][3];
# 3 regulate («random_choice»)[5];
# 4 Early reg («for instance»)[2][3][4];
# 4xx: Success (something):
# 6 OK («fine»)[2][3];"))
The breakout:
out <- strcapture("^(\\S+)\\s+([^(]+)\\s+\\((.*)\\).*$", txt,
proto = data.frame(ID=0L, Message="", Comment=""))
# Warning in fun(mat[, i]) : NAs introduced by coercion
out
# ID Message Comment
# 1 1 message «random_choice»
# 2 2 reply «asdf»
# 3 3 regulate «random_choice»
# 4 4 Early reg «for instance»
# 5 NA Success something
# 6 6 OK «fine»
The proto= argument indicates what type of columns are generated. Since I set the ID=0L, it assumes it'll be integer, so anything that does not convert to integer becomes NA (which satisfies your fifth row omission).
Explanation on the regex:
in general:
* means zero-or-more of the previous character (or character class)
+ means one-or-more
? (not used, but useful nonetheless) means zero or one
^ and $ mean the beginning and end of the string, respectively (a ^ within [..] is different)
(...) is a capture group: anything within the non-escaped parens is stored, anything not is discarded
[...] is a character group, any of the characters is a match; if this is instead [^..], then it is inverted: anything except what is listed
[[...]] is a character class
^(\\S+), start with (^) one or more (+) non-space characters (\\S);
\\s+ one or more space character (\\s) (discarded);
([^(]+) one or more character that is not a left-paren;
\\((.*)\\)$ a literal left-paren (\\() and then zero or more of anything (.*), all the way to a literal right-paren (\\)) and the end of the string ($).
It should be noted that \\s and \\S are non-POSIX regex characters, where it is generally suggested to use [^[:space:]] for \\S (no space chars) and [[:space:]] for \\s. Those are equivalent but I went with code-golf initially. With this replacement, it looks like
out <- strcapture("^([^[:space:]]+)[[:space:]]+([^(]+)[[:space:]]+\\((.*)\\).*$", txt,
proto = data.frame(ID=0L, Message="", Comment=""))

We can use {unglue}. Here we see you have two patterns, one contains "«" and ID, the other doesn't. {unglue} will use the first pattern that matches. any {foo} or {} expression matches the regex ".*?", and a data.frame is built from the names put between brackets.
txt <- c(
"1 message («random_choice»)[5];", "2 reply («asdf»)[2][3];",
"3 regulate («random_choice»)[5];", "4 Early reg («for instance»)[2][3][4];",
"4xx: Success (something):", "6 OK («fine»)[2][3];")
library(unglue)
patterns <-
c("{id} {Message} («{Comment}»){}",
"{} {Message} ({Comment}){}")
unglue_data(txt, patterns)
#> id Message Comment
#> 1 1 message random_choice
#> 2 2 reply asdf
#> 3 3 regulate random_choice
#> 4 4 Early reg for instance
#> 5 <NA> Success something
#> 6 6 OK fine

Related

extract part of word into a field from a long string using R

I have a single long string variable with 3 obs. I was trying to create a field prob to extract the specific string from the long string. the code and message is below.
data aa: "The probability of being a carrier is 0.0002422359 " " an BRCA1 carrier 0.0001061067 "
" an BRCA2 carrier 0.00013612 "
enter code here
aa$prob <- ifelse(grepl("The probability of being a carrier is", xx)==TRUE, word(aa, 8, 8), ifelse(grepl("BRCA", xx)==TRUE, word(aa, 5, 5), NA))
Warning message: In aa$prob <- ifelse(grepl("The probability of being a carrier is", : Coercing LHS to a list
Here is my previous answer, updated to reflect a data.frame.
library(dplyr)
aa <- data.frame(aa = c("...", "...", "The probability of being a carrier is 0.0002422359 ", " an BRCA1 carrier 0.0001061067 ", " an BRCA2 carrier 0.00013612 ", "..."))
aa %>%
mutate(prob = as.numeric(if_else(grepl("(probability|BRCA[12] carrier)", aa),
gsub("^.*?\\b([0-9]+\\.?[0-9]*)\\s*$", "\\1", aa), NA_character_)))
# aa prob
# 1 ... NA
# 2 ... NA
# 3 The probability of being a carrier is 0.0002422359 0.0002422359
# 4 an BRCA1 carrier 0.0001061067 0.0001061067
# 5 an BRCA2 carrier 0.00013612 0.0001361200
# 6 ... NA
Regex walk-through:
^ and $ are beginning and end of string, respective; \\b is a word-boundary; none of these "consume" any characters, they just mark beginnings and endings
. means one character
? means "zero or one", aka optional; * means "zero or more"; + means "one or more"; all refer to the previous character/class/group
\\s is blank space, including spaces and tabs
[0-9] is a class, meaning any character between 0 and 9; similarly, [a-z] is all lowercase letters, [a-zA-Z] are all letters, [0-9A-F] are hexadecimal digits, etc
(...) is a saved group; it's not uncommon in a group to use | as an "or"; this group is used later in the replacement= part of gsub as numbered groups, so \\1 recalls the first group from the pattern
So grouped and summarized:
"^.*?\\b([0-9]+\\.?[0-9]*)\\s*$"
1 ^^^^^^^^^^^^^^^^^^
2 ^^^
3 ^^^
4 ^^^^
This is the "number" part, that allows for one or more digits, an optional decimal point, and zero or more digits. This is saved in group "1".
The word boundary guarantees that we include leading numbers (it's possible, depending on a few things, for "12.345" to be parsed as "2.345" without this.
Anything before the number-like string.
Some or no blank space after the number.
Grouped logically, in an organized way
Regex isn't unique to R, it's a parsing language that R (and most other programming languages) supports in one way or another.

Rename Dataframe Column Names in R using Previous Column Name and Regex Pattern

I am working in R for the first time and I have been having difficulty renaming column names in a dataframe (Grade.Data). I have a dataset imported from an csv file that has column names like this:
Student.ID
Grade
Interactive.Exercises.1..Health
Interactive.Exercises.2..Fitness
Quizzes.1..Week.1.Quiz
Quizzes.2..Week.2.Quiz
Case.Studies.1..Case.Study1
Case.Studies.2..Case.Study2
I would like to be able to change the variable names so that they are more simple, i.e. from Interactive.Exercises.1.Health to Interactive.Exercises.1 or Quizzes.1.Week.1.Quiz to Quizzes.1
So far, I have tried this:
grep(".*[0-9]", names(Grade.Data))
But I get this returned:
[1] 3 4 5 6 7 8 9 11 12 13 14 15 16 17 19 20 21 22 23 24 25
Can anyone help me figure out what is going on, and write a better regex expression? Thank you so much.
It seems you truncate column names after the first chunk of digits.
You may use the following sub solution:
names(Grade.Data) <- sub("^(.*?\\d+).*$", "\\1", names(Grade.Data))
See the regex demo
Details
^ - start of string
(.*?\\d+) - Group 1 (later referred with \1 from the replacement pattern) matching any 0+ chars as few as possible (.*?) and then 1 or more digits (\d+)
.* - any 0+ chars as many as possible
$ - end of string
There is nothing wrong with your regex itself. What you are looking for is probably the combination of regexpr - which gets the start and ending of your regex- and regmatches - which gets the actual string corresponding to the output of regexpr:
start_end <- regexpr(".*[0-9]", names(Grade.data))
regmatches(names(Grade.data), start_end)
# [1] "Interactive.Exercises.1" "Interactive.Exercises.2"
# [3] "Quizzes.1..Week.1" "Quizzes.2..Week.2"
# [5] "Case.Studies.1..Case.Study1"
Adding a question-mark behind the dot-star will make the regex match as few characters as possible, so it will stop after the first numeric value:
start_end <- regexpr(".*?[0-9]", names(Grade.data))
regmatches(names(Grade.data), start_end)
# [1] "Interactive.Exercises.1" "Interactive.Exercises.2"
# [3] "Quizzes.1" "Quizzes.2"
# [5] "Case.Studies.1"
you should use the function names, following I write a little example, the names string can be as long as you need.
names(x = Grade.Data) <- c("Col1_name", "Col2_name")

How do I find a subtext without comma using regex in R?

I have a data frame as:
result <- c('Ab1 : 256 ug/mL(R), Ab2(disk); 18mm(S)', 'Ab1 : 4 ug/mL(S), Ab2(disk); <2mm(R)')
df <- data.frame(result)
What should I do if I would like to check whether '(R)' appears after 'antibiotics1' ?
grep("Ab1[[:print:]]*\\(R\\)", result)
gives
[1] 1 2
while the result I want is
[1] 1
Try this:
grep("Ab1[^(]*?\\(R\\)", result)
[1] 1
Ab1 match 'Ab1' literally
[^(]*? match anything besides an opening parenthesis, non greedily
(R) match '(R)' literally
In the second case, it is not possible to do this match without first consuming at least one opening parenthesis, hence only the first matches.

Count characters of a section of a string

I have this df:
dput(df)
structure(list(URLs = c("http://bursesvp.ro//portal/user/_/Banco_Votorantim_Cartoes/0-7f2f5cb67f1-22918b.html",
"http://46.165.216.78/.CartoesVotorantim/Usuarios/Cadastro/BV6102891782/",
"http://www.chalcedonyhotel.com/images/promoc/premiado.tam.fidelidade/",
"http://bmbt.ro/portal/a3/_Votorantim_/VotorantimCartoes2016/0-7f2f5cb67f1-22928b.html",
"http://voeazul.nl/azul/")), .Names = "URLs", row.names = c(NA,
-5L), class = "data.frame")
It describes different URLs and I am trying to count the number of characters of the host name, whether that is an actual name(http://hostname.com/....) or an IP(http://000.000.000.000/...). However, if it is an actual name, then I only want the nchar between www. and .com. If it's an IP then all its numbers and "in between" dots.
Expected Outcome for the above sample data:
exp_outcome
1 8
2 13
3 15
4 4
5 7
I tried to do something with strsplit but could not get anywhere.
Another, maybe more direct way with a different regex:
nchar(sub("^http://(www\\.)?(([a-z]+)|([0-9.]+))(\\.[a-z]+)?/+.+$", "\\2", x$df))
#[1] 8 13 15 4 7
explanation:
^http://: looks for "http://" after beginning of the string
(www\\.)?: looks for "www.", zero or one time (so this is optional)
(([a-z]+)|([0-9.]+)): the pattern that will be captured : either lowercase letters one or more time or digits and points
(\\.[a-z]+)?: looks for "." followed by one or more lowercase letters, zero or one time (so again optional)
/+.+$: looks for "/" followed by anything, one or more times till the end of string
NB:
sub("^http://(www\\.)?(([a-z]+)|([0-9.]+))(\\.[a-z]+)?/+.+$", "\\2", x$df)
# [1] "bursesvp" "46.165.216.78" "chalcedonyhotel" "bmbt" "voeazul"
Here’s how to do it (assuming your data.frame is called x):
domains = sub('^(http://)([^/]+)(.*)$', '\\2', x$df)
# This will fail for IP addresses …
hostname = sub('^(www\\.)?([^.]+)(\\..+)?$', '\\2', domains)
# … which we treat separately here:
is_ip = grepl('^(\\d{1,3}\\.){3}\\d{1,3}$', domains)
hostname[is_ip] = domains[is_ip]
exp_outcome$domain_length = nchar(hostname)
On a side note, I converted your original data.frame to character strings — it simply makes no sense to use a factor for URLs.
After 5 months of dealing with URLs in general, I found the following packages which make life a bit easier (Regex provided by other answers do work great by the way),
library(urltools)
library(iptools)
df$Hostname <- domain(df$URLs)
#However, TLDs and 'www' need to go so I used suffix_extract()$domain from `iptools`
df$Hostname <- ifelse(is.na(suffix_extract(df$Hostname)$domain), df$Hostname,
suffix_extract(df$Hostname)$domain)
#which gives:
# URLs Hostname
#1 http://bursesvp.ro//portal/user/_/... bursesvp
#2 http://46.165.216.78/.CartoesVotorantim/Usuarios/... 46.165.216.78
#3 http://www.chalcedonyhotel.com/images/promoc/ chalcedonyhotel
#4 http://bmbt.ro/portal/a3/_Votorantim_/... bmbt
#5 http://voeazul.nl/azul/ voeazul
#then simply,
nchar(df$Hostname)
#[1] 8 13 15 4 7

how to manipulate variables in a factor of a data frame

I need to do some manipulations in a factor inside my data frame with name phone number.
the variables must be numeric with lenght 5
also not contains special char
and I want to change the format AO-11111, VQ-11111from to 111111 it means erase the first chars and finally transform the rest of variables to na
My data.frame is derived from a .csv file.initial phone_number is a factor data such that
phone_number
VQ-40773
VQ-43685
VQ-44986
40270
41694
42623
.
.
strsplit function will help you to get the value out string.
str="VQ-40773"
(strsplit(str,"-"))[[1]][2] //will return 40773
If you want to remove anything the precedes a dash, then:
sub("^([^-]+[-])(.+)", "\\2", phone_number)
> phone_number <- scan(what="")
1: VQ-40773
2: VQ-43685
3: VQ-44986
4: 40270
5: 41694
6: 42623
7:
Read 6 items
> sub("^([^-]+[-])(.+)", "\\2", phone_number)
[1] "40773" "43685" "44986" "40270" "41694" "42623"
> as.numeric(sub("^([^-]+[-])(.+)", "\\2", phone_number))
[1] 40773 43685 44986 40270 41694 42623
The nchar function would allow checking the lengths of a character vector. Post an adequate example and, please, do make a greater effort to get punctuation and capitalization correct.

Resources