R:how to extract the first integer or decimal number from a text, and if the first number equal to specific numbers extract the second integer/decimal - r

The data is like this:
example - name of database
detail - the first column the contain sting with number in it (the number can be attached to $ etc. like 25m$ and also can be decimal like 1.2m$ or $1.2M)
lets say the datatable look like this:
example$detail<- c("The cole mine market worth every year 100M$ and the equipment they use worth 30$m per capita", "In 2017 the first enterpenur realized there is a potential of 500$M in cole mining", "The cole can make 23b$ per year ans help 1000000 familys living on it")
i want to add a column to the example data table - named: "number" that will extract the first number in the string in column "detail". BUT if this number is equal to one of the numbers in vector "year" (its not in the example database - its a seprate list i created) i want it to extract the second number of the string example$detail.
so i create another years list (separate from the database),
years<-c(2016:2030 )
im trying to create new column - number
what i did so far:
I managed to add variable that extract the first number of a string, by writing the following command:
example$number<-as.integer( sub("\\D*(\\d+).*", "\\1", example$detail) ) # EXTRACT ONLT INTEGERS
example$number1<-format(round(as.numeric(str_extract(example$detail, "\\d+\\.*\\d*")), 2), nsmall = 2) #EXTRACT THE NUMBERS AS DECIMALS WITH TWO DIGITS AFTER THE . (ITS ENOUGH FOR ME)
example$number1<-ifelse(example$number %in% years, TRUE, example$number1 ) #IF THE FIRST NUMBER EXTRACTED ARE IN THE YEARS VECTOR RETURN "TRUE"
and then i tried to write a code that extract the second number according to this if and its not working, just return me errors
i tried:
gsub("[^\d]*[\d]+[^\d]+([\d]+)", example$detail)
str_extract(example$detail, "\d+(?=[A-Z\s.]+$)",[[2]])
as.integer( sub("\\D*(\\d+).*", "\\1", example$detail) )
as.numeric(strsplit(example$detail, "\\D+")[1])
i didnt understand how i symbolized any number (integer\digits) or how i symbolized THE SECOND number in string.
thanks a lot!!

List item
Since no good example data is provided I'm just going to 'wing-it' here.
Imagine the dataframe df has the columns year (int) and details (char), then
df = mutate(clean_details = sub("[^0-9.-]", "",details),
clean_details_part1 = as.integer(strsplit(clean_details,"[.]")[[1]][1]),
clean_details_part2 = as.integer(strsplit(clean_details,"[.]")[[1]][2])
)
This works with the code I wrote up. I didn't apply the logic because I see you're proficient enough to do that. I believe a simple ifelse statement would do to create a boolean and then you can filter on that boolean, or a most direct way.

Related

Text mining: how to count the frequency of two words occurring close together

Let's say I have the following data frame, df.
speaker <- c('Lincoln','Douglas')
text <- c('The framers of the Constitution, those framers...',
'The framers of our great Constitution.')
df <- data.frame(speaker,text)
I want to find (or write) a function that can count the frequency of two words occurring close together. Say I want to count instance of "framers" occurring within three words of the word "Constitution" and vice versa.
Thus, for Lincoln, the function would return 2 because you have one instance of "framers" followed by "Constitution" and another instance of "Constitution" followed by "framers." For Douglas, the function would return 0 because "Constitution" is four words away from "framers."
I'm new to text mining, so I apologize if I'm missing an easy solution.

Splitting a column in a dataframe in R into two based on content

I have a column in a R dataframe that holds a product weight i.e. 20 kg but it has mixed measuring systems i.e. 1 lbs & 2 kg etc. I want to separate the value from the measurement and put them in separate columns then convert them in a new column to a standard weight. Any thoughts on how I might achieve that? Thanks in advance.
Assume you have the column given as
x <- c("20 kg","50 lbs","1.5 kg","0.02 lbs")
and you know that there is always a space between the number and the measurement. Then you can split this up at the space-character, e.g. via
splitted <- strsplit(x," ")
This results in a list of vectors of length two, where the first is the number and the second is the measurement.
Now grab the numbers and convert them via
numbers <- as.numeric(sapply(splitted,"[[",1))
and grab the units via
units <- sapply(splitted,"[[",2)
Now you can put everything together in a `data.frame.
Note: When using as.numeric, the decimal point has to be a dot. If you have commas instead, you need to replace them by a dot, for example via gsub(",","\\.",...).
separate(DataFrame, VariableName, into = c("Value", "Metric"), sep = " ")
My case was simple enough that I could get away with just one space separator but I learned you can also use a regular expression here for more complex separator considerations.

Display number of rows containing a range of numbers (between 0-9) in R

I am trying to count the number of rows in my dataset (called data) that contain a range of numbers (e.g. between 0 and 9) by using R. I have not created a dataframe and my dataset is directly imported from a csv file into R.
EXAMPLE OF DATASET (INPUT)
MESSAGE
I have to wait 3 days
Feel quite tired
No way is 7pm already
It is too late now
This is beautiful
So the output would be 2 rows (row 1 and 2)
I have tried the following code but it provides me the wrong output number of posts (3) - so I know I am definitely doing something wrong.
data = read.csv (xxxxxx)
#count number of rows that contain numbers between 0 and 9
numbers= filter(data, !grepl("[0-9]",MESSAGE))
length(numbers)
Thank you in advance.
Maybe you can try the code like below if there are at least one digit
> length(grep("\\d", MESSAGE, value = TRUE))
[1] 2
If you want to find out the rows where there is a single digit, you can try
> length(grep("\\b\\d(?![0-9])", MESSAGE, value = TRUE, perl = TRUE))
[1] 2
Data
MESSAGE <- c(
"I have to wait 3 days",
"Feel quite tired",
"No way is 7pm already",
"It is too late now",
"This is beautiful"
)
filter function returns a dataframe back and counting length on a dataframe returns number of columns and not rows. Also you are using regex to select rows which do not have a number by introducing ! in front.
You can use sum. + grepl :
result <- sum(grepl('[0-9]', data$MESSAGE))

Find specific strings, count their frequency in a given text, and report it as a proportion of the number of words

Trying to write a function in R that would :
1) look through each observation's string variables
2) identify and count certain strings that the user defines
3) report the findings as a proportion of the total number of words each observation contains.
Here's a sample dataset:
df <- data.frame(essay1=c("OMG. american sign language. knee-slides in leather pants", "my face looks totally different every time. lol."),
essay2=c("cheez-its and dried cranberries. sparkling apple juice is pretty\ndamned coooooool too.<br />\nas for music, movies and books: the great american authors, mostly\nfrom the canon, fitzgerald, vonnegut, hemmingway, hawthorne, etc.\nthen of course the europeans, dostoyevski, joyce, the romantics,\netc. also, one of the best books i have read is all quiet on the\nwestern front. OMG. I really love that. lol", "at i should have for dinner\nand when; some random math puzzle, which I loooooove; what it means to be alive; if\nthe meaning of life exists in the first place; how the !##$ can the\npolitical mess be fixed; how the %^&* can the education system\nbe fixed; my current game design project; my current writing Lol"),
essay3=c("Lol. I enjoy life and then no so sure what else to say", "how about no?"))
The furtherest I managed to get is this function:
find.query <- function(char.vector, query){
which.has.query <- grep(query, char.vector, ignore.case = TRUE)
length(which.has.query) != 0
}
profile.has.query <- function(data.frame, query){
query <- tolower(query)
has.query <- apply(data.frame, 1, find.query, query=query)
return(has.query)
}
This allows the user to detect if a given value is in the 'essay' for a given used, but that's not enough for the three goals outlined above. What this function would ideally do is to count the number of words identified, then divide that count by the total count of words in the overall essays (row sum of counts for each user).
Any advice on how to approach this?
Using the stringi package as in this post:
How do I count the number of words in a text (string) in R?
library(stringi)
words.identified.over.total.words <- function(dataframe, query){
# make the query all lower-case
query <- tolower(query)
# count the total number of words
total.words <- apply(dataframe, 2, stri_count, regex = "\\S+")
# count the number of words matching query
number.query <- apply(dataframe, 2, stri_count, regex = query)
# divide the number of words identified by total words for each column
final.result <- colSums(number.query) / colSums(total.words)
return(final.result)
}
(The df in your question has each essay in a column, so the function sums each column. However, in the text of your question you say you want row sums. If the input data frame was meant to have one essay per row, then you can change the function to reflect that.)

Counting specific characters in a string, across a data frame. sapply

I have found similar problems to this here:
Count the number of words in a string in R?
and here
Faster way to split a string and count characters using R?
but I can't get either to work in my example.
I have quite a large dataframe. One of the columns has genomic locations for features and the entries are formatted as follows:
[hg19:2:224840068-224840089:-]
[hg19:17:37092945-37092969:-]
[hg19:20:3904018-3904040:+]
[hg19:16:67000244-67000248,67000628-67000647:+]
I am splitting out these elements into thier individual elements to get the following (i,e, for the first entry):
hg19 2 224840068 224840089 -
But in the case of the fourth entry, I would like to pase this into two seperate locations.
i.e
hg19:16:67000244-67000248,67000628-67000647:+]
becomes
hg19 16 67000244 67000248 +
hg19 16 67000628 67000647 +
(with all the associated data in the adjacent columns filled in from the original)
An easy way for me to identify which rows need this action is to simply count the rows with commas ',' as they don't appear in any other text in any other columns, except where there are multiple genomic locations for the feature.
However I am failing at the first hurdle because the sapply command incorrectly returns '1' for every entry.
testdat$multiple <- sapply(gregexpr(",", testdat$genome_coordinates), length)
(or)
testdat$multiple <- sapply(gregexpr("\\,", testdat$genome_coordinates), length)
table(testdat$multiple)
1
4
Using the example I have posted above, I would expect the output to be
testdat$multiple
0
0
0
1
Actually doing
grep -c
on the same data in the command line shows I have 10 entries containing ','.
Using the example I have posted above, I would expect the output to be
So initially I would like to get this working but also I am a bit stumped for ideas as to how to then extract the two (or more) locations and put them on thier own rows, filling in the adjacent data.
Actually what I intended to to was to stick to something I know (on the command line) grepping the rows with ','out, duplicate the file and split and awk selected columns (1st and second location in respective files) then cat and sort them. If there is a niftier way for me to do this in R then I would love a pointer.
gregexpr does in fact return an object of length 1. If you want to find the rows which have a match vs the ones which don't, then you need to look at the returned value , not the length. A match failure returns -1 .
Try foo<-sapply(testdat$genome, function(x) gregexpr(',',x)); as.logical(foo) to get the rows with a comma.

Resources