Converting Movie Box Office to Numbers - r

I have a data frame in R with box office number listed like $121.5M and $0.014M and I'd like to convert them to straight numbers. I'm thinking of striping the $ and M and then using basic multiplication. Is there a better way to do this?

You could do this either by matching the non-numeric elements ([^0-9.]*) and replace it by ''
as.numeric(gsub("[^0-9.]*", '', "$121.5M"))
#[1] 121.5
Or by specifically matching the $ and M ([$M]) and replace it with ''
as.numeric(gsub("[$M]", '',"$121.5M"))
#[1] 121.5
Update
If you have a vector like below
v1 <- c("$1.21M", "$0.5B", "$100K", "$1T", "$0.9P", "$1.5K")
Create another vector with the numbers and set the names with the corresponding abbrevations
v2 <- setNames(c(1e3, 1e6, 1e9, 1e12, 1e15), c('K', 'M', 'B', 'T', 'P'))
Use that as index to replace the abbrevation and multiply it with the numeric part of the vector.
as.numeric(gsub("[^0-9.]*", '',v1))* v2[sub('[^A-Z]*', '', v1)]

The function extract_numeric from the tidyr package strips all non-numeric characters from a string and returns a number. With your example:
library(tidyr)
dat <- data.frame(revenue = c("$121.5M", "$0.014M"))
dat$revenue2 <- extract_numeric(dat$revenue)*1000000
dat
revenue revenue2
1 $121.5M 121500000
2 $0.014M 14000

This removes the $ and translates K and M to e3 and e6. There is an example very similar to this in the gsubfn vignette.
library(gsubfn)
x <- c("$1.21M", "$100K") # input
ch <- gsubfn("[KM$]", list(K = "e3", M = "e6", "$" = ""), x)
as.numeric(ch)
## [1] 1210000 100000
The as.numeric line can be omitted if you don't need to convert it to numeric.

Related

How to use grepl function multiple times, in R

I have a vector like go_id and a data.frame like data.
go_id <- c("[GO:0000086]", "[GO:0000209]", "[GO:0000278]")
protein_id <- c("Q96IF1","P26371","Q8NHG8","P60372","O75526","Q01130")
bio_process <- c("[GO:0000086]; [GO:0000122]; [GO:0000932]", "[GO:0005829]; [GO:0008544]","[GO:0000209]; [GO:0005737]; [GO:0005765]","NA","[GO:0000398]; [GO:0003729]","[GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714]")
data <- as.data.frame(cbind(protein_id,bio_process))
How can I keep the rows of the data for which bio_process cell contains at least one of the go_ids elements? I note that the GO code can not be repeated in the same bio_process cell.
To be more precise, i would like to receive only the first, the third and the sixth row of the data.frame.
I have tried a for loop using 'grepl' function, like this:
go_id <- gsub("GO:","", go_id, fixed = TRUE)
for (i in 1:6) {
new_data <- data[grepl("\\[GO:go_id[i]\\]",data$Gene.ontology..biological.process.)]
}
Which I know it can not work because I can not fit in a variable value into a regular expression.
Any ideas on this?
Thank you
We can use Reduce with grepl
data$ind <- Reduce(`|`, lapply(go_id, function(pat)
grepl(pat, data$bio_process, fixed = TRUE)))
data
# protein_id bio_process ind
#1 Q96IF1 [GO:0000086]; [GO:0000122]; [GO:0000932] TRUE
#2 P26371 [GO:0005829]; [GO:0008544] FALSE
#3 Q8NHG8 [GO:0000209]; [GO:0005737]; [GO:0005765] TRUE
#4 P60372 NA FALSE
#5 O75526 [GO:0000398]; [GO:0003729] FALSE
#6 Q01130 [GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714] TRUE
You should use fixed = TRUE in grepl() :
vect <- rep(FALSE, nrow(data))
for(id in go_id){
vect <- vect | grepl(id, data$bio_process, fixed = T)
}
data[vect,]
You can subset using str_extract to define the pattern on those substrings that are distinctive:
library(stringr)
data[grepl(paste(str_extract(go_id, "\\d{4}]"), collapse="|"), data$bio_process),]
protein_id bio_process
1 Q96IF1 [GO:0000086]; [GO:0000122]; [GO:0000932]
3 Q8NHG8 [GO:0000209]; [GO:0005737]; [GO:0005765]
6 Q01130 [GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714]
EDIT:
The most straighforward solution is subsetting with grepland paste0 to add the escape slashes for the metacharacter [:
data[grepl(paste0("\\", go_id, collapse="|"), data$bio_process),]

How to split strings and numbers in R?

I have character vector of the following form (this is just a sample):
R1Ng(10)
test(0)
n.Ex1T(34)
where as can be seen above, the first part is always some combination of alphanumeric and punctuation marks, then there are parentheses with a number inside. I want to create a numeric vector which will store the values inside the parentheses, and each number should have name attribute, and the name attribute should be the string before the number. So, for example, I want to store 10, 0, 34, inside a numeric vector and their name attributes should be, R1Ng, test, n.Ex1T, respectively.
I can always do something like this to get the numbers and create a numeric vector:
counts <- regmatches(data, gregexpr("[[:digit:]]+", data))
as.numeric(unlist(counts))
But, how can I extract the first string part, and store it as the name attribute of that numberic array?
How about this:
x <- c("R1Ng(10)", "test(0)", "n.Ex1T(34)")
data.frame(Name = gsub( "\\(.*", "", x),
Count = as.numeric(gsub(".*?\\((.*?)\\).*", "\\1", x)))
# Name Count
# 1 R1Ng 10
# 2 test 0
# 3 n.Ex1T 34
Or alternatively as a vector
setNames(as.numeric(gsub(".*?\\((.*?)\\).*", "\\1", x)),
gsub( "\\(.*", "", x ))
# R1Ng test n.Ex1T
# 10 0 34
Here is another variation using the same expression and capturing parentheses:
temp <- c("R1Ng(10)", "test(0)", "n.Ex1T(34)")
data.frame(Name=gsub("^(.*)\\((\\d+)\\)$", "\\1", temp),
count=gsub("^(.*)\\((\\d+)\\)$", "\\2", temp))
We can use str_extract_all
library(stringr)
lst <- str_extract_all(x, "[^()]+")
Or with strsplit from base R
lst <- strsplit(x, "[()]")
If we need to store as a named vector
sapply(lst, function(x) setNames(as.numeric(x[2]), x[1]))
# R1Ng test n.Ex1T
# 10 0 34
data
x <- c("R1Ng(10)", "test(0)", "n.Ex1T(34)")

Keep doubled columns which differ in only 2 letters in a data.frame

I have a data frame in R which consists of around 100 columns. Most of the columns are doubled but differ in 2 letters. I want to keep these columns and delete those columns which are not doubled.
Here is an example:
234-rgz SK 234-rgz PV 556-gft SK 456-hjk SK 456-hjk PV
The Output should be:
234-rgz SK 234-rgz PV 456-hjk SK 456-hjk PV
All columns have the same naming conventions. A number starting from 2 to 150 then a "-" after this 4 or 5 letters, then a space and then "SK" or "PV". I thought of using regular expression but then I don't solving the problem how I get rid of those single columns. Thanks for your help!
You can use duplicated on the column names after removing the suffix part. The output will be logical index which can be used to subset the original dataset.
v1 <- colnames(df1)
v2 <- sub('\\s+[^ ]+$', '', v1)
indx <- duplicated(v2)|duplicated(v2, fromLast=TRUE)
v1[indx]
#[1] "234-rgz SK" "234-rgz PV" "456-hjk SK" "456-hjk PV"
To subset the columns in the dataframe,
df1[indx]
Or another option is splitting the column names string to substring and use grep to match the substring that have a frequency >1
tbl <- table(unlist(strsplit(v1, '\\s+.*')))
df1[grep(paste(names(tbl)[tbl>1], collapse="|"), v1)]
data
set.seed(24)
df1 <- as.data.frame(matrix(sample(0:9, 5*10, replace=TRUE), ncol=5,
dimnames=list(NULL, c('234-rgz SK', '234-rgz PV' , '556-gft SK',
'456-hjk SK' , '456-hjk PV') )) )

R: edit column values by using if condition

I have a data frame with several columns. One of those contains Plotids like AEG1, AEG2,..., AEG50, HEG1, HEG2,..., HEG50, SEG1, SEG2,..., SEG50. So, the data frame has 150 rows. Now I want to change only some of these Plotids, so that there is AEG01, AEG02,... instead of AEG1, AEG2, ... So, I just want to add a "0" to some of the column entries. I tried it by using lapply, a for loop, writing a function,... but nothing did the job. There was always the error message:
In if (nchar(as.character(dat_merge$EP_Plotid)) == 4)
paste(substr(dat_merge$EP_Plotid, ... :
the condition has length > 1 and only the first element will be used
So, this was my last try:
Plotid_func <- function(x) {
if(nchar(as.character(dat_merge$EP_Plotid))==4)
paste(substr(dat_merge$EP_Plotid, 1, 3), "0", substr(dat_merge$EP_Plotid, 4, 4), sep="")
}
dat_merge$Plotid <- sapply(dat_merge$EP_Plotid, Plotid_func)
Therewith, I wanted to select only those column entries with four digits. And to only those selected entries, I wanted to add a 0. Can anybody help me? dat_merge is the name of my data frame and EP_Plotid is the column I want to edit. Thanks in advance
Just extract the "string" portion and the "numeric" portion and paste them back together after using sprintf on the numeric portion.
An example:
## "x" is the "column" of plot ids. Here I go up to 12
## to demonstrate the zero padding that it sounds like
## you're looking for
x <- c(paste0("AEG", 1:12), paste0("HEG", 1:12))
## Extract the string values
Strings <- gsub("([A-Z]+)(.*)", "\\1", x)
## Extract the numeric values
Nums <- gsub("([A-Z]+)(.*)", "\\2", x)
## Put them back together
paste0(Strings, sprintf("%02d", as.numeric(Nums)))
# [1] "AEG01" "AEG02" "AEG03" "AEG04" "AEG05" "AEG06"
# [7] "AEG07" "AEG08" "AEG09" "AEG10" "AEG11" "AEG12"
# [13] "HEG01" "HEG02" "HEG03" "HEG04" "HEG05" "HEG06"
# [19] "HEG07" "HEG08" "HEG09" "HEG10" "HEG11" "HEG12"
Or you can just modify your function to actually use the input variable x (which is not happening in your original function)
dat_merge <- data.frame(EP_Plotid = c("AEG1", "AEG2", "AEG50", "HEG1", "HEG2", "HEG50", "SEG1", "SEG2", "SEG50"))
Plotid_func <- function(x) {
if(nchar(as.character(x)) == 4){
paste(substr(x, 1, 3), "0", substr(x, 4, 4), sep="")
} else as.character(x)
}
dat_merge$Plotid <- sapply(dat_merge$EP_Plotid, Plotid_func)
dat_merge
# EP_Plotid Plotid
# 1 AEG1 AEG01
# 2 AEG2 AEG02
# 3 AEG50 AEG50
# 4 HEG1 HEG01
# 5 HEG2 HEG02
# 6 HEG50 HEG50
# 7 SEG1 SEG01
# 8 SEG2 SEG02
# 9 SEG50 SEG50
A vectorized version of your function (which is much better than using sapply which is just a for loop) would be
dat_merge$Plotid <- ifelse(nchar(as.character(dat_merge$EP_Plotid))==4, paste(substr(dat_merge$EP_Plotid, 1, 3), "0", substr(dat_merge$EP_Plotid, 4, 4), sep=""), as.character(dat_merge$EP_Plotid))
Or use a combination of formatC with str_extract from library(stringr)
library(stringr)
x from Ananda's post.
Extract alphabets and numbers separately.
Flag 0's to the numbers with formatC
paste together
paste0(str_extract(x, "[[:alpha:]]+"), formatC(as.numeric(str_extract(x,"\\d+")), width=2, flag=0))
#[1] "AEG01" "AEG02" "AEG03" "AEG04" "AEG05" "AEG06" "AEG07" "AEG08" "AEG09"
#[10] "AEG10" "AEG11" "AEG12" "HEG01" "HEG02" "HEG03" "HEG04" "HEG05" "HEG06"
#[19] "HEG07" "HEG08" "HEG09" "HEG10" "HEG11" "HEG12"

splitting filename text by underscores using R

In R I'd like to take a collection of file names in the format below and return the number to the right of the second underscore (this will always be a number) and the text string to the right of the third underscore (this will be combinations of letters and numbers).
I have file names in this format:
HELP_PLEASE_4_ME
I want to extract the number 4 and the text ME
I'd then like to create a new field within my data frame where these two types of data can be stored. Any suggestions?
Here is an option using regexec and regmatches to pull out the patterns:
matches <- regmatches(df$a, regexec("^.*?_.*?_([0-9]+)_([[:alnum:]]+)$", df$a))
df[c("match.1", "match.2")] <- t(sapply(matches, `[`, -1)) # first result for each match is full regular expression so need to drop that.
Produces:
a match.1 match.2
1 HELP_PLEASE_4_ME 4 ME
2 SOS_WOW_3_Y34OU 3 Y34OU
This will break if any rows don't have the expected structure, but I think that is what you want to happen (i.e. be alerted that your data is not what you think it is). strsplit based approaches will require additional checking to ensure that your data is what you think it is.
And the data:
df <- data.frame(a=c("HELP_PLEASE_4_ME", "SOS_WOW_3_Y34OU"), stringsAsFactors=F)
The obligatory stringr version of #BrodieG's quite spiffy answer:
df[c("match.1", "match.2")] <-
t(sapply(str_match_all(df$a, "^.*?_.*?_([0-9]+)_([[:alnum:]]+)$"), "[", 2:3))
Put here for context only. You should accept BrodieG's answer.
Since you already know that you want the text that comes after the second and third underscore, you could use strsplit and take the third and fourth result.
> x <- "HELP_PLEASE_4_ME"
> spl <- unlist(strsplit(x, "_"))[3:4]
> data.frame(string = x, under2 = spl[1], under3 = spl[2])
## string under2 under3
## 1 HELP_PLEASE_4_ME 4 ME
Then for longer vectors, you could do something like the last two lines here.
## set up some data
> word1 <- c("HELLO", "GOODBYE", "HI", "BYE")
> word2 <- c("ONE", "TWO", "THREE", "FOUR")
> nums <- 20:23
> word3 <- c("ME", "YOU", "THEM", "US")
> XX <-paste0(word1, "_", word2, "_", nums, "_", word3)
> XX
## [1] "HELLO_ONE_20_ME" "GOODBYE_TWO_21_YOU"
## [3] "HI_THREE_22_THEM" "BYE_FOUR_23_US"
## ------------------------------------------------
## process it
> spl <- do.call(rbind, strsplit(XX, "_"))[, 3:4]
> data.frame(cbind(XX, spl))
## XX V2 V3
## 1 HELLO_ONE_20_ME 20 ME
## 2 GOODBYE_TWO_21_YOU 21 YOU
## 3 HI_THREE_22_THEM 22 THEM
## 4 BYE_FOUR_23_US 23 US

Resources