I came across this function that converts numbers written in words into its numeric representation (e.g., five to 5). The function looks like this:
library(english)
words_to_numbers <- function(s) {
s <- stringr::str_to_lower(s)
for (i in 0:11)
s <- stringr::str_replace_all(s, words(i), as.character(i))
s
}
Can you explain how the function works? I am confused how as.character() is playing a role here.
The function works like this (note you also need the stringr package).
First, it takes the word you input (i.e. "five" if you used words_to_numbers("five"))
Then, str_to_lower() takes that and normalizes it to all lower case (i.e., avoiding issues if you typed "Five" or "FIVE" instead of "five").
It then iterates over a loop (for some reason ending at 11), so i will take the value of 1, then 2, then 3, all the way to 11.
Within the loop, str_replace_all() takes your string (i.e., "five") and looks for a matching pattern. Here, the pattern is words(i) (i.e. words(5) when i == 5 yields the pattern "five" - in the english package, the words() function provides a vector of words that represent the position in the vector. For instance, if you type english::words(1000) it will return "one thousand". Once it finds the pattern, it then replaces it with as.character(i). The as.character() function converts the number i value to a character since str_replace_all() requires a character replacement. If you needed the return value to be numeric, you could use as.numeric(words_to_numbers("five"))
For some reason, the function stops at 11, meaning if you type words_to_numbers("twelve") it won't work (returns "twelve"). So you will need to adjust that number if you want to use the function for values > 11.
Hope this helps and good luck learning R!
Related
I have a question how to write a loop in r which goes checks if a certain expression occurs in a string . So I want to check if the the expression “i-sty” occurs in my variable for each i between 1:200 and, if this is true, it should give the corresponding i.
For example if we have “4-sty” the loop should give me 4 and if there is no “i-sty” in the variable it should give me . for the observation.
I used
for (i in 1:200){
datafram$height <- ifelse(grepl("i-sty", dataframe$Description), i, ".")
}
But it did not work. I literally only receive points. Attached I show a picture of the string variable.
enter image description here
"i-sty" is just a string with the letter i in it. To you use a regex pattern with your variable i, you need to paste together a string, e.g., grepl(paste0(i, "-sty"), ...). I'd also recommend using NA rather than "." for the "else" result - that way the resulting height variable can be numeric.
for (i in 1:200){
dataframe$height <- ifelse(grepl("i-sty", dataframe$Description), i, ".")
}
The above works syntactically, but not logically. You also have a problem that you are overwriting height each time through the loop - when i is 2, you erase the results from when i is 1, when i is 3, you erase the results from when i is 2... I think a better approach would be to extract the match, which is easy using stringr (but also possible in base). As a benefit, with the right pattern we can skip the loop entirely:
library(stringr)
dataframe$height = str_match(string = dataframe$Description, pattern = "[0-9]+-sty")[, 2]
# might want to wrap in `as.numeric`
You use both datafram and dataframe. I've assumed dataframe is correct.
I have a discrete variable with scores from 1-3. I would like to change it so 1=2, 2=1, 3=3.
I have tried
recode(Data$GEB43, "c(1=2; 2=1; 3=3")
But that doesn't work.
I know this is an overly stupid question that can be solved in excel within seconds but trying to learn how to do basics like this in R.
We should always provide a minimal reproducible example:
df <- data.frame(x=c(1,1,2,2,3,3))
You didn't specifiy the package for recode so I assumed dplyr. ?dplyr::recode tells us how the arguments should be passed to the function. In the original question "c(1=2; 2=1; 3=3" is a string (i.e. not an R expression but a character string "c(1=2; 2=1; 3=3"). To make it an R expression we have to get rid of the double quotes and replace the ; with ,. Additionally, we need a closing bracket i.e. c(1=2, 2=1, 3=3). But still, as ?dplyr::recode tells us, this is not the way to pass this information to recode:
Solution using dplyr::recode:
dplyr::recode(df$x, "1"=2, "2"=1, "3"=3)
Returns:
[1] 2 2 1 1 3 3
Assuming, you mean dplyr::recode, the syntax is
recode(.x, ..., .default = NULL, .missing = NULL)
From the documentation it says
.x - A vector to modify
... - Replacements. For character and factor .x, these should be named and replacement is based only on their name. For numeric .x, these can be named or not. If not named, the replacement is done based on position i.e. .x represents positions to look for in replacements
So when you have numeric value you can replace based on position directly
recode(1:3, 2, 1, 3)
#[1] 2 1 3
I am interested to assign names to list elements. To do so I execute the following code:
file_names <- gsub("\\..*", "", doc_csv_names)
print(file_names)
"201409" "201412" "201504" "201507" "201510" "201511" "201604" "201707"
names(docs_data) <- file_names
In this case the name of the list element appears with ``.
docs_data$`201409`
However, in this case the name of the list element appears in the following way:
names(docs_data) <- paste("name", 1:8, sep = "")
docs_data$name1
How can I convert the gsub() result to receive the latter naming pattern without quotes?
gsub() and paste () seem to produce the same class () object. What is the difference?
Both gsub and paste return character objects. They are different because they are completely different functions, which you seem to know based on their usage (gsub replaces instances of your pattern with a desired output in a string of characters, while paste just... pastes).
As for why you get the quotations, that has nothing to do with gsub and everything to do with the fact that you are naming variables/columns with numbers. Indeed, try
names(docs_data) <- paste(1:8)
and you'll realize you have the same problem when invoking the naming pattern. It basically has to do with the fact that R doesn't want to be confused about whether a number is really a number or a variable because that would be chaos (how can 1 refer to a variable and also the number 1?), so what it does in such cases is change a number 1 into the character "1", which can be given names. For example, note that
> 1 <- 3
Error in 1 <- 3 : invalid (do_set) left-hand side to assignment
> "1" <- 3 #no problem!
So R is basically correcting that for you! This is not a problem when you name something using characters. Finally, an easy fix: just add a character in front of the numbers of your naming pattern, and you'll be able to invoke them without the quotations. For example:
file_names <- paste("file_",gsub("\\..*", "", doc_csv_names),sep="")
Should do the trick (or just change the "file_" into whatever you want as long as it's not empty, cause then you just have numbers left and the same problem)!
If you load the pracma package into the r console and type
gammainc(2,2)
you get
lowinc uppinc reginc
0.5939942 0.4060058 0.5939942
This looks like some kind of a named tuple or something.
But, I can't work out how to extract the number below the lowinc, namely 0.5939942. The code (gammainc(2,2))[1] doesn't work, we just get
lowinc
0.5939942
which isn't a number.
How is this done?
As can be checked with str(gammainc(2,2)[1]) and class(gammainc(2,2)[1]), the output mentioned in the OP is in fact a number. It is just a named number. The names used as attributes of the vector are supposed to make the output easier to understand.
The function unname() can be used to obtain the numerical vector without names:
unname(gammainc(2,2))
#[1] 0.5939942 0.4060058 0.5939942
To select the first entry, one can use:
unname(gammainc(2,2))[1]
#[1] 0.5939942
In this specific case, a clearer version of the same might be:
unname(gammainc(2,2)["lowinc"])
Double brackets will strip the dimension names
gammainc(2,2)[[1]]
gammainc(2,2)[["lowinc"]]
I don't claim it to be intuitive, or obvious, but it is mentioned in the manual:
For vectors and matrices the [[ forms are rarely used, although they
have some slight semantic differences from the [ form (e.g. it drops
any names or dimnames attribute, and that partial matching is used for
character indices).
The partial matching can be employed like this
gammainc(2, 2)[["low", exact=FALSE]]
In R vectors may have names() attribute. This is an example:
vector <- c(1, 2, 3)
names(vector) <- c("first", "second", "third")
If you display vector, you should probably get desired output:
vector
> vector
first second third
1 2 3
To ensure what type of output you get after the function you can use:
class(your_function())
I hope this helps.
this is a really basic question, and I am probably not seeing something obvious but I am currently stuck with this problem:
In R, I generated a List of integers, made through the sample() function. Then want to find an exact pattern.
Should be obvious, but grep does the following:
1)
grep('03230', hugeListofNumbers)
>integer(0)
2)
pattern<-toString(03230)
x<-toString(hugeListofNumbers)
grep(pattern, x)
>[1] 1
3) And using matchPattern from the Biostrings Package:
matchPattern(pattern, x)
start end width
[1] 5146 5158 13 [0, 3, 2, 3, 2]
....
No result helps me find the occurences of the pattern. And although the last one using matchPattern seems ok, it finds some weird 13 characters long string that does not match in any way the 5 character long pattern...
What am I not seeing here? How can I just preform a normal grep search as in the shell??
Edit:
To generate the list with the properties I needed I used:
hugeListofNumbers<-sample(c(0,1,2,3), 10^5, replace=TRUE, prob=NULL)
pattern<-sample(c(0,1,2,3), 5 , replace=TRUE, prob=NULL)
OK I found the solution, which was, as expected, because I was overseeing a very basic problem: The sample() function of R returns a vector, so I could not match any pattern longer than 1, before converting it to a string.