I have the following function in R. It is working fine, however, I think that must be a better way to run this function.
values <- c("a","b")
print <- function(values){
size <- length(values)
if (size == 1) {
final <- values[1]
}else if(size == 2){
final <- paste0(values[2], " and ", values[1])
}else if(size == 3){
final <- paste0(values[3], " and ",values[2], " and ", values[1])
}
return(final)
}
print(values)
The user can change the size of values, so if he choose values <- c("a","b", "c") the function is gonna run in the last condition. However, the last condition is in art equal to the second conditional plus something new. It is possible to make an if statement, or something in those lines that uses the previous condition . Something like:
values <- c("a","b", "c")
print <- function(values){
size <- length(values)
if (size == 1) {
final <- values[1]
}else if(size == 2){
final <- paste0(values[2], " and ", final )
}else if(size == 3){
final <- paste0(values[3], " and ",final )
}
return(final)
}
print(values)
Try this, which reverses the order of the input vector and pastes "and" between:
newfun <- function(x){
ifelse(length(x)>1, paste(rev(x), collapse = " and "), x)
}
Output:
newfun(letters[1])
# [1] "a"
newfun(letters[1:2])]
# [1] "b and a"
# and so on...
newfun(letters[1:5])
# [1] "e and d and c and b and a"
Testing this against your function to see if it is identical:
all.equal(print(letters[1:3]),
newfun(letters[1:3]))
# [1] TRUE
I would also strongly caution naming user-defined functions names that are already inherent in R (i.e. print() is already a function in R.
Another way of reversing the order of the vectors:
reverse_print <- function(values) paste(values[order(values, decreasing = TRUE)], collapse = " and ")
reverse_print(c("a", "b"))
#[1] "b and a"
reverse_print(c("a", "b", "c", "d"))
#[1] "d and c and b and a"
However, if your main objective is to create a function that recursively uses a condition and the previous conditions, one way of achieving it is to create a direct recursive function, in which the function calls itself (please see #G.Chan's comment for reference). Unfortunately, I failed to create such function for your case. Error: C stack usage 15927520 is too close to the limit was produced. This kind of error is relatively common in recursive functions, as discussed here.
Instead of crating a direct recursive function, I would suggest making the use of while along with incremented index as follows:
revprint <- function(values) {
size <- length(values)
if (size == 1) {
cat(values[1])
} else {
while (size > 1) {
final <- values[size]
appended <- paste0(final, " and ")
size <- size - 1
output <- cat(appended)
}
cat(output, values[1], sep = "")
}
}
revprint("a")
# a
revprint(c("a", "b", "c", "d"))
# d and c and b and a
If the length of the input (a character vector) is larger than 1, this function displays the final character of the input using paste0, and then incrementally reduces the length of the input. In each incremental step, the final character of the new (shorter) input is displayed, appended with the final character of the previous (longer) input.
Because this function uses cat, the result is displayed on the console, but it cannot be assigned directly to an object. To assign it to an object, you can use capture.output()
out <- capture.output(revprint(c("a", "b", "c", "d")))
out
#[1] "d and c and b and a"
Related
I'd like to validate that a data.frame contains columns with specific names. Ideally this would be a utility function that I can just pass the data.frame and expected column names and the function will raise an error if the data.frame does not contain the expected columns. I have written my own function below, however, this seems like something that would already exist in the R ecosystem.
My questions are:
Does such a function (or one-liner) already exist either in base R or in a common package?
If not, any suggestions for my function (below)?
Example of the function I have written to do this:
validate_df_columns <- function(df, columns) {
chr_df <- deparse(substitute(df))
chr_columns <- paste(columns, collapse = ", ")
if (!('data.frame' %in% class(df))) {
stop(paste("Argument", df, "must be a data.frame."))
}
if (sum(colnames(df) %in% columns) != length(columns)) {
stop(paste(chr_df, "must contain the columns", chr_columns))
}
}
validate_df_columns(data.frame(a=1:3, b=4:6), c("a", "b", "c'"))
## Error in validate_df_columns(data.frame(a = 1:3, b = 4:6), c("a", "b", :
## data.frame(a = 1:3, b = 4:6) must contain the columns a, b, c'
The packages tibble and rlang, part of tidyverse have a function to check this :
library(tibble) # or library(rlang) or library(tidyverse)
has_name(iris, c("Species","potatoe"))
# [1] TRUE FALSE
Technically it lives in rlang and its code is just :
function (x, name)
{
name %in% names2(x)
}
where rlang::names2 is an enhanced version of base::names which returns a vector of empty strings rather than NULL when the object doesn't have names.
Here's a way to rewrite your function :
validate_df_columns <- function(df, columns){
if (!is.data.frame(df)) {
stop(paste("Argument", deparse(substitute(df)), "must be a data.frame."))
}
if(!all(i <- rlang::has_name(df,columns)))
stop(sprintf(
"%s doesn't contain: %s",
deparse(substitute(df)),
paste(columns[!i], collapse=", ")))
}
validate_df_columns(iris, c("Species","potatoe","banana"))
# Error in validate_df_columns(iris, c("Species", "potatoe", "banana")) :
# iris doesn't contain: potatoe, banana
Using deparse(substitute(...)) here makes little sense to me though, as it's not used interactively, clearer in my opinion to just say "df".
The %in% operator works with pairs of vectors, so there is already a one-liner we can use here. Consider:
df <- data.frame(a=c(1:3), b=c(4:6), c=c(7:9))
names <- c("a", "c", "blah", "doh")
names[names %in% names(df)]
[1] "a" "c"
If you want to assert that the data frame contains all the input names, then just use:
length(names %in% names(df)) == length(names) # to check all inputs are present
length(names %in% names(df)) == length(names(df)) # to check that input matches df
Currently the script below is splitting a combined item code into a specific item codes.
rule2 <- c("MR")
df_1 <- test[grep(paste("^",rule2,sep="",collapse = "|"),test$Name.y),]
SpaceName_1 <- function(s){
num <- str_extract(s,"[0-9]+")
if(nchar(num) >3){
former <- substring(s, 1, 4)
latter <- strsplit(substring(s,5,nchar(s)),"")
latter <- unlist(latter)
return(paste(former,latter,sep = "",collapse = ","))
}
else{
return (s)
}
}
df_1$Name.y <- sapply(df_1$Name.y, SpaceName_1)
Example,
Combined item code: Room 324-326 is splitting into MR324 MR325 MR326.
However for this particular Combined item code: Room 309-311 is splitting into MR309 MR300 MR301.
How should I amend the script to give me MR309 MR310 MR311?
You can try something along these lines:
range <- "324-326"
x <- as.numeric(unlist(strsplit(range, split="-")))
paste0("MR", seq(x[1], x[2]))
[1] "MR324" "MR325" "MR326"
I assume that you can obtain the numerical room sequence by some means, and then use the snippet I gave you above.
If your combined item codes always have the form Room xxx-yyy, then you can extract the range using gsub:
range <- gsub("Room ", "", "Room 324-326")
If your item codes were in a vector called codes, then you could obtain a vector of ranges using:
ranges <- sapply(codes, function(x) gsub("Room ", "", x))
We can also evaluate the string after replacing the - with : and then paste the prefix "MR".
paste0("MR", eval(parse(text=sub("\\S+\\s+(\\d+)-(\\d+)", "\\1:\\2", range))))
#[1] "MR324" "MR325" "MR326"
Wrap it as a function for convenience
fChange <- function(prefixStr, RangeStr){
paste0(prefixStr, eval(parse(text=sub("\\S+\\s+(\\d+)-(\\d+)",
"\\1:\\2", RangeStr))))
}
fChange("MR", range)
fChange("MR", range1)
#[1] "MR309" "MR310" "MR311"
For multiple elements, just loop over and apply the function
sapply(c(range, range1), fChange, prefixStr = "MR")
data
range <- "Room 324-326"
range1 <- "Room 309-311"
I'm currently working on a programming project in R (for school) and I'm using a data set made of a large quantity of LastFm users (an application that collects data when you're using a media player).
I want to work on an eventual link between 2 variables present in the dataset which are the "nickname" and the "real name". To do so, I would like to compute a variable that represents the rate of similarity between the characters.
As an example take one individual (regardless of the other variables):
name = 'chris meller'
nickname = 'mellertime'
So far, tried to sort the strings in order to to check for identical characters one by one but I'm stuck here. What i found is just a way to to check if "name" is present inside "nickname" with different kind of functions.
>paste(sort(unlist(strsplit(name, ""))), collapse = "")
[1] "eeeillmmrt"
>paste(sort(unlist(strsplit(nickname, ""))), collapse = "")
[1] " ceehillmrrs"
What I would like to know is if there is a way to count the number of identical letters between 2 character strings, regardless of the order?
I would like to end with something like this:
function(a,b)
[1] 0.63
# a,b are 2 character strings
where the result is the ratio of the number of identical character between the two strings divided by the number of characters in the real name.
Try this:
SimilarityRatio <- function(wholeName, nickname, matchCase) {
n1 <- sort(strsplit(paste(strsplit(wholeName, " ")[[1]], collapse = ""), "")[[1]])
n2 <- sort(strsplit(paste(strsplit(nickname, " ")[[1]], collapse = ""), "")[[1]])
if (!matchCase) {
n1 <- tolower(n1)
n2 <- tolower(n2)
}
MyLen <- tempLen <- length(n1)
j <- 1L
numMatch <- 0L
while (j <= tempLen) {
test1 <- n1[j] %in% n2
if (test1) {
myRemove <- min(which(n2 %in% n1[j]))
n1 <- n1[-j]
n2 <- n2[-myRemove]
numMatch <- numMatch + 1L
tempLen <- tempLen - 1L
} else {
j <- j+1L
}
}
numMatch/MyLen
}
Below are some test cases:
> SimilarityRatio("chris meller", "mellertime", FALSE)
[1] 0.6363636
> SimilarityRatio("SuperMan3000", "The3Musketeers", FALSE)
[1] 0.5
> SimilarityRatio("SuperMan3000", "The3Musketeers", TRUE)
[1] 0.4166667
> SimilarityRatio("should a garbage collection be performed immediately", "same expression can vary considerably depending on whether", FALSE)
[1] 0.7608696
This code is suppose to take in a word, and compute values for letters of the word, based on the position of the letter in the word. So for a word like "broke" it's suppose to compute the values for the letter "r" and "k"
strg <- 'broke'
#this part stores everything except the first,
#last, and middle position of the word
strg.leng <- nchar(strg)
other.letts <- sequence(strg.leng)
if (length(other.letts) %% 2 != 0) {
oth_let1 <- other.letts[-c(1, ceiling(length(other.letts)/2), length(other.letts))]
} else {
oth_let0 <- other.letts[-c(1, c(1,0) + floor(length(other.letts)/2), length(other.letts))]
}
print(paste("Values of the other letters of: ", strg))
#here is where the computation starts, taking in the objects created above
if ((nchar(strg) %% 2) != 0) {
sapply(oth_let1, function(i) print(paste(oth_let1[i], "L", (.66666*1.00001) - (oth_let1[i] - 1) *.05 )))
} else {
sapply(oth_let0, function(i) print(paste(oth_let0[i], "L", (.66666*1.00001) - (oth_let0[i] - 1) *.05 )))
}
However for "broke" I get this which is only computing the value of "k" and some other stuff:
[1] "4 L 0.5166666666"
[1] "NA L NA"
[1] "4 L 0.5166666666" "NA L NA"
While the desired output should be a value for both "r" and "k", so something like:
[1] "2 L 0.61666666"
[1] "4 L 0.51666666"
What am I doing wrong? Am I using sapply incorrectly?
sapply iterates through the supplied vector or list and supplies each member in turn to the function. In your case, you're getting the values 2 and 4 and then trying to index your vector again using its own values. Since the oth_let1 vector has only two members, you get NA. You could fix your current code by replacing the oth_let1[i] with just i. However, your code could be greatly simplified to:
strg <- 'broke'
lets <- 2:(nchar(strg) - 1)
lets <- lets[-(1:2 + length(lets)) / 2] # removes middle item for odd and middle two for even
cat("Values of the other letters of:", strg, "\n")
#here is where the computation starts, taking in the objects created above
writeLines(paste(lets, "L", 0.66666*1.00001 - (lets - 1) * 0.05, sep = " "))
I'm assuming you want to output the results to the console.
You're using sapply correct, what you're getting wrong is the function inside it. What you want is the i element of the other.letts variable, not from the oth_let1. oth_let1 have the indexes from the other.letts.
The code bellow should work, I also change the name of the variable to oth_let, so you don't have to use other if. For the output be exact what you ask for I used the invisible function.
strg <- 'broke'
strg.leng <- nchar(strg)
other.letts <- sequence(strg.leng)
if(length(other.letts) %% 2 != 0) {
oth_let <- other.letts[-c(1, ceiling(length(other.letts)/2),
length(other.letts))]
}else{
oth_let <- other.letts[-c(1, c(1,0) + floor(length(other.letts)/2),
length(other.letts))]
}
print(paste("Values of the other letters of: ", strg))
invisible(sapply(oth_let,
function(i)
print(paste(other.letts[i], "L", (.66666*1.00001) - (other.letts[i] - 1) *.05 ))))
I'm new to R. I have a problem to solve, and a working function below that solves it nicely (in decent time). But, from what I'm reading on R tutorials, and here on SO, I feel like I'm doing way too much work to solve it. Is there some fancy R way to collapse this all into a few lines?
The problem to solve: Given a CSV file of data of character data, and a "flag" argument, extract the value at position [row, 1]. "row" is calculated to be the minimum value from column "InterestingColumn" for "flag a", the maximum value from column "Interesting Column" for "flag b", or the n-th value defined by a numeric "flag". The output should be grouped by the unique values of "InterestingColumn". The returned result should be a data frame. The column schema is known, but the length of the file is not.
My instinct is that I should be able to get rid of the for loop altogether, and also that my reconstruction of the matrix with rbind each time is inefficient (like this?) Any tutelage would be appreciated, thanks!
myfunc <- function(flag = "a") {
csv <- read.csv("data.csv", colClasses = "character")
col <- unique(csv$InterestingColumn)
output <- NULL
for (i in 1:length(col)) {
sub <- subset(csv, InterestingColumn == col[i])
vals <- as.numeric(sub[, 12])
if (flag == "a") {
output <- rbind(output, matrix(c(sub[which.min(vals),1], col[i]), ncol = 2))
}
else if (flag == "b") {
output <- rbind(output, matrix(c(sub[which.max(vals),1], col[i]), ncol = 2))
}
else if (is.numeric(flag)) {
output <- rbind(output, matrix(c(sub[flag,1], col[i]), ncol = 2))
}
colnames(output) <- c("data", "col")
as.data.frame(output)
}
}
Say that column 12 is named Col12. Then aggregate may be in order. Everything after the read.csv call in the function should be handled by the following expression (but you may want to set the names of the resulting data frame):
aggregate(Col12 ~ InterestingColumn, data=csv, FUN=function(x) {
if (flag == "a") {
min(x);
} else if (flag == "b") {
max(x);
} else if (is.numeric(flag)) {
x[flag];
}
})