I am writing in if statement, which checks for duplicated. If there are any I want to continue executing, but return a message indicating which are the duplicates. I tried the message(), but I am unsure how to include the values of locations.
if(anyDuplicated(regionGroups$location) > 0){
duplicateRegions <- regionGroups[, 'count' := .N, by = location][count > 1, .SD[1], by = location][[1]]
message("Location is not unique in the table regionGroups. There are length(duplicateRegions) duplicated locations, namely: duplicateRegions[1],duplicateRegions[2] ")
regionGroups <- regionGroups[!duplicated(regionGroups$location)]
}
(anyDuplicated(regionGroups$location) > 0)
[1] TRUE
duplicateRegions
[1] 55100 26080
The desired output is:
Location is not unique in the table regionGroups. There are 2 duplicated locations, namely: 55100, 26080
The complicated thing is there may be many more duplicatedRegions and the numbers will change.
QUESTION: how to write the message() statement such that the output will list the respective values of duplicateRegions?
Hi does this work for you? I'm a little unsure as to why you need the if statement if you are going to execute anyway as there appears to be no else element required, perhaps you have left this out for simplicity.
Another point to note is that the duplicated does not pick up the first of the duplicates in the duplicated set so when you use:
regionGroups[!duplicated(regionGroups$location),] it will always eliminate all but the first duplicate. That maybe ok for you but just as a warning.
Also if you take this approach: namely: duplicateRegions[1],duplicateRegions[2] in the message function you are assuming you know how many duplicates you will have which would not be the case. You can just collapse the string with: paste(as.character(regionGroups$location[dups]), collapse = ", ")) so you don't need to worry about that.
if(any(duplicated(regionGroups$location))){
dups <- which(duplicated(regionGroups$location))
dup_regions <- regionGroups$location[dups]
message(" Location is not unique in the table regionGroups. There are ",
length(dups)," duplicated locations, namely: ", paste(as.character(regionGroups$location[dups]), collapse = ", "))
regionGroups <- regionGroups[!duplicated(regionGroups$location),]
}
Related
I have a base query that i am trying to add 14 other strings to, which are named query1:14
I am trying to combine them using paste and the following works for one query. However, i am trying to set this up in a function which will have the number of queries passed to it, to know how many it should loop/add into the final restuls
Here is some code to add one query, which works:
result<- paste(base_query, eval(as.name(paste0("query", 2)))
Here is something i tried to loop the queries added, to no success
range <- 14
result<- paste(base_query, while(loop<range){loop<-loop+1
eval(as.name(paste0("query", loop)))}
I'm not sure how to get the names generated in the while loop to be added to the paste, thanks
In general, it is preferable to store the queries 1-14 in a list, rather than store them separately in the global environment. That is why half of this solution entails getting the objects into a list, which is easily pasted together with the collapse argument of paste.
Also pay attention to stools::mixedsort, which sorts them in the order that I assume you want (incrementally 1 to 14).
# Generate the query objects
base_query = "The letters of the alphabet are:"
for (i in 1:14) {
assign(x = paste0("query",i), value = LETTERS[i])
}
# Find all the query names, and sort them in a meaningful order
query_names = gtools::mixedsort(ls(pattern = "query*"))
query_names
#> [1] "base_query" "query1" "query2" "query3" "query4"
#> [6] "query5" "query6" "query7" "query8" "query9"
#> [11] "query10" "query11" "query12" "query13" "query14"
# Now get the appropriate objects
query_list = lapply(query_names, function(x){
get(x)
})
# Paste and collapse the contents of the list
paste(query_list, collapse = "")
#> [1] "The letters of the alphabet are:ABCDEFGHIJKLMN"
Created on 2022-08-26 by the reprex package (v2.0.0)
I have two columns.
Whenever the first column has the value "Breast" for a given row, I want to add the value of another column (test$IHC).
This code should work fine:
testdat$Pathology[testdat$Pathology == "Breast"] <- paste("Breast (", testdat$IHC, ")")
The problem however is that the function testdat$Pathology[testdat$Pathology == "Breast"] includes NA in the vector, and this prompts an error when executing the function in the first row:
NAs are not allowed in subscripted assignments
The problem I'm finding is that na.omit() doesn't work as I think it generates some modifications in the data:
na.omit(testdat$Pathology[testdat$Pathology == "Breast"]) <- paste("Breast (", testdat$IHC, ")")
You may check for NA explicitly. Remember also to include the same filters on the right side, or the vectors won`t match and the values will be offset.
testdat$Pathology[! is.na(testdat$Pathology) & testdat$Pathology == "Breast"] <- paste0("Breast (", testdat$IHC[! is.na(testdat$Pathology) & testdat$Pathology == "Breast"], ")")
You may make it cleaner by saving the common expression in a variable:
subscripts.to.replace <- ! is.na(testdat$Pathology) & testdat$Pathology == "Breast"
testdat$Pathology[subscripts.to.replace] <- paste0("Breast (", testdat$IHC[subscripts.to.replace], ")")
I¨m trying to convert somo variables to statements.
I´m loonking for some variables in a data frame like that:
y<-subset(x,x$Incidence.type == "Appro.11.Plural")
Invoice Incidence.type
1: 20171200738 Appro.11.Plural
2: 20171200737 Appro.11.Plural
Once it is done, I would like to set an statement like that:
Statement<-paste("The invoices",y[1,2],"and",y[2,2], "are...")
The problem is that the number of elements, instead of two, could be from 0 to infinite. So, I need to generalize this code to achieve an equivalent result, whatever the number of invoices.
I´ve try that, but still not working:
if (length(y>0) {for (j in (Push$Invoice)) {Statement<-paste(j,sep = ",")}}
Thanks in advance.
0 and 1 are special cases, anything 2 and over we can generalize.
if (nrow(y) > 0) Statement = "No matches."
if (nrow(y) == 1) Statement = paste("The invoice", y[1, 2], "is ...")
if (nrow(y) > 1) {
invoices = paste(y[, 2], collapse = ", and ")
Statement = paste("The invoices", invoices, "are ...")
}
I've mimicked your question using the second column of y, but don't you actually want the invoice number? For more robust code I'd recommend using the column name: y[1, "Invoice"] instead of y[1, 1], that way you will always get the correct column even if the order is different. If you are using data.table you don't need to quote the column name.
When writing functions it is important to check for the type of arguments. For example, take the following (not necessarily useful) function which is performing subsetting:
data_subset = function(data, date_col) {
if (!TRUE %in% (is.character(date_col) | is.expression(date_col))){
stop("Input variable date is of wrong format")
}
if (is.character(date_col)) {
x <- match(date_col, names(data))
} else x <- match(deparse(substitute(date_col)), names(data))
sub <- data[,x]
}
I would like to allow the user to provide the column which should be extracted as character or expression (e.g. a column called "date" vs. just date). At the beginning I would like to check that the input for date_col is really either a character value or an expression. However, 'is.expression' does not work:
Error in match(x, table, nomatch = 0L) : object '...' not found
Since deparse(substitute)) works if one provides expressions I thought 'is.expression' has to work as well.
What is wrong here, can anyone give me a hint?
I think you are not looking for is.expression but for is.name.
The tricky part is to get the type of date_col and to check if it is of type character only if it is not of type name. If you called is.character when it's a name, then it would get evaluated, typically resulting in an error because the object is not defined.
To do this, short circuit evaluation can be used: In
if(!(is.name(substitute(date_col)) || is.character(date_col)))
is.character is only called if is.name returns FALSE.
Your function boils down to:
data_subset = function(data, date_col) {
if(!(is.name(substitute(date_col)) || is.character(date_col))) {
stop("Input variable date is of wrong format")
}
date_col2 <- as.character(substitute(date_col))
return(data[, date_col2])
}
Of course, you could use if(is.name(…)) to convert only to character when date_col is a name.
This works:
testDF <- data.frame(col1 = rnorm(10), col2 = rnorm(10, mean = 10), col3 = rnorm(10, mean = 50), rnorm(10, mean = 100))
data_subset(testDF, "col1") # ok
data_subset(testDF, col1) # ok
data_subset(testDF, 1) # Error in data_subset(testDF, 1) : Input variable date is of wrong format
However, I don't think you should do this. Consider the following example:
var <- "col1"
data_subset(testDF, var) # Error in `[.data.frame`(data, , date_col2) : undefined columns selected
col1 <- "col2"
data_subset(testDF, col1) # Gives content of column 1, not column 2.
Though this "works as designed", it is confusing because unless carefully reading your function's documentation one would expect to get col1 in the first case and col2 in the second case.
Abusing a famous quote:
Some people, when confronted with a problem, think “I know, I'll use non-standard evaluation.” Now they have two problems.
Hadley Wickham in Non-standard evaluation:
Non-standard evaluation allows you to write functions that are extremely powerful. However, they are harder to understand and to program with. As well as always providing an escape hatch, carefully consider both the costs and benefits of NSE before using it in a new domain.
Unless you expect large benefits from allowing to skip the quotes around the name of the column, don't do it.
I want to do the same as explained here, i.e. adding missing rows to a data.table. The only additional difficulty I'm facing is that I want the number of key columns, i.e. those rows that are used for the self-join, to be flexible.
Here is a small example that basically repeats what is done in the link mentioned above:
df <- data.frame(fundID = rep(letters[1:4], each=6),
cfType = rep(c("D", "D", "T", "T", "R", "R"), times=4),
variable = rep(c(1,3), times=12),
value = 1:24)
DT <- as.data.table(df)
idCols <- c("fundID", "cfType")
setkeyv(DT, c(idCols, "variable"))
DT[CJ(unique(df$fundID), unique(df$cfType), seq(from=min(variable), to=max(variable))), nomatch=NA]
What bothers me is the last line. I want idCols to be flexible (for instance if I use it within a function), so I don't want to type unique(df$fundID), unique(df$cfType) manually. However, I just don't find any workaround for this. All my attempts to automatically split the subset of df into vectors, as needed by CJ, fail with the error message Error in setkeyv(x, cols, verbose = verbose) : Column 'V1' is type 'list' which is not (currently) allowed as a key column type.
CJ(sapply(df[, idCols], unique))
CJ(unique(df[, idCols]))
CJ(as.vector(unique(df[, idCols])))
CJ(unique(DT[, idCols, with=FALSE]))
I also tried building the expression myself:
str <- ""
for (i in idCols) {
str <- paste0(str, "unique(df$", i, "), ")
}
str <- paste0(str, "seq(from=min(variable), to=max(variable))")
str
[1] "unique(df$fundID), unique(df$cfType), seq(from=min(variable), to=max(variable))"
But then I don't know how to use str. This all fails:
CJ(eval(str))
CJ(substitute(str))
CJ(call(str))
Does anyone know a good workaround?
Michael's answer is great. do.call is indeed needed to call CJ flexibly in that way, afaik.
To clear up on the expression building approach and starting with your code, but removing the df$ parts (not needed and not done in the linked answer, since i is evaluated within the scope of DT) :
str <- ""
for (i in idCols) {
str <- paste0(str, "unique(", i, "), ")
}
str <- paste0(str, "seq(from=min(variable), to=max(variable))")
str
[1] "unique(fundID), unique(cfType), seq(from=min(variable), to=max(variable))"
then it's :
expr <- parse(text=paste0("CJ(",str,")"))
DT[eval(expr),nomatch=NA]
or alternatively build and eval the whole query dynamically :
eval(parse(text=paste0("DT[CJ(",str,"),nomatch=NA")))
And if this is done a lot then it may be worth creating yourself a helper function :
E = function(...) eval(parse(text=paste0(...)))
to reduce it to :
E("DT[CJ(",str,"),nomatch=NA")
I've never used the data.table package, so forgive me if I miss the mark here, but I think I've got it. There's a lot going on here. Start by reading up on do.call, which allows you to evaluate any function in a sort of non-traditional manner where arguments are specified by a supplied list (where each element is in the list is positionally matched to the function arguments unless explicitly named). Also notice that I had to specify min(df$variable) instead of just min(variable). Read Hadley's page on scoping to get an idea of the issue here.
CJargs <- lapply(df[, idCols], unique)
names(CJargs) <- NULL
CJargs[[length(CJargs) +1]] <- seq(from=min(df$variable), to=max(df$variable))
DT[do.call("CJ", CJargs),nomatch=NA]