Because this would simplify my code only slightly, I'm mainly asking out of curiosity. Say you have a function
f <- function(x) {
c(x[1] - x[2],
x[2] - x[3],
x[3] - x[1])
}
Is there a way of finding out the dimension of the input required, e.g.
dim(f) = 3
Here's a very questionable solution that involves computing on the language.
I've written three functions that allow matching, searching, and extracting pieces of parse trees based on a sort of "template" parse tree piece. Here they are:
ptlike <- function(lhs,rhs,wcs='.any.',wcf=NULL) if (is.symbol(rhs) && as.character(rhs)%in%wcs) !is.function(wcf) || isTRUE(wcf(lhs,as.character(rhs))) else typeof(lhs) == typeof(rhs) && length(lhs) == length(rhs) && if (is.call(lhs)) all(sapply(seq_along(lhs),function(i) ptlike(lhs[[i]],rhs[[i]],wcs,wcf))) else lhs == rhs;
ptfind <- function(ptspace,ptpat,wcs='.any.',wcf=NULL,locf=NULL,loc=integer(),ptspaceorig=ptspace) c(list(),if (ptlike(ptspace,ptpat,wcs,wcf) && (!is.function(locf) || isTRUE(locf(ptspaceorig,loc)))) list(loc),if (is.call(ptspace)) do.call(c,lapply(seq_along(ptspace),function(i) ptfind(ptspace[[i]],ptpat,wcs,wcf,locf,c(loc,i),ptspaceorig))));
ptextract <- function(ptspace,ptpat,gets='.get.',wcs='.any.',wcf=NULL,locf=NULL,getf=NULL) { getlocs <- do.call(c,lapply(gets,function(get) ptfind(ptpat,as.symbol(get),character()))); if (length(getlocs)==0L) getlocs <- list(integer()); c(list(),do.call(c,lapply(ptfind(ptspace,ptpat,unique(c(gets,wcs)),wcf,locf),function(loc) do.call(c,lapply(getlocs,function(getloc) { cloc <- c(loc,getloc); ptget <- if (length(cloc)>0) ptspace[[cloc]] else ptspace; if (!is.function(getf) || isTRUE(getf(if (missing(ptget)) substitute() else ptget,loc,getloc,as.character(ptpat[[getloc]])))) list(if (missing(ptget)) substitute() else ptget); }))))); };
ptlike() matches two parse tree pieces against each other, allowing for wildcards on the RHS to match anything. For example:
ptlike(1L,2L);
## [1] FALSE
ptlike(1L,1L);
## [1] TRUE
ptlike(quote(a),quote(b));
## [1] FALSE
ptlike(quote(a),quote(a));
## [1] TRUE
ptlike(quote(sum(a+1)),quote(sum(b+1)));
## [1] FALSE
ptlike(quote(sum(a+1)),quote(sum(.any.+1)));
## [1] TRUE
ptlike(quote(sum(a+1)),quote(.any.(a+1)));
## [1] TRUE
ptlike(quote(sum(a+1)),quote(.any.));
## [1] TRUE
ptfind() returns a recursive index vector for each match of a parse tree pattern (RHS, if you like) in a given parse tree space (LHS), combined into a list. For example:
sp <- quote({ a+b*c:d; e+f*g:h; });
ptfind(sp,quote(.any.:.any.));
## [[1]]
## [1] 2 3 3
##
## [[2]]
## [1] 3 3 3
##
sp[[c(2L,3L,3L)]];
## c:d
sp[[c(3L,3L,3L)]];
## g:h
ptextract() is similar to ptfind(), but returns the matched piece of the parse tree space or a subset thereof:
ptextract(sp,quote(c:d));
## [[1]]
## c:d
##
ptextract(sp,quote(.any.:.any.));
## [[1]]
## c:d
##
## [[2]]
## g:h
##
ptextract(sp,quote(.any.:.get.));
## [[1]]
## d
##
## [[2]]
## h
##
So what we can do is extract from (the parse tree that comprises the body of) your function all the subscripts of the x argument and get the maximum literal subscript value:
f <- function(x) c(x[1L]-x[2L],x[2L]-x[3L],x[3L]-x[1L]);
max(unlist(Filter(is.numeric,ptextract(body(f),quote(x[.get.])))));
## [1] 3
The Filter(is.numeric,...) piece isn't really necessary here, since all subscripts are literal numbers (doubles in your definition, integers in mine, although that's inconsequential), but if there were ever non-numeric-literal subscripts, then it would be necessary.
Caveats:
It's absurd.
It would not take into account variable subscripts that might rise above the maximum literal subscript value in the parse tree. (Although, strictly speaking, in the general case, it is impossible to statically analyze all possible subscripts that might occur at run-time, due to the halting problem or something.)
If the vector was ever assigned to a different variable name, say y, and then that variable was indexed with a different subscript, even a numeric literal subscript, this algorithm would not take that into account.
And there's probably some more.
Related
I am an infrequent R users so my apologies if any of my terminology is incorrect. I am working on a project around the game Wordle to see if a given Wordle submission in my family group chat is unique or if they have already been submitted before. The inspiration for this came from the twitter account "Scorigami" which tracks every NFL game and tweets whether or not that score has occurred before in the history of the league.
To load the Wordle entries into R, I've decided to turn each submission into a Matrix where 0 = incorrect letter, 1 = right letter/wrong position, and 2 = right letter/correct position. In R this looks like this:
wordle_brendan <- rbind(c(1,0,0,0,0),c(2,2,0,0,0),c(2,2,0,0,0),c(2,2,2,2,2))
wordle_jack <- rbind(c(2,0,0,0,0),c(2,2,0,0,0),c(2,2,2,2,2))
I then combine them into a list that will be used to check against any future Wordle submissions to see if they have been previously submitted.
list <- list(wordle_brendan, wordle_jack)
I think I am on the right track, but I don't know how to create a new wordle matrix to test whether that submission has been given before. Say I recreated "wordle_brendan" with the same values but under a different name... How would I then get R to check if that matrix exists in my preexisting list of matrices? I've tried using the %in% function 1,000 different ways but can't get it to work.. Any help would be much appreciated! Thanks! (And if you can think of a better way to do this, please let me know!)
There are multiple ways to do this, but this is pretty simple. We need some samples to check:
new1 <- list[[2]] # The same as your second matrix
new2 <- new1
new2[3, 5] <- 0 # Change one position from 2 to 0.
To compare
any(sapply(list, identical, y=new1))
# [1] TRUE
any(sapply(list, identical, y=new2))
# [1] FALSE
So new1 matches an existing matrix, but new2 does not. To see which matrix:
which(sapply(list, identical, y=new1))
# [1] 2
which(sapply(list, identical, y=new2))
# integer(0)
So new1 matches the second matrix in list, but new2 does not match any matrix.
Here is a way with a matequal function. Base function identical compares objects, not values and if the matrices have the same values but different attributes, such as names, identical returns FALSE.
This is many times too strict. A function that compares values only will return TRUE in these cases.
I will use dcarlson's new1 to illustrate this point.
matequal <- function(x, y) {
ok <- is.matrix(x) && is.matrix(y) && all(dim(x) == dim(y))
ok && all(x == y)
}
wordle_brendan <- rbind(c(1,0,0,0,0),c(2,2,0,0,0),c(2,2,0,0,0),c(2,2,2,2,2))
wordle_jack <- rbind(c(2,0,0,0,0),c(2,2,0,0,0),c(2,2,2,2,2))
list <- list(wordle_brendan, wordle_jack)
new1 <- list[[2]] # The same as your second matrix
wordle_john <- wordle_jack
dimnames(wordle_john) <- list(1:3, letters[1:5])
list2 <- list(wordle_brendan, wordle_jack, wordle_john)
sapply(list2, identical, y=new1)
#> [1] FALSE TRUE FALSE
sapply(list2, matequal, y=new1)
#> [1] FALSE TRUE TRUE
Created on 2022-09-27 with reprex v2.0.2
Edit
identical is not a function to compare two objects' values, it's a function to compare the objects themselves. In the following example identical returns FALSE though x and y have equal values, in the usual sense of equal.
matequal <- function(x, y) {
ok <- is.matrix(x) && is.matrix(y) && all(dim(x) == dim(y))
ok && all(x == y)
}
x <- matrix(1:5, ncol = 1)
y <- matrix(1 + 0:4, ncol = 1)
all(x == y)
#> [1] TRUE
identical(x, y)
#> [1] FALSE
matequal(x, y)
#> [1] TRUE
Created on 2022-09-28 with reprex v2.0.2
This is because the internal representations of x and y, borrowed from the C language, correspond to different class attributes. One of the objects stores elements of class "integer" and the other elements of class "numeric". The matrices both have the same class attribute ("matrix" "array"), the matrices elements' storage type is the main difference.
In a comment it is asked
Thank you and dcarlson for the response! Regarding the your two sapply lines, can you explain what the use would be behind using matequal as opposed to identical? Is the only difference that matequal takes into account the column and row names?
So the answer to the question in comment is no, the attributes, in this case dimnames, are not the only reason why identical is some or many times not ideal to compare R objects.
typeof(x)
#> [1] "integer"
typeof(y)
#> [1] "double"
class(x[1])
#> [1] "integer"
class(y[2])
#> [1] "numeric"
class(x)
#> [1] "matrix" "array"
class(y)
#> [1] "matrix" "array"
Created on 2022-09-28 with reprex v2.0.2
babybag - baby bag
badshelter - bad shelter
themoderncornerstore - the modern corner store
hamptonfamilyguidebook - hampton family guide book
Is there a way to use R to extract words from string that do not have spaces or other delimiters? I have a list of URLs and I am trying to figure out what words are included in the URLs.
input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook")
Here is a naive approach that might give you inspiration, I used library hunspell but you could test substrings against any dictionary.
I start from the right, try every substring and keep the longest I can find in the dictionary, then change my starting position, it's quite slow so I hope you don't have 4 millions of those. hampton is not in this dictionary so it doesn't give the right result for the last one :
split_words <- function(x){
candidate <- x
words <- NULL
j <- nchar(x)
while(j !=0){
word <- NULL
for (i in j:1){
candidate <- substr(x,i,j)
if(!length(hunspell::hunspell_find(candidate)[[1]])) word <- candidate
}
if(is.null(word)) return("")
words <- c(word,words)
j <- j-nchar(word)
}
words
}
input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook")
lapply(input,split_words)
# [[1]]
# [1] "baby" "bag"
#
# [[2]]
# [1] "bad" "shelter"
#
# [[3]]
# [1] "the" "modern" "corner" "store"
#
# [[4]]
# [1] "h" "amp" "ton" "family" "guidebook"
#
Here's a quick fix, adding words manually to the dictionary:
split_words <- function(x, additional = c("hampton","otherwordstoadd")){
candidate <- x
words <- NULL
j <- nchar(x)
while(j !=0){
word <- NULL
for (i in j:1){
candidate <- substr(x,i,j)
if(!length(hunspell::hunspell_find(candidate,ignore = additional)[[1]])) word <- candidate
}
if(is.null(word)) return("")
words <- c(word,words)
j <- j-nchar(word)
}
words
}
input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook")
lapply(input,split_words)
# [[1]]
# [1] "baby" "bag"
#
# [[2]]
# [1] "bad" "shelter"
#
# [[3]]
# [1] "the" "modern" "corner" "store"
#
# [[4]]
# [1] "hampton" "family" "guidebook"
#
You can just cross fingers not to have any ambiguous expressions though. Note that "guidebook" is in one word in my output so we already have an edge case in your four examples.
I have the following data frame:
> coc_comp_model[1:3,]
Relationship Output Input |r-Value| Y-Intercept Gradient
1 DG-r ~ DG-cl DG-r DG-cl 0.8271167 0.0027217513 12.9901380
2 CA3-r ~ CA3-cl CA3-r CA3-cl 0.7461309 0.0350767684 27.6107963
3 CA2-r ~ CA2-cl CA2-r CA2-cl 0.9732584 -0.0040992226 35.8299582
I want to create simple functions for each row of the data frame. here's what I've tried:
for(i in 1:nrow(coc_comp_model)) {
coc_glm_f[i] <- function(x)
x*coc_comp_model$Gradient[i] + coc_comp_model$Y-Intercept[i]
}
also tried making a vector of functions, which also does ont work either.
Thanks for reading this/helping.
Something like this:
myfunc<-function(datrow, x){
x*as.numeric(datrow[6]) + as.numeric(datrow[5] )
}
Then you can use apply to call it on each row, changing x as desired:
apply(hzdata, 1, myfunc, x = 0.5)
note: using dput() to share your data is much easier than a pasting in a subset.
There is no such thing as a vector of functions. There are 6 atomic vector types in R: raw, logical, integer, double, complex, and character, plus there is the heterogeneous list type, and finally there is the lesser known expression type, which is basically a vector of parse trees (such as you get from a call to the substitute() function). Those are all the vector types in R.
printAndType <- function(x) { print(x); typeof(x); };
printAndType(as.raw(1:3));
## [1] 01 02 03
## [1] "raw"
printAndType(c(T,F));
## [1] TRUE FALSE
## [1] "logical"
printAndType(1:3);
## [1] 1 2 3
## [1] "integer"
printAndType(as.double(1:3));
## [1] 1 2 3
## [1] "double"
printAndType(c(1i,2i,3i));
## [1] 0+1i 0+2i 0+3i
## [1] "complex"
printAndType(letters[1:3]);
## [1] "a" "b" "c"
## [1] "character"
printAndType(list(c(T,F),1:3,letters[1:3]));
## [[1]]
## [1] TRUE FALSE
##
## [[2]]
## [1] 1 2 3
##
## [[3]]
## [1] "a" "b" "c"
##
## [1] "list"
printAndType(expression(a+1,sum(1,2+3*4),if (T) 1 else 2));
## expression(a + 1, sum(1, 2 + 3 * 4), if (T) 1 else 2)
## [1] "expression"
If you want to store multiple functions in a single object, you have to use a list, and you must use the double-bracket indexing operator in the lvalue to assign to it:
fl <- list();
for (i in 1:3) fl[[i]] <- (function(i) { force(i); function(a) a+i; })(i);
fl;
## [[1]]
## function (a)
## a + i
## <environment: 0x600da11a0>
##
## [[2]]
## function (a)
## a + i
## <environment: 0x600da1ab0>
##
## [[3]]
## function (a)
## a + i
## <environment: 0x600da23f8>
sapply(fl,function(f) environment(f)$i);
## [1] 1 2 3
sapply(fl,function(f) f(3));
## [1] 4 5 6
In the above code I also demonstrate the proper way to closure around a loop variable. This requires creating a temporary function evaluation environment to hold a copy of i, and the returned function will then closure around that evaluation environment so that it can access the iteration-specific i. This holds true for other languages that support dynamic functions and closures, such as JavaScript. In R there is an additional requirement of forcing the promise to be resolved via force(), otherwise, for each generated function independently, the promise wouldn't be resolved until the first evaluation of that particular generated function, which would at that time lock in the current value of the promise target (the global i variable in this case) for that particular generated function. It should also be mentioned that this is an extremely wasteful design, to generate a temporary function for every iteration and evaluate it, which generates a new evaluation environment with a copy of the loop variable.
If you wanted to use this design then your code would become:
coc_glm_f <- list();
for (i in 1:nrow(coc_comp_model)) {
coc_glm_f[[i]] <- (function(i) { force(i); function(x) x*coc_comp_model$Gradient[i] + coc_comp_model$`Y-Intercept`[i]; })(i);
};
However, it probably doesn't make sense to create a separate function for every row of the data.frame. If you intended the x parameter to take a scalar value (by which I mean a one-element vector), then you can define the function as follows:
coc_glm_f <- function(x) x*coc_comp_model$Gradient + coc_comp_model$`Y-Intercept`;
This function is vectorized, meaning you can pass a vector for x, where each element of x would correspond to a row of coc_comp_model. For example:
coc_comp_model <- data.frame(Relationship=c('DG-r ~ DG-cl','CA3-r ~ CA3-cl','CA2-r ~ CA2-cl'),Output=c('DG-r','CA3-r','CA2-r'),Input=c('DG-cl','CA3-cl','CA2-cl'),`|r-Value|`=c(0.8271167,0.7461309,0.9732584),`Y-Intercept`=c(0.0027217513,0.0350767684,-0.0040992226),Gradient=c(12.9901380,27.6107963,35.8299582),check.names=F);
coc_glm_f(seq_len(nrow(coc_comp_model)));
## [1] 12.99286 55.25667 107.48578
I have a vector v where each entry is one or more strings (or possibly character(0)) seperated by semicolons:
ABC
DEF;ABC;QWE
TRF
character(0)
ABC;GFD
I need to find the indices of the vector which contain "ABC" (1,2,5 or a logical vector T,T,F,F,T) after splitting on ";"
I am currently using a loop as follows:
toSelect=integer(0)
for(i in c(1:length(v))){
if(length(v[i])==0) next
words=strsplit(v[i],";")[[1]]
if(!is.na(match("ABC",words))) toSelect=c(toSelect,i)
}
Unfortunately, my vector has 450k entries, so this takes far too long. I would prefer create a logical vector by doing something like
toSelect=(!is.na(match("ABC",strsplit(v,";")))
But since strsplit returns a list, I can't find a way to properly format strsplit(v,";") as a vector (unlist won't do since it would ruin the indices). Does anybody have any ideas on how to speed up this code?
Thanks!
Use regular expressions:
v = list("ABC", "DEF;ABC;QWE", "TRF", character(0), "ABC;GFD")
grep("(^|;)ABC($|;)", v)
#[1] 1 2 5
The tricky part is dealing with character(0), which #BlueMagister fudges by replacing it with character(1) (this allows use of a vector, but doesn't allow representation of the original problem). Perhaps
v <- list("ABC", "DEF;ABC;QWE", "TRF", character(0), "ABC;GFD")
v[sapply(v, length) != 0] <- strsplit(unlist(v), ";", fixed=TRUE)
to do the string split. One might proceed in base R, but I'd recommend the IRanges package
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
to install, then
library(IRanges)
w = CharacterList(v)
which gives a list-like structure where all elements must be character vectors.
> w
CharacterList of length 5
[[1]] ABC
[[2]] DEF ABC QWE
[[3]] TRF
[[4]] character(0)
[[5]] ABC GFD
One can then do fun things like ask "are element members equal to ABC"
> w == "ABC"
LogicalList of length 5
[[1]] TRUE
[[2]] FALSE TRUE FALSE
[[3]] FALSE
[[4]] logical(0)
[[5]] TRUE FALSE
or "are any element members equal to ABC"
> any(w == "ABC")
[1] TRUE TRUE FALSE FALSE TRUE
This will scale very well. For operations not supported "out of the box", the strategy (computationally cheap) is to unlist then transform to an equal-length vector then relist using the original CharacterList as a skeleton, for instance to use reverse on each member:
> relist(reverse(unlist(w)), w)
CharacterList of length 5
[[1]] CBA
[[2]] FED CBA EWQ
[[3]] FRT
[[4]] character(0)
[[5]] CBA DFG
As #eddi points out, this is slower than grep. The motivation is (a) to avoid needing to formulate complicated regular expressions while (b) gaining flexibility for other operations one might like to do on data structured like this.
Using strsplit with sapply and %in%:
v <- c("ABC","DEF;ABC;QWE","TRF",character(1),"ABC;GFD")
sapply(strsplit(v,";"),function(x) "ABC" %in% x)
#[1] TRUE TRUE FALSE FALSE TRUE
I am currently trying to check if a list(containing multiple vectors filled with values) is equal to a vector. Unfortunately the following functions did not worked for me: match(), any(), %in%. An example of what I am trying to achieve is given below:
Lets say:
lists=list(c(1,2,3,4),c(5,6,7,8),c(9,7))
vector=c(1,2,3,4)
answer=match(lists,vector)
When I execute this it does return False values instead of a positive result. When I compare a vector with a vector is working but when I compare a vector with a list it seems that it can not work properly.
I would use intersect, something like this :
lapply(lists,intersect,vector)
[[1]]
[1] 1 2 3 4
[[2]]
numeric(0)
[[3]]
numeric(0)
I'm not completely sure what you want the result to be (for example do you care about vector order?) but regardless you'll need to think about lapply. For example,
##Create some data
R> lists=list(c(1,2,3,4),c(5,6,7,8),c(9,7))
R> vector=c(1,2,3,4)
then we use lapply to go through each list element and apply a function. In this case, I've used the match function (since you mentioned that in your question):
R> lapply(lists, function(i) all(match(i, vector)))
[[1]]
[1] TRUE
[[2]]
[1] NA
[[3]]
[1] NA
It's probably worth converting to a vector, so
R> unlist(lapply(lists, function(i) all(match(i, vector))))
[1] TRUE NA NA
and to change NA to FALSE, something like:
m = unlist(lapply(lists, function(i) all(match(i, vector))))
m[is.na(m)] = FALSE