In RHadoop, when we make the results readable, it will use the code:
results <- data.frame(words=unlist(lapply(Output_data,"[[",1)), count
=unlist(lapply(Output_data,"[[",2)))
but what does lapply(Output_data,"[[",1)mean? especially the "[[" and '1' in lapply.
The syntax of extracting list elements with [ or [[ is often used in R. It is not specific to any packages. The meaning of the syntax
lapply(Output_data,"[[",1)
is loop through the object 'Output_data' and extract ([[) the first element. So, if the 'Output_data' is a list of data.frames, it will extract the first column of the data.frame and if it is a list of vectors, it extracts the first elements of vector. It does similar functionality as an anonymous function does i..e
lapply(Output_data, function(x) x[[1]])
The latter syntax is more clear and easier to understand but the former is compact and a bit more stylish...
More info about the [[ can be found in ?Extract
Operators like [[ , [ and -> are actually functions.
list[[1]]
is equal to
`[[`(list,1)
In your case, lapply(Output_data,"[[",1)means to extract the first value of every column (or sublist) of Output_data. And the 1 is a argument passed to [[ function.
Related
If you load the pracma package into the r console and type
gammainc(2,2)
you get
lowinc uppinc reginc
0.5939942 0.4060058 0.5939942
This looks like some kind of a named tuple or something.
But, I can't work out how to extract the number below the lowinc, namely 0.5939942. The code (gammainc(2,2))[1] doesn't work, we just get
lowinc
0.5939942
which isn't a number.
How is this done?
As can be checked with str(gammainc(2,2)[1]) and class(gammainc(2,2)[1]), the output mentioned in the OP is in fact a number. It is just a named number. The names used as attributes of the vector are supposed to make the output easier to understand.
The function unname() can be used to obtain the numerical vector without names:
unname(gammainc(2,2))
#[1] 0.5939942 0.4060058 0.5939942
To select the first entry, one can use:
unname(gammainc(2,2))[1]
#[1] 0.5939942
In this specific case, a clearer version of the same might be:
unname(gammainc(2,2)["lowinc"])
Double brackets will strip the dimension names
gammainc(2,2)[[1]]
gammainc(2,2)[["lowinc"]]
I don't claim it to be intuitive, or obvious, but it is mentioned in the manual:
For vectors and matrices the [[ forms are rarely used, although they
have some slight semantic differences from the [ form (e.g. it drops
any names or dimnames attribute, and that partial matching is used for
character indices).
The partial matching can be employed like this
gammainc(2, 2)[["low", exact=FALSE]]
In R vectors may have names() attribute. This is an example:
vector <- c(1, 2, 3)
names(vector) <- c("first", "second", "third")
If you display vector, you should probably get desired output:
vector
> vector
first second third
1 2 3
To ensure what type of output you get after the function you can use:
class(your_function())
I hope this helps.
I am trying to accomplish the same goal as is resolved in this question, but I want to filter the table by two grep statements. When I try this:
DT[grep("word1", column) | grep("word2", column)]
I get this error:
Warning message:
In grep("word1", column) | grep("word2", column) :
longer object length is not a multiple of shorter object length
And when I try to combine this logic with an assigment := in the j argument of the data.table, I get all kinds of weirdness. Basically, it's apparent that the OR operator | doesn't work with grep's in the i argument of a data.table.
I came up with a messy workaround:
DT.a <- DT[grep("word1", column)]
DT.b <- DT[grep("word2", column)]
DT.all <- rbind(DT.a,DT.b)
but I'm hoping there's a better way to accomplish this goal. Any ideas?
The issue here turned out to be a combination of function choice and syntax in the placement of the OR operator |. DT[grep("word1", column) | grep("word2", column)] is confusing to data.table because each grep() returns vectors of indices (integers), which can be of different lengths depending on the data, and the data.table package isn't built to handle this sort of input. grepl() is a more appropriate function to use here because it returns a boolean of whether there is a regex match or not, and the OR operator | should be placed within the regex pattern string.
Solution:
DT[grepl("word1|word2", column)]
I'm not sure that I understand the different outputs in these two scenarios:
(1)
pioneers <- c("GAUSS:1777", "BAYES:1702", "PASCAL:1623", "PEARSON:1857")
split <- strsplit(pioneers, split = ":")
split
(2)
pioneers <- c("GAUSS:1777", "BAYES:1702", "PASCAL:1623", "PEARSON:1857")
split <- lapply(pioneers, strsplit, split = ":")
split
In both cases, the output is a list but I'm not sure when I'd use the one notation (simply applying a function to a vector) or the other (using lapply to loop the function over the vector).
Thanks for the help.
Greg
To me it's to do with how the output is returned. [l]apply stands for list apply - i.e. the output is returned as a list. strsplit already returns a list as, if there were multiple :s in your pioneers vector, it's the only data structure that makes sense - i.e. a list element of each of the 4 elements of the vector and each list element contains a vector of the split string.
So using lapply(x, strsplit, ...) will always return a list inside a list, which you probably don't want in this case.
Using lapply is useful in cases where you expect the result of the function you're applying to be a vector of an undefined or variable length. As strsplit can see this coming already, the use of lapply is redundant, so you should probably know what form you expect/want your answer to be in, and use the appropriate functions to coerce the output in to the right data structure.
To make clear, the output of the examples you gave is not the same. One is a list, one is a list of lists. The identical result would be
lapply(pioneers, function(x, split) strsplit(x, split)[[1]], split = ":")
i.e. taking the first list element of the inner list (which is only 1 element anyway) in each case.
I have been working with R for about 2 months and have had a little bit of trouble getting a hold of how the $ and %% terms.
I understand I can use the $ term to pull a certain value from a function (e.g. t.test(x)$p.value), but I'm not sure if this is a universal definition. I also know it is possible to use this to specify to pull certain data.
I'm also curious about the use of the %% term, in particular, if I am placing a value in between it (e.g. %x%) I am aware of using it as a modulator or remainder e.g. 7 %% 5 returns 2. Perhaps I am being ignorant and this is not real?
Any help or links to literature would be greatly appreciated.
Note: I have been searching for this for a couple hours so excuse me if I couldn't find it!
You are not really pulling a value from a function but rather from the list object that the function returns. $ is actually an infix that takes two arguments, the values preceding and following it. It is a convenience function designed that uses non-standard evaluation of its second argument. It's called non-standard because the unquoted characters following $ are first quoted before being used to extract a named element from the first argument.
t.test # is the function
t.test(x) # is a named list with one of the names being "p.value"
The value can be pulled in one of three ways:
t.test(x)$p.value
t.test(x)[['p.value']] # numeric vector
t.test(x)['p.value'] # a list with one item
my.name.for.p.val <- 'p.value'
t.test(x)[[ my.name.for.p.val ]]
When you surround a set of characters with flanking "%"-signs you can create your own vectorized infix function. If you wanted a pmax for which the defautl was na.rm=TRUE do this:
'%mypmax%' <- function(x,y) pmax(x,y, na.rm=TRUE)
And then use it without quotes:
> c(1:10, NA) %mypmax% c(NA,10:1)
[1] 1 10 9 8 7 6 7 8 9 10 1
First, the $ operator is for selecting an element of a list. See help('$').
The %% operator is the modulo operator. See help('%%').
The '$' operator is used to select particular element from a list or any other data component which contains sub data components.
For example: data is a list which contains a matrix named MATRIX and other things too.
But to get the matrix we write,
Print(data$MATRIX)
The %% operator is a modulus operator ; which provides the remainder.
For example: print(7%%3)
Will print 1 as an output
I have a data frame with a column called listA, and a listB. I want to pull out only those rows in the data frame which match to an entry in listB, so I have:
newData <- mydata[mydata$listA %in% listB,]
However, some entries of listA are in the format "ABC /// DEF", where both ABC and DEF are possible entries in listB.
I want to pull out the rows of the data frame which have a listA for which any of the words match to an entry in listB. So if listB had "ABC" in it, that entry would be included in newData. I found the strsplit function, but things like
strsplit(mydata$listA," ") %in% listB
always returns FALSE, presumably because it's checking if the whole list returned by strsplit is an entry in listB.
match(word_vector, target_vector) allows both arguments to be vectors, which is what you want (note: that's vectors, not lists). In fact, %in% operator is a synonym for match(), as its help tells you.
But stringi package's methods stri_match_* may well directly do what you want, are all vectorized, and are way more performant than either match() or strsplit():
stri_match_all stri_match_all_regex stri_match_first stri_match_first_regex stri_match_last stri_match_last_regex
Also, you probably won't need to use an explicit split function, but if you must, then use stringi::stri_split_*(), avoid base::strsplit()
Note on performance: avoid splitting strings like the plague in R whenever possible, it creates memory leaks via unnecessary conscells, as gc() will show you. That's yet another reason why stringi is very efficient.