When using fcoalesce, is there any way I can retrieve the indices or names of the selected vectors?
Here is a simplified two-vector example, for the following coalesce of vectors a and b:
library(data.table)
a = c(NA,2,3,4,NA)
b = c(1,3,3,4,5)
fcoalesce(a,b)
[1] 1 2 3 4 5
I'd like to see something like this:
b,a,a,a,b
A real life example could have any number of vectors.
We can use ifelse - coalesce is simply taking the first non-NA for each row/element between two vectors/columns. Thus, create a logical condition for NA elements, and specify the 'yes', 'no' as the object names
ifelse(is.na(a), 'b', 'a')
[1] "b" "a" "a" "a" "b"
I managed to solve it by merging all vectors into a data.table (dt_combined) and coalescing them iteratively:
apply(dt_combined, 1, function(i){
(1:length(dt_combined))[ which(!is.na(i))[1] ]
})
One could also get the column names instead of the column index:
apply(dt_combined, 1, function(i){
colnames(dt_combined)[ which(!is.na(i))[1] ]
})
Related
I have a simple question, but I can't figure out a simple solution:
library(data.table)
plouf <- data.table(1:10,letters[1:10])
plouf[V1 %in% c(3,1),V2]
[1] "a" "c"
I would like the output to keep the initial order of the subsetting vector, i.e. "c" "a". What are the possiblities ?
I have
sapply(c(3,1),function(x){plouf[V1 == x,V2]})
but I find it uggly.
edit
I have
setkey(plouf,V1)
plouf[c(3,1),V2]
which is surely the good way for data.table.
Still I am curious about what are the solutions
Here is one option with match that can be used in data.table and in base R as well. Unlike %in%, match returns the position index of the first match and this can be used to get the corresponding elements of the other column 'V2'
plouf[, V2[match(c(3, 1), V1)]]
#[1] "c" "a"
plouf[, match(c(3, 1), V1)] # returns numeric index
#[1] 3 1
plouf[, V1 %in% c(3, 1)] # returns logical vector
#[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Because thee %in% returns logical vector, when we use this to extract the elements, the elements corresponding to each TRUE value will be extracted i.e. it extracts from 1st and 3rd positions instead of 3rd and 1st
Using data.table keys will accomplish what you're going for here, the Keys and fast binary search based subset vignette here explains the usage.
library(data.table)
plouf <- data.table(1:10,letters[1:10])
## Set a key
setkey(plouf,V1)
## Use .() syntax for key subsetting to get associated values of V2
plouf[.(c(3,1)),V2]
#[1] "c" "a"
I'm trying to check if a specific value is anywhere in a data frame.
I know the %in% operator should allow me to do this, but it doesn't seem to work the way I would expect when applying to a whole data frame:
A = data.frame(B=c(1,2,3,4), C=c(5,6,7,8))
1 %in% A
[1] FALSE
But if I apply this to the specific column the value is in it works the way I expect:
1 %in% A$C
[1] TRUE
What is the proper way of checking if a value is anywhere in a data frame?
You could do:
any(A==1)
#[1] TRUE
OR with Reduce:
Reduce("|", A==1)
OR
length(which(A==1))>0
OR
is.element(1,unlist(A))
To find the location of that value you can do f.ex:
which(A == 1, arr.ind=TRUE)
# row col
#[1,] 1 1
Or simply
sum(A == 1) > 0
#[1] TRUE
Loop through the variables with sapply, then use any.
any(sapply(A, function(x) 1 %in% x))
[1] TRUE
or following digEmAll's comment, you could use unlist, which takes a list (data.frame) and returns a vector.
1 %in% unlist(A)
[1] TRUE
The trick to understanding why your first attempt doesn't work, really comes down to understanding what a data frame is - namely a list of vectors of equal length. What you're trying to do here is not check if that list of vectors matches your condition, but checking if the values in those vectors matches the condition.
Try:
any(A == 1)
Returns FALSE or TRUE
Suppose I have a named list like
somelist <- list(a = 1, b = 5, c = 3)
I know that I can drop somelist$b, say, by assigning NULL to it:
somelist$b <- NULL
I suppose this is fine for interactive work, but not so much for programmatic work, because it forces the creation of otherwise superfluous variables.
For example, suppose that foo(42) evaluates to a list similar to somelist above, and that I want to pass the list resulting from dropping the b element from foo(42) to some other function bar. In this case, applying the method shown above would require the following:
superfluous.variable <- foo(42)
superfluous.variable$b <- NULL
bar(superfluous.variable)
rm(superfluous.variable)
I'm looking for a way to pass to bar the modified results from foo that does not require these superfluous assignments. The four lines above would collapse to a single line:
bar(drop.item.from.list(foo(42), item.to.drop = "b"))
Does R already have something like the hypothetical drop.item.from.list function above?
You can do that removal on the fly with replace()
replace(somelist, "b", NULL)
# $a
# [1] 1
#
# $c
# [1] 3
It works for multiple variables as well ...
replace(somelist, c("a", "b"), NULL)
# $c
# [1] 3
So just wrap that in bar() and the original list remains intact.
Note: I am not exactly sure what you are doing with foo(42) but you state that the resulting list takes a similar structure, so this should be fine for that.
We can try with setdiff
bar(foo(42)[setdiff(names(somelist), "b")])
as the setdiff subsets the 'somelist'
somelist[setdiff(names(somelist), "b")]
#$a
#[1] 1
#$c
#[1] 3
We can also use this to subset for multiple variables
somelist[setdiff(names(somelist), c("a", "b"))]
#$c
#[1] 3
I have tried to use R to find a vector within a matrix within list within list. I have tried if the vector 'ab' exists by using the following 'exists' code but none of them work. How can I make it work?
aa <- list(x = matrix(1,2,3), y = 4, z = 3)
colnames(aa$x) <- c('ab','bb','cb')
aa
#$x
# ab bb cb
#[1,] 1 1 1
#[2,] 1 1 1
#
#$y
#[1] 4
#
#$z
#[1] 3
exists('ab', where=aa)
#[1] FALSE
exists('ab', where=aa$x)
# Error in exists("ab", where = aa$x) : invalid 'envir' argument
exists('ab', where=colnames(aa$x))
# Error in as.environment(where) : no item called "ab" on the search list
colnames(aa$x)
#[1] "ab" "bb" "cb"
The column names are part of either matrix or data.frames. So, we loop over the list using sapply, get the column names (colnames), unlist and check whether 'ab' is among that vector
'ab' %in% unlist(sapply(aa, colnames))
#[1] TRUE
If we want to be more specific for a particular list element, we extract the element (aa$x), get the column names and check whether 'ab' is among them.
'ab' %in% colnames(aa$x)
#[1] TRUE
Or another option would be to loop through 'aa', and if the element is a matrix, extract the 'ab' column and check whether it is a vector, wrap the sapply with any to get a single TRUE/FALSE output.
any(sapply(aa, function(x) if(is.matrix(x)) is.vector(x[, 'ab']) else FALSE))
How do I reference the row number of an observation? For example, if you have a data.frame called "data" and want to create a variable data$rownumber equal to each observation's row number, how would you do it without using a loop?
These are present by default as rownames when you create a data.frame.
R> df = data.frame('a' = rnorm(10), 'b' = runif(10), 'c' = letters[1:10])
R> df
a b c
1 0.3336944 0.39746731 a
2 -0.2334404 0.12242856 b
3 1.4886706 0.07984085 c
4 -1.4853724 0.83163342 d
5 0.7291344 0.10981827 e
6 0.1786753 0.47401690 f
7 -0.9173701 0.73992239 g
8 0.7805941 0.91925413 h
9 0.2469860 0.87979229 i
10 1.2810961 0.53289335 j
and you can access them via the rownames command.
R> rownames(df)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
if you need them as numbers, simply coerce to numeric by adding as.numeric, as in as.numeric(rownames(df)).
You don't need to add them, as if you know what you are looking for (say item df$c == 'i', you can use the which command:
R> which(df$c =='i')
[1] 9
or if you don't know the column
R> which(df == 'i', arr.ind=T)
row col
[1,] 9 3
you may access the element using df[9, 'c'], or df$c[9].
If you wanted to add them you could use df$rownumber <- as.numeric(rownames(df)), though this may be less robust than df$rownumber <- 1:nrow(df) as there are cases when you might have assigned to rownames so they will no longer be the default index numbers (the which command will continue to return index numbers even if you do assign to rownames).
Simply:
data$rownumber = 1:nrow(Data)
Perhaps with dataframes, one of the easiest and most practical solutions is:
data = dplyr::mutate(data, rownum=row_number())
This is probably the simplest way:
data$rownumber = 1:dim(data)[1]
It's probably worth noting that if you want to select a row by its row index, you can do this with simple bracket notation
data[3,]
vs.
data[data$rownumber==3,]
So I'm not really sure what this new column accomplishes.