I would like to know if there is an efficient way to know if which values of my column in my dataframe are power of two.
My data is a dataframe with 6 columns, one of the columns has the values that I want to check if the numbers are power of 2.
class(df$doubling_times) > numeric
log2(x) %% 1 == 0 - checks if the log base 2 of the number is an integer (when divided by 1, is the remainder 0?)
> x = 1:10
> data.frame(x, power2 = log2(x) %% 1 == 0)
x power2
1 1 TRUE
2 2 TRUE
3 3 FALSE
4 4 TRUE
5 5 FALSE
6 6 FALSE
7 7 FALSE
8 8 TRUE
9 9 FALSE
10 10 FALSE
The above should work, but a safer approach would allow for floating point accuracy issues, and might be something like this:
remainder = log2(x) %% 1
tol = 1e-12 # tolerance
power2 = abs(remainder - round(remainder)) < tol
Related
My question is probably quite basic but I've been struggling with it so I'd be really grateful if someone could offer a solution.
I have data in the following format:
ORG_NAME
var_1_12
var_1_13
var_1_14
A
12
11
5
B
13
13
11
C
6
7
NA
D
NA
NA
5
I have data on organizations over 5 years, but over that time, some organizations have merged and others have disappeared. I'm planning on conducting a fixed-effects regression, so I need to add a dummy variable which is "0" when organizations have remained the same (in this case row A and row B), and "1" in the year before the merge, and after the merge. In this case, I know that orgs C and D merged, so I would like for the data to look like this:
ORG_NAME
var_1_12
dum_12
var_1_13
dum_13
A
12
0
5
0
B
13
0
11
0
C
6
1
NA
1
D
NA
1
5
1
How would I code this?
This approach (as is any, according to your description) is absolutely dependent on the companies being in consecutive rows.
mtx <- apply(is.na(dat[,-1]), MARGIN = 2,
function(vec) zoo::rollapply(vec, 2, function(z) xor(z[1], z[2]), fill = FALSE))
mtx
# var_1_12 var_1_13 var_1_14
# [1,] FALSE FALSE FALSE
# [2,] FALSE FALSE TRUE
# [3,] TRUE TRUE TRUE
# [4,] FALSE FALSE FALSE
out <- rowSums(mtx) == ncol(mtx)
out
# [1] FALSE FALSE TRUE FALSE
out | c(FALSE, out[-length(out)])
# [1] FALSE FALSE TRUE TRUE
### and your 0/1 numbers, if logical isn't right for you
+(out | c(FALSE, out[-length(out)]))
# [1] 0 0 1 1
Brief walk-through:
is.na(dat[,-1]) returns a matrix of whether the values (except the first column) are NA; because it's a matrix, we use apply to call a function on each column (using MARGIN=2);
zoo::rollapply is a function that does rolling calculations on a portion ("window") of the vector at a time, in this case 2-wide. For example, if we have 1:5, then it first looks at c(1,2), then c(2,3), then c(3,4), etc.
xor is an eXclusive OR, meaning it will be true when one of its arguments are true and the other is false;
mtx is a matrix indicating that a cell and the one below it met the conditions (one is NA, the other is not). We then check to see which of these rows are all true, forming out.
since we need a 1 in both rows, we vector-AND & out with itself, shifted, to produce your intended output
If I understand well, you want to code with "1" rows with at least one NA. if it's so, you just need one dummy var for all the years, right? Somthing like this
set.seed(4)
df <- data.frame(org=as.factor(LETTERS[1:5]),y1=sample(c(1:4,NA),5),y2=sample(c(3:6,NA),5),y3=sample(c(2:5,NA),5))
df$dummy <- as.numeric(apply(df, 1, function(x)any(is.na(x))))
which give you
org y1 y2 y3 dummy
1 A 3 5 3 0
2 B NA 4 5 1
3 C 4 3 2 0
4 D 1 6 NA 1
5 E 2 NA 4 1
> a <- sample(c(1:10), 20, replace = TRUE)
> a
[1] 6 3 6 2 6 9 3 9 9 8 2 10 7 9 1 5 3 10 5 5
> a[c(TRUE,FALSE)]
[1] 6 6 6 3 9 2 7 1 3 5
Why a[c(TRUE,FALSE)] gives me an ODD elements of my array? c(TRUE, FALSE) has length of 2. And on my mind, this supposed to give me a single index 1, which is TRUE.
Why is this comes by this way?
Logical subsets are recycled to match the length of the vector (numerical subsets are not recycled).
From help("["):
Arguments
i, j, …
...
For [-indexing only: i, j, … can be logical vectors,
indicating elements/slices to select. Such vectors are recycled if
necessary to match the corresponding extent. i, j, … can also be
negative integers, indicating elements/slices to leave out of the
selection.
When indexing arrays by [ a single argument i can be a matrix with
as many columns as there are dimensions of x; the result is then a
vector with elements corresponding to the sets of indices in each row
of i.
To illustrate, try:
cbind.data.frame(x = 1:10, odd = c(TRUE, FALSE), even = c(FALSE, TRUE))
# x odd even
# 1 1 TRUE FALSE
# 2 2 FALSE TRUE
# 3 3 TRUE FALSE
# 4 4 FALSE TRUE
# 5 5 TRUE FALSE
# 6 6 FALSE TRUE
# 7 7 TRUE FALSE
# 8 8 FALSE TRUE
# 9 9 TRUE FALSE
# 10 10 FALSE TRUE
a[TRUE] gives your all the elements and a[FALSE] gives none. for a[c(TRUE,FALSE] it will wrap length(c(TRUE,FALSE)) which is 2 to length(a) which is 20, so for example it would be like TRUE, FALSE, TRUE, .... , then it will give you just odds indexes.
can't figure this out, despite being rather close (supposedly). I want to check if a drug was given in a 4 hour window.
drug start stop
1 A 1 3
2 A 7 10
3 A 11 17
drug A was started on time 1 and was administered up to time 3; then started again at time 7 and given up to time 10 etc.
t1 t2
1 0 4
2 4 8
3 8 12
4 12 16
5 16 20
6 20 24
these are the windows in question
DATA:
t1 <- c(0,4,8,12,16,20)
t2 <- t1 + 4
chunks <- data.frame(t1=t1,t2=t2)
drug <- "A"
start <- c(1,7,11)
stop <- c(3,10,17)
times <- data.frame(drug,start,stop)
Expected solution
t1 t2 lsg
1 0 4 1
2 4 8 1
3 8 12 1
4 12 16 1
5 16 20 1
6 20 24 0
Attempt at solution
test <- function(){
n <- 1
for (row in times){
result <- (times$start[n] > chunks$t1 & times$stop[n] < chunks$t2) | ((times$start[n] > chunks$t1 & times$start[n] < chunks$t2) & (times$stop[n] > chunks$t2 | times$stop[n] < chunks$t2)) | (times$start[n] < chunks$t1 & times$stop[n] > chunks$t1)
n <- n + 1
print(result)
}
}
gives
[1] TRUE FALSE FALSE FALSE FALSE FALSE
[1] FALSE TRUE TRUE FALSE FALSE FALSE
[1] FALSE FALSE TRUE TRUE TRUE FALSE
which is correct! First administration fell into the first time window. 2nd and 3rd administration fell into the 2nd and 3rd windows etc. But how to get to the
expected solution?
As I said, I feel close but I don't know how to join the results to the chunks-df...
The first-half of this is #akrun's comment, but expanded to include the prerequisites. (If you come back and answer, I'll happily defer to you ... just giving more details here.) The second-half is new (and often over-looked).
data.table
data.table::foverlaps does joins based on overlaps/inequalities (as opposed to base merge and dplyr::*_join, which only operate on strict equalities). One prerequisite for using overlaps (in addition to being data.table class) is that the time fields be keyed correctly.
library(data.table)
setDT(times)
setDT(chunks)
# set the keys
setkey(times, start, stop)
setkey(chunks, t1, t2)
# the join
+(!is.na(foverlaps(chunks, times, which = TRUE, mult = 'first')))
# [1] 1 1 1 1 1 0
The function actually returns which row(s) each row in times corresponds to in chunks:
foverlaps(chunks, times, which = TRUE, mult = 'first')
# [1] 1 2 2 3 3 NA
sqldf
data.table is not the only R tool that lets this happen. This solution works on any variant of data.frame (base, data.table, or tbl_df).
Here's this:
library(sqldf)
sqldf("
select c.t1, c.t2,
(case when drug is null then 0 else 1 end) > 0 as n
from chunks c
left join times t on
(t.start between c.t1 and c.t2) or (t.stop between c.t1 and c.t2)
or (c.t1 between t.start and t.stop) or (c.t2 between t.start and t.stop)
group by c.t1, c.t2")
# t1 t2 n
# 1 0 4 1
# 2 4 8 1
# 3 8 12 1
# 4 12 16 1
# 5 16 20 1
# 6 20 24 0
(I don't know if it's possible to reduce the logic of that join, nor if it will mis-behave with other data.)
If you need the count of drugs that occur in each time frame, I think you can use sum(case when ... end) as n.
Quick question in understanding's R's anyDuplicated, when passed on a dataframe (lets say x y z columns with 1k observations) will if check if any of the rows has the exact same x y z values as another row in the same dataframe? Thanks
I would use duplicated and combine it from front to back.
mydf <- data.frame(x = c(1:3,1,1), y = c(3:5,3,3))
mydf
# x y
# 1 1 3
# 2 2 4
# 3 3 5
# 4 1 3
# 5 1 3
There three duplicated rows 1, 4, and 5. But 'duplicated' will only mark what is duplicated not the original value also.
duplicated(mydf)
#[1] FALSE FALSE FALSE TRUE TRUE
duplicated(mydf, fromLast = TRUE)
#[1] TRUE FALSE FALSE TRUE FALSE
Using from last looks from the end to front to include the original value. By the way, I will ask the R core team to add a unified function to do both.
myduplicates <- duplicated(mydf) | duplicated(mydf, fromLast = TRUE)
Saving the expression as a variable allows us to count and subset later.
sum(myduplicates)
#[1] 3
mydf[myduplicates,]
# x y
# 1 1 3
# 4 1 3
# 5 1 3
mydf[!myduplicates,]
# x y
# 2 2 4
# 3 3 5
I have a dataset of true values (location) that I'm attempting to compare to a vector of estimated values using dplyr. My code below results in an error message. How do I compare each value of data$location to every value of est.locations and collapse the resulting vector to true if all comparisons are greater than 20?
library(dplyr)
data <- data.frame("num" = 1:10, "location" = runif(10, 0, 1500) %>% sort)
est.locations <- runif(12, 0, 1500) %>% sort
data %>%
mutate(false.neg = (all(abs(location - est.locations) > 20)))
num location false.neg
1 1 453.4281 FALSE
2 2 454.4260 FALSE
3 3 718.0420 FALSE
4 4 801.2217 FALSE
5 5 802.7981 FALSE
6 6 854.2148 FALSE
7 7 873.6085 FALSE
8 8 901.0217 FALSE
9 9 1032.8321 FALSE
10 10 1240.3547 FALSE
Warning message:
In c(...) :
longer object length is not a multiple of shorter object length
The context of the question is dplyr, but I'm open to other suggestions that may be faster. This is a piece of a larger calculation I'm doing on birth-death mcmc chains for 3000 iterations * 200 datasets. (i.e. repeated many times and the number of locations will be different among datasets and for each iteration.)
UPDATE (10/13/15):
I'm going to mark akrun's solution as the answer. A linear algebra approach is a natural fit for this problem and with a little tweaking this will work for calculating both FNR and FPR (FNR should need an (l)apply by iteration, FPR should be one large vector/matrix operation).
JohannesNE's solution points out the issue with my initial approach -- the use of any() reduces the number of rows to a single value, when instead I intended to do this operation row-wise. Which also leads me to think there is likely a dplyr solution using rowwise() and do().
I attempted to limit the scope of the question in my initial post. But for added context, the full problem is on a Bayesian mixture model with an unknown number of components, where the components are defined by a 1D point process. Estimation results in a 'random effects' chain similar in structure to the version of est.locations below. The length mismatch is a result of having to estimate the number of components.
## Clarification of problem
options("max.print" = 100)
set.seed(1)
# True values (number of items and their location)
true.locations <-
data.frame("num" = 1:10,
"location" = runif(10, 0, 1500) %>% sort)
# Mcmc chain of item-specific values ('random effects')
iteration <<- 0
est.locations <-
lapply(sample(10:14, 3000, replace=T), function(x) {
iteration <<- iteration + 1
total.items <- rep(x, x)
num <- 1:x
location <- runif(x, 0, 1500) %>% sort
data.frame(iteration, total.items, num, location)
}) %>% do.call(rbind, .)
print(est.locations)
iteration total.items num location
1 1 11 1 53.92243818
2 1 11 2 122.43662006
3 1 11 3 203.87297671
4 1 11 4 641.70211495
5 1 11 5 688.19477968
6 1 11 6 1055.40283048
7 1 11 7 1096.11595818
8 1 11 8 1210.26744065
9 1 11 9 1220.61185888
10 1 11 10 1362.16553219
11 1 11 11 1399.02227302
12 2 10 1 160.55916378
13 2 10 2 169.66834129
14 2 10 3 212.44257723
15 2 10 4 228.42561489
16 2 10 5 429.22830291
17 2 10 6 540.42659572
18 2 10 7 594.58339156
19 2 10 8 610.53964624
20 2 10 9 741.62600969
21 2 10 10 871.51458277
22 3 13 1 10.88957267
23 3 13 2 42.66629869
24 3 13 3 421.77297967
25 3 13 4 429.95036650
[ reached getOption("max.print") -- omitted 35847 rows ]
You can use sapply (here inside mutate, but not really taking advantage of its functions).
library(dplyr)
data <- data.frame("num" = 1:10, "location" = runif(10, 0, 1500) %>% sort)
est.locations <- runif(12, 0, 1500) %>% sort
data %>%
mutate(false.neg = sapply(location, function(x) {
all(abs(x - est.locations) > 20)
}))
num location false.neg
1 1 92.67941 TRUE
2 2 302.52290 FALSE
3 3 398.26299 TRUE
4 4 558.18585 FALSE
5 5 859.28005 TRUE
6 6 943.67107 TRUE
7 7 991.19669 TRUE
8 8 1347.58453 TRUE
9 9 1362.31168 TRUE
10 10 1417.01290 FALSE
We can use outer for this kind of comparison. We get all the combination of difference between 'location' and 'est.locations', take the abs, compare with 20, negate (!), do the rowSums and negate again so that if all the elements in the rows are greater than 20, it will be TRUE.
data$false.neg <- !rowSums(!abs(outer(data$location, est.locations, FUN='-'))>20)