I have a data frame
> data.frame(Col1=seq(0,24,by=4),x=rnorm(7),y=rnorm(7,50))
Col1 x y
1 0 -0.107046196 49.96748
2 4 -0.001515573 50.02819
3 8 -1.884417429 49.80308
4 12 1.692774467 50.45827
5 16 -0.907602775 51.14937
6 20 0.166186536 49.17502
7 24 0.420263825 49.56720
and a variable
t=2
and want to find the subset of the data under which it falls (rows 1 and 2 in this example), and then calculate the ratio in variables x and y, ie
Col1 x y
1 0 -0.107046196 49.96748
2 4 -0.001515573 50.02819
then obtain, based on value t, (t-0)/(4-0), and then use that ratio to calculate the position in x and y
I found a fund function in matlab (Find which interval a point B is located in Matlab) and wonder if there is a similar function in R
Specifically, is there a way to determine which interval a variable falls under? And once I find that interval, a way to extract the subset of data?
I can only think of %in% operator currently,
> t %in% df$Col1
[1] FALSE
For more clarity, I have tried
> z=NULL
> for(i in 1:(nrow(df)-1)){
+ z[[i]]=df$Col1[i]:df$Col1[i+1]
+ }
> w=NULL
> for(i in 1:length(z)){
+ w=c(w,t %in% z[[i]])
+ }
> v=which(w==1)
> df[v:(v+1),]
Col1 x y
1 0 1.076101 50.17514
2 4 1.971503 47.81647
>
and now hope there may be a more concise answer, as my real data is >1M rows.
Try using the code below and see whether it will give you the expected results:
dataframe=data.frame(Col1=seq(0,24,by=4),x=rnorm(7),y=rnorm(7,50))
funfun=function(x){v=findInterval(x,dataframe$Col1);c(v,v+1)}
dataframe[funfun(2),]
Col1 x y
1 0 0.831266 50.28246
2 4 1.751892 48.78810
dataframe[funfun(10),]
Col1 x y
3 8 0.2624929 48.33945
4 12 -0.2243066 51.11304
If this helps please let us know. thank you
Related
I have below dataset w and key variable x for two cases.
Case 1:
x = 4
w = c(1,2,4,4,4,4,6,7,8,9,10,11,12,14,15)
Case2:
x = 12
w = c(1,2,4,4,4,4,6,7,8,9,10,11,12,14,15)
I want to create a function which will search for x through dataset w and will subset original dataset to lower size dataset as per x's location in w. Output will be a lower size dataset having upper bound value same as search key. Below is the function I am trying to write in R:
create_chunk <- function(val, tab, L=1L, H=length(tab))
{
if(H >= L)
{
mid = L + ((H-L)/2)
## If the element is present within middle length
if(tab[mid] > val)
{
## subset the original data in reduced size and again do mid position value checking
## then subset the data
} else
{
mid = mid + (mid/2)
## Increase the mid position to go for right side checking
}
}
}
In the output I am looking for below:
Output for Case 1:
Dataset containing: 1,2,4,4,4,4
Output for Case 2:
Dataset containing: 1,2,4,4,4,4,6,7,8,9,10,11,12
Please note:
1. Dataset may contain duplicate values for search key and
all the duplicate values are expected in the output dataset.
2. I have huge size datasets (say around 2M rows) from
where I am trying to subset smaller dataset as per my requirement of search key.
New Update: Case 3
Input Data:
date value size stockName
1 2016-08-12 12:44:43 10093.40 4 HWA IS Equity
2 2016-08-12 12:44:38 10093.35 2 HWA IS Equity
3 2016-08-12 12:44:47 10088.00 2 HWA IS Equity
4 2016-08-12 12:44:52 10089.95 1 HWA IS Equity
5 2016-08-12 12:44:53 10089.95 1 HWA IS Equity
6 2016-08-12 12:44:54 10088.95 1 HWA IS Equity
Search Key is: 10089.95 in value column.
Expected Output is:
date value size stockName
1 2016-08-12 12:44:47 10088.00 2 HWA IS Equity
2 2016-08-12 12:44:54 10088.95 1 HWA IS Equity
3 2016-08-12 12:44:52 10089.95 1 HWA IS Equity
4 2016-08-12 12:44:53 10089.95 1 HWA IS Equity
You could do this which takes care of duplicate values. In case of duplicates, the highest position of which will be returned. Please note that A should be in non-decreasing order.
binSearch <- function(A, value, left=1, right=length(A)){
if (left > right)
return(-1)
middle <- (left + right) %/% 2
if (A[middle] == value){
while (A[middle] == value)
middle<-middle+1
return(middle-1)
}
else {
if (A[middle] > value)
return(binSearch(A, value, left, middle - 1))
else
return(binSearch(A, value, middle + 1, right))
}
}
w[1:binSearch(w,x1)]
# [1] 1 2 4 4 4 4
w[1:binSearch(w,x2)]
# [1] 1 2 4 4 4 4 6 7 8 9 10 11 12
However, as its mentioned in the comments, you could simply use findInterval to achieve the same:
w[1:findInterval(x1,w)]
As you know, binary search has order of log(n) but as stated in ?findInterval, it also benefits from log(n) since the length of the first argument is one:
The function findInterval finds the index of one vector x in another, vec, where the latter must be non-decreasing. Where this is trivial, equivalent to apply( outer(x, vec, ">="), 1, sum), as a matter of fact, the internal algorithm uses interval search ensuring O(n * log(N)) complexity where n <- length(x) (and N <- length(vec)). For (almost) sorted x, it will be even faster, basically O(n).
EDIT
As per your edit and your new setting, you could do this (suppose your data is in df):
o <- order(df$value)
rows <- o[1:findInterval(key, df$value[o])]
df[rows,]
Or equivalently, using the proposed binSearch function:
o <- order(df$value)
rows <- o[1:binSearch(df$value[o], key)]
df[rows,]
data
x1 <- 4
x2 <- 12
w <- c(1,2,4,4,4,4,6,7,8,9,10,11,12,14,15)
key <- 10089.95
Here is a very simple solution and you can build your function out of this commands. Of course you have to check if x is in w, but that's your part :-)
x <- 12
w <- c(1,2,4,4,4,4,6,7,8,9,10,11,12,14,15)
index <- which(x == w)
w_new <- w[1:index[length(index)]]
print(w_new)
#[1] 1 2 4 4 4 4 6 7 8 9 10 11 12
I am trying to group a column of my data.frame/data.table into three groups, all with equal sums.
The data is first ordered from smallest to largest, such that group one would be made up of a large number of rows with small values, and group three would have a small number of rows with large values. This is accomplished in spirit with:
test <- data.frame(x = as.numeric(1:100000))
store <- 0
total <- sum(test$x)
for(i in 1:100000){
store <- store + test$x[i]
if(store < total/3){
test$y[i] <- 1
} else {
if(store < 2*total/3){
test$y[i] <- 2
} else {
test$y[i] <- 3
}
}
}
While successful, I feel like there must be a better way (and maybe a very obvious solution that I am missing).
I never like resorting to loops, especially with nested ifs, when a vectorized approach is available - with even 100,000+ records this code becomes quite slow
This method would become impossibly complex to code to a larger number of groups (not necessarily the looping, but the ifs)
Requires pre-ordering of the column. Might not be able to get around this one.
As a nuance (not that it makes a difference) but the data to be summed would not always (or ever) be consecutive integers.
Maybe with cumsum:
test$z <- cumsum(test$x) %/% (ceiling(sum(test$x) / 3)) + 1
This is more or less a bin-packing problem.
Use the binPack function from the BBmisc package:
library(BBmisc)
test$bins <- binPack(test$x, sum(test$x)/3+1)
The sums of the 3 bins are nearly identical:
tapply(test$x, test$bins, sum)
1 2 3
1666683334 1666683334 1666683332
I thought that the cumsum/modulo division approach was very elegant, but it does retrun a somewhat irregular allocation:
> tapply(test$x, test$z, sum)
1 2 3
1666636245 1666684180 1666729575
> sum(test)/3
[1] 1666683333
So I though I would first create a random permutation and offer something similar:
test$x <- sample(test$x)
test$z2 <- cumsum(test$x)[ findInterval(cumsum(test$x),
c(0, 1666683333*(1:2), sum(test$x)+1))]
> tapply(test$x, test$z2, sum)
91099 116379 129539
1666676164 1666686837 1666686999
This also achieves a more even distribution of counts:
> table(test$z2)
91099 116379 129539
33245 33235 33520
> table(test$z)
1 2 3
57734 23915 18351
I must admit to puzzlement regarding the naming of the entries in z2.
Or you can just cut on the cumsum
test$z <- cut(cumsum(test$x), breaks = 3, labels = 1:3)
or use ggplot2::cut_interval instead of cut:
test$z <- cut_interval(cumsum(test$x), n = 3, labels = 1:3)
You can use fold() from groupdata2 and get an almost equal number of elements per group:
# Create data frame
test <- data.frame(x = as.numeric(1:100000))
# Use fold() to create 3 numerically balanced groups
test <- groupdata2::fold(k = 3, num_col = "x")
# Watch first 10 rows
head(test, 10)
## # A tibble: 10 x 2
## # Groups: .folds [3]
## x .folds
## <dbl> <fct>
## 1 1 1
## 2 2 3
## 3 3 2
## 4 4 1
## 5 5 2
## 6 6 2
## 7 7 1
## 8 8 3
## 9 9 2
## 10 10 3
# Check the sum and number of elements per group
test %>%
dplyr::group_by(.folds) %>%
dplyr::summarize(sum_ = sum(x),
n_members = dplyr::n())
## # A tibble: 3 x 3
## .folds sum_ n_members
## <fct> <dbl> <int>
## 1 1 1666690952 33333
## 2 2 1666716667 33334
## 3 3 1666642381 33333
I've got a simple question that's stumping me. I'm trying to use a loop to count how many values of a vector fall in a bin (0,.01), (.01,.02), etc. For example (the loop does not work):
set.seed(12345)
x<- rnorm(100, 0, .05)
vec <- rep(NA, 11)
for(i in .01:.11){
vec[i] <- sum(x> i & x < (i +.01))
}
I would like this to ultimately produce a vector of the count between each break, such that the output for the above is:
5,9,10...
I think this may have something to do with the indexing/decimals. Thanks for any and all help.
You example contains negative numbers so I assume you are looking to do this with positive numbers. You should use cut to divide your vector into the given bins by setting breaks parameter. Then using table you can compute frequencies of x's falling within each interval.
## filter x
x <- x[x>=0.01] ## EDIT here : was x <- abs(x)
res <- table(cut(x,breaks=seq(round(min(x),2),round(max(x),2),0.01)))
## prettier output coerce to data.frame
as.data.frame(res)
# Var1 Freq
# 1 (0.01,0.02] 5
# 2 (0.02,0.03] 9
# 3 (0.03,0.04] 10
# 4 (0.04,0.05] 10
# 5 (0.05,0.06] 4
# 6 (0.06,0.07] 0
# 7 (0.07,0.08] 5
# 8 (0.08,0.09] 2
# 9 (0.09,0.1] 5
# 10 (0.1,0.11] 4
# 11 (0.11,0.12] 1
I'm pretty confused on how to go about this. Say I have two columns in a dataframe. One column a numerical series in order (x), the other specifying some value from the first, or -1 (y). These are results from a matching experiment, where the goal is to see if multiple photos are taken of the same individual. In the example below, there 10 photos, but 6 are unique individuals. In the y column, the corresponding x is reported if there is a match. y is -1 for no match (might as well be NAs). If there is more than 2 photos per individual, the match # will be the most recent record (photo 1, 5 and 7 are the same individual below). The group is the time period the photo was take (no matches within a group!). Hopefully I've got this example right:
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(-1,-1,-1,-1,1,-1,1,-1,2,4)
group <- c(1,1,1,2,2,2,3,3,3,3)
DF <- data.frame(x,y,group)
I would like to create a new variable to name the unique individuals, and have a final dataset with a single row per individual (i.e. only have 6 rows instead of 10), that also includes the group information. I.e. if an individual is in all three groups, there could be a value of "111" or if just in the first and last group it would be "101". Any tips?
Thanks for asking about the resulting dataset. I realized my group explanation was bad based on the actual numbers I gave, so I changed the results slightly. Bonus would also be nice to have, but not critical.
name <- c(1,2,3,4,6,8)
group_history <- as.character(c('111','101','100','011','010','001'))
bonus <- as.character(c('1,5,7','2,9','3','4,10','6','8'))
results_I_want <- data.frame(name,group_history,bonus)
My word, more mistakes fixed above...
Using the (updated) example you gave
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(-1,-1,-1,-1,1,-1,1,-1,3,4)
group <- c(1,1,1,2,2,2,3,3,3,3)
DF <- data.frame(x,y,group)
Use the x and y to create a mapping from higher numbers to lower numbers that are the same person. Note that names is a string, despite it be a string of digits.
bottom.df <- DF[DF$y==-1,]
mapdown.df <- DF[DF$y!=-1,]
mapdown <- c(mapdown.df$y, bottom.df$x)
names(mapdown) <- c(mapdown.df$x, bottom.df$x)
We don't know how many times it might take to get everything down to the lowest number, so have to use a while loop.
oldx <- DF$x
newx <- mapdown[as.character(oldx)]
while(any(oldx != newx)) {
oldx = newx
newx = mapdown[as.character(oldx)]
}
The result is the group it belongs to, names by the lowest number of that set.
DF$id <- unname(newx)
Getting the group membership is harder. Using reshape2 to convert this into wide format (one column per group) where the column is "1" if there was something in that one and "0" if not.
library("reshape2")
wide <- dcast(DF, id~group, value.var="id",
fun.aggregate=function(x){if(length(x)>0){"1"}else{"0"}})
Finally, paste these "0"/"1" memberships together to get the grouping variable you described.
wide$grouping = apply(wide[,-1], 1, paste, collapse="")
The result:
> wide
id 1 2 3 grouping
1 1 1 1 1 111
2 2 1 0 0 100
3 3 1 0 1 101
4 4 0 1 1 011
5 6 0 1 0 010
6 8 0 0 1 001
No "bonus" yet.
EDIT:
To get the bonus information, it helps to redo the mapping to keep everything. If you have a lot of cases, this could be slow.
Replace the oldx/newx part with:
iterx <- matrix(DF$x, ncol=1)
iterx <- cbind(iterx, mapdown[as.character(iterx[,1])])
while(any(iterx[,ncol(iterx)]!=iterx[,ncol(iterx)-1])) {
iterx <- cbind(iterx, mapdown[as.character(iterx[,ncol(iterx)])])
}
DF$id <- iterx[,ncol(iterx)]
To generate the bonus data, then you can use
bonus <- tapply(iterx[,1], iterx[,ncol(iterx)], paste, collapse=",")
wide$bonus <- bonus[as.character(wide$id)]
Which gives:
> wide
id 1 2 3 grouping bonus
1 1 1 1 1 111 1,5,7
2 2 1 0 0 100 2
3 3 1 0 1 101 3,9
4 4 0 1 1 011 4,10
5 6 0 1 0 010 6
6 8 0 0 1 001 8
Note this isn't same as your example output, but I don't think your example output is right (how can you have a grouping_history of "000"?)
EDIT:
Now it agrees.
Another solution for bonus variable
f_bonus <- function(data=df){
data_a <- subset(data,y== -1,select=x)
data_a$pos <- seq(nrow(data_a))
data_b <- subset(df,y!= -1,select=c(x,y))
data_b$pos <- match(data_b$y, data_a$x)
data_t <- rbind(data_a,data_b[-2])
data_t <- with(data_t,tapply(x,pos,paste,sep="",collapse=","))
return(data_t)
}
I have a dataset like this:
x
A B
1 x 2
2 y 4
3 z 4
4 x 4
5 x 4
6 x 3
......
I want to know if in this dataset are present a same number of "A" upper than some value(for example 3).
Probably i will need to group this value in a temporary table getting this:
X Y z
4 1 1
and after this i will call another method (that i don't know) that gives me this result
X
because only the value X is present more than 3 times in my previous table.
Can R optimise this operation?
data<-data.frame(factor(c("x","y","z","x","x","x")),c(2,4,4,4,4,3))
To get the count of each letter, do
table(data[,1])
and to get the name of the factors with > 3
names(table(data[,1]))[table(data[,1]) > 3]
DonĀ“t know if I understand you right... whats with this B column?
Is this working for you?
set.seed(1234)
A <- sample(c("x", "y", "z"), 20, replace = TRUE)
Ad <- data.frame(table(A))
with(Ad, A[Freq >= 7])
[1] x y