I have a dataset with two different columns (X and Y) that both contains the exact same amount of 0s and 1s:
0 1
3790 654
Now I want to have column Y to contain an exact amount of 1733 1s and 2711 0s. But the 1079 extra 1s (1733-654) must be assigned randomly. I already tried the following:
ind <- which(df$X == 0)
ind <- ind[rbinom(length(ind), 1, prob = 1079/3790) > 0]
df$Y[ind] <- 1
But if I run this code, there is everytime a different number of 1s, and I want it to be exactly 1733 if I run it. How do I do this?
You have this vector:
x <- sample(c(rep(0, 3790), rep(1, 654)))
#> table(x)
#> x
#> 0 1
#> 3790 654
What you need to do is randomly select the position of 1079 elements in your vector that equals 0, and assign them the value 1:
s <- sample(which(x == 0), 1079)
x[s] <- 1
#> table(x)
#> x
#> 0 1
#> 2711 1733
Related
I am basically new to using R software.
I have a list of repeating codes (numeric/ categorical) from an excel file. I need to add another column values (even at random) to which every same code will get the same value.
Codes Value
1 122
1 122
2 155
2 155
2 155
4 101
4 101
5 251
5 251
Thank you.
We can use match:
n <- length(code0 <- unique(code))
value <- sample(4 * n, n)[match(code, code0)]
or factor:
n <- length(unique(code))
value <- sample(4 * n, n)[factor(code)]
The random integers generated are between 1 and 4 * n. The number 4 is arbitrary; you can also put 100.
Example
set.seed(0); code <- rep(1:5, sample(5))
code
# [1] 1 1 1 1 1 2 2 3 3 3 3 4 4 4 5
n <- length(code0 <- unique(code))
sample(4 * n, n)[match(code, code0)]
# [1] 5 5 5 5 5 18 18 19 19 19 19 12 12 12 11
Comment
The above gives the most general treatment, assuming that code is not readily sorted or taking consecutive values.
If code is sorted (no matter what value it takes), we can also use rle:
if (!is.unsorted(code)) {
n <- length(k <- rle(code)$lengths)
value <- rep.int(sample(4 * n, n), k)
}
If code takes consecutive values 1, 2, ..., n (but not necessarily sorted), we can skip match or factor and do:
n <- max(code)
value <- sample(4 * n, n)[code]
Further notice: If code is not numerical but categorical, match and factor method will still work.
What you could also do is the following, it is perhaps more intuitive to a beginner:
data <- data.frame('a' = c(122,122,155,155,155,101,101,251,251))
duplicates <- unique(data)
duplicates[, 'b'] <- rnorm(nrow(duplicates))
data <- merge(data, duplicates, by='a')
I have two data frames that look like this:
> head(y,n=4)
Source: local data frame [6 x 3]
Start Date End Date Length
1 2006-06-08 2006-06-10 3
2 2006-06-12 2006-06-14 3
3 2006-06-18 2006-06-21 4
4 2006-06-24 2006-06-25 2
and
> head(x,n=19)
Date Group.Size
413 2006-06-07 6
414 2006-06-08 3
415 2006-06-09 1
416 2006-06-10 3
417 2006-06-11 15
418 2006-06-12 12
419 2006-06-13 NA
420 2006-06-14 4
421 2006-06-15 8
422 2006-06-16 3
423 2006-06-17 1
424 2006-06-18 3
425 2006-06-19 10
426 2006-06-20 2
427 2006-06-21 7
428 2006-06-22 6
429 2006-06-23 2
430 2006-06-24 1
431 2006-06-25 0
I'm looking for a way to add a new column in data frame y that will show the average Group.Size of data frame x (rounded to nearest integer), depending on the given Start Date and End Dates provided in y.
For example, in the first row of y, I have 6/8/06 to 6/10/06. This is a length of 3 days, so I would want the new column to have the number 2, because the corresponding Group.Size values are 3, 1, and 3 for the respective days in data frame x (mean=2.33, rounded to nearest integer is 2).
If there is an NA in my dataframe x, I'd like to consider it a 0.
There are multiple steps involved in this task, and there is probably a straightforward approach... I am relatively new to R, and am having a hard time breaking it down. Please let me know if I should clarify my example.
Assuming that x$Date, y$StartDate, and y$EndDate are of class Date (or, character), the following apply approach should be doing the trick:
y$AvGroupSize<- apply(y, 1, function(z) {
round(mean(x$Group.Size[which(x$Date >= z[1] & x$Date <=z[2])], na.rm=T),0)
}
)
#Replace missing values in x with 0
x[is.na(x)] <- 0
#Create new 'Group' variable and loop through x to create groups
x$Group <-1
j <- 1
for(i in 1:nrow(x)){
if(x[i,"Date"]==y[j,"StartDate"]){
x[i,"Group"] <- j+1
if(j<nrow(y)){
j <- j+1
} else{
j <- j
}
}else if(i>1){
x[i,"Group"] <- x[i-1,"Group"]
}else {
x[i,"Group"] <- 1
}
}
#Use tapply function to get the rounded mean of each Group
tapply(x$Group.Size, x$Group, function(z) round(mean(z)))
Here is a different dplyr solution
library(dplyr)
na2zero <- function(x) ifelse(is.na(x),0,x) # Convert NA to zero
ydf %>%
group_by(Start_Date, End_Date) %>%
mutate(avg = round(mean(na2zero(xdf$Group.Size[ between(xdf$Date, Start_Date, End_Date) ])), 0)) %>%
ungroup
## Start_Date End_Date Length avg
## (time) (time) (int) (dbl)
## 1 2006-06-08 2006-06-10 3 2
## 2 2006-06-12 2006-06-14 3 5
## 3 2006-06-18 2006-06-21 4 6
## 4 2006-06-24 2006-06-25 2 0
This is a solution that applies over the rows of the data frame y:
library(dplyr)
get_mean_size <- function(start, end, length) {
s <- sum(filter(x, Date >= start, Date <= end)$Group.Size, na.rm = TRUE)
round(s/length)
}
y$Mean.Size = Map(get_mean_size, y$Start_Date, y$End_Date, y$Length)
y
## Start_Date End_Date Length Mean.Size
## 1 2006-06-08 2006-06-10 3 2
## 2 2006-06-12 2006-06-14 3 5
## 3 2006-06-18 2006-06-21 4 6
## 4 2006-06-24 2006-06-25 2 0
It uses two functions from the dplyr package: filter() and mutate().
First I define the function get_mean_size that is supposed with the three values from a column in y: Start_Date, End_Date and length. It fist selects the relevant rows from x using filter and sums up the column Group.Size. Using na.rm = TRUE tells sum() to ignore NA values, which is the same as setting them to zero. Then the average is calculated by dividing by length and rounding. Note that round rounds half to even, thus 0.5 is rounded to 0, while 1.5 is rounded to 2.
This function is then applied to all rows of y using Map() and added as a new column to y.
A final note regarding the dates in x and y. This solution assumes that the dates are stored as Date object. You can check this using, e. g.,
is(x$Date, "Date")
If they do not have class Date, you can convert them using
x$Date <- as.Date(x$Date)
(and simliarly for y$Start_Date and y$End_Date).
There are many ways but here is one. We can first create a list of date positions with lapply (SN: Be sure that the dates are in chronological order). Then we map the function round(mean(Group.Size)) to each of the values:
lst <- lapply(y[1:2], function(.x) match(.x, x[,"Date"]))
y$avg <- mapply(function(i,j) round(mean(x$Group.Size[i:j], na.rm=TRUE)), lst[[1]],lst[[2]])
y
# StartDate EndDate Length avg
# 1 2006-06-08 2006-06-10 3 2
# 2 2006-06-12 2006-06-14 3 8
# 3 2006-06-18 2006-06-21 4 6
# 4 2006-06-24 2006-06-25 2 0
I need to extract summed subsets of a data.frame row-by-row and use the output to return a new data.frame. However, I want to increase the number of columns to sum across by 4 each time. So, for example, I want to extract the 1st column by itself, then the sum of columns 2 to 6 on a row-by-row basis, then columns 7 to 15 and so on.
I have this code that returns the sum of a constant number of columns across a data.frame (by a maximum number of trials) into a new data.frame- I just need to find a way to add the escalating function.
t<- max(as.numeric(df[,c(5)]))
process.row <- function (x){
sapply(1:t,function(i){
return(sum(as.numeric(x[c((6+(i-1)*5):(10+(i-1)*5))]
)
)
)
})
}
t(apply(df,1,process.row)) -> collated.data
I've been really struggling with a way to do this so thanks very much for any help. I couldn't find an answer to this elsewhere so apologies if I've missed something.
I was thinking you wanted to sum the rows of the selected subset of columns. If so, perhaps this will help.
# fake data
mydf <- as.data.frame(matrix(sample(45*5), nrow=5))
mydf
# prepare matrix of start and ending columns
n <- 20
i <- 1:n
ncols <- 1 + (i-1)*4
endcols <- cumsum(ncols)
startcols <- c(1, cumsum(ncols[-length(endcols)])+1)
mymat <- cbind(endcols, startcols)
# function to sum the rows
myfun <- function(df, m) {
# select subset with end columns within the dimensions of the given df
subm <- m[m[, 2] <= dim(df)[2], ]
# sum up the selected columns of df by rows
sapply(1:dim(subm)[1], function(j)
rowSums(df[, subm[j, 1]:subm[j, 2], drop=FALSE]))
}
mydf
myfun(df=mydf, m=mymat)
What you are looking for is a function that gives x (the lower value of the series), which looks like this for the sequence-part i:
In r, the code looks like this:
# the foo part of the function
foo <- function(x) ifelse(x > 0, 1 + (x - 1) * 4, 0)
# the wrapper of the function
min.val <- function(i){
ifelse(i == 1, 1, 1 + sum(sapply(1:(i - 1), foo)))
}
# takes only one value
min.val(1)
# [1] 1
min.val(2)
# [1] 2
min.val(3)
# [1] 7
# to calculate multiple values, use it like this
sapply(1:5, min.val)
#[1] 1 2 7 16 29
If you want to get the maximum number, you can create another function, which looks like this
max.val <- function(i) min.val(i + 1) - 1
sapply(1:5, max.val)
#[1] 1 6 15 28 45
Testing:
# creating a series to test it
series <- 1:20
min.vals <- sapply(series, min.val)
max.vals <- sapply(series, max.val)
dat <- data.frame(min = min.vals, max = max.vals)
# dat
# min max
# 1 1 1
# 2 2 6
# 3 7 15
# 4 16 28
# 5 29 45
# 6 46 66
# 7 67 91
# 8 92 120
# 9 121 153
# 10 154 190
# 11 191 231
# 12 232 276
# 13 277 325
# 14 326 378
# 15 379 435
# 16 436 496
# 17 497 561
# 18 562 630
# 19 631 703
# 20 704 780
Does that give you what you want?
I'm new to R and coding in general and I need some help connecting two processes in R.
I have a dataframe:
X <- c(385, 386, 387, 388, 390, 391, 392, 393, 394, 395, 396, 398, 399, 400)
east<- seq(1,14,1)
north<- seq(1,14,1)
df2 <-data.frame(X,east,north)
What I would like to do is to look at the values in X row by row and compare them to each other to populate a new column with a binary result. For example, if X[1,] and X[2,] are sequential the new column value is 1, if X[1,] and X[2,] are not sequential 0.
This piece of code:
for(i in 1:nrow(df2)){
ifelse((df2$X[i+1]-df2$X[i] <= 1), print(1), print(0))
}
provides the info that I want, but I am struggling to get it into a column.
[1] 1
[1] 1
[1] 1
[1] 0
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 0
[1] 1
[1] 1
I have also tried this:
df2$response <- NA
for(i in 1:nrow(df2)){
if(df2$X[i+1]-df2$X[i]==1){df2$response[i]<-1} else
if(df2$X[i+1]-df2$X[i]>1){df2$response[i]<-0}
}
but received this error:
Error in if (df2$X[i + 1] - df2$X[i] == 1) { :
missing value where TRUE/FALSE needed
Any suggestions? Tips? Thank you!
People are getting tied up in knots with arcane solutions. Just:
df2$response <- c( head( df2$X, -1) - tail(df2$X, -1) <= 1, NA_integer_)
OR:
df2$response <- c( diff(df2$X) <= 1, NA_integer_ )
Need the NA to account for the fact that at the last row there is nothing to subtract. Using NA_integer_ as the placeholder rather than NA results in coercion of the logical values to integer (NA by default is logical type).
To wrap it up, data.table solution (just for illustration)
library(data.table)
setDT(df2)[, flag := c(diff(X) <= 1, NaN)]
Another option using dplyr:
require(dplyr)
df2 %>% mutate( flag = ifelse( lead(X)-X==1, 1, 0 ) )
but ifelse() doesn't scale well / can be slow, so we could do:
df2 %>% mutate( flag = as.integer( lead(X)-X==1 ) )
where the as_integer() is necessary to specify exactly the output you've put forth as it converts TRUE and FALSE to 1 and 0, respectively.
# X flag
# 1 385 1
# 2 386 1
# 3 387 1
# 4 388 0
# 5 390 1
# 6 391 1
# 7 392 1
# 8 393 1
# 9 394 1
# 10 395 1
# 11 396 0
# 12 398 1
# 13 399 1
# 14 400 NA
You're almost there.
df2$flag <- ifelse(c(diff(df2$X), 1) <= 1, 1, 0)
I would like to aggregate an R data.frame by equal amounts of the cumulative sum of one of the variables in the data.frame. I googled quite a lot, but probably I don't know the correct terminology to find anything useful.
Suppose I have this data.frame:
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> head(x)
p v
1 10.002904 4
2 10.132200 2
3 10.026105 6
4 10.001146 2
5 9.990267 2
6 10.115907 6
7 10.199895 9
8 9.949996 8
9 10.165848 8
10 9.953283 6
11 10.072947 10
12 10.020379 2
13 10.084002 3
14 9.949108 8
15 10.065247 6
16 9.801699 3
17 10.014612 8
18 9.954638 5
19 9.958256 9
20 10.031041 7
I would like to reduce the x to a smaller data.frame where each line contains the weighted average of p, weighted by v, corresponding to an amount of n units of v. Something of this sort:
> n <- 100
> cum.v <- cumsum(x$v)
> f <- cum.v %/% n
> x.agg <- aggregate(cbind(v*p, v) ~ f, data=x, FUN=sum)
> x.agg$'v * p' <- x.agg$'v * p' / x.agg$v
> x.agg
f v * p v
1 0 10.039369 98
2 1 9.952049 94
3 2 10.015058 104
4 3 9.938271 103
5 4 9.967244 100
6 5 9.995071 69
First question, I was wondering if there is a better (more efficient approach) to the code above. The second, more important, question is how to correct the code above in order to obtain more precise bucketing. Namely, each row in x.agg should contain exacly 100 units of v, not just approximately as it is the case above. For example, the first row contains the aggregate of the first 17 rows of x which correspond to 98 units of v. The next row (18th) contains 5 units of v and is fully included in the next bucket. What I would like to achieve instead would be attribute 2 units of row 18th to the first bucket and the remaining 3 units to the following one.
Thanks in advance for any help provided.
Here's another method that does this with out repeating each p v times. And the way I understand it is, the place where it crosses 100 (see below)
18 9.954638 5 98
19 9.958256 9 107
should be changed to:
18 9.954638 5 98
19.1 9.958256 2 100 # ---> 2 units will be considered with previous group
19.2 9.958256 7 107 # ----> remaining 7 units will be split for next group
The code:
n <- 100
# get cumulative sum, an id column (for retrace) and current group id
x <- transform(x, cv = cumsum(x$v), id = seq_len(nrow(x)), grp = cumsum(x$v) %/% n)
# Paste these two lines in R to install IRanges
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
require(IRanges)
ir1 <- successiveIRanges(x$v)
ir2 <- IRanges(seq(n, max(x$cv), by=n), width=1)
o <- findOverlaps(ir1, ir2)
# gets position where multiple of n(=100) occurs
# (where we'll have to do something about it)
pos <- queryHits(o)
# how much do the values differ from multiple of 100?
val <- start(ir2)[subjectHits(o)] - start(ir1)[queryHits(o)] + 1
# we need "pos" new rows of "pos" indices
x1 <- x[pos, ]
x1$v <- val # corresponding values
# reduce the group by 1, so that multiples of 100 will
# belong to the previous row
x1$grp <- x1$grp - 1
# subtract val in the original data x
x$v[pos] <- x$v[pos] - val
# bind and order them
x <- rbind(x1,x)
x <- x[with(x, order(id)), ]
# remove unnecessary entries
x <- x[!(duplicated(x$id) & x$v == 0), ]
x$cv <- cumsum(x$v) # updated cumsum
x$id <- NULL
require(data.table)
x.dt <- data.table(x, key="grp")
x.dt[, list(res = sum(p*v)/sum(v), cv = tail(cv, 1)), by=grp]
Running on your data:
# grp res cv
# 1: 0 10.037747 100
# 2: 1 9.994648 114
Running on #geektrader's data:
# grp res cv
# 1: 0 9.999680 100
# 2: 1 10.040139 200
# 3: 2 9.976425 300
# 4: 3 10.026622 400
# 5: 4 10.068623 500
# 6: 5 9.982733 562
Here's a benchmark on a relatively big data:
set.seed(12345)
x <- data.frame(cbind(p=rnorm(1e5, 10, 0.1), v=round(runif(1e5, 1, 10))))
require(rbenchmark)
benchmark(out <- FN1(x), replications=10)
# test replications elapsed relative user.self
# 1 out <- FN1(x) 10 13.817 1 12.586
It takes about 1.4 seconds on 1e5 rows.
If you are looking for precise bucketing, I am assuming value of p is same for 2 "split" v
i.e. in your example, value of p for 2 units of row 18th that go in first bucket is 9.954638
With above assumption, you can do following for not super large datasets..
> set.seed(12345)
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> z <- unlist(mapply(function(x,y) rep(x,y), x$p, x$v, SIMPLIFY=T))
this creates a vector with each value of p repeated v times for each row and result is combined into single vector using unlist.
After this aggregation is trivial using aggregate function
> aggregate(z, by=list((1:length(z)-0.5)%/%100), FUN=mean)
Group.1 x
1 0 9.999680
2 1 10.040139
3 2 9.976425
4 3 10.026622
5 4 10.068623
6 5 9.982733