I have this data and would like to categorize each number every n-th number (example 2). Using cut or cut_interval, is not "cutting" it. Any suggestions very much welcome. Thanks!
haves <- data.frame(
some_vector = c(1,2,2,3,4,5,6,7,8)
)
haves$category <- cut_interval(df$some_vector, n=2)
haves
wants <- data.frame(
some_vector = c(1,2,2,3,4,5,6,7,8)
,category = c(1,1,1,2,2,3,3,4,4)
)
wants
This should do (for positive numbers)
cut_interval <- function(x, n) ceiling(x / n)
cut_interval(haves$some_vector, n=2)
# [1] 1 1 1 2 2 3 3 4 4
Anyway cut() should be able to cut it (with improvements from #Henrik so it generalises):
cut(haves$some_vector, c(-Inf, seq(2, max(haves$some_vector), by = 2), Inf), labels = FALSE)
# [1] 1 1 1 2 2 3 3 4 4
Related
My dataframe contains a column with various touch points, numbers 1 till 18. I want to know which touch point results in touch point 10. Therefore I want to create a new column which shows the touch point which occurred before touch point 10 per customer journey (PurchaseID). If touch point 10 doesn't occur in a customer journey the value can be NULL or 0.
So for example:
dd <- read.table(text="
PurchaseId TouchPoint DesiredOutcome
1 8 6
1 6 6
1 10 6
2 12 0
2 8 0
3 17 4
3 3 4
3 4 4
3 10 4", header=TRUE)
The complete dataset contains 2.500.000 observations. Does anyone know how to solve my problem? Thanks in advance.
Firstly, it is better to give a complete reproducible sample code. I suggest you look at the data.table library which is nice for handling large datasets.
library(data.table)
mdata <- matrix(sample(x = c(1:20, 21), size = 15*10, replace = TRUE), ncol = 10)
mdata[mdata==21] <- NA
mdata <- data.frame(mdata)
names(mdata) <- paste0("cj", 1:10)
df_touch <- data.table(mdata)
# -- using for
res <- rep(0, nrow(df_touch))
for( i in 1:10){
cat(i, "\n")
res[i] <- i*df_touch[, (10 %in% get(paste0("cj", i)))]
cat(res[i], "\n")
}
# -- using lapply
dfun <- function(x, k = 10){ return( k %in% x ) }
df_touch[, lapply(.SD, dfun)]
I currently have a string in R that looks like this:
a <- "BMMBMMMMBMMMBMMBBMMM"
First, I need to determine the frequency of different patterns of "M" that appear in the string.
In this example it would be:
MM = 2
MMM = 2
MMMM = 1
Secondly, I then need to designate a numerical value/score for each different pattern.
i.e:
MM = 1
MMM = 2
MMMM = 3
This would mean that the total value/score of M's in a would equal 9.
If anyone knows any script that would allow me to do this for multiple strings like this in a dataframe that would be great?
Thank you.
a <- "BMMBMMMMBMMMBMMBBMMM"
tbl <- table(strsplit(a, "B"), exclude="")
tbl
# MM MMM MMMM
# 2 2 1
score <- sum(tbl * 1:3)
score
# 9
You could also use the table function.
a_list<-unlist(strsplit(a,"B"))
a_list<-a_list[!a_list==""] #remove cases when 2 B are together
a_list<-table(a_list)
# MM MMM MMMM
# 2 2 1
Here's a solution that uses the dplyr package. First, I load the library and define my string.
library(dplyr)
a <- "BMMBMMMMBMMMBMMBBMMM"
Next, I define a function that counts the occurrences of character x in string y.
char_count <- function(x, y){
# Get runs of same character
tmp <- rle(strsplit(y, split = "")[[1]])
# Count runs of character stored in `x`
tmp <- data.frame(table(tmp$lengths[tmp$values == x]))
# Return strings and frequencies
tmp %>%
mutate(String = strrep(x, Var1)) %>%
select(String, Freq)
}
Then, I run the function.
# Run the function
res <- char_count("M", a)
# String Freq
# 1 M 2
# 2 MM 2
# 3 MMM 1
Finally, I define my value vector and calculate the total value of vector a.
# My value vector
value_vec <- c(M = 1, MM = 2, MMM = 3)
# Total `value` of vector `a`
sum(value_vec * res$Freq)
#[1] 9
It it's acceptable to skip the first step you could do:
nchar(gsub("(B+M)|(^M)","",a))
# [1] 9
First compute all diffrent patterns that appear in your sting :
a <- "BMMBMMMMBMMMBMMBBMMM"
chars = unlist(strsplit(a, ""))
pat = c()
for ( i in 1:length(chars)){
for (j in 1:(length(chars) - i+1)){ pat = c(pat, paste(chars[j:(j+i-1)], collapse = ""))}}
pat =sort(unique(pat))
pat[1:5] : [1] "B" "BB" "BBM" "BBMM" "BBMMM"
Next, count the occurence of each pattern :
counts = sapply(pat, function(w) length(gregexpr(w, a, fixed = TRUE)[[1]]))
Finally build a nice dataframe to summary everything up :
df = data.frame(counts = counts, num = 1:length(pat))
head(df, 10)
counts num
B 6 1
BB 1 2
BBM 1 3
BBMM 1 4
BBMMM 1 5
BM 5 6
BMM 5 7
BMMB 2 8
BMMBB 1 9
BMMBBM 1 10
library(stringr)
str_count(a, "MMMM")
gives 1
str_count(gsub("MMMM", "", a), "MMM") # now count how many times "MMM" occurs, but first delete the "MMMM"
gives 2
str_count(gsub("MMM", "", a), "MM") #now count how many times "MM" occurs, but first delete the "MMM"'s
gives 2
From ?dplyr::bind_cols:
This is an efficient implementation of the common pattern of do.call(rbind, dfs) or do.call(cbind, dfs) for binding many data frames into one
However, with example data:
tmp_df1 <- data.frame(a = 1)
tmp_df2 <- data.frame(b = c(-2, 2))
tmp_df3 <- data.frame(c = runif(10))
The command do.call(cbind, list(tmp_df1, tmp_df2, tmp_df3)) produces:
a b c
1 1 -2 0.8473307
2 1 2 0.8031552
3 1 -2 0.3057430
4 1 2 0.6344999
5 1 -2 0.7870753
6 1 2 0.9453199
7 1 -2 0.6642231
8 1 2 0.9708049
9 1 -2 0.7189576
10 1 2 0.9217087
That is, rows of tmp_df1 and tmp_df2 are recycled to match the number of rows in tmp_df3.
In dplyr:
> bind_cols(tmp_df1, tmp_df2, tmp_df3)
Error in eval(substitute(expr), envir, enclos) :
incompatible number of rows (2, expecting 1)
The reason why I want to do something like this is because I am in a situation similar to below:
df_normal_param <- df(mu = rnorm(10), sigma = runif(10))
df_normal_sample_list <- lapply(1:10, function(i)
with(df_normal_param,
data.frame(sam = rnorm(100, mu[i], sigma[i]))
and I wish to attach the arguments used to create each entry of df_normal_sample_list to the outputs, e.g.
df_normal_sample_list <- lapply(1:10, function(i)
cbind(df_normal_param[i,], df_normal_sample_list[[i]]))
You argue in a comment that this behavior is safe, I strongly disagree. It seems safe, for this very particular case, but it is likely to cause you problems somewhere down the road. Which is why I believe that the answer to your stated question ("Is there a way to get dplyr's bind_cols to expand number of rows like in cbind?") is a simple: no, and they probably built it that way intentionally.
Instead, I would suggest that you be more explicit in your approach, and just add the columns you want right as you build the data you are creating. For example, you could include that step right in your call (here using apply to clarify what is going where)
df <- data.frame(mu = rnorm(3), sigma = runif(3))
df_normal_sample_list <- apply(df, 1, function(x){
data.frame(
mu = x["mu"]
, sigma = x["sigma"]
, sam = rnorm(3, x["mu"], x["sigma"])
)
})
Returns
[[1]]
mu sigma sam
1 -0.6982395 0.1690402 -0.592286
2 -0.6982395 0.1690402 -0.516948
3 -0.6982395 0.1690402 -0.804366
[[2]]
mu sigma sam
1 -1.698747 0.2597186 -1.830950
2 -1.698747 0.2597186 -2.087393
3 -1.698747 0.2597186 -1.961376
[[3]]
mu sigma sam
1 0.9913492 0.3069877 0.9629801
2 0.9913492 0.3069877 1.2279697
3 0.9913492 0.3069877 1.1222780
Then, instead of binding the columns, then the rows, you can just bind the rows at the end (also from dplyr)
bind_rows(df_normal_sample_list)
** edited because I'm a doofus - with replacement, not without **
I have a large-ish (>500k rows) dataset with 421 groups, defined by two grouping variables. Sample data as follows:
df<-data.frame(group_one=rep((0:9),26), group_two=rep((letters),10))
head(df)
group_one group_two
1 0 a
2 1 b
3 2 c
4 3 d
5 4 e
6 5 f
...and so on.
What I want is some number (k = 12 at the moment, but that number may vary) of stratified samples, by membership in (group_one x group_two). Membership in each group should be indicated by a new column, sample_membership, which has a value of 1 through k (again, 12 at the moment). I should be able to subset by sample_membership and get up to 12 distinct samples, each of which is representative when considering group_one and group_two.
Final data set would thus look something like this:
group_one group_two sample_membership
1 0 a 1
2 0 a 12
3 0 a 5
4 1 a 5
5 1 a 7
6 1 a 9
Thoughts? Thanks very much in advance!
Maybe something like this?:
library(dplyr)
df %>%
group_by(group_one, group_two) %>%
mutate(sample_membership = sample(1:12, n(), replace = FALSE))
Here's a one-line data.table approach, which you should definitely consider if you have a long data.frame.
library(data.table)
setDT(df)
df[, sample_membership := sample.int(12, .N, replace=TRUE), keyby = .(group_one, group_two)]
df
# group_one group_two sample_membership
# 1: 0 a 9
# 2: 0 a 8
# 3: 0 c 10
# 4: 0 c 4
# 5: 0 e 9
# ---
# 256: 9 v 4
# 257: 9 x 7
# 258: 9 x 11
# 259: 9 z 3
# 260: 9 z 8
For sampling without replacement, use replace=FALSE, but as noted elsewhere, make sure you have fewer than k members per group. OR:
If you want to use "sampling without unnecessary replacement" (making this up -- not sure what the right terminology is here) because you have more than k members per group but still want to keep the groups as evenly sized as possible, you could do something like:
# example with bigger groups
k <- 12L
big_df <- data.frame(group_one=rep((0:9),260), group_two=rep((letters),100))
setDT(big_df)
big_df[, sample_round := rep(1:.N, each=k, length.out=.N), keyby = .(group_one, group_two)]
big_df[, sample_membership := sample.int(k, .N, replace=FALSE), keyby = .(group_one, group_two, sample_round)]
head(big_df, 15) # you can see first repeat does not occur until row k+1
Within each "sampling round" (first k observations in the group, second k observations in the group, etc.) there is sampling without replacement. Then, if necessary, the next sampling round makes all k assignments available again.
This approach would really evenly stratify the sample (but perfectly even is only possible if you have a multiple of k members in each group).
Here is a base R method, that assumes that your data.frame is sorted by groups:
# get number of observations for each group
groupCnt <- with(df, aggregate(group_one, list(group_one, group_two), FUN=length))$x
# for reproducibility, set the seed
set.seed(1234)
# get sample by group
df$sample <- c(sapply(groupCnt, function(i) sample(12, i, replace=TRUE)))
Untested example using dplyr, if it doesn't work it might point you in the right direction.
library( dplyr )
set.seed(123)
df <- data.frame(
group_one = as.integer( runif( 1000, 1, 6) ),
group_two = sample( LETTERS[1:6], 1000, TRUE)
) %>%
group_by( group_one, group_two ) %>%
mutate(
sample_membership = sample( seq(1, length(group_one) ), length(group_one), FALSE)
)
Good luck!
I am trying to group a column of my data.frame/data.table into three groups, all with equal sums.
The data is first ordered from smallest to largest, such that group one would be made up of a large number of rows with small values, and group three would have a small number of rows with large values. This is accomplished in spirit with:
test <- data.frame(x = as.numeric(1:100000))
store <- 0
total <- sum(test$x)
for(i in 1:100000){
store <- store + test$x[i]
if(store < total/3){
test$y[i] <- 1
} else {
if(store < 2*total/3){
test$y[i] <- 2
} else {
test$y[i] <- 3
}
}
}
While successful, I feel like there must be a better way (and maybe a very obvious solution that I am missing).
I never like resorting to loops, especially with nested ifs, when a vectorized approach is available - with even 100,000+ records this code becomes quite slow
This method would become impossibly complex to code to a larger number of groups (not necessarily the looping, but the ifs)
Requires pre-ordering of the column. Might not be able to get around this one.
As a nuance (not that it makes a difference) but the data to be summed would not always (or ever) be consecutive integers.
Maybe with cumsum:
test$z <- cumsum(test$x) %/% (ceiling(sum(test$x) / 3)) + 1
This is more or less a bin-packing problem.
Use the binPack function from the BBmisc package:
library(BBmisc)
test$bins <- binPack(test$x, sum(test$x)/3+1)
The sums of the 3 bins are nearly identical:
tapply(test$x, test$bins, sum)
1 2 3
1666683334 1666683334 1666683332
I thought that the cumsum/modulo division approach was very elegant, but it does retrun a somewhat irregular allocation:
> tapply(test$x, test$z, sum)
1 2 3
1666636245 1666684180 1666729575
> sum(test)/3
[1] 1666683333
So I though I would first create a random permutation and offer something similar:
test$x <- sample(test$x)
test$z2 <- cumsum(test$x)[ findInterval(cumsum(test$x),
c(0, 1666683333*(1:2), sum(test$x)+1))]
> tapply(test$x, test$z2, sum)
91099 116379 129539
1666676164 1666686837 1666686999
This also achieves a more even distribution of counts:
> table(test$z2)
91099 116379 129539
33245 33235 33520
> table(test$z)
1 2 3
57734 23915 18351
I must admit to puzzlement regarding the naming of the entries in z2.
Or you can just cut on the cumsum
test$z <- cut(cumsum(test$x), breaks = 3, labels = 1:3)
or use ggplot2::cut_interval instead of cut:
test$z <- cut_interval(cumsum(test$x), n = 3, labels = 1:3)
You can use fold() from groupdata2 and get an almost equal number of elements per group:
# Create data frame
test <- data.frame(x = as.numeric(1:100000))
# Use fold() to create 3 numerically balanced groups
test <- groupdata2::fold(k = 3, num_col = "x")
# Watch first 10 rows
head(test, 10)
## # A tibble: 10 x 2
## # Groups: .folds [3]
## x .folds
## <dbl> <fct>
## 1 1 1
## 2 2 3
## 3 3 2
## 4 4 1
## 5 5 2
## 6 6 2
## 7 7 1
## 8 8 3
## 9 9 2
## 10 10 3
# Check the sum and number of elements per group
test %>%
dplyr::group_by(.folds) %>%
dplyr::summarize(sum_ = sum(x),
n_members = dplyr::n())
## # A tibble: 3 x 3
## .folds sum_ n_members
## <fct> <dbl> <int>
## 1 1 1666690952 33333
## 2 2 1666716667 33334
## 3 3 1666642381 33333