Related
My code looks as below, I am wondering if there any better way to make it faster:
pos=NULL
row=data.frame(matrix(nrow=216,ncol=4))
colnames(row)=c("sub","subi","group","trial")
for (i in 1:100000){
row$sub="Positive"
row$subi=NA
row$group=NA
row$subi[1:144]=c(1:144)
row$group[1:144]=1
row$subi[145:216]=c(1:72)
row$group[145:216]=2
row$trial=i
pos=rbind(pos,row)
}
No loop needed. You can build a data.frame or tibble(my example) on your own.
Given you want to adjust the row length later:
library(dplyr)
n_rows <- 10000
tibble(
trail = 1:n_rows,
sub = "positive",
subi = c(seq(1:144), seq(1:72), rep(NA, n_rows-216)),
group = c(rep(1, 144), rep(2, 72), rep(NA, n_rows-216))
)
Output is:
# A tibble: 10,000 × 4
trail sub subi group
<int> <chr> <int> <dbl>
1 1 positive 1 1
2 2 positive 2 1
3 3 positive 3 1
4 4 positive 4 1
5 5 positive 5 1
6 6 positive 6 1
7 7 positive 7 1
8 8 positive 8 1
9 9 positive 9 1
10 10 positive 10 1
# … with 9,990 more rows
The only thing different in each pass of the loop is trial. rep is your friend. For the other columns, R will automatically recycle to match the longest column (here, it is trial with 21.6M rows).
pos <- data.frame(
sub = "Positive",
subi = c(1:144, 1:72),
group = rep.int(1:2, c(144, 72)),
trial = rep(1:1e5, each = 216)
)
It looks like you are trying to replicate this data frame 100,000 times, with each iteration of the frame having a different trial number.
data.frame(sub = rep("Positive", 216),
subi = c(1:144, 1:72),
group = rep(c(1, 2), c(144, 72)))
The replicate function is great for running static code multiple time. So one option is to create your 100,000 copies and then update the trial number.
FrameList <-
replicate(n = 100,
{
data.frame(sub = rep("Positive", 216),
subi = c(1:144, 1:72),
group = rep(c(1, 2), c(144, 72)),
trial = rep(NA_real_, 216))
},
simplify = FALSE)
To update the trial number, you can go with a for loop
for (i in seq_along(FrameList)){
FrameList$trial <- i
}
or you can try something fancy-pants, but taking a lot more code
FrameList <- mapply(function(FL, i){
FL$trial <- i
FL
},
FrameList,
seq_along(FrameList),
SIMPLIFY = FALSE)
Whichever way you go, you can stack them all together with
Frame <- do.call("rbind", FrameList)
This certainly isn't the most elegant way to do this, so watch for others to give you other clever tricks. But this, I would guess, would be the basic process to follow.
In order to deal with product time series where lots of them showing intermittent demand, I want to measure how large the gaps consisting of zero values in between the series are.
In the next step I want to measure the average gap length per id. In my example this would be 4.33 for ID 1.
I found an older solution for measurement of gap sizes in time series, that does not give me the result in way, that I am able to process it further and derive measures like average gap size and min and max gap size:
Gap size calculation in time series with R
library(tidyverse)
library(lubridate)
library(data.table)
data <- tibble(id = as.factor(c(rep("1",24),rep("2",24),rep("3",24))),
date = rep(c(ymd("2013-01-01")+ months(0:23)),3),
value = c(c(rep(4,5),0,0,0,0,0,0,0,0,7,0,0,0,0,11,23,54,33,45,0),
c(4,6,1,2,3,4,4,6,8,11,18,6,6,1,7,7,13,9,4,33,3,6,81,45),
c(rep(4,5),0,0,0,5,2,0,0,0,7,0,0,8,0,11,23,54,33,0,0))
)
# this gives me the repeated gap size per observation
setDT(data)
data[, gap := rep(rle(value)$lengths, rle(value)$lengths) * (value == 0)]
# I want the distinct gap size per id
1: c(8,4,1)
2: c(0)
3: c(3,3,2,1,2)
If I would be able to determine the number of gaps per id, I could also calculate the mean gap size, by retrieving the total number zeros per id like this (13/3 = 4.33):
# total number of zeros per id
data <- as_tibble(data)
data %>% group_by(id) %>% summarise(zero_sum = length(which(value == 0)))
You could use rle:
library(data.table)
setDT(data)
data[,.(n=with(rle(value==0),lengths*values)),by=id][n>0]
id n
<fctr> <int>
1: 1 8
2: 1 4
3: 1 1
4: 3 3
5: 3 3
6: 3 2
7: 3 1
8: 3 2
or in the expected format:
data[,.(n=list(with(rle(value==0),{r = lengths*values;
r <- r[r!=0];
if (length(r)==0) {r <- 0L};
r }))),by=id]
id n
<fctr> <list>
1: 1 8,4,1
2: 2 0
3: 3 3,3,2,1,2
I am new to coding and using R. I am working on a project to simulate the game Liar's Dice, also known as Perudo, and have some questions about creating the simulation.
Basically, the game consists of two or more players rolling five dice in a cup, turning it over, and and making bids on how many of a certain side they think is on the table. You can look at your own dice, but not anyone else's. To make bids, on your turn you would say "two 5's," which would mean there are at least two dice that landed on 5. Each bid will either increase the side or the amount. So if you said "two 5's," I could then say "two 6's" or "three 3's" on my turn.
When you believe the last bid is incorrect, you would say "Liar" on your turn, then everyone reveals their dice. If you were wrong, you lose a dice, but if you were right, the last bidder loses a dice. This continues until there is only one player left who has dice.
First, I decided to create a function called cup() which rolls a cup of five six-sided dice.
cup <- function(sides = 6, dice = 5){
sample(1:sides, size = dice, replace = TRUE)
}
Next, with a little assistance, I created a new function called cups() which rolls three cups for three players.
cups <- function(players = 3, sides = 6, dice = 5){
out <- cup(sides, dice)
for(i in 2:players){
out <- rbind(out, cup(sides, dice))
}
rownames(out) <- 1:players
rownames(out) <- c("P1", "P2", "P3")
return(out)
}
What I want to accomplish next is to create a table of probabilities of possible dice outcomes. In other words, what's the probability of there being at least two of a side given fifteen dice (five for each player) in play? And then the probability of there being three, four, five, etc. all the way up to fifteen in this case.
My question is how would I go about doing this in R? And what direction should I go in after getting the probabilities in R?
Here is an empirical process for determining the percentage outcomes of all the same, 4 the same, 3 the same, 2 the same, none the same upon rolling 5 die:
library(gtools) # package with permutations function
allcombos <- permutations(6, 5, repeats.allowed = TRUE) # all 6 choose 5 with replacment combos
alluniques <- apply(allcombos, 1, unique) # uniques for each combo
alllengths <- sapply(alluniques, length) # lengths for each combo imputes num repeats
alllengths2 <- as.factor(alllengths) # convert to factor to count unique
allsum <- summary(alllengths2) # sum by num uniques
allsum
1 2 3 4 5 # 1=all same, 2=4 same, 3=3 same, 4=2 same, 5=all different
6 450 3000 3600 720
totsum <- sum(allsum)
allfrac <- allsum / totsum
allpercent <- allfrac * 100
allpercent
1 2 3 4 5
0.07716049 5.78703704 38.58024691 46.29629630 9.25925926 # percentage breakout
There is no doubt an analytical solution but I don't know what it is. You could use standard probability calculations to estimate specific outcomes among multiple players. E.g. P(at least 1 4-same | 3 players) or run some simulations.
Here's likely more than you asked for but focusing on number of sides on the dice, total number of dice and probability of rolling Nrolled or more
dicegame <- function(Nsides = 6,
Ndice = 5,
Nrolled = 1,
verbose = FALSE)
{
total_possible_outcomes <- choose(Nsides + Ndice - 1, Ndice)
outcomes_matrix <- t(combn(Nsides + Ndice - 1,
Ndice,
sort)) - matrix(rep(c(0:(Ndice - 1)),
each = total_possible_outcomes),
nrow = total_possible_outcomes)
chances <- sum(apply(outcomes_matrix, 1, function(x) sum(x==2)) >= Nrolled) / total_possible_outcomes
if(verbose) {
cat(paste("Number of dice",
Ndice,
"each with", Nsides, "sides",
"chances of rolling", Nrolled,
"\n or more of any one side are:\n"))
}
return(chances)
# return(total_possible_outcomes)
# return(outcomes_matrix)
}
dicegame(verbose = TRUE)
#> Number of dice 5 each with 6 sides chances of rolling 1
#> or more of any one side are:
#> [1] 0.5
dicegame(6, 15, 10)
#> [1] 0.01625387
Using probability we can demonstrate that the probability to get a value n times is equal to :
we can easily write this into an R function:
prob_get_n <- function(ntimes, players=3, dice=5, sides=6){
if(missing(ntimes)) ntimes <- 0:(players*dice)
choose(players*dice,ntimes)*(1-1/sides)^((players*dice)-ntimes)*sides^(-ntimes)
}
Notice that this function is by construction vectorised ie it accepts 1:2, c(9,5) as valid inputs.
prob_get_n() -> probs
data.frame(ntimes=1:length(probs)-1, probs=probs,or_more= rev(cumsum(rev(probs))))
ntimes probs or_more
1 0 6.490547e-02 1.000000e+00
2 1 1.947164e-01 9.350945e-01
3 2 2.726030e-01 7.403781e-01
4 3 2.362559e-01 4.677751e-01
5 4 1.417535e-01 2.315192e-01
6 5 6.237156e-02 8.976567e-02
7 6 2.079052e-02 2.739411e-02
8 7 5.346134e-03 6.603585e-03
9 8 1.069227e-03 1.257451e-03
10 9 1.663242e-04 1.882242e-04
11 10 1.995890e-05 2.190005e-05
12 11 1.814445e-06 1.941153e-06
13 12 1.209630e-07 1.267076e-07
14 13 5.582909e-09 5.744548e-09
15 14 1.595117e-10 1.616385e-10
16 15 2.126822e-12 2.126822e-12
Edit
Or we can use R built in dbinom function to get the distribution and pbinom to get the cumulative probability function:
probs <- function(ntimes, players=3, dice=5, sides=6){
if(missing(ntimes)) ntimes <- 0:(players*dice)
data.frame(ntimes=ntimes, probs=dbinom(ntimes, players*dice, 1/sides), or_more=1-pbinom(ntimes-1, players*dice, 1/sides))
}
ntimes probs or_more
1 0 6.490547e-02 1.000000e+00
2 1 1.947164e-01 9.350945e-01
3 2 2.726030e-01 7.403781e-01
4 3 2.362559e-01 4.677751e-01
5 4 1.417535e-01 2.315192e-01
6 5 6.237156e-02 8.976567e-02
7 6 2.079052e-02 2.739411e-02
8 7 5.346134e-03 6.603585e-03
9 8 1.069227e-03 1.257451e-03
10 9 1.663242e-04 1.882242e-04
11 10 1.995890e-05 2.190005e-05
12 11 1.814445e-06 1.941153e-06
13 12 1.209630e-07 1.267076e-07
14 13 5.582909e-09 5.744548e-09
15 14 1.595117e-10 1.616385e-10
16 15 2.126822e-12 2.126743e-12
I need to create a new dataframe using multiple conditions on an existing dataframe.
I tried using dplyr function, summarise in particular for multiple conditions but failed as the dataset size decreases once the conditions as applied.
For explanation, below is a simple sample of what I am trying to achieve.
df <- data.frame(User = c("Newton","Newton","Newton","Newton","Newton"),
Location = c("A","A","B","A","B"),
Movement = c(10,10,20,20,30),
Unit = c(-2,2,2,-2,-1),
Time = c("4-20-2019","4-20-2019","4-21-2019","4-21-2019"
,"4-23-2019"))
dfNew <- data.frame(User = c("Newton","Newton","Newton"),
FromLocation = c("A","A","B"),
ToLocation = c("A","B","B"),
Movement = c(10,20,30),
Units = c(2,2,-1))
The conditions used to calculate dfNew are as follow:
Looking at the first line of df:
a) if movement is 10 and unit is negative - ignore this line
Looking at the second line of df:
a) if movement is 10 and unit is positive - FromLocation and ToLocation are both A, and Units is taken from df which is 2
Looking at the third line of df:
a) if movement is 20 and unit is positive - ToLocation (B) and Units (2) has to be taken from this line and FromLocation has to be taken from the next line
Looking at the fourth line of df:
a) if movement is 20 and unit is negative - FromLocation(A) for the previous line of dfnew has to be taken from this line
Looking at the fifth line of df:
a) if movement type is 30, then ToLocation and FromLocation will both be B and the units will be the same as df which is -1
Another pattern that could be useful is that each movement would occur on the same day/time. Also please do note that the example is for only 1 user, I have more than 2000 for which similar conditions have to be applied.
Like I said, I tried using dplyr and summarise to put all these conditions in but since the size of dataset is different I could find a way to make it work.
Appreciate any advice, thank you!
It sounds like dplyr::group_by and case_when might suffice, but I'm not sure these are the right interpretations of the "rules" for your table.
library(dplyr)
df %>%
group_by(User) %>%
mutate(FromLocation = case_when(Movement == 10 & Unit < 0 ~ "DROP",
Movement == 10 & Unit > 0 ~ Location,
Movement == 20 & Unit < 0 ~ lag(Location),
Movement == 20 & Unit > 0 ~ lead(Location),
Movement == 30 ~ "B",
TRUE ~ "not specified in rules"),
ToLocation = case_when(Movement == 10 & Unit < 0 ~ "DROP",
Movement == 10 & Unit > 0 ~ Location,
Movement == 20 & Unit < 0 ~ lag(Location), # Not given
Movement == 20 & Unit > 0 ~ Location,
Movement == 30 ~ "B",
TRUE ~ "not specified in rules")) %>%
ungroup() %>%
filter(FromLocation != "DROP") %>%
select(User, FromLocation, ToLocation, Movement, Unit)
Results
# A tibble: 4 x 5
User FromLocation ToLocation Movement Unit
<chr> <chr> <chr> <dbl> <dbl>
1 Newton A A 10 2
2 Newton A B 20 2
3 Newton B B 20 -2
4 Newton B B 30 -1
I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!
Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363
Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]