Product calculation by group in R data.table - r

I'm currently working on transforming a dataset to take the product of each previous observation in a datatable. This is something that is implemented easy in excel but I am struggling to find a non-recursive solution to in data.table. The data in short form, ID has thousands of more levels and thousands of x's per ID in the real data. Each ID has the same number of X's.
| index | ID | X |
|-------|----|------|
| 1 | 1 | 0.8 |
| 2 | 1 | 0.75 |
| 3 | 1 | 0.72 |
| 4 | 2 | 0.9 |
| 5 | 2 | 0.5 |
| 6 | 2 | 0.45 |
What I want to end up with is the following
| index | ID | X | product |
|-------|----|------|---------|
| 1 | 1 | 0.8 | 0.8 |
| 2 | 1 | 0.75 | 0.6 |
| 3 | 1 | 0.72 | 0.432 |
| 4 | 2 | 0.9 | 0.9 |
| 5 | 2 | 0.5 | 0.45 |
| 6 | 2 | 0.45 | 0.2025 |
Where product is equal to x multiplied by all previous values of x for that particular ID. This can be done in a for loop however I am looking for a solution that leverages the use of data.table so this can be run on a cluster.
Reproducible data:
df <- fread('
index ID X
1 1 0.8
2 1 0.75
3 1 0.72
4 2 0.9
5 2 0.5
6 2 0.45
')

You can use cumprod
# If data.table not already loaded, these steps are required first
# library(data.table)
# setDT(df)
df[, Xprod := cumprod(X), ID][]
# index ID X Xprod
# 1: 1 1 0.80 0.8000
# 2: 2 1 0.75 0.6000
# 3: 3 1 0.72 0.4320
# 4: 4 2 0.90 0.9000
# 5: 5 2 0.50 0.4500
# 6: 6 2 0.45 0.2025
If you need to apply a function other than prod, you can use frollapply. For example, the code below gives the same result as the code above.
df[, Xprod := frollapply(X, 1:.N, prod, adaptive = TRUE), by = ID]

Related

R, Friedman's test 'not an unreplicated complete block design' error?

I am trying to do a Friedman's test and yes my data is repeated measures but nonparametric.
The data is organized like this from the csv and used Rstudio's import dataset function so it is a table in Rstudio:
score| treatment | day
10 | 1 | 1
20 | 1 | 1
40 | 1 | 1
7 | 2 | 1
100| 2 | 1
58 | 2 | 1
98 | 3 | 1
89 | 3 | 1
40 | 3 | 1
70 | 4 | 1
10 | 4 | 1
28 | 4 | 1
86 | 5 | 1
200| 5 | 1
40 | 5 | 1
77 | 1 | 2
100| 1 | 2
90 | 1 | 2
33 | 2 | 2
15 | 2 | 2
25 | 2 | 2
23 | 3 | 2
54 | 3 | 2
67 | 3 | 2
1 | 4 | 2
2 | 4 | 2
400| 4 | 2
16 | 5 | 2
10 | 5 | 2
90 | 5 | 2
library(readr)
sample_data$treatment <- as.factor(sample_data$treatment) #setting treatment as categorical independent variable
sample_data$day <- as.factor(sample_data$day) #setting day as categorical independent variable
summary(sample_data)
#attach(sample_data) #not sure if this should be used only because according to https://www.sheffield.ac.uk/polopoly_fs/1.714578!/file/stcp-marquier-FriedmanR.pdf it says to use attach for R to use the variables directly
friedman3 <- friedman.test(y = sample_data$score, groups = sample_data$treatment, blocks = sample_data$day)
summary(friedman3)
I am interested in day and score using Friedman's.
this is the error I get:
>Error in friedman.test.default(y = sample_data$score, groups = sample_data$treatment, blocks = sample_data$day, :
not an unreplicated complete block design
Not sure what is wrong.
Prior to writing the Friedman part of the code, I only specified day and treatment as categorical using as.factor

r data.table groupby join in pyspark 1.6

I have the following datatables (R code):
accounts <- fread("ACC_ID | DATE | RATIO | VALUE
1 | 2017-12-31 | 2.00 | 8
2 | 2017-12-31 | 2.00 | 12
3 | 2017-12-31 | 6.00 | 20
4 | 2017-12-31 | 1.00 | 5 ", sep='|')
timeline <- fread(" DATE
2017-12-31
2018-12-31
2019-12-31
2020-12-31", sep="|")
In R, I know I can join on DATE, by ACC_ID, RATIO and VALUE:
accounts[, .SD[timeline, on='DATE'], by=c('ACC_ID', 'RATIO', 'VALUE')]
This way, I can "project" ACC_ID, RATIO and VALUE values over timeline dates, getting the following data table:
ACC_ID | RATIO | VALUE | DATE
1 | 2 | 8 |2017-12-31
2 | 2 | 12 |2017-12-31
3 | 6 | 20 |2017-12-31
4 | 1 | 5 |2017-12-31
1 | 2 | 8 |2018-12-31
2 | 2 | 12 |2018-12-31
3 | 6 | 20 |2018-12-31
4 | 1 | 5 |2018-12-31
1 | 2 | 8 |2019-12-31
2 | 2 | 12 |2019-12-31
3 | 6 | 20 |2019-12-31
4 | 1 | 5 |2019-12-31
1 | 2 | 8 |2020-12-31
2 | 2 | 12 |2020-12-31
3 | 6 | 20 |2020-12-31
4 | 1 | 5 |2020-12-31
I've been trying hard to find something similar with PySpark, but I've not been able to. What should be the appropriate way to solve this?
Thanks very much for your time. I greatly appreciate any help you can give me, this one is important for me.
It looks like you're trying to do a cross join?
spark.sql('''
select ACC_ID, RATIO, VALUE, timeline.DATE
from accounts, timeline
''')

Cross-table for subset in R

I have the following data frame (simplified):
IPET Task Type
1 1 1
1 2 2
1 3 1
2 1 1
2 1 2
How can I create a cross table (using the crosstable function in gmodels, because I need to do a chi-square test), but only if Type equals 1.
You probably want this.
library(gmodels)
with(df.1[df.1$Type==1, ], CrossTable(IPET, Task))
Yielding
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 3
| Task
IPET | 1 | 3 | Row Total |
-------------|-----------|-----------|-----------|
1 | 1 | 1 | 2 |
| 0.083 | 0.167 | |
| 0.500 | 0.500 | 0.667 |
| 0.500 | 1.000 | |
| 0.333 | 0.333 | |
-------------|-----------|-----------|-----------|
2 | 1 | 0 | 1 |
| 0.167 | 0.333 | |
| 1.000 | 0.000 | 0.333 |
| 0.500 | 0.000 | |
| 0.333 | 0.000 | |
-------------|-----------|-----------|-----------|
Column Total | 2 | 1 | 3 |
| 0.667 | 0.333 | |
-------------|-----------|-----------|-----------|
Data
df.1 <- read.table(header=TRUE, text="IPET Task Type
1 1 1
1 2 2
1 3 1
2 1 1
2 1 2")

Combining different data frames depending on the values of the columns [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 6 years ago.
I'm just a bit new with R and I have to work with it.
I have 2 data frames: df1 and df2
x | y | z | q x | y | w
1 | 00 | 1.99 | 5 1 | 00 | 1.34
1 | 10 | 2.05 | 11 1 | 12 | 1.69
1 | 12 | 1.89 | 9 2 | 15 | 2.99
1 | 20 | 1.75 | 7
2 | 05 | 3.25 | 3
2 | 15 | 3.35 | 0
2 | 26 | 3.10 | 6
And I would like to create a new data frame (ndf) combining df1 and df2 like this:
x | y | z | q | w
1 | 00 | 1.99 | 5 | 1.34
1 | 10 | 2.05 | 11 | NA
1 | 12 | 1.89 | 9 | 1.69
1 | 20 | 1.75 | 7 | NA
2 | 05 | 3.25 | 3 | NA
2 | 15 | 3.35 | 0 | 2.99
2 | 26 | 3.10 | 6 | NA
How can I obtain this dataframe in R? Someone can help me please?
You can use merge to combine two data frames.
merge(df1, df2, all = TRUE)
The result:
x y z q w
1 1 0 1.99 5 1.34
2 1 10 2.05 11 NA
3 1 12 1.89 9 1.69
4 1 20 1.75 7 NA
5 2 5 3.25 3 NA
6 2 15 3.35 0 2.99
7 2 26 3.10 6 NA

r bin equal deciles

I have a dataset containing over 6,000 observations, each record having a score ranging from 0-100. Below is a sample:
+-----+-------+
| uID | score |
+-----+-------+
| 1 | 77 |
| 2 | 61 |
| 3 | 74 |
| 4 | 47 |
| 5 | 65 |
| 6 | 51 |
| 7 | 25 |
| 8 | 64 |
| 9 | 69 |
| 10 | 52 |
+-----+-------+
I want to bin them into equal deciles based upon their rank order relative to their peers within the score column with cutoffs being at every 10th percentile, as seen below:
+-----+-------+-----------+----------+
| uID | score | position% | scoreBin |
+-----+-------+-----------+----------+
| 7 | 25 | 0.1 | 1 |
| 4 | 47 | 0.2 | 2 |
| 6 | 51 | 0.3 | 3 |
| 10 | 52 | 0.4 | 4 |
| 2 | 61 | 0.5 | 5 |
| 8 | 64 | 0.6 | 6 |
| 5 | 65 | 0.7 | 7 |
| 9 | 69 | 0.8 | 8 |
| 3 | 74 | 0.9 | 9 |
| 1 | 77 | 1 | 10 |
+-----+-------+-----------+----------+
So far I've tried cut, cut2, tapply, etc. I think I'm on the right logic path, but I have no idea on how to apply them to my situation. Any help is greatly appreciated.
I would use ntile() in dplyr.
library(dplyr)
score<-c(77,61,74,47,65,51,25,64,69,52)
ntile(score, 10)
##[1] 10 5 9 2 7 3 1 6 8 4
scoreBin<- ntile(score, 10)
In base R we can use a combination of .bincode() and quantile():
df$new <- .bincode(df$score,
breaks = quantile(df$score, seq(0, 1, by = 0.1)),
include.lowest = TRUE)
# uID score new
#1 1 77 10
#2 2 61 5
#3 3 74 9
#4 4 47 2
#5 5 65 7
#6 6 51 3
#7 7 25 1
#8 8 64 6
#9 9 69 8
#10 10 52 4
Here is a method that uses quantile together with cut to get the bins:
df$scoreBin <- as.integer(cut(df$score,
breaks=quantile(df$score, seq(0,1, .1), include.lowest=T)))
as.integer coerces the output of cut (which is a factor) into the underlying integer.
One way to get the position percent is to use rank:
df$position <- rank(df$score) / nrow(df)

Resources