How-to go parallel - r

I have a dataset as below:
Custid Product
12 A
12 B
12 C
13 A
13 B
13 D
14 B
14 D
14 E
15 A
15 E
15 B
16 C
16 A
16 D
So I have 5 distinct products (A B C D E) for customers (each getting 3). Now I want 5 text files for each product with the custids in them. for example:
test file for A should have custids-
12
13
15
16
and similarly other products should have text files with their custids that are asigned those products.
Is there a way to do it via parallel processing in R as I have millions of records with such data?

by(dat,dat$Product,function(x)write.csv(x,paste0(x[1,2],".txt")))
Now go to your working directory and check for the existence of these files. or try reading from your console: read.csv("A.txt")

To do it in a parallel way use the package parallel.
library(parallel)
lst=split(x = df,f = df$Product)
mcmapply(function(t,n){write(t$Custid,paste0(n,".txt"),ncolumns = 1,append = TRUE)},lst,names(lst),mc.preschedule=TRUE)

Related

Write a function in R - calculate value from historical records and add to future records

I have the following dataset
Name<-c('A','A','B','C','B','C','D','B','C','A','D','C','B','C','A','D','C','B','A','D','C','B')
Rate<-c(12,13,4,8,7,3,6,8,5,4,7,5,9,4,7,2,7,3,9,13,14,12)
Date<-c('1998-11-11', '1992-12-01','2010-06-17', '2001-10-3','2019-4-01', '2020-4-23','2021-2-01', '1995-12-01',
'1994-7-11', '2023-3-01','2022-06-17', '1982-10-3','1898-4-01', '2027-4-23','1927-2-01', '2028-12-01',
'1993-5-21', '2013-2-09','2020-01-17', '1987-4-3','1881-5-01', '2024-5-23')
df<-cbind.data.frame(Name,Rate, Date)
df
Name Rate Date
1 A 12 1998-11-11
2 A 13 1992-12-01
3 B 4 2010-06-17
4 C 8 2001-10-3
5 B 7 2019-4-01
6 C 3 2020-4-23
7 D 6 2021-2-01
8 B 8 1995-12-01
9 C 5 1994-7-11
10 A 4 2023-3-01
11 D 7 2006-06-17
12 C 5 1982-10-3
13 B 9 1898-4-01
14 C 4 2027-4-23
15 A 7 1927-2-01
16 D 2 2028-12-01
17 C 7 1993-5-21
18 B 3 2013-2-09
19 A 9 2020-01-17
20 D 13 1987-4-3
21 C 14 1881-5-01
22 B 12 2024-5-23
I want to write a function in R to do the following :
Find the Standard Deviation for each type of Name (A, B, C, D) of historical data. Historical data is any records with Date < Dec'2018. Future records would not be used to calculate the SD for type of Name. I want to then add the SD of the historical data to the future Rates of respective type of Name(A, B, C, D). Future Rates are the one with Date > Dec'2018. Could anyone please help me to write this function?
Below is the function I am working on
with(mutate(df,timediff = as.yearmon(Date) - as.yearmon(Sys.Date()) ),
tapply(df$Rate, Name, function(x){
ifelse(timediff < 0 ,
x + sd(x),
x)
}, simplify=FALSE) )

How to convert overlapping ranges to contiguous ranges?

I have overlapping ranges like this
data
G S E
o 10 15
o 13 20
r 20 28
r 25 33
I am trying to convert this into table shown below.
G S E
o 10 12
o 13 19
r 20 24
r 25 33
I have been trying use ifelse condition and data.table shift to access values from next row for comparison, but did not succeed yet. Any suggestion will be greatly appreciated.

Adding extreme value distributed noise (with µ=0,σ=10) to a vector of numbers in R

I have the following matrix
Measurement Treatment
38 A
14 A
54 A
69 A
20 B
36 B
35 B
10 B
11 C
98 C
88 C
14 C
I want to add extreme value distributed noise (with mean=0 and sd=10) to the Measurement values. How can I achieve that in R?
I found revd in extRemes package, but it does not work as expected. Does devd from the same package do what I want to do? (but it does not allow for mean and sd to be defined)
If you want to use your measure as the mean for the noise, then you can do this:
measure = round(runif(10,0,30),0)
data = data.frame(measure)
for(i in 1:nrow(data)){
data$measure1[i] = rnorm(1,data$measure[i],10)
}
data
measure measure1
1 6 6.281557
2 12 -5.780177
3 18 13.529773
4 26 33.665584
5 14 12.666614
6 24 41.146132
7 5 -1.850390
8 14 16.728703
9 13 26.082601
10 13 14.066475
EDIT: You can avoid the for loop with this instead:
data$measure1 = data$measure + rnorm(1,0,10)

how to assign value to users in a data.frame based on user ID records from another data.frame

I have read excel file in R, where sheet1 has 51500 rows and 5 column and sheet 2 has user ID of buyers (only one column). Objective: Aim to extract the user in sheet_1 whose User Id are occurred in sheet 2.
Here is the two example input files and desired output:
df <- data.frame(User.ID=c(12: 17), Group="Test", Spend=c(15:20), Purchase=c(5:10))
df
User.ID Group Spend Purchase
1 12 Test 15 5
2 13 Test 16 6
3 14 Test 17 7
4 15 Test 18 8
5 16 Test 19 9
6 17 Test 20 10
hash.ID <- data.frame(User.ID= c(13:16))
User.ID
1 13
2 14
3 15
4 16
desired output :
User.ID Group Spend Purchase Redem_Status
1 12 Test 15 5 Test_NonRedeemer
2 13 Test 16 6 Test_Redeemer
3 14 Test 17 7 Test_Redeemer
4 15 Test 18 8 Test_Redeemer
5 16 Test 19 9 Test_Redeemer
6 17 Test 20 10 Test_NonRedeemer
based on above example, we can see that if user Id from df is existed in hash.ID table, then we add new column and label it as Test_Redeemer, otherwise label it as Test_NonRedeemer. Is there any straightforward approach that can do this task ? Thanks a lot !!
The testcase you presented helped, thanks. As mentioned in the comments, you need to subset the rows you're interested in and assign them value. By placing ! in front of the statement (notice the braces!) you negate the statement and thus select all records not selected in the previous call.
df[df$User.ID %in% hash.ID$User.ID, "Redem_Status"] <- "Test_Redeemer"
df[!(df$User.ID %in% hash.ID$User.ID), "Redem_Status"] <- "Test_NonRedeemer"
df
User.ID Group Spend Purchase Redem_Status
1 12 Test 15 5 Test_NonRedeemer
2 13 Test 16 6 Test_Redeemer
3 14 Test 17 7 Test_Redeemer
4 15 Test 18 8 Test_Redeemer
5 16 Test 19 9 Test_Redeemer
6 17 Test 20 10 Test_NonRedeemer

Is there a way to "auto-name" expression in J

I have a few questions/suggestions concerning data.table.
R) X = data.table(x=c("q","q","q","w","w","e"),y=1:6,z=10:15)
R) X[,list(sum(y)),by=list(x)]
x V1
1: q 6
2: w 9
3: e 6
I think it is too bad that one has to write
R) X[,list(y=sum(y)),by=list(x)]
x y
1: q 6
2: w 9
3: e 6
It should default to keeping the same column name (ie: y) where the function calls only one column, this would be a massive gain in most of the cases, typically in finance as we usually look as weighted sums or last time or...
=> Is there any variable I can set to default to this behaviour ?
When doing a selectI might want to do a calculus on few columns and apply another operation for all other columns.
I mean too bad that when I want this:
R) X = data.table(x=c("q","q","q","w","w","e"),y=1:6,z=10:15,t=20:25,u=30:35)
R) X
x y z t u
1: q 1 10 20 30
2: q 2 11 21 31
3: q 3 12 22 32
4: w 4 13 23 33
5: w 5 14 24 34
6: e 6 15 25 35
R) X[,list(y=sum(y),z=last(z),t=last(t),u=last(u)),by=list(x)] #LOOOOOOOOOOONGGGG
#EXPR
x y z t u
1: q 6 12 22 32
2: w 9 14 24 34
3: e 6 15 25 35
I cannot write it like...
R) X[,list(sum(y)),by=list(x),defaultFn=last] #defaultFn would be
applied to all remaniing columns
=> Can I do this somehow (may be setting an option)?
Thanks
On part 1, that's not a bad idea. We already do that for expressions in by, and something close is already on the list for j :
FR#2286 Inferred naming could apply to j=colname[...]
Find max per group and return another column
But if we did do that it would probably need to be turned on via an option, to maintain backwards compatibility. I've added a link in that FR back to this question.
On the 2nd part how about :
X[,c(y=sum(y),lapply(.SD,last)[-1]),by=x]
x y z t u
1: q 6 12 22 32
2: w 9 14 24 34
3: e 6 15 25 35
Please ask multiple questions separately, though. Each question on S.O. is supposed to be a single question.

Resources