Vectorization in R - r

I am new to R, and I have researched vectorization. But I am still trying to train my mind to think in vectorization terms. Very often examples of vectorization instead of loops are either too simple, so it's difficult for me to generalize them, or not present at all.
Can anyone suggest how I can vectorize the following?
Model2 <- subset(Cor.RMA, MODEL == Models.Sort[2,1])
RCM2 <- count(Model2$REPAIR_CODE)
colnames(RCM2) <- c("REPAIR_CODE", "FREQ")
M2M <- merge(RCM.Sort, RCM2, by = "REPAIR_CODE", all.x = TRUE)
M2M.Sort <- M2M[order(M2M$FREQ.x, decreasing = TRUE), ]
M2M.Sort[is.na(M2M.Sort)] <- 0
In the above code, each "2" needs to run from 2 to 85
writeWorksheetToFile(file="CL2 - TC - RC.xlsx",
data = M2M.Sort[ ,c("FREQ.y")],
sheet = "RC by Model",
clearSheets = FALSE,
startRow = 6,
startCol = 6)
In the above code, "data" should from from "M2M..." to "M85M..." and "startCol" should run from 6 to 89 for an Excel printout.
The data frame this comes from (Cor.RMA) has columns "MODEL", "REPAIR_CODE", and others that are unused.
RCM.Sort is a frequency table of each "REPAIR_CODE" across all models that I use as a Master list to adjoin Device-specific Repair Code counts. (left-join: all.x = TRUE)
Models.Sort is a frequency table I generated using the "count" function from the plyr package, so I can create subsets for each MODEL.
Then I merge a list of each "REPAIR_CODE" that I generated using the "unique" function.
Sample Data:
CASE_NO DEVICE_TYPE MODEL TRIAGE_CODE REPAIR_CODE
12341 Smartphone X TC01 RC01
12342 Smartphone Y TC02 RC02
12343 Smartphone Z TC01, TC05 RC05
12344 Tablet AA TC02 RC37
12345 Wearable BB TC05 RC37
12346 Smartphone X TC07 RC01
12347 Smartphone Y TC04 RC02
I very much appreciate your time and effort if you are willing to help.

Alright, this is not what your original script did, but here goes:
models <- c("X","Y","Z","AA","BB") # in your case it would be Models.Sort[2:85,1]
new <- Cor.RMA[Cor.RMA$MODEL %in% models,]
new2 <- aggregate(new$REPAIR_CODE, list(new$MODEL), table)
temp <- unlist(new2[[2]])
temp <- temp[, order(colSums(temp), decreasing = T)]
out <- data.frame(group=new2[,1], temp)
out <- out[order(rowSums(out[,-1]), decreasing = T),]
out
# group RC01 RC02 RC37 RC05
# 3 X 2 0 0 0
# 4 Y 0 2 0 0
# 1 AA 0 0 1 0
# 2 BB 0 0 1 0
# 5 Z 0 0 0 1
You can then write it easily to an xlsx file, e.g. with:
require(xlsx)
xlsx:::xlsx.write(out,"test.xlsx",row.names=F)
Edit: Added sorting.
Edit2: Fixed sorting.

Related

Loop same actions in R

Have an issue here.
I want to loop my operations in R, however, do not know how to make this properly and efficiently.
I have several different sized datasets and performing the same block of code each time is time-consuming.
Here is the code I need to apply to each of the datasets and write the data or the output from a model into the datasets with different names.
##########################################################################################################################
#the combined list of separate data frames where the last letter is changing A, B, C...
z <- list(Data_A, Data_B, Data_C)
#need to loop these operations performed by using data from datasets. Here is an example by using data from Data_A dataset.
# TFP estimation by using ACF method
ACF_A <- prodest::prodestACF(Data_A$turn, fX = Data_A$cogs, sX = Data_A$tfa, pX = Data_A$cogs, idvar = Data_A$ID, timevar = Data_A$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
omegaACF_A <- prodest::omega(ACF_A)
Data_A$omegaACF_A <- prodest::omega(ACF_A)
#########################################################################################################################
# Growth variables
Data_A <- Data_A %>%
arrange(ID, Year) %>%
group_by(ID) %>%
mutate(domegaACF_A = omegaACF_A - dplyr::lag(omegaACF_A),
debt = LOAN + LTD,
ddebt = debt - dplyr::lag(debt),
dsales = SALE - dplyr::lag(SALE)) %>%
ungroup
# Panel data frame
PData_A <- pdata.frame(Data_A, index = c("ID","Year"))
# Within estimator
within_2way_A <- plm(domegaACF_A ~ dplyr::lag(domegaACF_A, 1) + dplyr::lag(domegaACF_A, 2) + ddebt + lag(ff1, 1) + ddebt:lag(ff1, 1) + log(Age) + ta + dsales,
data = PData_A, effect = "twoways", model ="within", index = c("ID", "Year"))
The main problem is that I do not know how to store the data in separate datasets with according names. For example, in the block of the following code, _A should be changing to _B, _C according to the dataset that is used.
ACF_A <- prodest::prodestACF(Data_A$turn, fX = Data_A$cogs, sX = Data_A$tfa, pX = Data_A$cogs, idvar = Data_A$ID, timevar = Data_A$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
omegaACF_A <- prodest::omega(ACF_A)
Data_A$omegaACF_A <- prodest::omega(ACF_A)
I know there are lapply and for loops but I do not know how to use them with changing names of storing variables:
z <- list (df1, df2, df3)
for (i in z){
ACF_[1 or 2 or 3] <- prodest::prodestACF(i$turn, fX = i$cogs, sX = i$tfa, pX = i$cogs, idvar = i$ID, timevar = i$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
omegaACF_[1 or 2 or 3] <- prodest::omega(ACF_[1 or 2 or 3])
Data_[]$omegaACF_[1 or 2 or 3] <- prodest::omega(ACF_[1 or 2 or 3])
{
UPD: Here are several datasets: https://drive.google.com/drive/folders/1gBV2ZkywW6JqDjRICafCwtYhh2DHWaUq?usp=sharing
UPD2:
Data_A
turn cogs tfa SALE
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
Data_B
turn cogs tfa SALE
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
After running the loop I need:
ACF_A, ACF_B, etc. storage variable, where the results of the estimations of prodest function will be stored
omegaACF_A, omegaACF_B, etc. storage where omega variable from prodest will be stored
omegaACF_A, omegaACF_B results of estimations should be added to Data_A, Data_B datasets accordingly as a new variable.
After that, in Data_A, Data_B datasets, growth variables should be created
The plm regression should be stored in within_2way_A, within_2way_B accordingly
So in the end, I need:
Data_A
turn cogs tfa SALE omegaACF_A domegaACF_A debt ddebt dsales
1 1 1 1 0.1 NA 1 NA NA
2 2 2 2 0.3 0.2 2 1 1
3 3 3 3 0.6 0.3 3 1 1
4 4 4 4 0.9 0.3 4 1 1
Data_B
turn cogs tfa SALE omegaACF_B domegaACF_B debt ddebt dsales
5 5 5 5 1.1 NA 5 NA NA
6 6 6 6 1.5 0.4 6 1 1
7 7 7 7 1.7 0.2 7 1 1
8 8 8 8 2.0 0.3 8 1 1
One approach is to separate the ACF estimation and omega calculation from the summary creation with different lapply() commands. Since you did not supply any example data, it's a blind shot, but try the following. Note that I assumed that every dataset has the same column names! In case it doesn't solve your problem I will remove my answer.
data <- list(Data_A, Data_B, Data_C)
Estimates <- lapply(data, function(x){
prodest::prodestACF(x$turn, fX = x$cogs, sX = x$tfa, pX = x$cogs, idvar = x$ID, timevar = x$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
}
Summaries_estimates <- lapply(Estimates, summary)
Omegas <- lapply(Estimates, function(x) prodest::omega(x))
Summaries_omega <- lapply(Omegas, summary)
Alternative using loops
Since you asked, it is also possible to define a loop that loops everything together though this is usually much slower. For this, we have to define empty lists that carry the results of ACF etc. and loop over the lists of data.frames that we already created.
data <- list(Data_A, Data_B, Data_C)
Estimates <- list()
Summaries_estimates <- list()
Omegas <- list()
Summaries_omegas <- list()
for(i in 1:(length(data))){
Estimates[[i]] <- prodest::prodestACF(data[[i]]$turn, fX = data[[i]]$cogs, sX = data[[i]]$tfa, pX = data[[i]]$cogs, idvar = data[[i]]$ID, timevar = data[[i]]$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
}
Summaries_estimates[[i]] <- summary(Estimates[[i]])
Omegas[[i]] <- prodest::omega(Estimates[[i]]))
Summaries_omega[[i]] <- summary(Omegas[[i]])
}

R fastdummies equivalent in sparkR

I have a Spark dataframe with the following data:
categories
1 John
2 Luis
3 Dora
For which I need to create a one hot ending version as:
categories categories_Dora categories_John categories_Luis
1 John 0 1 0
2 Luis 0 0 1
3 Dora 1 0 0
This is the current code I have:
test <- data.frame("SN" = 1:2, "Age" = c(21,15), "Name" = c("John;Luis","Dora"))
df <- as.DataFrame(test)
df_2 = selectExpr(df, "split(Name, ';') AS categories","Name")
dat <- df_2 %>%
mutate(categories=explode(df_2$categories)) %>%
select("categories")
The current solution I have is to convert this to a regular R dataframe,
and apply the fast dummies function. Which works for this case but it wont´t
work properly for a large dataset:
r_df = dat %>%
SparkR::collect()
dummy_r = dummy_cols(r_df)
How can I get the same result using sparkR dataframes?
EDIT:
I can not use sparklyr only sparkR
It can be done with Sparklyr which has many of the feature transformer functions exposed.
library(sparklyr)
test <- data.frame("categories" = c("John", "Luis","Dora"))
sc <- sparklyr::spark_connect(master = "local")
d_tbl <- copy_to(sc, test, overwrite = TRUE)
d_tbl %>%
ft_string_indexer(input_col = "categories", output_col = "cat_num") %>%
mutate(cat_num = cat_num + 1) %>%
ft_one_hot_encoder("cat_num", "cat_onehot") %>%
sdf_separate_column("cat_onehot",
paste("categories", pull(., categories), sep="_")) %>%
select(-cat_num, -cat_onehot)
The output:
# Source: spark<?> [?? x 4]
categories categories_John categories_Luis categories_Dora
<chr> <dbl> <dbl> <dbl>
1 John 0 0 0
2 Luis 0 1 0
3 Dora 0 0 1
The ft_string_indexer generates a column names cat_num which has the a numeric value for each category. Very similar to as.numeric(factor) in R. The +1 is just to have the indexes from 1 to N. ft_one_hot_encoder does the magic at Spark level, the function return a vectorised value like a list with the encoding. The function sdf_separate_column expands the encoding to columns. The paste generates the colnames using the category levels. The select drops unnecessary columns used in the transformation.

h2o deep learning NN 1 layer non reproducing X + Y = Z

I have just started using the h2o package in order to build a supervised NN network wit the deeplearning package.
In order to get a bit on track I started trying to simulate a function like X + Y = Z
My code is the following:
data <- read.table("DeepLearningTest.csv", header = FALSE, sep = ",", quote = "", stringsAsFactor = F)
test <- read.table("DeepLearningTestRun.csv", header = FALSE, sep = ",", quote = "", stringsAsFactor = F)
df <- data.frame (data)
tf <- data.frame (test)
h2o.init ()
hf <- as.h2o (df)
m2 <- h2o.deeplearning(
training_frame=hf,
x=0:1,
y='C',
hidden = c(100),
epochs=100000,
stopping_tolerance=0.001
)
h2o.predict (m2, as.h2o(tf))
Where my test samples are the following (for example):
1 1 2
2 2 4
3 3 6
4 4 8
. . .
until
2000 2000 4000
In general is X + X = 2X
The thing I am not understanding and for which I am writing is that, if I use one layer network (for the univeral approximation theorem should be sufficient)
I can traing the network and the predict quite good values in the range of the prediction.
for instance the network is giving me
100 100 200
101 101 202
7 7 14
but when I put
4000 4000
the result is misleading. It gives me something like 6300
It seems that the network is not able to generalized.
What am I doing wrong to make this behavior?
Thank you for you attention.
Regards,
Nicola

Count based on multiple conditions from other data.frame

I am migrating analysis from Excel to R, and would like some input on how best to perform something similar to Excel's COUNTIFS in R.
I have a two data.frames, statedf and memberdf.
statedf=data.frame(state=c('MD','MD','MD','NY','NY','NY'), week = 5:7)
memberdf=data.frame(memID = 1:15, state = c('MD','MD','NY','NY','MD'),
finalweek = c(3,3,5,3,3,5,3,5,3,5,6,5,2,3,5),
orders = c(1,2,3))
This data is for a subscription-based business. I would like to know the number of members who newly lapsed for each week/state combo in statedf, where newly lapse is defined by statedf$week - 1 = memberdf$finalweek. Further I would like to have separate counts for each order value (1,2,3).
The desired output would look like
out <- data.frame(state=c('MD','MD','MD','NY','NY','NY'), week = 5:7,
oneorder = c(0,1,0,0,0,0),
twoorder = c(0,0,1,0,1,0),
threeorder = c(0,3,0,0,1,0))
I asked (and got a great response for) a simpler version of this question yesterday - the answers revolved around creating a new data.frame based on member.df. However, I need to append the data to statedf, because statedf has member/week combos that don't exist in memberdf, and vice versa. If this was in Excel, I'd use COUNTIFS but am struggling for a solution in R.
Thanks.
Here is a solution with the dplyr and tidyr packages:
library(tidyr) ; library(dplyr)
counts <- memberdf %>%
mutate(lapsedweek = finalweek + 1) %>%
group_by(state, lapsedweek, orders) %>%
tally()
counts <- counts %>% spread(orders, n, fill = 0)
out <- left_join(statedf, counts, by = c("state", "week" = "lapsedweek"))
out[is.na(out)] <- 0 # convert rows with all NAs to 0s
names(out)[3:5] <- paste0("order", names(out)[3:5]) # rename columns
We could create a new variable ('week1') in the 'statedf' dataset, merge the 'memberdf' with 'statedf', and then reshape from 'long' to 'wide' format with dcast. I changed the 'orders' column to match the column names in the 'out'.
statedf$week1 <- statedf$week-1
df1 <- merge(memberdf[-1], statedf, by.x=c('state', 'finalweek'),
by.y=c('state', 'week1'), all.y=TRUE)
lvls <- paste0(c('one', 'two', 'three'), 'order')
df1$orders <- factor(lvls[df1$orders],levels=lvls)
library(reshape2)
out1 <- dcast(df1, state+week~orders, value.var='orders', length)[-6]
out1
# state week oneorder twoorder threeorder
#1 MD 5 0 0 0
#2 MD 6 1 0 3
#3 MD 7 0 1 0
#4 NY 5 0 0 0
#5 NY 6 0 1 1
#6 NY 7 0 0 0
all.equal(out, out1)
#[1] TRUE

Marking or Tagging Non-structured Data (Text data) with combination of words

I m working on unstructured data (text).
I want to tag back the data with some key words and combination of key words.
I am not able to tag back the data with the combination of words. I want to know where "fraud" AND "misselling" is occurring.
I tried using qdap package
I was able to tag these two words with OR condition not with AND condition
Below is the code i used
library (qdap)
df<- read.csv (file.choose(),header=T)
####cleaning of text
df$Comment<- strip(df$Comment)##remove capitalization and punctuation
df$Comment<- clean (df$Comment)
df$Comment<- scrubber(df$Comment)
df$Comment<- qprep(df$Comment)
df$Comment<-replace_abbreviation(df$Comment)
terms <- list(
" fraud ",
" refund "," cheat ", " cancellation ", "missold", "delay",
combo1= qcv(fraud,missold) )
df2<-with (df, termco(df$Comment, df$Comment, terms))[["raw"]]###tagging of data with key words
df3<- merge (df, df2, by="Comment")
I am using complain data for insurance companies
Variables I had are
Date of complain
Brand aganist complain
Comment (Complain)
Based on you sample xlsx:
library(xlsx)
df <- read.xlsx(file="sample output.xlsx", sheetIndex=1)
library(tm)
terms <- stemDocument(c("fraud","refund","cheat", "cancellation", "misselling", "delay"))
mat <- DocumentTermMatrix(x=Corpus(VectorSource(df$Comment)),
control=list(removePunctuation = TRUE,
dictionary = terms,
stemming = TRUE,
weighting = weightBin))
df2 <- as.data.frame(as.matrix(mat))
(df2 <- transform(df2, combo = fraud + missel))
df2
# cancel cheat delay fraud missel refund combo
# 1 1 0 0 1 1 0 2
# 2 1 0 0 1 1 0 2
# 3 0 0 0 1 1 0 2
df3 <- cbind(df, df2)
df3

Resources