How to optimize this process? - r

I have somewhat of broad question, but I will try to make my intent as clear as possible so that people can make suggestions. I am trying to optimize a process I am doing. Generally, what I am doing is feeding a function a data frame of values and generating a prediction off of operations on specific columns. Basically a custom function that is being used with sapply (code below). What I'm doing is much to large to provide any meaningful example, so instead I will try to describe the inputs to the process. I know this will restrict how helpful answers can be, but I am interested in any ideas for optimizing the time it takes me to compute a prediction. Currently it is taking me about 10 seconds to generate one prediction (run the sapply for one line of a dataframe).
mean_rating <- function(df){
user<-df$user
movie<-df$movie
u_row<-which(U_lookup == user)[1]
m_row<-which(M_lookup==movie)[1]
knn_match<- knn_txt[u_row,1:100]
knn_match1<-as.numeric(unlist(knn_match))
dfm_test<- dfm[knn_match1,]
dfm_mov<- dfm_test[,m_row] # row number from DFM associated with the query_movie
C<-mean(dfm_mov)
}
test<-sapply(1:nrow(probe_test),function(x) mean_rating(probe_test[x,]))
Inputs:
dfm is my main data matrix, users in the rows and movies in the columns. Very sparse.
> str(dfm)
Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
..# i : int [1:99072112] 378 1137 1755 1893 2359 3156 3423 4380 5103 6762 ...
..# j : int [1:99072112] 0 0 0 0 0 0 0 0 0 0 ...
..# Dim : int [1:2] 480189 17770
..# Dimnames:List of 2
.. ..$ : NULL
.. ..$ : NULL
..# x : num [1:99072112] 4 5 4 1 4 5 4 5 3 3 ...
..# factors : list()
probe_test is my test set, the set I'm trying to predict for. The actual probe test contains approximately 1.4 million rows but I am trying it on a subset first to optimize the time. It is being fed into my function.
> str(probe_test)
'data.frame': 6 obs. of 6 variables:
$ X : int 1 2 3 4 5 6
$ movie : int 1 1 1 1 1 1
$ user : int 1027056 1059319 1149588 1283744 1394012 1406595
$ Rating : int 3 3 4 3 5 4
$ Rating_Date: Factor w/ 1929 levels "2000-01-06","2000-01-08",..: 1901 1847 1911 1312 1917 1803
$ Indicator : int 1 1 1 1 1 1
U_lookup is the lookup I use to convert between user id and the line of the matrix a user is in since we lose user id's when they are converted to a sparse matrix.
> str(U_lookup)
'data.frame': 480189 obs. of 1 variable:
$ x: int 10 100000 1000004 1000027 1000033 1000035 1000038 1000051 1000053 1000057 ...
M_lookup is the lookup I use to convert between movie id and the column of a matrix a movie is in for similar reasons as above.
> str(M_lookup)
'data.frame': 17770 obs. of 1 variable:
$ x: int 1 10 100 1000 10000 10001 10002 10003 10004 10005 ...
knn_text contains the 100 nearest neighbors for all the lines of dfm
> str(knn_txt)
'data.frame': 480189 obs. of 200 variables:
Thank you for any advice you can provide to me.

Related

Why the function sparse.model.matrix() ignored the columns?

Can somebody help me please? I want to prepare data for XGBoost prediction so I need edit factor datas. I use sparse.model.matrix() but there is a problem. I don't know, why function ignored some of the columns. I'll try to explain. I have dataset dataset with many variables, but now these 3 are important:
Tsunami.Event.Validity - Factor with 6 classes: -1,0,1,2,3,4
Tsunami.Cause.Code - Factor with 6 classes: 0,1,2,3,4,5
Total.Death.Description - Factor with 5 classes: 0,1,2,3,4
But when I use sparse.model.matrix() I get matrix only with 15 columns not 6+6+5=17 as expected. Can somebody give ma an advice?
sp_matrix = sparse.model.matrix(Deadly ~ Tsunami.Event.Validity + Tsunami.Cause.Code + Total.Death.Description -1, data = datas)
str(sp_matrix)
Output:
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..# i : int [1:2510] 0 1 2 3 4 5 6 7 8 9 ...
..# p : int [1:16] 0 749 757 779 823 892 1495 2191 2239 2241 ...
..# Dim : int [1:2] 749 15
..# Dimnames:List of 2
.. ..$ : chr [1:749] "1" "2" "3" "4" ...
.. ..$ : chr [1:15] "Tsunami.Event.Validity-1" "Tsunami.Event.Validity0" "Tsunami.Event.Validity1" "Tsunami.Event.Validity2" ...
..# x : num [1:2510] 1 1 1 1 1 1 1 1 1 1 ...
..# factors : list()
..$ assign : int [1:15] 0 1 1 1 1 1 2 2 2 2 ...
..$ contrasts:List of 3
.. ..$ Tsunami.Event.Validity : chr "contr.treatment"
.. ..$ Tsunami.Cause.Code : chr "contr.treatment"
.. ..$ Total.Death.Description: chr "contr.treatment"
This question is a duplicate of In R, for categorical data with N unique categories, why does sparse.model.matrix() not produce a one-hot encoding with N columns? ... but that question was never answered.
The answers to this question explain how you could get the full model matrix you're looking for, but don't explain why you might not want to. (For what it's worth, unlike regular linear models regression trees are robust to multicollinearity, so a full model matrix would actually work in this case, but it's worth understanding why R gives you the answer it does, and why this won't hurt your predictive accuracy ...)
This is a fundamental property of the way that linear models based (additively) on more than one categorical predictor work (and hence the way that R constructs model matrices). When you construct a model matrix based on factors f1, ..., fn with numbers of levels n1, ..., nn the number of predictor variables is 1 + sum(ni-1), not sum(ni). Let's see how this works with a slightly simpler example:
xx <- expand.grid(A=factor(1:2), B = factor(1:2), C = factor(1:2))
model.matrix(~A+B+C-1, xx)
A1 A2 B2 C2
1 1 0 0 0
2 0 1 0 0
3 1 0 1 0
4 0 1 1 0
5 1 0 0 1
6 0 1 0 1
7 1 0 1 1
8 0 1 1 1
We have a total of (1 + 3*(2-1) =) 4 parameters.
The first parameter (A1) describes the expected mean in the baseline level of all parameters (A=1, B=1, C=1). The second parameter describes the expected difference between an observation with A=1 and one with A=2 (independent of the other factors). Parameters 3 and 4 (B2, C2) describe analogous differences between B1 and B2.
You might be thinking "but I want predictor variables for all the levels of all the factors, e.g.
m <- do.call(cbind, lapply(xx, \(x) t(fac2sparse(x))))
dim(m)
## [1] 8 6
This has all six columns expected, not just 4. But if you examine this matrix, or call rankMatrix(m) or caret::findLinearCombos(m), you'll discover that it is multicollinear. In a typical (fixed-effect) additive linear model, you can only estimate an intercept plus the differences between levels, not values associated with every level. In a regression tree model, the multicollinearity will make your computations slightly less efficient, and will make results about variable importance confusing, but shouldn't hurt your predictions.

R anesrake issue with list names non-binary argument

I am using anesrake to weight some survey data, but am getting a non-binary argument error. The error only occurs after I have added the names to the list to use as targets:
gender1<-c(0.516166000986901,0.483833999013099)
age<-c(0.15828262425613,0.364861110549873,0.429947760183493,0.0469085050104993)
mylist<-list(gender1,age)
names(mylist)<-c("gender1","age")
result<-anesrake(mylist,france,caseid=france$caseid, iterate=TRUE)
Error in x + weights : non-numeric argument to binary operator
In addition: Warning message:
In anesrake(targets, france, caseid = france$caseid, iterate = TRUE) :
Targets for age do not sum to 100%. Adjusting values to total 100%
This also says that the targets for age don't add to 100%, which they do, so also not sure what that's about. If I leave out the names(mylist) bit, I get the following error, presumably because R doesn't know which variables to use, but not a non-binary error:
Error in selecthighestpcts(discrep1, inputter, pctlim) :
No variables are off by more than 5 percent using the method you have chosen, either weighting is unnecessary or a smaller pre-raking limit should be chosen.
The variables in the data frame are called the same as the targets in the list, and are numeric:
> str(france)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 993 obs. of 5 variables:
$ Gender :Classes 'labelled', 'numeric' atomic [1:993] 2 2 2 2 2 2 2 2 2 2 ...
.. ..- attr(*, "label")= chr "Gender"
$ Age2 : num 2 3 2 2 2 2 2 1 2 3 ...
$ gender1: num 2 2 2 2 2 2 2 2 2 2 ...
$ caseid : int 1 2 3 4 5 6 7 8 9 10 ...
$ age : num 2 3 2 2 2 2 2 1 2 3 ...
I have also tried converting gender1 and age to factor variables (as the numbers represent levels of each variable - gender has 2, age has 4), but with the same result. I have used anesrake before successfully, so there must be something I am missing, but cannot see it! Any help greatly appreciated....

How to automatically convert a numeric column to categorical data using statistical techniques

>data
ACC_ID REG PRBLT OPP_TYPE_DESC PARENT_ID ACCT_NM INDUSTRY_ID BUY PWR REV QTY
11316456 No 90 A 2122628569 INF 7379 10190.82 6500 1
11456476 Yes 1 I 2385888136 Module 9199 17441.72 466.5 31
13453245 No 10 D 2122628087 Wooden 3559 44279.21 2500 500
15674568 No 1 I 2702074521 Nine 7379 183218.8 25.91 1
Above is the given dataset
When I load the same in R, I have the following structure
>str(data)
$ ACC_ID : int 11316974 11620677 11865091 ...
$ REG : Factor w/ 2 levels "No ","Yes ": 1 2 1 1 1 1 1 1 1 1 ...
$ PRBLT : int 90 1 10 1 30 30 10 1 60 1 ...
$ OPP_TYPE_DESC : Factor w/ 3 levels "D",..: 3 2 1 2 1 1 1 3 3 2 ...
$ PARENT_ID : num 2.12e+09 2.39e+09 2.12e+09 2.70e+09 2.12e+09 ...
$ ACCT_NM : Factor w/ 20 levels "Marketing Vertical",..: 10 15 20 17 8 16 2 14 7 11 ...
$ INDUSTRY_ID : int 7379 9199 3559 7379 2711 7374 7371 8742 4813 2111 ..
$ BUY PWR : num 1014791 17442 ...
$ REV : num 6500 46617 250000 25564 20000 ...
$ QTY : int 1 31 500 1 6 100 ...
But, I would want to somehow automatically want R to output the below fields as factors instead of int (with the help of statistical modelling or any other technique). Ideally, these are not continuous fields but categorical nominal fields
ACC_ID
PARENT_ID
INDUSTRY_ID
Whereas the REV and QTY columns should be left as is.
Also, the analysis should not be specific to the data and the columns shown here. The logic must be applicable to any data-set (with different columns) that we load in R
Can there be any method through which this is possible? Any ideas are welcome.
Thank you

Reading a csv file using ffdf and subsetting it successfully

I have been researching a way to efficiently extract information from large csv data sets using R. Many seem to recommend the package ff. I was successful in reading the data sets but am now running into problem trying to subset it.
The largest data set contains over 650,000 rows and 1005 columns. Not all columns contain the same data types. Viewed as a dataframe, the structure would look like this:
'data.frame': 5 obs. of 1005 variables:
$ SAMPLING_EVENT_ID : Factor w/ 5 levels "S6230404","S6252242",..: 2 1 3 4 5
$ LATITUDE : num 24.4 24.5 24.5 24.5 24.5
$ LONGITUDE : num -81.9 -81.9 -82 -82 -82
$ YEAR : int 2010 2010 2010 2010 2010
$ MONTH : int 4 3 10 10 10
$ DAY : int 97 88 299 298 300
$ TIME : num 9 10 10 11.58 9.58
$ COUNTRY : Factor w/ 1 level "United_States": 1 1 1 1 1
$ STATE_PROVINCE : Factor w/ 1 level "Florida": 1 1 1 1 1
$ COUNT_TYPE : Factor w/ 2 levels "P21","P22": 2 2 1 1 1
$ EFFORT_HRS : num 6 2 7 6 3.5
$ EFFORT_DISTANCE_KM : num 48.28 8.05 0 0 0
$ EFFORT_AREA_HA : int 0 0 0 0 0
$ OBSERVER_ID : Factor w/ 3 levels "obs132426","obs58643",..: 3 2 1 1 1
$ NUMBER_OBSERVERS : Factor w/ 2 levels "?","1": 2 1 2 2 2
$ Zenaida_macroura : int 0 0 1 0 0
All other variables being similar to this last one i.e. various species of bird.
Here is the code I used to “successfully: read the csv:
B2010 <- read.table.ffdf (x = NULL, “filePath&Name", nrows = -1, first.rows = 50000, next.rows = 50000)
Trying to learn about ffdf output, I entered command lines such as dim(B2010), str(B2010), ls(B2010), etc. dim(B2010) resulted in the appropriate number of rows but only one column (a string per record of the values separated by commas), and ls(B2010) outputted “[1] "physical" "row.names" "virtual" instead of the usual list of variables.
I not sure how to handle this type of output to be able to extract say STATE_PROVINCE == “California”? How do I tell B2010 what the variables are? I think I need to look at this differently but need some of your help to figure it out.
The ultimate goal for me is to subset a bunch of csv data sets (since I have one per year) and put the results back together as dataframe for various analysis.
Thanks,
Joe
To subset an ffdf, use the ffbase package.
As in
require(ffbase)
x <- subset(B2010, BB2010$STATE_PROVINCE == “California”)
I finally found the solution to getting the ffdf variable names and types properly read and accessible for subsetting:
B2010 <- read.csv.ffdf (file = "filepath/name", colClasses = c("factor", "numeric", "numeric", "integer", "integer", "integer", "numeric", rep("factor",998)), first.rows = 10000, next.rows = 50000, nrows = -1)
This took forever to read but seemed to have worked i.e. I was able to create a subset of the data. Next step: to save the subset back to a "normal" dataframe and/or to a csv.
According to the help page at ?read.table.ffdf, you should be using read.csv.ffdf(...). Then go to the page cited by Brandon.

Wrong R data type or bad data?

I'm having trouble doing simple functions on a data frame and am unsure whether it's the data type of the column, or bad data in the data frame.
I exported a SQL query into a CSV file, then loaded it into a data frame, then attached it.
df <-read.csv("~/Desktop/orders.csv")
Attach(df)
When I am done, and run str(df), here is what I get:
$ AccountID: Factor w/ 18093 levels "(819947 row(s) affected)",..: 10 97 167 207 207 299 299 309 352 573 ...
$ OrderID : int 1874197767 1874197860 1874196789 1874206918 1874209100 1874207018 1874209111 1874233050 1874196791 1875081598 ...
$ OrderDate : Factor w/ 280 levels "","2010-09-24",..: 2 2 2 2 2 2 2 2 2 2 ...
$ NumofProducts : int 16 6 4 6 10 4 2 4 6 40 ...
$ OrderTotal : num 20.3 13.8 12.5 13.8 16.4 ...
$ SpecialOrder : int 1 1 1 1 1 1 1 1 1 1 ...
Trying to run the following functions, here is what I get:
> length(OrderID)
[1] 0
> min(OrderTotal)
[1] NA
> min(OrderTotal, na.rm=TRUE)
[1] 5.00
> mean(NumofProducts)
[1] NA
> mean(NumofProducts, na.rm=TRUE)
[1] 3.462902
I have two questions related to this data frame:
Do I have the right data types for the columns? Nums versus integers versus decimals.
Is there a way to review the data set to find the rows that are driving the need to use na.rm=TRUE to make the function work? I'd like to know how many there are, etc.
The difference between num and int is pretty irrelevant at this stage.
See help(is.na) for starters on NA handling. Do things like:
sum(is.na(foo))
to see how many foo's are NA values. Then things like:
df[is.na(df$foo),]
to see the rows of df where foo is NA.

Resources