How can I transform data from a nominal column to a discreet column in Scala - bigdata

I am implementing an algorithm to obtain association rules in Big Data environments. At this moment my algorithm only works with purely numerical databases but does not work with nominal databases. I need a way to transform the data in the nominal columns to numerical values ​​so that each nominal value is associated with a discrete numerical value. So far I have only managed to obtain the different values ​​that are in each column.
db.schema.foreach { column =>
val valuesDistinct = db.select(column.name).distinct
val values = valuesDistinct.map(row => row(0).toString).collect
}

Related

how can i get numeric data from all this character data?

In the data set I use, there is no numeric information other than the measurement values explained with 0 and 1 values. the remaining columns are values such as location, education information. how can i get numeric data from all this character data? By the way, I'm using the R language.
I got some frequency values but I don't know what to do about columns like location, education.

Impute different types of variables with MICE

I am trying to perform imputation on a dataset which has 69 columns and over 50000 rows. My dataset has different types of variables:
columns that only present binary variables (0,1)
categorical columns
columns that take continuous numerical data
Now, I want to perform imputation and I know that my columns have a high level of multicollinearity.
Do I have to split my dataset into 3 different subsets (one for each of 1), 2), 3) type of column that I can have) or should I perform imputation on the whole dataset?
The problem is that the package mice have different methods for each of these types. And if I run three different times, do I have to take into consideration the whole dataset or only that specific part?
You can input your whole dataset at once to mice.
(you can actually specify which method to use for each variable separately)
I am citing from the mice reference:
Parameter 'method'
Can be either a single string, or a vector of strings with length length(blocks), specifying the imputation method to be used for each column in data. If specified as a single string, the same method will be used for all blocks. The default imputation method (when no argument is specified) depends on the measurement level of the target column, as regulated by the defaultMethod argument. Columns that need not be imputed have the empty method "". See details.

How to create appropriate dummy variables for all categorical variables with more than 2 values in R?

I have a CSV dataset that has a 1000 rows and 21 variables. Out of these 21, 9 are categorical variables having more than 2 values. How do I create dummy variables for the same in R? I wish to conduct logistic regression on this data set to interpret it. I tried using factors and levels to convert them but it works best for 2 variables only I think. I googled quite a bit and found many sites that explain how to do it theoretically but there's not code or function mentioned to understand it fully. On this website, I came across model.matrix () function, the dummies package of R and the dummy.code() function. However I am still stuck because I am newly introduced to R. Sorry for the long question, this is my first time asking here. Thanks in advance!
In R most functions will recognize when you are sending categorical values (gender, location, etc.) and will automatically create the dummy variables! For example if you are doing a linear regression you can just do lm(CSV_DATA). If the categorical values are being represented by actual numbers it is recommended to first convert them to a string to allow R to adjust accordingly!
If you must manually do this process you can instead create a loop that will iterate through your dataset and populate additional variables. For each categorical value, you will need n-1 additional variables to represent it as continuous data, n being the number of possible categories the variable contains. with your n-1 new variables you assign each one to a possible category in your original categorial variable. The last category will be represented by 0's in all of your n-1 new variables. For example, if you are trying to represent location and your data can either be "New York", "LA", or "Miami" you would create two (n-1) dummy variables, and for ease of explaining we will give them the name city1 and city2. If the original variable was equal to "New York" you would set city1 = 1 and city2 = 0, if it was "LA" you would set city1 = 0 and city2=1, and if your original value was "Miami" you would set city1=0 and city2=0.
The reason this works is because it does not rank any one of the categories numerically higher than any of the rest, and it uses the last category as a 'reference' to which all the rest are compared! As said previously, if you represent your variables as strings R will do this automatically for you.

nzv filter for continuous features in caret

I am a beginner to practical machine learning using R, specifically caret.
I am currently applying a random forest algorithm for a microbiome dataset. The values are relative abundance transformed so if my features are columns, sum of all columns for Row 1 == 1
It is common to have cells with a lot of 0 values.
Typically I used the default nzv preprocessing feature in caret.
Default:
a. One unique value across the entire dataset
b. few unique values relative to the number of samples in the dataset (<10 %)
c. large ratio of the frequency of the most common value to the frequency of the second most common value (cutoff used is > 19)
So is this function not actually calculating variance, but determining a frequency of occurence of features and filter based on the frequency? If so is it only safe to use it for discrete/categorical variables?
I have a number of features in my dataset ~12k, many of which might be singletons or have a zero value for a lot of features.
My question: Is nzv suitable for such a continuous, zero inflated dataset?
What pre-processing options would you recommend?
When I use default nzv I am dropping a tonne of features (from 12k to ~2,700 k) in the final table
I do want a less noisy dataset but at the same time do not want to loose good features
This is my first question and I am willing to re-revise, edit and resubmit if required.
Any solutions will be appreciated.
Thanks a tonne!

Imputation using related data (R)

I have a panel dataset with population data. I am working mostly with two vectors - population and households. The household vector(there are 3 countries) has a substantial amount of missing values, the population vector is full. I use a model with population as the independent variable to get the missing values of households. What function should I use to extract these values? I do not need to make any forecasts, just to imput the missing data.
Thank you.
EDIT:
This is a printscreen of my dataset:
https://imagizer.imageshack.us/v2/1366x440q90/661/RAH3uh.jpg
As you can see, many values of datatype = "original" data are missing and I need to input it somehow. I have created several panel data models (Pooled, within, between) and without further considerations tried to extract the missing data with each of them; however I do not know how to do this.
EDIT 2: What I need is not how to determine which model to use but how to get the missing values(so making the dataset more balanced) of the model.

Resources