R : Create a factor from two variables containing ranks and levels - r

Everything is in the title, I got from a database many columns, paired two-by-two containing codes and labels for some variables, I want an easy way to create half as many factors, with, for each factor levels/codes matching to the original two variables.
Here is an exemple of original data for two factors
| customer_type | customer_type_name | customer_status | customer_status_name |
|----------------------|----------------------|----------------------|----------------------|
| 1 | A | 2 | Beta |
| 2 | B | 2 | Beta |
| 3 | C | 1 | Alpha |
| 2 | B | 3 | Gamma |
| 1 | A | 4 | Delta |
| 3 | C | 2 | Beta |
i.e. a simpler way (simpler to call in a function for lots of variables) to do from dataframe "accounts"
a<-accounts[,c("customertypecode","customertypecodename")]
a<-a[!duplicated(a),]
a<-a[order(a$customertypecode),]
accounts$customertypecode<-factor(accounts$customertypecode,labels=a$customertypecodename[!is.na(a$customertypecodename)])

Related

How do you assign groups to larger groups dpylr

I would like to assign groups to larger groups in order to assign them to cores for processing. I have 16 cores.This is what I have so far
test<-data_extract%>%group_by(group_id)%>%sample_n(16,replace = TRUE)
This takes staples OF 16 from each group.
This is an example of what I would like the final product to look like (with two clusters),all I really want is for the same group id to belong to the same cluster as a set number of clusters
________________________________
balance | group_id | cluster|
454452 | a | 1 |
5450441 | a | 1 |
5444531 | b | 1 |
5404051 | b | 1 |
5404501 | b | 1 |
5404041 | b | 1 |
544251 | b | 1 |
254252 | b | 1 |
541254 | c | 2 |
54123254 | d | 1 |
542541 | d | 1 |
5442341 | e | 2 |
541 | f | 1 |
________________________________
test<-data%>%group_by(group_id)%>% mutate(group = sample(1:16,1))

Possible to invert the randomForest function in R?

I computed a random forest to predict a target value in a large data structure.
The matrix contains some thousand rows, about 20 input variables and one output/target/response variable.
For example, the dataframe df is like:
| V1 | V2 | V3 | V4 | ... | Rsp |
---------------------------------
| 1 | 8 | 2 | 3 | ... | 1.5 |
| 2 | 4 | 3 | 4 | ... | 1.3 |
| 5 | 7 | 6 | 3 | ... | 1.4 |
| 2 | 8 | 8 | 4 | ... | 1.9 |
| 9 | 3 | 1 | 6 | ... | 2.1 |
. . . . . .
I calculated the forest:
df.r <- randomForest(Rsp ~ . , data = df , subset = train , mtry = 50, ntree=200)
p <- predict(df.r, df[-train,])
I want to minimize the response in order to get the best combinations of input variables. But because the input and output are noisy, I cannot directly take the variables at the minimum response value.
So my question is: Is it possible to go the tree bottom-up? Is it possible to get the combinations of variables which give me a low response value?

Add category column to a data set

I've a data table like this
+------------+-------+
| Model | Price |
+------------+-------+
| Apple-1 | 10 |
+------------+-------+
| New Apple | 11 |
+------------+-------+
| Orange | 13 |
+------------+-------+
| Orange2019| 15 |
+------------+-------+
| Cat | 19 |
+------------+-------+
I'want to define a list of base model tags that I want to add to any single row that matches certain condition/value. So for example defined a data frame for tagging like this
+------------+--------+
| Model | Tag |
+------------+------ -+
| Apple-1 | A |
+------------+------ -+
| New Apple | A |
+------------+------ -+
| Orange | B |
+------------+------ -+
| Cat | B |
+------------+--------+
I would like to find some way to get this results:
+------------+-------+--------+
| Model | Price | Tag |
+------------+-------+--------+
| Apple-1 | 10 | A |
+------------+-------+--------|
| New Apple | 11 | A |
+------------+-------+--------|
| Orange | 13 | B |
+------------+-------+--------|
| Orange2019| 15 | B |
+------------+-------+--------|
| Cat | 19 | B |
+------------+-------+--------|
I'm don't mind to use a table to managed the tagging data, and I know that I could write very "ad-hoc" mutate statement to achieve the results I want, just wondering if there is more elegant way to tagging a string based on a pattern match.
One idea is to use the Levenshtein distances to cluster the words you have. You would need to provide with a number of clusters. Once you have this clusters, just add the number of each one as a category tag to your table. Check out this answer which goes into detail of Levenshtein distance clustering. Text clustering with Levenshtein distances
edit
I think I totally misunderstood your question... try this
df=data.frame("Model"=c("Apple-1","New Apple","Organe","Orange2019","Cat"),
"Price"=c(10,11,13,15,19),stringsAsFactors = FALSE)
tags=data.frame("Model"=c("Apple-1","New Apple","Orange","Cat"),
"Tag"=c("A","A","B","B"),stringsAsFactors = FALSE)
df%>%rowwise()%>%mutate(Tag=if_else(!is.na(tags$Tag[which(!is.na(str_extract(Model,tags$Model)))[1]]),
tags$Tag[which(!is.na(str_extract(Model,tags$Model)))[1]],false="None"))
Model Price Tag
<chr> <dbl> <chr>
1 Apple-1 10 A
2 New Apple 11 A
3 Organe 13 None
4 Orange2019 15 B
5 Cat 19 B
I actually changed Orange for Organe so that you see what happens if there is not match ( none is returned)

How to get a query result into a key value form in HiveQL

I have tried different things, but none succeeded. I have the following issue, and would be very gratefull if someone could help me.
I get the data from a view as several billions of records, for different measures
A)
| s_c_m1 | s_c_m2 | s_c_m3 | s_c_m4 | s_p_m1 | s_p_m2 | s_p_m3 | s_p_m4 |
|--------+--------+--------+--------+--------+--------+--------+--------|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|--------+--------+--------+--------+--------+--------+--------+--------|
Then I need to aggregate it by each measure. And so long so fine. I got this figured out.
B)
| s_c_m1 | s_c_m2 | s_c_m3 | s_c_m4 | s_p_m1 | s_p_m2 | s_p_m3 | s_p_m4 |
|--------+--------+--------+--------+--------+--------+--------+--------|
| 3 | 6 | 9 | 12 | 15 | 18 | 21 | 24 |
|--------+--------+--------+--------+--------+--------+--------+--------|
Then I need to get the data in the following form. I need to turn it into a key-value form.
C)
| measure | c | p |
|---------+----+----|
| m1 | 3 | 15 |
| m2 | 6 | 18 |
| m3 | 9 | 21 |
| m4 | 12 | 24 |
|---------+----+----|
The first 4 columns from B) would form in C) the first column, and the second 4 columns would form another column.
Is there an elegant way, that could be easily maintainable? The perfect solution would be if another measure would be introduced in A) and B), there no modification would be required and it would automatically pick up the difference.
I know how to get this done in SqlServer and Postgres, but here I am missing the expirience.
I think you should use map for this

Creation of Panel Data set in R

Programmers,
I have some difficulties in structuring my panel data set.
My panel data set, for the moment, has the following structure:
Exemplary here only with T = 2 and N = 3. (My real data set, however, is of size T = 6 and N = 20 000 000 )
Panel data structure 1:
Year | ID | Variable_1 | ... | Variable_k |
1 | 1 | A | ... | B |
1 | 2 | C | ... | D |
1 | 3 | E | ... | F |
2 | 1 | G | ... | H |
2 | 2 | I | ... | J |
2 | 3 | K | ... | L |
The desired structure is:
Panel data structure 2:
Year | ID | Variable_1 | ... | Variable_k |
1 | 1 | A | ... | B |
2 | 1 | G | ... | H |
1 | 2 | C | ... | D |
2 | 2 | I | ... | J |
1 | 3 | E | ... | F |
2 | 3 | K | ... | L |
This data structure represents the classic panel data structure, where the yearly observations over the whole period are structured for all individuals block by block.
My question: Is there any simple and efficient R-solution that changes the data structure from Table 1 to Table 2 for very large data sets (data.frame).
Thank you very much for all responses in advance!!
Enrico
You can reorder the rows of your dataframe using order():
df=df[order(df$ID,df$Year),]

Resources