Creation of Panel Data set in R - r

Programmers,
I have some difficulties in structuring my panel data set.
My panel data set, for the moment, has the following structure:
Exemplary here only with T = 2 and N = 3. (My real data set, however, is of size T = 6 and N = 20 000 000 )
Panel data structure 1:
Year | ID | Variable_1 | ... | Variable_k |
1 | 1 | A | ... | B |
1 | 2 | C | ... | D |
1 | 3 | E | ... | F |
2 | 1 | G | ... | H |
2 | 2 | I | ... | J |
2 | 3 | K | ... | L |
The desired structure is:
Panel data structure 2:
Year | ID | Variable_1 | ... | Variable_k |
1 | 1 | A | ... | B |
2 | 1 | G | ... | H |
1 | 2 | C | ... | D |
2 | 2 | I | ... | J |
1 | 3 | E | ... | F |
2 | 3 | K | ... | L |
This data structure represents the classic panel data structure, where the yearly observations over the whole period are structured for all individuals block by block.
My question: Is there any simple and efficient R-solution that changes the data structure from Table 1 to Table 2 for very large data sets (data.frame).
Thank you very much for all responses in advance!!
Enrico

You can reorder the rows of your dataframe using order():
df=df[order(df$ID,df$Year),]

Related

How do you assign groups to larger groups dpylr

I would like to assign groups to larger groups in order to assign them to cores for processing. I have 16 cores.This is what I have so far
test<-data_extract%>%group_by(group_id)%>%sample_n(16,replace = TRUE)
This takes staples OF 16 from each group.
This is an example of what I would like the final product to look like (with two clusters),all I really want is for the same group id to belong to the same cluster as a set number of clusters
________________________________
balance | group_id | cluster|
454452 | a | 1 |
5450441 | a | 1 |
5444531 | b | 1 |
5404051 | b | 1 |
5404501 | b | 1 |
5404041 | b | 1 |
544251 | b | 1 |
254252 | b | 1 |
541254 | c | 2 |
54123254 | d | 1 |
542541 | d | 1 |
5442341 | e | 2 |
541 | f | 1 |
________________________________
test<-data%>%group_by(group_id)%>% mutate(group = sample(1:16,1))

How to get a query result into a key value form in HiveQL

I have tried different things, but none succeeded. I have the following issue, and would be very gratefull if someone could help me.
I get the data from a view as several billions of records, for different measures
A)
| s_c_m1 | s_c_m2 | s_c_m3 | s_c_m4 | s_p_m1 | s_p_m2 | s_p_m3 | s_p_m4 |
|--------+--------+--------+--------+--------+--------+--------+--------|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|--------+--------+--------+--------+--------+--------+--------+--------|
Then I need to aggregate it by each measure. And so long so fine. I got this figured out.
B)
| s_c_m1 | s_c_m2 | s_c_m3 | s_c_m4 | s_p_m1 | s_p_m2 | s_p_m3 | s_p_m4 |
|--------+--------+--------+--------+--------+--------+--------+--------|
| 3 | 6 | 9 | 12 | 15 | 18 | 21 | 24 |
|--------+--------+--------+--------+--------+--------+--------+--------|
Then I need to get the data in the following form. I need to turn it into a key-value form.
C)
| measure | c | p |
|---------+----+----|
| m1 | 3 | 15 |
| m2 | 6 | 18 |
| m3 | 9 | 21 |
| m4 | 12 | 24 |
|---------+----+----|
The first 4 columns from B) would form in C) the first column, and the second 4 columns would form another column.
Is there an elegant way, that could be easily maintainable? The perfect solution would be if another measure would be introduced in A) and B), there no modification would be required and it would automatically pick up the difference.
I know how to get this done in SqlServer and Postgres, but here I am missing the expirience.
I think you should use map for this

How do I analyze Market Basket Output?

I have a sale data as below:
+------------+------+-------+
| Receipt ID | Item | Value |
+------------+------+-------+
| 1 | a | 2 |
| 1 | b | 3 |
| 1 | c | 2 |
| 1 | k | 4 |
| 2 | a | 2 |
| 2 | b | 5 |
| 2 | d | 6 |
| 2 | k | 7 |
| 3 | a | 8 |
| 3 | k | 1 |
| 3 | c | 2 |
| 3 | q | 3 |
| 4 | k | 4 |
| 4 | a | 5 |
| 5 | b | 6 |
| 5 | a | 7 |
| 6 | a | 8 |
| 6 | b | 3 |
| 6 | c | 4 |
+------------+------+-------+
Using APriori algorithm, I modified the Rules into different columns:
For eg, I got output as below, I trimmed support, confidence, Lift value.. I am only considering rules which mapped into different columns into Target Item, Item1, Items ({Item1,Item2} -> {Target Item})
Output is as below:
+-------------+-------+-------+
| Target Item | Item1 | Item2 |
+-------------+-------+-------+
| a | b | |
| a | b | c |
| a | k | |
+-------------+-------+-------+
I am looking to calculate the all the receipts having the rules combination and identify the Target item Sale value only in those receipts and also Combined sale value of Item 1 and Item 2 in the combination receipts:
Output should be something like below (I dont need receipt ID's from below)
+-------------+-------+-------+--------------+----------------------+------------------------------+
| Target Item | Item1 | Item2 | Receipt ID's | Value of Target Item | Remaining value(Item1+item2) |
+-------------+-------+-------+--------------+----------------------+------------------------------+
| a | b | | 1,2,5,6 | 2+2+7+8 | 3+5+6+3 |
| a | b | c | 1,6 | 2 | (3+3) + (2+4) |
| a | k | | 1,2,3,4 | 2+2+8+5 | 4+7+1+4 |
+-------------+-------+-------+--------------+----------------------+------------------------------+
To replicate the Apriori:
library(arules)
Data <- data.frame(
Receipt_ID = c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,5,5,6,6,6),
item = c('a','b','c','k','a','b','d','k','a','k','c','q','k', 'a','b','a','a', 'b', 'c'
)
,
value = c(2,3,2,4,2,5,6,7,8,1,2,3,4,5,6,7,8,3,4
)
)
write.table(Data,"item.csv",sep=',',row.names = F)
data_frame = read.transactions(
file = "item.csv",
format = "single",
sep = ",",
cols = c("Receipt_ID","item"),
rm.duplicates = T
)
rules_apriori <- apriori(data_frame)
rules_apriori
rules_tab <- as(rules_apriori, "data.frame")
rules_tab
out <- strsplit(as.character(rules_tab$rules),'=>')
rules_tab$rhs <- do.call(rbind, out)[,2]
rules_tab$lhs <- do.call(rbind, out)[,1]
rules_tab$rhs <- gsub("\\{", "", rules_tab$rhs)
rules_tab$rhs <- gsub("}", "", rules_tab$rhs)
rules_tab$lhs = gsub("}", "", rules_tab$lhs)
rules_tab$lhs = gsub("\\{", "", rules_tab$lhs)
rules_final <- data.frame (target_item = character(),item_combination = character() )
rules_final <- cbind(target_item = rules_tab$rhs,item_Combination = rules_tab$lhs)
rules_final

How to subset a dataframe using a column from another dataframe in r?

I have 2 dataframes
Dataframe1:
| Cue | Ass_word | Condition | Freq | Cue_Ass_word |
1 | ACCENDERE | ACCENDINO | A | 1 | ACCENDERE_ACCENDINO
2 | ACCENDERE | ALLETTARE | A | 0 | ACCENDERE_ALLETTARE
3 | ACCENDERE | APRIRE | A | 1 | ACCENDERE_APRIRE
4 | ACCENDERE | ASCENDERE | A | 1 | ACCENDERE_ASCENDERE
5 | ACCENDERE | ATTIVARE | A | 0 | ACCENDERE_ATTIVARE
6 | ACCENDERE | AUTO | A | 0 | ACCENDERE_AUTO
7 | ACCENDERE | ACCENDINO | B | 2 | ACCENDERE_ACCENDINO
8 | ACCENDERE| ALLETTARE | B | 3 | ACCENDERE_ALLETTARE
9 | ACCENDERE| ACCENDINO | C | 2 | ACCENDERE_ACCENDINO
10 | ACCENDERE| ALLETTARE | C | 0 | ACCENDERE_ALLETTARE
Dataframe2:
| Group.1 | x
1 | ACCENDERE_ACCENDINO | 5
13 | ACCENDERE_FUOCO | 22
16 | ACCENDERE_LUCE | 10
24 | ACCENDERE_SIGARETTA | 6
....
I want to exclude from Dataframe1 all the rows that contain words (Cue_Ass_word) that are not reported in the column Group.1 in Dataframe2.
In other words, how can I subset Dataframe1 using the strings reported in Dataframe2$Group.1?
It's not quite clear what you mean, but is this what you need?
Dataframe1[!(Dataframe1$Cue_Ass_word %in% Dataframe2$Group1),]

R : Create a factor from two variables containing ranks and levels

Everything is in the title, I got from a database many columns, paired two-by-two containing codes and labels for some variables, I want an easy way to create half as many factors, with, for each factor levels/codes matching to the original two variables.
Here is an exemple of original data for two factors
| customer_type | customer_type_name | customer_status | customer_status_name |
|----------------------|----------------------|----------------------|----------------------|
| 1 | A | 2 | Beta |
| 2 | B | 2 | Beta |
| 3 | C | 1 | Alpha |
| 2 | B | 3 | Gamma |
| 1 | A | 4 | Delta |
| 3 | C | 2 | Beta |
i.e. a simpler way (simpler to call in a function for lots of variables) to do from dataframe "accounts"
a<-accounts[,c("customertypecode","customertypecodename")]
a<-a[!duplicated(a),]
a<-a[order(a$customertypecode),]
accounts$customertypecode<-factor(accounts$customertypecode,labels=a$customertypecodename[!is.na(a$customertypecodename)])

Resources