How do you assign groups to larger groups dpylr - r

I would like to assign groups to larger groups in order to assign them to cores for processing. I have 16 cores.This is what I have so far
test<-data_extract%>%group_by(group_id)%>%sample_n(16,replace = TRUE)
This takes staples OF 16 from each group.
This is an example of what I would like the final product to look like (with two clusters),all I really want is for the same group id to belong to the same cluster as a set number of clusters
________________________________
balance | group_id | cluster|
454452 | a | 1 |
5450441 | a | 1 |
5444531 | b | 1 |
5404051 | b | 1 |
5404501 | b | 1 |
5404041 | b | 1 |
544251 | b | 1 |
254252 | b | 1 |
541254 | c | 2 |
54123254 | d | 1 |
542541 | d | 1 |
5442341 | e | 2 |
541 | f | 1 |
________________________________

test<-data%>%group_by(group_id)%>% mutate(group = sample(1:16,1))

Related

Master-Detail show data SQL

I'm working with SQL Server and I have this 3 tables
STUDENTS
| id | student |
-------------
| 1 | Ronald |
| 2 | Jenny |
SCORES
| id | score | period | student |
| 1 | 8 | 1 | 1 |
| 2 | 9 | 2 | 1 |
PERIODS
| id | period |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
And I want a query that returns this result:
| student | score1 | score2 | score3 | score4 |
| Ronald | 8 | 9 | null | null |
| Jenny | null | null | null | null |
As you can see, the number of scores depends of the periods because sometimes it can be 4 o 3 periods.
I don't know if I have the wrong idea or should I make this in the application, but I want some help.
You need to PIVOT your data e.g.
select Y.Student, [1], [2], [3], [4]
from (
select T.Student, P.[Period], S.Score
from Students T
cross join [Periods] P
left join Scores S on S.[Period] = P.id and S.Student = T.id
) X
pivot
(
sum(Score)
for [Period] in ([1],[2],[3],[4])
) Y
Reference: https://learn.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot?view=sql-server-20

How to remove empty cells and reduce columns

I have a table, that looks roughly like this:
| variable | observer1 | observer2 | observer3 | final |
| -------- | --------- | --------- | --------- | ----- |
| case1 | | | | |
| var1 | 1 | 1 | | |
| var2 | 3 | 3 | | |
| var3 | 4 | 5 | | 5 |
| case2 | | | | |
| var1 | 2 | | 2 | |
| var2 | 5 | | 5 | |
| var3 | 1 | | 1 | |
| case3 | | | | |
| var1 | | 2 | 3 | 2 |
| var2 | | 2 | 2 | |
| var3 | | 1 | 1 | |
| case4 | | | | |
| var1 | 1 | | 1 | |
| var2 | 5 | | 5 | |
| var3 | 3 | | 3 | |
Three colums for the observers, but only two are filled.
First I want to compute the IRR, so I need a table that has two columns without the empty cells like this:
| variable | observer1 | observer2 |
| -------- | --------- | --------- |
| case1 | | |
| var1 | 1 | 1 |
| var2 | 3 | 3 |
| var3 | 4 | 5 |
| case2 | | |
| var1 | 2 | 2 |
| var2 | 5 | 5 |
| var3 | 1 | 1 |
| case3 | | |
| var1 | 2 | 3 |
| var2 | 2 | 2 |
| var3 | 1 | 1 |
| case4 | | |
| var1 | 1 | 1 |
| var2 | 5 | 5 |
| var3 | 3 | 3 |
I try to use the tidyverse packages, but I'm not sure. Some 'ifelse()' magic may be easier.
Is there a clean and easy method to do something like this? Can anybody point me to the right function to use? Or just to a keyword to search for on stackoverflow? I found a lot of methods to remove whole empty columns or rows.
Edit: I removed the link to the original data. It was unnecessary. Thanks to Lamia for his working answer.
Out of your 3 columns observer1, observer2 and observer3, you sometimes have 2 non-NA values, 1 non-NA value, or 3 NA values.
If you want to merge your 3 columns, you could do:
res = data.frame(df$coding,t(apply(df[paste0("observer",1:3)],1,function(x) x[!is.na(x)][1:2])))
The apply function will return for each row the 2 non-NA values if there are 2, one non-NA value and one NA if there is only one value, and two NAs if there is no data in the row.
We then put this result in a dataframe with the first column (coding).

Creation of Panel Data set in R

Programmers,
I have some difficulties in structuring my panel data set.
My panel data set, for the moment, has the following structure:
Exemplary here only with T = 2 and N = 3. (My real data set, however, is of size T = 6 and N = 20 000 000 )
Panel data structure 1:
Year | ID | Variable_1 | ... | Variable_k |
1 | 1 | A | ... | B |
1 | 2 | C | ... | D |
1 | 3 | E | ... | F |
2 | 1 | G | ... | H |
2 | 2 | I | ... | J |
2 | 3 | K | ... | L |
The desired structure is:
Panel data structure 2:
Year | ID | Variable_1 | ... | Variable_k |
1 | 1 | A | ... | B |
2 | 1 | G | ... | H |
1 | 2 | C | ... | D |
2 | 2 | I | ... | J |
1 | 3 | E | ... | F |
2 | 3 | K | ... | L |
This data structure represents the classic panel data structure, where the yearly observations over the whole period are structured for all individuals block by block.
My question: Is there any simple and efficient R-solution that changes the data structure from Table 1 to Table 2 for very large data sets (data.frame).
Thank you very much for all responses in advance!!
Enrico
You can reorder the rows of your dataframe using order():
df=df[order(df$ID,df$Year),]

How do I analyze Market Basket Output?

I have a sale data as below:
+------------+------+-------+
| Receipt ID | Item | Value |
+------------+------+-------+
| 1 | a | 2 |
| 1 | b | 3 |
| 1 | c | 2 |
| 1 | k | 4 |
| 2 | a | 2 |
| 2 | b | 5 |
| 2 | d | 6 |
| 2 | k | 7 |
| 3 | a | 8 |
| 3 | k | 1 |
| 3 | c | 2 |
| 3 | q | 3 |
| 4 | k | 4 |
| 4 | a | 5 |
| 5 | b | 6 |
| 5 | a | 7 |
| 6 | a | 8 |
| 6 | b | 3 |
| 6 | c | 4 |
+------------+------+-------+
Using APriori algorithm, I modified the Rules into different columns:
For eg, I got output as below, I trimmed support, confidence, Lift value.. I am only considering rules which mapped into different columns into Target Item, Item1, Items ({Item1,Item2} -> {Target Item})
Output is as below:
+-------------+-------+-------+
| Target Item | Item1 | Item2 |
+-------------+-------+-------+
| a | b | |
| a | b | c |
| a | k | |
+-------------+-------+-------+
I am looking to calculate the all the receipts having the rules combination and identify the Target item Sale value only in those receipts and also Combined sale value of Item 1 and Item 2 in the combination receipts:
Output should be something like below (I dont need receipt ID's from below)
+-------------+-------+-------+--------------+----------------------+------------------------------+
| Target Item | Item1 | Item2 | Receipt ID's | Value of Target Item | Remaining value(Item1+item2) |
+-------------+-------+-------+--------------+----------------------+------------------------------+
| a | b | | 1,2,5,6 | 2+2+7+8 | 3+5+6+3 |
| a | b | c | 1,6 | 2 | (3+3) + (2+4) |
| a | k | | 1,2,3,4 | 2+2+8+5 | 4+7+1+4 |
+-------------+-------+-------+--------------+----------------------+------------------------------+
To replicate the Apriori:
library(arules)
Data <- data.frame(
Receipt_ID = c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,5,5,6,6,6),
item = c('a','b','c','k','a','b','d','k','a','k','c','q','k', 'a','b','a','a', 'b', 'c'
)
,
value = c(2,3,2,4,2,5,6,7,8,1,2,3,4,5,6,7,8,3,4
)
)
write.table(Data,"item.csv",sep=',',row.names = F)
data_frame = read.transactions(
file = "item.csv",
format = "single",
sep = ",",
cols = c("Receipt_ID","item"),
rm.duplicates = T
)
rules_apriori <- apriori(data_frame)
rules_apriori
rules_tab <- as(rules_apriori, "data.frame")
rules_tab
out <- strsplit(as.character(rules_tab$rules),'=>')
rules_tab$rhs <- do.call(rbind, out)[,2]
rules_tab$lhs <- do.call(rbind, out)[,1]
rules_tab$rhs <- gsub("\\{", "", rules_tab$rhs)
rules_tab$rhs <- gsub("}", "", rules_tab$rhs)
rules_tab$lhs = gsub("}", "", rules_tab$lhs)
rules_tab$lhs = gsub("\\{", "", rules_tab$lhs)
rules_final <- data.frame (target_item = character(),item_combination = character() )
rules_final <- cbind(target_item = rules_tab$rhs,item_Combination = rules_tab$lhs)
rules_final

How to subset a dataframe using a column from another dataframe in r?

I have 2 dataframes
Dataframe1:
| Cue | Ass_word | Condition | Freq | Cue_Ass_word |
1 | ACCENDERE | ACCENDINO | A | 1 | ACCENDERE_ACCENDINO
2 | ACCENDERE | ALLETTARE | A | 0 | ACCENDERE_ALLETTARE
3 | ACCENDERE | APRIRE | A | 1 | ACCENDERE_APRIRE
4 | ACCENDERE | ASCENDERE | A | 1 | ACCENDERE_ASCENDERE
5 | ACCENDERE | ATTIVARE | A | 0 | ACCENDERE_ATTIVARE
6 | ACCENDERE | AUTO | A | 0 | ACCENDERE_AUTO
7 | ACCENDERE | ACCENDINO | B | 2 | ACCENDERE_ACCENDINO
8 | ACCENDERE| ALLETTARE | B | 3 | ACCENDERE_ALLETTARE
9 | ACCENDERE| ACCENDINO | C | 2 | ACCENDERE_ACCENDINO
10 | ACCENDERE| ALLETTARE | C | 0 | ACCENDERE_ALLETTARE
Dataframe2:
| Group.1 | x
1 | ACCENDERE_ACCENDINO | 5
13 | ACCENDERE_FUOCO | 22
16 | ACCENDERE_LUCE | 10
24 | ACCENDERE_SIGARETTA | 6
....
I want to exclude from Dataframe1 all the rows that contain words (Cue_Ass_word) that are not reported in the column Group.1 in Dataframe2.
In other words, how can I subset Dataframe1 using the strings reported in Dataframe2$Group.1?
It's not quite clear what you mean, but is this what you need?
Dataframe1[!(Dataframe1$Cue_Ass_word %in% Dataframe2$Group1),]

Resources