I'm trying to aggregate multiple rows by column in a data frame. I succed to use aggregate for one column \o/ but I don't understand how to use it for several columns. I let you an exemple of my data:
Gene_Title ID_Affymetrix GB_Acc.x Gene_Symbol.x Entrez ID_Agl GB_Acc.y Gene_Symbol.y Unigene Ensembl Chr_location
trafficking protein particle complex 4 1429632_at AK005276 Trappc4 60409 10239 NM_021789 Trappc4 Mm.29814 ENSMUST00000170082 chr9:44211918-44211859
aldo-keto reductase family 1, member B3 (aldose reductase) 1437133_x_at AV127085 Akr1b3 11677 22 NM_009658 Akr1b3 Mm.451 ENSMUST00000166583 chr6:34253982-34253930
sodium channel, voltage-gated, type I, alpha 1450120_at AV336781 Scn1a 20265 58 NM_018733 Scn1a Mm.439704 ENSMUST00000094951 chr2:66173557-66173498
sodium channel, voltage-gated, type I, alpha 1450121_at AV336781 Scn1a 20265 58 NM_018733 Scn1a Mm.439704 ENSMUST00000094951 chr2:66173557-66173498
aldo-keto reductase family 1, member B3 (aldose reductase) 1456590_x_at BB469763 Akr1b3 11677 22 NM_009658 Akr1b3 Mm.451 ENSMUST00000166583 chr6:34253982-34253930
dolichol-phosphate (beta-D) mannosyltransferase 2 1415675_at BC008256 Dpm2 13481 33459 NM_010073 Dpm2 Mm.22001 ENSMUST00000150419 chr2:32428766-32428825
proline rich 13 1423686_a_at BC016234 Prr13 66151 4 NM_025385 Prr13 Mm.393955 ENSMUST00000164688 chr15:102291090-102291149
transmembrane protein 2 1424711_at BC019745 Tmem2 83921 23 NM_031997 Tmem2 Mm.329776 ENSMUST00000096194 chr19:21930192-21930251
transmembrane protein 2 1451458_at BC019745 Tmem2 83921 23 NM_031997 Tmem2 Mm.329776 ENSMUST00000096194 chr19:21930192-21930251
lipase, endothelial 1450188_s_at BC020991 Lipg 16891 52 NM_010720 Lipg Mm.299647 ENSMUST00000066532 chr18:75099688-75099629
lipase, endothelial 1421261_at BC020991 Lipg 16891 52 NM_010720 Lipg Mm.299647 ENSMUST00000066532 chr18:75099688-75099629
lipase, endothelial 1421262_at BC020991 Lipg 16891 52 NM_010720 Lipg Mm.299647 ENSMUST00000066532 chr18:75099688-75099629
coatomer protein complex, subunit gamma 1415670_at BC024686 Copg 54161 25829 NM_017477 Copg Mm.258785 ENSMUST00000113607 chr6:87862890-87862949
coatomer protein complex, subunit gamma 1416017_at BC024686 Copg 54161 25829 NM_017477 Copg Mm.258785 ENSMUST00000113607 chr6:87862890-87862949
leucine rich repeat containing 1 1452411_at BG966295 Lrrc1 214345 29 NM_172528 Lrrc1 Mm.28534 ENSMUST00000049755 chr9:77278998-77278939
aldo-keto reductase family 1, member B3 (aldose reductase) 1448319_at NM_009658 Akr1b3 11677 22 NM_009658 Akr1b3 Mm.451 ENSMUST00000166583 chr6:34253982-34253930
ATPase, H+ transporting, lysosomal V0 subunit D1 1415671_at NM_013477 Atp6v0d1 11972 11826 NM_013477 Atp6v0d1 Mm.17708 ENSMUST00000013304 chr8:108048837-108048778
golgi autoantigen, golgin subfamily a, 7 1415672_at NM_020585 Golga7 57437 54944 NM_020585 Golga7 Mm.196269 ENSMUST00000121783 chr8:24351978-24351919
trafficking protein particle complex 4 1415674_a_at NM_021789 Trappc4 60409 10239 NM_021789 Trappc4 Mm.29814 ENSMUST00000170082 chr9:44211918-44211859
phosphoserine phosphatase 1415673_at NM_133900 Psph 100678 57142 NM_133900 Psph Mm.271784 ENSMUST00000031399 chr5:130271500-130271441
Some gene_title (and gene_symbol) are represented several times but with different ID(Affymetrix or Agilent), or with different GB_Acc. In general I want to have only one line per gene and in Ids or GB_Acc or other columns the different values:
Here my data with Id affymetrix:
>f=function(x){return(paste(x,collapse=","))}
>tab4=aggregate(ID_Affymetrix ~ GB_Acc.x+ Gene_Title+GB_Acc.y+Gene_Symbol.x+Entrez+Unigene+Ensembl+Chr_location+ID_Agl,data=tab3,f)
GB_Acc.x Gene_Title GB_Acc.y Gene_Symbol.x Entrez Unigene Ensembl Chr_location ID_Agl ID_Affymetrix
BC016234 proline rich 13 NM_025385 Prr13 66151 Mm.393955 ENSMUST00000164688 chr15:102291090-102291149 4 1423686_a_at
AV127085 aldo-keto reductase family 1, member B3 (aldose reductase) NM_009658 Akr1b3 11677 Mm.451 ENSMUST00000166583 chr6:34253982-34253930 22 1437133_x_at
BB469763 aldo-keto reductase family 1, member B3 (aldose reductase) NM_009658 Akr1b3 11677 Mm.451 ENSMUST00000166583 chr6:34253982-34253930 22 1456590_x_at
NM_009658 aldo-keto reductase family 1, member B3 (aldose reductase) NM_009658 Akr1b3 11677 Mm.451 ENSMUST00000166583 chr6:34253982-34253930 22 1448319_at
BC019745 transmembrane protein 2 NM_031997 Tmem2 83921 Mm.329776 ENSMUST00000096194 chr19:21930192-21930251 23 1424711_at,1451458_at
BG966295 leucine rich repeat containing 1 NM_172528 Lrrc1 214345 Mm.28534 ENSMUST00000049755 chr9:77278998-77278939 29 1452411_at
BC020991 lipase, endothelial NM_010720 Lipg 16891 Mm.299647 ENSMUST00000066532 chr18:75099688-75099629 52 1450188_s_at,1421261_at,1421262_at
AV336781 sodium channel, voltage-gated, type I, alpha NM_018733 Scn1a 20265 Mm.439704 ENSMUST00000094951 chr2:66173557-66173498 58 1450120_at,1450121_at
AK005276 trafficking protein particle complex 4 NM_021789 Trappc4 60409 Mm.29814 ENSMUST00000170082 chr9:44211918-44211859 10239 1429632_at
NM_021789 trafficking protein particle complex 4 NM_021789 Trappc4 60409 Mm.29814 ENSMUST00000170082 chr9:44211918-44211859 10239 1415674_a_at
NM_013477 ATPase, H+ transporting, lysosomal V0 subunit D1 NM_013477 Atp6v0d1 11972 Mm.17708 ENSMUST00000013304 chr8:108048837-108048778 11826 1415671_at
BC024686 coatomer protein complex, subunit gamma NM_017477 Copg 54161 Mm.258785 ENSMUST00000113607 chr6:87862890-87862949 25829 1415670_at,1416017_at
BC008256 dolichol-phosphate (beta-D) mannosyltransferase 2 NM_010073 Dpm2 13481 Mm.22001 ENSMUST00000150419 chr2:32428766-32428825 33459 1415675_at
NM_020585 golgi autoantigen, golgin subfamily a, 7 NM_020585 Golga7 57437 Mm.196269 ENSMUST00000121783 chr8:24351978-24351919 54944 1415672_at
NM_133900 phosphoserine phosphatase NM_133900 Psph 100678 Mm.271784 ENSMUST00000031399 chr5:130271500-130271441 57142 1415673_at
As you can see, for Tmem2, Copg,Lipg and Scn1a I now have several ID_Affymetrix in the same row. For this genes the only difference was on this column. But for Akr1b3 and Trappc4 there was also some difference in th GB_acc.x column.
So in a general way I would like to make an aggregate for each columns (except Gene_Title and Gene_Symbol which normally are always the same for a given gene) and finally have for exemple:
Gene_Tile Gene_Symbol GB_Acc ID_Affy ...
Traffickp Prot complex 4 Trapcc4 AK005276,NM_021789 1429632_at,1415674_a_at ...
If anyone as any idea
Thanks!
EDIT:
this is the dput(head(mydata,20)). There's some errors at the end but I didn't know this function and his goal
structure(list(Gene_Title = structure(c(1L, 1L, 1L, 2L, 3L, 3L,
4L, 5L, 6L, 7L, 7L, 7L, 8L, 9L, 10L, 10L, 11L, 11L, 12L, 12L), .Label = c("aldo-keto reductase family 1, member B3 (aldose reductase)",
"ATPase, H+ transporting, lysosomal V0 subunit D1", "coatomer protein complex, subunit gamma",
"dolichol-phosphate (beta-D) mannosyltransferase 2", "golgi autoantigen, golgin subfamily a, 7",
"leucine rich repeat containing 1", "lipase, endothelial", "phosphoserine phosphatase",
"proline rich 13", "sodium channel, voltage-gated, type I, alpha",
"trafficking protein particle complex 4", "transmembrane protein 2"
), class = "factor"), ID_Affymetrix = structure(c(13L, 14L, 20L,
2L, 1L, 7L, 6L, 3L, 19L, 17L, 8L, 9L, 4L, 10L, 15L, 16L, 5L,
12L, 11L, 18L), .Label = c("1415670_at", "1415671_at", "1415672_at",
"1415673_at", "1415674_a_at", "1415675_at", "1416017_at", "1421261_at",
"1421262_at", "1423686_a_at", "1424711_at", "1429632_at", "1437133_x_at",
"1448319_at", "1450120_at", "1450121_at", "1450188_s_at", "1451458_at",
"1452411_at", "1456590_x_at"), class = "factor"), GB_Acc.x = structure(c(2L,
11L, 4L, 12L, 9L, 9L, 5L, 13L, 10L, 8L, 8L, 8L, 15L, 6L, 3L,
3L, 14L, 1L, 7L, 7L), .Label = c("AK005276", "AV127085", "AV336781",
"BB469763", "BC008256", "BC016234", "BC019745", "BC020991", "BC024686",
"BG966295", "NM_009658", "NM_013477", "NM_020585", "NM_021789",
"NM_133900"), class = "factor"), Gene_Symbol.x = structure(c(1L,
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L,
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg",
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a",
"Tmem2", "Trappc4"), class = "factor"), Entrez = c(11677L, 11677L,
11677L, 11972L, 54161L, 54161L, 13481L, 57437L, 214345L, 16891L,
16891L, 16891L, 100678L, 66151L, 20265L, 20265L, 60409L, 60409L,
83921L, 83921L), ID_Agl = c(22L, 22L, 22L, 11826L, 25829L, 25829L,
33459L, 54944L, 29L, 52L, 52L, 52L, 57142L, 4L, 58L, 58L, 10239L,
10239L, 23L, 23L), GB_Acc.y = structure(c(1L, 1L, 1L, 4L, 5L,
5L, 2L, 7L, 12L, 3L, 3L, 3L, 11L, 9L, 6L, 6L, 8L, 8L, 10L, 10L
), .Label = c("NM_009658", "NM_010073", "NM_010720", "NM_013477",
"NM_017477", "NM_018733", "NM_020585", "NM_021789", "NM_025385",
"NM_031997", "NM_133900", "NM_172528"), class = "factor"), Gene_Symbol.y = structure(c(1L,
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L,
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg",
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a",
"Tmem2", "Trappc4"), class = "factor"), Unigene = structure(c(12L,
12L, 12L, 1L, 4L, 4L, 3L, 2L, 6L, 8L, 8L, 8L, 5L, 10L, 11L, 11L,
7L, 7L, 9L, 9L), .Label = c("Mm.17708", "Mm.196269", "Mm.22001",
"Mm.258785", "Mm.271784", "Mm.28534", "Mm.29814", "Mm.299647",
"Mm.329776", "Mm.393955", "Mm.439704", "Mm.451"), class = "factor"),
Ensembl = structure(c(11L, 11L, 11L, 1L, 7L, 7L, 9L, 8L,
3L, 4L, 4L, 4L, 2L, 10L, 5L, 5L, 12L, 12L, 6L, 6L), .Label = c("ENSMUST00000013304",
"ENSMUST00000031399", "ENSMUST00000049755", "ENSMUST00000066532",
"ENSMUST00000094951", "ENSMUST00000096194", "ENSMUST00000113607",
"ENSMUST00000121783", "ENSMUST00000150419", "ENSMUST00000164688",
"ENSMUST00000166583", "ENSMUST00000170082"), class = "factor"),
Chr_location = structure(c(7L, 7L, 7L, 9L, 8L, 8L, 4L, 10L,
12L, 2L, 2L, 2L, 6L, 1L, 5L, 5L, 11L, 11L, 3L, 3L), .Label = c("chr15:102291090-102291149",
"chr18:75099688-75099629", "chr19:21930192-21930251", "chr2:32428766-32428825",
"chr2:66173557-66173498", "chr5:130271500-130271441", "chr6:34253982-34253930",
"chr6:87862890-87862949", "chr8:108048837-108048778", "chr8:24351978-24351919",
"chr9:44211918-44211859", "chr9:77278998-77278939"), class = "factor")), .Names = c("Gene_Title",
"ID_Affymetrix", "GB_Acc.x", "Gene_Symbol.x", "Entrez", "ID_Agl",
"GB_Acc.y", "Gene_Symbol.y", "Unigene", "Ensembl", "Chr_location"
), row.names = c(NA, 20L), class = "data.frame")
structure(list(Gene_Title = structure(c(1L, 1L, 1L, 2L, 3L, 3L,
4L, 5L, 6L, 7L, 7L, 7L, 8L, 9L, 10L, 10L, 11L, 11L, 12L, 12L), .Label = c("aldo-keto reductase family 1, member B3 (aldose reductase)",
"ATPase, H+ transporting, lysosomal V0 subunit D1", "coatomer protein complex, subunit gamma",
"dolichol-phosphate (beta-D) mannosyltransferase 2", "golgi autoantigen, golgin subfamily a, 7",
"leucine rich repeat containing 1", "lipase, endothelial", "phosphoserine phosphatase",
"proline rich 13", "sodium channel, voltage-gated, type I, alpha",
"trafficking protein particle complex 4", "transmembrane protein 2"
), class = "factor"), ID_Affymetrix = structure(c(13L, 14L, 20L,
2L, 1L, 7L, 6L, 3L, 19L, 17L, 8L, 9L, 4L, 10L, 15L, 16L, 5L,
12L, 11L, 18L), .Label = c("1415670_at", "1415671_at", "1415672_at",
"1415673_at", "1415674_a_at", "1415675_at", "1416017_at", "1421261_at",
"1421262_at", "1423686_a_at", "1424711_at", "1429632_at", "1437133_x_at",
"1448319_at", "1450120_at", "1450121_at", "1450188_s_at", "1451458_at",
"1452411_at", "1456590_x_at"), class = "factor"), GB_Acc.x = structure(c(2L,
11L, 4L, 12L, 9L, 9L, 5L, 13L, 10L, 8L, 8L, 8L, 15L, 6L, 3L,
3L, 14L, 1L, 7L, 7L), .Label = c("AK005276", "AV127085", "AV336781",
"BB469763", "BC008256", "BC016234", "BC019745", "BC020991", "BC024686",
"BG966295", "NM_009658", "NM_013477", "NM_020585", "NM_021789",
"NM_133900"), class = "factor"), Gene_Symbol.x = structure(c(1L,
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L,
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg",
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a",
"Tmem2", "Trappc4"), class = "factor"), Entrez = c(11677L, 11677L,
11677L, 11972L, 54161L, 54161L, 13481L, 57437L, 214345L, 16891L,
16891L, 16891L, 100678L, 66151L, 20265L, 20265L, 60409L, 60409L,
83921L, 83921L), ID_Agl = c(22L, 22L, 22L, 11826L, 25829L, 25829L,
33459L, 54944L, 29L, 52L, 52L, 52L, 57142L, 4L, 58L, 58L, 10239L,
10239L, 23L, 23L), GB_Acc.y = structure(c(1L, 1L, 1L, 4L, 5L,
5L, 2L, 7L, 12L, 3L, 3L, 3L, 11L, 9L, 6L, 6L, 8L, 8L, 10L, 10L
), .Label = c("NM_009658", "NM_010073", "NM_010720", "NM_013477",
"NM_017477", "NM_018733", "NM_020585", "NM_021789", "NM_025385",
"NM_031997", "NM_133900", "NM_172528"), class = "factor"), Gene_Symbol.y = structure(c(1L,
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L,
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg",
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a",
"Tmem2", "Trappc4"), class = "factor"), Unigene = structure(c(12L,
12L, 12L, 1L, 4L, 4L, 3L, 2L, 6L, 8L, 8L, 8L, 5L, 10L, 11L, 11L,
7L, 7L, 9L, 9L), .Label = c("Mm.17708", "Mm.196269", "Mm.22001",
"Mm.258785", "Mm.271784", "Mm.28534", "Mm.29814", "Mm.299647",
"Mm.329776", "Mm.393955", "Mm.439704", "Mm.451"), class = "factor"),
Ensembl = structure(c(11L, 11L, 11L, 1L, 7L, 7L, 9L, 8L,
3L, 4L, 4L, 4L, 2L, 10L, 5L, 5L, 12L, 12L, 6L, 6L), .Label = c("ENSMUST00000013304",
"ENSMUST00000031399", "ENSMUST00000049755", "ENSMUST00000066532",
"ENSMUST00000094951", "ENSMUST00000096194", "ENSMUST00000113607",
"ENSMUST00000121783", "ENSMUST00000150419", "ENSMUST00000164688",
"ENSMUST00000166583", "ENSMUST00000170082"), class = "factor"),
Chr_location = structure(c(7L, 7L, 7L, 9L, 8L, 8L, 4L, 10L,
12L, 2L, 2L, 2L, 6L, 1L, 5L, 5L, 11L, 11L, 3L, 3L), .Label = c("chr15:102291090-102291149",
"chr18:75099688-75099629", "chr19:21930192-21930251", "chr2:32428766-32428825",
"chr2:66173557-66173498", "chr5:130271500-130271441", "chr6:34253982-34253930",
"chr6:87862890-87862949", "chr8:108048837-108048778", "chr8:24351978-24351919",
"chr9:44211918-44211859", "chr9:77278998-77278939"), class = "factor")), .Names = c("Gene_Title",
"ID_Affymetrix", "GB_Acc.x", "Gene_Symbol.x", "Entrez", "ID_Agl",
"GB_Acc.y", "Gene_Symbol.y", "Unigene", "Ensembl", "Chr_location"
), row.names = c(NA, 20L), class = "data.frame")
Erreur dans `?`(dput(head(tab3, 20)), dput(head(tab3, 20))) :
c("pas de documentation de type ‘c(1, 1, 1, 2, 3, 3, 4, 5, 6, 7, 7, 7, 8, 9, 10, 10, 11, 11, 12, 12)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)", "pas de documentation de type ‘c(13, 14, 20, 2, 1, 7, 6, 3, 19, 17, 8, 9, 4, 10, 15, 16, 5, 12, 11, 18)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)", "pas de documentation de type ‘c(2, 11, 4, 12, 9, 9, 5, 13, 10, 8, 8, 8, 15, 6, 3, 3, 14, 1, 7, 7)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)",
"pas de documentation de type ‘c(1, 1, 1, 2, 3, 3, 4, 5, 7, 6, 6, 6, 9, 8, 10, 10, 12, 12, 11, 11)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)", "pas de documentation de type ‘c(11677, 11677, 11677, 11972, 54161, 54161, 13481, 57437, 214345, 16891, 16891, 16891, 100678, 66151, 20265, 20265, 60409, 60409, 83921, 83921)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'a
Maybe this is what you're looking for?
library(dplyr)
dfcollapsed <- df %>% # replace df with the name of your data frame
group_by(Gene_Title, Gene_Symbol) %>%
summarise_each(funs(paste(., collapse = ",")))
I didn't test it with your data though, because I couldn't copy and paste it into my session.
Update:
In your data you have two columns Gene_Symbol.x and Gene_Symbol.y which were probably created during a merge. I assume they have the same information, and hence you could adjust the code to:
dfcollapsed <- df %>% # replace df with the name of your data frame
group_by(Gene_Title, Gene_Symbol.x) %>%
summarise_each(funs(paste(., collapse = ",")), -Gene_Symbol.y)
Or, if you only want unique entries in each column (as in #juba's answer) you can write:
dfcollapsed <- df %>% # replace df with the name of your data frame
group_by(Gene_Title, Gene_Symbol.x) %>%
summarise_each(funs(paste(unique(.), collapse = ",")), -Gene_Symbol.y)
Hope that helps.
Maybe the following with aggregate :
f <- function(v) {paste(unique(v), collapse=", ")}
aggregate(tab3, list(tab3$Gene_Title, tab3$Gene_Symbol.x), f)
Related
I want to make a graph in my Shiny App using ggplot2 that plots the line regarding the user selection by topic. In this question I was told how to add those 0s in case given topic didn't have any item (using ankrun's answer). Now I'm looking for fitting this graph in a shiny App and thought about adding a sliderInput where you could choose the topic for the line. This is what I've tried:
# # # global # # #
#1. App
if("shiny" %in% rownames(installed.packages()) == FALSE){ install.packages("shiny") }
library(shiny)
#2. Easier data handling
if("dplyr" %in% rownames(installed.packages()) == FALSE){ install.packages("dplyr") }
library(dplyr)
#3. Graphs
if("ggplot2" %in% rownames(installed.packages()) == FALSE){ install.packages("ggplot2") }
library(ggplot2)
#4. Completion in graphs
if("tidyr" %in% rownames(installed.packages()) == FALSE){ install.packages("tidyr") }
library(tidyr)
# # # ui # # #
ui <- fluidPage(
sidebarPanel(
selectInput("select_topic_timeline", "What topic?",
choices = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"),
selected = c("1", "2", "3", "4","5", "6", "7", "8", "9", "10", "11", "12"),
multiple = T)
),#sidebarPanel
mainPanel(
plotOutput("per_topic_timeline")
) #mainPanel
) #fluidPage
# # # server # # #
server <- function(input, output, session) {
# TIMELINE PER TOPIC PER YEAR
output$per_topic_timeline <- renderPlot({
dtd2 <- structure(list(Topic = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L), .Label = c("Topic 1",
"Topic 10", "Topic 11", "Topic 12", "Topic 2", "Topic 3", "Topic 4",
"Topic 5", "Topic 6", "Topic 7", "Topic 8", "Topic 9"), class = "factor"),
Year = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 2L,
3L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 1L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 6L, 7L, 8L,
9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 2L, 3L, 4L, 5L, 6L, 7L, 8L), .Label = c("2011",
"2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019"
), class = "factor"), Count = c(3L, 3L, 3L, 5L, 5L, 11L,
17L, 14L, 4L, 1L, 1L, 4L, 2L, 3L, 9L, 4L, 2L, 1L, 3L, 4L,
5L, 18L, 23L, 19L, 15L, 1L, 5L, 6L, 8L, 11L, 17L, 7L, 1L,
3L, 6L, 4L, 20L, 21L, 18L, 12L, 3L, 1L, 1L, 2L, 5L, 5L, 11L,
5L, 2L, 1L, 1L, 2L, 2L, 5L, 7L, 23L, 9L, 1L, 1L, 2L, 3L,
6L, 4L, 9L, 8L, 1L, 1L, 6L, 2L, 3L, 3L, 1L, 3L, 2L, 5L, 7L,
11L, 11L, 28L, 11L, 2L, 1L, 2L, 2L, 5L, 6L, 5L, 16L, 3L,
4L, 2L, 2L, 7L, 6L, 8L, 6L)), row.names = c(NA, -96L), class = "data.frame")
dtd2 %>%
expand(Topic = factor(Topic, levels = gtools::mixedsort(levels(c(input$select_topic_timeline)))) ,
Year = unique(Year)) %>%
left_join(dtd2) %>%
mutate(Count = replace_na(Count, 0)) %>%
ggplot(aes(x = Year, y = Count), colour = c(input$select_topic_timeline), group = Topic) +
geom_point() +
geom_line() +
labs(x = "Year", y = NULL, title = "Timeline")
})
}
shinyApp(ui,server)
It's not printing any error out, but it's saying Warning: Column "Topic" joining factors with different levels, coercing to character vector and it's not giving any graph out, only an empty gray box. I'm not sure about how to fit the input in the graph code and I must've changed it wrong!
This issue is that you are basically refactoring the dtd2$Topic based on what's in the selectInput, but since the Topic only has 9 levels, and you can select up to 12 levels with selectInput, you are getting errors.
For this reason, I'd suggest instead using #Ronak Shah's answer with tidyr::complete instead.
Once you've completed the data, you should then dplyr::filter by Topic to get the what I believe to be the desired result:
# # # global # # #
#1. App
if("shiny" %in% rownames(installed.packages()) == FALSE){ install.packages("shiny") }
library(shiny)
#2. Easier data handling
if("dplyr" %in% rownames(installed.packages()) == FALSE){ install.packages("dplyr") }
library(dplyr)
#3. Graphs
if("ggplot2" %in% rownames(installed.packages()) == FALSE){ install.packages("ggplot2") }
library(ggplot2)
#4. Completion in graphs
if("tidyr" %in% rownames(installed.packages()) == FALSE){ install.packages("tidyr") }
library(tidyr)
# # # ui # # #
ui <- fluidPage(
sidebarPanel(
selectInput("select_topic_timeline",
label = "What topic?",
choices = as.character(1:12),
selected = as.character(1:12),
multiple = TRUE),
),#sidebarPanel
mainPanel(
plotOutput("per_topic_timeline")
) #mainPanel
) #fluidPage
# # # server # # #
server <- function(input, output, session) {
# TIMELINE PER TOPIC PER YEAR
output$per_topic_timeline <- renderPlot({
dtd2 <- structure(list(Topic = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L), .Label = c("Topic 1",
"Topic 10", "Topic 11", "Topic 12", "Topic 2", "Topic 3", "Topic 4",
"Topic 5", "Topic 6", "Topic 7", "Topic 8", "Topic 9"), class = "factor"),
Year = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 2L,
3L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 1L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 6L, 7L, 8L,
9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 2L, 3L, 4L, 5L, 6L, 7L, 8L), .Label = c("2011",
"2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019"
), class = "factor"), Count = c(3L, 3L, 3L, 5L, 5L, 11L,
17L, 14L, 4L, 1L, 1L, 4L, 2L, 3L, 9L, 4L, 2L, 1L, 3L, 4L,
5L, 18L, 23L, 19L, 15L, 1L, 5L, 6L, 8L, 11L, 17L, 7L, 1L,
3L, 6L, 4L, 20L, 21L, 18L, 12L, 3L, 1L, 1L, 2L, 5L, 5L, 11L,
5L, 2L, 1L, 1L, 2L, 2L, 5L, 7L, 23L, 9L, 1L, 1L, 2L, 3L,
6L, 4L, 9L, 8L, 1L, 1L, 6L, 2L, 3L, 3L, 1L, 3L, 2L, 5L, 7L,
11L, 11L, 28L, 11L, 2L, 1L, 2L, 2L, 5L, 6L, 5L, 16L, 3L,
4L, 2L, 2L, 7L, 6L, 8L, 6L)), row.names = c(NA, -96L), class = "data.frame")
dtd2 %>%
complete(Topic, Year = unique(Year), fill = list(Count = 0)) %>%
filter(Topic %in% paste("Topic", input$select_topic_timeline)) %>%
ggplot(aes(x = Year, y = Count, colour = Topic, group = Topic)) +
geom_point() +
geom_line() +
labs(x = "Year", y = NULL, title = "Timeline")
})
}
shinyApp(ui,server)
As a side note, I'd also recommend using the shinyWidgets package for this particular input, specifically shinyWidgets::pickerInput, rather than the vanilla shiny::selectInput. I'll leave that decision up to you though.
I have the following data:
df <- structure(list(IDVar = 1:40, Major.sectors = structure(c(5L,
9L, 3L, 15L, 11L, 7L, 18L, 18L, 18L, 3L, 3L, 3L, 3L, 17L, 3L,
11L, 7L, 17L, 3L, 11L, 3L, 18L, 3L, 17L, 9L, 18L, 9L, 19L, 3L,
11L, 11L, 2L, 5L, 3L, 18L, 17L, 4L, 2L, 3L, 3L), .Label = c("Banks",
"Chemicals, rubber, plastics, non-metallic products", "Construction",
"Education, Health", "Food, beverages, tobacco", "Gas, Water, Electricity",
"Hotels & restaurants", "Insurance companies", "Machinery, equipment, furniture, recycling",
"Metals & metal products", "Other services", "Post & telecommunications",
"Primary sector", "Public administration & defense", "Publishing, printing",
"Textiles, wearing apparel, leather", "Transport", "Wholesale & retail trade",
"Wood, cork, paper"), class = "factor"), Region.in.country = structure(c(15L,
8L, 8L, 8L, 10L, 15L, 19L, 10L, 8L, 10L, 3L, 18L, 4L, 12L, 4L,
15L, 13L, 4L, 15L, 15L, 7L, 15L, 12L, 1L, 7L, 10L, 15L, 8L, 13L,
15L, 12L, 8L, 7L, 15L, 15L, 10L, 8L, 10L, 10L, 15L), .Label = c("Andalucia",
"Aragon", "Asturias", "Canary Islands", "Cantabria", "Castilla-La Mancha",
"Castilla y Leon", "Cataluna", "Ceuta", "Comunidad Valenciana",
"Extremadura", "Galicia", "Islas Baleares", "La Rioja", "Madrid",
"Melilla", "Murcia", "Navarra", "Pais Vasco"), class = "factor"),
EBIT.TA = c(-0.234432635519391, -0.884337466274593, -0.00446559204081373,
0.11109107677028, -0.137203773525798, -0.582114677880617,
0.0190497663203189, -3.04252763094666, 0.113157822682219,
-0.0255533180037229, 0.281767142199724, 0.0326641697396841,
-0.00879974750993553, 0.0542074697816672, -0.112104697294392,
-0.191945591325174, -0.00380586115226597, -0.0363239884169068,
-0.273949107908537, 0.435398668004486, -0.00563436099927988,
-2.75971618056051, -0.1047327709263, 0.151283793741506, -0.0373197549569126,
0.00912639083178201, -0.0386627754065697, -0.018235399636112,
-0.0118104711362467, -0.701299939137125, NA, 0.0191819361175666,
-0.0104887983706721, -0.801677105519484, -0.402194475974272,
-0.124125227730062, 0.143020458476649, -0.601186271451194,
0.0163269364787831, 5.09955167591238), EBIT.TA_l1 = c(-0.443687074746458,
-0.561864166134075, -0.0345769510044604, 0.0282541797531804,
-0.0181173929170762, 0.0147211350970115, 0.0588534950162799,
-1.14097109926961, 0.060100343733096, -0.0386426338471025,
0.049684095221329, 0.0558174150334904, 0.00214962169435867,
0.0399960114646072, 0.0402934579830171, -0.612359147433149,
-0.0115916125659674, 0.00739473610413031, 0.0174576615247567,
0.68624861825246, 0.0305807338940829, -3.88006243913616,
0.0410122725022661, -0.089491343996377, -0.215219123182103,
0.00967853324842811, -0.0336715197882038, 0.362424791356667,
0.221203934329637, -0.654387857513823, 0.0656934439915892,
0.0652005453654772, 0.0339559014267185, 0.0259085077216708,
-0.303606048856146, 0.0280113794301873, 0.109307291990628,
-0.470048555841697, -0.00157699300508027, -0.350519090107081
), EBIT.TA_l2 = c(-0.351308186716873, 0.00159428805074234,
-0.00604587147802615, 0.0761894448922952, -0.00348378141492824,
NA, 0.0346370866793768, -0.552226781084599, 0.00220031803369861,
-0.0285840972149053, 0.065316579236306, 0.4090851643341,
-0.0188362202518351, 0.0403848986306371, 0.091146090480032,
-0.0154168449752466, -0.0694803621032671, 0.0511978643139393,
-0.452924037757731, -0.0091835704914724, 0.0119918914092344,
0.0858960833880717, NA, 0.104901526886479, -0.23096183545392,
-0.0163058345980967, 0.100643431561465, 0.0527859573541712,
0.250207316117438, NA, 0.00193240515291123, 0.0624210741756767,
0.0178136227732972, -0.0321294913646274, -0.0699629484084657,
-0.00417176180400133, 0.209612573099415, 0.0285645570852926,
0.0551624216079071, 0.0172738293439595), Major.sectors.id = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 7L, 7L, 3L, 3L, 3L, 3L, 8L, 3L, 5L,
6L, 8L, 3L, 5L, 3L, 7L, 3L, 8L, 2L, 7L, 2L, 9L, 3L, 5L, 5L,
10L, 1L, 3L, 7L, 8L, 11L, 10L, 3L, 3L), Region.in.country.id = c(1L,
2L, 2L, 2L, 3L, 1L, 4L, 3L, 2L, 3L, 5L, 6L, 7L, 8L, 7L, 1L,
9L, 7L, 1L, 1L, 10L, 1L, 8L, 11L, 10L, 3L, 1L, 2L, 9L, 1L,
8L, 2L, 10L, 1L, 1L, 3L, 2L, 3L, 3L, 1L)), .Names = c("IDVar",
"Major.sectors", "Region.in.country", "EBIT.TA", "EBIT.TA_l1",
"EBIT.TA_l2", "Major.sectors.id", "Region.in.country.id"), row.names = c(NA,
40L), class = "data.frame")
I randomly generate a column of zero and ones for illustration.
x <- 40
df$x<- sample(c(0,1), replace=TRUE, size=x)
What I am trying to do is to do is to drop rows which have zero values based on a few conditons.
:If df$x == 1
and if intersect(region.id, sector.id) == 0 #i.e. there is no data
then drop
So, I want to group_by region and sector and if the intersect between both columns does not exist then drop that observation.
Consider the following image. I am basically looking to delete the intersects of the columns which has not data. So take sector.id: 1 and region.id: 5 there is no data so I want to remove it. (However my data is not grouped like the image below, its as the dput code.
I used NA for missing values in the sample x.
# get ready
set.seed(123) # set seed for reproducibility
df$x <- sample(c(NA,1), 40, replace = TRUE) # sample values
Base solution
# split by ids, check for values, bind together nonempty combinations
dfs_split <- split(df, list(df$Major.sectors.id, df$Region.in.country.id))
has_value <- sapply(dfs_split, function(df) !all(is.na(df$x)))
dfs_nonempty <- dfs_split[has_value]
res <- do.call(rbind, dfs_nonempty)
Explanation:
split divides the data into the groups you specified
sapply applies the test for non-missing values on each group
do.call helps to rbind the groups (which actually form a list)
dplyr solution
This is the cleaner option.
library(dplyr)
res <- df %>%
group_by(Major.sectors.id, Region.in.country.id) %>%
filter(!all(is.na(x)))
Trying to create / store data.frames using a nested for loop.
I have some data on countries in a variable called countries, so USA, UK, Germany etc. which I have labeled them 1,2,3 respectively.
I also have data on specific industries in a variable industries for example textiles, retail, other. Again I have labeled these industries 1,2,3.
What I am trying to do is to create a new data.frame which will take;
country 1, industry 1
country 1, industry 2
country 1, industry 3
country 2, industry 1
country 2, industry 2
country 2, industry 3
country 3, industry 1
country 3, industry 2
country 3, industry 3
etc.
I am hopeing to carry out analysis on each data.frame
what I am currently working with is the following;
m <- 3 # m countries
k <- 3 # k industries
for(i in 1:length(m)){
country.ID <- m[i]
for(j in 1:length(k)){
sector.ID <- k[j]
S1 <- which(DF$COUNTRY.id == country.ID)
S2 <- which(DF$INDUSTRY.id == sector.ID)
rows.2.consider <- intersect(S1, S2)
# Here is where I am trying to save the data.frames for analysis
}
}
If I have gone wrong at any point please point this out. But I am trying to create many data.frames for each country and for each region, i.e. 3 countries * 3 industries in this example would give 9 data.frames
Here some sample code (I am actually using regional data not country data etc but the same pricipal still applies.
#
ratios <- structure(list(IDVar = 1:40, Major.sectors = structure(c(5L,
9L, 3L, 15L, 11L, 7L, 18L, 18L, 18L, 3L, 3L, 3L, 3L, 17L, 3L,
11L, 7L, 17L, 3L, 11L, 3L, 18L, 3L, 17L, 9L, 18L, 9L, 19L, 3L,
11L, 11L, 2L, 5L, 3L, 18L, 17L, 4L, 2L, 3L, 3L), .Label = c("Banks",
"Chemicals, rubber, plastics, non-metallic products", "Construction",
"Education, Health", "Food, beverages, tobacco", "Gas, Water, Electricity",
"Hotels & restaurants", "Insurance companies", "Machinery, equipment, furniture, recycling",
"Metals & metal products", "Other services", "Post & telecommunications",
"Primary sector", "Public administration & defense", "Publishing, printing",
"Textiles, wearing apparel, leather", "Transport", "Wholesale & retail trade",
"Wood, cork, paper"), class = "factor"), Region.in.country = structure(c(15L,
8L, 8L, 8L, 10L, 15L, 19L, 10L, 8L, 10L, 3L, 18L, 4L, 12L, 4L,
15L, 13L, 4L, 15L, 15L, 7L, 15L, 12L, 1L, 7L, 10L, 15L, 8L, 13L,
15L, 12L, 8L, 7L, 15L, 15L, 10L, 8L, 10L, 10L, 15L), .Label = c("Andalucia",
"Aragon", "Asturias", "Canary Islands", "Cantabria", "Castilla-La Mancha",
"Castilla y Leon", "Cataluna", "Ceuta", "Comunidad Valenciana",
"Extremadura", "Galicia", "Islas Baleares", "La Rioja", "Madrid",
"Melilla", "Murcia", "Navarra", "Pais Vasco"), class = "factor"),
EBIT.TA = c(-0.234432635519391, -0.884337466274593, -0.00446559204081373,
0.11109107677028, -0.137203773525798, -0.582114677880617,
0.0190497663203189, -3.04252763094666, 0.113157822682219,
-0.0255533180037229, 0.281767142199724, 0.0326641697396841,
-0.00879974750993553, 0.0542074697816672, -0.112104697294392,
-0.191945591325174, -0.00380586115226597, -0.0363239884169068,
-0.273949107908537, 0.435398668004486, -0.00563436099927988,
-2.75971618056051, -0.1047327709263, 0.151283793741506, -0.0373197549569126,
0.00912639083178201, -0.0386627754065697, -0.018235399636112,
-0.0118104711362467, -0.701299939137125, NA, 0.0191819361175666,
-0.0104887983706721, -0.801677105519484, -0.402194475974272,
-0.124125227730062, 0.143020458476649, -0.601186271451194,
0.0163269364787831, 5.09955167591238), EBIT.TA_l1 = c(-0.443687074746458,
-0.561864166134075, -0.0345769510044604, 0.0282541797531804,
-0.0181173929170762, 0.0147211350970115, 0.0588534950162799,
-1.14097109926961, 0.060100343733096, -0.0386426338471025,
0.049684095221329, 0.0558174150334904, 0.00214962169435867,
0.0399960114646072, 0.0402934579830171, -0.612359147433149,
-0.0115916125659674, 0.00739473610413031, 0.0174576615247567,
0.68624861825246, 0.0305807338940829, -3.88006243913616,
0.0410122725022661, -0.089491343996377, -0.215219123182103,
0.00967853324842811, -0.0336715197882038, 0.362424791356667,
0.221203934329637, -0.654387857513823, 0.0656934439915892,
0.0652005453654772, 0.0339559014267185, 0.0259085077216708,
-0.303606048856146, 0.0280113794301873, 0.109307291990628,
-0.470048555841697, -0.00157699300508027, -0.350519090107081
), EBIT.TA_l2 = c(-0.351308186716873, 0.00159428805074234,
-0.00604587147802615, 0.0761894448922952, -0.00348378141492824,
NA, 0.0346370866793768, -0.552226781084599, 0.00220031803369861,
-0.0285840972149053, 0.065316579236306, 0.4090851643341,
-0.0188362202518351, 0.0403848986306371, 0.091146090480032,
-0.0154168449752466, -0.0694803621032671, 0.0511978643139393,
-0.452924037757731, -0.0091835704914724, 0.0119918914092344,
0.0858960833880717, NA, 0.104901526886479, -0.23096183545392,
-0.0163058345980967, 0.100643431561465, 0.0527859573541712,
0.250207316117438, NA, 0.00193240515291123, 0.0624210741756767,
0.0178136227732972, -0.0321294913646274, -0.0699629484084657,
-0.00417176180400133, 0.209612573099415, 0.0285645570852926,
0.0551624216079071, 0.0172738293439595), Major.sectors.id = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 7L, 7L, 3L, 3L, 3L, 3L, 8L, 3L, 5L,
6L, 8L, 3L, 5L, 3L, 7L, 3L, 8L, 2L, 7L, 2L, 9L, 3L, 5L, 5L,
10L, 1L, 3L, 7L, 8L, 11L, 10L, 3L, 3L), Region.in.country.id = c(1L,
2L, 2L, 2L, 3L, 1L, 4L, 3L, 2L, 3L, 5L, 6L, 7L, 8L, 7L, 1L,
9L, 7L, 1L, 1L, 10L, 1L, 8L, 11L, 10L, 3L, 1L, 2L, 9L, 1L,
8L, 2L, 10L, 1L, 1L, 3L, 2L, 3L, 3L, 1L)), .Names = c("IDVar",
"Major.sectors", "Region.in.country", "EBIT.TA", "EBIT.TA_l1",
"EBIT.TA_l2", "Major.sectors.id", "Region.in.country.id"), row.names = c(NA,
40L), class = "data.frame")
You can do
m <- 3 # m countries
k <- 3 # k industries
d <- data.frame(country=rep(1:m, each=k), industry=rep(1:k, m) )
for a single data.frame
You can split that into 9 data.frames
split(d,d)
One option could be using expand.grid. Prepare data.frame with desired country and industry and then expand the same using expand.grid to generate all possible combinations.
df <- data.frame(c= c("country1","country2", "country3"),
i = c("industry1", "industry2","industry3"))
library(dplyr)
expand.grid(df) %>% arrange(c)
c i
1 country1 industry1
2 country1 industry2
3 country1 industry3
4 country2 industry1
5 country2 industry2
6 country2 industry3
7 country3 industry1
8 country3 industry2
9 country3 industry3
You don't actually need to split data nor create indexes. Can do like this to run analysis for each industry and country:
YourAnalysis <- function(x) mean(x$EBIT.TA)
by(data = ratios, INDICES = list(ratios$Region.in.country, ratios$Major.sectors), FUN = YourAnalysis)
I use the wordcloud2 package to render word clouds. It seems that wordcloud2 does not always display the most frequent words.
I said "not always" because the problem is not permanent. It seems that the results are mostly random.
Code :
library(wordcloud2)
library(htmlwidgets)
DataCloud <- as.character(DataTextAnalysis[,1])
DataCloud <- as.data.frame(table(DataCloud))
DataCloud <- DataCloud[order(DataCloud$Freq, decreasing = TRUE),]
DataCloud <- DataCloud[1:10, ]
wordcloud2(data = DataCloud)
Data :
structure(list(`Theme 1` = structure(c(12L, NA, 2L, 4L, 6L, 7L,
NA, 14L, 6L, 6L, 2L, 7L, 5L, 2L, 2L, 2L, 11L, 12L, 2L, 2L, 10L,
NA, 12L, NA, 2L, 13L, 15L, NA, NA, 10L, NA, 1L, 2L, 16L, 6L,
1L, 7L, 9L, 15L, 3L, 1L, 2L, 2L, 2L, 17L, 2L, 17L, 7L, 3L, 2L,
2L, 8L, 6L), .Label = c("Ambiance", "Autonomie", "Changement régulier de hiérarchie",
"Côté familial", "Défi", "Diversité des tâches", "Faire du bon travail",
"Gérer l humain", "Gestion de projets", "Horaires", "Réglage du finisseur",
"Relation client", "Rencontrer de nouvelles équipes", "Responsabilité",
"Technicité", "Travailler avec la hiérachie", "Travailler en binôme"
), class = "factor"), `Theme 2` = structure(c(NA, NA, 13L, 1L,
14L, NA, NA, 4L, 15L, 14L, 10L, 8L, 8L, 5L, 15L, 4L, 13L, 8L,
6L, NA, 3L, NA, 3L, NA, 11L, 5L, 5L, NA, NA, 9L, NA, 16L, 1L,
7L, 8L, 5L, 19L, 2L, 8L, 11L, 5L, 13L, 11L, 11L, 19L, 5L, 19L,
12L, 11L, 8L, 18L, 17L, 4L), .Label = c("Ambiance", "Amélioration",
"Autonomie", "Confiance", "Diversité des tâches", "Être écouté",
"Evolution continue de l entreprise", "Faire du bon travail",
"Hiver", "Liberté", "Matériel performant", "Partager mon savoir-faire",
"Relation client", "Rencontrer de nouvelles équipes", "Responsabilité",
"Solidarité", "Stimulation", "Tranquille", "Travailler dans ma région"
), class = "factor")), .Names = c("Theme 1", "Theme 2"), row.names = c(NA,
-53L), class = "data.frame")
Reduce the font size so that all words fit the available page space:
wordcloud2(DataCloud, size = .5)
I have an output of apriori function, which mines data and gives set of rules. I want to convert it to data frame for further processing.
The rules object looks like this:
> inspect(output)
lhs rhs support confidence lift
1 {curtosis=(846,1.27e+03]} => {skewness=(-0.254,419]} 0.2611233 0.8044944 2.418776
2 {variance=(892,1.34e+03]} => {notes.class=FALSE} 0.3231218 0.9888393 1.781470
3 {variance=(-0.336,446]} => {notes.class=TRUE} 0.2859227 0.8634361 1.940608
4 {skewness=(837,1.26e+03]} => {notes.class=FALSE} 0.2924872 0.8774617 1.580815
5 {entropy=(-0.155,386],
class=FALSE} => {skewness=(837,1.26e+03]} 0.1597374 0.9521739 2.856522
6 {variance=(-0.336,446],
curtosis=(846,1.27e+03]} => {skewness=(-0.254,419]} 0.1378556 0.8325991 2.503275
We can create rules object using data frame. Data frame looks like this:
> data
variance skewness curtosis entropy notes.class
1 (892,1.34e+03] (837,1.26e+03] (-0.268,424] (386,771] FALSE
2 (892,1.34e+03] (-0.254,419] (424,846] (771,1.16e+03] FALSE
3 (892,1.34e+03] (837,1.26e+03] (-0.268,424] (-0.155,386] FALSE
4 (446,892] (-0.254,419] (846,1.27e+03] (386,771] FALSE
Than we can get output variable using this:
> output <- apriori(data)
There was used arules package. dput(output) gives this:
new("rules"
, lhs = new("itemMatrix"
, data = new("ngCMatrix"
, i = c(8L, 2L, 0L, 5L, 9L, 12L, 0L, 8L, 0L, 3L, 0L, 8L, 8L, 13L, 8L,
10L, 3L, 10L, 8L, 11L, 8L, 13L, 3L, 12L, 2L, 5L, 2L, 6L, 2L,
5L, 2L, 6L, 2L, 10L, 2L, 7L, 2L, 11L, 0L, 3L, 0L, 10L, 0L, 7L,
11L, 13L, 5L, 6L, 6L, 12L, 5L, 10L, 1L, 5L, 4L, 6L, 6L, 13L,
0L, 3L, 8L, 0L, 8L, 13L, 3L, 8L, 13L, 0L, 3L, 13L, 2L, 5L, 6L,
2L, 5L, 12L, 2L, 6L, 12L)
, p = c(0L, 1L, 2L, 3L, 4L, 6L, 8L, 10L, 12L, 14L, 16L, 18L, 20L, 22L,
24L, 26L, 28L, 30L, 32L, 34L, 36L, 38L, 40L, 42L, 44L, 46L, 48L,
50L, 52L, 54L, 56L, 58L, 61L, 64L, 67L, 70L, 73L, 76L, 79L)
, Dim = c(14L, 38L)
, Dimnames = list(NULL, NULL)
, factors = list()
)
, itemInfo = structure(list(labels = structure(c("variance=(-0.336,446]",
"variance=(446,892]", "variance=(892,1.34e+03]", "skewness=(-0.254,419]",
"skewness=(419,837]", "skewness=(837,1.26e+03]", "curtosis=(-0.268,424]",
"curtosis=(424,846]", "curtosis=(846,1.27e+03]", "entropy=(-0.155,386]",
"entropy=(386,771]", "entropy=(771,1.16e+03]", "notes.class=FALSE",
"notes.class=TRUE"), class = "AsIs"), variables = structure(c(5L,
5L, 5L, 4L, 4L, 4L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), .Label = c("curtosis",
"entropy", "notes.class", "skewness", "variance"), class = "factor"),
levels = structure(c(4L, 8L, 12L, 2L, 6L, 10L, 3L, 7L, 11L,
1L, 5L, 9L, 13L, 14L), .Label = c("(-0.155,386]", "(-0.254,419]",
"(-0.268,424]", "(-0.336,446]", "(386,771]", "(419,837]",
"(424,846]", "(446,892]", "(771,1.16e+03]", "(837,1.26e+03]",
"(846,1.27e+03]", "(892,1.34e+03]", "FALSE", "TRUE"), class = "factor")), .Names = c("labels",
"variables", "levels"), row.names = c(NA, -14L), class = "data.frame")
, itemsetInfo = structure(list(), .Names = character(0), row.names = integer(0), class = "data.frame")
)
, rhs = new("itemMatrix"
, data = new("ngCMatrix"
, i = c(3L, 12L, 13L, 12L, 5L, 3L, 8L, 13L, 0L, 3L, 8L, 3L, 3L, 8L,
6L, 5L, 12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 3L, 12L, 5L,
12L, 12L, 13L, 4L, 13L, 3L, 0L, 8L, 12L, 6L, 5L)
, p = 0:38
, Dim = c(14L, 38L)
, Dimnames = list(NULL, NULL)
, factors = list()
)
, itemInfo = structure(list(labels = structure(c("variance=(-0.336,446]",
"variance=(446,892]", "variance=(892,1.34e+03]", "skewness=(-0.254,419]",
"skewness=(419,837]", "skewness=(837,1.26e+03]", "curtosis=(-0.268,424]",
"curtosis=(424,846]", "curtosis=(846,1.27e+03]", "entropy=(-0.155,386]",
"entropy=(386,771]", "entropy=(771,1.16e+03]", "notes.class=FALSE",
"notes.class=TRUE"), class = "AsIs"), variables = structure(c(5L,
5L, 5L, 4L, 4L, 4L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), .Label = c("curtosis",
"entropy", "notes.class", "skewness", "variance"), class = "factor"),
levels = structure(c(4L, 8L, 12L, 2L, 6L, 10L, 3L, 7L, 11L,
1L, 5L, 9L, 13L, 14L), .Label = c("(-0.155,386]", "(-0.254,419]",
"(-0.268,424]", "(-0.336,446]", "(386,771]", "(419,837]",
"(424,846]", "(446,892]", "(771,1.16e+03]", "(837,1.26e+03]",
"(846,1.27e+03]", "(892,1.34e+03]", "FALSE", "TRUE"), class = "factor")), .Names = c("labels",
"variables", "levels"), row.names = c(NA, -14L), class = "data.frame")
, itemsetInfo = structure(list(), .Names = character(0), row.names = integer(0), class = "data.frame")
)
, quality = structure(list(support = c(0.261123267687819, 0.323121808898614,
0.285922684172137, 0.292487235594457, 0.159737417943107, 0.137855579868709,
0.137855579868709, 0.142231947483589, 0.142231947483589, 0.110138584974471,
0.110138584974471, 0.12399708242159, 0.153902261123268, 0.107221006564551,
0.13056163384391, 0.13056163384391, 0.150984682713348, 0.139314369073669,
0.100656455142232, 0.107221006564551, 0.154631655725748, 0.165572574762947,
0.112326768781911, 0.105762217359592, 0.12180889861415, 0.181619256017505,
0.181619256017505, 0.102844638949672, 0.105762217359592, 0.12837345003647,
0.12837345003647, 0.137855579868709, 0.137855579868709, 0.137855579868709,
0.137855579868709, 0.13056163384391, 0.13056163384391, 0.13056163384391
), confidence = c(0.804494382022472, 0.988839285714286, 0.863436123348018,
0.87746170678337, 0.952173913043478, 0.832599118942731, 0.832599118942731,
0.859030837004405, 0.898617511520737, 0.853107344632768, 0.915151515151515,
0.80188679245283, 0.972350230414747, 0.885542168674699, 0.864734299516908,
0.913265306122449, 1, 0.974489795918367, 1, 1, 0.990654205607477,
1, 0.980891719745223, 0.873493975903614, 0.814634146341463, 0.943181818181818,
0.950381679389313, 1, 0.92948717948718, 0.931216931216931, 0.897959183673469,
1, 0.969230769230769, 0.895734597156398, 0.832599118942731, 1,
0.864734299516908, 0.93717277486911), lift = c(2.41877587226493,
1.78146998779801, 1.94060807395104, 1.580814717477, 2.85652173913043,
2.50327498261071, 2.56515369004603, 1.93070701234925, 2.71366653809456,
2.56493458221826, 2.81948927477017, 2.41093594836147, 2.92344773223381,
2.72826587247868, 2.58853870008227, 2.73979591836735, 1.80157687253614,
1.75561827884899, 1.80157687253614, 1.80157687253614, 1.78473970550309,
2.24754098360656, 2.20459434060771, 1.96321350977681, 2.44926187419769,
1.69921455023295, 2.85114503816794, 1.80157687253614, 1.67454260588295,
2.09294821753838, 2.68799572230639, 2.24754098360656, 2.91406882591093,
2.70496064471679, 2.56515369004603, 1.80157687253614, 2.58853870008227,
2.81151832460733)), row.names = c(NA, 38L), .Names = c("support",
"confidence", "lift"), class = "data.frame")
, info = structure(list(data = data, ntransactions = 1371L, support = 0.1,
confidence = 0.8), .Names = c("data", "ntransactions", "support",
"confidence"))
)
We can't duplicate your data from your question (oh, you just added your data as I was typing this! Sorry!), so I'll use the example from the arules package:
library('arules');
data("Adult")
## Mine association rules.
rules <- apriori(Adult,
parameter = list(supp = 0.5, conf = 0.9,
target = "rules"))
Then I can duplicate the stuff output from inspect(rules):
> ruledf = data.frame(
lhs = labels(lhs(rules))$elements,
rhs = labels(rhs(rules))$elements,
rules#quality)
> head(ruledf)
lhs rhs support confidence lift
1 {} {capital-gain=None} 0.9173867 0.9173867 1.0000000
2 {} {capital-loss=None} 0.9532779 0.9532779 1.0000000
3 {hours-per-week=Full-time} {capital-gain=None} 0.5435895 0.9290688 1.0127342
4 {hours-per-week=Full-time} {capital-loss=None} 0.5606650 0.9582531 1.0052191
5 {sex=Male} {capital-gain=None} 0.6050735 0.9051455 0.9866565
6 {sex=Male} {capital-loss=None} 0.6331027 0.9470750 0.9934931
and do stuff like order by decreasing lift:
head(ruledf[order(-ruledf$lift),])
The help for the rules class: http://www.rdocumentation.org/packages/arules/functions/rules-class.html will tell you what you can get from your rules object - I just used that information to build a data frame. If its not exactly what you want, then cook one up using your own recipe!
Run apriori in data Adult
rules <- apriori(Adult, parameter = list(supp = 0.5, conf = 0.9, target =
"rules"))
Inspect LHS, RHS, support, confidence and lift
arules::inspect(rules)
Create a dataframe
df = data.frame(
lhs = labels(lhs(rules)),
rhs = labels(rhs(rules)),
rules#quality)
View top 6 lines in new dataframe
head(df)
This does the trick
rules_dataframe <- as(output, 'data.frame')