Related
I am handling some data from a survey regarding feeling about several factors related to workplace culture. It is currently in a longform tibble called work_culture_data like this:
> print(work_culture_data, n = 21)
# A tibble: 140 × 3
Response_ID Factor Level
<int> <fct> <fct>
1 6 Level_support_colleagues low
2 6 Level_support_community low
3 6 Level_career_prospects low
4 6 Level_career_satisfaction high
5 6 Level_career_impact low
6 6 Level_collaboration high
7 6 Level_assessment_fairness high
8 7 Level_support_colleagues high
9 7 Level_support_community high
10 7 Level_career_prospects very high
11 7 Level_career_satisfaction high
12 7 Level_career_impact high
13 7 Level_collaboration high
14 7 Level_assessment_fairness high
15 8 Level_support_colleagues high
16 8 Level_support_community low
17 8 Level_career_prospects very low
18 8 Level_career_satisfaction high
19 8 Level_career_impact high
20 8 Level_collaboration low
21 8 Level_assessment_fairness low
# … with 119 more rows
# ℹ Use `print(n = ...)` to see more rows
Which can be recreated with this dput() output:
structure(list(Response_ID = c(6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L,
7L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L,
9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 13L,
13L, 13L, 13L, 13L, 13L, 13L, 14L, 14L, 14L, 14L, 14L, 14L, 14L,
15L, 15L, 15L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 16L, 16L,
16L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 18L, 18L, 18L, 18L, 18L,
18L, 18L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 20L, 20L, 20L, 20L,
20L, 20L, 20L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 22L, 22L, 22L,
22L, 22L, 22L, 22L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 24L, 24L,
24L, 24L, 24L, 24L, 24L, 25L, 25L, 25L, 25L, 25L, 25L, 25L),
Factor = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L,
4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L,
7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L,
4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L), levels = c("Level_support_colleagues",
"Level_support_community", "Level_career_prospects", "Level_career_satisfaction",
"Level_career_impact", "Level_collaboration", "Level_assessment_fairness"
), class = "factor"), Level = structure(c(2L, 2L, 2L, 3L,
2L, 3L, 3L, 3L, 3L, 4L, 3L, 3L, 3L, 3L, 3L, 2L, 1L, 3L, 3L,
2L, 2L, 4L, 3L, 2L, 3L, 3L, 3L, 2L, 4L, 3L, 3L, 4L, 3L, 4L,
3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 3L, 3L, 3L,
3L, 2L, 3L, 3L, 2L, 4L, 2L, 2L, 4L, 2L, 4L, 4L, 1L, 3L, 3L,
3L, 3L, 3L, 4L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 1L, 3L, 3L,
1L, 3L, 2L, 4L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 3L, 3L, 3L,
4L, 2L, 2L, 2L, 4L, 3L, 2L, 3L, 3L, 4L, 3L, 3L, 3L, 2L, 3L,
3L, 3L, 2L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 3L,
3L, 2L, 4L, 4L, 3L, 4L, 3L, 4L, 2L, 3L, 2L, 2L, 3L, 1L, 2L,
2L), levels = c("very low", "low", "high", "very high"), class = "factor")), row.names = c(NA,
-140L), class = c("tbl_df", "tbl", "data.frame"))
The actual dataset has 2000+ rows representing 400+ responses, and work_culture_data here is a subset of 20 survey responses (20 unique Response_IDs) where they rate (Level factor variable from "very low" to "very high") seven factors (Factor factor variable) about their workplace culture. For example, respondent number 6 thinks their Level_career_prospects is low.
Based on work_culture_data, I'd like to create a 100% stacked bar chart with ggplot2 with the following features:
The Factors are renamed in the final graph, such as from Level_career_prospects to "Career prospects". This will be the vertical axis.
The stacked bars are horizontal where I can specify its component colors.
There will be seven stacked bars altogether, each representing one of each Factor.
The stacked bars are made of the proportion of respondents who chose Levels in order from "very low" to "very high" (total four levels). Each segment of a stacked bar represents one of Levels. Each stacked bar adds to 100%.
The horizontal axis has three labelled breaks: 0%, 50%, and 100% from left to right.
The order of the stacked bars goes from the one with highest proportion of "very low" to least, from top to bottom.
Ideally, I'd like the count of responses for each segment of the stacked bars shown.
I tried to create this plot starting with this line:
work_culture_fig <- ggplot(work_culture_data, aes(y = Factor, x = Level)) +
geom_col()
However, it gave me this output which baffles me:
I don't know where to go from here and am very confused... Should the tibble data frame be widened first?
What did I did wrong? And how do I achieve 1~7 above in the final figure?
Thank you.
Not Sure if i understod correctly..
# Start by removing the "levels" from each word
t <- work_cultur_data$Factor
work_cultur_data$Factor <- gsub("Level_([a-z])", " \\U\\1", t, perl=TRUE)
work_cultur_data$Factor<- gsub("^([a-z])", "\\U\\1", t, perl=TRUE)
work_cultur_data$Factor <- str_to_title(str_trim( gsub("_", " ", t) ))
work_cultur_data$Factor <- t
# Change levels
l <- work_cultur_data$Level
l <- fct_relevel(work_cultur_data$Level,c("very high","high","low","very low"))
work_cultur_data$Level <- l
# Plot
work_cultur_data %>%
ggplot(aes(x=Factor,fill=Level))+
geom_bar()+labs(fill="")+ylab("")+
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank())
Changed the code to make it as percentage. I also corrected for the axis and made it more clear that it should be porportional.
# Plot
work_cultur_data %>% group_by(Factor,Level) %>% summarize(prop=n()) %>%
ggplot(aes(y=Factor,x=prop,fill=Level))+
geom_col(position="fill")+labs(fill="")+ylab("")+
scale_x_continuous(labels = scales::percent)
To change colors manualy:
scale_fill_manual(values=sample(colors(),
length(unique(work_cultur_data$Level))))
This is just a fancy shamncy way to sample different colors each time depending on how many unique levels you got in your fill argument. You could just specify the values to be c("red","#1CD317",colors()[444],"deeppink") - just different types of colors (either HEX code, a name, an index from all possible named colors or my favorite: DEEP PINK!
Ive changed outside of ggplot the order of Factor. You could obviously keep the original features by adding new columns instead of mutating your existing ones.
Also I changes instead of using four manual colors to a set pallete. R has really nice coloring options and you can make yourself familiar with them as time progresses. If you prefer manual coloring, use the line from the previous answer.
# Start by removing the "levels" from each word
t <- work_cultur_data$Factor
t <- gsub("Level_([a-z])", " \\U\\1", t, perl=TRUE)
t<- gsub("^([a-z])", "\\U\\1", t, perl=TRUE)
t <- str_to_title(str_trim( gsub("_", " ", t) ))
work_cultur_data$Factor <- t
# Change levels
l <- work_cultur_data$Level
l <- fct_relevel(work_cultur_data$Level,c("very high","high","low","very low"))
work_cultur_data$Level <- l
# Sort Factor by prop of Very low (Level)
arng_by <- work_cultur_data %>%
filter(Level=="very low") %>%
group_by(Factor,Level) %>%
summarize(prop=n()) %>% arrange(prop) %>% pull(Factor)
f <- work_cultur_data$Factor
f <- fct_relevel(work_cultur_data$Factor,arng_by)
work_cultur_data$Factor <- f
# Plot
work_cultur_data %>% group_by(Factor,Level) %>% summarize(prop=n()) %>%
ggplot(aes(y=Factor,x=prop,fill=Level))+
geom_col(position="fill")+labs(fill="")+ylab("")+
scale_x_continuous(labels = scales::percent)+
scale_fill_brewer(palette = 11)
I have the following data:
df <- structure(list(IDVar = 1:40, Major.sectors = structure(c(5L,
9L, 3L, 15L, 11L, 7L, 18L, 18L, 18L, 3L, 3L, 3L, 3L, 17L, 3L,
11L, 7L, 17L, 3L, 11L, 3L, 18L, 3L, 17L, 9L, 18L, 9L, 19L, 3L,
11L, 11L, 2L, 5L, 3L, 18L, 17L, 4L, 2L, 3L, 3L), .Label = c("Banks",
"Chemicals, rubber, plastics, non-metallic products", "Construction",
"Education, Health", "Food, beverages, tobacco", "Gas, Water, Electricity",
"Hotels & restaurants", "Insurance companies", "Machinery, equipment, furniture, recycling",
"Metals & metal products", "Other services", "Post & telecommunications",
"Primary sector", "Public administration & defense", "Publishing, printing",
"Textiles, wearing apparel, leather", "Transport", "Wholesale & retail trade",
"Wood, cork, paper"), class = "factor"), Region.in.country = structure(c(15L,
8L, 8L, 8L, 10L, 15L, 19L, 10L, 8L, 10L, 3L, 18L, 4L, 12L, 4L,
15L, 13L, 4L, 15L, 15L, 7L, 15L, 12L, 1L, 7L, 10L, 15L, 8L, 13L,
15L, 12L, 8L, 7L, 15L, 15L, 10L, 8L, 10L, 10L, 15L), .Label = c("Andalucia",
"Aragon", "Asturias", "Canary Islands", "Cantabria", "Castilla-La Mancha",
"Castilla y Leon", "Cataluna", "Ceuta", "Comunidad Valenciana",
"Extremadura", "Galicia", "Islas Baleares", "La Rioja", "Madrid",
"Melilla", "Murcia", "Navarra", "Pais Vasco"), class = "factor"),
EBIT.TA = c(-0.234432635519391, -0.884337466274593, -0.00446559204081373,
0.11109107677028, -0.137203773525798, -0.582114677880617,
0.0190497663203189, -3.04252763094666, 0.113157822682219,
-0.0255533180037229, 0.281767142199724, 0.0326641697396841,
-0.00879974750993553, 0.0542074697816672, -0.112104697294392,
-0.191945591325174, -0.00380586115226597, -0.0363239884169068,
-0.273949107908537, 0.435398668004486, -0.00563436099927988,
-2.75971618056051, -0.1047327709263, 0.151283793741506, -0.0373197549569126,
0.00912639083178201, -0.0386627754065697, -0.018235399636112,
-0.0118104711362467, -0.701299939137125, NA, 0.0191819361175666,
-0.0104887983706721, -0.801677105519484, -0.402194475974272,
-0.124125227730062, 0.143020458476649, -0.601186271451194,
0.0163269364787831, 5.09955167591238), EBIT.TA_l1 = c(-0.443687074746458,
-0.561864166134075, -0.0345769510044604, 0.0282541797531804,
-0.0181173929170762, 0.0147211350970115, 0.0588534950162799,
-1.14097109926961, 0.060100343733096, -0.0386426338471025,
0.049684095221329, 0.0558174150334904, 0.00214962169435867,
0.0399960114646072, 0.0402934579830171, -0.612359147433149,
-0.0115916125659674, 0.00739473610413031, 0.0174576615247567,
0.68624861825246, 0.0305807338940829, -3.88006243913616,
0.0410122725022661, -0.089491343996377, -0.215219123182103,
0.00967853324842811, -0.0336715197882038, 0.362424791356667,
0.221203934329637, -0.654387857513823, 0.0656934439915892,
0.0652005453654772, 0.0339559014267185, 0.0259085077216708,
-0.303606048856146, 0.0280113794301873, 0.109307291990628,
-0.470048555841697, -0.00157699300508027, -0.350519090107081
), EBIT.TA_l2 = c(-0.351308186716873, 0.00159428805074234,
-0.00604587147802615, 0.0761894448922952, -0.00348378141492824,
NA, 0.0346370866793768, -0.552226781084599, 0.00220031803369861,
-0.0285840972149053, 0.065316579236306, 0.4090851643341,
-0.0188362202518351, 0.0403848986306371, 0.091146090480032,
-0.0154168449752466, -0.0694803621032671, 0.0511978643139393,
-0.452924037757731, -0.0091835704914724, 0.0119918914092344,
0.0858960833880717, NA, 0.104901526886479, -0.23096183545392,
-0.0163058345980967, 0.100643431561465, 0.0527859573541712,
0.250207316117438, NA, 0.00193240515291123, 0.0624210741756767,
0.0178136227732972, -0.0321294913646274, -0.0699629484084657,
-0.00417176180400133, 0.209612573099415, 0.0285645570852926,
0.0551624216079071, 0.0172738293439595), Major.sectors.id = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 7L, 7L, 3L, 3L, 3L, 3L, 8L, 3L, 5L,
6L, 8L, 3L, 5L, 3L, 7L, 3L, 8L, 2L, 7L, 2L, 9L, 3L, 5L, 5L,
10L, 1L, 3L, 7L, 8L, 11L, 10L, 3L, 3L), Region.in.country.id = c(1L,
2L, 2L, 2L, 3L, 1L, 4L, 3L, 2L, 3L, 5L, 6L, 7L, 8L, 7L, 1L,
9L, 7L, 1L, 1L, 10L, 1L, 8L, 11L, 10L, 3L, 1L, 2L, 9L, 1L,
8L, 2L, 10L, 1L, 1L, 3L, 2L, 3L, 3L, 1L)), .Names = c("IDVar",
"Major.sectors", "Region.in.country", "EBIT.TA", "EBIT.TA_l1",
"EBIT.TA_l2", "Major.sectors.id", "Region.in.country.id"), row.names = c(NA,
40L), class = "data.frame")
I randomly generate a column of zero and ones for illustration.
x <- 40
df$x<- sample(c(0,1), replace=TRUE, size=x)
What I am trying to do is to do is to drop rows which have zero values based on a few conditons.
:If df$x == 1
and if intersect(region.id, sector.id) == 0 #i.e. there is no data
then drop
So, I want to group_by region and sector and if the intersect between both columns does not exist then drop that observation.
Consider the following image. I am basically looking to delete the intersects of the columns which has not data. So take sector.id: 1 and region.id: 5 there is no data so I want to remove it. (However my data is not grouped like the image below, its as the dput code.
I used NA for missing values in the sample x.
# get ready
set.seed(123) # set seed for reproducibility
df$x <- sample(c(NA,1), 40, replace = TRUE) # sample values
Base solution
# split by ids, check for values, bind together nonempty combinations
dfs_split <- split(df, list(df$Major.sectors.id, df$Region.in.country.id))
has_value <- sapply(dfs_split, function(df) !all(is.na(df$x)))
dfs_nonempty <- dfs_split[has_value]
res <- do.call(rbind, dfs_nonempty)
Explanation:
split divides the data into the groups you specified
sapply applies the test for non-missing values on each group
do.call helps to rbind the groups (which actually form a list)
dplyr solution
This is the cleaner option.
library(dplyr)
res <- df %>%
group_by(Major.sectors.id, Region.in.country.id) %>%
filter(!all(is.na(x)))
I use the wordcloud2 package to render word clouds. It seems that wordcloud2 does not always display the most frequent words.
I said "not always" because the problem is not permanent. It seems that the results are mostly random.
Code :
library(wordcloud2)
library(htmlwidgets)
DataCloud <- as.character(DataTextAnalysis[,1])
DataCloud <- as.data.frame(table(DataCloud))
DataCloud <- DataCloud[order(DataCloud$Freq, decreasing = TRUE),]
DataCloud <- DataCloud[1:10, ]
wordcloud2(data = DataCloud)
Data :
structure(list(`Theme 1` = structure(c(12L, NA, 2L, 4L, 6L, 7L,
NA, 14L, 6L, 6L, 2L, 7L, 5L, 2L, 2L, 2L, 11L, 12L, 2L, 2L, 10L,
NA, 12L, NA, 2L, 13L, 15L, NA, NA, 10L, NA, 1L, 2L, 16L, 6L,
1L, 7L, 9L, 15L, 3L, 1L, 2L, 2L, 2L, 17L, 2L, 17L, 7L, 3L, 2L,
2L, 8L, 6L), .Label = c("Ambiance", "Autonomie", "Changement régulier de hiérarchie",
"Côté familial", "Défi", "Diversité des tâches", "Faire du bon travail",
"Gérer l humain", "Gestion de projets", "Horaires", "Réglage du finisseur",
"Relation client", "Rencontrer de nouvelles équipes", "Responsabilité",
"Technicité", "Travailler avec la hiérachie", "Travailler en binôme"
), class = "factor"), `Theme 2` = structure(c(NA, NA, 13L, 1L,
14L, NA, NA, 4L, 15L, 14L, 10L, 8L, 8L, 5L, 15L, 4L, 13L, 8L,
6L, NA, 3L, NA, 3L, NA, 11L, 5L, 5L, NA, NA, 9L, NA, 16L, 1L,
7L, 8L, 5L, 19L, 2L, 8L, 11L, 5L, 13L, 11L, 11L, 19L, 5L, 19L,
12L, 11L, 8L, 18L, 17L, 4L), .Label = c("Ambiance", "Amélioration",
"Autonomie", "Confiance", "Diversité des tâches", "Être écouté",
"Evolution continue de l entreprise", "Faire du bon travail",
"Hiver", "Liberté", "Matériel performant", "Partager mon savoir-faire",
"Relation client", "Rencontrer de nouvelles équipes", "Responsabilité",
"Solidarité", "Stimulation", "Tranquille", "Travailler dans ma région"
), class = "factor")), .Names = c("Theme 1", "Theme 2"), row.names = c(NA,
-53L), class = "data.frame")
Reduce the font size so that all words fit the available page space:
wordcloud2(DataCloud, size = .5)
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
structure(list(Date = structure(c(4L, 4L, 4L, 4L, 4L, 4L, 4L,
5L, 5L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 9L, 10L, 10L,
10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 3L), .Label = c("13/09/14", "14/09/14", "15/09/14",
"16/08/14", "17/08/14", "18/08/14", "23/08/14", "24/08/14", "25/08/14",
"30/08/14", "31/08/14"), class = "factor"), HomeTeam = structure(c(1L,
8L, 11L, 13L, 15L, 19L, 20L, 9L, 12L, 3L, 2L, 4L, 5L, 6L, 14L,
17L, 7L, 16L, 18L, 10L, 3L, 6L, 10L, 12L, 13L, 17L, 20L, 2L,
8L, 18L, 1L, 4L, 5L, 9L, 14L, 15L, 16L, 19L, 11L, 7L), .Label = c("Arsenal",
"Aston Villa", "Burnley", "Chelsea", "Crystal Palace", "Everton",
"Hull", "Leicester", "Liverpool", "Man City", "Man United", "Newcastle",
"QPR", "Southampton", "Stoke", "Sunderland", "Swansea", "Tottenham",
"West Brom", "West Ham"), class = "factor"), AwayTeam = structure(c(5L,
6L, 17L, 7L, 2L, 16L, 18L, 14L, 10L, 4L, 12L, 8L, 20L, 1L, 19L,
3L, 15L, 11L, 13L, 9L, 11L, 4L, 15L, 5L, 16L, 19L, 14L, 7L, 1L,
9L, 10L, 17L, 3L, 2L, 12L, 8L, 18L, 6L, 13L, 20L), .Label = c("Arsenal",
"Aston Villa", "Burnley", "Chelsea", "Crystal Palace", "Everton",
"Hull", "Leicester", "Liverpool", "Man City", "Man United", "Newcastle",
"QPR", "Southampton", "Stoke", "Sunderland", "Swansea", "Tottenham",
"West Brom", "West Ham"), class = "factor"), FTR = structure(c(3L,
2L, 1L, 1L, 1L, 2L, 1L, 3L, 1L, 1L, 2L, 3L, 1L, 2L, 2L, 3L, 2L,
2L, 3L, 3L, 2L, 1L, 1L, 2L, 3L, 3L, 1L, 3L, 2L, 1L, 2L, 3L, 2L,
1L, 3L, 1L, 2L, 1L, 3L, 2L), .Label = c("A", "D", "H"), class = "factor"),
Referee = structure(c(4L, 10L, 9L, 3L, 1L, 12L, 2L, 8L, 7L,
11L, 9L, 6L, 8L, 5L, 15L, 3L, 4L, 7L, 1L, 11L, 2L, 4L, 6L,
10L, 16L, 14L, 9L, 8L, 1L, 13L, 8L, 5L, 9L, 6L, 2L, 11L,
3L, 1L, 13L, 7L), .Label = c("A Taylor", "C Foy", "C Pawson",
"J Moss", "K Friend", "L Mason", "M Atkinson", "M Clattenburg",
"M Dean", "M Jones", "M Oliver", "N Swarbrick", "P Dowd",
"P Tierney", "R East", "R Madley"), class = "factor")), .Names = c("Date",
"HomeTeam", "AwayTeam", "FTR", "Referee"), row.names = c(NA,
40L), class = "data.frame")
In the above dataset I am trying to find out the referee who served the most number of matches for each team. For example, which guy refereed for Aston Villa the most in home games and in away games and both.
Sorry about me being blunt with my question. I did make an attempt.
In order to find out how many times referee J Moss refereed for Arsenal I tried this,
awayref<-nrow(awayref<-(filter(fd,fd$Referee=='J Moss',fd$AwayTeam=='Arsenal')))
homeref<-nrow(hf<-(filter(fd,fd$Referee=='J Moss',fd$HomeTeam=='Arsenal')))
View(total<-homeref+awayref)
I needed some help with looping it to include all referees and all teams.
We can do
tbl1 <- table(df1$Referee)
tbl1[which.max(tbl1)]
I'm trying to aggregate multiple rows by column in a data frame. I succed to use aggregate for one column \o/ but I don't understand how to use it for several columns. I let you an exemple of my data:
Gene_Title ID_Affymetrix GB_Acc.x Gene_Symbol.x Entrez ID_Agl GB_Acc.y Gene_Symbol.y Unigene Ensembl Chr_location
trafficking protein particle complex 4 1429632_at AK005276 Trappc4 60409 10239 NM_021789 Trappc4 Mm.29814 ENSMUST00000170082 chr9:44211918-44211859
aldo-keto reductase family 1, member B3 (aldose reductase) 1437133_x_at AV127085 Akr1b3 11677 22 NM_009658 Akr1b3 Mm.451 ENSMUST00000166583 chr6:34253982-34253930
sodium channel, voltage-gated, type I, alpha 1450120_at AV336781 Scn1a 20265 58 NM_018733 Scn1a Mm.439704 ENSMUST00000094951 chr2:66173557-66173498
sodium channel, voltage-gated, type I, alpha 1450121_at AV336781 Scn1a 20265 58 NM_018733 Scn1a Mm.439704 ENSMUST00000094951 chr2:66173557-66173498
aldo-keto reductase family 1, member B3 (aldose reductase) 1456590_x_at BB469763 Akr1b3 11677 22 NM_009658 Akr1b3 Mm.451 ENSMUST00000166583 chr6:34253982-34253930
dolichol-phosphate (beta-D) mannosyltransferase 2 1415675_at BC008256 Dpm2 13481 33459 NM_010073 Dpm2 Mm.22001 ENSMUST00000150419 chr2:32428766-32428825
proline rich 13 1423686_a_at BC016234 Prr13 66151 4 NM_025385 Prr13 Mm.393955 ENSMUST00000164688 chr15:102291090-102291149
transmembrane protein 2 1424711_at BC019745 Tmem2 83921 23 NM_031997 Tmem2 Mm.329776 ENSMUST00000096194 chr19:21930192-21930251
transmembrane protein 2 1451458_at BC019745 Tmem2 83921 23 NM_031997 Tmem2 Mm.329776 ENSMUST00000096194 chr19:21930192-21930251
lipase, endothelial 1450188_s_at BC020991 Lipg 16891 52 NM_010720 Lipg Mm.299647 ENSMUST00000066532 chr18:75099688-75099629
lipase, endothelial 1421261_at BC020991 Lipg 16891 52 NM_010720 Lipg Mm.299647 ENSMUST00000066532 chr18:75099688-75099629
lipase, endothelial 1421262_at BC020991 Lipg 16891 52 NM_010720 Lipg Mm.299647 ENSMUST00000066532 chr18:75099688-75099629
coatomer protein complex, subunit gamma 1415670_at BC024686 Copg 54161 25829 NM_017477 Copg Mm.258785 ENSMUST00000113607 chr6:87862890-87862949
coatomer protein complex, subunit gamma 1416017_at BC024686 Copg 54161 25829 NM_017477 Copg Mm.258785 ENSMUST00000113607 chr6:87862890-87862949
leucine rich repeat containing 1 1452411_at BG966295 Lrrc1 214345 29 NM_172528 Lrrc1 Mm.28534 ENSMUST00000049755 chr9:77278998-77278939
aldo-keto reductase family 1, member B3 (aldose reductase) 1448319_at NM_009658 Akr1b3 11677 22 NM_009658 Akr1b3 Mm.451 ENSMUST00000166583 chr6:34253982-34253930
ATPase, H+ transporting, lysosomal V0 subunit D1 1415671_at NM_013477 Atp6v0d1 11972 11826 NM_013477 Atp6v0d1 Mm.17708 ENSMUST00000013304 chr8:108048837-108048778
golgi autoantigen, golgin subfamily a, 7 1415672_at NM_020585 Golga7 57437 54944 NM_020585 Golga7 Mm.196269 ENSMUST00000121783 chr8:24351978-24351919
trafficking protein particle complex 4 1415674_a_at NM_021789 Trappc4 60409 10239 NM_021789 Trappc4 Mm.29814 ENSMUST00000170082 chr9:44211918-44211859
phosphoserine phosphatase 1415673_at NM_133900 Psph 100678 57142 NM_133900 Psph Mm.271784 ENSMUST00000031399 chr5:130271500-130271441
Some gene_title (and gene_symbol) are represented several times but with different ID(Affymetrix or Agilent), or with different GB_Acc. In general I want to have only one line per gene and in Ids or GB_Acc or other columns the different values:
Here my data with Id affymetrix:
>f=function(x){return(paste(x,collapse=","))}
>tab4=aggregate(ID_Affymetrix ~ GB_Acc.x+ Gene_Title+GB_Acc.y+Gene_Symbol.x+Entrez+Unigene+Ensembl+Chr_location+ID_Agl,data=tab3,f)
GB_Acc.x Gene_Title GB_Acc.y Gene_Symbol.x Entrez Unigene Ensembl Chr_location ID_Agl ID_Affymetrix
BC016234 proline rich 13 NM_025385 Prr13 66151 Mm.393955 ENSMUST00000164688 chr15:102291090-102291149 4 1423686_a_at
AV127085 aldo-keto reductase family 1, member B3 (aldose reductase) NM_009658 Akr1b3 11677 Mm.451 ENSMUST00000166583 chr6:34253982-34253930 22 1437133_x_at
BB469763 aldo-keto reductase family 1, member B3 (aldose reductase) NM_009658 Akr1b3 11677 Mm.451 ENSMUST00000166583 chr6:34253982-34253930 22 1456590_x_at
NM_009658 aldo-keto reductase family 1, member B3 (aldose reductase) NM_009658 Akr1b3 11677 Mm.451 ENSMUST00000166583 chr6:34253982-34253930 22 1448319_at
BC019745 transmembrane protein 2 NM_031997 Tmem2 83921 Mm.329776 ENSMUST00000096194 chr19:21930192-21930251 23 1424711_at,1451458_at
BG966295 leucine rich repeat containing 1 NM_172528 Lrrc1 214345 Mm.28534 ENSMUST00000049755 chr9:77278998-77278939 29 1452411_at
BC020991 lipase, endothelial NM_010720 Lipg 16891 Mm.299647 ENSMUST00000066532 chr18:75099688-75099629 52 1450188_s_at,1421261_at,1421262_at
AV336781 sodium channel, voltage-gated, type I, alpha NM_018733 Scn1a 20265 Mm.439704 ENSMUST00000094951 chr2:66173557-66173498 58 1450120_at,1450121_at
AK005276 trafficking protein particle complex 4 NM_021789 Trappc4 60409 Mm.29814 ENSMUST00000170082 chr9:44211918-44211859 10239 1429632_at
NM_021789 trafficking protein particle complex 4 NM_021789 Trappc4 60409 Mm.29814 ENSMUST00000170082 chr9:44211918-44211859 10239 1415674_a_at
NM_013477 ATPase, H+ transporting, lysosomal V0 subunit D1 NM_013477 Atp6v0d1 11972 Mm.17708 ENSMUST00000013304 chr8:108048837-108048778 11826 1415671_at
BC024686 coatomer protein complex, subunit gamma NM_017477 Copg 54161 Mm.258785 ENSMUST00000113607 chr6:87862890-87862949 25829 1415670_at,1416017_at
BC008256 dolichol-phosphate (beta-D) mannosyltransferase 2 NM_010073 Dpm2 13481 Mm.22001 ENSMUST00000150419 chr2:32428766-32428825 33459 1415675_at
NM_020585 golgi autoantigen, golgin subfamily a, 7 NM_020585 Golga7 57437 Mm.196269 ENSMUST00000121783 chr8:24351978-24351919 54944 1415672_at
NM_133900 phosphoserine phosphatase NM_133900 Psph 100678 Mm.271784 ENSMUST00000031399 chr5:130271500-130271441 57142 1415673_at
As you can see, for Tmem2, Copg,Lipg and Scn1a I now have several ID_Affymetrix in the same row. For this genes the only difference was on this column. But for Akr1b3 and Trappc4 there was also some difference in th GB_acc.x column.
So in a general way I would like to make an aggregate for each columns (except Gene_Title and Gene_Symbol which normally are always the same for a given gene) and finally have for exemple:
Gene_Tile Gene_Symbol GB_Acc ID_Affy ...
Traffickp Prot complex 4 Trapcc4 AK005276,NM_021789 1429632_at,1415674_a_at ...
If anyone as any idea
Thanks!
EDIT:
this is the dput(head(mydata,20)). There's some errors at the end but I didn't know this function and his goal
structure(list(Gene_Title = structure(c(1L, 1L, 1L, 2L, 3L, 3L,
4L, 5L, 6L, 7L, 7L, 7L, 8L, 9L, 10L, 10L, 11L, 11L, 12L, 12L), .Label = c("aldo-keto reductase family 1, member B3 (aldose reductase)",
"ATPase, H+ transporting, lysosomal V0 subunit D1", "coatomer protein complex, subunit gamma",
"dolichol-phosphate (beta-D) mannosyltransferase 2", "golgi autoantigen, golgin subfamily a, 7",
"leucine rich repeat containing 1", "lipase, endothelial", "phosphoserine phosphatase",
"proline rich 13", "sodium channel, voltage-gated, type I, alpha",
"trafficking protein particle complex 4", "transmembrane protein 2"
), class = "factor"), ID_Affymetrix = structure(c(13L, 14L, 20L,
2L, 1L, 7L, 6L, 3L, 19L, 17L, 8L, 9L, 4L, 10L, 15L, 16L, 5L,
12L, 11L, 18L), .Label = c("1415670_at", "1415671_at", "1415672_at",
"1415673_at", "1415674_a_at", "1415675_at", "1416017_at", "1421261_at",
"1421262_at", "1423686_a_at", "1424711_at", "1429632_at", "1437133_x_at",
"1448319_at", "1450120_at", "1450121_at", "1450188_s_at", "1451458_at",
"1452411_at", "1456590_x_at"), class = "factor"), GB_Acc.x = structure(c(2L,
11L, 4L, 12L, 9L, 9L, 5L, 13L, 10L, 8L, 8L, 8L, 15L, 6L, 3L,
3L, 14L, 1L, 7L, 7L), .Label = c("AK005276", "AV127085", "AV336781",
"BB469763", "BC008256", "BC016234", "BC019745", "BC020991", "BC024686",
"BG966295", "NM_009658", "NM_013477", "NM_020585", "NM_021789",
"NM_133900"), class = "factor"), Gene_Symbol.x = structure(c(1L,
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L,
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg",
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a",
"Tmem2", "Trappc4"), class = "factor"), Entrez = c(11677L, 11677L,
11677L, 11972L, 54161L, 54161L, 13481L, 57437L, 214345L, 16891L,
16891L, 16891L, 100678L, 66151L, 20265L, 20265L, 60409L, 60409L,
83921L, 83921L), ID_Agl = c(22L, 22L, 22L, 11826L, 25829L, 25829L,
33459L, 54944L, 29L, 52L, 52L, 52L, 57142L, 4L, 58L, 58L, 10239L,
10239L, 23L, 23L), GB_Acc.y = structure(c(1L, 1L, 1L, 4L, 5L,
5L, 2L, 7L, 12L, 3L, 3L, 3L, 11L, 9L, 6L, 6L, 8L, 8L, 10L, 10L
), .Label = c("NM_009658", "NM_010073", "NM_010720", "NM_013477",
"NM_017477", "NM_018733", "NM_020585", "NM_021789", "NM_025385",
"NM_031997", "NM_133900", "NM_172528"), class = "factor"), Gene_Symbol.y = structure(c(1L,
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L,
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg",
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a",
"Tmem2", "Trappc4"), class = "factor"), Unigene = structure(c(12L,
12L, 12L, 1L, 4L, 4L, 3L, 2L, 6L, 8L, 8L, 8L, 5L, 10L, 11L, 11L,
7L, 7L, 9L, 9L), .Label = c("Mm.17708", "Mm.196269", "Mm.22001",
"Mm.258785", "Mm.271784", "Mm.28534", "Mm.29814", "Mm.299647",
"Mm.329776", "Mm.393955", "Mm.439704", "Mm.451"), class = "factor"),
Ensembl = structure(c(11L, 11L, 11L, 1L, 7L, 7L, 9L, 8L,
3L, 4L, 4L, 4L, 2L, 10L, 5L, 5L, 12L, 12L, 6L, 6L), .Label = c("ENSMUST00000013304",
"ENSMUST00000031399", "ENSMUST00000049755", "ENSMUST00000066532",
"ENSMUST00000094951", "ENSMUST00000096194", "ENSMUST00000113607",
"ENSMUST00000121783", "ENSMUST00000150419", "ENSMUST00000164688",
"ENSMUST00000166583", "ENSMUST00000170082"), class = "factor"),
Chr_location = structure(c(7L, 7L, 7L, 9L, 8L, 8L, 4L, 10L,
12L, 2L, 2L, 2L, 6L, 1L, 5L, 5L, 11L, 11L, 3L, 3L), .Label = c("chr15:102291090-102291149",
"chr18:75099688-75099629", "chr19:21930192-21930251", "chr2:32428766-32428825",
"chr2:66173557-66173498", "chr5:130271500-130271441", "chr6:34253982-34253930",
"chr6:87862890-87862949", "chr8:108048837-108048778", "chr8:24351978-24351919",
"chr9:44211918-44211859", "chr9:77278998-77278939"), class = "factor")), .Names = c("Gene_Title",
"ID_Affymetrix", "GB_Acc.x", "Gene_Symbol.x", "Entrez", "ID_Agl",
"GB_Acc.y", "Gene_Symbol.y", "Unigene", "Ensembl", "Chr_location"
), row.names = c(NA, 20L), class = "data.frame")
structure(list(Gene_Title = structure(c(1L, 1L, 1L, 2L, 3L, 3L,
4L, 5L, 6L, 7L, 7L, 7L, 8L, 9L, 10L, 10L, 11L, 11L, 12L, 12L), .Label = c("aldo-keto reductase family 1, member B3 (aldose reductase)",
"ATPase, H+ transporting, lysosomal V0 subunit D1", "coatomer protein complex, subunit gamma",
"dolichol-phosphate (beta-D) mannosyltransferase 2", "golgi autoantigen, golgin subfamily a, 7",
"leucine rich repeat containing 1", "lipase, endothelial", "phosphoserine phosphatase",
"proline rich 13", "sodium channel, voltage-gated, type I, alpha",
"trafficking protein particle complex 4", "transmembrane protein 2"
), class = "factor"), ID_Affymetrix = structure(c(13L, 14L, 20L,
2L, 1L, 7L, 6L, 3L, 19L, 17L, 8L, 9L, 4L, 10L, 15L, 16L, 5L,
12L, 11L, 18L), .Label = c("1415670_at", "1415671_at", "1415672_at",
"1415673_at", "1415674_a_at", "1415675_at", "1416017_at", "1421261_at",
"1421262_at", "1423686_a_at", "1424711_at", "1429632_at", "1437133_x_at",
"1448319_at", "1450120_at", "1450121_at", "1450188_s_at", "1451458_at",
"1452411_at", "1456590_x_at"), class = "factor"), GB_Acc.x = structure(c(2L,
11L, 4L, 12L, 9L, 9L, 5L, 13L, 10L, 8L, 8L, 8L, 15L, 6L, 3L,
3L, 14L, 1L, 7L, 7L), .Label = c("AK005276", "AV127085", "AV336781",
"BB469763", "BC008256", "BC016234", "BC019745", "BC020991", "BC024686",
"BG966295", "NM_009658", "NM_013477", "NM_020585", "NM_021789",
"NM_133900"), class = "factor"), Gene_Symbol.x = structure(c(1L,
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L,
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg",
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a",
"Tmem2", "Trappc4"), class = "factor"), Entrez = c(11677L, 11677L,
11677L, 11972L, 54161L, 54161L, 13481L, 57437L, 214345L, 16891L,
16891L, 16891L, 100678L, 66151L, 20265L, 20265L, 60409L, 60409L,
83921L, 83921L), ID_Agl = c(22L, 22L, 22L, 11826L, 25829L, 25829L,
33459L, 54944L, 29L, 52L, 52L, 52L, 57142L, 4L, 58L, 58L, 10239L,
10239L, 23L, 23L), GB_Acc.y = structure(c(1L, 1L, 1L, 4L, 5L,
5L, 2L, 7L, 12L, 3L, 3L, 3L, 11L, 9L, 6L, 6L, 8L, 8L, 10L, 10L
), .Label = c("NM_009658", "NM_010073", "NM_010720", "NM_013477",
"NM_017477", "NM_018733", "NM_020585", "NM_021789", "NM_025385",
"NM_031997", "NM_133900", "NM_172528"), class = "factor"), Gene_Symbol.y = structure(c(1L,
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L,
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg",
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a",
"Tmem2", "Trappc4"), class = "factor"), Unigene = structure(c(12L,
12L, 12L, 1L, 4L, 4L, 3L, 2L, 6L, 8L, 8L, 8L, 5L, 10L, 11L, 11L,
7L, 7L, 9L, 9L), .Label = c("Mm.17708", "Mm.196269", "Mm.22001",
"Mm.258785", "Mm.271784", "Mm.28534", "Mm.29814", "Mm.299647",
"Mm.329776", "Mm.393955", "Mm.439704", "Mm.451"), class = "factor"),
Ensembl = structure(c(11L, 11L, 11L, 1L, 7L, 7L, 9L, 8L,
3L, 4L, 4L, 4L, 2L, 10L, 5L, 5L, 12L, 12L, 6L, 6L), .Label = c("ENSMUST00000013304",
"ENSMUST00000031399", "ENSMUST00000049755", "ENSMUST00000066532",
"ENSMUST00000094951", "ENSMUST00000096194", "ENSMUST00000113607",
"ENSMUST00000121783", "ENSMUST00000150419", "ENSMUST00000164688",
"ENSMUST00000166583", "ENSMUST00000170082"), class = "factor"),
Chr_location = structure(c(7L, 7L, 7L, 9L, 8L, 8L, 4L, 10L,
12L, 2L, 2L, 2L, 6L, 1L, 5L, 5L, 11L, 11L, 3L, 3L), .Label = c("chr15:102291090-102291149",
"chr18:75099688-75099629", "chr19:21930192-21930251", "chr2:32428766-32428825",
"chr2:66173557-66173498", "chr5:130271500-130271441", "chr6:34253982-34253930",
"chr6:87862890-87862949", "chr8:108048837-108048778", "chr8:24351978-24351919",
"chr9:44211918-44211859", "chr9:77278998-77278939"), class = "factor")), .Names = c("Gene_Title",
"ID_Affymetrix", "GB_Acc.x", "Gene_Symbol.x", "Entrez", "ID_Agl",
"GB_Acc.y", "Gene_Symbol.y", "Unigene", "Ensembl", "Chr_location"
), row.names = c(NA, 20L), class = "data.frame")
Erreur dans `?`(dput(head(tab3, 20)), dput(head(tab3, 20))) :
c("pas de documentation de type ‘c(1, 1, 1, 2, 3, 3, 4, 5, 6, 7, 7, 7, 8, 9, 10, 10, 11, 11, 12, 12)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)", "pas de documentation de type ‘c(13, 14, 20, 2, 1, 7, 6, 3, 19, 17, 8, 9, 4, 10, 15, 16, 5, 12, 11, 18)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)", "pas de documentation de type ‘c(2, 11, 4, 12, 9, 9, 5, 13, 10, 8, 8, 8, 15, 6, 3, 3, 14, 1, 7, 7)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)",
"pas de documentation de type ‘c(1, 1, 1, 2, 3, 3, 4, 5, 7, 6, 6, 6, 9, 8, 10, 10, 12, 12, 11, 11)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)", "pas de documentation de type ‘c(11677, 11677, 11677, 11972, 54161, 54161, 13481, 57437, 214345, 16891, 16891, 16891, 100678, 66151, 20265, 20265, 60409, 60409, 83921, 83921)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'a
Maybe this is what you're looking for?
library(dplyr)
dfcollapsed <- df %>% # replace df with the name of your data frame
group_by(Gene_Title, Gene_Symbol) %>%
summarise_each(funs(paste(., collapse = ",")))
I didn't test it with your data though, because I couldn't copy and paste it into my session.
Update:
In your data you have two columns Gene_Symbol.x and Gene_Symbol.y which were probably created during a merge. I assume they have the same information, and hence you could adjust the code to:
dfcollapsed <- df %>% # replace df with the name of your data frame
group_by(Gene_Title, Gene_Symbol.x) %>%
summarise_each(funs(paste(., collapse = ",")), -Gene_Symbol.y)
Or, if you only want unique entries in each column (as in #juba's answer) you can write:
dfcollapsed <- df %>% # replace df with the name of your data frame
group_by(Gene_Title, Gene_Symbol.x) %>%
summarise_each(funs(paste(unique(.), collapse = ",")), -Gene_Symbol.y)
Hope that helps.
Maybe the following with aggregate :
f <- function(v) {paste(unique(v), collapse=", ")}
aggregate(tab3, list(tab3$Gene_Title, tab3$Gene_Symbol.x), f)