For my research I am trying to create a similar graph based around this graph I found in a piece of literature:
My experiment involved the genre-tagging of 10 different songs. I saved the tags (the words people used to describe seperately).
The x-asis should represent all the participants that took part in chronological order. The y-axis should represent how often a word is used in a tag. Consider this sample data:
df <- data.frame(tagid= numeric(0), participantid = numeric(0), tag = character(0))
newRow <-data.frame(tagid=1, participantid=1, tag = "triphop")
df <-rbind(df,newRow)
newRow <-data.frame(tagid=2, participantid=1, tag = "electronic")
df <-rbind(df,newRow)
newRow <-data.frame(tagid=3, participantid=2, tag = "mellow")
df <-rbind(df,newRow)
newRow <-data.frame(tagid=4, participantid=2, tag = "electronic")
df <-rbind(df,newRow)
newRow <-data.frame(tagid=5, participantid=3, tag = "electronic")
df <-rbind(df,newRow)
Tagid 1 and 2 belong to the same participant and should have the same x coordinate. 3 and 4 belong to participant 2 and tagid 5 belongs to participant 3.
For this dataset I'd like to plot a graph like this (excuse the drawing):
The y-axis represents the percentage of participants that have used a specific word to describe this music piece. As 'electronic' is used by all three participants it stays at 100%. 'Triphop' was used by participant 1, but not by participant 2 and 3, decreasing from 100%, to 50%, to 33% at participant 3.
Code is a bit messy, but probably you want something like this ? You need to complete the dataframe so each participantid has rows for all three tag levels. Then, with the cumulative sum of the tag levels and the cumulative sum of participants, you can get the proportion.
df %>%
group_by(participantid, tag) %>%
summarise(n = n()) %>%
complete(tag, nesting(participantid), fill = list(n = 0)) %>%
group_by(tag) %>%
mutate(absolute = cumsum(n)) %>%
ungroup() %>%
mutate(id = rep(1:3, each = length(levels(tag)))) %>%
mutate(proportion = ifelse(absolute / id != 0, absolute / id, NA)) %>%
ggplot(aes(x = participantid, y = proportion, color = tag)) + geom_line(lwd = 1)
Related
I have a data frame in which the first column indicates the work (manager, employee or worker), the second indicates whether the person works at night or not and the last is a household code (if two individuals share the same code then it means that they share the same house).
#Here is the reproductible data :
PCS <- c("worker", "manager","employee","employee","worker","worker","manager","employee","manager","employee")
work_night <- c("Yes","Yes","No", "No","No","Yes","No","Yes","No","Yes")
HHnum <- c(1,1,2,2,3,3,4,4,5,5)
df <- data.frame(PCS,work_night,HHnum)
My problem is that I would like to have a new data frame with households instead of individuals. I would like to group individuals based on HHnum and then merge their answers.
For the variable "PCS" I have new categories based on the combination of answers : Manager+work ="I" ; manager+employee="II", employee+employee=VI, worker+worker=III etc
For the variable "work_night", I would like to apply a score (is both answered Yes then score=2, if one answered YES then score =1 and if both answered No then score = 0).
To be clear, I would like my data frame to look like this :
HHnum PCS work_night
1 "I" 2
2 "VI" 0
3 "III" 1
4 "II" 1
5 "II" 1
How can I do this on R using dplyr ? I know that I need group_by() but then I don't know what to use.
Best,
Victor
Here is one way to do it (though I admit it is pretty verbose). I created a reference dataframe (i.e., combos) in case you had more categories than 3, which is then joined with the main dataframe (i.e., df_new) to bring in the PCS roman numerals.
library(dplyr)
library(tidyr)
# Create a dataframe with all of the combinations of PCS.
combos <- expand.grid(unique(df$PCS), unique(df$PCS))
combos <- unique(t(apply(combos, 1, sort))) %>%
as.data.frame() %>%
dplyr::mutate(PCS = as.roman(row_number()))
# Create another dataframe with the columns reversed (will make it easier to join to the main dataframe).
combos2 <- data.frame(V1 = c(combos$V2), V2 = c(combos$V1), PCS = c(combos$PCS)) %>%
dplyr::mutate(PCS = as.roman(PCS))
combos <- rbind(combos, combos2)
# Get the count of "Yes" for each HHnum group.
# Then, put the PCS into 2 columns to join together with "combos" df.
df_new <- df %>%
dplyr::group_by(HHnum) %>%
dplyr::mutate(work_night = sum(work_night == "Yes")) %>%
dplyr::group_by(grp = rep(1:2, length.out = n())) %>%
dplyr::ungroup() %>%
tidyr::pivot_wider(names_from = grp, values_from = PCS) %>%
dplyr::rename("V1" = 3, "V2" = 4) %>%
dplyr::left_join(combos, by = c("V1", "V2")) %>%
unique() %>%
dplyr::select(HHnum, PCS, work_night)
I have a task that I can't solve. My goal is to be able to figure out how many "families" have children (under 18). I only need the sum of unique familyids and I've tried doing it in R and Excel and can't figure it out.
In my data I have four families and my data is saved on a client level.
data <- data.frame(
"FamilyID" = c(10,10,10,11,11,11,12,12,13,13),
"ClientID" = c(101,102,103,111,112,113,121,122,131,132),
"Age" = c(26,1,5,35,34,1,54,60,17,21)
)
My goal is to have something like this
Metric Count
Families w/ Children 3
Families w/out Children 1
In my actual dataset I have thousands of families so I really appreciate ant help.
How can I do this with dplyr?
library(tidyverse)
counts <- data %>%
group_by(FamilyID) %>%
summarise(number_of_children = sum(Age<= 18), number_of_adults = sum(Age > 18)) %>%
ungroup()
final <- counts %>%
summarise("Families w/ children" = sum(number_of_children > 0), "Families w/o children" = sum(number_of_children < 1)) %>%
gather() %>%
rename("Metric" = key, "Count" = value)
You can try something like this:
data2 <- data %>% group_by(FamilyID) %>%
mutate(children=sum(Age<18)) %>% mutate(children=ifelse(children>=1,1,0))
data2 %>% group_by(children) %>% summarize(n_distinct(FamilyID))
which shows how many distinct Family IDs correspond to 0 children, and how many correspond to at least 1 child.
One option is to use any to distinguish which families have children TRUE/FALSE, followed by dplyr::count
library(dplyr)
data %>%
group_by(FamilyID) %>%
summarize(have_children = any(Age < 18)) %>%
count(have_children)
#------
have_children n
<lgl> <int>
1 FALSE 1
2 TRUE 3
Seeking some advice around the use of ggalluvium to demonstrate the distribution of preferences in Australia.
Context, in Australia we have preferential voting. Say I live in an area with 4 candidates contesting.
The ballot is completed by numbering a box 1-4 according to your party/candidate preference.
The candidate with the lowest proportion of the vote after the first count will be eliminated and their votes will be apportioned to where their voters have indicated on their ballot paper. This process is reiterated until two candidates remain and a candidate is elected when they have greater than 50% of the two party preferred vote.
I'm seeking to visualise the above reiterating distribution process using flow diagram, and ggalluvium.
However I can't quite seem to plot the aesthetics to show the flows feeding votes to candidates in the next count of the votes.
Here's what I get so far:
library(tidyverse)
library(magrittr)
library(ggalluvial)
Load Data
house_of_reps <- read_csv("https://results.aec.gov.au/24310/Website/Downloads/HouseDopByDivisionDownload-24310.csv", skip = 1)
house_of_reps$BallotPosition %<>% as.factor()
house_of_reps$CountNumber %<>% as.factor()
cooper <- house_of_reps %>%
filter(DivisionNm == "Cooper") %>%
spread(CalculationType, CalculationValue) %>%
select(4,9,10,14)
cooper %>% ggplot(aes(x = CountNumber, alluvium = PartyNm, stratum = `Preference Percent`, y = `Preference Percent`, fill = PartyAb)) +
geom_alluvium(aes(fill = PartyAb), decreasing = TRUE) +
geom_stratum(decreasing = TRUE) +
geom_text(stat = "stratum",decreasing = TRUE, aes(label = after_stat(fill))) +
stat_stratum(decreasing = TRUE) +
stat_stratum(geom = "text", aes(label = PartyAb), decreasing = TRUE) +
scale_fill_viridis_d() +
theme_minimal()
Output image
Would appreciate any guidance on how to show where the votes after each subsequent count are flowing to which political party in the next stratum.
Unfortunately your dataset is not well suited for the kind of plot you have in mind. While the plotting itself is easy, to achieve the desired plot involves "some" data wrangling and preparation steps.
The general issue is that your dataset as is does not show the flow of votes from one party to a second. It only shows the overall number of votes a party lost or receivd in each count.
However, as in each step only one party drops out this missing information could be extracted from your data. The basic idea is to split the obs for each party or more precisely each party which drops out in one of the later counts by voter's secondary party preference.
Not sure wether each step is clear but I added some explanations as comments and added a plot of the final structure of the dataset which hopefully makes it clearer what's the final result of all the steps:
library(tidyverse)
library(magrittr)
library(ggalluvial)
# Load Data
house_of_reps <- read_csv("https://results.aec.gov.au/24310/Website/Downloads/HouseDopByDivisionDownload-24310.csv", skip = 1)
house_of_reps$BallotPosition %<>% as.factor()
house_of_reps$CountNumber %<>% as.factor()
cooper <- house_of_reps %>%
filter(DivisionNm == "Cooper") %>%
spread(CalculationType, CalculationValue) %>%
select(count = CountNumber, party = PartyAb, pref = `Preference Count`, trans = `Transfer Count`)
# Helper function to
make_rows <- function(x) {
# Name of party which gets dropped in this period
dropped <- filter(x, trans < 0) %>% pull(party)
if (length(dropped) > 0) {
x <- filter(x, trans >= 0)
# Replacements are added two times. Once for the period where the party drops out,
# and also for the previous period
xdrop <- mutate(x, party = dropped, pref = trans, trans = 0, is_drop = FALSE)
xdrop1 <- mutate(xdrop, count = count - 1, to = party, is_drop = FALSE)
# For the parties to keep or which receive transfered votes have to adjust the number of votes
xkeep <- mutate(x, pref = pref - trans, trans = 0)
bind_rows(xdrop1, xdrop, xkeep)
} else {
x
}
}
cooper1 <- cooper %>%
# First: Convert count to a numeric. Add a "to" variable for second
# party preference or the party where votes are transferred to. This variable
# will later on be mapped on the "fill" aes
mutate(to = party, count = as.numeric(as.character(count))) %>%
group_by(party) %>%
# Add identifier of obs. to drop. Obs. to drop are obs. of parties which
# drop out in the following count
mutate(is_drop = lead(trans, default = 0) < 0) %>%
ungroup() %>%
# Split obs. to be dropped by secondary party preference, i.e. in count 0 the
# obs for party "IND" is replaced by seven obs. reflecting the secondary preference
# for one of the other seven parties
split(.$count) %>%
map(make_rows) %>%
bind_rows() %>%
# Now drop original obs.
filter(!is_drop, pref > 0) %>%
# Add a unique identifier
group_by(count, party) %>%
mutate(id = paste0(party, row_number())) %>%
ungroup() %>%
# To make the flow chart work we have make the dataset complete, i.e. add
# "empty" obs for each type of voter and each count
complete(count, id, fill = list(pref = 0, trans = 0, is_drop = FALSE)) %>%
# Fill up party and "to" columns
mutate(across(c(party, to), ~ if_else(is.na(.), str_extract(id, "[^\\d]+"), .))) %>%
# Filling up the "to" column with last observed value for "to" if any
group_by(id) %>%
mutate(last_id = last(which(party != to)),
to = if_else(count >= last_id & !is.na(last_id), to[last_id], to)) %>%
ungroup()
The final structure of the dataset could be illustrated by means of a tile plot:
cooper1 %>%
add_count(count, party) %>%
ggplot(aes(count, reorder(id, n), fill = to)) +
geom_tile(color = "white")
As I said, after all the cumbersome data wrangling making the flow chart itself is the easiest task and could be achieved like so:
cooper1 %>%
ggplot(aes(x = count, alluvium = id, stratum = to, y = pref, fill = to)) +
geom_flow(decreasing = TRUE) +
geom_stratum(decreasing = TRUE) +
scale_fill_viridis_d() +
theme_minimal()
I have 11 variables in my dataframe. The first is unique identifier of observation (a plane). The second one is a number from 1 to 21 representing flight of a given plane. The rest of the variables are time, velocity, distance, etc.
What I want to do is make new variables for every group (number) of flight e.g. time_1, time_2,..., velocity_1, velocity_2, etc. and consequently, reduce the number of observations (the repeating ones).
I don't really have idea how to start. I was thinking about a mutate function like:
mutate(df, time_1 = ifelse(n_flight == 1, time, NA))
But that would be a lot of typing and a new problem may appear, perhaps.
Basically, you want to convert long to wide data for each variable. You can lapply over these with tidyr::spread in that case. Suppose the data looks like the following:
library(dplyr)
library(tidyr)
df <- data.frame(
ID = c(rep("A", 3), rep("B", 3)),
n_flight = rep(seq(3), 2),
time = seq(19, 24),
velocity = rev(seq(65, 60))
)
Then the following will generate your outcome of interest, as long as you get rid of the extra ID variables.
lapply(
setdiff(names(df), c("ID", "n_flight")), function(x) {
df %>%
select(ID, n_flight, !!x) %>%
tidyr::spread(., key = "n_flight", value = x) %>%
setNames(paste(x, names(.), sep = "_"))
}
) %>%
bind_cols()
Let me know if this wasn't what you were going for.
Fifa2 datasetFirst, I am not a developer and have little experience with R, so please forgive me. I have tried to get this done on my own, but have run out of ideas for filtering a data frame using the 'filter' command.
the data frame has about a dozen or so columns, with one being Grp (meaning Group). This is a FIFA soccer dataset, so the Group in this context means the general position the player is in (Defense, Midfield, Goalkeeper, Forward).
I need to filter this data frame to provide me this exact information:
the Top 4 Defense Players
the Top 4 Midfield Players
the Top 2 Forwards
the Top 1 Goalkeeper
What do I mean by "Top"? It's arranged by the Grp column, which is just a numeric number. So, Top 4 would be like 22,21,21,20 (or something similar because that numeric number could in fact be repeated for different players). The Growth column is the difference between the Potential Column and Overall column, so again just a simple subtraction to find the difference between them.
#Create a subset of the data frame
library(dplyr)
fifa2 <- fifa %>% select(Club,Name,Position,Overall,Potential,Contract.Valid.Until2,Wage2,Value2,Release.Clause2,Grp) %>% arrange(Club)
#Add columns for determining potential
fifa2$Growth <- fifa2$Potential - fifa2$Overall
head(fifa2)
#Find Southampton Players
ClubName <- filter(fifa2, Club == "Southampton") %>%
group_by(Grp) %>% arrange(desc(Growth), .by_group=TRUE) %>%
top_n(4)
ClubName
ClubName2 <- ggplot(ClubName, aes(x=forcats::fct_reorder(Name, Grp),
y=Growth, fill = Grp)) +
geom_bar(stat = "identity", colour = "black") +
coord_flip() + xlab("Player Names") + ylab("Unfilled Growth Potential") +
ggtitle("Southampton Players, Grouped by Position")
ClubName2
That chart produces a list of players that ends up having the Top 4 players in each position (top_n(4)), but I need it further filtered per the logic I described above. How can I achieve this? I tried fooling around with dplyr and that is fairly easy to get rows by Grp name, but don't see how to filter it to the 4-4-2-1 that I need. Any help appreciated.
Sample Output from fifa2 & ClubName (which shows the data sorted by top_n(4):
fifa2_Dataset
This might not be the most elegant solution, but hopefully it works :)
# create dummy data
data_test = data.frame(grp = sample(c("def", "mid", "goal", "front"), 30, replace = T), growth = rnorm(30, 100,10), stringsAsFactors = F)
# create referencetable to give the number of players needed per grp
desired_n = data.frame(grp = c("def", "mid", "goal", "front"), top_n_desired = c(4,4,1,2), stringsAsFactors = F)
# > desired_n
# grp top_n_desired
# 1 def 4
# 2 mid 4
# 3 goal 1
# 4 front 2
# group and arrange, than look up the desired amount of players in the referencetable and select them.
data_test %>% group_by(grp) %>% arrange(desc(growth)) %>%
slice(1:desired_n$top_n_desired[which(first(grp) == desired_n$grp)]) %>%
arrange(grp)
# A bit more readable, but you have to create an additional column in your dataframe
# create additional column with desired amount for the position written in grp of each player
data_test = merge(data_test, desired_n, by = "grp", all.x = T
)
data_test %>% group_by(grp) %>% arrange(desc(growth)) %>%
slice(1:first(top_n_desired)) %>%
arrange(grp)