Apply function within dplyr group - r

I'm trying to calculate the number of seats parties should have in municipal councils using a function from the electoral package, and have a long-format dataframe that is sorted according to municipalities.
However, I seem unable to get the function to work within the groups, and instead get the following error:
Error in seats_ha(parties = mandates$party, votes = mandates$votes, n_seats = 25, : every party name must be unique
I have tried using both do() and group_map(), as this is what was suggested in this thread: Run a custom function on a data frame in R, by group, and summarise would not work since the function is expected to return several rows of values, not one summary value.
I have also tried using the dHondt()-function from the coalitions package, but to no avail, just a different error:
When using do:
Error: Results 1, 2 must be data frames, not integer
When using group_map:
Error: Can't convert an integer vector to function
Does anyone have an idea on how to solve this? :)
Some sample code:
library(tidyverse)
library(electoral)
mandates <- data.frame(municipality = c("A","A","A","B","B","B"),
party = c("1","2","3","1","2","3"),
votes = c(125,522,231,115,321,12),
seats = c(25,25,25,25,25,25))
mandates <- mandates %>% group_by(municipality) %>%
group_map(seats_ha(parties = mandates$party, votes = mandates$votes, n_seats = 25, method = "dhondt"))
Preferably I'd like it to use the seats variable for n_seats, since there are a different number of seats in each municipality, but getting it to work with 25 seats set is a good start.

you can simply use mutate in this case:
mandates %>% group_by(municipality) %>%
mutate(x = seats_ha(parties = party, votes = votes, n_seats = 25, method = "dhondt"))
# A tibble: 6 x 5
# Groups: municipality [2]
municipality party votes seats x
<fct> <fct> <dbl> <dbl> <int>
1 A 1 125 25 3
2 A 2 522 25 15
3 A 3 231 25 7
4 B 1 115 25 6
5 B 2 321 25 19
6 B 3 12 25 0
Mutate can always be used when applying a function which takes (one or more) vector arguments and returns a vector of the same size.
If you want to use n_seats as well you could group with respect to municipality and seats (I would assume that the number of seats within each municipality is the same). Therefore:
mandates %>% group_by(municipality, seats) %>%
mutate(x = seats_ha(parties = party, votes = votes, n_seats = seats[1], method = "dhondt"))

Related

How to summarise unique categorical variable for duplicate rows

Applicant
Age
Date (MM/YY)
Result
Alex
10
10/21
Fail
Alex
10
11/21
Pass
Bryan
21
10/21
Fail
Howard
30
11/21
Pass
So essentially I am playing with an exam database which has all exam attempts recorded- therefore some applicants who failed it, and decided to re-take it are repeated twice.
I have two main goals from this, (a) Summarise all applicants by Age/other demographics not included here - I have been able to do this by removing the duplicates and than summarising the new datasets
However, I also want to be able to show a flow of how many unique applicants took the exam once, or more than once and what the outcomes were. I.e 1 applicant- took it twice, passed on second/ 2 applicants took it once, half passed.
This is easy to do with the {tidyverse} packages. To visualize the summarized data, one option is a histogram with color by the final outcome. It's not very interesting with 3 data points, but with a larger dataset might be a good way to look at it.
Also this assumes that Applicant and Age are sufficient to uniquely identify each person. If not, you will need to do something else to uniquely identify them.
library(tidyverse)
d <- data.frame(
Applicant = c("Alex", "Alex", "Bryan", "Howard"),
Age = c(10, 10, 21, 30),
Date = c("10/21", "11/21", "10/21", "11/21"),
Result = c("Fail", "Pass", "Fail", "Pass")
)
e <- d %>%
group_by(Applicant, Age) %>%
summarize(attempts = n(),
final_result = if_else(any(str_detect(Result, "Pass")), "Pass", "Fail"),
.groups = "drop")
e
#> # A tibble: 3 × 4
#> Applicant Age attempts final_result
#> <chr> <dbl> <int> <chr>
#> 1 Alex 10 2 Pass
#> 2 Bryan 21 1 Fail
#> 3 Howard 30 1 Pass
# visualize with histogram by final outcome
e %>%
ggplot(aes(attempts, fill = final_result)) +
geom_histogram(position = "dodge", binwidth = 1)
Created on 2022-10-30 with reprex v2.0.2

Calculate mean and sd of a variable(salary) depending another variable(JobSatisfaction)

I have two columns on the data set and I know I have to use the functions ddply and summarise but I do not know how to start.
Hopefully this will get you started:
data %>%
group_by(Satisfaction) %>%
summarise(Mean = mean(Salary),
SD = sd(Salary))
# A tibble: 7 x 3
Satisfaction Mean SD
<int> <dbl> <dbl>
1 1 12481. 1437.
2 2 31965. 5235.
3 3 45844. 7631.
4 4 69052. 9257.
5 5 79555. 12975.
6 6 100557. 13739.
7 7 111414. 19139.
First, you should use the group_by verb to group the data by the variable you are interested in. Then, as you alluded to, you can use the summarise verb to perform a function on the data for the groups. You can do multiple at once by separating the new columns you want to make with ,.
Recall that the %>% pipe operator directs the output of one function to the next as the first argument.
Example data:
set.seed(3)
data <- data.frame(Salary = sapply(rep(1:7,each = 10), function(x){floor(runif(1,x*10000,x*20000))}),
Satisfaction = rep(1:7,each = 10))

R Beginner struggling with extremely messy XLSX

I got an XLSX with data from a questionnaire for my master thesis.
The questions and answers for an interviewee are in one row in the second column. The first column contains the date.
The data of the second column comes in a form like this:
"age":"52","height":"170","Gender":"Female",...and so on
I started with:
test12 <- read_xlsx("Testdaten.xlsx")
library(splitstackshape)
test13 <- concat.split(data = test12, split.col= "age", sep =",")
Then I got the questions and the answers as a column divided by a ":".
For e.g. column 1: "age":"52" and column2:"height":"170".
But the data is so messy that sometimes in the column of the age question and answer there is a height question and answer and for some questionnaires questions and answers double.
I would need the questions as variables and the answers as observations. But I have no clue how to get there. I could clean the data in excel first, but with the fact that columns are not constant and there are for e.g. some height questions in the age column I see no chance to do it as I will get new data regularly, formated the same way.
Here is an example of the data:
A tibble: 5 x 2
partner.createdAt partner.wphg.info
<chr> <chr>
1 2019-11-09T12:13:11.099Z "{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\""
2 2019-11-01T06:43:22.581Z "{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\""
3 2019-11-10T07:59:46.136Z "{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\""
4 2019-11-11T13:01:48.488Z "{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000~
5 2019-11-08T14:54:26.654Z "{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\""
Thank you so much for your time!
You can loop through each entry, splitting at , as you did. Then you can loop through them all again, splitting at :.
The result will be a bunch of variable/value pairings. This can be all done stacked. Then you just want to pivot back into columns.
data
Updated the data based on your edit.
data <- tribble(~partner.createdAt, ~partner.wphg.info,
'2019-11-09T12:13:11.099Z', '{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\"',
'2019-11-01T06:43:22.581Z', '{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\"',
'2019-11-10T07:59:46.136Z', '{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\"',
'2019-11-11T13:01:48.488Z', '{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000\"',
'2019-11-08T14:54:26.654Z', '{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\"')
libraries
We need a few here. Or you can just call tidyverse.
library(stringr)
library(purrr)
library(dplyr)
library(tibble)
library(tidyr)
function
This function will create a data frame (or tibble) for each question. The first column is the date, the second is the variable, the third is the value.
clean_record <- function(date, text) {
clean_records <- str_split(text, pattern = ",", simplify = TRUE) %>%
str_remove_all(pattern = "\\\"") %>% # remove double quote
str_remove_all(pattern = "\\{|\\}") %>% # remove curly brackets
str_split(pattern = ":", simplify = TRUE)
tibble(date = as.Date(date), variable = clean_records[,1], value = clean_records[,2])
}
iteration
Now we use pmap_dfr from purrr to loop over the rows, outputting each row with an id variable named record.
This will stack the data as described in the function. The mutate() line converts all variable names to lowercase. The distinct() line will filter out rows that are exact duplicates.
What we do then is just pivot on the variable column. Of course, replace data with whatever you name your data frame.
data_clean <- pmap_dfr(data, ~ clean_record(..1, ..2), .id = "record") %>%
mutate(variable = tolower(variable)) %>%
distinct() %>%
pivot_wider(names_from = variable, values_from = value)
result
The result is something like this. Note how I had reordered some of the columns, but it still works. You are probably not done just yet. All columns are now of type character. You need to figure out the desired type for each and convert.
# A tibble: 5 x 10
record date age_years job_des height_cm gender born_in alcoholic knowledge_selfass total_wealth
<chr> <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 2019-11-09 50 unemployed 170 female Italy false 5 200000
2 2 2019-11-01 34 self-employed 158 male Germany true 3 10000
3 3 2019-11-10 24 NA 187 male England false 3 150000
4 4 2019-11-11 59 employed 167 female United States false 2 1000000
5 5 2019-11-08 36 employed 180 male Germany false 5 170000
For example, convert age_years to numeric.
data_clean %>%
mutate(age_years = as.numeric(age_years))
I am sure you may run into other things, but this should be a start.

How do I sort an as.data.frame table

To try and get the frequency of variable within a column, I used the following code:
s = table(students$Sport)
t = as.data.frame(s)
names(t)[1] = 'Sport'
t
Although this works, it gives me a massive list that is not sorted, such as this:
1 Football 20310
2 Rugby 80302
3 Tennis 5123
4 Swimming 73132
… … …
68 Basketball 90391
How would I go about sorting this table, so that the most frequent sport is at the top. Also, is there a way to only display the top 5 options? Rather than all 68 different sports?
Or, alternatively, if there's a better way to approach this.
Any help would be appreciated!
you can use dplyr and do it all in a single line, below an example
library(dplyr)
students = data.frame(sport = c(rep("Football", 200),
rep("Rugby", 130),
rep("Tennis", 100),
rep("Swimming", 40),
rep("Basketball", 10),
rep("Baseball", 300),
rep("Gimnastics", 70)
)
)
students %>% group_by(sport) %>% summarise( n = length(sport)) %>% arrange(desc(n)) %>% top_n(5, n)
# A tibble: 5 x 2
sport n
<fct> <int>
1 Baseball 300
2 Football 200
3 Rugby 130
4 Tennis 100
5 Gimnastics 70
You can use the plyr packages count function to count the words and frequency. A more elegant way of doing it compared to converting it to a dataframe.
library(plyr)
d<-count(students,"Sport") #convert it to a dataframe first before using count.
Order function helps you to order the output. using the - makes in sort in descending order. [1:5] gives you the top 5 rows. You can remove it if you want all entries.
d[order(-d$freq)[1:5],]

programatically create new variables which are sums of nested series of other variables

I have data giving me the percentage of people in some groups who have various levels of educational attainment:
df <- data_frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10))
df
# A tibble: 2 x 5
group no.highschool high.school college graduate
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 20. 70. 10. 0.
2 B 10. 40. 40. 10.
E.g., in group A 70% of people have a high school education.
I want to generate 4 variables that give me the proportion of people in each group with less than each of the 4 levels of education (e.g., lessthan_no.highschool, lessthan_high.school, etc.).
desired df would be:
desired.df <- data.frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10),
lessthan_no.highschool = c(0,0),
lessthan_high.school = c(20, 10),
lessthan_college = c(90, 50),
lessthan_graduate = c(100, 90))
In my actual data I have many groups and a lot more levels of education. Of course I could do this one variable at a time, but how could I do this programatically (and elegantly) using tidyverse tools?
I would start by doing something like a mutate_at() inside of a map(), but where I get tripped up is that the list of variables being summed is different for each of the new variables. You could pass in the list of new variables and their corresponding variables to be summed as two lists to a pmap(), but it's not obvious how to generate that second list concisely. Wondering if there's some kind of nesting solution...
Here is a base R solution. Though the question asks for a tidyverse one, considering the dialog in the comments to the question I have decided to post it.
It uses apply and cumsum to do the hard work. Then there are some cosmetic concerns before cbinding into the final result.
tmp <- apply(df[-1], 1, function(x){
s <- cumsum(x)
100*c(0, s[-length(s)])/sum(x)
})
rownames(tmp) <- paste("lessthan", names(df)[-1], sep = "_")
desired.df <- cbind(df, t(tmp))
desired.df
# group no.highschool high.school college graduate lessthan_no.highschool
#1 A 20 70 10 0 0
#2 B 10 40 40 10 0
# lessthan_high.school lessthan_college lessthan_graduate
#1 20 90 100
#2 10 50 90
how could I do this programatically (and elegantly) using tidyverse tools?
Definitely the first step is to tidy your data. Encoding information (like edu level) in column names is not tidy. When you convert education to a factor, make sure the levels are in the correct order - I used the order in which they appeared in the original data column names.
library(tidyr)
tidy_result = df %>% gather(key = "education", value = "n", -group) %>%
mutate(education = factor(education, levels = names(df)[-1])) %>%
group_by(group) %>%
mutate(lessthan_x = lag(cumsum(n), default = 0) / sum(n) * 100) %>%
arrange(group, education)
tidy_result
# # A tibble: 8 x 4
# # Groups: group [2]
# group education n lessthan_x
# <chr> <fct> <dbl> <dbl>
# 1 A no.highschool 20 0
# 2 A high.school 70 20
# 3 A college 10 90
# 4 A graduate 0 100
# 5 B no.highschool 10 0
# 6 B high.school 40 10
# 7 B college 40 50
# 8 B graduate 10 90
This gives us a nice, tidy result. If you want to spread/cast this data into your un-tidy desired.df format, I would recommend using data.table::dcast, as (to my knowledge) the tidyverse does not offer a nice way to spread multiple columns. See Spreading multiple columns with tidyr or How can I spread repeated measures of multiple variables into wide format? for the data.table solution or an inelegant tidyr/dplyr version. Before spreading, you could create a key less_than_x_key = paste("lessthan", education, sep = "_").

Resources