Newbie trouble building script for a report - r

I am a newbie to coding and to R. I have been trying to solve a problem for a report that I am drawing and have hit a wall.
I have spent the last two days trying to find a workable answer and am now at my wit's end.
I have a data frame of student results. The columns are as follows
Student Number
Academic Year eg 2014, 2015 etc
Academic Semester eg Jan or June
Qualification eg Qual1, Qual2 etc
Modules eg Subject1, subject2 etc. the issue here is that subject1 may be in Qual1 and Qual2 but suject2 may only be in Qual1
Result. This is either "Passed" or "FAILED"
I am trying to create a summary/list showing the percentage passed for each module where students were active. Something like this
Year Semester Qualification Module PassRate
2014 Jan Qual1 Subject1 62.54%
2014 Jan Qual1 Subject2 72.81%
.
.
.
2014 July Qual1 Subject1 69.51%
.
.
2014 Jan Qual2 Subject1 42.86%
2014 Jan Qual2 Subject3 55.95%
etc.
I thought that perhaps an IF statement might work but that seems way too cumbersome. I also looked at For each but I can't seem to figure how to get it to work or a combination of the above. I have tried aggregate, count =, cbind and anything that i could find from my good friend Google.
I have the following code
AcademicYears <- as.character(unique(unlist(HE_Stats$Year)))
AcademicYears_count <- NROW(AcademicYears)
AcademicSemesters <- as.character(unique(unlist(HE_Stats$ActualSemester)))
AcademicSemesters_count <- NROW(AcademicSemesters)
Qualifications <- as.character(unique(unlist(HE_Stats$Qualification)))
Qualifications_count <- NROW(Qualifications)
Modules <- as.character(unique(unlist(HE_Stats$ModuleCode)))
Modules_count <- NROW(Modules)
df <- HE_Stats %>%
group_by(Year,ActualSemester,Qualification, ModuleCode) %>%
aggregate(cbind(count = AcademicSemesters) ~ AcademicYears,
data = HE_Stats,
FUN = function(AcademicSemesters){NROW(AcademicSemesters)})
the result of this is that it shows me one semester per year. My latest plan is to build the matrix column by column.

If you could supply sample data would be able to give you a better answer. But say that your data looked something like (this solution uses the dplyr package:
library(dplyr)
data <- tibble(student_number = c(1, 2, 3, 4, 5, 6),
academic_year = c(2014, 2014, 2014, 2015, 2015, 2015),
semester = c("jan", "jan", "jan","jan", "june", "june"),
qualification = c("qual1", "qual2", "qual1", "qual1", "qual2",
"qual2"),
module = c("subject1", "subject1", "subject1", "subject1",
"subject2", "subject2"),
result = c("passed", "failed", "passed", "passed", "passed",
"failed"))
# A tibble: 6 x 6
student_number academic_year semester qualification module result
<dbl> <dbl> <chr> <chr> <chr> <chr>
1 1 2014 jan qual1 subject1 passed
2 2 2014 jan qual2 subject1 failed
3 3 2014 jan qual1 subject1 passed
4 4 2015 jan qual1 subject1 passed
5 5 2015 june qual2 subject2 passed
6 6 2015 june qual2 subject2 failed
First I would make a logical vector of whether the subject had passed:
data <- data %>%
mutate(pass = ifelse(result == "passed", TRUE, FALSE))
Then summarise the grouped data:
data %>%
group_by(academic_year, semester, qualification, module) %>%
summarise(
pass_rate = (sum(pass)/n())*100
)
To produce:
academic_year semester qualification module pass_rate
<dbl> <chr> <chr> <chr> <dbl>
1 2014 jan qual1 subject1 100
2 2014 jan qual2 subject1 0
3 2015 jan qual1 subject1 100
4 2015 june qual2 subject2 50

Related

In 0:(b - 1) : numerical expression has 6 elements: only the first used

I've been working on a project which includes times series. My issue is that I don't have information for each year but for the period. I basically want to duplicate each row as long as the period last: for a n year period, I want to create (n-1) new rows with exactly the same informations. So far, so good.
stack = data.frame(c("ville","commune","université","pole emploi", "ministère","collège"),
c(2014,2015,2016,2014,2015,2014),
c(5,3,2,6,4,1))
colnames(stack) = c("benefit recipient","beginning year", "length of the period")
->
b = stack$`beginning year`
stack2 = stack[rep(rownames(stack),b),]
Now what I want to do is to modify the beginning year into the current year. So I want to add one year after one year into each row. To visualise it, here some code where I do it manually (also a screenshot of what I have and what I want on my real project.
stack3 = data.frame(c("ville","ville","ville","ville","ville","commune","commune","commune","université","université","pole emploi","pole emploi","pole emploi","pole emploi","pole emploi","pole emploi", "ministère","ministère","ministère","ministère","collège"),
c(2014,2015,2016,2017,2018,2015,2016,2017,2016,2017,2014,2015,2016,2017,2018,2019,2015,2016,2017,2018,2014),
c(5,5,5,5,5,3,3,3,2,2,6,6,6,6,6,6,4,4,4,4,1))
colnames(stack3) = c("benefit recipient","effective year", "length of the period")
So far, my idea was to split my period and to add the value of this new vector to my table. I tried with the function:
c = c(0:(b-1))
But it didn't work, I have the message In 0:(b - 1) : numerical expression has 6 elements: only the first used. It's a shame because it did exactly what I wanted but, only for the first element...
Do you have any idea of how I can solve it ?
Thanks a lot for your time!
What I have
What I would like to have
Solution using lapply() and seq():
b = stack$`beginning year`
c = stack$`length of the period`
stack2 = stack[rep(rownames(stack),b),]
stack2$`beginning year` = unlist(lapply(1:length(b), function(x) seq(b[x], b[x]+c[x]-1, by=1)))
We can use rowwise
library(dplyr)
library(tidyr)
stack %>%
rowwise %>%
mutate(year = list(`beginning year`:(`beginning year` +
`length of the period` - 1))) %>%
unnest(year)
You can use map2 to create sequence and unnest to create new rows.
library(tidyverse)
stack %>%
mutate(year = map2(`beginning year`, `beginning year` + `length of the period` - 1, seq)) %>%
unnest(year)
# `benefit recipient` `beginning year` `length of the period` year
# <chr> <dbl> <dbl> <int>
# 1 ville 2014 5 2014
# 2 ville 2014 5 2015
# 3 ville 2014 5 2016
# 4 ville 2014 5 2017
# 5 ville 2014 5 2018
# 6 commune 2015 3 2015
# 7 commune 2015 3 2016
# 8 commune 2015 3 2017
# 9 université 2016 2 2016
#10 université 2016 2 2017
# … with 11 more rows

Better ways to combine Year and Month into Date object using mapply and lubridate

(I actually came up with a solution but that didn't satisfy my desire for simplicity and intuitiveness, therefore here I state my question and solution while waiting for a nice and neat solution.)
I have a data with one column being Year and the other being Month, while the month is in the format of string:
Country Month Year Type
<fct> <chr> <dbl> <fct>
1 Argentina June 1975 Currency
2 Argentina February 1981 Currency
3 Argentina July 1982 Currency
I am trying to combine the Month and Year column to a single column Date, which is in the format of date.
First Try
My first try was to use mapply, with the help of lubridate and a little function of my that transforms month from string to int.
months = c("January", "February", "March", "April", 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December')
month_num = c(1:12)
names(month_num) = months
crisis$Date = mapply(function(y, m){
m = month_num[m]
d = make_date(y,m)
return(d)
},crisis$Year, crisis$Month)
However this didn't turn out to be what I want:
Country Month Year Type Date
<fct> <chr> <dbl> <fct> <list>
1 Argentina June 1975 Currency <date [1]>
2 Argentina February 1981 Currency <date [1]>
3 Argentina July 1982 Currency <date [1]>
4 Argentina September 1986 Currency <date [1]>
, as the Date column is list format.
Some Googling
With some help from this post and some manipulation on unlisting it and turning it back to date object, I managed to get the result I want:
crisis$Date = as_date(unlist(mapply(function(y, m){
m = month_num[m]
d = make_date(y,m)
return(d)
},crisis$Year, crisis$Month, SIMPLIFY = FALSE)))
The result is
Country Month Year Type Date
<fct> <chr> <dbl> <fct> <date>
1 Argentina June 1975 Currency 1975-06-01
2 Argentina February 1981 Currency 1981-02-01
3 Argentina July 1982 Currency 1982-07-01
4 Argentina September 1986 Currency 1986-09-01
This is so far fine to deal with, but I believe there are better solutions.
You can convert month to a number, and then from there to a date:
df %>%
mutate(
Month = base::match(Month, base::month.name),
Date = as.Date(paste(Year, '-', Month, '-01', sep=''))
) %>%
select(-c(Month, Year))
# A tibble: 3 x 3
# Country Type Date
# <chr> <chr> <date>
# 1 Argentina Currency 1975-06-01
# 2 Argentina Currency 1981-02-01
# 3 Argentina Currency 1982-07-01
Does this help?
I provided the dataframe below:
library(tibble)
df <- tibble(
Country = 'Argentina',
Month = c('June', 'February', 'July'),
Year = c(1975, 1981, 1982),
Type = 'Currency'
)
df$Date <- lubridate::myd(paste(df$Month, df$Year, "1"))
So after the help from #Gram and #det, I came up with my solution.
I am a new learner in R so I didn't realize some of the R-ish style of handling datas, therefore tried to make every thing done in one single line of code. Thanks to some tips from Gram's answer, I somehow learned to clear my code by adding auxilary columns instead (which is similar to excel).
Consider that there might be situations in the future where the correspondence might not simply be from 1:12 to months, and to make things more general for future utilization, I create a new data.frame just to store all the information about months:
month_ref = data.frame(num = 1:12, Month = c("January", "February", "March", "April", 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'))
num Month
1 1 January
2 2 February
3 3 March
4 4 April
Now the idea is to "combine" the two dataframes, matching the Month column to numerical numbers. This is exactly like the VLOOKUP function in excel, and with help from this post, I now have a dataframe with a column of numbers
crisis = crisis %>%
inner_join(month_ref, by=c("Month"))
Country Month Year Type num
<fct> <chr> <dbl> <fct> <int>
1 Argentina June 1975 Currency 6
2 Argentina February 1981 Currency 2
3 Argentina July 1982 Currency 7
4 Argentina September 1986 Currency 9
I can then handle my dataframe with a neat column of month in number, which is much more easier and readable than handling the parsing in a custom function in mutate().
crisis = crisis %>%
inner_join(month_ref, by="Month") %>%
mutate(
Date = lubridate::ymd(paste(Year, num, "01", sep="-"))
) %>%
select(-c(num, Month, Year))
Country Type Date
<fct> <fct> <date>
1 Argentina Currency 1975-06-01
2 Argentina Currency 1981-02-01
3 Argentina Currency 1982-07-01

R include rows conditioned to other variables with `add_row`

I have a data.frame like test. It corresponds to information associated with a registry of firms. year.entry reflects the time period when a firm gets into the registry. items are elements that represent capacity and remain fixed throughout time. It may happen that the firm increases its capacity in a particular year. My aim is to present this information longitudinally.
For doing that I would ideally include rows for the years that are missing between 2010 and 2015. I have tried with this with add_row() from tibble but I am having difficulties to make it work.
> test %>% add_row(firm = firm, year.entry == (year.entry)+1, item = item, .before = row_number(year.entry) == n())
Error in eval(expr, envir, enclos) : object 'firm' not found
I wonder whether there is an easier way to solve this problem. The ideal data frame should look like this:
firm year.entry item
<chr> <chr> <int>
1 1-102642692 2010 15
2 1-102642692 2011 15
3 1-102642692 2012 15
4 1-102642692 2013 15
5 1-102642692 2014 15
6 1-102642692 2015 8
test is given by:
test = data.frame(firm = c("1-102642692", "1-102642692"), year.entry = c(2010, 2015), item =c(15,8))
I add a dummy firm to the data to use later.
First I make sure every firm has all the years of the period of interest with complete. That is why I entered a dummy firm.
The missing years are added to the dataframe.
Then I take the last observation carried forward with na.locf.
When completed I remove the dummy firm.
comp <- data.frame(firm="test", year.entry= (2009:2016), item=0)
test = data.frame(firm = c("1-102642692", "1-102642692"), year.entry = c(2010, 2015), item =c(15,8))
library(zoo)
rbind(test,comp) %>%
complete(firm,year.entry) %>%
arrange(firm, year.entry)%>%
group_by(firm) %>%
mutate(item = na.locf(item, na.rm=FALSE)) %>%
filter(firm !="test")
result:
firm year.entry item
<fctr> <dbl> <dbl>
1-102642692 2009 NA
1-102642692 2010 15
1-102642692 2011 15
1-102642692 2012 15
1-102642692 2013 15
1-102642692 2014 15
1-102642692 2015 8
1-102642692 2016 8

How do I update data from an incomplete lookup table?

I have a table that uses unique IDs but inconsistent readable names for those IDs. It is more complex than month names, but for the sake of a more simple example, let's say it looks something like this:
demo_frame <- read.table(text=" Month_id Month_name Number
1 Jan 37
2 Feb 63
3 March 9
3 Mar 150
2 February 49", header=TRUE)
Except that they might have spelled "Feb" or "March" eight different ways. I also have a clean data frame that contains consistent names for the names that have variations:
month_lookup <- read.table(text=" Month_id Month_name
2 Feb
3 Mar", header=TRUE)
I want to get to this:
1 Jan 37
2 Feb 63
3 Mar 9
3 Mar 150
2 Feb 49"
I tried merge(month_lookup, demo_frame, by = "Month_id") but that dropped all the January values because "Jan" doesn't exist in the lookup table:
Month_id Month_name.x Month_name.y Number
1 2 Feb Feb 63
2 2 Feb February 49
3 3 Mar March 9
4 3 Mar Mar 150
My read of How to replace data.frame column names with string in corresponding lookup table in R is that I ought to be able to use plyr::mapvalues but I'm unclear from examples and documentation on how I'd map the id to the name. I don't just want to say "Replace 'March' with 'Mar'" -- I need to say SET month_name = 'Mar' WHERE month_id = 3 for each value in lookup.
I think you want this.
library(dplyr)
demo_frame <- read.table(text=" Month_id Month_name Number
1 Jan 37
2 Feb 63
3 March 9
3 Mar 150
2 February 49", header=TRUE, stringsAsFactors = FALSE)
month_lookup <- read.table(text=" Month_id Month_name
2 Feb
3 Mar", header=TRUE, stringsAsFactors = FALSE)
result =
demo_frame %>%
rename(bad_month = Month_name) %>%
left_join(month_lookup) %>%
mutate(month_fix =
Month_name %>%
is.na %>%
ifelse(bad_month, Month_name) )

return final row of dataframe - recurring variable names

I want to return the final row for each subsection of a dataframe. I'm aware of the ddply and aggregate functions, but they are not giving the expected output in this case, as the column by which I split the data has recurring names.
For example, in df:
year <- rep(c(2011, 2012, 2013), each=12)
season <- rep(c("Spring", "Summer", "Autumn", "Winter"), each=3)
allseason <- rep(season, 3)
temp <- rnorm(36, mean = 61, sd = 10)
df <- data.frame(year, allseason, temp)
I want to return the final temp reading at the end of every season. When I run either
final1 <- aggregate(df, list(df$allseason), tail, 1)
or
final2 <- ddply(df, .(allseason), tail, 1)
I get only the final 4 seasons (i.e. those of 2013). The function seems to stop there and does not go back to previous years/seasons. My intended output is a data frame with 12 rows * 3 columns.
All help appreciated!
*I notice that in the df created here, the allseasons column is designated as a factor with 4 levels, whereas this is not the case in my original dataframe.
In your ddply code, you only forgot to also group by year:
With plyr:
library(plyr)
ddply(df, .(year, allseason), tail, 1)
Or with dplyr
library(dplyr)
df %>%
group_by(year, allseason) %>%
do(tail(.,1))
Or if you want a base R alternative you can use ave:
df[with(df, ave(year, list(year, allseason), FUN = seq_along)) == 3,]
Result:
# year allseason temp
#1 2011 Autumn 63.40626
#2 2011 Spring 59.69441
#3 2011 Summer 42.33252
#4 2011 Winter 79.10926
#5 2012 Autumn 63.14974
#6 2012 Spring 60.32811
#7 2012 Summer 67.57364
#8 2012 Winter 61.39100
#9 2013 Autumn 50.30501
#10 2013 Spring 61.43044
#11 2013 Summer 55.16605
#12 2013 Winter 69.37070
Note that the output will contain the same rows in each case, only the ordering may differ.
And just to add to #beginneR's answer, your aggregate solution should look like:
aggregate(temp ~ allseason + year, data = df, tail, 1)
# or:
with(df, aggregate(temp, list(allseason, year), tail, 1))
Result:
allseason year temp
1 Autumn 2011 64.51539
2 Spring 2011 45.14341
3 Summer 2011 62.29240
4 Winter 2011 47.97461
5 Autumn 2012 43.16781
6 Spring 2012 80.02419
7 Summer 2012 72.31149
8 Winter 2012 45.58344
9 Autumn 2013 55.92607
10 Spring 2013 52.06778
11 Summer 2013 51.01308
12 Winter 2013 53.22452

Resources