R: Meaning of "\" in Sapply? - r

I have a dataset that looks something like this:
name = c("john", "john", "john", "alex","alex", "tim", "tim", "tim", "ralph", "ralph")
year = c(2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012, 2014, 2016)
my_data = data.frame(name, year)
name year
1 john 2010
2 john 2011
3 john 2012
4 alex 2011
5 alex 2012
6 tim 2010
7 tim 2011
8 tim 2012
9 ralph 2014
10 ralph 2016
I am trying to count the "number of rows with at least one missing (i.e. non-consecutive) year", for example:
# sample output
year count
1 2014, 2016 1
In a previous question (Counting Number of Unique Column Values Per Group), I received an answer - but when I tried to apply this answer, I got the following error:
agg <- aggregate(year ~ name, my_data, c)
agg <- agg$year[sapply(agg$year, \(y) any(diff(y) != 1))]
as.data.frame(table(sapply(agg, paste, collapse = ", ")))
Error: unexpected input .... " ... \"
I think this error might be due to the fact that I am using an older version of R.
Does anyone know if an alternate symbol can be used to replace "" in R that is supported by older versions of R?
Thanks!

In tidyverse, we may do this as
library(dplyr)
my_data %>%
group_by(name) %>%
filter(any(diff(year) != 1)) %>%
summarise(year = toString(year)) %>%
count(year, name = 'count')
-output
# A tibble: 1 × 2
year count
<chr> <int>
1 2014, 2016 1
The error in OP's code is based on the R version. The lambda concise option (\(x) -> function(x)) is introduced only recently from versions R > 4.0

Related

If statement with three true conditions

This is my data:
Year1 <- c(2015,2013,2012,2018)
Year2 <- c(2017,2015,2014,2020)
my_data <- data.frame(Year1, Year2)
I need an if statement that returns 1 when year 1 equals 2015 OR 2016 AND year 2 is greater than 2016. Currently, my code looks like this:
my_data <- my_data %>%
mutate(Y_2016=ifelse(my_data$Year1==2015|2016 & my_data$Year1>2016,1,0))
But this does not work and only seems to check the condition if Year 2 is greater than 2016, since it returns 1 even for the last row when Year 1 is 2018 and Year 2 is 2020.
Thank you for your help!
Instead of my_data$Year1==2015|2016, use %in% like my_data$Year1 %in% c(2015,2016).
Typo in my_data$Year1>2016
As you using dplyr you do not need to specify every variable with $ like my_data$...
my_data%>%
mutate(Y_2016=ifelse(Year1 %in% c(2015,2016) & Year2>2016,1,0))
Year1 Year2 Y_2016
1 2015 2017 1
2 2013 2015 0
3 2012 2014 0
4 2018 2020 0

Newbie trouble building script for a report

I am a newbie to coding and to R. I have been trying to solve a problem for a report that I am drawing and have hit a wall.
I have spent the last two days trying to find a workable answer and am now at my wit's end.
I have a data frame of student results. The columns are as follows
Student Number
Academic Year eg 2014, 2015 etc
Academic Semester eg Jan or June
Qualification eg Qual1, Qual2 etc
Modules eg Subject1, subject2 etc. the issue here is that subject1 may be in Qual1 and Qual2 but suject2 may only be in Qual1
Result. This is either "Passed" or "FAILED"
I am trying to create a summary/list showing the percentage passed for each module where students were active. Something like this
Year Semester Qualification Module PassRate
2014 Jan Qual1 Subject1 62.54%
2014 Jan Qual1 Subject2 72.81%
.
.
.
2014 July Qual1 Subject1 69.51%
.
.
2014 Jan Qual2 Subject1 42.86%
2014 Jan Qual2 Subject3 55.95%
etc.
I thought that perhaps an IF statement might work but that seems way too cumbersome. I also looked at For each but I can't seem to figure how to get it to work or a combination of the above. I have tried aggregate, count =, cbind and anything that i could find from my good friend Google.
I have the following code
AcademicYears <- as.character(unique(unlist(HE_Stats$Year)))
AcademicYears_count <- NROW(AcademicYears)
AcademicSemesters <- as.character(unique(unlist(HE_Stats$ActualSemester)))
AcademicSemesters_count <- NROW(AcademicSemesters)
Qualifications <- as.character(unique(unlist(HE_Stats$Qualification)))
Qualifications_count <- NROW(Qualifications)
Modules <- as.character(unique(unlist(HE_Stats$ModuleCode)))
Modules_count <- NROW(Modules)
df <- HE_Stats %>%
group_by(Year,ActualSemester,Qualification, ModuleCode) %>%
aggregate(cbind(count = AcademicSemesters) ~ AcademicYears,
data = HE_Stats,
FUN = function(AcademicSemesters){NROW(AcademicSemesters)})
the result of this is that it shows me one semester per year. My latest plan is to build the matrix column by column.
If you could supply sample data would be able to give you a better answer. But say that your data looked something like (this solution uses the dplyr package:
library(dplyr)
data <- tibble(student_number = c(1, 2, 3, 4, 5, 6),
academic_year = c(2014, 2014, 2014, 2015, 2015, 2015),
semester = c("jan", "jan", "jan","jan", "june", "june"),
qualification = c("qual1", "qual2", "qual1", "qual1", "qual2",
"qual2"),
module = c("subject1", "subject1", "subject1", "subject1",
"subject2", "subject2"),
result = c("passed", "failed", "passed", "passed", "passed",
"failed"))
# A tibble: 6 x 6
student_number academic_year semester qualification module result
<dbl> <dbl> <chr> <chr> <chr> <chr>
1 1 2014 jan qual1 subject1 passed
2 2 2014 jan qual2 subject1 failed
3 3 2014 jan qual1 subject1 passed
4 4 2015 jan qual1 subject1 passed
5 5 2015 june qual2 subject2 passed
6 6 2015 june qual2 subject2 failed
First I would make a logical vector of whether the subject had passed:
data <- data %>%
mutate(pass = ifelse(result == "passed", TRUE, FALSE))
Then summarise the grouped data:
data %>%
group_by(academic_year, semester, qualification, module) %>%
summarise(
pass_rate = (sum(pass)/n())*100
)
To produce:
academic_year semester qualification module pass_rate
<dbl> <chr> <chr> <chr> <dbl>
1 2014 jan qual1 subject1 100
2 2014 jan qual2 subject1 0
3 2015 jan qual1 subject1 100
4 2015 june qual2 subject2 50

Drop rows according to condition on different columns

I have a big dataframe where I need to erase rows according to a condition given in each level of a factor (country). I have data for a variable through different years, but where there are duplicated years, I need to go with just one of them. Here is a minimal dataframe:
datos <- data.frame(Country = c(rep("Australia", 4), rep("Belgium", 4)),
Year = c(2010, 2011, 2012, 2012, 2010, 2011, 2011, 2012),
method = c("Method1", "Method1", "Method1", "Method2", "Method1",
"Method1", "Method2", "Method1"))
Now I want R to do the following:
"For each country, in case that there is a repeated Year, erase the row where method is equal to Method1".
Using dplyr, we can group_by Country and Year and filter negate the rows where number of rows for each group is greater than 1 and method == "Method1.
library(dplyr)
datos %>%
group_by(Country, Year) %>%
filter(!(n() > 1 & method == "Method1"))
# Country Year method
# <fct> <dbl> <fct>
#1 Australia 2010 Method1
#2 Australia 2011 Method1
#3 Australia 2012 Method2
#4 Belgium 2010 Method1
#5 Belgium 2011 Method2
#6 Belgium 2012 Method1
Using the same logic with base R ave
datos[!with(datos, ave(method == "Method1", Country, Year,
FUN = function(x) length(x) > 1 & x)), ]
# Country Year method
#1 Australia 2010 Method1
#2 Australia 2011 Method1
#4 Australia 2012 Method2
#5 Belgium 2010 Method1
#7 Belgium 2011 Method2
#8 Belgium 2012 Method1

Dropping the rows by checking whether it has multiple values in R

I have a data frame in this form;
Year Department Jan Feb ................... Dec
2017 TF 15.15 225.51 .............. 5562.1
2015 CIF ...................................
2013 TTR ....................................
2011 COR ....................
. .............................
. ......................
As a summary, I want to create an algorithm but first I have to make this filtering:
If a department does not have a value for 2013, 2014, 2015, 2016 years, than I want to exclude that department from my data set.
In other words, by reading the each departments data, filtering the data by departments that has all four years values in the months columns.
I tried exists, is.na but the multiple filtering always fails. And another handicap is that filter works for only single condition, but here I need like 4 condition. 4 years values must be exist to use them in next step.
Thank you.
I can't find a clear duplicate to this question. Seems like a quick fix with group_by:
library(dplyr)
df <- data_frame(Year = c(2013:2016, 2015, 2016),
Department = c(rep('TF', 4), 'CIF', 'TTR'))
df
#> # A tibble: 6 x 2
#> Year Department
#> <dbl> <chr>
#> 1 2013 TF
#> 2 2014 TF
#> 3 2015 TF
#> 4 2016 TF
#> 5 2015 CIF
#> 6 2016 TTR
df %>%
group_by(Department) %>%
mutate(x = Year %in% c(2013:2016),
y = sum(x)) %>%
ungroup() %>%
filter(y == 4)
#> # A tibble: 4 x 4
#> Year Department x y
#> <dbl> <chr> <lgl> <int>
#> 1 2013 TF TRUE 4
#> 2 2014 TF TRUE 4
#> 3 2015 TF TRUE 4
#> 4 2016 TF TRUE 4
A solution using R base:
df = read.table(text = "Year, Department
2016,TF
2017,TF
2013,CIF
2014,CIF
2015,CIF
2016,CIF
2013,TTR", header = TRUE, sep = ",", stringsAsFactors = FALSE)
df[df$Department %in% subset(aggregate(subset(df, Year %in% c(2013,2014,2015,2016)), by=list(n$Department), FUN=length), Department==4)[,1], ]
Output:
Year Department
3 2013 CIF
4 2014 CIF
5 2015 CIF
6 2016 CIF

Remove specific rows from data frame conditional on caseid and year

I'm a beginner in R, so please be gentle :)
I have a dataframe of the following form:
sampleData <- data.frame(id = c(1,1,2,2,3,4,4),
year = c(2010, 2014, 2010, 2014, 2010, 2010, 2014))
sampleData
id year
1 1 2010
2 1 2014
3 2 2010
4 2 2014
5 3 2010
6 4 2010
7 4 2014
I want to exclude every id, which does not have both years.
In this case: id "3" only has year "2010".
Therefore I want to conditionally remove ids, which do not have another row with the missing year.
I hope you guys can understand what I'm looking for :(
thank you in advance!
sampleData <- data.frame(id = c(1,1,2,2,3,4,4),
year = c(2010, 2014, 2010, 2014, 2010, 2010, 2014))
First you count :
library(plyr)
countBy <- ddply(unique(sampleData),
.(id),
summarise,
occurence = length(year) ,
.parallel = F )
Then you subset
sampleData[sampleData$id %in% countBy$id[countBy$occurence > 1],]
We can use ave and check number of rows for each id and select only those rows with length as 2.
sampleData[ave(sampleData$year, sampleData$id, FUN = length) == 2, ]
# id year
#1 1 2010
#2 1 2014
#3 2 2010
#4 2 2014
#6 4 2010
#7 4 2014
In case if we want to check whether both "2010" and "2014" appear at least once per id we can do
sampleData[as.logical(ave(sampleData$year, sampleData$id, FUN = function(x)
any(2014 %in% x) & any(2010 %in% x))), ]
Here is a solution with data.table
library("data.table")
sampleData <- data.frame(id = c(1,1,2,2,3,4,4), year = c(2010, 2014, 2010, 2014, 2010, 2010, 2014))
setDT(sampleData)
sampleData[, `:=`(n, .N), by=id][n==2]
In case you want to make your check more explicit, i.e. not just relying on two rows per id but checking whether both "2010" and "2014" appear at least once per id, you can do something like this in base R:
x <- table(sampleData$id, sampleData$year) > 0
x
# 2010 2014
# 1 TRUE TRUE
# 2 TRUE TRUE
# 3 TRUE FALSE
# 4 TRUE TRUE
ids_to_keep <- row.names(x)[rowSums(x[,c("2010", "2014")]) == 2]
ids_to_keep
#[1] "1" "2" "4"
sampleData[sampleData$id %in% ids_to_keep,]
# id year
#1 1 2010
#2 1 2014
#3 2 2010
#4 2 2014
#6 4 2010
#7 4 2014
This approach is longer than others but it's also more robust, for example if you can have multiple occurences of the same year per id, then some other approaches may fail or, if you can have other years (not just 2010 and 2014) some other approaches may also fail if they only rely on checking number of occurences per id.
There is also a nice dplyr solution:
# create the sample dataset
sampleData <- data.frame(id = c(1,1,2,2,3,4,4),
year = c(2010, 2014, 2010, 2014, 2010, 2010, 2014))
# load dplyr library
library(dplyr)
# take the sample dateset
sampleData %>%
# group by id - thus the function within filter will be evaluated for each id
group_by(id) %>%
# filter only ids which were recorded in two separate years
filter(length(unique(year)) == 2)

Resources