Remove specific rows from data frame conditional on caseid and year - r

I'm a beginner in R, so please be gentle :)
I have a dataframe of the following form:
sampleData <- data.frame(id = c(1,1,2,2,3,4,4),
year = c(2010, 2014, 2010, 2014, 2010, 2010, 2014))
sampleData
id year
1 1 2010
2 1 2014
3 2 2010
4 2 2014
5 3 2010
6 4 2010
7 4 2014
I want to exclude every id, which does not have both years.
In this case: id "3" only has year "2010".
Therefore I want to conditionally remove ids, which do not have another row with the missing year.
I hope you guys can understand what I'm looking for :(
thank you in advance!

sampleData <- data.frame(id = c(1,1,2,2,3,4,4),
year = c(2010, 2014, 2010, 2014, 2010, 2010, 2014))
First you count :
library(plyr)
countBy <- ddply(unique(sampleData),
.(id),
summarise,
occurence = length(year) ,
.parallel = F )
Then you subset
sampleData[sampleData$id %in% countBy$id[countBy$occurence > 1],]

We can use ave and check number of rows for each id and select only those rows with length as 2.
sampleData[ave(sampleData$year, sampleData$id, FUN = length) == 2, ]
# id year
#1 1 2010
#2 1 2014
#3 2 2010
#4 2 2014
#6 4 2010
#7 4 2014
In case if we want to check whether both "2010" and "2014" appear at least once per id we can do
sampleData[as.logical(ave(sampleData$year, sampleData$id, FUN = function(x)
any(2014 %in% x) & any(2010 %in% x))), ]

Here is a solution with data.table
library("data.table")
sampleData <- data.frame(id = c(1,1,2,2,3,4,4), year = c(2010, 2014, 2010, 2014, 2010, 2010, 2014))
setDT(sampleData)
sampleData[, `:=`(n, .N), by=id][n==2]

In case you want to make your check more explicit, i.e. not just relying on two rows per id but checking whether both "2010" and "2014" appear at least once per id, you can do something like this in base R:
x <- table(sampleData$id, sampleData$year) > 0
x
# 2010 2014
# 1 TRUE TRUE
# 2 TRUE TRUE
# 3 TRUE FALSE
# 4 TRUE TRUE
ids_to_keep <- row.names(x)[rowSums(x[,c("2010", "2014")]) == 2]
ids_to_keep
#[1] "1" "2" "4"
sampleData[sampleData$id %in% ids_to_keep,]
# id year
#1 1 2010
#2 1 2014
#3 2 2010
#4 2 2014
#6 4 2010
#7 4 2014
This approach is longer than others but it's also more robust, for example if you can have multiple occurences of the same year per id, then some other approaches may fail or, if you can have other years (not just 2010 and 2014) some other approaches may also fail if they only rely on checking number of occurences per id.

There is also a nice dplyr solution:
# create the sample dataset
sampleData <- data.frame(id = c(1,1,2,2,3,4,4),
year = c(2010, 2014, 2010, 2014, 2010, 2010, 2014))
# load dplyr library
library(dplyr)
# take the sample dateset
sampleData %>%
# group by id - thus the function within filter will be evaluated for each id
group_by(id) %>%
# filter only ids which were recorded in two separate years
filter(length(unique(year)) == 2)

Related

R: Meaning of "\" in Sapply?

I have a dataset that looks something like this:
name = c("john", "john", "john", "alex","alex", "tim", "tim", "tim", "ralph", "ralph")
year = c(2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012, 2014, 2016)
my_data = data.frame(name, year)
name year
1 john 2010
2 john 2011
3 john 2012
4 alex 2011
5 alex 2012
6 tim 2010
7 tim 2011
8 tim 2012
9 ralph 2014
10 ralph 2016
I am trying to count the "number of rows with at least one missing (i.e. non-consecutive) year", for example:
# sample output
year count
1 2014, 2016 1
In a previous question (Counting Number of Unique Column Values Per Group), I received an answer - but when I tried to apply this answer, I got the following error:
agg <- aggregate(year ~ name, my_data, c)
agg <- agg$year[sapply(agg$year, \(y) any(diff(y) != 1))]
as.data.frame(table(sapply(agg, paste, collapse = ", ")))
Error: unexpected input .... " ... \"
I think this error might be due to the fact that I am using an older version of R.
Does anyone know if an alternate symbol can be used to replace "" in R that is supported by older versions of R?
Thanks!
In tidyverse, we may do this as
library(dplyr)
my_data %>%
group_by(name) %>%
filter(any(diff(year) != 1)) %>%
summarise(year = toString(year)) %>%
count(year, name = 'count')
-output
# A tibble: 1 × 2
year count
<chr> <int>
1 2014, 2016 1
The error in OP's code is based on the R version. The lambda concise option (\(x) -> function(x)) is introduced only recently from versions R > 4.0

If statement with three true conditions

This is my data:
Year1 <- c(2015,2013,2012,2018)
Year2 <- c(2017,2015,2014,2020)
my_data <- data.frame(Year1, Year2)
I need an if statement that returns 1 when year 1 equals 2015 OR 2016 AND year 2 is greater than 2016. Currently, my code looks like this:
my_data <- my_data %>%
mutate(Y_2016=ifelse(my_data$Year1==2015|2016 & my_data$Year1>2016,1,0))
But this does not work and only seems to check the condition if Year 2 is greater than 2016, since it returns 1 even for the last row when Year 1 is 2018 and Year 2 is 2020.
Thank you for your help!
Instead of my_data$Year1==2015|2016, use %in% like my_data$Year1 %in% c(2015,2016).
Typo in my_data$Year1>2016
As you using dplyr you do not need to specify every variable with $ like my_data$...
my_data%>%
mutate(Y_2016=ifelse(Year1 %in% c(2015,2016) & Year2>2016,1,0))
Year1 Year2 Y_2016
1 2015 2017 1
2 2013 2015 0
3 2012 2014 0
4 2018 2020 0

Change column values based on factors of other columns

For example, if I have a data frame like this:
df <- data.frame(profit=c(10,10,10), year=c(2010,2011,2012))
profit year
10 2010
10 2011
10 2012
I want to change the value of profit according to the year. For year 2010, I multiple the profit by 3, for year 2011, multiple the profit by 4, for year 2012, multiple by 5, which should result like this:
profit year
30 2010
40 2011
50 2012
How should I approach this? I tried:
inflationtransform <- function(k,v) {
switch(k,
2010,v<-v*3,
2011,v<-v*4,
2012,v<-v*5,
)
}
df$profit <- sapply(df$year,df$profit,inflationtransform)
But it doesn't work. Can someone tell me what to do?
For this particular example, since your factors and years are both ordered and incremented by 1, you could just subtract 2007 from the year column and multiply it by profit.
transform(df, profit = profit * (year - 2007))
# profit year
# 1 30 2010
# 2 40 2011
# 3 50 2012
Otherwise, you could use a lookup vector. This will cover all cases.
lookup <- c("2010" = 3, "2011" = 4, "2012" = 5)
transform(df, profit = profit * lookup[as.character(year)])
# profit year
# 1 30 2010
# 2 40 2011
# 3 50 2012
I wouldn't use switch() unless you really need to. It's not vectorized, and that's where R is most efficient. However, since you ask for it in the comments, here's one way. I find it easier to use a for() loop with switch().
for(i in seq_len(nrow(df))) {
df$profit[i] <- with(df, switch(as.character(year[i]),
"2010" = 3 * profit[i],
"2011" = 4 * profit[i],
"2012" = 5 * profit[i]
))
}

Grouping data by specific observations in R

I want to create a new variable that's derived from specific values in my existing variables. My data frame looks something like the following:
year <- c("2010", "2011", "2012", "2013", "2014", "2015")
x <- c(2980, 2955, 3110, 2962, 2566, 3788)
y <- c(2453, 2919, 2930, 2864, 2873, 3031)
df <- data.frame(year, x, y)
More specifically, I want to create a third column, z, that is the ratio of x and y. However, I don't want to create this ratio by simply dividing x by y for each individual year. Instead, I want the values in 2015 (and 2014 etc.) to be an average of this ratio in the three preceding years, i.e. 2014, 2013, and 2012.
I've looked at Wickham's dplyr package and, in particular, the group_by function but I'm stumped because I don't want to group my data by year per se but by each years' three preceding years as illustrated (hopefully) above.
With dplyr and library(zoo):
df_fin<- df %>% mutate( z = rollmeanr(x/y,3,na.pad=TRUE))
I think the column z is what you want but it would be good to have the desired output.
The answers that use zoo::rollmean are all on the correct track, but they have a couple of "off by one" errors in them. First, you actually want zoo::rollmeanr( ..., na.pad=TRUE ) which will correctly pad the output with NA on the left side:
> zoo::rollmeanr( df$x / df$y, 3, na.pad=TRUE )
[1] NA NA 1.0962018 1.0359948 0.9962648 1.0590378
The second "off by one" error arises from alignment of this vector with the rest of your data. From your description, you want the value for 2015 to be the average of 2014, 2013, and 2012. However, appending the vector above to your table will make the value for 2015 to be the average of 2015, 2014, and 2013, instead. To correct, you want to omit the last value in your input to the rolling average and prepend an NA to compensate:
> c( NA, zoo::rollmeanr( head(df$x / df$y,-1), 3, na.pad=TRUE ) )
[1] NA NA NA 1.0962018 1.0359948 0.9962648
Putting it all together using dplyr notation:
df %>% mutate( z = c( NA, zoo::rollmeanr( head(x/y,-1), 3, na.pad=TRUE ) ) )
year x y z
1 2010 2980 2453 NA
2 2011 2955 2919 NA
3 2012 3110 2930 NA
4 2013 2962 2864 1.0962018
5 2014 2566 2873 1.0359948
6 2015 3788 3031 0.9962648
df$z<-0
for (i in 4:6){
df$z[i]<-mean(df$x[(i-3):(i-1)])/mean(df$y[(i-3):(i-1)])
}
Whit a loop, you can get this:
year x y z
1 2010 2980 2453 0.000000
2 2011 2955 2919 0.000000
3 2012 3110 2930 0.000000
4 2013 2962 2864 1.089497
5 2014 2566 2873 1.036038
6 2015 3788 3031 0.996654
library(zoo)
library(dplyr)
df %>% mutate(z = x/y, zz = rollmean(z, 3, fill = NA)

Filter data frame by lowest common overlap in categorical variable in R

I have the following data frame:
input<-data.frame(
site=c("1","2","3","1","2","3","4","1","2"),
year=c(rep("2006",3),rep("2010",4),rep("2014",2)
))
site year
1 1 2006
2 2 2006
3 3 2006
4 1 2010
5 2 2010
6 3 2010
7 4 2010
8 1 2014
9 2 2014
I would like to return a list of sites surveyed in 2006, 2010, and 2014; so in the example above only site 1 and 2 would be in the list as they are the only sites that were surveyed in 2006, 2010, and 2014.
Any advice is most appreciated.
You can use ddply to count the number of years that are in your list of years of interest, for each site and then pull the sites that have all three.
library(plyr)
res <- ddply(.data = input, .variables = .(site),
summarize, allthree = all(c("2006","2010","2014") %in% year))
res$site[res$allthree]
If your data may contain other years. This solution should work
yearsneeded <- c("2006","2010","2014")
names(which(tapply(input$year, input$site, function(x) all(yearsneeded %in% x))))
It may be most straightforward to first cross-tabulate year and site using table(), and to then "apply" the function all to each of the table's rows to find which ones have all non-zero entries, like so:
(tb <- table(input))
# year
# site 2006 2010 2014
# 1 1 1 1
# 2 1 1 1
# 3 1 1 0
# 4 0 1 0
rownames(tb)[apply(tb,1,all)]
# [1] "1" "2"
Or, if you really just care that there should be at least one presence in each of 2006, 2010, and 2014 (even if your data might contain other years), try this:
rownames(tb)[apply(tb[,c("2006", "2010", "2014")], 1, all)]
# [1] "1" "2"
This is another approach (updated). It also works if the original input data frame has more than the 3 years in the example
years <- c(2006,2010,2014) #list with required years
df <- input[input$year %in% years,] #data frame containing only the required years
sites <- as.numeric(which(rowSums(table(df)) == length(years))) #sites that fullfill the criteria

Resources