Drop rows according to condition on different columns - r

I have a big dataframe where I need to erase rows according to a condition given in each level of a factor (country). I have data for a variable through different years, but where there are duplicated years, I need to go with just one of them. Here is a minimal dataframe:
datos <- data.frame(Country = c(rep("Australia", 4), rep("Belgium", 4)),
Year = c(2010, 2011, 2012, 2012, 2010, 2011, 2011, 2012),
method = c("Method1", "Method1", "Method1", "Method2", "Method1",
"Method1", "Method2", "Method1"))
Now I want R to do the following:
"For each country, in case that there is a repeated Year, erase the row where method is equal to Method1".

Using dplyr, we can group_by Country and Year and filter negate the rows where number of rows for each group is greater than 1 and method == "Method1.
library(dplyr)
datos %>%
group_by(Country, Year) %>%
filter(!(n() > 1 & method == "Method1"))
# Country Year method
# <fct> <dbl> <fct>
#1 Australia 2010 Method1
#2 Australia 2011 Method1
#3 Australia 2012 Method2
#4 Belgium 2010 Method1
#5 Belgium 2011 Method2
#6 Belgium 2012 Method1
Using the same logic with base R ave
datos[!with(datos, ave(method == "Method1", Country, Year,
FUN = function(x) length(x) > 1 & x)), ]
# Country Year method
#1 Australia 2010 Method1
#2 Australia 2011 Method1
#4 Australia 2012 Method2
#5 Belgium 2010 Method1
#7 Belgium 2011 Method2
#8 Belgium 2012 Method1

Related

R: Count number of new observations compared to a previous groups

I would like to know the number of new observations that occurred between groups.
If I have the following data:
Year
Observation
2009
A
2009
A
2009
B
2010
A
2010
B
2010
C
I wound like the output to be
Year
New_Obsevation_Count
2009
2
2010
1
I am new to R and don't really know how to move forward. I have tried using the count function in the tidyverse package but still can't figure out.
You can use union in Reduce:
y <- split(x$Observation, x$Year)
data.frame(Year = names(y), nNew =
diff(lengths(Reduce(union, y, NULL, accumulate = TRUE))))
# Year nNew
#1 2009 2
#2 2010 1
Data:
x <- read.table(header=TRUE, text="Year Observation
2009 A
2009 A
2009 B
2010 A
2010 B
2010 C")

Find the closest value for a certain year in R

I have this type of data:
iso3 year UHC cata10
AFG 2010 0.3551409 NA
AFG 2011 0.3496452 NA
AFG 2012 0.3468012 NA
AFG 2013 0.3567721 14.631331
AFG 2014 0.3647436 NA
AFG 2015 0.3717983 NA
AFG 2016 0.3855273 4.837534
AFG 2017 0.3948606 NA
AGO 2011 0.3250651 12.379809
AGO 2012 0.3400455 NA
AGO 2013 0.3397722 NA
AGO 2014 0.3385741 NA
AGO 2015 0.3521086 16.902584
AGO 2016 0.3636765 NA
AGO 2017 0.3764945 NA
and I would like to find the closest value to year 2012 and 2017 (+ ou - 2 years, i.e. for 2012 it can be 2010, 2011, 2013 or 2014 data) for cata10 variable. The output should be :
iso3year_UHC UHC year_cata cata10
AFG 2012 0.3468012 2013 14.631331
AFG 2017 0.3948606 2016 4.837534
AGO 2012 0.3400455 2011 12.379809
AGO 2017 0.3764945 2015 16.902584
EDIT: Note that I should have NA is there is no data 2 years before or after the reference year.
I have tried tones of commands since two days but could not manage to find a solution. Could you please advice on the type of commands to try?
Thank you very much,
N.
Here are three approaches. The first one is the clearest as it shows that the problem is really an aggregated and filtered self-join and directly models this and automatically handles the edge case mentioned in the comments without additional code. The second one uses a lapply loop to get the desired effect but it involves more tedious manipulation although it does have the advantage of zero package dependencies. The last one gets around the fact that dplyr lacks complex self joins by performing a left join twice.
1) sqldf Using DF defined reproducibly in the Note at the end perform a self join such that the difference in years is -2, -1, 1 or 2 and the iso3 codes are the same and cata10 is not NA in matching instance and among those rows we use min(...) to find the row having the minimum absolute difference in the year. This uses the fact that SQLite has the feature that min(...) will cause the entire row to be returned that satisfies the minimizing condition. Finally take only the 2012 and 2017 rows. The ability of SQL to directly model the constraints using a complex join allows us to directly model the requirements into code.
library(sqldf)
sqldf("select
a.iso3year iso3year_UHC,
a.UHC,
substr(b.iso3year, 5, 8) year_cata,
b.cata10,
substr(a.iso3year, 5, 8) year,
min(abs(substr(a.iso3year, 5, 8) - substr(b.iso3year, 5, 8))) min_value
from DF a
left join DF b on year - year_cata in (-2, -1, 1, 2) and
substr(a.iso3year, 1, 3) = substr(b.iso3year, 1, 3) and
b.cata10 is not null
group by a.iso3year
having year in ('2012', '2017')")[1:4]
giving:
iso3year_UHC UHC year_cata cata10
1 AFG 2012 0.3468012 2013 14.631331
2 AFG 2017 0.3948606 2016 4.837534
3 AGO 2012 0.3400455 2011 12.379809
4 AGO 2017 0.3764945 2015 16.902584
2) Base R This solution uses only base R. We first create year and iso variables by breaking up the iso3year into two parts. ix is an index into DF giving the rows having 2012 or 2017 as their year. For each of those rows we find the nearest year having a cata10 value and create a row of the output data frame which lapply returns as a list of rows, L. Finally we rbind those rows together. This is not as straight forward as (1) but does have the advantage of no package dependencies.
to.year <- function(x) as.numeric(substr(x, 5, 8))
year <- to.year(DF$iso3year)
iso <- substr(DF$iso3year, 1, 3)
ix <- which(year %in% c(2012, 2017))
L <- lapply(ix, function(i) {
DF0 <- na.omit(DF[iso[i] == iso & (year[i] - year) %in% c(-2, -1, 1, 2), ])
if (nrow(DF0)) {
with(DF0[which.min(abs(to.year(DF0$iso3year) - year[i])), c("iso3year", "cata10")],
data.frame(iso3year_UHC = DF$iso3year[i],
UHC = DF$UHC[i],
year_cata = as.numeric(substr(iso3year, 5, 8)),
cata10))
} else {
data.frame(iso3year_UHC = DF$iso3year[i],
UHC = DF$UHC[i],
year_cata = NA,
cata10 = NA)
}
})
do.call("rbind", L)
giving:
iso3year_UHC UHC year_cata cata10
1 AFG 2012 0.3468012 2013 14.631331
2 AFG 2017 0.3948606 2016 4.837534
3 AGO 2012 0.3400455 2011 12.379809
4 AGO 2017 0.3764945 2015 16.902584
3) dplyr/tidyr
First separate iso3year into iso and year columns giving DF2. Then pick out the 2012 and 2017 rows giving DF3. Now left join DF3 to DF2 using iso and get those rows for cata10 in the joined instance that are not NA and the absolute difference in years between the two joined data frames is 1 or 2. Then use slice to pick out the row having least distance in years and select out the desired columns giving DF4 Finally left join DF3 with DF4 which will fill out any rows for which there was no match.
library(dplyr)
library(tidyr)
DF2 <- DF %>%
separate(iso3year, c("iso", "year"), remove = FALSE, convert = TRUE)
DF3 <- DF2 %>%
filter(year %in% c(2012, 2017))
DF4 <- DF3 %>%
left_join(DF2, "iso") %>%
drop_na(cata10.y) %>%
filter(abs(year.x - year.y) %in% 1:2) %>%
group_by(iso3year.x) %>%
slice(which.min(abs(year.x - year.y))) %>%
ungroup %>%
select(iso3year = iso3year.x, UHC = UHC.x, year_cata = year.y, cata10 = cata10.y)
DF3 %>%
select(iso3year, UHC) %>%
left_join(DF4, c("iso3year", "UHC"))
giving:
# A tibble: 4 x 4
iso3year UHC year_cata cata10
<chr> <dbl> <int> <dbl>
1 AFG 2012 0.347 2013 14.6
2 AFG 2017 0.395 2016 4.84
3 AGO 2012 0.340 2011 12.4
4 AGO 2017 0.376 2015 16.9
Note
Lines <- "iso3year UHC cata10
AFG 2010 0.3551409 NA
AFG 2011 0.3496452 NA
AFG 2012 0.3468012 NA
AFG 2013 0.3567721 14.631331
AFG 2014 0.3647436 NA
AFG 2015 0.3717983 NA
AFG 2016 0.3855273 4.837534
AFG 2017 0.3948606 NA
AGO 2011 0.3250651 12.379809
AGO 2012 0.3400455 NA
AGO 2013 0.3397722 NA
AGO 2014 0.3385741 NA
AGO 2015 0.3521086 16.902584
AGO 2016 0.3636765 NA
AGO 2017 0.3764945 NA"
DF <- read.csv(text = gsub(" +", ",", Lines), as.is = TRUE)
Here is an answer with dplyr only:
library(tidyverse)
uhc_comb = read.table(header = T, text = "
iso3 year UHC cata10
AFG 2010 0.3551409 NA
AFG 2011 0.3496452 NA
AFG 2012 0.3468012 NA
AFG 2013 0.3567721 14.631331
AFG 2014 0.3647436 NA
AFG 2015 0.3717983 NA
AFG 2026 0.3855273 4.837534 #Year is 2026 for the example
AFG 2017 0.3948606 NA
AGO 2011 0.3250651 12.379809
AGO 2012 0.3400455 NA
AGO 2013 0.3397722 NA
AGO 2014 0.3385741 NA
AGO 2015 0.3521086 16.902584
AGO 2016 0.3636765 NA
AGO 2017 0.3764945 NA")
uhc_comb2 = uhc_comb %>%
pivot_longer(cols=c("UHC","cata10")) %>% #pivot UHC and cata10 to long format as columns "name" and "value"
filter(!is.na(value)) %>% #remove missing
group_by(iso3, name) %>% #for each iso3 and for each variable name (UHC and cata10)
mutate(dist=pmin(abs(year-2012),abs(year-2017))) %>% #compute the distance between the year and the targets and keep only the lowest
# filter(dist<=2) %>% #remove
top_n(-2, dist) %>% #select the minimal distance (in each group)
mutate(year=ifelse(dist>2, NA, year),
value=ifelse(dist>2, NA, value)) %>% #infer NA if distance is too high
select(-dist) #discard the now useless variable
uhc_comb2 %>%
pivot_wider(id_cols = iso3, values_from = c("year", "value")) %>% #pivot to wide again
unnest #since there are several values, unnest the lists from the dataframe
This will output some warnings but they are not significant. I'm not sure it is possible to remove them.
If you want to understand this better, run each line one by one. Pivoting tables is a tough brain gymnastic in the beginning.
EDIT: this will get you the right output with no warnings:
uhc_comb2 %>%
pivot_wider(id_cols = iso3,
values_from = c("year", "value"),
values_fn = list(value = list, year = list)) %>%
unnest(cols = c(year_cata10, year_UHC, value_cata10, value_UHC))

r merge data with different year

I would like to merge two data using different years.
My data are like the below with more than 1,000 firms with 20 years span.
And I want to merge data to examine firm A's ratio at t's impact on firm A's count at t+1.
Data A
firm year ratio
A 1990 0.2
A 1991 0.3
...
B 1990 0.1
Data B
firm tyear count
A 1990 2
A 1991 6
...
B 1990 4
Expected Output
firm year ratio count
A 1990 0.2 6
Any suggestion for code to merge data?
Thank you
This should get you started on the dataset, just make sure you do the right lag/lead transformation on the table.
library(data.table)
dt.a.years <- data.table(Year =seq(from = 1990, to = 2010, by = 1L))
dt.b.years <- data.table(Year =seq(from = 1990, to = 2010, by = 1L))
dt.merged <- merge( x = dt.a.years
, y = dt.b.years[, .(Year, lag.Year = shift(Year, n = 1, fill = NA))]
, by.x = "Year"
, by.y = "lag.Year")
>dt.merged
Year Year.y
1: 1990 1991
2: 1991 1992
3: 1992 1993
4: 1993 1994
5: 1994 1995
6: 1995 1996
7: 1996 1997
8: 1997 1998
9: 1998 1999
How about like this:
A$tyear = A$year+1
AB = merge(A,B,by=c('firm','tyear'),all=F)

How to create a function that calculates and retrieves a value from a table?

I have a table with info on driver infractions and value of infractions. I have two columns Value (of infraction) and year (of infraction). For each year of infraction I have several values. Years go from 2000-2014.
I need a function that can retrieve the total of infractions per a "predetermined" year. I.e., when user types year, only get the info of that year. So far I can only manage to get the info of all years at the same time
I tried this:
total_year <- function(x=infractions$year){
aggregate(infractions$value ~ infractions$year_deb, FUN=sum, na.rm = TRUE)
}
Then I type
total_year(2012)
and I get a table of infractions per year enlisting all years, but I only want the total for 2012.
My table looks like this:
value year
375714 1,00 2011
375715 0,00 2012
375716 0,00 2013
375717 0,00 2014
375738 12,00 2011
375739 7,00 2012
375740 2,00 2013
375741 4,00 2014
375762 23,00 2011
375763 14,00 2012
375764 18,00 2013
375765 7,00 2014
375786 6,00 2011
375787 4,00 2012
375788 2,00 2013
375789 5,00 2014
375810 0,00 2011
375811 0,00 2012
375812 0,00 2013
Here's another solution using dplyr
Data
set.seed(123)
df <- data.frame(value = sample(c("speeding", "parking", "dui"), 45, replace = T),
year = rep(2000:2014, 3))
Function
library(dplyr)
total_year <- function(data, x) {
data %>%
filter(year == x) %>%
group_by(year) %>%
summarize(inf = length(unique(value))) %>%
ungroup
}
Usage
total_year(df, 2014)
# year inf
# <int> <int>
# 1 2014 1

return final row of dataframe - recurring variable names

I want to return the final row for each subsection of a dataframe. I'm aware of the ddply and aggregate functions, but they are not giving the expected output in this case, as the column by which I split the data has recurring names.
For example, in df:
year <- rep(c(2011, 2012, 2013), each=12)
season <- rep(c("Spring", "Summer", "Autumn", "Winter"), each=3)
allseason <- rep(season, 3)
temp <- rnorm(36, mean = 61, sd = 10)
df <- data.frame(year, allseason, temp)
I want to return the final temp reading at the end of every season. When I run either
final1 <- aggregate(df, list(df$allseason), tail, 1)
or
final2 <- ddply(df, .(allseason), tail, 1)
I get only the final 4 seasons (i.e. those of 2013). The function seems to stop there and does not go back to previous years/seasons. My intended output is a data frame with 12 rows * 3 columns.
All help appreciated!
*I notice that in the df created here, the allseasons column is designated as a factor with 4 levels, whereas this is not the case in my original dataframe.
In your ddply code, you only forgot to also group by year:
With plyr:
library(plyr)
ddply(df, .(year, allseason), tail, 1)
Or with dplyr
library(dplyr)
df %>%
group_by(year, allseason) %>%
do(tail(.,1))
Or if you want a base R alternative you can use ave:
df[with(df, ave(year, list(year, allseason), FUN = seq_along)) == 3,]
Result:
# year allseason temp
#1 2011 Autumn 63.40626
#2 2011 Spring 59.69441
#3 2011 Summer 42.33252
#4 2011 Winter 79.10926
#5 2012 Autumn 63.14974
#6 2012 Spring 60.32811
#7 2012 Summer 67.57364
#8 2012 Winter 61.39100
#9 2013 Autumn 50.30501
#10 2013 Spring 61.43044
#11 2013 Summer 55.16605
#12 2013 Winter 69.37070
Note that the output will contain the same rows in each case, only the ordering may differ.
And just to add to #beginneR's answer, your aggregate solution should look like:
aggregate(temp ~ allseason + year, data = df, tail, 1)
# or:
with(df, aggregate(temp, list(allseason, year), tail, 1))
Result:
allseason year temp
1 Autumn 2011 64.51539
2 Spring 2011 45.14341
3 Summer 2011 62.29240
4 Winter 2011 47.97461
5 Autumn 2012 43.16781
6 Spring 2012 80.02419
7 Summer 2012 72.31149
8 Winter 2012 45.58344
9 Autumn 2013 55.92607
10 Spring 2013 52.06778
11 Summer 2013 51.01308
12 Winter 2013 53.22452

Resources