Drop lines in a long dataset by group, based on some condition - r

I have this df:
library(lubridate)
Date <- c("2020-10-01", "2020-10-02", "2020-10-03", "2020-10-04",
"2020-10-01", "2020-10-02", "2020-10-03", "2020-10-04",
"2020-10-01", "2020-10-02", "2020-10-03", "2020-10-04")
Date <- as_date(Date)
Country <- c("USA", "USA", "USA", "USA",
"Mexico", "Mexico", "Mexico", "Mexico",
"Japan", "Japan", "Japan","Japan")
Value_A <- c(0,40,0,0,25,29,34,0,20,25,27,0)
df<- data.frame(Date, Country, Value_A)
view(df)
Date Country Value_A
<date> <chr> <dbl>
1 2020-10-01 USA 0
2 2020-10-02 USA 40
3 2020-10-03 USA 0
4 2020-10-04 USA 0
5 2020-10-01 Mexico 25
6 2020-10-02 Mexico 29
7 2020-10-03 Mexico 34
8 2020-10-04 Mexico 0
9 2020-10-01 Japan 20
10 2020-10-02 Japan 25
11 2020-10-03 Japan 27
12 2020-10-04 Japan 0
I'm trying to drop the rows containing zeros, but only if these zeros are in the last two rows of each group of the Country column. So the result would be:
Date Country Value_A
<date> <chr> <dbl>
1 2020-10-01 USA 0
2 2020-10-02 USA 40
5 2020-10-01 Mexico 25
6 2020-10-02 Mexico 29
7 2020-10-03 Mexico 34
9 2020-10-01 Japan 20
10 2020-10-02 Japan 25
11 2020-10-03 Japan 27
I appreciate it if someone can help :)

We can use the tidyverse package to do a few manipulations to get the result. We group_by Country, and sort descending by Date. After that, we generate row_numbers. Finally, we filter based on the condition you described:
library(tidyverse)
df %>%
group_by(Country) %>%
arrange(desc(Date)) %>%
mutate(rn = row_number()) %>%
filter(!(Value_A == 0 & rn <= 2))
# Date Country Value_A rn
# 1 2020-10-03 Mexico 34 2
# 2 2020-10-03 Japan 27 2
# 3 2020-10-02 USA 40 3
# 4 2020-10-02 Mexico 29 3
# 5 2020-10-02 Japan 25 3
# 6 2020-10-01 USA 0 4
# 7 2020-10-01 Mexico 25 4
# 8 2020-10-01 Japan 20 4
Another method would be to use rank(desc(Date))
library(tidyverse)
df %>%
group_by(Country) %>%
mutate(rank_date = rank(desc(Date))) %>%
filter(!(rank_date <= 2 & Value_A == 0))
# Date Country Value_A rank_date
# 1 2020-10-01 USA 0 4
# 2 2020-10-02 USA 40 3
# 3 2020-10-01 Mexico 25 4
# 4 2020-10-02 Mexico 29 3
# 5 2020-10-03 Mexico 34 2
# 6 2020-10-01 Japan 20 4
# 7 2020-10-02 Japan 25 3
# 8 2020-10-03 Japan 27 2

Related

R - calculate annual population conditional on survival in every year

I have a data frame with three columns: birth_year, death_year, gender.
I have to calculate total alive male and female population for every year in a given range (1950:1980).
The data frame looks like this:
birth_year death_year gender
1934 1988 male
1922 1993 female
1890 1966 male
1901 1956 male
1946 2009 female
1909 1976 female
1899 1945 male
1887 1949 male
1902 1984 female
The person is alive in year x if death_year > x & birth year <= x
The output I am looking for is something like this:
year male female
1950 3 4
1951 2 3
1952 4 3
1953 4 5
.
.
1980 6 3
Thanks!
Does this work:
library(tidyr)
library(purrr)
library(dplyr)
df %>% mutate(year = map2(1950,1980, seq)) %>% unnest(year) %>%
mutate(isalive = case_when(year >= birth_year & year < death_year ~ 1, TRUE ~ 0)) %>%
group_by(year, gender) %>% summarise(alive = sum(isalive)) %>%
pivot_wider(names_from = gender, values_from = alive) %>% print( n = 50)
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 31 x 3
# Groups: year [31]
year female male
<int> <dbl> <dbl>
1 1950 4 3
2 1951 4 3
3 1952 4 3
4 1953 4 3
5 1954 4 3
6 1955 4 3
7 1956 4 2
8 1957 4 2
9 1958 4 2
10 1959 4 2
11 1960 4 2
12 1961 4 2
13 1962 4 2
14 1963 4 2
15 1964 4 2
16 1965 4 2
17 1966 4 1
18 1967 4 1
19 1968 4 1
20 1969 4 1
21 1970 4 1
22 1971 4 1
23 1972 4 1
24 1973 4 1
25 1974 4 1
26 1975 4 1
27 1976 3 1
28 1977 3 1
29 1978 3 1
30 1979 3 1
31 1980 3 1
Data used:
df
# A tibble: 9 x 3
birth_year death_year gender
<dbl> <dbl> <chr>
1 1934 1988 male
2 1922 1993 female
3 1890 1966 male
4 1901 1956 male
5 1946 2009 female
6 1909 1976 female
7 1899 1945 male
8 1887 1949 male
9 1902 1984 female
Here's a simple base R solution. Summing a logical vector will get you your count of alive or dead because TRUE is 1 and FALSE is 0.
number_alive <- function(range, df){
sapply(range, function(x) sum((df$death_year > x) & (df$birth_year <= x)))
}
output <- data.frame('year' = 1950:1980,
'female' = number_alive(1950:1980, df[df$gender == 'female']),
'male' = number_alive(1950:1980, df[df$gender == 'male']))
# year female male
# 1 1950 4 3
# 2 1951 4 3
# 3 1952 4 3
# 4 1953 4 3
# 5 1954 4 3
# 6 1955 4 3
# 7 1956 4 2
# 8 1957 4 2
# 9 1958 4 2
# 10 1959 4 2
# 11 1960 4 2
# 12 1961 4 2
# 13 1962 4 2
# 14 1963 4 2
# 15 1964 4 2
# 16 1965 4 2
# 17 1966 4 1
# 18 1967 4 1
# 19 1968 4 1
# 20 1969 4 1
# 21 1970 4 1
# 22 1971 4 1
# 23 1972 4 1
# 24 1973 4 1
# 25 1974 4 1
# 26 1975 4 1
# 27 1976 3 1
# 28 1977 3 1
# 29 1978 3 1
# 30 1979 3 1
# 31 1980 3 1
This approach uses an ifelse to determine if alive (1) or dead (0).
Data:
df <- "birth_year death_year gender
1934 1988 male
1922 1993 female
1890 1966 male
1901 1956 male
1946 2009 female
1909 1976 female
1899 1945 male
1887 1949 male
1902 1984 female"
df <- read.table(text = df, header = TRUE)
Code:
library(dplyr)
library(tidyr)
library(tibble)
library(purrr)
df %>%
mutate(year = map2(1950,1980, seq)) %>%
unnest(year) %>%
select(year, birth_year, death_year, gender) %>%
mutate(
alive = ifelse(year >= birth_year & year <= death_year, 1, 0)
) %>%
group_by(year, gender) %>%
summarise(
is_alive = sum(alive)
) %>%
pivot_wider(
names_from = gender,
values_from = is_alive
) %>%
select(year, male, female)
Output:
#> # A tibble: 31 x 3
#> # Groups: year [31]
#> year male female
#> <int> <dbl> <dbl>
#> 1 1950 3 4
#> 2 1951 3 4
#> 3 1952 3 4
#> 4 1953 3 4
#> 5 1954 3 4
#> 6 1955 3 4
#> 7 1956 3 4
#> 8 1957 2 4
#> 9 1958 2 4
#> 10 1959 2 4
#> # … with 21 more rows
Created on 2020-11-11 by the reprex package (v0.3.0)

Is there a R function which can undo cumsum() and recreate the original non-cumulative column in a dataset?

For simplicity, I have created a small dummy dataset.
Please note: dates are in yyyy-mm-dd format
Here is dataset DF:
DF <- tibble(country = rep(c("France", "England", "Spain"), each = 4),
date = rep(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01"), times = 3),
visits = c(10, 16, 14, 12, 11, 9, 12, 14, 13, 13, 15, 10))
# A tibble: 12 x 3
country date visits
<chr> <chr> <dbl>
1 France 2020-01-01 10
2 France 2020-01-02 16
3 France 2020-01-03 14
4 France 2020-01-04 12
5 England 2020-01-01 11
6 England 2020-01-02 9
7 England 2020-01-03 12
8 England 2020-01-04 14
9 Spain 2020-01-01 13
10 Spain 2020-01-02 13
11 Spain 2020-01-03 15
12 Spain 2020-01-04 10
Here is dataset DFc:
DFc <- DF %>% group_by(country) %>% mutate(cumulative_visits = cumsum(visits))
# A tibble: 12 x 3
# Groups: country [3]
country date cumulative_visits
<chr> <chr> <dbl>
1 France 2020-01-01 10
2 France 2020-01-02 26
3 France 2020-01-03 40
4 France 2020-01-04 52
5 England 2020-01-01 11
6 England 2020-01-02 20
7 England 2020-01-03 32
8 England 2020-01-04 46
9 Spain 2020-01-01 13
10 Spain 2020-01-02 26
11 Spain 2020-01-03 41
12 Spain 2020-01-04 51
Let's say I only have dataset DFc. Which R functions can I use to recreate the visits column (as shown in dataset DF) and essentially "undo/reverse" cumsum()?
I have been told that I can incorporate the lag() function but I am not sure how to do this.
Also, how would the code change if the dates were spaced weeks apart, rather than one day?
Any help would be much appreciated :)
Starting from your toy example:
library(dplyr)
DF <- tibble(country = rep(c("France", "England", "Spain"), each = 4),
date = rep(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01"), times = 3),
visits = c(10, 16, 14, 12, 11, 9, 12, 14, 13, 13, 15, 10))
DF <- DF %>%
group_by(country) %>%
mutate(cumulative_visits = cumsum(visits)) %>%
ungroup()
I propose you two methods:
diff
lag [as you specifically required]
DF %>%
group_by(country) %>%
mutate(decum_visits1 = c(cumulative_visits[1], diff(cumulative_visits)),
decum_visits2 = cumulative_visits - lag(cumulative_visits, default = 0)) %>%
ungroup()
#> # A tibble: 12 x 6
#> country date visits cumulative_visits decum_visits1 decum_visits2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 France 2020-01-01 10 10 10 10
#> 2 France 2020-02-01 16 26 16 16
#> 3 France 2020-03-01 14 40 14 14
#> 4 France 2020-04-01 12 52 12 12
#> 5 England 2020-01-01 11 11 11 11
#> 6 England 2020-02-01 9 20 9 9
#> 7 England 2020-03-01 12 32 12 12
#> 8 England 2020-04-01 14 46 14 14
#> 9 Spain 2020-01-01 13 13 13 13
#> 10 Spain 2020-02-01 13 26 13 13
#> 11 Spain 2020-03-01 15 41 15 15
#> 12 Spain 2020-04-01 10 51 10 10
If one date is missing, let's say, like in the following example:
DF1 <- DF %>%
# set to date!
mutate(date = as.Date(date)) %>%
# remove one date just for the sake of the example
filter(date != as.Date("2020-02-01"))
Then I advice you to complete the dates, while you fill visits with zero and cumulative_visits with the last seen value. Then you can get the opposite of cumsum in the same way as before.
DF1 %>%
group_by(country) %>%
# complete and fill with zero!
tidyr::complete(date = seq.Date(min(date), max(date), by = "month"), fill = list(visits = 0)) %>%
# fill cumulative with the last available value
tidyr::fill(cumulative_visits) %>%
# reset in the same way
mutate(decum_visits1 = c(cumulative_visits[1], diff(cumulative_visits)),
decum_visits2 = cumulative_visits - lag(cumulative_visits, default = 0)) %>%
ungroup()
Here's a generic solution. It's sloppy because as you see this didn't return foo[1] but that can be fixed. (as can reversing the output of the last line. ) I'll leave that "as an exercise for the reader" .
foo <- sample(1:20,10)
[1] 16 11 13 5 6 12 19 10 3 4
bar <- cumsum(foo)
[1] 16 27 40 45 51 63 82 92 95 99
rev(bar[-1])-rev(bar[-length(bar)])
[1] 4 3 10 19 12 6 5 13 11

cumulative count of character vector

I want to make a cumulative count of country names from a data frame:
df <- data.frame(country = c("Sweden", "Germany", "Sweden", "Sweden", "Germany",
"Vietnam"), year= c(1834, 1846, 1847, 1852, 1860, 1865))
I have tried different version of count(), cumsum() and tally() but can’t seem to get it right.
Output should look like:
country year n
Sweden 1834 1
Germany 1846 2
Sweden 1847 2
Sweden 1852 2
Germany 1860 2
Vietnam 1865 3
df %>% mutate(count = cumsum(!duplicated(.$country))) %>% as_tibble()
#> # A tibble: 6 x 3
#> country year count
#> <fctr> <dbl> <int>
#> 1 Sweden 1834 1
#> 2 Germany 1846 2
#> 3 Sweden 1847 2
#> 4 Sweden 1852 2
#> 5 Germany 1860 2
#> 6 Vietnam 1865 3
or
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
df %>% mutate(var2=dist_cum(country))
#> country year var2
#> 1 Sweden 1834 1
#> 2 Germany 1846 2
#> 3 Sweden 1847 2
#> 4 Sweden 1852 2
#> 5 Germany 1860 2
#> 6 Vietnam 1865 3
You can try this:
library(ggplot2)
library(plyr)
df<-data.frame(country=c("Sweden","Germany","Sweden","Sweden","Germany","Vietnam", "Germany"),year= c(1834,1846,1847,1852,1860,1865,1860))
counts <- ddply(df, .(df$country, df$year), nrow)
The output is:
> counts
df$country df$year V1
1 Germany 1846 1
2 Germany 1860 2
3 Sweden 1834 1
4 Sweden 1847 1
5 Sweden 1852 1
6 Vietnam 1865 1

How to flatten data.frame for use with googlevis treemap?

In order to use the treemap function on googleVis, data needs to be flattened into two columns. Using their example:
> library(googleVis)
> Regions
Region Parent Val Fac
1 Global <NA> 10 2
2 America Global 2 4
3 Europe Global 99 11
4 Asia Global 10 8
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
However, in the real world this information more frequently looks like this:
> a <- data.frame(
+ scal=c("Global", "Global", "Global", "Global", "Global", "Global", "Global"),
+ cont=c("Europe", "Europe", "Europe", "America", "America", "Asia", "Asia"),
+ country=c("France", "Sweden", "Germany", "Mexico", "USA", "China", "Japan"),
+ val=c(71, 89, 58, 2, 38, 5, 48),
+ fac=c(2,3,10,9,11,1,11))
> a
scal cont country val fac
1 Global Europe France 71 2
2 Global Europe Sweden 89 3
3 Global Europe Germany 58 10
4 Global America Mexico 2 9
5 Global America USA 38 11
6 Global Asia China 5 1
7 Global Asia Japan 48 11
But how to most efficiently change transform this data?
If we use dplyr, this script will transform the data correctly:
library(dplyr)
cbind(NA,a %>% group_by(scal) %>% summarize(val=sum(val),fac=sum(fac))) -> topLev
names(topLev) <- c("Parent","Region","val","fac")
a %>% group_by(scal,cont) %>% summarize(val=sum(val),fac=sum(fac)) %>%
select(Region=cont,Parent=scal,val,fac) -> midLev
a[,2:5] %>% select(Region=country,Parent=cont,val,fac) -> bottomLev
bind_rows(topLev,midLev,bottomLev) %>% select(2,1,3,4) -> answer
We can verify this by comparing dataframes:
> answer
Source: local data frame [11 x 4]
Region Parent val fac
1 Global NA 311 47
2 America Global 40 20
3 Asia Global 53 12
4 Europe Global 218 15
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
> Regions
Region Parent Val Fac
1 Global <NA> 10 2
2 America Global 2 4
3 Europe Global 99 11
4 Asia Global 10 8
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
Interesting that the summaries for the continents and the globe aren't the sum of their components (or min/max/ave/mean/normalized...)

How to remove rows in data frame after frequency tables in R

I have 3 data frames from which I have to find the continent with less than 2 countries and remove those countries(rows). The data frames are structured in a manner similar a data frame called x below:
row Country Continent Ranking
1 Kenya Africa 17
2 Gabon Africa 23
3 Spain Europe 04
4 Belgium Europe 03
5 China Asia 10
6 Nigeria Africa 14
7 Holland Europe 01
8 Italy Europe 05
9 Japan Asia 06
First I wanted to know the frequency of each country per continent, so I did
x2<-table(x$Continent)
x2
Africa Europe Asia
3 4 2
Then I wanted to identify the continents with less than 2 countries
x3 <- x2[x2 < 10]
x3
Asia
2
My problem now is how to remove these countries. For the example above it will be the 2 countries in Asia and I want my final data set to look like presented below:
row Country Continent Ranking
1 Kenya Africa 17
2 Gabon Africa 23
3 Spain Europe 04
4 Belgium Europe 03
5 Nigeria Africa 14
6 Holland Europe 01
7 Italy Europe 05
The number of continents with less than 2 countries will vary among the different data frames so I need one universal method that I can apply to all.
Try
library(dplyr)
x %>%
group_by(Continent) %>%
filter(n()>2)
# row Country Continent Ranking
#1 1 Kenya Africa 17
#2 2 Gabon Africa 23
#3 3 Spain Europe 04
#4 4 Belgium Europe 03
#5 6 Nigeria Africa 14
#6 7 Holland Europe 01
#7 8 Italy Europe 05
Or using the x2
subset(x, Continent %in% names(x2)[x2>2])
# row Country Continent Ranking
#1 1 Kenya Africa 17
#2 2 Gabon Africa 23
#3 3 Spain Europe 04
#4 4 Belgium Europe 03
#6 6 Nigeria Africa 14
#7 7 Holland Europe 01
#8 8 Italy Europe 05
A very easy way with "data.table" would be:
library(data.table)
as.data.table(x)[, N := .N, by = Continent][N > 2]
# row Country Continent Ranking N
# 1: 1 Kenya Africa 17 3
# 2: 2 Gabon Africa 23 3
# 3: 3 Spain Europe 4 4
# 4: 4 Belgium Europe 3 4
# 5: 6 Nigeria Africa 14 3
# 6: 7 Holland Europe 1 4
# 7: 8 Italy Europe 5 4
In base R you can try:
x[with(x, ave(rep(TRUE, nrow(x)), Continent, FUN = function(y) length(y) > 2)), ]
# row Country Continent Ranking
# 1 1 Kenya Africa 17
# 2 2 Gabon Africa 23
# 3 3 Spain Europe 4
# 4 4 Belgium Europe 3
# 6 6 Nigeria Africa 14
# 7 7 Holland Europe 1
# 8 8 Italy Europe 5

Resources