Remove rows conditionally based on NA values in other rows - r

I have a data frame like this:
city year value
<chr> <dbl> <dbl>
1 la 1 NA
2 la 2 NA
3 la 3 NA
4 la 4 20
5 la 5 25
6 nyc 1 18
7 nyc 2 29
8 nyc 3 24
9 nyc 4 17
10 nyc 5 30
I would like to remove any cities that don't have a complete 5 years worth of data. So in this case, I'd like to remove all rows for city la despite the fact that there is data for years 4 and 5, resulting in the following data frame:
city year value
<chr> <dbl> <dbl>
1 nyc 1 18
2 nyc 2 29
3 nyc 3 24
4 nyc 4 17
5 nyc 5 30
Is this possible? Thanks in advance.

In Base R:
subset(df, !ave(value, city, FUN = anyNA))
city year value
6 nyc 1 18
7 nyc 2 29
8 nyc 3 24
9 nyc 4 17
10 nyc 5 30
in Tidyverse
df %>%
group_by(city) %>%
filter(!anyNA(value))
# A tibble: 5 x 3
# Groups: city [1]
city year value
<chr> <int> <int>
1 nyc 1 18
2 nyc 2 29
3 nyc 3 24
4 nyc 4 17
5 nyc 5 30
or even
df %>%
group_by(city) %>%
filter(all(!is.na(value)))

Another base R option with ave
> subset(df, !is.na(ave(value, city)))
city year value
6 nyc 1 18
7 nyc 2 29
8 nyc 3 24
9 nyc 4 17
10 nyc 5 30
or a data.table one
> library(data.table)
> setDT(df)[, .SD[!anyNA(value)], city]
city year value
1: nyc 1 18
2: nyc 2 29
3: nyc 3 24
4: nyc 4 17
5: nyc 5 30

Related

Dividing a column cell with a different number based on number of observations in a panel long format

I have the following data which is of a panel structure. I need to normalize each cell so that the observation for a country is divided by total number of observations for that country divided by total number of observations in the panel structure (here 10 - in my data 1100). Also I have showcased three countries (AL, UK, FR) but I have 92 in total so I need some general formula (mutate: by = country?).
This is my data
df1 <- data_frame(Country =
c("AL","AL","AL","AL","AL","AL","AL","AL","AL","AL",
"UK","UK","UK","UK","UK","UK","UK","UK","UK","UK",
"FR","FR","FR","FR","FR","FR","FR","FR","FR","FR"),
Obs = c(NA,NA,2,3,2,3,2,3,2,NA,1,2,1,2,1,2,1,2,1,2,NA,NA,NA,NA,NA,NA,NA,NA,4,NA))
df1
Country Obs
<chr> <dbl>
1 AL NA
2 AL NA
3 AL 2
4 AL 3
5 AL 2
6 AL 3
7 AL 2
8 AL 3
9 AL 2
10 AL NA
11 UK 1
12 UK 2
13 UK 1
14 UK 2
15 UK 1
16 UK 2
17 UK 1
18 UK 2
19 UK 1
20 UK 2
21 FR NA
22 FR NA
23 FR NA
24 FR NA
25 FR NA
26 FR NA
27 FR NA
28 FR NA
29 FR 4
30 FR NA
Now, what I want is to divide each cell with number of observations available for each country / total obs like so,
df2 <- data_frame(Country =
c("AL","AL","AL","AL","AL","AL","AL","AL","AL","AL",
"UK","UK","UK","UK","UK","UK","UK","UK","UK","UK",
"FR","FR","FR","FR","FR","FR","FR","FR","FR","FR"),
Obs = c(NA,NA,2*7/10,3*7/10,2*7/10,3*7/10,2*7/10,3*7/10,2*7/10,
NA,1*10/10,2*10/10,1*10/10,2*10/10,1*10/10,2*10/10,1*10/10,
2*10/10,1*10/10,2*10/10,NA,NA,NA,NA,NA,NA,NA,NA,4*1/10,NA))
df2
Country Obs
<chr> <dbl>
1 AL NA
2 AL NA
3 AL 1.4
4 AL 3.7
5 AL 2.7
6 AL 3.7
7 AL 2.7
8 AL 3.7
9 AL 2.7
10 AL NA
11 UK 1
12 UK 2
13 UK 1
14 UK 2
15 UK 1
16 UK 2
17 UK 1
18 UK 2
19 UK 1
20 UK 2
21 FR NA
22 FR NA
23 FR NA
24 FR NA
25 FR NA
26 FR NA
27 FR NA
28 FR NA
29 FR 0.4
30 FR NA
I am interested in solving the problem obviously BUT I would really really appreciate it if you could show me how to do this for multiple columns as my original data needs this same operation done for many columns where the country tickers (AL, UK, FR in example) remains the same.
You can do :
library(dplyr)
df1 %>%
group_by(Country) %>%
mutate(Obs = Obs * sum(!is.na(Obs))/n()) %>%
ungroup
# Country Obs
# <chr> <dbl>
# 1 AL NA
# 2 AL NA
# 3 AL 1.4
# 4 AL 2.1
# 5 AL 1.4
# 6 AL 2.1
# 7 AL 1.4
# 8 AL 2.1
# 9 AL 1.4
#10 AL NA
# … with 20 more rows
sum(!is.na(Obs)) counts number of non-NA values in the Country whereas n() gives the number of rows for the Country.
For multiple columns -
df1 %>%
group_by(Country) %>%
mutate(across(col1:col4, ~. * sum(!is.na(.))/n())) %>%
ungroup
This will be applied to col1 to col4 in your dataframe.
Using data.table
library(data.table)
setDT(df1)[, Obs := Obs * mean(!is.na(Obs)), County]
Or using dplyr
library(dplyr)
df1 %>%
group_by(Country) %>%
mutate(Obs = Obs * mean(!is.na(Obs)))

Count number of instances above a varying threshold

I have the 0.95 percentile threshold for temperature for each country. In the example below a week is 4 days. I want to count in a new vector/single-column-dataframe how many days each individual country's temperature is over that country's threshold on a weekly basis.
The country 95% percentile temperatures are:
q95 <- c(26,21,22,20,23)
DailyTempCountry <- data.frame(Date = c("W1D1","W1D2","W1D3","W1D4","W2D1","W2D2","W2D3","W2D4",
"W1D1","W1D2","W1D3","W1D4","W2D1","W2D2","W2D3","W2D4",
"W1D1","W1D2","W1D3","W1D4","W2D1","W2D2","W2D3","W2D4",
"W1D1","W1D2","W1D3","W1D4","W2D1","W2D2","W2D3","W2D4",
"W1D1","W1D2","W1D3","W1D4","W2D1","W2D2","W2D3","W2D4"),
Country = c("AL","AL", "AL", "AL","AL","AL", "AL", "AL",
"BE","BE", "BE", "BE", "BE","BE", "BE", "BE",
"CA","CA", "CA", "CA","CA","CA", "CA", "CA",
"DE","DE", "DE", "DE","DE","DE", "DE", "DE",
"UK","UK", "UK", "UK","UK","UK", "UK", "UK"),
DailyTemp = c(27,25,20,22,20,20,27,27,
24,22,23,18,17,19,20,16,
23,23,23,23,27,26,20,26,
19,18,17,19,16,15,19,18,
20,24,24,20,19,25,19,25))
DailyTempCountry
Date Country DailyTemp
1 W1D1 AL 27
2 W1D2 AL 25
3 W1D3 AL 20
4 W1D4 AL 22
5 W2D1 AL 20
6 W2D2 AL 20
7 W2D3 AL 27
8 W2D4 AL 27
9 W1D1 BE 24
10 W1D2 BE 22
11 W1D3 BE 23
12 W1D4 BE 18
13 W2D1 BE 17
14 W2D2 BE 19
15 W2D3 BE 20
16 W2D4 BE 16
17 W1D1 CA 23
18 W1D2 CA 23
19 W1D3 CA 23
20 W1D4 CA 23
21 W2D1 CA 27
22 W2D2 CA 26
23 W2D3 CA 20
24 W2D4 CA 26
25 W1D1 DE 19
26 W1D2 DE 18
27 W1D3 DE 17
28 W1D4 DE 19
29 W2D1 DE 16
30 W2D2 DE 15
31 W2D3 DE 19
32 W2D4 DE 18
33 W1D1 UK 20
34 W1D2 UK 24
35 W1D3 UK 24
36 W1D4 UK 20
37 W2D1 UK 19
38 W2D2 UK 25
39 W2D3 UK 19
40 W2D4 UK 25
What I want is a vector/column that counts the number of days in that week above the country's threshold like this:
DaysInWeekAboveQ95 <- c(1,2,3,0,4,3,0,0,2,2)
df_right <- data.frame(Week = c("W1","W2","W1","W2","W1","W2","W1","W2","W1","W2"),
DaysInWeekAboveQ95 = c(1,2,3,0,4,3,0,0,2,2))
Week DaysInWeekAboveQ95
1 W1 1
2 W2 2
3 W1 3
4 W2 0
5 W1 4
6 W2 3
7 W1 0
8 W2 0
9 W1 2
10 W2 2
The q95% vector was
q95 <- c(26,21,22,20,23)
so in the first week AL have 1 instance above its threshold value 26. UK have 2 instances above 23 (UK's threshold) in the second week. And so for every country and every week.
I handled a similar problem but where the threshold did not vary by country but was just a constant 30 degrees (where I divide by 7 because seven days in week)
DaysAbove30perWeek <- as.data.frame(tapply(testdlong$value > 30,
ceiling(seq(nrow(testdlong))/7),sum))
Maybe a solution is to loop over countries? However, I can't figure out how to incorporate the specific loop. Other solutions are welcome.
In revised scenario you also need calculating a new column for week too
q95 <- c(26,21,22,20,23)
c_q95 <- data.frame(Country = unique(DailyTempCountry$Country),
threshold = q95)
library(dplyr)
DailyTempCountry %>% left_join(c_q95, by = 'Country') %>%
group_by(Country, Week = substr(Date, 1, 2)) %>%
summarise(days = sum(DailyTemp > threshold), .groups = 'drop')
# A tibble: 10 x 3
Country Week days
<chr> <chr> <int>
1 AL W1 1
2 AL W2 2
3 BE W1 3
4 BE W2 0
5 CA W1 4
6 CA W2 3
7 DE W1 0
8 DE W2 0
9 UK W1 2
10 UK W2 2
Created on 2021-05-05 by the reprex package (v2.0.0)
OP has asked that date variable is in some different format than given in sample data
time <- as.character(20000101:20000130)
> time
[1] "20000101" "20000102" "20000103" "20000104" "20000105" "20000106" "20000107" "20000108" "20000109" "20000110"
[11] "20000111" "20000112" "20000113" "20000114" "20000115" "20000116" "20000117" "20000118" "20000119" "20000120"
[21] "20000121" "20000122" "20000123" "20000124" "20000125" "20000126" "20000127" "20000128" "20000129" "20000130"
library(lubridate)
time <- ymd(time)
# Either ISO week
isoweek(time)
# or week
week(time)
> isoweek(time)
[1] 52 52 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4
> # or week
> week(time)
[1] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5
library(lubridate)
time <- ymd(time)
isoweek(time)
week(time)

grouping data in R and summing by decade

I have the following dataset:
ireland england france year
5 3 2 1920
4 3 4 1921
6 2 1 1922
3 1 5 1930
2 5 2 1931
I need to summarise the data by 1920's and 1930's. So I need total points for ireland, england and france in the 1920-1922 and then another total point for ireland,england and france in 1930,1931.
Any ideas? I have tried but failed.
Dataset:
x <- read.table(text = "ireland england france
5 3 2 1920
4 3 4 1921
6 2 1 1922
3 1 5 1930
2 5 2 1931", header = T)
How about dividing the years by 10 and then summarizing?
library(dplyr)
x %>% mutate(decade = floor(year/10)*10) %>%
group_by(decade) %>%
summarize_all(sum) %>%
select(-year)
# A tibble: 2 x 5
# decade ireland england france
# <dbl> <int> <int> <int>
# 1 1920 15 8 7
# 2 1930 5 6 7
An R base solution
As A5C1D2H2I1M1N2O1R2T1 mentioned, you can use findIntervals() to set corresponding decade for each year and then, an aggregate() to group py decade
txt <-
"ireland england france year
5 3 2 1920
4 3 4 1921
6 2 1 1922
3 1 5 1930
2 5 2 1931"
df <- read.table(text=txt, header=T)
decades <- c(1920, 1930, 1940)
df$decade<- decades[findInterval(df$year, decades)]
aggregate(cbind(ireland,england,france) ~ decade , data = df, sum)
Output:
decade ireland england france
1 1920 15 8 7
2 1930 5 6 7

filter a df with NA to get only individuals that appear more than one time in r

I am using a national survey to run a regression: the survey is conducted every two years and some individual are repeatedly interviewed while others just one time.
Now I want to make the df a panel one (have only the individual that appears more than one time). The df is like this:
year nquest nord nordp sex age
2000 10 1 1 F 40
2000 10 2 2 M 43
2000 30 1 1 M 30
2002 10 1 1 F 42
2002 10 2 2 M 45
2002 10 3 NA F 15
2002 30 1 1 M 32
2004 10 1 1 F 44
2004 10 2 2 M 47
2004 10 3 3 F 17
2004 50 1 NA M 66
where nquest is the code number of the family, nord is the code number of the individual and nordp is the code number that the individual had in the previous survey; when a new individual is interviewed the value in nordp is "missing" (R automatically insert NA). For example the individual 3 of family 10 has nordp=NA in 2002 because it is the first time that she is interviewed, while in 2004 nordp is 3 (because 3 was the number that she had in 2002).
I can't use nord to filter the df because the composition of the family may change (for example in 2002 in family x the mother has nordp=2 (it means that in 2000 nord was 2) and nord=2 but the next year nord could be 1 (for example if she gets divorced) but nordp is still 2).
I tried to filter using this command:
df <- df %>%
group_by(nquest, nordp)
filter(n()>1)
but I don't get the right df because if for the same family there are more than one individual insert (NA) they will be considered as the same person since nordp is NA the first time.
How can I consider also the individual that appears for the first time in a certain year (nordp=NA)? I tried to a create a command using age (the age in t shoul be equal to (age (in t-2) + 2; for example in 2000 age is 20, in 2002 is 22) but it didn't worked.
Consider that the df is composed by thousand observations and I can't check manually.
The final df should be:
year nquest nordp sex age
2000 10 1 F 40
2000 10 2 M 43
2000 30 1 M 30
2002 10 1 F 42
2002 10 2 M 45
2002 10 3 F 15
2002 30 1 M 32
2004 10 1 F 44
2004 10 2 M 47
2004 10 3 F 17
As you can see there are only the individual that appears more than one time and nquest=10 nordp=30 appears three times; with my command it appears just two times because in the first year nordp was NA.
We wish to assign unique IDs to individuals, then filter by the count of unique IDs. The main idea is to chain together the nordp and nord values within each family over years. Here's an idea inspired by Identify groups of linked episodes which chain together. First, load the igraph package, via library(igraph). Then the following function assigns IDs for a given family.
assignID <- function(d) {
fields <- names(d) # store original column names
d$nordp[is.na(d$nordp)] <- seq_len(sum(is.na(d$nordp))) + 100
d$nordp_x <- (d$year-2) * 1000 + d$nordp
d$nord_x <- d$year * 1000 + d$nord
dd <- d[, c("nordp_x", "nord_x")]
gr.test <- graph.data.frame(dd)
links <- data.frame(org_id = unique(unlist(dd)),
id = clusters(gr.test)$membership)
d <- merge(d, links, by.x = "nord_x", by.y = "org_id", all.x = TRUE)
d$uid <- d$nquest * 100 + d$id
d[, c(fields, "uid")]
}
The function can "tell", for example, that
year nordp nord
2000 1 1
2002 1 2
2004 2 3
is the same individual, by chaining together the nordp and nord over the years, and assigns the same unique ID to all 3 rows. So, for example,
assignID(subset(df, nquest == 10))
# year nquest nord nordp sex age dob uid
# 1 2000 10 1 1 F 40 1960 1001
# 2 2000 10 2 2 M 43 1957 1002
# 3 2002 10 1 1 F 42 1960 1001
# 4 2002 10 2 2 M 45 1957 1002
# 5 2002 10 3 101 F 15 1987 1003
# 6 2004 10 1 1 F 44 1960 1001
# 7 2004 10 2 2 M 47 1957 1002
# 8 2004 10 3 3 F 17 1987 1003
gives us an additional column with the uid for each individual.
The remaining steps are straightforward. We split the dataframe by nquest, apply assignID to each subset, and rbind the output:
dd <- do.call(rbind, by(df, df$nquest, assignID))
Then we can just group by uid and filter by count:
dd %>% group_by(uid) %>% filter(n()>1)
# Source: local data frame [10 x 8]
# Groups: uid [4]
# year nquest nord nordp sex age dob uid
# <int> <int> <int> <dbl> <fctr> <int> <int> <dbl>
# 1 2000 10 1 1 F 40 1960 1001
# 2 2000 10 2 2 M 43 1957 1002
# 3 2002 10 1 1 F 42 1960 1001
# 4 2002 10 2 2 M 45 1957 1002
# 5 2002 10 3 101 F 15 1987 1003
# 6 2004 10 1 1 F 44 1960 1001
# 7 2004 10 2 2 M 47 1957 1002
# 8 2004 10 3 3 F 17 1987 1003
# 9 2000 30 1 1 M 30 1970 3001
# 10 2002 30 1 1 M 32 1970 3001

Calculating yearly growth-rates from quarterly, long form data in r

My data takes the following form:
df <- data.frame(Sector=c(rep("A",8),rep("B",8)), Country = c(rep("USA", 16)),
Quarter=rep(1:8,2),Income=20:35)
df2 <- data.frame(Sector=c(rep("A",8),rep("B",8)), Country = c(rep("UK", 16)),
Quarter=rep(1:8,2),Income=32:47)
df <- rbind(df, df2)
What I want to do is to calculate the growth rate from the first quarter each year to the first quarter the second year, within country and sector. In the example above it would be the growth rate from quarter 1 to quarter 5. So for Sector A, in the USA, it would be (24/20)-1=0.2
I then want to append this data to the dataframe as a new column.
I looked at the solutions in:
How calculate growth rate in long format data frame?
But didn't have the r-skills to get it to work if the lag is more then one time-unit. Any suggestions?
ADDITION
So what i want is the growth-rate, that is (24/20)-1=0.2 in the example below. Not 1-(24/20), which I first wrote. The desired output should look something like this:
Sector Country Quarter Income growth
(fctr) (fctr) (int) (int) (dbl)
1 A USA 1 20 NA
2 A USA 2 21 NA
3 A USA 3 22 NA
4 A USA 4 23 NA
5 A USA 5 24 0.2
6 A USA 6 25 0.1904
7 A USA 7 26 0.1818
I think you need something like this:
library(dplyr)
df %>%
#group by sector and country
group_by(Sector, Country) %>%
#calculate growth as (quarter / 5-period-lagged quarter) - 1
mutate(growth = Income / lag(Income, 4) - 1)
Output
Source: local data frame [32 x 5]
Groups: Sector, Country [4]
Sector Country Quarter Income growth
(fctr) (fctr) (int) (int) (dbl)
1 A USA 1 20 NA
2 A USA 2 21 NA
3 A USA 3 22 NA
4 A USA 4 23 NA
5 A USA 5 24 0.2000000
6 A USA 6 25 0.1904762
7 A USA 7 26 0.1818182
8 A USA 8 27 0.1739130
9 B USA 1 28 NA
10 B USA 2 29 NA
.. ... ... ... ... ...
df3 = copy(df)
df3$Quarter = df3$Quarter - 4
df = merge(df,df3,c('Sector','Country','Quarter'), suffixes = c('','_prev'), all.x = T)
df$growth = 1 - (df$Income_prev/df$Income
> df
Sector Country Quarter Income Income_prev growth
1 A USA 1 20 24 -4
2 A USA 2 21 25 -4
3 A USA 3 22 26 -4
4 A USA 4 23 27 -4
5 A USA 5 24 NA NA
6 A USA 6 25 NA NA
7 A USA 7 26 NA NA
8 A USA 8 27 NA NA
9 A UK 1 32 36 -4
10 A UK 2 33 37 -4
11 A UK 3 34 38 -4
12 A UK 4 35 39 -4
13 A UK 5 36 NA NA
14 A UK 6 37 NA NA
15 A UK 7 38 NA NA
16 A UK 8 39 NA NA
17 B USA 1 28 32 -4
18 B USA 2 29 33 -4
19 B USA 3 30 34 -4
20 B USA 4 31 35 -4
21 B USA 5 32 NA NA
22 B USA 6 33 NA NA
23 B USA 7 34 NA NA
24 B USA 8 35 NA NA
25 B UK 1 40 44 -4
26 B UK 2 41 45 -4
27 B UK 3 42 46 -4
28 B UK 4 43 47 -4
29 B UK 5 44 NA NA
30 B UK 6 45 NA NA
31 B UK 7 46 NA NA
32 B UK 8 47 NA NA
>

Resources