R subset dataframe where no observations of certain variables - r

I have a dataframe that looks like
country
sector
data1
data2
France
1
7
.
France
2
10
.
belgium
1
12
7
belgium
2
14
8
I want to subset columns that are missing for a country in all sectors. In this example I would like to drop/exclude column two because it is missing for sector 1 and 2 for france. To be clear I would also be throwing out the values of data2 for belgium in this example.
My expected output would look like
country
sector
data1
France
1
7
France
2
10
belgium
1
12
belgium
2
14
data 2 is now excluded because it had a complete set of missing values for all sectors in France

We may group by country, create logical columns where the count of NA elements are equal to group size, ungroup, replace the corresponding columns to NA based on the logical column and remove those columns in select
library(dplyr)
library(stringr)
df1 %>%
group_by(country) %>%
mutate(across(everything(), ~ sum(is.na(.x)) == n(),
.names = "{.col}_lgl")) %>%
ungroup %>%
mutate(across(names(df1)[-1], ~ if(any(get(str_c(cur_column(),
"_lgl")) )) NA else .x)) %>%
select(c(where(~ !is.logical(.x) && any(complete.cases(.x)))))
-output
# A tibble: 4 × 3
country sector data1
<chr> <int> <int>
1 France 1 7
2 France 2 10
3 belgium 1 12
4 belgium 2 14
If we don't use group_by, the steps can be simplified as showed in Maël's post i.e. do the grouping with a base R function within select i.e. either tapply or ave can work
df1 %>%
select(where(~ !any(tapply(is.na(.x), df1[["country"]],
FUN = all))))
data
df1 <- structure(list(country = c("France", "France", "belgium", "belgium"
), sector = c(1L, 2L, 1L, 2L), data1 = c(7L, 10L, NA, 14L), data2 = c(NA,
NA, 7L, 8L)), row.names = c(NA, -4L), class = "data.frame")

In base R:
df1 <- read.table(header = T, text = "country sector data1 data2
France 1 7 NA
France 2 10 2
belgium 1 12 7
belgium 2 14 8")
df2 <- read.table(header = T, text = "country sector data1 data2
France 1 7 NA
France 2 10 NA
belgium 1 12 7
belgium 2 14 8")
df1[!sapply(df1, \(x) any(ave(x, df1$country, FUN = \(y) all(is.na(y)))))]
# country sector data1 data2
# 1 France 1 7 NA
# 2 France 2 10 2
# 3 belgium 1 12 7
# 4 belgium 2 14 8
df2[!sapply(df2, \(x) any(ave(x, df2$country, FUN = \(y) all(is.na(y)))))]
# country sector data1
# 1 France 1 7
# 2 France 2 10
# 3 belgium 1 12
# 4 belgium 2 14
Note: \ replaces function.

For a base R solution, you can use the apply family on column names and detect if there's any NA in the values of all columns:
keep_remove <- sapply(names(data), \(x) all(!is.na(data[[x]])))
data <- data[, keep_remove]

Related

Is there any way to generate year column from existing column names in R?

I am working with a dataset that has the corresponding year attached to variable names as suffix, e.g. AXOX1991, where AXO is the variable. I am trying to separate the year from the variable label/column names to generate a year column so that the dataset can be analyzed as time-series data.
In other words, the existing dataset looks like:
Country
AXOX1991
AXOX1992
BXOX1991
BXOX1992
CXOX1991
CXOX1992
Afghanistan
1
2
3
4
5
6
USA
6
5
4
3
2
1
And I am trying to create the following:
Country
Year
AXO
BXO
CXO
Afghanistan
1991
1
3
5
Afghanistan
1992
2
4
6
USA
1991
6
4
2
USA
1992
5
3
1
As you can see, X not only acts as the delimiter that divides the variable name and the year, but it is also part of the variable name. Is there any way in R to separate the year from the variable name in existing column names and then to create a year column as shown above?
I have been thinking of workarounds, such as loops, but I haven't gotten very far, and I'm truly stumped. I have more than 900 variable-years, so I want to avoid doing it by hand if possible.
Thank you!
For the sake of completeness, here is a solution using melt() with the new measure() function (introduced with data.table v1.14.1):
library(data.table) # development version 1.14.1
melt(setDT(df), measure.vars = measure(value.name, year,
pattern = "(\\w{3})X(\\d{4})"))
Country year AXO BXO CXO
1: Afghanistan 1991 1 3 5
2: USA 1991 6 4 2
3: Afghanistan 1992 2 4 6
4: USA 1992 5 3 1
Data
library(data.table)
df <- fread("Country AXOX1991 AXOX1992 BXOX1991 BXOX1992 CXOX1991 CXOX1992
Afghanistan 1 2 3 4 5 6
USA 6 5 4 3 2 1")
You can make use of tidyr::pivot_longer -
res <- tidyr::pivot_longer(df, cols = -Country,
names_to = c('.value', 'Year'),
names_pattern = '([A-Z]+)X(\\d+)')
res
# Country Year AXO BXO CXO
# <chr> <chr> <int> <int> <int>
#1 Afghanistan 1991 1 3 5
#2 Afghanistan 1992 2 4 6
#3 USA 1991 6 4 2
#4 USA 1992 5 3 1
data
df <- structure(list(Country = c("Afghanistan", "USA"), AXOX1991 = c(1L,
6L), AXOX1992 = c(2L, 5L), BXOX1991 = 3:4, BXOX1992 = 4:3, CXOX1991 = c(5L,
2L), CXOX1992 = c(6L, 1L)), class = "data.frame", row.names = c(NA, -2L))

Select sample from a grouping variable depending on another grouping in R

I have the following data frame with 1,000 rows; 10 Cities, each having 100 rows and I would like to randomly select 10 names by Year in the city and the selected should 10 sample names should come from at least one of the years in the City i.e the 10 names for City 1 should not come from only 1996 for instance.
City Year name
1 1 1996 b
2 1 1996 c
3 1 1997 d
4 1 1997 e
...
101 2 1996 f
102 2 1996 g
103 2 1997 h
104 2 1997 i
Desired Final Sample Data
City Year name
1 1 1996 b
2 1 1998 c
3 1 2001 d
...
11 2 1997 g
12 2 1999 h
13 2 2005 b
...
21 3 1998 a
22 3 2010 c
23 3 2005 d
Sample Data
df1 <- data.frame(City = rep(1:10, each = 100),
Year = rep(1996:2015, each = 5),
name = rep(letters[1:25], 40))
I am failing to randomly select the 10 sample names by Year (without repeating years - unless when the number of Years in a city is less than 10) for all the 10 Cities, how can I go over this?
The Final sample should have 10 names of each city and years should not repeat unless when they are less than 10 in that city.
Thank you.
First group by City and use sample_n to sample a sub-dataframe.
Then group by City and Year, and sample from name one element per group. Don't forget to set the RNG seed in order to make the result reproducible.
library(dplyr)
set.seed(2020)
df1 %>%
group_by(City) %>%
sample_n(min(n(), 10)) %>%
ungroup() %>%
group_by(City, Year) %>%
summarise(name = sample(name, 1))
#`summarise()` regrouping output by 'City' (override with `.groups` argument)
## A tibble: 4 x 3
## Groups: City [2]
# City Year name
# <int> <int> <chr>
#1 1 1996 b
#2 1 1997 e
#3 2 1996 f
#4 2 1997 h
Data
df1 <- read.table(text = "
City Year name
1 1 1996 b
2 1 1996 c
3 1 1997 d
4 1 1997 e
101 2 1996 f
102 2 1996 g
103 2 1997 h
104 2 1997 i
", header = TRUE)
Edit
Instead of reinventing the wheel, use package sampling, function strata to get an index into the data set and then filter its corresponding rows.
library(dplyr)
library(sampling)
set.seed(2020)
df1 %>%
mutate(row = row_number()) %>%
filter(row %in% strata(df1, stratanames = c('City', 'Year'), size = rep(1, 1000), method = 'srswor')$ID_unit) %>%
select(-row) %>%
group_by(City) %>%
sample_n(10) %>%
arrange(City, Year)

Using tidyverse and pipes how do I assign fixed rows

Given this dataframe
X1 X2
2001 NA
abc 10
def 12
xo 13
2002 NA
abc 10
efd 22
dd 23
2005 NA
a 30
All the years have NA in X2. My goal is to get this data frame to become
X1 X2 Date
abc 10 2001
def 12 2001
xo 13 2001
abc 10 2002
efd 22 2002
dd 23 2002
a 30 2005
That is, the years became their own column and the NA's have been dropped
What I tried
a = read_csv("given.csv")
a %>% mutate(Date = ifelse(is.na(X2), X1, NA))
This turns the first dataframe to
X1 X2 Date
2001 NA 2001
abc 10 NA
def 12 NA
xo 13 NA
2002 NA 2002
abc 10 NA
efd 22 NA
dd 23 NA
2005 NA 2005
a 30 NA
I'm not sure how to replace the NA of the date column into the upper value for each year. After that I think i can just drop_na and it will be like i would want it
Another option:
library(dplyr)
library(zoo)
a %>%
mutate(Date = na.locf(case_when(is.na(X2) ~ X1))) %>%
na.omit
Output:
X1 X2 Date
2 abc 10 2001
3 def 12 2001
4 xo 13 2001
6 abc 10 2002
7 efd 22 2002
8 dd 23 2002
10 a 30 2005
If you want to reset row numbers just use filter(!is.na(X2)) instead of na.omit.
P.S. You can of course just load tidyverse and do something like:
library(tidyverse)
a %>%
mutate(Date = case_when(is.na(X2) ~ X1)) %>%
fill(Date) %>%
drop_na
.. however note that fill is quite slow compared to the na.locf function from zoo.
We can create a grouping column based on the occurrence of numbers only elements (\\d+) in 'X1', get the cumulative sum, create the 'Date' as the first element of 'X1', ungroup and remove the NA rows
library(dplyr)
library(stringr)
a %>%
group_by(grp = cumsum(str_detect(X1, '^\\d+$'))) %>%
mutate(Date = first(X1)) %>%
ungroup %>%
select(-grp) %>%
na.omit
# A tibble: 7 x 3
# X1 X2 Date
# <chr> <int> <chr>
#1 abc 10 2001
#2 def 12 2001
#3 xo 13 2001
#4 abc 10 2002
#5 efd 22 2002
#6 dd 23 2002
#7 a 30 2005
Or using data.table with zoo
library(data.table)
library(zoo)
na.omit(setDT(a)[, Date := na.locf(fifelse(is.na(X2), X1, NA_character_))])
data
a <- structure(list(X1 = c("2001", "abc", "def", "xo", "2002", "abc",
"efd", "dd", "2005", "a"), X2 = c(NA, 10L, 12L, 13L, NA, 10L,
22L, 23L, NA, 30L)), class = "data.frame", row.names = c(NA,
-10L))

Tidy Data: Rename columns, get non-NA column names, then gather

I've got a rather ugly bit of data to tidy up and need help! What my data look like now:
countries <- c("Austria", "Belgium", "Croatia")
df <- tibble("age" = c(28,42,19, 67),
"1_recreate_1"=c(NA,15,NA,NA),
"1_recreate_2"=c(NA,10,NA,NA),
"1_recreate_3"=c(NA,8,NA,NA),
"1_recreate_4"=c(NA,4,NA,NA),
"1_fairness" = c(NA, 7, NA, NA),
"1_confidence" = c(NA, 5, NA, NA),
"2_recreate_1"=c(29,NA,NA,30),
"2_recreate_2"=c(20,NA,NA,24),
"2_recreate_3"=c(15,NA,NA,15),
"2_recreate_4"=c(11,NA,NA,9),
"2_fairness" = c(4, NA, NA, 1),
"2_confidence" = c(5, NA, NA, 4),
"3_recreate_1"=c(NA,NA,50,NA),
"3_recreate_2"=c(NA,NA,40,NA),
"3_recreate_3"=c(NA,NA,30,NA),
"3_recreate_4"=c(NA,NA,20,NA),
"3_fairness" = c(NA, NA, 2, NA),
"3_confidence" = c(NA, NA, 2, NA),
"overall" = c(3,3,2,5))
What I need them to look like at the end (hard-coding it):
df <- tibble(age = rep(c(28,42,19,67), each=4),
country = rep(c("Belgium", "Austria", "Croatia", "Belgium"), each=4),
recreate = rep(1:4, times=4),
fairness = rep(c(4,7,2,1), each=4),
confidence = rep(c(5,5,2,4), each=4),
allocation = c(29, 20, 15, 11,
15, 10, 8, 4,
50, 40, 30, 20,
30, 24, 15, 9),
overall = rep(c(3,3,2,5), each=4))
Steps to get there (I think!):
1. Replace the starting numbers for those columns using my list of countries.
The number that starts the string is the index in countries. In other words, 16_recreate_1 would correspond with the 16th country in the vector countries. I think the following code works (though am not sure it's exactly right):
for(i in length(countries):1){
colnames(df) <- str_replace(colnames(df), paste0(i,"_"), paste0(countries[i],"_"))
}
2. Create a new variable called "country" by getting the name of the column(s) that is NOT NA for each row.
I tried a BUNCH of experimentation with which.max and names, but couldn't get it fully functional.
3. Create new variables (recreate_1...recreate_4) that grab the [country_name]_recreate_1...[country_name]_recreate_4 value for each row, whatever country is non-NA for that person.
Maybe rowSums is the way to do this?
4. Make the data long instead of wide
I think this is going to require gather, but I'm not sure how to gather from only the variables country and recreate_1...recreate_4.
I'm so sorry this is so complex. Tidyverse solutions are preferred but any help is greatly appreciated!
A somehow different tidyverse possibility could be:
df %>%
gather(variable, allocation, na.rm = TRUE) %>%
separate(variable, c("ID", "variable", "recreate"), convert = TRUE) %>%
left_join(data.frame(countries) %>%
mutate(country = countries,
ID = seq_along(countries)) %>%
select(-countries), by = c("ID" = "ID")) %>%
select(-variable, -ID)
recreate allocation country
<int> <dbl> <fct>
1 1 15 Austria
2 2 10 Austria
3 3 8 Austria
4 4 4 Austria
5 1 29 Belgium
6 1 30 Belgium
7 2 20 Belgium
8 2 24 Belgium
9 3 15 Belgium
10 3 15 Belgium
11 4 11 Belgium
12 4 9 Belgium
13 1 50 Croatia
14 2 40 Croatia
15 3 30 Croatia
16 4 20 Croatia
Here it, first, transforms the data from wide to long format, removing the rows with NA. Second, it separates the variable names into three columns. Third, it transforms the vector of countries into a df and assigns each country a unique ID. Finally, it joins the two and removes the redundant variables.
A solution to the edited question:
df %>%
select(matches("(recreate)")) %>%
rowid_to_column() %>%
gather(var, allocation, -rowid, na.rm = TRUE) %>%
separate(var, c("ID", "var", "recreate"), convert = TRUE) %>%
select(-var) %>%
left_join(data.frame(countries) %>%
mutate(country = countries,
ID = seq_along(countries)) %>%
select(-countries), by = c("ID" = "ID")) %>%
left_join(df %>%
select(-matches("(recreate)")) %>%
rowid_to_column() %>%
gather(var, val, -rowid, na.rm = TRUE) %>%
mutate(var = gsub("[^[:alpha:]]", "", var)) %>%
spread(var, val), by = c("rowid" = "rowid")) %>%
select(-rowid, -ID)
recreate allocation country age confidence fairness overall
<int> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 15 Austria 42 5 7 3
2 2 10 Austria 42 5 7 3
3 3 8 Austria 42 5 7 3
4 4 4 Austria 42 5 7 3
5 1 29 Belgium 28 5 4 3
6 1 30 Belgium 67 4 1 5
7 2 20 Belgium 28 5 4 3
8 2 24 Belgium 67 4 1 5
9 3 15 Belgium 28 5 4 3
10 3 15 Belgium 67 4 1 5
11 4 11 Belgium 28 5 4 3
12 4 9 Belgium 67 4 1 5
13 1 50 Croatia 19 2 2 2
14 2 40 Croatia 19 2 2 2
15 3 30 Croatia 19 2 2 2
16 4 20 Croatia 19 2 2 2
Here it, first, selects the columns that contain recreate and adds a columns with row ID. Second, it follows the steps from the original solution. Third, it selects the columns that do not contain recreate, performs a wide-to-long data transformation, removes the number from column names and transforms the data back to the original wide format. Finally, it joins the two on row ID and removes the redundant variables.
library(dplyr)
library(tidyr)
df %>% mutate(rid=row_number()) %>%
gather(key,val,-c(age,overall,rid, matches('recreate'))) %>% mutate(country=sub('(^\\d)_.*','\\1',key),country=countries[as.numeric(country)]) %>%
filter(!is.na(val)) %>% mutate(key=sub('(^\\d\\_)(.*)','\\2',key)) %>%
spread(key,val) %>% gather(key = recreate,value = allocation,-c(rid,age,overall,Country,confidence,fairness)) %>%
filter(!is.na(allocation)) %>% mutate(recreate=sub('.*_(\\d$)','\\1',recreate))
Here (^\\d)_.* means get the first digit while .*_(\\d$) means get the last digit.

Merging two data frames according to row values

I have two data frames, each with the same two columns: county codes and frequencies. They aren't identical, but some of the county code values show up in both data frames. Like this:
"county_code","freq"
"01011",2
"01051",1
"01073",9
"01077",1
"county_code","freq"
"01011",4
"01056",2
"01073",1
"01088",6
I want to merge them into a new data frame, such that if a county code appears in both data frames, their respective frequencies are added together. If the county code just appears in one or the other of the data frames, I want to add it (and its frequency) to the new data frame unchanged. The result should look like this:
"county_code","freq"
"01011",6
"01051",1
"01056",2
"01073",10
"01077",1
"01088",6
The result doesn't have to be ordered. I tried to use reshape for this, but I wasn't sure that was the right approach. Thoughts?
Combine the two data frames with rbind, then use aggregate to collapse multiple rows with the same county_code:
aggregate(freq~county_code, rbind(d1, d2) , FUN=sum)
## county_code freq
## 1 1011 6
## 2 1051 1
## 3 1073 10
## 4 1077 1
## 5 1056 2
## 6 1088 6
(Using the definitions in MrFlick's answer.)
Using base functions, you can do a merge() then transform(). here are your sample input data.frames
d1 <- data.frame(
county_code = c("1011", "1051", "1073", "1077"),
freq = c(2L, 1L, 9L, 1L)
)
d2 <- data.frame(
county_code = c("1011", "1056", "1073", "1088"),
freq = c(4L, 2L, 1L, 6L)
)
then you would just do
transform(merge(d1, d2, by="county_code", all=T),
freq = rowSums(cbind(freq.x, freq.y), na.rm=T),
freq.x = NULL, freq.y = NULL
)
to get
county_code freq
1 1011 6
2 1051 1
3 1056 2
4 1073 10
5 1077 1
6 1088 6
Here is one way. I used rbind(),merge() and dplyr.
# sample data
country <- c("01011", "01051", "01073", "01077")
value <- c(2,1,9,1)
foo <- data.frame(country, value, stringsAsFactors=F)
country <- c("01011","01056","01073","01088")
value <- c(4,2,1,6)
foo2 <- data.frame(country, value, stringsAsFactors=F)
library(dplyr)
group_by(rbind_list(foo, foo2), country) %>%
summarize(count = sum(value))
ana
country count
1 01011 6
2 01051 1
3 01056 2
4 01073 10
5 01077 1
6 01088 6
The other idea I had was the following.
ana2 <- merge(foo, foo2, all = TRUE, by = "country")
country value.x value.y
1 01011 2 4
2 01051 1 NA
3 01056 NA 2
4 01073 9 1
5 01077 1 NA
6 01088 NA 6
bob2 <- ana2 %>%
rowwise() %>%
mutate(count = sum(value.x,value.y, na.rm = TRUE))
country value.x value.y count
1 01011 2 4 6
2 01051 1 NA 1
3 01056 NA 2 2
4 01073 9 1 10
5 01077 1 NA 1
6 01088 NA 6 6

Resources