Tidy Data: Rename columns, get non-NA column names, then gather - r

I've got a rather ugly bit of data to tidy up and need help! What my data look like now:
countries <- c("Austria", "Belgium", "Croatia")
df <- tibble("age" = c(28,42,19, 67),
"1_recreate_1"=c(NA,15,NA,NA),
"1_recreate_2"=c(NA,10,NA,NA),
"1_recreate_3"=c(NA,8,NA,NA),
"1_recreate_4"=c(NA,4,NA,NA),
"1_fairness" = c(NA, 7, NA, NA),
"1_confidence" = c(NA, 5, NA, NA),
"2_recreate_1"=c(29,NA,NA,30),
"2_recreate_2"=c(20,NA,NA,24),
"2_recreate_3"=c(15,NA,NA,15),
"2_recreate_4"=c(11,NA,NA,9),
"2_fairness" = c(4, NA, NA, 1),
"2_confidence" = c(5, NA, NA, 4),
"3_recreate_1"=c(NA,NA,50,NA),
"3_recreate_2"=c(NA,NA,40,NA),
"3_recreate_3"=c(NA,NA,30,NA),
"3_recreate_4"=c(NA,NA,20,NA),
"3_fairness" = c(NA, NA, 2, NA),
"3_confidence" = c(NA, NA, 2, NA),
"overall" = c(3,3,2,5))
What I need them to look like at the end (hard-coding it):
df <- tibble(age = rep(c(28,42,19,67), each=4),
country = rep(c("Belgium", "Austria", "Croatia", "Belgium"), each=4),
recreate = rep(1:4, times=4),
fairness = rep(c(4,7,2,1), each=4),
confidence = rep(c(5,5,2,4), each=4),
allocation = c(29, 20, 15, 11,
15, 10, 8, 4,
50, 40, 30, 20,
30, 24, 15, 9),
overall = rep(c(3,3,2,5), each=4))
Steps to get there (I think!):
1. Replace the starting numbers for those columns using my list of countries.
The number that starts the string is the index in countries. In other words, 16_recreate_1 would correspond with the 16th country in the vector countries. I think the following code works (though am not sure it's exactly right):
for(i in length(countries):1){
colnames(df) <- str_replace(colnames(df), paste0(i,"_"), paste0(countries[i],"_"))
}
2. Create a new variable called "country" by getting the name of the column(s) that is NOT NA for each row.
I tried a BUNCH of experimentation with which.max and names, but couldn't get it fully functional.
3. Create new variables (recreate_1...recreate_4) that grab the [country_name]_recreate_1...[country_name]_recreate_4 value for each row, whatever country is non-NA for that person.
Maybe rowSums is the way to do this?
4. Make the data long instead of wide
I think this is going to require gather, but I'm not sure how to gather from only the variables country and recreate_1...recreate_4.
I'm so sorry this is so complex. Tidyverse solutions are preferred but any help is greatly appreciated!

A somehow different tidyverse possibility could be:
df %>%
gather(variable, allocation, na.rm = TRUE) %>%
separate(variable, c("ID", "variable", "recreate"), convert = TRUE) %>%
left_join(data.frame(countries) %>%
mutate(country = countries,
ID = seq_along(countries)) %>%
select(-countries), by = c("ID" = "ID")) %>%
select(-variable, -ID)
recreate allocation country
<int> <dbl> <fct>
1 1 15 Austria
2 2 10 Austria
3 3 8 Austria
4 4 4 Austria
5 1 29 Belgium
6 1 30 Belgium
7 2 20 Belgium
8 2 24 Belgium
9 3 15 Belgium
10 3 15 Belgium
11 4 11 Belgium
12 4 9 Belgium
13 1 50 Croatia
14 2 40 Croatia
15 3 30 Croatia
16 4 20 Croatia
Here it, first, transforms the data from wide to long format, removing the rows with NA. Second, it separates the variable names into three columns. Third, it transforms the vector of countries into a df and assigns each country a unique ID. Finally, it joins the two and removes the redundant variables.
A solution to the edited question:
df %>%
select(matches("(recreate)")) %>%
rowid_to_column() %>%
gather(var, allocation, -rowid, na.rm = TRUE) %>%
separate(var, c("ID", "var", "recreate"), convert = TRUE) %>%
select(-var) %>%
left_join(data.frame(countries) %>%
mutate(country = countries,
ID = seq_along(countries)) %>%
select(-countries), by = c("ID" = "ID")) %>%
left_join(df %>%
select(-matches("(recreate)")) %>%
rowid_to_column() %>%
gather(var, val, -rowid, na.rm = TRUE) %>%
mutate(var = gsub("[^[:alpha:]]", "", var)) %>%
spread(var, val), by = c("rowid" = "rowid")) %>%
select(-rowid, -ID)
recreate allocation country age confidence fairness overall
<int> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 15 Austria 42 5 7 3
2 2 10 Austria 42 5 7 3
3 3 8 Austria 42 5 7 3
4 4 4 Austria 42 5 7 3
5 1 29 Belgium 28 5 4 3
6 1 30 Belgium 67 4 1 5
7 2 20 Belgium 28 5 4 3
8 2 24 Belgium 67 4 1 5
9 3 15 Belgium 28 5 4 3
10 3 15 Belgium 67 4 1 5
11 4 11 Belgium 28 5 4 3
12 4 9 Belgium 67 4 1 5
13 1 50 Croatia 19 2 2 2
14 2 40 Croatia 19 2 2 2
15 3 30 Croatia 19 2 2 2
16 4 20 Croatia 19 2 2 2
Here it, first, selects the columns that contain recreate and adds a columns with row ID. Second, it follows the steps from the original solution. Third, it selects the columns that do not contain recreate, performs a wide-to-long data transformation, removes the number from column names and transforms the data back to the original wide format. Finally, it joins the two on row ID and removes the redundant variables.

library(dplyr)
library(tidyr)
df %>% mutate(rid=row_number()) %>%
gather(key,val,-c(age,overall,rid, matches('recreate'))) %>% mutate(country=sub('(^\\d)_.*','\\1',key),country=countries[as.numeric(country)]) %>%
filter(!is.na(val)) %>% mutate(key=sub('(^\\d\\_)(.*)','\\2',key)) %>%
spread(key,val) %>% gather(key = recreate,value = allocation,-c(rid,age,overall,Country,confidence,fairness)) %>%
filter(!is.na(allocation)) %>% mutate(recreate=sub('.*_(\\d$)','\\1',recreate))
Here (^\\d)_.* means get the first digit while .*_(\\d$) means get the last digit.

Related

R subset dataframe where no observations of certain variables

I have a dataframe that looks like
country
sector
data1
data2
France
1
7
.
France
2
10
.
belgium
1
12
7
belgium
2
14
8
I want to subset columns that are missing for a country in all sectors. In this example I would like to drop/exclude column two because it is missing for sector 1 and 2 for france. To be clear I would also be throwing out the values of data2 for belgium in this example.
My expected output would look like
country
sector
data1
France
1
7
France
2
10
belgium
1
12
belgium
2
14
data 2 is now excluded because it had a complete set of missing values for all sectors in France
We may group by country, create logical columns where the count of NA elements are equal to group size, ungroup, replace the corresponding columns to NA based on the logical column and remove those columns in select
library(dplyr)
library(stringr)
df1 %>%
group_by(country) %>%
mutate(across(everything(), ~ sum(is.na(.x)) == n(),
.names = "{.col}_lgl")) %>%
ungroup %>%
mutate(across(names(df1)[-1], ~ if(any(get(str_c(cur_column(),
"_lgl")) )) NA else .x)) %>%
select(c(where(~ !is.logical(.x) && any(complete.cases(.x)))))
-output
# A tibble: 4 × 3
country sector data1
<chr> <int> <int>
1 France 1 7
2 France 2 10
3 belgium 1 12
4 belgium 2 14
If we don't use group_by, the steps can be simplified as showed in Maël's post i.e. do the grouping with a base R function within select i.e. either tapply or ave can work
df1 %>%
select(where(~ !any(tapply(is.na(.x), df1[["country"]],
FUN = all))))
data
df1 <- structure(list(country = c("France", "France", "belgium", "belgium"
), sector = c(1L, 2L, 1L, 2L), data1 = c(7L, 10L, NA, 14L), data2 = c(NA,
NA, 7L, 8L)), row.names = c(NA, -4L), class = "data.frame")
In base R:
df1 <- read.table(header = T, text = "country sector data1 data2
France 1 7 NA
France 2 10 2
belgium 1 12 7
belgium 2 14 8")
df2 <- read.table(header = T, text = "country sector data1 data2
France 1 7 NA
France 2 10 NA
belgium 1 12 7
belgium 2 14 8")
df1[!sapply(df1, \(x) any(ave(x, df1$country, FUN = \(y) all(is.na(y)))))]
# country sector data1 data2
# 1 France 1 7 NA
# 2 France 2 10 2
# 3 belgium 1 12 7
# 4 belgium 2 14 8
df2[!sapply(df2, \(x) any(ave(x, df2$country, FUN = \(y) all(is.na(y)))))]
# country sector data1
# 1 France 1 7
# 2 France 2 10
# 3 belgium 1 12
# 4 belgium 2 14
Note: \ replaces function.
For a base R solution, you can use the apply family on column names and detect if there's any NA in the values of all columns:
keep_remove <- sapply(names(data), \(x) all(!is.na(data[[x]])))
data <- data[, keep_remove]

Use pivot_longer to seperate columns

I have a dataframe that looks like
id = c("1", "2", "3")
IN1999 = c(1, 1, 0)
IN2000 = c(1, 0, 1)
TEST1999 = c(10, 12, NA)
TEST2000 = c(15, NA, 11)
df <- data.frame(id, IN1999, IN2000, TEST1999, TEST2000)
I am trying to use pivot_longer to change it into this form:
id year IN TEST
1 1 1999 1 10
2 1 2000 1 15
3 2 1999 1 12
4 2 2000 0 NA
5 3 1999 0 NA
6 3 2000 1 11
My current code looks like this
df %>%
pivot_longer(col = !id, names_to = c(".value", "year"),
names_sep = 4)
but obviousely by setting names_sep = 4, r cuts IN1999 and IN2000 at the wrong place. How can I set the argument so that r can separate the column name from the last four digits?
The names_sep-argument in pivot_longer also accepts regex expressions, that will allow you to split before the occurrence of four digits as in this example below:
library(tidyr)
df |>
pivot_longer(col = !id, names_to = c(".value", "year"),
names_sep = "(?=\\d{4})")
Output:
# A tibble: 6 × 4
id year IN TEST
<chr> <chr> <dbl> <dbl>
1 1 1999 1 10
2 1 2000 1 15
3 2 1999 1 12
4 2 2000 0 NA
5 3 1999 0 NA
6 3 2000 1 11

R data frame - fill missing values with condition on another column

In R, I have a the following data frame:
Id
Year
Age
1
2000
25
1
2001
NA
1
2002
NA
2
2000
NA
2
2001
30
2
2002
NA
Each Id has at least one row with age filled.
I would like to fill the missing "Age" values with the correct age for each ID.
Expected result:
Id
Year
Age
1
2000
25
1
2001
25
1
2002
25
2
2000
30
2
2001
30
2
2002
30
I've tried using 'fill':
df %>% fill(age)
But not getting the expected results.
Is there a simple way to do this?
The comments were close, you just have to add the .direction
df %>% group_by(Id) %>% fill(Age, .direction="downup")
# A tibble: 6 x 3
# Groups: Id [2]
Id Year Age
<int> <int> <int>
1 1 2000 25
2 1 2001 25
3 1 2002 25
4 2 2000 30
5 2 2001 30
6 2 2002 30
Assuming this is your dataframe
df<-data.frame(id=c(1,1,1,2,2,2),year=c(2000,2001,2002,2000,2001,2002),age=c(25,NA,NA,NA,30,NA))
With the zoo package, you can try
library(zoo)
df<-df[order(df$id,df$age),]
df$age<-na.locf(df$age)
Please see the solution below with the tidyverse library.
library(tidyverse)
dt <- data.frame(Id = rep(1:2, each = 3),
Year = rep(2000:2002, each = 2),
Age = c(25,NA,NA,30,NA,NA))
dt %>% group_by(Id) %>% arrange(Id,Age) %>% fill(Age)
In the code you provided, you didn't use group_by. It is also important to arrange by Id and Age, because the function fill only fills the column down. See for example that data frame, and compare the option with and without arrange:
dt <- data.frame(Id = rep(1:2, each = 3),
Year = rep(2000:2002, each = 2),
Age = c(NA, 25,NA,NA,30,NA))
dt %>% group_by(Id) %>% fill(Age) # only fills partially
dt %>% group_by(Id) %>% arrange(Id,Age) %>% fill(Age) # does the right job

Calculate change over time with tidy data in R - do you have to spread and gather?

Quick question about calculating a change over time for tidy data. Do I need to spread the data, mutate the variable and then gather the data again (see below), or is there a quicker way to do this keeping the data tidy.
Here is an example:
df <- data.frame(country = c(1, 1, 2, 2),
year = c(1999, 2000, 1999, 2000),
value = c(20, 30, 40, 50))
df
country year value
1 1 1999 20
2 1 2000 30
3 2 1999 40
4 2 2000 50
To calculate the change in value between 1999 and 2000 I would:
library(dplyr)
library(tidyr)
df2 <- df %>%
spread(year, value) %>%
mutate(change.99.00 = `2000` - `1999`) %>%
gather(year, value, c(`1999`, `2000`))
df2
country change.99.00 year value
1 1 10 1999 20
2 2 10 1999 40
3 1 10 2000 30
4 2 10 2000 50
This seems a laborious way to do this. I assume there should be a neat way to do this while keeping the data in narrow, tidy format, by grouping the data or something but I can't think of it and I can't find an answer online.
Is there an easier way to do this?
After grouping by 'country', get the diff of 'value' filtered with the logical expression year %in% 1999:2000
library(dplyr)
df %>%
group_by(country) %>%
mutate(change.99.00 = diff(value[year %in% 1999:2000]))
# A tibble: 4 x 4
# Groups: country [2]
# country year value change.99.00
# <dbl> <dbl> <dbl> <dbl>
#1 1 1999 20 10
#2 1 2000 30 10
#3 2 1999 40 10
#4 2 2000 50 10
NOTE: Here, we assume that the 'year' is not duplicated per 'country'

Rolling sum over multiple columns in r

I am working on R with a dataset that looks like this:
Screen shot of dataset
test=data.frame("1991" = c(1,5,3), "1992" = c(4,3,3), "1993" = c(10,5,3), "1994" = c(1,1,1), "1995" = c(2,2,6))
test=plyr::rename(test, c("X1991"="1991", "X1992"="1992", "X1993"="1993", "X1994"="1994", "X1995"="1995"))
What I want to do is that I want to create variables called Pre1991, Pre1992, Pre1993, ... and these variables would store the cumulated values up to that year, e.g.
Pre1991 = test$1991
Pre1992 = test$1991 + test$1992
Pre1993 = test$1991 + test$1992 + test$1993
so on.
My real dataset has variables from year 1900-2017 so I can't do this manually. I want to write a for loop but it didnt work.
for (i in 1900:2017){
x = paste0("Pre",i)
df[[x]] = rowSums(df[,(colnames(df)<=i)])
}
Can someone please help to review my code/ suggest other ways to do it? Thanks!
Edit 1:
Thanks so much! And I'm wondering if there's a way that I can use cumsum function in a reverse direction? For example, if I am interested in what happened after that particular year:
Post1991 = test$1992 + test$1993 + test$1994 + test$1995 + ...
Post1992 = test$1993 + test$1994 + test$1995 + ...
Post1993 = test$1994 + test$1995 + ...
This is a little inefficient in that it is converting from a data.frame to a matrix and back, but ...
as.data.frame(t(apply(as.matrix(test), 1, cumsum)))
# 1991 1992 1993 1994 1995
# 1 1 5 15 16 18
# 2 5 8 13 14 16
# 3 3 6 9 10 16
If your data has other columns that are not year-based, such as
test$quux <- LETTERS[3:5]
test
# 1991 1992 1993 1994 1995 quux
# 1 1 4 10 1 2 C
# 2 5 3 5 1 2 D
# 3 3 3 3 1 6 E
then subset on both sides:
test[1:5] <- as.data.frame(t(apply(as.matrix(test[1:5]), 1, cumsum)))
test
# 1991 1992 1993 1994 1995 quux
# 1 1 5 15 16 18 C
# 2 5 8 13 14 16 D
# 3 3 6 9 10 16 E
EDIT
In reverse, just use repeated rev:
as.data.frame(t(apply(as.matrix(test), 1, function(a) rev(cumsum(rev(a)))-a)))
# 1991 1992 1993 1994 1995
# 1 17 13 3 2 0
# 2 11 8 3 2 0
# 3 13 10 7 6 0
Using tidyverse we can gather and calculate before then spreading again. For this to work data will need to be arranged.
library(tidyverse)
test <- data.frame("1991" = c(1, 5, 3),
"1992" = c(4, 3, 3),
"1993" = c(10, 5, 3),
"1994" = c(1, 1, 1),
"1995" = c(2, 2, 6))
test <- plyr::rename(test, c("X1991" = "1991",
"X1992" = "1992",
"X1993" = "1993",
"X1994" = "1994",
"X1995" = "1995"))
Forwards
test %>%
mutate(id = 1:nrow(.)) %>% # adding an ID to identify groups
gather(year, value, -id) %>% # wide to long format
arrange(id, year) %>%
group_by(id) %>%
mutate(value = cumsum(value)) %>%
ungroup() %>%
spread(year, value) %>% # long to wide format
select(-id) %>%
setNames(paste0("pre", names(.))) # add prefix to columns
## A tibble: 3 x 5
# pre1991 pre1992 pre1993 pre1994 pre1995
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1. 5. 15. 16. 18.
# 2 5. 8. 13. 14. 16.
# 3 3. 6. 9. 10. 16.
Reverse direction
As your definition specifies its not strictly the reverse order, its the reverse order excluding itself which would be the cumulative lagged sum.
test %>%
mutate(id = 1:nrow(.)) %>%
gather(year, value, -id) %>%
arrange(id, desc(year)) %>% # using desc() to reverse sorting
group_by(id) %>%
mutate(value = cumsum(lag(value, default = 0))) %>% # lag cumsum
ungroup() %>%
spread(year, value) %>%
select(-id) %>%
setNames(paste0("post", names(.)))
## A tibble: 3 x 5
# post1991 post1992 post1993 post1994 post1995
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 17. 13. 3. 2. 0.
# 2 11. 8. 3. 2. 0.
# 3 13. 10. 7. 6. 0.
We can use rowCumsums from matrixStats
library(matrixStats)
test[] <- rowCumsums(as.matrix(test))
test
# 1991 1992 1993 1994 1995
#1 1 5 15 16 18
#2 5 8 13 14 16
#3 3 6 9 10 16

Resources