gather() with two key columns - r

I have a dataset that has two rows of data, and want to tidy them using something like gather() but don't know how to mark both as key columns.
The data looks like:
Country US Canada US
org_id 332 778 920
02-15-20 25 35 54
03-15-20 30 10 60
And I want it to look like
country org_id date purchase_price
US 332 02-15-20 25
Canada 778 02-15-20 35
US 920 02-15-20 54
US 332 03-15-20 30
Canada 778 03-15-20 10
US 920 03-15-20 60
I know gather() can move the country row to a column, for example, but is there a way to move both the country and org_id rows to columns?

It is not a good idea to have duplicate column names in the data so I'll rename one of them.
names(df)[4] <- 'US_1'
gather has been retired and replaced with pivot_longer.
This is not a traditional reshape because the data in the 1st row needs to be treated differently than rest of the rows so we can perform the reshaping separately and combine the result to get one final dataframe.
library(dplyr)
library(tidyr)
df1 <- df %>% slice(-1L) %>% pivot_longer(cols = -Country)
df %>%
slice(1L) %>%
pivot_longer(-Country, values_to = 'org_id') %>%
select(-Country) %>%
inner_join(df1, by = 'name') %>%
rename(Country = name, date = Country) -> result
result
# Country org_id date value
# <chr> <int> <chr> <int>
#1 US 332 02-15-20 25
#2 US 332 03-15-20 30
#3 Canada 778 02-15-20 35
#4 Canada 778 03-15-20 10
#5 US_1 920 02-15-20 54
#6 US_1 920 03-15-20 60
data
df <- structure(list(Country = c("org_id", "02-15-20", "03-15-20"),
US = c(332L, 25L, 30L), Canada = c(778L, 35L, 10L), US = c(920L,
54L, 60L)), class = "data.frame", row.names = c(NA, -3L))

First, we paste together Country and org_id
library(tidyverse)
data <- set_names(data, paste(names(data), data[1,], sep = "-"))
data
Country-org_id US-332 Canada-778 US-920
1 org_id 332 778 920
2 02-15-20 25 35 54
3 03-15-20 30 10 60
Then, we drop the first row, pivot the table and separate the column name.
df <- data %>%
slice(2:n()) %>%
rename(date = `Country-org_id`) %>%
pivot_longer(cols = -date, values_to = "price") %>%
separate(col = name, into = c("country", "org_id"), sep = "-")
df
# A tibble: 6 x 4
date country org_id price
<chr> <chr> <chr> <int>
1 02-15-20 US 332 25
2 02-15-20 Canada 778 35
3 02-15-20 US 920 54
4 03-15-20 US 332 30
5 03-15-20 Canada 778 10
6 03-15-20 US 920 60

Related

How to find duplicate dates within a row in R, and then replace associated values with the mean?

There are some similar questions, however I haven't been able to find the solution for my data:
ID <- c(27,46,72)
Gest1 <- c(27,28,29)
Sys1 <- c(120,123,124)
Dia1 <- c(90,89,92)
Gest2 <- c(29,28,30)
Sys2 <- c(122,130,114)
Dia2 <- c(89,78,80)
Gest3 <- c(32,29,30)
Sys3 <- c(123,122,124)
Dia3 <- c(90,88,89)
Gest4 <- c(33,30,32)
Sys4 <- c(124,123,128)
Dia4 <- c(94,89,80)
df.1 <- data.frame(ID,Gest1,Sys1,Dia1,Gest2,Sys2,Dia2,Gest3,Sys3,
Dia3,Gest4,Sys4,Dia4)
df.1
What I need to do is identify where there are any cases of gestational age duplicates (variables beginning with Gest), and then find the mean of the associated Sys and Dia variables.
Once the mean has been calculated, I need to replace the duplicates with just 1 Gest variable, and the mean of the Sys variable and the mean of the Dia variable. Everything after those duplicates should then be moved up the dataframe.
Here is what it should look like:
df.2
My real data has 25 Gest variables with 25 associated Sys variables and 25 association Dia variables.
Sorry if this is confusing! I've tried to write an ok question but it is my first time using stack overflow.
Thank you!!
This is easier to manage in long (and tidy) format.
Using tidyverse, you can use pivot_longer to put into long form. After grouping by ID and Gest you can substitute Sys and Dia values with the mean. If there are more than one Gest for a given ID it will then use the average.
Then, you can keep that row of data with slice. After grouping by ID, you can renumber after combining those with common Gest values.
library(tidyverse)
df.1 %>%
pivot_longer(cols = -ID, names_to = c(".value", "number"), names_pattern = "(\\w+)(\\d+)") %>%
group_by(ID, Gest) %>%
mutate(across(c(Sys, Dia), mean)) %>%
slice(1) %>%
group_by(ID) %>%
mutate(number = row_number())
Output
ID number Gest Sys Dia
<dbl> <int> <dbl> <dbl> <dbl>
1 27 1 27 120 90
2 27 2 29 122 89
3 27 3 32 123 90
4 27 4 33 124 94
5 46 1 28 126. 83.5
6 46 2 29 122 88
7 46 3 30 123 89
8 72 1 29 124 92
9 72 2 30 119 84.5
10 72 3 32 128 80
Note - I would keep in long form - but if you wanted wide again, you can add:
pivot_wider(id_cols = ID, names_from = number, values_from = c(Gest, Sys, Dia))
This involved change the structure of the table into the long format, averaging the duplicates and then reformatting back into the desired table:
library(tidyr)
library(dplyr)
df.1 <- data.frame(ID,Gest1,Sys1,Dia1,Gest2,Sys2,Dia2,Gest3,Sys3, Dia3,Gest4,Sys4,Dia4)
#convert data to long format
longdf <- df.1 %>% pivot_longer(!ID, names_to = c(".value", "time"), names_pattern = "(\\D+)(\\d)", values_to="count")
#average duplicate rows
temp<-longdf %>% group_by(ID, Gest) %>% summarize(Sys=mean(Sys), Dia=mean(Dia)) %>% mutate(time = row_number())
#convert back to wide format
answer<-temp %>% pivot_wider(ID, names_from = time, values_from = c("Gest", "Sys", "Dia"), names_glue = "{.value}{time}")
#resort the columns
answer <-answer[ , names(df.1)]
answer
# A tibble: 3 × 13
# Groups: ID [3]
ID Gest1 Sys1 Dia1 Gest2 Sys2 Dia2 Gest3 Sys3 Dia3 Gest4 Sys4 Dia4
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 27 27 120 90 29 122 89 32 123 90 33 124 94
2 46 28 126. 83.5 29 122 88 30 123 89 NA NA NA
3 72 29 124 92 30 119 84.5 32 128 80 NA NA NA

Pivot wider to one row in R

Here is the sample code that I am using
library(dplyr)
naics <- c("000000","000000",123000,123000)
year <- c(2020,2021,2020,2021)
January <- c(250,251,6,9)
February <- c(252,253,7,16)
March <- c(254,255,8,20)
sample2 <- data.frame (naics, year, January, February, March)
Here is the intended result
Jan2020 Feb2020 March2020 Jan2021 Feb2021 March2021
000000 250 252 254 251 253 255
123000 6 7 8 9 16 20
Is this something that is done with pivot_wider or is it more complex?
We use pivot_wider by selecting the values_from with the month column, names_from as 'year' and then change the column name format in names_glue and if needed convert the 'naics' to row names with column_to_rownames (from tibble)
library(tidyr)
library(tibble)
pivot_wider(sample2, names_from = year, values_from = January:March,
names_glue = "{substr(.value, 1, 3)}{year}")%>%
column_to_rownames('naics')
-output
Jan2020 Jan2021 Feb2020 Feb2021 Mar2020 Mar2021
000000 250 251 252 253 254 255
123000 6 9 7 16 8 20
With reshape function from BaseR,
reshape(sample2, dir = "wide", sep="",
idvar = "naics",
timevar = "year",
new.row.names = unique(naics))[,-1]
# January2020 February2020 March2020 January2021 February2021 March2021
# 000000 250 252 254 251 253 255
# 123000 6 7 8 9 16 20
This takes a longer route than #akrun's answer. I will leave this here in case it may help with more intuition on the steps being taken. Otherwise, #akrun's answer is more resource efficient.
sample2 %>%
tidyr::pivot_longer(-c(naics, year), names_to = "month",
values_to = "value") %>%
mutate(Month=paste0(month, year)) %>%
select(-year, - month) %>%
tidyr::pivot_wider(names_from = Month,values_from = value)
# A tibble: 2 x 7
naics January2020 February2020 March2020 January2021 February2021
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 000000 250 252 254 251 253
2 123000 6 7 8 9 16
# ... with 1 more variable: March2021 <dbl>

Turning multiple rows into a string by merging on ID in R

Table1:(there are hundreds of IDs)
participant_id hpo_term year_of_birth affected_relative genome
123 kidney failure 2000 Y 38
123 hand tremor 2000 Y 38
123 kidney transplant 2000 Y 38
432 hypertension 1980 N 37
432 exotropia 1980 N 37
432 scissor gait 1980 N 37
I have two look-up tables:(with hundreds of values in each)
Renal lookup:
kidney failure
kidney transplant
hypertension
Non-renal lookup(with hundreds of values in each):
hand tremor
exotropia
scissor gait
Desired outcome:
participant_id kidney_hpo_term non_kidney_hpo_term year_of_birth affected_relative genome
123 kidney failure;kidney transplant hand tremor 2000 Y 38
432 hypertension exotropia;scissor gait 1980 Y 37
Initially I tried:
library(dplyr); library(tidyr)
pt.data %>%
mutate(kidney = hpo_term %in% kidney.hpo) %>%
pivot_wider(names_from = kidney, values_from = hpo_term,
values_fn = function(x)paste(x,collapse = ";"), values_fill = NA) %>%
setNames(c("participant_id","Kidney","Non.kidney"))
with kidney.hpo <- read.delim("kidney_hpo_terms.txt", header = F)
But I get "Error in values_fn[[value]] ; object of type 'closure' is not subsettable"
Not sure what I am doing wrong and your help would be much appreciated.
There are several things to say with your data.
First, your table1 has duplicated columns: year_of_birth, affected_relative, and genome are the same for a given participant.
This should better be stored in a separate table, which I named table1_short.
For your very question, it is only a matter of checking whether a term is in a vector, which is done using %in%.
Here is how you could write the code:
library(tidyverse)
table1=read.table(header=T, text="
participant_id hpo_term year_of_birth affected_relative genome
123 'kidney failure' 2000 Y 38
123 'hand tremor' 2000 Y 38
123 'kidney transplant' 2000 Y 38
432 hypertension 1980 N 37
432 exotropia 1980 N 37
432 'scissor gait' 1980 N 37")
table1_short = table1 %>% select(-hpo_term) %>% group_by(participant_id) %>% slice(1)
table1_long = table1 %>% select(1:2)
renal_lookup = c("kidney failure", "kidney transplant", "hypertension")
nonrenal_lookup = c("hand tremor", "exotropia", "scissor gait")
table1_long %>%
group_by(participant_id) %>%
summarise(
kidney_hpo_term = hpo_term[hpo_term %in% renal_lookup] %>% paste(collapse=";"),
non_kidney_hpo_term = hpo_term[hpo_term %in% nonrenal_lookup] %>% paste(collapse=";")
) %>%
left_join(table1_short, by="participant_id")
#> # A tibble: 2 x 6
#> participant_id kidney_hpo_term non_kidney_hpo_term year_of_birth affected_relative genome
#> <int> <chr> <chr> <int> <chr> <int>
#> 1 123 kidney failure;kidney transplant hand tremor 2000 Y 38
#> 2 432 hypertension exotropia;scissor gait 1980 N 37
Created on 2021-05-12 by the reprex package (v2.0.0)
This can be done with dcast in data.table as follows:
dtt[, group := paste0(
ifelse(hpo_term %in% kidney_hpo, 'kidney', 'non_kidney'), '_hpo_term')]
dcast(dtt, ... ~ group, value.var = 'hpo_term',
fun.aggregate = paste, collapse = ';')
# participant_id year_of_birth affected_relative genome kidney_hpo_term
# 1: 123 2000 Y 38 kidney failure;kidney transplant
# 2: 432 1980 N 37 hypertension
# non_kidney_hpo_term
# 1: hand tremor
# 2: exotropia;scissor gait"

Using str_split to fill rows down data frame with number ranges and multiple numbers

I have a dataframe with crop names and their respective FAO codes. Unfortunately, some crop categories, such as 'other cereals', have multiple FAO codes, ranges of FAO codes or even worse - multiple ranges of FAO codes.
Snippet of the dataframe with the different formats for FAO codes.
> FAOCODE_crops
SPAM_full_name FAOCODE
1 wheat 15
2 rice 27
8 other cereals 68,71,75,89,92,94,97,101,103,108
27 other oil crops 260:310,312:339
31 other fibre crops 773:821
Using the following code successfully breaks down these numbers,
unlist(lapply(unlist(strsplit(FAOCODE_crops$FAOCODE, ",")), function(x) eval(parse(text = x))))
[1] 15 27 56 44 79 79 83 68 71 75 89 92 94 97 101 103 108
... but I fail to merge these numbers back into the dataframe, where every FAOCODE gets its own row.
> FAOCODE_crops$FAOCODE <- unlist(lapply(unlist(strsplit(MAPSPAM_crops$FAOCODE, ",")), function(x) eval(parse(text = x))))
Error in `$<-.data.frame`(`*tmp*`, FAOCODE, value = c(15, 27, 56, 44, :
replacement has 571 rows, data has 42
I fully understand why it doesn't merge successfully, but I can't figure out a way to fill the table with a new row for each FAOCODE as idealized below:
SPAM_full_name FAOCODE
1 wheat 15
2 rice 27
8 other cereals 68
8 other cereals 71
8 other cereals 75
8 other cereals 89
And so on...
Any help is greatly appreciated!
We can use separate_rows to separate the ,. After that, we can loop through the FAOCODE using map and ~eval(parse(text = .x)) to evaluate the number range. Finnaly, we can use unnest to expand the data frame.
library(tidyverse)
dat2 <- dat %>%
separate_rows(FAOCODE, sep = ",") %>%
mutate(FAOCODE = map(FAOCODE, ~eval(parse(text = .x)))) %>%
unnest(cols = FAOCODE)
dat2
# # A tibble: 140 x 2
# SPAM_full_name FAOCODE
# <chr> <dbl>
# 1 wheat 15
# 2 rice 27
# 3 other cereals 68
# 4 other cereals 71
# 5 other cereals 75
# 6 other cereals 89
# 7 other cereals 92
# 8 other cereals 94
# 9 other cereals 97
# 10 other cereals 101
# # ... with 130 more rows
DATA
dat <- read.table(text = " SPAM_full_name FAOCODE
1 wheat 15
2 rice 27
8 'other cereals' '68,71,75,89,92,94,97,101,103,108'
27 'other oil crops' '260:310,312:339'
31 'other fibre crops' '773:821'",
header = TRUE, stringsAsFactors = FALSE)

Tidy rows in one data frame based on a condition

I have a question in R programming.
I have a data frame in R with the following data:
Country Year Population Bikes Revenue
Austria 1970 85 NA NA
Austria 1973 86 NA NA
AUSTRIA 1970 NA 56 4567
AUSTRIA 1973 NA 54 4390
I want to summarise this data in order to have the following new data:
Country Year Population Bikes Revenue
Austria 1970 85 56 4567
Austria 1973 86 54 4390
Thus, I need to exclude the repeated years per country and join the Bikes and Revenue columns to the specific year and country.
I would highly appreciate if you could help me with this issue.
Thank you.
One dplyr possibility could be:
df %>%
group_by(Country = toupper(Country), Year) %>%
summarise_all(list(~ sum(.[!is.na(.)])))
Country Year Population Bikes Revenue
<chr> <int> <int> <int> <int>
1 AUSTRIA 1970 85 56 4567
2 AUSTRIA 1973 86 54 4390
Or a combination of dplyr and tidyr:
df %>%
group_by(Country = toupper(Country), Year) %>%
fill(everything(), .direction = "up") %>%
fill(everything(), .direction = "down") %>%
distinct()
Or if you for some reasons need to use the country names starting by an uppercase letter:
df %>%
mutate(Country = tolower(Country),
Country = paste0(toupper(substr(Country, 1, 1)), substr(Country, 2, nchar(Country)))) %>%
group_by(Country, Year) %>%
summarise_all(list(~ sum(.[!is.na(.)])))
Country Year Population Bikes Revenue
<chr> <int> <int> <int> <int>
1 Austria 1970 85 56 4567
2 Austria 1973 86 54 4390

Resources