reorganizing dataframe drastically in R using tidyr - r

I have a dataframe that consists of vegetation data. Columns are species names and rows are their relative abundances per site. Site, plotcode and year are also variables. Data looks like this:
Site Code Year speca specb specc
A A1 2001 0 1 10
A A2 2001 5 5 15
B B1 2001 0 5 20
B B1 2004 15 75 0
C C1 2006 50 0 15
I want the datatable to look like this:
species A1_2001 A2_2001 B1_2001 B1_2004 C1_2006
speca 0 5 0 15 50
specb 1 5 5 75 0
specc 10 15 20 0 15
I tried using the tidyr:pivot_longer function, but this does not give the result i want.
tidyr::pivot_longer(df, 4:length(df), names_to = "species", values_to = "abundance")
Is there a way to achieve this in a codefriendly way, preferably using tidyr (tidyverse)?

We reshape it to 'long' format and then do the 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('spec'), names_to = 'species') %>%
unite(CodeYear, Code, Year) %>%
select(-Site) %>%
pivot_wider(names_from = CodeYear, values_from = value)
# A tibble: 3 x 6
# species A1_2001 A2_2001 B1_2001 B1_2004 C1_2006
# <chr> <int> <int> <int> <int> <int>
#1 speca 0 5 0 15 50
#2 specb 1 5 5 75 0
#3 specc 10 15 20 0 15
data
df <- structure(list(Site = c("A", "A", "B", "B", "C"), Code = c("A1",
"A2", "B1", "B1", "C1"), Year = c(2001L, 2001L, 2001L, 2004L,
2006L), speca = c(0L, 5L, 0L, 15L, 50L), specb = c(1L, 5L, 5L,
75L, 0L), specc = c(10L, 15L, 20L, 0L, 15L)), class = "data.frame",
row.names = c(NA,
-5L))

In data.table:
library(data.table)
DT <- data.table(Site = c('A1','A2','B1','B1','C1'),
Year = c(2001, 2001, 2001, 2004, 2006),
speca = c(0,5,0,15,50),
specb = c(1,5,5,75,0),
specc = c(10,15,20,0,15))
DT <- melt(DT, id.vars = c('Site', 'Year'),
measure.vars = c('speca', 'specb', 'specc') , variable.name = 'species')
DT <- dcast(DT, species ~ Site + Year, value.var = c('value'))
> DT
species A1_2001 A2_2001 B1_2001 B1_2004 C1_2006
1: speca 0 5 0 15 50
2: specb 1 5 5 75 0
3: specc 10 15 20 0 15

You mainly need a pivot_wider() to follow your pivot_longer():
library(tidyverse)
df <- tribble(~Site, ~Code, ~Year, ~speca, ~specb, ~specc,
"A", "A1", 2001, 0, 1, 10,
"A", "A2", 2001, 5, 5, 15,
"B", "B1", 2001, 0, 5, 20,
"B", "B1", 2004, 15, 75, 0,
"C", "C1", 2006, 50, 0, 15)
df %>%
mutate(Code = paste(Code, Year, sep = "_")) %>%
select(-Site, -Year) %>%
pivot_longer(starts_with("spec"), names_to = "species", values_to = "abundance") %>%
pivot_wider(names_from = Code, values_from = abundance)
The result is
# A tibble: 3 x 6
species A1_2001 A2_2001 B1_2001 B1_2004 C1_2006
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 speca 0 5 0 15 50
2 specb 1 5 5 75 0
3 specc 10 15 20 0 15

Related

Extract data based on time to death

Hi I'm analysing the pattern of spending for individuals before they died. My dataset contains individuals' monthly spending and their dates of death. The dataset looks similar to this:
ID 2018_11 2018_12 2019_01 2019_02 2019_03 2019_04 2019_05 2019_06 2019_07 2019_08 2019_09 2019_10 2019_11 2019_12 2020_01 date_of_death
A 15 14 6 23 23 5 6 30 1 15 6 7 8 30 1 2020-01-02
B 2 5 6 7 7 8 9 15 12 14 31 30 31 0 0 2019-11-15
Each column denotes the month of the year. For example, "2018_11" means November 2018. The number in each cell denotes the spending in that specific month.
I would like to construct a data frame which contains the spending data of each individual in their last 0-12 months. It will look like this:
ID last_12_month last_11_month ...... last_1_month last_0_month date_of_death
A 6 23 30 1 2020-01-02
B 2 5 30 31 2019-11-15
Each individual died at different time. For example, individual A died on 2020-01-02, so the data of the "last_0_month" for this person should be extracted from the column "2020_01", and that of "last_12_month" extracted from "2019_01"; individual B died on 2019-11-15, so the data of "last_0_month" for this person should be extracted from the column "2019_11", and that of "last_12_month" should be extracted from the column "2018_11".
I will be really grateful for your help.
Using data.table and lubridate packages
library(data.table)
library(lubridate)
setDT(dt)
dt <- melt(dt, id.vars = c("ID", "date_of_death"))
dt[, since_death := interval(ym(variable), ymd(date_of_death)) %/% months(1)]
dt <- dcast(dt[since_death %between% c(0, 12)], ID + date_of_death ~ since_death, value.var = "value", fun.aggregate = sum)
setcolorder(dt, c("ID", "date_of_death", rev(names(dt)[3:15])))
setnames(dt, old = names(dt)[3:15], new = paste("last", names(dt)[3:15], "month", sep = "_"))
Results
dt
# ID date_of_death last_12_month last_11_month last_10_month last_9_month last_8_month last_7_month last_6_month last_5_month last_4_month last_3_month
# 1: A 2020-01-02 6 23 23 5 6 30 1 15 6 7
# 2: B 2019-11-15 2 5 6 7 7 8 9 15 12 14
# last_2_month last_1_month last_0_month
# 1: 8 30 1
# 2: 31 30 31
Data
dt <- structure(list(ID = c("A", "B"), `2018_11` = c(15L, 2L), `2018_12` = c(14L,
5L), `2019_01` = c(6L, 6L), `2019_02` = c(23L, 7L), `2019_03` = c(23L,
7L), `2019_04` = c(5L, 8L), `2019_05` = c(6L, 9L), `2019_06` = c(30L,
15L), `2019_07` = c(1L, 12L), `2019_08` = 15:14, `2019_09` = c(6L,
31L), `2019_10` = c(7L, 30L), `2019_11` = c(8L, 31L), `2019_12` = c(30L,
0L), `2020_01` = 1:0, date_of_death = structure(c(18263L, 18215L
), class = c("IDate", "Date"))), row.names = c(NA, -2L), class = c("data.frame"))
here you can find a similar approach to the one presented by #RuiBarradas but using lubridate for extracting the difference in months:
library(dplyr)
library(tidyr)
library(lubridate)
# Initial data
df <- structure(list(
ID = c("A", "B"),
`2018_11` = c(15, 2),
`2018_12` = c(14, 5),
`2019_01` = c(6, 6),
`2019_02` = c(23, 7),
`2019_03` = c(23, 7),
`2019_04` = c(5, 8),
`2019_05` = c(6, 9),
`2019_06` = c(30, 15),
`2019_07` = c(1, 12),
`2019_08` = c(15, 14),
`2019_09` = c(6, 31),
`2019_10` = c(7, 30),
`2019_11` = c(8, 31),
`2019_12` = c(30, 0),
`2020_01` = c(1, 0),
date_of_death = c("2020-01-02", "2019-11-15")
),
row.names = c(NA, -2L),
class = "data.frame"
)
# Convert to longer all cols that start with 20 (e.g. 2020, 2021)
df_long <- df %>%
pivot_longer(starts_with("20"), names_to = "month")
# treatment
df_long <- df_long %>%
mutate(
# To date, just in case
date_of_death = as.Date(date_of_death),
# Need to reformat the colnames from (e.g.) 2021_01 to 2021-01-01
month_fmt = as.Date(paste0(gsub("_", "-", df_long$month), "-01")),
# End of month
month_fmt = ceiling_date(month_fmt, "month") - days(1),
# End of month for month of death
date_of_death_eom = ceiling_date(date_of_death, "month") - days(1),
# Difference in months (using end of months
month_diff = round(time_length(
interval(month_fmt, date_of_death_eom),"month"),0)) %>%
# Select only months bw 0 and 12
filter(month_diff %in% 0:12) %>%
# Create labels for the next step
mutate(labs = paste0("last_", month_diff,"_month"))
# To wider
end <- df_long %>%
pivot_wider(
id_cols = c(ID, date_of_death),
names_from = labs,
values_from = value
)
end
#> # A tibble: 2 x 15
#> ID date_of_death last_12_month last_11_month last_10_month last_9_month
#> <chr> <date> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2020-01-02 6 23 23 5
#> 2 B 2019-11-15 2 5 6 7
#> # ... with 9 more variables: last_8_month <dbl>, last_7_month <dbl>,
#> # last_6_month <dbl>, last_5_month <dbl>, last_4_month <dbl>,
#> # last_3_month <dbl>, last_2_month <dbl>, last_1_month <dbl>,
#> # last_0_month <dbl>
Created on 2022-03-09 by the reprex package (v2.0.1)
Here is a tidyverse solution.
Reshape the data to long format, coerce the date columns to class "Date", use Dirk Eddelbuettel's accepted answer to this question to compute the date differences in months and keep the rows with month differences between 0 and 12.
This grouped long format is probably more useful and I compute means by group and plot the spending of the last 12 months prior to death but since the question asks for a wide format, the output data set spending12_wide is created.
options(width=205)
df1 <- read.table(text = "
ID 2018_11 2018_12 2019_01 2019_02 2019_03 2019_04 2019_05 2019_06 2019_07 2019_08 2019_09 2019_10 2019_11 2019_12 2020_01 date_of_death
A 15 14 6 23 23 5 6 30 1 15 6 7 8 30 1 2020-01-02
B 2 5 6 7 7 8 9 15 12 14 31 30 31 0 0 2019-11-15
", header = TRUE, check.names = FALSE)
suppressPackageStartupMessages(library(dplyr))
library(tidyr)
library(ggplot2)
# Dirk's functions
monnb <- function(d) {
lt <- as.POSIXlt(as.Date(d, origin = "1900-01-01"))
lt$year*12 + lt$mon
}
# compute a month difference as a difference between two monnb's
diffmon <- function(d1, d2) { monnb(d2) - monnb(d1) }
spending12 <- df1 %>%
pivot_longer(cols = starts_with('20'), names_to = "month") %>%
mutate(month = as.Date(paste0(month, "_01"), "%Y_%m_%d"),
date_of_death = as.Date(date_of_death)) %>%
group_by(ID, date_of_death) %>%
mutate(diffm = diffmon(month, date_of_death)) %>%
filter(diffm >= 0 & diffm <= 12)
spending12 %>% summarise(spending = mean(value), .groups = "drop")
#> # A tibble: 2 x 3
#> ID date_of_death spending
#> <chr> <date> <dbl>
#> 1 A 2020-01-02 12.4
#> 2 B 2019-11-15 13.6
spending12_wide <- spending12 %>%
mutate(month = zoo::as.yearmon(month)) %>%
pivot_wider(
id_cols = c(ID, date_of_death),
names_from = diffm,
names_glue = "last_{.name}_month",
values_from = value
)
spending12_wide
#> # A tibble: 2 x 15
#> # Groups: ID, date_of_death [2]
#> ID date_of_death last_12_month last_11_month last_10_month last_9_month last_8_month last_7_month last_6_month last_5_month last_4_month last_3_month last_2_month last_1_month last_0_month
#> <chr> <date> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 A 2020-01-02 6 23 23 5 6 30 1 15 6 7 8 30 1
#> 2 B 2019-11-15 2 5 6 7 7 8 9 15 12 14 31 30 31
ggplot(spending12, aes(month, value, color = ID)) +
geom_line() +
geom_point()
Created on 2022-03-09 by the reprex package (v2.0.1)

Performing pivot_longer() over multiple sets of columns

I am stuck in performing pivot_longer() over multiple sets of columns. Here is the sample dataset
df <- data.frame(
id = c(1, 2),
uid = c("m1", "m2"),
germ_kg = c(23, 24),
mineral_kg = c(12, 17),
perc_germ = c(45, 34),
perc_mineral = c(78, 10))
I need the output dataframe to look like this
out <- df <- data.frame(
id = c(1, 1, 2, 2),
uid = c("m1", "m1", "m2", "m2"),
crop = c("germ", "germ", "mineral", "mineral"),
kg = c(23, 12, 24, 17),
perc = c(45, 78, 34, 10))
df %>%
rename_with(~str_replace(.x,'(.*)_kg', 'kg_\\1')) %>%
pivot_longer(-c(id, uid), names_to = c('.value', 'crop'), names_sep = '_')
# A tibble: 4 x 5
id uid crop kg perc
<dbl> <chr> <chr> <dbl> <dbl>
1 1 m1 germ 23 45
2 1 m1 mineral 12 78
3 2 m2 germ 24 34
4 2 m2 mineral 17 10
If you were to use data.table:
library(data.table)
melt(setDT(df), c('id', 'uid'), patterns(kg = 'kg', perc = 'perc'))
id uid variable kg perc
1: 1 m1 1 23 45
2: 2 m2 1 24 34
3: 1 m1 2 12 78
4: 2 m2 2 17 10
I suspect there might be a simpler way using pivot_long_spec, but one tricky thing here is that your column names don't have a consistent ordering of their semantic components. #Onyambu's answer deals with this nicely by fixing it upsteam.
library(tidyverse)
df %>%
pivot_longer(-c(id, uid)) %>%
separate(name, c("col1", "col2")) %>% # only needed
mutate(crop = if_else(col2 == "kg", col1, col2), # because name
meas = if_else(col2 == "kg", col2, col1)) %>% # structure
select(id, uid, crop, meas, value) %>% # is
pivot_wider(names_from = meas, values_from = value) # inconsistent
# A tibble: 4 x 5
id uid crop kg perc
<dbl> <chr> <chr> <dbl> <dbl>
1 1 m1 germ 23 45
2 1 m1 mineral 12 78
3 2 m2 germ 24 34
4 2 m2 mineral 17 10

Add a new column with sum of count to a dataframe according to informations from another in R

I would need help in order to add count column into a table called tab1 according to another tab2.
Here is the first tab :
tab1
Event_Groups Other_column
1 1_G1,2_G2 A
2 2_G1 B
3 4_G4 C
4 7_G5,8_G5,9_G5 D
as you can see in Event_Groups column I have 2 information (Event and Groups numbers separated by a "_"). These informations will also be found in tab2$Group and tab2$Event and the idea is for each element within rows in tab1 (separated by a comma) , to count the number of rows within tab2 where VALUE1 < 10 AND VALUE2 > 30 and then add this count into tab1 in a new column called Sum_count.
Here is the
tab2
Group Event VALUE1 VALUE2
1 G1 1 5 50 <- VALUE1 < 10 & VALUE2 > 30 : count 1
2 G1 2 6 20 <- VALUE2 < 30 : count 0
3 G2 2 50 50 <- VALUE1 > 10 : count 0
4 G3 3 0 0
5 G4 1 0 0
6 G4 4 2 40 <- VALUE1 < 10 & VALUE2 > 30 : count 1
7 G5 7 1 70 <- VALUE1 < 10 & VALUE2 > 30 : count 1
8 G5 8 4 67 <- VALUE1 < 10 & VALUE2 > 30 : count 1
9 G5 9 3 60 <- VALUE1 < 10 & VALUE2 > 30 : count 1
Example :
For instance for the first element of row1 in tab1: 1_G1
we see in tab2 (row1) that VALUE1 < 10 & VALUE2 > 30, so I count 1.
For the seconde element (row1) : 2_G2 we see in tab2 (row3) that VALUE1 > 10, so I count 0.
And here is the expected result tab1 dataframe;
Event_Groups Other_column Sum_count
1_G1,2_G2 A 1
2_G1 B 0
4_G4 C 1
7_G5,8_G5,9_G5 D 3
I dot not know if I am clear enough, do not hesitate to ask questions.
Here are the two tables in dput format if it can helps:
tab1
structure(list(Event_Groups = structure(1:4, .Label = c("1_G1,2_G2",
"2_G1", "4_G4", "7_G5,8_G5,9_G5"), class = "factor"), Other_column =
structure(1:4, .Label = c("A", "B", "C", "D"), class = "factor")),
class = "data.frame", row.names = c(NA,
-4L))
tab2
structure(list(Group = structure(c(1L, 1L, 2L, 3L, 4L, 4L, 5L,
5L, 5L), .Label = c("G1", "G2", "G3", "G4", "G5"), class = "factor"),
Event = c(1L, 2L, 2L, 3L, 1L, 4L, 7L, 8L, 9L), VALUE1 = c(5L,
6L, 50L, 0L, 0L, 2L, 1L, 4L, 3L), VALUE2 = c(50, 20, 50,
0, 0, 40, 70, 67, 60)), class = "data.frame", row.names = c(NA,
-9L))
Here is one way to do it:
library(dplyr)
library(tidyr)
tab1 %>%
mutate(Event_Groups = as.character(Event_Groups)) %>%
separate_rows(Event_Groups, sep = ",") %>%
left_join(.,
tab2 %>%
unite(col = "Event_Groups", Event, Group) %>%
mutate(count = if_else(VALUE1 < 10 & VALUE2 > 30,1L, 0L))) %>%
group_by(Other_column) %>%
summarise(Event_Groups = paste(unique(Event_Groups), collapse = ","),
Sum_count = sum(count)) %>%
select(Event_Groups, everything())
#> Joining, by = "Event_Groups"
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 4 x 3
#> Event_Groups Other_column Sum_count
#> <chr> <fct> <int>
#> 1 1_G1,2_G2 A 1
#> 2 2_G1 B 0
#> 3 4_G4 C 1
#> 4 7_G5,8_G5,9_G5 D 3
Created on 2021-07-29 by the reprex package (v0.3.0)
You can try a tidyverse
library(tidyverse)
tab1 %>%
rownames_to_column() %>%
separate_rows(Event_Groups, sep = ",") %>%
separate(Event_Groups, into = c("Event", "Group"), sep="_", convert = T) %>%
left_join(tab2 %>%
mutate(count = as.numeric(VALUE1 < 10 & VALUE2 > 30)),
by = c("Event", "Group")) %>%
unite(Event_Groups, Event, Group) %>%
group_by(rowname) %>%
summarise(Event_Groups = toString(Event_Groups),
Other_column = unique(Other_column),
count =sum(count))
# A tibble: 4 x 4
rowname Event_Groups Other_column count
<chr> <chr> <chr> <dbl>
1 1 1_G1, 2_G2 A 1
2 2 2_G1 B 0
3 3 4_G4 C 1
4 4 7_G5, 8_G5, 9_G5 D 3

How do I cast data into non-equi columns?

I have a dataset of events, grouped by let like so:
set.seed(3)
events <- data.frame(
let = rep(LETTERS[1:2], each=3),
age = c(0,sample(1:20, size=2),
0,sample(1:20, size=2)),
value = sample(1:100, size=6))
let age value
1 A 0 61
2 A 4 60
3 A 16 13
4 B 0 29
5 B 8 56
6 B 7 99
How can I cast the data frame so that age is multiple columns grouped into weeks? So for each column, take the value of the largest age that is less than or equal to 0, 7, 14, 21 days.
events.cast <- data.frame(
let = LETTERS[1:2],
T0_value = c(61,29),
T1_value = c(60,99),
T2_value = c(60,56),
T3_value = c(13,56))
let T0_value T1_value T2_value T3_value
1 A 61 60 60 13
2 B 29 99 56 56
One option is to cut the 'age' into buckets, get the max row by that group and 'let', then reshape into 'wide' format
library(dplyr)
library(tidyr)
library(stringr)
events %>%
group_by(grp = cut(age, breaks = c(-Inf,0, 7, 14, 21),
labels = str_c("T", 0:3, "_value")), let) %>%
slice(which.max(value)) %>%
ungroup %>%
select(-age) %>%
group_by(let) %>%
complete(grp = unique(.$grp)) %>%
fill(value) %>%
pivot_wider(names_from = grp, values_from = value)
# A tibble: 2 x 5
# Groups: let [2]
# let T0_value T1_value T2_value T3_value
# <chr> <int> <int> <int> <int>
#1 A 61 60 60 13
#2 B 29 99 56 56
data
events <- structure(list(let = c("A", "A", "A", "B", "B", "B"), age = c(0L,
4L, 16L, 0L, 8L, 7L), value = c(61L, 60L, 13L, 29L, 56L, 99L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Retrieve column name based on values from cell

I'm trying to accomplish something like what is illustrated in this this question
However, in my situation, I'll have there might be multiple cases where I have 2 columns that evaluates to True:
year cat1 cat2 cat3 ... catN
2000 0 1 1 0
2001 1 0 0 0
2002 0 1 0 1
....
2018 0 1 0 0
In the DF above year 2000 can have cat2 and cat3 categories. In this case, how do I create a new row, that will have the second category. Something like this:
year category
2000 cat2
2000 cat3
2001 cat1
2002 cat2
2002 catN
....
2018 cat2
You can use gather from the Tidyverse
library(tidyverse)
data = tribble(
~year,~ cat1, ~cat2, ~cat3, ~catN,
2000, 0, 1, 1, 0,
2001, 1, 0, 0 , 0,
2002, 0, 1, 0, 1
)
data %>%
gather(key = "cat", value = "bool", 2:ncol(.)) %>%
filter(bool == 1)
One way would be to get row/column indices of all the values which are 1, subset the year values from row indices and column names from column indices to create a new dataframe.
mat <- which(df[-1] == 1, arr.ind = TRUE)
df1 <- data.frame(year = df$year[mat[, 1]], category = names(df)[-1][mat[, 2]])
df1[order(df1$year), ]
# year category
#2 2000 cat2
#5 2000 cat3
#1 2001 cat1
#3 2002 cat2
#6 2002 catN
#4 2018 cat2
data
df <- structure(list(year = c(2000L, 2001L, 2002L, 2018L), cat1 = c(0L,
1L, 0L, 0L), cat2 = c(1L, 0L, 1L, 1L), cat3 = c(1L, 0L, 0L, 0L
), catN = c(0L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA, -4L))
You can also use melt in reshape2
new_df = melt(df, id.vars='year')
new_df[new_df$value==1, c('year','variable')]
Data
df = data.frame(year=c(2000,2001),
cat1=c(0,1),
cat2=c(1,0),
cat3=c(1,0))
Output:
year variable
2 2001 cat1
3 2000 cat2
5 2000 cat3
Here is another variation with gather, by mutateing the columns having 0 to NA, then gather while removing the NA elements with na.rm = TRUE
library(dplyr)
library(tidyr)
data %>%
mutate_at(-1, na_if, y = 0) %>%
gather(category, val, -year, na.rm = TRUE) %>%
select(-val)
# A tibble: 5 x 2
# year category
# <dbl> <chr>
#1 2001 cat1
#2 2000 cat2
#3 2002 cat2
#4 2000 cat3
#5 2002 catN
data
data <- structure(list(year = c(2000, 2001, 2002), cat1 = c(0, 1, 0),
cat2 = c(1, 0, 1), cat3 = c(1, 0, 0), catN = c(0, 0, 1)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))

Resources