multiple step gathering of columns in R

multiple step gathering of columns in R - r

I have data.frame like this:
df<-data.frame(Time=c(1:100),Rome_population=c(1:100),Rome_gdp=c(1:100),Rome_LifeLenght=c(1:100),London_population=c(1:100),London_gdp=c(1:100),London_LifeLenght=c(1:100),Berlin_population=c(1:100),Berlin_gdp=c(1:100),Berlin_LifeLenght=c(1:100))
And I would like to have a data.frame like this:
df<-data.frame(Time,City,population,gdp,LifeLenght)
How can I make it? Possibly with tidyr?
Thanks!

Try:
df %>%
gather(key, value, Rome_population:Berlin_LifeLenght) %>%
separate(key, into = c("city", "stat"), sep = "_") %>%
spread(stat, value)
Output:
# A tibble: 300 x 5
Time city gdp LifeLenght population
<int> <chr> <int> <int> <int>
1 1 Berlin 1 1 1
2 1 London 1 1 1
3 1 Rome 1 1 1
4 2 Berlin 2 2 2
5 2 London 2 2 2
6 2 Rome 2 2 2
7 3 Berlin 3 3 3
8 3 London 3 3 3
9 3 Rome 3 3 3
10 4 Berlin 4 4 4
# ... with 290 more rows

Related

Grouping and stacking data

A sample of my data is :
dat <- read.table(text = " ID BC1 DC1 DE1 MN2 DC2 PO2 SA3 BC3 KL3 AA4 AP4 BC4 PO4
1 2 1 2 3 1 3 1 1 3 2 2 2 2
2 3 1 1 2 3 1 1 2 3 1 1 3 2
3 2 3 2 3 2 3 2 1 1 3 1 1 1
4 3 3 1 1 1 1 1 2 2 1 2 1 2", header = TRUE)
I want to get the following table and missing data are blank
ID Group1 Group2 Group3 Group4
1 2 1 2
2 3 1 1
3 2 3 2
4 3 3 1
1 3 1 3
2 2 3 1
3 3 2 3
4 1 1 1
1 1 1 3
2 1 2 3
3 2 1 1
4 1 2 2
1 2 2 2 2
2 1 1 3 2
3 3 1 1 1
4 1 2 1 2
The number in front of each column is where the columns are separated from each other. For example BC1, DC1 and DE1. They form the first four rows with their Ids and MN2, DC2 and PO2 form the second rows with their IDs and so on.

What about using the row numbers with some pivoting?
library(dplyr)
library(tidyr)
dat |>
pivot_longer(-ID, names_sep = "(?=\\d)", names_to = c(NA, "id")) |>
group_by(ID, id) |>
mutate(name = row_number()) |>
pivot_wider(c(ID, id), names_prefix = "Group") |>
arrange(id) |>
ungroup() |>
select(-id)
Or using a map:
library(purrr)
library(dplyr)
paste(1:4) |> # unique(readr::parse_number(names(dat |> select(-ID))))
map(\(x) select(dat, ID, ends_with(x)) |> rename_with(\(x) names(x) <- paste0("Group", 1:length(x)), -ID)) |>
bind_rows()
Output:
# A tibble: 16 × 5
ID Group1 Group2 Group3 Group4
<int> <int> <int> <int> <int>
1 1 2 1 2 NA
2 2 3 1 1 NA
3 3 2 3 2 NA
4 4 3 3 1 NA
5 1 3 1 3 NA
6 2 2 3 1 NA
7 3 3 2 3 NA
8 4 1 1 1 NA
9 1 1 1 3 NA
10 2 1 2 3 NA
11 3 2 1 1 NA
12 4 1 2 2 NA
13 1 2 2 2 2
14 2 1 1 3 2
15 3 3 1 1 1
16 4 1 2 1 2
Update 13-01: Now the first solution returns the correct ID (not id) + another approach added.

Would be interesting to see if there is an easier approach:
library(tidyverse)
dat |>
pivot_longer(-ID) |>
mutate(id = str_extract(name, "\\d$")) |>
group_by(ID, id) |>
mutate(name = paste0("Group", row_number())) |>
ungroup() |>
pivot_wider(names_from = name, values_from = value) |>
arrange(id, ID) |>
select(-id)
#> # A tibble: 16 × 5
#> ID Group1 Group2 Group3 Group4
#> <int> <int> <int> <int> <int>
#> 1 1 2 1 2 NA
#> 2 2 3 1 1 NA
#> 3 3 2 3 2 NA
#> 4 4 3 3 1 NA
#> 5 1 3 1 3 NA
#> 6 2 2 3 1 NA
#> 7 3 3 2 3 NA
#> 8 4 1 1 1 NA
#> 9 1 1 1 3 NA
#> 10 2 1 2 3 NA
#> 11 3 2 1 1 NA
#> 12 4 1 2 2 NA
#> 13 1 2 2 2 2
#> 14 2 1 1 3 2
#> 15 3 3 1 1 1
#> 16 4 1 2 1 2

You can rename the data with a specified pattern ("index1_index2"), i.e.
# ID 1_1 1_2 1_3 2_1 2_2 2_3 3_1 3_2 3_3 4_1 4_2 4_3 4_4
# 1 1 2 1 2 3 1 3 1 1 3 2 2 2 2
# 2 2 3 1 1 2 3 1 1 2 3 1 1 3 2
# 3 3 2 3 2 3 2 3 2 1 1 3 1 1 1
# 4 4 3 3 1 1 1 1 1 2 2 1 2 1 2
so that you can add the special element ".value" to names_to when using pivot_longer() to stack multiple columns that are grouped by that pattern.
Code
library(dplyr)
library(tidyr)
dat %>%
rename_with(~ sub('\\D+', '', .x) %>%
paste(., ave(., ., FUN = seq), sep = '_'), -ID) %>%
pivot_longer(-ID, names_to = c("set", ".value"), names_sep = '_') %>%
arrange(set) %>%
select(-set)
Output
# A tibble: 16 × 5
ID `1` `2` `3` `4`
<int> <int> <int> <int> <int>
1 1 2 1 2 NA
2 2 3 1 1 NA
3 3 2 3 2 NA
4 4 3 3 1 NA
5 1 3 1 3 NA
6 2 2 3 1 NA
7 3 3 2 3 NA
8 4 1 1 1 NA
9 1 1 1 3 NA
10 2 1 2 3 NA
11 3 2 1 1 NA
12 4 1 2 2 NA
13 1 2 2 2 2
14 2 1 1 3 2
15 3 3 1 1 1
16 4 1 2 1 2

DPLYR - merging rows together using a column value as a conditional

I have a series of rows in a single dataframe. I'm trying to aggregate the first two rows for each ID- i.e. - I want to combine events 1 and 2 for ID 1 into a single row, events 1 and 2 for ID 2 into a singlw row etc, but leave event 3 completely untouched.
id <- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
event <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)
score <- c(3,NA,1,3,NA,2,6,NA,1,8,NA,2,4,NA,1)
score2 <- c(NA,4,1,NA,5,2,NA,0,3,NA,5,6,NA,8,7)
df <- tibble(id, event, score, score2)
# A tibble: 15 x 4
id event score score2
<dbl> <dbl> <dbl> <dbl>
1 1 1 3 NA
2 1 2 NA 4
3 1 3 1 1
4 2 1 3 NA
5 2 2 NA 5
6 2 3 2 2
7 3 1 6 NA
8 3 2 NA 0
9 3 3 1 3
10 4 1 8 NA
11 4 2 NA 5
12 4 3 2 6
13 5 1 4 NA
14 5 2 NA 8
15 5 3 1 7
I've tried :
df_merged<- df %>% group_by (id) %>% summarise_all(funs(min(as.character(.),na.rm=TRUE))),
which aggregates these nicely, but then I struggle to merge these back into the orignal dataframe/tibble (there are really about 300 different "score" columns in the full dataset, so a right_join is a headache with score.x, score.y, score2.x, score2.y all over the place...)
Ideally, the situation would need to be dplyr as the rest of my code runs on this!
EDIT:
Ideally, my expected output would be:
# A tibble: 10 x 4
id event score score2
<dbl> <dbl> <dbl> <dbl>
1 1 1 3 4
3 1 3 1 1
4 2 1 3 5
6 2 3 2 2
7 3 1 6 0
9 3 3 1 3
10 4 1 8 5
12 4 3 2 6
13 5 1 4 8
15 5 3 1 7

We may change the order of NA elements with replace
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(starts_with('score'),
~replace(., 1:2, .[1:2][order(is.na(.[1:2]))]))) %>%
ungroup %>%
filter(if_all(starts_with('score'), Negate(is.na)))
-output
# A tibble: 10 x 4
id event score score2
<dbl> <dbl> <dbl> <dbl>
1 1 1 3 4
2 1 3 1 1
3 2 1 3 5
4 2 3 2 2
5 3 1 6 0
6 3 3 1 3
7 4 1 8 5
8 4 3 2 6
9 5 1 4 8
10 5 3 1 7

Here is an alternative way to achieve your task with fill from tidyr package:
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
fill(everything(), .direction = "down") %>%
fill(everything(), .direction = "up") %>%
slice(1,3)
id event score score2
<dbl> <dbl> <dbl> <dbl>
1 1 1 3 4
2 1 3 1 1
3 2 1 3 5
4 2 3 2 2
5 3 1 6 0
6 3 3 1 3
7 4 1 8 5
8 4 3 2 6
9 5 1 4 8
10 5 3 1 7

How about this?
library(dplyr)
df_e12 <- df %>%
filter(event %in% c(1, 2)) %>%
group_by(id) %>%
mutate(across(starts_with("score"), ~min(.x, na.rm = TRUE))) %>%
ungroup() %>%
distinct(id, .keep_all = TRUE)
df_e3 <- df %>%
filter(event == 3)
df <- bind_rows(df_e12, df_e3) %>%
arrange(id, event)
df
> df
# A tibble: 10 x 4
id event score score2
<dbl> <dbl> <dbl> <dbl>
1 1 1 3 4
2 1 3 1 1
3 2 1 3 5
4 2 3 2 2
5 3 1 6 0
6 3 3 1 3
7 4 1 8 5
8 4 3 2 6
9 5 1 4 8
10 5 3 1 7

How to recognize unknown patterns in data frame by row?

I have a data frame where I have agricultural use codes (1-5) for 15 consecutive years. Each row is a polygon representing a field. Ultimately I need R to loop through the rows and recognize patterns of use and tell me their respective frequency. Unfortunately in my real data set I have over 1 mio. features and thus all possible patterns are not known.
a <- data.frame(replicate(15, sample(0:5,500,rep=TRUE)))
colnames(a) <- paste0("use",2005:2019)
id <- c(1:500)
a <- cbind(id,a)
id use2005 use2006 use2007 use2008 use2009 use2010 use2011 use2012 use2013 use2014 use2015 ...
1 1 1 1 1 1 2 2 1 4 4 4 ...
2 4 4 4 4 5 5 5 0 5 5 5 ...
3 1 4 3 2 3 2 4 5 1 1 1 ...
4 1 1 1 1 1 2 2 1 4 4 4 ...
5 4 2 2 2 2 5 3 3 3 3 3 ...
So in this arbitrary example, the code should recognize that id 1 & 4 have the same pattern.
In the end I imagine the result to be some sort of frequency distribution to see if there are certain patterns in the agricultural use of my fields.
For example:
1 1 1 1 1 2 1 1 1 3 2 4 1 1 1
[50] - occurs 50 times
5 5 5 5 5 1 1 1 1 4 4 4 2 2 3
[35] - occurs 35 times
and so forth with all existing combinations...
Unfortunately I have no idea how to approach this. I have no experience with pattern recognition.
Thank you!

maybe this?
library(tidyverse)
a[, -1] %>% group_by_all %>% count
# use2005 use2006 use2007 use2008 use2009 use2010 use2011 use2012 use2013 use2014 use2015 n
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 1 1 1 1 1 2 2 1 4 4 4 2
# 2 1 4 3 2 3 2 4 5 1 1 1 1
# 3 4 2 2 2 2 5 3 3 3 3 3 1
# 4 4 4 4 4 5 5 5 0 5 5 5 1
or if you want to include fields you could change to group_by_at and exclude id from the grouping and then paste fields together:
a %>%
group_by_at(vars(-id)) %>%
summarise(n = n(), ids = paste(id, collapse= "," ))
# use2005 use2006 use2007 use2008 use2009 use2010 use2011 use2012 use2013 use2014 use2015 n ids
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <chr>
# 1 1 1 1 1 1 2 2 1 4 4 4 2 1,4
# 2 1 4 3 2 3 2 4 5 1 1 1 1 3
# 3 4 2 2 2 2 5 3 3 3 3 3 1 5
# 4 4 4 4 4 5 5 5 0 5 5 5 1 2

Here's an example on how to approach this, using a small example dataset (i.e. the one you posted).
library(tidyverse)
# example dataset
a = read.table(text = "
id use2005 use2006 use2007 use2008 use2009 use2010 use2011 use2012 use2013 use2014 use2015
1 1 1 1 1 1 2 2 1 4 4 4
2 4 4 4 4 5 5 5 0 5 5 5
3 1 4 3 2 3 2 4 5 1 1 1
4 1 1 1 1 1 2 2 1 4 4 4
5 4 2 2 2 2 5 3 3 3 3 3
", header=T)
a %>%
group_nest(id) %>% # for each row
mutate(pattern = map(data, ~paste(.x, collapse = ","))) %>% # create the pattern as a string
unnest(pattern) %>% # unnest pattern column
count(pattern, sort = T) # count patterns
# # A tibble: 4 x 2
# pattern n
# <chr> <int>
# 1 1,1,1,1,1,2,2,1,4,4,4 2
# 2 1,4,3,2,3,2,4,5,1,1,1 1
# 3 4,2,2,2,2,5,3,3,3,3,3 1
# 4 4,4,4,4,5,5,5,0,5,5,5 1

The dplyr way to get grouped differences

I am trying to figure out the dplyr way to do grouped differences.
Here is some fake data:
>crossing(year=seq(1,4),week=seq(1,3)) %>%
mutate(value = c(rep(4,3),rep(3,3),rep(2,3),rep(1,3)))
year week value
<int> <int> <dbl>
1 1 1 4
2 1 2 4
3 1 3 4
4 2 1 3
5 2 2 3
6 2 3 3
7 3 1 2
8 3 2 2
9 3 3 2
10 4 1 1
11 4 2 1
12 4 3 1
What I would like is year 1- year2, year2-year3, and year3-year4. The result would like like the following.
year week diffs
<int> <int> <dbl>
1 1 1 1
2 1 2 1
3 1 3 1
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 1
8 3 2 1
9 3 3 1
Edit:
I apologize. I was trying to make a simple reprex, but I messed up a lot.
Please let me know what the proper etiquette is. I don't want to ruffle any feathers.
I did not know that -diff() was a function. What I am actually looking for is percent difference ((new-old)/old)*100 and I am not able to find a straight forward way to use diff to get that value.
I am starting from the largest year. Adding a arrange(desc(year)) to the above code is what I have. I would be trimming the smallest year not the largest.
If this edit with worth a separate question let me know.

If you don't have missing years for each week:
df %>%
arrange(year) %>%
group_by(week) %>%
mutate(diffs = value - lead(value)) %>%
na.omit() %>% select(-value)
# A tibble: 9 x 3
# Groups: week [3]
# year week diffs
# <int> <int> <dbl>
#1 1 1 1
#2 1 2 1
#3 1 3 1
#4 2 1 1
#5 2 2 1
#6 2 3 1
#7 3 1 1
#8 3 2 1
#9 3 3 1

You can use diff, but it needs adjusting, as it subtracts the other way and returns a vector that's one shorter than what it's passed:
library(tidyverse)
diffed <- crossing(year = seq(1,4),
week = seq(1,3)) %>%
mutate(value = rep(4:1, each = 3)) %>%
group_by(week) %>%
mutate(value = c(-diff(value), NA)) %>%
drop_na(value)
diffed
#> # A tibble: 9 x 3
#> # Groups: week [3]
#> year week value
#> <int> <int> <int>
#> 1 1 1 1
#> 2 1 2 1
#> 3 1 3 1
#> 4 2 1 1
#> 5 2 2 1
#> 6 2 3 1
#> 7 3 1 1
#> 8 3 2 1
#> 9 3 3 1

using dplyr and do:
library(dplyr)
df %>% group_by(week) %>% do(cbind(.[-nrow(.),1:2],diffs=-diff(.$value)))
# # A tibble: 9 x 3
# # Groups: week [3]
# year week diffs
# <int> <int> <dbl>
# 1 1 1 1
# 2 2 1 1
# 3 3 1 1
# 4 1 2 1
# 5 2 2 1
# 6 3 2 1
# 7 1 3 1
# 8 2 3 1
# 9 3 3 1

Dealing with ties using rank (R)

I'm trying to create dummy variable for whether a child is first born, and one for if the child is second born. My data looks something like this
ID MID CMOB CYRB
1 1 1 1991
2 1 7 1989
3 2 1 1985
4 2 11 1985
5 2 9 1994
6 3 4 1992
7 4 2 1992
8 4 10 1983
With ID = child ID, MID = mother ID, CMOB = month of birth and CYRB = year of birth.
For the first born dummy I tried using this:
Identifiers_age <- Identifiers_age %>% group_by(MPUBID)
%>% mutate(first = as.numeric(rank(CYRB) == 1))
But there doesn't seem to be a way of breaking ties by the rank of another columnn (clearly in this case the desired column being CMOB), whenever I try using the "ties.method" argument it tell me the input must be a character vector.
Am I missing something here?

order might be more convenient to use here, from ?order:
order returns a permutation which rearranges its first argument into
ascending or descending order, breaking ties by further arguments.
Identifiers_age <- Identifiers_age %>% group_by(MID) %>%
mutate(first = as.numeric(order(CYRB, CMOB) == 1))
Identifiers_age
#Source: local data frame [8 x 5]
#Groups: MID [4]
# ID MID CMOB CYRB first
# <int> <int> <int> <int> <dbl>
#1 1 1 1 1991 0
#2 2 1 7 1989 1
#3 3 2 1 1985 1
#4 4 2 11 1985 0
#5 5 2 9 1994 0
#6 6 3 4 1992 1
#7 7 4 2 1992 0
#8 8 4 10 1983 1

If we still want to use rank, we can convert the 'CYRB', 'CMOB' in to 'Date', apply rank on it and the get the binary output based on the logical vector
Identifiers_age %>%
group_by(MID) %>%
mutate(first = as.integer(rank(as.Date(paste(CYRB, CMOB, 1,
sep="-"), "%Y-%m-%d"))==1))
# ID MID CMOB CYRB first
# <int> <int> <int> <int> <int>
#1 1 1 1 1991 0
#2 2 1 7 1989 1
#3 3 2 1 1985 1
#4 4 2 11 1985 0
#5 5 2 9 1994 0
#6 6 3 4 1992 1
#7 7 4 2 1992 0
#8 8 4 10 1983 1
Or we can use arithmetic to do this with rank
Identifiers_age %>%
group_by(MID) %>%
mutate(first = as.integer(rank(CYRB + CMOB/12)==1))
# ID MID CMOB CYRB first
# <int> <int> <int> <int> <int>
#1 1 1 1 1991 0
#2 2 1 7 1989 1
#3 3 2 1 1985 1
#4 4 2 11 1985 0
#5 5 2 9 1994 0
#6 6 3 4 1992 1
#7 7 4 2 1992 0
#8 8 4 10 1983 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

multiple step gathering of columns in R - r

Related

Grouping and stacking data

DPLYR - merging rows together using a column value as a conditional

How to recognize unknown patterns in data frame by row?

The dplyr way to get grouped differences

Dealing with ties using rank (R)

Categories

Resources