Grouping data starting with specific number in R - r

I am sorry if the title is incomprehensible. I have a data as shown below; 1, 2, 3.. are months of various years. And I want to gather months separately for a and l.
a l
1-2006 3.498939 0.8523857
1-2007 14.801777 0.2457656
1-2008 6.893728 0.5381691
2-2006 2.090962 0.6764694
2-2007 9.192913 0.8740950
2-2016 5.059505 1.1761113
Structure of data is;
data<-structure(list(a = c(3.49893890760882, 14.8017770056402, 6.89372828391484,
2.0909624091048, 9.19291324208917, 5.05950526612261, 13.1570625271881,
14.9570662205959, 7.72453112976811, 12.9331892673657
), l = c(0.852385662732809,
0.245765570168399, 0.538169092055646, 0.676469362818052, 0.874095005203713,
1.17611132212132, 0.76857056091243, 0.622533767341579, 0.9562200838363,
1.10064589903771, 0.85863722854391
)), class = "data.frame", row.names = c("1-2006",
"1-2007", "1-2008",
"2-2006", "2-2007",
"2-2016",
"3-2015", "3-2016", "3-2017", "3-2018"
))
For example; I want to gather all january (1-2005, 1-2006..) and march data(3-2012, 3-2015..) data for a and also for l. Like this one:
january_a
1-2006 3.498939
1-2007 14.801777
1-2008 6.893728
january_l
1-2006 0.8523857
1-2007 0.2457656
1-2008 0.5381691
march_a
3-2012 9.192913
3-2015 5.059505
march_l
3-2012 0.8740950
3-2015 1.1761113

You could add a column which contains only the numerical prefix, and then split on that:
data$prefix <- sub("^(\\d+).*$", "\\1", row.names(data))
data_a <- split(data[,"a"], data$prefix)
data_a
$`1`
[1] 3.498939 14.801777 6.893728
$`2`
[1] 2.090962 9.192913 5.059505
Data:
data <- data.frame(a=c(3.498939, 14.801777, 6.893728, 2.090962, 9.192913, 5.059505),
l=c(0.8523857, 0.2457656, 0.5381691, 0.6764694, 0.8740950, 1.1761113))
row.names(data) <- c("1-2006", "1-2007", "1-2008", "2-2006", "2-2007", "2-2016")

This is another variation that you can try using tidyverse which returns a list of dataframes, where every element has a combination of month and "a" or "l".
library(tidyverse)
data %>%
rownames_to_column('date') %>%
pivot_longer(cols = -date) %>%
separate(date, c('month', 'year'), sep = "-", remove = FALSE) %>%
group_split(month, name)
#[[1]]
# A tibble: 3 x 5
# date month year name value
# <chr> <chr> <chr> <chr> <dbl>
#1 1-2006 1 2006 a 3.50
#2 1-2007 1 2007 a 14.8
#3 1-2008 1 2008 a 6.89
#[[2]]
# A tibble: 3 x 5
# date month year name value
# <chr> <chr> <chr> <chr> <dbl>
#1 1-2006 1 2006 l 0.852
#2 1-2007 1 2007 l 0.246
#3 1-2008 1 2008 l 0.538
#...
#...
This has some additional columns to uniquely identify values which you can remove if not needed.

Another option is group_split
library(purrr)
library(dplyr)
library(stringr)
data %>%
rownames_to_column('rn') %>%
select(rn, a) %>%
group_split(rn = str_remove(rn, '-.*'), keep = FALSE) %>%
map(flatten_dbl)
#[[1]]
#[1] 3.498939 14.801777 6.893728
#[[2]]
#[1] 2.090962 9.192913 5.059505
data
data <- data.frame(a=c(3.498939, 14.801777, 6.893728, 2.090962, 9.192913, 5.059505),
l=c(0.8523857, 0.2457656, 0.5381691, 0.6764694, 0.8740950, 1.1761113))
row.names(data) <- c("1-2006", "1-2007", "1-2008", "2-2006", "2-2007", "2-2016")

Related

fill column with values in another columns based on their conditions

I would like to fill a column (KE) with values in other columns (K2007/K2008/K2009) based on the conditions (Year). For examle, If "Year" is 2007, KE would be 1.
Year <- c(2007,2008,2009)
K2007 <- c(1,2,3)
K2008 <- c(4,5,6)
K2009 <- c(7,8,9)
KE <- c(1,5,9)
Thanks in advance,
---2022/06/24 update----
I'm so sorry that the data have another limit.
I want to assign values to KE conditional on two column values in addition to "year."
Year <- c(2007,2008,2009)
X <- c(10, 20, 30)
Y <- c(40, 50, 60)
K2007 <- c(1,2,3)
K2008 <- c(4,5,6)
K2009 <- c(7,8,9)
KE <- c(1,5,9)
In this case, when Year = 2007, X = 10 and Y = 40, KE will be 1.
I successfully got the results below the code.
test <- ddd %>%
mutate(KE = case_when(
Year == 2007 ~ ke_2007,
Year == 2008 ~ ke_2008,
Year == 2009 ~ ke_2009,
TRUE ~ NA_real_))
Thanks,
If you don't need to keep the original columns you could pivot the data to a tidier long format, remove the "K" from the original column names, and only keep rows in which Year is the same as the old column:
library(tidyr)
library(dplyr)
dat |>
pivot_longer(K2007:K2009, values_to = "KE") |>
mutate(name = sub('.', '', name) |> as.double()) |>
filter(Year == name) |>
select(-name)
#> # A tibble: 3 x 2
#> Year KE
#> <dbl> <dbl>
#> 1 2007 1
#> 2 2008 5
#> 3 2009 9
Created on 2022-06-23 by the reprex package (v2.0.1)
Try this
df <- data.frame(K2007 , K2008 , K2009)
KE <- sapply(seq_along(Year) ,
\(x) df[ x,grep(Year[x] , names(df))])
KE
#[1] 1 5 9
library(tidyverse)
tbl <- tibble(
year = c(2007,2008,2009),
k2007 = c(1,2,3),
k2008 = c(4,5,6),
k2009 = c(7,8,9))
tbl %>%
pivot_longer(-year, values_to = 'ke') %>%
filter(name == str_c('k', year)) %>%
select(-name) %>%
left_join(tbl, ., 'year')
# A tibble: 3 x 5
year k2007 k2008 k2009 ke
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2007 1 4 7 1
2 2008 2 5 8 5
3 2009 3 6 9 9

How to create a new column that specifies which range of years a date belongs to (like academic year)?

In some cases, a "year" doesn't necessarily cycle from January 1st. For example, academic year starts at the end of August in the US. Another example is the NBA season.
My question: given data containing a date column, I want to create another column that refers to which period it falls in. For example, consider that we are given the following tib:
library(lubridate, warn.conflicts = FALSE)
library(tibble)
tib <- tibble(my_dates = as_date(c("1999-01-01", "2010-08-09", "2010-09-02", "1995-03-02")))
tib
#> # A tibble: 4 x 1
#> my_dates
#> <date>
#> 1 1999-01-01
#> 2 2010-08-09
#> 3 2010-09-02
#> 4 1995-03-02
and we want to mutate a column that refers to the academic year each date belongs to, provided that the academic year starts on August 31st:
desired_output <-
tib %>%
add_column(belongs_to_school_year = c("1998-1999", "2009-2010", "2010-2011", "1994-1995"))
desired_output
#> # A tibble: 4 x 2
#> my_dates belongs_to_school_year
#> <date> <chr>
#> 1 1999-01-01 1998-1999
#> 2 2010-08-09 2009-2010
#> 3 2010-09-02 2010-2011
#> 4 1995-03-02 1994-1995
How can I create the column belongs_to_school_year using mutate(), based on my_dates?
You can use dplyr and lubridate for this:
desired_output <- tib %>%
mutate(school_year = case_when(month(my_dates) <= 8 ~ paste(year(my_dates)-1, year(my_dates), sep = "-"),
month(my_dates) > 8 ~ paste(year(my_dates), year(my_dates)+1, sep = "-")))
or:
desired_output <- tib %>%
mutate(school_year = if_else(month(my_dates) <= 8,
paste(year(my_dates)-1, year(my_dates), sep = "-"),
paste(year(my_dates), year(my_dates)+1, sep = "-")))

Can I combine 2 rows in R

A section of dataframe looks like
Streets <- c("Muscow tweede","Muscow NDSM", "kazan Bo", "Kazan Ca")
Hotels<- c(5,9,4,3)
Is there a method to merge Muscow tweede and Muscow ndsm, as well as the two Kazan streets, so that I can find the total number of hotels in the city rather than separate streets?
With dplyr:
library(dplyr)
df %>% group_by(col=tolower(sub(' .*', '', Streets))) %>%
summarize(Hotels=sum(Hotels))
Output:
col Hotels
<chr> <dbl>
1 kazan 7
2 muscow 14
Another way:
library(dplyr)
library(stringr)
tibble(Streets, Hotels) %>%
mutate(Streets = str_to_title(str_extract(Streets, '\\w+'))) %>%
group_by(Streets) %>% summarise(Hotels = sum(Hotels))
# A tibble: 2 x 2
Streets Hotels
<chr> <dbl>
1 Kazan 7
2 Muscow 14
Another way with tapply -
with(df, tapply(Hotels, tools::toTitleCase(sub('\\s.*', '', Streets)), sum))
# Kazan Muscow
# 7 14
df1$City = stringr::str_to_title(stringr::word(Streets, end = 1))
aggregate(Hotels ~ City, data = df1, sum)
City Hotels
1 Kazan 7
2 Muscow 14
Sample data
df1 <- data.frame(
Streets = c("Muscow tweede","Muscow NDSM", "kazan Bo", "Kazan Ca"),
Hotels = c(5,9,4,3))
We can use rowsum from base R
rowsum(Hotels, tools::toTitleCase(trimws(Streets, whitespace = "\\s+.*")))
[,1]
Kazan 7
Muscow 14

How to extract elements from nested list in R

I'm working with json data which I've converted into a tibble with some list columns. I'm trying to extract the useful information from the list columns but am facing issues. If given the following dataset-
mydf <-tibble(
x = c(1, 2, 3),
y = list(list(list(id="id1", title="title1"), list(id="id11", title="title11")),
list(id="id2",title="title2"),
NULL)
)
How can I convert it into the following-
data.frame(x=c(1:3), id = c("id1;id11", "id2", ""), title = c("title1;title11", "title2", ""))
# x id title
#1 1 id1;id11 title1;title11
#2 2 id2 title2
#3 3
Any help is appreciated. Thanks!
I think there are better ways, but this is what I can do for now. For each row, I extracted strings and concatenated them with toString(). Since unnest() creates multiple rows for each row (i.e., 1, 2, and 3 in x), I used summarize() to temporarily combine strings. Then, I separate them using separate().
mydf %>%
unnest(y, keep_empty = TRUE) %>%
rowwise %>%
mutate(y = toString(unlist(y))) %>%
group_by(x) %>%
summarize(string = paste(y, collapse = "_")) %>%
separate(col = string, into = c("id", "title"), sep = "_")
# x id title
# <dbl> <chr> <chr>
#1 1 id1, title1 id11, title11
#2 2 id2 title2
#3 3 "" NA
If the names are consistent as in the example, you can do:
mydf2 <- unlist(mydf)
x <- mydf2[grepl("x", names(mydf2))]
id <- mydf2[grepl("id", names(mydf2))]
title <- mydf2[grepl("title", names(mydf2))]
tibble(x, id, title)
# A tibble: 3 x 3
x id title
<chr> <chr> <chr>
1 1 id1 title1
2 2 id11 title11
3 3 id2 title2

Efficient manipulation and extraction of data from multiple matrices - means and dates

I have a series of large matrices and I am just getting used to navigating them in this format and working with functions.
I have minute data for a number of parameters which i have been able to reduce to daily averages - i would like to align each mean output with a date sequence and from there extract the daily average for each year.
In the singular form i have done it like this
A <- matrix(c(1:3285),nrow=3)
AA <- sapply(1:1095, function(x) mean(A [,x], na.rm = TRUE))
D <- seq(from = as.Date("2013-01-01"), to = as.Date("2015-12-31"), by= 1)
df <- cbind.data.frame(D,AA)
Which gets me the means per column aligned to a date for 2013-2015
library(lubridate)
years <- year(as.Date(df$D, "%d-%b-%y"))
day <- yday(as.Date(df$D, "%d-%b-%y"))
#to get the average of DOY over three years
avg <- as.data.frame(tapply(df$AA,day, mean, na.rm=T)) #gives average value on day of year
#Average for specific DOY for each year
av <- as.data.frame(tapply(df$AA,list(day,years), mean, na.rm=T)) #gets the DOY average per year
#bind to get yearly averages and overall average in a data frame format
DF <- cbind(av,avg)
head(DF)
colnames(DF)[4] <- "avg" #rename ts average column
Now say i have multiple matrices (all the same dimension just different parameters) that i want to do this for... is there an efficient way to loop through this so i get a data frame (DF) output for each A-C?
#extra matrices to play with:
B <- matrix(c(3285:6570),nrow=3)
C <- matrix(c(6570:9855),nrow=3)
I have gotten thus far with some initial help on stackoverflow:
#column means for each matrices
vapply(list(A, B, C), colMeans, numeric(1095))
Here's a tidyverse solution. Let
dates <- seq(from = as.Date("2013-01-01"), to = as.Date("2015-12-31"), by = 1)
A <- data.frame(matrix(c(1:3285), ncol = 3, byrow = TRUE))
since I understand that dates are the same to all the matrices. Also, I made A long rather than wide, that's better when working with tidyverse. Then perhaps you would prefer the output in the form of
A %>% group_by(year = year(dates), day = yday(dates)) %>%
summarise(dayYearAvg = mean(c(X1, X2, X3))) %>%
group_by(day) %>% mutate(dayAvg = mean(dayYearAvg))
# A tibble: 1,095 x 4
# Groups: day [365]
# year day dayYearAvg dayAvg
# <dbl> <dbl> <dbl> <dbl>
# 1 2013 1 2 1097
# 2 2013 2 5 1100
# 3 2013 3 8 1103
# ...
If not, we get the same as in your example with
A %>% group_by(year = year(dates), day = yday(dates)) %>%
summarise(dayYearAvg = mean(c(X1, X2, X3))) %>%
group_by(day) %>% mutate(dayAvg = mean(dayYearAvg)) %>%
spread(year, dayYearAvg) %>% ungroup %>% select(-day)
# A tibble: 365 x 4
# dayAvg `2013` `2014` `2015`
# <dbl> <dbl> <dbl> <dbl>
# 1 1097 2 1097 2192
# 2 1100 5 1100 2195
# 3 1103 8 1103 2198
# 4 1106 11 1106 2201
# ...
Now let also
B <- data.frame(matrix(c(3285:6569), ncol = 3, byrow = TRUE))
C <- data.frame(matrix(c(6570:9854), ncol = 3, byrow = TRUE))
l <- list(A, B, C)
This gives
map(l, . %>% group_by(year = year(dates), day = yday(dates)) %>%
summarise(dayYearAvg = mean(c(X1, X2, X3))) %>%
group_by(day) %>% mutate(dayAvg = mean(dayYearAvg)) %>%
spread(year, dayYearAvg) %>% ungroup %>% select(-day))
# [[1]]
# A tibble: 365 x 4
# dayAvg `2013` `2014` `2015`
# <dbl> <dbl> <dbl> <dbl>
# 1 1097 2 1097 2192
# 2 1100 5 1100 2195
# ...
# [[2]]
# A tibble: 365 x 4
# dayAvg `2013` `2014` `2015`
# <dbl> <dbl> <dbl> <dbl>
# 1 4381 3286 4381 5476
# 2 4384 3289 4384 5479
# ...
# [[3]]
# A tibble: 365 x 4
# dayAvg `2013` `2014` `2015`
# <dbl> <dbl> <dbl> <dbl>
# 1 7666 6571 7666 8761
# 2 7669 6574 7669 8764
# ...
Here's a tinyverse solution (i.e., no third-party packages) that wraps your process in a function to receive a matrix as input and return data frame as output. Then run lapply on a list of matrices.
df_process <- function(mat) {
# CREATE DF AND ADD NEW COLUMNS
df <- within(data.frame(D=seq(from = as.Date("2013-01-01"),
to = as.Date("2015-12-31"), by= 1),
AA=sapply(1:1095, function(x) mean(mat[,x], na.rm=TRUE))),
{
year <- format(as.Date(df$D, origin="1970-01-01"), "%Y")
day <- format(as.Date(df$D, origin="1970-01-01"), "%d")
})
# CREATE DF WITH TAPPLY CALLS, RENAME COLUMNS
df <- setNames(data.frame(tapply(df$AA,list(day,years), mean, na.rm=T),
avg = c(tapply(df$AA, day, mean, na.rm=T))),
c("2013", "2014", "2015", "avg"))
}
A <- matrix(c(1:3285),nrow=3)
B <- matrix(c(3286:6570),nrow=3)
C <- matrix(c(6571:9855),nrow=3)
# NAMED LIST OF DATA FRAMES
DF_list <- setNames(lapply(list(A, B, C), df_process), c("A", "B", "C"))
all.equal(DF, DF_list$A)
# [1] TRUE
identical(DF, DF_list$A)
# [1] TRUE
Output
lapply(DF_list, head)
# $A
# 2013 2014 2015 avg
# 01 501.5 1596.5 2691.5 1596.5
# 02 504.5 1599.5 2694.5 1599.5
# 03 507.5 1602.5 2697.5 1602.5
# 04 510.5 1605.5 2700.5 1605.5
# 05 513.5 1608.5 2703.5 1608.5
# 06 516.5 1611.5 2706.5 1611.5
# $B
# 2013 2014 2015 avg
# 01 3786.5 4881.5 5976.5 4881.5
# 02 3789.5 4884.5 5979.5 4884.5
# 03 3792.5 4887.5 5982.5 4887.5
# 04 3795.5 4890.5 5985.5 4890.5
# 05 3798.5 4893.5 5988.5 4893.5
# 06 3801.5 4896.5 5991.5 4896.5
# $C
# 2013 2014 2015 avg
# 01 7071.5 8166.5 9261.5 8166.5
# 02 7074.5 8169.5 9264.5 8169.5
# 03 7077.5 8172.5 9267.5 8172.5
# 04 7080.5 8175.5 9270.5 8175.5
# 05 7083.5 8178.5 9273.5 8178.5
# 06 7086.5 8181.5 9276.5 8181.5

Resources