create new data frame based on variables conditions of other data frame - r

I am trying to create a new data frame using R from a larger data frame. This is a short version of my large data frame:
df <- data.frame(time = c(0,1,5,10,12,13,20,22,25,30,32,35,39),
t_0_1 = c(20,20,20,120,300,350,400,600,700,100,20,20,20),
t_0_2 = c(20,20,20,20,120,300,350,400,600,700,100,20,20),
t_2_1 = c(20,20,20,20,20,120,300,350,400,600,700,100,20),
t_2_2 = c(20,20,20,20,120,300,350,400,600,700,100,20,20))
The new data frame should have the first variable values as the number in the end of the large data frame variables name (1 and 2). The other variables name should be the number in the middle of the large data frame variables (0 and 2) and for their values I am trying to filter the values greater than 300 for each variable and calculate the time difference. For example for variable "t_0_1", the time that the values are greater than 300 is 13 to 25 seconds. So the value in the new data frame should be 12.
The new data frame should look like this:
df_new <- data.frame(height= c(1,2),
"0" = c(12,10),
"2" = c(10,10))
Any help where I should start or how I can do that is very welcome. Thank you!!

You could calculate the time difference for each column with summarise(across(...)), and then transform the data to long.
library(tidyverse)
df %>%
summarise(across(-time, ~ sum(diff(time[.x > 300])))) %>%
pivot_longer(everything(), names_to = c(".value", "height"), names_pattern = "t_(.+)_(.+)")
# # A tibble: 2 × 3
# height `0` `2`
# <chr> <dbl> <dbl>
# 1 1 12 10
# 2 2 10 10

Here is a tidyverse solution
library(tidyverse)
df %>%
pivot_longer(-time) %>%
separate(name, c(NA, "col", "height"), sep = "_") %>%
pivot_wider(names_from = "col", names_prefix = "X") %>%
group_by(height) %>%
summarise(
across(starts_with("X"), ~ sum(diff(time[.x > 300]))),
.groups = "drop")
## A tibble: 2 x 3
# height X0 X2
# <chr> <dbl> <dbl>
#1 1 12 10
#2 2 10 10
Explanation: The idea is to reshape from wide to long, separate the column names into a (future) column name "col" and a "height". Reshape from long to wide by taking column names from "col" (prefixing with "X") and the summarising according to your requirements (i.e. keep only those entries where the value is > 300, and sum the difference in time).

Related

In R, replace values across time series based on another column

Actually this is linked to my previous question: Replace values across time series columns based on another column
However I need to modify values across a time series data set but based on a condition from the same row but across another set of time series columns. The dataset looks like this:
#there are many more years (yrs) in the data set
product<-c("01","02")
yr1<-c("1","7")
yr2<-c("3","4")
#these follow the number of years
type.yr1<-c("mixed","number")
type.yr2<-c("number","mixed")
#this is a reference column to pull values from in case the type value is "mixed"
mixed.rate<-c("1+5GBP","7+3GBP")
df<-data.frame(product,yr1,yr2,type.yr1,type.yr2,mixed.rate)
Where the value 1 should be replaced by "1+5GBP" and 4 should be "7+3GBP". I am thinking of something like the below -- could anyone please help?
df %>%
mutate(across(c(starts_with('yr'),starts_with('type'), ~ifelse(type.x=="mixed", mixed.rate.x, .x)))
The final result should be:
product<-c("01","02")
yr1<-c("1+5GBP","7")
yr2<-c("3","7+3GBP")
type.yr1<-c("mixed","number")
type.yr2<-c("number","mixed")
mixed.rate<-c("1+5 GBP","7+3GBP")
df<-data.frame(product,yr1,yr2,type.yr1,type.yr2,mixed.rate)
If I understand you correctly, I think you might benefit from pivoting longer, replacing the values in a single if_else, and swinging back to wide.
df %>%
pivot_longer(cols = -c(product,mixed.rate), names_to=c(".value", "year"), names_pattern = "(.*)(\\d)") %>%
mutate(yr=if_else(type.yr=="mixed",mixed.rate,yr)) %>%
pivot_wider(names_from=year, values_from=c(yr,type.yr),names_sep = "")
Output:
product mixed.rate yr1 yr2 type.yr1 type.yr2
<chr> <chr> <chr> <chr> <chr> <chr>
1 01 1+5 GBP 1+5 GBP 3 mixed number
2 02 7+3GBP 7 7+3GBP number mixed
You can use pivot_longer to have all yrs in one column and type.yrs in another column. Then record 1 into 1+5GBP and 4 into 7+3GBP if the type.yr column is mixed. then pivot_wider
df %>%
pivot_longer(contains('yr'), names_to = c('.value','grp'),
names_pattern = '(\\D+)(\\d+)') %>%
mutate(yr = ifelse(type.yr == 'mixed', recode(yr, '1' = '1+5GBP', '4' = '7+3GBP'), yr)) %>%
pivot_wider(c(product, mixed.rate), names_from = grp,
values_from = c(yr, type.yr), names_sep = '')
# A tibble: 2 x 6
product mixed.rate yr1 yr2 type.yr1 type.yr2
<chr> <chr> <chr> <chr> <chr> <chr>
1 01 1+5GBP 1+5GBP 3 mixed number
2 02 7+3GBP 7 7+3GBP number mixed
If you're happy to use base R instead of dplyr then the following will produce your required output:
for (i in 1:2) {
df[,paste0('yr',i)] <- if_else(df[,paste0('type.yr',i)]=='mixed',df[,'mixed.rate'],df[,paste0('yr',i)])
}

R: Pivot numeric data from columns to rows based on string in variable name

I have a data set that I want to pivot to long format depending on if the variable name contains any of the strings: list_a <- c("a", "b", "c") and list_b <- c("usd", "eur", "gbp"). The data set only contains values in one row. I want the values in list_b to become column names and the values in list_a to become row names in the resulting dataset. Please see the reproducable example data set below.
I currently solve this issue by applying the following R code (once for each value in list_b) resulting in three data frames called "df_usd", "df_eur" and "df_gbp" which I then merge based on the column "name". This is however a bit cumbersome and I would very much appreciate if you could help me with finding a more elegant solution since the variables in list_b change from month to month (list_a stays the same each month) and updating the existing code manually is both time consuming and opens up for manual error.
# Current solution for df_usd:
df_usd <- df %>%
select(date, contains("usd")) %>%
pivot_longer(cols = contains(c("a_", "b_", "c_")),
names_to = "name", values_to = "usd") %>% mutate(name = case_when(
str_detect(name, "a_") ~ "a",
str_detect(name, "b_") ~ "b",
str_detect(name, "c_") ~ "c")) %>%
select(-date)
A screenshot of the starting point in Excel
A screenshot of the result I want to acheive in Excel
# Example data to copy and paste into R for easy reproduction of problem:
df <- data.frame (date = c("2020-12-31"),
a_usd = c(1000),
b_usd = c(2000),
c_usd = c(3000),
a_eur = c(100),
b_eur =c(200),
c_eur = c(300),
a_gbp = c(10),
b_gbp = c(20),
c_gbp = c(30))
It would be to specify names_sep with names_to in pivot_longer
library(dplyr)
df %>%
pivot_longer(cols = -date, names_to = c("grp", ".value"), names_sep = "_")
-output
# A tibble: 3 x 5
# date grp usd eur gbp
# <chr> <chr> <dbl> <dbl> <dbl>
#1 2020-12-31 a 1000 100 10
#2 2020-12-31 b 2000 200 20
#3 2020-12-31 c 3000 300 30
A base R option using reshape
reshape(
setNames(df, gsub("(\\w+)_(\\w+)", "\\2.\\1", names(df))),
direction = "long",
varying = -1
)
gives
date time usd eur gbp id
1.a 2020-12-31 a 1000 100 10 1
1.b 2020-12-31 b 2000 200 20 1
1.c 2020-12-31 c 3000 300 30 1

Sum duplicated columns in dataframe in R

Hello i have the following dataframe :
colnames(tv_viewing time) <-c("channel_1", "channel_2", "channel_1", "channel_2")
Each row gives a the viewing time for an individual on channel 1 and channel 2, for instance for individual 1 i get :
tv_viewing_time[1,] <- c(1,2,4,5)
What I would like is actually a dataframe that sums up the values of duplicated columns.
I.e. I would get
colnames(tv_viewing time) <-c("channel_1", "channel_2")
Where for instance for individual 1 i would get :
tv_viewing_time[1,] <- c(5,7)
As all two row entries are summed when they correspond to duplicated column names.
I have looked for an answer but all suggested on other threads did not work for my dataframe case.
Note that there are many more duplicated columns, so i am looking for a solution that can be efficiently applied to all my duplicates.
We could use split.default with rowSums
sapply(split.default(tv_viewing_time,
sub("\\.\\d+$", "", names(tv_viewing_time))), rowSums)
-output
# channel_1 channel_2
# 5 7
Or using tidyverse
library(dplyr)
library(tidyr)
library(stringr)
tv_viewing_time %>%
pivot_longer(cols = everything()) %>%
group_by(name = str_remove(name, "\\.\\d+$")) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 1 x 2
# channel_1 channel_2
# <dbl> <dbl>
#1 5 7
data
tv_viewing_time <- data.frame(channel_1 = 1, channel_2 = 2,
channel_1 = 4, channel_2 = 5)

Re-Format data with multiple headers to long format with one header row becoming data in a new column

I regularly receive data that is formatted with multiple headers and merged cells (yes..excel). Typically these data come in the form of 2+ merged cells representing sample sites, over the top of a number of observations in columns representing parameters of interest for that site. I am using the "openxlsx" package to read in the data with the read.xlsx function shown below (won't run just for reference):
read.xlsx('Mussels.xlsx',
detectDates = T,
sheet = 2,
fillMergedCells = T,
startRow = 2)
An example: I am currently working with invasive mussel survey data where I have 25 lengths for two species for each of 14 sites, which I've abbreviated for ease here for ease:
lendat <- data.frame(site.a = c("species.1",1,1,1,1),
site.a = c("species.2",2,2,2,2),
site.b = c("species.1",3,3,3,3),
site.b = c("species.2",4,4,4,4),
check.names = F)
I would like to be able to write some code that will re-format these data into long form where the column names become values under a new column named "site", and the first row of data becomes the other column names representing the lengths for each species like this:
data_form <- data.frame(site = c(rep("site.a", 4), rep("site.b",4)),
species.1 = c(1,1,1,1,3,3,3,3),
species.2 = c(2,2,2,2,4,4,4,4))
Update based on #Ronak Shah answer
Using code from the accepted answer below with the actual data results in a tibble with no data. I discovered that the issue arises with the filter step when decimal values are introduced in the data (actual data contains decimal values). I thought this was a data format issue (example data are all factors) but even when this is true the decimal data are changed into NA's. See example:
lendat <- data.frame(site.a = c("species.1", 1.1,2.2,3,4),
site.a = c("species.2",5,6,7,8),
site.b = c("species.1", 9,10,11,12),
site.b = c("species.2",13,14,15,16),
check.names = F)
str(lendat)
'data.frame': 5 obs. of 4 variables:
$ site.a: Factor w/ 5 levels "1.1","2.2","3",..: 5 1 2 3 4
$ site.a: Factor w/ 5 levels "5","6","7","8",..: 5 1 2 3 4
$ site.b: Factor w/ 5 levels "10","11","12",..: 5 4 1 2 3
$ site.b: Factor w/ 5 levels "13","14","15",..: 5 1 2 3 4
I split the piped code out to go line by line
#Get data in long format
pivot_longer(junk, cols = everything(), names_to = 'site') %>%
#Create a new column with column names
mutate(col = paste0('species', .copy)) %>%
#Remove the values from the first row
filter(!grepl('\\D', value)) %>%
#Remove .copy column which was created
select(-.copy) %>%
#Group by the new column
group_by(col) %>%
#Add a row index
mutate(row = row_number()) %>%
#Get data in wide format
pivot_wider(names_from = col, values_from = value) %>%
#Remove row index
select(-row) %>%
#Arrange data according to site information
arrange(site)
x <- pivot_longer(junk, cols = everything(), names_to = 'site')
x
x <- mutate(x, col = paste0('species', .copy))
x
x <- filter(x, !grepl('\\D', value))
x
x <- select(.data = x, -.copy)
x
x <- group_by(x, col)
x
x <- mutate(x, row = row_number())
x
x <- pivot_wider(x, names_from = col, values_from = value)
x
x <- select(x, -row)
x
x <- arrange(x, site)
x
The code executes but leaves NA's in the final tibble.
Using dplyr and tidyr :
library(dplyr)
library(tidyr)
#Get data in long format
pivot_longer(lendat, cols = everything(), names_to = 'site') %>%
#Create a new column with column names
mutate(col = paste0('species', .copy)) %>%
#Remove the values from the first row
filter(!grepl('[A-Za-z]', value)) %>%
#Remove .copy column which was created
select(-.copy) %>%
#Group by the new column
group_by(col) %>%
#Add a row index
mutate(row = row_number()) %>%
#Get data in wide format
pivot_wider(names_from = col, values_from = value) %>%
#Remove row index
select(-row) %>%
#Arrange data according to site information
arrange(site)
# site species1 species2
# <chr> <chr> <chr>
#1 site.a 1.1 5
#2 site.a 2.2 6
#3 site.a 3 7
#4 site.a 4 8
#5 site.b 9 13
#6 site.b 10 14
#7 site.b 11 15
#8 site.b 12 16

Compare column values against another column

I have the following data:
set.seed(1)
data <- data.frame(
id = 1:500, ht_1 = rnorm(500,10:20), ht_2 = rnorm(500,15:25),
ht_3 = rnorm(500,20:30), ht_4 = rnorm(500,25:35),
ht_5 = rnorm(500,20:40)
)
I would like to identify the values in columns ht_1:ht_4 that are greater than the values in column ht_5 (number of observations and means).
For each of these columns, I would then like to replace any values that are greater than ht_5 with ht_5.
Hi you can use the mutate_at function like this:
library(tidyverse)
data %>% as_tibble %>%
mutate_at(vars(paste0("ht_", 1:4)), ~if_else(.x > ht_5, ht_5, .x))
In this case you can also use pmin instead of if_else which should be faster.
data %>% as_tibble %>%
mutate_at(vars(paste0("ht_", 1:4)), ~pmin(.x, ht_5))
To see how many values are greater than ht_5 you can use the summarise_atfunction:
data %>% as_tibble %>%
summarize_at(vars(paste0("ht_", 1:4)), ~ length(.x[.x > ht_5]))
# A tibble: 1 x 4
ht_1 ht_2 ht_3 ht_4
<int> <int> <int> <int>
1 6 39 131 258

Resources