In R, replace values across time series based on another column - r

Actually this is linked to my previous question: Replace values across time series columns based on another column
However I need to modify values across a time series data set but based on a condition from the same row but across another set of time series columns. The dataset looks like this:
#there are many more years (yrs) in the data set
product<-c("01","02")
yr1<-c("1","7")
yr2<-c("3","4")
#these follow the number of years
type.yr1<-c("mixed","number")
type.yr2<-c("number","mixed")
#this is a reference column to pull values from in case the type value is "mixed"
mixed.rate<-c("1+5GBP","7+3GBP")
df<-data.frame(product,yr1,yr2,type.yr1,type.yr2,mixed.rate)
Where the value 1 should be replaced by "1+5GBP" and 4 should be "7+3GBP". I am thinking of something like the below -- could anyone please help?
df %>%
mutate(across(c(starts_with('yr'),starts_with('type'), ~ifelse(type.x=="mixed", mixed.rate.x, .x)))
The final result should be:
product<-c("01","02")
yr1<-c("1+5GBP","7")
yr2<-c("3","7+3GBP")
type.yr1<-c("mixed","number")
type.yr2<-c("number","mixed")
mixed.rate<-c("1+5 GBP","7+3GBP")
df<-data.frame(product,yr1,yr2,type.yr1,type.yr2,mixed.rate)

If I understand you correctly, I think you might benefit from pivoting longer, replacing the values in a single if_else, and swinging back to wide.
df %>%
pivot_longer(cols = -c(product,mixed.rate), names_to=c(".value", "year"), names_pattern = "(.*)(\\d)") %>%
mutate(yr=if_else(type.yr=="mixed",mixed.rate,yr)) %>%
pivot_wider(names_from=year, values_from=c(yr,type.yr),names_sep = "")
Output:
product mixed.rate yr1 yr2 type.yr1 type.yr2
<chr> <chr> <chr> <chr> <chr> <chr>
1 01 1+5 GBP 1+5 GBP 3 mixed number
2 02 7+3GBP 7 7+3GBP number mixed

You can use pivot_longer to have all yrs in one column and type.yrs in another column. Then record 1 into 1+5GBP and 4 into 7+3GBP if the type.yr column is mixed. then pivot_wider
df %>%
pivot_longer(contains('yr'), names_to = c('.value','grp'),
names_pattern = '(\\D+)(\\d+)') %>%
mutate(yr = ifelse(type.yr == 'mixed', recode(yr, '1' = '1+5GBP', '4' = '7+3GBP'), yr)) %>%
pivot_wider(c(product, mixed.rate), names_from = grp,
values_from = c(yr, type.yr), names_sep = '')
# A tibble: 2 x 6
product mixed.rate yr1 yr2 type.yr1 type.yr2
<chr> <chr> <chr> <chr> <chr> <chr>
1 01 1+5GBP 1+5GBP 3 mixed number
2 02 7+3GBP 7 7+3GBP number mixed

If you're happy to use base R instead of dplyr then the following will produce your required output:
for (i in 1:2) {
df[,paste0('yr',i)] <- if_else(df[,paste0('type.yr',i)]=='mixed',df[,'mixed.rate'],df[,paste0('yr',i)])
}

Related

create new data frame based on variables conditions of other data frame

I am trying to create a new data frame using R from a larger data frame. This is a short version of my large data frame:
df <- data.frame(time = c(0,1,5,10,12,13,20,22,25,30,32,35,39),
t_0_1 = c(20,20,20,120,300,350,400,600,700,100,20,20,20),
t_0_2 = c(20,20,20,20,120,300,350,400,600,700,100,20,20),
t_2_1 = c(20,20,20,20,20,120,300,350,400,600,700,100,20),
t_2_2 = c(20,20,20,20,120,300,350,400,600,700,100,20,20))
The new data frame should have the first variable values as the number in the end of the large data frame variables name (1 and 2). The other variables name should be the number in the middle of the large data frame variables (0 and 2) and for their values I am trying to filter the values greater than 300 for each variable and calculate the time difference. For example for variable "t_0_1", the time that the values are greater than 300 is 13 to 25 seconds. So the value in the new data frame should be 12.
The new data frame should look like this:
df_new <- data.frame(height= c(1,2),
"0" = c(12,10),
"2" = c(10,10))
Any help where I should start or how I can do that is very welcome. Thank you!!
You could calculate the time difference for each column with summarise(across(...)), and then transform the data to long.
library(tidyverse)
df %>%
summarise(across(-time, ~ sum(diff(time[.x > 300])))) %>%
pivot_longer(everything(), names_to = c(".value", "height"), names_pattern = "t_(.+)_(.+)")
# # A tibble: 2 × 3
# height `0` `2`
# <chr> <dbl> <dbl>
# 1 1 12 10
# 2 2 10 10
Here is a tidyverse solution
library(tidyverse)
df %>%
pivot_longer(-time) %>%
separate(name, c(NA, "col", "height"), sep = "_") %>%
pivot_wider(names_from = "col", names_prefix = "X") %>%
group_by(height) %>%
summarise(
across(starts_with("X"), ~ sum(diff(time[.x > 300]))),
.groups = "drop")
## A tibble: 2 x 3
# height X0 X2
# <chr> <dbl> <dbl>
#1 1 12 10
#2 2 10 10
Explanation: The idea is to reshape from wide to long, separate the column names into a (future) column name "col" and a "height". Reshape from long to wide by taking column names from "col" (prefixing with "X") and the summarising according to your requirements (i.e. keep only those entries where the value is > 300, and sum the difference in time).

Reshape long data to multiple wide columns

I have data in a long format that I need to selectively move to wide format.
Here is an example of what I have and what I need
Rating <- c("Overall","Overall_rank","Total","Total_rank")
value <- c(6,1,5,2)
example <- data.frame(Rating,value)
Creates data that looks like this:
Rating value
Overall 6
Overall_rank 1
Total 5
Total_rank 2
However, I want my data to look like:
I tried pivot_wider, but cannot seem to get it.
Does this work for your real situation?
I think the confusion is stemming from calling column 1 "Rating," when really the "rating" values (as I understand it) are contained in rows 1 and 3.
example %>%
separate(Rating, sep = "_", into = c("Category", "type")) %>%
mutate(type = replace(type, is.na(type), "rating")) %>%
pivot_wider(names_from = type, values_from = value)
Category rating rank
<chr> <dbl> <dbl>
1 Overall 6 1
2 Total 5 2

Tally()ing Multiple Observations In an Entire Data Frame

I'm having trouble with figuring out how to deal with a column that features several observations that I would like to tally. For example:
HTML/CSS;Java;JavaScript;Python;SQL
This is one of the cells for a column of a data frame and I'd like to tally the occurrences of each programming language. Is this something that should be tackled with str_detect(), with corpus(), or is there another way I'm not seeing?
My goal is to make each one of these languages (HTML, CSS, Java, JavaScript, Python, SQL, etc...) into a column name with the tally of how many times they occur in this column of the data frame.
I feel like I might've phrased this strangely so let me know if you need any clarification.
In tidyverse you can use separate_rows and count.
library(dplyr)
df %>% tidyr::separate_rows(PL, sep = ';') %>% count(PL)
In base R, we can split the string on semi-colon and count with table :
table(unlist(strsplit(df$PL, ';')))
#If you need a dataframe
#stack(table(unlist(strsplit(df$PL, ';'))))
If you just want a total count of each label, you can use unnest_longer and a grouped count:
# using #DPH's example data
library(dplyr)
library(tidyr)
df %>%
mutate(across(PL, strsplit, ";")) %>%
unnest_longer(PL) %>%
group_by(PL) %>%
count()
# A tibble: 6 x 2
# Groups: PL [6]
PL n
<chr> <int>
1 HTML/CSS 2
2 Java 1
3 JavaScript 2
4 Python 1
5 R 3
6 SQL 2
If I understood your problem correctly this would be solution:
library(dplyr)
library(tidyr)
# demo data
df <- dplyr::tibble(ID = c("Line 1: ","Line 2:"),
PL = c("HTML/CSS;JavaScript;Python;SQL;R","R;HTML/CSS;Java;JavaScript;SQL;R"))
# calculations
df %>%
dplyr::mutate(PLANG = stringr::str_split(PL, ";")) %>%
tidyr::unnest(c(PLANG)) %>%
dplyr::group_by(ID, PLANG) %>%
dplyr::count() %>%
tidyr::pivot_wider(names_from = "PLANG", values_from = "n", values_fill = 0)
ID `HTML/CSS` JavaScript Python R SQL Java
<chr> <int> <int> <int> <int> <int> <int>
1 "Line 1: " 1 1 1 1 1 0
2 "Line 2:" 1 1 0 2 1 1

R. Separate a string into 2 columns at first number

I have a column with a lot of data of the form "Male25" indicating sex and age. I just want to separate the column into two, one with the sex and the other one with the age. What is the best way to do that in R?
We can use separate
library(tidyverse)
as_tibble("Male25" ) %>%
separate(value, into = c("sex", "age"), "(?<=[a-z])(?=[0-9])", convert = TRUE)
# A tibble: 1 x 2
# sex age
#* <chr> <int>
#1 Male 25
you can try both the methods
d <- data.frame(a=c("male25","female24","male36","female20"))
cbind(a1=gsub("\\d","",d$a),a2=gsub("\\D","",d$a))
c <- data.frame(a1=gsub("\\d","",d$a),a2=as.numeric(gsub("\\D","",d$a)))

Long to Wide with Non-Unique Key Combinations in R

I am trying to convert a dataset from long to wide format. Need to do this to feed into another program for analysis purposes. My input data is below:
sdata <- data.frame(c(1,1,1,1,1,1,1,1,1,1,1,1,1),c(1,1,1,1,1,1,1,1,1,2,2,2,2),c("X1","A","B","C","D","X2","A","B","C","X1","A","B","C"),c(81,31,40,5,5,100,8,90,2,50,20,24,6))
col_headings <- c("Orig","Dest","Desc","Estimate")
names(sdata) <- col_headings
Input Data
Depending on the unique combination of Orig-Dest-X1, Orig-Dest-X2 category above, the subcategories vary from only A,B,C to A,B,C,D to A,B, etc. I am trying to get the desired output (code to recreate in R below) along with image of desired output.
sdata_spread <- data.frame(c(1,1),c(1,2),c(81,50),c(31,20),c(40,24),c(5,6),c(5,NA),c(100,NA),c(8,NA),c(90,NA),c(2,NA))
col_headings <- c("Orig","Dest","X1", "X1_A", "X1_B", "X1_C", "X1_D","X2", "X2_A", "X2_B", "X2_C")
names(sdata_spread) <- col_headings
Desired Output
I tried the following:
sdata_spread <- sdata %>% spread(Desc,Estimate)
The error I got was:
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 6 rows
I also tried the accepted answer given here: Long to wide with no unique key and here: Long to wide format with several duplicates. Circumvent with unique combo of columns but it did not get me the desired output.
Any insights would be much appreciated.
Thanks,
Krishnan
One option is to create a grouping variable based on the occurrence of 'X' as the first character in the 'Desc', use that to modify the 'Desc' by pasteing the first element of 'Desc' with each of the element based on a condition in case_when and reshape to wide format with pivot_wider (from tidyr_1.0.0, spread/gather are getting deprecated and in its place pivot_wider/pivot_longer are used)
library(dplyr)
library(tidyr)
library(stringr)
sdata %>%
group_by(grp = cumsum(str_detect(Desc, '^X'))) %>%
mutate(Desc = case_when(row_number() > 1 ~ str_c(first(Desc), Desc, sep="_"),
TRUE ~ as.character(Desc))) %>%
ungroup %>%
select(-grp) %>%
pivot_wider(names_from = Desc, values_from = Estimate)
# A tibble: 2 x 11
# Orig Dest X1 X1_A X1_B X1_C X1_D X2 X2_A X2_B X2_C
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 81 31 40 5 5 100 8 90 2
#2 1 2 50 20 24 6 NA NA NA NA NA

Resources