reshape data into multiple columns using pivot_longer - r

I am using pivot_longer to reshape my data from wide to long format into multiple value columns. I know there are related questions (Pivot_longer 6 columns to 3 columns or Tidy dataset with pivot_longer: Multiple columns into two columns), but I could not find a solution so far, probably because my two columns will be of different class, the first one being POSIXct and the second one is numeric.
Here is a minimal working example:
structure(list(compid = c("AT9130162999", "AT9090003478", "AT9070005375",
"AT9130048156"), iso2c = c("AT", "AT", "AT", "AT"), nace4 = c("7010",
"4211", "2452", "7010"), lastyear = c("2018", "2019", "2019",
"2019"), `Closing date
Last avail. yr` = structure(c(1546214400,
1577750400, 1585612800, 1577750400), tzone = "UTC", class = c("POSIXct",
"POSIXt")), `Closing date
Year - 1` = structure(c(1514678400,
1546214400, 1553990400, 1546214400), tzone = "UTC", class = c("POSIXct",
"POSIXt")), `Closing date
Year - 2` = structure(c(NA,
1514678400, 1522454400, 1514678400), tzone = "UTC", class = c("POSIXct",
"POSIXt")), `Closing date
Year - 3` = structure(c(NA,
1483142400, 1490918400, 1483142400), tzone = "UTC", class = c("POSIXct",
"POSIXt")), `Closing date
Year - 4` = structure(c(NA,
1451520000, 1459382400, 1451520000), tzone = "UTC", class = c("POSIXct",
"POSIXt")), `Closing date
Year - 5` = structure(c(NA,
1419984000, 1427760000, 1419984000), tzone = "UTC", class = c("POSIXct",
"POSIXt")), `Closing date
Year - 6` = structure(c(NA,
1388448000, 1396224000, 1388448000), tzone = "UTC", class = c("POSIXct",
"POSIXt")), `Closing date
Year - 7` = structure(c(NA,
1356912000, 1364688000, 1356912000), tzone = "UTC", class = c("POSIXct",
"POSIXt")), `Closing date
Year - 8` = structure(c(NA,
1325289600, 1333152000, 1325289600), tzone = "UTC", class = c("POSIXct",
"POSIXt")), `Closing date
Year - 9` = structure(c(NA,
1293753600, 1301529600, 1293753600), tzone = "UTC", class = c("POSIXct",
"POSIXt")), operatinginc_last = c(NA, 482813, -94300, NA), operatinginc_year1 = c(NA,
423482, 780400, NA), operatinginc_year2 = c(NA, 404694, 1210300,
NA), ebit_last = c(1060000, 482813, -94300, 351292), ebit_year1 = c(1501000,
423482, 780400, 331415), ebit_year2 = c(NA, 404694, 1210300,
305492), operatingrev_last = c(28463000, 15842418, 13009700,
11742884), operatingrev_year1 = c(NA, 13734462, 13146300, 10682889
), operatingrev_year2 = c(NA, 13734462, 13146300, 10682889)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
So far, I have tried this:
df_l <- df %>%
pivot_longer(., cols = -(starts_with(c("compid","iso2c","nace4","lastyear","Closing"))),
values_to = "value", values_drop_na=T, names_sep = "_", names_to = c("variable","year"))
But now I would also like to reshape all the columns that start with Closing. How do I do (preferably in one step with pivot_longer)?
The expected output should then include a variable, year and value column, but also a closingdate and date column:
compid iso2c nace4 lastyear `closingdate ~ `date ~`variable ~`year ~ `value
<chr> <chr> <chr> <chr> <dttm> <dttm> <dttm> <dttm>
1 AT913~ AT 7010 2018 `Closing date Last avail. yr` 2018-12-31 ebit last 28463000
2 AT913~ AT 7010 2018 `Closing date Year - 1` 2017-12-31 ebit year1 15362687
2 AT913~ AT 7010 2018 `Closing date Year - 1` 2016-12-31 ebit year2 404694

I have no clue how you would do that in one call to pivot_longer, because you have different variables with different schemes. And you ALSO want to pivot to longer the closing date variable. So here it is in two calls with some cleaning of the closing variable.
library(tidyverse)
df_l <- pivot_longer(df, cols = starts_with("Closing"),
values_to = "date", values_drop_na=T, names_to = c("closing")) %>%
pivot_longer(., cols = contains("_"),
values_to = "value", values_drop_na=T, names_sep = '_', names_to = c("variable",'year')) %>%
mutate(closing = str_remove_all(closing,'Closing date') %>%
str_remove_all(.,'[:cntrl:]') %>%
str_squish() %>%
str_trim())

Related

Trouble combining double and character

I am having trouble merging datasets and am going to merging many together so need to figure out a way to automate getting through the following error:
"Error: Can't combine `C:/Users/gabri/AppData/Local/Cache/R/noaa_lcd/2006_72038163885.csv$HourlyWetBulbTemperature` <double> and `C:/Users/gabri/AppData/Local/Cache/R/noaa_lcd/2009_72038163885.csv$HourlyWetBulbTemperature` <character>."
I have examined the data and see that in one of the files some of the NAs are marked by * so I know that is why the problem is there. I would like to add a command that will convert either all to character or all to numeric so that I can merge but when I try adding as.character I receive this error:
Error: Names repair functions can't return `NA` values
Here is the relevant code I am trying to run which produces the error.
library(rnoaa)
library(tidyverse)
library(fs)
super_big_df <- map_df(my_files, read_csv, col_select = c(1,2,21,32,80), col_types = "cTddd", .id = "file")
Here is the output of dput for the relevant columns of the dataset
structure(list(STATION = c(72038163885, 72038163885, 72038163885
), DATE = structure(c(1230768000, 1230769200, 1230770400), tzone = "UTC", class = c("POSIXct",
"POSIXt")), HourlyWetBulbTemperature = c("*", "38", "37"), DailyAverageWetBulbTemperature = c(NA,
NA, NA), MonthlyWetBulb = c(NA, NA, NA)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
structure(list(STATION = c(72038163885, 72038163885, 72038163885
), DATE = structure(c(1146459600, 1146460800, 1146462000), tzone = "UTC", class = c("POSIXct",
"POSIXt")), HourlyWetBulbTemperature = c(NA_real_, NA_real_,
NA_real_), DailyAverageWetBulbTemperature = c(72, NA, NA), MonthlyWetBulb = c(NA,
NA, NA)), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
"data.frame"))
In sum, I am wondering if there is a command I can put in map_df() that will convert everything to be the same (either character or numeric) so that the rest of the command will still run.
Untested, but the best way forward as #GregorThomas suggested is to read it in properly the first time. In this case, it's likely something like:
super_big_df <- map_df(
my_files, read_csv, na = c("", "NA", "*"),
col_select = c(1,2,21,32,80), col_types = "cTddd",
.id = "file")
If you need to fix it after the fact, then you'll need to read them into a list-of-frames, perhaps changing map_df to map,
super_big_df <- map(
my_files, read_csv, na = c("", "NA", "*"),
col_select = c(1,2,21,32,80), col_types = "cTddd",
.id = "file")
bind_rows(super_big_df)
# Error: Can't combine `..1$HourlyWetBulbTemperature` <character> and `..2$HourlyWetBulbTemperature` <double>.
and then something like
library(dplyr) # in case you did not already have it loaded
purrr::map(super_big_df, ~ mutate(., HourlyWetBulbTemperature = suppressWarnings(as.numeric(HourlyWetBulbTemperature)))) %>%
bind_rows()
# # A tibble: 6 x 5
# STATION DATE HourlyWetBulbTemperature DailyAverageWetBulbTemperature MonthlyWetBulb
# <dbl> <dttm> <dbl> <dbl> <lgl>
# 1 72038163885 2009-01-01 00:00:00 NA NA NA
# 2 72038163885 2009-01-01 00:20:00 38 NA NA
# 3 72038163885 2009-01-01 00:40:00 37 NA NA
# 4 72038163885 2006-05-01 05:00:00 NA 72 NA
# 5 72038163885 2006-05-01 05:20:00 NA NA NA
# 6 72038163885 2006-05-01 05:40:00 NA NA NA
The suppressWarnings here is because we know there is a non-number ("*") in that column somewhere. For that one frame, it will fix that column; for other frames, it should be a no-op since the column is already as.numeric.
Note that I hard-coded the name here since we know what it is ahead of time. If there are more columns that need repairing (i.e., you get more errors after fixing this one), then it might be advantageous to go with a more dynamic/programmatic approach (not yet covered here).
Data
super_big_df <- list(
structure(list(STATION = c(72038163885, 72038163885, 72038163885), DATE = structure(c(1230768000, 1230769200, 1230770400), tzone = "UTC", class = c("POSIXct", "POSIXt")), HourlyWetBulbTemperature = c("*", "38", "37"), DailyAverageWetBulbTemperature = c(NA, NA, NA), MonthlyWetBulb = c(NA, NA, NA)), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame")),
structure(list(STATION = c(72038163885, 72038163885, 72038163885), DATE = structure(c(1146459600, 1146460800, 1146462000), tzone = "UTC", class = c("POSIXct", "POSIXt")), HourlyWetBulbTemperature = c(NA_real_, NA_real_, NA_real_), DailyAverageWetBulbTemperature = c(72, NA, NA), MonthlyWetBulb = c(NA, NA, NA)), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
)

Why do I get Error in Error: Problem with `mutate()` input `medication_name`. x Result 1 must be a single string, not a character vector of length 2

I have a data set with another with a list of a nested data.
age_pharma <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8), age_band = c("5_9",
"10_14", "15-19", "20-24", "5_9", "10_14", "15-19", "20-24"),
table = list(structure(list(med_name_one = c("Co-amoxiclav",
"doxycycline"), med_name_two = c(NA, "Gentamicin"), mg_one = c("411 mg",
"120 mg"), mg_two = c(NA, "11280 mg"), datetime = c("2020-01-03 10:08",
"2020-01-01 11:08"), date_time = structure(c(1578046080,
1577876880), tzone = "Europe/London", class = c("POSIXct",
"POSIXt"))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-2L)), structure(list(med_name_one = c("Gentamicin", "Co-trimoxazole"
), med_name_two = c("Co-trimoxazole", NA), mg_one = c("11280 mg",
"8 mg"), mg_two = c("8 mg", NA), datetime = c("2020-01-02 19:08",
"2020-01-08 20:08"), date_time = structure(c(1577992080,
1578514080), tzone = "Europe/London", class = c("POSIXct",
"POSIXt"))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-2L)), structure(list(med_name_one = "Gentamicin", med_name_two = NA_character_,
mg_one = "11280 mg", mg_two = NA_character_, datetime = "2020-01-02 19:08",
date_time = structure(1577992080, tzone = "Europe/London", class = c("POSIXct",
"POSIXt"))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-1L)), structure(list(med_name_one = "Co-trimoxazole", med_name_two = NA_character_,
mg_one = "8 mg", mg_two = NA_character_, datetime = "2020-01-08 20:08",
date_time = structure(1578514080, tzone = "Europe/London", class = c("POSIXct",
"POSIXt"))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-1L)), structure(list(med_name_one = "Sodium Chloride", med_name_two = NA_character_,
mg_one = "411 mg", mg_two = NA_character_, datetime = "2020-01-10 08:08",
date_time = structure(1578643680, tzone = "Europe/London", class = c("POSIXct",
"POSIXt"))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-1L)), structure(list(med_name_one = "Piperacillin", med_name_two = NA_character_,
mg_one = "120 mg", mg_two = NA_character_, datetime = "2020-01-03 09:08",
date_time = structure(1578042480, tzone = "Europe/London", class = c("POSIXct",
"POSIXt"))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-1L)), structure(list(med_name_one = character(0), med_name_two = character(0),
mg_one = character(0), mg_two = character(0), datetime = character(0),
date_time = structure(numeric(0), tzone = "Europe/London", class = c("POSIXct",
"POSIXt"))), class = c("tbl_df", "tbl", "data.frame"), row.names = integer(0)),
structure(list(med_name_one = character(0), med_name_two = character(0),
mg_one = character(0), mg_two = character(0), datetime = character(0),
date_time = structure(numeric(0), tzone = "Europe/London", class = c("POSIXct",
"POSIXt"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = integer(0)))), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
I am trying to map a variable from the list (table). The variable is called med_name_one.
get_medication_name <- function(medication_name_df) {
medication_name <- medication_name_df %>%
dplyr::group_by(id) %>%
dplyr::arrange(datetime) %>%
pull(med_name_one)
}
Here I am applying the function so that I get the med_name_one as a variable.
age_pharma <- mutate(medication_name = purrr::map(age_pharma, get_medication_name))
Yet I do not know why I get this error?
Error: Problem with `mutate()` input `medication_name`.
x Result 1 must be a single string, not a character vector of length 2
ℹ Input `medication_name` is `purrr::map_chr(table, get_medication_name)`.
Run `rlang::last_error()` to see where the error occurred.
Can someone help me understand the error? Also how can I retrieve med_name_one?
Here's one option
get_medication_name <- function(medication_name_df) {
medication_name <- medication_name_df %>%
dplyr::arrange(datetime) %>%
dplyr::summarize(medname = first(med_name_one)) %>%
dplyr::pull(medname)
}
age_pharma %>% mutate(medication_name = purrr::map_chr(table, get_medication_name))
First we had to change the get_medication_name function to handle the case where there are no rows in the table column which is the case in your example.
Then we need to apply the map specifically to the table column.

How to extract time interval data from minute data in r

I am trying to extract rows at 5 minute intervals from 1 minute data. My data looks like this:
structure(list(Date = structure(c(1509408000, 1509408000, 1509408000,
1509408000, 1509408000, 1509408000), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Time = structure(c(-2209021500, -2209021560,
-2209021620, -2209021680, -2209021740, -2209021800), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), O = c(3674, 3675, 3674, 3675, 3675,
3675), H = c(3674, 3675, 3675, 3676, 3676, 3675), L = c(3673,
3674, 3674, 3674, 3675, 3675), C = c(3673, 3674, 3674, 3675,
3675, 3675)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
structure(list(Date = structure(c(1506902400, 1506902400, 1506902400,
1506902400, 1506902400, 1506902400), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Time = structure(c(-2209071300, -2209071360,
-2209071420, -2209071480, -2209071540, -2209071600), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), O = c(3450, 3451, 3451, 3452, 3450,
3449), H = c(3451, 3451, 3451, 3452, 3452, 3451), L = c(3448,
3449, 3449, 3450, 3450, 3449), C = c(3448, 3451, 3450, 3451,
3452, 3450)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
I have looked at:
Create a time interval of 15 minutes from minutely data in R?
How to subset and extract time series by time interval in row
but none do exactly what I want. Maybe I could use this:
substr(t,15,16)=="00".
but I'm not sure how to combine it with filter.
Desired Output: find rows at 30 minute intervals:
You can extract rows with a minute-mark ending in 0 or 5 with
df[substr(format(df$Time, '%M'), 2, 2) %in% c(0, 5),]
# or
df[as.numeric(format(df$Time, '%M')) %% 5 == 0,]
# or
df[grep('[0|5]$', format(df$Time, '%M')),]
With filter:
library(dplyr)
df %>%
filter(substr(format(df$Time, '%M'), 2, 2) %in% c(0, 5))
# or
df %>%
filter(as.numeric(format(df$Time, '%M')) %% 5 == 0)

Estimate overnight returns for many stocks using a for loop and store it in a dataframe with stock names as column names

I am trying to estimate overnight returns for many stocks using a for loop and store it in a dataframe with stock names as column names. The trade has raw intraday data and trade2 has cleaned intraday data. list.namess has stock names. This is my code:
require(xts)
require(highfrequency)
OvernightRet<-list()
list.namess<- list.files(pattern="*.IS Equity")
list.namess<- list.namess[2]
for(Q in 1:length(list.namess)){
trade<-readRDS(list.namess[Q])
trade<-xts(trade[,-1], order.by = trade[,1])
colnames(trade)[c(1,2)]<-c("PRICE", "SIZE")
#Unduplicating
trade2<-do.call(rbind, lapply(split(trade,"days"), mergeTradesSameTimestamp))
trade2<-trade2[,1]
fun.first= function(x) first(x)
fun.last= function(x) last(x)
A=do.call(rbind, lapply(split(trade2, "days"), FUN=fun.first))
B=do.call(rbind, lapply(split(trade2, "days"), FUN=fun.last))
OvernightRetA <- (as.numeric(A)-as.numeric(lag.xts(B)))/as.numeric(lag.xts(B))
colnames(OvernightRetA)<-list.namess[Q]
OvernightRet[[Q]]<-OvernightRetA
}
df.OvernightRet<-do.call(merge, OvernightRet)
However, it gives error, probably because of not being able to rename the OvernightRetA:
Error in `colnames<-`(`*tmp*`, value = "ACEM IS Equity.rds") :
attempt to set 'colnames' on an object with less than two dimensions
In addition: There were 50 or more warnings (use warnings() to see the first 50)
> df.OvernightRet<-do.call(merge, OvernightRet)
Error in as.data.frame(x) : argument "x" is missing, with no default
As trade and trade2 is huge and not appropriate for dput. I am posting given Open(A), Close(B) and list of names (list.namess) for reproducibility of error.
dput(head(A,10))
structure(c(231.9, 236.35, 230, 226.85, 229.05, 225.7, 226.95,
224.55, 227, 234.65), class = c("xts", "zoo"), .indexCLASS = c("POSIXct",
"POSIXt"), .indexTZ = "Asia/Calcutta", tclass = c("POSIXct",
"POSIXt"), tzone = "Asia/Calcutta", Price = 1L, index = structure(c(1459481850,
1459741066, 1459827433, 1459913867, 1460000236, 1460086630, 1460345867,
1460432285, 1460518631, 1460950628), tzone = "Asia/Calcutta", tclass = c("POSIXct",
"POSIXt")), .Dim = c(10L, 1L), .Dimnames = list(NULL, "PRICE"))
dput(head(B,10))
structure(c(235.35, 231.2, 226.1, 229.05, 226.45, 225.75, 224.55,
223.75, 231.1, 228.6), class = c("xts", "zoo"), .indexCLASS = c("POSIXct",
"POSIXt"), .indexTZ = "Asia/Calcutta", tclass = c("POSIXct",
"POSIXt"), tzone = "Asia/Calcutta", Price = 1L, index = structure(c(1459508732,
1459767943, 1459854348, 1459940748, 1460027143, 1460113538, 1460374518,
1460465873, 1460545568, 1460977541), tzone = "Asia/Calcutta", tclass = c("POSIXct",
"POSIXt")), .Dim = c(10L, 1L), .Dimnames = list(NULL, "PRICE"))
dput(list.namess) "ACEM IS Equity.rds"
Kindly help me solve this error.
I believe the problem, as the error message implies, is that you are trying to assign a column header to a single value. You can work around this by changing the line above to:
OvernightRetA <- as.data.frame(as.numeric(A)-as.numeric(lag.xts(B)))/as.numeric(lag.xts(B))

aaply for data.table to find length of intersection of interval

I have data like this:
View(dose_merged)
SUBJECT_Blinded PACKID SACDPDAT SACRTDAT treatment_interval SD_SDAT SD_EDAT
1 1501301 10094 2012-05-26 2012-07-23 58 2012-01-03 2013-01-02
2 1601301 10555 2012-01-03 2012-01-31 28 2012-01-03 2013-01-0
With columns types in data table:
> mapply(class, dose_merged)
$SUBJECT_Blinded
[1] "numeric"
$PACKID
[1] "numeric"
$SACDPDAT
[1] "POSIXct" "POSIXt"
$SACRTDAT
[1] "POSIXct" "POSIXt"
$treatment_interval
[1] "Interval"
attr(,"package")
[1] "lubridate"
$SD_SDAT
[1] "POSIXct" "POSIXt"
$SD_EDAT
[1] "POSIXct" "POSIXt"
I want to determine the length of intersection of intervals: interval(SACDPDAT, SACRTDAT) and interval(SD_SDAT, SD_EDAT).
I am trying this:
dose_merged[,intersect1 := aaply(dose_merged, 1, function(x){intersect(interval(x[3],x[4]), interval(x[8],x[9]))})]
But then I get error message:
Error: error while computing 'x' when choosing method for 'intersect': Error in as.POSIXct.default(start) :
do not know how to convert 'start' to class “POSIXct”
The line
intersect(interval(x[3],x[4]), interval(x[8],x[9]))})
works for specified row x.
Any ideas what I am doing wrong ?
The first two rows of dput(dose_merge):
structure(list(SUBJECT_Blinded = c(1101001, 1101001), PACKID = c(10096,
10595), SACDPDAT = structure(c(1335304800, 1325545200), class = c("POSIXct",
"POSIXt"), tzone = ""), SACRTDAT = structure(c(1340316000, 1327964400
), class = c("POSIXct", "POSIXt"), tzone = ""), treatment_interval = structure(c(58,
28), class = structure("Interval", package = "lubridate")), TS_SDAT = structure(c(NA_real_,
NA_real_), class = c("POSIXct", "POSIXt"), tzone = ""), TS_EDAT = structure(c(NA_real_,
NA_real_), class = c("POSIXct", "POSIXt"), tzone = ""), SD_SDAT = structure(c(1325545200,
1325545200), class = c("POSIXct", "POSIXt"), tzone = ""), SD_EDAT = structure(c(1357081200,
1357081200), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("SUBJECT_Blinded",
"PACKID", "SACDPDAT", "SACRTDAT", "treatment_interval", "TS_SDAT",
"TS_EDAT", "SD_SDAT", "SD_EDAT"), sorted = "SUBJECT_Blinded", class = c("data.table",
"data.frame"), row.names = c(NA, -2L), .internal.selfref = <pointer: 0x0000000002f30788>)

Resources