I have a list containing 180.000 elements each represents data about an investor and a specific traded asset.
I want to save all the elements of the list into single dataframes called df into a specific folder "dev/test-data/investors-singleass/" , so that I can later on apply a specific function on all the dfs of the folder
The list of my data has a structure similar to this
list(`4Z627.004125` = structure(list(investor = c("4Z627", "4Z627",
"4Z627"), asset = c("004125", "004125", "004125"), datetime = c("2015-05-12",
"2015-05-28", "2016-08-19"), Avgprice = c(169.4, 168, 162), operation = c(2000,
1000, -3000), portfolio = c(2000, 3000, 0), last_port = c(0,
2000, 3000), marketprice = c(169.4, 166.5, 161.75), portprice = c(169.4,
168.933333333333, 0), G = c(0, 0, 1), gainminus = c(2, 0, 0),
numasset = c(5, 8, 13)), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
Basically each elements of the list is an "investor" ID and an "asset" code for which i then have multiple other columns to work with
I would do it like this based on link
df1 <- list(`4Z627.004125` = structure(list(investor = c("4Z627", "4Z627",
"4Z627"), asset = c("004125", "004125", "004125"),
datetime = c("2015-05-12", "2015-05-28", "2016-08-19"),
Avgprice = c(169.4, 168, 162),
operation = c(2000, 1000, -3000), portfolio = c(2000, 3000, 0),
last_port = c(0,2000, 3000), marketprice = c(169.4, 166.5, 161.75),
portprice = c(169.4,
168.933333333333, 0), G = c(0, 0, 1), gainminus = c(2, 0, 0),
numasset = c(5, 8, 13)), row.names = c(NA, -3L),
class = c("tbl_df", "tbl", "data.frame")),
`4Z628.004128` = structure(list(investor = c("4Z627", "4Z627",
"4Z627"), asset = c("004125", "004125", "004125"),
datetime = c("2015-05-12", "2015-05-28", "2016-08-19"),
Avgprice = c(169.4, 168, 162),
operation = c(2000, 1000, -3000), portfolio = c(2000, 3000, 0),
last_port = c(0,2000, 3000), marketprice = c(169.4, 166.5, 161.75),
portprice = c(169.4,
168.933333333333, 0), G = c(0, 0, 1), gainminus = c(2, 0, 0),
numasset = c(5, 8, 13)), row.names = c(NA, -3L),
class = c("tbl_df", "tbl", "data.frame")))
library(purrr)
iwalk(df1, ~saveRDS(.x, paste0("dev/test-data/investors-singleass/", .y, '.RData')))
You can get the data back into R with
library(dplyr)
df <- list.files(path = "dev/test-data/investors-singleass/", pattern = ".RData") %>%
map_dfr(readRDS)
Related
I have a dataframe with approximately 3 million rows. Each row is assigned a unique ID and has up to 4 dates. I wish to create a set of new columns for month and year (i.e. Jan-21, Feb-21, Mar-21, etc) and assign a value of "0" for each month/year prior to the first date, and then a value of "1" for the month/year containing the date for each ID, and maintain the value of "1" in each subsequent month/year column until the next column that matches the 2nd date.
I understand that it's easier to help me with examples, so I have put together this dput output with an example of what my current data looks like:
structure(list(id = c(1, 2, 3, 4, 5), date1 = structure(c(1623801600,
1615420800, 1654560000, 1620259200, 1615248000), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), date2 = structure(c(1629158400, 1621987200,
1658448000, 1623974400, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
date3 = structure(c(NA, 1630454400, 1662076800, 1647907200,
NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"), date4 = structure(c(NA,
1639008000, NA, NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L))
And this is what I would like it to look like:
structure(list(id = c(1, 2, 3, 4, 5), `Mar-21` = c(0, 1, 0, 0,
1), `Apr-21` = c(0, 1, 0, 0, 1), `May-21` = c(0, 2, 0, 1, 1),
`Jun-21` = c(1, 2, 0, 2, 1), `Jul-21` = c(1, 2, 0, 2, 1),
`Aug-21` = c(2, 2, 0, 2, 1), `Sep-21` = c(2, 3, 0, 2, 1),
`Oct-21` = c(2, 3, 0, 2, 1), `Nov-21` = c(2, 3, 0, 2, 1),
`Dec-21` = c(2, 4, 0, 2, 1), `Jan-22` = c(2, 4, 0, 2, 1),
`Feb-22` = c(2, 4, 0, 2, 1), `Mar-22` = c(2, 4, 0, 3, 1),
`Apr-22` = c(2, 4, 0, 3, 1), `May-22` = c(2, 4, 0, 3, 1),
`Jun-22` = c(2, 4, 1, 3, 1), `Jul-22` = c(2, 4, 2, 3, 1),
`Aug-22` = c(2, 4, 2, 3, 1), `Sep-22` = c(2, 4, 3, 3, 1)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L))
Just a note that I have this dataset in both wide and long format, in case using it in a long format makes more sense.
Thank you!
This was a fun exercise! I'm sure there are a billion ways to do this more efficiently, but I think this works and was a fun puzzle for me. I first put the dates into long format to get a min and max. Then I made a sequence of those dates by month. I then used expand grid to make all combinations of the months with each ID to join it to the original data frame. Then I just summed how many dates1:4 were greater then the months in the list. I had to use floor_date to change dates1:4 to the first of the month. Hopefully this helps!
library(dplyr)
library(lubridate)
library(tidyr)
dat2<-dat%>%
tidyr::pivot_longer(cols = -id, values_drop_na = T)
dat_min_max<-data.frame("Min" = min(dat2$value), "Max" = max(dat2$value))
month_seq<-seq(dat_min_max$Min, dat_min_max$Max+months(1), by = "month")
dat3<-dat%>%
mutate(date1 = floor_date(date1, "month"),
date2 = floor_date(date2, "month"),
date3 = floor_date(date3, "month"),
date4 = floor_date(date4, "month")
)%>%
left_join(expand.grid(dat$id, month_seq), by = c("id" = "Var1"))%>%
rowwise()%>%
mutate(c = sum(date1 <= Var2, date2 <= Var2, date3 <= Var2, date4 <= Var2, na.rm = T))%>%
mutate(Var2 = format(Var2, "%b-%y"))%>%
select(-date1, -date2, -date3, -date4)%>%
tidyr::pivot_wider(names_from = Var2, values_from = c)
I have a dataset, which has a common feature - at the end of column names (after comma) is written the group to which a specific column corresponds to. Is it possible to create a table where two row column names will be used? In an example, the first row is Up and goes Quantity, Price, Quality. Is it also possible to somehow separate by empty column/ or some bold border these 3 groups (Up, Down and Total)? I know there is DT library that helps to make it easier, however, I am looking for the solution using shiny library only.
I also found that using tags$style() and CSS can help to solve it, however not familiar with CSS.
library(shiny)
df <- structure(list(Year = c(2022L, 2022L, 2022L, 2022L, 2022L),
Week = c(0, 1, 2, 3, 4),
`Quantity, Up` = c(335, 305, 740, 910, 515),
`Price, Up` = c(1, 2, 5, 5, 3),
`Quality, Up` = c(243, 243, 243, 12321.434052, 1706.327462),
`Quantity, Down` = c(-175, -900, -205, -35, 0),
`Price, Down` = c(243, 243, 1219.717851, 902.819827, 0),
`Quality, Down` = c(2834.205418, 243, -1219.717851, 902.819827, 0),
`Quantity, Total` = c(510, 1205, 945, 945, 515),
`Price, Total` = c(431, 32, 423, 342, 243),
`Quality, Total` = c(24, 4, -2, 42, 1706.327462)),
class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L),
groups = structure(list(Year = 2022L, .rows = structure(list(1:5), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr", "list"))),
class = c("tbl_df", "tbl", "data.frame" ), row.names = c(NA, -1L), .drop = TRUE))
ui <- fluidPage(
# Application title
titlePanel("aFRR"),
# plot graphs
mainPanel(tabsetPanel(
tabPanel("Up",
h3(helpText("aFRR Price and Quantity")),
tableOutput("table_up"))
)
)
)
server <- function(input, output, session) {
output$table_up <- renderTable(df, na = "missing",align = 'c',striped = TRUE
)
}
shinyApp(ui, server)
I have a panel (cross-sectional time series) dataset. For each group (defined by (NAICS2, occ_type) in time ym) I have many variables. For each variable I would like to subtract each group's first (dplyr::first) value from every value of that group.
Ultimately I am trying to take the Euclidean difference between the vector of each row 's group's first entry, (i.e. sqrt(c_1^2 + ... + c_k^2).
I was able to create the a column equal to the first entries for each group:
df2 <- df %>%
group_by(ym, NAICS2, occ_type) %>%
distinct(ym, NAICS2, occ_type, .keep_all = T) %>%
arrange(occ_type, NAICS2, ym) %>%
select(group_cols(), ends_with("_scf")) %>%
mutate_at(vars(-group_cols(), ends_with("_scf")),
list(first = dplyr::first))
I then tried to include variations of f.diff = . - dplyr::first(.) in the list, but none of those worked. I googled the dot notation for a while as well as first and lag in dplyr timeseries but have not been able to resolve this yet.
Ideally, I unite all variables into a vector for each row first and then take the difference.
df2 <- df %>%
group_by(ym, NAICS2, occ_type) %>%
distinct(ym, NAICS2, occ_type, .keep_all = T) %>%
arrange(occ_type, NAICS2, ym) %>%
select(group_cols(), ends_with("_scf")) %>%
unite(vector, c(-group_cols(), ends_with("_scf")), sep = ',') %>%
# TODO: DISTANCE_BETWEEN_ENTRY_AND_FIRST
mutate(vector.diff = ???)
I expect the output to be a numeric column that contains a distance measure of how different each group's row vector is from its initial row vector.
Here is a sample of the data:
structure(list(ym = c("2007-01-01", "2007-02-01"), NAICS2 = c(0L,
0L), occ_type = c("is_middle_manager", "is_middle_manager"),
Administration_scf = c(344, 250), Agriculture..Horticulture..and.the.Outdoors_scf = c(11,
17), Analysis_scf = c(50, 36), Architecture.and.Construction_scf = c(57,
51), Business_scf = c(872, 585), Customer.and.Client.Support_scf = c(302,
163), Design_scf = c(22, 17), Economics..Policy..and.Social.Studies_scf = c(7,
7), Education.and.Training_scf = c(77, 49), Energy.and.Utilities_scf = c(25,
28), Engineering_scf = c(90, 64), Environment_scf = c(19,
19), Finance_scf = c(455, 313), Health.Care_scf = c(105,
71), Human.Resources_scf = c(163, 124), Industry.Knowledge_scf = c(265,
174), Information.Technology_scf = c(467, 402), Legal_scf = c(21,
17), Maintenance..Repair..and.Installation_scf = c(194, 222
), Manufacturing.and.Production_scf = c(176, 174), Marketing.and.Public.Relations_scf = c(139,
109), Media.and.Writing_scf = c(18, 20), Personal.Care.and.Services_scf = c(31,
16), Public.Safety.and.National.Security_scf = c(14, 7),
Religion_scf = c(0, 0), Sales_scf = c(785, 463), Science.and.Research_scf = c(52,
24), Supply.Chain.and.Logistics_scf = c(838, 455), total_scf = c(5599,
3877)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L), groups = structure(list(ym = c("2007-01-01",
"2007-02-01"), NAICS2 = c(0L, 0L), occ_type = c("is_middle_manager",
"is_middle_manager"), .rows = list(1L, 2L)), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))
I researched ways to write this type of list, but I could not achieve.
Here is my data:
a<-structure(list(X2005 = structure(list(hours = list(c(0.2, 0,
4), c(0.2, 4)), maxx = structure(list(maxh2 = 4, maxh3 = 4), .Names = c("maxh2",
"maxh3"))), .Names = c("hours", "maxx")), X2006 = structure(list(
hours = list(c(1.8, 0, 1), c(1.8, 1)), maxx = structure(list(
maxh2 = 1.8, maxh3 = 1.8), .Names = c("maxh2", "maxh3"
))), .Names = c("hours", "maxx")), X2007 = structure(list(
hours = list(c(4.2, 0, 0), c(4.2, 0)), maxx = structure(list(
maxh2 = 4.2, maxh3 = 4.2), .Names = c("maxh2", "maxh3"
))), .Names = c("hours", "maxx")), X2008 = structure(list(
hours = list(c(0.1, 6, 0), c(3.1, 3)), maxx = structure(list(
maxh2 = 6, maxh3 = 3.1), .Names = c("maxh2", "maxh3"))), .Names = c("hours",
"maxx"))), .Names = c("X2005", "X2006", "X2007", "X2008"))
I need to see this list of lists in a excel sheet.
We could try this:
write.csv(do.call("rbind",list(unlist(a))),"testme.csv")
You could also try this and do some reshape2ing before export.
write.csv(do.call("cbind",list(unlist(a))),"testme2.csv")
Viewing the structure these yield:
View(do.call("cbind",list(unlist(a))))
Another option as suggested by #jay.sf :
openxlsx::write.xlsx(do.call("rbind",list(unlist(a))),"testme.xlsx")
Can someone explain me why this is not working?
tapply(dat$`lagged Date`, INDEX = dat$Location, FUN = diff(dat$`lagged Date`))
I receive the following error:
Error in match.fun(FUN) : 'diff(dat$lagged Date)' is not a
function, character or symbol
structure(list(`lagged Date` = structure(c(1466306880, 1466307060,
1466307240, 1466307420, 1466307600, 1466307780), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Location = c(309, 309, 309, 309, 309,
309), Duration = c(0, 0, 0, 0, 0, 0), Latitude = c(53.50205667,
53.501915, 53.50183667, 53.50178833, 53.50184, 53.50186167),
Longitude = c(-3.354733333, -3.354096667, -3.353838333, -3.353673333,
-3.353711667, -3.353741667), `Number of Records` = c(1, 1,
1, 1, 1, 1), Speed = c(0.9, 0, 0, 0, 0, 0), `Sum of Var` = c(38,
38, 38, 38, 38, 38), check = c(0, 0, 0, 0, 0, 0)), .Names = c("lagged Date",
"Location", "Duration", "Latitude", "Longitude", "Number of Records",
"Speed", "Sum of Var", "check"), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
thank you!
I'm not sure what you want to achieve, but using only diff as the FUN part works and produces this output:
tapply(dat$`lagged Date`, INDEX = dat$Location, FUN = diff)
$`309`
Time differences in mins
[1] 3 3 3 3 3
If you want to convert the output into hours, you can do that by selecting only the values of the difftime-list object and convert those:
as.numeric(tapply(dat$`lagged Date`, INDEX = dat$Location, FUN = diff)[[1]], units = "hours")
Output then looks like this:
[1] 0.05 0.05 0.05 0.05 0.05