I have a data frame in R
year group sales
1 2000 1 20
2 2001 1 25
3 2002 1 23
4 2003 1 30
5 2001 2 50
6 2002 2 55
And I want to group the data by groups or create some kind of object. I want to create one array for each group that will store the year and the sales. And the I will try to save it as a json file with this structure:
[{"group": 1, "sales":[[2000,20],[2001, 25], [2002,23], [2003, 30]]},
{"group": 2, "sales":[[2001, 50], [2002,55]]}]
Is it possible to do it automatically?
Thanks a lot
We can use data.table to paste the 'year' and 'sales' column grouped by 'group. We convert the 'data.frame' to 'data.table' (setDT(df1)). Group by 'group', we use sprintf to paste the 'year', 'sales' along with the parentheses ([]), then collapse the output to a single string with toString (it is a wrapper for paste(..., collapse=', ')), paste the [], and use toJSON.
library(jsonlite)
library(data.table)
toJSON(setDT(df1)[, list(sales= paste0('[',toString(sprintf('[%d,%d]',
year, sales)),']')), by = group])
#[{"group":1,"sales":"[[2000,20], [2001,25], [2002,23], [2003,30]]"},
#{"group":2,"sales":"[[2001,50], [2002,55]]"}]
The paste by group can be done using base R. We split the dataset by the 'group' column to create a list. Loop through the list with lapply, paste, the 'year', 'sales' column as mentioned above. Create a data.frame with the first element of 'group' and the string from the paste step, rbind the list elements to create a single data.frame and then use toJSON.
toJSON(
do.call(rbind,
lapply(
split(df1, df1$group),
function(x) data.frame(group=x$group[1L],
sales=paste0('[',
toString(sprintf('[%d,%d]', x$year, x$sales)),
']')))))
data
df1 <- structure(list(year = c(2000L, 2001L, 2002L, 2003L, 2001L, 2002L
), group = c(1L, 1L, 1L, 1L, 2L, 2L), sales = c(20L, 25L, 23L,
30L, 50L, 55L)), .Names = c("year", "group", "sales"),
class = "data.frame", row.names = c(NA, -6L))
Since the other answer uses data.table, I thought it would be a interesting exercise to try to do this in dplyr. This is not the optimal way but illustrates do which I'm not convinced is well enough documented. I have also shown the more appropriate summarise solution.
df <-read.table(textConnection('
year group sales expenses
2000 1 20 19
2001 1 25 19
2002 1 23 20
2003 1 30 15
2001 2 50 27
2002 2 55 30
'),header=TRUE)
library(dplyr)
library(jsonlite)
df %>%
group_by( group ) %>%
do(
sales = group_by(.,year) %>% select(sales) %>% apply(MARGIN=2,identity),
expenses = group_by(.,year) %>% select(expenses) %>% apply(MARGIN=2,identity)
)
df %>%
group_by( group ) %>%
summarise(
sales = list(apply( data.frame(year,sales), MARGIN=2, identity ))
,expenses = list(apply( data.frame(year,sales), MARGIN=2, identity ))
) %>% jsonlite::toJSON()
Related
I have a dataframe which contains count for each continent year wise. Below is the dataframe.
# A tibble: 4 x 4
continent year_2020 year_2021 year_2022
<chr> <dbl> <dbl> <dbl>
1 Asia 35 177 350
2 Europe 45 47 84
3 Australia 26 46 58
4 Africa 15 20 25
And this is the R script I used to create the graph
stack %>%
e_charts(continent) %>%
e_bar(year_2020) %>%
e_bar(year_2021) %>%
e_bar(year_2022)
Graph
Bar graph
My expectation is how do I pass this column names dynamically. The above dataframe is sample dataset and the year column keeps on increasing. My idea is to show max of 3 bars per continent.
What I tried was, have a start year and end year so the bar graph can be shown based on the input and not hotcode the column name in e_bar function.
start_year <- "2020"
end_year <- "2022"
year_val <- paste0("year_",start_year:end_year)
year_val1 <- year_val[1]
year_val2 <- year_val[2]
year_val3 <- year_val[3]
stack %>%
e_charts(continent) %>%
e_bar(sym(year_val1)) %>%
e_bar(sym(year_val2)) %>%
e_bar(sym(year_val3))
But was getting the below error
Error in chr_as_locations():
! Can't subset columns that don't exist.
x Column sym(year_val1) doesn't exist.
Need help on how to dynamically to pass the year columns.
Thanks
One option would be to switch to the "underscored" version of e_bar, i.e. e_bar_ which allows to pass the name of the series as a character string:
library(echarts4r)
stack |>
e_charts(continent) |>
e_bar_(year_val1) |>
e_bar_(year_val2) |>
e_bar_(year_val3)
DATA
stack <- structure(list(continent = c("Asia", "Europe", "Australia", "Africa"), year_2020 = c(35L, 45L, 26L, 15L), year_2021 = c(
177L, 47L,
46L, 20L
), year_2022 = c(350L, 84L, 58L, 25L)), class = "data.frame", row.names = c(
"1",
"2", "3", "4"
))
My current df looks like the following:
WEEK COUNT COUNT2 PERCENTAGE
2017-53 10 15 .05
2018-00 5 10 .1
2018-01 7 9 .1
....
2018-52 10 12 .06
2019-00 6 10 .05
....
What I would like to do is combine the last two weeks of each year together into the final week of the year and combine COUNT, COUNT2, and PERCENTAGE. The weeks I currently have that I would like to combine are: 2017-53 and 2018-00, 2018-52 and 2019-00, 2019-52 and 2020-00. Which I would like to merge into 2017-53, 2018-52, 2019-52 My expected output would be the following:
WEEK COUNT COUNT2 PERCENTAGE
2017-53 15 25 .15
2018-01 7 9 .1
....
2018-52 16 22 .11
....
With tidyverse, after converting the 'WEEK' to Date class, arrange by that column, extract the 'year', create a grouping with 'WEEK' based on the difference of adjacent elements of 'year', and then summarise to get the sum of the columns that matches 'COUNT' or 'PERCENTAGE'
library(stringr)
library(lubridate)
library(dplyr) #1.0.0
df1 %>%
mutate(Date = as.Date(str_c(WEEK, "-01"), format = '%Y-%U-%w')) %>%
arrange(Date) %>%
mutate(year = year(Date)) %>%
group_by(WEEK = case_when(lag(year, default = first(year)) - year < 0 ~
lag(WEEK), TRUE ~ WEEK)) %>%
summarise(across(matches("COUNT|PERCENTAGE"), sum))
# A tibble: 3 x 4
# WEEK COUNT COUNT2 PERCENTAGE
# <chr> <int> <int> <dbl>
#1 2017-53 15 25 0.15
#2 2018-01 7 9 0.1
#3 2018-52 16 22 0.11
data
df1 <- structure(list(WEEK = c("2017-53", "2018-00", "2018-01", "2018-52",
"2019-00"), COUNT = c(10L, 5L, 7L, 10L, 6L), COUNT2 = c(15L,
10L, 9L, 12L, 10L), PERCENTAGE = c(0.05, 0.1, 0.1, 0.06, 0.05
)), class = "data.frame", row.names = c(NA, -5L))
You could use colSums() as is shown here, but it's a bit convoluted. I'd recommend using aggregate and pipes, as is shown further down in the same link.
Hope this helps!
I have a data frame in R which looks like below
Model Month Demand Inventory
A Jan 10 20
B Feb 30 40
A Feb 40 60
I want the data frame to look
Jan Feb
A_Demand 10 40
A_Inventory 20 60
A_coverage
B_Demand 30
B_Inventory 40
B_coverage
A_coverage and B_Coverage will be calculated in excel using a formula. But the problem I need help with is to pivot the data frame from wide to long format (original format).
I tried to implement the solution from the linked duplicate but I am still having difficulty:
HD_dcast <- reshape(data,idvar = c("Model","Inventory","Demand"),
timevar = "Month", direction = "wide")
Here is a dput of my data:
data <- structure(list(Model = c("A", "B", "A"), Month = c("Jan", "Feb",
"Feb"), Demand = c(10L, 30L, 40L), Inventory = c(20L, 40L, 60L
)), class = "data.frame", row.names = c(NA, -3L))
Thanks
Here's an approach with dplyr and tidyr, two popular R packages for data manipulation:
library(dplyr)
library(tidyr)
data %>%
mutate(coverage = NA_real_) %>%
pivot_longer(-c(Model,Month), names_to = "Variable") %>%
pivot_wider(id_cols = c(Model, Variable), names_from = Month ) %>%
unite(Variable, c(Model,Variable), sep = "_")
## A tibble: 6 x 3
# Variable Jan Feb
# <chr> <dbl> <dbl>
#1 A_Demand 10 40
#2 A_Inventory 20 60
#3 A_coverage NA NA
#4 B_Demand NA 30
#5 B_Inventory NA 40
#6 B_coverage NA NA
In an excel file, there are two columns labelled "id" and "date" as in the following data frame:
df <-
structure(
list(
id = c(1L, 2L, 3L, 4L,5L),
date = c("10/2/2013", "-5/3/2015", "-11/-4/2019", "3/10/2019","")
),
.Names = c("id", "date"),
class = "data.frame",
row.names = c(NA,-5L)
)
The "date" column has both date e.g 10/2/2013 and non-date entries e.g. -5/3/2015 and -11/-4/2019 as well as blank spaces. I am looking for a way to read the excel file into R such that the dates and the non-dates are preserved and the blank spaces are replaced by NAs.
I have tried to use the function "read_excel" and argument "col_types" as follows:
df1<- data.frame(read_excel("df.xlsx", col_types = c("numeric", "date")))
However, this reads the dates and replaces the non-dates with NAs. I have tried other options of col_types e.g. "guess" and "skip" but these did not work for me. Any help on this is much appreciated.
Here's an approach using tidyr::separate and dplyr to filter out negative months so that only positive months are converted to "yearmon" data with zoo:
library(tidyverse)
df %>%
separate(date, c("day", "month", "year"),
sep = "/", remove = F, convert = T) %>%
mutate(month = if_else(month < 0, NA_integer_, month)) %>%
mutate(date2 = zoo::as.yearmon(paste(year, month, sep = "-")))
# id date day month year date2
#1 1 10/2/2013 10 2 2013 Feb 2013
#2 2 -5/3/2015 -5 3 2015 Mar 2015
#3 3 -11/-4/2019 -11 NA 2019 <NA>
#4 4 3/10/2019 3 10 2019 Oct 2019
#5 5 NA NA NA <NA>
The following data is a very small part from a series of tests before and after a treatment. Right now my data is like this:
Subject Var1 Var2 Var3 Var4
1 A-pre 25 27 23 0
2 A-post 25 26 25 120
3 B-pre 30 28 27 132
4 B-post 30 28 26 140
and I need to reshape it like this:
Subject Var1.pre Var1.post Var2.pre Var2.post Var3.pre Var3.post Var4.pre Var4.post
1 A 25 25 27 26 23 25 0 120
2 B 30 30 28 28 27 26 132 140
I have read many questions in SO and the documentations of packages for data wrangling in r like reshape2 etc but I could not find something similar. Any ideas?
Here is the code for replicating the first table:
dat<-structure(list(Subject = structure(c(2L, 1L, 4L, 3L), .Label = c("A-post",
"A-pre", "B-post", "B-pre"), class = "factor"), Var1 = c(25L,
25L, 30L, 30L), Var2 = c(27L, 26L, 28L, 28L), Var3 = c(23L, 25L,
27L, 26L), Var4 = c(0L, 120L, 132L, 140L)), .Names = c("Subject",
"Var1", "Var2", "Var3", "Var4"), row.names = c(NA, -4L), class = "data.frame")
You can use dcast from the devel version of data.table ie. v1.9.5 after splitting the 'Subject' column into two using tstrsplit with split as '-'. We use the dcast to reshape from 'long' to 'wide' format. The dcast function from data.table can take multiple value.var columns, i.e. 'Var1' to 'Var4'.
library(data.table)#v1.9.5+
#convert the data.frame to data.table with `setDT(dat)`
#split the 'Subject' column with tstrsplit and create two columns
setDT(dat)[, c('Subject', 'New') :=tstrsplit(Subject, '-')]
#change the New column class to 'factor' and specify the levels in order
#so that while using dcast we get the 'pre' column before 'post'
dat[, New:= factor(New, levels=c('pre', 'post'))]
#reshape the dataset
dcast(dat, Subject~New, value.var=grep('^Var', names(dat), value=TRUE),sep=".")
# Subject Var1.pre Var1.post Var2.pre Var2.post Var3.pre Var3.post Var4.pre
#1: A 25 25 27 26 23 25 0
#2: B 30 30 28 28 27 26 132
# Var4.post
#1: 120
#2: 140
NOTE: Instructions to install the devel version are here
An option using dplyr/tidyr would be to split the 'Subject' column into two by separate, convert the 'wide' format to 'long' format using gather, unite the 'Var' column (i.e. Var1 to Var4) and 'New' ('VarNew') and spread the 'long' format to 'wide'.
library(dplyr)
library(tidyr)
dat %>%
separate(Subject, into=c('Subject', 'New')) %>% #split to two columns
gather(Var, Val, Var1:Var4)%>% #change from wide to long. Similar to melt
unite(VarNew, Var, New, sep=".") %>% #unite two columns to form a single
spread(VarNew, Val)#change from 'long' to 'wide'