R -> creating data set with repetitive values - r

I got the to solve the following problem:
create a dataset holding the Turnover (runif 500;1000) integer values for your 4 Sales representatives for the last 4 years each salesperson selling 4 different products (Mars, Snickers, Bounty, Milkeyway); additioanlly add a column with the integer CostofSales (runif 50;150) finally calculate the Earnings in an own column. Combine all values into a dataframe
so I did:
Years <- rep(c(2021:2018),16)
Years
Sales <- rep(c("Chris","Lucas","Cara","Bia"),16)
View(Sales)
Product <- rep(c("Mars","Snickers","Bounty","Milkway"),16)
Product
Turnover <- c(runif(64,500,1000))
Turnover
df <- data.frame(Years,Sales,Product,Turnover)
View(df)
But the 'dataframe' is messed up:
Can anyone help me? THANK YOU

Perhaps this is what you are trying for
Years <- rep(c(2021:2018), each=16)
Sales <- rep(rep(c("Chris","Lucas","Cara","Bia"), each=4), 4)
Product <- rep(c("Mars", "Snickers", "Bounty", "Milkway"), 16)
Turnover <- runif(64, 500, 1000)
df <- data.frame(Years,Sales,Product,Turnover)
df[c(4, 8, 12, 16, 20, 24, 28, 32, 36), ]
# Years Sales Product Turnover
# 4 2021 Chris Milkway 964.8695
# 8 2021 Lucas Milkway 799.1933
# 12 2021 Cara Milkway 613.6976
# 16 2021 Bia Milkway 970.3118
# 20 2020 Chris Milkway 598.2047
# 24 2020 Lucas Milkway 951.0657
# 28 2020 Cara Milkway 537.1925
# 32 2020 Bia Milkway 720.0880
# 36 2019 Chris Milkway 759.2236

Related

creating matrix from three column data frame in R [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 11 months ago.
I have a data frame with three columns where each row is unique:
df1
# state val_1 season
# 1 NY 3 winter
# 2 NY 10 spring
# 3 NY 24 summer
# 4 BOS 14 winter
# 5 BOS 26 spring
# 6 BOS 19 summer
# 7 WASH 99 winter
# 8 WASH 66 spring
# 9 WASH 42 summer
I want to create a matrix with the state names for rows and the seasons for columns with val_1 as the values. I have previously used:
library(reshape2)
df <- acast(df1, state ~ season, value.var='val_1')
And it has created the desired matrix with each state name appearing once but for some reason when I have been using acast or dcast recently it automatically defaults to the length function and gives 1's for the values. Can anyone recommend a solution?
data
state <- c('NY', 'NY', 'NY', 'BOS', 'BOS', 'BOS', 'WASH', 'WASH', 'WASH')
val_1 <- c(3, 10, 24, 14, 26, 19, 99, 66, 42)
season <- c('winter', 'spring', 'summer', 'winter', 'spring', 'summer',
'winter', 'spring', 'summer')
df1 <- data.frame(state, val_1, season)
You may define the fun.aggregate=.
library(reshape2)
acast(df1, state~season, value.var = 'val_1', fun.aggregate=sum)
# spring summer winter
# BOS 26 19 14
# NY 10 24 3
# WASH 66 42 99
This also works
library(reshape2)
state = c('NY', 'NY', 'NY', 'BOS', 'BOS', 'BOS', 'WASH', 'WASH', 'WASH')
val_1 = c(3, 10, 24, 14, 26, 19, 99, 66, 42)
season = c('winter', 'spring', 'summer', 'winter', 'spring', 'summer', 'winter', 'spring', 'summer')
df1 = data.frame(state,
val_1,
season)
dcast(df1, state~season, value.var = 'val_1')
#> state spring summer winter
#> 1 BOS 26 19 14
#> 2 NY 10 24 3
#> 3 WASH 66 42 99
Created on 2022-04-08 by the reprex package (v2.0.1)

how to sum conditional functions to grouped rows in R

I so have the following data frame
customerid
payment_month
payment_date
bill_month
charges
1
January
22
January
30
1
February
15
February
21
1
March
2
March
33
1
May
4
April
43
1
May
4
May
23
1
June
13
June
32
2
January
12
January
45
2
February
15
February
56
2
March
2
March
67
2
April
4
April
65
2
May
4
May
54
2
June
13
June
68
3
January
25
January
45
3
February
26
February
56
3
March
30
March
67
3
April
1
April
65
3
June
1
May
54
3
June
1
June
68
(the id data is much larger) I want to calculate payment efficiency using the following function,
efficiency = (amount paid not late / total bill amount)*100
not late is paying no later than the 21st day of the bill's month. (paying January's bill on the 22nd of January is considered as late)
I want to calculate the efficiency of each customer with the expected output of
customerid
effectivity
1
59.90
2
100
3
37.46
I have tried using the following code to calculate for one id and it works. but I want to apply and assign it to the entire group id and summarize it into 1 column (effectivity) and 1 row per ID. I have tried using group by, aggregate and ifelse functions but nothing works. What should I do?
df1 <- filter(df, (payment_month!=bill_month & id==1) | (payment_month==bill_month & payment_date > 21 & id==1) )
df2 <-filter(df, id==1001)
x <- sum(df1$charges)
x <- sum(df2$charges)
100-(x/y)*100
An option using dplyr
library(dplyr)
df %>%
group_by(customerid) %>%
summarise(
effectivity = sum(
charges[payment_date <= 21 & payment_month == bill_month]) / sum(charges) * 100,
.groups = "drop")
## A tibble: 3 x 2
#customerid effectivity
# <int> <dbl>
#1 1 59.9
#2 2 100
#3 3 37.5
df %>%
group_by(customerid) %>%
mutate(totalperid = sum(charges)) %>%
mutate(pay_month_number = match(payment_month , month.name),
bill_month_number = match(bill_month , month.name)) %>%
mutate(nolate = ifelse(pay_month_number > bill_month_number, TRUE, FALSE)) %>%
summarise(efficiency = case_when(nolate = TRUE ~ (charges/totalperid)*100))

How could I split a data.frame?

I have 50 synoptic stations precipitation data from 1986 to 2015.
I need to sort the related information for the period of years from 2007 to 2015 for each station separately. I mean there are three variables:
the station's name
the specific year
the amount of precipitation
I need the result for each station separately.
Does anyone know how to use "split" for this purpose?
May you please write codes from the beginning "read.table"?
If your task is simply to split the dataframe by year you can use split:
split(df, f = df$year)
Illustrative data:
(set.seed(123)
df <- data.frame(
station = sample(LETTERS[1:3],10, replace = T),
year = paste0("201", sample(1:9, 10, replace = T)),
precipitation = sample(333:444, 10, replace = T)
)
Result:
$`2011`
station year precipitation
5 C 2011 406
8 C 2011 399
$`2013`
station year precipitation
7 B 2013 393
9 B 2013 365
$`2015`
station year precipitation
2 C 2015 410
$`2016`
station year precipitation
4 C 2016 444
$`2017`
station year precipitation
3 B 2017 404
$`2019`
station year precipitation
1 A 2019 432
6 A 2019 412
10 B 2019 349

How to find out how many trading days in each month in R?

I have a dataframe like this. The time span is 10 years. Because it's Chinese market data, and China has Lunar Holidays. So each year have different holiday times in terms of the western calendar.
When it is a holiday, the stock market does not open, so it is a non-trading day. Weekends are non-trading days too.
I want to find out which month of which year has the least number of trading days, and most importantly, what number is that.
There are not repeated days.
date change open high low close volume
1 1995-01-03 -1.233 637.72 647.71 630.53 639.88 234518
2 1995-01-04 2.177 641.90 655.51 638.86 653.81 422220
3 1995-01-05 -1.058 656.20 657.45 645.81 646.89 430123
4 1995-01-06 -0.948 642.75 643.89 636.33 640.76 487482
5 1995-01-09 -2.308 637.52 637.55 625.04 625.97 509851
6 1995-01-10 -2.503 616.16 617.60 607.06 610.30 606925
If there are not repeated days, you can count days per month and year by:
library(data.table) "maxx"))), .Names = c("X2005", "X2006", "X2007", "X2008"))
library(lubridate)
dt <- as.data.table(dt)
dt_days <- dt[, .(count_day=.N), by=.(year(date), month(date))]
Then you only need to do this to get the min:
dt_days[count_day==min(count_day)]
The chron and bizdays packages deal with business days but neither actually contains a usable calendar of holidays limiting their usefulness.
We will use chron below assuming you have defined the .Holidays vector of dates that are holidays. (If you run the code below without doing that only weekdays will be regarded as business days as the default .Holidays vector supplied by chron has very few dates in it.) DF has 120 rows (one row for each year/month) and the last line subsets that to just the month in each year having least business days.
library(chron)
library(zoo)
st <- as.yearmon("2001-01")
en <- as.yearmon("2010-12")
ym <- seq(st, en, 1/12) # sequence of year/months of interest
# no of business days in each yearmonth
busdays <- sapply(ym, function(x) {
s <- seq(as.Date(x), as.Date(x, frac = 1), "day")
sum(!is.weekend(s) & !is.holiday(s))
})
# data frame with one row per year/month
yr <- as.integer(ym)
DF <- data.frame(year = yr, month = cycle(ym), yearmon = ym, busdays)
# data frame with one row per year
wx.min <- ave(busdays, yr, FUN = function(x) which.min(x) == seq_along(x))
DF[wx.min == 1, ]
giving:
year month yearmon busdays
2 2001 2 Feb 2001 20
14 2002 2 Feb 2002 20
26 2003 2 Feb 2003 20
38 2004 2 Feb 2004 20
50 2005 2 Feb 2005 20
62 2006 2 Feb 2006 20
74 2007 2 Feb 2007 20
95 2008 11 Nov 2008 20
98 2009 2 Feb 2009 20
110 2010 2 Feb 2010 20

split-apply-combine R

I have a data table with several columns.
Lets say
Location which may include Los Angles, etc.
age_Group, lets say (young, child, teenager), etc.
year = (2000, 2001, ..., 2015)
month = c(jan, ..., dec)
I would like to group_by them and see how many people has spent money
in some intervals, lets say I have intervals of interval_1 = (1, 100), (100, 1000), ..., interval_20=(1000, infinity)
How shall I proceed? What should I do after the following?
data %>% group_by(location, age_Group, year, month)
sample:
location age_gp year month spending
LA child 2000 1 102
LA teen 2000 1 15
LA teen 2000 10 9
NY old 2000 11 1000
NY old 2010 2 1000000
NY teen 2020 3 10
desired output
LA, child, 2000, jan interval_1
LA, child, 2000, feb interval_20
...
NY OLD 2015 Dec interval_1
the last column has to be determined by adding the spending of all people belonging to the same city, age_croup, year, month.
You can first create a new column (spending_cat) using, for example, the cut function. After you can add the new variable as a grouping variable and then you just need to count:
df <- data.frame(group = sample(letters[1:4], size = 1000, replace = T),
spending = rnorm(1000))
df %>%
mutate(spending_cat = cut(spending, breaks = c(-5:5))) %>%
group_by(group, spending_cat) %>%
summarise(n_people = n())
# A tibble: 26 x 3
# Groups: group [?]
group spending_cat n_people
<fct> <fct> <int>
1 a (-3,-2] 6
2 a (-2,-1] 36
3 a (-1,0] 83
4 a (0,1] 78
5 a (1,2] 23
6 a (2,3] 10
7 b (-4,-3] 1
8 b (-3,-2] 4
9 b (-2,-1] 40
10 b (-1,0] 78
# … with 16 more rows

Resources