Sum specify rows in a data frame - r

I created a summary for my data and worked out percentages of occurrences per category.
Now, I want to sum a subset of categories to show their value combined. For example, I want to be able to say that 51.1% of all occurrences are within the categories 30, 60, and 120 days (sum of rows #6, #9, and #3). The name of the Data.frame is "Summary_2".
Category Count Percent
1 1 day 4 3.3%
8 5 days 5 4.1%
4 180 days 8 6.5%
5 240 days 9 7.3%
2 10 days 15 12.2%
3 120 days 18 14.6%
6 30 days 19 15.4%
7 360 days 19 15.4%
9 60 days 26 21.1%
This is a summary of tickets. I arbitrarily want to say that 50% of our tickets are resolved within 2 months, 30% are resolved from 180 to 360 days, and 20% is resolved within 10 days.
In Excel it looks like that:

Related

Normalize aggregation results according to number of days per month

I have this table:
Month
nbr_of_days
aggregated_sum
1
25
120
2
28
70
3
30
130
4
31
125
My goal here is to normalize the aggregated sum to an assumed value of 30 (nbr_of_days) per month.
So for the first row, for example, the normalized aggregated_sum would be: 30*120/25=144
How to do this in R?
df <- df%>% mutate(normalized_aggregated_sum=30*aggregated_sum/nbr_of_days)
Note: while asking the question, I realized how it can be answered

Calculate Cumulative sum for previous 6 months

RECORD
ATTRIBUTE
DATE
MONTH
AMT
CML AMT
1
A
1/1/2021
1
10
10
2
A
2/1/2021
2
10
20
3
A
3/1/2021
3
10
30
4
A
4/1/2021
4
10
40
5
A
5/1/2021
5
10
50
6
A
6/1/2021
6
10
60
7
B
1/1/2021
1
20
20
8
B
3/1/2021
3
20
40
9
B
5/1/2021
5
20
60
10
B
7/1/2021
7
20
80
11
B
9/1/2021
9
20
80
12
B
11/1/2021
11
20
80
13
C
1/1/2021
1
30
30
14
C
8/1/2021
8
30
30
15
C
9/1/2021
9
30
60
I am looking to calculate the cumulative sum (CML AMT column) using the AMT column for the past 6 months.
The CML AMT column should only look at window of 6 Months.
If there is no other record for the same attribute within a 6 month time frame, then it should simply return the AMT column.
I tried the below which clearly wont work as the dates/months are not consistent.
Any help will be appreciated.
SUM(AMT)
OVER (PARTITION BY ATTRIBUTE
ORDER BY DATE
ROWS BETWEEN 4 PRECEDING AND CURRENT ROW)
Unfortunately Teradata doesn't support RANGE, but if you need to sum over a small number of values only (six months = up to six rows) you can apply a brute-force-approach:
AMT
+
CASE WHEN LAG(DATE,1) OVER (PARTITION BY ATTRIBUTE ORDER BY DATE) >= ADD_MONTHS(DATE,-6)
THEN LAG(AMT,1) OVER (PARTITION BY ATTRIBUTE ORDER BY DATE)
ELSE 0
END
+
CASE WHEN LAG(DATE,2) OVER (PARTITION BY ATTRIBUTE ORDER BY DATE) >= ADD_MONTHS(DATE,-6)
THEN LAG(AMT,2) OVER (PARTITION BY ATTRIBUTE ORDER BY DATE)
ELSE
END
+
...
Looks ugly, but it's mostly cut&paste&modify and still a single step in Explain. Other possible solutions would be based on an additional EXPAND ON or time-series aggregation step.

Importing .csv file with tidydata

I am having difficulty importing my data in the way I would like to from a .csv file to tidydata.
My data set is made up of descriptive data (age, country, etc.) and then 15 condition columns that I would like to have in just one column (long format). I have previously tried 'melting' the data in a few ways, but it does not turn out the way I intended it to. These are a few things I have tried, I know it is kind of messy. There are quite a few NAs in the data, which seem to be causing an issue. I am trying to create this specific column "Vignette" which will serve as the collective column for the 15 vignette columns I would like in long format.
head(dat)
ID Frequency Gender Country Continent Age
1 5129615189 At least weekly female France Europe 30-50 years
2 5128877943 At least daily female Spain Europe > 50 years
3 5126775994 At least weekly female Spain Europe 30-50 years
4 5126598863 At least daily male Albania Europe 30-50 years
5 5124909744 At least daily female Ireland Europe > 50 years
6 5122047758 At least weekly female Denmark Europe 30-50 years
Practice Specialty Seniority AMS
1 University public hospital centre Infectious diseases 6-10 years Yes
2 Other public hospital Infectious diseases > 10 years Yes
3 University public hospital centre Intensive care > 10 years Yes
4 University public hospital centre Infectious diseases > 10 years No
5 Private hospial/clinic Clinical microbiology > 10 years Yes
6 University public hospital centre Infectious diseases 0-5 years Yes
Durations V01 V02 V03 V04 V05 V06 V07 V08 V09 V10 V11 V12 V13 V14 V15
1 range 7 2 7 7 7 5 7 14 7 42 42 90 7 NA 5
2 range 7 10 10 5 14 5 7 14 10 42 21 42 14 14 14
3 range 7 5 5 7 14 5 5 13 10 42 42 42 5 0 7
4 range 10 7 7 5 7 10 7 5 7 28 14 42 10 10 7
5 range 7 5 7 7 14 7 7 14 10 42 42 90 10 0 7
6 fixed duration 7 3 3 7 10 10 7 14 7 90 90 90 10 7 7
dat_long %>%
gather(Days, Age, -Vignette)
dat$new_sp = NULL
names(dat) <- gsub("new_sp", "", names(dat))
dat_tidy<-melt(
data=dat,
id=0:180,
variable.name="Vignette",
value.name="Days",
na.rm=TRUE
)
dat_tidy<- mutate(dat_tidy,
Days= sub("^V", "", Days)
)
It keeps saying "Error: id variables not found in data: NA"
I have tried to get rid of NAs but it doesn't seem to do anything.
I am guessing you are loading the melt function from reshape2. I will recommend that you try tidyr which is basically the next generation of reshape2.
Your error is presumable that the argument id=0:180. This is basically asking it to keep columns 0-180 as "identifier" columns, and melt the rest (i.e. create a new row for each value in each column).
When you subset more column indices than columns in a data.frame, the non-existing columns are filled in with pure NA - you asked for them, so you get them!
I would recommend loading tidyr, as it is newer. There should be some new verbs in the package that are more intuitive, but I'll give you a solution with the older semantic:
library(tidyr)
dat_tidy <- dat %>% gather('Vignette', 'Days', starts_with('V'))
# or a bit more verbose
dat_tidy <- dat %>% gather('Vignette', 'Days', V01, V02, V03, V04)
And check out the comment #heck1 for asking even better questions.

last 6 months total in Teradata

I have to calculate total quantity sold for last 6 months. For example in case of January 2018 , I have to calculate told quantity sold from July - Dec 2017. This total should be grouped by primary key.
Thanks
Primary Key Date qty last 6 months quantity sold
1 1-Oct 4 0
1 1-Nov 10 4
1 5-Dec 20 14
1 1-Jan 3 34
1 1-Sep 88 0
You can calculate the range using ADD_MONTHS plus TRUNC:
WHERE datecole BETWEEN Trunc(Add_Months(Current_Date, -6), 'mon')
AND Trunc(Current_Date, 'mon') -1

Performing the colsum based on row values [duplicate]

This question already has answers here:
Calculate the mean by group
(9 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 5 years ago.
Hi I have 3 data set with contains the items and counts. I need to add the all data sets and combine the count based on the item names. He is my input.
Df1 <- data.frame(items =c("Cookies", "Candys","Toys","Games"), Counts = c( 10,20,30,5))
Df2 <- data.frame(items =c( "Candys","Cookies","Toys"), Counts = c( 5,21,20))
Df3 <- data.frame(items =c( "Playdows","Gummies","Candys"), Counts = c(10,15,20))
Df_all <- rbind(Df1,Df2,Df3)
Df_all
items Counts
1 Cookies 10
2 Candys 20
3 Toys 30
4 Games 5
5 Candys 5
6 Cookies 21
7 Toys 20
8 Playdows 10
9 Gummies 15
10 Candys 20
I need to combine the columns based on the item values. Delete the Row after adding the values. My output should be
items Counts
1 Cookies 31
2 Candys 45
3 Toys 50
4 Games 5
5 Playdows 10
6 Gummies 15
Could you help in getting this output in r.
use dplyr:
library(dplyr)
result<-Df_all%>%group_by(items)%>%summarize(sum(Counts))
> result
# A tibble: 6 x 2
items `sum(Counts)`
<fct> <dbl>
1 Candys 45.0
2 Cookies 31.0
3 Games 5.00
4 Toys 50.0
5 Gummies 15.0
6 Playdows 10.0
You can use tapply
tapply(Df_all$Counts, Df_all$items, FUN=sum)
what returns
Candys Cookies Games Toys Gummies Playdows
45 31 5 50 15 10

Resources