To apply mutate with an other line - r

I have a table and I would like to add a column that calculates the percentage compared to the previous line.
You have to do as calculation takes the line 1 divided by line 2 and on the line 2, you indicate the result
Example
month <- c(10,11,12,13,14,15)
sell <-c(258356,278958,287928,312254,316287,318999)
df <- data.frame(month, sell)
df %>% mutate(augmentation = sell[month]/sell[month+1])
month sell resultat
1 10 258356 NA
2 11 278958 0.9261466
3 12 287928 0.9688464
4 13 312254 0.9220955
5 14 316287 0.9872489
6 15 318999 0.9914984

dplyr
You can just use lag like this:
library(dplyr)
df %>%
mutate(resultat = lag(sell)/sell)
Output:
month sell resultat
1 10 258356 NA
2 11 278958 0.9261466
3 12 287928 0.9688464
4 13 312254 0.9220955
5 14 316287 0.9872489
6 15 318999 0.9914984
data.table
Another option is using shift:
library(data.table)
setDT(df)[, resultat:= shift(sell)/sell][]
Output:
month sell resultat
1: 10 258356 NA
2: 11 278958 0.9261466
3: 12 287928 0.9688464
4: 13 312254 0.9220955
5: 14 316287 0.9872489
6: 15 318999 0.9914984

Related

Hold current value until non-null value occurs [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 5 years ago.
Hi I come from a background in SAS and I am relatively new to R. I am attempting to convert an existing SAS program into equivalent R code
I am unsure how to achieve the equivalent of SAS's "retain" and "by" Behavior in R
I have a dataframe with two columns first column is a date column and the second column is a numeric value.
The numeric column represents a result from lab test. The test is conducted semi-regularly so on some days there will be Null values in the data. The data is ordered by date and the dates are sequential.
i.e example data looks like this
Date Result
2017/01/01 15
2017/01/02 NA
2017/01/03 NA
2017/01/04 12
2017/01/05 NA
2017/01/06 13
2017/01/07 11
2017/01/08 NA
I would like to create a third column which would contain the most recent result.
If Result column is Null it should be set to most recent previously non Null Result otherwise it should contain the Result value
My desired output would look like this:
Date Result My_var
2017/01/01 15 15
2017/01/02 NA 15
2017/01/03 NA 15
2017/01/04 12 12
2017/01/05 NA 12
2017/01/06 13 13
2017/01/07 11 11
2017/01/08 NA 11
In SAS I can achieve this with something like following code snippet:
data my_data;
retain My_var;
set input_data;
by date;
if Result not = . then
my_var = result;
run;
I am stumped as to how to do this in R I do not think R supports By group processing as in SAS - or at least I don't know how to set that as option.
I have naively tried:
my_data <- mutate(input_data, my_var = if(is.na(Result)) {lag(Result)} else {Result})
But I do not think that syntax is correct.
We can use na.locf function from the zoo package to fill in the missing values.
library(zoo)
dt$My_var <- na.locf(dt$Result)
dt
# Date Result My_var
# 1 2017/01/01 15 15
# 2 2017/01/02 NA 15
# 3 2017/01/03 NA 15
# 4 2017/01/04 12 12
# 5 2017/01/05 NA 12
# 6 2017/01/06 13 13
# 7 2017/01/07 11 11
# 8 2017/01/08 NA 11
Or the fill function from the tidyr package.
library(dplyr)
library(tidyr)
dt <- dt %>%
mutate(My_var = Result) %>%
fill(My_var)
dt
# Date Result My_var
# 1 2017/01/01 15 15
# 2 2017/01/02 NA 15
# 3 2017/01/03 NA 15
# 4 2017/01/04 12 12
# 5 2017/01/05 NA 12
# 6 2017/01/06 13 13
# 7 2017/01/07 11 11
# 8 2017/01/08 NA 11
DATA
dt <- read.table(text = "Date Result
2017/01/01 15
2017/01/02 NA
2017/01/03 NA
2017/01/04 12
2017/01/05 NA
2017/01/06 13
2017/01/07 11
2017/01/08 NA",
header = TRUE, stringsAsFactors = FALSE)

Transpose column and group dataframe [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I'm trying to change a dataframe in R to group multiple rows by a measurement. The table has a location (km), a size (mm) a count of things in that size bin, a site and year. I want to take the sizes, make a column from each one (2, 4 and 6 in this example), and place the corresponding count into each the row for that location, site and year.
It seems like a combination of transposing and grouping, but I can't figure out a way to accomplish this in R. I've looked at t(), dcast() and aggregate(), but those aren't really close at all.
So I would go from something like this:
df <- data.frame(km=c(rep(32,3),rep(50,3)), mm=rep(c(2,4,6),2), count=sample(1:25,6), site=rep("A", 6), year=rep(2013, 6))
km mm count site year
1 32 2 18 A 2013
2 32 4 2 A 2013
3 32 6 12 A 2013
4 50 2 3 A 2013
5 50 4 17 A 2013
6 50 6 21 A 2013
To this:
km site year mm_2 mm_4 mm_6
1 32 A 2013 18 2 12
2 50 A 2013 3 17 21
Edit: I tried the solution in a suggested duplicate, but I did not work for me, not really sure why. The answer below worked better.
As suggested in the comment above, we can use the sep argument in spread:
library(tidyr)
spread(df, mm, count, sep = "_")
km site year mm_2 mm_4 mm_6
1 32 A 2013 4 20 1
2 50 A 2013 15 14 22
As you mentioned dcast(), here is a method using it.
set.seed(1)
df <- data.frame(km=c(rep(32,3),rep(50,3)),
mm=rep(c(2,4,6),2),
count=sample(1:25,6),
site=rep("A", 6),
year=rep(2013, 6))
library(reshape2)
dcast(df, ... ~ mm, value.var="count")
# km site year 2 4 6
# 1 32 A 2013 13 10 20
# 2 50 A 2013 3 17 1
And if you want a bit of a challenge you can try the base function reshape().
df2 <- reshape(df, v.names="count", idvar="km", timevar="mm", ids="mm", direction="wide")
colnames(df2) <- sub("count.", "mm_", colnames(df2))
df2
# km site year mm_2 mm_4 mm_6
# 1 32 A 2013 13 10 20
# 4 50 A 2013 3 17 1

Clean way to calculate both group and overall statistics

I would like like to calculate the median not only for different groups of my data, but also the median over all groups and store the result in a single data.frame. While accomplishing each of these tasks separately is easy, I have not found a clean way to do both at the same time.
Right now, what I'm doing is calculate both statistics separately; then join the results; then tidy the data if necessary. Here's an example of what this may look like if I wanted to know the median delay per day and per month:
library(dplyr)
library(hflights)
data(hflights)
# Calculate both statistics separately
per_day <- hflights %>%
group_by(Year, Month, DayofMonth) %>%
summarise(Delay = mean(ArrDelay, na.rm = TRUE)) %>%
mutate(Interval = "Daily")
per_month <- hflights %>%
group_by(Year, Month) %>%
summarise(Delay = mean(ArrDelay, na.rm = TRUE)) %>%
mutate(Interval = "Monthly", DayofMonth = NA)
# Join into a single data.frame
my_summary <- full_join(per_day, per_month,
by = c("Year", "Month", "DayofMonth", "Interval", "Delay"))
my_summary
# Source: local data frame [377 x 5]
# Groups: Year, Month
#
# Year Month DayofMonth Delay Interval
# 1 2011 1 1 10.067642 Daily
# 2 2011 1 2 10.509745 Daily
# 3 2011 1 3 6.038627 Daily
# 4 2011 1 4 7.970740 Daily
# 5 2011 1 5 4.172650 Daily
# 6 2011 1 6 6.069909 Daily
# 7 2011 1 7 3.907295 Daily
# 8 2011 1 8 3.070140 Daily
# 9 2011 1 9 17.254325 Daily
# 10 2011 1 10 11.040388 Daily
# .. ... ... ... ... ...
Are there better ways to do this?
(Note that in many cases one could easily progressively roll up summaries as pointed out in the Introduction to dplyr. However, this doesn't work for statistics like median, mean etc.)
As a one-off table. This is fairly straightforward in data.table:
require(data.table)
setDT(hflights)[,{
mo_del <- mean(ArrDelay,na.rm=TRUE)
.SD[,.(DailyDelay = mean(ArrDelay,na.rm=TRUE),MonthlyDelay = mo_del),by=DayofMonth]
},by=.(Year,Month)]
# Year Month DayofMonth DailyDelay MonthlyDelay
# 1: 2011 1 1 10.0676417 4.926065
# 2: 2011 1 2 10.5097451 4.926065
# 3: 2011 1 3 6.0386266 4.926065
# 4: 2011 1 4 7.9707401 4.926065
# 5: 2011 1 5 4.1726496 4.926065
# ---
# 361: 2011 12 14 1.0293610 5.013244
# 362: 2011 12 17 -0.1049822 5.013244
# 363: 2011 12 24 -4.1457490 5.013244
# 364: 2011 12 25 -2.2976827 5.013244
# 365: 2011 12 31 46.4846491 5.013244
How it works. The basic syntax is DT[i,j,by].
With by=.(Year,Month), all operations in j are done per "by group."
We can nest another "by group" using the data.table of the current Subset of Data, .SD.
To return columns in j we use .(colname1=col1,colname2=col2,...).
Creating new variables. Alternately, we could create new variables in hflights using := in j.
hflights[,DailyDelay := mean(ArrDelay,na.rm=TRUE),.(Year,Month,DayofMonth)]
hflights[,MonthlyDelay := mean(ArrDelay,na.rm=TRUE),.(Year,Month)]
Then we can view the summary table:
hflights[,.GRP,.(Year,Month,DayofMonth,DailyDelay,MonthlyDelay)]
# Year Month DayofMonth DailyDelay MonthlyDelay .GRP
# 1: 2011 1 1 10.0676417 4.926065 1
# 2: 2011 1 2 10.5097451 4.926065 2
# 3: 2011 1 3 6.0386266 4.926065 3
# 4: 2011 1 4 7.9707401 4.926065 4
# 5: 2011 1 5 4.1726496 4.926065 5
# ---
# 361: 2011 12 14 1.0293610 5.013244 361
# 362: 2011 12 17 -0.1049822 5.013244 362
# 363: 2011 12 24 -4.1457490 5.013244 363
# 364: 2011 12 25 -2.2976827 5.013244 364
# 365: 2011 12 31 46.4846491 5.013244 365
(Something needed to be put in j here, so I used the "by group" code, .GRP.)

How to calculate top rows from a large data set

I have a dataset in which there are following columns: flavor, flavorid and unitSoled.
Flavor Flavorid unitsoled
beans 350 6
creamy 460 2
.
.
.
I want to find top ten flavors and then calculate market share for each flavor. My logic is market share for each flavor = units soled for particular flavor divided by total units soled.
How do I implement this. For output I just want two col Flavorid and corresponding market share. Do I need to save top ten flavors in some table first?
One way is with the dplyr package:
An example data set:
flavor <- rep(letters[1:15],each=5)
flavorid <- rep(1:15,each=5)
unitsold <- 1:75
df <- data.frame(flavor,flavorid,unitsold)
> df
flavor flavorid unitsold
1 a 1 1
2 a 1 2
3 a 1 3
4 a 1 4
5 a 1 5
6 b 2 6
7 b 2 7
8 b 2 8
9 b 2 9
...
...
Solution:
library(dplyr)
df %>%
select(flavorid,unitsold) %>% #select the columns you want
group_by(flavorid) %>% #group by flavorid
summarise(total=sum(unitsold)) %>% #sum the total units sold per id
mutate(marketshare=total/sum(total)) %>% #calculate the market share per id
arrange( desc(marketshare)) %>% #order by marketshare descending
head(10) #pick the 10 first
#and you can add another select(flavorid,marketshare) if you only want those two
Output:
Source: local data frame [10 x 3]
flavorid total marketshare
1 15 365 0.12807018
2 14 340 0.11929825
3 13 315 0.11052632
4 12 290 0.10175439
5 11 265 0.09298246
6 10 240 0.08421053
7 9 215 0.07543860
8 8 190 0.06666667
9 7 165 0.05789474
10 6 140 0.04912281

cross sectional sub-sets in data.table

I have a data.table which contains multiple columns, which is well represented by the following:
DT <- data.table(date = as.IDate(rep(c("2012-10-17", "2012-10-18", "2012-10-19"), each=10)),
session = c(1,2,3), price = c(10, 11, 12,13,14),
volume = runif(30, min=10, max=1000))
I would like to extract a multiple column table which shows the volume traded at each price in a particular type of session -- with each column representing a date.
At present, i extract this data one date at a time using the following:
DT[session==1,][date=="2012-10-17", sum(volume), by=price]
and then bind the columns.
Is there a way of obtaining the end product (a table with each column referring to a particular date) without sticking all the single queries together -- as i'm currently doing?
thanks
Does the following do what you want.
A combination of reshape2 and data.table
library(reshape2)
.DT <- DT[,sum(volume),by = list(price,date,session)][, DATE := as.character(date)]
# reshape2 for casting to wide -- it doesn't seem to like IDate columns, hence
# the character DATE co
dcast(.DT, session + price ~ DATE, value.var = 'V1')
session price 2012-10-17 2012-10-18 2012-10-19
1 1 10 308.9528 592.7259 NA
2 1 11 649.7541 NA 816.3317
3 1 12 NA 502.2700 766.3128
4 1 13 424.8113 163.7651 NA
5 1 14 682.5043 NA 147.1439
6 2 10 NA 755.2650 998.7646
7 2 11 251.3691 695.0153 NA
8 2 12 791.6882 NA 275.4777
9 2 13 NA 111.7700 240.3329
10 2 14 230.6461 817.9438 NA
11 3 10 902.9220 NA 870.3641
12 3 11 NA 719.8441 963.1768
13 3 12 361.8612 563.9518 NA
14 3 13 393.6963 NA 718.7878
15 3 14 NA 871.4986 582.6158
If you just wanted session 1
dcast(.DT[session == 1L], session + price ~ DATE)
session price 2012-10-17 2012-10-18 2012-10-19
1 1 10 308.9528 592.7259 NA
2 1 11 649.7541 NA 816.3317
3 1 12 NA 502.2700 766.3128
4 1 13 424.8113 163.7651 NA
5 1 14 682.5043 NA 147.1439

Resources