R - Dplyr - How to mutate rows - r

I found that dplyr is speedy and simple for aggregate and summarise data. But I can't find out how to solve the following problem with dplyr.
Given these data frames:
df_2017 <- data.frame(expand.grid(1:195,1:65,1:39),
value = sample(1:1000000,(195*65*39)),
period = rep("2017",(195*65*39)),
stringsAsFactors = F)
df_2017 <- df_2017[sample(1:(195*65*39),450000),]
names(df_2017) <- c("company", "product", "acc_concept", "value", "period")
df_2017$company <- as.character(df_2017$company)
df_2017$product <- as.character(df_2017$product)
df_2017$acc_concept <- as.character(df_2017$acc_concept)
df_2017$value <- as.numeric(df_2017$value)
ratio_df <- data.frame(concept=c("numerator","numerator","numerator","denom", "denom", "denom","name"),
ratio1=c("1","","","4","","","Sales over Assets"),
ratio2=c("1","","","5","6","","Sales over Expenses A + B"), stringsAsFactors = F)
where the columns in df_2017 are:
company = This is a categorical variable with companies from 1 to 195
product = This is a categorical, with home apliance products from 1 to 65. For example, 1 could be equal to irons, 2 to television, etc
acc_concept = This is a categorical variable with accounting concepts from 1 to 39. For example, 1 would be equal to "Sales", 2 to "Total Expenses", 3 to Returns", 4 to "Assets, etc
value = This is a numeric variable, with USD from 1 to 100.000.000
period = Categorical variable. Always 2017
As the expand.grid implies, the combinations of company - product - acc_concept are never duplicated, but, It could happen that certain subjects have not every company - product - acc_concept combinations. That's why the code line "df_2017 <- df_2017[sample(1:195*65*39),450000),]", and that's why the output could turn out into NA (see below).
And where the columns in ratio_df are:
Concept = which acc_concept corresponds to the numerator, which one to
denominator, and which is name of the ratio
ratio1 = acc_concept and name for ratio1
ratio2 = acc_concept and name for ratio2
I want to calculate 2 ratios (ratio_df) between acc_concept, for each product within each company.
For example:
I take the first ratio "acc_concepts" and "name" from ratio_df:
num_acc_concept <- ratio_df[ratio_df$concept == "numerator", 2]
denom_acc_concept <- ratio_df[ratio_df$concept == "denom", 2]
ratio_name <- ratio_df[ratio_df$concept == "name", 2]
Then I calculate the ratio for one product of one company, just to show you want i want to do:
ratio1_value <- sum(df_2017[df_2017$company == 1 & df_2017$product == 1 & df_2017$acc_concept %in% num_acc_concept, 4]) / sum(df_2017[df_2017$company == 1 & df_2017$product == 1 & df_2017$acc_concept %in% denom_acc_concept, 4])
Output:
output <- data.frame(Company="1", Product="1", desc_ratio=ratio_name, ratio_value = ratio1_value, stringsAsFactors = F)
As I said before I want to do this for each product within each company
The output data.frame could be something like this (ratios aren't the true ones because I haven't done the calculations yet):
company product desc_ratio ratio_value
1 1 Sales over Assets 0.9303675
1 3 Sales over Assets 1.30
1 7 Sales over Assets Nan
1 1 Sales over Expenses A + B Inf
1 2 Sales over Expenses A + B 2.32
1 3 Sales over Expenses A + B NA
2
3
and so on...
NaN when ratio is 0 / 0
Inf when ratio is number / 0
NA when there is no data for certain company and product.
I hope I have made myself clear...
Is there any way to solve this row problem with dplyr? Should I cast the df_2017?

Related

How to apply function with multiple conditions on multiple columns to get new conditional columns in R

Hello all a R noob here,
I hope you guys can help me with the following.
I need to transform multiple columns in my dataset to new columns based on the values in the original columns multiple times. This means that for the first transformation I use column 1, 2, 3 and if certain conditions are met the output results a new column with a 1 or a 0, for the second transformation I use columns 4, 5, 6 and the output should be a 1 or a 0 also. I have to do this 18 times. I already wrote a function which succesfully does the transformation if I impute the variables manually, but I would like to apply this function to all the desired columns at once. My desired output would be 18 new columns with 0's and 1's. Finally I will make a last column which will display a 1 if any of the 18 columns is a 1 and a 0 otherwise.
df <- data.frame(admiss1 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
admiss2 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
admiss3 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
visit1 = sample(seq(as.Date('1995/01/01'), as.Date('1996/01/01'), by="day"), 12),
visit2 = sample(seq(as.Date('1997/01/01'), as.Date('1998/01/01'), by="day"), 12),
reason1 = sample(3,12, replace = T),
reason2 = sample(3,12, replace = T),
reason3 = sample(3,12, replace = T))
df$discharge1 <- df$admiss1 + 10
df$discharge2 <- df$admiss2 + 10
df$discharge3 <- df$admiss3 + 10
#every discharge date is 10 days after the admission date for the sake of this example
#now I have the following dataframe
#for the sake of it I included only 3 dates and reasons(instead of 18)
admiss1 admiss2 admiss3 visit1 visit2 reason1 reason2 reason3 discharge1 discharge2 discharge3
1 1990-03-12 1992-04-04 1998-07-31 1995-01-24 1997-10-07 2 1 3 1990-03-22 1992-04-14 1998-08-10
2 1999-05-18 1990-11-25 1995-10-04 1995-03-06 1997-03-13 1 2 1 1999-05-28 1990-12-05 1995-10-14
3 1993-07-16 1998-06-10 1991-07-05 1995-11-06 1997-11-15 1 1 2 1993-07-26 1998-06-20 1991-07-15
4 1991-07-05 1992-06-17 1995-10-12 1995-05-14 1997-05-02 2 1 3 1991-07-15 1992-06-27 1995-10-22
5 1995-08-16 1999-03-08 1992-04-03 1995-02-20 1997-01-03 1 3 3 1995-08-26 1999-03-18 1992-04-13
6 1999-10-07 1991-12-26 1995-05-05 1995-10-24 1997-10-15 3 1 1 1999-10-17 1992-01-05 1995-05-15
7 1998-03-18 1992-04-18 1993-12-31 1995-11-14 1997-06-14 3 2 2 1998-03-28 1992-04-28 1994-01-10
8 1992-08-04 1991-09-16 1992-04-23 1995-05-29 1997-10-11 1 2 3 1992-08-14 1991-09-26 1992-05-03
9 1997-02-20 1990-02-12 1998-03-08 1995-10-09 1997-12-29 1 1 3 1997-03-02 1990-02-22 1998-03-18
10 1992-09-16 1997-06-16 1997-07-18 1995-12-11 1997-01-12 1 2 2 1992-09-26 1997-06-26 1997-07-28
11 1991-01-25 1998-04-07 1999-07-02 1995-12-27 1997-05-28 3 2 1 1991-02-04 1998-04-17 1999-07-12
12 1996-02-25 1993-03-30 1997-06-25 1995-09-07 1997-10-18 1 3 2 1996-03-06 1993-04-09 1997-07-05
admissdate <- function(admis, dis, rsn, vis1, vis2){
xnew <- ifelse(df[eval(substitute(admis))] >= df[eval(substitute(vis1))] & df[eval(substitute(dis))] <= df[eval(substitute(vis2))] & df[eval(substitute(rsn))] == 2, 1, 0)
xnew <- ifelse(df[eval(substitute(admis))] >= df[eval(substitute(vis1))] & df[eval(substitute(admis))] <= df[eval(substitute(vis2))] & df[eval(substitute(dis))] >= df[eval(substitute(vis2))] & df[eval(substitute(rsn))] == 2, 1, xnew)
return(xnew)
}
I wrote this function to generate a 1 if the conditions are true and a 0 if the conditions are false.
-Condition 1: admission date and discharge date are between visit 1 and visit 2 + admission reason is 2.
-Condition 2: admission date is after visit 1 but before visit 2 and the discharge date is after visit 2 with also admission reason 2.
It should return 1 if these conditions are true and 0 if these conditions are false. Eventually, I will end up with 18 new variables with 1's or 0's and will combine them to make one variable with Admission between visit 1 and visit 2 (with reason 2).
If I manually impute the variable names it will work, but I cant make it work for all the variables at once. I tried to make a string vector with all the admiss dates, discharge dates and reasons and tried to transform them with mapply, but this does not work.
admiss <- paste0(rep("admiss", 3), 1:3)
discharge <- paste0(rep("discharge", 3), 1:3)
reason <- paste0(rep("reason", 3), 1:3)
visit1 <- rep("visit1",3)
visit2 <- rep("visit2",3)
mapply(admissdate, admis = admiss, dis = discharge, rsn = reason, vis1 = visit1, vis2 = visit2)
I have also considered lapply but here you have to define an X = ..., which I think I cannot use because I have multiple column that I want to impute, please correct me if I am wrong!
Also I considered using a for loop, but I don't know how to use that with multiple conditions.
Any help would be greatly appreciated!
You can change the function to accept values instead of column names.
admissdate <- function(admis, dis, rsn, vis1, vis2){
xnew <- as.integer(admis >= vis1 & dis <= vis2 & rsn == 2)
xnew <- ifelse(admis >= vis1 & admis <= vis2 & dis >= vis2 & rsn == 2, 1, xnew)
return(xnew)
}
Now create new columns -
admiss <- paste0("admiss", 1:3)
discharge <- paste0("discharge", 1:3)
reason <- paste0("reason", 1:3)
new_col <- paste0('newcol', 1:3)
df[new_col] <- Map(function(x, y, z) admissdate(x, y, z, df$visit1, df$visit2),
df[admiss],df[discharge],df[reason])
#Additional column will be 1 if any of the value in the new column is 1.
df$result <- as.integer(rowSums(df[new_col]) > 0)
df

creating custom function for converting columns of data frame to factor

Hi I am giving labels to my data frame manually like below, I have 800 columns to be labeled , after that I am creating a subset of data frame (sub setting of data have many), then applying that data frame to function for calculation.
labels can be different for all chunks , also its very time taking for creating labels one by one for all chunks.
data<-data.frame(
gender = c(1,2,1,2,1,2,1,2,2,2,2,1,1,2,2,2,2,1,1,1,1,1,2,1,2,1,2,2,2,1,2,1,2,1,2,1,2,2,2),
sector = c(3,3,1,2,5,4,4,4,4,3,3,4,3,4,2,1,4,2,3,4,4,4,3,1,2,1,5,5,4,3,1,4,5,2,3,4,5,1,4),
col1=c(1,1,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,1,1,1,NA,1,1,NA,NA,NA,NA,1,NA,NA,NA,NA,1,NA,1),
col2=c(1,1,1,1,1,NA,NA,NA,NA,1,1,1,1,1,NA,NA,NA,1,1,1,NA,1,1,1,1,1,NA,NA,NA,1,1,1,1,1,1,1,NA,NA,NA),
col3=c(1,1,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,1,1,1,NA,NA,NA,1,NA,NA,1,1,1,1,1,NA,NA,1),
col4=c(1,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA),
col5=c(1,2,1,1,1,2,1,2,2,1,2,NA,1,1,2,2,2,1,1,1,2,NA,2,1,1,1,2,2,2,NA,1,2,2,1,1,1,2,2,2)
)
data$gender<-factor(data$gender, levels=c(1,2), labels=c("Male","female"))
data$sector<-factor(data$sector, levels=c(1,2,3,4,5), labels=c("TX","CA","NY","LA","WA"))
data$col1<-factor(data$col1, levels=1, labels="Sales")
data$col2<-factor(data$col2, levels=1, labels="OPS")
data$col3<-factor(data$col3, levels=1, labels="Management")
data$col4<-factor(data$col4, levels=1, labels="HR")
data$col5<-factor(data$col5, levels=c(1,2), labels=c("Local","Overseas"))
df1<- data
df1$cc1<-1
df2<- subset(df, col5 == 'Local')
df$cc2<-ifelse(df$col5 == 'Local',1,NA)
lst<-list(df$cc1, df$cc2)
ldat<-list("ALL" = df, "Local" =df2)
now I am looking for a function like where I can give a list of labels for eg .
factor_list <- data.frame (colnames=c("col1","col2"....),col_labels =c("sales","OPS"....)
# so here I will be just needed to update factor list then the apply function for labelling
conv_frac <- function(dataset,var_bject){
for(i in 1:ldat)
lapply(factor,ldat(i)) # may be lapply or any thing else
}
# then will apply factor_list
conv_frac(dataset = ldat,var_bject = factor_list)
any solution for this
You can write a function with the help of Map :
conv_frac <- function(dataset,cols, labels_list) {
dataset[cols] <- Map(function(x, y) factor(x, labels = y),
dataset[cols], labels_list)
return(dataset)
}
Now pass dataframe, vector of column names and list of labels to it. For example, if you want to change 3 columns in data say gender, col1 and col2 you can do :
result <- conv_frac(data, c('gender', paste0('col', 1:2)),
list(c('Male', 'Female'), c('one', 'zero'), c('Sales', 'HR')))
head(result)
# gender sector col1 col2 col3 col4 col5
#1 Male 3 zero HR 1 1 1
#2 Female 3 zero HR 1 0 2
#3 Male 1 one HR 0 0 1
#4 Female 2 one HR 0 0 1
#5 Male 5 one HR 0 0 1
#6 Female 4 one Sales 0 0 2
This changes labels of only those columns keeping other columns as it is.

Merge a dataframe by creating subsets based on time period and a unique ID number

I am looking to create a dataframe that lists a unique ID with the movement of n different amounts across a period of m timesteps. I currently generate subsets of each timestep and then merge all these subsets with a separate dataframe that contains just the unique IDs. See below:
set.seed(129)
df1 <- data.frame(
id= c(rep(seq(1:7),3)),
time= c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3),
amount1= runif(21,0,50),
amount2= runif(21,-20,600),
amount3= runif(21,-15,200),
amount4= runif(21,-3,300)
)
df2 <- data.frame(
id = unique(df1$id)
)
sub_1 <- subset(df1, time == 1)
sub_2 <- subset(df1, time == 2)
sub_3 <- subset(df1, time == 3)
df2<-merge(df2,sub_1,by.x = "id",by.y = "id", all=TRUE)
df2<-merge(df2,sub_2,by.x = "id",by.y = "id", all=TRUE)
df2<-merge(df2,sub_3,by.x = "id",by.y = "id", all=TRUE)
#df2
id time.x amount1.x amount2.x amount3.x amount4.x time.y amount1.y amount2.y amount3.y amount4.y time amount1 amount2 amount3 amount4
1 1 1 6.558261 -17.713007 46.477430 195.061597 2 18.5453843 269.7406808 132.588713 80.40133 3 24.943217 488.1025 103.473479 198.51302
2 2 1 15.736044 230.018563 72.604346 -2.513162 2 48.8537058 356.5593748 161.239261 246.25985 3 35.559262 406.4749 66.278064 30.11592
3 3 1 8.057720 386.814867 101.997370 152.269564 2 0.7334493 0.7842648 66.603965 156.12478 3 42.170220 450.0306 195.872986 109.73098
4 4 1 15.575282 527.033563 37.403278 197.529341 2 37.8372445 370.0410836 6.074847 273.46715 3 20.302206 290.0026 -2.101649 112.88488
5 5 1 4.230635 427.294382 112.771237 199.401096 2 15.3735066 376.8945806 104.382371 224.09730 3 8.050933 291.6123 53.660734 270.37200
6 6 1 29.087870 9.330858 129.400932 70.801129 2 38.9966662 421.9258798 -3.891286 290.59259 3 17.919554 581.1735 137.100314 129.78561
7 7 1 4.380303 463.658580 4.120219 56.527016 2 6.0582455 484.4981686 67.820164 72.05615 3 43.556746 170.0745 41.134708 247.99512
I have a major issue with this, as the values of m and n increase this method becomes ugly and long. Is there a cleaner way to do this? Maybe as a one liner so I don't have to make say 15 subsets if m = 15.
Thanks
You just need your original df1 dataset and do this:
library(tidyverse)
df1 %>%
group_split(time) %>% # create your subsets and store them as a list of data frames
reduce(left_join, by = "id") # sequentially join those subsets

Obtaining mean of dataframe by value groups in another dataframe

Once again I consult your wisdom.
I have 2 dataframes of the form:
**data1sample**
ID value
water 3
water 5
fire 1
fire 3
fire 2
air 1
**data2controls**
ID value
water 1
fire 3
air 5
I want to use the values in my control dataframe (data2controls) and know their corresponding percentile in the sample distribution (data1sample). I have to classify each sample by their ID (meaning, get fire control against fire sample, and water against water, etc), but I haven't been able to do so.
I am using the command:
mean(data1sample[data1sample$ID == data2controls$ID,] <= data2controls$value)
but I get the error
In Ops.factor(left, right) : ‘<=’ not meaningful for factors
What I am after is basically the percentile of the value in dataframe2 calculated based on the samples of dataframe1 (I am trying to obtain the percentile as in percentile = mean(data1sample$value(by ID) <= dataframe2$value))
So something like this:
**data2controls**
ID value percentile(based on data1 sample values)
water 1 .30
fire 3 .14
air 5 .1
Please disregard the percentile values, they're just made up to show desired output.
I'd love if someone could give me a hand! Thanks!!
Its hard to answer without the desired output, but I will try to guess it here:
library(dplyr)
data1sample <- data.frame(ID = c("water", "water", "fire", "fire", "fire", "air"), value = c(3,5,1,3,2,1))
data2sample <- data.frame(ID = c("water", "fire", "air"), value = c(1,3,5))
by_ID <- data1sample %>% group_by(ID) %>% summarise(control = mean(value))
data2sample %>% inner_join(by_ID)
#> Joining, by = "ID"
#> ID value control
#> 1 water 1 4
#> 2 fire 3 2
#> 3 air 5 1
This gives the result I think you're after?
for(i in d2$ID){
x <- mean(d1[d1$ID == i & d1$value <= d2[d2$ID == i, 'value'], 'value'])
print(x)
}
Based on the data you provided it returns NaN for water because there are no 'water's that meet your criterion, and so div by 0

R conditional lookup and sum

I have data on college course completions, with estimated numbers of students from each cohort completing after 1, 2, 3, ... 7 years. I want to use these estimates to calculate the total number of students outputting from each College and Course in any year.
The output of students in a given year will be the sum of the previous 7 cohorts outputting after 1, 2, 3, ... 7 years.
For example, the number of students outputting in 2014 from COLLEGE 1, COURSE A is equal to the sum of:
Output of 2013 cohort (College 1, Course A) after 1 year +
Output of 2012 cohort (College 1, Course A) after 2 years +
Output of 2011 cohort (College 1, Course A) after 3 years +
Output of 2010 cohort (College 1, Course A) after 4 years +
Output of 2009 cohort (College 1, Course A) after 5 years +
Output of 2008 cohort (College 1, Course A) after 6 years +
Output of 2007 cohort (College 1, Course A) after 7 years +
So there are two dataframes: a lookup table that contains all the output estimates, and a smaller summary table that I'm trying to modify. I want to update dummy.summary$output with, for each row, the total output based on the above calculation.
The following code will replicate my data pretty well
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
# Summary table to be modified
dummy.summary <- aggregate(x = dummy.lookup["intake"], by = list(dummy.lookup$cohort, dummy.lookup$college, dummy.lookup$course), FUN = mean)
names(dummy.summary)[1:3] <- c("year", "college", "course")
dummy.summary <- dummy.summary[order(dummy.summary$year, dummy.summary$college, dummy.summary$course), ]
dummy.summary$output <- 0
The following code does not work, but shows the approach I've been attempting.
dummy.summary$output <- sapply(dummy.summary$output, function(x){
# empty vector to fill with output values
vec <- c()
# Find relevant output for college + course, from each cohort and exit year
for(j in 1:7){
append(x = vec,
values = dummy.lookup[dummy.lookup$college==dummy.summary[x, "college"] &
dummy.lookup$course==dummy.summary[x, "course"] &
dummy.lookup$cohort==dummy.summary[x, "year"]-j &
dummy.lookup$output.year==j, "output"])
}
# Sum and return total output
sum_vec <- sum(vec)
return(sum_vec)
}
)
I guess it doesn't work because I was hoping to use 'x' in the anonymous function to index particular values of the dummy.summary dataframe. But that clearly isn't happening and is only returning zero for each row, presumably because the starting value of 'x' is zero each time. I don't know if it is possible to access the index position of each value that sapply loops over, and use that to index my summary dataframe.
Is this approach fixable or do I need a completely different approach?
Even if it is fixable, is there a more elegant/faster way to acheive what I'm trying to do?
Thanks in anticipation.
I've just updated your output.year to output.year2 where instead of a value from 1 to 7 it gets a value of a year based on the cohort you have.
I've realised that the output information you want corresponds to the output.year, but the intake information you want corresponds to the cohort. So, I calculate them separately and then I join tables/information. This automatically creates empty (NA that I transform to 0) output info for 1998.
# fix your random sampling
set.seed(24)
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
dummy.lookup$output[dummy.lookup$yr %in% 1:2] <- 0
library(dplyr)
# create result table for output info
dt_output =
dummy.lookup %>%
mutate(output.year2 = output.year+cohort) %>% # update output.year to get a year value
group_by(output.year2, college, course) %>% # for each output year, college, course
summarise(SumOutput = sum(output)) %>% # calculate sum of intake
ungroup() %>%
arrange(college,course,output.year2) %>% # for visualisation purposes
rename(cohort = output.year2) # rename column
# create result for intake info
dt_intake =
dummy.lookup %>%
select(cohort, college, course, intake) %>% # select useful columns
distinct() # keep distinct rows/values
# join info
dt_intake %>%
full_join(dt_output, by=c("cohort","college","course")) %>%
mutate(SumOutput = ifelse(is.na(SumOutput),0,SumOutput)) %>%
arrange(college,course,cohort) %>% # for visualisation purposes
tbl_df() # for printing purposes
# Source: local data frame [720 x 5]
#
# cohort college course intake SumOutput
# (int) (fctr) (fctr) (int) (dbl)
# 1 1998 College 1 Course A 194 0
# 2 1999 College 1 Course A 198 11
# 3 2000 College 1 Course A 223 29
# 4 2001 College 1 Course A 198 45
# 5 2002 College 1 Course A 289 62
# 6 2003 College 1 Course A 163 78
# 7 2004 College 1 Course A 211 74
# 8 2005 College 1 Course A 181 108
# 9 2006 College 1 Course A 277 101
# 10 2007 College 1 Course A 157 109
# .. ... ... ... ... ...

Resources