Hello all a R noob here,
I hope you guys can help me with the following.
I need to transform multiple columns in my dataset to new columns based on the values in the original columns multiple times. This means that for the first transformation I use column 1, 2, 3 and if certain conditions are met the output results a new column with a 1 or a 0, for the second transformation I use columns 4, 5, 6 and the output should be a 1 or a 0 also. I have to do this 18 times. I already wrote a function which succesfully does the transformation if I impute the variables manually, but I would like to apply this function to all the desired columns at once. My desired output would be 18 new columns with 0's and 1's. Finally I will make a last column which will display a 1 if any of the 18 columns is a 1 and a 0 otherwise.
df <- data.frame(admiss1 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
admiss2 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
admiss3 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
visit1 = sample(seq(as.Date('1995/01/01'), as.Date('1996/01/01'), by="day"), 12),
visit2 = sample(seq(as.Date('1997/01/01'), as.Date('1998/01/01'), by="day"), 12),
reason1 = sample(3,12, replace = T),
reason2 = sample(3,12, replace = T),
reason3 = sample(3,12, replace = T))
df$discharge1 <- df$admiss1 + 10
df$discharge2 <- df$admiss2 + 10
df$discharge3 <- df$admiss3 + 10
#every discharge date is 10 days after the admission date for the sake of this example
#now I have the following dataframe
#for the sake of it I included only 3 dates and reasons(instead of 18)
admiss1 admiss2 admiss3 visit1 visit2 reason1 reason2 reason3 discharge1 discharge2 discharge3
1 1990-03-12 1992-04-04 1998-07-31 1995-01-24 1997-10-07 2 1 3 1990-03-22 1992-04-14 1998-08-10
2 1999-05-18 1990-11-25 1995-10-04 1995-03-06 1997-03-13 1 2 1 1999-05-28 1990-12-05 1995-10-14
3 1993-07-16 1998-06-10 1991-07-05 1995-11-06 1997-11-15 1 1 2 1993-07-26 1998-06-20 1991-07-15
4 1991-07-05 1992-06-17 1995-10-12 1995-05-14 1997-05-02 2 1 3 1991-07-15 1992-06-27 1995-10-22
5 1995-08-16 1999-03-08 1992-04-03 1995-02-20 1997-01-03 1 3 3 1995-08-26 1999-03-18 1992-04-13
6 1999-10-07 1991-12-26 1995-05-05 1995-10-24 1997-10-15 3 1 1 1999-10-17 1992-01-05 1995-05-15
7 1998-03-18 1992-04-18 1993-12-31 1995-11-14 1997-06-14 3 2 2 1998-03-28 1992-04-28 1994-01-10
8 1992-08-04 1991-09-16 1992-04-23 1995-05-29 1997-10-11 1 2 3 1992-08-14 1991-09-26 1992-05-03
9 1997-02-20 1990-02-12 1998-03-08 1995-10-09 1997-12-29 1 1 3 1997-03-02 1990-02-22 1998-03-18
10 1992-09-16 1997-06-16 1997-07-18 1995-12-11 1997-01-12 1 2 2 1992-09-26 1997-06-26 1997-07-28
11 1991-01-25 1998-04-07 1999-07-02 1995-12-27 1997-05-28 3 2 1 1991-02-04 1998-04-17 1999-07-12
12 1996-02-25 1993-03-30 1997-06-25 1995-09-07 1997-10-18 1 3 2 1996-03-06 1993-04-09 1997-07-05
admissdate <- function(admis, dis, rsn, vis1, vis2){
xnew <- ifelse(df[eval(substitute(admis))] >= df[eval(substitute(vis1))] & df[eval(substitute(dis))] <= df[eval(substitute(vis2))] & df[eval(substitute(rsn))] == 2, 1, 0)
xnew <- ifelse(df[eval(substitute(admis))] >= df[eval(substitute(vis1))] & df[eval(substitute(admis))] <= df[eval(substitute(vis2))] & df[eval(substitute(dis))] >= df[eval(substitute(vis2))] & df[eval(substitute(rsn))] == 2, 1, xnew)
return(xnew)
}
I wrote this function to generate a 1 if the conditions are true and a 0 if the conditions are false.
-Condition 1: admission date and discharge date are between visit 1 and visit 2 + admission reason is 2.
-Condition 2: admission date is after visit 1 but before visit 2 and the discharge date is after visit 2 with also admission reason 2.
It should return 1 if these conditions are true and 0 if these conditions are false. Eventually, I will end up with 18 new variables with 1's or 0's and will combine them to make one variable with Admission between visit 1 and visit 2 (with reason 2).
If I manually impute the variable names it will work, but I cant make it work for all the variables at once. I tried to make a string vector with all the admiss dates, discharge dates and reasons and tried to transform them with mapply, but this does not work.
admiss <- paste0(rep("admiss", 3), 1:3)
discharge <- paste0(rep("discharge", 3), 1:3)
reason <- paste0(rep("reason", 3), 1:3)
visit1 <- rep("visit1",3)
visit2 <- rep("visit2",3)
mapply(admissdate, admis = admiss, dis = discharge, rsn = reason, vis1 = visit1, vis2 = visit2)
I have also considered lapply but here you have to define an X = ..., which I think I cannot use because I have multiple column that I want to impute, please correct me if I am wrong!
Also I considered using a for loop, but I don't know how to use that with multiple conditions.
Any help would be greatly appreciated!
You can change the function to accept values instead of column names.
admissdate <- function(admis, dis, rsn, vis1, vis2){
xnew <- as.integer(admis >= vis1 & dis <= vis2 & rsn == 2)
xnew <- ifelse(admis >= vis1 & admis <= vis2 & dis >= vis2 & rsn == 2, 1, xnew)
return(xnew)
}
Now create new columns -
admiss <- paste0("admiss", 1:3)
discharge <- paste0("discharge", 1:3)
reason <- paste0("reason", 1:3)
new_col <- paste0('newcol', 1:3)
df[new_col] <- Map(function(x, y, z) admissdate(x, y, z, df$visit1, df$visit2),
df[admiss],df[discharge],df[reason])
#Additional column will be 1 if any of the value in the new column is 1.
df$result <- as.integer(rowSums(df[new_col]) > 0)
df
Hi I am giving labels to my data frame manually like below, I have 800 columns to be labeled , after that I am creating a subset of data frame (sub setting of data have many), then applying that data frame to function for calculation.
labels can be different for all chunks , also its very time taking for creating labels one by one for all chunks.
data<-data.frame(
gender = c(1,2,1,2,1,2,1,2,2,2,2,1,1,2,2,2,2,1,1,1,1,1,2,1,2,1,2,2,2,1,2,1,2,1,2,1,2,2,2),
sector = c(3,3,1,2,5,4,4,4,4,3,3,4,3,4,2,1,4,2,3,4,4,4,3,1,2,1,5,5,4,3,1,4,5,2,3,4,5,1,4),
col1=c(1,1,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,1,1,1,NA,1,1,NA,NA,NA,NA,1,NA,NA,NA,NA,1,NA,1),
col2=c(1,1,1,1,1,NA,NA,NA,NA,1,1,1,1,1,NA,NA,NA,1,1,1,NA,1,1,1,1,1,NA,NA,NA,1,1,1,1,1,1,1,NA,NA,NA),
col3=c(1,1,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,1,1,1,NA,NA,NA,1,NA,NA,1,1,1,1,1,NA,NA,1),
col4=c(1,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA),
col5=c(1,2,1,1,1,2,1,2,2,1,2,NA,1,1,2,2,2,1,1,1,2,NA,2,1,1,1,2,2,2,NA,1,2,2,1,1,1,2,2,2)
)
data$gender<-factor(data$gender, levels=c(1,2), labels=c("Male","female"))
data$sector<-factor(data$sector, levels=c(1,2,3,4,5), labels=c("TX","CA","NY","LA","WA"))
data$col1<-factor(data$col1, levels=1, labels="Sales")
data$col2<-factor(data$col2, levels=1, labels="OPS")
data$col3<-factor(data$col3, levels=1, labels="Management")
data$col4<-factor(data$col4, levels=1, labels="HR")
data$col5<-factor(data$col5, levels=c(1,2), labels=c("Local","Overseas"))
df1<- data
df1$cc1<-1
df2<- subset(df, col5 == 'Local')
df$cc2<-ifelse(df$col5 == 'Local',1,NA)
lst<-list(df$cc1, df$cc2)
ldat<-list("ALL" = df, "Local" =df2)
now I am looking for a function like where I can give a list of labels for eg .
factor_list <- data.frame (colnames=c("col1","col2"....),col_labels =c("sales","OPS"....)
# so here I will be just needed to update factor list then the apply function for labelling
conv_frac <- function(dataset,var_bject){
for(i in 1:ldat)
lapply(factor,ldat(i)) # may be lapply or any thing else
}
# then will apply factor_list
conv_frac(dataset = ldat,var_bject = factor_list)
any solution for this
You can write a function with the help of Map :
conv_frac <- function(dataset,cols, labels_list) {
dataset[cols] <- Map(function(x, y) factor(x, labels = y),
dataset[cols], labels_list)
return(dataset)
}
Now pass dataframe, vector of column names and list of labels to it. For example, if you want to change 3 columns in data say gender, col1 and col2 you can do :
result <- conv_frac(data, c('gender', paste0('col', 1:2)),
list(c('Male', 'Female'), c('one', 'zero'), c('Sales', 'HR')))
head(result)
# gender sector col1 col2 col3 col4 col5
#1 Male 3 zero HR 1 1 1
#2 Female 3 zero HR 1 0 2
#3 Male 1 one HR 0 0 1
#4 Female 2 one HR 0 0 1
#5 Male 5 one HR 0 0 1
#6 Female 4 one Sales 0 0 2
This changes labels of only those columns keeping other columns as it is.
I am looking to create a dataframe that lists a unique ID with the movement of n different amounts across a period of m timesteps. I currently generate subsets of each timestep and then merge all these subsets with a separate dataframe that contains just the unique IDs. See below:
set.seed(129)
df1 <- data.frame(
id= c(rep(seq(1:7),3)),
time= c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3),
amount1= runif(21,0,50),
amount2= runif(21,-20,600),
amount3= runif(21,-15,200),
amount4= runif(21,-3,300)
)
df2 <- data.frame(
id = unique(df1$id)
)
sub_1 <- subset(df1, time == 1)
sub_2 <- subset(df1, time == 2)
sub_3 <- subset(df1, time == 3)
df2<-merge(df2,sub_1,by.x = "id",by.y = "id", all=TRUE)
df2<-merge(df2,sub_2,by.x = "id",by.y = "id", all=TRUE)
df2<-merge(df2,sub_3,by.x = "id",by.y = "id", all=TRUE)
#df2
id time.x amount1.x amount2.x amount3.x amount4.x time.y amount1.y amount2.y amount3.y amount4.y time amount1 amount2 amount3 amount4
1 1 1 6.558261 -17.713007 46.477430 195.061597 2 18.5453843 269.7406808 132.588713 80.40133 3 24.943217 488.1025 103.473479 198.51302
2 2 1 15.736044 230.018563 72.604346 -2.513162 2 48.8537058 356.5593748 161.239261 246.25985 3 35.559262 406.4749 66.278064 30.11592
3 3 1 8.057720 386.814867 101.997370 152.269564 2 0.7334493 0.7842648 66.603965 156.12478 3 42.170220 450.0306 195.872986 109.73098
4 4 1 15.575282 527.033563 37.403278 197.529341 2 37.8372445 370.0410836 6.074847 273.46715 3 20.302206 290.0026 -2.101649 112.88488
5 5 1 4.230635 427.294382 112.771237 199.401096 2 15.3735066 376.8945806 104.382371 224.09730 3 8.050933 291.6123 53.660734 270.37200
6 6 1 29.087870 9.330858 129.400932 70.801129 2 38.9966662 421.9258798 -3.891286 290.59259 3 17.919554 581.1735 137.100314 129.78561
7 7 1 4.380303 463.658580 4.120219 56.527016 2 6.0582455 484.4981686 67.820164 72.05615 3 43.556746 170.0745 41.134708 247.99512
I have a major issue with this, as the values of m and n increase this method becomes ugly and long. Is there a cleaner way to do this? Maybe as a one liner so I don't have to make say 15 subsets if m = 15.
Thanks
You just need your original df1 dataset and do this:
library(tidyverse)
df1 %>%
group_split(time) %>% # create your subsets and store them as a list of data frames
reduce(left_join, by = "id") # sequentially join those subsets
I have data on college course completions, with estimated numbers of students from each cohort completing after 1, 2, 3, ... 7 years. I want to use these estimates to calculate the total number of students outputting from each College and Course in any year.
The output of students in a given year will be the sum of the previous 7 cohorts outputting after 1, 2, 3, ... 7 years.
For example, the number of students outputting in 2014 from COLLEGE 1, COURSE A is equal to the sum of:
Output of 2013 cohort (College 1, Course A) after 1 year +
Output of 2012 cohort (College 1, Course A) after 2 years +
Output of 2011 cohort (College 1, Course A) after 3 years +
Output of 2010 cohort (College 1, Course A) after 4 years +
Output of 2009 cohort (College 1, Course A) after 5 years +
Output of 2008 cohort (College 1, Course A) after 6 years +
Output of 2007 cohort (College 1, Course A) after 7 years +
So there are two dataframes: a lookup table that contains all the output estimates, and a smaller summary table that I'm trying to modify. I want to update dummy.summary$output with, for each row, the total output based on the above calculation.
The following code will replicate my data pretty well
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
# Summary table to be modified
dummy.summary <- aggregate(x = dummy.lookup["intake"], by = list(dummy.lookup$cohort, dummy.lookup$college, dummy.lookup$course), FUN = mean)
names(dummy.summary)[1:3] <- c("year", "college", "course")
dummy.summary <- dummy.summary[order(dummy.summary$year, dummy.summary$college, dummy.summary$course), ]
dummy.summary$output <- 0
The following code does not work, but shows the approach I've been attempting.
dummy.summary$output <- sapply(dummy.summary$output, function(x){
# empty vector to fill with output values
vec <- c()
# Find relevant output for college + course, from each cohort and exit year
for(j in 1:7){
append(x = vec,
values = dummy.lookup[dummy.lookup$college==dummy.summary[x, "college"] &
dummy.lookup$course==dummy.summary[x, "course"] &
dummy.lookup$cohort==dummy.summary[x, "year"]-j &
dummy.lookup$output.year==j, "output"])
}
# Sum and return total output
sum_vec <- sum(vec)
return(sum_vec)
}
)
I guess it doesn't work because I was hoping to use 'x' in the anonymous function to index particular values of the dummy.summary dataframe. But that clearly isn't happening and is only returning zero for each row, presumably because the starting value of 'x' is zero each time. I don't know if it is possible to access the index position of each value that sapply loops over, and use that to index my summary dataframe.
Is this approach fixable or do I need a completely different approach?
Even if it is fixable, is there a more elegant/faster way to acheive what I'm trying to do?
Thanks in anticipation.
I've just updated your output.year to output.year2 where instead of a value from 1 to 7 it gets a value of a year based on the cohort you have.
I've realised that the output information you want corresponds to the output.year, but the intake information you want corresponds to the cohort. So, I calculate them separately and then I join tables/information. This automatically creates empty (NA that I transform to 0) output info for 1998.
# fix your random sampling
set.seed(24)
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
dummy.lookup$output[dummy.lookup$yr %in% 1:2] <- 0
library(dplyr)
# create result table for output info
dt_output =
dummy.lookup %>%
mutate(output.year2 = output.year+cohort) %>% # update output.year to get a year value
group_by(output.year2, college, course) %>% # for each output year, college, course
summarise(SumOutput = sum(output)) %>% # calculate sum of intake
ungroup() %>%
arrange(college,course,output.year2) %>% # for visualisation purposes
rename(cohort = output.year2) # rename column
# create result for intake info
dt_intake =
dummy.lookup %>%
select(cohort, college, course, intake) %>% # select useful columns
distinct() # keep distinct rows/values
# join info
dt_intake %>%
full_join(dt_output, by=c("cohort","college","course")) %>%
mutate(SumOutput = ifelse(is.na(SumOutput),0,SumOutput)) %>%
arrange(college,course,cohort) %>% # for visualisation purposes
tbl_df() # for printing purposes
# Source: local data frame [720 x 5]
#
# cohort college course intake SumOutput
# (int) (fctr) (fctr) (int) (dbl)
# 1 1998 College 1 Course A 194 0
# 2 1999 College 1 Course A 198 11
# 3 2000 College 1 Course A 223 29
# 4 2001 College 1 Course A 198 45
# 5 2002 College 1 Course A 289 62
# 6 2003 College 1 Course A 163 78
# 7 2004 College 1 Course A 211 74
# 8 2005 College 1 Course A 181 108
# 9 2006 College 1 Course A 277 101
# 10 2007 College 1 Course A 157 109
# .. ... ... ... ... ...