i have data that looks like this:
Data <- "Person Address Starting.Date Resignation.Date Job
John abc 01.01.2017 03.01.2017 IT
Sarah cde 06.01.2017 06.07.2017 Teacher
Susi bfg 09.06.2017 08.09.2017 secretary"
Data <- read.table(text=zz, header = TRUE)
My goal is to find out how long people stayed in their job before quitting and put that information in a new variable. So i check if the resignation date is in a certain datespan, what I do by using this Code:
Data$Span<- ifelse(Data$Resignation.Date>= "01.01.2017" & Data$Resignation.Date <= "31.01.2017", 1,
ifelse(Data$Resignation.Date>= "01.02.2017" & Data$Resignation.Date <= "28.02.2017", 2,
ifelse(Data$Resignation.Date>= "01.03.2017" & Data$Resignation.Date <= "31.03.2017", 3,
ifelse(Data$Resignation.Date>= "01.04.2017" & Data$Resignation.Date <= "30.04.2017", 4,
ifelse(Data$Resignation.Date>="01.05.2017" & Data$Resignation.Date <= "31.05.2017",5,
ifelse(Data$Resignation.Date>="01.06.2017" & Data$Resignation.Date<="30.06.2017",6,
ifelse(Data$Resignation.Date>="01.07.2017" & Data$Resignation.Date<="31.07.2017",7,
ifelse(Data$Resignation.Date>="01.08.2017" & Data$Resignation.Date<="31.08.2017", 8,
ifelse(Data$Resignation.Date>="01.09.2017" & Data$Resignation.Date<="30.09.2017", 9,
ifelse(Data$Resignation.Date>="01.10.2017" & Data$Resignation.Date<="31.10.2017",10,
ifelse(Data$Resignation.Date>="01.11.2017" & Data$Resignation.Date<="30.11.2017", 11,
ifelse(Data$Resignation.Date>="01.12.2017" & Data$Resignation.Date<="31.12.2017",12,999))))))))))))
The data I presented is for a subset for People who started working in January. I have subsets for all 12 months in 2017. What I want to do is use the same Code for People who started working in February / March / and so on. To do this I would have to alter the Code in that it starts with the first line and adds one month and then adds one month for all following lines. So that for example for the February subset it would start with
Data$Resignation.Date>= "01.02.2017" & Data$Resignation.Date <= "28.02.2017.2017", 1,
and end with
ifelse(Data$Resignation.Date>="01.01.2018" & Data$Resignation.Date<="31.01.2018",12,999
Is there any way to do this without copy pasting the Code and doing the changes manually for every month? Since the changes follow a certain systematic I would think that it would be possible, but I could not find any solution for this. I looked for Solutions in the dplyr package since I thought that my Problem fits there, but that did not help me. I would be very thankful for any advice. Of Course I will happily answer remaining questions.
P.S.: I am not not attached to using the subsets, that was just easier for me to work with since I am not so experienced in r. I filtered the subsets by using this Code
Data <- TotalData[TotalData$Starting.Date>= "01.01.2017" & TotalData$Starting.Date <= "31.01.2017",]
I think this code should be sufficient to do your work :-
Logic is if the Start date and end data are same it will give you 1and if they are not same it will give you months difference for how many months an employee was there for a company
library(lubridate)
Data$Starting.Date <- dmy(Data$Starting.Date)
Data$Resignation.Date <- dmy(Data$Resignation.Date)
Data$code<- ifelse(month(Data$Starting.Date) == month(Data$Resignation.Date),1,(interval(Data$Starting.Date, Data$Resignation.Dat) %/% months(1)))
Data :-
Data <- structure(list(Person = structure(1:4, .Label = c("John", "johnyy",
"Sarah", "Susi"), class = "factor"), Address = structure(c(1L,
1L, 3L, 2L), .Label = c("abc", "bfg", "cde"), class = "factor"),
Starting.Date = structure(c(17167, 17199, 17172, 17326), class = "Date"),
Resignation.Date = structure(c(17169, 17199, 17353, 17417
), class = "Date"), Job = structure(c(1L, 1L, 3L, 2L), .Label = c("IT",
"secretary", "Teacher"), class = "factor"), code = c(1, 2,
999, 999)), row.names = c(NA, -4L), class = "data.frame")
You could do it with the lubridate package to get the time a person stayed in the company.
library(lubridate)
Data <- "Person Address Starting.Date Resignation.Date Job
John abc 01.01.2017 03.01.2017 IT
Sarah cde 06.01.2017 06.07.2017 Teacher
Susi bfg 09.06.2017 08.09.2017 secretary"
Data <- read.table(text=Data, header = TRUE)
Data$Starting.Date = dmy(Data$Starting.Date)
Data$Resignation.Date = dmy(Data$Resignation.Date)
time.interval <- Data$Starting.Date %--% Data$Resignation.Date
time.period <- as.period(time.interval)
time.period <- month(time.period)
Data$Span <- time.period
Related
I am currently working on listening data of a music platform in R.
I have a subset (listening.subset) of the total data set. It contains 6 columns (USER, artist, Week, COUNT, user_type, binary).
Each user can either be a focal user, a friend, or a neighbour. There are separate data sets that link focal users to their friends (friend.data) and neighbours (neighbour.data), but I added a column to indicate the type of user.
Now, I have the following for-loop to indicate whether a friend has listened to an artist in the 10 weeks before the focal user has listened to that same artist. If that is the case, the binary column must show a 0, else a 1.
listening.subset$binary <- NA
for (i in 1:count(listening.subset)$n) {
test_user <- listening.subset[i,]
test_week <- test_user$Week
test_artist <- test_user$artist
if (test_user$user_type == "friend") {
foc <- vlookup(test_user$USER, friend.data, result_column = 1, lookup_column = 2)
prior_listen <- listening.subset %>% filter(USER == foc) %>% group_by(artist) %>% filter(test_week >= (Week -10) & test_week <= Week) %>% filter(artist == test_artist)
if (nrow(prior_listen) > 0) {
listening.subset[i,]$binary <- 0
}
else(
listening.subset[i,]$binary <- 1)
}
}
The problem with this for-loop is that it takes too long to apply to the full data set. Therefore, I want to apply vectorization. However, This concept is vague to me and after reading up on it online, I still do not have a clue as to how I should adjust my code.
I hope someone knows how to use vectorization and could help me.
EDIT1: the total data set contains around 50 million entries. However, I could split it up in 10 data sets of 5 million each.
EDIT2: listening.subset:
"clubanddeform", "HyprMusic", "Peter-182", "komosionmel", "SHHitsKaty",
"Sonik_Villa", "Haalf"), artist = c("Justin Timberlake", "Ediya",
"Lady Gaga", "El Guincho", "Lighthouse Family", "Pidżama Porno",
"The Men", "Modest Mouse", "Com Truise", "April Smith and The Great Picture Show"
), Week = c(197L, 213L, 411L, 427L, 443L, 232L, 431L, 312L, 487L,
416L), COUNT = c(1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 6L, 11L), user_type = c("friend",
"friend", "friend", "friend", "neighbour", "friend", "neighbour",
"friend", "focal", "friend"), binary = c(1, 1, 1, 1, NA, 1, NA,
1, NA, 1)), row.names = c(NA, 10L), class = "data.frame")
Where Week is an indicator for which week the user listened to the particular band (ranging between 1 and 527), and COUNT equals the amount of times the user has listened to that artist in that particular week.
Recap: The binary variable should indicate whether the "friend user" has listened to the same band as the "focal user", in the 10 weeks before the focal user played the band. The social connections can be found in the friend.data, which is depicted below.
structure(list(USER = c("TheMariner", "TheMariner", "TheMariner",
"TheMariner", "TheMariner", "TheMariner", "TheMariner", "TheMariner",
"TheMariner", "TheMariner"), FRIEND = c("npetrovay", "marno25",
"lennonstarr116", "sachmonkey", "andrewripp", "daledrops", "Skittlebite",
"Ego_Trippin", "pistolgal5", "jjollett")), row.names = c(NA,
10L), class = "data.frame")
For each 190 focal users (first column), the friends are listed next to it, in the second column.
I am trying to write a loop in R that creates a new variable based on a table of conditional outcomes.
I have four treatment groups (A, B, C, D). Each treatment group pays a different price at three different time periods (day, dinner, night).
Treatment Group Day Price Dinnertime Price Night Price
A 10 20 7
B 11 25 8
C 12 30 9
D 13 35 10
The time period is recorded as a given "hour" (day is hours 8-17, dinner is from 17-19 and night is from 19-0 and 0-8).
Hour Usage
Person 1 1 0
Person 1 2 0
Person 2 20 5
Person 3 17 6
Based on both treatment group (A, B, C and D) and time of day (night, day, dinnertime), I would like to create a new vector of prices.
Ideally, I would create dummy variables for each of the time periods (day, night and dinner) based on these hourly conditions. However, my data set is pretty large (24 observations per person per day) so I'm looking for a more elegant solution.
In plain language, I want this:
if group==A & time==night, then price=7 --> and this information saved in a new variable "price"
Any advice?
Edit: Question is about the loop with two conditions. Is there a way to refer this directly to the data-frame with the treatment groups and tariffs or do I just need to write it manually?
Assuming that you have some way of including a column for the group each person belongs to in the dataframe with the transactions on it. Then something like this may work for you.
df.pricing <- structure(list(Treatment.Group = c("A", "B", "C", "D"), Day.Price = 10:13,
Dinnertime.Price = c(20L, 25L, 30L, 35L), Night.Price = 7:10),
.Names = c("Treatment.Group", "Day.Price", "Dinnertime.Price", "Night.Price"),
class = "data.frame",
row.names = c(NA, -4L))
df.transactions <- structure(list(Person = c("Person1", "Person1", "Person2", "Person3", "Person4"),
Hour = c(1L, 2L, 20L, 17L, 9L),
Usage = c(0L, 0L, 5L, 6L, 2L)),
.Names = c("Person", "Hour", "Usage"),
class = "data.frame", row.names = c(NA, -5L))
# Add the group that each person belongs to
df.transactions$group <- c("A","A","B","C","D")
# Get the transaction price
df.transactions$price <- apply(df.transactions, 1, function(x){
hour <- as.numeric(x[["Hour"]])
price <- ifelse(hour >= 8 & hour <= 16, df.pricing[df.pricing$Treatment.Group == x[["group"]], "Day.Price"],
ifelse((hour > 16 & hour <= 18), df.pricing[df.pricing$Treatment.Group == x[["group"]], "Dinnertime.Price"],
df.pricing[df.pricing$Treatment.Group == x[["group"]], "Night.Price"]))})
I have a dataset with 2 months of data (month of Feb and March). Can I know how can I split the data into 59 subsets of data by day and save it as data frame (28 days for Feb and 31 days for Mar)? Preferably to save the data frame in different name according to the date, i.e. 20140201, 20140202 and so forth.
df <- structure(list(text = structure(c(4L, 6L, 5L, 2L, 8L, 1L), .Label = c(" Terpilih Jadi Maskapai dengan Pelayanan Kabin Pesawat cont",
"booking number ZEPLTQ I want to cancel their flight because they can not together with my wife and kids",
"Can I change for the traveler details because i choose wrongly for the Mr or Ms part",
"cant do it with cards either", "Coming back home AK", "gotta try PNNL",
"Jadwal penerbangan medanjktsblm tangalmasi ada kah", "Me and my Tart would love to flyLoveisintheAir",
"my flight to Bangkok onhas been rescheduled I couldnt perform seat selection now",
"Pls checks his case as money is not credited to my bank acctThanks\n\nCASLTP",
"Processing fee Whatt", "Tacloban bound aboardto get them boats Boats boats boats Tacloban HeartWork",
"thanks I chatted with ask twice last week and told the same thing"
), class = "factor"), created = structure(c(1L, 1L, 2L, 2L, 3L,
3L), .Label = c("1/2/2014", "2/2/2014", "5/2/2014", "6/2/2014"
), class = "factor")), .Names = c("text", "created"), row.names = c(NA,
6L), class = "data.frame")
You don't need to output multiple dataframes. You only need to select/subset them by year&month of the 'created' field. So here are two ways do do that: 1. is simpler if you don't plan on needing any more date-arithmetic
# 1. Leave 'created' a string, just use text substitution to extract its month&date components
df$created_mthyr <- gsub( '([0-9]+/)[0-9]+/([0-9]+)', '\\1\\2', df$created )
# 2. If you need to do arbitrary Date arithmetic, convert 'created' field to Date object
# in this case you need an explicit format-string
df$created <- as.Date(df$created, '%M/%d/%Y')
# Now you can do either a) split
split(df, df$created_mthyr)
# specifically if you want to assign the output it creates to 3 dataframes:
df1 <- split(df, df$created_mthyr)[[1]]
df2 <- split(df, df$created_mthyr)[[2]]
df5 <- split(df, df$created_mthyr)[[3]]
# ...or else b) do a Split-Apply-Combine and perform arbitrary command on each separate subset. This is very powerful. See plyr/ddply documentation for examples.
require(plyr)
df1 <- dlply(df, .(created_mthyr))[[1]]
df2 <- dlply(df, .(created_mthyr))[[2]]
df5 <- dlply(df, .(created_mthyr))[[3]]
# output looks like this - strictly you might not want to keep 'created','created_mthyr':
> df1
# text created created_mthyr
#1 cant do it with cards either 1/2/2014 1/2014
#2 gotta try PNNL 1/2/2014 1/2014
> df2
#3
#Coming back home AK
#4 booking number ZEPLTQ I want to cancel their flight because they can not together with my wife and kids
# created created_mthyr
#3 2/2/2014 2/2014
#4 2/2/2014 2/2014
I have a dataframe with some observations of when lines attached to IDs.
I need the period of time in days when each ID had a line/catheter attached.
Here is my dput return:
structure(list(ID = c(487622L, 487622L, 487639L, 487639L, 489027L,
489027L, 489027L, 491858L, 491858L, 491858L, 491858L, 491858L,
491858L), Line = c("Central Venous Line", "Central Venous Line",
"Central Venous Line", "Peripherally Inserted Central Catheter (PICC)",
"Haemodialysis Catheter", "Peripherally Inserted Central Catheter (PICC)",
"Haemodialysis Catheter", "Central Venous Line", "Haemodialysis Catheter",
"Central Venous Line", "Haemodialysis Catheter", "Central Venous Line",
"Peripherally Inserted Central Catheter (PICC)"), Start = structure(c(1362528000,
1363219200, 1362268800, 1363219200, 1364774400, 1365120000, 1365465600,
1364688000, 1364688000, 1365724800, 1365724800, 1366848000, 1369353600
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), End = structure(c(1362787200,
1363824000, 1363305600, 1363737600, 1365465600, 1366675200, 1365638400,
1365724800, 1365724800, 1366329600, 1366848000, 1367539200, 1369612800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Days = c("3.095138889",
"7.045138889", "11.87777778", "5.736111111", "7.850694444", "18.02083333",
"1.813888889", "12.32986111", "12.71388889", "6.782638889", "13.14027778",
"7.718055556", "3.397222222"), dateOrder = c(1L, 2L, 1L, 2L,
1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L)), .Names = c("ID", "Line",
"Start", "End", "Days", "dateOrder"), row.names = 79:91, class = "data.frame")
Here is the catch. It does not matter if an ID has more than one line/catheter. I just need to take the earliest start date for each ID, the latest end date for each ID, and calculate the number of continuous days each ID has a line/catheter attached.
The problem is confounded by some cases, e.g. ID 491858. This individual had a line removed (dateOrder = 5) on 2013-05-03 and reinserted on 2013-05-24 for just over 3 days.
How I intended to handle this is to subtract the gap (number of days) from the number of days of continuous time between min(Start Date) and max(end date).
There are over 20,000 records in the data set.
Here is what I have done so far:
Converted the DF to a list of DFs based on ID.
I intended to apply a function to each DF something as follows:
If the difference in time (days) between subsequent start date and previous end date for each row exceeds 0, then add TRUE or some arbitrary column value to each data frame.
function(y){
for (i in length(y)){
if(difftime(y$Start[i+1], y$End[i], units='days') > 0){
y$test <- TRUE}
}
}
Any help would be greatly appreciated.
Thanks.
UPDATE
Ignore the days column. It is of no use. I intend to aggregate month line counts from the unique cases.
I guess something like this might help, unless I've misunderstood something:
unlist(lapply(split(DF, DF$ID),
function(x) { totaldays <- max(x$End) - min(x$Start);
x$Start <- c(x$Start[-1], NA);
res <- difftime(x$Start[-length(x$Start)], x$End[-length(x$Start)], units = "days");
res <- res[res > 0];
res <- ifelse(length(res) == 0, 0, res);
return(as.numeric(totaldays - res)) }))
#487622 487639 489027 491858
# 10 17 22 36
DF is your dput.
If I understand correctly, you want the total amount of days that the catheter was present. To do that, I would use plyr
#assume df is your dput object
library(plyr)
day.summary <- ddply(df, "ID", function(x) data.frame(total.days = sum(as.numeric(x$Days))))
print(day.summary)
ID total.days
1 487622 10.14028
2 487639 17.61389
3 489027 27.68542
4 491858 56.08194
A novice R user here. So i have a data set formated like:
Date Temp Month
1-Jan-90 10.56 1
2-Jan-90 11.11 1
3-Jan-90 10.56 1
4-Jan-90 -1.67 1
5-Jan-90 0.56 1
6-Jan-90 10.56 1
7-Jan-90 12.78 1
8-Jan-90 -1.11 1
9-Jan-90 4.44 1
10-Jan-90 10.00 1
In R syntax:
datacl <- structure(list(Date = structure(1:10, .Label = c("1990/01/01",
"1990/01/02", "1990/01/03", "1990/01/04", "1990/01/05", "1990/01/06",
"1990/01/07", "1990/01/08", "1990/01/09", "1990/01/10"), class = "factor"),
Temp = c(10.56, 11.11, 10.56, -1.67, 0.56, 10.56, 12.78,
-1.11, 4.44, 10), Month = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L)), .Names = c("Date", "Temp", "Month"), class = "data.frame", row.names = c(NA,
-10L))
i would like to subset the data for a particular month and apply a change factor to the temp then save the results. so i have something like
idx <- subset(datacl, Month == 1) # Index
results[idx[,2],1] = idx[,2]+change # change applied to only index values
but i keep getting an error like
Error in results[idx[, 2], 1] = idx[, 2] + change:
only 0's may be mixed with negative subscripts
Any help would be appreciated.
First, give the change factor a value:
change <- 1
Now, here is how to create an index:
# one approach to subsetting is to create a logical vector:
jan.idx <- datacl$Month == 1
# alternatively the which function returns numeric indices:
jan.idx2 <- which(datacl$Month == 1)
If you want just the subset of data from January,
jandata <- datacl[jan.idx,]
transformed.jandata <- transform(jandata, Temp = Temp + change)
To keep the entire data frame, but only add the change factor to Jan temps:
datacl$Temp[jan.idx] <- datacl$Temp[jan.idx] + change
First, note that subset does not produce an index, it produces a subset of your original dataframe containing all rows with Month == 1.
Then when you are doing idx[,2], you are selecting out the Temp column.
results[idx[,2],1] = idx[,2] + change
But then you are using these as an index into results, i.e. you're using them as row numbers. Row numbers can't be things like 10.56 or -1.11, hence your error. Also, you're selecting the first column of results which is Date and trying to add temperatures to it.
There are a few ways you can do this.
You can create a logical index that is TRUE for a row with Month == 1 and FALSE otherwise like so:
idx <- datac1$Month == 1
Then you can use that index to select the rows in datac1 you want to modify (this is what you were trying to do originally, I think):
datac1$Temp[idx] <- datac1$Temp[idx] + change # or 'results' instead of 'datac1'?
Note that datac1$Temp[idx] selects the Temp column of datac1 and the idx rows.
You could also do
datac1[idx,'Temp']
or
datac1[idx,2] # as Temp is the second column.
If you only want results to be the subset where Month == 1, try:
results <- subset(datac1, Month == 1)
results$Temp <- results$Temp + change
This is because results only contains the rows you are interested in, so there's no need to do subsetting.
Personally, I would use ifelse() and leverage the syntactic beauty that is within() for a nice one liner datacl <- within(datacl, Temp <- ifelse(Month == 1, Temp + change,Temp)). Well, I said one liner, but you'd need to define change somewhere else too.