How do I avoid using for-loops - r

I am currently working on listening data of a music platform in R.
I have a subset (listening.subset) of the total data set. It contains 6 columns (USER, artist, Week, COUNT, user_type, binary).
Each user can either be a focal user, a friend, or a neighbour. There are separate data sets that link focal users to their friends (friend.data) and neighbours (neighbour.data), but I added a column to indicate the type of user.
Now, I have the following for-loop to indicate whether a friend has listened to an artist in the 10 weeks before the focal user has listened to that same artist. If that is the case, the binary column must show a 0, else a 1.
listening.subset$binary <- NA
for (i in 1:count(listening.subset)$n) {
test_user <- listening.subset[i,]
test_week <- test_user$Week
test_artist <- test_user$artist
if (test_user$user_type == "friend") {
foc <- vlookup(test_user$USER, friend.data, result_column = 1, lookup_column = 2)
prior_listen <- listening.subset %>% filter(USER == foc) %>% group_by(artist) %>% filter(test_week >= (Week -10) & test_week <= Week) %>% filter(artist == test_artist)
if (nrow(prior_listen) > 0) {
listening.subset[i,]$binary <- 0
}
else(
listening.subset[i,]$binary <- 1)
}
}
The problem with this for-loop is that it takes too long to apply to the full data set. Therefore, I want to apply vectorization. However, This concept is vague to me and after reading up on it online, I still do not have a clue as to how I should adjust my code.
I hope someone knows how to use vectorization and could help me.
EDIT1: the total data set contains around 50 million entries. However, I could split it up in 10 data sets of 5 million each.
EDIT2: listening.subset:
"clubanddeform", "HyprMusic", "Peter-182", "komosionmel", "SHHitsKaty",
"Sonik_Villa", "Haalf"), artist = c("Justin Timberlake", "Ediya",
"Lady Gaga", "El Guincho", "Lighthouse Family", "Pidżama Porno",
"The Men", "Modest Mouse", "Com Truise", "April Smith and The Great Picture Show"
), Week = c(197L, 213L, 411L, 427L, 443L, 232L, 431L, 312L, 487L,
416L), COUNT = c(1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 6L, 11L), user_type = c("friend",
"friend", "friend", "friend", "neighbour", "friend", "neighbour",
"friend", "focal", "friend"), binary = c(1, 1, 1, 1, NA, 1, NA,
1, NA, 1)), row.names = c(NA, 10L), class = "data.frame")
Where Week is an indicator for which week the user listened to the particular band (ranging between 1 and 527), and COUNT equals the amount of times the user has listened to that artist in that particular week.
Recap: The binary variable should indicate whether the "friend user" has listened to the same band as the "focal user", in the 10 weeks before the focal user played the band. The social connections can be found in the friend.data, which is depicted below.
structure(list(USER = c("TheMariner", "TheMariner", "TheMariner",
"TheMariner", "TheMariner", "TheMariner", "TheMariner", "TheMariner",
"TheMariner", "TheMariner"), FRIEND = c("npetrovay", "marno25",
"lennonstarr116", "sachmonkey", "andrewripp", "daledrops", "Skittlebite",
"Ego_Trippin", "pistolgal5", "jjollett")), row.names = c(NA,
10L), class = "data.frame")
For each 190 focal users (first column), the friends are listed next to it, in the second column.

Related

Populate vectors with 2 for loops

I may be misunderstanding how for loops work, but I'm having hard time comprehending why the current code doesn't populate vectors (the vectors evidently remain NA, although the code itself runs). I imagine there may also be a way to subset all of this information using ifelse(), but I'm experiencing "coder's block".
Issue (elaborated): I am trying to code a running Electoral College projection based on a betting market from the 2008 presidential cycle, over the final 90 days until Election Day. I justify using two for loops because the code needs to check conditional statements on a particular day and add a particular value to a preexisting sum at on that day. In other words, if the betting price for Obama is higher than McCain on a particular for a particular state that state's electoral votes are awarded to Obama on that day, and visa versa. Again, the code runs, but the vectors apparently remain NA.
Key of Relevant Variables
EV, electoral votes of that particular state
X, a unique value assigned to each observation
day, date class
PriceD, betting price for the Dem candidate
PriceR, betting price for the Rep candidate
DaysToEday, a numeric value indicating the difference between variable day and election day (2008-11-04)
Code in Question
Obama08.ECvotesByDay <- McCain08.ECvotesByDay <- rep(NA, 90)
for (i in 1:90) {
for (j in 1:nrow(subset(mpres08, mpres08$DaysToEday <= 90))){
if(mpres08$PriceD[j] > mpres08$PriceR[j]) {
Obama08.ECvotesByDay[i] <- Obama08.ECvotesByDay[i]+mpres08$EV[j]
}
else {
McCain08.ECvotesByDay[i] <- McCain08.ECvotesByDay[i]+mpres08$EV[j]
}
}
}
dput of Data (five rows)
structure(list(state = c("AK", "AK", "AK", "AK", "AK"), state.name = c("Alaska",
"Alaska", "Alaska", "Alaska", "Alaska"), Obama = c(38L, 38L,
38L, 38L, 38L), McCain = c(59L, 59L, 59L, 59L, 59L), EV = c(3L,
3L, 3L, 3L, 3L), X = c(24073L, 25195L, 8773L, 25603L, 25246L),
day = structure(c(13937, 13959, 13637, 13967, 13960), class = "Date"),
PriceD = c(7.5, 7.5, 10, 8, 7.5), VolumeD = c(0L, 0L, 0L,
0L, 0L), PriceR = c(92.5, 92.5, 90, 92, 92.5), VolumeR = c(0L,
0L, 0L, 0L, 0L), DaysToEday = c(250, 228, 550, 220, 227)), row.names = c(NA,
5L), class = "data.frame")
You are adding a number to NA, and for R the result is NA.
Obama08.ECvotesByDay[i] and McCain08.ECvotesByDay[i] are initialised with NA. In R, if you try to do arithmetic with NA it stays NA (e.g. NA + 1 results in NA). Depending on what is a neutral result for you, you could initialise the vectors in the beginning with 0:
Obama08.ECvotesByDay <- McCain08.ECvotesByDay <- rep(0, 90)

BradleyTerry2 package missing one player in model results

I have data on 23 'players'. Some of them played against each other (but not every possible pair) one or multiple times. The dataset I have (see dput below) includes the number of times one player won and lost against another player. I use it to fit a BT model using BradleyTerry2 package. The issue I have is that the model gives me the coefficients for 22 players not 23. Can anyone help me figure out what the problem is, please?
Below is the dput of my data (head)
structure(list(player1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("a12TTT.pdf",
"a15.pdf", "a17.pdf", "a18.pdf", "a21.pdf", "a2TTT.pdf", "a5.pdf",
"B11.pdf", "B12.pdf", "B13.pdf", "B22.pdf", "B24.pdf", "B4.pdf",
"B7.pdf", "B8.pdf", "cw10-1.pdf", "cw15-1TTT.pdf", "cw17-1.pdf",
"cw18.pdf", "cw3.pdf", "cw4.pdf", "cw7_1TTT.pdf", "cw13-1.pdf"
), class = "factor"), player2 = structure(c(4L, 5L, 8L, 9L, 10L,
12L), .Label = c("a12TTT.pdf", "a15.pdf", "a17.pdf", "a18.pdf",
"a21.pdf", "a2TTT.pdf", "a5.pdf", "B11.pdf", "B12.pdf", "B13.pdf",
"B22.pdf", "B24.pdf", "B4.pdf", "B7.pdf", "B8.pdf", "cw10-1.pdf",
"cw15-1TTT.pdf", "cw17-1.pdf", "cw18.pdf", "cw3.pdf", "cw4.pdf",
"cw7_1TTT.pdf", "cw13-1.pdf"), class = "factor"), win1 = c(0,
1, 1, 1, 2, 0), win2 = c(1, 1, 0, 1, 0, 2)), row.names = c(NA,
6L), class = "data.frame")
The code I am using:
BTm(cbind(win1,win2), player1, player2, data= prep)
I also tried
BTm(cbind(win1,win2), player1, player2, ~player, id="player", data= prep)
And it gives me the same result (i.e. the same player is missing, and the 22 coefficients for the rest are the same).
If that is relevant, I created 'prep' using the below code.
prep<-countsToBinomial(table(ju$winner, ju$loser))
ju$winner and ju$loser are two columns in which rows are individual games and the winner is in the first column.
I also tried the following code to fit the model:
BTm(1, p1, p2, data=ju)
In this case p1 and p2 are the same as columns winner and losser, but transformed so as to have the same level factors (so that the function would work). I am not sure I used this alternative correctly, and I mention it because in this case I also have one player missing (although a different one).
After reading more carefully the documentation for the package, I found that when estimating the model the function removes one script/player/contestant as a reference. Its value is always 0. So my understanding is that if you want to do any further analysis, you have to find what player was removed and reintroduce it in the data frame with the value for its ability 0.

Check for Date in Datespan for multiple times

i have data that looks like this:
Data <- "Person Address Starting.Date Resignation.Date Job
John abc 01.01.2017 03.01.2017 IT
Sarah cde 06.01.2017 06.07.2017 Teacher
Susi bfg 09.06.2017 08.09.2017 secretary"
Data <- read.table(text=zz, header = TRUE)
My goal is to find out how long people stayed in their job before quitting and put that information in a new variable. So i check if the resignation date is in a certain datespan, what I do by using this Code:
Data$Span<- ifelse(Data$Resignation.Date>= "01.01.2017" & Data$Resignation.Date <= "31.01.2017", 1,
ifelse(Data$Resignation.Date>= "01.02.2017" & Data$Resignation.Date <= "28.02.2017", 2,
ifelse(Data$Resignation.Date>= "01.03.2017" & Data$Resignation.Date <= "31.03.2017", 3,
ifelse(Data$Resignation.Date>= "01.04.2017" & Data$Resignation.Date <= "30.04.2017", 4,
ifelse(Data$Resignation.Date>="01.05.2017" & Data$Resignation.Date <= "31.05.2017",5,
ifelse(Data$Resignation.Date>="01.06.2017" & Data$Resignation.Date<="30.06.2017",6,
ifelse(Data$Resignation.Date>="01.07.2017" & Data$Resignation.Date<="31.07.2017",7,
ifelse(Data$Resignation.Date>="01.08.2017" & Data$Resignation.Date<="31.08.2017", 8,
ifelse(Data$Resignation.Date>="01.09.2017" & Data$Resignation.Date<="30.09.2017", 9,
ifelse(Data$Resignation.Date>="01.10.2017" & Data$Resignation.Date<="31.10.2017",10,
ifelse(Data$Resignation.Date>="01.11.2017" & Data$Resignation.Date<="30.11.2017", 11,
ifelse(Data$Resignation.Date>="01.12.2017" & Data$Resignation.Date<="31.12.2017",12,999))))))))))))
The data I presented is for a subset for People who started working in January. I have subsets for all 12 months in 2017. What I want to do is use the same Code for People who started working in February / March / and so on. To do this I would have to alter the Code in that it starts with the first line and adds one month and then adds one month for all following lines. So that for example for the February subset it would start with
Data$Resignation.Date>= "01.02.2017" & Data$Resignation.Date <= "28.02.2017.2017", 1,
and end with
ifelse(Data$Resignation.Date>="01.01.2018" & Data$Resignation.Date<="31.01.2018",12,999
Is there any way to do this without copy pasting the Code and doing the changes manually for every month? Since the changes follow a certain systematic I would think that it would be possible, but I could not find any solution for this. I looked for Solutions in the dplyr package since I thought that my Problem fits there, but that did not help me. I would be very thankful for any advice. Of Course I will happily answer remaining questions.
P.S.: I am not not attached to using the subsets, that was just easier for me to work with since I am not so experienced in r. I filtered the subsets by using this Code
Data <- TotalData[TotalData$Starting.Date>= "01.01.2017" & TotalData$Starting.Date <= "31.01.2017",]
I think this code should be sufficient to do your work :-
Logic is if the Start date and end data are same it will give you 1and if they are not same it will give you months difference for how many months an employee was there for a company
library(lubridate)
Data$Starting.Date <- dmy(Data$Starting.Date)
Data$Resignation.Date <- dmy(Data$Resignation.Date)
Data$code<- ifelse(month(Data$Starting.Date) == month(Data$Resignation.Date),1,(interval(Data$Starting.Date, Data$Resignation.Dat) %/% months(1)))
Data :-
Data <- structure(list(Person = structure(1:4, .Label = c("John", "johnyy",
"Sarah", "Susi"), class = "factor"), Address = structure(c(1L,
1L, 3L, 2L), .Label = c("abc", "bfg", "cde"), class = "factor"),
Starting.Date = structure(c(17167, 17199, 17172, 17326), class = "Date"),
Resignation.Date = structure(c(17169, 17199, 17353, 17417
), class = "Date"), Job = structure(c(1L, 1L, 3L, 2L), .Label = c("IT",
"secretary", "Teacher"), class = "factor"), code = c(1, 2,
999, 999)), row.names = c(NA, -4L), class = "data.frame")
You could do it with the lubridate package to get the time a person stayed in the company.
library(lubridate)
Data <- "Person Address Starting.Date Resignation.Date Job
John abc 01.01.2017 03.01.2017 IT
Sarah cde 06.01.2017 06.07.2017 Teacher
Susi bfg 09.06.2017 08.09.2017 secretary"
Data <- read.table(text=Data, header = TRUE)
Data$Starting.Date = dmy(Data$Starting.Date)
Data$Resignation.Date = dmy(Data$Resignation.Date)
time.interval <- Data$Starting.Date %--% Data$Resignation.Date
time.period <- as.period(time.interval)
time.period <- month(time.period)
Data$Span <- time.period

Generate new variable based on table of conditions

I am trying to write a loop in R that creates a new variable based on a table of conditional outcomes.
I have four treatment groups (A, B, C, D). Each treatment group pays a different price at three different time periods (day, dinner, night).
Treatment Group Day Price Dinnertime Price Night Price
A 10 20 7
B 11 25 8
C 12 30 9
D 13 35 10
The time period is recorded as a given "hour" (day is hours 8-17, dinner is from 17-19 and night is from 19-0 and 0-8).
Hour Usage
Person 1 1 0
Person 1 2 0
Person 2 20 5
Person 3 17 6
Based on both treatment group (A, B, C and D) and time of day (night, day, dinnertime), I would like to create a new vector of prices.
Ideally, I would create dummy variables for each of the time periods (day, night and dinner) based on these hourly conditions. However, my data set is pretty large (24 observations per person per day) so I'm looking for a more elegant solution.
In plain language, I want this:
if group==A & time==night, then price=7 --> and this information saved in a new variable "price"
Any advice?
Edit: Question is about the loop with two conditions. Is there a way to refer this directly to the data-frame with the treatment groups and tariffs or do I just need to write it manually?
Assuming that you have some way of including a column for the group each person belongs to in the dataframe with the transactions on it. Then something like this may work for you.
df.pricing <- structure(list(Treatment.Group = c("A", "B", "C", "D"), Day.Price = 10:13,
Dinnertime.Price = c(20L, 25L, 30L, 35L), Night.Price = 7:10),
.Names = c("Treatment.Group", "Day.Price", "Dinnertime.Price", "Night.Price"),
class = "data.frame",
row.names = c(NA, -4L))
df.transactions <- structure(list(Person = c("Person1", "Person1", "Person2", "Person3", "Person4"),
Hour = c(1L, 2L, 20L, 17L, 9L),
Usage = c(0L, 0L, 5L, 6L, 2L)),
.Names = c("Person", "Hour", "Usage"),
class = "data.frame", row.names = c(NA, -5L))
# Add the group that each person belongs to
df.transactions$group <- c("A","A","B","C","D")
# Get the transaction price
df.transactions$price <- apply(df.transactions, 1, function(x){
hour <- as.numeric(x[["Hour"]])
price <- ifelse(hour >= 8 & hour <= 16, df.pricing[df.pricing$Treatment.Group == x[["group"]], "Day.Price"],
ifelse((hour > 16 & hour <= 18), df.pricing[df.pricing$Treatment.Group == x[["group"]], "Dinnertime.Price"],
df.pricing[df.pricing$Treatment.Group == x[["group"]], "Night.Price"]))})

Identify gaps in a continuous time period

I have a dataframe with some observations of when lines attached to IDs.
I need the period of time in days when each ID had a line/catheter attached.
Here is my dput return:
structure(list(ID = c(487622L, 487622L, 487639L, 487639L, 489027L,
489027L, 489027L, 491858L, 491858L, 491858L, 491858L, 491858L,
491858L), Line = c("Central Venous Line", "Central Venous Line",
"Central Venous Line", "Peripherally Inserted Central Catheter (PICC)",
"Haemodialysis Catheter", "Peripherally Inserted Central Catheter (PICC)",
"Haemodialysis Catheter", "Central Venous Line", "Haemodialysis Catheter",
"Central Venous Line", "Haemodialysis Catheter", "Central Venous Line",
"Peripherally Inserted Central Catheter (PICC)"), Start = structure(c(1362528000,
1363219200, 1362268800, 1363219200, 1364774400, 1365120000, 1365465600,
1364688000, 1364688000, 1365724800, 1365724800, 1366848000, 1369353600
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), End = structure(c(1362787200,
1363824000, 1363305600, 1363737600, 1365465600, 1366675200, 1365638400,
1365724800, 1365724800, 1366329600, 1366848000, 1367539200, 1369612800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Days = c("3.095138889",
"7.045138889", "11.87777778", "5.736111111", "7.850694444", "18.02083333",
"1.813888889", "12.32986111", "12.71388889", "6.782638889", "13.14027778",
"7.718055556", "3.397222222"), dateOrder = c(1L, 2L, 1L, 2L,
1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L)), .Names = c("ID", "Line",
"Start", "End", "Days", "dateOrder"), row.names = 79:91, class = "data.frame")
Here is the catch. It does not matter if an ID has more than one line/catheter. I just need to take the earliest start date for each ID, the latest end date for each ID, and calculate the number of continuous days each ID has a line/catheter attached.
The problem is confounded by some cases, e.g. ID 491858. This individual had a line removed (dateOrder = 5) on 2013-05-03 and reinserted on 2013-05-24 for just over 3 days.
How I intended to handle this is to subtract the gap (number of days) from the number of days of continuous time between min(Start Date) and max(end date).
There are over 20,000 records in the data set.
Here is what I have done so far:
Converted the DF to a list of DFs based on ID.
I intended to apply a function to each DF something as follows:
If the difference in time (days) between subsequent start date and previous end date for each row exceeds 0, then add TRUE or some arbitrary column value to each data frame.
function(y){
for (i in length(y)){
if(difftime(y$Start[i+1], y$End[i], units='days') > 0){
y$test <- TRUE}
}
}
Any help would be greatly appreciated.
Thanks.
UPDATE
Ignore the days column. It is of no use. I intend to aggregate month line counts from the unique cases.
I guess something like this might help, unless I've misunderstood something:
unlist(lapply(split(DF, DF$ID),
function(x) { totaldays <- max(x$End) - min(x$Start);
x$Start <- c(x$Start[-1], NA);
res <- difftime(x$Start[-length(x$Start)], x$End[-length(x$Start)], units = "days");
res <- res[res > 0];
res <- ifelse(length(res) == 0, 0, res);
return(as.numeric(totaldays - res)) }))
#487622 487639 489027 491858
# 10 17 22 36
DF is your dput.
If I understand correctly, you want the total amount of days that the catheter was present. To do that, I would use plyr
#assume df is your dput object
library(plyr)
day.summary <- ddply(df, "ID", function(x) data.frame(total.days = sum(as.numeric(x$Days))))
print(day.summary)
ID total.days
1 487622 10.14028
2 487639 17.61389
3 489027 27.68542
4 491858 56.08194

Resources