Unable to figure out nested for loops in R - r
I am unable to figure out how to use nested for loops in R for solving my problem. Here's a miniature version of what I'm trying to solve:
I have two files, test1 and test2 which look like this:
head(test1)
Date Settlement
2008-08-28 138.29
2008-08-29 135.34
2008-09-01 135.23
2008-09-02 123.36
2008-09-03 126.41
2008-09-04 128.68
2008-09-05 123.70
2008-09-08 124.60
2008-09-09 122.33
2008-09-10 120.85
2008-09-11 120.15
2008-09-12 121.17
2008-09-15 118.97
2008-09-16 114.90
2008-09-17 115.78
2008-09-18 115.60
2008-09-19 115.90
2008-09-22 120.49
2008-09-23 124.10
And here is test2:
test2
X1 X2 X3
2008-08-31 2008-09-05 2008-09-11
2008-09-05 2008-09-11 2008-09-14
2008-09-11 2008-09-14 2008-09-18
2008-09-14 2008-09-18 2009-09-22
The logic that I need to put in is:
Select Dates [1,1] and [1,2] from test2
Find all Settlement Prices between those 2 dates in test1
Get average of those prices, place it in [1,1] of a new dataframe.
Repeat by increasing columns, and then rows in pt1.
The end-result of this would look like this:
X1 X2
128.42 122.87
122.87 120.66
120.66 116.55
116.55 115.75
So, the 1st value in X1 is an average of Settlement prices between 31-Aug-08 (including) and 5-Sep-08 (excluding), and the 1st value in X2 is an average of Settlement prices between 5-Sep-08 (including) and 11-Sep-08 (excluding), and so on for the rows below.
Here's my code that works (if I pass it fixed dates from test2 as given below):
temp1 <- test1 %>%
group_by(Date >= test2$X1[1] & Date < test2$X2[1]) %>%
summarise(AvgPrice2 = mean(Settlement, na.rm = T))
temp1 <- filter(temp1, temp1[,1]==TRUE)
However, no matter what I try (over last 3 days !) I cannot figure out how to put this into a for loop. Even tried rollapply, sapply...not able to get anything to work. The code need not be time efficient, I just need to automate this process.
I have been working with R for sometime, but clearly this is a problem for advanced users...Would deeply appreciate any help on this.
Many thanks in advance.
I would use an SQL-like approach through the sqldf package (which lets you to apply SQL sintax to your data.frames
ds = data.frame(Date = c("2008-08-28", "2008-08-29", "2008-09-01", "2008-09-02", "2008-09-03", "2008-09-04", "2008-09-05", "2008-09-08", "2008-09-09", "2008-09-10", "2008-09-11", "2008-09-12", "2008-09-15", "2008-09-16", "2008-09-17", "2008-09-18", "2008-09-19", "2008-09-22", "2008-09-23"),
Settlement = c(138.29, 135.34, 135.23, 123.36, 126.41, 128.68, 123.70, 124.60, 122.33, 120.85, 120.15, 121.17, 118.97, 114.90, 115.78, 115.60, 115.90, 120.49, 124.10))
dr = data.frame(d1=c("2008-08-31", "2008-09-05", "2008-09-11", "2008-09-14"),
d2=c("2008-09-05", "2008-09-11", "2008-09-14", "2008-09-18"),
d3=c("2008-09-11", "2008-09-14", "2008-09-18", "2009-09-22"))
# add a variable which I will use to identify the rows
dr$g = 1:NROW(dr);
library(sqldf);
output = sqldf("SELECT dr.g, AVG(s1.Settlement) AS X1, AVG(s2.Settlement) AS X2
FROM dr
JOIN ds AS s1 ON dr.d1 <= s1.Date AND s1.Date < dr.d2
JOIN ds AS s2 ON dr.d2 <= s2.Date AND s2.Date < dr.d3
GROUP BY dr.g");
I found the suggested package in this post. In the same post another user suggested the use of the data.table package but I don't feel as confident on data.table sintax as the SQL one :)
The documentation of sqldf and some usage example can be found on GitHub project page
I'm not sure I got it, one of my results is different from the one in your wanted output. First, make sure the dates are of class Date.
test1$Date <- as.Date(test1$Date)
test2$X1 <- as.Date(test2$X1)
test2$X2 <- as.Date(test2$X2)
test2$X3 <- as.Date(test2$X3)
Now, for the computations you've described.
res1 <- numeric(nrow(test2))
res2 <- numeric(nrow(test2))
for(i in seq_len(nrow(test2))){
inx <- test2$X1[i] <= test1$Date & test1$Date < test2$X2[i]
res1[i] <- mean(test1$Settlement[inx])
inx <- test2$X2[i] <= test1$Date & test1$Date < test2$X3[i]
res2[i] <- mean(test1$Settlement[inx])
}
result <- data.frame(X1 = res1, X2 = res2)
result
X1 X2
1 128.42 122.8700
2 122.87 120.6600
3 120.66 116.5500
4 116.55 119.0225
The value that is different is the very last one, result$X2[4]. Your output is 115.75 and here it's 119.0225.
Your data
Ensuring dates are Dates
library(lubridate)
test1 = data.frame(Date = ymd(c("2008-08-28", "2008-08-29", "2008-09-01", "2008-09-02", "2008-09-03", "2008-09-04", "2008-09-05", "2008-09-08", "2008-09-09", "2008-09-10", "2008-09-11", "2008-09-12", "2008-09-15", "2008-09-16", "2008-09-17", "2008-09-18", "2008-09-19", "2008-09-22", "2008-09-23")),
Settlement = c(138.29, 135.34, 135.23, 123.36, 126.41, 128.68, 123.70, 124.60, 122.33, 120.85, 120.15, 121.17, 118.97, 114.90, 115.78, 115.60, 115.90, 120.49, 124.10))
test2 = data.frame(d1=ymd(c("2008-08-31", "2008-09-05", "2008-09-11", "2008-09-14")),
d2=ymd(c("2008-09-05", "2008-09-11", "2008-09-14", "2008-09-18")),
d3=ymd(c("2008-09-11", "2008-09-14", "2008-09-18", "2009-09-22")))
tidyverse solution
library(tidyverse)
result <- map_df(1:nrow(test2), ~data.frame(X1=(filter(test1, Date >= test2$d1[.x] & Date < test2$d2[.x]) %>% summarise(m=mean(Settlement)))$m,
X2=(filter(test1, Date >= test2$d2[.x] & Date < test2$d3[.x]) %>% summarise(m=mean(Settlement)))$m))
Output
X1 X2
1 128.42 122.8700
2 122.87 120.6600
3 120.66 116.5500
4 116.55 119.0225
Thanks a lot for all the answers, I tried all of them, but none seemed to fit my needs given that the files above were a miniaturized version of actual files - so coding by column names / splitting data manually into rows didn't seem like a good option for me.
But I finally figured out what'll work nicely in this case:
library(lubridate)
Makingrows <- function(test1, test2, j){
res<<- NULL
m1 = nrow(test2)
for(i in 1:m1){
d1 <- ymd(test2[i,j])
d2 <- ymd(test2[i,j+1])
X1 <- filter(test1, Date < d2 & Date >= d1)
res[i] <- mean(X1$Settlement, na.rm = T)
}
return(res)
}
mcol1 <- ncol(test2)-1
finalres <- lapply(1:mcol1, function(x) Makingrows(test1, test2, x))
finalres <- as.data.frame(finalres)
And yes, I was also getting the last value as 119.02...and I realized that by mistake I put the year as 2009 in the last cell in test2 file. Due to this, the code was picking up all the values till the end.
Thanks a lot everyone. I hope you'll agree with me as I mark this as the answer to my question.
Related
Looping row numbers from one dataframe to create new data using logical operations in R
I would like to extract a dataframe that shows how many years it takes for NInd variable (dataset p1) to recover due to some culling happening, which is showed in dataframe e1. I have the following datasets (mine are much bigger, but just to give you something to play with): # Dataset 1 Batch <- c(2,2,2,2,2,2,2,2,2,2) Rep <- c(0,0,0,0,0,0,0,0,0,0) Year <- c(0,0,1,1,2,2,3,3,4,4) RepSeason <- c(0,0,0,0,0,0,0,0,0,0) PatchID <- c(17,25,19,16,21,24,23,20,18,33) Species <- c(0,0,0,0,0,0,0,0,0,0) Selected <- c(1,1,1,1,1,1,1,1,1,1) Nculled <- c(811,4068,1755,449,1195,1711,619,4332,457,5883) e1 <- data.frame(Batch,Rep,Year,RepSeason,PatchID,Species,Selected,Nculled) # Dataset 2 Batch <- c(2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2) Rep <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0) Year <- c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2) RepSeason <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0) PatchID <- c(17,25,19,16,21,24,23,20,18,33,17,25,19,16,21,24,23,20,18,33,17,25,19,16,21,24,23,20,18,33) Ncells <- c(6,5,6,4,4,5,6,5,5,5,6,5,6,4,4,5,6,7,3,5,4,4,3,3,4,4,5,5,6,4) Species <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0) NInd <- c(656,656,262,350,175,218,919,218,984,875,700,190,93,127,52,54,292,12,43,68,308,1000,98,29,656,656,262,350,175,300) p1 <- data.frame(Batch, Rep, Year, RepSeason, PatchID, Ncells, Species, NInd) The dataset called e1 shows only those year where some culled happened to the population on specific PatchID. I have created the following script that basically use each row from e1 to create a Recovery number. Maybe there is an easier way to get to the end, but this is the one I managed to get... When you run this, you are working on ONE row of e1, so we focus on the first PatchID encounter and then do some calculation to match that up with p1, and finally I get a number named Recovery. Now, the thing is my dataframe has 50,000 rows, so doing this over and over looks quite tedious. So, that's where I thought a loop may be useful. But have tried and no luck on how to make it work at all... #here is where I would like the loop e2 <- e1[1,] # Trial for one row only # but the idea is having here a loop that keep doing of comes next for each row e3 <- e2 %>% select(1,2,4,5) p2 <- p1[,c(1,2,4,5,3,6,7,8)] # Re-order row2 <- which(apply(p2, 1, function(x) return(all(x == e3)))) p3 <- p1 %>% slice(row2) # all years with that particular patch in that particular Batch #How many times was this patch cull during this replicate? e4 <- e2[,c(1,2,4,5,3,6,7,8)] e4 <- e4 %>% select(1,2,3,4) c_batch <- e1[,c(1,2,4,5,3,6,7,8)] row <- which(apply(c_batch, 1, function(x) return(all(x == e4)))) c4 <- c_batch %>% slice(row) # Number of year to recover to 95% that had before culled c5 <- c4[1,] # extract the first time was culled c5 <- c5 %>% select(1:5) row3 <- which(apply(p2, 1, function(x) return(all(x == c5)))) Before <- p2 %>% slice(row3) NInd <- Before[,8] # Before culling number of individuals Year2 <- Before[,5] # Year number where first culling happened (that actually the number corresponds to individuals before culling, as the Pop file is developed during reproduction, while Cull file is developed after!) Percent <- (95*NInd)/100 # 95% recovery we want to achieve would correspond to having 95% of NInd BEFORE culled happened (Year2) After <- p3 %>% filter(NInd >= Percent & Year > Year2) # Look rows that match number of ind and Year After2 <- After[1,] # we just want the first year where the recovery was successfully achieved Recovery <- After2$Year - Before$Year # no. of years to reach 95% of the population immediately before the cull I reckon that the end would have to change somehow to to tell R that we are creating a dataframe with the Recovery, something like: Batch <- c(1,1,2,2) Rep <- c(0,0,0,0) PatchID <- c(17,25,30,12) Recovery <- c(1,2,1,5) Final <- data.frame(Batch, Rep, PatchID, Recovery) Would that be possible? OR this is just too mess-up and I may should try a different way?
Does the following solve the problem correectly? I have first added a unique ID to your data.frames to allow matching of the cull and population files (this saves most of you complicated look-up code): # Add a unique ID for the patch/replicate etc. (as done in the example code) e1$RepID = paste(e1$Batch, e1$Rep, e1$RepSeason, e1$PatchID, sep = ":") p1$RepID = paste(p1$Batch, p1$Rep, p1$RepSeason, p1$PatchID, sep = ":") If you want a quick overview of the number of times each patch was culled, the new RepID makes this easy: # How many times was each patch culled? table(p1$RepID) Then you want a loop to check the recovery time after each cull. My solutions uses an sapply loop (which also retains the RepIDs so you can match to other metadata later): sapply(unique(e1$RepID), function(rep_id){ all_cull_events = e1[e1$RepID == rep_id, , drop = F] first_year = order(all_cull_events$Year)[1] # The first cull year (assuming data might not be in temporal order) first_cull_event = all_cull_events[first_year, ] # The row corresponding to the first cull event population_counts = p1[p1$RepID == first_cull_event$RepID, ] # The population counts for this plot/replicate population_counts = population_counts[order(population_counts$Year), ] # Order by year (assuming data might not be in temporal order) pop_at_first_cull_event = population_counts[population_counts$Year == first_cull_event$Year, "NInd"] population_counts_after_cull = population_counts[population_counts$Year > first_cull_event$Year, , drop = F] years_to_recovery = which(population_counts_after_cull$NInd >= (pop_at_first_cull_event * .95))[1] # First year to pass 95% threshold return(years_to_recovery) }) 2:0:0:17 2:0:0:25 2:0:0:19 2:0:0:16 2:0:0:21 2:0:0:24 2:0:0:23 2:0:0:20 2:0:0:18 2:0:0:33 1 2 1 NA NA NA NA NA NA NA (The output contains some NAs because the first cull year was outside the range of population counts in the data you gave us) Please check this against your expected output though. There were some aspects of the question and example code that were not clear (see comments).
Filtering a data frame in R using a for loop
I am new to R and am running into difficulty with more advanced filtering. I have a data frame containing 1500 rows of people in households and need to filter out everyone who is part of a household where at least 1 person is older than 24. For example, in the sample set below I would only want to keep rows 3,4, and 5. PersonalID DOB HouseholdID 1 1961-04-15 123 2 2017-01-12 123 3 2000-01-02 122 4 2001-03-05 122 5 1996-08-22 122 Initially I just filtered to get a new data frame with everyone in that age range and then filtered the original data frame again (and again and again and so on...) with each HouseholdID of someone under 25 to check if anyone else with that HouseholdID is over 24. Whenever I'm doing the same thing over and over it seems like there's probably a way to use a function instead but I'm having a hard time coming up with one that works. This is my current attempt but I know there's plenty wrong with it: UNDER25df <- filter(df, DOB >= "yyyy-mm-dd") for (UNDER25df$HouseholdID in df) { if (all(df$DOB >= "yyyy-mm-dd")) { view(filter(df, HouseholdID == "$HouseholdID")) } } The error I get is: unexpected '}' in "}" but I'm pretty sure that I can nest an if statement in a for loop in R and that I've been careful about the positioning of the brackets so I don't know exactly what it's referring to. What I'm not sure of is if I can iterate through a data frame in this way or if this even makes sense. I've read that vectoring might be better in general for advanced filtering but tried to read the documentation on it and couldn't really see how to make that jump to this problem. Does anyone have a suggestion or a direction I should be looking in?
You do not need a loop for this. Try library(lubridate) library(dplyr) set.seed(1) df <- tibble(DOB = Sys.Date() - sample(3000:12000, 6), personalID = 1:6, HouseholdID = c(1,1,2,2,2,3)) df$DOB # grab householdID from all persons that are at least 24 oldies <- df[(lubridate::today() - lubridate::ymd(df$DOB)) > years(24), "HouseholdID", TRUE] # base R way oldies <- df[as.Date(df$DOB) > as.Date("1993-2-10"), "HouseholdID", TRUE] # household members in a household with someone 24 or older df %>% filter(HouseholdID %in% oldies) # household members in a household with noone 24 or older df %>% filter(!(HouseholdID %in% oldies))
I am not sure if you want keep the rows grouped by ID that all users are less than or equal to 24-year old. If so, then maybe you can try the code below library(lubridate) dfout <- subset(df, ave(floor(time_length(Sys.Date()-as.Date(DOB),"years"))<=24, HouseholdID, FUN = all)) If you really want to use for loop to make it, then the below is an example dfout <- data.frame() for (id in unique(df$HouseholdID)) { subdf <- subset(df,HouseholdID == id) if (with(subdf, all(floor(time_length(Sys.Date()-as.Date(DOB),"years"))<=24))) { dfout <- rbind(dfout,subdf) } } Both approaches above can give you the result shown as > dfout PersonalID DOB HouseholdID 3 3 2000-01-02 122 4 4 2001-03-05 122 5 5 1996-08-22 122 DATA df <- structure(list(PersonalID = 1:5, DOB = c("1961-04-15", "2017-01-12", "2000-01-02", "2001-03-05", "1996-08-22"), HouseholdID = c(123L, 123L, 122L, 122L, 122L)), class = "data.frame", row.names = c(NA, -5L))
I am not sure if you want to select household where all the people are above 24 or at least one person is above 24. In any case, you can use subset with ave subset(df, ave(as.integer(format(Sys.Date(), "%Y")) - as.integer(format(DOB, "%Y")) >= 24, HouseholdID, FUN = any)) This selects households where at least one person is above 24. If you want to select households where all people are above 24 use all instead of any in FUN argument. Similarly, using dplyr, we can use library(dplyr) df %>% group_by(HouseholdID) %>% filter(any(as.integer(format(Sys.Date(), "%Y")) - as.integer(format(DOB, "%Y")) >= 24))
Different methods to expand R data
I have the following data, and I would like to expand it. For example, if June has two successes, and one failure, my dataset should look like: month | is_success ------------------ 6 | T 6 | T 6 | F Dataset is as follows: # Months from July to December months <- 7:12 # Number of success (failures) for each month successes <- c(11,22,12,7,6,13) failures <- c(20,19,11,16,13,10) A sample solution is as follows: dataset<-data.frame() for (i in 1:length(months)) { dataset <- rbind(dataset,cbind(rep(months[i], successes[i]), rep(T, successes[i]))) dataset <- rbind(dataset,cbind(rep(months[i], failures[i]), rep(F, failures[i]))) } names(dataset) <- c("months", "is_success") dataset[,"is_success"] <- as.factor(dataset[,"is_success"]) Question: What are the different ways to rewrite this code? I am looking for a comprehensive solution with different but efficient ways (matrix, loop, apply). Thank you!
Here is one way with rep. Create a dataset with 'months' and 'is_success' based on replication of 1 and 0. Then replicate the rows by the values of 'successes', 'failures', order if necessary and set the row names to 'NULL' d1 <- data.frame(months, is_success = factor(rep(c(1, 0), each = length(months)))) d2 <- d1[rep(1:nrow(d1), c(successes, failures)),] d2 <- d2[order(d2$months),] row.names(d2) <- NULL Now, we check whether this is equal to the data created from for loop all.equal(d2, dataset, check.attributes = FALSE) #[1] TRUE Or as #thelatemail suggested, 'd1' can be created with expand.grid d1 <- expand.grid(month=months, is_success=1:0)
using mapply you can try this: createdf<-function(month,successes,failures){ data.frame(month=rep(x = month,(successes+failures)), is_success=c(rep(x = T,successes), rep(x = F,failures)) ) } Now create a list of required data.frames: lofdf<-mapply(FUN = createdf,months,successes,failures,SIMPLIFY = F) And then combine using the plyr package's ldply function: resdf<-ldply(lofdf,.fun = data.frame)
R: missing in action with dates/factors
I'm bit lost in all those formatting and can't make my age function works, I'm not sure where format option applied to source or destination, my goal is to attach AGE column, dx$BIRTH_DATE define exactly like in my source data, factor in this format I can not change it. I don't want to use lubridate for now, as i need to migrate it to diff env. is it possible. Thanks much for ur help!! Mario age_years <- function(from, to) { lt <- as.POSIXlt(c(from, to)) age <- lt$year[2] - lt$year[1] mons <- lt$mon + lt$mday/50 if(mons[2] < mons[1]) age <- age -1 age } today <- Sys.Date() #,format="%m/%d/%Y") class(today) age_years("1988-06-30", "2003-07-12") age_years("1988-06-30", date) ### doesn't work ??? as.character(dx$BIRTH_DATE) as.Date(dx$BIRTH_DATE) dx <- data.frame(ID = factor(c(1,2,3)), BIRTH_DATE = c("1/11/1953","2/12/1977","3/13/2000"), FEM = (c(11,22,33))) dx str(dx) ### <#>>< dx$AGE <- age_years(as.Date(dx$BIRTH_DATE), today) ## Age=1 ?????
Using your code here is a solution for you. You had 2 major problems, first your BirthDate format was not being declared (as noted by Pierre) second you were calling only the second and first year to declare the age, rather than using the entire dataframe. Now from and to are turned into vectors to define ages. Hope this helps! age_years <- function(from, to) { from <- as.POSIXlt(from) to<- as.POSIXlt(to) age <- to$year - from$year mons <- from$mon + from$mday/50 if(mons[2] < mons[1]) age <- age -1 age } today <- Sys.Date() #,format="%m/%d/%Y") dx <- data.frame(ID = factor(c(1,2,3)), BIRTH_DATE = c("1/11/1953","2/12/1977","3/13/2000"), FEM = (c(11,22,33))) dx$AGE <- age_years(from=as.Date(dx$BIRTH_DATE,format = "%m/%d/%Y"), to=today) > dx ID BIRTH_DATE FEM AGE 1 1 1/11/1953 11 62 2 2 2/12/1977 22 38 3 3 3/13/2000 33 15
Calculate marginal totals as a function within a ddply call
I have been working on a file to calculate hospital infection rates. I want to standardise the infection rates to yearly procedure counts. The data are located here because it is too big for dput. SSI is the number of surgical infections(1 = infected, 0=not infected), Procedure is the type of procedure. Year has been derived using lubridate library(plyr) fname <- "https://raw.github.com/johnmarquess/some.data/master/hospG.csv" download.file(fname, destfile='hospG.csv', method='wget') hospG <- read.csv('hospG.csv') Inf_table <- ddply(hospG, "Year", summarise, Infections = sum(SSI == 1), Procedures = length(Procedure), PropInf = round(Infections/Procedures * 100 ,2) ) This gives me the number of infections, procedures, and proportion infected per year for this hospital. What I would like is an additional column with the standardised proportion infected. The long way to do this outside the inf_table is: s1 <- sum(Inf_table$Infections) s2 <- sum(Inf_table$Procedures) Expected_prop_inf <- Inf_table$Procedures * s1/s2 Is there a way to get ddply to do this. I tied making a function with the calculation to produce Expected_prop_inf but I did not get very far. Thanks for any help offered.
It's more difficult with ddply because you are dividing by a number outside the grouping . Better to do it with base R. # base > with(Inf_table, Procedures*(sum(Infections)/sum(Procedures))) [1] 17.39184 17.09623 23.00847 20.84065 24.83141 24.83141 rather than with ddply which is not so natural: # NB note .(Year) is unique for every row, you might also use rownames > s1 <- sum(Inf_table$Infections) > s2 <- sum(Inf_table$Procedures) > ddply(Inf_table, .(Year), summarise, Procedures*(s1/s2)) Year ..1 1 2001 17.39184 2 2002 17.09623 3 2003 23.00847 4 2004 20.84065 5 2005 24.83141 6 2006 24.83141
Here is a solution to aggregate using data.table. I'm not sure if it's posible to do it in one step. require("data.table") fname <- "https://raw.github.com/johnmarquess/some_data/master/hospG.csv" hospG <- read.csv(fname) Inf_table <- DT[, {Infections = sum(SSI == 1) Procedures = length(Procedure) PropInf = round(Infections/Procedures * 100 ,2) list( Infections = Infections, Procedures = Procedures, PropInf = PropInf ) }, by = Year] Inf_table[,Expected_prop_inf := list(Procedures * sum(Infections)/sum(Procedures))] tables() The added bonus of this approach is that you are not creating another data.table in the second step, a new column of the data.table is created. This would be relevant in case your datasets are bigger.