How to run a loop in R to find a unique combination of numbers within a range of 7? - r

I have a dataset which looks something like this:-
Key Days
A 1
A 2
A 3
A 8
A 9
A 36
A 37
B 14
B 15
B 44
B 45
I would like to split the individual keys based on the days in groups of 7. For e.g.:-
Key Days
A 1
A 2
A 3
Key Days
A 8
A 9
Key Days
A 36
A 37
Key Days
B 14
B 15
Key Days
B 44
B 45
I could use ifelse and specify buckets of 1-7, 7-14 etc until 63-70 (max possible value of days). However the issue lies with the days column. There are lots of cases wherein there is an overlap in days - Take days 14-15 as an example which would fall into 2 brackets if split using the ifelse logic (7-14 & 15-21).
The ideal method of splitting this would be to identify a day and add 7 to it and check how many rows of data are actually falling under that category. I think we need to use loops for this. I could do it in excel but i have 20000 rows of data for 2000 keys hence i'm using R. I would need a loop which checks each key value and for each key it further checks the value of days and buckets them in group of 7 by checking the first day value of each range.

We create a grouping variable by applying %/% on the 'Day' column and then split the dataset into a list based on that 'grp'.
grp <- df$Day %/%7
split(df, factor(grp, levels = unique(grp)))
#$`0`
# Key Days
#1 A 1
#2 A 2
#3 A 3
#$`1`
# Key Days
#4 A 8
#5 A 9
#$`5`
# Key Days
#6 A 36
#7 A 37
#$`2`
# Key Days
#8 B 14
#9 B 15
#$`6`
# Key Days
#10 B 44
#11 B 45
Update
If we need to split by 'Key' also
lst <- split(df, list(factor(grp, levels = unique(grp)), df$Key), drop=TRUE)

Related

Selecting later date observation in panel data in R

I have the following panel data in R:
ID_column<- c("A","A","A","A","B","B","B","B")
Date_column<-c(20040131, 20041231,20051231,20061231, 20051231, 20061231, 20071231, 20081231)
Price_column<-c(12,13,17,19,35,38,39,41)
Data<- data.frame(ID_column, Date_column, Price_column)
#The data looks like this:
ID_column Date_column Price_column
1: A 20040131 12
2: A 20041231 13
3: A 20051231 17
4: A 20061231 19
5: B 20051231 35
6: B 20061231 38
7: B 20071231 39
8: B 20081231 41
My next aim would be to convert the Date column which is currently in a numeric YYYYMMDD format into YYYY by simply taking the first four digits of each entry in the data column as follows:
Data$Date_column<- substr(Data$Date_column,1,4)
#The data then looks like:
ID_column Date_column Price_column
1 A 2004 12
2 A 2004 13
3 A 2005 17
4 A 2006 19
5 B 2005 35
6 B 2006 38
7 B 2007 39
8 B 2008 41
My ultimate goal would be to employ the plm package for panel data regression, but when applying the package and using pdata.frame to set the ID and Time variables as indices, I get error messages of duplicate ID/Time pairs (In this case rows 1 and 2 which would both be given the tag: A,2004). To solve this issue, I would like to delete row 1 in the original data, and only keep the newer observation from the year 2004. This would the provide me with unique ID/Time pairs across the whole data.
Therefore I was hoping for someone to help me out with a loop or a package suggestion with which I can only keep the row with the newer/later observation within a year, if this occurs, also for application to larger data sets.. I believe this involves a couple commands of conditional formatting which I am having difficulties putting together currently. I believe a loop that evaluates whether the first four digits of consecutive date observations are identical and then deletes the one with the "smaller" date/takes the "larger" date would do it, but my experience with loops is very limited.
Kind regards and thank you!
I'd recommend to keep the Date_column as a reference to pick the later observation and mutate a new column for only the year,since you want the latest observation each year.
Data$year<- substr(Data$Date_column,1,4)
> Data$Date_column<- lubridate::ymd(Data$Date_column)
>
> Data %>% arrange(desc(Date_column)) %>%
+ distinct(ID_column,year,.keep_all = TRUE) %>%
+ arrange(Date_column)
ID_column Date_column Price_column year
1 A 2004-12-31 13 2004
2 A 2005-12-31 17 2005
3 B 2005-12-31 35 2005
4 A 2006-12-31 19 2006
5 B 2006-12-31 38 2006
6 B 2007-12-31 39 2007
since we arranged in the actual date in descending order, you guarantee that dropped rows for the unique combination of ID and year is the oldest. you can change the arrangement for the opposite; to get the oldest occuerence

switching list elements with dataframe rows

Consider my list IDs that has a dataframe of behaviours in each one:
IDs <- list(Dave = data.frame(Behaviour = c("Aggression","Interaction", "Nursing"), number = c(20,10,5), duration = c(60,39,27)),James = data.frame(Behaviour = c("Aggression","Interaction"), number = c(21,30), duration = c(30,49)))
IDs
$Dave
Behaviour number duration
1 Aggression 20 60
2 Interaction 10 39
3 Nursing 5 27
$James
Behaviour number duration
1 Aggression 21 30
2 Interaction 30 49
Note that James does not exhibit any nursing behaviour and therefore different number of rows between the two list elements.
I want to switch the list elements with the dataframe rows so that I have a list of behaviours and a dataframe of ID. So that it looks like this:
$Aggression
ID number duration
1 Dave 20 60
2 James 21 30
$Interaction
ID number duration
1 Dave 10 39
2 James 30 49
$Nursing
ID number duration
1 Dave 5 27
I thought that it could be achieved with reshape2::melt. I wasn't able to get further than melt(IDs, id = "Behaviour)
Any ideas?
Generally you can do it in two steps:
turning the list into a single data.frame/data.table
splitting it based on Behavior
You can do it like this, for example:
dt <- data.table::rbindlist(IDs, id = "ID")
# or: dt <- dplyr::bind_rows(IDs, .id = "ID")
split(dt, dt$Behaviour)
Note:
If you don't want the Behaviour column in the result and you used the data.table approach, you can modify the split to:
split(dt[,!"Behaviour"], dt$Behaviour)
Try this:
tmp<-data.frame(ID=rep(names(IDs),vapply(IDs,nrow,1L)),do.call(rbind,IDs),row.names=NULL)
split(tmp[-2],tmp$Behaviour)
#$Aggression
# ID number duration
#1 Dave 20 60
#4 James 21 30
#$Interaction
# ID number duration
#2 Dave 10 39
#5 James 30 49
#$Nursing
# ID number duration
#3 Dave 5 27
#6 James 1 17
Or using base R
d1 <- do.call(rbind, Map(cbind, id = names(IDs), IDs))
split(d1, d1$Behaviour)

R - How to sum a column based on date range? [duplicate]

This question already has an answer here:
R // Sum by based on date range
(1 answer)
Closed 7 years ago.
Suppose I have df1 like this:
Date Var1
01/01/2015 1
01/02/2015 4
....
07/24/2015 1
07/25/2015 6
07/26/2015 23
07/27/2015 15
Q1: Sum of Var1 on previous 3 days of 7/27/2015 (not including 7/27).
Q2: Sum of Var1 on previous 3 days of 7/25/2015 (This is not last row), basically I choose anyday as reference day, and then calculate rolling sum.
As suggested in one of the comments in the link referenced by #SeƱorO, with a little bit of work you can use zoo::rollsum:
library(zoo)
set.seed(42)
df <- data.frame(d=seq.POSIXt(as.POSIXct('2015-01-01'), as.POSIXct('2015-02-14'), by='days'),
x=sample(20, size=45, replace=T))
k <- 3
df$sum3 <- c(0, cumsum(df$x[1:(k-1)]),
head(zoo::rollsum(df$x, k=k), n=-1))
df
## d x sum3
## 1 2015-01-01 16 0
## 2 2015-01-02 12 16
## 3 2015-01-03 15 28
## 4 2015-01-04 15 43
## 5 2015-01-05 17 42
## 6 2015-01-06 10 47
## 7 2015-01-07 11 42
The 0, cumsum(...) is to pre-populate the first two rows that are ignored (rollsum(x, k) returns a vector of length length(x)-k+1). The head(..., n=-1) discards the last element, because you said that the nth entry should sum the previous 3 and not its own row.

Turning one row into multiple rows in r [duplicate]

This question already has answers here:
Combine Multiple Columns Into Tidy Data [duplicate]
(3 answers)
Closed 5 years ago.
In R, I have data where each person has multiple session dates, and the scores on some tests, but this is all in one row. I would like to change it so I have multiple rows with the persons info, but only one of the session dates and corresponding test scores, and do this for every person. Also, each person may have completed different number of sessions.
Ex:
ID Name Session1Date Score Score Session2Date Score Score
23 sjfd 20150904 2 3 20150908 5 7
28 addf 20150905 3 4 20150910 6 8
To:
ID Name SessionDate Score Score
23 sjfd 20150904 2 3
23 sjfd 20150908 5 7
28 addf 20150905 3 4
28 addf 20150910 6 8
You can use melt from the devel version of data.table ie. v1.9.5. It can take multiple 'measure' columns as a list. Instructions to install are here
library(data.table)#v1.9.5+
melt(setDT(df1), measure = patterns("Date$", "Score(\\.2)*$", "Score\\.[13]"))
# ID Name variable value1 value2 value3
#1: 23 sjfd 1 20150904 2 3
#2: 28 addf 1 20150905 3 4
#3: 23 sjfd 2 20150908 5 7
#4: 28 addf 2 20150910 6 8
Or using reshape from base R, we can specify the direction as 'long' and varying as a list of column index
res <- reshape(df1, idvar=c('ID', 'Name'), varying=list(c(3,6), c(4,7),
c(5,8)), direction='long')
res
# ID Name time Session1Date Score Score.1
#23.sjfd.1 23 sjfd 1 20150904 2 3
#28.addf.1 28 addf 1 20150905 3 4
#23.sjfd.2 23 sjfd 2 20150908 5 7
#28.addf.2 28 addf 2 20150910 6 8
If needed, the rownames can be changed
row.names(res) <- NULL
Update
If the columns follow a specific order i.e. 3rd grouped with 6th, 4th with 7th, 5th with 8th, we can create a matrix of column index and then split to get the list for the varying argument in reshape.
m1 <- matrix(3:8,ncol=2)
lst <- split(m1, row(m1))
reshape(df1, idvar=c('ID', 'Name'), varying=lst, direction='long')
If your data frame name is data
Use this
data1 <- data[1:5]
data2 <- data[c(1,2,6,7,8)]
newdata <- rbind(data1,data2)
This works for the example you've given. You might have to change column names appropriately in data1 and data2 for a proper rbind

Assign rows to a group based on spatial neighborhood and temporal criteria in R

I have an issue that I just cannot seem to sort out. I have a dataset that was derived from a raster in arcgis. The dataset represents every fire occurrence during a 10-year period. Some raster cells had multiple fires within that time period (and, thus, will have multiple rows in my dataset) and some raster cells will not have had any fire (and, thus, will not be represented in my dataset). So, each row in the dataset has a column number (sequential integer) and a row number assigned to it that corresponds with the row and column ID from the raster. It also has the date of the fire.
I would like to assign a unique ID (fire_ID) to all of the fires that are within 4 days of each other and in adjacent pixels from one another (within the 8-cell neighborhood) and put this into a new column.
To clarify, if there were an observation from row 3, col 3, Jan 1, 2000 and another from row 2, col 4, Jan 4, 2000, those observations would be assigned the same fire_ID.
Below is a sample dataset with "rows", which are the row IDs of the raster, "cols", which are the column IDs of the raster, and "dates" which are the dates the fire was detected.
rows<-sample(seq(1,50,1),600, replace=TRUE)
cols<-sample(seq(1,50,1),600, replace=TRUE)
dates<-sample(seq(from=as.Date("2000/01/01"), to=as.Date("2000/02/01"), by="day"),600, replace=TRUE)
fire_df<-data.frame(rows, cols, dates)
I've tried sorting the data by "row", then "column", then "date" and looping through, to create a new fire_ID if the row and column ID were within one value and the date was within 4 days, but this obviously doesn't work, as fires which should be assigned the same fire_ID are assigned different fire_IDs if there are observations in between them in the list that belong to a different fire_ID.
fire_df2<-fire_df[order(fire_df$rows, fire_df$cols, fire_df$date),]
fire_ID=numeric(length=nrow(fire_df2))
fire_ID[1]=1
for (i in 2:nrow(fire_df2)){
fire_ID[i]=ifelse(
fire_df2$rows[i]-fire_df2$rows[i-1]<=abs(1) & fire_df2$cols[i]-fire_df2$cols[i-1]<=abs(1) & fire_df2$date[i]-fire_df2$date[i-1]<=abs(4),
fire_ID[i-1],
i)
}
length(unique(fire_ID))
fire_df2$fire_ID<-fire_ID
Please let me know if you have any suggestions.
I think this task requires something along the lines of hierarchical clustering.
Note, however, that there will be necessarily some degree of arbitrariness in the ids. This is because it is entirely possible that the cluster of fires itself is longer than 4 days yet every fire is less than 4 days away from some other fire in that cluster (and thus should have the same id).
library(dplyr)
# Create the distances
fire_dist <- fire_df %>%
# Normalize dates
mutate( norm_dates = as.numeric(dates)/4) %>%
# Only keep the three variables of interest
select( rows, cols, norm_dates ) %>%
# Compute distance using L-infinite-norm (maximum)
dist( method="maximum" )
# Do hierarchical clustering with "single" aggl method
fire_clust <- hclust(fire_dist, method="single")
# Cut the tree at height 1 and obtain groups
group_id <- cutree(fire_clust, h=1)
# First attach the group ids back to the data frame
fire_df2 <- cbind( fire_df, group_id ) %>%
# Then sort the data
arrange( group_id, dates, rows, cols )
# Print the first 20 records
fire_df2[1:10,]
(Make sure you have dplyr library installed. You can run install.packages("dplyr",dep=TRUE) if not installed. It is a really good and very popular library for data manipulations)
A couple of simple tests:
Test #1. The same forest fire moving.
rows<-1:6
cols<-1:6
dates<-seq(from=as.Date("2000/01/01"), to=as.Date("2000/01/06"), by="day")
fire_df<-data.frame(rows, cols, dates)
gives me this:
rows cols dates group_id
1 1 1 2000-01-01 1
2 2 2 2000-01-02 1
3 3 3 2000-01-03 1
4 4 4 2000-01-04 1
5 5 5 2000-01-05 1
6 6 6 2000-01-06 1
Test #2. 6 different random forest fires.
set.seed(1234)
rows<-sample(seq(1,50,1),6, replace=TRUE)
cols<-sample(seq(1,50,1),6, replace=TRUE)
dates<-sample(seq(from=as.Date("2000/01/01"), to=as.Date("2000/02/01"), by="day"),6, replace=TRUE)
fire_df<-data.frame(rows, cols, dates)
output:
rows cols dates group_id
1 6 1 2000-01-10 1
2 32 12 2000-01-30 2
3 31 34 2000-01-10 3
4 32 26 2000-01-27 4
5 44 35 2000-01-10 5
6 33 28 2000-01-09 6
Test #3: one expanding forest fire
dates <- seq(from=as.Date("2000/01/01"), to=as.Date("2000/01/06"), by="day")
rows_start <- 50
cols_start <- 50
fire_df <- data.frame(dates = dates) %>%
rowwise() %>%
do({
diff = as.numeric(.$dates - as.Date("2000/01/01"))
expand.grid(rows=seq(rows_start-diff,rows_start+diff),
cols=seq(cols_start-diff,cols_start+diff),
dates=.$dates)
})
gives me:
rows cols dates group_id
1 50 50 2000-01-01 1
2 49 49 2000-01-02 1
3 49 50 2000-01-02 1
4 49 51 2000-01-02 1
5 50 49 2000-01-02 1
6 50 50 2000-01-02 1
7 50 51 2000-01-02 1
8 51 49 2000-01-02 1
9 51 50 2000-01-02 1
10 51 51 2000-01-02 1
and so on. (All records identified correctly to belong to the same forest fire.)

Resources