R ddply sum value from next row - r

I want to sum the column value from a row with the next one.
> df
+----+------+--------+------+
| id | Val | Factor | Col |
+----+------+--------+------+
| 1 | 15 | 1 | 7 |
| 3 | 20 | 1 | 4 |
| 2 | 35 | 2 | 8 |
| 7 | 35 | 1 | 12 |
| 5 | 40 | 1 | 11 |
| 6 | 45 | 2 | 13 |
| 4 | 55 | 1 | 4 |
| 8 | 60 | 1 | 7 |
| 9 | 15 | 2 | 12 |
..........
I would like to have the mean of sum of the Row$Val + nextRow$Val based on their id and Col. I can't assume that the id or Col are consecutive.
I am using ddply to summarize my df. I have tried
> ddply(df, .(Factor), summarize,
max(Val),
sum(Val),
mean(Val + df[df$id == id+1 & df$Col = Col]$Val)
)
> "longer object length is not a multiple of shorter object length"

You can build a vector of values with
sapply(df$id, function(x){mean(c(
subset(df, id == x, select = Val, drop = TRUE),
subset(df, id == x+1, select = Val, drop = TRUE)
))})
You could simplify, but I tried to make it as readable as possible.

You can use rollapply from the zoo package. Since you want mean of only two consecutive rows , you can try
library(zoo)
rollapply(df[order(df$id), 2], 2, function(x) sum(x)/2)
#[1] 17.5 27.5 35.0 37.5 42.5 50.0 57.5 37.5

You can do something like this with dplyr package:
library(dplyr)
df <- arrange(df, id)
mean(df$Val + lead(df$Val), na.rm = TRUE)
[1] 76.25

Related

Adding and subtracting values from columns in dataframe

I've a dataframe as under:
+-----+---------+-----+-----+-----------------+----------+----------+------------+
| ID | CURRENT | JAN | FEB | CURRENT_IN_2018 | JAN_2018 | FEB_2018 | UNITS_SWAP |
+-----+---------+-----+-----+-----------------+----------+----------+------------+
| 123 | 2 | 3 | 4 | 5 | 6 | 7 | 12 |
| 456 | 1 | 5 | 0 | 4 | 8 | 6 | 6 |
+-----+---------+-----+-----+-----------------+----------+----------+------------+
What I'm trying to do here is subtract the number in UNITS_SWAP from CURRENT_IN_2018, JAN_2018 and FEB_2018 sequentially until the number in UNITS_SWAP reaches zero. Also while doing this, add the number of subtracted UNITS_SWAP from each row to their respective matching rows, for instance if 5 units are deducted from current_in_2018 then add 5 units in CURRENT, so on and so forth for JAN and FEB such that the end result is as under:
+-----+---------+-----+-----+-----------------+----------+----------+------------+
| ID | CURRENT | JAN | FEB | CURRENT_IN_2018 | JAN_2018 | FEB_2018 | UNITS_SWAP |
+-----+---------+-----+-----+-----------------+----------+----------+------------+
| 123 | 7 | 9 | 5 | 0 | 0 | 6 | 0 |
| 456 | 5 | 7 | 0 | 0 | 6 | 6 | 0 |
+-----+---------+-----+-----+-----------------+----------+----------+------------+
Script to load the data:
DF <- data.frame(ID = c(123,456),
CURRENT = c(2,1),
JAN = c(3,5),
FEB=c(4,0),
CURRENT_2018 = c(5,4),
JAN_2018 = c(6,8),
FEB_2018=c(7,6),
UNITS_SWAP =c(12,6))
You could do this - though note that it will overwrite your source DF:
cols <- c('CURRENT', 'JAN', 'FEB')
for (i in 1:NROW(DF)) {
while (DF[i, 'UNITS_SWAP'] > 0) {
for (col in cols) {
excess <- min(DF[i,'UNITS_SWAP'], DF[i, paste0(col, '_2018')])
DF[i, col] <- DF[i, col] + excess
DF[i, paste0(col, '_2018')] <- DF[i, paste0(col, '_2018')] - excess
DF[i, 'UNITS_SWAP'] <- DF[i, 'UNITS_SWAP'] - excess
}
}
}
Because your columns have a definite structure (column matching column_2018), we just need to run through them in the order you specified, and paste _2018 to get the relevant matching column.

Data imputation for empty subsetted dataframes in R

I'm trying to build a function in R in which I can subset my raw dataframe according to some specifications, and thereafter convert this subsetted dataframe into a proportion table.
Unfortunately, some of these subsettings yields to an empty dataframe as for some particular specifications I do not have data; hence no proportion table can be calculated. So, what I would like to do is to take the closest time step from which I have a non-empty subsetted dataframe and use it as an input for the empty subsetted dataframe.
Here some insights to my dataframe and function:
My raw dataframe looks +/- as follows:
| year | quarter | area | time_comb | no_individuals | lenCls | age |
|------|---------|------|-----------|----------------|--------|-----|
| 2005 | 1 | 24 | 2005.1.24 | 8 | 380 | 3 |
| 2005 | 2 | 24 | 2005.2.24 | 4 | 490 | 2 |
| 2005 | 1 | 24 | 2005.1.24 | 3 | 460 | 6 |
| 2005 | 1 | 21 | 2005.1.21 | 25 | 400 | 2 |
| 2005 | 2 | 24 | 2005.2.24 | 1 | 680 | 6 |
| 2005 | 2 | 21 | 2005.2.21 | 2 | 620 | 5 |
| 2005 | 3 | 21 | 2005.3.21 | NA | NA | NA |
| 2005 | 1 | 21 | 2005.1.21 | 1 | 510 | 5 |
| 2005 | 1 | 24 | 2005.1.24 | 1 | 670 | 4 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 750 | 4 |
| 2006 | 4 | 24 | 2006.4.24 | 1 | 660 | 8 |
| 2006 | 2 | 24 | 2006.2.24 | 8 | 540 | 3 |
| 2006 | 2 | 24 | 2006.2.24 | 4 | 560 | 3 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 250 | 2 |
| 2006 | 3 | 22 | 2006.3.22 | 1 | 520 | 2 |
| 2006 | 2 | 24 | 2006.2.24 | 1 | 500 | 2 |
| 2006 | 2 | 22 | 2006.2.22 | NA | NA | NA |
| 2006 | 2 | 21 | 2006.2.21 | 3 | 480 | 2 |
| 2006 | 1 | 24 | 2006.1.24 | 1 | 640 | 5 |
| 2007 | 4 | 21 | 2007.4.21 | 2 | 620 | 3 |
| 2007 | 2 | 21 | 2007.2.21 | 1 | 430 | 3 |
| 2007 | 4 | 22 | 2007.4.22 | 14 | 410 | 2 |
| 2007 | 1 | 24 | 2007.1.24 | NA | NA | NA |
| 2007 | 2 | 24 | 2007.2.24 | NA | NA | NA |
| 2007 | 3 | 24 | 2007.3.22 | NA | NA | NA |
| 2007 | 4 | 24 | 2007.4.24 | NA | NA | NA |
| 2007 | 3 | 21 | 2007.3.21 | 1 | 560 | 4 |
| 2007 | 1 | 21 | 2007.1.21 | 7 | 300 | 3 |
| 2007 | 3 | 23 | 2007.3.23 | 1 | 640 | 5 |
Here year, quarter and area refers to a particular time (Year & Quarter) and area for which X no. of individuals were measured (no_individuals). For example, from the first row we get that in the first quarter of the year 2005 in area 24 I had 8 individuals belonging to a length class (lenCLs) of 380 mm and age=3. It is worth to mention that for a particular year, quarter and area combination I can have different length classes and ages (thus, multiple rows)!
So what I want to do is basically to subset the raw dataframe for a particular year, quarter and area combination, and from that combination calculate a proportion table based on the number of individuals in each length class.
So far my basic function looks as follows:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
From the dataset example above, one can notice that for year=2005 & quarter=3 & area=21, I do not have any measured individuals. Yet, for the same area AND year I have data for either quarter 1 or 2. The most reasonable assumption would be to take the subsetted dataframe from the closest time step (herby quarter 2 with the same area and year), and replace the NA from the columns "no_individuals", "lenCls" and "age" accordingly.
Note also that for some cases I do not have data for a particular year! In the example above, one can see this by looking into area 24 from year 2007. In this case I can not borrow the information from the nearest quarter, and would need to borrow from the previous year instead. This would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I have tried to include this in my function by specifying some extra rules, but due to my poor programming skills I didn't make any progress.
So, any help here will be very much appreciated.
Here my LAK function which I'm trying to update:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
# In case of empty dataset
#if(is.data.frame(sALK) && nrow(sALK)==0){
if(sALK[rowSums(is.na(sALK)) > 0,]){
warning("Empty subset combination; data will be subsetted based on the
nearest timestep combination")
FIXME: INCLDUE IMPUTATION RULES HERE
}
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
So, I finally came up with a partial solution to my problem and will include my function here in case it might be of someone's interest:
LAK <- function(df, Year="2005", Quarter="1", Area="22",alkplot=T){
require(FSA)
# subset alk by year, quarter, area and species
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
print(sALK)
if(nrow(sALK)==1){
warning("Empty subset combination; data has been subsetted to the nearest input combination")
syear <- unique(as.numeric(as.character(sALK$year)))
sarea <- unique(as.numeric(as.character(sALK$area)))
sALK2 <- subset(df, year==syear & area==sarea)
vals <- as.data.frame(table(sALK2$comb_index))
colnames(vals)[1] <- "comb_index"
idx <- which(vals$Freq>1)
quarterId <- as.numeric(as.character(vals[idx,"comb_index"]))
imput <- subset(df,year==syear & area==sarea & comb_index==quarterId)
dfexp2 <- imput[rep(seq(nrow(imput)), imput$no_at_length_age), 1:ncol(imput)]
raw2 <- t(table(dfexp2$lenCls, dfexp2$age))
key2 <- round(prop.table(raw2, margin=1), 3)
print(key2)
if(alkplot==TRUE){
alkPlot(key2,"area",xlab="Age")
}
} else {
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_at_length_age), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
print(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
}
This solves my problem when I have data for at least one quarter of a particular Year & Area combination. Yet, I'm still struggling to figure out how to deal when I do not have data for a particular Year & Area combination. In this case I need to borrow data from the closest Year that contains data for all the quarters for the same area.
For the example exposed above, this would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I don't know if you have ever encountered MICE, but it is a pretty cool and comprehensive tool for variable imputation. It also allows you to see how the imputed data is distributed so that you can choose the method most suited for your problem. Check this brief explanation and the original package description

Random sample by group and filtering on the basis of result

I have a dataframe that is generated by the following code
l_ids = c(1, 1, 1, 2, 2, 2, 2)
l_months = c(5, 5, 5, 88, 88, 88, 88)
l_calWeek = c(201708, 201709, 201710, 201741, 201742, 201743, 201744)
value = c(5, 6, 3, 99, 100, 1001, 1002)
dat <- setNames(data.frame(cbind(l_ids, l_months, l_calWeek, value)),
c("ids", "months", "calWeek", "value"))
and looks like this:
+----+-------+----------+-------+
| Id | Month | Cal Week | Value |
+----+-------+----------+-------+
| 1 | 5 | 201708 | 4.5 |
| 1 | 5 | 201709 | 5 |
| 1 | 5 | 201710 | 6 |
| 2 | 88 | 201741 | 75 |
| 2 | 88 | 201742 | 89 |
| 2 | 88 | 201743 | 90 |
| 2 | 88 | 201744 | 51 |
+----+-------+----------+-------+
I would like to randomly sample a calendar week from each id-month group (the months are not calendar months). Then I would like to keep all id-month combination prior to the sample months.
An example output could be: suppose the sampling output returned cal week 201743 for the group id=2 and month=88 and 201709 for the group id=1 and month=5, then the final ouput should be
+----+-------+----------+-------+
| Id | Month | Cal Week | Value |
+----+-------+----------+-------+
| 1 | 5 | 201708 | 4.5 |
| 1 | 5 | 201709 | 5 |
| 2 | 88 | 201741 | 75 |
| 2 | 88 | 201742 | 89 |
2 | 88 | 201743 | 90 |
+----+-------+----------+-------+
I tried to work with dplyr's sample_n function (which is going to give me the random calendar week by id-month group, but then I do not know how to get all calendar weeks prior to that date. Can you help me with this. If possible, I would like to work with dplyr.
Please let me know in case you need further information.
Many thanks
require(dplyr)
set.seed(1) # when sampling please set.seed
sampled <- dat %>% group_by(ids) %>% do(., sample_n(.,1))
sampled_day <- sampled$calWeek
dat %>% group_by(ids) %>%
mutate(max_day = which(calWeek %in% sampled_day)) %>%
filter(row_number() <= max_day)
#You can also just filter directly with row_number() <= which(calWeek %in% sampled_day)
# A tibble: 3 x 4
# Groups: ids [2]
ids months calWeek value
<dbl> <dbl> <dbl> <dbl>
1 1.00 5.00 201708 5.00
2 2.00 88.0 201741 99.0
3 2.00 88.0 201742 100
This depends on the row order! So make sure to arrange by day first. You'll need to think about ties, though. I have edited my previous answer and simply filtered with <=
That should do the trick:
sample_and_get_below <- function(df, when, size){
res <- filter(df, calWeek == when) %>%
sample_n(size)
filter(df, calWeek > when) %>%
rbind(res, .)
}
sample_and_get_below(dat, 201741, 1)
ids months calWeek value
1 2 88 201741 99
2 2 88 201742 100
3 2 88 201743 1001
4 2 88 201744 1002

Calculate a mean and deal with NA

I got a dataset (df) that looks like this:
LETTER | VALUE |
A | 2 |
A | 3 |
B | 4 |
B | NA |
B | 6 |
B | NA |
C | NA |
C | NA |
I m looking for a way to create a second datased (new_df) based on the mean of VALUE for each LETTER. But I need to know which letter have NA.
new_df should look like this:
LETTER | VALUE |
A | 2,5 |
B | 5 |
C | NA |
Here is the code I tried :
new_df <- aggregate(as.numeric(VALUE) ~ LETTER, df, mean)
The issue with it, is that it omit NA and only returns this:
LETTER | VALUE |
A | 2,5 |
B | 5 |
Can you please help?
You may just change defaults of aggregate()
aggregate(as.numeric(VALUE) ~ LETTER, df, function(x) mean(x, na.rm=TRUE),
na.action = na.pass)

Create new variable based on the order of values in other columns

I have a data frame relative to accesses to a website. Several accesses per day, with different possible actions and descriptions of the actions
People | Date | Time | Action | Descr |
| | | | |
j | 01/01/2010 | 10:13 | X | A |
j | 01/01/2010 | 10:15 | Y | B |
j | 02/01/2010 | 14:15 | Z | C |
j | 03/01/2010 | 11:45 | X | D |
j | 03/01/2010 | 13:56 | X | E |
j | 03/01/2010 | 18:43 | Z | F |
j | 03/01/2010 | 18:44 | X | A |
After reducing the data frame to a balanced daily panel data, I need to create variables such that:
-the value of the first variable (FirstX) must be equal to the description (Descr) of the first Action = X of the day (if available) and zero otherwise
-the value of the second variable must be equal to the description of the second Action = X of the day and zero otherwise
-so on
Once I transformed it into a balanced daily panel (which I can do) I need to have a final result which looks like this:
People | Date |Accesses| First X|Second X| Third X| Fourth X |
| | | | | | |
j | 01/01/2010 | 2 | A | 0 | 0 | 0 |
j | 02/01/2010 | 1 | 0 | 0 | 0 | 0 |
j | 03/01/2010 | 4 | D | E | A | 0 |
You can do it using the dplyr package:
library(dplyr)
df %>%
group_by(People,Date) %>%
summarise(Accesses = n(),
FirstX = ifelse(sum(Action=="X")>=1,Descr[Action=="X"][1],"0"),
SecondX = ifelse(sum(Action=="X")>=2,Descr[Action=="X"][2],"0"),
ThirdX = ifelse(sum(Action=="X")>=3,Descr[Action=="X"][3],"0"),
FourthX = ifelse(sum(Action=="X")>=4,Descr[Action=="X"][4],"0"))
This returns:
People Date Accesses FirstX SecondX ThirdX FourthX
<chr> <chr> <int> <chr> <chr> <chr> <chr>
1 j 01/01/2010 2 A 0 0 0
2 j 02/01/2010 1 0 0 0 0
3 j 03/01/2010 4 D E A 0
Note that you cannot have numeric 0s and characters in the same vector, so I put character 0s in the FirstX, SecondX, .. columns.
I found a solution myself. I post it here in case this is useful to somebody.
# create temp variables to be used for the count(just a vector of all the
numbers from 1 to N)
subset$temp_var1<-c(1:N)
# generate a variable which starts counting from one and starts again
# every time "date" or "people" change
subset$count<-ave(subset$temp_var1 , subset$date ,
subset$people , FUN = seq_along)
#drop variable "Action"
subset<-subset( subset, select=c("people" , "date" ,
"descr" , "count"))
#reshape
subset_comuni<-reshape(subset_comuni , idvar=c("nome_utente" , "date") ,
timevar = "count" , direction = "wide")

Resources