Aggregating over 2 columns of a dataframe R - r

My dataframe is as follows
TreeID Species PlotNo Basalarea
12345 A 1 120
13242 B 7 310
14567 D 8 250
13245 B 1 305
13426 B 1 307
13289 A 3 118
I used
newdata<- aggregate(Basalarea~PlotNo+Species, data, sum, na.rm=TRUE)
to aggregate all the values such that
newdata
Species PlotNo Basalarea
A 1 120
A 3 118
B 1 some value
B 7 310
D 8 250
This is great but I would like a dataframe such that
PlotNo A B D
1 120 some value 0
3 118 0 0
7 0 310 0
8 0 0 250
How do I obtain the above dataframe?

We can use dcast to convert from long to wide format. Specify the fun.aggregate as sum.
library(reshape2)
dcast(df1, PlotNo~Species, value.var='Basalarea', sum)
# PlotNo A B D
#1 1 120 612 0
#2 3 118 0 0
#3 7 0 310 0
#4 8 0 0 250
Or a base R option would be using xtabs. By default it gets the sum of the 'Basalarea' for the combinations of 'PlotNo' and 'Species'.
xtabs(Basalarea~PlotNo+Species, df1)
# Species
#PlotNo A B D
# 1 120 612 0
# 3 118 0 0
# 7 0 310 0
# 8 0 0 250
Or another base R option is tapply
with(df1, tapply(Basalarea, list(PlotNo, Species), FUN=sum))

Related

R Lag value of previous calculated function

I am trying to use a lag value of a previous row, which needs to be calculated from the previous row (unless its first entry).
I was trying something similar to:
test<-data.frame(account_id=c(123,123,123,123,444,444,444,444),entry=c(1,2,3,4,1,2,3,4),beginning_balance=c(100,0,0,0,200,0,0,0),
deposit=c(10,20,5,8,10,12,20,4),running_balance=c(0,0,0,0,0,0,0,0))
test2<-test %>%
group_by(account_id) %>%
mutate(running_balance = if_else(entry==1, beginning_balance+deposit,
lag(running_balance)+deposit))
print(test2)
the running balance should be 110,130,135,143,210,222,242,246
For each account_id you can add first beginning_balance with cumulative sum of deposit.
library(dplyr)
test %>%
group_by(account_id) %>%
mutate(running_balance = first(beginning_balance) + cumsum(deposit))
# account_id entry beginning_balance deposit running_balance
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 123 1 100 10 110
#2 123 2 0 20 130
#3 123 3 0 5 135
#4 123 4 0 8 143
#5 444 1 200 10 210
#6 444 2 0 12 222
#7 444 3 0 20 242
#8 444 4 0 4 246
Same thing using data.table :
library(data.table)
setDT(test)[, running_balance := first(beginning_balance) + cumsum(deposit), account_id]
Using for-loops for each unique account_id and adding cumulative sum for each id.
for ( i in unique (test$account_id)) {
test$running_balance [test$account_id == i] <- cumsum(test$beginning_balance[test$account_id == i]+test$deposit[test$account_id == i])
}
print (test)
account_id entry beginning_balance deposit running_balance
1 123 1 100 10 110
2 123 2 0 20 130
3 123 3 0 5 135
4 123 4 0 8 143
5 444 1 200 10 210
6 444 2 0 12 222
7 444 3 0 20 242
8 444 4 0 4 246

After imputation, how to round to nearest level of a factor

I have imputed missing values in the following data frame in column q1. q1 is a factor with the levels 0,20,40,60,80,100. I need to round the column q1 to nearest level of factors, number 49 and 91 need to be rounded. Does anyone have a solution for this problem?, thanks in advance!
id <- rep(c(300,450), each=6)
> visit <- rep(1:6,2)
> trt <- rep(c(0,"A",0,"B",0,"C"),2)
> q1 <- c(0,100,0,89,0, 60,0,85,0,40,0, 20)
> df <- data.frame(id,visit,trt,q1)
> df
id visit trt q1
1 300 1 0 0
2 300 2 A 100
3 300 3 0 0
4 300 4 B 49
5 300 5 0 0
6 300 6 C 60
7 450 1 0 0
8 450 2 A 91
9 450 3 0 0
10 450 4 B 40
11 450 5 0 0
12 450 6 C 20
>
Maybe you could use plyr's round_any function here which would round to nearest multiple of 20 in this case.
plyr::round_any(df$q1, 20)
#[1] 0 100 0 80 0 60 0 80 0 40 0 20

Average columns based on other column value and number of rows in R

I'm using R and am trying to create a new dataframe of averaged results from another dataframe based on the values in Column A. To demonstrate my goal here is some data:
set.seed(1981)
df <- data.frame(A = sample(c(0,1), replace=TRUE, size=100),
B=round(runif(100), digits=4),
C=sample(1:1000, 100, replace=TRUE))
head(df, 30)
A B C
0 0.6739 459
1 0.5466 178
0 0.154 193
0 0.41 206
1 0.7526 791
1 0.3104 679
1 0.739 434
1 0.421 171
0 0.3653 577
1 0.4035 739
0 0.8796 147
0 0.9138 37
0 0.7257 350
1 0.2125 779
0 0.1502 495
1 0.2972 504
0 0.2406 245
1 0.0325 613
0 0.8642 539
1 0.1096 630
1 0.2113 363
1 0.277 974
0 0.0485 755
1 0.0553 412
0 0.509 24
0 0.2934 795
0 0.0725 413
0 0.8723 606
0 0.3192 591
1 0.5557 177
I need to reduce the size of the data by calculating the average value for column B and column C for as many rows as the value in Column A stays consecutively the same, up to a maximum of 3 rows. If value A remains either 1, or 0 for more than 3 rows it would roll over into the next row in the new dataframe as you can see below.
The new dataframe requires the following columns:
Value of A B.Av C.Av No. of rows used
0 0.6739 459 1
1 0.5466 178 1
0 0.282 199.5 2
1 0.600666667 634.6666667 3
1 0.421 171 1
0 0.3653 577 1
1 0.4035 739 1
0 0.8397 178 3
1 0.2125 779 1
0 0.1502 495 1
1 0.2972 504 1
0 0.2406 245 1
1 0.0325 613 1
0 0.8642 539 1
1 0.1993 655.6666667 3
0 0.0485 755 1
1 0.0553 412 1
0 0.291633333 410.6666667 3
0 0.59575 598.5 2
1 0.5557 177 1
I haven't managed to find another similar scenario to mine whilst searching Stack Overflow so any help would be really appreciated.
Here is a base-R solution:
## define a function to split the run-length if greater than 3
split.3 <- function(l,v) {
o <- c(values=v,lengths=min(l,3))
while(l > 3) {
l <- l - 3
o <- rbind(o,c(values=v,lengths=min(l,3)))
}
return(o)
}
## compute the run-length encoding of column A
rl <- rle(df$A)
## apply split.3 to the run-length encoding
## the first column of vl are the values of column A
## the second column of vl are the corresponding run-length limited to 3
vl <- do.call(rbind,mapply(split.3,rl$lengths,rl$values))
## compute the begin and end row indices of df for each value of A to average
fin <- cumsum(vl[,2])
beg <- fin - vl[,2] + 1
## compute the averages
out <- do.call(rbind,lapply(1:length(beg), function(i) data.frame(`Value of A`=vl[i,1],
B.Av=mean(df$B[beg[i]:fin[i]]),
C.Av=mean(df$C[beg[i]:fin[i]]),
`No. of rows used`=fin[i]-beg[i]+1)))
## Value.of.A B.Av C.Av No..of.rows.used
##1 0 0.6739000 459.0000 1
##2 1 0.5466000 178.0000 1
##3 0 0.2820000 199.5000 2
##4 1 0.6006667 634.6667 3
##5 1 0.4210000 171.0000 1
##6 0 0.3653000 577.0000 1
##7 1 0.4035000 739.0000 1
##8 0 0.8397000 178.0000 3
##9 1 0.2125000 779.0000 1
##10 0 0.1502000 495.0000 1
##11 1 0.2972000 504.0000 1
##12 0 0.2406000 245.0000 1
##13 1 0.0325000 613.0000 1
##14 0 0.8642000 539.0000 1
##15 1 0.1993000 655.6667 3
##16 0 0.0485000 755.0000 1
##17 1 0.0553000 412.0000 1
##18 0 0.2916333 410.6667 3
##19 0 0.5957500 598.5000 2
##20 1 0.5557000 177.0000 1
Here is a data.table solution:
library(data.table)
setDT(df)
# create two group variables, consecutive A and for each consecutive A every three rows
(df[,rleid := rleid(A)][, threeWindow := ((1:.N) - 1) %/% 3, rleid]
# calculate the average of the columns grouped by the above two variables
[, c(.N, lapply(.SD, mean)), .(rleid, threeWindow)]
# drop group variables
[, `:=`(rleid = NULL, threeWindow = NULL)][])
# N A B C
#1: 1 0 0.6739000 459.0000
#2: 1 1 0.5466000 178.0000
#3: 2 0 0.2820000 199.5000
#4: 3 1 0.6006667 634.6667
#5: 1 1 0.4210000 171.0000
#6: 1 0 0.3653000 577.0000
#7: 1 1 0.4035000 739.0000
#8: 3 0 0.8397000 178.0000
#9: 1 1 0.2125000 779.0000
#10: 1 0 0.1502000 495.0000
#11: 1 1 0.2972000 504.0000
#12: 1 0 0.2406000 245.0000
#13: 1 1 0.0325000 613.0000
#14: 1 0 0.8642000 539.0000
#15: 3 1 0.1993000 655.6667
#16: 1 0 0.0485000 755.0000
#17: 1 1 0.0553000 412.0000
#18: 3 0 0.2916333 410.6667
#19: 2 0 0.5957500 598.5000
#20: 1 1 0.5557000 177.0000

How to remove rows based on distance from an average of column and max of another column

Consider this toy data frame. I would like to create a new data frame in which only rows that are below the average of "birds" and only rows that less than the two top values after the maximum value of "wolfs".So in this data frame I'll get only rows: 543,608,987,225,988,556.
I used this two lines of code for the first constrain but couldn't find a solution for the second constrain.
df$filt<-ifelse(df$birds<mean(df$birds),1,0)
df1<-df1[which(df1$filt==1),]
How can I create the second constrain ?
Here is the toy dataframe:
df <- read.table(text = "userid target birds wolfs
222 1 9 7
444 1 8 4
234 0 2 8
543 1 2 3
678 1 8 3
987 0 1 2
294 1 7 1
608 0 1 5
123 1 9 7
321 1 8 7
226 0 2 7
556 0 2 3
334 1 6 3
225 0 1 1
999 0 3 9
988 0 1 1 ",header = TRUE)
subset(df,birds < mean(birds) & wolfs < sort(unique(wolfs),decreasing=T)[3]);
## userid target birds wolfs
## 4 543 1 2 3
## 6 987 0 1 2
## 8 608 0 1 5
## 12 556 0 2 3
## 14 225 0 1 1
## 16 988 0 1 1
Here a solution but maybe some constraints are not clear to me because it is fit another row respect your desired output.
avbi <- mean(df$birds)
ttw <- sort(df$wolfs, decreasing = T)[3]
df[df$birds < avbi & df$wolfs < ttw , ]
userid target birds wolfs
4 543 1 2 3
6 987 0 1 2
8 608 0 1 5
12 556 0 2 3
14 225 0 1 1
16 988 0 1 1
or with dplyr
df %>% filter(birds < avbi & wolfs < ttw)

R: Running sum of changed column values within groups

I have data that looks like this:
df <- read.table(textConnection(
"ID DATE UNIT
100 1/5/2005 4
100 2/6/2006 4
100 3/7/2007 5
100 4/7/2008 5
100 5/9/2009 6
101 1/5/2005 1
101 2/6/2006 1
101 3/7/2007 1
101 4/7/2008 1
102 1/3/2010 3
102 4/5/2010 4
102 5/9/2011 3
102 6/7/2011 5
102 10/10/2012 5
103 1/5/2005 1
103 1/6/2010 2"),header=TRUE)
I want to group by ID, sort each group by DATE, and create another column that is a running count of the number of times the UNIT variable has changed for each given ID variable. So I want an output that looks like this:
ID DATE UNIT CHANGES
100 1/5/2005 4 0
100 2/6/2006 4 0
100 3/7/2007 5 1
100 4/7/2008 5 1
100 5/9/2009 6 2
101 1/5/2005 1 0
101 2/6/2006 1 0
101 3/7/2007 1 0
101 4/7/2008 1 0
102 1/3/2010 3 0
102 4/5/2010 4 1
102 5/9/2011 3 2
102 6/7/2011 5 3
102 10/10/2012 5 3
103 1/5/2005 1 0
103 1/6/2010 2 1
You could also do this in base R, using order to sort the observations and ave to compute the grouped values:
df$DATE <- as.Date(df$DATE, "%m/%d/%Y")
df <- df[order(df$ID, df$DATE),]
df$CHANGES <- ave(df$UNIT, df$ID, FUN=function(x) c(0, cumsum(diff(x) != 0)))
df
# ID DATE UNIT CHANGES
# 1 100 2005-01-05 4 0
# 2 100 2006-02-06 4 0
# 3 100 2007-03-07 5 1
# 4 100 2008-04-07 5 1
# 5 100 2009-05-09 6 2
# 6 101 2005-01-05 1 0
# 7 101 2006-02-06 1 0
# 8 101 2007-03-07 1 0
# 9 101 2008-04-07 1 0
# 10 102 2010-01-03 3 0
# 11 102 2010-04-05 4 1
# 12 102 2011-05-09 3 2
# 13 102 2011-06-07 5 3
# 14 102 2012-10-10 5 3
# 15 103 2005-01-05 1 0
# 16 103 2010-01-06 2 1
Using dplyr.
First I'm converting your DATE column to a date, assuming it's in format m/d/y (if not, change the "%m/%d/%Y" to "%d/%m/%Y"):
df$DATE <- as.Date(df$DATE, "%m/%d/%Y")
Now the code:
library(dplyr)
df %>% group_by(ID) %>%
arrange(DATE) %>%
mutate(CHANGES=c(0,cumsum(na.omit(UNIT!=lag(UNIT,1)))))

Resources