R: Calculating mean value for preceding rows and within groups - r

x<-c("A","B")
y<-c(1:10)
dat<-expand.grid(visit=y,site=x)
I would like to get a column that has the mean value for visit of the preceding rows within each site. The first visits will have no values.
So example of returned data
visit site mean
1 A
2 A 1
3 A 1.5
4 A 2
5 A 2.5
6 A 3
1 B etc..

Using y = 1:6 for this, to match the example in the question.
You can get the running averages with by and cumsum:
with(dat, by(visit, site, FUN=function(x) cumsum(x)/1:length(x)))
## site: A
## [1] 1.0 1.5 2.0 2.5 3.0 3.5
## -----------------------------------------------------------------------------------------------------
## site: B
## [1] 1.0 1.5 2.0 2.5 3.0 3.5
These are almost what you want. You want them shifted by one, and don't want the last entry. That's easy enough to do (if a bit odd of a requirement).
with(dat, by(visit, site, FUN=function(x) c(NA, head(cumsum(x)/1:length(x), -1))))
## site: A
## [1] NA 1.0 1.5 2.0 2.5 3.0
## -----------------------------------------------------------------------------------------------------
## site: B
## [1] NA 1.0 1.5 2.0 2.5 3.0
And you can easily present these in a single column with unlist:
dat$mean <- unlist(with(dat, by(visit, site, FUN=function(x) c(NA, head(cumsum(x)/1:length(x), -1)))))
dat
## visit site mean
## 1 1 A NA
## 2 2 A 1.0
## 3 3 A 1.5
## 4 4 A 2.0
## 5 5 A 2.5
## 6 6 A 3.0
## 7 1 B NA
## 8 2 B 1.0
## 9 3 B 1.5
## 10 4 B 2.0
## 11 5 B 2.5
## 12 6 B 3.0

Related

Setting values to NA in one column based on conditions in another column

Here's a simplified mock dataframe:
df1 <- data.frame(amb = c(2.5,3.6,2.1,2.8,3.4,3.2,1.3,2.5,3.2),
warm = c(3.6,5.3,2.1,6.3,2.5,2.1,2.4,6.2,1.5),
sensor = c(1,1,1,2,2,2,3,3,3))
I'd like to set all values in the "amb" column to NA if they're in sensor 1, but retain the values in the "warm" column for sensor 1. Here's what I'd like the final output to look like:
amb warm sensor
NA 3.6 1
NA 5.3 1
NA 2.1 1
2.8 6.3 2
3.4 2.5 2
3.2 2.1 2
1.3 2.4 3
2.5 6.2 3
3.2 1.5 3
Using R version 4.0.2, Mac OS X 10.13.6
A possible solution, based on dplyr:
library(dplyr)
df1 %>%
mutate(amb = ifelse(sensor == 1, NA, amb))
#> amb warm sensor
#> 1 NA 3.6 1
#> 2 NA 5.3 1
#> 3 NA 2.1 1
#> 4 2.8 6.3 2
#> 5 3.4 2.5 2
#> 6 3.2 2.1 2
#> 7 1.3 2.4 3
#> 8 2.5 6.2 3
#> 9 3.2 1.5 3
Seems to be best handled with the vectorized function is.na<-
is.na(df1$amb) <- df1$sensor %in% c(1) # that c() isn't needed
But to be most general and support tests of proper test for equality among floating point numbers the answer might be:
is.na(df1$amb) <- df1$sensor-1 < 1e-16

Add a column that iterates/ counts every time a sequence resets

I have a dataframe, with a column that increases with every row, and periodically (though not regularly) resets back to 1.
I'd like to track/ count these resets in separate column. This for-loop example does exactly what I want, but is incredibly slow when applied to large datasets. Is there a better/ quicker/ more R way to do this same operation:
ColA<-seq(1,20)
ColB<-rep(seq(1,5),4)
DF<-data.frame(ColA, ColB)
DF$ColC<-NA
DF[1,'ColC']<-1
#Removing line 15 and changing line 5 to 1.1 per comments in answer
DF<-DF[-15,]
DF[5,2]<-0.1
for(i in seq(1,nrow(DF)-1)){
print(i)
MyRow<-DF[i+1,]
if(MyRow$ColB < DF[i,'ColB']){
DF[i+1,"ColC"]<-DF[i,"ColC"] +1
}else{
DF[i+1,"ColC"]<-DF[i,"ColC"]
}
}
No need for a loop here. We can just use the vectorized cumsum. This ought to be faster:
DF$ColC<-cumsum(DF$ColB==1)
DF
To keep using varying variable reset values that are always lower then the previous value, use cumsum(ColB < lag(ColB)):
DF %>% mutate(ColC = cumsum(ColB < lag(ColB, default = Inf)))
ColA ColB ColC
1 1 1.0 1
2 2 2.0 1
3 3 3.0 1
4 4 4.0 1
5 5 0.1 2
6 6 1.0 2
7 7 2.0 2
8 8 3.0 2
9 9 4.0 2
10 10 5.0 2
11 11 1.0 3
12 12 2.0 3
13 13 3.0 3
14 14 4.0 3
16 16 1.0 4
17 17 2.0 4
18 18 3.0 4
19 19 4.0 4
20 20 5.0 4

Counting the NA's in a part of a row in data.table

I have a dataset df of which the structure looks similar to the example below:
nr countrycode questionA questionB questionC WeightquestionA WeightquestionB WeightquestionC
1 NLD 2 1 4 0.6 0.2 0.2
2 NLD NA 4 NA 0.4 0.4 0.2
3 NLD 4 4 1 0.2 0.2 0.6
4 BLG 1 NA 1 0.1 0.5 0.4
5 BLG 5 3 5 0.2 0.2 0.6
The questions A, B and C relate to a similar topic and as a result I would like to create an average score for all questions, taking into account the importance of each question (WeightquestionA WeightquestionB WeightquestionC).
Currently I have manually calculated the average score.
(questionA*WeightquestionA) + (questionB*WeightquestionB) + (questionC*WeightquestionC)
This would not be an insurmountable problem were it not for the NA's (for which: no they cannot be removed). As a result I would like to automate the process.
I am currently thinking of using sum(!is.na()) for counting the non-NA's in each question (A,B,C) for each row (1 through 5) and putting that value into a new column.
With data.table I however always have trouble getting the syntax right. I believe it should be something like:
df[, NonNA:=sum(!is.na(questionA + questionB + questionC))]
But this sums all NA's in the column, instead of for each row. How should I write the syntax to calculate per row?
I would like to refer to the columns separately by name, because they are not next to each other in the actual df.
Desired output:
nr countrycode qA qB qC WeightquestionA WeightquestionB WeightquestionC NonNA
1 NLD 2 1 4 0.6 0.2 0.2 3
2 NLD NA 4 NA 0.4 0.4 0.2 1
3 NLD 4 4 1 0.2 0.2 0.6 3
4 BLG 1 NA 1 0.1 0.5 0.4 2
5 BLG 5 3 5 0.2 0.2 0.6 3
Using data.table, you could do this:
df[, NonNA := sum(!is.na(questionA), !is.na(questionB), !is.na(questionC)), by = .(nr)]
A base solution:
df$nonNA <- rowSums(!is.na(df[,c("questionA", "questionB", "questionC")]))
Another alternative with recommendation from snoram:
df[, NonNA := rowSums(!is.na(.SD)),
.SDcols=paste0("question", LETTERS[1:3])]
And also:
df[, NonNA := Reduce(function(x, y) x + !is.na(y), .SD, init=rep(0L, .N)),
.SDcols=paste0("question", LETTERS[1:3])]
We can count non NA (for column questionA,questionB and questionC i.e. column number 3 to 5) using apply as below:
df$nonNA=apply(df[,3:5], 1, function(x) length(which(!is.na(x))))
or (suggestion from snoarm)
df$nonNA=apply(df[,3:5], 1, function(x) sum(!is.na(x)))
Sample output:
questionA questionB questionC nonNA
1 2 1 4 3
2 NA 4 NA 1
3 4 4 1 3
4 1 NA 1 2
5 5 3 5 3

How to get the proportions of data with respect to two variables in R?

I have 4 columns: Vehicle ID, Vehicle Class, Vehicle Length and Vehicle Width. Every vehicle has a unique vehicle ID (e.g. 2, 4, 5,...) and the data was collected every 0.1 seconds which means that vehicle IDs are repeated in Vehicle ID column for the number of times they were observed. There are three vehicle classes i.e. 1=motorcycles, 2=cars, 3=trucks in the Vehicle Class column and the lengths and widths are in their respective columns against every vehicle ID. I want to subset the data by vehicle class and then find the proportions of each vehicle model (unique length and width) within every class. For example, for the Vehicle Class = 2 i.e. car, I want to find different models of cars (unique length and width) and their proportions with respect to total number of cars. Here is what I have done so far:
To subset data by Vehicle Class
cars <- subset(b, b$'Vehicle class'==2)
trucks <- subset(b, b$'Vehicle class'==3)
motorcycles <- subset(b, b$'Vehicle class'==1)
To find the number of cars
numofcars <- length(unique(cars$'Vehicle ID')) # 2830
numoftrucks <- length(unique(trucks$'Vehicle ID')) # 137
numofmotorcycles <- length(unique(motorcycles$'Vehicle ID'))# 45
The above code worked but I could not find the proportions by using the code below:
by (cars, INDICES=cars$'Vehicle Length', FUN=table(cars$'Vehicle width'))
R gives an error stating that it could not find 'FUN'. Please help me in finding the proportions of each model within all classes of vehicles.
EDIT (Sample Input)
Vehicle ID Vehicle Class Vehicle Length Vehicle Width
2 2 13.5 4.5
2 2 13.5 4.5
2 2 13.5 4.5
2 2 13.5 4.5
3 2 13.5 4.0
3 2 13.5 4.0
3 2 13.5 4.0
3 2 13.5 4.0
4 2 10.0 4.5
4 2 10.0 4.5
4 2 10.0 4.5
4 2 10.0 4.5
5 3 23.0 4.5
5 3 23.0 4.5
5 3 23.0 4.5
5 3 23.0 4.5
6 3 76.5 4.5
6 3 76.5 4.5
6 3 76.5 4.5
6 3 76.5 4.5
6 3 76.5 4.5
7 1 10.0 3.0
7 1 10.0 3.0
7 1 10.0 3.0
7 1 10.0 3.0
8 2 13.5 5.5
8 2 13.5 5.5
8 2 13.5 5.5
8 2 13.5 5.5
Note that in this input: Total number of cars=4, trucks=2, motorcycles=1
Sample Output
Group: cars
VehicleLength VehicleWidth Proportion
13.5 4.5 0.25
13.5 4.0 0.25
13.5 5.5 0.25
23.0 4.5 0.25
Group:trucks
VehicleLength VehicleWidth Proportion
23.0 4.5 0.5
76.0 4.5 0.5
Group: motorcycles
VehicleLength VehicleWidth Proportion
10.0 3.0 1.0
You should have given sample output and sample input to make it easier to answer. From what I understand, you want something along the lines of this -
library(data.table)
dt <- data.table(df)
dt2 <- dt[,
list(ClassLengthWidthFreq = .N),
by = c('VehicleClass','VehicleLength','VehicleWidth')
]
dt2[,
ClassLengthWidthFreqProportion := ClassLengthWidthFreq / sum(ClassLengthWidthFreq),
by = 'VehicleClass'
]
Output -
> dt2
VehicleClass VehicleLength VehicleWidth ClassLengthWidthFreq ClassLengthWidthFreqProportion
1: 2 13.5 4.5 4 0.2500000
2: 2 13.5 4.0 4 0.2500000
3: 2 10.0 4.5 4 0.2500000
4: 3 23.0 4.5 4 0.4444444
5: 3 76.5 4.5 5 0.5555556
6: 1 10.0 3.0 4 1.0000000
7: 2 13.5 5.5 4 0.2500000
If not, then please add sample output and sample input.

How to use ddply to add a column to a data frame?

I have a data frame that looks like this:
site date var dil
1 A 7.4 2
2 A 6.5 2
1 A 7.3 3
2 A 7.3 3
1 B 7.1 1
2 B 7.7 2
1 B 7.7 3
2 B 7.4 3
I need add a column called wt to this dataframe that contains the weighting factor needed to calculate the weighted mean. This weighting factor has to be derived for each combination of site and date.
The approach I'm using is to first built a function that calculate the weigthing factor:
> weight <- function(dil){
dil/sum(dil)
}
then apply the function for each combination of site and date
> df$wt <- ddply(df,.(date,site),.fun=weight)
but I get this error message:
Error in FUN(X[[1L]], ...) :
only defined on a data frame with all numeric variables
You are almost there. Modify your code to use the transform function. This allows you to add columns to the data.frame inside ddply:
weight <- function(x) x/sum(x)
ddply(df, .(date,site), transform, weight=weight(dil))
site date var dil weight
1 1 A 7.4 2 0.40
2 1 A 7.3 3 0.60
3 2 A 6.5 2 0.40
4 2 A 7.3 3 0.60
5 1 B 7.1 1 0.25
6 1 B 7.7 3 0.75
7 2 B 7.7 2 0.40
8 2 B 7.4 3 0.60

Resources