taking average by groups, excluding NA value - r

I'm struggling with finding something to aggregate my data frame by taking the mean and ignoring the NA value, but the end results would still show a missing value them.
the data table looks for instance like this
Guar1 Bucket2 1 2 3 4 Total Month
10 -10 NA NA NA NA 0 201110
10 -0.2 0 9.87 8.42 0 18.29 201110
10 0 0.81 7.49 3.32 5.92 17.54 201110
10 0.4 0 0 NA 0 0 201110
10 999 0.73 7.57 4.61 0.77 13.68 201110
20 -10 NA NA NA NA 0 201110
20 -0.2 NA NA 100 NA 100 201110
20 0 NA 0 0 0 0 201110
20 0.4 1.39 3.13 14.04 2.98 21.54 201110
20 999 1.38 3.11 17.08 2.97 24.54 201110
999 999 1.06 5.44 8.61 1.52 16.63 201110
10 -10 NA NA NA NA 0 201111
10 -0.2 0 0 8.54 0 8.54 201111
10 0 1.87 6.12 16.6 0 24.59 201111
10 0.4 0 0 0 1.47 1.47 201111
10 999 1.68 5.82 13.15 1.67 22.32 201111
20 -10 NA NA NA NA 0 201111
20 -0.2 NA 0 NA NA 0 201111
20 0 NA NA 0 0 0 201111
20 0.4 2.29 5.38 14.91 14.18 36.76 201111
20 999 2.29 5.35 13.09 14.1 34.83 201111
And the final table
Guar1 Bucket2 1 2 3 4 Total
10 -10 NA NA NA NA 0
10 -0.2 0 4.935 8.48 0 13.415
10 0 1.34 6.805 9.96 2.96 21.065
10 0.4 0 0 0 0.735 0.735
10 999 1.205 6.695 8.88 1.22 18
20 -10 NA NA NA NA 0
20 -0.2 NA 0 100 NA 50
20 0 NA 0 0 0 0
20 0.4 1.84 4.255 14.475 8.58 29.15
20 999 1.835 4.23 15.085 8.535 29.685
999 999 1.06 5.44 8.61 1.52 16.63
I've try the
aggregate(.~ Guar1+Bucket2, df, mean, na.rm = FALSE)
but it then excluding all NA in the final table.
and if I set all the NA value in df equal to 0 then I would not have the desire average.
I hope that someone can help me with this. Thanks!

Check this example with dplyr package
You can group by more than one variable. dplyr package is great for data editing summarising end etc.
dataFrame <- data.frame(group = c("a","a","a", "b","b","b"), value = c(1,2,NA,NA,NA,3))
library("dplyr")
df <- dataFrame %>%
group_by(group) %>%
summarise(Mean = mean(value, na.rm = T))
Output
# A tibble: 2 × 2
group Mean
<fctr> <dbl>
1 a 1.5
2 b 3.0

To avoid the NA rows to be removed, use na.action = na.pass and with na.rm=TRUE from the mean, make sure that we use only the non-NA elements to get the mean
aggregate(.~ Guar1+Bucket2, df, mean, na.rm =TRUE, na.action = na.pass)

Related

R Upsampling a time series in a dataframe filling missing values

I have collected data from two instruments, one is collected #10 Hz and the other is #100Hz.
I would like to increase the data from 10Hz to 100Hz in one dataframe to then align and merge the two dataframes together
The example data frame is:
DeltaT
Speed
Acc
HR
Player
48860,7
0,03
-0,05
0
Player1
48860,8
0,02
-0,05
0
Player1
48860,9
0,02
-0,04
0
Player1
48861,0
0,02
-0,03
0
Player1
48861,1
0,01
-0,02
0
Player1
Is there a package function that can help me create data between two points?
Manually with the approx function:
dt<- read.table(text=gsub(",", ".", 'DeltaT Speed Acc HR Player
48860,7 0,03 -0,05 0 Player1
48860,8 0,02 -0,05 0 Player1
48860,9 0,02 -0,04 0 Player1
48861,0 0,02 -0,03 0 Player1
48861,1 0,01 -0,02 0 Player1', fixed = TRUE),header=T)
upsampleDeltaT=seq(from=min(dt$DeltaT),to=max(dt$DeltaT),by=.01)
Speed<-approx(dt$DeltaT,dt$Speed,upsampleDeltaT)$y
Acc<-approx(dt$DeltaT,dt$Acc,upsampleDeltaT)$y
HR<-approx(dt$DeltaT,dt$HR,upsampleDeltaT)$y
Player <- rep(dt$Player,c(rep(10,nrow(dt)-1),1))
data.frame(upsampleDeltaT,Speed,Acc,HR,Player)
#> upsampleDeltaT Speed Acc HR Player
#> 1 48860.70 0.030 -0.050 0 Player1
#> 2 48860.71 0.029 -0.050 0 Player1
#> 3 48860.72 0.028 -0.050 0 Player1
#> 4 48860.73 0.027 -0.050 0 Player1
#> 5 48860.74 0.026 -0.050 0 Player1
#> 6 48860.75 0.025 -0.050 0 Player1
#> 7 48860.76 0.024 -0.050 0 Player1
#> 8 48860.77 0.023 -0.050 0 Player1
#> 9 48860.78 0.022 -0.050 0 Player1
#> 10 48860.79 0.021 -0.050 0 Player1
#> 11 48860.80 0.020 -0.050 0 Player1
#> 12 48860.81 0.020 -0.049 0 Player1
#> 13 48860.82 0.020 -0.048 0 Player1
#> 14 48860.83 0.020 -0.047 0 Player1
#> 15 48860.84 0.020 -0.046 0 Player1
#> 16 48860.85 0.020 -0.045 0 Player1
#> 17 48860.86 0.020 -0.044 0 Player1
#> 18 48860.87 0.020 -0.043 0 Player1
#> 19 48860.88 0.020 -0.042 0 Player1
#> 20 48860.89 0.020 -0.041 0 Player1
#> 21 48860.90 0.020 -0.040 0 Player1
#> 22 48860.91 0.020 -0.039 0 Player1
#> 23 48860.92 0.020 -0.038 0 Player1
#> 24 48860.93 0.020 -0.037 0 Player1
#> 25 48860.94 0.020 -0.036 0 Player1
#> 26 48860.95 0.020 -0.035 0 Player1
#> 27 48860.96 0.020 -0.034 0 Player1
#> 28 48860.97 0.020 -0.033 0 Player1
#> 29 48860.98 0.020 -0.032 0 Player1
#> 30 48860.99 0.020 -0.031 0 Player1
#> 31 48861.00 0.020 -0.030 0 Player1
#> 32 48861.01 0.019 -0.029 0 Player1
#> 33 48861.02 0.018 -0.028 0 Player1
#> 34 48861.03 0.017 -0.027 0 Player1
#> 35 48861.04 0.016 -0.026 0 Player1
#> 36 48861.05 0.015 -0.025 0 Player1
#> 37 48861.06 0.014 -0.024 0 Player1
#> 38 48861.07 0.013 -0.023 0 Player1
#> 39 48861.08 0.012 -0.022 0 Player1
#> 40 48861.09 0.011 -0.021 0 Player1
#> 41 48861.10 0.010 -0.020 0 Player1
library(data.table)
library(zoo)
set.seed(123)
# 10Hz and 100Hz sample data
DT10 <- data.table(time = seq(0,1, by = 0.1), value = sample(1:10, 11, replace = TRUE))
DT100 <- data.table(time = seq(0,1, by = 0.01), value = sample(1:10, 101, replace = TRUE))
# you should use setDT() if your data is not already data.table format
# join the DT10 to DT100
DT100[DT10, value2 := i.value, on = .(time)]
# intyerpolate NA-values
DT100[, value2_inter := zoo::na.approx(value2)]
#output
head(DT100, 31)
# time value value2 value2_inter
# 1: 0.00 3 3 3.0
# 2: 0.01 9 NA 3.0
# 3: 0.02 9 NA 3.0
# 4: 0.03 9 NA 3.0
# 5: 0.04 3 NA 3.0
# 6: 0.05 8 NA 3.0
# 7: 0.06 10 NA 3.0
# 8: 0.07 7 NA 3.0
# 9: 0.08 10 NA 3.0
# 10: 0.09 9 NA 3.0
# 11: 0.10 3 3 3.0
# 12: 0.11 4 NA 3.7
# 13: 0.12 1 NA 4.4
# 14: 0.13 7 NA 5.1
# 15: 0.14 5 NA 5.8
# 16: 0.15 10 NA 6.5
# 17: 0.16 7 NA 7.2
# 18: 0.17 9 NA 7.9
# 19: 0.18 9 NA 8.6
# 20: 0.19 10 NA 9.3
# 21: 0.20 7 10 10.0
# 22: 0.21 5 NA 9.8
# 23: 0.22 7 NA 9.6
# 24: 0.23 5 NA 9.4
# 25: 0.24 6 NA 9.2
# 26: 0.25 9 NA 9.0
# 27: 0.26 2 NA 8.8
# 28: 0.27 5 NA 8.6
# 29: 0.28 8 NA 8.4
# 30: 0.29 2 NA 8.2
# 31: 0.30 1 NA 8.0
# time value value2 value2_inter
Have a look at the approx function. It interpolates data for new points by.

R - Delete Observations if More Than 25% of a Group

This is my first post! I started using R about a year ago and I have learned a lot from this sub over the last few months! Thanks for all of your help so far.
Here is what I am trying to do:
• Group Data by POS
• Within each POS group, no ORG should represent more than 25% of the dataset
• If the ORG represents more than 25% of the observation(column), the value furthest from the mean should be deleted. I think this would loop until the data from that ORG are less than 25% of the observation.
I am not sure how to approach this problem as I am a not too familiar with R functions. Well, I am assuming this would require a function.
Here is the sample dataset:
print(Example)
# A tibble: 18 x 13
Org Pos obv1 obv2 obv3 obv4 obv5 obv6 obv7 obv8 obv9 obv10 obv11
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 34.6 26.2 43.1 NA NA NA NA NA NA NA NA
2 2 1 18.7 15.5 23.4 NA NA NA NA NA NA NA NA
3 3 1 16.2 14.4 21.7 NA NA NA NA NA NA NA 1.32
4 3 1 20.0 15.5 23.4 NA NA 1.32 2.78 1.44 NA NA 1.89
5 3 1 2.39 16.9 24.1 NA NA 1.13 1.52 1.12 NA NA 2.78
6 3 1 24.3 15.4 24.6 NA NA 1.13 1.89 1.13 NA NA 1.51
7 6 1 16.7 16.0 23.4 0.19 NA 0.83 1.3 0.94 1.78 2.15 1.51
8 6 1 18.7 16.4 25.8 0.19 NA 1.22 1.4 0.97 1.93 2.35 1.51
9 6 1 19.3 16.4 25.8 0.19 NA 1.22 1.4 0.97 1.93 2.35 1.51
10 7 1 23.8 18.6 28.6 NA NA NA NA NA NA NA NA
11 12 2 28.8 24.4 39.7 NA NA 1.13 1.89 1.32 2.46 3.21 NA
12 13 2 24.6 19.6 29.4 0.16 NA 3.23 3.23 2.27 NA NA NA
13 14 2 18.4 15.5 24.8 NA NA 2.27 3.78 1.13 3.46 4.91 2.78
14 15 2 23.8 24.4 39.7 NA NA NA NA NA NA NA NA
15 15 2 25.8 24.4 39.7 NA NA NA NA NA NA NA NA
16 16 2 18.9 17.4 26.9 0.15 NA NA 1.89 2.99 NA NA 1.51
17 16 2 22.1 17.3 26.9 NA NA NA 2.57 0.94 NA NA 1.51
18 16 2 24.3 19.6 28.5 0.15 NA NA 1.51 1.32 NA NA 2.27
The result would look something like this:
print(Result)
# A tibble: 18 x 13
Org Pos obv1 obv2 obv3 obv4 obv5 obv6 obv7 obv8 obv9 obv10 obv11
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 34.6 26.2 43.1 NA NA NA NA NA NA NA NA
2 2 1 18.7 15.5 23.4 NA NA NA NA NA NA NA NA
3 3 1 NA NA NA NA NA NA NA NA NA NA NA
4 3 1 20.0 15.5 23.4 NA NA 1.32 2.78 1.44 NA NA NA
5 3 1 NA NA NA NA NA NA NA NA NA NA NA
6 3 1 NA NA NA NA NA NA NA NA NA NA 1.51
7 6 1 16.7 16.0 23.4 0.19 NA NA NA NA NA NA NA
8 6 1 NA NA NA NA NA 1.22 1.4 0.97 1.93 2.35 1.51
9 6 1 19.3 16.4 25.8 NA NA NA NA NA NA NA NA
10 7 1 23.8 18.6 28.6 NA NA NA NA NA NA NA NA
11 12 2 28.8 24.4 39.7 NA NA 1.13 1.89 1.32 2.46 3.21 NA
12 13 2 24.6 19.6 29.4 0.16 NA 3.23 3.23 2.27 NA NA NA
13 14 2 18.4 15.5 24.8 NA NA 2.27 3.78 1.13 3.46 4.91 2.78
14 15 2 NA NA NA NA NA NA NA NA NA NA NA
15 15 2 25.8 24.4 39.7 NA NA NA NA NA NA NA NA
16 16 2 NA NA NA NA NA NA 1.89 2.99 NA NA NA
17 16 2 22.1 17.3 26.9 NA NA NA 2.57 0.94 NA NA 1.51
18 16 2 NA NA NA NA NA NA NA NA NA NA NA
Any advice would be appreciated. Thanks!

How to transform numbers of a variable that are readed as factor to be readed as numbers? [duplicate]

This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 4 years ago.
First of all, I'm new at R, I'm just learning.
I have a data frame and I want to make some plots and graphics with two variables, one of these variables is read as a factor but this variable is with real numbers. This variable is a percentage so I want to graphics this percentage related to some municipalities, how can I transform these numbers to numeric values?
I've tried this following code because in the guide I'm reading its say to convert factors to numeric with the function as.numeric() but the result is totally different numbers.
for example
#the data frame is valle.abu2
valle.abu2$Porcentaje.de.Excedencias
#then
as.numeric(valle.abu2$Porcentaje.de.Excedencias)
valle.abu2$Porcentaje.de.Excedencias
[1] 1.3 0.04 1.6 0 0 0 0.31 0.61 0 2.31 3.6 8.04 0 7.18 0 5.88 1.35 0
[19] 2.56 0 3.2 0 0 0 0 0 0.05 0.32 0 5.23 0 0 0 0 0 0
[37] 0 5.42 5.54 11.44 0 2.51 0 4.88 0 3.45 0 2.78 2.7 0 4.39 0 0 0
[55] 0 3.99 3.42 6.01 0 5.52 0 0.04 0 0.46 0.34 0 4.63 0 14.65 2.91 5.9 4.17
[73] 0 0 0 0 0 0 1.15 1.52 9.17 2.22 3.82 0 0 0 0 7.04 3.57 12.5
[91] 0 0 0 0.72 1.32 0 9.88 2.63 0 0 0 0 0 0 37.57
134 Levels: 0 0.03 0.04 0.05 0.06 0.07 0.09 0.1 0.11 0.14 0.15 0.23 0.27 0.29 0.31 0.32 0.33 0.34 0.42
as.numeric(valle.abu2$Porcentaje.de.Excedencias)
[1] 42 3 48 1 1 1 15 25 1 69 92 129 1 127 1 120 44 1 71 1 86 1 1 1 1 1 4
[28] 16 1 115 1 1 1 1 1 1 1 116 118 59 1 70 1 108 1 90 1 75 73 1 103 1 1 1
[55] 1 97 89 122 1 117 1 3 1 21 18 1 104 1 64 77 121 101 1 1 1 1 1 1 39 47 131
[82] 67 96 1 1 1 1 126 91 60 1 1 1 28 43 1 134 72 1 1 1 1 1 1 98
Try:
as_numeric_factor <- function(x){
as.numeric(levels(x))[x]
}
as_numeric_factor(valle.abu2$Porcentaje.de.Excedencias)
Explanation.
The help page ?factor section Warning includes two different ways of doing what the question asks for, and states that one of them is more efficient.
To transform a factor f to approximately its original numeric values,
as.numeric(levels(f))[f] is recommended and slightly more efficient than
as.numeric(as.character(f)).
Here is a simple speed test. set.seed is not needed since the result of interest are the timings, not computations.
library(microbenchmark)
library(ggplot2)
as_numeric_factor2 <- function(x){
as.numeric(as.character(x))
}
f <- factor(rnorm(1e4))
mb <- microbenchmark(
levl = as_numeric_factor(f),
char = as_numeric_factor2(f)
)
autoplot(mb)

Cumulative sum based on factor on R

I have the following dataset, and I need to acumulate the value and
sum, if the factor is 0, and then put the cummulated sum when I found
the factor != 0.
I've tried the loop bellow, but it didn't worked at all.
for(i in dataset$Variable.1) {
ifelse(dataset$Factor == 0,
dataset$teste <- dataset$Variable.1 + i,
dataset$teste <- dataset$Variable.1)
i<- dataset$Variable.1
print(i)
}
Any ideas?
Bellow an example of the dataset. I wish to get the "Result" Column.
On the real one, I also have a negative factor (-1).
Date Factor Variable.1 Result
1 03/02/2018 0 0.75 0.75
2 04/02/2018 0 0.75 1.50
3 05/02/2018 1 0.96 2.46
4 06/02/2018 1 0.76 0.76
5 07/02/2018 0 1.35 1.35
6 08/02/2018 1 0.70 2.05
7 09/02/2018 1 2.02 2.02
8 10/02/2018 0 0.00 0.00
9 11/02/2018 0 0.00 0.00
10 12/02/2018 0 0.20 0.20
11 13/02/2018 0 0.13 0.33
12 14/02/2018 0 1.64 1.97
13 15/02/2018 0 0.03 2.00
14 16/02/2018 1 0.51 2.51
15 17/02/2018 1 0.00 0.00
16 18/02/2018 0 0.00 0.00
17 19/02/2018 0 0.83 0.83
18 20/02/2018 1 0.42 1.25
19 21/02/2018 1 0.17 0.17
20 22/02/2018 1 0.97 0.97
21 23/02/2018 0 0.92 0.92
22 24/02/2018 0 0.00 0.92
23 25/02/2018 0 0.00 0.92
24 26/02/2018 1 0.19 1.11
25 27/02/2018 1 0.87 0.87
26 28/02/2018 1 0.85 0.85
27 01/03/2018 1 1.95 1.95
28 02/03/2018 1 0.54 0.54
29 03/03/2018 1 0.00 0.00
30 04/03/2018 0 0.00 0.00
31 05/03/2018 0 1.17 1.17
32 06/03/2018 1 0.25 1.42
33 07/03/2018 1 1.45 1.45
Thanks In advance.
If you want to stick with the for-loop, you can try this code :
DF$Result <- NA
prev <- 0
for(i in seq_len(nrow(DF))){
DF$Result[i] <- DF$Variable.1[i] + prev
if(DF$Factor[i] == 1)
prev <- 0
else
prev <- DF$Result[i]
}
Iteratively, try something like:
a=as.data.frame(cbind(Factor=c(0,0,1,1,0,1,1,
rep(0,3),1),Variable.1=c(0.75,0.75,0.96,0.71,1.35,0.7,
0.75,0.96,0.71,1.35,0.7)))
Result=0
aux=NULL
for (i in 1:nrow(a)){
if (a$Factor[i]==0){
Result=Result+a$Variable.1[i]
aux=c(aux,Result)
} else{
Result=Result+a$Variable.1[i]
aux=c(aux,Result)
Result=0
}
}
a$Results=aux
a
Factor Variable.1 Results
1 0 0.75 0.75
2 0 0.75 1.50
3 1 0.96 2.46
4 1 0.71 0.71
5 0 1.35 1.35
6 1 0.70 2.05
7 1 0.75 0.75
8 0 0.96 0.96
9 0 0.71 1.67
10 0 1.35 3.02
11 1 0.70 3.72
A possibility using tidyverse and data.table:
df %>%
mutate(temp = ifelse(Factor == 1 & lag(Factor) == 1, NA, 1), #Marking the rows after the first 1 in "Factor" as NA
temp = ifelse(!is.na(temp), rleid(temp), NA)) %>% #Run length along non-NA values
group_by(temp) %>% #Grouping by run length
mutate(Result = ifelse(!is.na(temp), cumsum(Variable.1), Variable.1)) %>% #Cumulative sum of desired rows
ungroup() %>%
select(-temp) #Removing the redundant variable
Date Factor Variable.1 Result
<chr> <int> <dbl> <dbl>
1 03/02/2018 0 0.750 0.750
2 04/02/2018 0 0.750 1.50
3 05/02/2018 1 0.960 2.46
4 06/02/2018 1 0.760 0.760
5 07/02/2018 0 1.35 1.35
6 08/02/2018 1 0.700 2.05
7 09/02/2018 1 2.02 2.02
8 10/02/2018 0 0. 0.
9 11/02/2018 0 0. 0.
10 12/02/2018 0 0.200 0.200

using melt() does not sort by set ID

I have the data set:
Time a b
[1,] 0 5.06 9.60
[2,] 4 9.57 4.20
[3,] 8 1.78 3.90
[4,] 12 2.21 3.90
[5,] 16 4.10 5.84
[6,] 20 2.81 8.10
[7,] 24 2.70 1.18
[8,] 36 52.00 5.68
[9,] 48 NA 6.66
And I would like to reshape it to:
Time variable value
0 a 5.06
4 a 9.57
8 a 1.78
...
0 b 9.60
4 b 4.20
8 b 3.90
...
The code I am using is:
library(reshape2)
Time <- c(0,4,8,12,16,20,24,36,48)
a <- c(5.06,9.57,1.78,2.21,4.1,2.81,2.7,52,NA)
b <- c(9.6,4.2,3.9,3.9,5.84,8.1,1.18,5.68,6.66)
Mono <- cbind(Time,a,b)
mono <- melt(Mono,id="Time",na.rm=F)
Which produces:
Var1 Var2 value
1 1 Time 0.00
2 2 Time 4.00
3 3 Time 8.00
4 4 Time 12.00
5 5 Time 16.00
6 6 Time 20.00
7 7 Time 24.00
8 8 Time 36.00
9 9 Time 48.00
10 1 a 5.06
11 2 a 9.57
12 3 a 1.78
13 4 a 2.21
14 5 a 4.10
15 6 a 2.81
16 7 a 2.70
17 8 a 52.00
18 9 a NA
19 1 b 9.60
20 2 b 4.20
21 3 b 3.90
22 4 b 3.90
23 5 b 5.84
24 6 b 8.10
25 7 b 1.18
26 8 b 5.68
27 9 b 6.66
I'm sure its a small error, but I can't figure it out. It's especially frustrating because I've used melt() without problems many times before. How can I fix the code to produce the table I'm looking for?
Thanks for your help!
Use tidyr::gather() to move from wide to long format.
> df <- data.frame(time = seq(0,20,5),
a = rnorm(5,0,1),
b = rnorm(5,0,1))
> library(tidyr)
> gather(df, variable, value, -time)
time variable value
1 0 a 1.5406529
2 5 a 1.5048055
3 10 a -1.1138529
4 15 a -0.1199039
5 20 a -1.7052608
6 0 b -1.1976938
7 5 b 0.7997127
8 10 b 1.1940454
9 15 b 0.5177981
10 20 b 0.6725264

Resources