R: apply simple function to specific columns by grouped variable - r

I have a data set with 2 observations for each person.
There are more than 100 variables in the data set.
I would like to fill in the missing data for each person, with the available data for the same variable. I can do this manually with dplyr mutate function, but it will be cumbersome to do that for all the variables that needs to be filled in.
Here is what I tried, but it failed:
> # Here's data example
> # https://www.dropbox.com/s/a0bc69xgxhaeguc/data_xlsc.xlsx?dl=0
> # I have already attached it to my working space
>
> names(data)
[1] "ID" "Age" "var1" "var2" "var3" "var4" "var5" "var6" "var7" "var8" "var9"
> head(data)
Source: local data frame [6 x 11]
ID Age var1 var2 var3 var4 var5 var6 var7 var8 var9
1 1 50 27.5 1.83 92.0 NA NA NA NA NA 5.1
2 1 NA NA NA NA 3.54 30.2 27.9 64.34 60.8 NA
3 2 51 33.7 1.77 105.6 NA NA NA NA NA 5.2
4 2 NA NA NA NA 4.05 36.4 38.7 67.75 63.7 NA
5 3 43 26.3 1.84 89.1 NA NA NA NA NA 4.8
6 3 NA NA NA NA 3.77 24.4 21.9 67.97 64.2 NA
> # As you can see above, for each person (ID) there are missing values for age and other variables.
> # I'd like to fill in missing data with the available data for each variable, for each ID
>
> #These are the variables that I need to fill in
> desired_variables <- names(data[,2:11])
>
> # this is my attempt that failed
>
> data2 <- data %>% group_by(ID) %>%
+ do(
+ for (i in seq_along(desired_variables)) {
+ i=max(i, na.rm=T)
+ }
+ )
Error: Results are not data frames at positions: 1, 2, 3
Desired output for the first person:
ID Age var1 var2 var3 var4 var5 var6 var7 var8 var9
1 1 50 27.5 1.83 92.0 3.54 30.2 27.9 64.34 60.8 5.1
2 1 50 27.5 1.83 92.0 3.54 30.2 27.9 64.34 60.8 5.1

Here's a possible data.table solution
library(data.table)
setattr(data, "class", "data.frame") ## If your data is of `tbl_df` class
setDT(data)[, (desired_variables) := lapply(.SD, max, na.rm = TRUE), by = ID] ## you can also use `.SDcols` if you want to specify specific columns
data
# ID Age var1 var2 var3 var4 var5 var6 var7 var8 var9
# 1: 1 50 27.5 1.83 92.0 3.54 30.2 27.9 64.34 60.8 5.1
# 2: 1 50 27.5 1.83 92.0 3.54 30.2 27.9 64.34 60.8 5.1
# 3: 2 51 33.7 1.77 105.6 4.05 36.4 38.7 67.75 63.7 5.2
# 4: 2 51 33.7 1.77 105.6 4.05 36.4 38.7 67.75 63.7 5.2
# 5: 3 43 26.3 1.84 89.1 3.77 24.4 21.9 67.97 64.2 4.8
# 6: 3 43 26.3 1.84 89.1 3.77 24.4 21.9 67.97 64.2 4.8

Related

How to use rollapplyr while ignoring NA values?

I have weather data with NAs sporadically throughout and I want to calculate rolling means. I have been using the rollapplyr function within zoo but even though I include partial = TRUE, it still puts a NA whenever, for example, there is a NA in 1 of the 30 values to be averaged.
Here is the formula:
weather_rolled <- weather %>%
mutate(maxt30 = rollapplyr(max_temp, 30, mean, partial = TRUE))
Here's my data:
A tibble: 7,160 x 11
station_name date max_temp avg_temp min_temp rainfall rh avg_wind_speed dew_point avg_bare_soil_temp total_solar_rad
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 VEGREVILLE 2019-01-01 0.9 -7.9 -16.6 1 81.7 20.2 -7.67 NA NA
2 VEGREVILLE 2019-01-02 5.5 1.5 -2.5 0 74.9 13.5 -1.57 NA NA
3 VEGREVILLE 2019-01-03 3.3 -0.9 -5 0.5 80.6 10.1 -3.18 NA NA
4 VEGREVILLE 2019-01-04 -1.1 -4.7 -8.2 5.2 92.1 8.67 -4.76 NA NA
5 VEGREVILLE 2019-01-05 -3.8 -6.5 -9.2 0.2 92.6 14.3 -6.81 NA NA
6 VEGREVILLE 2019-01-06 -3 -4.4 -5.9 0 91.1 16.2 -5.72 NA NA
7 VEGREVILLE 2019-01-07 -5.8 -12.2 -18.5 0 75.5 30.6 -16.9 NA NA
8 VEGREVILLE 2019-01-08 -17.4 -21.6 -25.7 1.2 67.8 16.1 -26.1 NA NA
9 VEGREVILLE 2019-01-09 -12.9 -15.1 -17.4 0.2 71.5 14.3 -17.7 NA NA
10 VEGREVILLE 2019-01-10 -13.2 -17.9 -22.5 0.4 80.2 3.38 -21.8 NA NA
# ... with 7,150 more rows
Essentially, whenever a NA appears midway through, it results in a lot of NAs for the rolling mean. I want to still calculate the rolling mean within that time frame, ignoring the NAs. Does anyone know a way to get around this? I have been searching online for hours to no avail.
Thanks!

aov throws "could not find function 'Error'" when used in the context of group_by

I'm trying to run a repeated measures ANCOVA. The following code works fine:
tidy(aov(FA ~ sex * study + Error(PartID), data = DTI.TRACTlong))
Where FA is a continuous measure, sex and study are factors where study indicates (time 1 or time 2) and PartID is the individual id. However, I have to run this analysis for a number of regions (ROI) for two different conditions (harmonized vs. not). This seems easy enough using tidyverse with group_by (see below), but it throws Error in Error(.$PartID) : could not find function "Error". Any idea what is happening here? Why is the Error function recognized when used on its own but not when using tidyverse with group_by?
group_by(harmonize,ROI) %>%
do(tidy(aov(.$FA ~ .$GOBS_Gender * .$study + Error(.$PartID))))
Specify the data and remove the .$
library(broom)
library(dplyr)
...
group_by(harmonize,ROI) %>%
do(tidy(aov(FA ~ sex * study + Error(PartID), data = .)))
Another option may be to use to nest_by
library(tidyr)
...
nest_by(harmonize, ROI) %>%
mutate(out = list(tidy(aov(FA ~ sex * study + Error(PartID), data = data)))) %>%
select(-data) %>%
unnest(c(out))
Using a reproducible example
> data(npk)
> npk$grp <- rep(c('a', 'b'), each = 12)
-do method
> npk %>%
group_by(grp) %>%
do(tidy(aov(yield ~ N*P*K + Error(block), data = .)))
# A tibble: 18 x 8
# Groups: grp [2]
grp stratum term df sumsq meansq statistic p.value
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a block N:P:K 1 69.0 69.0 3.12 0.328
2 a block Residuals 1 22.1 22.1 NA NA
3 a Within N 1 119. 119. 6.89 0.0786
4 a Within P 1 0.270 0.270 0.0156 0.908
5 a Within K 1 58.1 58.1 3.36 0.164
6 a Within N:P 1 12.2 12.2 0.705 0.463
7 a Within N:K 1 23.4 23.4 1.35 0.329
8 a Within P:K 1 44.6 44.6 2.58 0.207
9 a Within Residuals 3 51.8 17.3 NA NA
10 b block N:P:K 1 29.3 29.3 0.431 0.630
11 b block Residuals 1 67.9 67.9 NA NA
12 b Within N 1 73.0 73.0 57.3 0.00478
13 b Within P 1 21.3 21.3 16.7 0.0264
14 b Within K 1 38.2 38.2 29.9 0.0120
15 b Within N:P 1 8.52 8.52 6.68 0.0814
16 b Within N:K 1 31.5 31.5 24.7 0.0156
17 b Within P:K 1 47.3 47.3 37.1 0.00888
18 b Within Residuals 3 3.82 1.27 NA NA
-group_modify #andrew reece
> npk %>%
group_by(grp) %>%
group_modify(~aov(yield ~ N*P*K + Error(block), data = .) %>%
tidy)
# A tibble: 18 x 8
# Groups: grp [2]
grp stratum term df sumsq meansq statistic p.value
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a block N:P:K 1 69.0 69.0 3.12 0.328
2 a block Residuals 1 22.1 22.1 NA NA
3 a Within N 1 119. 119. 6.89 0.0786
4 a Within P 1 0.270 0.270 0.0156 0.908
5 a Within K 1 58.1 58.1 3.36 0.164
6 a Within N:P 1 12.2 12.2 0.705 0.463
7 a Within N:K 1 23.4 23.4 1.35 0.329
8 a Within P:K 1 44.6 44.6 2.58 0.207
9 a Within Residuals 3 51.8 17.3 NA NA
10 b block N:P:K 1 29.3 29.3 0.431 0.630
11 b block Residuals 1 67.9 67.9 NA NA
12 b Within N 1 73.0 73.0 57.3 0.00478
13 b Within P 1 21.3 21.3 16.7 0.0264
14 b Within K 1 38.2 38.2 29.9 0.0120
15 b Within N:P 1 8.52 8.52 6.68 0.0814
16 b Within N:K 1 31.5 31.5 24.7 0.0156
17 b Within P:K 1 47.3 47.3 37.1 0.00888
18 b Within Residuals 3 3.82 1.27 NA NA
-nest_by method
> npk %>%
nest_by(grp) %>%
mutate(out = list(tidy(aov(yield ~ N*P*K + Error(block),
data = data)))) %>%
select(-data) %>%
unnest(out)
# A tibble: 18 x 8
# Groups: grp [2]
grp stratum term df sumsq meansq statistic p.value
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a block N:P:K 1 69.0 69.0 3.12 0.328
2 a block Residuals 1 22.1 22.1 NA NA
3 a Within N 1 119. 119. 6.89 0.0786
4 a Within P 1 0.270 0.270 0.0156 0.908
5 a Within K 1 58.1 58.1 3.36 0.164
6 a Within N:P 1 12.2 12.2 0.705 0.463
7 a Within N:K 1 23.4 23.4 1.35 0.329
8 a Within P:K 1 44.6 44.6 2.58 0.207
9 a Within Residuals 3 51.8 17.3 NA NA
10 b block N:P:K 1 29.3 29.3 0.431 0.630
11 b block Residuals 1 67.9 67.9 NA NA
12 b Within N 1 73.0 73.0 57.3 0.00478
13 b Within P 1 21.3 21.3 16.7 0.0264
14 b Within K 1 38.2 38.2 29.9 0.0120
15 b Within N:P 1 8.52 8.52 6.68 0.0814
16 b Within N:K 1 31.5 31.5 24.7 0.0156
17 b Within P:K 1 47.3 47.3 37.1 0.00888
18 b Within Residuals 3 3.82 1.27 NA NA

Add -0.5 to a value below 0 and add 0.5 to value above 0 in r

I maybe have a strange question...I have a dataframe as below:
Station Mean_length Diff
1 AMEL 28.1 -2.91
2 AMRU 21.1 -9.90
3 BALG 31.0 0
4 BORK 30.1 -0.921
5 BUSU 22.6 -8.38
6 CADZ 28.5 2.46
7 DOLL 27.9 -3.07
8 EGMO 28.3 -2.69
9 EIER 30.8 0.233
10 FANO 23.1 -7.89
Now from column "Diff" I want to get a new column and I want to add -0.5 to a value below 0 and add 0.5 to value above 0.
So I get a new dataframe like this:
Station Mean_length Diff Diff05
1 AMEL 28.1 -2.91 -3.41 (-0.5)
2 AMRU 21.1 -9.90 -13.8 (-0.5)
3 BALG 31.0 0 0.5 (+0.5)
4 BORK 30.1 -0.921 -1.421 (-0.5)
5 BUSU 22.6 -8.38 -8.88 (-0.5)
6 CADZ 28.5 2.46 2.96 (+0.5)
7 DOLL 27.9 -3.07 -3.57 (-0.5)
8 EGMO 28.3 -2.69 -3.19 (-0.5)
9 EIER 30.8 0.233 0.733 (+0.5)
10 FANO 23.1 -7.89 -8.39 (-0.5)
How can I tackle this? Is there something in dplyr possible? with the 'ifelse' function? recognizing values when they are haven the '-' in front of them....
Thank you I advance!
Another way:
df$Diff05 <- df$Diff + 0.5 * sign(df$Diff)
Station Mean_length Diff Diff05
1 AMEL 28.1 -2.910 -3.410
2 AMRU 21.1 -9.900 -10.400
3 BALG 31.0 0.000 0.000
4 BORK 30.1 -0.921 -1.421
5 BUSU 22.6 -8.380 -8.880
6 CADZ 28.5 2.460 2.960
7 DOLL 27.9 -3.070 -3.570
8 EGMO 28.3 -2.690 -3.190
9 EIER 30.8 0.233 0.733
10 FANO 23.1 -7.890 -8.390
You could also use df$Diff + (df$Diff>0) - 0.5
Does this work:
library(dplyr)
df %>% mutate(Diff05 = if_else(Diff < 0, Diff - 0.5, Diff + 0.5))
# A tibble: 10 x 4
station Mean_length Diff Diff05
<chr> <dbl> <dbl> <dbl>
1 AMEL 28.1 -2.91 -3.41
2 AMRU 21.1 -9.9 -10.4
3 BALG 31 0 0.5
4 BORK 30.1 -0.921 -1.42
5 BUSU 22.6 -8.38 -8.88
6 CADZ 28.5 2.46 2.96
7 DOLL 27.9 -3.07 -3.57
8 EGMO 28.3 -2.69 -3.19
9 EIER 30.8 0.233 0.733
10 FANO 23.1 -7.89 -8.39
The logical way
df$Diff05 <- ifelse(test = df$Diff < 0, yes = df$Diff - 0.5, no = df$Diff + 0.5)

Apply row-wise transformation in R so that total percentage for each row will be 100%

I have a data frame like this:
df <- structure(list(groups= c("group1", "group2", "group3", "group4"),
A = c(28.6, 26.7, 29.1,23.1,1.0),
B = c(24.5, 22.3,23.9,20.2,1.5),
C = c(12.1,11.2,12.1,11.7,1.5),
D = c(9.4,7.0,9.0,8.7,1.1)),
class = "data.frame",
row.names = c("1","2","3","4"))
groups A B C D
1 group1 28.6 24.5 12.1 9.4
2 group2 26.7 22.3 11.2 7.0
3 group3 29.1 23.9 12.1 9.0
4 group4 23.1 20.2 11.7 8.7
The values in the dataframe are in percentage. I would like to grow the total percent for each row to be 100%. So the output would look similar like this(BTW, I calculated the expected output by hand, so it may not be so accurate as computer calculated):
groups A B C D
1 group1 38.3 32.8 16.2 12.6
2 group2 39.7 33.1 16.7 10.4
3 group3 39.7 32.6 16.4 11.3
4 group4 36.3 31.5 18.9 13.3
How should I do it? Thank you!
You can use proportions to get percentages.
proportions(as.matrix(df[1:4,-1]), 1) * 100
# A B C D
#1 38.33780 32.84182 16.21984 12.60054
#2 39.73214 33.18452 16.66667 10.41667
#3 39.27126 32.25371 16.32928 12.14575
#4 36.26374 31.71115 18.36735 13.65777
If you want to do this in the dplyr context:
df %>%
rowwise() %>%
mutate(sm = sum(c_across(-groups))) %>%
mutate(across(A:D, function(x)x/sm)*100) %>%
select(-sm)
## A tibble: 5 x 5
## Rowwise:
# groups A B C D
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 group1 38.3 32.8 16.2 12.6
#2 group2 39.7 33.2 16.7 10.4
#3 group3 39.3 32.3 16.3 12.1
#4 group4 36.3 31.7 18.4 13.7
#5 group5 19.6 29.4 29.4 21.6

Calculate medians of rows in a grouped dataframe

I have a dataframe containing multiple entries per week. It looks like this:
Week t_10 t_15 t_18 t_20 t_25 t_30
1 51.4 37.8 25.6 19.7 11.9 5.6
2 51.9 37.8 25.8 20.4 12.3 6.2
2 52.4 38.5 26.2 20.5 12.3 6.1
3 52.2 38.6 26.1 20.4 12.4 5.9
4 52.2 38.3 26.1 20.2 12.1 5.9
4 52.7 38.4 25.8 20.0 12.1 5.9
4 51.1 37.8 25.7 20.0 12.2 6.0
4 51.9 38.0 26.0 19.8 12.0 5.8
The Weeks have different amounts of entries, they range from one entry for a week to multiple (up to 4) entries a week.
I want to calculate the medians of each week and output it for all the different variables (t_10 throughout to t_30) in a new dataframe. NA cells are already omitted in the original dataframe. I have tried different approaches through the ddply function of the plyrpackage but to no avail so far.
We could use summarise_at for multiple columns
library(dplyr)
colsToKeep <- c("t_10", "t_30")
df1 %>%
group_by(Week) %>%
summarise_at(vars(colsToKeep), median)
# A tibble: 4 x 3
# Week t_10 t_30
# <int> <dbl> <dbl>
#1 1 51.40 5.60
#2 2 52.15 6.15
#3 3 52.20 5.90
#4 4 52.05 5.90
Specify variables to keep in colsToKeep and store input table in d
library(tidyverse)
colsToKeep <- c("t_10", "t_30")
gather(d, variable, value, -Week) %>%
filter(variable %in% colsToKeep) %>%
group_by(Week, variable) %>%
summarise(median = median(value))
# A tibble: 8 x 3
# Groups: Week [4]
Week variable median
<int> <chr> <dbl>
1 1 t_10 51.40
2 1 t_30 5.60
3 2 t_10 52.15
4 2 t_30 6.15
5 3 t_10 52.20
6 3 t_30 5.90
7 4 t_10 52.05
8 4 t_30 5.90
You can also use the aggregate function:
newdf <- aggregate(data = df, Week ~ . , median)

Resources