Aggregate by group AND add column to data frame in R [duplicate] - r

This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 4 years ago.
For a sample dataframe:
df1 <- structure(list(place = c("a", "a", "b", "b", "b", "b", "c", "c",
"c", "d", "d"), animal = c("cat", "bear", "cat", "bear", "pig",
"goat", "cat", "bear", "goat", "goat", "bear"), number = c(5,
6, 7, 4, 5, 6, 8, 5, 3, 7, 4)), .Names = c("place", "animal",
"number"), row.names = c(NA, -11L), spec = structure(list(cols = structure(list(
place = structure(list(), class = c("collector_character",
"collector")), animal = structure(list(), class = c("collector_character",
"collector")), number = structure(list(), class = c("collector_integer",
"collector"))), .Names = c("place", "animal", "number")),
default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"), class = c("tbl_df",
"tbl", "data.frame"))
I want to create a variable 'sum' which sums the 'number' column by 'place' (regardless of animal), and adds it to the datafame.
The command below:
df1$sum <- aggregate(df1$number, by=list(Category=df1$place), FUN=sum)
... tries to do the sum but can't complete the function because it wants to report by only the number of individual places (hence why we get this error):
Error in `$<-.data.frame`(`*tmp*`, sum, value = list(Category = c("a", :
replacement has 4 rows, data has 11
Any ideas how I add this extra column onto my dataframe?

Since you have a tibble, first a dplyr solution. Next a base R version.
using dplyr:
df1 %>%
group_by(place) %>%
mutate(sum_num = sum(number))
# A tibble: 11 x 4
# Groups: place [4]
place animal number sum_num
<chr> <chr> <dbl> <dbl>
1 a cat 5 11
2 a bear 6 11
3 b cat 7 22
4 b bear 4 22
5 b pig 5 22
6 b goat 6 22
7 c cat 8 16
8 c bear 5 16
9 c goat 3 16
10 d goat 7 11
11 d bear 4 11
using base R:
df1$sum_num <- ave(df1$number, df1$place, FUN = sum)
# A tibble: 11 x 4
place animal number sum_num
<chr> <chr> <dbl> <dbl>
1 a cat 5 11
2 a bear 6 11
3 b cat 7 22
4 b bear 4 22
5 b pig 5 22
6 b goat 6 22
7 c cat 8 16
8 c bear 5 16
9 c goat 3 16
10 d goat 7 11
11 d bear 4 11

Related

R - apply function on two files in folders with for loop or lapply and save results in one dataframe

I have a data set in "data" with 20 folders, which are identical in their structure. The only difference at the level of the folders are their names (from "1" to "20"). Please see the pattern below. The files have always the same file name and the same column structure. There might be a difference in the column length in the .csv files between folders, but not between the .csv files in the same folder. There are no missing values in the data frames. I want to work with the columns "mean" from the files.
Data structure
data
- 1 (folder)
- alpha (file)
- mean (column)
- .... (more columns)
- beta (file)
- mean (column)
- .... (more columns)
- ... (more files)
- 2 (folder)
- alpha (file)
- mean (column)
- .... (more columns)
- beta (file)
- mean (column)
- .... (more columns)
- ... (more files)
- ... (more folders with the same structure)
I would like to compare the mean from alpha to the mean from beta in one folder. In the end however, I would like to have one dataframe which is subsetted of all the results of all individual folders. So I can create faceted boxplots and descriptive statistics out of this dataframe.
I am still new to R and apparently lack the skills for it (also sorry for the complicated code and my English). I can manually perform the task for one folder each, but I can not put the findings together with a for loop or lapply solution.
I have found many threads where data frames need to be merged without prior executing of a function from two files in the same folder. I do hope I produced a workable minimal example with 2 data frames each from 2 folders.
library(plyr)
library(tidyverse)
alpha1 <- read_csv('data/1/alpha.csv')
beta1 <- read_csv('data/1/beta.csv')
alpha2 <- read_csv('data/2/alpha2.csv')
beta2 <- read_csv('data/2/beta2.csv')
Folder 1
alpha1 <- structure(list(Name = c("A", "B", "C", "D", "E", "F", "G", "H",
"I", "J", "K"), mean = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -11L), spec = structure(list(
cols = list(Name = structure(list(), class = c("collector_character",
"collector")), mean = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
beta1 <- structure(list(Name = c("A", "B", "C", "D", "E", "F", "G", "H",
"I", "J", "K"), mean = c(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -11L), spec = structure(list(
cols = list(Name = structure(list(), class = c("collector_character",
"collector")), mean = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
alpha_mean <- alpha1 %>% select(mean_alpha = mean)
alphabeta <- alpha_mean %>% add_column(mean_beta = beta1$mean)
alphabeta_table <- ddply(alphabeta, .(), transform, alphabeta = (mean_alpha/mean_beta))
alphabeta_table
.id mean_alpha mean_beta alphabeta
1 <NA> 1 2 0.5000000
2 <NA> 2 3 0.6666667
3 <NA> 3 4 0.7500000
4 <NA> 4 5 0.8000000
5 <NA> 5 6 0.8333333
6 <NA> 6 7 0.8571429
7 <NA> 7 8 0.8750000
8 <NA> 8 9 0.8888889
9 <NA> 9 10 0.9000000
10 <NA> 10 11 0.9090909
11 <NA> 11 12 0.9166667
Folder 2
alpha2 <- structure(list(Name = c("A", "B", "C", "D", "E", "F", "G", "H",
"I", "J", "K", "L", "M"), mean = c(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -13L), spec = structure(list(
cols = list(Name = structure(list(), class = c("collector_character",
"collector")), mean = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
beta2 <- structure(list(Name = c("A", "B", "C", "D", "E", "F", "G", "H",
"I", "J", "K", "L", "M"), mean = c(3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -13L), spec = structure(list(
cols = list(Name = structure(list(), class = c("collector_character",
"collector")), mean = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
alpha2_mean <- alpha2 %>% select(mean_alpha = mean)
alphabeta2 <- alpha2_mean %>% add_column(mean_beta = beta2$mean)
alphabeta2_table <- ddply(alphabeta2, .(), transform, alphabeta = (mean_alpha/ mean_beta))
alphabeta2_table
.id mean_alpha mean_beta alphabeta
1 <NA> 2 3 0.6666667
2 <NA> 3 4 0.7500000
3 <NA> 4 5 0.8000000
4 <NA> 5 6 0.8333333
5 <NA> 6 7 0.8571429
6 <NA> 7 8 0.8750000
7 <NA> 8 9 0.8888889
8 <NA> 9 10 0.9000000
9 <NA> 10 11 0.9090909
10 <NA> 11 12 0.9166667
11 <NA> 12 13 0.9230769
12 <NA> 13 14 0.9285714
13 <NA> 14 15 0.9333333
Desired output
My desired output would be:
.id mean_alpha mean_beta alphabeta
1 1 1 2 0.5000000
2 1 2 3 0.6666667
3 1 3 4 0.7500000
4 1 4 5 0.8000000
5 1 5 6 0.8333333
6 1 6 7 0.8571429
7 1 7 8 0.8750000
8 1 8 9 0.8888889
9 1 9 10 0.9000000
10 1 10 11 0.9090909
11 1 11 12 0.9166667
1 2 2 3 0.6666667
2 2 3 4 0.7500000
3 2 4 5 0.8000000
4 2 5 6 0.8333333
5 2 6 7 0.8571429
6 2 7 8 0.8750000
7 2 8 9 0.8888889
8 2 9 10 0.9000000
9 2 10 11 0.9090909
10 2 11 12 0.9166667
11 2 12 13 0.9230769
12 2 13 14 0.9285714
13 2 14 15 0.9333333
1 3 ... ... ...
2 3 ... ... ...
...
Thank you for any help!
Try this solution :
Get all the folders using list.dirs.
For each folder read the "alpha" and "beta" files and return a 3 column tibble back with alpha, beta and alphabeta values.
Bind all the dataframes with and id column to know from which folder each value is coming.
all_folders <- list.dirs('Data/', recursive = FALSE, full.names = TRUE)
result <- purrr::map_df(all_folders, function(x) {
all_Files <- list.files(x, full.names = TRUE, pattern = 'alpha|beta')
df1 <- read.csv(all_Files[1])
df2 <- read.csv(all_Files[2])
tibble::tibble(alpha = df1$mean, beta = df2$mean, alphabeta = alpha/beta)
}, .id = "id")

Fill multiple columns in a R dataframe [duplicate]

This question already has answers here:
Complete dataframe with missing combinations of values
(2 answers)
Closed 2 years ago.
I have a dataframe called flu that is a count of case(n) by group per week.
flu <- structure(list(isoweek = c(1, 1, 2, 2, 3, 3, 4, 5, 5), group = c("fluA",
"fluB", "fluA", "fluB", "fluA", "fluB", "fluA", "fluA", "fluB"
), n = c(5, 6, 3, 5, 12, 14, 6, 23, 25)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L), spec = structure(list(
cols = list(isoweek = structure(list(), class = c("collector_double",
"collector")), group = structure(list(), class = c("collector_character",
"collector")), n = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
In the data set there are some rows where zero cases are not reported in the data so there are no NA values to work with.
I have identified a fix for this to fill down missing weeks with zeros.
flu %>% complete(isoweek, nesting(group), fill = list(n = 0))
My problem is that this only works for the weeks of data reported. For example, at weeks 6, 7, 8 etc if there are no cases reported I have no data.
How can I extend this fill down process to extend the data frame with zeros for isoweeks 6 to 10 (for example) and have a corresponding fluA and fluB for each week with a zero value for each isoweek/group pair?
You can expand multiple columns in complete. Let's say if you need data till week 8, you can do :
tidyr::complete(flu, isoweek = 1:8, group, fill = list(n = 0))
# A tibble: 16 x 3
# isoweek group n
# <dbl> <chr> <dbl>
# 1 1 fluA 5
# 2 1 fluB 6
# 3 2 fluA 3
# 4 2 fluB 5
# 5 3 fluA 12
# 6 3 fluB 14
# 7 4 fluA 6
# 8 4 fluB 0
# 9 5 fluA 23
#10 5 fluB 25
#11 6 fluA 0
#12 6 fluB 0
#13 7 fluA 0
#14 7 fluB 0
#15 8 fluA 0
#16 8 fluB 0

Generate column id

I am working with log data; trying to find the round number of each event. The start of a round is signaled by action=="start". I want to create a "action.round" columns that tells me which round each event corresponds to.
I have data such this:
data <- read_table2("Id action
A start
A na
A start
A na
A na
A na
A na
A start
B start
B na
B start
B na
B start
B na"
I am trying to create an output such as this:
output <- read_table2("Id action action.round
A start 1
A na 1
A start 2
A na 2
A na 2
A na 2
A na 2
A start 3
B start 1
B na 1
B start 2
B na 2
B start 3
B na 3")
So far, I have been able to get part of the output by using row_number(), like this:
` data %>%
mutate(round.start=case_when(actionValue=="start"~"start",TRUE~"NA")) %>%
ungroup() %>%
group_by(Id,round.start) %>%
mutate(action.round=row_number())`
But now, I would like to fill the round number that corresponds to round.start=="start" into the column, so that I know which round number each column actually corresponds to (see desired output above).
You could use cumsum after grouping by Id.
library(dplyr)
data %>% group_by(Id) %>% mutate(action.round = cumsum(action == "start"))
# Id action action.round
# <chr> <chr> <int>
# 1 A start 1
# 2 A na 1
# 3 A start 2
# 4 A na 2
# 5 A na 2
# 6 A na 2
# 7 A na 2
# 8 A start 3
# 9 B start 1
#10 B na 1
#11 B start 2
#12 B na 2
#13 B start 3
#14 B na 3
This can be done in base R
data$action.round <- with(data, ave(action == "start", Id, FUN = cumsum))
and data.table as well
library(data.table)
setDT(data)[, action.round := cumsum(action == "start"), Id]
data
data <- structure(list(Id = c("A", "A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B"), action = c("start", "na", "start",
"na", "na", "na", "na", "start", "start", "na", "start", "na",
"start", "na")), row.names = c(NA, -14L), spec = structure(list(
cols = list(Id = structure(list(), class = c("collector_character",
"collector")), action = structure(list(), class = c("collector_character",
"collector")), action.round = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))

Joining two dataframes to remove NaN values in the first dataframe

I would like to merge two dataframe columns.
I have df1 and that has a specific column (df$col1). This column has rows 1-100, certain rows have NA values (lets say rows 10,15,20,50,69).
Dataframe 2 has rows 10,15,20,50,69.
Is it possible to merge DF2 to df$col such that only the NA values in df$col are filled by DF2..depending on the index number for each dataset
I tried this but instead got a dataframe that did not look anything like what I want
merge(brfss2$pa1min_,df,by.x=1,by.y=1,all.x=TRUE,all.y=TRUE)
Here are the two dataframes
Dataframe1:
1 NA
2 110
3 NA
4 35
5 NA
6 120
7 280
8 30
9 240
10 260
11 322
12 NA
Dataframe 2:
1 2127.6
3 1403.0
5 198.0
12 112.8
a different method - I imported your data and gave column names:
df <- structure(list(col1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
), col2 = c(NA, 110, NA, 35, NA, 120, 280, 30, 240, 260, 322,
NA)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-12L), spec = structure(list(cols = list(col1 = structure(list(), class = c("collector_double",
"collector")), col2 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 2), class = "col_spec"))
df2 <- structure(list(col1 = c(1, 3, 5, 12), col2 = c(2127.6, 1403,
198, 112.8)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L), spec = structure(list(cols = list(
col1 = structure(list(), class = c("collector_double", "collector"
)), col2 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 2), class = "col_spec"))
Using tidyverse you can merge and then add a new column conditionally based on the value without NA:
library(tidyverse)
df %>%
merge(df2, by = "col1", all.x = TRUE) %>%
mutate(new_col = if_else(is.na(col2.x), col2.y, col2.x)) %>%
select(new_col)
new_col
1 2127.6
2 110.0
3 1403.0
4 35.0
5 198.0
6 120.0
7 280.0
8 30.0
9 240.0
10 260.0
11 322.0
12 112.8
I wrote the package safejoin which solves this very succinctly
# devtools::install_github("moodymudskipper/safejoin")
safe_left_join(df1,df2, by = "col1", conflict = dplyr::coalesce)
# # A tibble: 12 x 2
# col1 col2
# <dbl> <dbl>
# 1 1 2128.
# 2 2 110
# 3 3 1403
# 4 4 35
# 5 5 198
# 6 6 120
# 7 7 280
# 8 8 30
# 9 9 240
# 10 10 260
# 11 11 322
# 12 12 113.

Reorder, exclude a column and keep others in R?

Here is my toy dataframe:
structure(list(a = c(1, 2), b = c(3, 4), c = c(5, 6), d = c(7,
8)), .Names = c("a", "b", "c", "d"), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
Now I want to reorder and exclude one the columns and keep the others:
df %>% select(-a, d, everything())
I want my df to be :
d b c
7 3 5
8 4 6
I get the following:
b c d a
<dbl> <dbl> <dbl> <dbl>
1 3 5 7 1
2 4 6 8 2
Keep the -a at the last in the select. Even though, we removed a in the beginning the everythig() at the end is still checking the column names of the whole dataset
df%>%
select(d, everything(), -a)
# A tibble: 2 x 3
# d b c
# <dbl> <dbl> <dbl>
#1 7 3 5
#2 8 4 6

Resources