How to sum a variable based on factor? - r

Here is an example of my data:
Type <- c('A','A','A','A','B','B','C','D')
Name <- c('DK', 'MO', 'OM', 'LSO', 'GOP', 'ADG','BFC','TMD')
Value <- c(3,2,5,3,6,5,7,6)
Dat <- data.frame(Type, Name,Value)
Dat
Type Name Value
1 A DK 3
2 A MO 2
3 A OM 5
4 A LSO 3
5 B GOP 6
6 B ADG 5
7 C BFC 7
8 D TMD 6
What I'm trying to get is the sum of the value when Type=A. In this case, it is 13. I found some similar examples by applying dplyr, but I don't need the type nor the name. Please help and thank you!

Using dplyr you would use group_by to group each type or if you only want type A you could filter where Type == A. Then in both cases you would summarize by the sum of the value. I've shown both examples below.
library(dplyr)
Type <- c('A','A','A','A','B','B','C','D')
Name <- c('DK', 'MO', 'OM', 'LSO', 'GOP', 'ADG','BFC','TMD')
Value <- c(3,2,5,3,6,5,7,6)
Dat <- data.frame(Type, Name,Value)
Dat
res1 <- Dat %>%
group_by(Type) %>%
summarize(sum(Value))
res1
# Source: local data frame [4 x 2]
#
# Type sum(Value)
# (fctr) (dbl)
#1 A 13
#2 B 11
#3 C 7
#4 D 6
res2 <- Dat %>%
filter(Type == "A") %>%
summarize(sum(Value))
res2
# sum(Value)
#1 13

Related

Matching old and new column names in R [duplicate]

I have a tibble which has column names containing spaces & special characters which make it a hassle to work with. I want to change these column names to easier to use names while I'm working with the data, and then change them back to the original names at the end for display. Ideally, I want to be able to do this as part of a pipe, however I haven't figured out how to do it with rename_with().
Sample data:
df <- tibble(oldname1 = seq(1:10),
oldname2 = letters[seq(1:10)],
oldname3 = LETTERS[seq(1:10)])
cols_lookup <- tibble(old_names = c("oldname4", "oldname2", "oldname1"),
new_names = c("newname4", "newname2", "newname1"))
Desired output:
> head(df_renamed)
# A tibble: 6 x 3
newname1 newname2 oldname3
<int> <chr> <chr>
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
Some columns are removed & reordered during this work so when converting them back there will be entries in the cols_lookup table which are no longer in df. There are also new columns created in df which I want to remain named the same.
I am aware there are similar questions which have already been asked, however the answers either don't work well with tibbles or in a pipe (eg. those using match()), or don't work if the columns aren't all present in the same order in both tables.
We can use rename_at. From the master lookup table, filter the rows where the names of dataset have a match (filtered_lookup), then use that in rename_at where we specify the 'old_names' in vars and replace with the 'new_names'
library(dplyr)
filtered_lookup <- cols_lookup %>%
filter(old_names %in% names(df))
df %>%
rename_at(vars(filtered_lookup$old_names), ~ filtered_lookup$new_names)
Or using rename_with, use the same logic
df %>%
rename_with(.fn = ~filtered_lookup$new_names, .cols = filtered_lookup$old_names)
Or another option is rename with splicing (!!!) from a named vector
library(tibble)
df %>%
rename(!!! deframe(filtered_lookup[2:1]))
You can use rename_ with setnames
cols_lookup <- tibble(old_names = c("oldname3", "oldname2", "oldname1"),
new_names = c("newname3", "newname2", "newname1"))
df
rename_(df, .dots=setNames(cols_lookup$old_names, cols_lookup$new_names))
Output:
# A tibble: 10 x 3
newname1 newname2 newname3
<int> <chr> <chr>
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J

Rename columns using values from separate dataframe

I have a tibble which has column names containing spaces & special characters which make it a hassle to work with. I want to change these column names to easier to use names while I'm working with the data, and then change them back to the original names at the end for display. Ideally, I want to be able to do this as part of a pipe, however I haven't figured out how to do it with rename_with().
Sample data:
df <- tibble(oldname1 = seq(1:10),
oldname2 = letters[seq(1:10)],
oldname3 = LETTERS[seq(1:10)])
cols_lookup <- tibble(old_names = c("oldname4", "oldname2", "oldname1"),
new_names = c("newname4", "newname2", "newname1"))
Desired output:
> head(df_renamed)
# A tibble: 6 x 3
newname1 newname2 oldname3
<int> <chr> <chr>
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
Some columns are removed & reordered during this work so when converting them back there will be entries in the cols_lookup table which are no longer in df. There are also new columns created in df which I want to remain named the same.
I am aware there are similar questions which have already been asked, however the answers either don't work well with tibbles or in a pipe (eg. those using match()), or don't work if the columns aren't all present in the same order in both tables.
We can use rename_at. From the master lookup table, filter the rows where the names of dataset have a match (filtered_lookup), then use that in rename_at where we specify the 'old_names' in vars and replace with the 'new_names'
library(dplyr)
filtered_lookup <- cols_lookup %>%
filter(old_names %in% names(df))
df %>%
rename_at(vars(filtered_lookup$old_names), ~ filtered_lookup$new_names)
Or using rename_with, use the same logic
df %>%
rename_with(.fn = ~filtered_lookup$new_names, .cols = filtered_lookup$old_names)
Or another option is rename with splicing (!!!) from a named vector
library(tibble)
df %>%
rename(!!! deframe(filtered_lookup[2:1]))
You can use rename_ with setnames
cols_lookup <- tibble(old_names = c("oldname3", "oldname2", "oldname1"),
new_names = c("newname3", "newname2", "newname1"))
df
rename_(df, .dots=setNames(cols_lookup$old_names, cols_lookup$new_names))
Output:
# A tibble: 10 x 3
newname1 newname2 newname3
<int> <chr> <chr>
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J

Check if values of one dataframe exist in another dataframe in exact order

I have 1 dataframe of data and multiple "reference" dataframes. I'm trying to automate checking if values of the dataframe match the values of the reference dataframes. Importantly, the values must also be in the same order as the values in the reference dataframes. These columns are of the columns of importance, but my real dataset contains many more columns.
Below is a toy dataset.
Dataframe
group type value
1 A Teddy
1 A William
1 A Lars
2 B Dolores
2 B Elsie
2 C Maeve
2 C Charlotte
2 C Bernard
Reference_A
type value
A Teddy
A William
A Lars
Reference_B
type value
B Elsie
B Dolores
Reference_C
type value
C Maeve
C Hale
C Bernard
For example, in the toy dataset, group1 would score 1.0 (100% correct) because all its values in A match the values and order of values of An in reference_A. However, group2 would score 0.0 because the values in B are out of order compared to reference_B and 0.66 because 2/3 values in C match the values and order of values in reference_C.
Desired output
group type score
1 A 1.0
2 B 0.0
2 C 0.66
This was helpful, but does not take order into account:
Check whether values in one data frame column exist in a second data frame
Update: Thank you to everyone that has provided solutions! These solutions are great for the toy dataset, but have not yet been adaptable to datasets with more columns. Again, like I wrote in my post, the columns that I've listed above are of importance — I'd prefer to not drop the unneeded columns if necessary.
We may also do this with mget to return a list of data.frames, bind them together, and do a group by mean of logical vector
library(dplyr)
mget(ls(pattern = '^Reference_[A-Z]$')) %>%
bind_rows() %>%
bind_cols(df1) %>%
group_by(group, type = type...1) %>%
summarise(score = mean(value...2 == value...5))
# Groups: group [2]
# group type score
# <int> <chr> <dbl>
#1 1 A 1
#2 2 B 0
#3 2 C 0.667
This is another tidyverse solution. Here, I am adding a counter (i.e. rowname) to both reference and data. Then I join them together on type and rowname. At the end, I summarize them on type to get the desired output.
library(dplyr)
library(purrr)
library(tibble)
list(`Reference A`, `Reference B`, `Reference C`) %>%
map(., rownames_to_column) %>%
bind_rows %>%
left_join({Dataframe %>%
group_split(type) %>%
map(., rownames_to_column) %>%
bind_rows},
. , by=c("type", "rowname")) %>%
group_by(type) %>%
dplyr::summarise(group = head(group,1),
score = sum(value.x == value.y)/n())
#> # A tibble: 3 x 3
#> type group score
#> <chr> <int> <dbl>
#> 1 A 1 1
#> 2 B 2 0
#> 3 C 2 0.667
Here's a "tidy" method:
library(dplyr)
# library(purrr) # map2_dbl
Reference <- bind_rows(Reference_A, Reference_B, Reference_C) %>%
nest_by(type, .key = "ref") %>%
ungroup()
Reference
# # A tibble: 3 x 2
# type ref
# <chr> <list<tbl_df[,1]>>
# 1 A [3 x 1]
# 2 B [2 x 1]
# 3 C [3 x 1]
Dataframe %>%
nest_by(group, type, .key = "data") %>%
left_join(Reference, by = "type") %>%
mutate(
score = purrr::map2_dbl(data, ref, ~ {
if (length(.x) == 0 || length(.y) == 0) return(numeric(0))
if (length(.x) != length(.y)) return(0)
sum((is.na(.x) & is.na(.y)) | .x == .y) / length(.x)
})
) %>%
select(-data, -ref) %>%
ungroup()
# # A tibble: 3 x 3
# group type score
# <int> <chr> <dbl>
# 1 1 A 1
# 2 2 B 0
# 3 2 C 0.667

Tidyeval: apply function to data frames extracted from list

This is a simplified version of a problem involving a large list containing complex tables. I want to extract the tables from the list and apply a function to each one. Here we can create a simple list containing small named data frames:
library(tidyverse)
table_names <- c('dfA', 'dfB', 'dfC')
dfA <- tibble(a = 1:3, b = 4:6, c = 7:9)
dfB <- tibble(a = 10:12, b = 13:15, c = 16:18)
dfC <- tibble(a = 19:21, b = 22:24, c = 25:27)
df_list <- list(dfA, dfB, dfC) %>% setNames(table_names)
Here is a simplified example of the kind of operation I would like to apply:
dfA_mod <- df_list$dfA %>%
mutate(name = 'dfA') %>%
select(name, everything())
In this example, I extract one of three tables in the list df_list$dfA, create a new column with the same value in each row mutate(name = 'dfA'), and re-order the columns so that the new column appears in the left-most position select(name, everything()). The resulting object is assigned to dfA_mod.
To solve the larger problem, I want to use one of the purrr::map() variants to apply the function over the character vector table_names, which was initiated in the first block of code above. The elements of table_names serve two purposes: 1) naming the tables held in the list; and 2) supplying values for the name column in the modified table.
I could write a function such as:
fun <- function(x) {
df_list$x %>%
mutate(name = x) %>%
select(name, everything()) %>%
assign(paste0(x, '_mod'), ., envir = .GlobalEnv)
}
And then use map() to create a new list of modified tables:
new_list <- df_list %>% map(table_name, fun(x))
But of course this code does not work, with the main obstacle being (for me at least) figuring out how to quote and unquote the right terms within the function. I'm a beginner at tidy evaluation, and I could use some help in specifying the function and using map properly.
Here is the desired output (for one modified table):
# A tibble: 3 x 4
name a b c
<chr> <int> <int> <int>
1 dfA 1 4 7
2 dfA 2 5 8
3 dfA 3 6 9
Thanks in advance for any help!
We can use purrr::imap which passes data in the list as well as name of the list
library(dplyr)
library(purrr)
df_out <- imap(df_list, ~.x %>% mutate(name = .y) %>% select(name, everything()))
df_out
#$dfA
# A tibble: 3 x 4
# name a b c
# <chr> <int> <int> <int>
#1 dfA 1 4 7
#2 dfA 2 5 8
#3 dfA 3 6 9
#$dfB
# A tibble: 3 x 4
# name a b c
# <chr> <int> <int> <int>
#1 dfB 10 13 16
#....
#....
This gives a list of desired dataframes, if you want them as separate dataframes, you can do
names(df_out) <- paste0(names(df_out), "_mod")
list2env(df_out, .GlobalEnv)
We can also do it using base R Map
df_out <- Map(function(x, y) transform(x, name = y)[c('name', names(x))],
df_list, names(df_list))
and give list names same as above.
We can convert it to a single data.frame with map while passing the .id
library(purrr)
map_dfr(df_list, I, .id = 'name')
Or with bind_rows
library(dplyr)
bind_rows(df_list, .id = 'name')
# A tibble: 9 x 4
# name a b c
# <chr> <int> <int> <int>
#1 dfA 1 4 7
#2 dfA 2 5 8
#3 dfA 3 6 9
#4 dfB 10 13 16
#5 dfB 11 14 17
#6 dfB 12 15 18
#7 dfC 19 22 25
#8 dfC 20 23 26
#9 dfC 21 24 27

R: aggregate by all factor levels (present and not present)

I can aggregate a data.frame trivially with dplyr with the following:
z <- data.frame(a = rnorm(20), b = rep(letters[1:4], each = 5))
library(dplyr)
z %>%
group_by(b) %>%
summarise(out = n())
Source: local data frame [4 x 2]
b out
(fctr) (int)
1 a 5
2 b 5
3 c 5
4 d 5
However, sometimes a dataset may be missing a factor. In which case I would like the output to be 0.
For example, let's say the typical dataset should have 5 groups.
z$b <- factor(z$b, levels = letters[1:5])
But clearly there aren't any in this particular but could be in another. How can I aggregate this data so the length for missing factors is 0.
Desired output:
Source: local data frame [4 x 2]
b out
(fctr) (int)
1 a 5
2 b 5
3 c 5
4 d 5
5 e 0
One way to approach this is to use complete from "tidyr". You have to use mutate first to factor column "b":
library(dplyr)
library(tidyr)
z %>%
mutate(b = factor(b, letters[1:5])) %>%
group_by(b) %>%
summarise(out = n()) %>%
complete(b, fill = list(out = 0))
# Source: local data frame [5 x 2]
#
# b out
# (fctr) (dbl)
# 1 a 5
# 2 b 5
# 3 c 5
# 4 d 5
# 5 e 0
A workaround is to join with a table containing all levels:
z <- full_join(z, data.frame(b=levels(z$b))
This will set all the missing rows for your analysis variables to NA, which in the general case would make more sense than setting them to zero. You can change them to zero if necessary with z[is.na(z)] <- 0.
You could use xtabs:
xtabs(a ~ b, z)
This aggregates z$b rather than just counting levels in z$a as in your example, but that's easily achieved with table:
table(z$a)

Resources