Set name of variable defined in formula - r

This just popped into my head,
Let's take this example from a recent question:
data:
df1<-
structure(list(Year = c(2015L, 2015L, 2015L, 2015L, 2016L, 2016L,
2016L, 2016L), Category = c("a", "1", "2", "3", "1", "2", "3",
"1"), Value = c(2L, 3L, 2L, 1L, 7L, 2L, 1L, 1L)), row.names = c(NA,
-8L), class = "data.frame")
code:
aggregate( Value ~ Year + c(MY_NAME = c("OneTwo", "three")[Category %in% 1:2 + 1]), data=df1, FUN=sum )
current output: (look at the long ugly name of the new var)
# Year c(MY_NAME = c("OneTwo", "three")[Category %in% 1:2 + 1]) Value
#1 2015 OneTwo 3
#2 2016 OneTwo 1
#3 2015 three 5
#4 2016 three 10
desired output:
# Year MY_NAME Value
#1 2015 OneTwo 3
#2 2016 OneTwo 1
#3 2015 three 5
#4 2016 three 10
please note:
One could (possibly should) declare a new variable.
This question is about how to set the name of the new variable DIRECTLY by adding code to the one-liner in code: section.

Instead of c, we need cbind, which results in a matrix of one column with column name 'MY_NAME' while c gets a named vector with unique names (make.unique) of the 'MY_NAME'
aggregate( Value ~ Year +
cbind(MY_NAME = c("OneTwo", "three")[Category %in% 1:2 + 1]), data=df1, FUN=sum )
# Year MY_NAME Value
#1 2015 OneTwo 3
#2 2016 OneTwo 1
#3 2015 three 5
#4 2016 three 10
In the ?aggregate, it is mentioned about the usage of cbind in the formula method
formula - a formula, such as y ~ x or cbind(y1, y2) ~ x1 + x2, where
the y variables are numeric data to be split into groups according to
the grouping x variables (usually factors).
An option with tidyverse would be
library(dplyr)
df1 %>%
group_by(Year, MY_NAME = c("OneTwo", "three")[Category %in% 1:2 + 1]) %>%
summarise(Value = sum(Value))

1) aggregate.data.frame Use aggregate.data.frame rather than aggregate.formula:
by <- with(df1,
list(
Year = Year,
MY_NAME = c("OneTwo", "three")[Category %in% 1:2 + 1]
)
)
aggregate(df1["Value"], by, FUN = sum)
giving:
Year MY_NAME Value
1 2015 OneTwo 3
2 2016 OneTwo 1
3 2015 three 5
4 2016 three 10
2) 2 step It might be a bit cleaner to split this into two parts (1) create a new data frame in which Category is transformed and (2) perform the aggregate.
df2 <- transform(df1, MY_NAME = c("OneTwo", "three")[Category %in% 1:2 + 1])
aggregate(Value ~ Year + MY_NAME, df2, sum)
2a) or expressing (2) in terms of a magrittr pipeline:
library(magrittr)
df1 %>%
transform(MY_NAME = c("OneTwo", "three")[Category %in% 1:2 + 1]) %>%
aggregate(Value ~ Year + MY_NAME, ., sum)

Related

Finding minimum by groups and among columns

I am trying to find the minimum value among different columns and group.
A small sample of my data looks something like this:
group cut group_score_1 group_score_2
1 a 1 3 5.0
2 b 2 2 4.0
3 a 0 2 2.5
4 b 3 5 4.0
5 a 2 3 6.0
6 b 1 5 1.0
I want to group by the groups and for each group, find the row which contains the minimum group score among both group scores and then also get the name of the column which contains the minimum (group_score_1 or group_score_2),
so basically my result should be something like this:
group cut group_score_1 group_score_2
1 a 0 2 2.5
2 b 1 5 1.0
I tried a few ideas, and came up eventually to dividing the into several new data frames, filtering by group and selecting the relevant columns and then using which.min(), but I'm sure there's a much more efficient way to do it. Not sure what I am missing.
We can use data.table methods
library(data.table)
setDT(df)[df[, .I[which.min(do.call(pmin, .SD))],
group, .SDcols = patterns('^group_score')]$V1]
# group cut group_score_1 group_score_2
#1: a 0 2 2.5
#2: b 1 5 1.0
For each group, you can calculate min value and select the row in which that value exist in one of the column.
library(dplyr)
df %>%
group_by(group) %>%
filter({tmp = min(group_score_1, group_score_2);
group_score_1 == tmp | group_score_2 == tmp})
# group cut group_score_1 group_score_2
# <chr> <int> <int> <dbl>
#1 a 0 2 2.5
#2 b 1 5 1
The above works well when you have only two group_score columns. If you have many such columns it is not possible to list down each one of them with group_score_1 == tmp | group_score_2 == tmp etc. In such case, get the data in long format and get the corresponding cut value of the minimum value and join the data. Assuming cut is unique in each group.
df %>%
tidyr::pivot_longer(cols = starts_with('group_score')) %>%
group_by(group) %>%
summarise(cut = cut[which.min(value)]) %>%
left_join(df, by = c("group", "cut"))
Here is a base R option using pmin + ave + subset
subset(
df,
as.logical(ave(
do.call(pmin, df[grep("group_score_\\d+", names(df))]),
group,
FUN = function(x) x == min(x)
))
)
which gives
group cut group_score_1 group_score_2
3 a 0 2 2.5
6 b 1 5 1.0
Data
> dput(df)
structure(list(group = c("a", "b", "a", "b", "a", "b"), cut = c(1L,
2L, 0L, 3L, 2L, 1L), group_score_1 = c(3L, 2L, 2L, 5L, 3L, 5L
), group_score_2 = c(5, 4, 2.5, 4, 6, 1)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Count number of occurences for every column in dataframe

I have a dataframe with an unknown amount of columns (it can change frequently) and I need to count the number of observations for a given ID and year for every column and create a costum "n" column for each column of my dataframe telling me how many observations were made for that specific column.
I have tried:
library(dplyr)
count <- tally(group_by(final_database,ID,Year))
But that will count unique combinations of ID + Year. While I need to know how many times over the years my ID was observed for each characteristic. Example:
ID Year CHAR1 n_CHAR1
A 2016 0 3
A 2017 5 3
A 2018 2 3
A 2019 3
B 2016 1 2
B 2017 2
B 2018 2
B 2019 1 2
And so on for all characteristics. I would insert the "n_CHAR" columns to the original dataframe.
It doesn't need to be tidy.
Thanks!
Try:
transform(final_database, n_CHAR1 = ave(CHAR1, ID, FUN = function(x) sum(x != "")))
If the blank rows are actually NA, then just replace sum(x != "") with sum(!is.na(x)).
Edit:
If you'd need multiple n columns for multiple NCHAR columns, you could do:
library(dplyr)
final_database %>%
group_by(ID) %>%
mutate_at(vars(starts_with("CHAR")),
list(n = ~ sum(. != "")))
This example assumes that all the relevant NCHAR columns start with the string NCHAR (e.g. NCHAR1, NCHAR2, NCHAR3, etc.).
If the columns you're referring to are 3rd to last, then you can do:
library(dplyr)
finalDatabase <- final_database %>%
group_by(ID) %>%
mutate_at(vars(3:ncol(.)), # If you don't have many other vars except NCHAR, you can also do vars(-ID, -Year) as suggested by #camille
list(n = ~ sum(. != ""))) %>%
select(ID, Year, ends_with("_n"))
We can also do this with data.table:
library(data.table)
setDT(df)[, n_CHAR1 := sum(CHAR1 != ""), by = "ID"]
Output:
ID Year CHAR1 n_CHAR1
1: A 2016 0 3
2: A 2017 5 3
3: A 2018 2 3
4: A 2019 3
5: B 2016 1 2
6: B 2017 2
7: B 2018 2
8: B 2019 1 2
Data:
df <- structure(list(ID = c("A", "A", "A", "A", "B", "B", "B", "B"),
Year = c(2016L, 2017L, 2018L, 2019L, 2016L, 2017L, 2018L,
2019L), CHAR1 = c("0", "5", "2", "", "1", "", "", "1")), row.names = c(NA,
-8L), class = "data.frame")

Count # of IDs that meet both criteria

I have a dataset that has two columns. One is userid, the other is company type, like below:
userid company.type
1 A
2 A
3 C
1 B
2 B
3 B
4 A
I want to know how many unique userid's there are that have company.type of A and B or A and C, (but not B and C).
I'm assuming it's some sort of aggregate function, but I'm not sure how to place the qualifier that company.type has to be A and B or A and C only.
We can do this with base R using table
tbl <- table(df1) > 0
sum(((tbl[, 1] & tbl[,2]) | (tbl[,1] & tbl[,3])) & (!(tbl[,2] & tbl[,3])))
#[1] 2
Here's an idea with dplyr. setequal checks if two vectors are composed of the same elements, regardless of ordering:
library(dplyr)
df %>%
group_by(userid) %>%
summarize(temp = setequal(company.type, c("A", "B")) |
setequal(company.type, c("A", "C"))) %>%
pull(temp) %>%
sum()
# [1] 2
Data:
df <- structure(list(userid = c(1L, 2L, 3L, 1L, 2L, 3L, 4L), company.type = c("A",
"A", "C", "B", "B", "B", "A")), .Names = c("userid", "company.type"
), class = "data.frame", row.names = c(NA, -7L))
See: Check whether two vectors contain the same (unordered) elements in R
Sort DF and reduce it to one row per userid with a types column consisting of a comma-separated string of company types. Then filter it using the indicated condition. Finally use tally to get the number of rows left after filtering. To get the details omit the tally line.
library(dplyr)
DF %>%
arrange(userid, company.type) %>%
group_by(userid) %>%
summarize(types = toString(company.type)) %>%
ungroup %>%
filter(grepl("A.*B|A.*C", types) & ! grepl("B.*C", types)) %>%
tally
giving:
# A tibble: 1 x 1
n
<int>
1 2
Note
The input used, in reproducible form, is:
Lines <- "userid company.type
1 A
2 A
3 C
1 B
2 B
3 B
4 A"
DF <- read.table(text = Lines, header = TRUE)

How to sanitize a df according to specific variable values?

I have two data frames. dfOne is made like this:
X Y Z T J
3 4 5 6 1
1 2 3 4 1
5 1 2 5 1
and dfTwo is made like this
C.1 C.2
X Z
Y T
I want to obtain a new dataframe where there are simultaneously X, Y, Z, T Values which are major than a specific threshold.
Example. I need simultaneously (in the same row):
X, Y > 2
Z, T > 4
I need to use the second data frame to reach my objective, I expect something like:
dfTwo$C.1>2
so the result would be a new dataframe with this structure:
X Y Z T J
3 4 5 6 1
How could I do it?
Here is a base R method with Map and Reduce.
# build lookup table of thresholds relative to variable name
vals <- setNames(c(2, 2, 4, 4), unlist(dat2))
# subset data.frame
dat[Reduce("&", Map(">", dat[names(vals)], vals)), ]
X Y Z T J
1 3 4 5 6 1
Here, Map returns a list of length 4 with logical variables corresponding to each comparison. This list is passed to Reduce which returns a single logical vector with length corresponding to the number of rows in the data.frame, dat. This logical vector is used to subset dat.
data
dat <-
structure(list(X = c(3L, 1L, 5L), Y = c(4L, 2L, 1L), Z = c(5L,
3L, 2L), T = c(6L, 4L, 5L), J = c(1L, 1L, 1L)), .Names = c("X",
"Y", "Z", "T", "J"), class = "data.frame", row.names = c(NA,
-3L))
dat2 <-
structure(list(C.1 = structure(1:2, .Label = c("X", "Y"), class = "factor"),
C.2 = structure(c(2L, 1L), .Label = c("T", "Z"), class = "factor")), .Names = c("C.1",
"C.2"), class = "data.frame", row.names = c(NA, -2L))
We can use the purrr package
Here is the input data.
# Data frame from lmo's solution
dat <-
structure(list(X = c(3L, 1L, 5L), Y = c(4L, 2L, 1L), Z = c(5L,
3L, 2L), T = c(6L, 4L, 5L), J = c(1L, 1L, 1L)), .Names = c("X",
"Y", "Z", "T", "J"), class = "data.frame", row.names = c(NA,
-3L))
# A numeric vector to show the threshold values
# Notice that columns without any requirements need NA
vals <- c(X = 2, Y = 2, Z = 4, T = 4, J = NA)
Here is the implementation
library(purrr)
map2_dfc(dat, vals, ~ifelse(.x > .y | is.na(.y), .x, NA)) %>% na.omit()
# A tibble: 1 x 5
X Y Z T J
<int> <int> <int> <int> <int>
1 3 4 5 6 1
map2_dfc loop through each column in dat and each value in vals one by one with a defined function. ~ifelse(.x > .y | is.na(.y), .x, NA) means if the number in each column is larger than the corresponding value in vals, or vals is NA, the output should be the original value from the column. Otherwise, the value is replaced to be NA. The output of map2_dfc(dat, vals, ~ifelse(.x > .y | is.na(.y), .x, NA)) is a data frame with NA values in some rows indicating that the condition is not met. Finally, na.omit removes those rows.
Update
Here I demonstrate how to covert the dfTwo dataframe to the vals vector in my example.
First, let's create the dfTwo data frame.
dfTwo <- read.table(text = "C.1 C.2
X Z
Y T",
header = TRUE, stringsAsFactors = FALSE)
dfTwo
C.1 C.2
1 X Z
2 Y T
To complete the task, I load the dplyr and tidyr package.
library(dplyr)
library(tidyr)
Now I begin the transformation of dfTwo. The first step is to use stack function to convert the format.
dfTwo2 <- dfTwo %>%
stack() %>%
setNames(c("Col", "Group")) %>%
mutate(Group = as.character(Group))
dfTwo2
Col Group
1 X C.1
2 Y C.1
3 Z C.2
4 T C.2
The second step is to add the threshold information. One way to do this is to create a look-up table showing the association between Group and Value
threshold_df <- data.frame(Group = c("C.1", "C.2"),
Value = c(2, 4),
stringsAsFactors = FALSE)
threshold_df
Group Value
1 C.1 2
2 C.2 4
And then we can use the left_join function to combine the data frame.
dfTwo3 <- dfTwo2 %>% left_join(threshold_dt, by = "Group")
dfTwo3
Col Group Value
1 X C.1 2
2 Y C.1 2
3 Z C.2 4
4 T C.2 4
Now it is the third step. Notice that there is a column called J which does not need any threshold. So we need to add this information to dfTwo3. We can use the complete function from tidyr. The following code completes the data frame by adding Col in dat but not in dfTwo3 and NA to the Value.
dfTwo4 <- dfTwo3 %>% complete(Col = colnames(dat))
dfTwo4
# A tibble: 5 x 3
Col Group Value
<chr> <chr> <dbl>
1 J <NA> NA
2 T C.2 4
3 X C.1 2
4 Y C.1 2
5 Z C.2 4
The fourth step is arrange the right order of dfTwo4. We can achieve this by turning Col to factor and assign the level based on the order of the column name in dat.
dfTwo5 <- dfTwo4 %>%
mutate(Col = factor(Col, levels = colnames(dat))) %>%
arrange(Col) %>%
mutate(Col = as.character(Col))
dfTwo5
# A tibble: 5 x 3
Col Group Value
<chr> <chr> <dbl>
1 X C.1 2
2 Y C.1 2
3 Z C.2 4
4 T C.2 4
5 J <NA> NA
We are almost there. Now we can create vals from dfTwo5.
vals <- dfTwo5$Value
names(vals) <- dfTwo5$Col
vals
X Y Z T J
2 2 4 4 NA
Now we are ready to use the purrr package to filter the data.
The aboved are the breakdown of steps. We can combine all these steps into the following code for simlicity.
library(dplyr)
library(tidyr)
threshold_df <- data.frame(Group = c("C.1", "C.2"),
Value = c(2, 4),
stringsAsFactors = FALSE)
dfTwo2 <- dfTwo %>%
stack() %>%
setNames(c("Col", "Group")) %>%
mutate(Group = as.character(Group)) %>%
left_join(threshold_df, by = "Group") %>%
complete(Col = colnames(dat)) %>%
mutate(Col = factor(Col, levels = colnames(dat))) %>%
arrange(Col) %>%
mutate(Col = as.character(Col))
vals <- dfTwo2$Value
names(vals) <- dfTwo2$Col
dfOne[Reduce(intersect, list(which(dfOne["X"] > 2),
which(dfOne["Y"] > 2),
which(dfOne["Z"] > 4),
which(dfOne["T"] > 4))),]
# X Y Z T J
#1 3 4 5 6 1
Or iteratively (so fewer inequalities are tested):
vals = c(X = 2, Y = 2, Z = 4, T = 4) # from #lmo's answer
dfOne[Reduce(intersect, lapply(names(vals), function(x) which(dfOne[x] > vals[x]))),]
# X Y Z T J
#1 3 4 5 6 1
I'm writing this assuming that the second DF is meant to categorize the fields in the first DF. It's way simpler if you don't need to use the second one to define the conditions:
dfNew = dfOne[dfOne$X > 2 & dfOne$Y > 2 & dfOne$Z > 4 & dfOne$T > 4, ]
Or, using dplyr:
library(dplyr)
dfNew = dfOne %>% filter(X > 2 & Y > 2 & Z > 4 & T > 4)
In case that's all you need, I'll save this comment while I poke at the more complicated version of the question.

Combine result from top_n with an "Other" category in dplyr

I have a data frame dat1
Country Count
1 AUS 1
2 NZ 2
3 NZ 1
4 USA 3
5 AUS 1
6 IND 2
7 AUS 4
8 USA 2
9 JPN 5
10 CN 2
First I want to sum "Count" per "Country". Then the top 3 total counts per country should be combined with an additional row "Others", which is the sum of countries which are not part of top 3.
The expected outcome therefore would be:
Country Count
1 AUS 6
2 JPN 5
3 USA 5
4 Others 7
I have tried the below code, but could not figure out how to place the "Others" row.
dat1 %>%
group_by(Country) %>%
summarise(Count = sum(Count)) %>%
arrange(desc(Count)) %>%
top_n(3)
This code currently gives:
Country Count
1 AUS 6
2 JPN 5
3 USA 5
Any help would be greatly appreciated.
dat1 <- structure(list(Country = structure(c(1L, 5L, 5L, 6L, 1L, 3L,
1L, 6L, 4L, 2L), .Label = c("AUS", "CN", "IND", "JPN", "NZ",
"USA"), class = "factor"), Count = c(1L, 2L, 1L, 3L, 1L, 2L,
4L, 2L, 5L, 2L)), .Names = c("Country", "Count"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
Instead of top_n, this seems like a good case for the convenience function tally. It uses summarise, sum and arrange under the hood.
Then use factor to create an "Other" category. Use the levels argument to set "Other" as the last level. "Other" will then will be placed last in the table (and in any subsequent plot of the result).
If "Country" is factor in your original data, you may wrap Country[1:3] in as.character.
group_by(df, Country) %>%
tally(Count, sort = TRUE) %>%
group_by(Country = factor(c(Country[1:3], rep("Other", n() - 3)),
levels = c(Country[1:3], "Other"))) %>%
tally(n)
# Country n
# (fctr) (int)
#1 AUS 6
#2 JPN 5
#3 USA 5
#4 Other 7
You can use fct_lump from the forcats library
dat1 %>%
group_by(fct_lump(Country, n = 3, w = Count)) %>%
summarize(Count = sum(Count))
This should do it, also you can change the "Other" label using the other_level param inside fct_lump
We could do this in two steps: first create a sorted data.frame, and then rbind the top three rows with a summary of the last rows:
d <- df %>% group_by(Country) %>% summarise(Count = sum(Count)) %>% arrange(desc(Count))
rbind(top_n(d,3),
slice(d,4:n()) %>% summarise(Country="other",Count=sum(Count))
)
output
Country Count
(fctr) (int)
1 AUS 6
2 JPN 5
3 USA 5
4 other 7
Here is an option using data.table. We convert the 'data.frame' to 'data.table' (setDT(dat1)), grouped by 'Country we get the sum of 'Count', then order by 'Count', we rbind the first three observations with the list of 'Others' and the sum of 'Count' of the rest of the observations.
library(data.table)
setDT(dat1)[, list(Count=sum(Count)), Country][order(-Count),
rbind(.SD[1:3], list(Country='Others', Count=sum(.SD[[2]][4:.N]))) ]
# Country Count
#1: AUS 6
#2: USA 5
#3: JPN 5
#4: Others 7
Or using base R
d1 <- aggregate(.~Country, dat1, FUN=sum)
i1 <- order(-d1$Count)
rbind(d1[i1,][1:3,], data.frame(Country='Others',
Count=sum(d1$Count[i1][4:nrow(d1)])))
You could even use xtabs() and manipulate the result. This is a base R answer.
s <- sort(xtabs(Count ~ ., dat1), decreasing = TRUE)
setNames(
as.data.frame(as.table(c(head(s, 3), Others = sum(tail(s, -3)))),
names(dat1)
)
# Country Count
# 1 AUS 6
# 2 JPN 5
# 3 USA 5
# 4 Others 7
A function some might find useful:
top_cases = function(v, top, other = 'other'){
cv = class(v)
v = as.character(v)
v[factor(v, levels = top) %>% is.na()] = other
if(cv == 'factor') v = factor(v, levels = c(top, other))
v
}
E.g..
> table(state.region)
state.region
Northeast South North Central West
9 16 12 13
> top_cases(state.region, c('South','West'), 'North') %>% table()
.
South West North
16 13 21
iris %>% mutate(Species = top_cases(Species, c('setosa','versicolor')))
For those interested in the case for getting categories consisting of greater than some percentage placed into an 'other' category, here's some code.
For this, any values less than 5% go into the 'other' category, the 'other' category is summed, and it includes a label of the number of categories aggregated into the 'other' category.
othernum <- nrow(sub[(sub$value<.05),])
sub<- subset(sub, value >.05)
toplot <- rbind(sub,c(paste("Other (",othernum," types)", sep=""), 1-sum(sub$value)))

Resources