I see a lot of examples of how to count values for one column. I can't find a solution for counting for several columns.
I have data like
city col1 col2 col3 col4
I want to group by city and count unique values in col1, col2, col3...
aggregate(. ~ city, hh2, function(x) length(unique(x)))
I can count using aggregate, but it replaces city names with numbers and it's unclear how to revert it.
Here's an approach using dplyr::across, which is a handy way to calculate across multiple columns:
my_data <- data.frame(
city = c(rep("A", 3), rep("B", 3)),
col1 = 1:6,
col2 = 0,
col3 = c(1:3, 4, 4, 4),
col4 = 1:2
)
library(dplyr)
my_data %>%
group_by(city) %>%
summarize(across(col1:col4, n_distinct))
# A tibble: 2 x 5
city col1 col2 col3 col4
* <chr> <int> <int> <int> <int>
1 A 3 1 3 2
2 B 3 1 1 2
Looks to me like tidy data is what you're after. Here's an example with the tidyverse and subset of the mpg data set in ggplot2.
library(tidyverse)
data <- mpg[c("model", 'cty', 'hwy')]
head(data) #to see the initial data layout.
data %>%
pivot_longer(cols = c('cty', 'hwy'), names_to = 'cat', values_to = 'values') %>%
group_by(model, cat) %>%
summarise(avg = mean(values))
Related
I have a table with ID, Category and amount with a few thousand records.
data:
df1 <- data.frame(
ID = c('V1', 'V1', 'V1', 'V3', 'V3', 'V3', 'V4', 'V5','V5','V5'),
Category = c('a', 'a', 'a', 'a', 'b', 'b', 'a', 'b', 'c', 'c'),
Amount = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1))
Using dplyr I want to group by ID and Category, sum the total amount per group, then filter the results to only have IDs which exist in multiple category.
result:
ID Category Amount_Sum
V3 a 1
V3 b 2
V5 b 1
V5 c 2
I have the following code which groups and sums, but missing how to filter when the ID is in multiple groups
code:
x <- df1 %>%
group_by(ID, Category) %>%
summarize(CNT = n(), amount = sum(Amount)) %>%
filter(????????)
Using n_distinct on the Category should give you your desired result:
library(dplyr)
df1 %>%
group_by(ID, Category) %>%
summarize(CNT = n(), amount = sum(Amount)) %>%
filter(n_distinct(Category) > 1) %>%
ungroup()
returns
# A tibble: 4 x 4
ID Category CNT amount
<chr> <chr> <int> <dbl>
1 V3 a 1 1
2 V3 b 2 2
3 V5 b 1 1
4 V5 c 2 2
You can also use a combination of length and unique to filter as well:
library(dplyr)
df1 %>%
group_by(ID, Category) %>%
summarize(CNT = n(), amount = sum(Amount)) %>%
filter(length(unique(Category)) > 1)
Output
ID Category CNT amount
<chr> <chr> <int> <dbl>
1 V3 a 1 1
2 V3 b 2 2
3 V5 b 1 1
4 V5 c 2 2
Or here is a base R option using aggregate to do the summary, then using ave to do the filtering. Here, Amount is the variable that we want to apply 2 functions to (i.e., length and sum), but we want to do that for each group (ID and Category). aggregate will return a matrix with the results in 2 columns. So, to integrate those with the rest of the dataframe, we can use do.call to bind each of those columns to the dataframe. Then, we can rename the columns with the desired column names with setNames.
df1_output <-
setNames(do.call(
data.frame,
aggregate(
Amount ~ ID + Category,
data = df1,
FUN = function(x)
c(CNT = length(x), amount = sum(x))
)
), c(names(df1[1:2]), "CNT", "amount"))
df1_output[with(df1_output, ave(Category, ID, FUN = function(x) length(unique(x))) > 1),]
How can I remove entire group if one of its values is NA. For ex - remove category B because it contains NA.
library(dplyr)
tbl = tibble(category = c("A", "A", "B", "B"),
values = c(2, 3, 1, NA))
We can use filter after grouping by 'category'
library(dplyr)
tbl %>%
group_by(category) %>%
filter(!any(is.na(values))) %>%
ungroup
-output
# A tibble: 2 x 2
category values
<chr> <dbl>
1 A 2
2 A 3
tbl %>%
filter(!category %in% category[is.na(values)])
Output
category values
<chr> <dbl>
1 A 2
2 A 3
tbl %>%
group_by(category) %>%
filter(all(!is.na(values)))
category values
<chr> <dbl>
1 A 2
2 A 3
You can get the categories which has at least one NA value and exclude them.
subset(tbl, !category %in% unique(category[is.na(values)]))
# category values
# <chr> <dbl>
#1 A 2
#2 A 3
If you prefer dplyr::filter.
library(dplyr)
tbl %>% filter(!category %in% unique(category[is.na(values)]))
when I execute the following code:
data_ikea_wider <- data_ikea_longer %>%
pivot_wider(id_cols = c(Record_no
, Geography
, City
, Country
, City.Country
, Year)
, names_from = Category, values_from = Value)
The columns just have n/a's as shown in the attached print screen.
What am I doing wrong? Thanks!
We could use dcast from data.table
library(data.table)
setDT(dat)[, col1 ~ col2, value.var = 'val')
Getting NAs from a pivot is not unexpected, it means that not all of your id columns have all "columns".
For example,
dat <- data.frame(col1 = c(1,1,2), col2 = c('a', 'b', 'a'), val = 1:3)
dat
# col1 col2 val
# 1 1 a 1
# 2 1 b 2
# 3 2 a 3
If we want to pivot keeping col1 as an id, and turning col2 values into new columns, then it should be apparent that we'll end up with two rows (ida 1 and 2), and two new columns (a and b) to replace col2 and val. Unfortunately, since we only have three rows, the 2 rows 2 columns = 4 cells will not be completely filled with 3 values, so one will be NA:
pivot_wider(dat, col1, names_from = col2, values_from = val)
# # A tibble: 2 x 3
# col1 a b
# <dbl> <int> <int>
# 1 1 1 2
# 2 2 3 NA
If you see this and are surprised, thinking that you actually have the data ... then you should check your data importing and filtering to make sure you did not inadvertently remove it (or it was not provided initially).
I am struggling with a collapse of my data.
Basically my data consists of multiple indicators with multiple observations for each year. I want to convert this to one observation for each indicator for each country.
I have a rank indicator which specifies the sequence by which sequence the observations have to be chosen.
Basically the observation with the first rank (thus 1 instead of 2) has to be chosen, as long as for that rank the value is not NA.
An additional question: The years in my dataset vary over time, thus is there a way to make the code dynamic in the sense that it applies the code to all column names from 1990 to 2025 when they exist?
df <- data.frame(country.code = c(1,1,1,1,1,1,1,1,1,1,1,1),
id = as.factor(c("GDP", "GDP", "GDP", "GDP", "CA", "CA", "CA", "GR", "GR", "GR", "GR", "GR")),
`1999` = c(NA,NA,NA, 1000,NA,NA, 100,NA,NA, NA,NA,22),
`2000` = c(NA,NA,1, 2,NA,1, 2,NA,1000, 12,13,2),
`2001` = c(3,100,1, 3,100,20, 1,1,44, 65,NA,NA),
rank = c(1, 2 , 3 , 4 , 1, 2, 3, 1, 3, 2, 4, 5))
The result should be the following dataset:
result <- data.frame(country.code = c(1, 1, 1),
id = as.factor(c("GDP", "CA", "GR")),
`1999`= c(1000, 100, 22),
`2000`= c(1, 1, 12),
`2001`= c(3, 100, 1))
I attempted the following solution (but this does not work given the NA's in the data and I would have to specify each column:
test <- df %>% group_by(Country.Code, Indicator.Code) %>%
summarise(test1999 = `1999`[which.min(rank))
I don't see how I can explain R to omit the cases of the column 1999 that are NA.
We can subset using the minimum rank of the non-null values for a column e.g x[rank==min(rank[!is.na(x)])].
An additional question: The years in my dataset vary over time,....
Using summarise_at, vars and matches can be used to select any column name with 4 digits i.e. 1990-2025 using a regular expression [0-9]{4} (which means search for a digit "0-9" repeated exactly 4 times) and apply the above procedure to them using funs
librar(dplyr)
df %>% group_by(country.code,id) %>%
summarise(`1999` = `1999`[rank==ifelse(all(is.na(`1999`)),1, min(rank[!is.na(`1999`)]))])
df %>% group_by(country.code,id) %>%
summarise_at(vars(matches("[0-9]{4}")),funs(.[rank==ifelse(all(is.na(.)), 1, min(rank[!is.na(.)]))]))
# A tibble: 3 x 5
# Groups: country.code [?]
country.code id `1999` `2000` `2001`
<dbl> <fct> <dbl> <dbl> <dbl>
1 1 CA 100 1 100
2 1 GDP 1000 1 3
3 1 GR 22 12 1
Here is one option that uses tidyr::fill to replace the NAs by the first non-NA value after we arranged the data by id and rank. It might not be the most efficient approach because we first gather and then spread the data again.
library(tidyverse)
df %>%
arrange(id, rank) %>%
gather(key, value, X1999:X2001) %>%
tidyr::fill(value, .direction = "up") %>%
spread(key, value) %>%
group_by(id) %>%
slice(1) %>%
ungroup()
# A tibble: 3 x 6
# country.code id rank X1999 X2000 X2001
# <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
#1 1 CA 1 100 1 100
#2 1 GDP 1 1000 1 3
#3 1 GR 1 22 12 1
NOTE: the column names are not 1999, 2000 etc. as in your data probably. But that is easily adoptable.
You can change dataframe to long form , remove na, select values corresponding to minimum rank and spread back to wide form
library(tidyr)
test <- df %>%
gather("Year", "Value", X1999:X2001) %>%
filter(!is.na(Value))%>%
group_by(country.code, id, Year) %>%
arrange(rank)%>%
summarise(first(Value)) %>%
spread(Year, `first(Value)`)
This question already has answers here:
R group by, counting non-NA values
(3 answers)
Closed 4 years ago.
Here is my example
mydf<-data.frame('col_1' = c('A','A','B','B'), 'col_2' = c(100,NA, 90,30))
I would like to group by col_1 and count non-NA elements in col_2
I would like to do it with dplyr. Here is what I tried:
mydf %>% group_by(col_1) %>% summarise_each(funs(!is.na(col_2)))
mydf %>% group_by(col_1) %>% mutate(non_na_count = length(col_2, na.rm=TRUE))
mydf %>% group_by(col_1) %>% mutate(non_na_count = count(col_2, na.rm=TRUE))
Nothing worked. Any suggestions?
You can use this
mydf %>% group_by(col_1) %>% summarise(non_na_count = sum(!is.na(col_2)))
# A tibble: 2 x 2
col_1 non_na_count
<fctr> <int>
1 A 1
2 B 2
We can filter the NA elements in 'col_2' and then do a count of 'col_1'
mydf %>%
filter(!is.na(col_2)) %>%
count(col_1)
# A tibble: 2 x 2
# col_1 n
# <fctr> <int>
#1 A 1
#2 B 2
or using data.table
library(data.table)
setDT(mydf)[, .(non_na_count = sum(!is.na(col_2))), col_1]
Or with aggregate from base R
aggregate(cbind(col_2 = !is.na(col_2))~col_1, mydf, sum)
# col_1 col_2
#1 A 1
#2 B 2
Or using table
table(mydf$col_1[!is.na(mydf$col_2)])
library(knitr)
library(dplyr)
mydf <- data.frame("col_1" = c("A", "A", "B", "B"),
"col_2" = c(100, NA, 90, 30))
mydf %>%
group_by(col_1) %>%
select_if(function(x) any(is.na(x))) %>%
summarise_all(funs(sum(is.na(.)))) -> NA_mydf
kable(NA_mydf)