How to get column mean grouped by row labels in R dataframe? - r

I have a dataframe that looks like this
Fruit
2021
2022
Apples
12
29
Bananas
11
31
Apples
44
55
Oranges
30
73
Oranges
19
82
Bananas
24
78
The Fruit names are not ordered so I can't group them by taking n at a time, they're listed randomly. I need to get the mean of fruits sold in 2021 & 2022 as well as mean sold for apples, oranges & bananas for each year separately.
My code is
2021 <- c(mean(df$2021), sd(df$2021))
2022 <- c(mean(df$2022), sd(df$2022))
measure <- c('mean','standard deviation')
df1 <- data.table(measure,TE,TW,NC,SC,NWC)
and output looks like this:
Measure
2021
2022
mean
23.3
58
standard deviation
12.4
23.3
But I'm not sure where to start with grouping the rows by name. I need to get something that looks like this
Measure
2021
Apples
Bananas
Oranges
2022
Apples
Bananas
Oranges
mean
23.3
58
standard deviation
12.4
23.3
(with the appropriate numbers in the blank spaces)

I suggest this might be better (in the long run) in a long format, which this summarizing can get started. This is just 'mean', not hard to repeat for sd and combine with this:
fruits <- c(NA, "Apples", "Oranges", "Bananas")
lapply(quux[,-1], function(yr) stack(sapply(fruits, function(z) mean(yr[is.na(z) | quux$Fruit %in% z])))) |>
dplyr::bind_rows(.id = "year")
# year values ind
# 1 2021 23.33333 <NA>
# 2 2021 28.00000 Apples
# 3 2021 24.50000 Oranges
# 4 2021 17.50000 Bananas
# 5 2022 58.00000 <NA>
# 6 2022 42.00000 Apples
# 7 2022 77.50000 Oranges
# 8 2022 54.50000 Bananas
where NA in ind indicates all fruits, otherwise the individual fruit labeled.

If you put your data in long form, you could use the aggregate function:
a <- aggregate(value ~ year + fruit, data=df, FUN=function(x) c(sd(x),mean(x))
Where value is a column you could create to put the values which are now under 2021 and 2022. Then create a new column called year which has 2021 or 2022 accordingly. Long form is the way to go in R almost always.

We may use
library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
df1 %>%
pivot_longer(cols = where(is.numeric), names_to = 'year') %>%
as.data.table %>%
cube( .(Mean = mean(value), SD = sd(value)),
by = c("Fruit", "year")) %>%
filter(!if_all(Fruit:year, is.na)) %>%
unite(Fruit, Fruit, year, sep = "_", na.rm = TRUE) %>%
filter(str_detect(Fruit, "_|\\d+")) %>%
data.table::transpose(make.names = "Fruit", keep.names = "Measure")
-output
Measure Apples_2021 Apples_2022 Bananas_2021 Bananas_2022 Oranges_2021 Oranges_2022 2021 2022
1: Mean 28.00000 42.00000 17.500000 54.50000 24.500000 77.500000 23.33333 58.00000
2: SD 22.62742 18.38478 9.192388 33.23402 7.778175 6.363961 12.42041 23.57965
Or if we want the duplicate column names
df1 %>%
pivot_longer(cols = where(is.numeric), names_to = 'year') %>%
as.data.table %>%
cube( .(Mean = mean(value), SD = sd(value)), by = c("Fruit", "year")) %>%
mutate(Fruit = coalesce(Fruit, year)) %>%
drop_na(year) %>%
arrange(year, str_detect(Fruit, '\\d{4}', negate = TRUE)) %>%
select(-year) %>%
data.table::transpose(make.names = "Fruit", keep.names = "Measure")
-output
Measure 2021 Apples Bananas Oranges 2022 Apples Bananas Oranges
1: Mean 23.33333 28.00000 17.500000 24.500000 58.00000 42.00000 54.50000 77.500000
2: SD 12.42041 22.62742 9.192388 7.778175 23.57965 18.38478 33.23402 6.363961
data
df1 <- structure(list(Fruit = c("Apples", "Bananas", "Apples", "Oranges",
"Oranges", "Bananas"), `2021` = c(12L, 11L, 44L, 30L, 19L, 24L
), `2022` = c(29L, 31L, 55L, 73L, 82L, 78L)),
class = "data.frame", row.names = c(NA,
-6L))

Related

How can I plot a dataframe with 10 columns?

I have a dataset looks like this
year china India United state ....
2020 30 40 50
2021 20 30 60
2022 34 20 40
....
I have 10 columns and more than 50 rows in this dataframe. I have to plot them in one graph to show the movement of different countries.
So I think line graph would be good for the purpose.But I don't know how should I do the visulisation.
I think I shuold change the dataframe format and then start visulisation. How should I do it?
Pivot (reshape from wide to long) then plot with groups.
dat <- structure(list(year = 2020:2022, China = c(30L, 20L, 34L), India = c(40L, 30L, 20L), UnitedStates = c(50L, 60L, 40L)), class = "data.frame", row.names = c(NA, -3L))
datlong <- reshape2::melt(dat, "year", variable.name = "country", value.name = "value")
datlong
# year country value
# 1 2020 China 30
# 2 2021 China 20
# 3 2022 China 34
# 4 2020 India 40
# 5 2021 India 30
# 6 2022 India 20
# 7 2020 UnitedStates 50
# 8 2021 UnitedStates 60
# 9 2022 UnitedStates 40
### or using tidyr::
tidyr::pivot_longer(dat, -year, names_to = "country", values_to = "value")
Once reshaped, just group= (and optionally color=) lines:
library(ggplot2)
ggplot(datlong, aes(year, value, color = country)) +
geom_line(aes(group = country))
If you have many more years, the decimal-years in the axis will likely smooth out. You can alternately control it by converting year to a Date-class and forcing the display with scale_x_date.

Transform to wide format from long in R

I have a data frame in R which looks like below
Model Month Demand Inventory
A Jan 10 20
B Feb 30 40
A Feb 40 60
I want the data frame to look
Jan Feb
A_Demand 10 40
A_Inventory 20 60
A_coverage
B_Demand 30
B_Inventory 40
B_coverage
A_coverage and B_Coverage will be calculated in excel using a formula. But the problem I need help with is to pivot the data frame from wide to long format (original format).
I tried to implement the solution from the linked duplicate but I am still having difficulty:
HD_dcast <- reshape(data,idvar = c("Model","Inventory","Demand"),
timevar = "Month", direction = "wide")
Here is a dput of my data:
data <- structure(list(Model = c("A", "B", "A"), Month = c("Jan", "Feb",
"Feb"), Demand = c(10L, 30L, 40L), Inventory = c(20L, 40L, 60L
)), class = "data.frame", row.names = c(NA, -3L))
Thanks
Here's an approach with dplyr and tidyr, two popular R packages for data manipulation:
library(dplyr)
library(tidyr)
data %>%
mutate(coverage = NA_real_) %>%
pivot_longer(-c(Model,Month), names_to = "Variable") %>%
pivot_wider(id_cols = c(Model, Variable), names_from = Month ) %>%
unite(Variable, c(Model,Variable), sep = "_")
## A tibble: 6 x 3
# Variable Jan Feb
# <chr> <dbl> <dbl>
#1 A_Demand 10 40
#2 A_Inventory 20 60
#3 A_coverage NA NA
#4 B_Demand NA 30
#5 B_Inventory NA 40
#6 B_coverage NA NA

How to summarize the top 3 highest values in a dataset when there are ties

I have a data frame (my_data) and want to calculate the sum of only the 3 highest values even though there might be ties. I am quite new to R and I've used dplyr.
A tibble: 15 x 3
city month number
<chr> <chr> <dbl>
1 Lund jan 12
2 Lund feb 12
3 Lund mar 18
4 Lund apr 28
5 Lund may 28
6 Stockholm jan 15
7 Stockholm feb 15
8 Stockholm mar 30
9 Stockholm apr 30
10 Stockholm may 10
11 Uppsala jan 22
12 Uppsala feb 30
13 Uppsala mar 40
14 Uppsala apr 60
15 Uppsala may 30
This is the code I have tried:
# For each city, count the top 3 of variable number
my_data %>% group_by(city) %>% top_n(3, number) %>% summarise(top_nr = sum(number))
The expected (wanted) output is:
# A tibble: 3 x 2
city top_nr
<chr> <dbl>
1 Lund 86
2 Stockholm 75
3 Uppsala 130
but the actual R output is:
# A tibble: 3 x 2
city top_nr
<chr> <dbl>
1 Lund 86
2 Stockholm 90
3 Uppsala 160
It seems like if there are ties, all tied values are included in the summation. I wanted only 3 unique instances with highest values to be counted.
Any help would be much appreciated! :)
We can do a distinct to remove the duplicate elements. The way in which top_n works is that if the values are duplicated, it will keep that many dupe rows
my_data %>%
distinct(city, number, .keep_all = TRUE) %>%
group_by(city) %>%
top_n(3, number) %>%
summarise(top_nr = sum(number))
Update
Based on the OP's new output, after the top_n output (which is not arranged), get the 'number' arranged in descending order and get the sum of first 3 'number'
my_data %>%
group_by(city) %>%
top_n(3, number) %>%
arrange(city, desc(number)) %>%
summarise(number = sum(head(number, 3)))
# A tibble: 3 x 2
# city number
# <chr> <int>
#1 Lund 74
#2 Stockholm 75
#3 Uppsala 130
data
my_data <- structure(list(city = c("Lund", "Lund", "Lund", "Lund", "Lund",
"Stockholm", "Stockholm", "Stockholm", "Stockholm", "Stockholm",
"Uppsala", "Uppsala", "Uppsala", "Uppsala", "Uppsala"), month = c("jan",
"feb", "mar", "apr", "may", "jan", "feb", "mar", "apr", "may",
"jan", "feb", "mar", "apr", "may"), number = c(12L, 12L, 18L,
28L, 28L, 15L, 15L, 30L, 30L, 10L, 22L, 30L, 40L, 60L, 30L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"))
Life might be way simpler without top_n():
dat %>%
group_by(city) %>%
summarize(
top_nr = sum(tail(sort(number), 3))
)
This tidyverse (actually, dplyr) solution is almost equal to akrun's, but filters the dataframe instead of getting the top_n.
library(tidyverse)
my_data %>%
group_by(city) %>%
arrange(desc(number), .by_group = TRUE) %>%
filter(row_number() %in% 1:3) %>%
summarise(top_nr = sum(number))
## A tibble: 3 x 2
# city top_nr
# <chr> <int>
#1 Lund 74
#2 Stockholm 75
#3 Uppsala 130

how to assign words to a number in a dataframe

I have a below dataframe with numbers in two of the columns and I should replace that with string using my other reference dataset.
Dataset 1:
lhs rhs
32,39,6 65
39,6,65 32
14,16,26 15
16,20,4 26
16,26,33 4
53 31
Dataset 2:
id name
4 yougurt
6 coffee
14 cream chese
15 meat spreads
16 butter
20 whole milk
26 condensed milk
31 curd
32 flour
39 rolls
53 sugar
65 soda
Expected output:
lhs rhs
flour, rolls, coffee soda
rolls, coffee, soda flour
cream chease, butter, condensed milk meat spreads
A solution using dplyr and tidyr. dat is the final output. The key is to use separate_rows to expand the lhs and then conduct left_join twice.
library(dplyr)
library(tidyr)
dat <- dat1 %>%
separate_rows(lhs, convert = TRUE) %>%
left_join(dat2, by = c("lhs" = "id")) %>%
left_join(dat2, by = c("rhs" = "id")) %>%
drop_na(name.x) %>%
group_by(name.y) %>%
summarise(lhs = paste0(name.x, collapse = ", ")) %>%
ungroup() %>%
select(lhs, rhs = name.y)
dat
# # A tibble: 6 x 2
# lhs rhs
# <chr> <chr>
# 1 butter, whole milk, yougurt condensed milk
# 2 sugar curd
# 3 rolls, coffee, soda flour
# 4 cream chese, butter, condensed milk meat spreads
# 5 flour, rolls, coffee soda
# 6 butter, condensed milk yougurt
DATA
dat1 <- read.table(text = "lhs rhs
'32,39,6' 65
'39,6,65' 32
'14,16,26' 15
'16,20,4' 26
'16,26,33' 4
53 31 ",
stringsAsFactors = FALSE, header = TRUE)
dat2 <- read.table(text = "id name
4 yougurt
6 coffee
14 'cream chese'
15 'meat spreads'
16 butter
20 'whole milk'
26 'condensed milk'
31 curd
32 flour
39 rolls
53 sugar
65 soda",
header = TRUE, stringsAsFactors = FALSE)
Another option. Here d1 is your first data frame and d2 your second.
library(tidyverse)
d1 %>% separate(lhs, sep = ',', into = c('v1', 'v2', 'v3')) %>%
mutate_all(as.numeric) %>%
left_join(d2, by = c('v1'='id')) %>%
left_join(d2, by = c('v2'='id')) %>%
left_join(d2, by = c('v3'='id')) %>%
left_join(d2, by = c('rhs'='id')) %>%
unite(lhs, name.x, name.y, name.x.x, sep = ',') %>%
mutate(lhs = str_replace_all(lhs, ',NA', '')) %>%
select(lhs, rhs = name.y.y)
OR, as pointed out by #Moody_Mudskipper in the comments
d1 %>% separate(lhs, sep = ',', into = c('v1', 'v2', 'v3')) %>%
mutate_all(as.numeric) %>%
lmap(~setNames(left_join(setNames(.x, "id"), d2)[2], names(.x))) %>%
unite(lhs, v1, v2, v3, sep = ', ') %>%
mutate(lhs = str_replace_all(lhs, ',NA', '')) %>%
select(lhs, rhs = name.y.y)
lhs rhs
1 flour, rolls, coffee soda
2 rolls, coffee, soda flour
3 cream chese, butter, condensed milk meat spreads
4 butter, whole milk, yougurt condensed milk
5 butter, condensed milk yougurt
6 sugar curd
This is almost the same as www, but appears to be a little faster. Apparently using strsplit and unnest is faster than separate_rows
require(tidyverse)
df1 %>%
mutate(lhs = sapply(lhs, strsplit, ',')) %>%
unnest %>%
mutate_at(c('lhs', 'rhs'), as.numeric) %>%
left_join(df2, by = c('lhs'= 'id')) %>%
left_join(df2, by = c('rhs'= 'id')) %>%
group_by(name.y) %>%
summarize(name.x = paste(name.x, collapse = ', ')) %>%
rename(rhs = name.y, lhs = name.x)
Then there's the data.table solution, which is much faster.
require(data.table)
setDT(df1)
df1[, .(lhs = unlist(strsplit(lhs, ','))), rhs] %>%
.[, lapply(.SD, as.numeric)] %>%
merge(df2, by.x = 'lhs', by.y = 'id') %>%
merge(df2, by.x = 'rhs', by.y = 'id') %>%
.[, .(lhs = paste0(name.x, collapse = ',')), by = .(rhs = name.y)]
Benchmark
# Results
# Unit: relative
# expr min lq mean median uq max neval
# useDT() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 300
# UseUnnest() 5.570704 5.632532 5.274552 5.374714 5.042518 9.254190 300
# UseSeparateRows() 8.640615 8.356889 7.661669 7.939593 7.401666 7.896038 300
# Method
require(tidyverse)
require(data.table)
df1 <- fread("
lhs rhs
32,39,6 65
39,6,65 32
14,16,26 15
16,20,4 26
16,26,33 4
53 31
")
df2 <- fread("
id name
4 yougurt
6 coffee
14 cream_chese
15 meat_spreads
16 butter
20 whole_milk
26 condensed_milk
31 curd
32 flour
39 rolls
53 sugar
65 soda
")
useDT <- function(x){
df1[, lapply(sapply(lhs, strsplit, ','), unlist), rhs] %>%
setNames(c('rhs', 'lhs')) %>%
.[, `:=`(lhs = as.numeric(lhs),
rhs = as.numeric(rhs))] %>%
.[df2, on = c('lhs'= 'id')] %>%
.[df2, on = c('rhs'= 'id')] %>%
.[, .(lhs = paste0(name, collapse = ',')), by = i.name] %>%
.[lhs != 'NA', .(lhs, rhs = i.name)]
}
UseUnnest <- function(x){
df1 %>%
mutate(lhs = sapply(lhs, strsplit, ',')) %>%
unnest %>%
mutate_at(c('lhs', 'rhs'), as.numeric) %>%
left_join(df2, by = c('lhs'= 'id')) %>%
left_join(df2, by = c('rhs'= 'id')) %>%
group_by(name.y) %>%
summarize(name.x = paste(name.x, collapse = ', ')) %>%
rename(rhs = name.y, lhs = name.x)
}
UseSeparateRows <- function(x){
df1 %>%
separate_rows(lhs, convert = TRUE) %>%
left_join(df2, by = c("lhs" = "id")) %>%
left_join(df2, by = c("rhs" = "id")) %>%
drop_na(name.x) %>%
group_by(name.y) %>%
summarise(lhs = paste0(name.x, collapse = ", ")) %>%
ungroup() %>%
select(lhs, rhs = name.y)
}
microbenchmark(useDT(), UseUnnest(), UseSeparateRows(), times = 300, unit = 'relative')
Here is an option using just base R and mapping the numeric values to factor labels.
Split the string, map the labels to the values and then collapse the labels back into a string.
df<-structure(list(id = c(4L, 6L, 14L, 15L, 16L, 20L, 26L, 31L, 32L,
39L, 53L, 65L), name = c("yougurt", "coffee", "cream cheese",
"meat spreads", "butter", "whole milk", "condensed milk", "curd",
"flour", "rolls", "sugar", "soda")), .Names = c("id", "name"),
class = "data.frame", row.names = c(NA, -12L))
input<-structure(list(lhs = c("32,39,6", "39,6,65", "14,16,26", "16,20,4",
"16,26,33", "53"), rhs = c(65L, 32L, 15L, 26L, 4L, 31L)),
.Names = c("lhs", "rhs"), class = "data.frame", row.names = c(NA, -6L))
#new left hand side
newlhs<-sapply(as.character(input$lhs), function(x){
strs<-unlist(strsplit(x, ","))
f<-factor(strs, levels=df$id, labels=df$name)
paste(f, collapse = ", ")
})
#new right hand side
newrhs<-sapply(as.character(input$rhs), function(x){
strs<-unlist(strsplit(x, ","))
f<-factor(strs, levels=df$id, labels=df$name)
paste(f, collapse = ", ")
})
answer<-data.frame(newlhs, newrhs)
row.names(answer)<-NULL #remove rownames
Not so idiomatic but I win the code golf :) :
as.data.frame(lapply(dat1, function(x){
for (i in seq(nrow(dat2))) x <- gsub(paste0("(^|,)",dat2$id[i],"(,|$)"),
paste0("\\1",dat2$name[i],"\\2"),x)
x}))
# lhs rhs
# 1 flour,rolls,coffee soda
# 2 rolls,coffee,soda flour
# 3 cream chese,butter,condensed milk meat spreads
# 4 butter,whole milk,yougurt condensed milk
# 5 butter,condensed milk,33 yougurt
# 6 sugar curd
May fail if you have numbers in 2nd dataset.

aggregate all columns in r

So this is what I do I total up their grades by using this, there are more columns than this
test<-with(data,table(Student,Subject))
test <- cbind(test)
test <- as.data.frame(test)
row.names Maths Science English Geography History Art ...
1 George 64 70 40 50 60 70
2 Anna 40 20 65 54 30 50
3 Scott 30 64 30 40 50 20
...
Summarize <- data.frame(
aggregate(.~Maths, data=test, min),
aggregate(.~English, data=test, max),
aggregate(.~Science, data=test, mean))
Is there a way to select all the columns itself and aggregate(range and the mean) the columns to a new dataframe?
Min Mean Max
Maths 30 60 90
Science
English
Geography
...
Thanks in advance !
Try:
library(dplyr)
library(tidyr)
df %>%
summarise_each(funs(min=min(., na.rm=TRUE), max=max(., na.rm=TRUE),
mean=mean(., na.rm=TRUE)), -Student) %>%
gather(Var, Value, Maths_min:Art_mean) %>%
separate(Var, c("Subject", "Var")) %>%
spread(Var, Value)
# Subject max mean min
#1 Art 70 46.66667 20
#2 English 65 45.00000 30
#3 Geography 54 48.00000 40
#4 History 60 46.66667 30
#5 Maths 64 44.66667 30
#6 Science 70 51.33333 20
Update
Or you could use aggregate with melt
library(reshape2)
res <- aggregate(value~variable, melt(df, id="Student"),
FUN=function(x) c(Min=min(x, na.rm=TRUE), Mean=mean(x, na.rm=TRUE),
Max=max(x, na.rm=TRUE)))
res1 <- do.call(`data.frame`, res)
colnames(res1) <- gsub(".*\\.", "", colnames(res1))
res1
# variable Min Mean Max
#1 Maths 30 44.66667 64
#2 Science 20 51.33333 70
#3 English 30 45.00000 65
#4 Geography 40 48.00000 54
#5 History 30 46.66667 60
#6 Art 20 46.66667 70
Or using only base R
res2 <- do.call(`data.frame`,
aggregate(values~ind, stack(df, select=-1),
FUN=function(x) c(Min=min(x, na.rm=TRUE), Mean=mean(x, na.rm=TRUE),
Max=max(x, na.rm=TRUE))))
colnames(res2) <- gsub(".*\\.", "", colnames(res2))
res2
# ind Min Mean Max
#1 Art 20 46.66667 70
#2 English 30 45.00000 65
#3 Geography 40 48.00000 54
#4 History 30 46.66667 60
#5 Maths 30 44.66667 64
#6 Science 20 51.33333 70
data
df <- structure(list(Student = c("George", "Anna", "Scott"), Maths = c(64L,
40L, 30L), Science = c(70L, 20L, 64L), English = c(40L, 65L,
30L), Geography = c(50L, 54L, 40L), History = c(60L, 30L, 50L
), Art = c(70L, 50L, 20L)), .Names = c("Student", "Maths", "Science",
"English", "Geography", "History", "Art"), class = "data.frame", row.names = c(NA,
-3L))

Resources