I have a data frame in R which looks like below
Model Month Demand Inventory
A Jan 10 20
B Feb 30 40
A Feb 40 60
I want the data frame to look
Jan Feb
A_Demand 10 40
A_Inventory 20 60
A_coverage
B_Demand 30
B_Inventory 40
B_coverage
A_coverage and B_Coverage will be calculated in excel using a formula. But the problem I need help with is to pivot the data frame from wide to long format (original format).
I tried to implement the solution from the linked duplicate but I am still having difficulty:
HD_dcast <- reshape(data,idvar = c("Model","Inventory","Demand"),
timevar = "Month", direction = "wide")
Here is a dput of my data:
data <- structure(list(Model = c("A", "B", "A"), Month = c("Jan", "Feb",
"Feb"), Demand = c(10L, 30L, 40L), Inventory = c(20L, 40L, 60L
)), class = "data.frame", row.names = c(NA, -3L))
Thanks
Here's an approach with dplyr and tidyr, two popular R packages for data manipulation:
library(dplyr)
library(tidyr)
data %>%
mutate(coverage = NA_real_) %>%
pivot_longer(-c(Model,Month), names_to = "Variable") %>%
pivot_wider(id_cols = c(Model, Variable), names_from = Month ) %>%
unite(Variable, c(Model,Variable), sep = "_")
## A tibble: 6 x 3
# Variable Jan Feb
# <chr> <dbl> <dbl>
#1 A_Demand 10 40
#2 A_Inventory 20 60
#3 A_coverage NA NA
#4 B_Demand NA 30
#5 B_Inventory NA 40
#6 B_coverage NA NA
I have a data frame (my_data) and want to calculate the sum of only the 3 highest values even though there might be ties. I am quite new to R and I've used dplyr.
A tibble: 15 x 3
city month number
<chr> <chr> <dbl>
1 Lund jan 12
2 Lund feb 12
3 Lund mar 18
4 Lund apr 28
5 Lund may 28
6 Stockholm jan 15
7 Stockholm feb 15
8 Stockholm mar 30
9 Stockholm apr 30
10 Stockholm may 10
11 Uppsala jan 22
12 Uppsala feb 30
13 Uppsala mar 40
14 Uppsala apr 60
15 Uppsala may 30
This is the code I have tried:
# For each city, count the top 3 of variable number
my_data %>% group_by(city) %>% top_n(3, number) %>% summarise(top_nr = sum(number))
The expected (wanted) output is:
# A tibble: 3 x 2
city top_nr
<chr> <dbl>
1 Lund 86
2 Stockholm 75
3 Uppsala 130
but the actual R output is:
# A tibble: 3 x 2
city top_nr
<chr> <dbl>
1 Lund 86
2 Stockholm 90
3 Uppsala 160
It seems like if there are ties, all tied values are included in the summation. I wanted only 3 unique instances with highest values to be counted.
Any help would be much appreciated! :)
We can do a distinct to remove the duplicate elements. The way in which top_n works is that if the values are duplicated, it will keep that many dupe rows
my_data %>%
distinct(city, number, .keep_all = TRUE) %>%
group_by(city) %>%
top_n(3, number) %>%
summarise(top_nr = sum(number))
Update
Based on the OP's new output, after the top_n output (which is not arranged), get the 'number' arranged in descending order and get the sum of first 3 'number'
my_data %>%
group_by(city) %>%
top_n(3, number) %>%
arrange(city, desc(number)) %>%
summarise(number = sum(head(number, 3)))
# A tibble: 3 x 2
# city number
# <chr> <int>
#1 Lund 74
#2 Stockholm 75
#3 Uppsala 130
data
my_data <- structure(list(city = c("Lund", "Lund", "Lund", "Lund", "Lund",
"Stockholm", "Stockholm", "Stockholm", "Stockholm", "Stockholm",
"Uppsala", "Uppsala", "Uppsala", "Uppsala", "Uppsala"), month = c("jan",
"feb", "mar", "apr", "may", "jan", "feb", "mar", "apr", "may",
"jan", "feb", "mar", "apr", "may"), number = c(12L, 12L, 18L,
28L, 28L, 15L, 15L, 30L, 30L, 10L, 22L, 30L, 40L, 60L, 30L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"))
Life might be way simpler without top_n():
dat %>%
group_by(city) %>%
summarize(
top_nr = sum(tail(sort(number), 3))
)
This tidyverse (actually, dplyr) solution is almost equal to akrun's, but filters the dataframe instead of getting the top_n.
library(tidyverse)
my_data %>%
group_by(city) %>%
arrange(desc(number), .by_group = TRUE) %>%
filter(row_number() %in% 1:3) %>%
summarise(top_nr = sum(number))
## A tibble: 3 x 2
# city top_nr
# <chr> <int>
#1 Lund 74
#2 Stockholm 75
#3 Uppsala 130
I have a below dataframe with numbers in two of the columns and I should replace that with string using my other reference dataset.
Dataset 1:
lhs rhs
32,39,6 65
39,6,65 32
14,16,26 15
16,20,4 26
16,26,33 4
53 31
Dataset 2:
id name
4 yougurt
6 coffee
14 cream chese
15 meat spreads
16 butter
20 whole milk
26 condensed milk
31 curd
32 flour
39 rolls
53 sugar
65 soda
Expected output:
lhs rhs
flour, rolls, coffee soda
rolls, coffee, soda flour
cream chease, butter, condensed milk meat spreads
A solution using dplyr and tidyr. dat is the final output. The key is to use separate_rows to expand the lhs and then conduct left_join twice.
library(dplyr)
library(tidyr)
dat <- dat1 %>%
separate_rows(lhs, convert = TRUE) %>%
left_join(dat2, by = c("lhs" = "id")) %>%
left_join(dat2, by = c("rhs" = "id")) %>%
drop_na(name.x) %>%
group_by(name.y) %>%
summarise(lhs = paste0(name.x, collapse = ", ")) %>%
ungroup() %>%
select(lhs, rhs = name.y)
dat
# # A tibble: 6 x 2
# lhs rhs
# <chr> <chr>
# 1 butter, whole milk, yougurt condensed milk
# 2 sugar curd
# 3 rolls, coffee, soda flour
# 4 cream chese, butter, condensed milk meat spreads
# 5 flour, rolls, coffee soda
# 6 butter, condensed milk yougurt
DATA
dat1 <- read.table(text = "lhs rhs
'32,39,6' 65
'39,6,65' 32
'14,16,26' 15
'16,20,4' 26
'16,26,33' 4
53 31 ",
stringsAsFactors = FALSE, header = TRUE)
dat2 <- read.table(text = "id name
4 yougurt
6 coffee
14 'cream chese'
15 'meat spreads'
16 butter
20 'whole milk'
26 'condensed milk'
31 curd
32 flour
39 rolls
53 sugar
65 soda",
header = TRUE, stringsAsFactors = FALSE)
Another option. Here d1 is your first data frame and d2 your second.
library(tidyverse)
d1 %>% separate(lhs, sep = ',', into = c('v1', 'v2', 'v3')) %>%
mutate_all(as.numeric) %>%
left_join(d2, by = c('v1'='id')) %>%
left_join(d2, by = c('v2'='id')) %>%
left_join(d2, by = c('v3'='id')) %>%
left_join(d2, by = c('rhs'='id')) %>%
unite(lhs, name.x, name.y, name.x.x, sep = ',') %>%
mutate(lhs = str_replace_all(lhs, ',NA', '')) %>%
select(lhs, rhs = name.y.y)
OR, as pointed out by #Moody_Mudskipper in the comments
d1 %>% separate(lhs, sep = ',', into = c('v1', 'v2', 'v3')) %>%
mutate_all(as.numeric) %>%
lmap(~setNames(left_join(setNames(.x, "id"), d2)[2], names(.x))) %>%
unite(lhs, v1, v2, v3, sep = ', ') %>%
mutate(lhs = str_replace_all(lhs, ',NA', '')) %>%
select(lhs, rhs = name.y.y)
lhs rhs
1 flour, rolls, coffee soda
2 rolls, coffee, soda flour
3 cream chese, butter, condensed milk meat spreads
4 butter, whole milk, yougurt condensed milk
5 butter, condensed milk yougurt
6 sugar curd
This is almost the same as www, but appears to be a little faster. Apparently using strsplit and unnest is faster than separate_rows
require(tidyverse)
df1 %>%
mutate(lhs = sapply(lhs, strsplit, ',')) %>%
unnest %>%
mutate_at(c('lhs', 'rhs'), as.numeric) %>%
left_join(df2, by = c('lhs'= 'id')) %>%
left_join(df2, by = c('rhs'= 'id')) %>%
group_by(name.y) %>%
summarize(name.x = paste(name.x, collapse = ', ')) %>%
rename(rhs = name.y, lhs = name.x)
Then there's the data.table solution, which is much faster.
require(data.table)
setDT(df1)
df1[, .(lhs = unlist(strsplit(lhs, ','))), rhs] %>%
.[, lapply(.SD, as.numeric)] %>%
merge(df2, by.x = 'lhs', by.y = 'id') %>%
merge(df2, by.x = 'rhs', by.y = 'id') %>%
.[, .(lhs = paste0(name.x, collapse = ',')), by = .(rhs = name.y)]
Benchmark
# Results
# Unit: relative
# expr min lq mean median uq max neval
# useDT() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 300
# UseUnnest() 5.570704 5.632532 5.274552 5.374714 5.042518 9.254190 300
# UseSeparateRows() 8.640615 8.356889 7.661669 7.939593 7.401666 7.896038 300
# Method
require(tidyverse)
require(data.table)
df1 <- fread("
lhs rhs
32,39,6 65
39,6,65 32
14,16,26 15
16,20,4 26
16,26,33 4
53 31
")
df2 <- fread("
id name
4 yougurt
6 coffee
14 cream_chese
15 meat_spreads
16 butter
20 whole_milk
26 condensed_milk
31 curd
32 flour
39 rolls
53 sugar
65 soda
")
useDT <- function(x){
df1[, lapply(sapply(lhs, strsplit, ','), unlist), rhs] %>%
setNames(c('rhs', 'lhs')) %>%
.[, `:=`(lhs = as.numeric(lhs),
rhs = as.numeric(rhs))] %>%
.[df2, on = c('lhs'= 'id')] %>%
.[df2, on = c('rhs'= 'id')] %>%
.[, .(lhs = paste0(name, collapse = ',')), by = i.name] %>%
.[lhs != 'NA', .(lhs, rhs = i.name)]
}
UseUnnest <- function(x){
df1 %>%
mutate(lhs = sapply(lhs, strsplit, ',')) %>%
unnest %>%
mutate_at(c('lhs', 'rhs'), as.numeric) %>%
left_join(df2, by = c('lhs'= 'id')) %>%
left_join(df2, by = c('rhs'= 'id')) %>%
group_by(name.y) %>%
summarize(name.x = paste(name.x, collapse = ', ')) %>%
rename(rhs = name.y, lhs = name.x)
}
UseSeparateRows <- function(x){
df1 %>%
separate_rows(lhs, convert = TRUE) %>%
left_join(df2, by = c("lhs" = "id")) %>%
left_join(df2, by = c("rhs" = "id")) %>%
drop_na(name.x) %>%
group_by(name.y) %>%
summarise(lhs = paste0(name.x, collapse = ", ")) %>%
ungroup() %>%
select(lhs, rhs = name.y)
}
microbenchmark(useDT(), UseUnnest(), UseSeparateRows(), times = 300, unit = 'relative')
Here is an option using just base R and mapping the numeric values to factor labels.
Split the string, map the labels to the values and then collapse the labels back into a string.
df<-structure(list(id = c(4L, 6L, 14L, 15L, 16L, 20L, 26L, 31L, 32L,
39L, 53L, 65L), name = c("yougurt", "coffee", "cream cheese",
"meat spreads", "butter", "whole milk", "condensed milk", "curd",
"flour", "rolls", "sugar", "soda")), .Names = c("id", "name"),
class = "data.frame", row.names = c(NA, -12L))
input<-structure(list(lhs = c("32,39,6", "39,6,65", "14,16,26", "16,20,4",
"16,26,33", "53"), rhs = c(65L, 32L, 15L, 26L, 4L, 31L)),
.Names = c("lhs", "rhs"), class = "data.frame", row.names = c(NA, -6L))
#new left hand side
newlhs<-sapply(as.character(input$lhs), function(x){
strs<-unlist(strsplit(x, ","))
f<-factor(strs, levels=df$id, labels=df$name)
paste(f, collapse = ", ")
})
#new right hand side
newrhs<-sapply(as.character(input$rhs), function(x){
strs<-unlist(strsplit(x, ","))
f<-factor(strs, levels=df$id, labels=df$name)
paste(f, collapse = ", ")
})
answer<-data.frame(newlhs, newrhs)
row.names(answer)<-NULL #remove rownames
Not so idiomatic but I win the code golf :) :
as.data.frame(lapply(dat1, function(x){
for (i in seq(nrow(dat2))) x <- gsub(paste0("(^|,)",dat2$id[i],"(,|$)"),
paste0("\\1",dat2$name[i],"\\2"),x)
x}))
# lhs rhs
# 1 flour,rolls,coffee soda
# 2 rolls,coffee,soda flour
# 3 cream chese,butter,condensed milk meat spreads
# 4 butter,whole milk,yougurt condensed milk
# 5 butter,condensed milk,33 yougurt
# 6 sugar curd
May fail if you have numbers in 2nd dataset.
So this is what I do I total up their grades by using this, there are more columns than this
test<-with(data,table(Student,Subject))
test <- cbind(test)
test <- as.data.frame(test)
row.names Maths Science English Geography History Art ...
1 George 64 70 40 50 60 70
2 Anna 40 20 65 54 30 50
3 Scott 30 64 30 40 50 20
...
Summarize <- data.frame(
aggregate(.~Maths, data=test, min),
aggregate(.~English, data=test, max),
aggregate(.~Science, data=test, mean))
Is there a way to select all the columns itself and aggregate(range and the mean) the columns to a new dataframe?
Min Mean Max
Maths 30 60 90
Science
English
Geography
...
Thanks in advance !
Try:
library(dplyr)
library(tidyr)
df %>%
summarise_each(funs(min=min(., na.rm=TRUE), max=max(., na.rm=TRUE),
mean=mean(., na.rm=TRUE)), -Student) %>%
gather(Var, Value, Maths_min:Art_mean) %>%
separate(Var, c("Subject", "Var")) %>%
spread(Var, Value)
# Subject max mean min
#1 Art 70 46.66667 20
#2 English 65 45.00000 30
#3 Geography 54 48.00000 40
#4 History 60 46.66667 30
#5 Maths 64 44.66667 30
#6 Science 70 51.33333 20
Update
Or you could use aggregate with melt
library(reshape2)
res <- aggregate(value~variable, melt(df, id="Student"),
FUN=function(x) c(Min=min(x, na.rm=TRUE), Mean=mean(x, na.rm=TRUE),
Max=max(x, na.rm=TRUE)))
res1 <- do.call(`data.frame`, res)
colnames(res1) <- gsub(".*\\.", "", colnames(res1))
res1
# variable Min Mean Max
#1 Maths 30 44.66667 64
#2 Science 20 51.33333 70
#3 English 30 45.00000 65
#4 Geography 40 48.00000 54
#5 History 30 46.66667 60
#6 Art 20 46.66667 70
Or using only base R
res2 <- do.call(`data.frame`,
aggregate(values~ind, stack(df, select=-1),
FUN=function(x) c(Min=min(x, na.rm=TRUE), Mean=mean(x, na.rm=TRUE),
Max=max(x, na.rm=TRUE))))
colnames(res2) <- gsub(".*\\.", "", colnames(res2))
res2
# ind Min Mean Max
#1 Art 20 46.66667 70
#2 English 30 45.00000 65
#3 Geography 40 48.00000 54
#4 History 30 46.66667 60
#5 Maths 30 44.66667 64
#6 Science 20 51.33333 70
data
df <- structure(list(Student = c("George", "Anna", "Scott"), Maths = c(64L,
40L, 30L), Science = c(70L, 20L, 64L), English = c(40L, 65L,
30L), Geography = c(50L, 54L, 40L), History = c(60L, 30L, 50L
), Art = c(70L, 50L, 20L)), .Names = c("Student", "Maths", "Science",
"English", "Geography", "History", "Art"), class = "data.frame", row.names = c(NA,
-3L))