Barplot dplyr summarized values - r

I have data from a top 3 ranking. I'm trying to create a plot that would have on the x axis the column name (cost/product), and the y value be the frequency (ideally relative frequency but I'm not sure how to get that in dplyr).
I'm trying to create this in plotly from values summarized in dplyr. I have a dplyr data frame that looks something like this:
likelyReasonFreq<- LikelyRenew_Reason %>%
filter(year==3)%>%
filter(status==1)%>%
summarize(costC = count(cost),
productsC = count(products))
> likelyReasonFreq
costC.x costC.freq productsC.x productsC.freq
1 1 10 1 31
2 2 11 2 40
3 3 17 3 30
4 NA 149 NA 86
I'm trying to create a barplot that shows the total (summed) frequency for cost,and for products. So frequency for cost would be the frequency for # of times ranked 1, 2, or 3 so 38. Essentially I'm summing rows 1:3 (for products it would be 101 (not including NA values).
I'm not sure how to go about this, any ideas??
below is the variable likelyReasonFreq
> dput(head(likelyReasonFreq))
structure(list(costC = structure(list(x = c(1, 2, 3, NA), freq = c(10L,
11L, 17L, 149L)), .Names = c("x", "freq"), row.names = c(NA,
4L), class = "data.frame"), productsC = structure(list(x = c(1,
2, 3, NA), freq = c(31L, 40L, 30L, 86L)), .Names = c("x", "freq"
), row.names = c(NA, 4L), class = "data.frame")), .Names = c("costC",
"productsC"), row.names = c(NA, 4L), class = "data.frame")
I appreciate any advice!

Your data structure is little awkward to work with, you can do a str or glimpse to it to see the problem, however you may fix this as below and then can plot it.
> str(df)
'data.frame': 4 obs. of 2 variables:
$ costC :'data.frame': 4 obs. of 2 variables:
..$ x : num 1 2 3 NA
..$ freq: int 10 11 17 149
$ productsC:'data.frame': 4 obs. of 2 variables:
..$ x : num 1 2 3 NA
..$ freq: int 31 40 30 86
Code to follow for plotting:
library(ggplot2)
library(tidyverse)
df <- df %>% map(unnest) %>% bind_rows(.id="Name") %>% na.omit() #fixing the structure of column taken as a set of two separate columns
df %>%
ggplot(aes(x=Name, y= freq)) +
geom_col()
I hope this is what is expected, although I am not entirely sure of it.
Input data given:
df <- structure(list(costC = structure(list(x = c(1, 2, 3, NA), freq = c(10L,
11L, 17L, 149L)), .Names = c("x", "freq"), row.names = c(NA,
4L), class = "data.frame"), productsC = structure(list(x = c(1,
2, 3, NA), freq = c(31L, 40L, 30L, 86L)), .Names = c("x", "freq"
), row.names = c(NA, 4L), class = "data.frame")), .Names = c("costC",
"productsC"), row.names = c(NA, 4L), class = "data.frame")
Output:
Added after OP request:
Here, I have not removed the NAs instead I have relplaced with a new value '4'. To take a relative sum across groups, I have used cumsum and then divided by the entire sum across both groups to get the relative frequencies.
df <- df %>% map(unnest) %>% bind_rows(.id="Name")
df[is.na(df$x),"x"] <- 4
df %>%
group_by(Name) %>%
mutate(sum_Freq = sum(freq), cum_Freq = cumsum(freq)) %>%
filter(x == 3) %>%
mutate(new_x = cum_Freq*100/sum_Freq) %>%
ggplot(aes(x=Name, y = new_x)) +
geom_col()

Related

How to merge data frame by picking up selected values based on some logical criteria?

I have 3 data frames with similar structure and i try to fill a 4rth data frame with values from first 3 data frames but on logical condition basis.
My data frame 1
`Account id Value $ RMSE
1 500 10
2 7000 15
3 1900 20
My data frame 2
`Account id Value $ RMSE
1 400 5
2 8000 18
3 1700 18
My data frame 3
`Account id Value $ RMSE
1 500 10
2 2000 25
3 5000 0.2
My desired result is (Value picked up from data frame which has lowest corresponding RMSE)
`Account id Value $
1 400
2 7000
3 5000
Request your help on how to merge.
In the case of your issue you have to bind all your dataframes by row. After that you can use tidyverse functions in order to filter by group defined by account id. Here the code with a tidyverse approach:
library(tidyverse)
#Code
ndf <- do.call(bind_rows,list(df1,df2,df3)) %>%
group_by(Account.id) %>%
filter(RMSE==min(RMSE)) %>% select(Account.id,Value) %>%
arrange(Account.id)
Output:
# A tibble: 3 x 2
# Groups: Account.id [3]
Account.id Value
<int> <int>
1 1 400
2 2 7000
3 3 5000
Some data used:
#Data 1
df1 <- structure(list(Account.id = 1:3, Value = c(500L, 7000L, 1900L
), RMSE = c(10L, 15L, 20L)), class = "data.frame", row.names = c(NA,
-3L))
#Data 2
df2 <- structure(list(Account.id = 1:3, Value = c(400L, 8000L, 1700L
), RMSE = c(5L, 18L, 18L)), class = "data.frame", row.names = c(NA,
-3L))
#Data 3
df3 <- structure(list(Account.id = 1:3, Value = c(500L, 2000L, 5000L
), RMSE = c(10, 25, 0.2)), class = "data.frame", row.names = c(NA,
-3L))
An option with data.table
library(data.table)
rbindlist(list(df1, df2, df3))[, .(Value = Value[which.min(RMSE)]), .(Account.id)]
# Account.id Value
#1: 1 400
#2: 2 7000
#3: 3 5000
Or with tidyverse using slice_min after binding the datasets together with bind_rows
library(dplyr)
bind_rows(df1, df2, df3) %>%
group_by(Account.id) %>%
slice_min(RMSE) %>%
select(-RMSE)
# A tibble: 3 x 2
# Groups: Account.id [3]
# Account.id Value
# <int> <int>
#1 1 400
#2 2 7000
#3 3 5000
df1 <- structure(list(Account.id = 1:3, Value = c(500L, 7000L, 1900L
), RMSE = c(10L, 15L, 20L)), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(Account.id = 1:3, Value = c(400L, 8000L, 1700L
), RMSE = c(5L, 18L, 18L)), class = "data.frame", row.names = c(NA,
-3L))
df3 <- structure(list(Account.id = 1:3, Value = c(500L, 2000L, 5000L
), RMSE = c(10, 25, 0.2)), class = "data.frame", row.names = c(NA,
-3L))
A base R option is using merge + aggregate
merge(
df <- do.call(rbind, lst(df1, df2, df3)),
aggregate(RMSE ~ Account.id, df, min)
)[c("Account.id","Value")]
which gives
Account.id Value
1 1 400
2 2 7000
3 3 5000

Compare two dataframes in R

I have two dataframes in R and want to compare any entries of rows. I want two check if the value of the first entrie, second entrie etc. of first (any) row of the first dataframe is bigger as the entrie of the first entrie of the the first row of the second dataframe. Afterwards it should give me a TRUE if all entries are bigger and in the intervall (0,2). It looks like this.
Dataframe 1
Letter 2011 2012 2013
A 2 3 5
B 6 6 6
C 5 4 8
Dataframe 2
Letter 2011 2012 2013
A 1 1 4
C 5 5 5
Result for example like this (comparing rows A and A and C and C)
Letter 2011 2012 2013
A 1 2 1 TRUE- all ok
C 0 -1 3 FALSE- second entrie smaller of the first table and third entrie much more
bigger of the first table.
One approach could be to convert data to long format, perform an inner_join subtract values, check if all the values are in range and get the data back in wide format.
library(dplyr)
library(tidyr)
df1 %>% pivot_longer(cols = -Letter) %>%
inner_join(df2 %>% pivot_longer(cols = -Letter), by = c("Letter", "name")) %>%
mutate(value = value.x - value.y) %>%
group_by(Letter) %>%
mutate(check = all(between(value, 0, 2))) %>%
select(-value.x, -value.y) %>%
pivot_wider()
# Letter check `2011` `2012` `2013`
# <chr> <lgl> <int> <int> <int>
#1 A TRUE 1 2 1
#2 C FALSE 0 -1 3
data
df1 <- structure(list(Letter = c("A", "B", "C"), `2011` = c(2L, 6L,5L),
`2012` = c(3L, 6L, 4L), `2013` = c(5L, 6L, 8L)), row.names = c(NA, -3L),
class = "data.frame")
df2 <- structure(list(Letter = c("A", "C"), `2011` = c(1L, 5L), `2012` = c(1L,
5L), `2013` = 4:5), row.names = c(NA, -2L), class = "data.frame")

How to get this dcast'able long table in R?

I am trying to apply dcast on long table, continua from the thread answer How to get this data structure in R?
Code
dat.m <- structure(c(150L, 60L, 41L, 61L, 0L, 0L), .Dim = c(3L, 2L), .Dimnames = list(
c("ave_max", "ave", "lepo"), NULL))
library("ggplot2")
library("data.table")
dat.m <- melt(as.data.table(dat.m, keep.rownames = "Vars"), id.vars = "Vars") # https://stackoverflow.com/a/44128640/54964
dat.m
print("New step")
# http://stackoverflow.com/a/44090815/54964
minmax <- dat.m[dat.m$Vars %in% c("ave_max","lepo"), ]
absol <- dat.m[dat.m$Vars %in% c("ave"), ]
#minm <- dcast(minmax, Vars ~ variable)
minm <- dcast(minmax, Vars ~ ...)
absol <- merge(absol, minm, by = "Vars", all.x = T)
absol
#Test function
ggplot(absol, aes(x = Vars, y = value, fill = variable)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = lepo, ymax = ave_max), width = .25)
Output
dcast, melt
Vars variable value
1: ave_max V1 150
2: ave V1 60
3: lepo V1 41
4: ave_max V2 61
5: ave V2 0
6: lepo V2 0
[1] "New step"
Vars variable value V1 V2
1: ave V1 60 NA NA
2: ave V2 0 NA NA
Error in FUN(X[[i]], ...) : object 'lepo' not found
Calls: <Anonymous> ... by_layer -> f -> <Anonymous> -> f -> lapply -> FUN -> FUN
Execution halted
Expected output: to pass the test function ggplot
Testing Uwe's proposal
Aim is to get to this data structure
dat.m <- structure(c(150L, 60L, 41L, 61L, 0L, 0L), .Dim = c(3L, 2L), .Dimnames = list(c("ave_max", "ave", "lepo"), NULL))
from this data structure
dat.m <- structure(list(ave_max = c(15L, 6L), ave = c(6L, NA), lepo = c(4L, NA)), .Names = c("ave_max", "ave", "lepo"), class = "data.frame", row.names = c(NA, -2L))
Attempts
dat.m <- structure(list(ave_max = c(15L, 6L), ave = c(6L, NA), lepo = c(4L, NA)), .Names = c("ave_max", "ave", "lepo"), class = "data.frame", row.names = c(NA, -2L))
# ...
Code and output
dat.m <- setDT(dat.m)
Output wrong
ave_max ave lepo
1: 15 6 4
2: 6 NA NA
Classes ‘data.table’ and 'data.frame': 2 obs. of 3 variables:
$ ave_max: int 15 6
$ ave : int 6 NA
$ lepo : int 4 NA
- attr(*, ".internal.selfref")=<externalptr>
Code and output
dat.m <- as.matrix(dcast(melt(setDT(dat.m), measure.vars = names(dat.m)), variable ~ rowid(variable))[, variable := NULL]);
dimnames(dat.m) <- list(names(dat.m), NULL);
Output wrong
Error in `:=`(variable, NULL) :
Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways.
See help(":=").
R: 3.4.0 (backports)
OS: Debian 8.7.
The OP has edited his question and is supplying the data as a data.frame:
dat.df <- structure(list(ave_max = c(15L, 6L), ave = c(6L, NA), lepo = c(4L, NA)),
.Names = c("ave_max", "ave", "lepo"), class = "data.frame",
row.names = c(NA, -2L))
dat.df
# ave_max ave lepo
#1 15 6 4
#2 6 NA NA
class(dat.df)
#[1] "data.frame"
He is now asking to transform this data.frame into a matrix which is similar to the one used as input data in this answer.
This can be achieved by using data.table:
library(data.table) # CRAN version 1.10.4 used
# transpose the input data frame, use rowid() to create columns,
# remove a character column to ensure matrix will be of type integer,
# finally, coerce to matrix
dat.m2 <- as.matrix(
data.table::dcast(
data.table::melt(setDT(dat.df), measure.vars = names(dat.df)),
variable ~ rowid(variable)
)[, variable := NULL]
)
# add row names, remove column names
dimnames(dat.m2) <- list(names(dat.df), NULL)
dat.m2
# [,1] [,2]
#ave_max 15 6
#ave 6 NA
#lepo 4 NA
str(dat.m2)
# int [1:3, 1:2] 15 6 4 6 NA NA
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:3] "ave_max" "ave" "lepo"
# ..$ : NULL
class(dat.m2)
#[1] "matrix"
Edit: I've amended above code to use the double colon operator to explicitely state the namespace from which melt() and dcast() should be taken. Normally, this wouldn't be necessary as data.table is already loaded. However, the OP is reporting issues which might be caused by package reshape2 being loaded after data.table. The data.table package has it's own faster implementations of reshape2::dcast() and reshape2::melt(). When both packages have been loaded for some reason name clashes might occur.
The OP has supplied data as a matrix:
dat.m <- structure(c(150L, 60L, 41L, 61L, 0L, 0L), .Dim = c(3L, 2L), .Dimnames = list(
c("ave_max", "ave", "lepo"), NULL))
# dat.m
# [,1] [,2]
#ave_max 150 61
#ave 60 0
#lepo 41 0
class(dat.m)
#[1] "matrix"
For this data set, the OP wants to use ggplot2 to create a bar chart with error bars where the height of the bars is given by the values of ave and the lower and upper limits of the error bars by lepo and ave_max, resp., in each column.
As ggplot2 expects data to be supplied as data.frame the data needs to be transformed. For this, data.table is used:
library(data.table) # CRAN version 1.10.4 used
# convert to data.table & transpose
transposed <- dcast(melt(as.data.table(dat.m, keep.rownames = "Vars"),
id.vars = "Vars"), variable ~ ...)
setnames(transposed, "variable", "Vars")
library(ggplot2)
ggplot(transposed, aes(x = Vars, y = ave, ymin = lepo, ymax = ave_max)) +
geom_col() +
geom_errorbar(width = .25)

Manipulating all split data sets

I'm drawing a blank-- I have 51 sets of split data from a data frame that I had, and I want to take the mean of the height of each set.
print(dataset)
$`1`
ID Species Plant Height
1 A 1 42.7
2 A 1 32.5
$`2`
ID Species Plant Height
3 A 2 43.5
4 A 2 54.3
5 A 2 45.7
...
...
...
$`51`
ID Species Plant Height
134 A 51 52.5
135 A 51 61.2
I know how to run each individually, but with 51 split sections, it would take me ages.
I thought that
mean(dataset[,4])
might work, but it says that I have the wrong number of dimensions. I get now why that is incorrect, but I am no closer to figuring out how to average all of the heights.
The dataset is a list. We could use lapply/sapply/vapply etc to loop through the list elements and get the mean of the 'Height' column. Using vapply, we can specify the class and length of the output (numeric(1)). This will be useful for debugging.
vapply(dataset, function(x) mean(x[,4], na.rm=TRUE), numeric(1))
# 1 2 51
#37.60000 47.83333 56.85000
Or another option (if we have the same columnames/number of columns for the data.frames in the list), would be to use rbindlist from data.table with the optionidcol=TRUEto generate a singledata.table. The '.id' column shows the name of thelistelements. We group by '.id' and get themeanof theHeight`.
library(data.table)
rbindlist(dataset, idcol=TRUE)[, list(Mean=mean(Height, na.rm=TRUE)), by = .id]
# .id Mean
#1: 1 37.60000
#2: 2 47.83333
#3: 51 56.85000
Or a similar option as above is unnest from library(tidyr) to return a single dataset with the '.id' column, grouped by '.id', we summarise to get the mean of 'Height'.
library(tidyr)
library(dplyr)
unnest(dataset, .id) %>%
group_by(.id) %>%
summarise(Mean= mean(Height, na.rm=TRUE))
# .id Mean
#1 1 37.60000
#2 2 47.83333
#3 51 56.85000
The syntax for plyr is
df1 <- unnest(dataset, .id)
ddply(df1, .(.id), summarise, Mean=mean(Height, na.rm=TRUE))
# .id Mean
#1 1 37.60000
#2 2 47.83333
#3 51 56.85000
data
dataset <- structure(list(`1` = structure(list(ID = 1:2, Species = c("A",
"A"), Plant = c(1L, 1L), Height = c(42.7, 32.5)), .Names = c("ID",
"Species", "Plant", "Height"), class = "data.frame", row.names = c(NA,
-2L)), `2` = structure(list(ID = 3:5, Species = c("A", "A", "A"
), Plant = c(2L, 2L, 2L), Height = c(43.5, 54.3, 45.7)), .Names = c("ID",
"Species", "Plant", "Height"), class = "data.frame", row.names = c(NA,
-3L)), `51` = structure(list(ID = 134:135, Species = c("A", "A"
), Plant = c(51L, 51L), Height = c(52.5, 61.2)), .Names = c("ID",
"Species", "Plant", "Height"), class = "data.frame", row.names = c(NA,
-2L))), .Names = c("1", "2", "51"))
This also works, though it uses dplyr.
library(dplyr)
1:length(dataset) %>%
lapply(function(i)
test[[i]] %>%
mutate(section = i ) ) %>%
bind_rows %>%
group_by(section) %>%
summarize(mean_height = mean(height) )

R: reshape data from column to row and add additional data based on name [duplicate]

This question already has answers here:
Transpose a data frame
(6 answers)
Closed 7 years ago.
I am looking for a way to reshape the following sample data
data <- structure(list(id = c(2L, 5L, 7L), name = structure(1:3, .Label = c("Test1","Test10", "Test8"), class = "factor"), source = structure(c(1L,3L, 2L), .Label = c("A", "T", "Z"), class = "factor")), .Names = c("id", "name", "source"), class = "data.frame", row.names = c(NA, -3L))
id name source
1 2 Test1 A
2 5 Test10 Z
3 7 Test8 T
into the following structure
row.names 1.1 2.1 3.1
id 2 5 7
name Test1 Test10 Test8
source A Z T
and how could I add a second data.frame say, data2 based on the name to data (only the data that contains a matching name)?
data2 <- structure(list(name = structure(1L, .Label = "adddata", class = "factor"), Test1 = 10L, Test10 = 12L, Test8 = 17L, Test12 = 7L), .Names = c("name", "Test1", "Test10", "Test8", "Test12"), class = "data.frame", row.names = c(NA, -1L))
data2
name Test1 Test10 Test8 Test12
1 adddata 10 12 17 7
So that in the end something like the following data.frame which contains only matching names (Test12 from data2 is left out) is the case
datanew
row.names 1.1 2.1 3.1
1 id 2 5 7
2 name Test1 Test10 Test8
3 source A Z T
4 adddata 10 12 17
EDIT
I just realised that my input data contains nested lists like this. Is there a way to implement this?
data <- structure(list(`1.1` = structure(list(id = structure(2, .Dim = c(1L, 1L)), name = structure("Test1", .Dim = c(1L, 1L)), source = structure("A", .Dim = c(1L, 1L))), .Names = c("id", "name", "source")), `2.1` = structure(list(id = structure(5, .Dim = c(1L, 1L)), name = structure("Test10", .Dim = c(1L, 1L)), source = structure("Z", .Dim = c(1L, 1L))), .Names = c("id", "name", "source")), `3.1` = structure(list(id = structure(7, .Dim = c(1L, 1L)), name = structure("Test8", .Dim = c(1L, 1L)), source = structure("T", .Dim = c(1L, 1L))), .Names = c("id", "name", "source"))), .Names = c("1.1", "2.1", "3.1"), class = "data.frame", row.names = c("id", "name", "source"))
'data.frame': 3 obs. of 3 variables:
$ 1.1:List of 3
..$ id : num [1, 1] 2
..$ name : chr [1, 1] "Test1"
..$ source: chr [1, 1] "A"
$ 2.1:List of 3
..$ id : num [1, 1] 5
..$ name : chr [1, 1] "Test10"
..$ source: chr [1, 1] "Z"
$ 3.1:List of 3
..$ id : num [1, 1] 7
..$ name : chr [1, 1] "Test8"
..$ source: chr [1, 1] "T"
You could transpose the first dataset ('data') and rbind the output ('d1') with the columns that in 'data2' that we subset using the match between the column 'name' in 'data' and the column names of 'data2'
d1 <- as.data.frame(t(data), stringsAsFactors=FALSE)
res <- rbind(d1, setNames(data2[match(data$name, names(data2))], names(d1)))
rownames(res)[4] <- as.character(data2$name)
res
# V1 V2 V3
#id 2 5 7
#name Test1 Test10 Test8
#source A Z T
#adddata 10 12 17
Or another option is join from data.table
library(data.table)#v1.9.5+
DT <- setDT(data)[melt(data2, id.var='name', value.name='adddata',
variable.name='name')[-1], on='name', nomatch=0]
DT
# id name source adddata
#1: 2 Test1 A 10
#2: 5 Test10 Z 12
#3: 7 Test8 T 17
I would keep in this format rather than transpose it as the columns are of different classes. If we transpose, the numeric and non-numeric elements get mixed together in a column and the class will be either factor or character
t(DT)
Update
Based on the edited "data", we can unlist the list and convert to 'data.frame'. Then, we can use the steps as before.
data <- setNames(as.data.frame(matrix(unlist(data), ncol=3,
byrow=TRUE)), row.names(data))
DT <- setDT(data)[melt(data2, id.var='name', value.name='adddata',
variable.name='name')[-1], on='name', nomatch=0]
DT
# id name source adddata
#1: 2 Test1 A 10
#2: 5 Test10 Z 12
#3: 7 Test8 T 17

Resources