Constructing a Boxplot from a dataframe consisting of Multivalue columns

Constructing a Boxplot from a dataframe consisting of Multivalue columns - r

Suppose that we have a dataframe in which one of the columns represents a list of numerical data entries.
"ID","Costs"
"tim","1, 2, 3, 4, 5, 6, 7, 8"
"ryan","8, 7, 6, 5, 4, 3, 2, 1"
"bob","1, 3, 5, 7, 9, 11, 13, 15"
If I wanted to construct a box-plot of costs with respect to ID, how would approach doing so?

A base R solution is pretty much a one-liner, since boxplot() will accept a list as input:
boxplot(lapply(strsplit(dat$Costs, ",\\s+"), as.numeric), names=dat$ID)
dat in this case being:
dat <- structure(list(ID = c("tim", "ryan", "bob"), Costs = c("1, 2, 3, 4, 5, 6, 7, 8",
"8, 7, 6, 5, 4, 3, 2, 1", "1, 3, 5, 7, 9, 11, 13, 15")), .Names = c("ID",
"Costs"), class = "data.frame", row.names = c(NA, -3L))

Assuming that the data are as given in your example, i.e. column Costs contains quoted characters separated by comma + space:
df1 <- read.csv(text = '"ID","Costs"
"tim","1, 2, 3, 4, 5, 6, 7, 8"
"ryan","8, 7, 6, 5, 4, 3, 2, 1"
"bob","1, 3, 5, 7, 9, 11, 13, 15"',
header = TRUE,
stringsAsFactors = FALSE)
Then you can separate the values using unnest, convert to numeric and plot:
library(tidyverse)
df1 %>%
unnest(Costs = str_split(Costs, ", ")) %>%
mutate(Costs = as.numeric(Costs)) %>%
ggplot(aes(ID, Costs)) +
geom_boxplot()

If you want a base solution, here's one possibility:
boxplot( values~ind,
data=stack( data.frame( apply(df1, 1, # stack function converts wide to long
function(r) setNames(
list(scan(text=r[2], sep=",")), # numeric Costs
r[1]) ) )) ) # names then as 'ID'

Related

Display only specific information in hovertext of radarchart

I create the radarchart below and I would like to know if it is possible to exclude all the person with value=6 from the hoverinfo text.
library(radarchart)
# Using the data frame interface
chartJSRadar(scores=skills)
# Or using a list interface
labs <- c("Communicator", "Data Wangler", "Programmer", "Technologist", "Modeller", "Visualizer")
scores <- list("Rich" = c(9, 7, 4, 5, 3, 7),
"Andy" = c(7, 6, 6, 2, 6, 9),
"Aimee" = c(6, 5, 8, 4, 7, 6))
# Default settings
chartJSRadar(scores=scores, labs=labs)

Code to analyze relationships between responses to different ranking questions on a survey

My goal is to find much simpler code, which can generalize, that shows the relationships between responses to two survey questions. In the MWE, one question asked respondents to rank eight marketing selections from 1 to 8 and the other asked them to rank nine attribute selections from 1 to 9. Higher rankings indicate the respondent favored the selection more. Here is the data frame.
structure(list(Email = c("a", "b", "c", "d", "e", "f", "g", "h",
"i"), Ads = c(2, 1, 1, 1, 1, 2, 1, 1, 1), Alumni = c(3, 2, 2,
3, 2, 3, 2, 2, 2), Articles = c(6, 4, 3, 2, 3, 4, 3, 3, 3), Referrals = c(4,
3, 4, 8, 7, 8, 8, 6, 4), Speeches = c(7, 7, 6, 7, 4, 7, 4, 5,
5), Updates = c(8, 6, 6, 5, 5, 5, 5, 7, 6), Visits = c(5, 8,
7, 6, 6, 6, 6, 4, 8), `Business Savvy` = c(10, 6, 10, 10, 4,
4, 6, 8, 9), Communication = c(4, 3, 8, 3, 3, 9, 7, 6, 7), Experience = c(7,
7, 7, 9, 2, 8, 5, 9, 5), Innovation = c(2, 1, 4, 2, 1, 2, 2,
1, 1), Nearby = c(3, 2, 2, 1, 5, 3, 3, 2, 2), Personal = c(8,
10, 6, 8, 6, 10, 4, 3, 3), Rates = c(9, 5, 9, 6, 9, 7, 10, 5,
4), `Staffing Model` = c(6, 8, 5, 5, 7, 5, 8, 7, 8), `Total Cost` = c(5,
4, 3, 7, 8, 6, 9, 4, 6)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
If numeric rankings cannot be used for my solution to calculating relationships (correlations), please correct me.
Hoping they can be used, I arrived at the following plodding code, which I hope calculates the correlation matrix of each method selection against each attribute selection.
library(psych)
dataframe2 <- psych::corr.test(dataframe[ , c(2, 9:17)])[[1]][1:10] # the first method vs all attributes
dataframe3 <- psych::corr.test(dataframe[ , c(3, 9:17)])[[1]][1:10] # the 2nd method vs all attributes and so on
dataframe4 <- psych::corr.test(dataframe[ , c(4, 9:17)])[[1]][1:10]
dataframe5 <- psych::corr.test(dataframe[ , c(5, 9:17)])[[1]][1:10]
dataframe6 <- psych::corr.test(dataframe[ , c(6, 9:17)])[[1]][1:10]
dataframe7 <- psych::corr.test(dataframe[ , c(7, 9:17)])[[1]][1:10]
dataframe8 <- psych::corr.test(dataframe[ , c(8, 9:17)])[[1]][1:10]
# create a dataframe from the rbinded rows
bind <- data.frame(rbind(dataframe2, dataframe3, dataframe4, dataframe5, dataframe6, dataframe7, dataframe8))
Rename rows and columns:
colnames(bind) <- c("Sel", colnames(dataframe[9:17]))
rownames(bind) <- colnames(dataframe[2:8])
How can I accomplish the above more efficiently?
By the way, the bind data frame also allows one to produce a heat map with the DataExplorer package.
library(DataExplorer)
DataExplorer::plot_correlation(bind)

[Summary]
In the scope of our discussion, there are two ways to get the correlation data.
Use stats::cor, i.e., cor(subset(dataframe, select = -Email))
Use psych::corr.test, i.e., corr.test(subset(dataframe, select = -Email))[[1]]
Then you may subset the correlation matrix with the desired rows and columns.
In order to use DataExplorer::plot_correlation, you can simply do plot_correlation(dataframe, type = "c"). Note: the output heatmap will include correlations for all columns, so you can just ignore columns that are not of interests.
[Original Answer]
## Create data
dataframe <- structure(
list(
Email = c("a", "b", "c", "d", "e", "f", "g", "h", "i"),
Ads = c(2, 1, 1, 1, 1, 2, 1, 1, 1),
Alumni = c(3, 2, 2, 3, 2, 3, 2, 2, 2),
Articles = c(6, 4, 3, 2, 3, 4, 3, 3, 3),
Referrals = c(4, 3, 4, 8, 7, 8, 8, 6, 4),
Speeches = c(7, 7, 6, 7, 4, 7, 4, 5, 5),
Updates = c(8, 6, 6, 5, 5, 5, 5, 7, 6),
Visits = c(5, 8, 7, 6, 6, 6, 6, 4, 8),
`Business Savvy` = c(10, 6, 10, 10, 4, 4, 6, 8, 9),
Communication = c(4, 3, 8, 3, 3, 9, 7, 6, 7),
Experience = c(7, 7, 7, 9, 2, 8, 5, 9, 5),
Innovation = c(2, 1, 4, 2, 1, 2, 2, 1, 1),
Nearby = c(3, 2, 2, 1, 5, 3, 3, 2, 2),
Personal = c(8, 10, 6, 8, 6, 10, 4, 3, 3),
Rates = c(9, 5, 9, 6, 9, 7, 10, 5, 4),
`Staffing Model` = c(6, 8, 5, 5, 7, 5, 8, 7, 8),
`Total Cost` = c(5, 4, 3, 7, 8, 6, 9, 4, 6)
),
row.names = c(NA, -9L),
class = c("tbl_df", "tbl", "data.frame")
)
Following your example strictly, we can do the following:
## Calculate correlation
df2 <- subset(dataframe, select = -Email)
marketing_selections <- names(df2)[1:7]
attribute_selections <- names(df2)[8:16]
corr_matrix <- psych::corr.test(df2)[[1]]
bind <- subset(corr_matrix,
subset = rownames(corr_matrix) %in% marketing_selections,
select = attribute_selections)
DataExplorer::plot_correlation(bind)
WARNING
However, is this what you really want? psych::corr.test generates the correlation matrix, and DataExplorer::plot_correlation calculates the correlation again. It is like the correlation of the correlation.

Selecting columns using ends_with helper and a vector of string names

I have a data frame, in wide format, with each column representing one questionnaire item for one particular version of a questionnaire for a particular time point (repeated measures design).
My data would look something like the following:
df <- data.frame(id = c(1:5), t1_QOL_child_Q1 = c(5, 3, 6, 2, 7), t1_QOL_child_Q2 = c(5, 2, 3, 7, 1), t1_QOL_child_Q3 = c(7, 7, 6, 2, 5), t1_QOL_child_joy = c(9,9, 5, 3, 6), t1_QOL_teen_Q1 = c(5, 3, 6, 2, 7), t1_QOL_teen_Q2 = c(5, 2, 3, 7, 1), t1_QOL_teen_Q3 = c(7, 7, 6, 2, 5), t1_QOL_teen_joy = c(5, 7, 4, 7, 9), t1_QOL_adult_Q1 = c(5, 3, 6, 2, 7), t1_QOL_adult_Q2 = c(5, 2, 3, 7, 1), t1_QOL_adult_Q3 = c(7, 7, 6, 2, 5), t1_QOL_adult_joy = c(6, 5, 3, 3, 2), t2_QOL_child_Q1 = c(5, 3, 6, 2, 7), t2_QOL_child_Q2 = c(5, 2, 3, 7, 1), t2_QOL_child_Q3 = c(7, 7, 6, 2, 5), t2_QOL_child_joy = c(9,9, 5, 3, 6), t2_QOL_teen_Q1 = c(5, 3, 6, 2, 7), t2_QOL_teen_Q2 = c(5, 2, 3, 7, 1), t2_QOL_teen_Q3 = c(7, 7, 6, 2, 5), t2_QOL_teen_joy = c(5, 7, 4, 7, 9), t2_QOL_adult_Q1 = c(5, 3, 6, 2, 7), t2_QOL_adult_Q2 = c(5, 2, 3, 7, 1), t2_QOL_adult_Q3 = c(7, 7, 6, 2, 5), t2_QOL_adult_joy = c(6, 5, 3, 3, 2))
For example, column t1_QOL_child_Q1 would mean Question 1 (Q1) of the child version (child) of Quality of Life (QOL) questionnaire, with time point 1 (t1) data.
I want to select only subscales/columns whose suffix are labelled differently. In the sample data above, it would be the columns ending with "joy".
I have over 3000 columns and many more suffixes and it would be a pain to use the following:
select(df, ends_with("joy"), ends_with(<another suffix>), ends_with(<another suffix>))
I have thought of putting all the potential suffixes in a string vector, and use the vector as an input to the ends_with function, but ends_with could only take a single string instead of a vector of strings.
I have searched on Stackoverflow and found a solution that could accommodate a small vector of strings, which is the following:
select(df, sapply(vector_of_strings, starts_with))
However, I have too many suffixes in my vector of strings and the following error message resulted from it: Error: sapply(vector_of_strings, ends_with) must resolve to integer column positions, not a list
Help appreciated. Thanks!

We can use a single matches with multiple patterns separated by | to match substrings at the end ($) of the string
df %>%
select(matches("(joy|Q2)$"))

"Object 'freq' not found" error applying colour in UpSetR

If I run this reprex, I get the required output:
``` r
library(UpSetR)
listInput <- list(one = c(1, 2, 3, 5, 7, 8, 11, 12, 13),
two = c(1, 2, 4, 5, 10),
three = c(1, 5, 6, 7, 8, 9, 10, 12, 13))
upset(fromList(listInput), order.by = "freq")
```
If I apply a colour, I get the following error.
``` r
library(UpSetR)
listInput <- list(one = c(1, 2, 3, 5, 7, 8, 11, 12, 13),
two = c(1, 2, 4, 5, 10),
three = c(1, 5, 6, 7, 8, 9, 10, 12, 13))
upset(fromList(listInput), order.by = "freq",
queries = list(list(query = intersects, params = list("one"), color = "orange", active = T)))
#> Error in eval(expr, envir, enclos): object 'freq' not found
```
I've looked at the colouring "example 5" in the vignettes, but can't spot my misstep.

Add a column of integers to the dataframe inputted into upset.
library(UpSetR)
listInput <- list(one = c(1, 2, 3, 5, 7, 8, 11, 12, 13),
two = c(1, 2, 4, 5, 10),
three = c(1, 5, 6, 7, 8, 9, 10, 12, 13))
df <- fromList(listInput)
df$n <- sample(1:nrow(df))
upset(df, order.by = "freq",
queries = list(list(query = intersects,
params = list("one"),
color = "orange")))

Is there a simple way of pairing unique data points in a data frame?

I want to extract pairs of data from a data frame, where they are paired with data that is not in their own column. Each number in column 1 is paired with all the numbers to the right of that column. Likewise numbers in column 2 are only paired with numbers in columns 3 or above.
I have created a script that does it using a bird's nest of 'for' loops but I feel there should be a more elegant way to do it.
Example data:
structure(list(A = 1:3, B = 4:6, C = 7:9), .Names = c("A", "B",
"C"), class = "data.frame", row.names = c(NA, -3L))
Desired output:
structure(list(X1 = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3,
3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6), X2 = c(4, 5, 6, 7,
8, 9, 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, 9, 7, 8, 9, 7, 8, 9, 7,
8, 9)), .Names = c("X1", "X2"), row.names = c(NA, 27L), class = "data.frame")

Here's an approach using data.table package and its very efficient CJ and rbindlist functions (assuming your data set called df)
library(data.table)
res <- rbindlist(lapply(seq_len(length(df) - 1),
function(i) CJ(df[, i], unlist(df[, -(seq_len(i))]))))
You could then set your column names by reference (if you insist on "X1" and "X2") using setnames
setnames(res, 1:2, c("X1", "X2"))
You can also convert back to data.frame by reference (if you want to match your desired output "exactly") by using setDF()
setDF(res)

Here df is the input dataset
out1 <- do.call(rbind,lapply(1:(ncol(df)-1), function(i) {
x1 <- df[,i:(ncol(df))]
Un1 <-unique(unlist(x1[,-1]))
data.frame(X1=rep(x1[,1], each=length(Un1)), X2= Un1)}))
all.equal(out, out1) #if `out` is the expected output
#[1] TRUE

Another approach:
res <- do.call(rbind, unlist(lapply(seq(ncol(dat) - 1), function(x)
lapply(seq(x + 1, ncol(dat)), function(y)
"names<-"(expand.grid(dat[c(x, y)]), c("X1", "X2")))),
recursive = FALSE))
where dat is the name of your data frame.
You can sort the result with this command:
res[order(res[[1]], res[[2]]), ]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Constructing a Boxplot from a dataframe consisting of Multivalue columns - r

If you want a base solution, here's one possibility: boxplot( values~ind, data=stack( data.frame( apply(df1, 1, # stack function converts wide to long function(r) setNames( list(scan(text=r[2], sep=",")), # numeric Costs r[1]) ) )) ) # names then as 'ID'

Related

Display only specific information in hovertext of radarchart

Code to analyze relationships between responses to different ranking questions on a survey

Selecting columns using ends_with helper and a vector of string names

"Object 'freq' not found" error applying colour in UpSetR

Is there a simple way of pairing unique data points in a data frame?

Categories

Resources