How can I reshape this data? - r

I have some data to reshape in R but can not figure out how.
Here is the scenario:
I have data like this
a<- c("exam1", "exam2", "exam3","exam4")
date1<- c(8.2,4.3,6.7,3.9)
date2<- c(11.2,9.3,6.5,4.1)
date3<- c(8.2,9.1,4.3,4.4)
dr.df.a <- cbind(a,date1,date2,date3)
a date1 date2 date3
[1,] "exam1" "8.2" "11.2" "8.2"
[2,] "exam2" "4.3" "9.3" "9.1"
[3,] "exam3" "6.7" "6.5" "4.3"
[4,] "exam4" "3.9" "4.1" "4.4"
b<- c("exam1", "exam2", "exam3","exam4")
date1<- c(8.6,14.3,6.7,13.9)
date2<- c(11.2,8.3,16.5,14.1)
date3<- c(4.2,9.1,4.3,14.4)
dr.df.b <- cbind(b,date1,date2,date3)
b date1 date2 date3
[1,] "exam1" "8.6" "11.2" "4.2"
[2,] "exam2" "14.3" "8.3" "9.1"
[3,] "exam3" "6.7" "16.5" "4.3"
[4,] "exam4" "13.9" "14.1" "14.4"
mylist<–list(dr.df.a,dr.df.b)
The example is for reproducibly proposes. I get the data in this format (dr.df.a and dr.df.b) There are multiple data frames in list object.
Now I need to reshape it a way to get one single line and variable names like
exam1_date1, exam1_date2 , exam1_date3, exam2_date1,exam2_date2 ... and so on and essentially I would like to get data frame with rows of exam1_date1, exam1_date2 , exam1_date3, exam2_date1,exam2_date2 ... for every data frame in list object.
How I can reshape this data and which function should I use ?

Try this:
library(reshape2)
# convert the first row (the one defined by variable 'a' in post) into column names
dr.df.2 <- setNames(dr.df[-1,], dr.df[1, ])
m <- melt(dr.df.2)
d <- dcast(m, 1 ~ ...)[-1]
names(d) <- sub("_", "_exam", names(d)) # fix up names (optional)
Giving this:
> d
date1_exam1 date1_exam2 date1_exam3 date1_exam4 date2_exam1 date2_exam2
1 8.2 4.3 6.7 3.9 11.2 9.3
date2_exam3 date2_exam4 date3_exam1 date3_exam2 date3_exam3 date3_exam4
1 6.5 4.1 8.2 9.1 4.3 4.4
UPDATE: simplified dcast formula

If your dr.df object were a data.frame instead of a matrix, you can easily create a named vector as demonstrated below:
Create your data, but as a data.frame this time:
a <- c("exam1", "exam2", "exam3","exam4")
date1 <- c(8.2,4.3,6.7,3.9)
date2 <- c(11.2,9.3,6.5,4.1)
date3 <- c(8.2,9.1,4.3,4.4)
dr.df <- rbind(date1, date2, date3)
colnames(dr.df) <- a
dr.df <- as.data.frame(dr.df)
dr.df
# exam1 exam2 exam3 exam4
# date1 8.2 4.3 6.7 3.9
# date2 11.2 9.3 6.5 4.1
# date3 8.2 9.1 4.3 4.4
The "reshaping" step
You can now simply use stack to get the data in a long form.
dr.dfL <- data.frame(stack(dr.df), date = rownames(dr.df))
The values for the vector you want are in the "values" column, and the names for those values can be obtained using paste.
setNames(dr.dfL$values, paste(dr.dfL$ind, dr.dfL$date, sep = "_"))
# exam1_date1 exam1_date2 exam1_date3 exam2_date1 exam2_date2 exam2_date3
# 8.2 11.2 8.2 4.3 9.3 9.1
# exam3_date1 exam3_date2 exam3_date3 exam4_date1 exam4_date2 exam4_date3
# 6.7 6.5 4.3 3.9 4.1 4.4
Note that the result here is just a named vector, not a data.frame, as in the other answers.

You can use reshape from base R:
new <- reshape(dr, varying = list(c("date1","date2","date3")), direction = "long")
new$newname <- apply(new, 1, function(x) paste(x[1],paste("date",x[2],sep=""),sep="_"))
new <- new[,c("date1","newname")]
names(new) <- c("info","exam")
Outputs:
> new
info exam
1.1 8.2 exam1_date1
2.1 4.3 exam2_date1
3.1 6.7 exam3_date1
4.1 3.9 exam4_date1
1.2 11.2 exam1_date2
2.2 9.3 exam2_date2
3.2 6.5 exam3_date2
4.2 4.1 exam4_date2
1.3 8.2 exam1_date3
2.3 9.1 exam2_date3
3.3 4.3 exam3_date3
4.3 4.4 exam4_date3

Related

How to obtain values (e.g. median) from a boxplot in r?

I’ve plotted a boxplot for PM2.5 levels per year.
Boxplot(PM2.5~year, data=subset(dat, hour==12), las=1)
How can I extract values such as the median from the boxplots?
The default boxplot function returns summaries invisibly, you just have to assign it to a variable:
res <- boxplot(Sepal.Length ~ Species, data=iris)
Within res there exists an element stats:
> res$stats
[,1] [,2] [,3]
[1,] 4.3 4.9 5.6
[2,] 4.8 5.6 6.2
[3,] 5.0 5.9 6.5
[4,] 5.2 6.3 6.9
[5,] 5.8 7.0 7.9
These are quartile summaries of the boxes. The median is the middle one, so:
> res$stats[3,]
[1] 5.0 5.9 6.5

Plotting sales over time in R

I am trying to show the top 100 sales on a scatterplot by year. I used the below code to take top 100 games according to sales and then set it as a data frame.
top100 <- head(sort(games$NA_Sales,decreasing=TRUE), n = 100)
as.data.frame(top100)
I then tried to plot this with the below code:
ggplot(top100)+
aes(x=Year, y = Global_Sales) +
geom_point()
I bet the below error when using the subset top100
Error: data must be a data frame, or other object coercible by fortify(), not a numeric vector
if i use the actual games dataseti get the plot attached.
Any ideas?
As pointed out in comments by #CMichael, you have several issues in your code.
In absence of reproducible example, I used iris dataset to explain you what is wrong with your code.
top100 <- head(sort(games$NA_Sales,decreasing=TRUE), n = 100)
By doing that you are only extracting a single column.
The same command with the iris dataset:
> head(sort(iris$Sepal.Length, decreasing = TRUE), n = 20)
[1] 7.9 7.7 7.7 7.7 7.7 7.6 7.4 7.3 7.2 7.2 7.2 7.1 7.0 6.9 6.9 6.9 6.9 6.8 6.8 6.8
So, first, you do not have anymore two dimensions to be plot in your ggplot2. Second, even colnames are not kept during the extraction, so you can't after ask for ggplot2 to plot Year and Global_Sales.
So, to solve your issue, you can do (here the example with the iris dataset):
top100 = as.data.frame(head(iris[order(iris$Sepal.Length, decreasing = TRUE), 1:2], n = 100))
And you get a data.frame of of this type:
> str(top100)
'data.frame': 100 obs. of 2 variables:
$ Sepal.Length: num 7.9 7.7 7.7 7.7 7.7 7.6 7.4 7.3 7.2 7.2 ...
$ Sepal.Width : num 3.8 3.8 2.6 2.8 3 3 2.8 2.9 3.6 3.2 ...
> head(top100)
Sepal.Length Sepal.Width
132 7.9 3.8
118 7.7 3.8
119 7.7 2.6
123 7.7 2.8
136 7.7 3.0
106 7.6 3.0
And then if you are plotting:
library(ggplot2)
ggplot(top100, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
Warning Based on what you provided in your example, I will suggest you to do:
top100 <- as.data.frame(head(games[order(games$NA_Sales,decreasing=TRUE),c("Year","Global_Sales")], 100))
However, if this is not satisfying to you, you should consider to provide a reproducible example of your dataset How to make a great R reproducible example

Extract chart data from powerpoint slides

Given a powerpoint file with a chart containing chart data, how can I extract the chart data as a data frame? That is, given the the tempf.pptx file, how can I retrieve the iris dataset?
library(magrittr)
library(mschart)
library(officer)
linec <- ms_linechart(data = iris, x = "Sepal.Length",
y = "Sepal.Width", group = "Species")
linec <- chart_ax_y(linec, num_fmt = "0.00", rotation = -90)
doc <- read_pptx()
doc <- add_slide(doc, layout = "Title and Content", master = "Office Theme")
doc <- ph_with_chart(doc, chart = linec)
print(doc, target = tempf.pptx <- tempfile(fileext = ".pptx"))
Another approach would be to directly import the xls file associated with the chart :
tempdir <- tempfile()
officer::unpack_folder(tempf.pptx, tempdir)
xl_file <- list.files(tempdir, recursive = TRUE, full.names = TRUE, pattern = "\\.xlsx$")
readxl::read_excel(xl_file)
Note: this code only works because there is only one dataset in the pptx file. If there were more than a file, the relationships file *.xml.rels should be read to be sure we import the correct xlsx file (the xl reference is stored in ppt/charts/_rels/chart_file_title.xml.rels)
"Cut and paste" is a seriously flawed anti-pattern for reproducible code & analyses or automation (all things we strive for in data science workflows).
This is starter code that gets you to the data elements (but you still have some "roll up your sleeves" work to do
library(xml2)
library(magrittr)
# temp holding space for the unzipped PPTX
td <- tempfile("dir")
# unzip it and keep file names
fils <- unzip(tempf.pptx, exdir = td)
# look for chart XML files
charts <- fils[grepl("chart.*\\.xml$", fils)]
# read in the first one
chart <- read_xml(charts[1])
Now that we found and read in a chart XML file, let's see if we figure out which kind of chart it is:
# find charts in the XML (i don't know if there can be more than one per-XML file)
(embedded_charts <- xml_find_all(chart, ".//c:chart/c:plotArea"))
## {xml_nodeset (1)}
## [1] <c:plotArea xmlns:c="http://schemas.openxmlformats.org/drawingml/200 ...
# get the node root of the first one (again, i'm not sure if there can be more than one)
(first_embed <- embedded_charts[1])
## {xml_nodeset (1)}
## [1] <c:plotArea xmlns:c="http://schemas.openxmlformats.org/drawingml/200 ...
# use it to get the kind of chart so we can target the values with it
(xml_children(first_embed) %>%
xml_name() %>%
grep("Chart", ., value=TRUE) -> embed_kind)
## [1] "lineChart"
Now we can try to find the data series for that chart.
(target <- xml_find_first(first_embed, sprintf(".//c:%s", embed_kind)))
## {xml_nodeset (1)}
## [1] <c:lineChart>\n <c:grouping val="standard"/>\n <c:varyColors val=" ...
# extract "column" metadata
col_refs <- xml_find_all(target, ".//c:ser/c:tx/c:strRef")
(xml_find_all(col_refs, ".//c:f") %>%
sapply(xml_text) -> col_specs)
## [1] "sheet1!$B$1" "sheet1!$C$1" "sheet1!$D$1"
(xml_find_all(col_refs, ".//c:v") %>%
sapply(xml_text))
## [1] "setosa" "versicolor" "virginica"
Extract "X" metadata & data:
x_val_refs <- xml_find_all(target, ".//c:cat")
(lapply(x_val_refs, xml_find_all, ".//c:f") %>%
sapply(xml_text) -> x_val_specs)
## [1] "sheet1!$A$2:$A$36" "sheet1!$A$2:$A$36" "sheet1!$A$2:$A$36"
(lapply(x_val_refs, xml_find_all, ".//c:v") %>%
sapply(xml_double) -> x_vals)
## [,1] [,2] [,3]
## [1,] 4.3 4.3 4.3
## [2,] 4.4 4.4 4.4
## [3,] 4.5 4.5 4.5
## [4,] 4.6 4.6 4.6
## [5,] 4.7 4.7 4.7
## [6,] 4.8 4.8 4.8
## [7,] 4.9 4.9 4.9
## [8,] 5.0 5.0 5.0
## [9,] 5.1 5.1 5.1
## [10,] 5.2 5.2 5.2
## [11,] 5.3 5.3 5.3
## [12,] 5.4 5.4 5.4
## [13,] 5.5 5.5 5.5
## [14,] 5.6 5.6 5.6
## [15,] 5.7 5.7 5.7
## [16,] 5.8 5.8 5.8
## [17,] 5.9 5.9 5.9
## [18,] 6.0 6.0 6.0
## [19,] 6.1 6.1 6.1
## [20,] 6.2 6.2 6.2
## [21,] 6.3 6.3 6.3
## [22,] 6.4 6.4 6.4
## [23,] 6.5 6.5 6.5
## [24,] 6.6 6.6 6.6
## [25,] 6.7 6.7 6.7
## [26,] 6.8 6.8 6.8
## [27,] 6.9 6.9 6.9
## [28,] 7.0 7.0 7.0
## [29,] 7.1 7.1 7.1
## [30,] 7.2 7.2 7.2
## [31,] 7.3 7.3 7.3
## [32,] 7.4 7.4 7.4
## [33,] 7.6 7.6 7.6
## [34,] 7.7 7.7 7.7
## [35,] 7.9 7.9 7.9
Extract "Y" metadata and data:
y_val_refs <- xml_find_all(target, ".//c:val")
(lapply(y_val_refs, xml_find_all, ".//c:f") %>%
sapply(xml_text) -> y_val_specs)
## [1] "sheet1!$B$2:$B$36" "sheet1!$C$2:$C$36" "sheet1!$D$2:$D$36"
(lapply(y_val_refs, xml_find_all, ".//c:v") %>%
sapply(xml_double) -> y_vals)
## [[1]]
## [1] 3.0 3.2 2.3 3.2 3.2 3.0 3.6 3.3 3.8 4.1 3.7 3.4 3.5 3.8 4.0
##
## [[2]]
## [1] 2.4 2.3 2.5 2.7 3.0 2.6 2.7 2.8 2.6 3.2 3.4 3.0 2.9 2.3 2.9 2.8 3.0
## [18] 3.1 2.8 3.1 3.2
##
## [[3]]
## [1] 2.5 2.8 2.5 2.7 3.0 3.0 2.6 3.4 2.5 3.1 3.0 3.0 3.2 3.1 3.0 3.0 2.9
## [18] 2.8 3.0 3.0 3.8
# see if there are X & Y titles
title_nodes <- xml_find_all(first_embed, ".//c:title")
(lapply(title_nodes, xml_find_all, ".//a:t") %>%
sapply(xml_text) -> titles)
## [1] "Sepal.Length" "Sepal.Width"
Unlike the impetus behind my docxtractr package (for getting tables out of Word docs) I haven't seen much call for this particular need much so I'm not sure there will be a package for the above idiom in the near future.
I don't know of a way to get the data from within R, but you could open up the pptx file, right-click the chart, and select "Edit Data" to see the underlying data in table form. Could then copy and paste into an R data frame using the handy datapasta package.

referencing an xts object with a matrix

I have a 207x7 xts object (called temp). I have a 207x3 matrix (called ac.topn), each row of which contains the columns I'd like from the corresponding row in the xts object.
For example, given the following top two rows of temp and ac.topn,
temp
v1 v2 v3 v4 v5 v6 v7
1997-09-30 14.5 8.7 -5.8 2.6 4.7 1.9 17.2
1997-10-31 6.0 -2.0 -25.7 2.9 4.9 9.6 8.4
head(ac.topn)
Rank1 Rank2 Rank3
1997-09-30 7 4 2
1997-10-31 6 5 7
I would like to get the result:
1997-09-30 17.2 2.6 8.7 (elements 7, 4, and 2 from the first row of temp)
1997-10-31 9.6 4.9 8.4 (elements 6, 5, 7 from the second row of temp)
My first attempt was temp[,ac.topn]. I've browsed for help, but am struggling to word my request effectively.
Thank you.
Well, this works, but I've got to think there's a better way...
result <- do.call(rbind,lapply(index(temp),function(i)temp[i,ac.topn[i]]))
colnames(result) <- colnames(as.topn)
result
# Rank1 Rank2 Rank3
# 1997-09-30 17.2 2.6 8.7
# 1997-10-31 9.6 4.9 8.4
You may subset a matrix version of the xts object, using indexing via a numeric matrix:
m <- as.matrix(temp)
cols <- as.vector(ac.topn)
rows <- rep(1:nrow(ac.topn), ncol(ac.topn))
vals <- m[cbind(rows, cols)]
xts(x = matrix(vals, nrow = nrow(temp)), order.by = index(temp))
# [,1] [,2] [,3]
# 1997-09-30 17.2 2.6 8.7
# 1997-10-31 9.6 4.9 8.4
However, I say the same as #jlhoward: I've got to think there's a better way...

Apply formula for between species comparison

I have a data frame laid out in the follwing manner:
Species Trait.p Trait.y Trait.z
a 20.1 7.2 14.1
b 20.4 8.3 15.2
b 19.2 6.8 13.9
I would like to apply, for each species combination, (Xa) - (Xb) where is X is the trait value and the letter is the species and Xa > Xb. I.e has to be such that the larger value of each respective species combination has to come first, calculated for every trait
Would this be a multi-step process?
An example output could be
Combination Trait.p Trait.y Trait.z
a/b 0.3 1.1 1.1
I assumed you choose the largest value but David brings up a good point. I doubt this is the best approach but I think it gives you what you're after. Note I added a c as I'm sure your problem is a bit more complex that just a and b:
dat <- read.table(text="Species Trait.p Trait.y Trait.z
a 20.1 7.2 14.1
b 20.4 8.3 15.2
b 19.2 6.8 13.9
c 14.2 3.8 11.9", header=T)
li <- lapply(split(dat, dat$Species), function(x) apply(x[, -1], 2, max))
com <- expand.grid(names(li), names(li))
inds <- com[com[, 1] != com[, 2], ]
inds <- t(apply(inds, 1, sort))
inds <- inds[!duplicated(inds), ]
ans <- lapply(1:nrow(inds), function(i) {
abs(li[[inds[i, 1]]]-li[[inds[i, 2]]])
})
cbind(Combination = paste(inds[, 1], inds[, 2], sep="/"),
as.data.frame(do.call(rbind, ans)))
This gives us:
Combination Trait.p Trait.y Trait.z
1 a/b 0.3 1.1 1.1
2 a/c 5.9 3.4 2.2
3 b/c 6.2 4.5 3.3
Sorry for the lack of annotation but I'm heading to class.

Resources