Creating a Matrix in R from a dataset - r

I am trying to convert data provided to us in a csv into a matrix. We have saved the data as an object (us_quarters) Its a simple dataset containing the name of a state, then the number of quarters produced at two separate mints for that state.
State DenverMint PhillyMint
Delaware 401424 373400
one row for each state.
I am trying to create a side by side barplot of this data, and first need to convert the data into a matrix to work with it. The issue I seem to be struggling with is the fact that the state itself is a column so when I try to convert I end up with a jumbled mess of character values and integer values stored in massive lists.
x <- matrix(us_quarters,ncol=3, byrow = TRUE)
colnames(x) <- c("State", "DenverMint", "Phillymint")
x
produces this result
State DenverMint Phillymint
[1,] character,50 integer,50 integer,50
Everything I am trying to do requires the data to be formatted in this matrix in order to work with it properly and I am at a total loss as to how to proceed. Any thoughts are much appreciated.

Could you use pivot_longer to group Denver and Philly mint?
df <- tribble(~state, ~den_mint, ~philly_mint,
'delaware', 401424,373400,
'newyork', 460858, 494023)
df %>%
mutate(state = as.factor(state)) %>%
pivot_longer(cols = c("den_mint", "philly_mint"), names_to = "mint", values_to = "count") %>%
ggplot(aes(mint, count)) +
geom_col() +
facet_wrap(~state) +
coord_flip()

Related

Nested lists multiple levels get elements by name

I have a nested lists with multiple levels and I want to extract elements by name.
I have a datasets with many metrics. The head looks like:
Metric
Url
Domestic
https://www.ons.gov.uk/economy/grossdomesticproductgdp/timeseries/njiq/ukea/data
International
https://www.ons.gov.uk/economy/grossvalueaddedgva/timeseries/abml/pn2/data
I retrieve JSON for all Url:
metrics <- lapply(dataset$Url, function(i) fromJSON(content(GET(i), as = 'text')))
I get from the JSON the sublists with years
metrics_years <- lapply(metrics, function (i) i$years)
Now I have a nested list with three levels. I get what I want for one metric if I do:
sapply(metrics_years[["Domestic"]], '[[', 'year')
But I cannot type for all of them. How can I obtain the same for all metrics without typing them one by one:
sapply(metrics_years[["Domestic"]], '[[', 'year')
sapply(metrics_years[["International"]], '[[', 'year')
...
There isn't very much data in these files. I'm not sure what you're expecting. There are data frames called years and quarters in both lists. They both contain dates and values. Most of the content is NA. Each is a time-series data set starting in 1962. You don't need httr, either. fromJSON accepts a URL as a source without content or GET. m <- map(1:2, ~fromJSON(d[.x], flatten = T) where d is a vector with the two URLs. I think that's what #John Nielsen was stating. I used map, but lapply works fine, too.
You're using brackets, but you don't have to.
library(tidyverse) # for map and ggplot
library(jsonlite)
# get the JSON data and flatten it
mm <- map(1:2, ~fromJSON(d[.x], flatten = T))
# name each list, based on the URL
names(mm) <- c("Domestic", "International")
attributes(mm) # validate change
You can access the content with the environment pane, to get an idea of what you have.
You can use $ to access the content or [[]]. You can use numbers or names with [[]].
all.equal(mm[[1]][[1]], mm$Domestic$years)
# [1] TRUE
all.equal(mm[["Domestic"]][["years"]], mm$Domestic$years)
# [1] TRUE
To extract the data or to make a data frame-only object, you just need to assign it to an object name.
newDF <- mm$Domestic$years
You don't have to create a new object, though. If you wanted to plot the Domestic years data frame, you can use it as it is.
mm$Domestic$years %>%
mutate(year = as.integer(year),
value = as.numeric(value)) %>%
ggplot(aes(year, value)) + geom_path() + theme_bw()
The Domestic quarters data
mm$Domestic$quarters %>%
mutate(value = as.numeric(value)) %>%
ggplot(aes(seq_along(value), value)) + geom_path() +
scale_x_continuous(labels = mm$Domestic$quarters[seq(1, 220, 50), ]$date,
breaks = seq(1, 220, 50)) +
xlab("") + theme_bw()
You can do the same thing with the International data.
mm$International$years %>%
mutate(year = as.integer(year),
value = as.numeric(value)) %>%
ggplot(aes(year, value)) + geom_path() + theme_bw()

R Function to Create Custom Data Frames from Larger Data Frame

Ok, So I found somewhat similar questions asked of this already, but I'm not quite getting it. So, here is my example. I have a very large table of data that has a basic setup like the small example data below. I will try to explain very clearly what I am wanting to do. I'm guessing maybe it's easier to do than I think, but I'm not really good at creating functions or for-loops at this point, and I'm guessing that's what I need. So here is the basic setup for my data.
test_year <- c(2019,2019,2019,2020,2020,2020,2021,2021,2021)
SN <- c(1001,1002,1003,1004,1005,1006,1007,1008,1009)
Owner <- c("Adam","Bob","Bob","Carl","Adam","Bob","Adam","Carl","Adam")
ObsA <- c(0,0,1,1,0,1,1,NA,1)
ObsB <- c(1,1,1,0,0,0,0,0,1)
ObsC <- c(0,0,0,0,1,1,0,0,0)
df <- data.frame(test_year, SN, Owner, ObsA, ObsB, ObsC)
From this, I need to be able to create smaller data frames by selecting individual observation columns. So if this were a small data set:
df_A <- df %>% select(test_year, SN, Owner, ObsA)
and then have a data frame for each of the other observations. And yes, it is easier to select the columns that I want versus the columns I don't want as most of the columns selected will be standard, and I just need to change which observation is picked out of over 40 in my real data.
From these smaller data frames, I will be doing numerous other operations including making multiple tables and graphs. As examples, the following are similar to the types of graphs I will make (with some additional formatting that is simple enough). Notice too in these graphs a title that is based on (though not identical to), the column selected.
df_A[is.na(df_A)] = 0
df_A
df_A %>% group_by(test_year) %>%
summarize(n = n(), obs = sum(ObsA)) %>%
ggplot(aes(x = test_year, y = 100*obs/n)) +
ggtitle("Observation A") +
geom_point()
df_A %>% group_by(Owner) %>%
summarize(n = n(), obs = sum(ObsA)) %>%
ggplot(aes(x = Owner, y = 100*obs/n)) +
ggtitle("Observation A") +
geom_bar(stat = "identity") +
coord_flip() +
scale_x_discrete()
As I said, additional analysis will also need to be done. So, I'm needing help figuring out how I can structure a function to do what it is I'm wanting to do. Thanks!
Here is a way to return a list of plots.
Split all the 'Obs' columns in a list of dataframes, use imap to pass dataframe along with the column name (to use it as title).
library(tidyverse)
common_cols <- 1:3
df[is.na(df)] = 0
list_plots <- df %>%
select(starts_with('Obs')) %>%
split.default(names(.)) %>%
imap(~{
tmp <- df[common_cols] %>% bind_cols(.x)
tmp %>% group_by(test_year) %>%
summarize(n = n(), obs = sum(.data[[.y]])) %>%
ggplot(aes(x = factor(test_year), y = 100*obs/n)) +
geom_point() +
labs(x = 'Year', y = 'ratio', title = .y)
})
Individual plots can be accessed by list_plots[[1]],list_plots[[2]] etc.

make multiple separate stacked barplots from one data frame

I would like to create multiple grouped and stalked barplots with several data frames and be able to export the plots i a single pdf file.
I have several data frames with the same format but varying values. For each data frame I would like to create multiple stalked and grouped bar plots. Ideally the bar plots of the same group from the data frames should be placed next to each other and share the same Y-axis length (in order to easily visually compare the data frames).
Her an example of what ma data looks like:
data1 <- data.frame(group=c('A','A','A','A','B','B','B','B','C','C','C','C'),
Year=c('2012','2013','214','2015','2012','2013','214','2015','2012','2013','214','2015'),
Fruit=c(5,3,6,3,5,4,2,2,3,4,6,2),
Vegetables=c(3,6,1,4,8,9,43,2,1,5,0,1),
Rice=c(20,23,53,12,45,5,23,12,32,41,54,32))
data2 <- data.frame(group=c('A','A','A','A','B','B','B','B','C','C','C','C'),
Year=c('2012','2013','214','2015','2012','2013','214','2015','2012','2013','214','2015'),
Fruit=c(2,4,5,2,3,9,4,7,5,7,4,7),
Vegetables=c(9,7,8,7,4,3,0,0,2,3,5,6),
Rice=c(23,12,32,41,54,32,20,23,53,12,45,5))
data1 <- pivot_longer(data1, cols = 3:5, names_to = 'Type', values_to = 'value')
data2 <- pivot_longer(data1, cols = 3:5, names_to = 'Type', values_to = 'value')
I started by formating the tables like this:
data1 <- pivot_longer(data1, cols = 3:5, names_to = 'Type', values_to = 'value')
data2 <- pivot_longer(data1, cols = 3:5, names_to = 'Type', values_to = 'value')
My attempts to use ggplot to create the desired PDF have so far failed. I took several different attempts but could not get near to the desired PDF. I found instructions on how to create several plots for one data frame, or grouped plots or stalked plots, but never the combination of all three.
If possible the PDF I would like to get for this example should look like this:
In total 6 plots: left 3 plots data1, right 3 plots data2; Group A row1, Group B row2, Group C row3 (if possible same y axis length in one row/Group)
All bar plots: x-axis= years, y-axis= value / 1 stalked bar per year with colors matching Type (Fruit, Vegetable, Rice)
Group name per row
data source(data1, data2) per column
legend with Types (Fruit, Vegetable, Rice)
Q1. Is something like this possible or would one have to create two PDFs (for each data.table, here: data1 and data2).
Q2. Is it possible to format the code in a way to automatically adjust the amount of plots needed according to the data frames and adjust the PDF file size automatically and create a new page if necessary? (In reality i have 5 data frames and 13 Groups, this may however change with time)
I know this is quite a difficult code to write. I have spent two working days on this already though, which is why I am now asking for help here. I will try again tomorrow and post any possible progress here.
Thank you very much for any suggestions
This code should produce the desired plot (or at least something really close).The two critical steps include: 1) joining all the dataframes into a single one, using bind_rows and 2) using facet_grid to set define the layout panels according to two variables (group and id).
library(tidyverse)
# Combine the data
# id column contains the number of the dataframe from which the data comes from
df <- bind_rows(data1, data2, .id = "id")
df %>%
# Change to long format, add 1 to the columns number, as we now added id column
pivot_longer(cols = 4:6,
names_to = 'Type',
values_to = 'value') %>%
# Transform value to x / 1
mutate_at(vars(value), function(x) x / 1) %>%
# Do plot
ggplot(aes(x = Year,
y = value,
fill = Type)) +
# columns
geom_col()+
# Facets by two factors, groups and data source (id)
facet_grid(group ~ id)
# Save plot to pdf
ggsave("my_plot.pdf",
device = "pdf",
height = 15,
width = 20,
units = "cm",
dpi = 300)

How to make dot plot with multiple data points for single variable?

I would like to create dot-plot for my data set. I know how to create a normal dot-plot for treatment comparisons or similar data sets using ggplot. I have the following data. And would like to create a dot-plot with three different colors. Please suggest me how to prepare data for this dot-plot. If I have a single data point in NP and P, it is easy to plot as I already worked with similar data but not getting any idea with this kind of data. I can use ggplot module from R and can be done.
The variable W has always single data point while NP and P has different data points i.e. some time one in NP and some times three and same with variable P,as I shown in the table.
Here is the screen shot for my data.
Sorry for my language
I agree my data is mess. I googled and did some coding to get the plot. I used tidyverse and dplyr packages to attain the plot but again there is a problem with y-axis. Y-axis is very clumsy. I used this following code
d <- read.table("Data1.txt", header = TRUE, sep = "\t", stringsAsFactors = NA)
df <- data.frame(d)
df <- df %>%
mutate(across(everything(), as.character)) %>%
pivot_longer(!ID, names_to="colid", values_to="val") %>%
separate_rows(val, sep="\t", convert=TRUE) %>%
mutate(ID=as_factor(ID)
Then I plot the graph with ggplot
ggplot(df, aes(x=ID, y=val, color=colid))+geom_point(size=1.5) +theme(axis.text.x = element_text(angle = 90))
The output is this. I tried to adjust Y-axis with ylim and scale_y_discrete() but nothing worked. Please suggest a way to rectify it.
This contains many necessary steps for data cleaning, as suggested by user Dan Adams in the comment. This was kind of fun, and it helped me procrastinate my own thesis.
I am using a function from a very famous thread which offers a way to splits columns when the number of resulting columns is unknown.
P.S. The way you shared the data was less than ideal.
#your data is unreadable without this awesome package
# devtools::install_github("alistaire47/read.so")
library(tidyverse)
df <- read.so::read_md("|ID| |W| |NP| |P|
|:-:| |:-:| |:-:| |:-:|
|1| |4.161| |1.3,1.5| |1.5,2.8|
|2| |0.891| |1.33,1.8,1.79| |1.6|
|3| |7.91| |4.3| |0.899,1.43,0.128|
|40| |2.1| |1.4,0.99,7.9,0.32| |0.6,0.5,1.57|") %>%select(-starts_with("x"))
#> Warning: Missing column names filled in: 'X2' [2], 'X4' [4], 'X6' [6]
# from this thread https://stackoverflow.com/a/47060452/7941188
split_into_multiple <- function(column, pattern = ", ", into_prefix){
cols <- str_split_fixed(column, pattern, n = Inf)
cols[which(cols == "")] <- NA
cols <- as.tibble(cols)
m <- dim(cols)[2]
names(cols) <- paste(into_prefix, 1:m, sep = "_")
cols
}
# apply this over the columns of interest
ls_cols <- lapply(c("NP", "P"), function(x) split_into_multiple(df$NP, pattern = ",", x))
# bind it to the single columns of the old data frame
# convert character columns to numeric
# apply pivot longer twice (there might be more direct options, but I won't be
# bothered to do too much here)
df_new <-
bind_cols(df[c("ID", "W")], ls_cols) %>%
pivot_longer(cols = c(-ID,-W), names_sep = "_", names_to = c(".value", "value")) %>%
mutate(across(c(P, NP), as.numeric)) %>%
select(-value) %>%
pivot_longer(W:P, names_to = c("var"), values_to = "value")
# The new tidy data can easily be plotted
ggplot(df_new, aes(ID, value, color = var)) +
geom_point()
#> Warning: Removed 12 rows containing missing values (geom_point).

How do I label my rows and columns in order to work better with them?

I have a dataset with the emisisons of Canada:
and I would like to label the first row to "years" and the second to "emissions".
For example, if I dont do this how can I name my variables in "aes" in ggplot () function:
ggplot(CAN_emissions, aes(___, ___))
To add a name to the first row, we can use rownames(CAN_emissions) <- "emissions" though this won't help much as the years data points are in the column titles, not in a row of their own.
Generally speaking, you'll struggle to plot the data while it is in a 'wide' format like this. A better solution is to convert all of the year columns into rows. The problem of row names will then disappear. We can do this like so:
library(tidyr)
library(dplyr)
library(magrittr)
CAN_emissions <- CAN_emissions %>%
pivot_longer(-country, names_to = "year", values_to = "emissions")
The data can then be plotted directly:
ggplot(CAN_emissions, aes(x = year, y = emissions)) + geom_point()
Data:
CAN_emissions <- tibble(
country = c("Canada"),
`1800` = 0.00568,
`1801` = 0.00561,
`1802` = 0.00555
)

Resources