Restructure csv data for r and ggplot2

Restructure csv data for r and ggplot2 - r

I'm new to R and ggplot2. I have a csv file with beverage consumption data. The first column is the Year, and then the next 9 columns are beverage types like, coffee, tea, soda, etc., with values for the consumption amount for the year value of that row. The data covers a 41 year period. I've been researching this and trying many things. I can easily create a dot plot for any one type of beverage with ggplot.
However, I want to create horizontal stacked dot plots with Year on the x axis for each plot. So, there'd be a plot for coffee, and then right below it, one for tea, etc. I think I want to use facets. I'm also thinking I want to get my data restructured so it has 3 columns: one for year, one for "category" (i.e., coffee, tea, soda, etc.), and the last one for the value. My thinking is that once I get the data in that form, then using faceting should be straight forward.
Problem is, I can't seem to figure out how to get my data in that form. Here is how the first few rows of the data look:
Year Whole Milk Other Milk Total Milk Tea Coffee Diet Soda Regular Soda Total Soda Juice
1970 25.5 5.8 31.3 6.8 33.4 2.1 22.2 24.3 5.5
1971 25 6.3 31.3 7.2 32.2 2.2 23.3 25.5 5.8
1972 24.1 6.9 31 7.3 33.6 2.3 23.9 26.2 6
Can someone help me?
dput of the data is:
structure(list(Year = 1970:1972, `Whole Milk` = c(25.5, 25, 24.1
), `Other Milk` = c(5.8, 6.3, 6.9), `Total Milk` = c(31.3, 31.3,
31), Tea = c(6.8, 7.2, 7.3), Coffee = c(33.4, 32.2, 33.6), `Diet Soda` = c(2.1,
2.2, 2.3), `Regular Soda` = c(22.2, 23.3, 23.9), `Total Soda` = c(24.3,
25.5, 26.2), Juice = c(5.5, 5.8, 6)), .Names = c("Year", "Whole Milk",
"Other Milk", "Total Milk", "Tea", "Coffee", "Diet Soda", "Regular Soda",
"Total Soda", "Juice"), class = "data.frame", row.names = c(NA,
-3L))

I have a little saying that I use often for ggplot2, "When in doubt, melt". In the reshape package there is a function melt(), that does exactly this.
tmp <- structure(list(Year = 1970:1972, `Whole Milk` = c(25.5, 25, 24.1
), `Other Milk` = c(5.8, 6.3, 6.9), `Total Milk` = c(31.3, 31.3,
31), Tea = c(6.8, 7.2, 7.3), Coffee = c(33.4, 32.2, 33.6), `Diet Soda` = c(2.1,
2.2, 2.3), `Regular Soda` = c(22.2, 23.3, 23.9), `Total Soda` = c(24.3,
25.5, 26.2), Juice = c(5.5, 5.8, 6)), .Names = c("Year", "Whole Milk",
"Other Milk", "Total Milk", "Tea", "Coffee", "Diet Soda", "Regular Soda",
"Total Soda", "Juice"), class = "data.frame", row.names = c(NA,
-3L))
library(reshape)
melt(tmp, id.vars="Year")
Year variable value
1 1970 Whole Milk 25.5
2 1971 Whole Milk 25.0
3 1972 Whole Milk 24.1
4 1970 Other Milk 5.8
5 1971 Other Milk 6.3
6 1972 Other Milk 6.9
7 1970 Total Milk 31.3
8 1971 Total Milk 31.3
9 1972 Total Milk 31.0
10 1970 Tea 6.8
11 1971 Tea 7.2
12 1972 Tea 7.3
13 1970 Coffee 33.4
...

Related

How to calculate mean of column, then paste mean value as row value in another data frame in R?

I have 36 data frames that each contain columns titled "lon", "lat", and "bottom_temp". Each different data frame represents data from a year between 1980 and 2015. I have a seperate dataframe called "month3_avg_box" that contains two columns: "year" and "avg_bottom_temp". The year column of the "month3_avg_box" data frame contains one row for each year between 1980-2015. I would like to find the average value of each "bottom_temp" column in each of the 36 data frames I have, and place each mean in the corresponding row of the new "month3_avg_box" data frame I have. I will write a mini example of what I'd like:
1980_df:
lon lat bottom_temp
-75.61 39.1 11.6
-75.60 39.1 11.5
-75.59 39.1 11.6
-75.58 39.1 11.7
(mean of bottom_temp column for 1980_df = 11.6)
1981_df:
lon lat bottom_temp
-75.57 39.1 11.9
-75.56 39.1 11.9
-75.55 39.1 12.0
-75.54 39.1 11.8
(mean of bottom_temp column for 1981_df = 11.9)
1982_df:
lon lat bottom_temp
-75.57 39.1 11.6
-75.56 39.1 11.7
-75.55 39.1 11.9
-75.54 39.1 11.2
(mean of bottom_temp column for 1982_df = 11.6)
Now, I'd like to take these averages and put them into my "month3_avg_box" data frame so it looks like:
month3_avg_box:
Year Avg_bottom_temp
1980 11.6
1981 11.9
1982 11.6
Does this make sense? How can I do this?

We may get the datasets in a list, bind the datasets, create a 'Year' column from the named list, do a group by mean
library(dplyr)
library(stringr)
lst(`1980_df`, `1981_df`, `1982_df`) %>%
bind_rows(.id = 'Year') %>%
group_by(Year = str_remove(Year, '_df')) %>%
summarise(Avg_bottom_temp = mean(bottom_temp))
-output
# A tibble: 3 × 2
Year Avg_bottom_temp
<chr> <dbl>
1 1980 11.6
2 1981 11.9
3 1982 11.6
data
`1980_df` <- structure(list(lon = c(-75.61, -75.6, -75.59, -75.58), lat = c(39.1,
39.1, 39.1, 39.1), bottom_temp = c(11.6, 11.5, 11.6, 11.7)), class = "data.frame", row.names = c(NA,
-4L))
`1981_df` <- structure(list(lon = c(-75.57, -75.56, -75.55, -75.54), lat = c(39.1,
39.1, 39.1, 39.1), bottom_temp = c(11.9, 11.9, 12, 11.8)), class = "data.frame", row.names = c(NA,
-4L))
`1982_df` <- structure(list(lon = c(-75.57, -75.56, -75.55, -75.54), lat = c(39.1,
39.1, 39.1, 39.1), bottom_temp = c(11.6, 11.7, 11.9, 11.2)), class = "data.frame", row.names = c(NA,
-4L))

Descending order in ggplot bar_col

Below is the dataset,
# A tibble: 449 x 7
`Country or Area` `Region 1` Year Rate MinCI MaxCI Average
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Southern Asia 2011 4.2 2.6 6.2 4.4
2 Afghanistan Southern Asia 2016 5.5 3.4 8.1 5.75
3 Aland Islands Northern Europe NA NA NA NA NA
4 Albania Southern Europe 2011 18.8 14.8 23 18.9
5 Albania Southern Europe 2016 21.7 17 26.7 21.8
6 Algeria Northern Africa 2011 24 19.9 28.4 24.2
7 Algeria Northern Africa 2016 27.4 22.5 32.7 27.6
8 American Samoa Polynesia NA NA NA NA NA
9 Andorra Southern Europe 2011 24.6 19.8 29.8 24.8
10 Andorra Southern Europe 2016 25.6 20.1 31.3 25.7
I need to draw a bar_col using the above dataset to compare the average obesity rate of each region. Further, I need to order the bar from the highest to the lowest.
I have also calculated the Average obesity rate as shown above.
Below is the code I used to generate the ggplot, but unable to figure out how to order from the highest to lowest.
region_plot <- ggplot(continent) + aes(x = continent$`Region 1`, y = continent$Average, fill = Average) +
geom_col() +
xlab("Region") + ylab("Average Obesity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Average obesity rate of each region")
region_plot

After checking your data, you have multiple regions so in order to show the average per region you must to compute it and then plot. You can do that with dplyr using group_by() and summarise(). Your data is limited but for the real one, NA should not be present. Here the code using part of the shared data. Be careful with names when using your real data. reorder() function can arrange bars. Here the code:
library(dplyr)
library(ggplot2)
#Code
df %>% group_by(Region) %>%
summarise(Avg=mean(Average,na.rm=T)) %>%
filter(!is.na(Avg)) %>%
ggplot(aes(x=reorder(Region,-Avg),y=Avg,fill=Region))+
geom_col() +
xlab("Region") + ylab("Average Obesity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Average obesity rate of each region")
Output:
Some data used:
#Data
df <- structure(list(Region = c("Southern Asia", "Southern Asia", "Northern Europe",
"Southern Europe", "Southern Europe", "Northern Africa", "Northern Africa",
"Polynesia", "Southern Europe", "Southern Europe"), Year = c(2011L,
2016L, NA, 2011L, 2016L, 2011L, 2016L, NA, 2011L, 2016L), Rate = c(4.2,
5.5, NA, 18.8, 21.7, 24, 27.4, NA, 24.6, 25.6), MinCI = c(2.6,
3.4, NA, 14.8, 17, 19.9, 22.5, NA, 19.8, 20.1), MaxCI = c(6.2,
8.1, NA, 23, 26.7, 28.4, 32.7, NA, 29.8, 31.3), Average = c(4.4,
5.75, NA, 18.9, 21.8, 24.2, 27.6, NA, 24.8, 25.7)), row.names = c(NA,
-10L), class = "data.frame")

The problem can be solved by preprocessing the data and sorting the result by Average. Then coerce Region 1 to factor.
library(ggplot2)
library(dplyr)
continent %>%
group_by(`Region 1`) %>%
summarise(Average = mean(Average, na.rm = TRUE)) %>%
arrange(desc(Average)) %>%
mutate(`Region 1` = factor(`Region 1`, levels = unique(`Region 1`))) %>%
ggplot(aes(x = `Region 1`, y = Average, fill = Average)) +
geom_col() +
xlab("Region") + ylab("Average Obesity") +
ggtitle("Average obesity rate of each region") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) -> region_plot
region_plot
Data
continent <- read.table(text = "
'Country or Area' 'Region 1' Year Rate MinCI MaxCI Average
1 Afghanistan 'Southern Asia' 2011 4.2 2.6 6.2 4.4
2 Afghanistan 'Southern Asia' 2016 5.5 3.4 8.1 5.75
3 'Aland Islands' 'Northern Europe' NA NA NA NA NA
4 Albania 'Southern Europe' 2011 18.8 14.8 23 18.9
5 Albania 'Southern Europe' 2016 21.7 17 26.7 21.8
6 Algeria 'Northern Africa' 2011 24 19.9 28.4 24.2
7 Algeria 'Northern Africa' 2016 27.4 22.5 32.7 27.6
8 'American Samoa' Polynesia NA NA NA NA NA
9 Andorra 'Southern Europe' 2011 24.6 19.8 29.8 24.8
10 Andorra 'Southern Europe' 2016 25.6 20.1 31.3 25.7
", header = TRUE, check.names = FALSE)

Use dplyr to filter dataframe by 32 conditions stored in a 2nd dataframe

Let me dive right into a reproducible example here:
Here is the dataframe with these "possession" conditions to be met for each team:
structure(list(conferenceId = c("A10", "AAC", "ACC", "AE", "AS",
"BIG10", "BIG12", "BIGEAST", "BIGSKY", "BIGSOUTH", "BIGWEST",
"COLONIAL", "CUSA", "HORIZON", "IVY", "MAAC", "MAC", "MEAC",
"MVC", "MWC", "NE", "OVC", "PAC12", "PATRIOT", "SEC", "SOUTHERN",
"SOUTHLAND", "SUMMIT", "SUNBELT", "SWAC", "WAC", "WCC"), values = c(25.5,
33.625, 57.65, 16, 20.9, 48.55, 63.9, 45, 17.95, 28, 11, 24.4,
23.45, 10.5, 16, 12.275, 31.5, 10.95, 21.425, 36.8999999999999,
31.025, 18.1, 23.7, 19.675, 52.9999999999997, 24.5, 15, 27.5,
12.6, 17.75, 13, 33)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -32L))
> head(poss_quantiles)
# A tibble: 6 x 2
conferenceId values
<chr> <dbl>
1 A10 25.5
2 AAC 33.6
3 ACC 57.6
4 AE 16
5 AS 20.9
6 BIG10 48.5
My main dataframe looks as followed:
> head(stats_df)
# A tibble: 6 x 8
season teamId teamName teamMarket conferenceName conferenceId possessions games
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <int>
1 1819 AFA Falcons Air Force Mountain West MWC 75 2
2 1819 AKR Zips Akron Mid-American MAC 46 3
3 1819 ALA Crimson Tide Alabama Southeastern SEC 90.5 6
4 1819 ARK Razorbacks Arkansas Southeastern SEC 71.5 5
5 1819 ARK Razorbacks Arkansas Southeastern SEC 42.5 5
6 1819 ASU Sun Devils Arizona State Pacific 12 PAC12 91.5 7e: 6 x 8
> dim(stats_df)
[1] 6426 500
I need to filter the main dataframe stats_df so that each conference's possessions is greater than their respective possession value in the poss_quantiles dataframe. I am struggling to figure out the best way to do this w/ dplyr.

I believe the following is what the question asks for.
I have made up a dataset to test the code. Posted at the end.
library(dplyr)
stats_df %>%
inner_join(poss_quantiles) %>%
filter(possessions > values) %>%
select(-values) %>%
left_join(stats_df)
# conferenceId possessions otherCol oneMoreCol
#1 s 119.63695 -1.2519859 1.3853352
#2 d 82.68660 -0.4968500 0.1954866
#3 b 103.58936 -1.0149620 0.9405918
#4 o 139.69607 -0.1623095 0.4832004
#5 q 76.06736 0.5630558 0.1319336
#6 x 86.19777 -0.7733534 2.3939706
#7 p 135.80127 -1.1578085 0.2037951
#8 t 136.05944 1.7770844 0.5145781
Data creation code.
set.seed(1234)
poss_quantiles <- data.frame(conferenceId = letters[sample(26, 20)],
values = runif(20, 50, 100),
stringsAsFactors = FALSE)
stats_df <- data.frame(conferenceId = letters[sample(26, 20)],
possessions = runif(20, 10, 150),
otherCol = rnorm(20),
oneMoreCol = rexp(20),
stringsAsFactors = FALSE)

How to specify ID variables in dcast?

Following is a sample of the data I have
datahave
# A tibble: 6 x 6
YEAR SCHOOL_NAME CONTENT_AREA BELOW_BASIC_PCT BASIC_PCT ADVANCED_PCT
<dbl> <chr> <chr> <chr> <chr> <chr>
1 2015 5TH AND 6TH GRADE CTR. Eng. Language Arts 38.1 28.3 10.1
2 2015 5TH AND 6TH GRADE CTR. Mathematics 39 30.3 14.6
3 2015 5TH AND 6TH GRADE CTR. Science 25.4 41.7 12.3
4 2015 6TH GRADE CENTER Eng. Language Arts 7.6 27.8 21.8
5 2015 6TH GRADE CENTER Mathematics 19.100000000000001 37.700000000000003 17.5
6 2015 7th and 8th Grade Center Eng. Language Arts 52.1 27.4 1.7
Following is a reproducible example similar to this
school<-c("A","A",'A','B','B','B')
content_area<-c('english','math','science','english','math','science')
below_basic<-c(20,30,40,10,15,20)
advanced<-c(2,5,3,1,2.5,1.5)
df<-data.frame(school,content_area,below_basic,advanced)
df
and ran the following code on the above
library(reshape2)
dcast(melt(df), school ~ content_area + variable)
This gives me the desired output because it is using Using school, content_area as id variables
However when I run the same code on the original dataset
dcast(melt(datahave), SCHOOL_NAME ~ CONTENT_AREA + variable)
it is actually using Using SCHOOL_NAME, CONTENT_AREA, BELOW_BASIC_PCT, BASIC_PCT, ADVANCED_PCT as id variables
How do I specify which columns can be used as the ID variable? so I get an output similar to the reproducible example.

We can specify the id.var in melt, otherwise, it can automatically pick the variables based on the type.
library(reshape2)
dcast(melt(datahave, id.var = c("YEAR", "SCHOOL_NAME", "CONTENT_AREA")),
SCHOOL_NAME ~ CONTENT_AREA + variable)
# SCHOOL_NAME Eng. Language Arts_BELOW_BASIC_PCT Eng. Language Arts_BASIC_PCT
#1 5TH AND 6TH GRADE CTR. 38.1 28.3
#2 6TH GRADE CENTER 7.6 27.8
#3 7th and 8th Grade Center 52.1 27.4
# Eng. Language Arts_ADVANCED_PCT Mathematics_BELOW_BASIC_PCT Mathematics_BASIC_PCT Mathematics_ADVANCED_PCT
#1 10.1 39.0 30.3 14.6
#2 21.8 19.1 37.7 17.5
#3 1.7 NA NA NA
# Science_BELOW_BASIC_PCT Science_BASIC_PCT Science_ADVANCED_PCT
#1 25.4 41.7 12.3
#2 NA NA NA
#3 NA NA NA
The melt/dcast wrapper is recast which can be used as well
recast(datahave, id.var = c("YEAR", "SCHOOL_NAME", "CONTENT_AREA"),
SCHOOL_NAME ~ CONTENT_AREA + variable)
data
datahave <- structure(list(YEAR = c(2015L, 2015L, 2015L, 2015L, 2015L, 2015L
), SCHOOL_NAME = c("5TH AND 6TH GRADE CTR.", "5TH AND 6TH GRADE CTR.",
"5TH AND 6TH GRADE CTR.", "6TH GRADE CENTER", "6TH GRADE CENTER",
"7th and 8th Grade Center"), CONTENT_AREA = c("Eng. Language Arts",
"Mathematics", "Science", "Eng. Language Arts", "Mathematics",
"Eng. Language Arts"), BELOW_BASIC_PCT = c(38.1, 39, 25.4, 7.6,
19.1, 52.1), BASIC_PCT = c(28.3, 30.3, 41.7, 27.8, 37.7, 27.4
), ADVANCED_PCT = c(10.1, 14.6, 12.3, 21.8, 17.5, 1.7)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Grouping valuable with barplot in R studio

Income1.csv
Age.Group X X.1 X.2 X.3 X.4
1 Income 16-24 25-34 35-44 45-54 55+
2 Low 13.9 17.4 14.9 11.9 10.9
3 Medium 26.3 46.9 42.2 30.7 21.5
4 High 11.6 19.7 22.4 17.4 6.7
How do you create a grouped barplot with the height as Age? The picture below is what I want to create.

Read your data:
d <- dput(d)
structure(list(Income = structure(c(2L, 3L, 1L), .Label = c("High",
"Low", "Medium"), class = "factor"), `16-24` = c(13.9, 26.3,
11.6), `25-34` = c(17.4, 46.9, 19.7), `35-44` = c(14.9, 42.2,
22.4), `45-54` = c(11.9, 30.7, 17.4), `55+` = c(10.9, 21.5, 6.7
)), .Names = c("Income", "16-24", "25-34", "35-44", "45-54",
"55+"), class = "data.frame", row.names = c(NA, -3L))
Plot your data: beside specifies that the values are plotted beside not stacked.
barplot(as.matrix(d[,-1]), beside = T, legend.text = d$Income)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Restructure csv data for r and ggplot2 - r

Related

How to calculate mean of column, then paste mean value as row value in another data frame in R?

Descending order in ggplot bar_col

Use dplyr to filter dataframe by 32 conditions stored in a 2nd dataframe

How to specify ID variables in dcast?

Grouping valuable with barplot in R studio

Categories

Resources