I am trying to create boxplot using R script from the following type of tab delimited file "New.txt" where the number of rows and columns will be variable
Chr Start End Name 18NGS31 18MPD168 18NGS21 18NGS29 18NGS33 18NGS38
chr9 1234 1234 ABL1 1431 1 1112 1082 1809 1647
chr9 2345 2345 ASXL1 3885 37 3578 1974 2921 3559
chr9 3456 3456 ETV6 3235 188 2911 1578 2344 2673
chr9 4567 4567 MYD88 3198 187 2860 1547 2289 2621
After skipping first four columns create box plot in R from 5th column on wards using following commands
file <- "new.txt"
x=read.table(file,skip=1)
boxplot(x$V5,x$V6,x$V7,x$V9,x$V10,x$V11,col=rainbow(54),xlab="abc",ylab="Coverage",main="Coverage Metrics")
And I am getting following box plot
[![R ploy][1]][1]
I want to modify this command such that I can incorporate any number of columns that will be present in the tab delimited file and label each box plot as per its column head.
I recommend reshaping from wide to long .
Here is a minimal example using ggplot2
# Sample data
df <- data.frame(id = paste0("id", 1:100), matrix(rnorm(1000), ncol = 10))
library(dplyr)
library(tidyr)
library(ggplot2)
df %>%
gather(key, value, -id) %>%
mutate(key = factor(key, levels = paste0("X", 1:10))) %>%
ggplot(aes(x = key, y = value)) +
geom_boxplot()
Explanation: Reshaping from wide to long stores the column names in a new column key and its values in value; we can then simply map key to x. This works for an arbitrary number of columns.
Update
Using your sample data
df <- read.table(text =
"Chr Start End Name 18NGS31 18MPD168 18NGS21 18NGS29 18NGS33 18NGS38
chr9 1234 1234 ABL1 1431 1 1112 1082 1809 1647
chr9 2345 2345 ASXL1 3885 37 3578 1974 2921 3559
chr9 3456 3456 ETV6 3235 188 2911 1578 2344 2673
chr9 4567 4567 MYD88 3198 187 2860 1547 2289 2621", header = T)
df %>%
gather(key, value, -Chr, -Start, -End, -Name) %>%
ggplot(aes(x = key, y = value, fill = key)) +
geom_boxplot()
Related
This question already has answers here:
Plotting each value of columns for a specific row
(2 answers)
Closed 1 year ago.
I have a dataframe that shows the number of car sales in each country for years 2000 to 2020. I wish to plot a line graph to show how the number of car sales have changed over time for only a specific country/row, with year on the x axis and sales on the y axis. How would I do this using ggplot?
You perhaps want this
#toy_data
sales
#> Country 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
#> 2 A 1002 976 746 147 1207 627 157 1481 1885 1908 392
#> 3 B 846 723 1935 176 1083 636 1540 1692 899 607 1446
#> 4 C 1858 139 1250 121 1520 199 864 238 1109 1029 937
#> 5 D 534 1203 1759 553 1765 1784 1410 420 606 467 1391
library(tidyverse)
#for all countries
sales %>% pivot_longer(!Country, names_to = 'year', values_to = 'sales') %>%
mutate(year = as.numeric(year)) %>%
ggplot(aes(x = year, y = sales, color = Country)) +
geom_line()
#for one country
sales %>% pivot_longer(!Country, names_to = 'year', values_to = 'sales') %>%
mutate(year = as.numeric(year)) %>%
filter(Country == 'A') %>%
ggplot(aes(x = year, y = sales)) +
geom_line()
Created on 2021-06-07 by the reprex package (v2.0.0)
Suppose you have a data frame that looks like this:
#make dummy df
df <- matrix(sample(1:100, 63), ncol=21, nrow=3)
rownames(df) <- c("UK", "US", "UAE")
colnames(df) <- 2000:2020
Here I generated some random data for 21 years between 2000 and 2020, and for three countries. To get a line plot with ggplot for UK, I did:
data_uk <- data.frame(year=colnames(df), sales=df["UK",], row.names=NULL)
ggplot(data=data_uk, aes(x=year, y=sales, group=1)) + geom_point() + geom_line()
Example plot
Is there is a way to sum variables (e.g. sales and units) for all unique variable names (brands like coke and pepsi) within a dataframe.
To help, here is some example data.
set.seed(123)
period <- seq(as.Date('2021/01/01'), as.Date('2021/01/07'), by="day")
Coke_Regular_Units <- sample(1000:2000, 7, replace = TRUE)
Coke_Diet_Units <- sample(1000:2000, 7, replace = TRUE)
Coke_Regular_Sales <- sample(500:1000,7, replace = TRUE)
Coke_Diet_Sales <- sample(500:1000, 7, replace = TRUE)
Pepsi_Regular_Units <- sample(1000:2000, 7, replace = TRUE)
Pepsi_Diet_Units <- sample(1000:2000, 7, replace = TRUE)
Pepsi_Regular_Sales <- sample(500:1000, 7, replace = TRUE)
Pepsi_Diet_Sales <- sample(500:1000, 7, replace = TRUE)
df <- data.frame(Coke_Regular_Units, Coke_Diet_Units, Coke_Regular_Sales, Coke_Diet_Sales,
Pepsi_Regular_Units, Pepsi_Diet_Units, Pepsi_Regular_Sales, Pepsi_Diet_Sales)
> head(df)
period Coke_Regular_Units Coke_Diet_Units Coke_Regular_Sales Coke_Diet_Sales Pepsi_Regular_Units
1 2021-01-01 1414 1117 589 847 1425
2 2021-01-02 1462 1298 590 636 1648
3 2021-01-03 1178 1228 755 976 1765
4 2021-01-04 1525 1243 696 854 1210
5 2021-01-05 1194 1013 998 827 1931
6 2021-01-06 1937 1373 590 525 1589
Pepsi_Diet_Units Pepsi_Regular_Sales Pepsi_Diet_Sales
1 1554 608 943
2 1870 762 808
3 1372 892 634
4 1843 924 808
5 1142 829 910
6 1543 522 723
I like a code to automatically calculate Coke_Sales, Coke_Units, Pepsi_Sales, Pepsi_Units, Regular_Sales and Diet_Units.
I am currently doing it like this for each variable
library(dplyr)
df$Coke_Sales <- rowSums(Filter(is.numeric, select(df, (matches("Coke") & matches("Sales")))))
df$Coke_Units <- rowSums(Filter(is.numeric, select(df, (matches("Coke") & matches("Units")))))
This is ok for a small number of variables, but I need to do this for 100s of variables. Is there any function that enables this? It would need to automatically find the unique variable names like Coke, Pepsi, Diet and Regular. The metric is the last part of the variable name, so doesn't necessarily need to auto-find this but would be great. If it makes it any easier, it would be ok to specify the metrics as there are only 3 metrics at most, but there are hundreds of brands.
If it cant be automated, is there a way it can be simplified, where I specify the variables required. Not perfect but still an improvement. For example including these lines of code to specify variables to sum and metrics required.
VarsToSum <- c("Coke", "Pepsi", "Diet", "Regular")
Metrics <- c("Sales", "Units")
If it can't be accomplished that way either, maybe I need to break into smaller steps, any tips would be great. Trying to think how to do it, should I try to find unique name before a prefix "_", then calculate "Sales" and "Units" for those unique names. Would this be the best way to do it? Or should I reshape the data? Are there any other routes to get there?
Any help, or directions how to achieve this would be greatly appreciated. Thanks
here is a data.tableapproach...
library( data.table )
setDT(df) #make it a data.table
#melt to long
ans <- melt( df, id.vars = "period", variable.factor = FALSE )
#split variable to 3 new columns
ans[, c("brand", "type", "what") := tstrsplit( variable, "_" ) ]
# > head(ans)
# period variable value brand type what
# 1: 2021-01-01 Coke_Regular_Units 1414 Coke Regular Units
# 2: 2021-01-02 Coke_Regular_Units 1462 Coke Regular Units
# 3: 2021-01-03 Coke_Regular_Units 1178 Coke Regular Units
# 4: 2021-01-04 Coke_Regular_Units 1525 Coke Regular Units
# 5: 2021-01-05 Coke_Regular_Units 1194 Coke Regular Units
# 6: 2021-01-06 Coke_Regular_Units 1937 Coke Regular Units
#summarise however you like
ans[, .(total = sum(value) ), by = .(brand, type, what)]
# brand type what total
# 1: Coke Regular Units 10527
# 2: Coke Diet Units 8936
# 3: Coke Regular Sales 5158
# 4: Coke Diet Sales 5171
# 5: Pepsi Regular Units 11160
# 6: Pepsi Diet Units 10813
# 7: Pepsi Regular Sales 5447
# 8: Pepsi Diet Sales 5491
Using outer for pasteing the syllables and grep.
sapply(outer(c("Coke", "Pepsi"), c("Sales", "Units"), paste, sep=".*"), function(x)
rowSums(df[grep(x, names(df))]))
# Coke.*Sales Pepsi.*Sales Coke.*Units Pepsi.*Units
# [1,] 1436 1551 2531 2979
# [2,] 1226 1570 2760 3518
# [3,] 1731 1526 2406 3137
# [4,] 1550 1732 2768 3053
# [5,] 1825 1739 2207 3073
# [6,] 1115 1245 3310 3132
# [7,] 1446 1575 3481 3081
Here's a solution similar in spirit to that of #Wimpel, but with the tidyverse :
library(tidyverse)
summary_df <-
df %>%
pivot_longer(cols = ends_with("Sales") | ends_with("Units"),
names_to = c("brand", "type", ".value"),
names_pattern = "(.*)_(.*)_(.*)") %>%
group_by(brand) %>%
summarize(Sales = sum(Sales),
Units = sum(Units)) %>%
pivot_wider(names_from = "brand",
values_from = c("Sales", "Units"),
names_glue = "{brand}_{.value}")
summary_df
# # A tibble: 1 x 4
# Coke_Sales Pepsi_Sales Coke_Units Pepsi_Units
# <int> <int> <int> <int>
# 1 10329 10938 19463 21973
I have analyzed current data (ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide) from the European Centre for Disease Prevention and Control that keeps track about COVID-19 cases across months and countries. This way I would like to gain insights about the spread of active cases, but also about the way they relate to deaths related to the disease. My goal: to create a variable that stores the percentage of deaths per total infections for every day in march, divided by countries.
Here is my code:
library(readxl)
d <- read_excel("C:/Users/hanna/Downloads/COVID-19-geographic-disbtribution-worldwide.xlsx")
#View(d)
corona_de <- d %>% filter(`Countries and territories` == "Germany" & Month == 3)
# explore the data
library(skimr)
skim(corona_de)
library(ggplot2)
ggplot(corona_de, aes (x = Day, y = Cases)) +
geom_line(color = "red")+ theme_classic()
# deutschland, england, spanien, italien, frankreich, österreich
corona <- d %>% filter(`Countries and territories` == "Germany" |
`Countries and territories` == "France" |
`Countries and territories` == "Italy" |
`Countries and territories` == "Spain") #filter for month later %>% filter(Month == 3)
#----------------------------------------------------------------
# Preprocess data and create cumulative and percent variables
#----------------------------------------------------------------
# format dates
library(lubridate)
corona$DateRep <-as.Date(corona$DateRep,"%Y-%m-%d UTC")
# store in list for later
dates <- corona_march$DateRep
# store countries list to loop through
countries <- unique(corona$`Countries and territories`)
#create empty objects
active_cases<- NULL
deaths_cum <- NULL
active_percent <- NULL
death_percent <- NULL
#loop through number of countries
for (c in 1:4){
current_country <- subset(corona_march, `Countries and territories` == countries[c])
# loop trhough days of march
for (i in 25:1){
# get new cases, deaths and population size for that day
current_interval = current_country %>% filter(DateRep >= dates[i])
current_case = current_interval %>% select(Cases)
current_death = current_interval %>% select(Deaths)
pop = current_country %>% filter(DateRep == dates[i]) %>% select(Pop_Data.2018)
# calculate cumulative cases, deaths and percent active
active_cum = sum(current_case$Cases)
percent_active = active_cum / pop[[1]]
cum_death = sum(current_death)
# avoid scientific notation
options("scipen"=100, "digits"=7)
percent_death = cum_death / pop[[1]]
# store variables in list
active_cases <- append(active_cases,active_cum)
deaths_cum <- append(deaths_cum,cum_death)
active_percent <- append(active_percent,percent_active)
percent_death <- append(death_percent, percent_death)
}
}
Surprisingly, everything works fine except for the percent_death variable. For the cumulative deaths, the output looks like this:
deaths_cum
[1] 1098 1098 1098 1097 1096 1096 1093 1091 1090 1081 1070 1067 1052 1039 1021 1009 973
[18] 952 925 856 728 650 538 426 240 149 149 149 149 149 149 149 149 149
[35] 149 147 147 146 144 144 141 137 136 136 136 106 104 82 55 23 6799
[52] 6791 6785 6768 6740 6713 6672 6623 6587 6454 6356 6189 5993 5804 5552 5379 5009 4662
[69] 4315 3842 3413 2788 1993 1344 743 2696 2696 2696 2696 2696 2695 2693 2691 2691 2691
[86] 2668 2661 2649 2612 2575 2560 2408 2387 2205 2098 1929 1694 1370 976 514
But for the percent_death variable, it seems to stop after 1 iteration:
> percent_death
[1] 0.00001100083
Any idea what happened? Why does the append function work for all of the variables except for small numbers? Is there a smarter way to do it?
I have now found a neat way to perform the computation with data.table which I would not have found without initial advice by Gregor Thomas.
change object names to more distinguishable ones
avoid scientific notation of small values using option()
use data.table instead of several loops to reduce complexity and make the code cleaner
corona <- d %>% filter(`Countries and territories` %in% c("Germany", "France", "Italy", "Spain") )
corona_march <- corona %>% filter(Month == 3)
library(data.table)
corona_table <- data.table(corona_march)
# for each country in corona_march, calculate the cumulative cases, percent_active, cumulative deaths and percent deathsbased on the dates
corona_table[, Active_cases := cumsum(Cases), by = .(`Countries and territories`,Day)]
corona_table[, Deaths_cumulative := cumsum(Deaths), by = .(`Countries and territories`,Day)]
# avoid scientific notation for percentages
options("scipen"=100, "digits"=7)
corona_table[, Percent_active := Active_cases/Pop_Data.2018, by = .(`Countries and territories`,Day)]
corona_table[, Percent_death := Deaths_cumulative/Active_cases, by = .(`Countries and territories`,Day)]
I have two data frame: LF and HF
head(LF)
Year SS SS_CQT SRP SRP_CQT TDP TDP_CQT TP TP_CQT
1 2009 386.18 1164.3966 4586 12.30089 5285 14.23955 6707 18.17906
2 2010 268.72 884.9963 4354 13.37728 4927 15.20045 6078 18.81523
3 2011 347.61 746.7686 6924 12.25466 7917 13.84788 9302 16.93291
4 2012 170.68 1218.6758 2471 16.39350 3006 19.60066 3670 24.18561
head(HF)
Year SS SS_CQT SRP SRP_CQT TDP TDP_CQT TP TP_CQT
1 2009 184.44 4055.367 535 11.53037 621 13.50632 1175 25.82282
2 2010 118.08 2726.272 737 14.44196 868 16.92781 1236 24.56522
3 2011 119.90 2208.308 663 10.19803 742 11.42253 1086 17.36818
4 2012 554.07 11913.003 2413 45.44719 2781 52.90863 4290 85.87746
5 2013 165.32 5926.628 424 15.93962 461 17.16547 873 31.70556
Following relationship follow for above data frame: LF$SS + HF$SS = total load
I want to plot the proportion(%) of LF and HF each column variable using the two data frame as shown below;
Your help would be appreciated
Here is an approach:
library(tidyverse)
lf %>%
mutate(col = "lf") %>% #add column to lf specifying the data frame
bind_rows(hf %>% #bind rows of hf
mutate(col = "hf")) %>% #add column to hf specifying the data frame
gather(key, value, 2:9) %>% #convert to long format
group_by(key, Year) %>% #group by variable and year
mutate(ratio = value/sum(value)) %>% #calculate the desired ratio
ggplot()+
geom_area(aes(x = Year, y = ratio, fill = col)) + #pretty much self explanatory
facet_wrap(~key) +
scale_y_continuous(labels = scales::percent)
data:
lf <- read.table(text = "Year SS SS_CQT SRP SRP_CQT TDP TDP_CQT TP TP_CQT
1 2009 386.18 1164.3966 4586 12.30089 5285 14.23955 6707 18.17906
2 2010 268.72 884.9963 4354 13.37728 4927 15.20045 6078 18.81523
3 2011 347.61 746.7686 6924 12.25466 7917 13.84788 9302 16.93291
4 2012 170.68 1218.6758 2471 16.39350 3006 19.60066 3670 24.18561", header = T)
hf <- read.table(text = "Year SS SS_CQT SRP SRP_CQT TDP TDP_CQT TP TP_CQT
1 2009 184.44 4055.367 535 11.53037 621 13.50632 1175 25.82282
2 2010 118.08 2726.272 737 14.44196 868 16.92781 1236 24.56522
3 2011 119.90 2208.308 663 10.19803 742 11.42253 1086 17.36818
4 2012 554.07 11913.003 2413 45.44719 2781 52.90863 4290 85.87746", header = T)
I have removed the last row from hf so it matches the number of rows in lf
My answer doesn't differ much from #missuse's, except that it skips the need to calculate proportions.
For ggplot, you generally want data in long shape, so after binding the two data frames and marking which data frame observations come from (creating the type column in mutate), you should gather the data. In geom_area, using position = position_fill() calculates proportions within each facet, rather than you needing to do this manually.
library(tidyverse)
lf <- read.table(text = "Year SS SS_CQT SRP SRP_CQT TDP TDP_CQT TP TP_CQT
1 2009 386.18 1164.3966 4586 12.30089 5285 14.23955 6707 18.17906
2 2010 268.72 884.9963 4354 13.37728 4927 15.20045 6078 18.81523
3 2011 347.61 746.7686 6924 12.25466 7917 13.84788 9302 16.93291
4 2012 170.68 1218.6758 2471 16.39350 3006 19.60066 3670 24.18561", header = T)
hf <- read.table(text = "Year SS SS_CQT SRP SRP_CQT TDP TDP_CQT TP TP_CQT
1 2009 184.44 4055.367 535 11.53037 621 13.50632 1175 25.82282
2 2010 118.08 2726.272 737 14.44196 868 16.92781 1236 24.56522
3 2011 119.90 2208.308 663 10.19803 742 11.42253 1086 17.36818
4 2012 554.07 11913.003 2413 45.44719 2781 52.90863 4290 85.87746", header = T)
df <- bind_rows(
lf %>% mutate(type = "LF"),
hf %>% mutate(type = "HF")
) %>%
gather(key = measure, value = value, -Year, -type)
ggplot(df, aes(x = Year, y = value, fill = type)) +
geom_area(position = position_fill()) +
facet_wrap(~ measure) +
scale_y_continuous(labels = scales::percent) +
scale_fill_manual(values = c(HF = "darkorange", LF = "slateblue"))
Created on 2018-05-20 by the reprex package (v0.2.0).
This question already has answers here:
Add two dataframes; same dimension; different order column
(3 answers)
Closed 7 years ago.
Can I ask for the right command in [R] to add the values in two data.frame objects to produce a third "aggregate" data.frame? Here's the data from this post:
Mercy Hospital
Type A B C D E All
Operations 359 1836 299 2086 149 4729
Successful 292 1449 179 434 13 2366
and...
Hope Hospital
Type A B C D E All
Operations 88 514 222 86 45 955
Successful 70 391 113 12 2 588
The way I am doing is long and cumbersome:
rbind(Hope[1,] + Mercy[1,], Hope[2,] + Mercy[2,])
A B C D E All
Operations 447 2350 521 2172 194 5684
Successful 362 1840 292 446 15 2954
Here is a way to do it with reshaping
library(dplyr)
library(tidyr)
list(hope = Hope, mercy = Mercy) %>%
bind_rows(.id = "hospital") %>%
gather(variable, value, -hospital, -Type) %>%
group_by(Type, variable) %>%
summarize(value = sum(value)) %>%
spread(variable, value)