I have a dataset looks like this
year china India United state ....
2020 30 40 50
2021 20 30 60
2022 34 20 40
....
I have 10 columns and more than 50 rows in this dataframe. I have to plot them in one graph to show the movement of different countries.
So I think line graph would be good for the purpose.But I don't know how should I do the visulisation.
I think I shuold change the dataframe format and then start visulisation. How should I do it?
Pivot (reshape from wide to long) then plot with groups.
dat <- structure(list(year = 2020:2022, China = c(30L, 20L, 34L), India = c(40L, 30L, 20L), UnitedStates = c(50L, 60L, 40L)), class = "data.frame", row.names = c(NA, -3L))
datlong <- reshape2::melt(dat, "year", variable.name = "country", value.name = "value")
datlong
# year country value
# 1 2020 China 30
# 2 2021 China 20
# 3 2022 China 34
# 4 2020 India 40
# 5 2021 India 30
# 6 2022 India 20
# 7 2020 UnitedStates 50
# 8 2021 UnitedStates 60
# 9 2022 UnitedStates 40
### or using tidyr::
tidyr::pivot_longer(dat, -year, names_to = "country", values_to = "value")
Once reshaped, just group= (and optionally color=) lines:
library(ggplot2)
ggplot(datlong, aes(year, value, color = country)) +
geom_line(aes(group = country))
If you have many more years, the decimal-years in the axis will likely smooth out. You can alternately control it by converting year to a Date-class and forcing the display with scale_x_date.
Related
I have a dataframe that looks like this
Fruit
2021
2022
Apples
12
29
Bananas
11
31
Apples
44
55
Oranges
30
73
Oranges
19
82
Bananas
24
78
The Fruit names are not ordered so I can't group them by taking n at a time, they're listed randomly. I need to get the mean of fruits sold in 2021 & 2022 as well as mean sold for apples, oranges & bananas for each year separately.
My code is
2021 <- c(mean(df$2021), sd(df$2021))
2022 <- c(mean(df$2022), sd(df$2022))
measure <- c('mean','standard deviation')
df1 <- data.table(measure,TE,TW,NC,SC,NWC)
and output looks like this:
Measure
2021
2022
mean
23.3
58
standard deviation
12.4
23.3
But I'm not sure where to start with grouping the rows by name. I need to get something that looks like this
Measure
2021
Apples
Bananas
Oranges
2022
Apples
Bananas
Oranges
mean
23.3
58
standard deviation
12.4
23.3
(with the appropriate numbers in the blank spaces)
I suggest this might be better (in the long run) in a long format, which this summarizing can get started. This is just 'mean', not hard to repeat for sd and combine with this:
fruits <- c(NA, "Apples", "Oranges", "Bananas")
lapply(quux[,-1], function(yr) stack(sapply(fruits, function(z) mean(yr[is.na(z) | quux$Fruit %in% z])))) |>
dplyr::bind_rows(.id = "year")
# year values ind
# 1 2021 23.33333 <NA>
# 2 2021 28.00000 Apples
# 3 2021 24.50000 Oranges
# 4 2021 17.50000 Bananas
# 5 2022 58.00000 <NA>
# 6 2022 42.00000 Apples
# 7 2022 77.50000 Oranges
# 8 2022 54.50000 Bananas
where NA in ind indicates all fruits, otherwise the individual fruit labeled.
If you put your data in long form, you could use the aggregate function:
a <- aggregate(value ~ year + fruit, data=df, FUN=function(x) c(sd(x),mean(x))
Where value is a column you could create to put the values which are now under 2021 and 2022. Then create a new column called year which has 2021 or 2022 accordingly. Long form is the way to go in R almost always.
We may use
library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
df1 %>%
pivot_longer(cols = where(is.numeric), names_to = 'year') %>%
as.data.table %>%
cube( .(Mean = mean(value), SD = sd(value)),
by = c("Fruit", "year")) %>%
filter(!if_all(Fruit:year, is.na)) %>%
unite(Fruit, Fruit, year, sep = "_", na.rm = TRUE) %>%
filter(str_detect(Fruit, "_|\\d+")) %>%
data.table::transpose(make.names = "Fruit", keep.names = "Measure")
-output
Measure Apples_2021 Apples_2022 Bananas_2021 Bananas_2022 Oranges_2021 Oranges_2022 2021 2022
1: Mean 28.00000 42.00000 17.500000 54.50000 24.500000 77.500000 23.33333 58.00000
2: SD 22.62742 18.38478 9.192388 33.23402 7.778175 6.363961 12.42041 23.57965
Or if we want the duplicate column names
df1 %>%
pivot_longer(cols = where(is.numeric), names_to = 'year') %>%
as.data.table %>%
cube( .(Mean = mean(value), SD = sd(value)), by = c("Fruit", "year")) %>%
mutate(Fruit = coalesce(Fruit, year)) %>%
drop_na(year) %>%
arrange(year, str_detect(Fruit, '\\d{4}', negate = TRUE)) %>%
select(-year) %>%
data.table::transpose(make.names = "Fruit", keep.names = "Measure")
-output
Measure 2021 Apples Bananas Oranges 2022 Apples Bananas Oranges
1: Mean 23.33333 28.00000 17.500000 24.500000 58.00000 42.00000 54.50000 77.500000
2: SD 12.42041 22.62742 9.192388 7.778175 23.57965 18.38478 33.23402 6.363961
data
df1 <- structure(list(Fruit = c("Apples", "Bananas", "Apples", "Oranges",
"Oranges", "Bananas"), `2021` = c(12L, 11L, 44L, 30L, 19L, 24L
), `2022` = c(29L, 31L, 55L, 73L, 82L, 78L)),
class = "data.frame", row.names = c(NA,
-6L))
To be honest, I am completely stuck, I'm not quite sure how to phrase the title either.
I have two datasets, lets say it looks something like this:
Dataset1 (ie GDP related):
Year
Country
2000
Austria
2001
Austria
2000
Belgium
2001
Belgium
Dataset2 (TAX-related):
Year
Austria
Belgium
2000
55
48
2001
51
45
So what I would like, is to generate some sort of function/loop that essentially says:
if our country variable in dataset1 has a name that is a column name in dataset2, use these observations
Then, conditional on the year and country, I want to create a new variable in dataset1 called tax, apply the country's tax rate from dataset two into dataset1.
So for instance, we know Austria (observation) is also a name of a variable, then I want to get this tax rate from dataset2, and apply 55 for year 2000 and 56 for 2001, for dataset1. And this will go on for all countries and years.
And should thus look like
Dataset1 (ie GDP related):
Year
Country
Tax
2000
Austria
55
2001
Austria
51
2000
Belgium
48
2001
Belgium
45
My dataset is quite big, so it is much preferred if I have some sort of algorithm for this
Thanks!
Assuming the first data have more columns, then after reshaping the second data to long with pivot_longer, do a join with the first data (left_join) which matches the 'Year', 'Country'
library(dplyr)
library(tidyr)
df2 %>%
pivot_longer(cols = -Year, names_to = 'Country', values_to = 'Tax') %>%
left_join(df1, .)
-output
Year Country Tax
1 2000 Austria 55
2 2001 Austria 51
3 2000 Belgium 48
4 2001 Belgium 45
data
df1 <- structure(list(Year = c(2000L, 2001L, 2000L, 2001L), Country = c("Austria",
"Austria", "Belgium", "Belgium")), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(Year = 2000:2001, Austria = c(55L, 51L), Belgium = c(48L,
45L)), class = "data.frame", row.names = c(NA, -2L))
This should also work:
library(dplyr)
library(tidyr)
df2 %>%
# pivot_longer(-Year) %>% first solution
pivot_longer(cols = -Year, names_to = 'Country', values_to = 'Tax') %>% # taken from #akrun
arrange(Country)
Year Country Tax
<int> <chr> <int>
1 2000 Austria 55
2 2001 Austria 51
3 2000 Belgium 48
4 2001 Belgium 45
I have the following DataFrame in R:
Y ... Price Year Quantity Country
010190 ... 4781 2021 4 Germany
010190 ... 367 2021 3 Germany
010190 ... 4781 2021 6 France
010190 ... 250 2021 3 France
020190 ... 690 2021 NA USA
020190 ... 10 2021 6 USA
...... ... .... .. ...
217834 ... 56 2021 3 USA
217834 ... 567 2021 9 USA
As you see the numbers in Y column startin with 01.., 02..., 21... I want to aggregate such kind of rows from 6 digit to 2 digit by considering different categorical column (e.g. Country and Year) and sum numerical columns like Quantity and Price. Also I want to take into account rows with NAs during caclulation. So, in the end I want such kind of output:
Y Price Year Quantity Country
01 5148 2021 7 Germany
01 5031 2021 9 USA
02 700 2021 6 USA
.. .... ... .... ...
21 623 2021 12 USA
You can use group_by and summarize from dplyr
library(dplyr)
df %>%
mutate(Y = sprintf(as.numeric(factor(Y, unique(Y))), fmt = '%02d')) %>%
group_by(Y, Year, Country) %>%
summarize(across(where(is.numeric), sum))
#> # A tibble: 4 x 5
#> # Groups: Y, Year [3]
#> Y Year Country Price Quantity
#> <chr> <int> <chr> <int> <int>
#> 1 01 2021 France 5031 9
#> 2 01 2021 Germany 5148 7
#> 3 02 2021 USA 700 NA
update: request:
library(dplyr)
df %>%
mutate(Y = substr(Y, 1, 2)) %>%
group_by(Y, Year, Country) %>%
summarise(across(c(Price, Quantity), ~sum(., na.rm = TRUE)))
We could use substr to get the first two characters from Y and group_by and summarise() with sum()
library(dplyr)
df %>%
mutate(Y = substr(Y, 1, 2)) %>%
group_by(Y, Year, Country) %>%
summarise(Price = sum(Price, na.rm = TRUE),
Quantity = sum(Quantity, na.rm = TRUE)
)
Y Year Country Price Quantity
<chr> <dbl> <chr> <dbl> <dbl>
1 01 2021 France 5031 9
2 01 2021 Germany 5148 7
3 02 2021 USA 700 6
4 21 2021 USA 623 12
Using aggregate and the substring of Y.
aggregate(cbind(Quantity, Price) ~ Y + Year + Country,
transform(dat, Y=substr(Y, 1, 2)), sum)
# Y Year Country Quantity Price
# 1 10 2021 France 9 5031
# 2 10 2021 Germany 7 5148
# 3 20 2021 USA 7 700
# 4 21 2021 USA 12 623
Data:
dat <- structure(list(Y = c(10190L, 10190L, 10190L, 10190L, 20190L,
20190L, 217834L, 217834L), foo = c("...", "...", "...", "...",
"...", "...", "...", "..."), Price = c(4781L, 367L, 4781L, 250L,
690L, 10L, 56L, 567L), Year = c(2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L), model = c(NA, NA, NA, NA, NA, NA, "Tesla",
"Tesla"), Quantity = c(4L, 3L, 6L, 3L, 1L, 6L, 3L, 9L), Country = c("Germany",
"Germany", "France", "France", "USA", "USA", "USA", "USA")), class = "data.frame", row.names = c(NA,
-8L))
I have a data frame in R which looks like below
Model Month Demand Inventory
A Jan 10 20
B Feb 30 40
A Feb 40 60
I want the data frame to look
Jan Feb
A_Demand 10 40
A_Inventory 20 60
A_coverage
B_Demand 30
B_Inventory 40
B_coverage
A_coverage and B_Coverage will be calculated in excel using a formula. But the problem I need help with is to pivot the data frame from wide to long format (original format).
I tried to implement the solution from the linked duplicate but I am still having difficulty:
HD_dcast <- reshape(data,idvar = c("Model","Inventory","Demand"),
timevar = "Month", direction = "wide")
Here is a dput of my data:
data <- structure(list(Model = c("A", "B", "A"), Month = c("Jan", "Feb",
"Feb"), Demand = c(10L, 30L, 40L), Inventory = c(20L, 40L, 60L
)), class = "data.frame", row.names = c(NA, -3L))
Thanks
Here's an approach with dplyr and tidyr, two popular R packages for data manipulation:
library(dplyr)
library(tidyr)
data %>%
mutate(coverage = NA_real_) %>%
pivot_longer(-c(Model,Month), names_to = "Variable") %>%
pivot_wider(id_cols = c(Model, Variable), names_from = Month ) %>%
unite(Variable, c(Model,Variable), sep = "_")
## A tibble: 6 x 3
# Variable Jan Feb
# <chr> <dbl> <dbl>
#1 A_Demand 10 40
#2 A_Inventory 20 60
#3 A_coverage NA NA
#4 B_Demand NA 30
#5 B_Inventory NA 40
#6 B_coverage NA NA
My sample data looks like:
time state district count category
2018-01-01 Telangana Nalgonda 17 Water
2018-01-01 Telangana Nalgonda 8 Irrigation
2018-01-01 Telangana Nalgonda 52 Seeds
2018-01-01 Telangana Nalgonda 28 Electricity
2018-01-01 Telangana Nalgonda 27 Storage
2018-01-01 Telangana Nalgonda 12 Pesticides
I've around 2 years of monthly data of different states and districts.
I would like to melt the data to wide format
Tried :
one <- reshape(dataset,idvar = c("time","state","district"),v.names = names(dataset$category),
timevar = "count"
, direction = "wide")
Expected Output :
time state district Water Irrigation Seeds Electricity Storage Pesticides
2018-01-01 Telangana Nalgonda 17 8 52 28 27 12
I'm not sure how exactly reshape package works. I've seen many examples but couldn't figure it out right explanations.
Can some one let me know what's wrong I'm doing.
We could use gather and spread
library(dplyr)
library(tidyr)
df %>%
gather(key, value, count) %>%
spread(category, value) %>%
select(-key)
# time state district Electricity Irrigation Pesticides Seeds Storage Water
#1 2018-01-01 Telangana Nalgonda 28 8 12 52 27 17
We can use data.table
library(data.table)
dcast(setDT(df1), time + state + district + rowid(count) ~
category, value.var = 'count')
# time state district count Electricity Irrigation Pesticides Seeds Storage Water
#1: 2018-01-01 Telangana Nalgonda 1 28 8 12 52 27 17
data
df1 <- structure(list(time = c("2018-01-01", "2018-01-01", "2018-01-01",
"2018-01-01", "2018-01-01", "2018-01-01"), state = c("Telangana",
"Telangana", "Telangana", "Telangana", "Telangana", "Telangana"
), district = c("Nalgonda", "Nalgonda", "Nalgonda", "Nalgonda",
"Nalgonda", "Nalgonda"), count = c(17L, 8L, 52L, 28L, 27L, 12L
), category = c("Water", "Irrigation", "Seeds", "Electricity",
"Storage", "Pesticides")), class = "data.frame", row.names = c(NA,
-6L))