table restructure split R [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I have a table where I have a variable Technology, which includes "AllRenewables", "Biomass","Solar","Offshore wind", "Onshore wind" and "Wind".
I would like that the "All Renewables" is split into "Biomass","Solar","Offshore wind", "Onshore wind" and that "Wind" technology should be split into ""Offshore wind", "Onshore wind".
The table looks approximately as follows:
Table
Year Country Technology Changes
2000 A Solar 1
2000 A Wind 2
2000 A Onshore wind 2
2000 A All Renewables 3
It should look as follows after the re-structuring:
Table
Year Country Technology Changes
2000 A Solar 1
2000 A Onshore wind 2
2000 A Offshore wind 2
2000 A Onshore wind 3
2000 A Biomass 3
2000 A Solar 3
2000 A Onshore wind 3
2000 A Offshore wind 3
If anybody could help, I would be really really thankful.
Sarah

You could rename factor levels and use tidyr::separate_rows
lvls <- c(
"Biomass, Solar, Offshore wind, Onshore wind",
"Onshore wind",
"Solar",
"Offshore wind, Onshore wind")
levels(df$Technology) <- lvls;
library(tidyverse)
df %>% separate_rows(Technology, sep = ", ") %>%
group_by_all() %>%
slice(1) %>%
ungroup() %>%
arrange(Changes)
## A tibble: 7 x 4
# Year Country Technology Changes
# <int> <fct> <chr> <int>
#1 2000 A Solar 1
#2 2000 A Offshore wind 2
#3 2000 A Onshore wind 2
#4 2000 A Biomass 3
#5 2000 A Offshore wind 3
#6 2000 A Onshore wind 3
#7 2000 A Solar 3
Explanation: We redefine factor levels such that "All Renewables" becomes "Biomass, Solar, Offshore wind, Onshore wind" and "Wind" becomes "Offshore wind, Onshore wind". Then we use tidyr::separate_rows to split entries with a comma into separate rows. All that remains are removal of duplicates and re-ordering of rows.
Sample data
df <- read.table(text =
"Year Country Technology Changes
2000 A 'Solar' 1
2000 A 'Wind' 2
2000 A 'Onshore wind' 2
2000 A 'All Renewables' 3", header = T)

Just a question of merging (with tidyverse) :
# Your data:
df <- read.csv(textConnection("Y, A, B, C
2000,A,Solar,1
2000,A,Wind,2
2000,A,Onshore wind,2
2000,A,All Renewables,3"),stringsAsFactors=FALSE)
# Your synonyms:
c <- read.csv(textConnection("B, D
All Renewables,Biomass
All Renewables,Solar
All Renewables,Offshore wind
All Renewables,Onshore wind
Wind,Offshore wind
Wind,Onshore wind"),stringsAsFactors=FALSE)
df %>% left_join(c,by="B") %>% mutate(B=coalesce(D,B)) %>% select(-D)
# Y A B C
#1 2000 A Solar 1
#2 2000 A Offshore wind 2
#3 2000 A Onshore wind 2
#4 2000 A Onshore wind 2
#5 2000 A Biomass 3
#6 2000 A Solar 3
#7 2000 A Offshore wind 3
#8 2000 A Onshore wind 3

Related

Calculate total number of distinct users across years, cumulatively

Let's say I have a data.frame like so:
user_df = read.table(text = "id industry year
1 Government 1999
2 Government 1999
3 Government 1999
4 Private 1999
5 NGO 1999
1 Government 2000
2 Government 2000
3 Government 2000
4 Government 2000
1 Government 2001
5 Government 2001
2 Private 2001
3 Private 2001
4 Private 2001", header = T)
For each user I have a unique id, industry, and year.
I'm trying to compute a cumulative count of the people who have ever worked Government, so the cumulative count should be a count of the total number of unique users for that year and all preceding years.
I know I can do an ordinary cumulative sum like so:
user_df %>% group_by(year, industry) %>% summarize(cum_sum = cumsum(n_distinct(id)))
year industry cum_sum
<int> <chr> <int>
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 2
6 2001 Private 3
However, this isn't what I want since the sums in the year 2000 and 2001 will include people who have already been included in 1999. I want each year to be a cumulative count of the total number of unique users that have ever worked in Government at a given year. I couldn't figure out the right way to do this in dplyr.
So the correct output should look like:
year industry cum_sum
<int> <chr> <int>
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 5
6 2001 Private 3
One option might be:
user_df %>%
group_by(industry) %>%
mutate(cum_sum = cumsum(!duplicated(id))) %>%
group_by(year, industry) %>%
summarise(cum_sum = max(cum_sum))
year industry cum_sum
<int> <fct> <int>
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 5
6 2001 Private 3
1) sqldf This can be implemented through a complex self-join in sql. This joins each row to the rows having the same industry and same year or before and then groups them by year and industry counting the distinct id's.
library(sqldf)
sqldf("select a.year, a.industry, count(distinct b.id) cum_sum
from user_df a
left join user_df b on b.industry = a.industry and b.year <= a.year
group by a.year, a.industry")
giving:
year industry cum_sum
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 5
6 2001 Private 3
2) baseA base only solution is formed by merging the data frame to itself on industry and then subset to the same or earlier year and aggregate over industry and year. This is inefficient since unlike the SQL statement which filters as it joins this creates the entire join before filtering it down; however, if your data is not too large this may be sufficient.
m <- merge(user_df, user_df, by = "indstry")
s <- subset(m, year.y <= year.x)
ag <- aggregate(id.y ~ industry + year.x, s, function(x) length(unique(x)))
names(ag) <- sub("\\..*", "", names(ag))
ag
giving:
industry year id
1 Government 1999 3
2 NGO 1999 1
3 Private 1999 1
4 Government 2000 4
5 Government 2001 5
6 Private 2001 3

How to take an attribute from one dataframe and add it to another [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
Suppose I have 2 different datasets for countries. Both have same countries, but slightly different:
dataset A:
col1 covid_cases region
russia 2 2
israel 3 1
russia 2 3
russia 2 4
russia 2 1
russia 2 6
dataset B:
col1 covid_cases income
russia 2 low
russia 2 low
israel 3 high
The region column and income column are independent.
In my original datasets I have 100 countries.
What's an efficient way to get this type of dataset:
col1 covid_cases region income
russia 2 2 low
israel 3 1 high
russia 2 3 low
russia 2 4 low
russia 2 1 low
russia 2 6 low
So order here in the dataset doesn't matter. I'm not interested in simply just taking one column from one dataset and adding it to another. I'm interested in adding the income column so that its values matches the countries income, just like in dataset 2.
Maybe try this:
library(dplyr)
#Code
newdf <- df1 %>% left_join(df2 %>% select(-c(covid_cases)) %>%
filter(!duplicated(col1)))
Output:
col1 covid_cases region income
1 russia 2 2 low
2 israel 3 1 high
3 russia 2 3 low
Using your new dataframes, the code will work too:
col1 covid_cases region income
1 russia 2 2 low
2 israel 3 1 high
3 russia 2 3 low
4 russia 2 4 low
5 russia 2 1 low
6 russia 2 6 low

Aggregating observations based on category in R

I have a set of agricultural data in R that looks something like this:
State District Year Crop Production Area
1 State A District 1 2000 Banana 1254.00 2000.00
2 State A District 1 2000 Apple 175.00 176.00
3 State A District 1 2000 Wheat 321.00 641.00
4 State A District 1 2000 Rice 1438.00 175.00
5 State A District 1 2000 Cashew 154.00 1845.00
6 State A District 1 2000 Peanut 2076.00 439.00
7 State B District 2 2000 Banana 3089.00 1987.00
8 State B District 2 2000 Apple 309.00 302.00
9 State B District 2 2000 Wheat 401.00 230.00
10 State B District 2 2000 Rice 1832.00 2134.00
11 State B District 2 2000 Cashew 991.00 1845.00
12 State B District 2 2000 Peanut 2311.00 1032.00
I want to aggregate the area and production values by crop type, but keep the state, district and year details, so that it would look something like:
State District Year Crop Production Area
1 State A District 1 2000 Fruit 1429.00 2176.00
2 State A District 1 2000 Grain 1759.00 816.00
3 State A District 1 2000 Nut 2230.00 2284.00
4 State B District 2 2000 Fruit 3398.00 2289.00
5 State B District 2 2000 Grain 2233.00 2364.00
6 State B District 2 2000 Nut 3302.00 2877.00
What's the best way to go about this?
Using dplyr & forcats:
library(dplyr)
library(forcats)
df %>%
mutate(crop_type = fct_recode(Crop, fruit = "Apple", fruit = "Banana",
grain = "Wheat", grain = "Rice",
nut = "Cashew", nut = "Peanut")) %>%
group_by(State, District, Year, Crop) %>%
summarize(mean_production = mean(Production),
mean_area = mean(Area))

How to create a loop for sum calculations which then are inserted into a new row?

I have tried to find a solution via similar topics, but haven't found anything suitable. This may be due to the search terms I have used. If I have missed something, please accept my apologies.
Here is a excerpt of my data UN_ (the provided sample should be sufficient):
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
AT 1990 Total 7.869005
AT 1991 1 1.484667
AT 1991 2 1.001578
AT 1991 3 4.625927
AT 1991 4 2.515453
AT 1991 5 2.702081
AT 1991 Total 8.249567
....
BE 1994 1 3.008115
BE 1994 2 1.550344
BE 1994 3 1.080667
BE 1994 4 1.768645
BE 1994 5 7.208295
BE 1994 Total 1.526016
BE 1995 1 2.958820
BE 1995 2 1.571759
BE 1995 3 1.116049
BE 1995 4 1.888952
BE 1995 5 7.654881
BE 1995 Total 1.547446
....
What I want to do is, to add another row with UN_$sector = Residual. The value of residual will be (UN_$sector = Total) - (the sum of column UN for the sectors c("1", "2", "3", "4", "5")) for a given year AND country.
This is how it should look like:
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
----> AT 1990 Residual TO BE CALCULATED
AT 1990 Total 7.869005
As I don't want to write many, many lines of code I'm looking for a way to automate this. I was told about loops, but can't really follow the concept at the moment.
Thank you very much for any type of help!!
Best,
Constantin
PS: (for Parfait)
country year sector UN ETS
UK 2012 1 190336512 NA
UK 2012 2 18107910 NA
UK 2012 3 8333564 NA
UK 2012 4 11269017 NA
UK 2012 5 2504751 NA
UK 2012 Total 580957306 NA
UK 2013 1 177882200 NA
UK 2013 2 20353347 NA
UK 2013 3 8838575 NA
UK 2013 4 11051398 NA
UK 2013 5 2684909 NA
UK 2013 Total 566322778 NA
Consider calculating residual first and then stack it with other pieces of data:
# CALCULATE RESIDUALS BY MERGED COLUMNS
agg <- within(merge(aggregate(UN ~ country + year, data = subset(df, sector!='Total'), sum),
aggregate(UN ~ country + year, data = subset(df, sector=='Total'), sum),
by=c("country", "year")),
{UN <- UN.y - UN.x
sector = 'Residual'})
# ROW BIND DIFFERENT PIECES
final_df <- rbind(subset(df, sector!='Total'),
agg[c("country", "year", "sector", "UN")],
subset(df, sector=='Total'))
# ORDER ROWS AND RESET ROWNAMES
final_df <- with(final_df, final_df[order(country, year, as.character(sector)),])
row.names(final_df) <- NULL
Rextester demo
final_df
# country year sector UN
# 1 AT 1990 1 1.407555
# 2 AT 1990 2 1.037137
# 3 AT 1990 3 4.769618
# 4 AT 1990 4 2.455139
# 5 AT 1990 5 2.238618
# 6 AT 1990 Residual -4.039062
# 7 AT 1990 Total 7.869005
# 8 AT 1991 1 1.484667
# 9 AT 1991 2 1.001578
# 10 AT 1991 3 4.625927
# 11 AT 1991 4 2.515453
# 12 AT 1991 5 2.702081
# 13 AT 1991 Residual -4.080139
# 14 AT 1991 Total 8.249567
# 15 BE 1994 1 3.008115
# 16 BE 1994 2 1.550344
# 17 BE 1994 3 1.080667
# 18 BE 1994 4 1.768645
# 19 BE 1994 5 7.208295
# 20 BE 1994 Residual -13.090050
# 21 BE 1994 Total 1.526016
# 22 BE 1995 1 2.958820
# 23 BE 1995 2 1.571759
# 24 BE 1995 3 1.116049
# 25 BE 1995 4 1.888952
# 26 BE 1995 5 7.654881
# 27 BE 1995 Residual -13.643015
# 28 BE 1995 Total 1.547446
I think there are multiple ways you can do this. What I may recommend is to take advantage of the tidyverse suite of packages which includes dplyr.
Without getting too far into what dplyr and tidyverse can achieve, we can talk about the power of dplyr's inline commands group_by(...), summarise(...), arrange(...) and bind_rows(...) functions. Also, there are tons of great tutorials, cheat sheets, and documentation on all tidyverse packages.
Although it is less and less relevant these days, we generally want to avoid for loops in R. Therefore, we will create a new data frame which contains all of the Residual values then bring it back into your original data frame.
Step 1: Calculating all residual values
We want to calculate the sum of UN values, grouped by country and year. We can achieve this by this value
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))
Step 2: Add sector column to res_UN with value 'residual'
This should yield a data frame which contains country, year, and UN, we now need to add a column sector which the value 'Residual' to satisfy your specifications.
res_UN$sector = 'Residual'
Step 3 : Add res_UN back to UN_ and order accordingly
res_UN and UN_ now have the same columns and they can now be added back together.
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)
Piecing this all together, should answer your question and can be achieved in a couple lines!
TLDR:
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))`
res_UN$sector = 'Residual'
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)

How to do Group By Rollup in R? (Like SQL)

I have a dataset and I want to perform something like Group By Rollup like we have in SQL for aggregate values.
Below is a reproducible example. I know aggregate works really well as explained here but not a satisfactory fit for my case.
year<- c('2016','2016','2016','2016','2017','2017','2017','2017')
month<- c('1','1','1','1','2','2','2','2')
region<- c('east','west','east','west','east','west','east','west')
sales<- c(100,200,300,400,200,400,600,800)
df<- data.frame(year,month,region,sales)
df
year month region sales
1 2016 1 east 100
2 2016 1 west 200
3 2016 1 east 300
4 2016 1 west 400
5 2017 2 east 200
6 2017 2 west 400
7 2017 2 east 600
8 2017 2 west 800
now what I want to do is aggregation (sum- by year-month-region) and add the new aggregate row in the existing dataframe
e.g. there should be two additional rows like below with a new name for region as 'USA' for the aggreagted rows
year month region sales
1 2016 1 east 400
2 2016 1 west 600
3 2016 1 USA 1000
4 2017 2 east 800
5 2017 2 west 1200
6 2017 2 USA 2000
I have figured out a way (below) but I am very sure that there exists an optimum solution for this OR a better workaround than mine
df1<- setNames(aggregate(df$sales, by=list(df$year,df$month, df$region), FUN=sum),
c('year','month','region', 'sales'))
df2<- setNames(aggregate(df$sales, by=list(df$year,df$month), FUN=sum),
c('year','month', 'sales'))
df2$region<- 'USA' ## added a new column- region- for total USA
df2<- df2[, c('year','month','region', 'sales')] ## reordering the columns of df2
df3<- rbind(df1,df2)
df3<- df3[order(df3$year,df3$month,df3$region),] ## order by
rownames(df3)<- NULL ## renumbered the rows after order by
df3
Thanks for the support!
melt/dcast in the reshape2 package can do subtotalling. After running dcast we replace "(all)" in the month column with the month using na.locf from the zoo package:
library(reshape2)
library(zoo)
m <- melt(df, measure.vars = "sales")
dout <- dcast(m, year + month + region ~ variable, fun.aggregate = sum, margins = "month")
dout$month <- na.locf(replace(dout$month, dout$month == "(all)", NA))
giving:
> dout
year month region sales
1 2016 1 east 400
2 2016 1 west 600
3 2016 1 (all) 1000
4 2017 2 east 800
5 2017 2 west 1200
6 2017 2 (all) 2000
In recent devel data.table 1.10.5 you can use new feature called "grouping sets" to produce sub totals:
library(data.table)
setDT(df)
res = groupingsets(df, .(sales=sum(sales)), sets=list(c("year","month"), c("year","month","region")), by=c("year","month","region"))
setorder(res, na.last=TRUE)
res
# year month region sales
#1: 2016 1 east 400
#2: 2016 1 west 600
#3: 2016 1 NA 1000
#4: 2017 2 east 800
#5: 2017 2 west 1200
#6: 2017 2 NA 2000
You can substitute NA to USA using res[is.na(region), region := "USA"].
plyr::ddply(df, c("year", "month", "region"), plyr::summarise, sales = sum(sales))

Resources