Aggregating observations based on category in R

Aggregating observations based on category in R - r

I have a set of agricultural data in R that looks something like this:
State District Year Crop Production Area
1 State A District 1 2000 Banana 1254.00 2000.00
2 State A District 1 2000 Apple 175.00 176.00
3 State A District 1 2000 Wheat 321.00 641.00
4 State A District 1 2000 Rice 1438.00 175.00
5 State A District 1 2000 Cashew 154.00 1845.00
6 State A District 1 2000 Peanut 2076.00 439.00
7 State B District 2 2000 Banana 3089.00 1987.00
8 State B District 2 2000 Apple 309.00 302.00
9 State B District 2 2000 Wheat 401.00 230.00
10 State B District 2 2000 Rice 1832.00 2134.00
11 State B District 2 2000 Cashew 991.00 1845.00
12 State B District 2 2000 Peanut 2311.00 1032.00
I want to aggregate the area and production values by crop type, but keep the state, district and year details, so that it would look something like:
State District Year Crop Production Area
1 State A District 1 2000 Fruit 1429.00 2176.00
2 State A District 1 2000 Grain 1759.00 816.00
3 State A District 1 2000 Nut 2230.00 2284.00
4 State B District 2 2000 Fruit 3398.00 2289.00
5 State B District 2 2000 Grain 2233.00 2364.00
6 State B District 2 2000 Nut 3302.00 2877.00
What's the best way to go about this?

Using dplyr & forcats:
library(dplyr)
library(forcats)
df %>%
mutate(crop_type = fct_recode(Crop, fruit = "Apple", fruit = "Banana",
grain = "Wheat", grain = "Rice",
nut = "Cashew", nut = "Peanut")) %>%
group_by(State, District, Year, Crop) %>%
summarize(mean_production = mean(Production),
mean_area = mean(Area))

Related

Calculate total number of distinct users across years, cumulatively

Let's say I have a data.frame like so:
user_df = read.table(text = "id industry year
1 Government 1999
2 Government 1999
3 Government 1999
4 Private 1999
5 NGO 1999
1 Government 2000
2 Government 2000
3 Government 2000
4 Government 2000
1 Government 2001
5 Government 2001
2 Private 2001
3 Private 2001
4 Private 2001", header = T)
For each user I have a unique id, industry, and year.
I'm trying to compute a cumulative count of the people who have ever worked Government, so the cumulative count should be a count of the total number of unique users for that year and all preceding years.
I know I can do an ordinary cumulative sum like so:
user_df %>% group_by(year, industry) %>% summarize(cum_sum = cumsum(n_distinct(id)))
year industry cum_sum
<int> <chr> <int>
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 2
6 2001 Private 3
However, this isn't what I want since the sums in the year 2000 and 2001 will include people who have already been included in 1999. I want each year to be a cumulative count of the total number of unique users that have ever worked in Government at a given year. I couldn't figure out the right way to do this in dplyr.
So the correct output should look like:
year industry cum_sum
<int> <chr> <int>
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 5
6 2001 Private 3

One option might be:
user_df %>%
group_by(industry) %>%
mutate(cum_sum = cumsum(!duplicated(id))) %>%
group_by(year, industry) %>%
summarise(cum_sum = max(cum_sum))
year industry cum_sum
<int> <fct> <int>
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 5
6 2001 Private 3

1) sqldf This can be implemented through a complex self-join in sql. This joins each row to the rows having the same industry and same year or before and then groups them by year and industry counting the distinct id's.
library(sqldf)
sqldf("select a.year, a.industry, count(distinct b.id) cum_sum
from user_df a
left join user_df b on b.industry = a.industry and b.year <= a.year
group by a.year, a.industry")
giving:
year industry cum_sum
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 5
6 2001 Private 3
2) baseA base only solution is formed by merging the data frame to itself on industry and then subset to the same or earlier year and aggregate over industry and year. This is inefficient since unlike the SQL statement which filters as it joins this creates the entire join before filtering it down; however, if your data is not too large this may be sufficient.
m <- merge(user_df, user_df, by = "indstry")
s <- subset(m, year.y <= year.x)
ag <- aggregate(id.y ~ industry + year.x, s, function(x) length(unique(x)))
names(ag) <- sub("\\..*", "", names(ag))
ag
giving:
industry year id
1 Government 1999 3
2 NGO 1999 1
3 Private 1999 1
4 Government 2000 4
5 Government 2001 5
6 Private 2001 3

How do I get the sum of frequency count based on two columns?

Assuming that the dataframe is stored as someData, and is in the following format:
ID Team Games Medal
1 Australia 1992 Summer NA
2 Australia 1994 Summer Gold
3 Australia 1992 Summer Silver
4 United States 1991 Winter Gold
5 United States 1992 Summer Bronze
6 Singapore 1991 Summer NA
How would I count the frequencies of the medal, based on the Team - while excluding NA as an variable. But at the same time, the total frequency of each country should be summed, rather than displayed separately for Gold, Silver and Bronze.
In other words, I am trying to display the total number of medals PER country, with the exception of NA.
I have tried something like this:
library(plyr)
counts <- ddply(olympics, .(olympics$Team, olympics$Medal), nrow)
names(counts) <- c("Country", "Medal", "Freq")
counts
But this just gives me a massive table of every medal for every country separately, including NA.
What I would like to do is the following:
Australia 2
United States 2
Any help would be greatly appreciated.
Thank you!

We can use count
library(dplyr)
df1 %>%
filter(!is.na(Medal)) %>%
count(Team)
# A tibble: 2 x 2
# Team n
# <fct> <int>
#1 Australia 2
#2 United States 2

You can do that in base R with table and colSums
colSums(table(someData$Medal, someData$Team))
Australia Singapore United States
2 0 2
Data
someData = read.table(text="ID Team Games Medal
1 Australia '1992 Summer' NA
2 Australia '1994 Summer' Gold
3 Australia '1992 Summer' Silver
4 'United States' '1991 Winter' Gold
5 'United States' '1992 Summer' Bronze
6 Singapore '1991 Summer' NA",
header=TRUE)

table restructure split R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I have a table where I have a variable Technology, which includes "AllRenewables", "Biomass","Solar","Offshore wind", "Onshore wind" and "Wind".
I would like that the "All Renewables" is split into "Biomass","Solar","Offshore wind", "Onshore wind" and that "Wind" technology should be split into ""Offshore wind", "Onshore wind".
The table looks approximately as follows:
Table
Year Country Technology Changes
2000 A Solar 1
2000 A Wind 2
2000 A Onshore wind 2
2000 A All Renewables 3
It should look as follows after the re-structuring:
Table
Year Country Technology Changes
2000 A Solar 1
2000 A Onshore wind 2
2000 A Offshore wind 2
2000 A Onshore wind 3
2000 A Biomass 3
2000 A Solar 3
2000 A Onshore wind 3
2000 A Offshore wind 3
If anybody could help, I would be really really thankful.
Sarah

You could rename factor levels and use tidyr::separate_rows
lvls <- c(
"Biomass, Solar, Offshore wind, Onshore wind",
"Onshore wind",
"Solar",
"Offshore wind, Onshore wind")
levels(df$Technology) <- lvls;
library(tidyverse)
df %>% separate_rows(Technology, sep = ", ") %>%
group_by_all() %>%
slice(1) %>%
ungroup() %>%
arrange(Changes)
## A tibble: 7 x 4
# Year Country Technology Changes
# <int> <fct> <chr> <int>
#1 2000 A Solar 1
#2 2000 A Offshore wind 2
#3 2000 A Onshore wind 2
#4 2000 A Biomass 3
#5 2000 A Offshore wind 3
#6 2000 A Onshore wind 3
#7 2000 A Solar 3
Explanation: We redefine factor levels such that "All Renewables" becomes "Biomass, Solar, Offshore wind, Onshore wind" and "Wind" becomes "Offshore wind, Onshore wind". Then we use tidyr::separate_rows to split entries with a comma into separate rows. All that remains are removal of duplicates and re-ordering of rows.
Sample data
df <- read.table(text =
"Year Country Technology Changes
2000 A 'Solar' 1
2000 A 'Wind' 2
2000 A 'Onshore wind' 2
2000 A 'All Renewables' 3", header = T)

Just a question of merging (with tidyverse) :
# Your data:
df <- read.csv(textConnection("Y, A, B, C
2000,A,Solar,1
2000,A,Wind,2
2000,A,Onshore wind,2
2000,A,All Renewables,3"),stringsAsFactors=FALSE)
# Your synonyms:
c <- read.csv(textConnection("B, D
All Renewables,Biomass
All Renewables,Solar
All Renewables,Offshore wind
All Renewables,Onshore wind
Wind,Offshore wind
Wind,Onshore wind"),stringsAsFactors=FALSE)
df %>% left_join(c,by="B") %>% mutate(B=coalesce(D,B)) %>% select(-D)
# Y A B C
#1 2000 A Solar 1
#2 2000 A Offshore wind 2
#3 2000 A Onshore wind 2
#4 2000 A Onshore wind 2
#5 2000 A Biomass 3
#6 2000 A Solar 3
#7 2000 A Offshore wind 3
#8 2000 A Onshore wind 3

From monadic to dyadic data in R

For the sake of simplicity, let's say I have a dataset at the country-year level, that lists organizations that received aid from a government, how much money was that, and the type of project. The data frame has "space" for 10 organizations each year, but not every government subsidizes so many organizations each year, so there are a lot a blank spaces. Moreover, they do not follow any order: one organization can be in the first spot one year, and the next year be coded in the second spot. The data looks like this:
> State Year Org1 Aid1 Proj1 Org2 Aid2 Proj2 Org3 Aid3 Proj3 Org4 Aid4 Proj4 ...
Italy 2000 A 1000 Arts B 500 Arts C 300 Social
Italy 2001 B 700 Social A 1000 Envir
Italy 2002 A 1000 Arts C 300 Envir
UK 2000
UK 2001 Z 2000 Social
UK 2002 Z 2000 Social
...
I'm trying to transform this into dyadic data, which would look like this:
> State Org Year Aid Proj
Italy A 2000 1000 Arts
Italy A 2001 1000 Envir
Italy A 2002 1000 Arts
Italy B 2000 500 Arts
Italy B 2001 700 Social
Italy C 2000 300 Social
Italy C 2002 300 Envir
UK Z 2001 2000 Social
...
I'm using R, and the best way I could find was building a pre-defined possible set of dyads —using something like expand.grid(unique(State), unique(Org))— and then looping through the data, finding the corresponding column and filling the data frame. But I don't thing this is the most effective method, so I was wondering whether there would be a better way. I thought about dplyror reshape but can't find a solution.
I know this is a recurring question, but couldn't really find an answer. The most similar question is this one, but it's not exactly the same.
Thanks a lot in advance.

Since you did not use dput, I will try and make some data that resemble yours:
dat = data.frame(State = rep(c("Italy", "UK"), 3),
Year = rep(c(2014, 2015, 2016), 2),
Org1 = letters[1:6],
Aid1 = sample(800:1000, 6),
Proj1 = rep(c("A", "B"), 3),
Org2 = letters[7:12],
Aid2 = sample(600:700, 6),
Proj2 = rep(c("C", "D"), 3),
stringsAsFactors = FALSE)
dat
# State Year Org1 Aid1 Proj1 Org2 Aid2 Proj2
# 1 Italy 2014 a 910 A g 658 C
# 2 UK 2015 b 926 B h 681 D
# 3 Italy 2016 c 834 A i 625 C
# 4 UK 2014 d 858 B j 620 D
# 5 Italy 2015 e 831 A k 650 C
# 6 UK 2016 f 821 B l 687 D
Next I gather the data and then use extract to make 2 new columns and then spread it all again:
library(tidyr)
library(dplyr)
dat %>%
gather(key, value, -c(State, Year)) %>%
extract(key, into = c("key", "num"), "([A-Za-z]+)([0-9]+)") %>%
spread(key, value) %>%
select(-num)
# State Year Aid Org Proj
# 1 Italy 2014 910 a A
# 2 Italy 2014 658 g C
# 3 Italy 2015 831 e A
# 4 Italy 2015 650 k C
# 5 Italy 2016 834 c A
# 6 Italy 2016 625 i C
# 7 UK 2014 858 d B
# 8 UK 2014 620 j D
# 9 UK 2015 926 b B
# 10 UK 2015 681 h D
# 11 UK 2016 821 f B
# 12 UK 2016 687 l D
Is this the desired output?

Aggregate column R

I am new here and have a problem
Year Market Winner BID
1 1990 ABC Apple 0.1260
2 1990 ABC Apple 0.1395
3 1990 EFG Pear 0.1350
4 1991 EFG Apple 0.1113
5 1991 EFG Orange 0.1094
For each year and separately for the two markets (i.e.,ABC,EFG), examine the
combined data for Apple and Pear on the bid price variable BID for presence
of potential outliers.5 Identify instances where you observe the presence of
potential outliers.
I managed to separate the data by year only
y <- c(1, seq(300))
year1991 <- subset(X, y < 39)
year1991
Year1991 <- year1991[, c(1,2,3,5)]
Year1991
now I need help on whats the right R command to key to select(View) only ABC
of the Market COLUMN, which the other column values remains.
Is it possible to do multiple separation at one time? or step by step
Possible to give me a tip,how do I exlude if I wanna view the date in such
a manner
Year Market Winner BID
1 1990 ABC Apple 0.1260
2 1990 ABC Apple 0.1395
Year Market Winner BID
1 1990 EFG Pear 0.1350
Like trying to split the 'Market' but still see the whole list of values
Thanks in advance :)

> df
Year Market Winner BID
1 1990 ABC Apple 0.1260
2 1990 ABC Apple 0.1395
3 1990 EFG Pear 0.1350
4 1991 EFG Apple 0.1113
5 1991 EFG Orange 0.1094
library(plyr)
# Then you can break up the data into chunks of year x market.
# I split your data.frame into a list. You can do further things with that list.
# alternatively, you can use ddply and add a function to do your hw bit and collate all
# results back into a final data.frame. This should be a helpful start.
> dlply(df, .(Year,Market))
$`1990.ABC`
Year Market Winner BID
1 1990 ABC Apple 0.1260
2 1990 ABC Apple 0.1395
$`1990.EFG`
Year Market Winner BID
3 1990 EFG Pear 0.135
$`1991.EFG`
Year Market Winner BID
4 1991 EFG Apple 0.1113
5 1991 EFG Orange 0.1094