How do I get the sum of frequency count based on two columns? - r

Assuming that the dataframe is stored as someData, and is in the following format:
ID Team Games Medal
1 Australia 1992 Summer NA
2 Australia 1994 Summer Gold
3 Australia 1992 Summer Silver
4 United States 1991 Winter Gold
5 United States 1992 Summer Bronze
6 Singapore 1991 Summer NA
How would I count the frequencies of the medal, based on the Team - while excluding NA as an variable. But at the same time, the total frequency of each country should be summed, rather than displayed separately for Gold, Silver and Bronze.
In other words, I am trying to display the total number of medals PER country, with the exception of NA.
I have tried something like this:
library(plyr)
counts <- ddply(olympics, .(olympics$Team, olympics$Medal), nrow)
names(counts) <- c("Country", "Medal", "Freq")
counts
But this just gives me a massive table of every medal for every country separately, including NA.
What I would like to do is the following:
Australia 2
United States 2
Any help would be greatly appreciated.
Thank you!

We can use count
library(dplyr)
df1 %>%
filter(!is.na(Medal)) %>%
count(Team)
# A tibble: 2 x 2
# Team n
# <fct> <int>
#1 Australia 2
#2 United States 2

You can do that in base R with table and colSums
colSums(table(someData$Medal, someData$Team))
Australia Singapore United States
2 0 2
Data
someData = read.table(text="ID Team Games Medal
1 Australia '1992 Summer' NA
2 Australia '1994 Summer' Gold
3 Australia '1992 Summer' Silver
4 'United States' '1991 Winter' Gold
5 'United States' '1992 Summer' Bronze
6 Singapore '1991 Summer' NA",
header=TRUE)

Related

Complex Grouped Bar Chart

I would really like to learn how to use R, but I'm still struggling with basic things. I need to make a Bar graph where column are grouped into four variables. This is a simplified matrix of my data:
REGION AREA AGE LOCALS FOREIGNER
1 USA CITY OLD 30.7485876 3.5254237
2 USA CITY YOUNG 51.1666667 1.1666667
3 USA COUNTRY OLD 6.1666667 1.8333333
4 USA COUNTRY YOUNG 14.0000000 2.5000000
5 EUROPE CITY OLD 4.5000000 8.8333333
6 EUROPE CITY YOUNG 0.6680672 18.7044818
7 EUROPE COUNTRY OLD 56.5000000 0.8333333
8 EUROPE COUNTRY YOUNG 59.8333333 0.6666667
9 ASIA CITY OLD 28.6666667 6.1666667
10 ASIA CITY YOUNG 25.8333333 7.3333333
11 ASIA COUNTRY OLD 3.0494232 18.1195224
12 ASIA COUNTRY YOUNG 2.1666667 21.5000000
And this is the results that I would like to obtain with R (made with excel):
I've spent a lot of lime looking online but I can find codes for just two variables. Could someone help me to do this?
Not exactly what you asked for but here goes.
data <- read.table(textConnection("
REGION AREA AGE LOCALS FOREIGNER
1 USA CITY OLD 30.7485876 3.5254237
2 USA CITY YOUNG 51.1666667 1.1666667
3 USA COUNTRY OLD 6.1666667 1.8333333
4 USA COUNTRY YOUNG 14.0000000 2.5000000
5 EUROPE CITY OLD 4.5000000 8.8333333
6 EUROPE CITY YOUNG 0.6680672 18.7044818
7 EUROPE COUNTRY OLD 56.5000000 0.8333333
8 EUROPE COUNTRY YOUNG 59.8333333 0.6666667
9 ASIA CITY OLD 28.6666667 6.1666667
10 ASIA CITY YOUNG 25.8333333 7.3333333
11 ASIA COUNTRY OLD 3.0494232 18.1195224
12 ASIA COUNTRY YOUNG 2.1666667 21.5000000"), header = TRUE)
data <- as.data.frame(data)
library(tidyr)
data <- data %>%
gather(LOC_FOR, VALUE, -REGION, -AREA, -AGE) #If you want to change the name "LOC_FOR" to something else do it here.
library(ggplot2)
ggplot(data, aes(x = AGE, y = VALUE, fill = LOC_FOR)) +
geom_bar(position = 'dodge', stat = 'identity') +
facet_grid(~REGION + AREA)

create a variable in a dataframe based on another matrix on R

I am having some problems with the following task
I have a data frame of this type with 99 different countries for thousands of IDs
ID Nationality var 1 var 2 ....
1 Italy //
2 Eritrea //
3 Italy //
4 USA
5 France
6 France
7 Eritrea
....
I want to add a variable corresponding to a given macroregion of Nationality
so I created a matrix of this kind with the rule to follow
Nationality Continent
Italy Europe
Eritrea Africa
Usa America
France Europe
Germany Europe
....
I d like to obtain this
ID Nationality var 1 var 2 Continent
1 Italy // Europe
2 Eritrea // Africa
3 Italy // Europe
4 USA America
5 France Europe
6 France Europe
7 Eritrea Africa
....
I was trying with this command
datasubset <- merge(dataset , continent.matrix )
but it doesn't work, it reports the following error
Error: cannot allocate vector of size 56.6 Mb
that seems very strange to me, also trying to apply this code to a subset it doesn't work. do you have any suggestion on how to proceed?
thank you very much in advance for your help, I hope my question doesn't sound too trivial, but I am quite new to R
You can do this with the left_join function (dplyr's library):
library(dplyr)
df <- tibble(ID=c(1,2,3),
Nationality=c("Italy", "Usa", "France"),
var1=c("a", "b", "c"),
var2=c(4,5,6))
nat_cont <- tibble(Nationality=c("Italy", "Eritrea", "Usa", "Germany", "France"),
Continent=c("Europe", "Africa", "America", "Europe", "Europe"))
df_2 <- left_join(df, nat_cont, by=c("Nationality"))
The output:
> df_2
# A tibble: 3 x 5
ID Nationality var1 var2 Continent
<dbl> <chr> <chr> <dbl> <chr>
1 1 Italy a 4 Europe
2 2 Usa b 5 America
3 3 France c 6 Europe

Find the nth largest value based on criteria [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
This is the basically same problem I had in Excel a few days ago (Excel - find nth largest value based on criteria), but this time in R (the data set contains half a million entries and that is more than Excel seems to be able to handle).
I have a table that looks like this that I have imported from Excel:
Country Region Code Product name Year Value
Sweden Stockholm 123 Apple 1991 244
Sweden Kirruna 123 Apple 1987 100
Japan Kyoto 543 Pie 1987 544
Denmark Copenhagen 123 Apple 1998 787
Denmark Copenhagen 123 Apple 1987 100
Denmark Copenhagen 543 Pie 1991 320
Denmark Copenhagen 126 Candy 1999 200
Sweden Gothenburg 126 Candy 2013 300
Sweden Gothenburg 157 Tomato 1987 150
Sweden Stockholm 125 Juice 1987 250
Sweden Kirruna 187 Banana 1998 310
Japan Kyoto 198 Ham 1987 157
Japan Kyoto 125 Juice 1987 550
Japan Tokyo 125 Juice 1991 100
What I want to do is to make a code that can give me the sum of the nth largest value of products that have been sold in a specific country. For instance, the most sold product in Sweden is Apple so I want to code to find that apple is the most sold product (in total, which is what I am interested in) and then summaries all the values of the sold apples in the country Sweden, 344.
I also want to be able to find the nth largest value based on both country and year. That is, if I am looking for the most sold product in Sweden in the year 2013, it should return the product Candy and the value 300.
Solution for your first question (find most sold product per country, summarise value for this product) using dplyr:
library(tidyverse)
df %>%
group_by(Country, Product_name) %>%
summarise(sum_value = sum(Value, na.rm = TRUE)) %>%
ungroup() %>%
group_by(Country) %>%
filter(sum_value == max(sum_value))
# A tibble: 3 x 3
# Groups: Country [3]
Country Product_name sum_value
<fctr> <fctr> <int>
1 Denmark Apple 887
2 Japan Juice 650
3 Sweden Apple 344
Solution for second question (show nth most sold products per country and year, sum value):
df %>%
group_by(Country, Product_name, Year) %>%
summarise(sum_value = sum(Value, na.rm = TRUE)) %>%
ungroup() %>%
group_by(Country, Year) %>%
arrange(desc(sum_value), .by_group = TRUE) %>%
slice(., 1:2)
Had to change the data a bit to get a decent output, so here's the output with all years set to 1987 (change the 2 in the 1:2 within the last row for a different n):
# A tibble: 6 x 4
# Groups: Country, Year [3]
Country Product_name Year sum_value
<fctr> <fctr> <int> <int>
1 Denmark Apple 1987 887
2 Denmark Pie 1987 320
3 Japan Juice 1987 650
4 Japan Pie 1987 544
5 Sweden Apple 1987 344
6 Sweden Banana 1987 310

Having trouble merging/joining two datasets on two variables in R

I realize there have already been many asked and answered questions about merging datasets here, but I've been unable to find one that addresses my issue.
What I'm trying to do is merge to datasets using two variables and keeping all data from each. I've tried merge and all of the join operations from dplyr, as well as cbind and have not gotten the result I want. Usually what happens is that one column from one of the datasets gets overwritten with NAs. Another thing that will happen, as when I do full_join in dplyr or all = TRUE in merge is that I get double the number of rows.
Here's my data:
Primary_State Primary_County n
<fctr> <fctr> <int>
1 AK 12
2 AK Aleutians West 1
3 AK Anchorage 961
4 AK Bethel 1
5 AK Fairbanks North Star 124
6 AK Haines 1
Primary_County Primary_State Population
1 Autauga AL 55416
2 Baldwin AL 208563
3 Barbour AL 25965
4 Bibb AL 22643
5 Blount AL 57704
6 Bullock AL 10362
So I want to merge or join based on Primary_State and Primary_County, which is necessary because there are a lot of duplicate county names in the U.S. and retain the data from both n and Population. From there I can then divide the Population by n and get a per capita figure for each county. I just can't figure out how to do it and keep all of the data, so any help would be appreciated. Thanks in advance!
EDIT: Adding code examples of what I've already described above.
This code (as well as left_join):
countyPerCap <- merge(countyLicense, countyPops, all.x = TRUE)
Produces this:
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians West 1 NA
3 AK Anchorage 961 NA
4 AK Bethel 1 NA
5 AK Fairbanks North Star 124 NA
6 AK Haines 1 NA
This code:
countyPerCap <- right_join(countyLicense, countyPops)
Produces this:
Primary_State Primary_County n Population
<chr> <chr> <int> <int>
1 AL Autauga NA 55416
2 AL Baldwin NA 208563
3 AL Barbour NA 25965
4 AL Bibb NA 22643
5 AL Blount NA 57704
6 AL Bullock NA 10362
Hope that's helpful.
EDIT: This is what happens with the following code:
countyPerCap <- merge(countyLicense, countyPops, all = TRUE)
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians East NA 3296
3 AK Aleutians West 1 NA
4 AK Aleutians West NA 5647
5 AK Anchorage 961 NA
6 AK Anchorage NA 298192
It duplicates state and county and then adds n to one record and Population in another. Is there a way to deduplicate the dataset and remove the NAs?
We can give column names in merge by mentioning "by" in merge statement
merge(x,y, by=c(col1, col2 names))
in merge statement
I figured it out. There were trailing whitespaces in the Census data's county names, so they weren't matching with the other dataset's county names. (Note to self: Always check that factors match when trying to merge datasets!)
trim.trailing <- function (x) sub("\\s+$", "", x)
countyPops$Primary_County <- trim.trailing(countyPops$Primary_County)
countyPerCap <- full_join(countyLicense, countyPops,
by=c("Primary_State", "Primary_County"), copy=TRUE)
Those three lines did the trick. Thanks everyone!

R factor with overlapping level ranges

Hi I am struggling about a problem since coupple of days and haven't found any answer yet.
Supposed I am having a dataset with columns: Country, Population. The Country is incoded in Numbers, so the raw dataset looks like this:
df <- data.frame(country=c(1,2,3,4,5,6), population=c(10000,20000,30000,4000,50000,60000))
df
country population
1 1 10000
2 2 20000
3 3 30000
4 4 4000
5 5 50000
6 6 60000
I want country to be a factor with the following levels: France, Germany, Canada, USA, India, China and Europe, America, Asia.
So to say a factor combinig:
df$country <- factor(df$country, labels = c("France", "Germany", "Canada", "USA", "India", "Asia"))
df
country population
1 France 10000
2 Germany 20000
3 Canada 30000
4 USA 4000
5 India 50000
6 Asia 60000
and
df$country <- cut(df$country, breaks = c(0,2,4,6),labels = c("Europe", "America", "Asia"))
df
country population
1 Europe 10000
2 Europe 20000
3 America 30000
4 America 4000
5 Asia 50000
6 Asia 60000
My aim is to do something like:
tapply(df$population, df$country, sum)
with a result like this:
France Germany Canada USA India China Europe America Asia
10000 20000 30000 4000 50000 60000 30000 34000 110000
Is there a way to this, without creating a third column in the data frame?
I hope it is understandble, what my problem is.
I already tried interaction() but thats not what I want.
So the following function from the plyr-Package divides your data frame into sub-data-frames (one sub-data-frame per country) and then sums up the population values. The t function just transverses your data frame.
> library(plyr)
> neu <- ddply(df, .(country), Summe = sum(population))
> t(neu)

Resources