R factor with overlapping level ranges - r

Hi I am struggling about a problem since coupple of days and haven't found any answer yet.
Supposed I am having a dataset with columns: Country, Population. The Country is incoded in Numbers, so the raw dataset looks like this:
df <- data.frame(country=c(1,2,3,4,5,6), population=c(10000,20000,30000,4000,50000,60000))
df
country population
1 1 10000
2 2 20000
3 3 30000
4 4 4000
5 5 50000
6 6 60000
I want country to be a factor with the following levels: France, Germany, Canada, USA, India, China and Europe, America, Asia.
So to say a factor combinig:
df$country <- factor(df$country, labels = c("France", "Germany", "Canada", "USA", "India", "Asia"))
df
country population
1 France 10000
2 Germany 20000
3 Canada 30000
4 USA 4000
5 India 50000
6 Asia 60000
and
df$country <- cut(df$country, breaks = c(0,2,4,6),labels = c("Europe", "America", "Asia"))
df
country population
1 Europe 10000
2 Europe 20000
3 America 30000
4 America 4000
5 Asia 50000
6 Asia 60000
My aim is to do something like:
tapply(df$population, df$country, sum)
with a result like this:
France Germany Canada USA India China Europe America Asia
10000 20000 30000 4000 50000 60000 30000 34000 110000
Is there a way to this, without creating a third column in the data frame?
I hope it is understandble, what my problem is.
I already tried interaction() but thats not what I want.

So the following function from the plyr-Package divides your data frame into sub-data-frames (one sub-data-frame per country) and then sums up the population values. The t function just transverses your data frame.
> library(plyr)
> neu <- ddply(df, .(country), Summe = sum(population))
> t(neu)

Related

Complex Grouped Bar Chart

I would really like to learn how to use R, but I'm still struggling with basic things. I need to make a Bar graph where column are grouped into four variables. This is a simplified matrix of my data:
REGION AREA AGE LOCALS FOREIGNER
1 USA CITY OLD 30.7485876 3.5254237
2 USA CITY YOUNG 51.1666667 1.1666667
3 USA COUNTRY OLD 6.1666667 1.8333333
4 USA COUNTRY YOUNG 14.0000000 2.5000000
5 EUROPE CITY OLD 4.5000000 8.8333333
6 EUROPE CITY YOUNG 0.6680672 18.7044818
7 EUROPE COUNTRY OLD 56.5000000 0.8333333
8 EUROPE COUNTRY YOUNG 59.8333333 0.6666667
9 ASIA CITY OLD 28.6666667 6.1666667
10 ASIA CITY YOUNG 25.8333333 7.3333333
11 ASIA COUNTRY OLD 3.0494232 18.1195224
12 ASIA COUNTRY YOUNG 2.1666667 21.5000000
And this is the results that I would like to obtain with R (made with excel):
I've spent a lot of lime looking online but I can find codes for just two variables. Could someone help me to do this?
Not exactly what you asked for but here goes.
data <- read.table(textConnection("
REGION AREA AGE LOCALS FOREIGNER
1 USA CITY OLD 30.7485876 3.5254237
2 USA CITY YOUNG 51.1666667 1.1666667
3 USA COUNTRY OLD 6.1666667 1.8333333
4 USA COUNTRY YOUNG 14.0000000 2.5000000
5 EUROPE CITY OLD 4.5000000 8.8333333
6 EUROPE CITY YOUNG 0.6680672 18.7044818
7 EUROPE COUNTRY OLD 56.5000000 0.8333333
8 EUROPE COUNTRY YOUNG 59.8333333 0.6666667
9 ASIA CITY OLD 28.6666667 6.1666667
10 ASIA CITY YOUNG 25.8333333 7.3333333
11 ASIA COUNTRY OLD 3.0494232 18.1195224
12 ASIA COUNTRY YOUNG 2.1666667 21.5000000"), header = TRUE)
data <- as.data.frame(data)
library(tidyr)
data <- data %>%
gather(LOC_FOR, VALUE, -REGION, -AREA, -AGE) #If you want to change the name "LOC_FOR" to something else do it here.
library(ggplot2)
ggplot(data, aes(x = AGE, y = VALUE, fill = LOC_FOR)) +
geom_bar(position = 'dodge', stat = 'identity') +
facet_grid(~REGION + AREA)

create a variable in a dataframe based on another matrix on R

I am having some problems with the following task
I have a data frame of this type with 99 different countries for thousands of IDs
ID Nationality var 1 var 2 ....
1 Italy //
2 Eritrea //
3 Italy //
4 USA
5 France
6 France
7 Eritrea
....
I want to add a variable corresponding to a given macroregion of Nationality
so I created a matrix of this kind with the rule to follow
Nationality Continent
Italy Europe
Eritrea Africa
Usa America
France Europe
Germany Europe
....
I d like to obtain this
ID Nationality var 1 var 2 Continent
1 Italy // Europe
2 Eritrea // Africa
3 Italy // Europe
4 USA America
5 France Europe
6 France Europe
7 Eritrea Africa
....
I was trying with this command
datasubset <- merge(dataset , continent.matrix )
but it doesn't work, it reports the following error
Error: cannot allocate vector of size 56.6 Mb
that seems very strange to me, also trying to apply this code to a subset it doesn't work. do you have any suggestion on how to proceed?
thank you very much in advance for your help, I hope my question doesn't sound too trivial, but I am quite new to R
You can do this with the left_join function (dplyr's library):
library(dplyr)
df <- tibble(ID=c(1,2,3),
Nationality=c("Italy", "Usa", "France"),
var1=c("a", "b", "c"),
var2=c(4,5,6))
nat_cont <- tibble(Nationality=c("Italy", "Eritrea", "Usa", "Germany", "France"),
Continent=c("Europe", "Africa", "America", "Europe", "Europe"))
df_2 <- left_join(df, nat_cont, by=c("Nationality"))
The output:
> df_2
# A tibble: 3 x 5
ID Nationality var1 var2 Continent
<dbl> <chr> <chr> <dbl> <chr>
1 1 Italy a 4 Europe
2 2 Usa b 5 America
3 3 France c 6 Europe

How do I get the sum of frequency count based on two columns?

Assuming that the dataframe is stored as someData, and is in the following format:
ID Team Games Medal
1 Australia 1992 Summer NA
2 Australia 1994 Summer Gold
3 Australia 1992 Summer Silver
4 United States 1991 Winter Gold
5 United States 1992 Summer Bronze
6 Singapore 1991 Summer NA
How would I count the frequencies of the medal, based on the Team - while excluding NA as an variable. But at the same time, the total frequency of each country should be summed, rather than displayed separately for Gold, Silver and Bronze.
In other words, I am trying to display the total number of medals PER country, with the exception of NA.
I have tried something like this:
library(plyr)
counts <- ddply(olympics, .(olympics$Team, olympics$Medal), nrow)
names(counts) <- c("Country", "Medal", "Freq")
counts
But this just gives me a massive table of every medal for every country separately, including NA.
What I would like to do is the following:
Australia 2
United States 2
Any help would be greatly appreciated.
Thank you!
We can use count
library(dplyr)
df1 %>%
filter(!is.na(Medal)) %>%
count(Team)
# A tibble: 2 x 2
# Team n
# <fct> <int>
#1 Australia 2
#2 United States 2
You can do that in base R with table and colSums
colSums(table(someData$Medal, someData$Team))
Australia Singapore United States
2 0 2
Data
someData = read.table(text="ID Team Games Medal
1 Australia '1992 Summer' NA
2 Australia '1994 Summer' Gold
3 Australia '1992 Summer' Silver
4 'United States' '1991 Winter' Gold
5 'United States' '1992 Summer' Bronze
6 Singapore '1991 Summer' NA",
header=TRUE)

Merge two data frames by a max number condition in r

Cheers, I have a data frame df1 with the Major City with max visitors in 2011.
df1:
Country City Visitors_2011
UK London 100000
USA Washington D.C 200000
USA New York 100000
France Paris 100000
The other data frame df2 consists of top visited cities in the country for 2012:
df2:
Country City Visitors_2012
USA Washington D.C 200000
USA New York 100000
USA Las Angeles 100000
UK London 100000
UK Manchester 100000
France Paris 100000
France Nice 100000
The Output I would need is:
Logic: To obtain df3, merge df1 and df2 by Country and City and if you can't find city in df1 then add that volume to biggest city in df1.
Example: Los Angeles visitor count here is added to Washington D.C because Los Angeles is not present in df1 and Washington D.C has more visitors(2012) than New York.
df3:
Country City Visitors_2011 Visitors_2012
UK London 100000 200000
USA Washington D.C 200000 300000
USA New York 100000 100000
France Paris 100000 200000
Can anyone point me to the right direction?
Assume df1.txt and df2.txt contain your space-delimited dataframes.
Here is a solution in base R:
df1 <- read.table("df1.txt", header = T, stringsAsFactors = F);
df2 <- read.table("df2.txt", header = T, stringsAsFactors = F);
# Merge with all = TRUE, see ?merge
df <- merge(df1, df2, all = TRUE);
# Deal with missing values
tmp <- lapply(split(df, df$Country), function(x) {
# Make sure NA's are at the bottom
x <- x[order(x$Visitors_2011), ];
# Select first max Visitors_2012 entry
idx <- which.max(x$Visitors_2012);
# Add any NA's to max entry
x$Visitors_2012[idx] <- x$Visitors_2012[idx] + sum(x$Visitors_2012[is.na(x$Visitors_2011)]);
# Return dataframe
return(x[!is.na(x$Visitors_2011), ])});
# Bind list entries into dataframe
df <- do.call(rbind, tmp);
print(df);
Country City Visitors_2011 Visitors_2012
France France Paris 100000 200000
UK UK London 100000 200000
USA.6 USA New_York 100000 100000
USA.7 USA Washington_D.C 200000 300000
A dplyr approach:
library(dplyr)
max.cities <- df1 %>% group_by(Country) %>% summarise(City = City[which.max(Visitors_2011)])
result <- df2 %>% mutate(City=ifelse(City %in% df1$City, City,
max.cities$City[match(Country, max.cities$Country)])) %>%
group_by(Country,City) %>%
summarise(Visitors_2012=sum(Visitors_2012)) %>%
left_join(df1,., by=c("Country", "City"))
Notes:
First, compute the City that has the max visitors group_by Country in df1 and set that to a separate data frame max.cities.
mutate the City column in df2 so that if the City is in df1, then the name is unchanged; otherwise, the City from max.cites that matches the Country is used.
Once the City has been suitably modified, group_by both Country and City and sum up the Visitors_2012.
Finally, left_join with df1 by c("Country", "City") to get the final result.
The result using your posted data is as expected:
print(result)
## Country City Visitors_2011 Visitors_2012
##1 UK London 100000 200000
##2 USA Washington D.C 200000 300000
##3 USA New York 100000 100000
##4 France Paris 100000 200000

Manipulating data.frames

I have a sample survey sheet; something like demographic. One of the columns is country (factor) another is annual income. Now, I need to calculate average of each country and store in new data.frame with country and corresponding mean. It should be simple but I am lost. The data is something like the one shown below:
Country Income($) Education ... ... ...
1. USA 90000 Phd
2. UK 94000 Undergrad
3. USA 94000 Highschool
4. UK 87000 Phd
5. Russia 77000 Undergrad
6. Norway 60000 Masters
7. Korea 90000 Phd
8. USA 110000 Masters
.
.
I need a final result like:
USA UK Russia ...
98000 90000 75000
Thank You.
data example:
dat <- read.table(text="Country Income Education
USA 90000 Phd
UK 94000 Undergrad
USA 94000 Highschool
UK 87000 Phd
Russia 77000 Undergrad
Norway 60000 Masters
Korea 90000 Phd
USA 110000 Masters",header=TRUE)
Do what you want with plyr :
if your data is called dat:
library(plyr)
newdf <- ddply(dat, .(Country), function(x) Countrymean = mean(x$Income))
# newdf <- ddply(dat, .(Country), function(x) data.frame(Income = mean(x$Income)))
and aggregate:
newdf <- aggregate(Income ~ Country, data = dat, FUN = mean)
for the output you show at the end maybe tapply?
tapply(dat$Income, dat$Country, mean)

Resources