Manipulating data.frames - r

I have a sample survey sheet; something like demographic. One of the columns is country (factor) another is annual income. Now, I need to calculate average of each country and store in new data.frame with country and corresponding mean. It should be simple but I am lost. The data is something like the one shown below:
Country Income($) Education ... ... ...
1. USA 90000 Phd
2. UK 94000 Undergrad
3. USA 94000 Highschool
4. UK 87000 Phd
5. Russia 77000 Undergrad
6. Norway 60000 Masters
7. Korea 90000 Phd
8. USA 110000 Masters
.
.
I need a final result like:
USA UK Russia ...
98000 90000 75000
Thank You.

data example:
dat <- read.table(text="Country Income Education
USA 90000 Phd
UK 94000 Undergrad
USA 94000 Highschool
UK 87000 Phd
Russia 77000 Undergrad
Norway 60000 Masters
Korea 90000 Phd
USA 110000 Masters",header=TRUE)
Do what you want with plyr :
if your data is called dat:
library(plyr)
newdf <- ddply(dat, .(Country), function(x) Countrymean = mean(x$Income))
# newdf <- ddply(dat, .(Country), function(x) data.frame(Income = mean(x$Income)))
and aggregate:
newdf <- aggregate(Income ~ Country, data = dat, FUN = mean)
for the output you show at the end maybe tapply?
tapply(dat$Income, dat$Country, mean)

Related

How do I get the sum of frequency count based on two columns?

Assuming that the dataframe is stored as someData, and is in the following format:
ID Team Games Medal
1 Australia 1992 Summer NA
2 Australia 1994 Summer Gold
3 Australia 1992 Summer Silver
4 United States 1991 Winter Gold
5 United States 1992 Summer Bronze
6 Singapore 1991 Summer NA
How would I count the frequencies of the medal, based on the Team - while excluding NA as an variable. But at the same time, the total frequency of each country should be summed, rather than displayed separately for Gold, Silver and Bronze.
In other words, I am trying to display the total number of medals PER country, with the exception of NA.
I have tried something like this:
library(plyr)
counts <- ddply(olympics, .(olympics$Team, olympics$Medal), nrow)
names(counts) <- c("Country", "Medal", "Freq")
counts
But this just gives me a massive table of every medal for every country separately, including NA.
What I would like to do is the following:
Australia 2
United States 2
Any help would be greatly appreciated.
Thank you!
We can use count
library(dplyr)
df1 %>%
filter(!is.na(Medal)) %>%
count(Team)
# A tibble: 2 x 2
# Team n
# <fct> <int>
#1 Australia 2
#2 United States 2
You can do that in base R with table and colSums
colSums(table(someData$Medal, someData$Team))
Australia Singapore United States
2 0 2
Data
someData = read.table(text="ID Team Games Medal
1 Australia '1992 Summer' NA
2 Australia '1994 Summer' Gold
3 Australia '1992 Summer' Silver
4 'United States' '1991 Winter' Gold
5 'United States' '1992 Summer' Bronze
6 Singapore '1991 Summer' NA",
header=TRUE)

Merge two data frames by a max number condition in r

Cheers, I have a data frame df1 with the Major City with max visitors in 2011.
df1:
Country City Visitors_2011
UK London 100000
USA Washington D.C 200000
USA New York 100000
France Paris 100000
The other data frame df2 consists of top visited cities in the country for 2012:
df2:
Country City Visitors_2012
USA Washington D.C 200000
USA New York 100000
USA Las Angeles 100000
UK London 100000
UK Manchester 100000
France Paris 100000
France Nice 100000
The Output I would need is:
Logic: To obtain df3, merge df1 and df2 by Country and City and if you can't find city in df1 then add that volume to biggest city in df1.
Example: Los Angeles visitor count here is added to Washington D.C because Los Angeles is not present in df1 and Washington D.C has more visitors(2012) than New York.
df3:
Country City Visitors_2011 Visitors_2012
UK London 100000 200000
USA Washington D.C 200000 300000
USA New York 100000 100000
France Paris 100000 200000
Can anyone point me to the right direction?
Assume df1.txt and df2.txt contain your space-delimited dataframes.
Here is a solution in base R:
df1 <- read.table("df1.txt", header = T, stringsAsFactors = F);
df2 <- read.table("df2.txt", header = T, stringsAsFactors = F);
# Merge with all = TRUE, see ?merge
df <- merge(df1, df2, all = TRUE);
# Deal with missing values
tmp <- lapply(split(df, df$Country), function(x) {
# Make sure NA's are at the bottom
x <- x[order(x$Visitors_2011), ];
# Select first max Visitors_2012 entry
idx <- which.max(x$Visitors_2012);
# Add any NA's to max entry
x$Visitors_2012[idx] <- x$Visitors_2012[idx] + sum(x$Visitors_2012[is.na(x$Visitors_2011)]);
# Return dataframe
return(x[!is.na(x$Visitors_2011), ])});
# Bind list entries into dataframe
df <- do.call(rbind, tmp);
print(df);
Country City Visitors_2011 Visitors_2012
France France Paris 100000 200000
UK UK London 100000 200000
USA.6 USA New_York 100000 100000
USA.7 USA Washington_D.C 200000 300000
A dplyr approach:
library(dplyr)
max.cities <- df1 %>% group_by(Country) %>% summarise(City = City[which.max(Visitors_2011)])
result <- df2 %>% mutate(City=ifelse(City %in% df1$City, City,
max.cities$City[match(Country, max.cities$Country)])) %>%
group_by(Country,City) %>%
summarise(Visitors_2012=sum(Visitors_2012)) %>%
left_join(df1,., by=c("Country", "City"))
Notes:
First, compute the City that has the max visitors group_by Country in df1 and set that to a separate data frame max.cities.
mutate the City column in df2 so that if the City is in df1, then the name is unchanged; otherwise, the City from max.cites that matches the Country is used.
Once the City has been suitably modified, group_by both Country and City and sum up the Visitors_2012.
Finally, left_join with df1 by c("Country", "City") to get the final result.
The result using your posted data is as expected:
print(result)
## Country City Visitors_2011 Visitors_2012
##1 UK London 100000 200000
##2 USA Washington D.C 200000 300000
##3 USA New York 100000 100000
##4 France Paris 100000 200000

R factor with overlapping level ranges

Hi I am struggling about a problem since coupple of days and haven't found any answer yet.
Supposed I am having a dataset with columns: Country, Population. The Country is incoded in Numbers, so the raw dataset looks like this:
df <- data.frame(country=c(1,2,3,4,5,6), population=c(10000,20000,30000,4000,50000,60000))
df
country population
1 1 10000
2 2 20000
3 3 30000
4 4 4000
5 5 50000
6 6 60000
I want country to be a factor with the following levels: France, Germany, Canada, USA, India, China and Europe, America, Asia.
So to say a factor combinig:
df$country <- factor(df$country, labels = c("France", "Germany", "Canada", "USA", "India", "Asia"))
df
country population
1 France 10000
2 Germany 20000
3 Canada 30000
4 USA 4000
5 India 50000
6 Asia 60000
and
df$country <- cut(df$country, breaks = c(0,2,4,6),labels = c("Europe", "America", "Asia"))
df
country population
1 Europe 10000
2 Europe 20000
3 America 30000
4 America 4000
5 Asia 50000
6 Asia 60000
My aim is to do something like:
tapply(df$population, df$country, sum)
with a result like this:
France Germany Canada USA India China Europe America Asia
10000 20000 30000 4000 50000 60000 30000 34000 110000
Is there a way to this, without creating a third column in the data frame?
I hope it is understandble, what my problem is.
I already tried interaction() but thats not what I want.
So the following function from the plyr-Package divides your data frame into sub-data-frames (one sub-data-frame per country) and then sums up the population values. The t function just transverses your data frame.
> library(plyr)
> neu <- ddply(df, .(country), Summe = sum(population))
> t(neu)

R. How to add sum row in data frame

I know this question is very elementary, but I'm having a trouble adding an extra row to show summary of the row.
Let's say I'm creating a data.frame using the code below:
name <- c("James","Kyle","Chris","Mike")
nationality <- c("American","British","American","Japanese")
income <- c(5000,4000,4500,3000)
x <- data.frame(name,nationality,income)
The code above creates the data.frame below:
name nationality income
1 James American 5000
2 Kyle British 4000
3 Chris American 4500
4 Mike Japanese 3000
What I'm trying to do is to add a 5th row and contains: name = "total", nationality = "NA", age = total of all rows. My desired output looks like this:
name nationality income
1 James American 5000
2 Kyle British 4000
3 Chris American 4500
4 Mike Japanese 3000
5 Total NA 16500
In a real case, my data.frame has more than a thousand rows, and I need efficient way to add the total row.
Can some one please advice? Thank you very much!
We can use rbind
rbind(x, data.frame(name='Total', nationality=NA, income = sum(x$income)))
# name nationality income
#1 James American 5000
#2 Kyle British 4000
#3 Chris American 4500
#4 Mike Japanese 3000
#5 Total <NA> 16500
using index.
name <- c("James","Kyle","Chris","Mike")
nationality <- c("American","British","American","Japanese")
income <- c(5000,4000,4500,3000)
x <- data.frame(name,nationality,income, stringsAsFactors=FALSE)
x[nrow(x)+1, ] <- c('Total', NA, sum(x$income))
UPDATE: using list
x[nrow(x)+1, ] <- list('Total', NA, sum(x$income))
x
# name nationality income
# 1 James American 5000
# 2 Kyle British 4000
# 3 Chris American 4500
# 4 Mike Japanese 3000
# 5 Total <NA> 16500
sapply(x, class)
# name nationality income
# "character" "character" "numeric"
If you want the exact row as you put in your post, then the following should work:
newdata = rbind(x, data.frame(name='Total', nationality='NA', income = sum(x$income)))
I though agree with Jaap that you may not want this row to add to the end. In case you need to load the data and use it for other analysis, this will add to unnecessary trouble. However, you may also use the following code to remove the added row before other analysis:
newdata = newdata[-newdata$name=='Total',]

Summarize data using doBy package at region level

I have a dataset Data as below,
Region Country Market Price
EUROPE France France 30.4502
EUROPE Israel Israel 5.14110965
EUROPE France France 8.99665
APAC CHINA CHINA 2.6877232
APAC INDIA INDIA 60.9004
AFME SL SL 54.1729685
LA BRAZIL BRAZIL 56.8606917
EUROPE RUSSIA RUSSIA 11.6843732
APAC BURMA BURMA 63.5881232
AFME SA SA 115.0733685
I would like to summarize the data at Region level and get the SUM of Price at every Region Level.
I want the ouput to be Like below.
Data Output
Region Country Price
EUROPE France 30.4502
EUROPE Israel 5.14110965
EUROPE France 8.99665
EUROPE RUSSIA 11.6843732
Europe 56.27233285
APAC BURMA 63.5881232
APAC CHINA 2.6877232
APAC INDIA 60.9004
Apac 127.1762464
AFME BAHARAIN 54.1729685
AFME SA 115.0733685
AFME 169.246337
LA BRAZIL 56.8606917
LA 56.8606917
I have used summaryBy function of doBy package, i have tried the code below.
summaryBy
myfun1 <- function(x){c(s=Sum(x)}
DB= summaryBy(Data$Price ~Region + Country , data=Data, FUN=myfun1)
Anyhelp on this regard is very much appreciated.
You can do this by using dplyr to generate a summary table:
library(dplyr)
totals <- data %>% group_by(Region) %>% summarise(Country="",Price=sum(Price))
And then merging the summary with the rest of the data:
summary <- rbind(data[-3], totals)
Then you can sort by Region to put the summary with the region:
summary <- summary %>% arrange(Region)
Output:
Region Country Price
1 AFME SL 54.1730
2 AFME SA 115.0734
3 AFME 169.2463
4 APAC CHINA 2.6877
5 APAC INDIA 60.9004
6 APAC BURMA 63.5881
7 APAC 127.1762
8 EUROPE France 30.4502
9 EUROPE Israel 5.1411
10 EUROPE France 8.9967
11 EUROPE RUSSIA 11.6844
12 EUROPE 56.2723
13 LA BRAZIL 56.8607
14 LA 56.8607
You have to split data by Region factor and sum Price for each factor
lapply(split(data, data$Region), function(x) sum(x$Price))
Or, if you need to present result as you have shown:
totals = lapply(split(data, data$Region), function(x) rbind(x,data.frame(Region=unique(x$Region), Country="", Market="", Price=sum(x$Price))))
do.call(rbind, totals)

Resources