R. How to add sum row in data frame - r

I know this question is very elementary, but I'm having a trouble adding an extra row to show summary of the row.
Let's say I'm creating a data.frame using the code below:
name <- c("James","Kyle","Chris","Mike")
nationality <- c("American","British","American","Japanese")
income <- c(5000,4000,4500,3000)
x <- data.frame(name,nationality,income)
The code above creates the data.frame below:
name nationality income
1 James American 5000
2 Kyle British 4000
3 Chris American 4500
4 Mike Japanese 3000
What I'm trying to do is to add a 5th row and contains: name = "total", nationality = "NA", age = total of all rows. My desired output looks like this:
name nationality income
1 James American 5000
2 Kyle British 4000
3 Chris American 4500
4 Mike Japanese 3000
5 Total NA 16500
In a real case, my data.frame has more than a thousand rows, and I need efficient way to add the total row.
Can some one please advice? Thank you very much!

We can use rbind
rbind(x, data.frame(name='Total', nationality=NA, income = sum(x$income)))
# name nationality income
#1 James American 5000
#2 Kyle British 4000
#3 Chris American 4500
#4 Mike Japanese 3000
#5 Total <NA> 16500

using index.
name <- c("James","Kyle","Chris","Mike")
nationality <- c("American","British","American","Japanese")
income <- c(5000,4000,4500,3000)
x <- data.frame(name,nationality,income, stringsAsFactors=FALSE)
x[nrow(x)+1, ] <- c('Total', NA, sum(x$income))
UPDATE: using list
x[nrow(x)+1, ] <- list('Total', NA, sum(x$income))
x
# name nationality income
# 1 James American 5000
# 2 Kyle British 4000
# 3 Chris American 4500
# 4 Mike Japanese 3000
# 5 Total <NA> 16500
sapply(x, class)
# name nationality income
# "character" "character" "numeric"

If you want the exact row as you put in your post, then the following should work:
newdata = rbind(x, data.frame(name='Total', nationality='NA', income = sum(x$income)))
I though agree with Jaap that you may not want this row to add to the end. In case you need to load the data and use it for other analysis, this will add to unnecessary trouble. However, you may also use the following code to remove the added row before other analysis:
newdata = newdata[-newdata$name=='Total',]

Related

If/Else statement in R

I have two dataframes in R:
city price bedroom
San Jose 2000 1
Barstow 1000 1
NA 1500 1
Code to recreate:
data = data.frame(city = c('San Jose', 'Barstow'), price = c(2000,1000, 1500), bedroom = c(1,1,1))
and:
Name Density
San Jose 5358
Barstow 547
Code to recreate:
population_density = data.frame(Name=c('San Jose', 'Barstow'), Density=c(5358, 547));
I want to create an additional column named city_type in the data dataset based on condition, so if the city population density is above 1000, it's an urban, lower than 1000 is a suburb, and NA is NA.
city price bedroom city_type
San Jose 2000 1 Urban
Barstow 1000 1 Suburb
NA 1500 1 NA
I am using a for loop for conditional flow:
for (row in 1:length(data)) {
if (is.na(data[row,'city'])) {
data[row, 'city_type'] = NA
} else if (population[population$Name == data[row,'city'],]$Density>=1000) {
data[row, 'city_type'] = 'Urban'
} else {
data[row, 'city_type'] = 'Suburb'
}
}
The for loop runs with no error in my original dataset with over 20000 observations; however, it yields a lot of wrong results (it yields NA for the most part).
What has gone wrong here and how can I do better to achieve my desired result?
I have become quite a fan of dplyr pipelines for this type of join/filter/mutate workflow. So here is my suggestion:
library(dplyr)
# I had to add that extra "NA" there, did you not? Hm...
data <- data.frame(city = c('San Jose', 'Barstow', NA), price = c(2000,1000, 500), bedroom = c(1,1,1))
population <- data.frame(Name=c('San Jose', 'Barstow'), Density=c(5358, 547));
data %>%
# join the two dataframes by matching up the city name columns
left_join(population, by = c("city" = "Name")) %>%
# add your new column based on the desired condition
mutate(
city_type = ifelse(Density >= 1000, "Urban", "Suburb")
)
Output:
city price bedroom Density city_type
1 San Jose 2000 1 5358 Urban
2 Barstow 1000 1 547 Suburb
3 <NA> 500 1 NA <NA>
Using ifelse create the city_type in population_density, then we using match
population_density$city_type=ifelse(population_density$Density>1000,'Urban','Suburb')
data$city_type=population_density$city_type[match(data$city,population_density$Name)]
data
city price bedroom city_type
1 San Jose 2000 1 Urban
2 Barstow 1000 1 Suburb
3 <NA> 1500 1 <NA>

Subsetting rows of a dataframe when respondent number is duplicated in column

I have a huge dataset which is partly pooled cross section and partly panel data:
Year Country Respnr Power Nr
1 2000 France 1 1213 1
2 2001 France 2 1234 2
3 2000 UK 3 1726 3
4 2001 UK 3 6433 4
I would like to filter the panel data from the combined data and tried the following:
> anyDuplicated(df$Respnr)
[1] 45047 # Out of 340.000
dfpanel<- subset(df, duplicated(df$Respnr) == TRUE)
The new df is however reduced to zero observations. The following led to the expected amount of observations:
dfpanel<- subset(df, Nr < 3)
Any idea what could be the issue?
Although I have not figured out why the previous did not work, the following does provide a working solution. I have simply split the previous approach. The solution adds a column panel, which in my case is actually a welcome addition
df$panel <- duplicated(df$Respnr)
dfpanel <- subset(df, df$panel == TRUE)

Conditional Counting in Data Tables

Suppose I have the following data table:
hs_code country city company
1: apples Canada Calgary West Jet
2: apples Canada Calgary United
3: apples US Los Angeles Alaska
4: apples US Chicago Alaska
5: oranges Korea Seoul West Jet
6: oranges China Shanghai John's Freight Co
7: oranges China Harbin John's Freight Co
8: oranges China Ningbo John's Freight Co
Output:
hs_code countries city company
1: apples 2 1,2 2,1,1
2: oranges 2 1,3 1,1,1,1
The logic is as follows:
For each good, I want the first column to summarize the number of unique countries. For apples it is 2. Based on this value, I want a 2-tuple in the city column that summarizes the unique number of cities for each country. So, since there is only unique city for Canada and two for the US, the value becomes (1,2). Notice that the sum of this tuple is 3. Finally, in the company column, I want a 3-tuple, that summarizes the unique number of companies per city and country possibility. So, since there is West Jet and United for the (Canada, Calgary) pair, I assign a 2. The next two values are 1 and 1 because Los Angeles and Chicago only have one transportation company listed.
I understand this is pretty confusing and involved. But any help would be greatly appreciated. I've tried using data table methods such as
DT[, countries := .uniqueN(country), by =.(hs_code)]
DT[, city:= .uniqueN(city), by = .(hs_code, country)]
but I'm not sure how to get this conveniently into a list form into a data.table recursively.
Thanks!
Well, this is some sort of nested transformation that you can do in three steps:
dt[, .(companies = length(unique(company))), by = .(hs_code, country, city)][,
.(cities = length(unique(city)),
companies = paste0(companies, collapse = ",")), by = .(hs_code, country)][,
.(countries = length(unique(country)),
cities = paste0(cities, collapse = ","),
companies = paste0(companies, collapse = ",")), by = hs_code ]
# hs_code countries cities companies
# 1: apples 2 1,2 2,1,1
# 2: oranges 2 1,3 1,1,1,1
You can use .SD[] notation to create subgroups with more granular grouping:
dt[, .(
countries = uniqueN(country),
company = c(.SD[, uniqueN(city), .(country)][, .(V1)]),
company = c(.SD[, uniqueN(company), .(country, city)][, .(V1)])
), .(hs_code)]
# hs_code countries company company
# 1: apples 2 1,2 2,1,1
# 2: oranges 2 1,3 1,1,1,1

R merge dataframes with different columns with order

I have three large data frames below, and want to merge into one data frame with order.
df 1:
First Name Last Name
John Langham
Paul McAuley
Steven Hutchison
Sean Hamilton
N N
df2:
First Name Wage Location
John 500 HK
Paul 600 NY
Steven 1900 LDN
Sean 800 TL
N N N
df3:
Last Name Time
Langham 8
McAuley 9
Hutchison 12
Hamilton 7
N N
desired output:
First Name Last Name Wage Location Time
John Langham 500 HK 8
Paul McAuley 600 NY 9
Steven Hutchison 1900 LDN 12
Sean Hamilton 800 TL 7
N N N N N
I know how to merge df1 and df2 but df1+2 merges to df3 by second column changed the order in desired output, so I want to ask is there any recommendation? Thank you.
If you are trying to preserve the order as it appears in df1, create a column to memorialize that order, then use it again to set the order of df3
# record order
df1$original_order <- 1:nrow(df1)
# then do your merges...
# ...
# then restore the df1 order to df3
df3 <- df3[order(df3$original_order),]
If you want you can then also get rid of that column:
df3$original_order <- NULL

How can I count the number of instances a value occurs within a subgroup in R?

I have a data frame that I'm working with in R, and am trying to check how many times a value occurs within its larger, associated group. Specifically, I'm trying to count the number of cities that are listed for each particular country.
My data look something like this:
City Country
=========================
New York US
San Francisco US
Los Angeles US
Paris France
Nantes France
Berlin Germany
It seems that table() is the way to go, but I can't quite figure it out — how can I find out how many cities are listed for each country? That is to say, how can I find out how many fields in one column are associated with a particular value in another column?
EDIT:
I'm hoping for something along the lines of
3 US
2 France
1 Germany
I guess you can try table.
table(df$Country)
# France Germany US
# 2 1 3
Or using data.table
library(data.table)
setDT(df)[, .N, by=Country]
# Country N
#1: US 3
#2: France 2
#3: Germany 1
Or
library(plyr)
count(df$Country)
# x freq
#1 France 2
#2 Germany 1
#3 US 3

Resources