Conditional Counting in Data Tables - r

Suppose I have the following data table:
hs_code country city company
1: apples Canada Calgary West Jet
2: apples Canada Calgary United
3: apples US Los Angeles Alaska
4: apples US Chicago Alaska
5: oranges Korea Seoul West Jet
6: oranges China Shanghai John's Freight Co
7: oranges China Harbin John's Freight Co
8: oranges China Ningbo John's Freight Co
Output:
hs_code countries city company
1: apples 2 1,2 2,1,1
2: oranges 2 1,3 1,1,1,1
The logic is as follows:
For each good, I want the first column to summarize the number of unique countries. For apples it is 2. Based on this value, I want a 2-tuple in the city column that summarizes the unique number of cities for each country. So, since there is only unique city for Canada and two for the US, the value becomes (1,2). Notice that the sum of this tuple is 3. Finally, in the company column, I want a 3-tuple, that summarizes the unique number of companies per city and country possibility. So, since there is West Jet and United for the (Canada, Calgary) pair, I assign a 2. The next two values are 1 and 1 because Los Angeles and Chicago only have one transportation company listed.
I understand this is pretty confusing and involved. But any help would be greatly appreciated. I've tried using data table methods such as
DT[, countries := .uniqueN(country), by =.(hs_code)]
DT[, city:= .uniqueN(city), by = .(hs_code, country)]
but I'm not sure how to get this conveniently into a list form into a data.table recursively.
Thanks!

Well, this is some sort of nested transformation that you can do in three steps:
dt[, .(companies = length(unique(company))), by = .(hs_code, country, city)][,
.(cities = length(unique(city)),
companies = paste0(companies, collapse = ",")), by = .(hs_code, country)][,
.(countries = length(unique(country)),
cities = paste0(cities, collapse = ","),
companies = paste0(companies, collapse = ",")), by = hs_code ]
# hs_code countries cities companies
# 1: apples 2 1,2 2,1,1
# 2: oranges 2 1,3 1,1,1,1

You can use .SD[] notation to create subgroups with more granular grouping:
dt[, .(
countries = uniqueN(country),
company = c(.SD[, uniqueN(city), .(country)][, .(V1)]),
company = c(.SD[, uniqueN(company), .(country, city)][, .(V1)])
), .(hs_code)]
# hs_code countries company company
# 1: apples 2 1,2 2,1,1
# 2: oranges 2 1,3 1,1,1,1

Related

New Column Based on Conditions

To set the scene, I have a set of data where two columns of the data have been mixed up. To give a simple example:
df1 <- data.frame(Name = c("Bob", "John", "Mark", "Will"), City=c("Apple", "Paris", "Orange", "Berlin"), Fruit=c("London", "Pear", "Madrid", "Orange"))
df2 <- data.frame(Cities = c("Paris", "London", "Berlin", "Madrid", "Moscow", "Warsaw"))
As a result, we have two small data sets:
> df1
Name City Fruit
1 Bob Apple London
2 John Paris Pear
3 Mark Orange Madrid
4 Will Berlin Orange
> df2
Cities
1 Paris
2 London
3 Berlin
4 Madrid
5 Moscow
6 Warsaw
My aim is to create a new column where the cities are in the correct place using df2. I am a bit new to R so I don't know how this would work.
I don't really know where to even start with this sort of a problem. My full dataset is much larger and it would be good to have an efficient method of unpicking this issue!
If the 'City' values are only different. We may loop over the rows, create a logical vector based on the matching values with 'Cities' from 'df2', and concatenate with the rest of the values by getting the matched values second in the order
df1[] <- t(apply(df1, 1, function(x)
{
i1 <- x %in% df2$Cities
i2 <- !i1
x1 <- x[i2]
c(x1[1], x[i1], x1[2])}))
-output
> df1
Name City Fruit
1 Bob London Apple
2 John Paris Pear
3 Mark Madrid Orange
4 Will Berlin Orange
using dplyr package this is a solution, where it looks up the two City and Fruit values in df1, and takes the one that exists in the df2 cities list.
if none of the two are a city name, an empty string is returned, you can replace that with anything you prefer.
library(dplyr)
df1$corrected_City <- case_when(df1$City %in% df2$Cities ~ df1$City,
df1$Fruit%in% df2$Cities ~ df1$Fruit,
TRUE ~ "")
output, a new column created as you wanted with the city name on that row.
> df1
Name City Fruit corrected_City
1 Bob Apple London London
2 John Paris Pear Paris
3 Mark Orange Madrid Madrid
4 Will Berlin Orange Berlin
Another way is:
library(dplyr)
library(tidyr)
df1 %>%
mutate(across(1:3, ~case_when(. %in% df2$Cities ~ .), .names = 'new_{col}')) %>%
unite(New_Col, starts_with('new'), na.rm = TRUE, sep = ' ')
Name City Fruit New_Col
1 Bob Apple London London
2 John Paris Pear Paris
3 Mark Orange Madrid Madrid
4 Will Berlin Orange Berlin

R Identifying Dataframe Change Patterns by Groups

I have a dataframe looks like below:
person year location salary
Harry 2002 Los Angeles $2000
Harry 2006 Boston $3000
Harry 2007 Los Angeles $2500
Peter 2001 New York $2000
Peter 2002 New York $2300
Lily 2007 New York $7000
Lily 2008 Boston $2300
Lily 2011 New York $4000
Lily 2013 Boston $3300
I want to identify a pattern at the person level. I want to know who moves out of a location and came back later. For example, Harry moves out of Los Angeles and came back later. Lily moved out of new York and came back later. Also for Lily, we can say she also moved out of Boston and came back later. I only am interested in who has this pattern and does not care the number of back and forth. Therefore, ideally, the output can look like:
person move_back (yes/no)
Harry 1
Peter 0
Lily 1
With the help of data.table rleid you can do -
library(dplyr)
df %>%
arrange(person, year) %>%
group_by(person) %>%
mutate(val = data.table::rleid(location)) %>%
arrange(person, location) %>%
group_by(location, .add = TRUE) %>%
summarise(move_back = any(val != lag(val, default = first(val)))) %>%
summarise(move_back = as.integer(any(move_back)))
# person move_back
# <chr> <int>
#1 Harry 1
#2 Lily 1
#3 Peter 0
You could use rle to identify situations where the are one or more instances of repeats. (I think your item Lily had two repeats.)
lapply( split(dat, dat$person), function(x) duplicated( rle(x$location)$values))
$Harry
[1] FALSE FALSE TRUE
$Lily
[1] FALSE FALSE TRUE TRUE
$Peter
[1] FALSE
You could use sapply with sum or any to determine the number of move-backs or whether any move-backs occurred. If you only want to know if there's a move-back to the first site then the logic would be different.
A slightly different data.table method, based on joins and row number (.I).
Basically I'm flagging all the times that a location for a person matches a row that is not the next row, then aggregating.
library(data.table)
setDT(dat)
dat[, rn := .I]
dat[, rnp1 := .I + 1]
dat[dat, on=.(person, location, rn > rnp1), back := TRUE]
dat[, .(move_back = any(back, na.rm=TRUE)), by=person]
# person move_back
#1: Harry TRUE
#2: Peter FALSE
#3: Lily TRUE
Where dat was:
dat <- read.csv(text="person,year,location,salary
Harry,2002,Los Angeles,$2000
Harry,2006,Boston,$3000
Harry,2007,Los Angeles,$2500
Peter,2001,New York,$2000
Peter,2002,New York,$2300
Lily,2007,New York,$7000
Lily,2008,Boston,$2300
Lily,2011,New York,$4000
Lily,2013,Boston,$3300", header=TRUE)

R: remove rows that are duplicate in two columns and different in a third

I would like to find rows that match in two columns but differ in a third and retain only one of these lines. So for example:
animal_couples <- data.frame(ID=c(1,2,3,4,5,6,7,8,9,10,11,12),species=c("Cat","Cat","Cat","Cat","Cat","Dog","Dog","Dog","Fish","Fish","Fish","Fish"),partner=c("Cat","Cat","Cat","Cat","Cat","Cat","Dog","Dog","Dog","Dog","Badger","Badger"),location=c("Germany","Germany","Iceland","France","France","Iceland","Greece","Greece","Germany","Germany","France","Spain"))
A row can match in 'species' and 'partner' so long as it also matches in 'location'. So the first two rows in this df are fine as Germany and Germany are the same. The next three rows are then removed. So the final df should be:
animal_couples_after <- data.frame(ID=c(1,2,6,7,8,9,10,11),species=c("Cat","Cat","Dog","Dog","Dog","Fish","Fish","Fish"),partner=c("Cat","Cat","Cat","Dog","Dog","Dog","Dog","Badger"),location=c("Germany","Germany","Iceland","Greece","Greece","Germany","Germany","France"))
The real dataset is quite large so I don't think looping through each row would be an option.
Thanks a lot for your help.
Could try:
library(data.table)
setDT(animal_couples)[, idx := rleid(location), by = .(species, partner)][idx == 1, ][, idx := NULL]
Output:
ID species partner location
1: 1 Cat Cat Germany
2: 2 Cat Cat Germany
3: 6 Dog Cat Iceland
4: 7 Dog Dog Greece
5: 8 Dog Dog Greece
6: 9 Fish Dog Germany
7: 10 Fish Dog Germany
8: 11 Fish Badger France
Or also shortened:
setDT(animal_couples)[, .SD[rleid(location) == 1], by = .(species, partner)]

R. How to add sum row in data frame

I know this question is very elementary, but I'm having a trouble adding an extra row to show summary of the row.
Let's say I'm creating a data.frame using the code below:
name <- c("James","Kyle","Chris","Mike")
nationality <- c("American","British","American","Japanese")
income <- c(5000,4000,4500,3000)
x <- data.frame(name,nationality,income)
The code above creates the data.frame below:
name nationality income
1 James American 5000
2 Kyle British 4000
3 Chris American 4500
4 Mike Japanese 3000
What I'm trying to do is to add a 5th row and contains: name = "total", nationality = "NA", age = total of all rows. My desired output looks like this:
name nationality income
1 James American 5000
2 Kyle British 4000
3 Chris American 4500
4 Mike Japanese 3000
5 Total NA 16500
In a real case, my data.frame has more than a thousand rows, and I need efficient way to add the total row.
Can some one please advice? Thank you very much!
We can use rbind
rbind(x, data.frame(name='Total', nationality=NA, income = sum(x$income)))
# name nationality income
#1 James American 5000
#2 Kyle British 4000
#3 Chris American 4500
#4 Mike Japanese 3000
#5 Total <NA> 16500
using index.
name <- c("James","Kyle","Chris","Mike")
nationality <- c("American","British","American","Japanese")
income <- c(5000,4000,4500,3000)
x <- data.frame(name,nationality,income, stringsAsFactors=FALSE)
x[nrow(x)+1, ] <- c('Total', NA, sum(x$income))
UPDATE: using list
x[nrow(x)+1, ] <- list('Total', NA, sum(x$income))
x
# name nationality income
# 1 James American 5000
# 2 Kyle British 4000
# 3 Chris American 4500
# 4 Mike Japanese 3000
# 5 Total <NA> 16500
sapply(x, class)
# name nationality income
# "character" "character" "numeric"
If you want the exact row as you put in your post, then the following should work:
newdata = rbind(x, data.frame(name='Total', nationality='NA', income = sum(x$income)))
I though agree with Jaap that you may not want this row to add to the end. In case you need to load the data and use it for other analysis, this will add to unnecessary trouble. However, you may also use the following code to remove the added row before other analysis:
newdata = newdata[-newdata$name=='Total',]

How can I count the number of instances a value occurs within a subgroup in R?

I have a data frame that I'm working with in R, and am trying to check how many times a value occurs within its larger, associated group. Specifically, I'm trying to count the number of cities that are listed for each particular country.
My data look something like this:
City Country
=========================
New York US
San Francisco US
Los Angeles US
Paris France
Nantes France
Berlin Germany
It seems that table() is the way to go, but I can't quite figure it out — how can I find out how many cities are listed for each country? That is to say, how can I find out how many fields in one column are associated with a particular value in another column?
EDIT:
I'm hoping for something along the lines of
3 US
2 France
1 Germany
I guess you can try table.
table(df$Country)
# France Germany US
# 2 1 3
Or using data.table
library(data.table)
setDT(df)[, .N, by=Country]
# Country N
#1: US 3
#2: France 2
#3: Germany 1
Or
library(plyr)
count(df$Country)
# x freq
#1 France 2
#2 Germany 1
#3 US 3

Resources