Computing frequency of membership in R's data.frame

Computing frequency of membership in R's data.frame - r

I have the following data.frame:
authors <- data.frame(
surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
nationality = c("US", "Australia", "US", "UK", "Australia"),
deceased = c("yes", rep("no", 3),"noinfo"))
which produce this output:
surname nationality deceased
1 Tukey US yes
2 Venables Australia no
3 Tierney US no
4 Ripley UK no
5 McNeil Australia noinfo
What I want to do is to get the frequency of deceased by nationality.
Yielding this output:
US yes 1
US no 1
US noinfo 0
Australia yes 0
Australia no 1
Australia noinfo 1
UK yes 0
UK no 1
UK noinfo 0
At the moment I can only display the statistics through tables.
stat <- table(authors)
I'm not sure how to proceed by accessing the element of the tables.
Advice would be appreciated.

You need to table on the things you want the occurence for...
table( authors[ c("nationality" , "deceased" ) ] )
# deceased
#nationality no noinfo yes
# Australia 1 1 0
# UK 1 0 0
# US 1 0 1
And to get the exact output you want... turn it into a data.frame....
data.frame( table( authors[ c("nationality" , "deceased" ) ] ) )
# nationality deceased Freq
#1 Australia no 1
#2 UK no 1
#3 US no 1
#4 Australia noinfo 1
#5 UK noinfo 0
#6 US noinfo 0
#7 Australia yes 0
#8 UK yes 0
#9 US yes 1

Related

Find frequencies of all factor combinations of all column combinations

I have a dataframe with n variables whose values are all factors. Now I would like to select m columns from this dataframe (m < n) and find the frequencies of all factor combinations of all possible columns selected.
I have looked up but I only found how to find frequencies of factor combinations if specific columns are chose. In my case, there could be many combinations of columns since m < n
Here is our data, all variable have factor values.
company <- data.frame("country" = c("USA", "China", 'France', "Germany"),
"category" = c("C-corp", "S-corp", "C-corp", "LLC"),
"Type" = c("Public", "Private", "Private", "Private"),
"Profit" = c("High", "High", "High", "Low"))
Now I want to select 2 columns (m = 2) and find out about the frequency of factor combinations of all of the possible variables selected
In this case, I can have "country = USA & category = S-Corp", "country = USA & category = C-Corp", "country = China & category = LLC". But I could also select other columns and have "country = USA & Profit = Low", "country = China & Type = Public". I want to know the frequeies of all of these combinations
Edit: My expected output is something like
country = USA, category = C-corp freq 1
country = USA, category = S-corp freq 0
country = USA, category = LLC freq 0
country = China, category = LLC freq 0
country = France, category = C-corp freq 1
country = USA, type = Public freq 1
country = China, type = Public freq 0
Type = Private, Profit = High freq 2
Type = Public, category = LLC freq 0
category = Private, Profit = Low freq 1
If I need to select 2 columns, I need all the possible column combinations, orders don't matter

The combinations part sounds like expand.grid():
expand.grid(company[, 1:2])
country category
1 USA C-corp
2 China C-corp
3 France C-corp
4 Germany C-corp
5 USA S-corp
6 China S-corp
7 France S-corp
8 Germany S-corp
9 USA C-corp
10 China C-corp
11 France C-corp
12 Germany C-corp
13 USA LLC
14 China LLC
15 France LLC
16 Germany LLC
# or if you want 4 columns with all countries, do a cross join:
merge(company[, 1, drop = F], company[, -1], by = NULL)
#or if you want 4 columns with all possible results, do expand.grid without subsetting:
expand.grid(company)
The second part sounds like table(). You can perform it directly on the company data.frame:
table(company)
, , Type = Private, Profit = High
category
country C-corp LLC S-corp
China 0 0 1
France 1 0 0
Germany 0 0 0
USA 0 0 0
, , Type = Public, Profit = High
category
country C-corp LLC S-corp
China 0 0 0
France 0 0 0
Germany 0 0 0
USA 1 0 0
, , Type = Private, Profit = Low
category
country C-corp LLC S-corp
China 0 0 0
France 0 0 0
Germany 0 1 0
USA 0 0 0
, , Type = Public, Profit = Low
category
country C-corp LLC S-corp
China 0 0 0
France 0 0 0
Germany 0 0 0
USA 0 0 0

You can do this using a nested loop of the table function:
for (j in 1:ncol(company)) {
for (i in 1:ncol(company)) {
print(table(company[[j]],
company[[i]]))
}
}
It's ugly and has a lot of duplicates, but it's quick and easy for your purposes.

Convert Panel Data to Long in R

My current data is for missiles between 1920 and 2018. The goal is to measure a nation’s ability to deploy missiles of different kinds for each year from 1920 to 2018. The problems that arise are that the data has multiple observations per nation and often per year. This creates issues because for instance if a nation adopted a missile in 1970 that is Air to Air and imported then developed one in 1980 that is Air to Air and Air to Ground and produced domestically, that change needs to be reflected. The goal is to have a unique row/observation for each year for every nation. Also it should be noted that it is assumed if the nation can produced Air to air for instance in 1970 they can do so until 2018.
Current:
YearAcquired CountryCode CountryName Domestic AirtoAir
2014 670 Saudi Arabia 0 1
2017 670 Saudi Arabia 1 1
2016 2 United States 1 1
Desired:
YearAcquired CountryCode CountryName Domestic AirtoAir
2014 670 Saudi Arabia 0 1
2015 670 Saudi Arabia 0 1
2016 670 Saudi Arabia 0 1
2017 670 Saudi Arabia 1 1
2018 670 Saudi Arabia 1 1
2016 2 United States 0 1
2017 2 United States 0 1
2018 2 United States 0 1
Note: There are many entries and so I would like it to generate from 1920 to 2018 for every country even if they will have straight zeroes. That is not necessary but it would be a great bit!

You can do this via several steps:
Create the combination of all years and countries (a CROSS JOIN in SQL)
LEFT JOIN these combinations with the available data
Use a function like zoo::na.locf() to replace NA values by the last known ones per country.
The first step is common:
df <- read.table(text = 'YearAcquired CountryCode CountryName Domestic AirtoAir
2014 670 "Saudi Arabia" 0 1
2017 670 "Saudi Arabia" 1 1
2016 2 "United States" 1 1', header = TRUE, stringsAsFactors = FALSE)
combinations <- merge(data.frame(YearAcquired = seq(1920, 2018, 1)),
unique(df[,2:3]), by = NULL)
For steps 2 and 3 here a solution using dplyr
library(dplyr)
library(zoo)
df <- left_join(combinations, df) %>%
group_by(CountryCode) %>%
mutate(Domestic = na.locf(Domestic, na.rm = FALSE),
AirtoAir = na.locf(AirtoAir, na.rm = FALSE))
And one solution using data.table:
library(data.table)
library(zoo)
setDT(df)
setDT(combinations)
df <- df[combinations, on = c("YearAcquired", "CountryCode", "CountryName")]
df <- df[, na.locf(.SD, na.rm = FALSE), by = "CountryCode"]

You could create a new dataframe using the country names and codes available and perform a left join with your existing data. This would give you 1920 to 2018 for each country and code, leaving NA's in where you don't have data available but you could easily replace them given how you want your data structured.
# df is your initial dataframe
countries <- df$CountryName
codes <- df
new_df <- data.frame(YearAcquired = seq(1920, 2018, 1),
CountryName = df$CountryName
CountryCode = df$CountryCode)
new_df <- left_join(new_df, df)

Using tidyverse (dplyr and tidyr)...
If you only need to fill in internal years per country...
df <- read.table(header = TRUE, as.is = TRUE, text = "
YearAcquired countrycode CountryName Domestic AirtoAir
2014 670 'Saudi Arabia' 0 1
2017 670 'Saudi Arabia' 1 1
2016 2 'United States' 1 1
")
library(dplyr)
library(tidyr)
df %>%
group_by(countrycode) %>%
complete(YearAcquired = full_seq(YearAcquired, 1), countrycode, CountryName) %>%
arrange(countrycode, YearAcquired) %>%
fill(Domestic, AirtoAir)
#> # A tibble: 5 x 5
#> # Groups: countrycode [2]
#> YearAcquired countrycode CountryName Domestic AirtoAir
#> <dbl> <int> <chr> <int> <int>
#> 1 2016 2 United States 1 1
#> 2 2014 670 Saudi Arabia 0 1
#> 3 2015 670 Saudi Arabia 0 1
#> 4 2016 670 Saudi Arabia 0 1
#> 5 2017 670 Saudi Arabia 1 1
If you want to expand each country to all years found in the dataset...
df <- read.table(header = TRUE, as.is = TRUE, text = "
YearAcquired countrycode CountryName Domestic AirtoAir
2014 670 'Saudi Arabia' 0 1
2017 670 'Saudi Arabia' 1 1
2016 2 'United States' 1 1
")
library(dplyr)
library(tidyr)
df %>%
complete(YearAcquired = full_seq(YearAcquired, 1),
nesting(countrycode, CountryName)) %>%
group_by(countrycode) %>%
arrange(countrycode, YearAcquired) %>%
fill(Domestic, AirtoAir) %>%
mutate_at(vars(Domestic, AirtoAir), funs(if_else(is.na(.), 0L, .)))
#> # A tibble: 8 x 5
#> # Groups: countrycode [2]
#> YearAcquired countrycode CountryName Domestic AirtoAir
#> <dbl> <int> <chr> <int> <int>
#> 1 2014 2 United States 0 0
#> 2 2015 2 United States 0 0
#> 3 2016 2 United States 1 1
#> 4 2017 2 United States 1 1
#> 5 2014 670 Saudi Arabia 0 1
#> 6 2015 670 Saudi Arabia 0 1
#> 7 2016 670 Saudi Arabia 0 1
#> 8 2017 670 Saudi Arabia 1 1

Use DocumentTermMatrix in R with 'dictionary' parameter

I want to use R for text classification. I use DocumentTermMatrix to return the matrix of word:
library(tm)
crude <- "japan korea usa uk albania azerbaijan"
corps <- Corpus(VectorSource(crude))
dtm <- DocumentTermMatrix(corps)
inspect(dtm)
words <- c("australia", "korea", "uganda", "japan", "argentina", "turkey")
test <- DocumentTermMatrix(corps, control=list(dictionary = words))
inspect(test)
The first inspect(dtm) work as expected with result:
Terms
Docs albania azerbaijan japan korea usa
1 1 1 1 1 1
But the second inspect(test) show this result:
Terms
Docs argentina australia japan korea turkey uganda
1 0 1 0 1 0 0
While the expected result is:
Terms
Docs argentina australia japan korea turkey uganda
1 0 0 1 1 0 0
Is it a bug or I use it the wrong way ?

Corpus() seems to have a bug when indexing word frequency.
Use VCorpus() instead, this will give you the expected result.

Combining 2 columns in R prioriziting one of them

I know nothing of R, and I have a data.frame with 2 columns, both of them are about the sex of the animals, but one of them have some corrections and the other doesn't.
My desired data.frame would be like this:
id sex father mother birth.date farm
0 1 john ray 05/06/94 1
1 1 doug ana 18/02/93 NA
2 2 bryan kim 21/03/00 3
But i got to this data.frame by using merge on 2 others data.frames
id sex.x father mother birth.date sex.y farm
0 2 john ray 05/06/94 1 1
1 1 doug ana 18/02/93 NA NA
2 2 bryan kim 21/03/00 2 3
data.frame 1 or Animals (Has the wrong sex for some animals)
id sex father mother birth.date
0 2 john ray 05/06/94
1 1 doug ana 18/02/93
2 2 bryan kim 21/03/00
data.frame 2 or Farm (Has the correct sex):
id farm sex
0 1 1
2 3 2
The code i used was: Animals_Farm <- merge(Animals , Farm, by="id", all.x=TRUE)
I need to combine the 2 sex columns into one, prioritizing sex.y. How do I do that?

If I correctly understand you example you have a situation similar to what I show below based on the example from the merge function.
> (authors <- data.frame(
surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
nationality = c("US", "Australia", "US", "UK", "Australia"),
deceased = c("yes", rep("no", 3), "yes")))
surname nationality deceased
1 Tukey US yes
2 Venables Australia no
3 Tierney US no
4 Ripley UK no
5 McNeil Australia yes
> (books <- data.frame(
name = I(c("Tukey", "Venables", "Tierney",
"Ripley", "Ripley", "McNeil", "R Core")),
title = c("Exploratory Data Analysis",
"Modern Applied Statistics ...", "LISP-STAT",
"Spatial Statistics", "Stochastic Simulation",
"Interactive Data Analysis",
"An Introduction to R"),
deceased = c("yes", rep("no", 6))))
name title deceased
1 Tukey Exploratory Data Analysis yes
2 Venables Modern Applied Statistics ... no
3 Tierney LISP-STAT no
4 Ripley Spatial Statistics no
5 Ripley Stochastic Simulation no
6 McNeil Interactive Data Analysis no
7 R Core An Introduction to R no
> (m1 <- merge(authors, books, by.x = "surname", by.y = "name"))
surname nationality deceased.x title deceased.y
1 McNeil Australia yes Interactive Data Analysis no
2 Ripley UK no Spatial Statistics no
3 Ripley UK no Stochastic Simulation no
4 Tierney US no LISP-STAT no
5 Tukey US yes Exploratory Data Analysis yes
6 Venables Australia no Modern Applied Statistics ... no
Where authors might represent your first dataframe and books your second and deceased might be the value that is in both dataframe but only up to date in one of them (authors).
The easiest way to only include the correct value of deceased would be to simply exclude the incorrect one from the merge.
> (m2 <- merge(authors, books[names(books) != "deceased"],
by.x = "surname", by.y = "name"))
surname nationality deceased title
1 McNeil Australia yes Interactive Data Analysis
2 Ripley UK no Spatial Statistics
3 Ripley UK no Stochastic Simulation
4 Tierney US no LISP-STAT
5 Tukey US yes Exploratory Data Analysis
6 Venables Australia no Modern Applied Statistics ...
The line of code books[names(books) != "deceased"] simply subsets the dataframe books to remove the deceased column leaving only the correct deceased column from authors in the final merge.

Create count per item by year/decade

I have data in a data.table that is as follows:
> x<-df[sample(nrow(df), 10),]
> x
> Importer Exporter Date
1: Ecuador United Kingdom 2004-01-13
2: Mexico United States 2013-11-19
3: Australia United States 2006-08-11
4: United States United States 2009-05-04
5: India United States 2007-07-16
6: Guatemala Guatemala 2014-07-02
7: Israel Israel 2000-02-22
8: India United States 2014-02-11
9: Peru Peru 2007-03-26
10: Poland France 2014-09-15
I am trying to create summaries so that given a time period (say a decade), I can find the number of time each country appears as Importer and Exporter. So, in the above example the desired output when dividing up by decade should be something like:
Decade Country.Name Importer.Count Exporter.Count
2000 Ecuador 1 0
2000 Mexico 1 1
2000 Australia 1 0
2000 United States 1 3
.
.
.
2010 United States 0 2
.
.
.
So far, I have tried with aggregate and data.table methods as suggested by the post here, but both of them seem to just give me counts of the number Importers/Exporters per year (or decade as I am more interested in that).
> x$Decade<-year(x$Date)-year(x$Date)%%10
> importer_per_yr<-aggregate(Importer ~ Decade, FUN=length, data=x)
> importer_per_yr
Decade Importer
2 2000 6
3 2010 4
Considering that aggregate uses the formula interface, I tried adding another criteria, but got the following error:
> importer_per_yr<-aggregate(Importer~ Decade + unique(Importer), FUN=length, data=x)
Error in model.frame.default(formula = Importer ~ Decade + :
variable lengths differ (found for 'unique(Importer)')
Is there a way to create the summary according to the decade and the importer/ exporter? It does not matter if the summary for importer and exporter are in different tables.

We can do this using data.table methods, Create the 'Decade' column by assignment :=, then melt the data from 'wide' to 'long' format by specifying the measure columns, reshape it back to 'wide' using dcast and we use the fun.aggregate as length.
x[, Decade:= year(Date) - year(Date) %%10]
dcast(melt(x, measure = c("Importer", "Exporter"), value.name = "Country"),
Decade + Country~variable, length)
# Decade Country Importer Exporter
# 1: 2000 Australia 1 0
# 2: 2000 Ecuador 1 0
# 3: 2000 India 1 0
# 4: 2000 Israel 1 1
# 5: 2000 Peru 1 1
# 6: 2000 United Kingdom 0 1
# 7: 2000 United States 1 3
# 8: 2010 France 0 1
# 9: 2010 Guatemala 1 1
#10: 2010 India 1 0
#11: 2010 Mexico 1 0
#12: 2010 Poland 1 0
#13: 2010 United States 0 2

I think with will work with aggregate in base R:
my.data <- read.csv(text = '
Importer, Exporter, Date
Ecuador, United Kingdom, 2004-01-13
Mexico, United States, 2013-11-19
Australia, United States, 2006-08-11
United States, United States, 2009-05-04
India, United States, 2007-07-16
Guatemala, Guatemala, 2014-07-02
Israel, Israel, 2000-02-22
India, United States, 2014-02-11
Peru, Peru, 2007-03-26
Poland, France, 2014-09-15
', header = TRUE, stringsAsFactors = TRUE, strip.white = TRUE)
my.data$my.Date <- as.Date(my.data$Date, format = "%Y-%m-%d")
my.data <- data.frame(my.data,
year = as.numeric(format(my.data$my.Date, format = "%Y")),
month = as.numeric(format(my.data$my.Date, format = "%m")),
day = as.numeric(format(my.data$my.Date, format = "%d")))
my.data$my.decade <- my.data$year - (my.data$year %% 10)
importer.count <- with(my.data, aggregate(cbind(count = Importer) ~ my.decade + Importer, FUN = function(x) { NROW(x) }))
exporter.count <- with(my.data, aggregate(cbind(count = Exporter) ~ my.decade + Exporter, FUN = function(x) { NROW(x) }))
colnames(importer.count) <- c('my.decade', 'country', 'importer.count')
colnames(exporter.count) <- c('my.decade', 'country', 'exporter.count')
my.counts <- merge(importer.count, exporter.count, by = c('my.decade', 'country'), all = TRUE)
my.counts$importer.count[is.na(my.counts$importer.count)] <- 0
my.counts$exporter.count[is.na(my.counts$exporter.count)] <- 0
my.counts
# my.decade country importer.count exporter.count
# 1 2000 Australia 1 0
# 2 2000 Ecuador 1 0
# 3 2000 India 1 0
# 4 2000 Israel 1 1
# 5 2000 Peru 1 1
# 6 2000 United States 1 3
# 7 2000 United Kingdom 0 1
# 8 2010 Guatemala 1 1
# 9 2010 India 1 0
# 10 2010 Mexico 1 0
# 11 2010 Poland 1 0
# 12 2010 United States 0 2
# 13 2010 France 0 1