Find frequencies of all factor combinations of all column combinations - r

I have a dataframe with n variables whose values are all factors. Now I would like to select m columns from this dataframe (m < n) and find the frequencies of all factor combinations of all possible columns selected.
I have looked up but I only found how to find frequencies of factor combinations if specific columns are chose. In my case, there could be many combinations of columns since m < n
Here is our data, all variable have factor values.
company <- data.frame("country" = c("USA", "China", 'France', "Germany"),
"category" = c("C-corp", "S-corp", "C-corp", "LLC"),
"Type" = c("Public", "Private", "Private", "Private"),
"Profit" = c("High", "High", "High", "Low"))
Now I want to select 2 columns (m = 2) and find out about the frequency of factor combinations of all of the possible variables selected
In this case, I can have "country = USA & category = S-Corp", "country = USA & category = C-Corp", "country = China & category = LLC". But I could also select other columns and have "country = USA & Profit = Low", "country = China & Type = Public". I want to know the frequeies of all of these combinations
Edit: My expected output is something like
country = USA, category = C-corp freq 1
country = USA, category = S-corp freq 0
country = USA, category = LLC freq 0
country = China, category = LLC freq 0
country = France, category = C-corp freq 1
country = USA, type = Public freq 1
country = China, type = Public freq 0
Type = Private, Profit = High freq 2
Type = Public, category = LLC freq 0
category = Private, Profit = Low freq 1
If I need to select 2 columns, I need all the possible column combinations, orders don't matter

The combinations part sounds like expand.grid():
expand.grid(company[, 1:2])
country category
1 USA C-corp
2 China C-corp
3 France C-corp
4 Germany C-corp
5 USA S-corp
6 China S-corp
7 France S-corp
8 Germany S-corp
9 USA C-corp
10 China C-corp
11 France C-corp
12 Germany C-corp
13 USA LLC
14 China LLC
15 France LLC
16 Germany LLC
# or if you want 4 columns with all countries, do a cross join:
merge(company[, 1, drop = F], company[, -1], by = NULL)
#or if you want 4 columns with all possible results, do expand.grid without subsetting:
expand.grid(company)
The second part sounds like table(). You can perform it directly on the company data.frame:
table(company)
, , Type = Private, Profit = High
category
country C-corp LLC S-corp
China 0 0 1
France 1 0 0
Germany 0 0 0
USA 0 0 0
, , Type = Public, Profit = High
category
country C-corp LLC S-corp
China 0 0 0
France 0 0 0
Germany 0 0 0
USA 1 0 0
, , Type = Private, Profit = Low
category
country C-corp LLC S-corp
China 0 0 0
France 0 0 0
Germany 0 1 0
USA 0 0 0
, , Type = Public, Profit = Low
category
country C-corp LLC S-corp
China 0 0 0
France 0 0 0
Germany 0 0 0
USA 0 0 0

You can do this using a nested loop of the table function:
for (j in 1:ncol(company)) {
for (i in 1:ncol(company)) {
print(table(company[[j]],
company[[i]]))
}
}
It's ugly and has a lot of duplicates, but it's quick and easy for your purposes.

Related

How do you match a numeric value to a categorical value in another data set

I have two data sets. One with a numeric value assigned to individual categorical variables (country name) and a second with survey responses including a person's nationality. How do I assign the numeric value to a new column in the survey dataset with matching nationality/country name?
Here is the head of data set 1 (my.data1):
EN HCI
1 South Korea 0.845
2 UK 0.781
3 USA 0.762
Here is the head of data set 2 (my.data2):
Nationality OIS IR
1 South Korea 2 2
2 South Korea 3 3
3 USA 3 4
4 UK 3 3
I would like to make it look like this:
Nationality OIS IR HCI
1 South Korea 2 2 0.845
2 South Korea 3 3 0.845
3 USA 3 4 0.762
4 UK 3 3 0.781
I have tried this but unsuccessfully:
my.data2$HCI <- NA
for (i in i:nrow(my.data2)) {
my.data2$HCI[i] <- my.data1$HCI[my.data1$EN == my.data2$Nationality[i]]
}
We can use a left_join
library(dplyr)
left_join(my.data2, my.data1, by = c("Nationality" = "EN"))
Or with merge from base R
merge(my.data2, my.data1, by.x = c("Nationality", by.y = "EN", all.x = TRUE)

If/Else statement in R

I have two dataframes in R:
city price bedroom
San Jose 2000 1
Barstow 1000 1
NA 1500 1
Code to recreate:
data = data.frame(city = c('San Jose', 'Barstow'), price = c(2000,1000, 1500), bedroom = c(1,1,1))
and:
Name Density
San Jose 5358
Barstow 547
Code to recreate:
population_density = data.frame(Name=c('San Jose', 'Barstow'), Density=c(5358, 547));
I want to create an additional column named city_type in the data dataset based on condition, so if the city population density is above 1000, it's an urban, lower than 1000 is a suburb, and NA is NA.
city price bedroom city_type
San Jose 2000 1 Urban
Barstow 1000 1 Suburb
NA 1500 1 NA
I am using a for loop for conditional flow:
for (row in 1:length(data)) {
if (is.na(data[row,'city'])) {
data[row, 'city_type'] = NA
} else if (population[population$Name == data[row,'city'],]$Density>=1000) {
data[row, 'city_type'] = 'Urban'
} else {
data[row, 'city_type'] = 'Suburb'
}
}
The for loop runs with no error in my original dataset with over 20000 observations; however, it yields a lot of wrong results (it yields NA for the most part).
What has gone wrong here and how can I do better to achieve my desired result?
I have become quite a fan of dplyr pipelines for this type of join/filter/mutate workflow. So here is my suggestion:
library(dplyr)
# I had to add that extra "NA" there, did you not? Hm...
data <- data.frame(city = c('San Jose', 'Barstow', NA), price = c(2000,1000, 500), bedroom = c(1,1,1))
population <- data.frame(Name=c('San Jose', 'Barstow'), Density=c(5358, 547));
data %>%
# join the two dataframes by matching up the city name columns
left_join(population, by = c("city" = "Name")) %>%
# add your new column based on the desired condition
mutate(
city_type = ifelse(Density >= 1000, "Urban", "Suburb")
)
Output:
city price bedroom Density city_type
1 San Jose 2000 1 5358 Urban
2 Barstow 1000 1 547 Suburb
3 <NA> 500 1 NA <NA>
Using ifelse create the city_type in population_density, then we using match
population_density$city_type=ifelse(population_density$Density>1000,'Urban','Suburb')
data$city_type=population_density$city_type[match(data$city,population_density$Name)]
data
city price bedroom city_type
1 San Jose 2000 1 Urban
2 Barstow 1000 1 Suburb
3 <NA> 1500 1 <NA>

Convert Panel Data to Long in R

My current data is for missiles between 1920 and 2018. The goal is to measure a nation’s ability to deploy missiles of different kinds for each year from 1920 to 2018. The problems that arise are that the data has multiple observations per nation and often per year. This creates issues because for instance if a nation adopted a missile in 1970 that is Air to Air and imported then developed one in 1980 that is Air to Air and Air to Ground and produced domestically, that change needs to be reflected. The goal is to have a unique row/observation for each year for every nation. Also it should be noted that it is assumed if the nation can produced Air to air for instance in 1970 they can do so until 2018.
Current:
YearAcquired CountryCode CountryName Domestic AirtoAir
2014 670 Saudi Arabia 0 1
2017 670 Saudi Arabia 1 1
2016 2 United States 1 1
Desired:
YearAcquired CountryCode CountryName Domestic AirtoAir
2014 670 Saudi Arabia 0 1
2015 670 Saudi Arabia 0 1
2016 670 Saudi Arabia 0 1
2017 670 Saudi Arabia 1 1
2018 670 Saudi Arabia 1 1
2016 2 United States 0 1
2017 2 United States 0 1
2018 2 United States 0 1
Note: There are many entries and so I would like it to generate from 1920 to 2018 for every country even if they will have straight zeroes. That is not necessary but it would be a great bit!
You can do this via several steps:
Create the combination of all years and countries (a CROSS JOIN in SQL)
LEFT JOIN these combinations with the available data
Use a function like zoo::na.locf() to replace NA values by the last known ones per country.
The first step is common:
df <- read.table(text = 'YearAcquired CountryCode CountryName Domestic AirtoAir
2014 670 "Saudi Arabia" 0 1
2017 670 "Saudi Arabia" 1 1
2016 2 "United States" 1 1', header = TRUE, stringsAsFactors = FALSE)
combinations <- merge(data.frame(YearAcquired = seq(1920, 2018, 1)),
unique(df[,2:3]), by = NULL)
For steps 2 and 3 here a solution using dplyr
library(dplyr)
library(zoo)
df <- left_join(combinations, df) %>%
group_by(CountryCode) %>%
mutate(Domestic = na.locf(Domestic, na.rm = FALSE),
AirtoAir = na.locf(AirtoAir, na.rm = FALSE))
And one solution using data.table:
library(data.table)
library(zoo)
setDT(df)
setDT(combinations)
df <- df[combinations, on = c("YearAcquired", "CountryCode", "CountryName")]
df <- df[, na.locf(.SD, na.rm = FALSE), by = "CountryCode"]
You could create a new dataframe using the country names and codes available and perform a left join with your existing data. This would give you 1920 to 2018 for each country and code, leaving NA's in where you don't have data available but you could easily replace them given how you want your data structured.
# df is your initial dataframe
countries <- df$CountryName
codes <- df
new_df <- data.frame(YearAcquired = seq(1920, 2018, 1),
CountryName = df$CountryName
CountryCode = df$CountryCode)
new_df <- left_join(new_df, df)
Using tidyverse (dplyr and tidyr)...
If you only need to fill in internal years per country...
df <- read.table(header = TRUE, as.is = TRUE, text = "
YearAcquired countrycode CountryName Domestic AirtoAir
2014 670 'Saudi Arabia' 0 1
2017 670 'Saudi Arabia' 1 1
2016 2 'United States' 1 1
")
library(dplyr)
library(tidyr)
df %>%
group_by(countrycode) %>%
complete(YearAcquired = full_seq(YearAcquired, 1), countrycode, CountryName) %>%
arrange(countrycode, YearAcquired) %>%
fill(Domestic, AirtoAir)
#> # A tibble: 5 x 5
#> # Groups: countrycode [2]
#> YearAcquired countrycode CountryName Domestic AirtoAir
#> <dbl> <int> <chr> <int> <int>
#> 1 2016 2 United States 1 1
#> 2 2014 670 Saudi Arabia 0 1
#> 3 2015 670 Saudi Arabia 0 1
#> 4 2016 670 Saudi Arabia 0 1
#> 5 2017 670 Saudi Arabia 1 1
If you want to expand each country to all years found in the dataset...
df <- read.table(header = TRUE, as.is = TRUE, text = "
YearAcquired countrycode CountryName Domestic AirtoAir
2014 670 'Saudi Arabia' 0 1
2017 670 'Saudi Arabia' 1 1
2016 2 'United States' 1 1
")
library(dplyr)
library(tidyr)
df %>%
complete(YearAcquired = full_seq(YearAcquired, 1),
nesting(countrycode, CountryName)) %>%
group_by(countrycode) %>%
arrange(countrycode, YearAcquired) %>%
fill(Domestic, AirtoAir) %>%
mutate_at(vars(Domestic, AirtoAir), funs(if_else(is.na(.), 0L, .)))
#> # A tibble: 8 x 5
#> # Groups: countrycode [2]
#> YearAcquired countrycode CountryName Domestic AirtoAir
#> <dbl> <int> <chr> <int> <int>
#> 1 2014 2 United States 0 0
#> 2 2015 2 United States 0 0
#> 3 2016 2 United States 1 1
#> 4 2017 2 United States 1 1
#> 5 2014 670 Saudi Arabia 0 1
#> 6 2015 670 Saudi Arabia 0 1
#> 7 2016 670 Saudi Arabia 0 1
#> 8 2017 670 Saudi Arabia 1 1

Create count per item by year/decade

I have data in a data.table that is as follows:
> x<-df[sample(nrow(df), 10),]
> x
> Importer Exporter Date
1: Ecuador United Kingdom 2004-01-13
2: Mexico United States 2013-11-19
3: Australia United States 2006-08-11
4: United States United States 2009-05-04
5: India United States 2007-07-16
6: Guatemala Guatemala 2014-07-02
7: Israel Israel 2000-02-22
8: India United States 2014-02-11
9: Peru Peru 2007-03-26
10: Poland France 2014-09-15
I am trying to create summaries so that given a time period (say a decade), I can find the number of time each country appears as Importer and Exporter. So, in the above example the desired output when dividing up by decade should be something like:
Decade Country.Name Importer.Count Exporter.Count
2000 Ecuador 1 0
2000 Mexico 1 1
2000 Australia 1 0
2000 United States 1 3
.
.
.
2010 United States 0 2
.
.
.
So far, I have tried with aggregate and data.table methods as suggested by the post here, but both of them seem to just give me counts of the number Importers/Exporters per year (or decade as I am more interested in that).
> x$Decade<-year(x$Date)-year(x$Date)%%10
> importer_per_yr<-aggregate(Importer ~ Decade, FUN=length, data=x)
> importer_per_yr
Decade Importer
2 2000 6
3 2010 4
Considering that aggregate uses the formula interface, I tried adding another criteria, but got the following error:
> importer_per_yr<-aggregate(Importer~ Decade + unique(Importer), FUN=length, data=x)
Error in model.frame.default(formula = Importer ~ Decade + :
variable lengths differ (found for 'unique(Importer)')
Is there a way to create the summary according to the decade and the importer/ exporter? It does not matter if the summary for importer and exporter are in different tables.
We can do this using data.table methods, Create the 'Decade' column by assignment :=, then melt the data from 'wide' to 'long' format by specifying the measure columns, reshape it back to 'wide' using dcast and we use the fun.aggregate as length.
x[, Decade:= year(Date) - year(Date) %%10]
dcast(melt(x, measure = c("Importer", "Exporter"), value.name = "Country"),
Decade + Country~variable, length)
# Decade Country Importer Exporter
# 1: 2000 Australia 1 0
# 2: 2000 Ecuador 1 0
# 3: 2000 India 1 0
# 4: 2000 Israel 1 1
# 5: 2000 Peru 1 1
# 6: 2000 United Kingdom 0 1
# 7: 2000 United States 1 3
# 8: 2010 France 0 1
# 9: 2010 Guatemala 1 1
#10: 2010 India 1 0
#11: 2010 Mexico 1 0
#12: 2010 Poland 1 0
#13: 2010 United States 0 2
I think with will work with aggregate in base R:
my.data <- read.csv(text = '
Importer, Exporter, Date
Ecuador, United Kingdom, 2004-01-13
Mexico, United States, 2013-11-19
Australia, United States, 2006-08-11
United States, United States, 2009-05-04
India, United States, 2007-07-16
Guatemala, Guatemala, 2014-07-02
Israel, Israel, 2000-02-22
India, United States, 2014-02-11
Peru, Peru, 2007-03-26
Poland, France, 2014-09-15
', header = TRUE, stringsAsFactors = TRUE, strip.white = TRUE)
my.data$my.Date <- as.Date(my.data$Date, format = "%Y-%m-%d")
my.data <- data.frame(my.data,
year = as.numeric(format(my.data$my.Date, format = "%Y")),
month = as.numeric(format(my.data$my.Date, format = "%m")),
day = as.numeric(format(my.data$my.Date, format = "%d")))
my.data$my.decade <- my.data$year - (my.data$year %% 10)
importer.count <- with(my.data, aggregate(cbind(count = Importer) ~ my.decade + Importer, FUN = function(x) { NROW(x) }))
exporter.count <- with(my.data, aggregate(cbind(count = Exporter) ~ my.decade + Exporter, FUN = function(x) { NROW(x) }))
colnames(importer.count) <- c('my.decade', 'country', 'importer.count')
colnames(exporter.count) <- c('my.decade', 'country', 'exporter.count')
my.counts <- merge(importer.count, exporter.count, by = c('my.decade', 'country'), all = TRUE)
my.counts$importer.count[is.na(my.counts$importer.count)] <- 0
my.counts$exporter.count[is.na(my.counts$exporter.count)] <- 0
my.counts
# my.decade country importer.count exporter.count
# 1 2000 Australia 1 0
# 2 2000 Ecuador 1 0
# 3 2000 India 1 0
# 4 2000 Israel 1 1
# 5 2000 Peru 1 1
# 6 2000 United States 1 3
# 7 2000 United Kingdom 0 1
# 8 2010 Guatemala 1 1
# 9 2010 India 1 0
# 10 2010 Mexico 1 0
# 11 2010 Poland 1 0
# 12 2010 United States 0 2
# 13 2010 France 0 1

Computing frequency of membership in R's data.frame

I have the following data.frame:
authors <- data.frame(
surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
nationality = c("US", "Australia", "US", "UK", "Australia"),
deceased = c("yes", rep("no", 3),"noinfo"))
which produce this output:
surname nationality deceased
1 Tukey US yes
2 Venables Australia no
3 Tierney US no
4 Ripley UK no
5 McNeil Australia noinfo
What I want to do is to get the frequency of deceased by nationality.
Yielding this output:
US yes 1
US no 1
US noinfo 0
Australia yes 0
Australia no 1
Australia noinfo 1
UK yes 0
UK no 1
UK noinfo 0
At the moment I can only display the statistics through tables.
stat <- table(authors)
I'm not sure how to proceed by accessing the element of the tables.
Advice would be appreciated.
You need to table on the things you want the occurence for...
table( authors[ c("nationality" , "deceased" ) ] )
# deceased
#nationality no noinfo yes
# Australia 1 1 0
# UK 1 0 0
# US 1 0 1
And to get the exact output you want... turn it into a data.frame....
data.frame( table( authors[ c("nationality" , "deceased" ) ] ) )
# nationality deceased Freq
#1 Australia no 1
#2 UK no 1
#3 US no 1
#4 Australia noinfo 1
#5 UK noinfo 0
#6 US noinfo 0
#7 Australia yes 0
#8 UK yes 0
#9 US yes 1

Resources