Related
I have a large DF with certain columns that have a vector of character values as below. The number of columns varies from dataset to dataset as well as the number of character vectors it holds also varies.
ID Country1 Country2 Country3
1 1 Argentina, Japan,USA,Poland, Argentina,USA Pakistan
2 2 Colombia, Mexico,Uruguay,Dutch Mexico,Uruguay Afganisthan
3 3 Argentina, Japan,USA,NA Japan Khazagistan
4 4 Colombia, Mexico,Uruguay,Dutch Colombia, Dutch North Korea
5 5 India, China China Iran
Would like to match them one-to-one with another string vector as below
vals_to_find <-c("Argentina","USA","Mexico")
If, a column/row matches to anyone of the strings passed would like to retain that column and row. Remove duplicates, and finally remove those values that do not match.
the desired output is as follows
ID Countries.found
1 1 Argentina, USA
2 2 Mexico
3 3 Argentina, USA
4 4 Mexico
data
dput(df)
structure(list(ID = 1:5, Country1 = c("Argentina, Japan,USA,Poland,",
"Colombia, Mexico,Uruguay,Dutch", "Argentina, Japan,USA,NA",
"Colombia, Mexico,Uruguay,Dutch", "India, China"), Country2 = c("Argentina,USA",
"Mexico,Uruguay", "Japan", "Colombia, Dutch", "China"), Country3 = c("Pakistan",
"Afganisthan", "Khazagistan", "North Korea", "Iran")), class = "data.frame", row.names = c(NA,
-5L))
dput(df_out)
structure(list(ID = 1:4, Countries.found = c("Argentina, USA",
"Mexico", "Argentina, USA", "Mexico")), class = "data.frame", row.names = c(NA,
-4L))
Instead of a each column as a vector, if the file is read as one value per column. Then, was able do it as below
dput(df_out)
structure(list(ID = 1:5, X1 = c("Argentina", "Colombia", "Argentina",
"Colombia", "India"), X2 = c("Japan", "Mexico", "Japan", "Mexico",
"China"), X3 = c("USA", "Uruguay", "USA", "Uruguay", NA), X4 = c("Poland",
"Dutch", NA, "Dutch", NA), X5 = c("Argentina", "Mexico", "Japan",
"Colombia", "China"), X6 = c("USA", "Uruguay", NA, "Dutch", NA
), X7 = c("Pakistan", "Afganisthan", "Khazagistan", "North Korea",
"Iran")), class = "data.frame", row.names = c(NA, -5L))
df_out %>%
dplyr::select(
where(~ !all(is.na(.x)))
) %>%
dplyr::select(c(1, where(~ any(.x %in% vals_to_find)))) %>%
dplyr::mutate(dplyr::across(
tidyselect::starts_with("X"),
~ vals_to_find[match(., vals_to_find)]
)) %>%
tidyr::unite("countries_found", tidyselect::starts_with("X"),
sep = " | ", remove = TRUE, na.rm = TRUE
)
Output
ID countries_found
1 1 Argentina | USA | Argentina | USA
2 2 Mexico | Mexico
3 3 Argentina | USA
4 4 Mexico
unite the "Country" columns, then create a long vector by separating the values into rows, get all distinct values per ID, filter only those who are in vals_to_find, and summarise each countries.found toString.
library(tidyr)
library(dplyr)
df %>%
unite("Country", starts_with("Country"), sep = ",") %>%
separate_rows(Country) %>%
distinct(ID, Country) %>%
filter(Country %in% vals_to_find) %>%
group_by(ID) %>%
summarise(Countries.found = toString(Country))
output
# A tibble: 4 × 2
ID Countries.found
<int> <chr>
1 1 Argentina, USA
2 2 Mexico
3 3 Argentina, USA
4 4 Mexico
We may use
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(across(starts_with("Country"),
~ str_extract_all(.x, str_c(vals_to_find, collapse = "|")))) %>%
pivot_longer(cols = -ID, names_to = NULL,
values_to = 'Countries.found') %>%
unnest(Countries.found) %>%
distinct %>%
group_by(ID) %>%
summarise(Countries.found = toString(Countries.found))
-output
# A tibble: 4 × 2
ID Countries.found
<int> <chr>
1 1 Argentina, USA
2 2 Mexico
3 3 Argentina, USA
4 4 Mexico
I am trying to do hot encoding for a subset of df columns in R,
One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction by converting string columns to binary columns for each string in that column.
Supose we are having a df that looks like this:
mes work_location birth_place
01/01/2000 China Chile
01/02/2000 Mexico Japan
01/03/2000 China Chile
01/04/2000 China Argentina
01/05/2000 USA Poland
01/06/2000 Mexico Poland
01/07/2000 USA Finland
01/08/2000 USA Finland
01/09/2000 Japan Norway
01/10/2000 Japan Kenia
01/11/2000 Japan Mali
01/12/2000 India Mali
Here's the code to hot encode :
## function to hot-encode ##
columna_dummy <- function(df, columna) {
df %>%
mutate_at(columna, ~paste(columna, eval(as.symbol(columna)), sep = "_")) %>%
mutate(valor = 1) %>%
spread(key = columna, value = valor, fill = 0)
}
## selecting columns ##
columnas <- c("work_location", "birth_place")
## applying loop to repeat columna_dummy function for each df column ##
for(i in 1:length(columnas)){
new_dataset <- columna_dummy(df, i)
}
Console output:
Error: Problem with `mutate()` input `mes`.
x objeto '1' no encontrado
i Input `mes` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
Run `rlang::last_error()` to see where the error occurred.
Called from: signal_abort(cnd)
Column mes it's a date class column, however it is not included into columns atomic vector
and it still raises the above error,
Expected output should look somewhat like this for each string in selected string df column:
(I could not add every single column, but work_location_China it's an example of
how columns should look)
mes work_location birth_place work_location_China
01/01/2000 China Chile 1
01/02/2000 Mexico Japan 0
01/03/2000 China Chile 1
01/04/2000 China Argentina 1
01/05/2000 USA Poland 0
01/06/2000 Mexico Poland 0
01/07/2000 USA Finland 0
01/08/2000 USA Finland 0
01/09/2000 Japan Norway 0
01/10/2000 Japan Kenia 0
01/11/2000 Japan Mali 0
01/12/2000 India Mali 0
Is there any other way to apply this loop?
As we are passing strings, an option is to select the column (select can take both quoted/unquoted), create a column of 1s ('valor') and a row number column ('rn'), then do the reshaping from 'long' to 'wide' (pivot_wider)
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
columna_dummy <- function(df, columna) {
df %>%
select(columna) %>%
mutate(valor = 1, rn = row_number()) %>%
pivot_wider(names_from = all_of(columna),
values_from = valor, values_fill = 0) %>%
select(-rn)
}
-testing
For more than one column, an option is to loop over the column names of interest with map, apply the function and bind them with _dfc and bind with the original dataset (bind_cols)
out <- imap_dfc(setNames(c("work_location", "birth_place"),
c("work_location", "birth_place")) , ~ {
nm1 <- as.character(.y)
columna_dummy(df = df, columna = .x) %>%
rename_all(~ str_c(nm1, ., sep="_"))
}) %>%
bind_cols(df, .)
-output
head(out, 2)
# mes work_location birth_place work_location_China work_location_Mexico work_location_USA work_location_Japan
#1 01/01/2000 China Chile 1 0 0 0
#2 01/02/2000 Mexico Japan 0 1 0 0
# work_location_India birth_place_Chile birth_place_Japan birth_place_Argentina birth_place_Poland birth_place_Finland
#1 0 1 0 0 0 0
#2 0 0 1 0 0 0
# birth_place_Norway birth_place_Kenia birth_place_Mali
#1 0 0 0
#2 0 0 0
data
df <- structure(list(mes = c("01/01/2000", "01/02/2000", "01/03/2000",
"01/04/2000", "01/05/2000", "01/06/2000", "01/07/2000", "01/08/2000",
"01/09/2000", "01/10/2000", "01/11/2000", "01/12/2000"), work_location = c("China",
"Mexico", "China", "China", "USA", "Mexico", "USA", "USA", "Japan",
"Japan", "Japan", "India"), birth_place = c("Chile", "Japan",
"Chile", "Argentina", "Poland", "Poland", "Finland", "Finland",
"Norway", "Kenia", "Mali", "Mali")), class = "data.frame",
row.names = c(NA,
-12L))
By using purrr library I solved the issue:
## data ##
df <- structure(list(mes = c("01/01/2000", "01/02/2000", "01/03/2000",
"01/04/2000", "01/05/2000", "01/06/2000", "01/07/2000", "01/08/2000",
"01/09/2000", "01/10/2000", "01/11/2000", "01/12/2000"), work_location = c("China",
"Mexico", "China", "China", "USA", "Mexico", "USA", "USA", "Japan",
"Japan", "Japan", "India"), birth_place = c("Chile", "Japan",
"Chile", "Argentina", "Poland", "Poland", "Finland", "Finland",
"Norway", "Kenia", "Mali", "Mali")), class = "data.frame",
row.names = c(NA,
-12L))
## function to hot-encode ##
columna_dummy <- function(df, columna) {
df %>%
mutate_at(columna, ~paste(columna, eval(as.symbol(columna)), sep = "_")) %>%
mutate(valor = 1) %>%
spread(key = columna, value = valor, fill = 0)
}
## vector of columns ##
columnas <- c("work_location", "birth_place")
## hot_encoded_dataset ##
library(purrr)
hot_encoded_dataset <- purrr :: map(columnas , columna_dummy, df = df) %>%
reduce(inner_join)
The code below does what I want for a simple table. The mapping that takes place in the statement with on works perfectly. But I also have the situation with multiple countries that need to be assigned potentially to multiple regions and the result stored in the regions column is more challenging
library(data.table)
testDT <- data.table(country = c("Algeria", "Egypt", "United States", "Brazil"))
testDTcomplicated <- data.table(country = c("Algeria, Ghana, Sri Lanka", "Egypt", "United States, Argentina", "Brazil"))
regionLookup <- data.table(countrylookup = c("Algeria", "Argentina", "Egypt", "United States", "Brazil", "Ghana", "Sri Lanka"), regionVal = c("Africa", "South America", "Africa", "North America", "South America", "Africa", "Asia"))
testDT[regionLookup, region := regionVal, on = c(country = "countrylookup")]
> testDT
country region
1: Algeria Africa
2: Egypt Africa
3: United States North America
4: Brazil South America
I'd like to have testDTcomplicated look like the following
> testDT
country region
1: Algeria, Ghana, Sri Lanka Africa, Africa, Asia
2: Egypt Africa
3: United States, Argentina, Brazil North America, South America, South America
4: Brazil South America
You could split the data on comma and get each country in a separate row, join the data with regionLookup and collapse them again in one value in a comma-separated string.
library(data.table)
testDTcomplicated[, row := seq_len(.N)]
new <- splitstackshape::cSplit(testDTcomplicated, 'country', ',',
direction = 'long')[regionLookup, region := regionVal,
on = c(country = "countrylookup")]
new <- new[, lapply(.SD, toString), row][,row:=NULL]
new
# country region
#1: Algeria, Ghana, Sri Lanka Africa, Africa, Asia
#2: Egypt Africa
#3: United States, Argentina North America, South America
#4: Brazil South America
Same logic in dplyr can be implemented as :
library(dplyr)
testDTcomplicated %>%
mutate(row = row_number()) %>%
tidyr::separate_rows(country, sep = ", ") %>%
left_join(regionLookup, by = c("country" = "countrylookup")) %>%
group_by(row) %>%
summarise(across(.fns = toString))
I have a hard time to figure out why I can't map the right country on the map. I have gone through all of my codes, I still don't understand why is not working right.
If you see any problems, please let me know. I appreciate.
This is dataset
Country Total Code
1 China 34620 CHN
2 Japan 89 JPN
3 Singapore 40 SGP
4 Thailand 32 THA
5 Hong Kong 26 HKG
6 S. Korea 24 KOR
7 Taiwan 17 TWN
8 Malaysia 16 MYS
9 Australia 15 AUS
10 Germany 14 DEU
11 Vietnam 13 VNM
12 USA 12 USA
13 France 11 FRA
14 Macao 10 MAC
15 U.A.E. 7 ARE
16 Canada 7 CAN
17 Philippines 3 PHL
18 India 3 IND
19 Italy 3 ITA
20 U.K. 3 GBR
21 Russia 2 RUS
22 Finland 1 FIN
23 Sri Lanka 1 LKA
24 Sweden 1 SWE
25 Nepal 1 NPL
26 Cambodia 1 KHM
27 Spain 1 ESP
28 Belgium 1 BEL
library(leaflet)
library(maps)
library(maptools)
case <- read.csv("Cases_02072020_v1.csv",stringsAsFactors = FALSE)
Country = map("world", fill = TRUE, plot = FALSE, regions=iso.expand(case$Code,regex = TRUE))
IDs <- sapply(strsplit(Country$names, ":"), function(x) x[1])
Country <- map2SpatialPolygons(Country,
IDs=IDs,
proj4string=CRS("+proj=longlat +datum=WGS84"))
pal <- colorNumeric(
palette = "Blues",
domain = as.numeric(case$Total))
case$labels <- sprintf(
"<strong>Country:%s</strong><br/>Total:%g",
case$Country, case$Total)%>% lapply(htmltools::HTML)
leaflet(Country) %>% addTiles() %>%
addPolygons(fillOpacity = 0.6, smoothFactor = 0.5, stroke = TRUE, weight = 1,
color = pal(as.numeric(case$Total)),
label = case$labels)
You need to repeat each row of your case data frame so that the countries match each individual polygon on the map. This means ensuring you order them correctly and also you need to incorporate Macao and Hong Kong into China (or change the way you split IDs to handle them there).
Here is a full working version:
library(leaflet)
library(maps)
library(maptools)
case <- structure(list(Country = c("China", "Japan", "Singapore", "Thailand",
"Hong Kong", "S. Korea", "Taiwan", "Malaysia", "Australia", "Germany",
"Vietnam", "USA", "France", "Macao", "U.A.E.", "Canada", "Philippines",
"India", "Italy", "U.K.", "Russia", "Finland", "Sri Lanka", "Sweden",
"Nepal", "Cambodia", "Spain", "Belgium"), Total = c(34620, 89,
40, 32, 26, 24, 17, 16, 15, 14, 13, 12, 11, 10, 7, 7, 3, 3, 3,
3, 2, 1, 1, 1, 1, 1, 1, 1), Code = c("CHN", "JPN", "SGP", "THA",
"HKG", "KOR", "TWN", "MYS", "AUS", "DEU", "VNM", "USA", "FRA",
"MAC", "ARE", "CAN", "PHL", "IND", "ITA", "GBR", "RUS", "FIN",
"LKA", "SWE", "NPL", "KHM", "ESP", "BEL")), row.names = c(NA,
-28L), class = "data.frame")
case <- case[order(case$Country), ]
Country = map("world", fill = TRUE, plot = FALSE, regions = iso.expand(case$Code,regex = F))
IDs <- Country$names
Country <- map2SpatialPolygons(Country,
IDs=IDs,
proj4string=CRS("+proj=longlat +datum=WGS84"))
case[nrow(case) + 1, ] <- case[case$Code == "ESP",]
case$Country[nrow(case)] <- "Canary Islands"
case$Country[case$Country == "S. Korea"] <- "South Korea"
case$Country[case$Country == "U.K."] <- "UK"
case$Country[case$Country == "U.A.E."] <- "United Arab Emirates"
case$Total[case$Country == "China"] <- case$Total[case$Country == "China"] +
case$Total[case$Country == "Hong Kong"] +
case$Total[case$Country == "Macao"]
case <- case[-which(case$Country == "Hong Kong"), ]
case <- case[-which(case$Country == "Macao"), ]
case <- case[order(case$Country), ]
reps <- as.numeric(table(sapply(strsplit(IDs, ":"), function(x) x[1])))
case <- do.call(rbind, mapply(function(x, y){ x[rep(1,y),]},
split(case, case$Country),
reps,
SIMPLIFY = F))
pal <- colorNumeric(
palette = "Blues",
domain = as.numeric(case$Total))
case$labels <- sprintf(
"<strong>Country:%s</strong><br/>Total:%g",
case$Country, case$Total)%>% lapply(htmltools::HTML)
leaflet(Country) %>% addTiles() %>%
addPolygons(fillOpacity = 0.6, smoothFactor = 0.5, stroke = TRUE, weight = 1,
color = pal(as.numeric(case$Total)),
label = case$labels)
Obviously this is just a snapshot, but you can see China is coloured correctly.
Update
Thanks to #jazzurro for his anwer. It made me realize that the duplicates may just complicate things. I hope by keeping only unique values/row simplifies the task.*
df <- data.frame(ID = c(1,2,3,4,5),
CTR1 = c("England", "England", "England", "China", "Sweden"),
CTR2 = c("England", "China", "China", "England", NA),
CTR3 = c("USA", "USA", "USA", "USA", NA),
CTR4 = c(NA, NA, NA, NA, NA),
CTR5 = c(NA, NA, NA, NA, NA),
CTR6 = c(NA, NA, NA, NA, NA))
ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
1 England China USA
2 England China USA
3 England China USA
4 China England USA
5 Sweden
It is still the goal to create a co-occurrence matrix (now) based on the following four conditions:
Single observations without additional observations by ID/row are not considered, i.e. a row with only a single country once is counted as 0.
A combination/co-occurrence should be counted as 1.
Being in a combination results in counting as a self-combination as well (USA-USA), i.e. a value of 1 is assigned.
There is no value over 1 assigned to a combination by row/ID.
Aspired Result
China England USA Sweden
China 4 4 4 0
England 4 4 4 0
USA 4 4 4 0
Sweden 0 0 0 0
*I've used the code from here to remove all non-unique observations.
Original Post
Assume I have a data set with a low two digit number of columns (some NA/empty) and more than 100.000 rows, represented by the following example dataframe
df <- data.frame(ID = c(1,2,3,4,5),
CTR1 = c("England", "England", "England", "China", "England"),
CTR2 = c("England", "China", "China", "England", NA),
CTR3 = c("England", "China", "China", "England", NA),
CTR4 = c("China", "USA", "USA", "China", NA),
CTR5 = c("USA", "England", "USA", "USA", NA),
CTR6 = c("England", "China", "USA", "England", NA))
df
ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
1 England England England China USA England
2 England China China USA England China
3 England China China USA USA USA
4 China England England China USA England
5 England
and I want to count the co-occurrences by ID/row to get a co-occurrence matrix that sums up the co-occurence by ID/row only once, meaning that no value over 1 will be allocated to a combination (i.e. assign a value of 1 for the existence of a co-occurrence independent of in-row frequencies and order, assign a value of 0 for no co-occurrence/combination by ID/row),
1 England-England-England => 1
2 England-England => 1
3 England-China => 1
4 England- => 0
Another important aspects regards the counting of observations that appear once in a row but in combination with others, e.g. USA in row 1. They should get a value of 1 for their own co-occurrence (as they are in a combination even though not with themselves) so that the combination USA-USA also gets a value of 1 assigned.
1 England England England China USA England
USA-USA => 1
China-China => 1
USA-China => 1
England-England => 1
England-USA => 1
England-China => 1
Due to the fact that row count should not >1 for a combination by row/ID, this results to:
China England USA
China 1 1 1
England 1 1 1
USA 1 1 1
This should lead to the following result based on the example dataframe, where a value of 4 is assigned to each combination based on the fact that each combination has occured at least in four rows and each string is part of a combination of the original dataframe:
China England USA
China 4 4 4
England 4 4 4
USA 4 4 4
So there are five conditions for counting:
Single observations without additional observations by ID/row are not considered, i.e. a row with only a single country once is not counted.
A combination should be counted as 1.
Observations occuring more than once do not contribute to a higher value for the interaction, i.e. several occurrences of the same country do not matter.
Being in a combination (even in the case the same country does not appear twice in a row) results in counting as a self-combination, i.e. a value of 1 is assigned.
There is no value over 1 assigned to a combination by row/ID.
I've tried to implement this by using dplyr, data.table, base aggregate or plyr adjusting code from [1], [2], [3], [4], [5] and [6] but as I don't care about order within a row but I also don't want to sum up all combinations within a row, I haven't got the aspired result so far.
I'm a novice in R. Any help is very much appreciated.
DATA
I modified your data so that data can represent your actual situation.
# ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
#1: 1 England England England China USA England
#2: 2 England China China USA England China
#3: 3 England China China USA USA USA
#4: 4 China England England China USA England
#5: 5 Sweden <NA> <NA> <NA> <NA>
df <- structure(list(ID = c(1, 2, 3, 4, 5), CTR1 = c("England", "England",
"England", "China", "Sweden"), CTR2 = c("England", "China", "China",
"England", NA), CTR3 = c("England", "China", "China", "England",
NA), CTR4 = c("China", "USA", "USA", "China", NA), CTR5 = c("USA",
"England", "USA", "USA", ""), CTR6 = c("England", "China", "USA",
"England", NA)), class = c("data.table", "data.frame"), row.names = c(NA,
-5L))
UPDATE
After seeing the OP's previous question, I got a clear picture in my mind. I think this is what you want, Seb.
# Transform the data to long-format data. Remove rows that have zero character (i.e, "") or NA.
melt(setDT(df), id.vars = "ID", measure = patterns("^CTR"))[nchar(value) > 0 & complete.cases(value)] -> foo
# Get distinct value (country) in each ID group (each row)
unique(foo, by = c("ID", "value")) -> foo2
# https://stackoverflow.com/questions/13281303/creating-co-occurrence-matrix
# Seeing this question, you want to create a matrix with crossprod().
crossprod(table(foo2[, c(1,3)])) -> mymat
# Finally, you need to change diagonal values. If a value is equal to one,
# change it to zero. Otherwise, keep the original value.
diag(mymat) <- ifelse(diag(mymat) <= 1, 0, mymat)
#value
#value China England Sweden USA
#China 4 4 0 4
#England 4 4 0 4
#Sweden 0 0 0 0
#USA 4 4 0 4
Here is an option using base::table:
#get paired combi and remove those from same country
pairsDF <- as.data.frame(do.call(rbind,
by(df, df$ID, function(x) t(combn(unlist(x[-1L]), 2L)))))
#tabulate pairs
duppairs <- rbind(pairsDF, data.frame(V1=pairsDF$V2, V2=pairsDF$V1))
tab <- table(duppairs, useNA="no")
#set diagonals to be the count of countries if count is at least 2
cnt <- c(table(unlist(df[-1L])))
cnt[cnt==1L] <- 0L
diag(tab) <- cnt[names(diag(tab))]
output:
V2
V1 China England Sweden USA
China 4 4 0 4
England 4 4 0 4
Sweden 0 0 0 0
USA 4 4 0 4
data:
df <- data.frame(ID = c(1,2,3,4,5),
CTR1 = c("England", "England", "England", "China", "Sweden"),
CTR2 = c("China", "China", "China", "England", NA),
CTR3 = c("USA", "USA", "USA", "USA", NA),
CTR4 = c(NA, NA, NA, NA, NA),
CTR5 = c(NA, NA, NA, NA, NA),
CTR6 = c(NA, NA, NA, NA, NA))