I am trying to do hot encoding for a subset of df columns in R,
One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction by converting string columns to binary columns for each string in that column.
Supose we are having a df that looks like this:
mes work_location birth_place
01/01/2000 China Chile
01/02/2000 Mexico Japan
01/03/2000 China Chile
01/04/2000 China Argentina
01/05/2000 USA Poland
01/06/2000 Mexico Poland
01/07/2000 USA Finland
01/08/2000 USA Finland
01/09/2000 Japan Norway
01/10/2000 Japan Kenia
01/11/2000 Japan Mali
01/12/2000 India Mali
Here's the code to hot encode :
## function to hot-encode ##
columna_dummy <- function(df, columna) {
df %>%
mutate_at(columna, ~paste(columna, eval(as.symbol(columna)), sep = "_")) %>%
mutate(valor = 1) %>%
spread(key = columna, value = valor, fill = 0)
}
## selecting columns ##
columnas <- c("work_location", "birth_place")
## applying loop to repeat columna_dummy function for each df column ##
for(i in 1:length(columnas)){
new_dataset <- columna_dummy(df, i)
}
Console output:
Error: Problem with `mutate()` input `mes`.
x objeto '1' no encontrado
i Input `mes` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
Run `rlang::last_error()` to see where the error occurred.
Called from: signal_abort(cnd)
Column mes it's a date class column, however it is not included into columns atomic vector
and it still raises the above error,
Expected output should look somewhat like this for each string in selected string df column:
(I could not add every single column, but work_location_China it's an example of
how columns should look)
mes work_location birth_place work_location_China
01/01/2000 China Chile 1
01/02/2000 Mexico Japan 0
01/03/2000 China Chile 1
01/04/2000 China Argentina 1
01/05/2000 USA Poland 0
01/06/2000 Mexico Poland 0
01/07/2000 USA Finland 0
01/08/2000 USA Finland 0
01/09/2000 Japan Norway 0
01/10/2000 Japan Kenia 0
01/11/2000 Japan Mali 0
01/12/2000 India Mali 0
Is there any other way to apply this loop?
As we are passing strings, an option is to select the column (select can take both quoted/unquoted), create a column of 1s ('valor') and a row number column ('rn'), then do the reshaping from 'long' to 'wide' (pivot_wider)
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
columna_dummy <- function(df, columna) {
df %>%
select(columna) %>%
mutate(valor = 1, rn = row_number()) %>%
pivot_wider(names_from = all_of(columna),
values_from = valor, values_fill = 0) %>%
select(-rn)
}
-testing
For more than one column, an option is to loop over the column names of interest with map, apply the function and bind them with _dfc and bind with the original dataset (bind_cols)
out <- imap_dfc(setNames(c("work_location", "birth_place"),
c("work_location", "birth_place")) , ~ {
nm1 <- as.character(.y)
columna_dummy(df = df, columna = .x) %>%
rename_all(~ str_c(nm1, ., sep="_"))
}) %>%
bind_cols(df, .)
-output
head(out, 2)
# mes work_location birth_place work_location_China work_location_Mexico work_location_USA work_location_Japan
#1 01/01/2000 China Chile 1 0 0 0
#2 01/02/2000 Mexico Japan 0 1 0 0
# work_location_India birth_place_Chile birth_place_Japan birth_place_Argentina birth_place_Poland birth_place_Finland
#1 0 1 0 0 0 0
#2 0 0 1 0 0 0
# birth_place_Norway birth_place_Kenia birth_place_Mali
#1 0 0 0
#2 0 0 0
data
df <- structure(list(mes = c("01/01/2000", "01/02/2000", "01/03/2000",
"01/04/2000", "01/05/2000", "01/06/2000", "01/07/2000", "01/08/2000",
"01/09/2000", "01/10/2000", "01/11/2000", "01/12/2000"), work_location = c("China",
"Mexico", "China", "China", "USA", "Mexico", "USA", "USA", "Japan",
"Japan", "Japan", "India"), birth_place = c("Chile", "Japan",
"Chile", "Argentina", "Poland", "Poland", "Finland", "Finland",
"Norway", "Kenia", "Mali", "Mali")), class = "data.frame",
row.names = c(NA,
-12L))
By using purrr library I solved the issue:
## data ##
df <- structure(list(mes = c("01/01/2000", "01/02/2000", "01/03/2000",
"01/04/2000", "01/05/2000", "01/06/2000", "01/07/2000", "01/08/2000",
"01/09/2000", "01/10/2000", "01/11/2000", "01/12/2000"), work_location = c("China",
"Mexico", "China", "China", "USA", "Mexico", "USA", "USA", "Japan",
"Japan", "Japan", "India"), birth_place = c("Chile", "Japan",
"Chile", "Argentina", "Poland", "Poland", "Finland", "Finland",
"Norway", "Kenia", "Mali", "Mali")), class = "data.frame",
row.names = c(NA,
-12L))
## function to hot-encode ##
columna_dummy <- function(df, columna) {
df %>%
mutate_at(columna, ~paste(columna, eval(as.symbol(columna)), sep = "_")) %>%
mutate(valor = 1) %>%
spread(key = columna, value = valor, fill = 0)
}
## vector of columns ##
columnas <- c("work_location", "birth_place")
## hot_encoded_dataset ##
library(purrr)
hot_encoded_dataset <- purrr :: map(columnas , columna_dummy, df = df) %>%
reduce(inner_join)
Related
I have a large DF with certain columns that have a vector of character values as below. The number of columns varies from dataset to dataset as well as the number of character vectors it holds also varies.
ID Country1 Country2 Country3
1 1 Argentina, Japan,USA,Poland, Argentina,USA Pakistan
2 2 Colombia, Mexico,Uruguay,Dutch Mexico,Uruguay Afganisthan
3 3 Argentina, Japan,USA,NA Japan Khazagistan
4 4 Colombia, Mexico,Uruguay,Dutch Colombia, Dutch North Korea
5 5 India, China China Iran
Would like to match them one-to-one with another string vector as below
vals_to_find <-c("Argentina","USA","Mexico")
If, a column/row matches to anyone of the strings passed would like to retain that column and row. Remove duplicates, and finally remove those values that do not match.
the desired output is as follows
ID Countries.found
1 1 Argentina, USA
2 2 Mexico
3 3 Argentina, USA
4 4 Mexico
data
dput(df)
structure(list(ID = 1:5, Country1 = c("Argentina, Japan,USA,Poland,",
"Colombia, Mexico,Uruguay,Dutch", "Argentina, Japan,USA,NA",
"Colombia, Mexico,Uruguay,Dutch", "India, China"), Country2 = c("Argentina,USA",
"Mexico,Uruguay", "Japan", "Colombia, Dutch", "China"), Country3 = c("Pakistan",
"Afganisthan", "Khazagistan", "North Korea", "Iran")), class = "data.frame", row.names = c(NA,
-5L))
dput(df_out)
structure(list(ID = 1:4, Countries.found = c("Argentina, USA",
"Mexico", "Argentina, USA", "Mexico")), class = "data.frame", row.names = c(NA,
-4L))
Instead of a each column as a vector, if the file is read as one value per column. Then, was able do it as below
dput(df_out)
structure(list(ID = 1:5, X1 = c("Argentina", "Colombia", "Argentina",
"Colombia", "India"), X2 = c("Japan", "Mexico", "Japan", "Mexico",
"China"), X3 = c("USA", "Uruguay", "USA", "Uruguay", NA), X4 = c("Poland",
"Dutch", NA, "Dutch", NA), X5 = c("Argentina", "Mexico", "Japan",
"Colombia", "China"), X6 = c("USA", "Uruguay", NA, "Dutch", NA
), X7 = c("Pakistan", "Afganisthan", "Khazagistan", "North Korea",
"Iran")), class = "data.frame", row.names = c(NA, -5L))
df_out %>%
dplyr::select(
where(~ !all(is.na(.x)))
) %>%
dplyr::select(c(1, where(~ any(.x %in% vals_to_find)))) %>%
dplyr::mutate(dplyr::across(
tidyselect::starts_with("X"),
~ vals_to_find[match(., vals_to_find)]
)) %>%
tidyr::unite("countries_found", tidyselect::starts_with("X"),
sep = " | ", remove = TRUE, na.rm = TRUE
)
Output
ID countries_found
1 1 Argentina | USA | Argentina | USA
2 2 Mexico | Mexico
3 3 Argentina | USA
4 4 Mexico
unite the "Country" columns, then create a long vector by separating the values into rows, get all distinct values per ID, filter only those who are in vals_to_find, and summarise each countries.found toString.
library(tidyr)
library(dplyr)
df %>%
unite("Country", starts_with("Country"), sep = ",") %>%
separate_rows(Country) %>%
distinct(ID, Country) %>%
filter(Country %in% vals_to_find) %>%
group_by(ID) %>%
summarise(Countries.found = toString(Country))
output
# A tibble: 4 × 2
ID Countries.found
<int> <chr>
1 1 Argentina, USA
2 2 Mexico
3 3 Argentina, USA
4 4 Mexico
We may use
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(across(starts_with("Country"),
~ str_extract_all(.x, str_c(vals_to_find, collapse = "|")))) %>%
pivot_longer(cols = -ID, names_to = NULL,
values_to = 'Countries.found') %>%
unnest(Countries.found) %>%
distinct %>%
group_by(ID) %>%
summarise(Countries.found = toString(Countries.found))
-output
# A tibble: 4 × 2
ID Countries.found
<int> <chr>
1 1 Argentina, USA
2 2 Mexico
3 3 Argentina, USA
4 4 Mexico
I am getting an error while trying to filter data in 1st df based on countries available in 2nd df by using pipe operator in conditional statement.
Referencing countries df
Overall_top5
########### output ###########
continent country gdpPercap
Africa Botswana 8090
Africa Equatorial Guinea 20500
Africa Gabon 19600
Africa Libya 12100
Africa Mauritius 10900
Americas Canada 51600
Americas Chile 15100
Americas Trinidad and Tobago 17100
main df
gap_longer
########### output #############
country year gdpPercap continent
Australia 2019 57100 Oceania
Botswana 2019 8090 Africa
Canada 2019 51600 Americas
Chile 2019 15100 Americas
Denmark 2019 65100 Europe
Error: When I try below code it gives me errors:
gap_longer %>%
filter(year == 2019,
country %in% Overall_top5 %>% select(country) )
Error: Problem with `filter()` input `..1`. x no applicable method for 'select_' applied to an object of class "logical" i Input `..1` is `country %in% Overall_top5 %>% select(country)`. Run `rlang::last_error()` to see where the error occurred.
How can I run this using pipes ? I am able to run this using base R but don't know how to fix it using pipes .
gap_longer %>%
filter(year == 2019,
country %in% Overall_top5$country )
Raw data
Overall_top5 <- structure(list(continent = c("Africa", "Africa", "Africa", "Africa", "Africa", "Americas", "Americas", "Americas"), country = c("Botswana", "Equatorial Guinea", "Gabon", "Libya", "Mauritius", "Canada", "Chile", "Trinidad and Tobago"), gdpPercap = c(8090L, 20500L, 19600L, 12100L, 10900L, 51600L, 15100L, 17100L)), row.names = c(NA, -8L), class = "data.frame")
gap_longer <- structure(list(country = c("Australia", "Botswana", "Canada", "Chile", "Denmark"), year = c(2019L, 2019L, 2019L, 2019L, 2019L), gdpPercap = c(57100L, 8090L, 51600L, 15100L, 65100L), continent = c("Oceania", "Africa", "Americas", "Americas", "Europe")), class = "data.frame", row.names = c(NA, -5L))
First, you want to use pull rather than select as select will return a data frame rather than a vector (but that doesn't solve your problem).
Your problem comes from precedence. In your example, %in% is evaluated first, then %>%. To fix this, use parentheses.
gap_longer %>%
filter(
year == 2019,
country %in% (Overall_top5 %>% pull(country))
)
#> # A tibble: 3 x 4
#> country year gdpPercap continent
#> <chr> <dbl> <dbl> <chr>
#> 1 Botswana 2019 8090 Africa
#> 2 Canada 2019 51600 Americas
#> 3 Chile 2019 15100 Americas
Parenthesis problem. Try:
gap_longer %>%
filter(year == 2019, country %in% Overall_top5$country) %>%
select(country)
or, if you want a vector of country names, not a data frame:
gap_longer %>%
filter(year == 2019, country %in% Overall_top5$country) %>%
pull(country)
I have a hard time to figure out why I can't map the right country on the map. I have gone through all of my codes, I still don't understand why is not working right.
If you see any problems, please let me know. I appreciate.
This is dataset
Country Total Code
1 China 34620 CHN
2 Japan 89 JPN
3 Singapore 40 SGP
4 Thailand 32 THA
5 Hong Kong 26 HKG
6 S. Korea 24 KOR
7 Taiwan 17 TWN
8 Malaysia 16 MYS
9 Australia 15 AUS
10 Germany 14 DEU
11 Vietnam 13 VNM
12 USA 12 USA
13 France 11 FRA
14 Macao 10 MAC
15 U.A.E. 7 ARE
16 Canada 7 CAN
17 Philippines 3 PHL
18 India 3 IND
19 Italy 3 ITA
20 U.K. 3 GBR
21 Russia 2 RUS
22 Finland 1 FIN
23 Sri Lanka 1 LKA
24 Sweden 1 SWE
25 Nepal 1 NPL
26 Cambodia 1 KHM
27 Spain 1 ESP
28 Belgium 1 BEL
library(leaflet)
library(maps)
library(maptools)
case <- read.csv("Cases_02072020_v1.csv",stringsAsFactors = FALSE)
Country = map("world", fill = TRUE, plot = FALSE, regions=iso.expand(case$Code,regex = TRUE))
IDs <- sapply(strsplit(Country$names, ":"), function(x) x[1])
Country <- map2SpatialPolygons(Country,
IDs=IDs,
proj4string=CRS("+proj=longlat +datum=WGS84"))
pal <- colorNumeric(
palette = "Blues",
domain = as.numeric(case$Total))
case$labels <- sprintf(
"<strong>Country:%s</strong><br/>Total:%g",
case$Country, case$Total)%>% lapply(htmltools::HTML)
leaflet(Country) %>% addTiles() %>%
addPolygons(fillOpacity = 0.6, smoothFactor = 0.5, stroke = TRUE, weight = 1,
color = pal(as.numeric(case$Total)),
label = case$labels)
You need to repeat each row of your case data frame so that the countries match each individual polygon on the map. This means ensuring you order them correctly and also you need to incorporate Macao and Hong Kong into China (or change the way you split IDs to handle them there).
Here is a full working version:
library(leaflet)
library(maps)
library(maptools)
case <- structure(list(Country = c("China", "Japan", "Singapore", "Thailand",
"Hong Kong", "S. Korea", "Taiwan", "Malaysia", "Australia", "Germany",
"Vietnam", "USA", "France", "Macao", "U.A.E.", "Canada", "Philippines",
"India", "Italy", "U.K.", "Russia", "Finland", "Sri Lanka", "Sweden",
"Nepal", "Cambodia", "Spain", "Belgium"), Total = c(34620, 89,
40, 32, 26, 24, 17, 16, 15, 14, 13, 12, 11, 10, 7, 7, 3, 3, 3,
3, 2, 1, 1, 1, 1, 1, 1, 1), Code = c("CHN", "JPN", "SGP", "THA",
"HKG", "KOR", "TWN", "MYS", "AUS", "DEU", "VNM", "USA", "FRA",
"MAC", "ARE", "CAN", "PHL", "IND", "ITA", "GBR", "RUS", "FIN",
"LKA", "SWE", "NPL", "KHM", "ESP", "BEL")), row.names = c(NA,
-28L), class = "data.frame")
case <- case[order(case$Country), ]
Country = map("world", fill = TRUE, plot = FALSE, regions = iso.expand(case$Code,regex = F))
IDs <- Country$names
Country <- map2SpatialPolygons(Country,
IDs=IDs,
proj4string=CRS("+proj=longlat +datum=WGS84"))
case[nrow(case) + 1, ] <- case[case$Code == "ESP",]
case$Country[nrow(case)] <- "Canary Islands"
case$Country[case$Country == "S. Korea"] <- "South Korea"
case$Country[case$Country == "U.K."] <- "UK"
case$Country[case$Country == "U.A.E."] <- "United Arab Emirates"
case$Total[case$Country == "China"] <- case$Total[case$Country == "China"] +
case$Total[case$Country == "Hong Kong"] +
case$Total[case$Country == "Macao"]
case <- case[-which(case$Country == "Hong Kong"), ]
case <- case[-which(case$Country == "Macao"), ]
case <- case[order(case$Country), ]
reps <- as.numeric(table(sapply(strsplit(IDs, ":"), function(x) x[1])))
case <- do.call(rbind, mapply(function(x, y){ x[rep(1,y),]},
split(case, case$Country),
reps,
SIMPLIFY = F))
pal <- colorNumeric(
palette = "Blues",
domain = as.numeric(case$Total))
case$labels <- sprintf(
"<strong>Country:%s</strong><br/>Total:%g",
case$Country, case$Total)%>% lapply(htmltools::HTML)
leaflet(Country) %>% addTiles() %>%
addPolygons(fillOpacity = 0.6, smoothFactor = 0.5, stroke = TRUE, weight = 1,
color = pal(as.numeric(case$Total)),
label = case$labels)
Obviously this is just a snapshot, but you can see China is coloured correctly.
Update
Thanks to #jazzurro for his anwer. It made me realize that the duplicates may just complicate things. I hope by keeping only unique values/row simplifies the task.*
df <- data.frame(ID = c(1,2,3,4,5),
CTR1 = c("England", "England", "England", "China", "Sweden"),
CTR2 = c("England", "China", "China", "England", NA),
CTR3 = c("USA", "USA", "USA", "USA", NA),
CTR4 = c(NA, NA, NA, NA, NA),
CTR5 = c(NA, NA, NA, NA, NA),
CTR6 = c(NA, NA, NA, NA, NA))
ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
1 England China USA
2 England China USA
3 England China USA
4 China England USA
5 Sweden
It is still the goal to create a co-occurrence matrix (now) based on the following four conditions:
Single observations without additional observations by ID/row are not considered, i.e. a row with only a single country once is counted as 0.
A combination/co-occurrence should be counted as 1.
Being in a combination results in counting as a self-combination as well (USA-USA), i.e. a value of 1 is assigned.
There is no value over 1 assigned to a combination by row/ID.
Aspired Result
China England USA Sweden
China 4 4 4 0
England 4 4 4 0
USA 4 4 4 0
Sweden 0 0 0 0
*I've used the code from here to remove all non-unique observations.
Original Post
Assume I have a data set with a low two digit number of columns (some NA/empty) and more than 100.000 rows, represented by the following example dataframe
df <- data.frame(ID = c(1,2,3,4,5),
CTR1 = c("England", "England", "England", "China", "England"),
CTR2 = c("England", "China", "China", "England", NA),
CTR3 = c("England", "China", "China", "England", NA),
CTR4 = c("China", "USA", "USA", "China", NA),
CTR5 = c("USA", "England", "USA", "USA", NA),
CTR6 = c("England", "China", "USA", "England", NA))
df
ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
1 England England England China USA England
2 England China China USA England China
3 England China China USA USA USA
4 China England England China USA England
5 England
and I want to count the co-occurrences by ID/row to get a co-occurrence matrix that sums up the co-occurence by ID/row only once, meaning that no value over 1 will be allocated to a combination (i.e. assign a value of 1 for the existence of a co-occurrence independent of in-row frequencies and order, assign a value of 0 for no co-occurrence/combination by ID/row),
1 England-England-England => 1
2 England-England => 1
3 England-China => 1
4 England- => 0
Another important aspects regards the counting of observations that appear once in a row but in combination with others, e.g. USA in row 1. They should get a value of 1 for their own co-occurrence (as they are in a combination even though not with themselves) so that the combination USA-USA also gets a value of 1 assigned.
1 England England England China USA England
USA-USA => 1
China-China => 1
USA-China => 1
England-England => 1
England-USA => 1
England-China => 1
Due to the fact that row count should not >1 for a combination by row/ID, this results to:
China England USA
China 1 1 1
England 1 1 1
USA 1 1 1
This should lead to the following result based on the example dataframe, where a value of 4 is assigned to each combination based on the fact that each combination has occured at least in four rows and each string is part of a combination of the original dataframe:
China England USA
China 4 4 4
England 4 4 4
USA 4 4 4
So there are five conditions for counting:
Single observations without additional observations by ID/row are not considered, i.e. a row with only a single country once is not counted.
A combination should be counted as 1.
Observations occuring more than once do not contribute to a higher value for the interaction, i.e. several occurrences of the same country do not matter.
Being in a combination (even in the case the same country does not appear twice in a row) results in counting as a self-combination, i.e. a value of 1 is assigned.
There is no value over 1 assigned to a combination by row/ID.
I've tried to implement this by using dplyr, data.table, base aggregate or plyr adjusting code from [1], [2], [3], [4], [5] and [6] but as I don't care about order within a row but I also don't want to sum up all combinations within a row, I haven't got the aspired result so far.
I'm a novice in R. Any help is very much appreciated.
DATA
I modified your data so that data can represent your actual situation.
# ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
#1: 1 England England England China USA England
#2: 2 England China China USA England China
#3: 3 England China China USA USA USA
#4: 4 China England England China USA England
#5: 5 Sweden <NA> <NA> <NA> <NA>
df <- structure(list(ID = c(1, 2, 3, 4, 5), CTR1 = c("England", "England",
"England", "China", "Sweden"), CTR2 = c("England", "China", "China",
"England", NA), CTR3 = c("England", "China", "China", "England",
NA), CTR4 = c("China", "USA", "USA", "China", NA), CTR5 = c("USA",
"England", "USA", "USA", ""), CTR6 = c("England", "China", "USA",
"England", NA)), class = c("data.table", "data.frame"), row.names = c(NA,
-5L))
UPDATE
After seeing the OP's previous question, I got a clear picture in my mind. I think this is what you want, Seb.
# Transform the data to long-format data. Remove rows that have zero character (i.e, "") or NA.
melt(setDT(df), id.vars = "ID", measure = patterns("^CTR"))[nchar(value) > 0 & complete.cases(value)] -> foo
# Get distinct value (country) in each ID group (each row)
unique(foo, by = c("ID", "value")) -> foo2
# https://stackoverflow.com/questions/13281303/creating-co-occurrence-matrix
# Seeing this question, you want to create a matrix with crossprod().
crossprod(table(foo2[, c(1,3)])) -> mymat
# Finally, you need to change diagonal values. If a value is equal to one,
# change it to zero. Otherwise, keep the original value.
diag(mymat) <- ifelse(diag(mymat) <= 1, 0, mymat)
#value
#value China England Sweden USA
#China 4 4 0 4
#England 4 4 0 4
#Sweden 0 0 0 0
#USA 4 4 0 4
Here is an option using base::table:
#get paired combi and remove those from same country
pairsDF <- as.data.frame(do.call(rbind,
by(df, df$ID, function(x) t(combn(unlist(x[-1L]), 2L)))))
#tabulate pairs
duppairs <- rbind(pairsDF, data.frame(V1=pairsDF$V2, V2=pairsDF$V1))
tab <- table(duppairs, useNA="no")
#set diagonals to be the count of countries if count is at least 2
cnt <- c(table(unlist(df[-1L])))
cnt[cnt==1L] <- 0L
diag(tab) <- cnt[names(diag(tab))]
output:
V2
V1 China England Sweden USA
China 4 4 0 4
England 4 4 0 4
Sweden 0 0 0 0
USA 4 4 0 4
data:
df <- data.frame(ID = c(1,2,3,4,5),
CTR1 = c("England", "England", "England", "China", "Sweden"),
CTR2 = c("China", "China", "China", "England", NA),
CTR3 = c("USA", "USA", "USA", "USA", NA),
CTR4 = c(NA, NA, NA, NA, NA),
CTR5 = c(NA, NA, NA, NA, NA),
CTR6 = c(NA, NA, NA, NA, NA))
I have a data frame in R that has x and y values. There are NA values in x I would like to fill with specific strings dependent upon the y values.
Example data frame:
df1 = data.frame(x = c("Canada", "United States", NA, NA, NA),
y = c("CAN", "USA", "TWN", "TWN", "ARG"))
Expected result:
df2 = data.frame(x = c("Canada", "United States", "Taiwan", "Taiwan", NA),
y = c("CAN", "USA", "TWN", "TWN", "ARG"))
I have tried
df2 <- df1 %>% transform(df1, x = ifelse(y == "TWN", "Taiwan", x))
but this removes all values of x and y for all other observations not containing "TWN". Most of the other examples I've found on here are replacing all NA values and I can't seem to figure out how to do it conditionally upon strings in another column. This will need to be done with very large data frames so efficiency is ideal.
I think a better approach here would be to maintain a data frame which has all the keys and replacements which you intend to use should a full country name be missing:
df1 <- data.frame(x=c("Canada", "United States", NA, NA, NA),
y=c("CAN", "USA", "TWN", "TWN", "ARG"), stringsAsFactors=FALSE)
df2 <- data.frame(y=c("TWN", "ARG"), value=c("Taiwan", "Argentina"),
stringsAsFactors=FALSE)
result <- merge(df1, df2, by="y", all=TRUE)
result$x <- ifelse(is.na(result$x), result$value, result$x)
result <- result[, c("y", "x")]
result
y x
1 ARG Argentina
2 CAN Canada
3 TWN Taiwan
4 TWN Taiwan
5 USA United States
You first need to augment the levels of the factor column before you use "[" with logical indexing for the joint condion of 'x' being NA and 'y' being TWN.
> levels(df1$x) <- c(levels(df1$x), 'Taiwan')
> df1[is.na(df1$x) & df1$y=='TWN', 'x'] <- 'Taiwan'
> df1
x y
1 Canada CAN
2 United States USA
3 Taiwan TWN
4 Taiwan TWN
5 <NA> ARG
The simplest approach would be to use data.table library in R
library(data.table)
setDT(df1)
df1[y=="TWN",x:="TAIWAN"]
That should work like a pro.
I like to use a lookUpTable (named character) for those problems.
lookUpTable = c("Canada", "United States", "Taiwan", "Argentina")
names(lookUpTable) = c("CAN", "USA", "TWN", "ARG")
# CAN USA TWN ARG
#"Canada" "United States" "Taiwan" "Argentina"
df1$x <- lookUpTable[df1$y]
# x y
#1 United States CAN
#2 Argentina USA
#3 Taiwan TWN
#4 Taiwan TWN
#5 Canada ARG
data:
df1 = data.frame(x = c("Canada", "United States", NA, NA, NA),
y = c("CAN", "USA", "TWN", "TWN", "ARG"))