Subset of data frame:
country1 country2
Japan Japan
Netherlands <NA>
<NA> <NA>
Brazil Brazil
Russian Federation <NA>
<NA> <NA>
<NA> United States of America
Germany Germany
Ukraine <NA>
Japan Japan
<NA> Russian Federation
<NA> United States of America
France France
New Zealand New Zealand
Japan <NA>
I have two character vectors, country1 and country2, which I would like to merge together into a new column. No observations in my dataset have different countries. However, some pairs have duplicated values which I would like only to display once. There is also the issue of the NAs, which I want to omit in the merged column, where each value in the new column only has the country string. A few observations have NAs in both of my columns, which I just want to leave as NA in the new column. I'm wondering what the best way to tackle this would be.
I've made a minor modification to the function in the top voted answer here with a similar question, changing the seperation of commas to nothing.
However, this leaves the repeating issue unsolved:
country1 country2 merge
Japan Japan JapanJapan
Netherlands <NA> Netherlands
<NA> <NA> <NA>
Brazil Brazil BrazilBrazil
Russian Federation <NA> Russian Federation
<NA> <NA> <NA>
<NA> United States of America United States of America
Germany Germany GermanyGermany
Ukraine <NA> Ukraine
Japan Japan JapanJapan
<NA> Russian Federation Russian Federation
<NA> United States of America United States of America
France France FranceFrance
New Zealand New Zealand New ZealandNew Zealand
Japan <NA> Japan
Since you specified dplyr, here's a one-liner with it:
df <- dplyr::mutate(df, merge = dplyr::if_else(is.na(country1), country2, country1))
Data
country1 <- c("Japan", "Netherlands", NA, "Brazil", "Russian Federation", NA, NA, "Germany", "Ukraine", "Japan", NA, NA, "France", "New Zealand", "Japan")
country2 <- c("Japan", NA, NA, "Brazil", NA, NA, "United States of America", "Germany", NA, "Japan", "Russian Federation", "United States of America", "France", "New Zealand", NA)
df <- data.frame(country1, country2, stringsAsFactors = F)
Since you said you have character vectors, then:
library(tidyverse)
coalesce(country1,country2)
[1] "Japan" "Netherlands" NA
[4] "Brazil" "Russian Federation" NA
[7] "United States of America" "Germany" "Ukraine"
[10] "Japan" "Russian Federation" "United States of America"
[13] "France" "New Zealand" "Japan"
if its a dataframe. Just do coalesce(!!!df)
You can also just replace the NA values from 1st column with values from the 2nd :
df$country1[is.na(df$country1)] <- df$country2[is.na(df$country1)]
Related
I have this dataset below
Country Sales
France 12000
Germany 2400
Italy 1000
Belgium 500
Please can you help with a code to add a common 'word' to the Country. I have tried all my best. Here is the intended output I want. thanks
Country Sales
France - Europe 12000
Germany - Europe 2400
Italy - Europe 1000
Belgium - Europe 500
Thanks, as you help me
country_data <-
data.frame(
country = c(
"France",
"Germany",
"Italy",
"Belgium"
),
sales = c(
12000,
2400,
1000,
500
)
)
country_data_2 <-
country_data |>
dplyr::mutate(
continent = dplyr::case_when(
country %in% c("France", "Germany", "Italy", "Belgium") ~ "Europe",
country %in% c("Egypt", "South Africa", "Morroco") ~ "Africa",
country %in% c("Canada", "Mexico", "United States") ~ "North America"
# ...
)
) |>
dplyr::transmute(
country = paste(
country,
continent,
sep = " - "
),
sales = sales
)
country_data_2
#> country sales
#> 1 France - Europe 12000
#> 2 Germany - Europe 2400
#> 3 Italy - Europe 1000
#> 4 Belgium - Europe 500
Created on 2022-11-08 with reprex v2.0.2
I have a dataframe like this:
structure(list(from = c("China", "China", "Canada", "Canada",
"USA", "China", "Trinidad and Tobago", "China", "USA", "USA"),
to = c("Japan", "Japan", "USA", "USA", "Japan", "USA", "USA",
"Rep. of Korea", "Canada", "Japan"), weight = c(4766781396,
4039683737, 3419468319, 3216051707, 2535151299, 2513604035,
2303474559, 2096033823, 2091906420, 2066357443)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L), groups = structure(list(
from = c("Canada", "China", "China", "China", "Trinidad and Tobago",
"USA", "USA"), to = c("USA", "Japan", "Rep. of Korea", "USA",
"USA", "Canada", "Japan"), .rows = structure(list(3:4, 1:2,
8L, 6L, 7L, 9L, c(5L, 10L)), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -7L), .drop = TRUE))
I would like to perform the absolute value of difference in weight column grouped by from and to.
I'm trying with the function aggregate() but it seems to work for means and sums and not for difference. For example (df is the name of my dataframe):
aggregate(weight~from+to, data = df, FUN=mean)
which produces:
from to weight
1 USA Canada 2091906420
2 China Japan 4403232567
3 USA Japan 2300754371
4 China Rep. of Korea 2096033823
5 Canada USA 3317760013
6 China USA 2513604035
7 Trinidad and Tobago USA 2303474559
EDIT. The desired result is instead
from to weight
1 USA Canada 2091906420
2 China Japan 727097659
3 USA Japan 468793856
4 China Rep. of Korea 2096033823
5 Canada USA 203416612
6 China USA 2513604035
7 Trinidad and Tobago USA 2303474559
As we can see, the countries that appear two times in the columns from and to colllapsed in only one row with the difference between weights in the column weight. E.g.,
from to weight
China Japan 4766781396
China Japan 4039683737
become
from to weight
China Japan 727097659
because
> 4766781396-4039683737
[1] 727097659
The difference should be positive (and this is why I wrote "the absolute value of difference of the weights").
The couples of countries which instead appear just in one row of dataframe df remain the same, as e.g.
from to weight
7 Trinidad and Tobago USA 2303474559
Assuming at most 2 values per group and that the order of the difference is not important
aggregate(weight~from+to, data=df, FUN=function(x){
abs(ifelse(length(x)==1,x,diff(x)))
})
from to weight
1 USA Canada 2091906420
2 China Japan 727097659
3 USA Japan 468793856
4 China Rep. of Korea 2096033823
5 Canada USA 203416612
6 China USA 2513604035
7 Trinidad and Tobago USA 2303474559
Is the following what you are looking for?
f <- function(x) abs(x[2] - x[1])
aggregate(weight ~ from + to, data = df, FUN = f)
#> from to weight
#> 1 USA Canada NA
#> 2 China Japan 727097659
#> 3 USA Japan 468793856
#> 4 China Rep. of Korea NA
#> 5 Canada USA 203416612
#> 6 China USA NA
#> 7 Trinidad and Tobago USA NA
I am trying to automatically match two dataframes with a (for?) loop.
> df_key
country_election keyword1 keyword2 keyword3 keyword4 keyword5
1 France Paris Rome Madrid London Marseille
2 Spain Valencia Berlin Manchester Zurich Milan
> df_country
city country
1 Paris France
2 Rome Italy
3 Madrid Spain
4 London United Kingdom
5 Marseille France
6 Valencia Spain
7 Berlin Germany
8 Manchester United Kingdom
9 Zurich Switzerland
10 Milan Italy
In this example I would like to match every keyword in df_key with df_country to add country columns.
country_election keyword1 country_1 keyword2 country_2 keyword3 country_3
1 France Paris France Rome Italy Madrid Spain
2 Spain Valencia Spain Berlin Germany Manchester United Kingdom
FInally, I'd also like to have a series of dummy variables checking whether country_i is equal to country_election. Thanks a lot for your help.
df_key <- structure(list(country_election = c("France", "Spain"), keyword1 = c("Paris", "Valencia"),
keyword2 = c("Rome ", "Berlin"), keyword3 = c("Madrid", "Manchester"), keyword4 = c("London", "Zurich"),
keyword5 = c("Marseille", "Milan")), class = "data.frame", row.names = c(NA, -2L))
df_country <- structure(list(city = c("Paris", "Rome", "Madrid", "London", "Marseille", "Valencia",
"Berlin", "Manchester", "Zurich", "Milan"), country = c("France", "Italy", "Spain", "United Kingdom",
"France", "Spain", "Germany", "United Kingdom", "Switzerland", "Italy")),
class = "data.frame", row.names = c(NA, -10L))
You can match the city names, extract the country and create new columns. If the column order is important, extract the numeric part from it and order the data.
cols <- sub('keyword', 'country', names(df_key[-1]))
df_key[cols] <- df_country$country[match(as.matrix(df_key[-1]), df_country$city)]
df_key[order(as.numeric(sub('\\D+', '', names(df_key))), na.last = FALSE)]
# country_election keyword1 country1 keyword2 country2 keyword3
#1 France Paris France Rome Italy Madrid
#2 Spain Valencia Spain Berlin Germany Manchester
# country3 keyword4 country4 keyword5 country5
#1 Spain London United Kingdom Marseille France
#2 United Kingdom Zurich Switzerland Milan Italy
The code below does what I want for a simple table. The mapping that takes place in the statement with on works perfectly. But I also have the situation with multiple countries that need to be assigned potentially to multiple regions and the result stored in the regions column is more challenging
library(data.table)
testDT <- data.table(country = c("Algeria", "Egypt", "United States", "Brazil"))
testDTcomplicated <- data.table(country = c("Algeria, Ghana, Sri Lanka", "Egypt", "United States, Argentina", "Brazil"))
regionLookup <- data.table(countrylookup = c("Algeria", "Argentina", "Egypt", "United States", "Brazil", "Ghana", "Sri Lanka"), regionVal = c("Africa", "South America", "Africa", "North America", "South America", "Africa", "Asia"))
testDT[regionLookup, region := regionVal, on = c(country = "countrylookup")]
> testDT
country region
1: Algeria Africa
2: Egypt Africa
3: United States North America
4: Brazil South America
I'd like to have testDTcomplicated look like the following
> testDT
country region
1: Algeria, Ghana, Sri Lanka Africa, Africa, Asia
2: Egypt Africa
3: United States, Argentina, Brazil North America, South America, South America
4: Brazil South America
You could split the data on comma and get each country in a separate row, join the data with regionLookup and collapse them again in one value in a comma-separated string.
library(data.table)
testDTcomplicated[, row := seq_len(.N)]
new <- splitstackshape::cSplit(testDTcomplicated, 'country', ',',
direction = 'long')[regionLookup, region := regionVal,
on = c(country = "countrylookup")]
new <- new[, lapply(.SD, toString), row][,row:=NULL]
new
# country region
#1: Algeria, Ghana, Sri Lanka Africa, Africa, Asia
#2: Egypt Africa
#3: United States, Argentina North America, South America
#4: Brazil South America
Same logic in dplyr can be implemented as :
library(dplyr)
testDTcomplicated %>%
mutate(row = row_number()) %>%
tidyr::separate_rows(country, sep = ", ") %>%
left_join(regionLookup, by = c("country" = "countrylookup")) %>%
group_by(row) %>%
summarise(across(.fns = toString))
My aim is to create a table that summarizes the countries featured in my sample. This table should only have two rows, a first row with different columns for each region and a second row with country names that are located in the respective region.
To give you an example, this is what my data.frame XYZ looks like:
..................wvs5red2.s003names.....wvs5red2.regiondummies
21............."Hong Kong"......................Asian Tigers
45............."South Korea"....................Asian Tigers
49............."Taiwan".............................Asian Tigers
66............."China"...............................East Asia & Pacific
80............."Indonesia"........................East Asia & Pacific
86............."Malaysia"...........................East Asia & Pacific
My aim is to obtain a table that looks similar to this:
region.............Asian Tigers..............................................East Asia & Pacific
countries........Hong Kong, South Korea, Taiwan...........China, Indonesia, etc.
Do you have any idea how to obtain such a table? It took me hours searching for something similar.
Simplest way is tapply:
XYZ <- structure(list(
names = structure(c(2L, 5L, 6L, 1L, 3L, 4L), .Label = c("China", "Hong Kong", "Indonesia", "Malaysia", "South Korea", "Taiwan"), class = "factor"),
region = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("Asian Tigers", "East Asia & Pacific"), class = "factor")),
.Names = c("names", "region"), row.names = c(NA, -6L), class = "data.frame")
tapply(XYZ$names, XYZ$region, paste, collapse=", ")
# Asian Tigers East Asia & Pacific
# "Hong Kong, South Korea, Taiwan" "China, Indonesia, Malaysia"
Recreate the data:
dat <- data.frame(
country = c("Hong Kong", "South Korea", "Taiwan", "China", "Indonesia", "Malaysia"),
region = c(rep("Asian Tigers", 3), rep("East Asia & Pacific", 3))
)
dat
country region
1 Hong Kong Asian Tigers
2 South Korea Asian Tigers
3 Taiwan Asian Tigers
4 China East Asia & Pacific
5 Indonesia East Asia & Pacific
6 Malaysia East Asia & Pacific
Use ddply in package plyr combined with paste to summarise the data:
library(plyr)
ddply(dat, .(region), function(x)paste(x$country, collapse= ","))
region V1
1 Asian Tigers Hong Kong,South Korea,Taiwan
2 East Asia & Pacific China,Indonesia,Malaysia
First create data:
> country<-c("Hong Kong","Taiwan","China","Indonesia")
> region<-rep(c("Asian Tigers","East Asia & Pacific"),each=2)
> df<-data.frame(country=country,region=region)
Then run through column region and gather all the countries. We can use tapply, but I will use dlply from package plyr, since it retains list names.
> ll<-dlply(df,~region,function(d)paste(d$country,collapse=","))
> ll
$`Asian Tigers`
[1] "Hong Kong,Taiwan"
$`East Asia & Pacific`
[1] "China,Indonesia"
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
region
1 Asian Tigers
2 East Asia & Pacific
Now convert the list to the data.frame using do.call. Since we need nice names we need to pass argument check.names=FALSE:
> ll$check.names <- FALSE
> do.call("data.frame",ll)
Asian Tigers East Asia & Pacific
1 Hong Kong,Taiwan China,Indonesia