R: extract value and insert in 3 existing columns - r

I have a large dataset like the following and I'm trying to add value to 3 columns based on column Country.
Country<-c("Asia","Africa - Benin (Cotonou)",
"Europe - France (Paris)","Asia - China(Shanghai)", "Europe - United Kingdom (London)", "Europe - France (Orléans)"
, "Afrique - Togo (Lomé)", "Afrique - Sénégal (Dakar)", "Asia - Pakistan (Rahim Yar Khan)")
ID<-c(1,2,3,4,5,6,7,8,9)
mydata<-data.frame(ID,Country)
> mydata
> ID Country col1 col2 col3
> 1 1 Asia
> 2 2 Africa - Benin (Cotonou)
> 3 3 Europe - France (Paris)
> 4 4 Asia - China(Shanghai)
> 5 5 Europe - United Kingdom (London)
> 6 6 Europe - France (Orléans)
> 7 7 Afrique - Togo (Lomé)
> 8 8 Afrique - Sénégal (Dakar)
> 9 9 Asia - Pakistan (Rahim Yar Khan)
I tried the following but im having issue with the regular expression
library(tidyr)
mydata <- mydata %>% separate(col = "Country", into = c("Col1", "Col2", "Col3"), remove = FALSE, fill = "right")
The result that I get is the following:
ID Country Col1 Col2 Col3
1 Asia Asia <NA> <NA>
2 Africa - Benin (Cotonou) Africa Benin Cotonou
3 Europe - France (Paris) Europe France Paris
4 Asia - China(Shanghai) Asia China Shanghai
5 Europe - United Kingdom (London) Europe United Kingdom
6 Europe - France (Orléans) Europe France Orl
7 Afrique - Togo (Lomé) Afrique Togo L
8 Afrique - Sénégal (Dakar) Afrique S n
9 Asia - Pakistan (Rahim Yar Khan) Asia Pakistan Rahim
Some part are missing in column 3, row 5,6,7,8 and 9.
the result that i want is the following:
ID Country Col1 Col2 Col3
1 Asia Asia <NA> <NA>
2 Africa - Benin (Cotonou) Africa Benin Cotonou
3 Europe - France (Paris) Europe France Paris
4 Asia - China(Shanghai) Asia China Shanghai
5 Europe - United Kingdom (London) Europe United Kingdom London
6 Europe - France (Orléans) Europe France Orléans
7 Afrique - Togo (Lomé) Afrique Togo Lomé
8 Afrique - Sénégal (Dakar) Afrique Sénégal Dakar
9 Asia - Pakistan (Rahim Yar Khan) Asia Pakistan Rahim Yar Khan
Any suggestion on how to do this?

this is my first contribution so please forgive me if I am wrong.
I did it this way, may not be the easiest way but I guess it worked:
mydata %>%
separate(col = "Country",
sep = "[\\(-]",
into = c("Col1", "Col2", "Col3"),
remove = FALSE,
fill = "right") %>%
mutate(Col3 = str_remove(Col3, "\\)"))

Update: to remove the extra spaces we could add this line at the end of the code:
mutate(across(starts_with("col"), str_squish))
We could replace the first separator - by ( then
we get one separtor .
Afterwords do separate and finally remove the remaining )
library(dplyr)
library(stringr)
library(tidyr)
ID col1 col2 col3
1 1 Asia <NA> <NA>
2 2 Africa Benin Cotonou
3 3 Europe France Paris
4 4 Asia China Shanghai
5 5 Europe United Kingdom London
6 6 Europe France Orléans
7 7 Afrique Togo Lomé
8 8 Afrique Sénégal Dakar
9 9 Asia Pakistan Rahim Yar Khan

tidyr::separate will separate text into columns based on delimiter (which is by default any non alpha-numeric) so it separates on spaces by default. You can use the extra argument to merge all the remaining text into the 3rd column like so:
mydata %>%
separate(Country,
into = c("Col1", "Col2", "Col3"),
extra = "merge")
ID Col1 Col2 Col3
1 1 Asia <NA> <NA>
2 2 Africa Benin Cotonou)
3 3 Europe France Paris)
4 4 Asia China Shanghai)
5 5 Europe United Kingdom (London)
6 6 Europe France Orléans)
7 7 Afrique Togo Lomé)
8 8 Afrique Sénégal Dakar)
9 9 Asia Pakistan Rahim Yar Khan)
Warning message:
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
However, with this we get an unnecessary ) at the end. You can either remove this via a mutate or instead of separate use tidyr::extract that allows extracting based on regex:
mydata %>%
extract(Country,
into = c("Col1", "Col2", "Col3"),
regex = "([[:alnum:]]+) - ([[:alnum:]]+) ?\\((.*)\\)")
ID Col1 Col2 Col3
1 1 <NA> <NA> <NA>
2 2 Africa Benin Cotonou
3 3 Europe France Paris
4 4 Asia China Shanghai
5 5 <NA> <NA> <NA>
6 6 Europe France Orléans
7 7 Afrique Togo Lomé
8 8 Afrique Sénégal Dakar
9 9 Asia Pakistan Rahim Yar Khan

library(dplyr)
library(tidyr)
mydata %>%
separate(Country, into = c("col1", "col2", "col3"), '( - | ?\\()', remove = FALSE) %>%
mutate(col3 = gsub(')', '', col3))
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
#> ID Country col1 col2 col3
#> 1 1 Asia Asia <NA> <NA>
#> 2 2 Africa - Benin (Cotonou) Africa Benin Cotonou
#> 3 3 Europe - France (Paris) Europe France Paris
#> 4 4 Asia - China(Shanghai) Asia China Shanghai
#> 5 5 Europe - United Kingdom (London) Europe United Kingdom London
#> 6 6 Europe - France (Orléans) Europe France Orléans
#> 7 7 Afrique - Togo (Lomé) Afrique Togo Lomé
#> 8 8 Afrique - Sénégal (Dakar) Afrique Sénégal Dakar
#> 9 9 Asia - Pakistan (Rahim Yar Khan) Asia Pakistan Rahim Yar Khan

A data.table solution:
require(data.table)
setDT(mydata)
splitCountry <- function( c_str ) {
vec <- trimws(unlist(strsplit(as.character(c_str),"[[:punct:]]")))
col1 <- vec[1]
col2 <- vec[2]
col3 <- vec[3]
return(list(col1,
col2,
col3))
}
mydata[,c('col1','col2','col3'):=splitCountry(Country),by=Country]

Related

Creating a new column when two columns satisfy certain conditions in R

My data is like this:
country supporter1 supporter2 supporter3 supporter4 supporter5
USA Albania Germany USA NA NA
France USA France NA NA NA
UK UK Chile Peru NA NA
Germany USA Iran Mexico India Pakistan
USA China Spain NA NA NA
Cuba Cuba UK Germany South Korea NA
China Russia NA NA NA NA
What I want to do is to create a new variable when the country column and one of the remaining supporter columns (supporter 1, supporter 2, supporter 3, supporter 4, and supporter 5) are the same (for instance country France and supporter2 France are the same). In this case, the new variable should take 1, 0 otherwise.
I expect to have this:
country supporter1 supporter2 supporter3 supporter4 supporter5 new variable
USA Albania Germany USA NA NA 1
France USA France NA NA NA 1
UK UK Chile Peru NA NA 1
Germany USA Iran Mexico India Pakistan 0
USA China Spain NA NA NA 0
Cuba Cuba UK Germany South Korea NA 1
China Russia NA NA NA NA 0
Update dplyr only solution Using if_any:
library(dplyr)
df %>%
rowwise() %>%
mutate(new_var = as.integer(as.logical(if_any(starts_with("supporter"), ~ . %in% country))))
country supporter1 supporter2 supporter3 supporter4 supporter5 new_var
<chr> <chr> <chr> <chr> <chr> <chr> <int>
1 USA Albania Germany USA NA NA 1
2 France USA France NA NA NA 1
3 UK UK Chile Peru NA NA 1
4 Germany USA Iran Mexico India Pakistan 0
5 USA China Spain NA NA NA 0
6 Cuba Cuba UK Germany South Korea NA 1
7 China Russia NA NA NA NA 0
First answer: also correct:
Here is one possible solution:
calculate rowwise
check in cols supporter1 to supporter5 if country is included
unite all new columns to one and with an ifelse statement take 1 or 0
library(dplyr)
library(stringr)
library(tidyr)
df %>%
rowwise() %>%
mutate(across(supporter1:supporter5, ~ifelse(. %in% country, 1,0), .names = "new_{col}")) %>%
unite(New_Col, starts_with('new'), na.rm = TRUE, sep = ' ') %>%
mutate(New_Col = ifelse(str_detect(New_Col, "1"), 1,0))
country supporter1 supporter2 supporter3 supporter4 supporter5 New_Col
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 USA Albania Germany USA NA NA 1
2 France USA France NA NA NA 1
3 UK UK Chile Peru NA NA 1
4 Germany USA Iran Mexico India Pakistan 0
5 USA China Spain NA NA NA 0
6 Cuba Cuba UK Germany South Korea NA 1
7 China Russia NA NA NA NA 0
Here is a base R solution.
First mapply checks for equality of suporter* and country. NA's are considered to return FALSE. Then as.integer/rowSums transforms rows with at least one TRUE into 1, otherwise 0.
eq <- mapply(\(x, y){x == y & !is.na(x)}, df1[-1], df1[1])
as.integer(rowSums(eq) != 0)
#[1] 1 1 1 0 0 1 0
df1$new_variable <- as.integer(rowSums(eq) != 0)
Data
df1 <- read.table(text = "
country supporter1 supporter2 supporter3 supporter4 supporter5
USA Albania Germany USA NA NA
France USA France NA NA NA
UK UK Chile Peru NA NA
Germany USA Iran Mexico India Pakistan
USA China Spain NA NA NA
Cuba Cuba UK Germany 'South Korea' NA
China Russia NA NA NA NA
", header = TRUE)
Another solution is checking per row whether country is present in one of the columns:
df <- data.frame(country=c("USA","France","UK","Germany","USA","Cuba","China"),
supporter1=c("Albania","USA","UK","USA","China","Cuba","Russia"),
supporter2=c("Germany","France","Chile","Iran","Spain","UK","NA"),
supporter3=c("USA","NA","Peru","Mexico","NA","Germany","NA"),
supporter4=c("NA","NA","NA","India","NA","South Korea","NA"),
supporter5=c("NA","NA","NA","Pakistan","NA","NA","NA"))
That would give:
df$new <- sapply(seq(1,nrow(df)), function(x) ifelse(df$country[x] %in% df[x,2:6],1,0))
> df$new
[1] 1 1 1 0 0 1 0

Create a new column from conditions

I have a dataframe with information of some countries and states like this:
data.frame("state1"= c(NA,NA,"Beijing","Beijing","Schleswig-Holstein","Moskva",NA,"Moskva",NA,"Berlin"),
"country1"=c("Spain","Spain","China","China","Germany","Russia","Germany","Russia","Germany","Germany"),
"state2"= c(NA,NA,"Beijing",NA,NA,NA,"Moskva",NA,NA,NA),
"country2"=c("Germany","Germany","China","Germany","","Ukraine","Russia","Germany","Ukraine","" ),
"state3"= c(NA,NA,NA,NA,"Schleswig-Holstein",NA,NA,NA,NA,"Berlin"),
"country3"=c("Spain","Spain","Germany","Germany","Germany","Germany","Germany","Germany","Germany","Germany"))
Now, I would like to create a new column with the information of German states. (the result would look like below).
When at least one of the three variables state are a German state, assign it in the new variable.
data.frame("GE_State"=c(NA,NA,NA,NA, "Schleswig-Holstein",NA,NA,NA,NA,"Berlin"))
Please help a beginner for the condition setting.
Thank you in advance!
Using dplyr::mutate() with case_when() works, although I suspect there should be a more efficient way using across()
library(dplyr)
df %>%
mutate(GE_state = case_when(country1 == "Germany" & !is.na(state1) ~ state1,
country2 == "Germany" & !is.na(state2) ~ state2,
country3 == "Germany" & !is.na(state3) ~ state3,
TRUE ~ NA_character_))
#> state1 country1 state2 country2 state3 country3
#> 1 <NA> Spain <NA> Germany <NA> Spain
#> 2 <NA> Spain <NA> Germany <NA> Spain
#> 3 Beijing China Beijing China <NA> Germany
#> 4 Beijing China <NA> Germany <NA> Germany
#> 5 Schleswig-Holstein Germany <NA> Schleswig-Holstein Germany
#> 6 Moskva Russia <NA> Ukraine <NA> Germany
#> 7 <NA> Germany Moskva Russia <NA> Germany
#> 8 Moskva Russia <NA> Germany <NA> Germany
#> 9 <NA> Germany <NA> Ukraine <NA> Germany
#> 10 Berlin Germany <NA> Berlin Germany
#> GE_state
#> 1 <NA>
#> 2 <NA>
#> 3 <NA>
#> 4 <NA>
#> 5 Schleswig-Holstein
#> 6 <NA>
#> 7 <NA>
#> 8 <NA>
#> 9 <NA>
#> 10 Berlin
Created on 2021-03-31 by the reprex package (v1.0.0)
I think you want cbind() here:
df1 <- cbind(df1, df2)
Data:
df1 <- <your first data frame>
df2 <- data.frame("GE_State"=c(NA,NA,NA,NA, "Schleswig-Holstein",NA,NA,NA,NA,"Berlin"))

Reformat input-output table in R

I have an input output table with the origin (input field) as rows and the destination (output field) as columns. Here's an example:
Mexico Thailand Vietnam
USA 0 3 6
Italy 3 7 8
France 9 3 1
Germany 3 6 7
I want to convert the table so that the origin is in column1, destination is in column 2, and value is in column 3 so that it would look like this:
origin destination value
USA Mexico 0
USA Thailand 3
USA Vietnam 6
Italy Mexico 3
Italy Thailand 7
Italy Vietnam 8
France Mexico 9
France Thailand 3
France Vietnam 1
Germany Mexico 3
Germany Thailand 6
Germany Vietnam 7
There is a simple solution using the melt function from the reshape2 package:
#sample data
Mexico<-c(0, 3, 9,3)
Thailand <-c(3, 7, 3, 6)
Vietnam <-c(6, 8, 1, 7)
names<-c("USA", "Italy", "France", "Germany")
df<-data.frame(names, Mexico, Thailand, Vietnam)
library(reshape2)
melt(df )
The package "tidyr" has a similar functionality.
library(tidyr)
gather(df, "names")

How to flatten data.frame for use with googlevis treemap?

In order to use the treemap function on googleVis, data needs to be flattened into two columns. Using their example:
> library(googleVis)
> Regions
Region Parent Val Fac
1 Global <NA> 10 2
2 America Global 2 4
3 Europe Global 99 11
4 Asia Global 10 8
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
However, in the real world this information more frequently looks like this:
> a <- data.frame(
+ scal=c("Global", "Global", "Global", "Global", "Global", "Global", "Global"),
+ cont=c("Europe", "Europe", "Europe", "America", "America", "Asia", "Asia"),
+ country=c("France", "Sweden", "Germany", "Mexico", "USA", "China", "Japan"),
+ val=c(71, 89, 58, 2, 38, 5, 48),
+ fac=c(2,3,10,9,11,1,11))
> a
scal cont country val fac
1 Global Europe France 71 2
2 Global Europe Sweden 89 3
3 Global Europe Germany 58 10
4 Global America Mexico 2 9
5 Global America USA 38 11
6 Global Asia China 5 1
7 Global Asia Japan 48 11
But how to most efficiently change transform this data?
If we use dplyr, this script will transform the data correctly:
library(dplyr)
cbind(NA,a %>% group_by(scal) %>% summarize(val=sum(val),fac=sum(fac))) -> topLev
names(topLev) <- c("Parent","Region","val","fac")
a %>% group_by(scal,cont) %>% summarize(val=sum(val),fac=sum(fac)) %>%
select(Region=cont,Parent=scal,val,fac) -> midLev
a[,2:5] %>% select(Region=country,Parent=cont,val,fac) -> bottomLev
bind_rows(topLev,midLev,bottomLev) %>% select(2,1,3,4) -> answer
We can verify this by comparing dataframes:
> answer
Source: local data frame [11 x 4]
Region Parent val fac
1 Global NA 311 47
2 America Global 40 20
3 Asia Global 53 12
4 Europe Global 218 15
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
> Regions
Region Parent Val Fac
1 Global <NA> 10 2
2 America Global 2 4
3 Europe Global 99 11
4 Asia Global 10 8
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
Interesting that the summaries for the continents and the globe aren't the sum of their components (or min/max/ave/mean/normalized...)

How to remove rows in data frame after frequency tables in R

I have 3 data frames from which I have to find the continent with less than 2 countries and remove those countries(rows). The data frames are structured in a manner similar a data frame called x below:
row Country Continent Ranking
1 Kenya Africa 17
2 Gabon Africa 23
3 Spain Europe 04
4 Belgium Europe 03
5 China Asia 10
6 Nigeria Africa 14
7 Holland Europe 01
8 Italy Europe 05
9 Japan Asia 06
First I wanted to know the frequency of each country per continent, so I did
x2<-table(x$Continent)
x2
Africa Europe Asia
3 4 2
Then I wanted to identify the continents with less than 2 countries
x3 <- x2[x2 < 10]
x3
Asia
2
My problem now is how to remove these countries. For the example above it will be the 2 countries in Asia and I want my final data set to look like presented below:
row Country Continent Ranking
1 Kenya Africa 17
2 Gabon Africa 23
3 Spain Europe 04
4 Belgium Europe 03
5 Nigeria Africa 14
6 Holland Europe 01
7 Italy Europe 05
The number of continents with less than 2 countries will vary among the different data frames so I need one universal method that I can apply to all.
Try
library(dplyr)
x %>%
group_by(Continent) %>%
filter(n()>2)
# row Country Continent Ranking
#1 1 Kenya Africa 17
#2 2 Gabon Africa 23
#3 3 Spain Europe 04
#4 4 Belgium Europe 03
#5 6 Nigeria Africa 14
#6 7 Holland Europe 01
#7 8 Italy Europe 05
Or using the x2
subset(x, Continent %in% names(x2)[x2>2])
# row Country Continent Ranking
#1 1 Kenya Africa 17
#2 2 Gabon Africa 23
#3 3 Spain Europe 04
#4 4 Belgium Europe 03
#6 6 Nigeria Africa 14
#7 7 Holland Europe 01
#8 8 Italy Europe 05
A very easy way with "data.table" would be:
library(data.table)
as.data.table(x)[, N := .N, by = Continent][N > 2]
# row Country Continent Ranking N
# 1: 1 Kenya Africa 17 3
# 2: 2 Gabon Africa 23 3
# 3: 3 Spain Europe 4 4
# 4: 4 Belgium Europe 3 4
# 5: 6 Nigeria Africa 14 3
# 6: 7 Holland Europe 1 4
# 7: 8 Italy Europe 5 4
In base R you can try:
x[with(x, ave(rep(TRUE, nrow(x)), Continent, FUN = function(y) length(y) > 2)), ]
# row Country Continent Ranking
# 1 1 Kenya Africa 17
# 2 2 Gabon Africa 23
# 3 3 Spain Europe 4
# 4 4 Belgium Europe 3
# 6 6 Nigeria Africa 14
# 7 7 Holland Europe 1
# 8 8 Italy Europe 5

Resources