I have 3 data frames from which I have to find the continent with less than 2 countries and remove those countries(rows). The data frames are structured in a manner similar a data frame called x below:
row Country Continent Ranking
1 Kenya Africa 17
2 Gabon Africa 23
3 Spain Europe 04
4 Belgium Europe 03
5 China Asia 10
6 Nigeria Africa 14
7 Holland Europe 01
8 Italy Europe 05
9 Japan Asia 06
First I wanted to know the frequency of each country per continent, so I did
x2<-table(x$Continent)
x2
Africa Europe Asia
3 4 2
Then I wanted to identify the continents with less than 2 countries
x3 <- x2[x2 < 10]
x3
Asia
2
My problem now is how to remove these countries. For the example above it will be the 2 countries in Asia and I want my final data set to look like presented below:
row Country Continent Ranking
1 Kenya Africa 17
2 Gabon Africa 23
3 Spain Europe 04
4 Belgium Europe 03
5 Nigeria Africa 14
6 Holland Europe 01
7 Italy Europe 05
The number of continents with less than 2 countries will vary among the different data frames so I need one universal method that I can apply to all.
Try
library(dplyr)
x %>%
group_by(Continent) %>%
filter(n()>2)
# row Country Continent Ranking
#1 1 Kenya Africa 17
#2 2 Gabon Africa 23
#3 3 Spain Europe 04
#4 4 Belgium Europe 03
#5 6 Nigeria Africa 14
#6 7 Holland Europe 01
#7 8 Italy Europe 05
Or using the x2
subset(x, Continent %in% names(x2)[x2>2])
# row Country Continent Ranking
#1 1 Kenya Africa 17
#2 2 Gabon Africa 23
#3 3 Spain Europe 04
#4 4 Belgium Europe 03
#6 6 Nigeria Africa 14
#7 7 Holland Europe 01
#8 8 Italy Europe 05
A very easy way with "data.table" would be:
library(data.table)
as.data.table(x)[, N := .N, by = Continent][N > 2]
# row Country Continent Ranking N
# 1: 1 Kenya Africa 17 3
# 2: 2 Gabon Africa 23 3
# 3: 3 Spain Europe 4 4
# 4: 4 Belgium Europe 3 4
# 5: 6 Nigeria Africa 14 3
# 6: 7 Holland Europe 1 4
# 7: 8 Italy Europe 5 4
In base R you can try:
x[with(x, ave(rep(TRUE, nrow(x)), Continent, FUN = function(y) length(y) > 2)), ]
# row Country Continent Ranking
# 1 1 Kenya Africa 17
# 2 2 Gabon Africa 23
# 3 3 Spain Europe 4
# 4 4 Belgium Europe 3
# 6 6 Nigeria Africa 14
# 7 7 Holland Europe 1
# 8 8 Italy Europe 5
Related
I have a large dataset like the following and I'm trying to add value to 3 columns based on column Country.
Country<-c("Asia","Africa - Benin (Cotonou)",
"Europe - France (Paris)","Asia - China(Shanghai)", "Europe - United Kingdom (London)", "Europe - France (Orléans)"
, "Afrique - Togo (Lomé)", "Afrique - Sénégal (Dakar)", "Asia - Pakistan (Rahim Yar Khan)")
ID<-c(1,2,3,4,5,6,7,8,9)
mydata<-data.frame(ID,Country)
> mydata
> ID Country col1 col2 col3
> 1 1 Asia
> 2 2 Africa - Benin (Cotonou)
> 3 3 Europe - France (Paris)
> 4 4 Asia - China(Shanghai)
> 5 5 Europe - United Kingdom (London)
> 6 6 Europe - France (Orléans)
> 7 7 Afrique - Togo (Lomé)
> 8 8 Afrique - Sénégal (Dakar)
> 9 9 Asia - Pakistan (Rahim Yar Khan)
I tried the following but im having issue with the regular expression
library(tidyr)
mydata <- mydata %>% separate(col = "Country", into = c("Col1", "Col2", "Col3"), remove = FALSE, fill = "right")
The result that I get is the following:
ID Country Col1 Col2 Col3
1 Asia Asia <NA> <NA>
2 Africa - Benin (Cotonou) Africa Benin Cotonou
3 Europe - France (Paris) Europe France Paris
4 Asia - China(Shanghai) Asia China Shanghai
5 Europe - United Kingdom (London) Europe United Kingdom
6 Europe - France (Orléans) Europe France Orl
7 Afrique - Togo (Lomé) Afrique Togo L
8 Afrique - Sénégal (Dakar) Afrique S n
9 Asia - Pakistan (Rahim Yar Khan) Asia Pakistan Rahim
Some part are missing in column 3, row 5,6,7,8 and 9.
the result that i want is the following:
ID Country Col1 Col2 Col3
1 Asia Asia <NA> <NA>
2 Africa - Benin (Cotonou) Africa Benin Cotonou
3 Europe - France (Paris) Europe France Paris
4 Asia - China(Shanghai) Asia China Shanghai
5 Europe - United Kingdom (London) Europe United Kingdom London
6 Europe - France (Orléans) Europe France Orléans
7 Afrique - Togo (Lomé) Afrique Togo Lomé
8 Afrique - Sénégal (Dakar) Afrique Sénégal Dakar
9 Asia - Pakistan (Rahim Yar Khan) Asia Pakistan Rahim Yar Khan
Any suggestion on how to do this?
this is my first contribution so please forgive me if I am wrong.
I did it this way, may not be the easiest way but I guess it worked:
mydata %>%
separate(col = "Country",
sep = "[\\(-]",
into = c("Col1", "Col2", "Col3"),
remove = FALSE,
fill = "right") %>%
mutate(Col3 = str_remove(Col3, "\\)"))
Update: to remove the extra spaces we could add this line at the end of the code:
mutate(across(starts_with("col"), str_squish))
We could replace the first separator - by ( then
we get one separtor .
Afterwords do separate and finally remove the remaining )
library(dplyr)
library(stringr)
library(tidyr)
ID col1 col2 col3
1 1 Asia <NA> <NA>
2 2 Africa Benin Cotonou
3 3 Europe France Paris
4 4 Asia China Shanghai
5 5 Europe United Kingdom London
6 6 Europe France Orléans
7 7 Afrique Togo Lomé
8 8 Afrique Sénégal Dakar
9 9 Asia Pakistan Rahim Yar Khan
tidyr::separate will separate text into columns based on delimiter (which is by default any non alpha-numeric) so it separates on spaces by default. You can use the extra argument to merge all the remaining text into the 3rd column like so:
mydata %>%
separate(Country,
into = c("Col1", "Col2", "Col3"),
extra = "merge")
ID Col1 Col2 Col3
1 1 Asia <NA> <NA>
2 2 Africa Benin Cotonou)
3 3 Europe France Paris)
4 4 Asia China Shanghai)
5 5 Europe United Kingdom (London)
6 6 Europe France Orléans)
7 7 Afrique Togo Lomé)
8 8 Afrique Sénégal Dakar)
9 9 Asia Pakistan Rahim Yar Khan)
Warning message:
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
However, with this we get an unnecessary ) at the end. You can either remove this via a mutate or instead of separate use tidyr::extract that allows extracting based on regex:
mydata %>%
extract(Country,
into = c("Col1", "Col2", "Col3"),
regex = "([[:alnum:]]+) - ([[:alnum:]]+) ?\\((.*)\\)")
ID Col1 Col2 Col3
1 1 <NA> <NA> <NA>
2 2 Africa Benin Cotonou
3 3 Europe France Paris
4 4 Asia China Shanghai
5 5 <NA> <NA> <NA>
6 6 Europe France Orléans
7 7 Afrique Togo Lomé
8 8 Afrique Sénégal Dakar
9 9 Asia Pakistan Rahim Yar Khan
library(dplyr)
library(tidyr)
mydata %>%
separate(Country, into = c("col1", "col2", "col3"), '( - | ?\\()', remove = FALSE) %>%
mutate(col3 = gsub(')', '', col3))
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
#> ID Country col1 col2 col3
#> 1 1 Asia Asia <NA> <NA>
#> 2 2 Africa - Benin (Cotonou) Africa Benin Cotonou
#> 3 3 Europe - France (Paris) Europe France Paris
#> 4 4 Asia - China(Shanghai) Asia China Shanghai
#> 5 5 Europe - United Kingdom (London) Europe United Kingdom London
#> 6 6 Europe - France (Orléans) Europe France Orléans
#> 7 7 Afrique - Togo (Lomé) Afrique Togo Lomé
#> 8 8 Afrique - Sénégal (Dakar) Afrique Sénégal Dakar
#> 9 9 Asia - Pakistan (Rahim Yar Khan) Asia Pakistan Rahim Yar Khan
A data.table solution:
require(data.table)
setDT(mydata)
splitCountry <- function( c_str ) {
vec <- trimws(unlist(strsplit(as.character(c_str),"[[:punct:]]")))
col1 <- vec[1]
col2 <- vec[2]
col3 <- vec[3]
return(list(col1,
col2,
col3))
}
mydata[,c('col1','col2','col3'):=splitCountry(Country),by=Country]
Given a dataframe like this
country rest count
Argentina pizza 26
Argentina asador 22
Brazil feijoada 52
Brazil pizza 67
Germany pizza 22
Germany biergarten 52
Germany kebab 20
Let's suppose we want all the unique values in 'rest' column to be represented in as many rows as countries in the dataframe, even if they have no values. My desired output would look like this:
country rest count
Argentina pizza 26
Argentina asador 22
Argentina feijoada 0
Argentina biergarten 0
Argentina kebab 0
Brazil pizza 67
Brazil feijoada 52
Brazil asador 0
Brazil biergarten 0
Brazil kebab 0
Germany pizza 22
Germany biergarten 52
Germany kebab 20
Germany asador 0
Germany feijoada 0
Is there any simple way to reach this output through dplyr?
tidyr::complete(dat, country, rest, fill=list(count=0))
# # A tibble: 15 x 3
# country rest count
# <chr> <chr> <dbl>
# 1 Argentina asador 22
# 2 Argentina biergarten 0
# 3 Argentina feijoada 0
# 4 Argentina kebab 0
# 5 Argentina pizza 26
# 6 Brazil asador 0
# 7 Brazil biergarten 0
# 8 Brazil feijoada 52
# 9 Brazil kebab 0
# 10 Brazil pizza 67
# 11 Germany asador 0
# 12 Germany biergarten 52
# 13 Germany feijoada 0
# 14 Germany kebab 20
# 15 Germany pizza 22
This question already has answers here:
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
My data set sometimes contains multiple observations for the same year as below.
id country ccode year region protest protestnumber duration
201990001 Canada 20 1990 North America 1 1 1
201990002 Canada 20 1990 North America 1 2 1
201990003 Canada 20 1990 North America 1 3 1
201990004 Canada 20 1990 North America 1 4 57
201990005 Canada 20 1990 North America 1 5 2
201990006 Canada 20 1990 North America 1 6 1
201991001 Canada 20 1991 North America 1 1 8
201991002 Canada 20 1991 North America 1 2 5
201992001 Canada 20 1992 North America 1 1 2
201993001 Canada 20 1993 North America 1 1 1
201993002 Canada 20 1993 North America 1 2 62
201994001 Canada 20 1994 North America 1 1 1
201994002 Canada 20 1994 North America 1 2 1
201995001 Canada 20 1995 North America 1 1 1
201995002 Canada 20 1995 North America 1 2 1
201996001 Canada 20 1996 North America 1 1 1
201997001 Canada 20 1997 North America 1 1 13
201997002 Canada 20 1997 North America 1 2 16
I need to sum up all values for the same year to one value per year. So that I receive one value per year in every column. I want to iterate this through the whole data set for all years and countries. Any help is much appreciated. Thank you!
I have a data frame like this:
indx country year death value
1 1 Italy 2000 hiv 1
2 1 Italy 2001 hiv 2
3 1 Italy 2005 hiv 3
4 1 Italy 2000 cancer 4
5 1 Italy 2001 cancer 5
6 1 Italy 2002 cancer 6
7 1 Italy 2003 cancer 7
8 1 Italy 2004 cancer 8
9 1 Italy 2005 cancer 9
10 4 France 2000 hiv 10
11 4 France 2004 hiv 11
12 4 France 2005 hiv 12
13 4 France 2001 cancer 13
14 4 France 2002 cancer 14
15 4 France 2003 cancer 15
16 4 France 2004 cancer 16
17 2 Spain 2000 hiv 17
18 2 Spain 2001 hiv 18
19 2 Spain 2002 hiv 19
20 2 Spain 2003 hiv 20
21 2 Spain 2004 hiv 21
22 2 Spain 2005 hiv 22
23 2 Spain ... ... ...
indx is a value linked to the country (same country = same indx).
In this example I used only 3 countries (country) and 2 disease (death), in the original data frame are many more.
I would like to have one row for each country for each disease from 2000 to 2005.
What I would like to get is:
indx country year death value
1 1 Italy 2000 hiv 1
2 1 Italy 2001 hiv 2
3 1 Italy 2002 hiv NA
4 1 Italy 2003 hiv NA
5 1 Italy 2004 hiv NA
6 1 Italy 2005 hiv 3
7 1 Italy 2000 cancer 4
8 1 Italy 2001 cancer 5
9 1 Italy 2002 cancer 6
10 1 Italy 2003 cancer 7
11 1 Italy 2004 cancer 8
12 1 Italy 2005 cancer 9
13 4 France 2000 hiv 10
14 4 France 2001 hiv NA
15 4 France 2002 hiv NA
16 4 France 2003 hiv NA
17 4 France 2004 hiv 11
18 4 France 2005 hiv 12
19 4 France 2000 cancer NA
20 4 France 2001 cancer 13
21 4 France 2002 cancer 14
22 4 France 2003 cancer 15
23 4 France 2004 cancer 16
24 4 France 2005 cancer NA
25 2 Spain 2000 hiv 17
26 2 Spain 2001 hiv 18
27 2 Spain 2002 hiv 19
28 2 Spain 2003 hiv 20
29 2 Spain 2004 hiv 21
30 2 Spain 2005 hiv 22
31 2 Spain ... ... ...
I.e. I would like to add lines with value = NA at the missing years for each country for each disease.
For example, it lacks data of HIV in Italy between 2002 and 2004 and then I add this lines with value = NA.
How can I do that?
For a reproducible example:
indx <- c(rep(1, times=9), rep(4, times=7), rep(2, times=6))
country <- c(rep("Italy", times=9), rep("France", times=7), rep("Spain", times=6))
year <- c(2000, 2001, 2005, 2000:2005, 2000, 2004, 2005, 2001:2004, 2000:2005)
death <- c(rep("hiv", times=3), rep("cancer", times=6), rep("hiv", times=3), rep("cancer", times=4), rep("hiv", times=6))
value <- c(1:22)
dfl <- data.frame(indx, country, year, death, value)
Using base R, you could do:
# setDF(dfl) # run this first if you have a data.table
merge(expand.grid(lapply(dfl[c("country", "death", "year")], unique)), dfl, all.x = TRUE)
This first creates all combinations of the unique values in country, death, and year and then merges it to the original data, to add the values and where combinations were not in the original data, it adds NAs.
In the package tidyr, there's a special function that does this for you with a a single command:
library(tidyr)
complete(dfl, country, year, death)
Here is a longer base R method. You create two new data.frames, one that contains all combinations of the country, year, and death, and a second that contains an index key.
# get data.frame with every combination of country, year, and death
dfNew <- with(df, expand.grid("country"=unique(country), "year"=unique(year),
"death"=unique(death)))
# get index key
indexKey <- unique(df[, c("indx", "country")])
# merge these together
dfNew <- merge(indexKey, dfNew, by="country")
# merge onto original data set
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)
This returns
dfNew
indx country year death value
1 1 Italy 2000 cancer 4
2 1 Italy 2000 hiv 1
3 1 Italy 2001 cancer 5
4 1 Italy 2001 hiv 2
5 1 Italy 2002 cancer 6
6 1 Italy 2002 hiv NA
7 1 Italy 2003 cancer 7
8 1 Italy 2003 hiv NA
9 1 Italy 2004 cancer 8
10 1 Italy 2004 hiv NA
11 1 Italy 2005 cancer 9
12 1 Italy 2005 hiv 3
13 2 Spain 2000 cancer NA
14 2 Spain 2000 hiv 17
15 2 Spain 2001 cancer NA
...
If df is a data.table, here are the corresponding lines of code:
# CJ is a cross-join
setkey(df, country, year, death)
dfNew <- df[CJ(country, year, death, unique=TRUE),
.(country, year, death, value)]
indexKey <- unique(df[, .(indx, country)])
dfNew <- merge(indexKey, dfNew, by="country")
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)
Note that it rather than using CJ, it is also possible to use expand.grid as in the data.frame version:
dfNew <- df[, expand.grid("country"=unique(country), "year"=unique(year),
"death"=unique(death))]
tidyr::complete helps create all combinations of the variables you pass it, but if you have two columns that are identical, it will over-expand or leave NAs where you don't want. As a workaround you can use dplyr grouping (df %>% group_by(indx, country) %>% complete(death, year)) or just merge the two columns into one temporarily:
library(tidyr)
# merge indx and country into a single column so they won't over-expand
df %>% unite(indx_country, indx, country) %>%
# fill in missing combinations of new column, death, and year
complete(indx_country, death, year) %>%
# separate indx and country back to how they were
separate(indx_country, c('indx', 'country'))
# Source: local data frame [36 x 5]
#
# indx country death year value
# (chr) (chr) (fctr) (int) (int)
# 1 1 Italy cancer 2000 4
# 2 1 Italy cancer 2001 5
# 3 1 Italy cancer 2002 6
# 4 1 Italy cancer 2003 7
# 5 1 Italy cancer 2004 8
# 6 1 Italy cancer 2005 9
# 7 1 Italy hiv 2000 1
# 8 1 Italy hiv 2001 2
# 9 1 Italy hiv 2002 NA
# 10 1 Italy hiv 2003 NA
# .. ... ... ... ... ...
In order to use the treemap function on googleVis, data needs to be flattened into two columns. Using their example:
> library(googleVis)
> Regions
Region Parent Val Fac
1 Global <NA> 10 2
2 America Global 2 4
3 Europe Global 99 11
4 Asia Global 10 8
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
However, in the real world this information more frequently looks like this:
> a <- data.frame(
+ scal=c("Global", "Global", "Global", "Global", "Global", "Global", "Global"),
+ cont=c("Europe", "Europe", "Europe", "America", "America", "Asia", "Asia"),
+ country=c("France", "Sweden", "Germany", "Mexico", "USA", "China", "Japan"),
+ val=c(71, 89, 58, 2, 38, 5, 48),
+ fac=c(2,3,10,9,11,1,11))
> a
scal cont country val fac
1 Global Europe France 71 2
2 Global Europe Sweden 89 3
3 Global Europe Germany 58 10
4 Global America Mexico 2 9
5 Global America USA 38 11
6 Global Asia China 5 1
7 Global Asia Japan 48 11
But how to most efficiently change transform this data?
If we use dplyr, this script will transform the data correctly:
library(dplyr)
cbind(NA,a %>% group_by(scal) %>% summarize(val=sum(val),fac=sum(fac))) -> topLev
names(topLev) <- c("Parent","Region","val","fac")
a %>% group_by(scal,cont) %>% summarize(val=sum(val),fac=sum(fac)) %>%
select(Region=cont,Parent=scal,val,fac) -> midLev
a[,2:5] %>% select(Region=country,Parent=cont,val,fac) -> bottomLev
bind_rows(topLev,midLev,bottomLev) %>% select(2,1,3,4) -> answer
We can verify this by comparing dataframes:
> answer
Source: local data frame [11 x 4]
Region Parent val fac
1 Global NA 311 47
2 America Global 40 20
3 Asia Global 53 12
4 Europe Global 218 15
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
> Regions
Region Parent Val Fac
1 Global <NA> 10 2
2 America Global 2 4
3 Europe Global 99 11
4 Asia Global 10 8
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
Interesting that the summaries for the continents and the globe aren't the sum of their components (or min/max/ave/mean/normalized...)