in the following data.df we see that lines 2 and 3 are identical and just the mean of line 4 differs.
iso3 dest code year uv mean
1 ALB AUT 490700 2002 14027.2433 427387.640
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405
5 ALB BGR 843050 2002 677.9827 4272.176
6 ALB BGR 851030 2002 31004.0946 32364.379
7 ALB HRV 392329 2002 1410.0072 6970.930
Is there any easy way to automatically find these same rows?
I found this subject which seems to answer to this question but I do not understand how 'duplicated()` works...
What I would like is a "simple" command where I could precise which column value by row should be identical.
something like : function(data.df, c(iso3, dest, code, year, uv, mean))
to find the very same rows and function(data.df, c(iso3, dest, code, year, uv)) to find the "quasi" same rows...
the expected result being something like, in the first case:
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
and in the second one:
2 ALB BGR 490700 2002 1215.5613 11886.494
3 ALB BGR 490700 2002 1215.5613 11886.494
4 ALB BGR 490700 2002 1215.5613 58069.405
any idea?
We could write a function and then pass columns which we want to consider.
get_duplicated_rows <- function(df, cols) {
df[duplicated(df[cols]) | duplicated(df[cols], fromLast = TRUE), ]
}
get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"))
# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886
get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv"))
# iso3 dest code year uv mean
#2 ALB BGR 490700 2002 1215.6 11886
#3 ALB BGR 490700 2002 1215.6 11886
#4 ALB BGR 490700 2002 1215.6 58069
You can get at the quasi duplications if you look at each feature one by one and then, you consider rows with a Rowsum greater than your target value.
toread <- " iso3 dest code year uv mean
ALB AUT 490700 2002 14027.2433 427387.640
ALB BGR 490700 2002 1215.5613 11886.494
ALB BGR 490700 2002 1215.5613 11886.494
ALB BGR 490700 2002 1215.5613 58069.405
ALB BGR 843050 2002 677.9827 4272.176
ALB BGR 851030 2002 31004.0946 32364.379
ALB HRV 392329 2002 1410.0072 6970.930"
df <- read.table(textConnection(toread), header = TRUE)
closeAllConnections()
get_quasi_duplicated_rows <- function(df, cols, cut){
result <- matrix(nrow = nrow(df), ncol = length(cols))
colnames(result) <- cols
for(col in cols){
dup <- duplicated(df[col]) | duplicated(df[col], fromLast = TRUE)
result[ , col] <- dup
}
return(df[which(rowSums(result) > cut), ])
}
get_quasi_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"), 4)
iso3 dest code year uv mean
2 ALB BGR 490700 2002 1215.561 11886.49
3 ALB BGR 490700 2002 1215.561 11886.49
4 ALB BGR 490700 2002 1215.561 58069.40
Using dplyr and rlang package we can achive this-
Solution-
find_dupes <- function(df,cols){
df <- df %>% get_dupes(!!!rlang::syms(cols))
return(df)
}
Output-
1st Case-
> cols
[1] "iso3" "dest" "code" "year" "uv"
> find_dupes(df, cols)
# A tibble: 3 x 7
iso3 dest code year uv dupe_count mean
<fct> <fct> <int> <int> <dbl> <int> <dbl>
1 ALB BGR 490700 2002 1216. 3 11886.
2 ALB BGR 490700 2002 1216. 3 11886.
3 ALB BGR 490700 2002 1216. 3 58069.
2nd Case-
> cols
[1] "iso3" "dest" "code" "year" "uv" "mean"
> find_dupes(df,cols)
# A tibble: 2 x 7
iso3 dest code year uv mean dupe_count
<fct> <fct> <int> <int> <dbl> <dbl> <int>
1 ALB BGR 490700 2002 1216. 11886. 2
2 ALB BGR 490700 2002 1216. 11886. 2
Note-
rlan::syms function take strings as input and turn them into symbols. Contrarily to as.name(), they convert the strings to the native encoding beforehand. This is necessary because symbols remove silently the encoding mark of strings.
To pass a list of vector names in dplyr function, we use syms.
!!! is used to unquote
We can use group_by_all and filter that having more than 1 frequency count
library(dplyr)
df1 %>%
group_by_all() %>%
filter(n() > 1)
# A tibble: 2 x 6
# Groups: iso3, dest, code, year, uv, mean [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.
if it is a subset of columns, use group_by_at
df1 %>%
group_by_at(vars(iso3, dest, code, year, uv)) %>%
filter(n() > 1)
# A tibble: 3 x 6
# Groups: iso3, dest, code, year, uv [1]
# iso3 dest code year uv mean
# <chr> <chr> <int> <int> <dbl> <dbl>
#1 ALB BGR 490700 2002 1216. 11886.
#2 ALB BGR 490700 2002 1216. 11886.
#3 ALB BGR 490700 2002 1216. 58069.
Related
I am using the datase who (available in the library datasets
tidyr), which for 34 years counts the number of TB cases registered for 56 groups (combinations of gender, age and method of testing) for a number of countries. There is one row per country per year, and the first 4 entries are to do with year, country name and such.
I want to calculate the sum of new cases per country per year, but I just can't make it work.
I was ecpecting something like
group_by(who, country) %>% summarise(count = rowsum(.[5:60]))
would work, but it doesn't.
Can anyone help me understand why it doesn't work, and what to do instead?
You're missing a first step, which is to gather the data into a 'tidy' format. Try this:
who%>%
gather(key=type,value=cases,-country,-iso2,-iso3,-year)%>%
filter(!is.na(cases))%>%
group_by(country,year)%>%
summarise(sum(cases))
Which gives output:
# A tibble: 3,484 × 3
# Groups: country [219]
country year `sum(cases)`
<chr> <int> <int>
1 Afghanistan 1997 128
2 Afghanistan 1998 1778
3 Afghanistan 1999 745
4 Afghanistan 2000 2666
5 Afghanistan 2001 4639
library(tidyverse)
(long_who <- who |> pivot_longer(cols = -(1:4)))
long_who |> filter(startsWith(name,"new")) |> # dont want things like "Population"
group_by(country) |>
summarise(sum_of_new_ = sum(value,na.rm=TRUE))
A base r approach
data.frame(who[,c("country", "year")],
cnt = rowSums(who[5:60], na.rm = TRUE))
#> + country year cnt
#> 1 Afghanistan 1980 0
#> 2 Afghanistan 1981 0
#> 3 Afghanistan 1982 0
#> 4 Afghanistan 1983 0
#> 5 Afghanistan 1984 0
#> 6 Afghanistan 1985 0
You could also do without the long format by using rowSums and across:
library(dplyr)
who |>
group_by(country, year) |>
summarise(count = rowSums(across(5:58), na.rm = TRUE)) |>
ungroup()
Alternatives to across(5:58):
across(starts_with("new"))
across(-(1:4))
Output:
# A tibble: 20 × 3
# Groups: country [1]
country year count
<chr> <int> <dbl>
1 Afghanistan 1980 0
2 Afghanistan 1981 0
3 Afghanistan 1982 0
4 Afghanistan 1983 0
5 Afghanistan 1984 0
6 Afghanistan 1985 0
7 Afghanistan 1986 0
8 Afghanistan 1987 0
9 Afghanistan 1988 0
10 Afghanistan 1989 0
11 Afghanistan 1990 0
12 Afghanistan 1991 0
13 Afghanistan 1992 0
14 Afghanistan 1993 0
15 Afghanistan 1994 0
16 Afghanistan 1995 0
17 Afghanistan 1996 0
18 Afghanistan 1997 128
19 Afghanistan 1998 1778
20 Afghanistan 1999 745
This question already has answers here:
Complete dataframe with missing combinations of values
(2 answers)
Fill missing combinations in a dataframe
(1 answer)
Closed 1 year ago.
I have the next database with country, year, and GDP:
What I have
Country
Year
GDP
Afghanistan
1950
$123
Afghanistan
1951
$123
Afghanistan
2019
$123
Australia
1945
$123
Australia
2021
$123
And what I need is to create or delete rows so each country has rows from 1948 to 2021. So, for example, for Afghanistan I need to create rows 1948 to 1949 and 2021 with a null GDP, and for Australia delete the 1945 row and create everything in between.
This isn't my exact database, I have 200+ countries each with different years. Is there a way to create this easily?
What I need
Country
Year
GDP
Afghanistan
1948
NA
...
...
...
Afghanistan
2021
NA
Australia
1948
$123
...
...
...
Australia
2021
$123
We can use complete to create the missing combinations and specify the GDP as 0
library(tidyr)
complete(df1, Country, Year = 1948:2021, list(GDP = 0)) %>%
arrange(Country)
We can use complete, then filter and finally replace_na.
library(dplyr)
df <-read.table(header=TRUE, text="Country Year GDP
Afghanistan 1950 $123
Afghanistan 1951 $123
Afghanistan 2019 $123
Australia 1945 $123
Australia 2021 $123")
df <- df %>%
complete(Year = 1948:2021, Country) %>%
filter(between(Year, 1948, 2021)) %>%
replace_na(list(GDP = 0)) %>%
arrange(Country)
head(df)
tail(df)
> print(head(df))
# A tibble: 6 x 3
Year Country GDP
<int> <chr> <chr>
1 1948 Afghanistan 0
2 1949 Afghanistan 0
3 1950 Afghanistan $123
4 1951 Afghanistan $123
5 1952 Afghanistan 0
6 1953 Afghanistan 0
> print(tail(df))
# A tibble: 6 x 3
Year Country GDP
<int> <chr> <chr>
1 2016 Australia 0
2 2017 Australia 0
3 2018 Australia 0
4 2019 Australia 0
5 2020 Australia 0
6 2021 Australia $123
Created on 2021-09-26 by the reprex package (v2.0.1)
library(tidyr)
library(dplyr)
df <-
tibble::tribble(
~Country, ~Year, ~GDP,
"Afghanistan", 1950L, "$123",
"Afghanistan", 1951L, "$123",
"Afghanistan", 2019L, "$123",
"Australia", 1945L, "$123",
"Australia", 2021L, "$123"
)
df %>%
filter(Year >= 1948 & Year <= 2021) %>%
complete(Year = 1948:2021,Country) %>%
arrange(Country)
# A tibble: 148 x 3
Year Country GDP
<int> <chr> <chr>
1 1948 Afghanistan NA
2 1949 Afghanistan NA
3 1950 Afghanistan $123
4 1951 Afghanistan $123
5 1952 Afghanistan NA
6 1953 Afghanistan NA
7 1954 Afghanistan NA
8 1955 Afghanistan NA
9 1956 Afghanistan NA
10 1957 Afghanistan NA
# ... with 138 more rows
Here is a solution with complete and coalesce
library(dplyr)
library(tidyr)
df %>%
complete(Year = 1948:2021, Country) %>%
arrange(Country, Year) %>%
mutate(GDP = coalesce(GDP, "0"))
# A tibble: 149 x 3
Year Country GDP
<int> <chr> <chr>
1 1948 Afghanistan 0
2 1949 Afghanistan 0
3 1950 Afghanistan $123
4 1951 Afghanistan $123
5 1952 Afghanistan 0
6 1953 Afghanistan 0
7 1954 Afghanistan 0
8 1955 Afghanistan 0
9 1956 Afghanistan 0
10 1957 Afghanistan 0
# … with 139 more rows
I would like to split a variable called country conditional on whether it has a year in it (Albania2009 vs. Albania).
In addition, where the variable does not have a year (i.e. Albania), I would like to copy the country name to cname and manually put a year in cyear.
idstd id xxx id1 country
<dbl+> <dbl> <dbl+lbl> <dbl+lbl> <chr>
1 445801 NA NA 7 Albania2009
2 542384 4616555 1163 7 Albania
3 445801 NA NA 7 Albania2009
4 542384 4616555 1163 7 Albania
I first tried myself, making use of the fact that id is NA when country has a year in it:
CAmerica0306P$cyear <- NA
CAmerica0306P$cname <- NA
for (i in 1:nrow(df)) {
if (df$id[i]==NA) {
df[i,] <- separate(df, country[i], into = c("cname", "cyear"), -4)
} else {
df$cyear[i,] <- 2001
df$cname[i,] <- df$country[i,]
}
}
But it splits everything. After checking stackoverflow I tried:
df <- df %>%
extract(country, into=c("cname", "cyear"), regex="^(?=.{1,7}$)([a-zA-Z]+)([0-9].*)$", remove=FALSE)
but it does not fill the cells (still NA's).
Desired output:
idstd id xxx id1 country cyear cname
<dbl+> <dbl> <dbl+lbl> <dbl+lbl> <chr> <dbl>
1 445801 NA NA 7 Albania 2009 Albania
2 542384 4616555 1163 7 Albania 2001 Albania
3 445801 NA NA 7 Albania 2009 Albania
4 542384 4616555 1163 7 Albania 2001 Albania
Any suggestions?
Example data: (you should provide ready to use data)
df1<-
data.frame(country = I(paste0("Albania",c("",2007:2012,""))) )
code:
df1$cname <-sub("\\d+$","", df1$country) #remove all numbers in the end
df1$cyear <-gsub("[^0-9]","", df1$country) #remove everything that is not a number
df1$cyear[df1$cyear == ""] <- 2001 #where no year is prominent insert 2001
df1$country<- df1$cname
result:
# country cname cyear
#1 Albania Albania 2001
#2 Albania Albania 2007
#3 Albania Albania 2008
#4 Albania Albania 2009
#5 Albania Albania 2010
#6 Albania Albania 2011
#7 Albania Albania 2012
#8 Albania Albania 2001
I'm trying to merge two datasets, by year and country. The first data set (df = GNIPC) represent Gross national income per capite for every country from 1980-2008.
Country Year GNIpc
(chr) (dbl) (dbl)
1 Afghanistan 1990 NA
2 Afghanistan 1991 NA
3 Afghanistan 1992 2010
4 Afghanistan 1993 NA
5 Afghanistan 1994 12550
6 Afghanistan 1995 NA
The second dataset (df = sanctions) represents the imposition of economic sanctions from 1946 to present day.
country imposition sanctiontype sanctions_period
(chr) (dbl) (chr) (chr)
1 Afghanistan 1 1 6 8 1997-2001
2 Afghanistan 1 7 1979-1979
3 Afghanistan 1 4 7 1995-2002
4 Albania 1 2 8 2005-2005
5 Albania 1 7 2005-2006
6 Albania 1 8 2004-2005
I would like to merge the two datasets so that for every GNI year i either have sanctions present in the country or not. For the GNI years that are not in the sanctions_period the value would be 0 and for those that are it would be 1. This is what i want it to look like:
Country Year GNIpc Imposition sanctiontype
(chr) (dbl) (dbl) (dbl) (chr)
1 Afghanistan 1990 NA 0 NA
2 Afghanistan 1991 NA 0 NA
3 Afghanistan 1992 2010 0 NA
4 Afghanistan 1993 NA 0 NA
5 Afghanistan 1994 12550 0 NA
6 Afghanistan 1995 NA 1 4 7
Some example data:
df1 <- data.frame(country = c('Afghanistan', 'Turkey'),
imposition = c(1, 0),
sanctiontype = c('1 6 8', '4'),
sanctions_period = c('1997-2001', '2003-ongoing')
)
country imposition sanctiontype sanctions_period
1 Afghanistan 1 1 6 8 1997-2001
2 Turkey 0 4 2012-ongoing
The "sanctions_period" column can be transformed with dplyr and tidyr:
library(tidyr)
library(dplyr)
df.new <- separate(df1, sanctions_period, c('start', 'end'), remove = F) %>%
mutate(end = ifelse(end == 'ongoing', '2016', end)) %>%
mutate(start = as.numeric(start), end = as.numeric(end)) %>%
group_by(country, sanctions_period) %>%
do(data.frame(country = .$country, imposition = .$imposition, sanctiontype = .$sanctiontype, year = .$start:.$end))
sanctions_period country imposition sanctiontype year
<fctr> <fctr> <dbl> <fctr> <int>
1 1997-2001 Afghanistan 1 1 6 8 1997
2 1997-2001 Afghanistan 1 1 6 8 1998
3 1997-2001 Afghanistan 1 1 6 8 1999
4 1997-2001 Afghanistan 1 1 6 8 2000
5 1997-2001 Afghanistan 1 1 6 8 2001
6 2012-ongoing Turkey 0 4 2012
7 2012-ongoing Turkey 0 4 2013
8 2012-ongoing Turkey 0 4 2014
9 2012-ongoing Turkey 0 4 2015
10 2012-ongoing Turkey 0 4 2016
From there, it should easy to merge with your first data frame. Note that your first data frame capitalizes Country and Year, while the second doesn't.
df.merged <- merge(df.first, df.new, by.x = c('Country', 'Year'), by.y = c('country', 'year'))
Using dplyr:
left_join(GNIPC, sanctions, by=c("Country"="country", "Year"="Year")) %>%
select(Country,Year, GNIpc, Imposition, sanctiontype)
I have a dataframe called df and I have 10 variables inside this df.
df contains a list of countries which are connected to their gdp, unemployment level, and whether they have been colonised as a (TRUE) etc.
For each variable gdp, unemp level and colonised I know there's a number of NAs.
Is there a command where I can list the names of the countries where they have NAs. e.g.: if the UK has NA for gdp, but has unemp and colonised and France has gdp, unemp but NA for colonised.
Is there a command which will bring a list of the UK and France because they have NAs?
My data:
destination origin sum gdp.diff unemployment.diff
1 Albania Azerbaijan 2 27 8.467610
2 Albania Congo 1 -21 NA
3 Albania Dem. Rep. of the Congo 1 -80 13.437610
4 Albania Eritrea 21 -66 NA
5 Albania Iran (Islamic Rep. of) 279 5 2.997610
6 Albania Mali 1 -68 6.137609
So I need Albania to appear in the list because is has an NA for unemp.diff
Using complete.cases:
#dummy data
df <- data.frame(country = letters[1:3],
gdp = c(1,NA,2),
unemployment = c(1,2,3),
colonised = c(T,F,NA))
df
# country gdp unemployment colonised
# 1 a 1 1 TRUE
# 2 b NA 2 FALSE
# 3 c 2 3 NA
df[ !complete.cases(df), ]
# country gdp unemployment colonised
# 2 b NA 2 FALSE
# 3 c 2 3 NA
# check for NAs on one column
df[ is.na(df$gdp), ]
# country gdp unemployment colonised
# 2 b NA 2 FALSE