Dummy indicator for cross country ID repetitions - r

Here is my problem: I have a table with isin and countries, like this:
ISIN COUNTRY
XX0001 ITALY
XX0002 FRANCE
XX0003 ITALY
XX0001 FRANCE
XX0002 ITALY
XX0004 FRANCE
I would like to create a new column with an indicator taking value 1 if the same name appears in both countries, 0 otherwise.
ISIN COUNTRY INDICATOR
XX0001 ITALY 1
XX0002 FRANCE 1
XX0003 ITALY 0
XX0001 FRANCE 1
XX0002 ITALY 1
XX0004 FRANCE 0
I am working in Tibco Spotfire that works also with native R language.
Data
df1 <- structure(list(ISIN = c("XX0001", "XX0002", "XX0003", "XX0001", "XX0002", "XX0004"),
COUNTRY = c("ITALY", "FRANCE", "ITALY", "FRANCE", "ITALY", "FRANCE")),
.Names = c("ISIN", "COUNTRY"), class = "data.frame",
row.names = c(NA, -6L))

We can try with duplicated
df1$INDICATOR <- as.integer(duplicated(df1$ISIN)|!duplicated(df1$COUNTRY))
df1$INDICATOR
#[1] 1 1 0 1 1 0
Or using data.table
library(data.table)
setDT(df1)[, INDICATOR := +(uniqueN(COUNTRY)>1) , ISIN]

Related

Identifying matching observations in dyadic data in R

Hell everyone,
I am struggling with the following issue. Currently, I have a dataset looking like this:
living_in from Year stock
Austria Australia 2014 2513
Austria Australia 2013 2000
Germany Austria 2010 6000
Australia Austria 2014 3000
Austria Australia 1993 NA
Now I would like to identify all observations that fulfill the following criteria:
Should be from same year
Should contain the same country pairs in that year
Should not contain NA
For instance, I want to find all observations for combinations of two countries like Austria-Australia and Australia-Austria within the same year that contain values. This is due to the fact that some combinations in a given year in the dataset have only one value for stock not two. I want to remove those.
What is the best way to proceed here? Many thanks in advance!
P.S. I have about 14 country pairs in my dataset that need this kind of identification
A helpful output might be something like this.
living_in from Year stock dummy
Austria Australia 2014 2513 1
Austria Australia 2013 2000 0
Germany Austria 2010 6000 0
Australia Austria 2014 3000 1
Austria Australia 1993 NA 0
For each combination of country irrespective of their order (A-B is same as B-A) assign 1 to dummy column if for the same Year it has more than 1 row and all the stock values are non-NA or else assign 0.
library(dplyr)
df %>%
group_by(col1 = pmin(living_in, from), col2 = pmax(living_in, from), Year) %>%
mutate(dummy = as.integer(n() > 1 && all(!is.na(stock)))) %>%
ungroup %>%
select(-col1, -col2)
# living_in from Year stock dummy
# <chr> <chr> <int> <int> <int>
#1 Austria Australia 2014 2513 1
#2 Austria Australia 2013 2000 0
#3 Germany Austria 2010 6000 0
#4 Australia Austria 2014 3000 1
#5 Austria Australia 1993 NA 0
data
df <- structure(list(living_in = c("Austria", "Austria", "Germany",
"Australia", "Austria"), from = c("Australia", "Australia", "Austria",
"Austria", "Australia"), Year = c(2014L, 2013L, 2010L, 2014L,
1993L), stock = c(2513L, 2000L, 6000L, 3000L, NA)),
class = "data.frame", row.names = c(NA, -5L))

How to modify data frame in R based on one unique column

I have a data frame that looks like this.
Data
Denmark MG301
Denmark MG302
Australia MG301
Australia MG302
Sweden MG100
Sweden MG120
I need to make a new data frame based on unique values of 2nd columns while removing repeating values in Denmark. And results should look like this
Data
Australia MG301
Australia MG302
Sweden MG100
Sweden MG120
Regards
Update after clarification:
This code keeps all distinct values in column2:
distinct(df, code, .keep_all = TRUE)
Output:
1 Denmark MG301
2 Australia MG302
3 Sweden MG100
4 Sweden MG120
First answer:
I am not quite sure. But it gives the desired output:
df %>%
filter(country != "Denmark")
Output:
country code
<chr> <chr>
1 Australia MG301
2 Australia MG302
3 Sweden MG100
4 Sweden MG120
data:
df<- tribble(
~country, ~code,
"Denmark", "MG301",
"Denmark", "MG301",
"Australia", "MG301",
"Australia", "MG302",
"Sweden", "MG100",
"Sweden", "MG120")
In base R, the following code removes all rows with "Denmark" in the first column and all duplicated 2nd column by groups of 1st column.
i <- df1$V1 != "Denmark"
j <- as.logical(ave(df1$V2, df1$V1, FUN = duplicated))
df1[i & !j, ]
# V1 V2
#3 Australia MG301
#4 Australia MG302
#5 Sweden MG100
#6 Sweden MG120
Do you want just distinct ? then this may help
df <- data.frame(A = c("denmark", "denmark", "Australia", "Australia", "Sweden", "Sweden"), B = c("MG301","MG302","MG301","MG302","MG100","MG100"))
df %>% distinct()
A B
1 denmark MG301
2 denmark MG302
3 Australia MG301
4 Australia MG302
5 Sweden MG100
Or you want this ?
df %>%
group_by(B) %>%
dplyr::summarise(A = first(A))
B A
* <chr> <chr>
1 MG100 Sweden
2 MG301 denmark
3 MG302 denmark
Use duplicated with a ! bang operator to remove duplicated rows among that column.
To show a rather complicated case, I am adding one row in Denmark which is not duplicated and hence should not be filtered out.
df<- tribble(
~country, ~code,
"Denmark", "MG301",
"Denmark", "MG302",
'Denmark', "MG303",
"Australia", "MG301",
"Australia", "MG302",
"Sweden", "MG100",
"Sweden", "MG120")
# A tibble: 7 x 2
country code
<chr> <chr>
1 Denmark MG301
2 Denmark MG302
3 Denmark MG303
4 Australia MG301
5 Australia MG302
6 Sweden MG100
7 Sweden MG120
df %>%
mutate(d = duplicated(code)) %>%
group_by(code) %>%
mutate(d = sum(d)) %>% ungroup() %>%
filter(!(d > 0 & country == 'Denmark'))
# A tibble: 5 x 3
country code d
<chr> <chr> <int>
1 Denmark MG303 0
2 Australia MG301 1
3 Australia MG302 1
4 Sweden MG100 0
5 Sweden MG120 0

How do I rename the values in my column as I have misspelt them and cant rename them in R or Colab

I have a data frame that was given to me. Under the column titled state, there are two components with the same name but with different case sensitivities ie one is "London" and the other is "LONDON". How would i be able to rename "LONDON" to become "London" in order to total them up together and not separately. reminder, I am trying to change the name of the input not the name of the column.
You can use the following code, df is your current dataframe, in which you want to substitute "LONDON" for "London"
df <- data.frame(Country = c("US", "UK", "Germany", "Brazil","US", "Brazil", "UK", "Germany"),
State = c("NY", "London", "Bavaria", "SP", "CA", "RJ", "LONDON", "Berlin"),
Candidate = c(1:8))
print(df)
output
Country State Candidate
1 US NY 1
2 UK London 2
3 Germany Bavaria 3
4 Brazil SP 4
5 US CA 5
6 Brazil RJ 6
7 UK LONDON 7
8 Germany Berlin 8
then run the following code to substitute London to all the instances where State is equal to "LONDON"
df[df$State == "LONDON", "State"] <- "London"
Now the output will be as
Country State Candidate
1 US NY 1
2 UK London 2
3 Germany Bavaria 3
4 Brazil SP 4
5 US CA 5
6 Brazil RJ 6
7 UK London 7
8 Germany Berlin 8
Maybe you could try using the case_when function. I would do something like this:
ยดยดยดยด
mutate(data, State_def=case_when(State=="LONDON" ~ "London",
State=="London" ~ "London",
TRUE ~ NA_real_)
I might misunderstand, but I think it should be as simple as this:
x$state <- sub( "LONDON", "London", x$state, fixed=TRUE )
This should change LONDON to London

create a variable in a dataframe based on another matrix on R

I am having some problems with the following task
I have a data frame of this type with 99 different countries for thousands of IDs
ID Nationality var 1 var 2 ....
1 Italy //
2 Eritrea //
3 Italy //
4 USA
5 France
6 France
7 Eritrea
....
I want to add a variable corresponding to a given macroregion of Nationality
so I created a matrix of this kind with the rule to follow
Nationality Continent
Italy Europe
Eritrea Africa
Usa America
France Europe
Germany Europe
....
I d like to obtain this
ID Nationality var 1 var 2 Continent
1 Italy // Europe
2 Eritrea // Africa
3 Italy // Europe
4 USA America
5 France Europe
6 France Europe
7 Eritrea Africa
....
I was trying with this command
datasubset <- merge(dataset , continent.matrix )
but it doesn't work, it reports the following error
Error: cannot allocate vector of size 56.6 Mb
that seems very strange to me, also trying to apply this code to a subset it doesn't work. do you have any suggestion on how to proceed?
thank you very much in advance for your help, I hope my question doesn't sound too trivial, but I am quite new to R
You can do this with the left_join function (dplyr's library):
library(dplyr)
df <- tibble(ID=c(1,2,3),
Nationality=c("Italy", "Usa", "France"),
var1=c("a", "b", "c"),
var2=c(4,5,6))
nat_cont <- tibble(Nationality=c("Italy", "Eritrea", "Usa", "Germany", "France"),
Continent=c("Europe", "Africa", "America", "Europe", "Europe"))
df_2 <- left_join(df, nat_cont, by=c("Nationality"))
The output:
> df_2
# A tibble: 3 x 5
ID Nationality var1 var2 Continent
<dbl> <chr> <chr> <dbl> <chr>
1 1 Italy a 4 Europe
2 2 Usa b 5 America
3 3 France c 6 Europe

Calculate duration/difference between first and n rows that match on column value

I'm trying to calculate difference/duration between the first and n rows of a dataframe that match in one column. I want to place that value in a new column "duration". Sample data: below.
y <- data.frame(c("USA", "USA", "USA", "France", "France", "Mexico", "Mexico", "Mexico"), c(1992, 1993, 1994, 1989, 1990, 1999, 2000, 2001))
colnames(y) <- c("Country", "Year")
y$Year <- as.integer(y$Year) # this is to match the class of my actual data
My desired result is:
1992 USA 0
1993 USA 1
1994 USA 2
1989 France 0
1990 France 1
1999 Mexico 0
2000 Mexico 1
2001 Mexico 2
I've tried using dplyr's group_by and mutate
y <- y %>% group_by(Country) %>% mutate(duration = Year - lag(Year))
but I can only get the actual lag year (e.g. 1999) or only calculate the difference between sequential rows getting me either NA for the first row of a country or 1 for all other rows with the same country. Many q & a's focus on difference between sequential rows and not between the first and n rows.
Thoughts?
This can be done by subtracting the first 'Year' with the 'Year' column after grouping by 'Country'.
y %>%
group_by(Country) %>%
mutate(duration = Year - first(Year))

Resources