r - How to track Changes in Rows of dataframe with characters? - r

Additional to my last question, I am now looking for a way to track changes within a data frame of characters.
Suppose I have the following dataframe df:
df=data.frame(ID=c(123100,123200,123300,123400,123500),"2014"=c("Germany","Germany","Germany","Italy","Austria"),"2015"=c("Germany","Germany","Germany","Italy","Austria"),"2016"=c("Italy","Germany","Germany","Italy","Germany"), "2017"=c("Italy","Germany","Germany","Italy","Germany"), "2018"=c("Italy","Austria","Germany","Italy","Germany") )
Now, I want to find out, for which ID the data has changed in which year. So for example, in 2016 ID 123100 has changed from Germany to Italy. I would like to add new columns for change (1 = change, 0 or NA = no change), year of change, old expression and new expression. The fact, that the real dataset consists of thousands of different expressions instead of the three countries is a challenge for me. I need a solution without the need to determine the different expressions before.
In the end it should look like this:
df_final=data.frame(ID=c(123100,123200,123300,123400,123500),"2014"=c("Germany","Germany","Germany","Italy","Austria"),"2015"=c("Germany","Germany","Germany","Italy","Austria"),"2016"=c("Italy","Germany","Germany","Italy","Germany"), "2017"=c("Italy","Germany","Germany","Italy","Germany"), "2018"=c("Italy","Austria","Germany","Italy","Germany"), "change"=c(1,1,0,0,1),
"year"=c(2016, 2018, 0, 0, 2016), "before"=c("Germany","Germany",0,0,"Austria"), "after"=c("Italy", "Austria", 0, 0, "Germany"))
I couldn't find any satisfying solution on here, so I hope you can help me.

Try this
df |> rowwise() |> mutate(change = case_when(all(c_across(X2015:X2018) == X2014) ~ 0 , TRUE ~ 1) ,
year = colnames(df)[-1][which(c_across(X2014) != c_across(X2014:X2018))[1]] ) |>
ungroup() |> mutate(before = ifelse(change == 1 , X2014 ,NA) ,
after = ifelse(change == 1 , X2018 ,NA))
output
# A tibble: 5 × 10
ID X2014 X2015 X2016 X2017 X2018 change year before after
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 123100 Germany Germany Italy Italy Italy 1 X2016 Germany Italy
2 123200 Germany Germany Germany Germany Austria 1 X2018 Germany Austria
3 123300 Germany Germany Germany Germany Germany 0 NA NA NA
4 123400 Italy Italy Italy Italy Italy 0 NA NA NA
5 123500 Austria Austria Germany Germany Germany 1 X2016 Austria Germany
>

Not elegant, but you can use rle to count the lengths and values in a vector. I'd used plyr::ldply to run rle for each row.
library(plyr)
output <- ldply(seq_len(nrow(df)), function(x){
columns <- c("X2014", "X2015", "X2016", "X2017", "X2018")
rle_output <- rle(df[x, columns])
if(length(rle_output$lengths) == 1) return(data.frame(change=0))
else{
change = 1
year = columns[rle_output$lengths[2]]
before = unlist(rle_output$values[1])
after = unlist(rle_output$values[2])
return(data.frame(change, year, before, after))
}})
cbind(df, output)
ID X2014 X2015 X2016 X2017 X2018 change year before after
1 123100 Germany Germany Italy Italy Italy 1 X2016 Germany Italy
2 123200 Germany Germany Germany Germany Austria 1 X2014 Germany Germany
3 123300 Germany Germany Germany Germany Germany 0 <NA> <NA> <NA>
4 123400 Italy Italy Italy Italy Italy 0 <NA> <NA> <NA>
5 123500 Austria Austria Germany Germany Germany 1 X2016 Austria Germany

Related

Replacing NA values with values from neighbouring rows [duplicate]

This question already has answers here:
Complete column with group_by and complete
(2 answers)
Closed 1 year ago.
I need some help filling cells which have an 'NA' values with other values which are already present in the surrounding rows.
I currently have a panel dataset of investors and their activities. Some of the rows were missing, so I have completed the panel to include these rows, replacing the financial deal information with '0' values.
The other variables relate to wider firm characteristics, such as region and strategy. I am unsure how to replicate these for each firm.
This is my code so far.
df <- df %>%
group_by(investor) %>%
mutate(min = min(dealyear, na.rm = TRUE),
max = max(dealyear, na.rm = TRUE)) %>%
complete(investor, dealyear = min:max, fill = list(counttotal=0, countgreen=0, countbrown=0)) %>%
An example of data before completion - notice year 2004 is missing.
investor
dealyear
dealcounts
strategy
region
123IM
2002
5
buyout
europe
123IM
2003
5
buyout
europe
123IM
2005
5
buyout
europe
123IM
2006
5
buyout
europe
Example of data after completion, with missing row added in
investor
dealyear
dealcounts
strategy
region
123IM
2002
5
buyout
europe
123IM
2003
5
buyout
europe
123IM
2004
0
NA
NA
123IM
2005
5
buyout
europe
123IM
2006
5
buyout
europe
How would I go about replacing these NA values with the corresponding information for each investment firm?
Many thanks
Rory
You may use complete with group_by as -
library(dplyr)
library(tidyr)
df %>%
group_by(investor) %>%
complete(dealyear = min(dealyear):max(dealyear),
fill = list(dealcounts = 0)) %>%
ungroup
# investor dealyear dealcounts strategy region
# <chr> <int> <dbl> <chr> <chr>
#1 123IM 2002 5 buyout europe
#2 123IM 2003 5 buyout europe
#3 123IM 2004 0 NA NA
#4 123IM 2005 5 buyout europe
#5 123IM 2006 5 buyout europe
If you want to replace NA in strategy and region column you may use fill.
df %>%
group_by(investor) %>%
complete(dealyear = min(dealyear):max(dealyear),
fill = list(dealcounts = 0)) %>%
fill(strategy, region) %>%
ungroup
# investor dealyear dealcounts strategy region
# <chr> <int> <dbl> <chr> <chr>
#1 123IM 2002 5 buyout europe
#2 123IM 2003 5 buyout europe
#3 123IM 2004 0 buyout europe
#4 123IM 2005 5 buyout europe
#5 123IM 2006 5 buyout europe

How to create a data group (factor variables) in my dataframe based on categorical variables #R

I want to create a factor variables in my dataframes based on categorical variables.
My data:
# A tibble: 159 x 3
name.country gpd rate_suicide
<chr> <dbl> <dbl>
1 Afghanistan 2129. 6.4
2 Albania 12003. 5.6
3 Algeria 11624. 3.3
4 Angola 7103. 8.9
5 Antigua and Barbuda 19919. 0.5
6 Argentina 20308. 9.1
7 Armenia 10704. 5.7
8 Australia 47350. 11.7
9 Austria 52633. 11.4
10 Azerbaijan 14371. 2.6
# ... with 149 more rows
I want to create factor variable region, which contains a factors as:
region <- c('Asian', 'Europe', 'South America', 'North America', 'Africa')
region = factor(region, levels = c('Asian', 'Europe', 'South America', 'North America', 'Africa'))
I want to do this with dplyr packages, that can to choose a factor levels depends on name.countrybut it doesn't work. Example:
if (new_data$name.country[new_data$name.country == "N"]) {
mutate(new_data, region_ = region[1])
}
How i can solve the problem?
I think the way I would think about your problem is
Create a reproducible problem. (see How to make a great R reproducible example. ) Since you already have the data, use dput to make it easier for people like me to recreate your data in their environment.
dput(yourdf)
structure(list(name.country = c("Afghanistan", "Albania", "Algeria"
), gpd = c(2129L, 12003L, 11624L), rate_suicide = c(6.4, 5.6,
3.3)), class = "data.frame", row.names = c(NA, -3L))
raw_data<-structure(list(name.country = c("Afghanistan", "Albania", "Algeria"
), gpd = c(2129L, 12003L, 11624L), rate_suicide = c(6.4, 5.6,
3.3)), class = "data.frame", row.names = c(NA, -3L))
Define vectors that specify your regions
Use case_when to separate countries into regions
Use as.factor to convert your character variable to a factor
asia=c("Afghanistan","India","...","Rest of countries in Asia")
europe=c("Albania","France","...","Rest of countries in Europe")
africa=c("Algeria","Egypt","...","Rest of countries in Africa")
df<-raw_data %>%
mutate(region=case_when(
name.country %in% asia ~ "asia",
name.country %in% europe ~ "europe",
name.country %in% africa ~ "africa",
TRUE ~ "other"
)) %>%
mutate(region=region %>% as.factor())
You can check that your variable region is a factor using str
str(df)
'data.frame': 3 obs. of 4 variables:
$ name.country: chr "Afghanistan" "Albania" "Algeria"
$ gpd : int 2129 12003 11624
$ rate_suicide: num 6.4 5.6 3.3
$ region : Factor w/ 3 levels "africa","asia",..: 2 3 1
Here is a working example that combines data from the question with a file of countries and region information from Github. H/T to Luke Duncalfe for maintaining the region data, which is:
...a combination of the Wikipedia ISO-3166 article for alpha and numeric country codes and the UN Statistics site for countries' regional and sub-regional codes.
regionFile <- "https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv"
regionData <- read.csv(regionFile,header=TRUE)
textFile <- "rowID|country|gdp|suicideRate
1|Afghanistan|2129.|6.4
2|Albania|12003.|5.6
3|Algeria|11624.|3.3
4|Angola|7103.|8.9
5|Antigua and Barbuda|19919.|0.5
6|Argentina|20308.|9.1
7|Armenia|10704.|5.7
8|Australia|47350.|11.7
9|Austria|52633.|11.4
10|Azerbaijan|14371.|2.6"
data <- read.csv(text=textFile,sep="|")
library(dplyr)
data %>%
left_join(.,regionData,by = c("country" = "name"))
...and the output:
rowID country gdp suicideRate alpha.2 alpha.3 country.code
1 1 Afghanistan 2129 6.4 AF AFG 4
2 2 Albania 12003 5.6 AL ALB 8
3 3 Algeria 11624 3.3 DZ DZA 12
4 4 Angola 7103 8.9 AO AGO 24
5 5 Antigua and Barbuda 19919 0.5 AG ATG 28
6 6 Argentina 20308 9.1 AR ARG 32
7 7 Armenia 10704 5.7 AM ARM 51
8 8 Australia 47350 11.7 AU AUS 36
9 9 Austria 52633 11.4 AT AUT 40
10 10 Azerbaijan 14371 2.6 AZ AZE 31
iso_3166.2 region sub.region intermediate.region
1 ISO 3166-2:AF Asia Southern Asia
2 ISO 3166-2:AL Europe Southern Europe
3 ISO 3166-2:DZ Africa Northern Africa
4 ISO 3166-2:AO Africa Sub-Saharan Africa Middle Africa
5 ISO 3166-2:AG Americas Latin America and the Caribbean Caribbean
6 ISO 3166-2:AR Americas Latin America and the Caribbean South America
7 ISO 3166-2:AM Asia Western Asia
8 ISO 3166-2:AU Oceania Australia and New Zealand
9 ISO 3166-2:AT Europe Western Europe
10 ISO 3166-2:AZ Asia Western Asia
region.code sub.region.code intermediate.region.code
1 142 34 NA
2 150 39 NA
3 2 15 NA
4 2 202 17
5 19 419 29
6 19 419 5
7 142 145 NA
8 9 53 NA
9 150 155 NA
10 142 145 NA
At this point one can decide whether to use the region, sub region, or intermediate region and convert it to a factor.
We can set region to a factor by adding a mutate() function to the dplyr pipeline:
data %>%
left_join(.,regionData,by = c("country" = "name")) %>%
mutate(region = factor(region)) -> mergedData
At this point mergedData$region is a factor.
str(mergedData$region)
table(mergedData$region)
> str(mergedData$region)
Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 2 3 5 4 3
> table(mergedData$region)
Africa Americas Asia Europe Oceania
2 2 3 2 1
Now the data is ready for further analysis. We will generate a table of average suicide rates by region.
library(knitr) # for kable
mergedData %>% group_by(region) %>%
summarise(suicideRate = mean(suicideRate)) %>%
kable(.)
...and the output:
|region | suicideRate|
|:--------|-----------:|
|Africa | 6.1|
|Americas | 4.8|
|Asia | 4.9|
|Europe | 8.5|
|Oceania | 11.7|
When rendered in an HTML / markdown viewer, the result looks like this:

create a variable in a dataframe based on another matrix on R

I am having some problems with the following task
I have a data frame of this type with 99 different countries for thousands of IDs
ID Nationality var 1 var 2 ....
1 Italy //
2 Eritrea //
3 Italy //
4 USA
5 France
6 France
7 Eritrea
....
I want to add a variable corresponding to a given macroregion of Nationality
so I created a matrix of this kind with the rule to follow
Nationality Continent
Italy Europe
Eritrea Africa
Usa America
France Europe
Germany Europe
....
I d like to obtain this
ID Nationality var 1 var 2 Continent
1 Italy // Europe
2 Eritrea // Africa
3 Italy // Europe
4 USA America
5 France Europe
6 France Europe
7 Eritrea Africa
....
I was trying with this command
datasubset <- merge(dataset , continent.matrix )
but it doesn't work, it reports the following error
Error: cannot allocate vector of size 56.6 Mb
that seems very strange to me, also trying to apply this code to a subset it doesn't work. do you have any suggestion on how to proceed?
thank you very much in advance for your help, I hope my question doesn't sound too trivial, but I am quite new to R
You can do this with the left_join function (dplyr's library):
library(dplyr)
df <- tibble(ID=c(1,2,3),
Nationality=c("Italy", "Usa", "France"),
var1=c("a", "b", "c"),
var2=c(4,5,6))
nat_cont <- tibble(Nationality=c("Italy", "Eritrea", "Usa", "Germany", "France"),
Continent=c("Europe", "Africa", "America", "Europe", "Europe"))
df_2 <- left_join(df, nat_cont, by=c("Nationality"))
The output:
> df_2
# A tibble: 3 x 5
ID Nationality var1 var2 Continent
<dbl> <chr> <chr> <dbl> <chr>
1 1 Italy a 4 Europe
2 2 Usa b 5 America
3 3 France c 6 Europe

Column manipulation in R - matching correct names

I have a data.frame composed of multiple columns and thousands of rows. Below I attempt to display its (head):
|year |state_name|idealPoint| vote_no| vote_yes|
|:--------------|---------:|---------:|---------:|---------:|
|1971 | China | -25.0000| 31.0000| 45.4209|
|1972 | China | -26.2550| 38.2974| 45.4209|
|1973 | China | 28.2550| 35.2974| 45.4209|
|1994 | Czech | 27.2550| 34.2974| 45.4209|
As you can see. Not all countries [there are 196 of them] joined voting at the UN in the same year.
What I want to do is to create a new column in my data.frame (votes) that consists of the absolute difference between ChinaIdealpoints to Czech Ideal points (for given year...). I know how to create the new column with dplyr but how do I multiply correct countries from the list of 196 countries? (the difference between the year of joining can be then deleted manually I think).
The final Output should be new data.frame (or new columns in votes) looking like this: China ideal point in 1994 was, for instance, 2.2550
|year |state_name|idealPoint|Abs.Difference China_Czech
|:--------------|---------:|---------:|-------------------------:|
|1971 | China | -25.0000| NA |
|1972 | China | -26.2550| NA |
|1973 | China | 28.2550| NA |
|1994 | Czech | 27.2550| 25.0000 |
Codes:
df1 <- data.frame(year = c(1994,1995,1996,1997,1994,1995,1996,1997),
state_name = c("China","China","China","China","Czech_Republic","Czech_Republic","Czech_Republic","Czech_Republic"),
idealpoints = c(-25.0000,-26.2550,28.2550,27.2550,-27.0000,-28.2550,29.2550,22.2550),
vote_no = c(31.0000,38.2974,35.2974,34.2974,33.0000,36.2974,37.2974,38.2974),
vote_yes = c(45.4209,45.4209,45.4209,45.4209,45.4209,45.4209,45.4209,45.4209))
china_df <- df1[df1$state_name == "China",]
czech_df <- df1[df1$state_name == "Czech_Republic",]
china_czech_merge <- merge(china_df,czech_df,by = "year")
china_czech_merge$Abs_diff <- abs(china_czech_merge$idealpoints.x - china_czech_merge$idealpoints.y)
Output:
year state_name.x idealpoints.x vote_no.x vote_yes.x state_name.y idealpoints.y vote_no.y vote_yes.y Abs_diff
1 1994 China -25.000 31.0000 45.4209 Czech_Republic -27.000 33.0000 45.4209 2
2 1995 China -26.255 38.2974 45.4209 Czech_Republic -28.255 36.2974 45.4209 2
3 1996 China 28.255 35.2974 45.4209 Czech_Republic 29.255 37.2974 45.4209 1
4 1997 China 27.255 34.2974 45.4209 Czech_Republic 22.255 38.2974 45.4209 5
I think this will work for you.
Thanks
Does this perhaps solve your problem?
library(tibble)
library(dplyr)
a <- tribble(
~year, ~ctry, ~vote,
1994, "China", 5,
1995, "China", 100,
1996, "China", 600,
1997, "China", 45,
1998, "China", 9,
1994, "Czech_Republic", 1,
1995, "Czech_Republic", 5,
1996, "Czech_Republic", 100,
1997, "Czech_Republic", 40,
1998, "Czech_Republic", 6,
)
a %>%
group_by(year) %>%
mutate(foo = abs(lag(lead(vote) - vote)))
Output:
# A tibble: 10 x 4
# Groups: year [5]
year ctry vote foo
<dbl> <chr> <dbl> <dbl>
1 1994 China 5 NA
2 1995 China 100 NA
3 1996 China 600 NA
4 1997 China 45 NA
5 1998 China 9 NA
6 1994 Czech_Republic 1 4
7 1995 Czech_Republic 5 95
8 1996 Czech_Republic 100 500
9 1997 Czech_Republic 40 5
10 1998 Czech_Republic 6 3
You'll have to filter down the data to fit your needs, e.g. by country.

Join 2 dataframes together if two columns match

I have 2 dataframes:
CountryPoints
From.country To.Country points
Belgium Finland 4
Belgium Germany 5
Malta Italy 12
Malta UK 1
and another dataframe with neighbouring/bordering countries:
From.country To.Country
Belgium Finland
Belgium Germany
Malta Italy
I would like to add another column in CountryPoints called neighbour (Y/N) depending if the key value pair is found in the neighbour/bordering countries dataframe. Is this somehow possible - so it is a kind of a join but the result should be a boolean column.
The result should be:
From.country To.Country points Neighbour
Belgium Finland 4 Y
Belgium Germany 5 Y
Malta Italy 12 Y
Malta UK 1 N
In the question below it shows how you can merge but it doesn't show how you can add that extra boolean column
Two alternative approaches:
1) with base R:
idx <- match(df1$From.country, df2$From.country, nomatch = 0) &
match(df1$To.Country, df2$To.Country, nomatch = 0)
df1$Neighbour <- c('N','Y')[1 + idx]
2) with data.table:
library(data.table)
setDT(df1)
setDT(df2)
df1[, Neighbour := 'N'][df2, on = .(From.country, To.Country), Neighbour := 'Y'][]
which both give (data.table-output shown):
From.country To.Country points Neighbour
1: Belgium Finland 4 Y
2: Belgium Germany 5 Y
3: Malta Italy 12 Y
4: Malta UK 1 N
Borrowing the idea from this post:
df1$Neighbour <- duplicated(rbind(df2[, 1:2], df1[, 1:2]))[ -seq_len(nrow(df2)) ]
df1
# From.country To.Country points Neighbour
# 1 Belgium Finland 4 TRUE
# 2 Belgium Germany 5 TRUE
# 3 Malta Italy 12 TRUE
# 4 Malta UK 1 FALSE
What about something like this?
sortpaste <- function(x) paste0(sort(x), collapse = "_");
df1$Neighbour <- apply(df1[, 1:2], 1, sortpaste) %in% apply(df2[, 1:2], 1, sortpaste)
# From.country To.Country points Neighbour
#1 Belgium Finland 4 TRUE
#2 Belgium Germany 5 TRUE
#3 Malta Italy 12 TRUE
#4 Malta UK 1 FALSE
Sample data
df1 <- read.table(text =
"From.country To.Country points
Belgium Finland 4
Belgium Germany 5
Malta Italy 12
Malta UK 1", header = T)
df2 <- read.table(text =
"From.country To.Country
Belgium Finland
Belgium Germany
Malta Italy", header = T)

Resources