Computing correlations between variables in 2 dataframes - r

Am trying to compute the correlations of the below countries, with USA. I have a relatively big dataset with 80+ variables & 3000+ observations in my first df as below, so am trying to use R to automate this instead of using excel.
I am trying to compute correlations for the countries in the first df (i.e. Germany, Italy, Japan and more) with USA in the 2nd df. So it should go Germany - USA, Italy - USA, Japan - USA and so on.
Not too sure how should I begin - should I loop every column in the first table to correlate with USA in the 2nd? Help is much appreciated.
Thanks!
df1
Date
Germany
Italy
Japan
More countries...
01-01-2020
1000
200
2304
More numbers...
01-02-2020
2000
389
2098
More numbers...
and on and on
df2
Date
USA
01-01-2020
500
01-02-2020
600
and on and on

You could use this approach:
library(dplyr);library(magrittr)
countries = c("Germany", "Italy", "Japan")
left_join(df1, df2) %>% summarise(across(countries, ~cor(., USA)))
or, as the OP did not have access to the latest version of dplyr and across():
left_join(df1, df2) %>% summarise_at(countries, ~cor(., USA))
left_join merges df1 and df2 together so that the dates always match up with one another
summarise allows you to perform column-wise operations
across tells you which columns you want to make a correlation with USA
~cor(., USA) says take each country and perform the correlation with USA
Germany Italy Japan
<dbl> <dbl> <dbl>
1 -0.393 -0.147 -0.214
Thank you Damien Georges for the data.

Something like that should do the trick:
library(dplyr)
df1 <-
tibble(
date = 2001:2010,
Germany = runif(10),
Italy = runif(10),
Japan = runif(10)
)
df2 <-
tibble(
date = 2001:2010,
USA = runif(10)
)
df.cor <-
df1 %>%
summarise(across(-one_of('date'), ~ cor(.x, df2$USA)))
df.cor
Note: You have to be sure that dates are consistent between df1 and df2. You can use join function (e.g. left_join) to ensure this

Here are two base R solutions, depending on the final format you want.
Both with the new pipe operator, introduced in R 4.1.0.
df2[-1] |> cor(df1[-1]) |> as.data.frame()
# Germany Italy Japan
#USA 0.3161338 0.5483885 0.1725733
df1[-1] |> cor(df2[-1]) |> as.data.frame()
# USA
#Germany 0.3161338
#Italy 0.5483885
#Japan 0.1725733
More traditional but equivalent versions:
as.data.frame(cor(df2[-1], df1[-1]))
as.data.frame(cor(df1[-1], df2[-1]))
Data
Data creation code borrowed from Damien Georges.
set.seed(2021)
df1 <-
data.frame(
date = 2001:2010,
Germany = runif(10),
Italy = runif(10),
Japan = runif(10)
)
df2 <-
data.frame(
date = 2001:2010,
USA = runif(10)
)

Related

How to merge two rows so 2 years of data is represtented in one row

This is a cutout of my dataframe
I have a dataframe where i have two different variables that is found one year apart from each other. I would like to combine for exampel 2007 and 2008 to make one row with both variable and name it Denmark2007/8.
I have about 300 rows to do this with, and cannot find a command that will do this, and typing it mannually is not in the question
I have looked at everything from merge() and colsums, and i am lost
While one can debate whether a wide format data frame will be easiest to use in subsequent analysis steps, the tricky part of this request is that the names of countries may include multiple words. This means that a simpler solution like tidyr::separate() with sep = " " isn't feasible.
Here is a solution that uses length of each Country to extract the last 4 characters into a Year column, and everything before the final space as Country.
For the purposes of this example, v1 represents the odd year data, and v2 represents the even year data.
Refactored Solution
After coding the tidyverse friendly answer (see below), I realized I could simplify the original solution by starting with the long form tidy data, splitting it into even and odd years, renaming columns and then merging by year.
First, we create data based on the graphic in the original post, and add a couple of rows for a country whose name includes multiple words.
textData <- "v1,Country,v2
0.93181,Denmark 2007,NA
NA,Denmark 2008,5.519108
0.64285,Denmark 2009,NA
NA,Denmark 2010,4.93885
.55260,Denmark 2011,NA
NA,Denmark 2012,5.101908
0.13187,United Kingdom 2007,NA
NA,United Kingdom 2008,3.18781"
df <- read.csv(text = textData)
After reading the data into a data frame, we extract the last 4 characters from the Country column to create Year, merge v1 and v2 into a single column, add a yearType column, and use it to split the data into even and odd years.
library(dplyr)
library(stringr)
df %>%
mutate(countryLength = str_length(Country),
countryName = substr(Country,1,countryLength - 5),
Year = as.numeric(substr(Country,countryLength - 4,countryLength)),
value = if_else(!is.na(v1),v1,v2),
yearType = if_else(Year %% 2 == 0,"Even","Odd")) %>%
select(!c(Country,countryLength,v1,v2)) %>%
rename(Country = countryName) %>%
split(.$yearType) -> dataList
Having split the data into two data frames, we now rename columns in the even year data frame, subtract 1 from Year to merge with the odd numbered year data, join with the odd numbered year data, rename a few columns and add a column for the even numbered years.
dataList$Even %>%
rename(EvenYearValue = value) %>%
mutate(Year = Year - 1) %>%
select(-yearType) %>%
full_join(dataList$Odd,by = c("Country","Year")) %>%
rename(OddYearValue = value,
OddYear = Year) %>%
mutate(EvenYear = OddYear + 1) %>% select(-yearType)
...and the output:
Country OddYear EvenYearValue OddYearValue EvenYear
1 Denmark 2007 5.519108 0.93181 2008
2 Denmark 2009 4.938850 0.64285 2010
3 Denmark 2011 5.101908 0.55260 2012
4 United Kingdom 2007 3.187810 0.13187 2008
>
If it is absolutely required to append the start and end years to the Country column, that can be accomplished as follows.
dataList$Even %>%
rename(EvenYearValue = value) %>%
mutate(Year = Year - 1) %>%
select(-yearType) %>%
full_join(dataList$Odd,by = c("Country","Year")) %>%
rename(OddYearValue = value,
OddYear = Year) %>%
mutate(EvenYear = OddYear + 1) %>% select(-yearType) %>%
# modify the Country name to include years
mutate(Country = paste(Country,OddYear,"-",EvenYear))
...and the output:
Country OddYear EvenYearValue OddYearValue EvenYear
1 Denmark 2007 - 2008 2007 5.519108 0.93181 2008
2 Denmark 2009 - 2010 2009 4.938850 0.64285 2010
3 Denmark 2011 - 2012 2011 5.101908 0.55260 2012
4 United Kingdom 2007 - 2008 2007 3.187810 0.13187 2008
>
Original Solution
First, we covert the graphic from the question into usable data, and include a couple of rows for a country name that contains multiple words.
textData <- "v1,Country,v2
0.93181,Denmark 2007,NA
NA,Denmark 2008,5.519108
0.64285,Denmark 2009,NA
NA,Denmark 2010,4.93885
.55260,Denmark 2011,NA
NA,Denmark 2012,5.101908
0.13187,United Kingdom 2007,NA
NA,United Kingdom 2008,3.18781"
df <- read.csv(text = textData)
Next, we load a couple of packages, create a column to count the number of characters in each row of Country, and use it to separate Year from countryName. We also drop the intermediary columns created during this operation and save the result to yearlyData.
library(dplyr)
library(stringr)
df %>%
mutate(countryLength = str_length(Country),
countryName = substr(Country,1,countryLength - 5),
Year = as.numeric(substr(Country,countryLength - 4,countryLength))) %>%
select(!c(Country,countryLength)) %>%
rename(Country = countryName) -> yearlyData
At this point we separate the even years data into another data frame, drop the v1 variable, and subtract 1 from Year so we can merge it with the data for odd numbered years.
yearlyData %>%
filter(Year %% 2 == 0) %>%
select(-v1) %>%
mutate( Year = Year - 1) -> evenYears
Next, we read the yearly data, filter() out the rows for even numbered years, merge in the evenYears data frame via full_join(), rename a few columns and generate a new column for the even numbered years.
yearlyData %>%
filter(Year %% 2 == 1) %>%
rename(OddYearValue = v1) %>%
select(-v2) %>%
full_join(.,evenYears,by = c("Year","Country")) %>%
rename(EvenYearValue = v2,
OddYear = Year) %>%
mutate(EvenYear = OddYear + 1)
...and the output:
OddYearValue Country OddYear EvenYearValue EvenYear
1 0.93181 Denmark 2007 5.519108 2008
2 0.64285 Denmark 2009 4.938850 2010
3 0.55260 Denmark 2011 5.101908 2012
4 0.13187 United Kingdom 2007 3.187810 2008
>
NOTE: that the tidy data specification assets that each column in a data frame should contain one and only one variable, so we did not combine OddYear, EvenYear and Country into a single column as requested in the original post.
A tidy friendly solution
In the classic article on this topic, Hadley Wickham defines two forms of tidy data, narrow / long form and wide form.
The following solution creates a tidy data long form data frame, where each row in the resulting table is one value for each combination of Country and Year.
textData <- "v1,Country,v2
0.93181,Denmark 2007,NA
NA,Denmark 2008,5.519108
0.64285,Denmark 2009,NA
NA,Denmark 2010,4.93885
.55260,Denmark 2011,NA
NA,Denmark 2012,5.101908
0.13187,United Kingdom 2007,NA
NA,United Kingdom 2008,3.18781"
df <- read.csv(text = textData)
library(dplyr)
library(stringr)
df %>%
mutate(countryLength = str_length(Country),
countryName = substr(Country,1,countryLength - 5),
Year = as.numeric(substr(Country,countryLength - 4,countryLength)),
value = if_else(!is.na(v1),v1,v2)) %>%
select(!c(Country,countryLength,v1,v2)) %>%
rename(Country = countryName) -> yearlyData
yearlyData
...and the output:
> yearlyData
Country Year value
1 Denmark 2007 0.931810
2 Denmark 2008 5.519108
3 Denmark 2009 0.642850
4 Denmark 2010 4.938850
5 Denmark 2011 0.552600
6 Denmark 2012 5.101908
7 United Kingdom 2007 0.131870
8 United Kingdom 2008 3.187810
>
Ironically, given the input data, it's much easier to create a long form tidy data frame than it is to format the data as requested in the original post.
I had to make very specific assumptions about the whole dataset. I hope they apply also to the rest of the table:
You always merge two rows together (not more than two)
The format of the column with Country + year is always the same
Older years (e.g. 2007) always have a non-NA value in the first column and an NA in the last column, while the opposite is true for more recent years (e.g. 2008).
If these assumptions hold, I thought to work it out by first creating two tibbles containing the columns for Country and year, only the non-NA values in v1 and v2, respectively (i.e. dropping all the NAs).
Then you add one year to the tibble containing v1, and finally perform an inner join on the year.
To make it more readable, and do not repeat code, I created a function that takes care of the string extraction.
# Import data and libraries
library(dplyr)
library(tidyr)
library(stringr)
df <- tribble(
~v1,~Country,~v2,
#--|--|---
0.93181,"Denmark 2007",NA,
NA,"Denmark 2008",5.519108,
0.64285,"Denmark 2009",NA,
NA,"Denmark 2010",4.93885,
0.55260,"Denmark 2011",NA,
NA,"Denmark 2012",5.101908,
0.13187,"New Zealand 2007",NA,
NA,"New Zealand 2008",3.187819
)
# Regular expressions to extract year and country from the Country column
regexp_year <- "[[:digit:]]+"
regexp_country <- "[[:alpha:]\\s]+"
# Function that carries out the string extraction from the `Country` column
do_separate_df <- function(df) {
df %>%
mutate(year = str_extract(Country,regexp_year) %>% as.numeric()) %>%
mutate(Country = str_extract(Country,regexp_country))
}
# Tibble with non-NA values in v1 (earlier year)
df_v1 <- df %>%
select(v1,Country) %>%
drop_na %>%
do_separate_df()
# Tibble with non-NA values in v2 (later year)
df_v2 <- df %>%
select(Country,v2) %>%
drop_na %>%
do_separate_df()
# Join on df_v1$year + 1 = df_v2$year
df_combined <-inner_join(
df_v1 %>% mutate(year_to_match = year + 1),
df_v2,
by=c("year_to_match" = "year", "Country")
) %>%
mutate(Country = paste(Country, year, year + 1, sep = " ")) %>%
relocate(Country) %>%
select(-c(year,year_to_match))
df_combined
Country
v1
v2
Denmark 2007 2008
0.93181
5.519108
Denmark 2009 2010
0.64285
4.938850
Denmark 2011 2012
0.55260
5.101908
New Zealand 2007 2008
0.13187
3.187819

Merge Data Frames of different years

im trying to merge some dataframes using R. You can find the dataframe here https://www.kaggle.com/mathurinache/world-happiness-report.
There are 6 dataframes, each for one year (2015-2020).
Is there anyway of merging this dateframes using the year as a new column?
Ex:
Year Country Region
2015 Switzerland Western Europe ...
2016 Switzerland Western Europe ...
2017 Switzerland Western Europe ...
.
.
.
.
You first need to clean the data so that it contains the same columns. I gave it a shot, but it's not perfect yet. You still need to figure out how healthy life expectancy is defined across each year. In 2020 it seems to be a number in years, in 2019 it seems to be a standardized value and in the other years it is probably the proportion relative to the real life expectancy. Further, the log transformation of GDP in 2020 is my best guess, so no guarantee that this brings the 2020 data on the same scale as the rest of the data.
library(tidyverse)
mypath <- "Insert/Your/Path/Here/"
file_ls <- paste0(mypath, list.files(path = mypath, pattern = "*.csv"))
dat_ls <- tibble(year = 2015:2020, # setup a nested tibble
data = set_names(map(file_ls, ~ read.csv(.x) %>% as_tibble), year)) %>%
# mutate the 2020 data to match the other years
mutate(data = map_at(data, "2020",
~ mutate(.x,
Rank = rank(desc(Ladder.score)),
Logged.GDP.per.capita = log(exp(Logged.GDP.per.capita), exp(8)))
)) %>%
# enter here more map_at calls to transform the life expectancy column
# ...
# switch to rowwise, so that `map` etc is no longer needed
rowwise(year) %>%
# select names in each data set in given order and then rename them with set_names
mutate(data2 = list(select(data,
contains("country"),
contains("rank"),
contains("score") & !contains("Dystopia") & !contains("standard"),
contains("gdp") & !contains("explained"),
contains("expectancy") & !contains("explained")
) %>%
set_names(., c("country",
"rank",
"happiness_score",
"gdp_per_capita",
"life_expectancy"))
)) %>%
select(year, data2) %>%
unnest(data2)

Searching and using databases

df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-timeseries/master/countryReport/raw/rawReport.csv')
df8 <- read.csv ('https://raw.githubusercontent.com/hirenvadher954/Worldometers-Scraping/master/countries.csv')
In the 1st dataset, there are countries divided into continents.
In the second data set, there is country and population information.
How can I combine population information in data set 2 according to the continental information in data set 1.
thank you. The problem is that in the 1st dataset, countries are written on a continental basis. Countries and their populations in the second dataset. Do I need the population information of the continents? eg europe = 400 million, asia = 2.4 billion
Using the dplyr package all you have to do is join by a common variable, in this case country name. Since in one data frame the name is called countryName and in the other one country_name, we just have to specify that they in fact belong to the same variable.
library(dplyr)
library(stringr)
df %>%
left_join(df8, by = c("countryName" = "country_name")) %>%
mutate(population = as.numeric(str_remove_all(population, ","))) %>%
group_by(countryName) %>%
slice_tail(1) %>%
group_by(region) %>%
summarize(population = sum(population, na.rm = TRUE))
# A tibble: 5 x 2
region population
* <chr> <dbl>
1 Africa 1304908713
2 Americas 1019607512
3 Asia 4592311527
4 Europe 738083720
5 Oceania 40731992

How can I convert this dataframe into a multiple time series object in R?

I'm trying to clean up some data (https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv) regarding the COVID19 Novel Coronavirus to do various types of analysis (ie. create a chart of countries with 100 cases over time, or track the death-rate over time per country). I used data which had the dates as columns and countries as rows. I transposed the Dataframe so that I got a column for each country and a single column of dates as shown below.
I have attempted to read this dataframe in as a time series object through the following code:
covid19ts = ts(covid19, frequency = 365, start = c(2020,22))
The result is the following. Instead of getting dates as my index column I get a number from 1 - 47 (the number of days recorded). This results in me being unable to create charts or do any meaningful analysis.
I have also tried the following code using the lubridate package with the same results:
covid19ts = ts(covid19, frequency = 365, start= decimal_date(as.Date("2020-01-22")))
How can I make my ts dates into the actual dates for charting and analysis?
Or is there a completely different approach I could be using which would be better for the analysis im trying to do?
Thank you for your help.
You could keep the data as a dataframe and do useful plotting. Maybe get the data in long format.
library(tidyverse)
df <- read.csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv', check.names = FALSE)
df1 <- df %>% pivot_longer(cols = -(1:4))
head(df1)
# A tibble: 6 x 6
# `Province/State` `Country/Region` Lat Long name value
# <fct> <fct> <dbl> <dbl> <chr> <int>
#1 Anhui Mainland China 31.8 117. 1/22/20 1
#2 Anhui Mainland China 31.8 117. 1/23/20 9
#3 Anhui Mainland China 31.8 117. 1/24/20 15
#4 Anhui Mainland China 31.8 117. 1/25/20 39
#5 Anhui Mainland China 31.8 117. 1/26/20 60
#6 Anhui Mainland China 31.8 117. 1/27/20 70
If you want to convert the data into time-series as shown in your post, you could do :
df2 <- df1 %>%
group_by(`Country/Region`, name) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = `Country/Region`, values_from = value,
values_fill = list(value = 0))
ts_data <- xts::xts(df2[-1], as.Date(df2$name, "%m/%d/%y"))
An alternative solution suggested by #G. Grothendieck relying on zoo is
z <- read.zoo(df1[c(2, 5:6)], index = "name", split = "Country/Region",
format = "%m/%d/%Y", aggregate = sum)
read.zoo avoids all the explicit aggregating and reshaping by tidyverse. We can then use autoplot function to plot this zoo object.
Rather than use ts or xts objects, this is best suited to a tsibble format like this.
library(tidyverse)
library(tsibble)
library(feasts)
covid19 <- read_csv("time_series_19-covid-Confirmed.csv") %>%
pivot_longer(cols = -(1:4)) %>%
mutate(date = lubridate::mdy(name)) %>%
select(-name) %>%
rename(
"Region" = `Province/State`,
"Country" = `Country/Region`
) %>%
as_tsibble(key = c(Region, Country), index = date)
# Plot by country
covid19 %>%
filter(Country %in% c("China", "Italy", "Iran", "South Korea")) %>%
group_by(Country) %>%
summarise(value = sum(value)) %>%
autoplot(value)
Created on 2020-03-09 by the reprex package (v0.3.0)

Aggregate on a xts object by matching column names with grouping variables present in another dataframe in R

I have a time series object suppose :
library(xts)
exposure <- xts(Google = c(100, 200,300,400,500,600,700,800),
Apple = c(10, 20,30,40,50,60,70,80),
Audi = c(1,2,3,4,5,6,7,8),
BMW = c(1000, 2000,3000,4000,5000,6000,7000,8000),
AENA = c(50,51,52,53,54,55,56,57,58),
order.by = Sys.Date() - 1:8)
I have a dataframe :
map <- data.frame(Company = c("Google", "Apple", " Audi", "BMW", " AENA"),
Country = c("US", "US", " GERMANY", "GERMANY", " SPAIN"))
I want to aggregate in exposure object based on the country to which the companies are mapped. Basically my output will be a xts object with same index as exposure but column names would be that of US, GERMANY, SPAIN. For example for a particular date under US column I would want sum of exposures for Google and Apple for that date.
Any help is welcome.
I think there was an error with your original data specification. This is a way to do it by first moving it out of the xts format and then back into it again.
data
I made a few changes to how the xts object is created. I also cleaned some mistaken spaces up.
library(xts)
df <- data.frame(Google = c(100, 200,300,400,500,600,700,800),
Apple = c(10, 20,30,40,50,60,70,80),
Audi = c(1,2,3,4,5,6,7,8),
BMW = c(1000, 2000,3000,4000,5000,6000,7000,8000),
AENA = c(50,51,52,53,54,55,56,57))
exposure <- xts(df, order.by = Sys.Date() - 1:8)
map <- data.frame(Company = c("Google", "Apple", "Audi", "BMW", "AENA"),
Country = c("US", "US", "GERMANY", "GERMANY", "SPAIN"),
stringsAsFactors = F)
aggregation
I use tbl2xts to convert the format. Then, we use dplyr and tidyr to pivot the data to a long format, join in the Country to each Company, and summarize over the Country. We then convert back to xts, spreading the data wide by Country.
library(tbl2xts)
library(dplyr)
library(tidyr)
xts_tbl(exposure) %>%
pivot_longer(-date, names_to = "Company") %>%
left_join(map, by = "Company") %>%
group_by(date, Country) %>%
summarize(value = sum(value)) %>%
ungroup() %>%
tbl_xts(spread_by = "Country")
result
GERMANY SPAIN US
2019-10-28 8008 57 880
2019-10-29 7007 56 770
2019-10-30 6006 55 660
2019-10-31 5005 54 550
2019-11-01 4004 53 440
2019-11-02 3003 52 330
2019-11-03 2002 51 220
2019-11-04 1001 50 110

Resources