Reshaping line organized data in a time series [duplicate] - r

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 3 years ago.
I'm working with a time-series tibble that is organized in the following way:
Country<- ('Somalia')
'1961'<- 2999
'1962'<- 2917
'1963'<- 1853
df <- data.frame(Country, `1961`, `1962`, `1963`)
df
The problem is that is extremely hard to work with data organized in such a way, since that the only way to access the data that I want (those numbers that are under the column names) is by referring to each year individually.
Is there a simple way to organize them in a tidy way, such as:
x <- 'Somalia'
y <- c('1961', '1962', '1963')
z <- c(2999, 2917, 1853)
df <- data.frame(x, y, z)
df
Without having to manually rebuild the entire dataset?

> library(tidyverse)
> df %>%
gather(Year, Value, -Country)
Country Year Value
1 Somalia 1961 2999
2 Somalia 1962 2917
3 Somalia 1963 1853
where df is
df <- data.frame(Country = "Somalia",
`1961` = 2999,
`1962` = 2917,
`1963` = 1853,
check.names = FALSE)

Related

How to merge two rows so 2 years of data is represtented in one row

This is a cutout of my dataframe
I have a dataframe where i have two different variables that is found one year apart from each other. I would like to combine for exampel 2007 and 2008 to make one row with both variable and name it Denmark2007/8.
I have about 300 rows to do this with, and cannot find a command that will do this, and typing it mannually is not in the question
I have looked at everything from merge() and colsums, and i am lost
While one can debate whether a wide format data frame will be easiest to use in subsequent analysis steps, the tricky part of this request is that the names of countries may include multiple words. This means that a simpler solution like tidyr::separate() with sep = " " isn't feasible.
Here is a solution that uses length of each Country to extract the last 4 characters into a Year column, and everything before the final space as Country.
For the purposes of this example, v1 represents the odd year data, and v2 represents the even year data.
Refactored Solution
After coding the tidyverse friendly answer (see below), I realized I could simplify the original solution by starting with the long form tidy data, splitting it into even and odd years, renaming columns and then merging by year.
First, we create data based on the graphic in the original post, and add a couple of rows for a country whose name includes multiple words.
textData <- "v1,Country,v2
0.93181,Denmark 2007,NA
NA,Denmark 2008,5.519108
0.64285,Denmark 2009,NA
NA,Denmark 2010,4.93885
.55260,Denmark 2011,NA
NA,Denmark 2012,5.101908
0.13187,United Kingdom 2007,NA
NA,United Kingdom 2008,3.18781"
df <- read.csv(text = textData)
After reading the data into a data frame, we extract the last 4 characters from the Country column to create Year, merge v1 and v2 into a single column, add a yearType column, and use it to split the data into even and odd years.
library(dplyr)
library(stringr)
df %>%
mutate(countryLength = str_length(Country),
countryName = substr(Country,1,countryLength - 5),
Year = as.numeric(substr(Country,countryLength - 4,countryLength)),
value = if_else(!is.na(v1),v1,v2),
yearType = if_else(Year %% 2 == 0,"Even","Odd")) %>%
select(!c(Country,countryLength,v1,v2)) %>%
rename(Country = countryName) %>%
split(.$yearType) -> dataList
Having split the data into two data frames, we now rename columns in the even year data frame, subtract 1 from Year to merge with the odd numbered year data, join with the odd numbered year data, rename a few columns and add a column for the even numbered years.
dataList$Even %>%
rename(EvenYearValue = value) %>%
mutate(Year = Year - 1) %>%
select(-yearType) %>%
full_join(dataList$Odd,by = c("Country","Year")) %>%
rename(OddYearValue = value,
OddYear = Year) %>%
mutate(EvenYear = OddYear + 1) %>% select(-yearType)
...and the output:
Country OddYear EvenYearValue OddYearValue EvenYear
1 Denmark 2007 5.519108 0.93181 2008
2 Denmark 2009 4.938850 0.64285 2010
3 Denmark 2011 5.101908 0.55260 2012
4 United Kingdom 2007 3.187810 0.13187 2008
>
If it is absolutely required to append the start and end years to the Country column, that can be accomplished as follows.
dataList$Even %>%
rename(EvenYearValue = value) %>%
mutate(Year = Year - 1) %>%
select(-yearType) %>%
full_join(dataList$Odd,by = c("Country","Year")) %>%
rename(OddYearValue = value,
OddYear = Year) %>%
mutate(EvenYear = OddYear + 1) %>% select(-yearType) %>%
# modify the Country name to include years
mutate(Country = paste(Country,OddYear,"-",EvenYear))
...and the output:
Country OddYear EvenYearValue OddYearValue EvenYear
1 Denmark 2007 - 2008 2007 5.519108 0.93181 2008
2 Denmark 2009 - 2010 2009 4.938850 0.64285 2010
3 Denmark 2011 - 2012 2011 5.101908 0.55260 2012
4 United Kingdom 2007 - 2008 2007 3.187810 0.13187 2008
>
Original Solution
First, we covert the graphic from the question into usable data, and include a couple of rows for a country name that contains multiple words.
textData <- "v1,Country,v2
0.93181,Denmark 2007,NA
NA,Denmark 2008,5.519108
0.64285,Denmark 2009,NA
NA,Denmark 2010,4.93885
.55260,Denmark 2011,NA
NA,Denmark 2012,5.101908
0.13187,United Kingdom 2007,NA
NA,United Kingdom 2008,3.18781"
df <- read.csv(text = textData)
Next, we load a couple of packages, create a column to count the number of characters in each row of Country, and use it to separate Year from countryName. We also drop the intermediary columns created during this operation and save the result to yearlyData.
library(dplyr)
library(stringr)
df %>%
mutate(countryLength = str_length(Country),
countryName = substr(Country,1,countryLength - 5),
Year = as.numeric(substr(Country,countryLength - 4,countryLength))) %>%
select(!c(Country,countryLength)) %>%
rename(Country = countryName) -> yearlyData
At this point we separate the even years data into another data frame, drop the v1 variable, and subtract 1 from Year so we can merge it with the data for odd numbered years.
yearlyData %>%
filter(Year %% 2 == 0) %>%
select(-v1) %>%
mutate( Year = Year - 1) -> evenYears
Next, we read the yearly data, filter() out the rows for even numbered years, merge in the evenYears data frame via full_join(), rename a few columns and generate a new column for the even numbered years.
yearlyData %>%
filter(Year %% 2 == 1) %>%
rename(OddYearValue = v1) %>%
select(-v2) %>%
full_join(.,evenYears,by = c("Year","Country")) %>%
rename(EvenYearValue = v2,
OddYear = Year) %>%
mutate(EvenYear = OddYear + 1)
...and the output:
OddYearValue Country OddYear EvenYearValue EvenYear
1 0.93181 Denmark 2007 5.519108 2008
2 0.64285 Denmark 2009 4.938850 2010
3 0.55260 Denmark 2011 5.101908 2012
4 0.13187 United Kingdom 2007 3.187810 2008
>
NOTE: that the tidy data specification assets that each column in a data frame should contain one and only one variable, so we did not combine OddYear, EvenYear and Country into a single column as requested in the original post.
A tidy friendly solution
In the classic article on this topic, Hadley Wickham defines two forms of tidy data, narrow / long form and wide form.
The following solution creates a tidy data long form data frame, where each row in the resulting table is one value for each combination of Country and Year.
textData <- "v1,Country,v2
0.93181,Denmark 2007,NA
NA,Denmark 2008,5.519108
0.64285,Denmark 2009,NA
NA,Denmark 2010,4.93885
.55260,Denmark 2011,NA
NA,Denmark 2012,5.101908
0.13187,United Kingdom 2007,NA
NA,United Kingdom 2008,3.18781"
df <- read.csv(text = textData)
library(dplyr)
library(stringr)
df %>%
mutate(countryLength = str_length(Country),
countryName = substr(Country,1,countryLength - 5),
Year = as.numeric(substr(Country,countryLength - 4,countryLength)),
value = if_else(!is.na(v1),v1,v2)) %>%
select(!c(Country,countryLength,v1,v2)) %>%
rename(Country = countryName) -> yearlyData
yearlyData
...and the output:
> yearlyData
Country Year value
1 Denmark 2007 0.931810
2 Denmark 2008 5.519108
3 Denmark 2009 0.642850
4 Denmark 2010 4.938850
5 Denmark 2011 0.552600
6 Denmark 2012 5.101908
7 United Kingdom 2007 0.131870
8 United Kingdom 2008 3.187810
>
Ironically, given the input data, it's much easier to create a long form tidy data frame than it is to format the data as requested in the original post.
I had to make very specific assumptions about the whole dataset. I hope they apply also to the rest of the table:
You always merge two rows together (not more than two)
The format of the column with Country + year is always the same
Older years (e.g. 2007) always have a non-NA value in the first column and an NA in the last column, while the opposite is true for more recent years (e.g. 2008).
If these assumptions hold, I thought to work it out by first creating two tibbles containing the columns for Country and year, only the non-NA values in v1 and v2, respectively (i.e. dropping all the NAs).
Then you add one year to the tibble containing v1, and finally perform an inner join on the year.
To make it more readable, and do not repeat code, I created a function that takes care of the string extraction.
# Import data and libraries
library(dplyr)
library(tidyr)
library(stringr)
df <- tribble(
~v1,~Country,~v2,
#--|--|---
0.93181,"Denmark 2007",NA,
NA,"Denmark 2008",5.519108,
0.64285,"Denmark 2009",NA,
NA,"Denmark 2010",4.93885,
0.55260,"Denmark 2011",NA,
NA,"Denmark 2012",5.101908,
0.13187,"New Zealand 2007",NA,
NA,"New Zealand 2008",3.187819
)
# Regular expressions to extract year and country from the Country column
regexp_year <- "[[:digit:]]+"
regexp_country <- "[[:alpha:]\\s]+"
# Function that carries out the string extraction from the `Country` column
do_separate_df <- function(df) {
df %>%
mutate(year = str_extract(Country,regexp_year) %>% as.numeric()) %>%
mutate(Country = str_extract(Country,regexp_country))
}
# Tibble with non-NA values in v1 (earlier year)
df_v1 <- df %>%
select(v1,Country) %>%
drop_na %>%
do_separate_df()
# Tibble with non-NA values in v2 (later year)
df_v2 <- df %>%
select(Country,v2) %>%
drop_na %>%
do_separate_df()
# Join on df_v1$year + 1 = df_v2$year
df_combined <-inner_join(
df_v1 %>% mutate(year_to_match = year + 1),
df_v2,
by=c("year_to_match" = "year", "Country")
) %>%
mutate(Country = paste(Country, year, year + 1, sep = " ")) %>%
relocate(Country) %>%
select(-c(year,year_to_match))
df_combined
Country
v1
v2
Denmark 2007 2008
0.93181
5.519108
Denmark 2009 2010
0.64285
4.938850
Denmark 2011 2012
0.55260
5.101908
New Zealand 2007 2008
0.13187
3.187819

How do you output results of a loop applied to a data frame into a new data frame? [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Closed 1 year ago.
I am working on a data frame with the following columns - country, factor, year, number of technicians.
What i want to achieve: create a new data frame containing only the most recent data (i.e. highest year) for each country (NOTE: there are several rows of data for each country).
I have created a function to isolate the country and year as follows. (NOTE: main dataset = numeric)
#Function to isolate 1 country
one_country1 <- function(x) {
a <- numeric %>% filter(Factor == x)
return(a) }
#Function to isolate latest year
latest_country <- function(y){
b <- y %>% filter(Year == max(Year))
return(b) }
#Function to isolate both country and latest year
best_data <- function(z){
G <- latest_country(one_country1(z))
return(G)}
I then made this into a for loop to apply it to each country as follows.
z <- 1
loop_data <- for(z in 1:114){
print(best_data(z))}
This produces the correct data but it is in a strange format that is not a data frame. When I try 'typeof' it says 'NULL' and I can't seem to convert it into a data frame using simple as.data.frame functions or starting with an empty data frame and incorporating rbind.data.frame into the function. The results appear as follows:
Country Factor Year Technicians
1 Yemen 112 2010 1809
Country Factor Year Technicians
1 Zambia 113 2012 1126
Country Factor Year Technicians
1 Zimbabwe 114 2018 1126
typeof(loop_data) = NULL
Any help about how to rework this code to output a data frame would be much appreciated! I've only started leaning R a couple of weeks ago so please forgive how amateurish and untidy the code may be!
Country = df$Country
for (country in Country) {
loop_data <- df%>%
group_by(Country)
filter(Year == max(Year))
}
alternatively you could try
for (country in Country) {
loop_data <- df%>%
filter(Country == country)%>%
filter(Year == max(Year))
}
or
for (country in Country) {
loop_data <- df%>%
filter(Country == country)%>%
slice(which.max(Year))
}

Merge Data Frames of different years

im trying to merge some dataframes using R. You can find the dataframe here https://www.kaggle.com/mathurinache/world-happiness-report.
There are 6 dataframes, each for one year (2015-2020).
Is there anyway of merging this dateframes using the year as a new column?
Ex:
Year Country Region
2015 Switzerland Western Europe ...
2016 Switzerland Western Europe ...
2017 Switzerland Western Europe ...
.
.
.
.
You first need to clean the data so that it contains the same columns. I gave it a shot, but it's not perfect yet. You still need to figure out how healthy life expectancy is defined across each year. In 2020 it seems to be a number in years, in 2019 it seems to be a standardized value and in the other years it is probably the proportion relative to the real life expectancy. Further, the log transformation of GDP in 2020 is my best guess, so no guarantee that this brings the 2020 data on the same scale as the rest of the data.
library(tidyverse)
mypath <- "Insert/Your/Path/Here/"
file_ls <- paste0(mypath, list.files(path = mypath, pattern = "*.csv"))
dat_ls <- tibble(year = 2015:2020, # setup a nested tibble
data = set_names(map(file_ls, ~ read.csv(.x) %>% as_tibble), year)) %>%
# mutate the 2020 data to match the other years
mutate(data = map_at(data, "2020",
~ mutate(.x,
Rank = rank(desc(Ladder.score)),
Logged.GDP.per.capita = log(exp(Logged.GDP.per.capita), exp(8)))
)) %>%
# enter here more map_at calls to transform the life expectancy column
# ...
# switch to rowwise, so that `map` etc is no longer needed
rowwise(year) %>%
# select names in each data set in given order and then rename them with set_names
mutate(data2 = list(select(data,
contains("country"),
contains("rank"),
contains("score") & !contains("Dystopia") & !contains("standard"),
contains("gdp") & !contains("explained"),
contains("expectancy") & !contains("explained")
) %>%
set_names(., c("country",
"rank",
"happiness_score",
"gdp_per_capita",
"life_expectancy"))
)) %>%
select(year, data2) %>%
unnest(data2)

Convert aggregate result into summary table [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
given this data frame that is the result of a sum aggregation on the data.
how to transform this into the usual table as the one result of table in order to plot it properly.
for more clear picture of the desired output, it should be something like this:
Mountain Bikes Road Bikes
2005 130694 708713
2006 168445 1304031
2007 0 56112
I even tried something silly like calculating the value individually then combining them. but it's still a data frame so it think of the first column to be values instead of headers.
A solution using dplyr and tidyr.
library(dplyr)
library(tidyr)
dat2 <- dat %>%
group_by(Category, Year) %>%
summarize(SUM = sum(x)) %>%
spread(Category, SUM, fill = 0)
dat2
# # A tibble: 3 x 3
# Year `Mountain Bikes` `Road Bikes`
# <dbl> <dbl> <dbl>
# 1 2005 130694 708713
# 2 2006 168445 1304031
# 3 2007 0 561122
DATA
dat <- data.frame(Category = paste(c("Mountain", "Road", "Mountain",
"Road", "Road"), "Bikes", sep = " "),
Year = c(2005, 2005, 2006, 2006, 2007),
x = c(130694, 708713, 168445, 1304031, 561122))

apply function to grouped rows in dataframe [duplicate]

This question already has answers here:
Split dataframe using two columns of data and apply common transformation on list of resulting dataframes
(3 answers)
Closed 5 years ago.
I have created a function that computes a number of biological statistics, such as species range edges. Here is a simplified version of the function:
range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
cent_lat <- weighted.mean(x=rangedf[,lat], w=rangedf[,weighting], na.rm=T)
cent_lon <- weighted.mean(x=rangedf[,lon], w=rangedf[,weighting], na.rm=T)
out <- data.frame(cent_lat, cent_lon)
return(out)
}
I would like to apply this to a large dataframe where every row is an observation of a species. As such, I want the function to group rows by a specified set of columns, and then computer these statistics for each group. Here is a test dataframe:
LATITUDE <- c(27.91977, 21.29066, 26.06340, 28.38918, 25.97517, 27.96313)
LONGITUDE <- c(-175.8617, -157.8645, -173.9593, -178.3571, -173.9679, -175.7837)
BIOMASS <- c(4.3540488, 0.2406332, 0.2406332, 2.1419699, 0.3451426, 1.0946017)
SPECIES <- c('Abudefduf abdominalis','Abudefduf abdominalis','Abudefduf abdominalis','Chaetodon lunulatus','Chaetodon lunulatus','Chaetodon lunulatus')
YEAR <- c('2005', '2005', '2014', '2009', '2009', '2015')
testdf <- data.table(LATITUDE, LONGITUDE, BIOMASS, SPECIES, YEAR)
I want to apply this function to every unique combination of species and year to calculate summary statistics, i.e., the following:
testresult <- testdf %>%
group_by(SPECIES, YEAR) %>%
range_stats(lat="LATITUDE",lon="LONGITUDE",weighting="BIOMASS",na.rm=T)
However, the code above does not work (I get a (list) object cannot be coerced to type 'double' error) and I am not sure how else to approach the problem.
Since you add the tag of dplyr and purrr, I assume you are interested in a tidyverse solution. So below I will demonstrate a solution based on the tidyverse.
First, your range_stats is problematic. This is why you got the error message. The weighted.mean is expecting a vector for both the x and w argument. However, if rangedf is a tibble, the way you subset the tibble, such as rangedf[,lat] will still return a one-column tibble. A better way is to use pull from the dplyr package.
library(tidyverse)
range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
cent_lat <- weighted.mean(x = rangedf %>% pull(lat),
w = rangedf %>% pull(weighting), na.rm=T)
cent_lon <- weighted.mean(x = rangedf %>% pull(lon),
w = rangedf %>% pull(weighting), na.rm=T)
out <- data.frame(cent_lat, cent_lon)
return(out)
}
Next, the way you created the data frame is OK, but data.table is from the data.table package and you will create a data.table, not a tibble. I thought you want to use an approach from tidyverse, so I changed data.table to data_frame as follows.
LATITUDE <- c(27.91977, 21.29066, 26.06340, 28.38918, 25.97517, 27.96313)
LONGITUDE <- c(-175.8617, -157.8645, -173.9593, -178.3571, -173.9679, -175.7837)
BIOMASS <- c(4.3540488, 0.2406332, 0.2406332, 2.1419699, 0.3451426, 1.0946017)
SPECIES <- c('Abudefduf abdominalis','Abudefduf abdominalis','Abudefduf abdominalis','Chaetodon lunulatus','Chaetodon lunulatus','Chaetodon lunulatus')
YEAR <- c('2005', '2005', '2014', '2009', '2009', '2015')
testdf <- data_frame(LATITUDE, LONGITUDE, BIOMASS, SPECIES, YEAR)
Now, you said you want to apply the range_stats function to each combination of SPECIES and YEAR. One approach is to split the data frame to a list of data frames, and use lapply family function. But here I want to show you how to use the map family function to achieve this task as map is from the purrr package, which is part of the tidyverse.
We can first create a group indices based on SPECIES and YEAR.
testdf2 <- testdf %>%
mutate(Group = group_indices(., SPECIES, YEAR))
testdf2
# A tibble: 6 x 6
LATITUDE LONGITUDE BIOMASS SPECIES YEAR Group
<dbl> <dbl> <dbl> <chr> <chr> <int>
1 27.91977 -175.8617 4.3540488 Abudefduf abdominalis 2005 1
2 21.29066 -157.8645 0.2406332 Abudefduf abdominalis 2005 1
3 26.06340 -173.9593 0.2406332 Abudefduf abdominalis 2014 2
4 28.38918 -178.3571 2.1419699 Chaetodon lunulatus 2009 3
5 25.97517 -173.9679 0.3451426 Chaetodon lunulatus 2009 3
6 27.96313 -175.7837 1.0946017 Chaetodon lunulatus 2015 4
As you can see, Group is a new column showing the index number. Now we can split the data frame based on Group, and then use map_dfr to apply the range_stats function.
testresult <- testdf2 %>%
split(.$Group) %>%
map_dfr(range_stats, lat = "LATITUDE",lon = "LONGITUDE",
weighting = "BIOMASS", na.rm = TRUE, .id = "Group")
testresult
Group cent_lat cent_lon
1 1 27.57259 -174.9191
2 2 26.06340 -173.9593
3 3 28.05418 -177.7480
4 4 27.96313 -175.7837
Notice that map_dfr can automatic bind the output list of data frames to a single data frame. .id = "Group" means we want to create a column called Group based on the name of the list element.
I separated the process into two steps, but of course they can be all in one pipeline as follows.
testresult <- testdf %>%
mutate(Group = group_indices(., SPECIES, YEAR)) %>%
split(.$Group) %>%
map_dfr(range_stats, lat = "LATITUDE",lon = "LONGITUDE",
weighting = "BIOMASS", na.rm = TRUE, .id = "Group")
If you want, testresult can be merged with testdf using left_join, but I will stop here as testresult is probably already the desired output you want. I hope this helps.
Fundamentally, the main issue involves weighted.mean() where you are passing a dataframe object and not a vector that can be coerced to double. To fix within method, simply change:
x=rangedf[,lat]
To double brackets:
x=rangedf[[lat]]
Adjusted method:
range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
cent_lat <- weighted.mean(x=rangedf[[lat]], w=rangedf[[weighting]], na.rm=T)
cent_lon <- weighted.mean(x=rangedf[[lon]], w=rangedf[[weighting]], na.rm=T)
out <- data.frame(cent_lat, cent_lon)
return(out)
}
As for overall group by slice computation, do forgive me in bypassing, dplyr and data.table which you use and consider base R's underutilized but useful method, by().
The challenge with your current setup is the output of range_stats method return is a data.frame of two columns and dplyr's group_by() expects one aggregation vector operation. However, by passes dataframe objects (sliced by factors) into a defined function to return a list of data.frames which you can then rbind for one final dataframe:
df_List <- by(testdf, testdf[, c("SPECIES", "YEAR")], FUN=function(df)
data.frame(species=df$SPECIES[1],
year=df$YEAR[1],
range_stats(df,"LATITUDE","LONGITUDE","BIOMASS"))
)
finaldf <- do.call(rbind, df_List)
finaldf
# species year cent_lat cent_lon
# 1 Abudefduf abdominalis 2005 27.57259 -174.9191
# 2 Chaetodon lunulatus 2009 28.05418 -177.7480
# 3 Abudefduf abdominalis 2014 26.06340 -173.9593
# 4 Chaetodon lunulatus 2015 27.96313 -175.7837

Resources