Given data frames the first column of which is the list of country names and is common in all data frames and the remainder columns are the years for which the value of the indicator is measured and these being the years is also common in data frames, what are the ways to merge the datasets by the first column? How to merge into a multidimensional array? dataset example:
country name
2005
....
2020
Aruba
23591
Angola
1902
country name
2005
....
2020
Aruba
-8.8
Angola
-3.5
Doing a full_join
library(dplyr)
full_join(DataSet1,DataSet2, by = 'country name')
changes the name of the columns and the data is not accessible.
1) Assuming the data frames in the Note at the end we can use bind_rows
library(dplyr)
bind_rows(DF1, DF2, .id = "id")
giving the following which takes all the rows from both data frames and identifies which data frame each row came from.
id countryName 2005 2006
1 1 Aruba 1 2
2 1 Angola 3 4
3 2 Aruba 11 12
4 2 Angola 13 14
2) Another possibility is to create a 3d array
library(abind)
a <- abind(DF1[-1], DF2[-1], along = 3, new.names = list(DF1$countryName,NULL,1:2))
a
giving this 3d array where the dimensions correspond to the country name, the year and the originating data.frame.
, , 1
2005 2006
Aruba 1 2
Angola 3 4
, , 2
2005 2006
Aruba 11 12
Angola 13 14
We can get various slices:
> a["Aruba",,]
1 2
2005 1 11
2006 2 12
> a[,"2005",]
1 2
Aruba 1 11
Angola 3 13
> a[,,2]
2005 2006
Aruba 11 12
Angola 13 14
Note
DF1 <- structure(list(countryName = c("Aruba", "Angola"), `2005` = c(1L,
3L), `2006` = c(2L, 4L)), class = "data.frame", row.names = c(NA, -2L))
DF2 <- structure(list(countryName = c("Aruba", "Angola"), `2005` = c(11L,
13L), `2006` = c(12L, 14L)), class = "data.frame", row.names = c(NA, -2L))
> DF1
countryName 2005 2006
1 Aruba 1 2
2 Angola 3 4
> DF2
countryName 2005 2006
1 Aruba 11 12
2 Angola 13 14
Related
I've looked around but I can't find an answer to this!
I've imported a large number of datasets to R.
Each dataset contains information for a single year (ex. df_2012, df_2013, df_2014 etc).
All the datasets have the same variables/columns (ex. varA_2012 in df_2012 corresponds to varA_2013 in df_2013).
I want to create a df with my id variable and varA_2012, varB_2012, varA_2013, varB_2013, varA_2014, varB_2014 etc
I'm trying to create a loop that helps me extract the few columns that I'm interested in (varA_XXXX, varB_XXXX) in each data frame and then do a full join based on my id var.
I haven't used R in a very long time...
So far, I've tried this:
id <- c("France", "Belgium", "Spain")
varA_2012 <- c(1,2,3)
varB_2012 <- c(7,2,9)
varC_2012 <- c(1,56,0)
varD_2012 <- c(13,55,8)
varA_2013 <- c(34,3,56)
varB_2013 <- c(2,53,5)
varC_2013 <- c(24,3,45)
varD_2013 <- c(27,13,8)
varA_2014 <- c(9,10,5)
varB_2014 <- c(95,30,75)
varC_2014 <- c(99,0,51)
varD_2014 <- c(9,40,1)
df_2012 <-data.frame(id, varA_2012, varB_2012, varC_2012, varD_2012)
df_2013 <-data.frame(id, varA_2013, varB_2013, varC_2013, varD_2013)
df_2014 <-data.frame(id, varA_2014, varB_2014, varC_2014, varD_2014)
year = c(2012:2014)
for(i in 1:length(year)) {
df_[i] <- df_[I][df_[i]$id, df_[i]$varA_[i], df_[i]$varB_[i], ]
list2env(df_[i], .GlobalEnv)
}
panel_df <- Reduce(function(x, y) merge(x, y, by="if"), list(df_2012, df_2013, df_2014))
I know that there are probably loads of errors in here.
Here are a couple of options; however, it's unclear what you want the expected output to look like.
If you want a wide format, then we can use tidyverse to do:
library(tidyverse)
results <-
map(list(df_2012, df_2013, df_2014), function(x)
x %>% dplyr::select(id, starts_with("varA"), starts_with("varB"))) %>%
reduce(., function(x, y)
left_join(x, y, all = TRUE, by = "id"))
Output
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75
However, if you need it in a long format, then we could pivot the data:
results %>%
pivot_longer(-id, names_to = c("variable", "year"), names_sep = "_")
Output
id variable year value
<chr> <chr> <chr> <dbl>
1 France varA 2012 1
2 France varB 2012 7
3 France varA 2013 34
4 France varB 2013 2
5 France varA 2014 9
6 France varB 2014 95
7 Belgium varA 2012 2
8 Belgium varB 2012 2
9 Belgium varA 2013 3
10 Belgium varB 2013 53
11 Belgium varA 2014 10
12 Belgium varB 2014 30
13 Spain varA 2012 3
14 Spain varB 2012 9
15 Spain varA 2013 56
16 Spain varB 2013 5
17 Spain varA 2014 5
18 Spain varB 2014 75
Or if using base R for the wide format, then we can do:
results <-
lapply(list(df_2012, df_2013, df_2014), function(x)
subset(x, select = c("id", names(x)[startsWith(names(x), "varA")], names(x)[startsWith(names(x), "varB")])))
results <-
Reduce(function(x, y)
merge(x, y, all = TRUE, by = "id"), results)
From your initial for loop attempt, it seems the code below may help
> (df <- Reduce(merge, list(df_2012, df_2013, df_2014)))[grepl("^(id|var(A|B))",names(df))]
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75
I have some data that is structured something like this:
ID Region Value
1 Europe 8
2 Europe: Class 1 6
3 Asia: System 2 6
4 North America 7
5 Europe: System 1 5
6 Africa 7
7 Africa: Class 2 5
8 South America 9
9 Europe: System 1 3
10 Europe 7
What I want to do is create a new column called Class which adds instances of where "Class" AND "System" are mentioned in the Region column - if it's not clear what I mean, take a look at my expected output below. I know this can be done with the separate function but I think you can only specify one value for the separator part of the code. E.g. sep = ": Class" will only split instances that mention "class" but I also want to split any instances where "system" is mentioned too. Can this be done in one line of code, or do I need to do something a bit more complicated here? Here's how my final data should look:
ID Region Class Value
1 Europe 8
2 Europe 1 6
3 Asia 2 6
4 North America 7
5 Europe 1 5
6 Africa 7
7 Africa 2 5
8 South America 9
9 Europe 1 3
10 Europe 7
Please note, I want to remove any reference to "class" or "system" (including colons) from the Region column, and simply add the numerical value to a new Class column.
You can do it with base functions by just using strsplit with a regular expression that takes either ": System" or ": Class" as symbol:
splitted = strsplit(df$Region,"(: Class)|(: System)")
df$Region = lapply(splitted,FUN=function(x){x[1]})
df$Class = lapply(splitted,FUN=function(x){x[2]})
The result is:
> df
ID Region Value Class
1 1 Europe 8 NA
2 2 Europe 6 1
3 3 Asia 6 2
4 4 North America 7 NA
5 5 Europe 5 1
6 6 Africa 7 NA
7 7 Africa 5 2
8 8 South America 9 NA
9 9 Europe 3 1
10 10 Europe 7 NA
You can use str_extract to extract the number and str_remove to drop the text that you don't want.
library(dplyr)
library(stringr)
df %>%
mutate(Class = str_extract(Region, '(?<=(Class|System)\\s)\\d+'),
Region = str_remove(Region, ':\\s*(Class|System)\\s*\\d+'))
# ID Region Value Class
#1 1 Europe 8 <NA>
#2 2 Europe 6 1
#3 3 Asia 6 2
#4 4 North America 7 <NA>
#5 5 Europe 5 1
#6 6 Africa 7 <NA>
#7 7 Africa 5 2
#8 8 South America 9 <NA>
#9 9 Europe 3 1
#10 10 Europe 7 <NA>
str_extract extracts the number which comes after 'Class'
or 'System'. If these words are not present then it returns NA.
str_remove removes colon followed by zero or more whitespace (\\s*) followed by either 'Class' or 'System' and a number (\\d+).
data
It is easier to help if you provide data in a reproducible format which is easier to copy.
df <- structure(list(ID = 1:10, Region = c("Europe", "Europe: Class 1",
"Asia: System 2", "North America", "Europe: System 1", "Africa",
"Africa: Class 2", "South America", "Europe: System 1", "Europe"
), Value = c(8L, 6L, 6L, 7L, 5L, 7L, 5L, 9L, 3L, 7L)),
class = "data.frame", row.names = c(NA, -10L))
I have a dataframe of about 8,000 country-year observations. I want to model the correlates of an event that occurs in certain country-years. To do so properly, I need to drop observations after the event starts until it ends. The events can vary in length from less than one year to 30 years. In my df, I have a column that identifies the threshold_year and termination_year for each event. This column obviously contains many NAs for those countries and years that do not experience the event.
How do I drop observations that fall between the threshold and termination years for specific countries? I have tried to execute the following but it yields an empty dataset: filtering observations from time series conditionally by group.
See code I have attempted below. (BTW, this is my first question on SO).
df <- structure(list(country_id = c(475, 150, 475, 475, 475, 475, 475, 150, 475, 475, 475), year = c(1962, 1967, 1964, 1965, 1966, 1967, 1968, 1968, 1970, 1971, 1972), event = c(0L, 0L, 0L, 0L, 1L, 3L, 0L, 0L, 0L, 0L, 0L), threshold_year = c(NA, NA, NA, NA, 1966, 1967, NA, NA, NA, NA, NA), termination_year = c(NA, NA, NA, NA, 1966, 1970, NA, NA, NA, NA, NA)), .Names = c("country_id", "year", "event", "threshold_year", "termination_year"), row.names = 90:100, class = "data.frame")
df2 <- df %>%
group_by(country_id) %>%
filter(year<=threshold_year & year>termination_year)
I expect a smaller df, perhaps with about 7,000 observations. My attempts typically produce 0 observations.
EDIT
I discovered an inelegant and clumsy process for resolving this issue. I joined my complete dataframe with my threshold dataframe by country only, not year. This adds a column with threshold and termination years for every country that has an event. It also creates a lot of duplicates, but that doesn't matter. Since I no longer have NAs in my threshold and termination columns, I can easily code a dummy variable for each observation to determine whether it falls within the threshold and termination years. I can also concatenate country ids and country years. Once I subset subset my lengthy dataframe by whether the dummy = 1, I can then easily create a list of all country-years that need to be dropped. I then go back to my original data and threshold data set, left_join by both country and year, then subset this data by !(df$country-year %in% drops).
df_drops <- left_join(df, threshold_df, by=c("id"="id"))
df_drops$drops <- ifelse(df_drops$year>df_drops$threshold_year & df_drops$year<=df_drops$termination_year, 1,0)
df_drops$obs_to_drop <- ifelse(df_drops$drops==1, paste(df_drops$id,df_drops$year, sep="_"), NA)
drops <- unique(df_drops$obs_to_drop)
df2 <- left_join(df, threshold.df,by=c("id"="id","year"="threshold_year"))
df2$id_year <- paste(df2$id,df2$year,sep="_")
df3 <- subset(df2, !(df2$id_year %in% drops))
I am assuming that you have a list of thresholds that are specific to each group. If so, you can put the thresholds in a new data frame, then merge them with your original country-year data frame, and finally filter. My toy example below assumes that the end date is 2 years after the start date.
df <- data.frame(country=rep(letters[1:20],each=50),
year=sample(1999:2018,50,T))
threshold <- data.frame(country=letters[1:20],
start=as.numeric(sample(1999:2016,20,T))) %>%
mutate(end=start + 2)
df %>% left_join(.,threshold) %>%
filter(year>=start & year<=end)
country year start end
1 a 2016 2014 2016
2 a 2016 2014 2016
3 a 2014 2014 2016
4 a 2015 2014 2016
5 a 2015 2014 2016
6 a 2015 2014 2016
7 a 2014 2014 2016
8 b 2006 2004 2006
9 b 2004 2004 2006
10 b 2004 2004 2006
11 b 2005 2004 2006
12 b 2006 2004 2006
13 b 2004 2004 2006
14 b 2004 2004 2006
15 b 2006 2004 2006
16 b 2006 2004 2006
17 b 2006 2004 2006
18 b 2006 2004 2006
19 c 2010 2008 2010
20 c 2009 2008 2010
...
I think my guess was right, simply adding | is.na(threshold_year) is enough, at least for the sample data provided.
df %>% group_by(country_id) %>%
filter((year <= threshold_year & year > termination_year) | is.na(threshold_year))
# # A tibble: 9 x 5
# # Groups: country_id [2]
# country_id year event threshold_year termination_year
# <dbl> <dbl> <int> <dbl> <dbl>
# 1 475 1962 0 NA NA
# 2 150 1967 0 NA NA
# 3 475 1964 0 NA NA
# 4 475 1965 0 NA NA
# 5 475 1968 0 NA NA
# 6 150 1968 0 NA NA
# 7 475 1970 0 NA NA
# 8 475 1971 0 NA NA
# 9 475 1972 0 NA NA
I have a dataset in R which I am trying to aggregate by column level and year which looks like this:
City State Year Status Year_repealed PolicyNo
Pitt PA 2001 InForce 6
Phil. PA 2001 Repealed 2004 9
Pitt PA 2002 InForce 7
Pitt PA 2005 InForce 2
What I would like to create is where for each Year, I aggregate the PolicyNo across states taking into account the date the policy was repealed. The results I would then get is:
Year State PolicyNo
2001 PA 15
2002 PA 22
2003 PA 22
2004 PA 12
2005 PA 14
I am not sure how to go about splitting and aggregating the data conditional on the repeal data and was wondering if there is a way to achieve this is R easily.
It may help you to break this up into two distinct problems.
Get a table that shows the change in PolicyNo in every city-state-year.
Summarize that table to show the PolicyNo in each state-year.
To accomplish (1) we add the missing years with NA PolicyNo, and add repeals as negative PolicyNo observations.
library(dplyr)
df = structure(list(City = c("Pitt", "Phil.", "Pitt", "Pitt"), State = c("PA", "PA", "PA", "PA"), Year = c(2001L, 2001L, 2002L, 2005L), Status = c("InForce", "Repealed", "InForce", "InForce"), Year_repealed = c(NA, 2004L, NA, NA), PolicyNo = c(6L, 9L, 7L, 2L)), .Names = c("City", "State", "Year", "Status", "Year_repealed", "PolicyNo"), class = "data.frame", row.names = c(NA, -4L))
repeals = df %>%
filter(!is.na(Year_repealed)) %>%
mutate(Year = Year_repealed, PolicyNo = -1 * PolicyNo)
repeals
# City State Year Status Year_repealed PolicyNo
# 1 Phil. PA 2004 Repealed 2004 -9
all_years = expand.grid(City = unique(df$City), State = unique(df$State),
Year = 2001:2005)
df = bind_rows(df, repeals, all_years)
# City State Year Status Year_repealed PolicyNo
# 1 Pitt PA 2001 InForce NA 6
# 2 Phil. PA 2001 Repealed 2004 9
# 3 Pitt PA 2002 InForce NA 7
# 4 Pitt PA 2005 InForce NA 2
# 5 Phil. PA 2004 Repealed 2004 -9
# 6 Pitt PA 2001 <NA> NA NA
# 7 Phil. PA 2001 <NA> NA NA
# 8 Pitt PA 2002 <NA> NA NA
# 9 Phil. PA 2002 <NA> NA NA
# 10 Pitt PA 2003 <NA> NA NA
# 11 Phil. PA 2003 <NA> NA NA
# 12 Pitt PA 2004 <NA> NA NA
# 13 Phil. PA 2004 <NA> NA NA
# 14 Pitt PA 2005 <NA> NA NA
# 15 Phil. PA 2005 <NA> NA NA
Now the table shows every city-state-year and incorporates repeals. This is a table we can summarize.
df = df %>%
group_by(Year, State) %>%
summarize(annual_change = sum(PolicyNo, na.rm = TRUE))
df
# Source: local data frame [5 x 3]
# Groups: Year [?]
#
# Year State annual_change
# <int> <chr> <dbl>
# 1 2001 PA 15
# 2 2002 PA 7
# 3 2003 PA 0
# 4 2004 PA -9
# 5 2005 PA 2
That gets us PolicyNo change in each state-year. A cumulative sum over the changes gets us levels.
df = df %>%
ungroup() %>%
mutate(PolicyNo = cumsum(annual_change))
df
# # A tibble: 5 × 4
# Year State annual_change PolicyNo
# <int> <chr> <dbl> <dbl>
# 1 2001 PA 15 15
# 2 2002 PA 7 22
# 3 2003 PA 0 22
# 4 2004 PA -9 13
# 5 2005 PA 2 15
With the data.table package you could do it as follows:
melt(setDT(dat),
measure.vars = c(3,5),
value.name = 'Year',
value.factor = FALSE)[!is.na(Year)
][variable == 'Year_repealed', PolicyNo := -1*PolicyNo
][CJ(Year = min(Year):max(Year), State = State, unique = TRUE), on = .(Year, State)
][is.na(PolicyNo), PolicyNo := 0
][, .(PolicyNo = sum(PolicyNo)), by = .(Year, State)
][, .(Year, State, PolicyNo = cumsum(PolicyNo))]
The result of the above code:
Year State PolicyNo
1: 2001 PA 15
2: 2002 PA 22
3: 2003 PA 22
4: 2004 PA 13
5: 2005 PA 15
As you can see, there are several steps needed to come to the desired endresult:
First you convert to a data.table (setDT(dat)) and reshape this into long format and remove the rows with no Year
Then you make the value for the rows that have 'Year_repealed' to negative.
With a cross-join (CJ) you make sure that alle the years for each state are present and convert the NA-values in the PolicyNo column to zero.
Finally, you summarise by year and do a cumulative sum on the result.
This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 6 years ago.
How can I 'unpivot' a table? What is the proper technical term for this?
UPDATE: The term is called melt
I have a data frame for countries and data for each year
Country 2001 2002 2003
Nigeria 1 2 3
UK 2 NA 1
And I want to have something like
Country Year Value
Nigeria 2001 1
Nigeria 2002 2
Nigeria 2003 3
UK 2001 2
UK 2002 NA
UK 2003 1
I still can't believe I beat Andrie with an answer. :)
> library(reshape)
> my.df <- read.table(text = "Country 2001 2002 2003
+ Nigeria 1 2 3
+ UK 2 NA 1", header = TRUE)
> my.result <- melt(my.df, id = c("Country"))
> my.result[order(my.result$Country),]
Country variable value
1 Nigeria X2001 1
3 Nigeria X2002 2
5 Nigeria X2003 3
2 UK X2001 2
4 UK X2002 NA
6 UK X2003 1
The base R reshape approach for this problem is pretty ugly, particularly since the names aren't in a form that reshape likes. It would be something like the following, where the first setNames line modifies the column names into something that reshape can make use of.
reshape(
setNames(mydf, c("Country", paste0("val.", c(2001, 2002, 2003)))),
direction = "long", idvar = "Country", varying = 2:ncol(mydf),
sep = ".", new.row.names = seq_len(prod(dim(mydf[-1]))))
A better alternative in base R is to use stack, like this:
cbind(mydf[1], stack(mydf[-1]))
# Country values ind
# 1 Nigeria 1 2001
# 2 UK 2 2001
# 3 Nigeria 2 2002
# 4 UK NA 2002
# 5 Nigeria 3 2003
# 6 UK 1 2003
There are also new tools for reshaping data now available, like the "tidyr" package, which gives us gather. Of course, the tidyr:::gather_.data.frame method just calls reshape2::melt, so this part of my answer doesn't necessarily add much except introduce the newer syntax that you might be encountering in the Hadleyverse.
library(tidyr)
gather(mydf, year, value, `2001`:`2003`) ## Note the backticks
# Country year value
# 1 Nigeria 2001 1
# 2 UK 2001 2
# 3 Nigeria 2002 2
# 4 UK 2002 NA
# 5 Nigeria 2003 3
# 6 UK 2003 1
All three options here would need reordering of rows if you want the row order you showed in your question.
A fourth option would be to use merged.stack from my "splitstackshape" package. Like base R's reshape, you'll need to modify the column names to something that includes a "variable" and "time" indicator.
library(splitstackshape)
merged.stack(
setNames(mydf, c("Country", paste0("V.", 2001:2003))),
var.stubs = "V", sep = ".")
# Country .time_1 V
# 1: Nigeria 2001 1
# 2: Nigeria 2002 2
# 3: Nigeria 2003 3
# 4: UK 2001 2
# 5: UK 2002 NA
# 6: UK 2003 1
Sample data
mydf <- structure(list(Country = c("Nigeria", "UK"), `2001` = 1:2, `2002` = c(2L,
NA), `2003` = c(3L, 1L)), .Names = c("Country", "2001", "2002",
"2003"), row.names = 1:2, class = "data.frame")
You can use the melt command from the reshape package. See here: http://www.statmethods.net/management/reshape.html
Probably something like melt(myframe, id=c('Country'))