Convert Dataframe to key value pair list in R [duplicate] - r

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 6 years ago.
How can I 'unpivot' a table? What is the proper technical term for this?
UPDATE: The term is called melt
I have a data frame for countries and data for each year
Country 2001 2002 2003
Nigeria 1 2 3
UK 2 NA 1
And I want to have something like
Country Year Value
Nigeria 2001 1
Nigeria 2002 2
Nigeria 2003 3
UK 2001 2
UK 2002 NA
UK 2003 1

I still can't believe I beat Andrie with an answer. :)
> library(reshape)
> my.df <- read.table(text = "Country 2001 2002 2003
+ Nigeria 1 2 3
+ UK 2 NA 1", header = TRUE)
> my.result <- melt(my.df, id = c("Country"))
> my.result[order(my.result$Country),]
Country variable value
1 Nigeria X2001 1
3 Nigeria X2002 2
5 Nigeria X2003 3
2 UK X2001 2
4 UK X2002 NA
6 UK X2003 1

The base R reshape approach for this problem is pretty ugly, particularly since the names aren't in a form that reshape likes. It would be something like the following, where the first setNames line modifies the column names into something that reshape can make use of.
reshape(
setNames(mydf, c("Country", paste0("val.", c(2001, 2002, 2003)))),
direction = "long", idvar = "Country", varying = 2:ncol(mydf),
sep = ".", new.row.names = seq_len(prod(dim(mydf[-1]))))
A better alternative in base R is to use stack, like this:
cbind(mydf[1], stack(mydf[-1]))
# Country values ind
# 1 Nigeria 1 2001
# 2 UK 2 2001
# 3 Nigeria 2 2002
# 4 UK NA 2002
# 5 Nigeria 3 2003
# 6 UK 1 2003
There are also new tools for reshaping data now available, like the "tidyr" package, which gives us gather. Of course, the tidyr:::gather_.data.frame method just calls reshape2::melt, so this part of my answer doesn't necessarily add much except introduce the newer syntax that you might be encountering in the Hadleyverse.
library(tidyr)
gather(mydf, year, value, `2001`:`2003`) ## Note the backticks
# Country year value
# 1 Nigeria 2001 1
# 2 UK 2001 2
# 3 Nigeria 2002 2
# 4 UK 2002 NA
# 5 Nigeria 2003 3
# 6 UK 2003 1
All three options here would need reordering of rows if you want the row order you showed in your question.
A fourth option would be to use merged.stack from my "splitstackshape" package. Like base R's reshape, you'll need to modify the column names to something that includes a "variable" and "time" indicator.
library(splitstackshape)
merged.stack(
setNames(mydf, c("Country", paste0("V.", 2001:2003))),
var.stubs = "V", sep = ".")
# Country .time_1 V
# 1: Nigeria 2001 1
# 2: Nigeria 2002 2
# 3: Nigeria 2003 3
# 4: UK 2001 2
# 5: UK 2002 NA
# 6: UK 2003 1
Sample data
mydf <- structure(list(Country = c("Nigeria", "UK"), `2001` = 1:2, `2002` = c(2L,
NA), `2003` = c(3L, 1L)), .Names = c("Country", "2001", "2002",
"2003"), row.names = 1:2, class = "data.frame")

You can use the melt command from the reshape package. See here: http://www.statmethods.net/management/reshape.html
Probably something like melt(myframe, id=c('Country'))

Related

How to create a loop for sum calculations which then are inserted into a new row?

I have tried to find a solution via similar topics, but haven't found anything suitable. This may be due to the search terms I have used. If I have missed something, please accept my apologies.
Here is a excerpt of my data UN_ (the provided sample should be sufficient):
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
AT 1990 Total 7.869005
AT 1991 1 1.484667
AT 1991 2 1.001578
AT 1991 3 4.625927
AT 1991 4 2.515453
AT 1991 5 2.702081
AT 1991 Total 8.249567
....
BE 1994 1 3.008115
BE 1994 2 1.550344
BE 1994 3 1.080667
BE 1994 4 1.768645
BE 1994 5 7.208295
BE 1994 Total 1.526016
BE 1995 1 2.958820
BE 1995 2 1.571759
BE 1995 3 1.116049
BE 1995 4 1.888952
BE 1995 5 7.654881
BE 1995 Total 1.547446
....
What I want to do is, to add another row with UN_$sector = Residual. The value of residual will be (UN_$sector = Total) - (the sum of column UN for the sectors c("1", "2", "3", "4", "5")) for a given year AND country.
This is how it should look like:
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
----> AT 1990 Residual TO BE CALCULATED
AT 1990 Total 7.869005
As I don't want to write many, many lines of code I'm looking for a way to automate this. I was told about loops, but can't really follow the concept at the moment.
Thank you very much for any type of help!!
Best,
Constantin
PS: (for Parfait)
country year sector UN ETS
UK 2012 1 190336512 NA
UK 2012 2 18107910 NA
UK 2012 3 8333564 NA
UK 2012 4 11269017 NA
UK 2012 5 2504751 NA
UK 2012 Total 580957306 NA
UK 2013 1 177882200 NA
UK 2013 2 20353347 NA
UK 2013 3 8838575 NA
UK 2013 4 11051398 NA
UK 2013 5 2684909 NA
UK 2013 Total 566322778 NA
Consider calculating residual first and then stack it with other pieces of data:
# CALCULATE RESIDUALS BY MERGED COLUMNS
agg <- within(merge(aggregate(UN ~ country + year, data = subset(df, sector!='Total'), sum),
aggregate(UN ~ country + year, data = subset(df, sector=='Total'), sum),
by=c("country", "year")),
{UN <- UN.y - UN.x
sector = 'Residual'})
# ROW BIND DIFFERENT PIECES
final_df <- rbind(subset(df, sector!='Total'),
agg[c("country", "year", "sector", "UN")],
subset(df, sector=='Total'))
# ORDER ROWS AND RESET ROWNAMES
final_df <- with(final_df, final_df[order(country, year, as.character(sector)),])
row.names(final_df) <- NULL
Rextester demo
final_df
# country year sector UN
# 1 AT 1990 1 1.407555
# 2 AT 1990 2 1.037137
# 3 AT 1990 3 4.769618
# 4 AT 1990 4 2.455139
# 5 AT 1990 5 2.238618
# 6 AT 1990 Residual -4.039062
# 7 AT 1990 Total 7.869005
# 8 AT 1991 1 1.484667
# 9 AT 1991 2 1.001578
# 10 AT 1991 3 4.625927
# 11 AT 1991 4 2.515453
# 12 AT 1991 5 2.702081
# 13 AT 1991 Residual -4.080139
# 14 AT 1991 Total 8.249567
# 15 BE 1994 1 3.008115
# 16 BE 1994 2 1.550344
# 17 BE 1994 3 1.080667
# 18 BE 1994 4 1.768645
# 19 BE 1994 5 7.208295
# 20 BE 1994 Residual -13.090050
# 21 BE 1994 Total 1.526016
# 22 BE 1995 1 2.958820
# 23 BE 1995 2 1.571759
# 24 BE 1995 3 1.116049
# 25 BE 1995 4 1.888952
# 26 BE 1995 5 7.654881
# 27 BE 1995 Residual -13.643015
# 28 BE 1995 Total 1.547446
I think there are multiple ways you can do this. What I may recommend is to take advantage of the tidyverse suite of packages which includes dplyr.
Without getting too far into what dplyr and tidyverse can achieve, we can talk about the power of dplyr's inline commands group_by(...), summarise(...), arrange(...) and bind_rows(...) functions. Also, there are tons of great tutorials, cheat sheets, and documentation on all tidyverse packages.
Although it is less and less relevant these days, we generally want to avoid for loops in R. Therefore, we will create a new data frame which contains all of the Residual values then bring it back into your original data frame.
Step 1: Calculating all residual values
We want to calculate the sum of UN values, grouped by country and year. We can achieve this by this value
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))
Step 2: Add sector column to res_UN with value 'residual'
This should yield a data frame which contains country, year, and UN, we now need to add a column sector which the value 'Residual' to satisfy your specifications.
res_UN$sector = 'Residual'
Step 3 : Add res_UN back to UN_ and order accordingly
res_UN and UN_ now have the same columns and they can now be added back together.
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)
Piecing this all together, should answer your question and can be achieved in a couple lines!
TLDR:
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))`
res_UN$sector = 'Residual'
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)

How do I avoid a slow loop with large data set?

Consider this data set:
> DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
+ country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"),
+ action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"),
+ signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
+ ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002))
> DATA
Agreement_number country action signature_date ratification_date
1 Canada signature 2000 NA
1 Canada ratification NA 2001
1 USA signature 2000 NA
1 USA ratification NA 2002
2 Canada signature 2001 NA
2 Canada ratification NA 2001
2 USA signature 2002 NA
2 USA ratification NA 2002
As you can see, half of the rows have duplicate information. For a small data set like this it is really easy to remove duplicates. I could use the coalesce function (dplyr package), get rid of the "action" column and then erase all the irrelevant rows. Though, there many other ways. The final result should look like this:
> DATA <- data.frame( Agreement_number = c(1,1,2,2),
+ country = c("Canada", "USA", "Canada","USA"),
+ signature_date = c(2000,2000,2001,2002),
+ ratification_date = c(2001, 2002, 2001, 2002))
> DATA
Agreement_number country signature_date ratification_date
1 Canada 2000 2001
1 USA 2000 2002
2 Canada 2001 2001
2 USA 2002 2002
The problem, is that my real data set is MUCH bigger (102000 x 270) and there are many more variables. The real data is also more irregular and there are more absent values. The coalesce function seems very slow. The best loop I could make so far still takes up to 5-10 minutes to run.
Is there a simple way of doing this which would be faster? I have the feeling that there must be some function in R for that kind of operation, but I couldn't find any.
I think you need dcast. The version in the data.table library calls itself "fast", and in my experience, it is speedy on large datasets.
First, let's create one column which is either the signature_date or ratification_date, depending on the action
library(data.table)
setDT(DATA)[, date := ifelse(action == "ratification", ratification_date, signature_date)]
Now, let's cast it so that the action are the columns and the value is the date
wide <- dcast(DATA, Agreement_number + country ~ action, value.var = 'date')
So wide looks like this
Agreement_number country ratification signature
1 1 Canada 2001 2000
2 1 USA 2002 2000
3 2 Canada 2001 2001
4 2 USA 2002 2002
The OP has told that his production data has 100 k rows x 270 columns, and speed is a concern for him. Therefore, I suggest to use data.table.
I'm aware that Harland also has proposed to use data.table and dcast() but the solution below is a different approach. It brings the rows in the correct order and copies the ratification_date to the signature row. After some clean-up we get the desired result.
library(data.table)
# coerce to data.table,
# make sure that the actions are ordered properly, not alphabetically
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]
# order the rows to make sure that signature row and ratification row are
# subsequent for each agreement and country
setorder(DATA, Agreement_number, country, action)
# copy the ratification date from the row below but only within each group
result <- DATA[, ratification_date := shift(ratification_date, type = "lead"),
by = c("Agreement_number", "country")][
# keep only signature rows, remove action column
action == "signature"][, action := NULL]
result
Agreement_number country signature_date ratification_date dummy1 dummy2
1: 1 Canada 2000 2001 2 D
2: 1 USA 2000 2002 3 A
3: 2 Canada 2001 2001 1 B
4: 2 USA 2002 2002 4 C
Data
The OP has mentioned that his production data has 270 columns. To simulate this I've added two dummy columns:
set.seed(123L)
DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"),
action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"),
signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002),
dummy1 = rep(sample(4), each = 2L),
dummy2 = rep(sample(LETTERS[1:4]), each = 2L))
Note that set.seed() is used for repeatable results when sampling.
Agreement_number country action signature_date ratification_date dummy1 dummy2
1 1 Canada signature 2000 NA 2 D
2 1 Canada ratification NA 2001 2 D
3 1 USA signature 2000 NA 3 A
4 1 USA ratification NA 2002 3 A
5 2 Canada signature 2001 NA 1 B
6 2 Canada ratification NA 2001 1 B
7 2 USA signature 2002 NA 4 C
8 2 USA ratification NA 2002 4 C
Addendum: dcast() with additional columns
Harland has suggested to use data.table and dcast(). Besides several other flaws in his answer, it doesn't handle the additional columns the OP has mentioned.
The dcast() approach below will return also the additional columns:
library(data.table)
# coerce to data table
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]
# use already existing column to "coalesce" dates
DATA[action == "ratification", signature_date := ratification_date]
DATA[, ratification_date := NULL]
# dcast from long to wide form, note that ... refers to all other columns
result <- dcast(DATA, Agreement_number + country + ... ~ action,
value.var = "signature_date")
result
Agreement_number country dummy1 dummy2 signature ratification
1: 1 Canada 2 D 2000 2001
2: 1 USA 3 A 2000 2002
3: 2 Canada 1 B 2001 2001
4: 2 USA 4 C 2002 2002
Note that this approach will change the order of columns.
Here is another data.table solution using uwe-block's data.frame. It is similar to uwe-block's method, but uses max to collapse the data.
# covert data.frame to data.table and factor variables to character variables
library(data.table)
setDT(DATA)[, names(DATA) := lapply(.SD,
function(x) if(is.factor(x)) as.character(x) else x)]
# collapse data set, by agreement and country. Take max of remaining variables.
DATA[, lapply(.SD, max, na.rm=TRUE), by=.(Agreement_number, country)][,action := NULL][]
The lapply runs through variables not included in the by statement and calculates the maximum after removing NA values. The next link in the chain drops the unneeded action variable and the final (unnecessary) link prints the output.
This returns
Agreement_number country signature_date ratification_date dummy1 dummy2
1: 1 Canada 2000 2001 2 D
2: 1 USA 2000 2002 3 A
3: 2 Canada 2001 2001 1 B
4: 2 USA 2002 2002 4 C

Canonical way to reduce number of ID variables in wide-format data

I have data organized by two ID variables, Year and Country, like so:
Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8
I'd like to keep Year as an ID variable, but create multiple columns for VarA and VarB, one for each value of Country (I'm not picky about column order), to make the following table:
Year VarA.Canada VarA.USA VarB.Canada VarB.USA
2014 0 NA 10 NA
2015 6 1 5 3
2016 7 2 8 2
I managed to do this with the following code:
require(data.table)
require(reshape2)
data <- as.data.table(read.table(header=TRUE, text='Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8'))
molten <- melt(data, id.vars=c('Year', 'Country'))
molten[,variable:=paste(variable, Country, sep='.')]
recast <- dcast(molten, Year ~ variable)
But this seems a bit hacky (especially editing the default-named variable field). Can I do it with fewer function calls? Ideally I could just call one function, specifying the columns to drop as IDs and the formula for creating new variable names.
Using dcast you can cast multiple value.vars at once (from data.table v1.9.6 on). Try:
dcast(data, Year ~ Country, value.var = c("VarA","VarB"), sep = ".")
# Year VarA.Canada VarA.USA VarB.Canada VarB.USA
#1: 2014 0 NA 10 NA
#2: 2015 6 1 5 3
#3: 2016 7 2 8 2

R: conditional aggregate based on factor level and year

I have a dataset in R which I am trying to aggregate by column level and year which looks like this:
City State Year Status Year_repealed PolicyNo
Pitt PA 2001 InForce 6
Phil. PA 2001 Repealed 2004 9
Pitt PA 2002 InForce 7
Pitt PA 2005 InForce 2
What I would like to create is where for each Year, I aggregate the PolicyNo across states taking into account the date the policy was repealed. The results I would then get is:
Year State PolicyNo
2001 PA 15
2002 PA 22
2003 PA 22
2004 PA 12
2005 PA 14
I am not sure how to go about splitting and aggregating the data conditional on the repeal data and was wondering if there is a way to achieve this is R easily.
It may help you to break this up into two distinct problems.
Get a table that shows the change in PolicyNo in every city-state-year.
Summarize that table to show the PolicyNo in each state-year.
To accomplish (1) we add the missing years with NA PolicyNo, and add repeals as negative PolicyNo observations.
library(dplyr)
df = structure(list(City = c("Pitt", "Phil.", "Pitt", "Pitt"), State = c("PA", "PA", "PA", "PA"), Year = c(2001L, 2001L, 2002L, 2005L), Status = c("InForce", "Repealed", "InForce", "InForce"), Year_repealed = c(NA, 2004L, NA, NA), PolicyNo = c(6L, 9L, 7L, 2L)), .Names = c("City", "State", "Year", "Status", "Year_repealed", "PolicyNo"), class = "data.frame", row.names = c(NA, -4L))
repeals = df %>%
filter(!is.na(Year_repealed)) %>%
mutate(Year = Year_repealed, PolicyNo = -1 * PolicyNo)
repeals
# City State Year Status Year_repealed PolicyNo
# 1 Phil. PA 2004 Repealed 2004 -9
all_years = expand.grid(City = unique(df$City), State = unique(df$State),
Year = 2001:2005)
df = bind_rows(df, repeals, all_years)
# City State Year Status Year_repealed PolicyNo
# 1 Pitt PA 2001 InForce NA 6
# 2 Phil. PA 2001 Repealed 2004 9
# 3 Pitt PA 2002 InForce NA 7
# 4 Pitt PA 2005 InForce NA 2
# 5 Phil. PA 2004 Repealed 2004 -9
# 6 Pitt PA 2001 <NA> NA NA
# 7 Phil. PA 2001 <NA> NA NA
# 8 Pitt PA 2002 <NA> NA NA
# 9 Phil. PA 2002 <NA> NA NA
# 10 Pitt PA 2003 <NA> NA NA
# 11 Phil. PA 2003 <NA> NA NA
# 12 Pitt PA 2004 <NA> NA NA
# 13 Phil. PA 2004 <NA> NA NA
# 14 Pitt PA 2005 <NA> NA NA
# 15 Phil. PA 2005 <NA> NA NA
Now the table shows every city-state-year and incorporates repeals. This is a table we can summarize.
df = df %>%
group_by(Year, State) %>%
summarize(annual_change = sum(PolicyNo, na.rm = TRUE))
df
# Source: local data frame [5 x 3]
# Groups: Year [?]
#
# Year State annual_change
# <int> <chr> <dbl>
# 1 2001 PA 15
# 2 2002 PA 7
# 3 2003 PA 0
# 4 2004 PA -9
# 5 2005 PA 2
That gets us PolicyNo change in each state-year. A cumulative sum over the changes gets us levels.
df = df %>%
ungroup() %>%
mutate(PolicyNo = cumsum(annual_change))
df
# # A tibble: 5 × 4
# Year State annual_change PolicyNo
# <int> <chr> <dbl> <dbl>
# 1 2001 PA 15 15
# 2 2002 PA 7 22
# 3 2003 PA 0 22
# 4 2004 PA -9 13
# 5 2005 PA 2 15
With the data.table package you could do it as follows:
melt(setDT(dat),
measure.vars = c(3,5),
value.name = 'Year',
value.factor = FALSE)[!is.na(Year)
][variable == 'Year_repealed', PolicyNo := -1*PolicyNo
][CJ(Year = min(Year):max(Year), State = State, unique = TRUE), on = .(Year, State)
][is.na(PolicyNo), PolicyNo := 0
][, .(PolicyNo = sum(PolicyNo)), by = .(Year, State)
][, .(Year, State, PolicyNo = cumsum(PolicyNo))]
The result of the above code:
Year State PolicyNo
1: 2001 PA 15
2: 2002 PA 22
3: 2003 PA 22
4: 2004 PA 13
5: 2005 PA 15
As you can see, there are several steps needed to come to the desired endresult:
First you convert to a data.table (setDT(dat)) and reshape this into long format and remove the rows with no Year
Then you make the value for the rows that have 'Year_repealed' to negative.
With a cross-join (CJ) you make sure that alle the years for each state are present and convert the NA-values in the PolicyNo column to zero.
Finally, you summarise by year and do a cumulative sum on the result.

Long Format Function [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
faster way to create variable that aggregates a column by id
I am having trouble with a project. I created a dataframe (called dat) in long format (i copied the first 3 rows below) and I want to calculate for example the mean of the Pretax Income of all Banks in the United States for the years 2000 to 2011. How would I do that? I have hardly any experience in R. I am sorry if the answer is too obvious, but I couldn't find anything and i already spent a lot of time on the project. Thank you in advance!
KeyItem Bank Country Year Value
1 Pretax Income WELLS_FARGO_&_COMPANY UNITED STATES 2011 2.365600e+10
2 Total Assets WELLS_FARGO_&_COMPANY UNITED STATES 2011 1.313867e+12
3 Total Liabilities WELLS_FARGO_&_COMPANY UNITED STATES 2011 1.172180e+12
The following should get you started. You basically need to do two things: subset, and aggregate. I'll demonstrate a base R solution and a data.table solution.
First, some sample data.
set.seed(1) # So you can reproduce my results
dat <- data.frame(KeyItem = rep(c("Pretax", "TotalAssets", "TotalLiabilities"),
times = 30),
Bank = rep(c("WellsFargo", "BankOfAmerica", "ICICI"),
each = 30),
Country = rep(c("UnitedStates", "India"), times = c(60, 30)),
Year = rep(c(2000:2009), each = 3, times = 3),
Value = runif(90, min=300, max=600))
Let's aggregate mean of the "Pretax" values by "Country" and "Year", but only for the years 2001 to 2005.
aggregate(Value ~ Country + Year,
dat[dat$KeyItem == "Pretax" & dat$Year >= 2001 & dat$Year <=2005, ],
mean)
# Country Year Value
# 1 India 2001 399.7184
# 2 UnitedStates 2001 464.1638
# 3 India 2002 443.5636
# 4 UnitedStates 2002 560.8373
# 5 India 2003 562.5964
# 6 UnitedStates 2003 370.9591
# 7 India 2004 404.0050
# 8 UnitedStates 2004 520.4933
# 9 India 2005 567.6595
# 10 UnitedStates 2005 493.0583
Here's the same thing in data.table
library(data.table)
DT <- data.table(dat, key = "Country,Bank,Year")
subset(DT, KeyItem == "Pretax")[Year %between% c(2001, 2005),
mean(Value), by = list(Country, Year)]
# Country Year V1
# 1: India 2001 399.7184
# 2: India 2002 443.5636
# 3: India 2003 562.5964
# 4: India 2004 404.0050
# 5: India 2005 567.6595
# 6: UnitedStates 2001 464.1638
# 7: UnitedStates 2002 560.8373
# 8: UnitedStates 2003 370.9591
# 9: UnitedStates 2004 520.4933
# 10: UnitedStates 2005 493.0583

Resources