Reshaping a dataframe in R - r

I need some help to re-design the output of a function that comes through an R package.
My scope is to reshape a dataframe called output_IMFData in a way that look very similar to the shape of output_imfr. The codes of a MWE reproducing these dataframes are:
library(imfr)
output_imfr <- imf_data(database_id="IFS", indicator="IAD_BP6_USD", country = "", start = 2010, end = 2014, freq = "A", return_raw =FALSE, print_url = T, times = 3)
and for output_IMFData
library(IMFData)
databaseID <- "IFS"
startdate <- "2010"
enddate <- "2014"
checkquery <- FALSE
queryfilter <- list(CL_FREA = "A", CL_AREA_IFS = "", CL_INDICATOR_IFS = "IAD_BP6_USD")
output_IMFData <- CompactDataMethod(databaseID, queryfilter, startdate, enddate,
checkquery)
the output from output_IMFData looks like this:
But, I want to redesign this dataframe to look like the output of output_imfr:
Sadly, I am not that advanced user and could not find something that can help me. My basic problem in converting the shape of output_IMFData to the shape of the second ``panel-data-looking" dataframework is that I don't know how to handle the Obs in output_IMFData in a way that cannot lose the "correspondence" with the reference code #REF-AREA in output_IMFData. That is, in column #REF-AREA there are codes of country names and the column in Obs has their respective time series data. This is very cumbersome way of working with panel data, and therefore I want to reshape that dataframe to the much nicer form of output_imfr dataframe.

The data of interest are stored in a list in the column Obs. Here is a dplyr solution to split the data, crack open the list, then stitch things back together.
longData <-
output_IMFData %>%
split(1:nrow(.)) %>%
lapply(function(x){
data.frame(
iso2c = x[["#REF_AREA"]]
, x$Obs
)
}) %>%
bind_rows()
head(longData)
gives:
iso2c X.TIME_PERIOD X.OBS_VALUE X.OBS_STATUS
1 FJ 2010 47.2107721901621 <NA>
2 FJ 2011 48.28347 <NA>
3 FJ 2012 51.0823499999999 <NA>
4 FJ 2013 157.015648875072 <NA>
5 FJ 2014 186.623232882226 <NA>
6 AW 2010 616.664804469274 <NA>

Here's another approach:
NewDataFrame <- data.frame(iso2c=character(),
year=numeric(),
IAD_BP6_USD=character(),
stringsAsFactors=FALSE)
newrow = 1
for(i in 1:nrow(output_IMFData)) { # for each row of your cludgy df
for(j in 1:length(output_IMFData$Obs[[i]]$`#TIME_PERIOD`)) { # for each year
NewDataFrame[newrow,'iso2c']<-output_IMFData[i, '#REF_AREA']
NewDataFrame[newrow,'year']<-output_IMFData$Obs[[i]]$`#TIME_PERIOD`[j]
NewDataFrame[newrow,'IAD_BP6_USD']<-output_IMFData$Obs[[i]]$`#OBS_VALUE`[j]
newrow<-newrow + 1 # increment down a row
}
}

Related

Merging listed data.frames from a data.frame

I'm querying data from the IMF using the package IMFData (https://github.com/mingjerli/IMFData).
The problem is that I do not know how to improve my code. There are 25 countries that I would like to extract data. I am sure my code is not the best to quickly create the desired dataframe (XDF, in my code).
library(IMFData)
databaseID <- "DOT"
startdate = "2013-01-01"
enddate = "2019-12-31"
checkquery = FALSE
queryfilter <- list(CL_FREQ = "M", CL_AREA_DOT = c("CA", "JP"), CL_INDICATOR_DOT = "TXG_FOB_USD", CL_COUNTERPART_AREA_DOT = "W00")
EXPORTS <- CompactDataMethod(databaseID, queryfilter, startdate, enddate, checkquery)
# I would like to improve this part:
XDF.1 <- EXPORTS$Obs[[1]]
XDF.2 <- EXPORTS$Obs[[2]]
XDF <- dplyr::left_join(XDF.1, XDF.2, by=c("#TIME_PERIOD"))
colnames(XDF) <- c("Date", "Canada", "Japan")
Here is a base R approach that will get you all 227 areas with data.
First, get all the areas available with DataStructureMethod. Then split the list into sets of 25 areas so that the API won't fail. Create a new empty list to hold the returned data. Next, use a for loop to iterate over all the area sets and store the results into a list element.
library(IMFData)
databaseID <- "DOT"
startdate = "2013-01-01"
enddate = "2019-12-31"
areas <- DataStructureMethod("DOT")$CL_COUNTERPART_AREA_DOT$CodeValue
areas.list <- split(areas, ceiling(seq_along(areas)/25))
result.list <- list()
for(i in seq_along(areas.list)) {
filter <- list(CL_FREQ = "M", CL_AREA_DOT = areas.list[[i]], CL_INDICATOR_DOT = "TXG_FOB_USD", CL_COUNTERPART_AREA_DOT = "W00")
result.list[[i]] <- CompactDataMethod(databaseID, filter, startdate, enddate)
}
Now that we have all the data, we can extract the #OBS_VALUE from each area. So we can keep up with which is which, we will assign the column names to #REF_AREA. Then all we need to do is cbind all the areas together and add a time period column.
result <- sapply(result.list,function(x){y <- sapply(x$Obs,function(y){y[['#OBS_VALUE']]}); colnames(y) <- x[["#REF_AREA"]]; y})
result <- do.call(cbind,result)
result <- cbind(timeperiod = result.list[[1]]$Obs[[1]][['#TIME_PERIOD']],result)
result[1:10,1:10]
timeperiod BB BM AF BS AL AW BD BZ AO
[1,] "2013-01" "28.609779" "2.763473" "37.545734" "140.793072" "182.268383" "15.248257" "2135.314764" "26.993657" "5738.361548"
[2,] "2013-02" "31.408923" "2.588724" "23.319418" "51.207085" "160.256056" "13.357883" "1883.921679" "31.959256" "5093.785673"
[3,] "2013-03" "26.490062" "2.161194" "34.313418" "116.533489" "187.347118" "11.807801" "2074.639533" "36.975964" "5836.777823"
[4,] "2013-04" "30.969022" "6.541486" "27.46926" "79.9772" "199.063249" "15.363928" "1996.029477" "39.84747" "4953.276187"
[5,] "2013-05" "27.633188" "3.030127" "32.675746" "765.5369" "221.793898" "13.232063" "2247.850876" "73.201747" "5425.804703"
[6,] "2013-06" "24.064953" "2.816781" "29.454347" "60.756462" "201.765833" "13.698186" "2291.680871" "32.821853" "5271.431577"
[7,] "2013-07" "26.25563" "2.657042" "15.540238" "95.12846" "233.746903" "14.499091" "2359.924118" "33.763333" "5666.628083"
[8,] "2013-08" "26.85187" "2.883294" "21.369248" "74.317362" "180.045606" "15.545374" "1985.100494" "31.342921" "5557.632778"
[9,] "2013-09" "25.025515" "3.368449" "26.061924" "89.380055" "211.352443" "12.323627" "2441.630301" "25.107398" "5558.266666"
[10,] "2013-10" "34.040048" "3.249082" "49.352241" "128.44329" "227.724296" "17.172523" "2131.788729" "28.489788" "5411.943251"
As you probably know, the names of those areas are available in DataStructureMethod("DOT")$CL_COUNTERPART_AREA_DOT.
You might want to create a list and combine together thanks to the Reduce function (or purrr:::reduce). I often come back to this post:
Assuming you have a list of X.DFs dataframe named X_DF, you can do:
Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "i", all.x = TRUE),
list_XDF)
or, if you prefer tidyverse syntax:
list_XDF %>% purrr::reduce(left_join, by=c("#TIME_PERIOD"))

How to identify observations with multiple matching patterns and create another variable in R?

I am trying to create a broad industry category from detailed categories in my data. I am wondering where am I going wrong in creating this with grepl in R?
My example data is as follows:
df <- data.frame(county = c(01001, 01002, 02003, 04004, 08005, 01002, 02003, 04004),
ind = c("0700","0701","0780","0980","1000","1429","0840","1500"))
I am trying to create a variable called industry with 2 levels (e.g., agri, manufacturing) with the help of grepl or str_replace commands in R.
I have tried this:
newdf$industry <- ""
newdf[df$ind %>% grepl(c("^07|^08|^09", levels(df$ind), value = TRUE)), "industry"] <- "Agri"
But this gives me the following error:
argument 'pattern' has length > 1 and only the first element will be used
I want to get the following dataframe as my result:
newdf <- data.frame(county = c(01001, 01002, 02003, 04004, 08005, 01002, 02003, 04004),
ind = c("0700","0701","0780","0980","1000","1429","0840","1500"),
industry = c("Agri", "Agri", "Agri", "Agri", "Manufacturing", "Manufacturing", "Agri", "Manufacturing"))
So my question is this, how do I specify if variable 'ind' starts with 07,08 or 09, my industry variable will take the value 'agri', if 'ind' starts with 10, 14 or 15, industry will be 'manufacturing'? Needless to say, there is a huge list of industry codes that I am trying to crunch in 10 categories, so looking for a solution which will help me do it with pattern recognition.
Any help is appreciated! Thanks!
Try this:
newdf = df %>%
mutate(industry = ifelse(str_detect(string = ind,
pattern = '^07|^08|^09'),
'Agri',
'Manufacturing'))
This works, using ifelse() to add desired column to df data.frame
df$industry <- ifelse(grepl(paste0("^", c('07','08','09'), collapse = "|"), df$ind), "Agri", "Manufacturing")
> df
county ind industry
1 1001 0700 Agri
2 1002 0701 Agri
3 2003 0780 Agri
4 4004 0980 Agri
5 8005 1000 Manufacturing
6 1002 1429 Manufacturing
7 2003 0840 Agri
8 4004 1500 Manufacturing

Creating a new unique dataset from dates and categories in R

I have a dataframe that has OrderDate and MajorCategory as the two variables. OrderDates range from 2005-01-01 to 2007-12-31, and MajorCategory runs from 1 to 73 with around 35.5 million entries. Each OrderDate references a specific order, which has an ID number and also is attributed to a specific MajorCategory. I am trying to create a dataframe to show each unique OrderDate and the count of each MajorCategory that was ordered on that date.
The dataset currently looks something like:
OrderDate MajorCategory
2005-12-12 66
2005-12-12 66
2006-03-28 43
2006-05-16 66
I have separated the unique OrderDate (after changing the class to Date) into its own dataframe by using:
OD <- as.data.frame(unique(DMEFLines3Dataset2$OrderDate))
OD <- as.data.frame(sort(OD$`unique(DMEFLines3Dataset2$OrderDate)`))
I'm not sure how to get the MajorCategory to show me a count for each date. So the desired output would be something like:
OD MC_1 MC_2
2005-01-01 4 6
2005-01-02 7 45
2005-01-03 3 23
where OD is the Order Date and MC_X is the MajorCategory's order count per date (MC_1 to MC_73).
I tried using for loops, frequency, and count, but I can't seem to figure it out.
I am not an R expert, and if given the option I would try to aggregate the data as needed in a different language and then load the aggregated data into an R data frame for any further analysis.
I have done something close to what you are asking by calculating ROC graphs from the output of a third party naive bayes model which consisted of appointment detail grouped by departments. Tweaking my code a bit, I was able to get a dataframe with counts of an identifier grouped by date, which seems to be structured the way you are asking for.
library(RODBC)
dbConnection <- 'Driver={SQL Server};Server=SERVERNAME;Database=DBName;Trusted_Connection=yes'
channel <- odbcDriverConnect(dbConnection)
InputDataSet <- sqlQuery(channel, "
SELECT OrderID, OrderDate, MajorCategory from [dbo].[myDataSet];"
)
results <- data.frame("date", "ordCount")
names(results) <- c("date", "ordCount")
for (dt in unique(InputDataSet$OrderDate)) {
ordCount <- 0
filteredSet = InputDataSet[InputDataSet$OrderDate == dt,]
for (mc in unique(filteredSet$MajorCategory)) {
ordCount <- ordCount+1
}
df <- data.frame(dt, ordCount)
names(df) <- c("date", "ordCount")
results <- rbind(df, results)
}
results
library(tidyverse)
df1 <- df %>%
group_by(OrderDate, MajorCategory) %>%
tally() %>%
mutate(MajorCategory = paste("MC", MajorCategory, sep="_")) %>%
spread(MajorCategory, n)
df1
Output is:
OrderDate MC_43 MC_66 MC_67
1 2005-12-12 NA 2 1
2 2006-03-28 1 NA NA
3 2006-05-16 NA 1 NA
Sample data:
df <- structure(list(OrderDate = c("2005-12-12", "2005-12-12", "2005-12-12",
"2006-03-28", "2006-05-16"), MajorCategory = c(66L, 66L, 67L,
43L, 66L)), .Names = c("OrderDate", "MajorCategory"), class = "data.frame", row.names = c(NA,
-5L))
OrderDate<- as.Date(c('2005-12-12','2005-12-12','2006-03-28','2006-05-16','2005-03-04','2005-12-12'))
MajorCategory<- as.numeric(c(66, 66, 43, 66, 43, 1))
OD=data.frame(OrderDate,MajorCategory)
out <- split(OD, OD$MajorCategory)
count=lapply(out, function(x) aggregate(x$MajorCategory, FUN = length, by = list(x$OrderDate)))

Frequency tables by groups with weighted data in R

I wish to calculate two kind of frequency tables by groups with weighted data.
You can generate reproducible data with the following code :
Data <- data.frame(
country = sample(c("France", "USA", "UK"), 100, replace = TRUE),
migrant = sample(c("Native", "Foreign-born"), 100, replace = TRUE),
gender = sample (c("men", "women"), 100, replace = TRUE),
wgt = sample(100),
year = sample(2006:2007)
)
Firstly, I try to calculate a frequency table of migrant status (Native VS Foreign-born) by country and year. I wrote the following code using the packages questionr and plyr :
db2006 <- subset (Data, year == 2006)
db2007 <- subset (Data, year == 2007)
result2006 <- as.data.frame(cprop(wtd.table(db2006$migrant, db2006$country, weights=db2006$wgt),total=FALSE))
result2007 <- as.data.frame(cprop(wtd.table(db2007$migrant, db2007$country, weights=db2007$wgt),total=FALSE))
result2006<-rename (result2006, c(Freq = "y2006"))
result2007<-rename (result2007, c(Freq = "y2007"))
result <- merge(result2006, result2007, by = c("Var1","Var2"))
In my real database, I have 10 years so it takes times to apply this code for all the years. Does anyone know a faster way to do it ?
I also wish to calculate the share of women and men among migrant status by country and year. I am looking for something like :
Var1 Var2 Var3 y2006 y2007
Foreign born France men 52 55
Foreign born France women 48 45
Native France men 51 52
Native France women 49 48
Foreign born UK men 60 65
Foreign born UK women 40 35
Native UK men 48 50
Native UK women 52 50
Does anyone have an idea of how I can get these results?
You could do this by: making a function with the code you've already written; using lapply to iterate that function over all years in your data; then using Reduce and merge to collapse the resulting list into one data frame. Like this:
# let's make your code into a function called 'tallyho'
tallyho <- function(yr, data) {
require(dplyr)
require(questionr)
DF <- filter(data, year == yr)
result <- with(DF, as.data.frame(cprop(wtd.table(migrant, country, weights = wgt), total = FALSE)))
# rename the last column by year
names(result)[length(names(result))] <- sprintf("y%s", year)
return(result)
}
# now iterate that function over all years in your original data set, then
# use Reduce and merge to collapse the resulting list into a data frame
NewData <- lapply(unique(Data$year), function(x) tallyho(x, Data)) %>%
Reduce(function(...) merge(..., all=T), .)

Performing Operations on a Subset Using Data Table

I have a survey data set in wide form. For a particular question, a set of variables was created in the raw data to represent different the fact that the survey question was asked on a particular month.
I wish to create a new set of variables that have month-invariant names; the value of these variables will correspond to the value of a month-variant question for the month observed.
Please see an example / fictitious data set:
require(data.table)
data <- data.table(month = rep(c('may', 'jun', 'jul'), each = 5),
may.q1 = rep(c('yes', 'no', 'yes'), each = 5),
jun.q1 = rep(c('breakfast', 'lunch', 'dinner'), each = 5),
jul.q1 = rep(c('oranges', 'apples', 'oranges'), each = 5),
may.q2 = rep(c('econ', 'math', 'science'), each = 5),
jun.q2 = rep(c('sunny', 'foggy', 'cloudy'), each = 5),
jul.q2 = rep(c('no rain', 'light mist', 'heavy rain'), each = 5))
In this survey, there are really only two questions: "q1" and "q2". Each of these questions is repeatedly asked for several months. However, the observation contains a valid response only if the month observed in the data matches up with the survey question for a particular month.
For example: "may.q1" is observed as "yes" for any observation in "May". I would like a new "Q1" variable to represent "may.q1", "jun.q1", and "jul.q1". The value of "Q1" will take on the value of "may.q1" when the month is "may", and the value of "Q1" will take on the value of "jun.q1" when the month is "jun".
If I were to try and do this by hand using data table, I would want something like:
mdata <- data[month == 'may', c('month', 'may.q1', 'may.q2'), with = F]
setnames(mdata, names(mdata), gsub('may\\.', '', names(mdata)))
I would want this repeated "by = month".
If I were to use the "plyr" package for a data frame, I would solve using the following approach:
require(plyr)
data <- data.frame(data)
mdata <- ddply(data, .(month), function(dfmo) {
dfmo <- dfmo[, c(1, grep(dfmo$month[1], names(dfmo)))]
names(dfmo) <- gsub(paste0(dfmo$month[1], '\\.'), '', names(dfmo))
return(dfmo)
})
Any help using a data.table method would be greatly appreciated, as my data are large. Thank you.
A different way to illustrate :
data[, .SD[,paste0(month,c(".q1",".q2")), with=FALSE], by=month]
month may.q1 may.q2
1: may yes econ
2: may yes econ
3: may yes econ
4: may yes econ
5: may yes econ
6: jun lunch foggy
7: jun lunch foggy
8: jun lunch foggy
9: jun lunch foggy
10: jun lunch foggy
11: jul oranges heavy rain
12: jul oranges heavy rain
13: jul oranges heavy rain
14: jul oranges heavy rain
15: jul oranges heavy rain
But note the column names come from the first group (can rename afterwards using setnames). And it may not be the most efficient if there are a great number of columns with only a few needed. In that case Arun's solution melting to long format should be faster.
Edit: Seems very inefficient on bigger data. Check out #MatthewDowle's answer for a really fast and neat solution.
Here's a solution using data.table.
dd <- melt.dt(data, id.var=c("month"))[month == gsub("\\..*$", "", ind)][,
ind := gsub("^.*\\.", "", ind)][, split(values, ind), by=list(month)]
The function melt.dt is a small function (still more improvements to be made) I wrote to melt a data.table similar to that of the melt function in plyr (copy/paste this function shown below before trying out the code above).
melt.dt <- function(DT, id.var) {
stopifnot(inherits(DT, "data.table"))
measure.var <- setdiff(names(DT), id.var)
ind <- rep.int(measure.var, rep.int(nrow(DT), length(measure.var)))
m1 <- lapply(c("list", id.var), as.name)
m2 <- as.call(lapply(c("factor", "ind"), as.name))
m3 <- as.call(lapply(c("c", measure.var), as.name))
quoted <- as.call(c(m1, ind = m2, values = m3))
DT[, eval(quoted)]
}
The idea: First melt the data.table with id.var = month column. Now, all your melted column names are of the form month.question. So, by removing ".question" from this melted column and equating with month column, we can remove all unnecessary entries. Once we did this, we don't need the "month." in the melted column "ind" anymore. So, we use gsub to remove "month." to retain just q1, q2 etc.. After this, we have to reshape (or cast) it. This is done by grouping by month and splitting the values column by ind (which has either q1 or q2. So, you'll get 2 columns for every month (which is then stitched together) to get your desired output.
What about something like this
data <- data.table(
may.q1 = rep(c('yes', 'no', 'yes'), each = 5),
jun.q1 = rep(c('breakfast', 'lunch', 'dinner'), each = 5),
jul.q1 = rep(c('oranges', 'apples', 'oranges'), each = 5),
may.q2 = rep(c('econ', 'math', 'science'), each = 5),
jun.q2 = rep(c('sunny', 'foggy', 'cloudy'), each = 5),
jul.q2 = rep(c('no rain', 'light mist', 'heavy rain'), each = 5)
)
tmp <- reshape(data, direction = "long", varying = 1:6, sep = ".", timevar = "question")
str(tmp)
## Classes ‘data.table’ and 'data.frame': 30 obs. of 5 variables:
## $ question: chr "q1" "q1" "q1" "q1" ...
## $ may : chr "yes" "yes" "yes" "yes" ...
## $ jun : chr "breakfast" "breakfast" "breakfast" "breakfast" ...
## $ jul : chr "oranges" "oranges" "oranges" "oranges" ...
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
If you want to go further and melting this data again you can use the melt package
require(reshape2)
## remove the id column if you want (id is the last col so ncol(tmp))
res <- melt(tmp[,-ncol(tmp), with = FALSE], measure.vars = c("may", "jun", "jul"), value.name = "response", variable.name = "month")
str(res)
## 'data.frame': 90 obs. of 3 variables:
## $ question: chr "q1" "q1" "q1" "q1" ...
## $ month : Factor w/ 3 levels "may","jun","jul": 1 1 1 1 1 1 1 1 1 1 ...
## $ response: chr "yes" "yes" "yes" "yes" ...

Resources