I am trying to download data directly from the OECD website using the OECD package in R. I'm specifically trying to download data from the industrial production dataset (https://data.oecd.org/industry/industrial-production.htm) for South Africa. I believe that the codes for the dataset itself and for South Africa are MEI_REAL and ZAF.
However, when I try to run the following
df <- get_dataset("MEI_REAL",
filter = 'ZAF',
start_time = 2019, end_time = 2020)
I get the following error
Error in rsdmx::readSDMX(url) :
HTTP request failed with status: 400 Bad Request
Can anyone advise on what I'm doing wrong? I've never used this package before so I'm struggling to figure it out.
TIA
To actually use the filter it has to be something like
df <- get_dataset("MEI_REAL",
filter = list(c(),'ZAF'),
start_time = 2019, end_time = 2020)
as the country codes are in the second column, but I don't know how you can know this in advance without just downloading the whole dataset first (just drop the filter argument entirely to do that).
The code in the question fails because LOCATION is the second column in the data frame, and the filter = statement does not account for this. We can fix the request by adding NULL to the list() passed to filter= argument.
library(OECD)
# filter on second column
saProduction <- get_dataset("MEI_REAL",
filter = list(NULL,"ZAF"),
start_time = 2019,
end_time = 2020)
head(saProduction)
...and the output:
> head(saProduction)
SUBJECT LOCATION FREQUENCY TIME_FORMAT UNIT POWERCODE REFERENCEPERIOD obsTime
1 PRMNTO01 ZAF A P1Y IDX 0 2015_100 2019
2 PRMNTO01 ZAF Q P3M IDX 0 2015_100 2019-Q1
3 PRMNTO01 ZAF Q P3M IDX 0 2015_100 2019-Q2
4 PRMNTO01 ZAF Q P3M IDX 0 2015_100 2019-Q3
5 PRMNTO01 ZAF Q P3M IDX 0 2015_100 2019-Q4
6 PRMNTO01 ZAF Q P3M IDX 0 2015_100 2020-Q1
obsValue
1 100.46670
2 100.97510
3 101.30840
4 100.04170
5 99.47495
6 97.40812
How did I figure out that the right entry into the list is NULL? When one looks at the arguments for get_dataset(), we see that NULL is a valid value for filter =, so I inferred that I could use it as a value in the filter list().
For anyone who ecnounters the same problem: My "solution" is to set the dataframe-ID you want at different positions in the filter. E.g:
1.try:
filter_GDP <- list("B1_GE")
GDP <- get_dataset("QNA", filter_GDP)
This gives back an error, hence i set the dataframe-ID at the next position in the filter list (by including NULL):
filter_GDP <- list(NULL, "B1_GE")
GDP <- get_dataset("QNA", filter_GDP)
This works, so now you can look up the positions of the other parameters you want to filter, in my case:
filter_GDP <- list(NULL,"B1_GE", "CQRSA", "Q")
GDP <- get_dataset("QNA", filter_GDP)
Related
My goal is to plot a map with each point representing the year of the highest measured value. So for that I need the year as one value and the Station Name as Row Name.
I get to the point where I get the year of the maximum value for each Station but don´t know how to get the station name as Row Name.
My example is the following:
set.seed(123)
df1<-data.frame(replicate(6,sample(0:200,2500,rep=TRUE)))
date_df1<-seq(as.Date("1995-01-01"), by = "day", length.out = 2500)
test_sto<-cbind(date_df1, df1)
test_sto$date_df1<-as.Date(test_sto$date_df1)
test_sto<-test_sto%>% dplyr::mutate( year = lubridate::year(date_df1),
month = lubridate::month(date_df1),
day = lubridate::day(date_df1))
This is my Dataframe, i then applied the following steps:
To get all values above the treshold for each year and station:
test_year<-aggregate.data.frame(x=test_sto[2:7] > 120, by = list(test_sto$year), FUN = sum, na.rm=TRUE )
This works as it should, the nex is the following
m <- ncol(test_year)
Value <- rep(NA,m)
for (j in 2:m) {
idx<- which.max(test_year[,j])
Value[j] <- test_year[,1][idx]
}
test_test<-Value[2:m]
At the end of this, I get the following table:
x
1
1996
2
1996
3
1998
4
1996
5
1999
6
1999
But instead of the 1,2,3,4,5..I need there my Column Names (X1,X2,X3 etc.):
x
X1
1996
X2
1996
X3
1998
X4
1996
X5
1999
X6
1999
but this is the point where i´m struggeling.
I tried it with the following step:
test_year$max<-apply(test_year[2:7], 1, FUN = max)
apply(test_year[2:7], 2, FUN = max)
test_year2<-subset(test_year, ncol(2:7) == max(ncol(2:7)))
But i´m just getting an error message saying:
in max(ncol(2:7)):
non not-missing Argument for max; give -Inf back<
Maybe someone knows a work around! Thanks in advance!
The 'test_test' is just a vector. Its magnitude characterized by length and is a one 1 dimensional object which doesn't have row.names attribute. But, we can have names attribute
names(test_test) <- colnames(test_year)[-1]
This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have a data frame df1 on import data for 397 different industries over 17 years and several different exporting countries/ regions.
> head(df1)
year importer exporter imports sic87dd
2300 1991 USA CAN 9.404848e+05 2011
2301 1991 USA CAN 2.259720e+04 2015
2302 1991 USA CAN 5.459608e+02 2021
2303 1991 USA CAN 1.173237e+04 2022
2304 1991 USA CAN 2.483033e+04 2023
2305 1991 USA CAN 5.353975e+00 2024
However, I want the sum of all imports for a given industry and a given year, regardless of where they came from. (The importer is always the US, sic87dd is a code that uniquely identifies the 397 industries)
So far I have tried the following code, which works correctly but is terribly inefficient and takes ages to run.
sic87dd <- unique(df1$sic87dd)
year <- unique (df1$year)
df2 <- data.frame("sic87dd" = rep(sic87dd, each = 17), "year" = rep(year, 397), imports = rep(0, 6749))
i <- 1
j <- 1
while(i <= nrow(df2)){
while(j <= nrow(df1)){
if((df1$sic87dd[j] == df2$sic87dd[i]) == TRUE & (df1$year[j] == df2$year[i]) == TRUE){
df2$imports[i] <- df2$imports[i] + df1$imports[j]
}
j <- j + 1
}
i <- i + 1
j <- 1
}
Is there a more efficient way to do this? I have seen some questions here that were somewhat similar and suggested the use of the data.table package, but I can't figure out how to make it work in my case.
Any help is appreciated.
There is a simple solution using dplyr:
First, you'll need to set your industry field as a factor (I'm assuming this entire field consists of a 4 digit number):
df1$sic87dd <- as.factor(df1$sic87dd)
Next, use the group_by command and summarise:
df1 %>%
group_by(sic87dd) %>%
summarise(total_imports = sum(imports))
I have tried to find an answer to what appears to be a simple question but without any success.
I want to create a function which would operate on different variables for different data frames. All that the function needs to do is search for the value "don't know" and replace it with NA. I would do this manually as follows:
raw.df$S8[raw.df$S8 == "Don't know"] <- NA
As an exercise in learning R I would like to do this by function but cannot find a way to reference the inputs to the function.
In this example code I cannot even create a vector which is a copy of the dataframe variable I want to recode - it is coming out as NULL. So until I know how to do this part, I can't progress to recoding values as NA.
> NADK <- function(df,x) {
+ DDD <<- df$x
+ }
>
> NADK(raw.df, S8)
> DDD
NULL
I am assuming that I cannot use the commands df$x and expect r to know that this is coming from the function inputs?
Rather than writing a function which hardwires in "Don't know" it seems more flexible to have that as an argument to the function. Something like:
to.na <- function(df,x,na.string){
df[x][df[x] == na.string] <- NA
df
}
This returns the altered dataframe.
For example, if
df <- data.frame(Name = c("Larry", "Curly", "Moe"),BirthYear = c(1900, 1910, 1920), DeathYear = c("1950", "1960", "Not dead"))
So that df is
Name BirthYear DeathYear
1 Larry 1900 1950
2 Curly 1910 1960
3 Moe 1920 Not dead
Then:
> df <- to.na(df,"DeathYear","Not dead")
> df
Name BirthYear DeathYear
1 Larry 1900 1950
2 Curly 1910 1960
3 Moe 1920 <NA>
If you are reading the dataframe from a file by using read.table (or associated functions like read.csv) then you might be able to avoid the problem to begin with by using the parameter na.strings. See ?read.table for details.
I am using trade data (FAO) which I would like to turn into matrices (per Item and Year). Therefore I've done a split:
# import is the original df
import_YI <- split(import, list(import$Item, import$Year))
import_YI_lap <- lapply(seq_along(import_YI), function(x) as.data.frame(import_YI[[x]])[, 1:11])
and the data looks like this (you can find test data at the end) :
[[1]]
RC PC Item Year Value
Argentina Chile Almonds 1996 1108
Algeria Spain Almonds 1996 1
....
[[2]]
....
[[3]]
....
[[n]]
I used the cast function (below) to create a matrix for almonds in 2012:
# import_almonds2012 is a test subset from import df (with import values for almonds in 2012)
RCPC <- cast(RC ~ PC, data =import_almonds2012, value = "Value")
Now my question: how can I do matrices of all Items/Years (~100 Items and 17 years!!) from the import_YI_lap df? My problem is that I don't know (1) how to operate the levels/ojects in this df ([[1]], [[2]]...). Or there a better way to split data or to save the splited df into objects? And (2) how to create all the needed matrices without coping thousend lines of code. Loops? If yes, how??
here a test-dataset:
import<- data.frame(RC=c("DE", "IT", "USA"),
PC = c("BRA", "ARG"),
Item = c("Almonds", "Apples"),
Year = c(1996,1997,1998),
Value = c(1,5,3,2,8,3))
import_YI <- split(import, list(import$Item, import$Year))
import_YI_lap <- lapply(seq_along(import_YI), function(x) as.data.frame(import_YI[[x]])[, 1:5])
import_YI_lap
It's difficult to test without data, but you can try this:
do.call(rbind,import_YI_lap)
Hope it is not a too newbie question.
I am trying to subset rows from the GDP UK dataset that can be downloaded from here:
http://www.ons.gov.uk/ons/site-information/using-the-website/time-series/index.html
The dataframe looks more or less like that:
X ABMI
1 1948 283297
2 1949 293855
3 1950 304395
....
300 2013 Q2 381318
301 2013 Q3 384533
302 2013 Q4 387138
303 2014 Q1 390235
The thing is that for my analysis I only need the data for years 2004-2013 and I am interested in one result per year, so I wanted to get every fourth row from the dataset that lies between the 263 and 303 row.
On the basis of the following websites:
https://stat.ethz.ch/pipermail/r-help/2008-June/165634.html
(plus a few that i cannot quote due to the link limit)
I tried the following, each time getting some error message:
> GDPUKodd <- seq(GDPUKsubset[263:302,], by = 4)
Error in seq.default(GDPUKsubset[263:302, ], by = 4) :
argument 'from' musi mieæ d³ugoœæ 1
> OddGDPUK <- GDPUK[seq(263, 302, by = 4)]
Error in `[.data.frame`(GDPUK, seq(263, 302, by = 4)) :
undefined columns selected
> OddGDPUKprim <- GDPUK[seq(263:302), by = 4]
Error in `[.data.frame`(GDPUK, seq(263:302), by = 4) :
unused argument (by = 4)
> OddGDPUK <- GDPUK[seq(from=263, to=302, by = 4)]
Error in `[.data.frame`(GDPUK, seq(from = 263, to = 302, by = 4)) :
undefined columns selected
> OddGDPUK <- GDPUK[seq(from=GDPUK[263,] to=GDPUK[302,] by = 4)]
Error: unexpected symbol in "OddGDPUK <- GDPUK[seq(from=GDPUK[263,] to"
> GDPUK[seq(1,nrows(GDPUK),by=4),]
Error in seq.default(1, nrows(GDPUK), by = 4) :
could not find function "nrows"
To put a long story short: help!
Instead of trying to extract data based on row ids, you can use the subset function with appropriate filters based on the values.
For example if your data frame has a year column with values 1948...2014 and a quarter column with values Q1..Q4, then you can get the right subset with:
subset(data, year >= 2004 & year <= 2013 & quarter == 'Q1')
UDATE
I see your source data is dirty, with no proper year and quarter columns. You can clean it like this:
x <- read.csv('http://www.ons.gov.uk/ons/datasets-and-tables/downloads/csv.csv?dataset=pgdp&cdid=ABMI')
x$ABMI <- as.numeric(as.character(x$ABMI))
x$year <- as.numeric(gsub('[^0-9].*', '', x$X))
x$quarter <- gsub('[0-9]{4} (Q[1-4])', '\\1', x$X)
subset(x, year >= 2004 & year <= 2013 & quarter == 'Q1')
Your code GDPUK[seq(1,nrows(GDPUK),by=4),] actually works quite well for these purposes. The only thing you need to change is nrow for nrows.