Simple function to classify NA values - r

I have tried to find an answer to what appears to be a simple question but without any success.
I want to create a function which would operate on different variables for different data frames. All that the function needs to do is search for the value "don't know" and replace it with NA. I would do this manually as follows:
raw.df$S8[raw.df$S8 == "Don't know"] <- NA
As an exercise in learning R I would like to do this by function but cannot find a way to reference the inputs to the function.
In this example code I cannot even create a vector which is a copy of the dataframe variable I want to recode - it is coming out as NULL. So until I know how to do this part, I can't progress to recoding values as NA.
> NADK <- function(df,x) {
+ DDD <<- df$x
+ }
>
> NADK(raw.df, S8)
> DDD
NULL
I am assuming that I cannot use the commands df$x and expect r to know that this is coming from the function inputs?

Rather than writing a function which hardwires in "Don't know" it seems more flexible to have that as an argument to the function. Something like:
to.na <- function(df,x,na.string){
df[x][df[x] == na.string] <- NA
df
}
This returns the altered dataframe.
For example, if
df <- data.frame(Name = c("Larry", "Curly", "Moe"),BirthYear = c(1900, 1910, 1920), DeathYear = c("1950", "1960", "Not dead"))
So that df is
Name BirthYear DeathYear
1 Larry 1900 1950
2 Curly 1910 1960
3 Moe 1920 Not dead
Then:
> df <- to.na(df,"DeathYear","Not dead")
> df
Name BirthYear DeathYear
1 Larry 1900 1950
2 Curly 1910 1960
3 Moe 1920 <NA>
If you are reading the dataframe from a file by using read.table (or associated functions like read.csv) then you might be able to avoid the problem to begin with by using the parameter na.strings. See ?read.table for details.

Related

downloading data from OECD website using the OECD package in R

I am trying to download data directly from the OECD website using the OECD package in R. I'm specifically trying to download data from the industrial production dataset (https://data.oecd.org/industry/industrial-production.htm) for South Africa. I believe that the codes for the dataset itself and for South Africa are MEI_REAL and ZAF.
However, when I try to run the following
df <- get_dataset("MEI_REAL",
filter = 'ZAF',
start_time = 2019, end_time = 2020)
I get the following error
Error in rsdmx::readSDMX(url) :
HTTP request failed with status: 400 Bad Request
Can anyone advise on what I'm doing wrong? I've never used this package before so I'm struggling to figure it out.
TIA
To actually use the filter it has to be something like
df <- get_dataset("MEI_REAL",
filter = list(c(),'ZAF'),
start_time = 2019, end_time = 2020)
as the country codes are in the second column, but I don't know how you can know this in advance without just downloading the whole dataset first (just drop the filter argument entirely to do that).
The code in the question fails because LOCATION is the second column in the data frame, and the filter = statement does not account for this. We can fix the request by adding NULL to the list() passed to filter= argument.
library(OECD)
# filter on second column
saProduction <- get_dataset("MEI_REAL",
filter = list(NULL,"ZAF"),
start_time = 2019,
end_time = 2020)
head(saProduction)
...and the output:
> head(saProduction)
SUBJECT LOCATION FREQUENCY TIME_FORMAT UNIT POWERCODE REFERENCEPERIOD obsTime
1 PRMNTO01 ZAF A P1Y IDX 0 2015_100 2019
2 PRMNTO01 ZAF Q P3M IDX 0 2015_100 2019-Q1
3 PRMNTO01 ZAF Q P3M IDX 0 2015_100 2019-Q2
4 PRMNTO01 ZAF Q P3M IDX 0 2015_100 2019-Q3
5 PRMNTO01 ZAF Q P3M IDX 0 2015_100 2019-Q4
6 PRMNTO01 ZAF Q P3M IDX 0 2015_100 2020-Q1
obsValue
1 100.46670
2 100.97510
3 101.30840
4 100.04170
5 99.47495
6 97.40812
How did I figure out that the right entry into the list is NULL? When one looks at the arguments for get_dataset(), we see that NULL is a valid value for filter =, so I inferred that I could use it as a value in the filter list().
For anyone who ecnounters the same problem: My "solution" is to set the dataframe-ID you want at different positions in the filter. E.g:
1.try:
filter_GDP <- list("B1_GE")
GDP <- get_dataset("QNA", filter_GDP)
This gives back an error, hence i set the dataframe-ID at the next position in the filter list (by including NULL):
filter_GDP <- list(NULL, "B1_GE")
GDP <- get_dataset("QNA", filter_GDP)
This works, so now you can look up the positions of the other parameters you want to filter, in my case:
filter_GDP <- list(NULL,"B1_GE", "CQRSA", "Q")
GDP <- get_dataset("QNA", filter_GDP)

How can I concatenate one row to the previous row in a data frame, if a condition is met?

I am an intermediate user of R and have a data set of ~850,000 rows that was edited through Stata, saved as a csv, but about .01% of the rows got split to the following row after column 11. I am trying to get the file back to its original form, with no split rows. I was using column 4 "type of" as the required condition, but someone below pointed out this won't work. I tested this and all object types in the data frame are indeed "integers". Maybe this would work if I changed the column "type of" for this problem, but here was what I tried:
wages <- for (i in wages) {
if(typeof(wages[i,4]) == "integer") {
cat(i-1, i)
}
}
all I get is NAs.
When trying:
for (i in wages) {
if(typeof(i[ ,4]) == "integer") {
append(i-1, i, after = length(i-1))
}
}
it says:
Error in [.default(i, , 4) : incorrect number of dimensions
I have spent hours searching for solutions and trying different methods with no success. Thanks in advance for any help.
Snippet of data:
WD County_Name State_Name Cons_Code constructiondescription wagegroup Rate_Effective_Date hourly
113352 CO20190006 Adams Colorado Highway SUCO2011-001 9/15/2011 22.67
113353 CO20190004 Adams Colorado Residential PLUM0058-011 7/1/2018 32.75
113354 (pipefitters exclude hvac pipe) SOUTHWEST CO 8001 METRO 1352 100335 plumber
113355 CO20190004 Adams Colorado Residential PLUM0145-005 8/1/2016 24.58
fringe Rate_Type Craft_Title region st_abbr stcnty_fips mr supergrp
113352 8.73 Open power equipment operator: broom/sweeper arapahoe SOUTHWEST CO 8001 METRO 1352
113353 14.85 CBA plumber/pipefitter (plumbers include hvac pipe) NA NA
113354 1 NA NA
113355 10.47 CBA plumber (plumbers include hvac pipe) & pipefitters (exclude hvac pipe) SOUTHWEST CO 8001 METRO 1352
group key_craft key
113352 100335 operator 1
113353 NA NA
113354 NA NA
113355 100335 plumber 1
Reproducible data:
data <- data.frame(c("CO20190006","CO20190004","(pipefitters exclude hvac pipe)","CO20190004"), #1
c("Adams","Adams","SOUTHWEST","Adams"), #2
c("Colorado","Colorado","CO","Colorado"), #3
c("Highway","Residential","8001","Residential"), #4
c("","","METRO",""), #5
c("SUCO2011-001","PLUM0058-011","1352","PLUM0145-005"), #6
c("9/15/2011","7/1/2018","100335","8/1/2016"), #7
c("22.67","32.75","plumber","24.58"), #8
c("8.73","14.85","1","10.47"), #9
c("Open","CBA","","CBA"), #10
c("power equipment operator: broom/sweeper arapahoe","plumber/pipefitter (plumbers include hvac pipe)","",
"plumber (plumbers include hvac pipe) & pipefitters (exclude hvac pipe)"), #11
c("SOUTHWEST","","","SOUTHWEST"), #12
c("CO","","","CO"), #13
c("8001",NA,NA,"8001"), #14
c("METRO","","","METRO"), #15
c("1352",NA,NA,"1352"), #16
c("100335",NA,NA,"100335"), #17
c("operator","","","plumber"), #18
c("1",NA,NA,"1")) #19
colnames(data) <- c("WD","County_Name","State_Name","Cons_Code","constructiondescription","wagegroup","Rate_Effective_Date",
"hourly","fringe","Rate_Type","Craft_Title","region","st_abbr","stcnty_fips","mr","supergrp","group",
"key_craft","key")
The following solution should do the job:
new_data <- NULL
i <- 1
while (i <= nrow(data)) {
new_data <- rbind(new_data, data[i, ])
if (all(is.na(data[i, c(14, 16, 17, 19)]))) {
if (!paste(data[i,11], data[i+1,1]) %in% levels(new_data[,11])) {
levels(new_data[,11]) <- c(
levels(new_data[,11]), paste(data[i,11], data[i+1,1])
)
}
new_data[nrow(new_data),11] <- paste(data[i,11], data[i+1,1])
new_data[nrow(new_data),] <- cbind(new_data[nrow(new_data),1:11], data[i+1,2:9])
i <- i + 2
} else {i <- i + 1}
}
Note that as it stands, your data frame (DF) stores strings as factors because when creating a DF using the data.frame() function, one of the settings is stringsAsFactors = TRUE by default. You can read more about factors and their levels in data frames here.
Therefore, in the code above, we first add a new row to the clean new_data:
new_data <- rbind(new_data, data[i, ])
Then we test whether that row is split by checking if there are NAs in columns 14, 16, 17, and 19:
if (all(is.na(data[i, c(14, 16, 17, 19)])))
If so, in order for us to be able to merge the cell in column 11 of the split row with the 1st cell of the following row, we first need to check whether that level already exists in that column and if not:
if (!paste(data[i,11], data[i+1,1]) %in% levels(new_data[,11]))
it needs to be added to the list of levels before merging:
levels(new_data[,11]) <- c(levels(new_data[,11]), paste(data[i,11], data[i+1,1]))
And then, finally, the merging (to complete the cell in column 11 of the split row) can be done:
new_data[nrow(new_data),11] <- paste(data[i,11], data[i+1,1])
After that, the remaining missing columns are added to the split row in question:
new_data[nrow(new_data),] <- cbind(new_data[nrow(new_data),1:11], data[i+1,2:9])
Version LITE
Now, I am suspecting that all this checking for factors and adding new ones is taking some extra time, so I propose you may use a new version of this code, which turns the implicated column 11 into just characters, instead of factors. I think it makes sense in this particular data set, as specifically that column does not seem to be intended as factors anyway. That way, all the factor checking / adding can be skipped:
data[,11] <- as.character(data[,11])
new_data <- NULL
i <- 1
while (i <= nrow(data)) {
new_data <- rbind(new_data, data[i, ])
if (all(is.na(data[i, c(14, 16, 17, 19)]))) {
new_data[nrow(new_data),11] <- paste(data[i,11], data[i+1,1])
new_data[nrow(new_data),] <- cbind(new_data[nrow(new_data),1:11], data[i+1,2:9])
i <- i + 2
} else {i <- i + 1}
}
Let me know if this improved the speed!

Importing Dataframe in R

I'm new to R so please forgive the repetitive question. I was trying to do this in Access (I know) but unfortunately the application kept crashing.
I have a dataframe object that contains 78k records that I imported from a CSV, and it should form a tree like structure, while there may not be a natural root however as this is a subset of the entire org.
POS_NUM|TITLE|REPORT_TO_POS_NUM
1234 Bob 789
5698 Jim 1234
8976 Frank 1653
This should for a loose relationship tree relationship
Bob
\ Jim
Frank
Essentially I need this to calculate the number of sub reports for each person, the number direct reports, as well as some other recursive functions
EDIT
Right now I'm attempting to simply loop through my table
treeDataOne <- read.csv(file="File1.csv", header=TRUE, stringsAsFactors=FALSE sep=",")
treeDataTwo <- read.csv(file="File2.csv",header=TRUE, stringsAsFactors=FALSE, sep=",") #Same columns, different data
treeDataAll <- rbind(treeDataOne, treeDataTwo) #Merge data, this seems to work
#Adding new columns to store data
treeDataAll['DIRECT_REPORTS'] <- 0
treeDataAll['INDIRECT_REPORTS'] <- 0
treeDataAll['DIVISION'] <- ""
treeDataAll['BRANCH'] <- ""
treeDataAll['PROCESSED'] <- FALSE
I'm now trying to iterate over every record and calculate the direct reports
So I'm pseudo code it should be:
for i in treeDataAll{
i.DIRECT_REPORTS = nrow(where REPORT_TO_POS_NUM = i.pos_num)
}
library(data.table)
setDT(treeDataAll)
funky <- function(x){
nrow(treeDataAll[REPORT_TO_POS_NUM == x])
}
treeDataAll[, DIR_REPORTS := funky(POS_NUM), by = POS_NUM]
treeDataAll[]
# POS_NUM TITLE REPORT_TO_POS_NUM DIR_REPORTS
# 1: 1234 Bob 789 1
# 2: 5698 Jim 1234 0
# 3: 8976 Frank 1653 0

Quickly create new columns in dataframe using lists - R

I have a data containing quotations of indexes (S&P500, CAC40,...) for every 5 minutes of the last 3 years, which make it quite huge. I am trying to create new columns containing the performance of the index for each time (ie (quotation at [TIME]/quotation at yesterday close) -1) and for each index. I began that way (my data is named temp):
listIndexes<-list("CAC","SP","MIB") # there are a lot more
listTime<-list(900,905,910,...1735) # every 5 minutes
for (j in 1:length(listTime)){
Time<-listTime[j]
for (i in 1:length(listIndexes)) {
Index<-listIndexes[i]
temp[[paste0(Index,"perf",Time)]]<-temp[[paste0(Index,Time)]]/temp[[paste0(Index,"close")]]-1
# other stuff to do but with the same concept
}
}
but it is quite long. Is there a way to get rid of the for loop(s) or to make the creation of those variables quicker ? I read some stuff about the apply functions and the derivatives of it but I do not see if and how it should be used here.
My data looks like this :
date CACcloseyesterday CAC1000 CAC1005 ... CACclose ... SP1000 ... SPclose
20140105 3999 4000 40001.2 4005 .... 2000 .... 2003
20140106 4005 4004 40003.5 4002 .... 2005 .... 2002
...
and my desired output would be a new column (more eaxcatly a new column for each time and each index) which would be added to temp
date CACperf1000 CACperf1005... SPperf1000...
20140106 (4004/4005)-1 (4003.5/4005)-1 .... (2005/2003)-1 # the close used is the one of the day before
idem for the following day
i wrote (4004/4005)-1 just to show the calcualtio nbut the result should be a number : -0.0002496879
It looks like you want to generate every combination of Index and Time. Each Index-Time combination is a column in temp and you want to calculate a new perf column by comparing each Index-Time column against a specific Index close column. And your problem is that you think there should be an easier (less error-prone) way to do this.
We can remove one of the for-loops by generating all the necessary column names beforehand using something like expand.grid.
listIndexes <-list("CAC","SP","MIB")
listTime <- list(900, 905, 910, 915, 920)
df <- expand.grid(Index = listIndexes, Time = listTime,
stringsAsFactors = FALSE)
df$c1 <- paste0(df$Index, "perf", df$Time)
df$c2 <- paste0(df$Index, df$Time)
df$c3 <- paste0(df$Index, "close")
head(df)
#> Index Time c1 c2 c3
#> 1 CAC 900 CACperf900 CAC900 CACclose
#> 2 SP 900 SPperf900 SP900 SPclose
#> 3 MIB 900 MIBperf900 MIB900 MIBclose
#> 4 CAC 905 CACperf905 CAC905 CACclose
#> 5 SP 905 SPperf905 SP905 SPclose
#> 6 MIB 905 MIBperf905 MIB905 MIBclose
Then only one loop is required, and it's for iterating over each batch of column names and doing the calculation.
for (row_i in seq_len(nrow(df))) {
this_row <- df[row_i, ]
temp[[this_row$c1]] <- temp[[this_row$c2]] / temp[[this_row$c3]] - 1
}
An alternative solution would also be to reshape your data into a form that makes this transformation much simpler. For instance, converting into a long, tidy format with columns for Date, Index, Time, Value, ClosingValue column and directly operating on just the two relevant columns there.

R: How to create multiple matrices of splited data?

I am using trade data (FAO) which I would like to turn into matrices (per Item and Year). Therefore I've done a split:
# import is the original df
import_YI <- split(import, list(import$Item, import$Year))
import_YI_lap <- lapply(seq_along(import_YI), function(x) as.data.frame(import_YI[[x]])[, 1:11])
and the data looks like this (you can find test data at the end) :
[[1]]
RC PC Item Year Value
Argentina Chile Almonds 1996 1108
Algeria Spain Almonds 1996 1
....
[[2]]
....
[[3]]
....
[[n]]
I used the cast function (below) to create a matrix for almonds in 2012:
# import_almonds2012 is a test subset from import df (with import values for almonds in 2012)
RCPC <- cast(RC ~ PC, data =import_almonds2012, value = "Value")
Now my question: how can I do matrices of all Items/Years (~100 Items and 17 years!!) from the import_YI_lap df? My problem is that I don't know (1) how to operate the levels/ojects in this df ([[1]], [[2]]...). Or there a better way to split data or to save the splited df into objects? And (2) how to create all the needed matrices without coping thousend lines of code. Loops? If yes, how??
here a test-dataset:
import<- data.frame(RC=c("DE", "IT", "USA"),
PC = c("BRA", "ARG"),
Item = c("Almonds", "Apples"),
Year = c(1996,1997,1998),
Value = c(1,5,3,2,8,3))
import_YI <- split(import, list(import$Item, import$Year))
import_YI_lap <- lapply(seq_along(import_YI), function(x) as.data.frame(import_YI[[x]])[, 1:5])
import_YI_lap
It's difficult to test without data, but you can try this:
do.call(rbind,import_YI_lap)

Resources