How to build subset query using a loop in R? - r

I'm trying to subset a big table across a number of columns, so all the rows where State_2009, State_2010, State_2011 etc. do not equal the value "Unknown."
My instinct was to do something like this (coming from a JS background), where I either build the query in a loop or continually subset the data in a loop, referencing the year as a variable.
mysubset <- data
for(i in 2009:2016){
mysubset <- subset(mysubset, paste("State_",i," != Unknown",sep=""))
}
But this doesn't work, at least because paste returns a string, giving me the error 'subset' must be logical.
Is there a better way to do this?

Using dplyr with the filter_ function should get you the correct output
library(dplyr)
mysubset <- data
for(i in 2009:2016)
{
mysubset <- mysubset %>%
filter_(paste("State_",i," != \"Unknown\"", sep = ""))
}

To add to Matt's answer, you could also do it like this:
cols <- paste0( "State_", 2009:2016)
inds <- which( mysubset[ ,cols] == "Unknown", arr.ind = T)[,1]
mysubset <- mysubset[-(unique(inds), ]

Related

R function used to rename columns of a data frames

I have a data frame, say acs10. I need to relabel the columns. To do so, I created another data frame, named as labelName with two columns: The first column contains the old column names, and the second column contains names I want to use, like the table below:
column_1
column_2
oldLabel1
newLabel1
oldLabel2
newLabel2
Then, I wrote a for loop to change the column names:
for (i in seq_len(nrow(labelName))){
names(acs10)[names(acs10) == labelName[i,1]] <- labelName[i,2]}
, and it works.
However, when I tried to put the for loop into a function, because I need to rename column names for other data frames as well, the function failed. The function I wrote looks like below:
renameDF <- function(dataF,varName){
for (i in seq_len(nrow(varName))){
names(dataF)[names(dataF) == varName[i,1]] <- varName[i,2]
print(varName[i,1])
print(varName[i,2])
print(names(dataF))
}
}
renameDF(acs10, labelName)
where dataF is the data frame whose names I need to change, and varName is another data frame where old variable names and new variable names are paired. I used print(names(dataF)) to debug, and the print out suggests that the function works. However, the calling the function does not actually change the column names. I suspect it has something to do with the scope, but I want to know how to make it works.
In your function you need to return the changed dataframe.
renameDF <- function(dataF,varName){
for (i in seq_len(nrow(varName))){
names(dataF)[names(dataF) == varName[i,1]] <- varName[i,2]
}
return(dataF)
}
You can also simplify this and avoid for loop by using match :
renameDF <- function(dataF,varName){
names(dataF) <- varName[[2]][match(names(dataF), varName[[1]])]
return(dataF)
}
This should do the whole thing in one line.
colnames(acs10)[colnames(acs10) %in% labelName$column_1] <- labelName$column_2[match(colnames(acs10)[colnames(acs10) %in% labelName$column_1], labelName$column_1)]
This will work if the column name isn't in the data dictionary, but it's a bit more convoluted:
library(tibble)
df <- tribble(~column_1,~column_2,
"oldLabel1", "newLabel1",
"oldLabel2", "newLabel2")
d <- tibble(oldLabel1 = NA, oldLabel2 = NA, oldLabel3 = NA)
fun <- function(dat, dict) {
names(dat) <- sapply(names(dat), function(x) ifelse(x %in% dict$column_1, dict[dict$column_1 == x,]$column_2, x))
dat
}
fun(d, df)
You can create a function containing just on line of code.
renameDF <- function(df, varName){
setNames(df,varName[[2]][pmatch(names(df),varName[[1]])])
}

Turn index-based loop into name-based function

I have at disposal a clean dataframe (1500r x 297c, named 'Data' - very inspiring) with both numeric/factor columns. However, as this is often the case, my factors were encoded as numbers (each number representing a level) hence a dataframe full a numeric vectors.
To overcome this matter I also have a second dataframe (VarLabels), containing information about the columns of the 1st dataframe (which has... 297 rows as you would imagine). In there, one specific column helps me defining what should be the data class in the main dataframe (named VarLabels$TypeVar).
I wrote the following piece of code, which might not be optimal but proved to work so far:
(NB: as you can see, for data labelled 'MIX' I wish to create a copy to have one numeric and one factor)
nbcol <- ncol(Data)
indexcol <- which(colnames(VarLabels) == "TypeVar")
for(i in 1:nbcol){
if (colnames(Data)[[i]] %in% VarLabels$VarName){
if (VarLabels[i,indexcol] == "Quant"){
Data[[i]] <- as.numeric(Data[[i]])
} else if (VarLabels[i,indexcol] == "Qual") {
Data[[i]] <- as.character(Data[[i]])
Data[[i]] <- as.factor(Data[[i]])
} else if (VarLabels[i,indexcol] == "Mix") {
Data <- cbind(Data, Data[[i]])
Data[[i]] <- as.character(Data[[i]])
Data[[i]] <- as.factor(Data[[i]])
Data[[ncol(Data)]] <- as.numeric(Data[[ncol(Data)]])
colnames(Data)[[ncol(Data)]] <- paste(colnames(Data)[[i]], "Num", sep = "_")
} else {
Data[[i]] <- as.numeric(Data[[i]])
}
} else {
}
}
Do you have a neater solution, possibly using a function to reduce the number of code lines / using names instead of column index? (which may be risky if order changes in one of the two dataframes) I recently got into R and am still struggling with user-defined functions.
I read other related topics like:
Change all columns from factor to numeric in R
Function to change class of columns in R to match the class of an other dataset
Convert type of multiple columns of a dataframe at once
How do I get the classes of all columns in a data frame?
but could not apply the answers to my own problem. Any idea how to make things simple? (if possible!)
The following function does what the question asks for.
It matches input data set X column names with the new column types with a sequence of which/match statements, without needing loops. The coercion is performed with lapply loops.
The test data set is the built-in data set mtcars.
coerceCols <- function(X, VarLabels){
i <- which(VarLabels$TypeVar == "Qual")
j <- match(VarLabels$VarName[i], names(X))
X[j] <- lapply(X[j], factor)
i <- which(VarLabels$TypeVar == "Mix")
j <- match(VarLabels$VarName[i], names(X))
tmp <- X[j]
names(tmp) <- paste(names(tmp), "Num", sep = "_")
X[j] <- lapply(X[j], factor)
cbind(X, tmp)
}
Data <- mtcars
VarLabels <- data.frame(VarName = names(mtcars),
TypeVar = c("Quant", "Mix", "Quant",
"Quant", "Quant", "Quant",
"Quant", "Qual", "Qual",
"Mix", "Mix"),
stringsAsFactors = FALSE)
coerceCols(Data, VarLabels)

How to filter for 'any value' in R?

Strange question but how to do I filter such that all rows are returned for a dataframe? For example, say you have the following dataframe:
Pts <- floor(runif(20, 0, 4))
Name <- c(rep("Adam",5), rep("Ben",5), rep("Charlie",5), rep("Daisy",5))
df <- data.frame(Pts, Name)
And say you want to set up a predetermined filter for this dataframe, for example:
Ptsfilter <- c("2", "1")
Which you will then run through the dataframe, to get your new filtered dataframe
dffil <- df[df$Pts %in% Ptsfilter, ]
At times, however, you don't want the dataframe to be filtered at all, and in the interests of automation and minimising workload, you don't want to have to go back and remove/comment-out every instance of this filter. You just want to be able to adjust the Ptsfilter value such that no rows will be filtered out of the dataframe, when that line of code is run.
I have experimented/guesses with things like:
Ptsfilter <- c("")
Ptsfilter <- c(" ")
Ptsfilter <- c()
to no avail.
Is there a value I can enter for Ptsfilter that will achieve this goal?
You might need to define a function to do this for you.
filterDF = function(df,filter){
if(length(filter)>0){
return(df[df$Pts %in% filter, ])
}
else{
return(df)
}
}

Filtering a df by date in a loop within a function

I've been having some trouble with what I would think is a pretty simple issue. I have a function which takes as an input a list of dataframes and a date range, filters the dataframes by date, and then does some other stuff. Simply put, it looks like this:
my_function <- function(df_list, date_range = c(min, max)) {
for(i in 1:length(df_list)) {
df_list[[i]] <- df_list[[i]][df_list[[i]]$date >= as.Date(date_range[1])]
df_list[[i]] <- df_list[[i]][df_list[[i]]$date <= as.Date(date_range[2])]
}
etc
}
With the above, I get an error undefined columns selected. I've also tried with filters and lapply, as in:
lapply(df_list, function(df) {
df <- filter(df, week >= as.Date(date_range[1]))
df <- filter(df, week <= as.Date(date_range[2]))
}
Which doesn't give an error but still doesn't work.
I feel like this isn't as hard as I'm making it. Any suggestions?
I would use lapply for this. You could achieve what you want using:
date_range <- as.Date(date_range) # no need to do this on every iteration of the lapply function
df_list <- lapply(df_list,
function(x) x[x$week >= date_range[1] & x$week <= date_range[2], ])
As #Frank has pointed out, the main issue with your first piece of code is that you're missing a trailing comma when subsetting. For the second, it should work but you need to assign it back to df_list.

Delete rows with blank values in one particular column

I am working on a large dataset, with some rows with NAs and others with blanks:
df <- data.frame(ID = c(1:7),
home_pc = c("","CB4 2DT", "NE5 7TH", "BY5 8IB", "DH4 6PB","MP9 7GH","KN4 5GH"),
start_pc = c(NA,"Home", "FC5 7YH","Home", "CB3 5TH", "BV6 5PB",NA),
end_pc = c(NA,"CB5 4FG","Home","","Home","",NA))
How do I remove the NAs and blanks in one go (in the start_pc and end_pc columns)? I have in the past used:
df<- df[-which(is.na(df$start_pc)), ]
... to remove the NAs - is there a similar command to remove the blanks?
df[!(is.na(df$start_pc) | df$start_pc==""), ]
It is the same construct - simply test for empty strings rather than NA:
Try this:
df <- df[-which(df$start_pc == ""), ]
In fact, looking at your code, you don't need the which, but use the negation instead, so you can simplify it to:
df <- df[!(df$start_pc == ""), ]
df <- df[!is.na(df$start_pc), ]
And, of course, you can combine these two statements as follows:
df <- df[!(df$start_pc == "" | is.na(df$start_pc)), ]
And simplify it even further with with:
df <- with(df, df[!(start_pc == "" | is.na(start_pc)), ])
You can also test for non-zero string length using nzchar.
df <- with(df, df[!(nzchar(start_pc) | is.na(start_pc)), ])
Disclaimer: I didn't test any of this code. Please let me know if there are syntax errors anywhere
An elegant solution with dplyr would be:
df %>%
# recode empty strings "" by NAs
na_if("") %>%
# remove NAs
na.omit
Alternative solution can be to remove the rows with blanks in one variable:
df <- subset(df, VAR != "")
An easy approach would be making all the blank cells NA and only keeping complete cases. You might also look for na.omit examples. It is a widely discussed topic.
df[df==""]<-NA
df<-df[complete.cases(df),]

Resources