Extract all rows from data frame matching a certain condition - r

I have a data frame in R, in which one of the columns contains state abbreviations like 'AL','MD' etc.
Say I wanted to extract the data for state = 'AL', then the following condition
dataframe['AL',] only seems to return one row, whereas there are multiple rows against this state.
Can someone help me understand the error in this approach.

this should work
mydataframe[mydataframe$state == "AL",]
or if you want more than one sate
mydataframe[mydataframe$state %in% c("AL","MD"),]

In R, there are always multiple ways to do something. We'll illustrate three different techniques that can be used to subset data in a data frame based on a logical condition.
We'll use data from the 2012 U.S. Hospital Compare Database. We'll check to see whether the data has already been downloaded to disk, and if not, download and unzip it.
if(!file.exists("outcome-of-care-measures.zip")){
dlMethod <- "curl"
if(substr(Sys.getenv("OS"),1,7) == "Windows") dlMethod <- "wininet"
url <- "https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2FProgAssignment3-data.zip"
download.file(url,destfile='outcome-of-care-measures.zip',method=dlMethod,mode="wb")
unzip(zipfile = "outcome-of-care-measures.zip")
}
## read outcome data & keep hospital name, state, and some
## mortality rates. Notice that here we use the extract operator
## to subset columns instead of rows
theData <- read.csv("outcome-of-care-measures.csv",
colClasses = "character")[,c(2,7,11,17,23)]
This first technique matches the one from the other answer, but we illustrate it with both $ and [[ forms of the extract operator during the subset operation.
# technique 1: extract operator
aSubset <- theData[theData$State == "AL",]
table(aSubset$State)
AL
98
aSubset <- theData[theData[["State"]] == "AL",]
table(aSubset$State)
AL
98
>
Next, we can subset by using a Base R function, such as subset().
# technique 2: subset() function
aSubset <- subset(theData,State == "AL")
table(aSubset$State)
AL
98
>
Finally, for the tidyverse fans, we'll use dplyr::filter().
# technique 3: dplyr::filter()
aSubset <- dplyr::filter(theData,State == "AL")
table(aSubset$State)
AL
98
>

Related

Subset DF with for-loop to run each subset through a function

I have a data set of plant demographics from 5 years across 10 sites with a total of 37 transects within the sites. Below is a link to a GoogleDoc with some of the data:
https://docs.google.com/spreadsheets/d/1VT-dDrTwG8wHBNx7eW4BtXH5wqesnIDwKTdK61xsD0U/edit?usp=sharing
In total, I have 101 unique combinations.
I need to subset each unique set of data, so that I can run each through some code. This code will give me one column of output that I need to add back to the original data frame so that I can run LMs on the entire data set. I had hoped to write a for-loop where I could subset each unique combination, run the code on each, and then append the output for each model back onto the original dataset. My attempts at writing a subset loop have all failed to produce even a simple output.
I created a column, "SiteTY", with unique Site, Transect, Year combinations. So "PWR 832015" is site PWR Transect 83 Year 2015. I tried to use that to loop through and fill an empty matrix, as proof of concept.
transect=unique(dat$SiteTY)
ntrans=length(transect)
tmpout=matrix(NA, nrow=ntrans, ncol=2)
for (i in 1:ntrans) {
df=subset(dat, SiteTY==i)
tmpout[i,]=(unique(df$SiteTY))
}
When I do this, I notice that df has no observations. If I replace "i" with a known value (like PWR 832015) and run each line of the for-loop individually, it populates correctly. If I use is.factor() for i or PWR 832015, both return FALSE.
This particular code also gives me the error:
Error in [,-(*tmp*, , i, value=mean(df$Year)) : subscript out of bounds
I can only assume this happens because the data frame is empty.
I've read enough SO posts to know that for-loops are tricky, but I've tried more iterations than I can remember to try to make this work in the last 3 years to no avail.
Any tips on loops or ways to avoid them while getting the output I need would be appreciated.
Per your needs, I need to subset each unique set of data, run a function, take the output and calculate a new value, consider two routes:
Using ave if your function expects and returns a single numeric column.
Using by if your function expects a data frame and returns anything.
ave
Returns a grouped inline aggregate column with repeated value for every member of group. Below, with is used as context manager to avoid repeated dat$ references.
# BY SITE GROUPING
dat$New_Column <- with(dat, ave(Numeric_Column, Site, FUN=myfunction))
# BY SITE AND TRANSECT GROUPINGS
dat$New_Column <- with(dat, ave(Numeric_Column, Site, Transect, FUN=myfunction))
# BY SITE AND TRANSECT AND YEAR GROUPINGS
dat$New_Column <- with(dat, ave(Numeric_Column, Site, Transect, Year, FUN=myfunction))
by
Returns a named list of objects or whatever your function returns for each possible grouping. For more than one grouping, tryCatch is used due to possibly empty data frame item from all possible combinations where your myfunction can return an error.
# BY SITE GROUPING
obj_list <- by(dat, dat$Site, function(sub) {
myfunction(sub) # RUN ANY OPERATION ON sub DATA FRAME
})
# BY SITE AND TRANSECT GROUPINGS
obj_list <- by(dat, dat[c("Site", "Transect")], function(sub) {
tryCatch(myfunction(sub),
error = function(e) NULL)
})
# BY SITE AND TRANSECT AND YEAR GROUPINGS
obj_list <- by(dat, dat[c("Site", "Transect", "Year")], function(sub) {
tryCatch(myfunction(sub),
error = function(e) NULL)
})
# FILTERS OUT ALL NULLs (I.E., NO LENGTH)
obj_list <- Filter(length, obj_list)
# BUILDS SINGLE OUTPUT IF MATRIX OR DATA FRAME
final_obj <- do.call(rbind, obj_list)
Here's another approach using the dplyr library, in which I'm creating a data.frame of summary statistics for each group and then just joining it back on:
library(dplyr)
# Group by species (site, transect, etc) and summarise
species_summary <- iris %>%
group_by(Species) %>%
summarise(mean.Sepal.Length = mean(Sepal.Length),
mean.Sepal.Width = mean(Sepal.Width))
# A data.frame with one row per species, one column per statistic
species_summary
# Join the summary stats back onto the original data
iris_plus <- iris %>% left_join(species_summary, by = "Species")
head(iris_plus)

How to test for an absent row/value in a dataframe to help transpose part of it?

I have a dataframe containing data on repeatedly sampled individuals and days alive. Some individuals were not sampled on every day alive. I want to move the data from being row oriented (each individual&day alive being a row) to column oriented (one row for an individual, each column holding data for each day alive).
However, the code I am running for this exits with error when an individual does not have a row for a certain day alive in the first DF, because there is a column for that day alive in the second DF. I haven't found a good way to test for absence of a row and value in the first DF, it makes the value a numeric of length 0 (ie, numeric(0)) and performing logical tests against such a variable doesn't provide a logical answer (O or 1), it just yield logical(0).
What follows is a simplified example of what I'm trying to do. I know there may be other ways to deal with some of the larger data movements I'm doing, but would like to do it this way if possible. The code below will get stuck when individual=B and dayAlive=2, because there is no dayAlive=2 for that individual. I'd like to be able to test for the absence of a row like this and then insert an NA or something else into the second data frame cell where that data would go.
# Initialize data in row format in first data fram:
v1<-c("A",1,1.3)
v2<-c("A",2,1.8)
v3<-c("A",3,2.4)
v4<-c("B",1,0.8)
v5<-c("B",3,1.7)
first_DF<-data.frame(matrix(c(v1,v2,v3,v4,v5),ncol=3, nrow=5,byrow=TRUE,dimnames=list(NULL,c("Individual","DayAlive","Length"))), stringsAsFactors=FALSE)
# Convert to column format in second data frame:
individual_IDs<-unique(first_DF$"Individual")
days_alive<-unique(first_DF$"DayAlive")
# Initialize second DF by subsetting a single row for each individual from the first DF
second_DF<-data.frame(first_DF[which(first_DF$"Individual" %in% individual_IDs & first_DF$"DayAlive" %in% 1),1], stringsAsFactors=FALSE)
names(second_DF)<-"Individual"
initial_DF_width<-dim(second_DF)[2]
# Move 'Length' data into the columns as each 'day alive' column is created:
for(i in 1:length(days_alive)){
current_day<-days_alive[i]
second_DF<-cbind(second_DF,matrix(ncol=1, nrow=nrow(second_DF),dimnames=list(NULL,paste("Day ",current_day," Length"))))
for(j in 1:length(individual_IDs)){
current_individualID<-individual_IDs[j]
length<-first_DF[which(first_DF$"Individual" %in% current_individualID & first_DF$"DayAlive" %in% current_day),"Length"]
second_DF[j,i+initial_DF_width]<-length
}
}
This is the error it throws:
Error in [<-.data.frame(*tmp*, j, i + initial_DF_width, value = character(0)) :
replacement has length zero
(In my real code I had converted that data to numeric but didn't bother here).
You should look into the reshape2 package. Try this:
library('reshape2')
dcast(first_DF, Individual ~ DayAlive)
# Individual 1 2 3
# 1 A 1.3 1.8 2.4
# 2 B 0.8 <NA> 1.7
Since you said you wanted to do it your way if possible, I've also edited your nested loop to work. However I would not advise doing it this way. Most people will tell you that nested loops in R are usually not the best idea, and that's definitely true in this case.
for(i in 1:length(days_alive)){
current_day<-days_alive[i]
second_DF<-cbind(second_DF,matrix(ncol=1, nrow=nrow(second_DF),dimnames=list(NULL,paste("Day ",current_day," Length"))))
for(j in 1:length(individual_IDs)){
current_individualID<-individual_IDs[j]
# I changed "length" to "length2" to avoid confusion with the
# function length(). You also don't need which() here.
length2 <- first_DF[first_DF$Individual %in% current_individualID
& first_DF$DayAlive %in% current_day, "Length"]
if (length(length2) > 0) {
second_DF[j, i + initial_DF_width] <- length2
} else {
second_DF[j, i + initial_DF_width] <- NA
}
}
}

R - Deleting rows from one data set which are present in another data set

In R I have two data sets, one which has all the data lets call this data set LARGE, where we have one column labelled idnumber. The other data set which has specific data records from LARGE is a reduced version due to criteria that I have made which also has the column labelled idnumber.
What I would like to do is from the data set ‘LARGE’ I would like to exclude from it all data records which have the same idnumber which appears in the reduced version.
This is what I have thought of: unmatched <- LARGE[which(LARGE$idnumber not in reduced$idnumber)] but I don't know how to code 'not in’ in R
You are describing an anti-join
library(dplyr)
LARGE <- data.frame(idnumber = 1:100, Y = rnorm(100))
reduced <- LARGE[sample(nrow(LARGE), 42),]
unmatched <- anti_join(LARGE, reduced)
And to use a "not in" binary function in general, you can apply the following function:
`%notin%` <- function(x,y) !(x %in% y)
3 %notin% c(3,5)
# [1] FALSE
following Coles answer, google the "R not in operator"
Easiest way:
data [!(data %in% large$idnubmer),]
so the %in% finds all the cases where they match. the ! at the start 'negates' that .. ie: finds where they don't.

R: Split-Apply-Combine... Apply Functions via Aggregate to Row-Bound Data Frames Subset by Class

Update: My NOAA GHCN-Daily weather station data functions have since been cleaned and merged into the rnoaa package, available on CRAN or here: https://github.com/ropensci/rnoaa
I'm designing a R function to calculate statistics across a data set comprised of multiple data frames. In short, I want to pull data frames by class based on a reference data frame containing the names. I then want to apply statistical functions to values for the metrics listed for each given day. In effect, I want to call and then overlay a list of data frames to calculate functions on a vector of values for every unique date and metric where values are not NA.
The data frames are iteratively read into the workspace from file based on a class variable, using the 'by' function. After importing the files for a given class, I want to rbind() the data frames for that class and each user-defined metric within a range of years. I then want to apply a concatenation of user-provided statistical functions to each metric within a class that corresponds to a given value for the year, month, and day (i.e., the mean [function] low temperature [class] on July 1st, 1990 [date] reported across all locations [data frames] within a given region [class]. I want the end result to be new data frames containing values for every date within a region and a year range for each metric and statistical function applied. I am very close to having this result using the aggregate() function, but I am having trouble getting reasonable results out of the aggregate function, which is currently outputting NA and NaN for most functions other than the mean temperature. Any advice would be much appreciated! Here is my code thus far:
# Example parameters
w <- c("mean","sd","scale") # Statistical functions to apply
x <- "C:/Data/" # Folder location of CSV files
y <- c("MaxTemp","AvgTemp","MinTemp") # Metrics to subset the data
z <- c(1970:2000) # Year range to subset the data
CSVstnClass <- data.frame(CSVstations,CSVclasses)
by(CSVstnClass, CSVstnClass[,2], function(a){ # Station list by class
suppressWarnings(assign(paste(a[,2]),paste(a[,1]),envir=.GlobalEnv))
apply(a, 1, function(b){ # Data frame list, row-wise
classData <- data.frame()
sapply(y, function(d){ # Element list
CSV_DF <- read.csv(paste(x,b[2],"/",b[1],".csv",sep="")) # Read in CSV files as data frames
CSV_DF1 <- CSV_DF[!is.na("Value")]
CSV_DF2 <- CSV_DF1[which(CSV_DF1$Year %in% z & CSV_DF1$Element == d),]
assign(paste(b[2],"_",d,sep=""),CSV_DF2,envir=.GlobalEnv)
if(nrow(CSV_DF2) > 0){ # Remove empty data frames
classData <<- rbind(classData,CSV_DF2) # Bind all data frames by row for a class and element
assign(paste(b[2],"_",d,"_bound",sep=""),classData,envir=.GlobalEnv)
sapply(w, function(g){ # Function list
# Aggregate results of bound data frame for each unique date
dataFunc <- aggregate(Value~Year+Month+Day+Element,data=classData,FUN=g,na.action=na.pass)
assign(paste(b[2],"_",d,"_",g,sep=""),dataFunc,envir=.GlobalEnv)
})
}
})
})
})
I think I am pretty close, but I am not sure if rbind() is performing properly, nor why the aggregate() function is outputting NA and NaN for so many metrics. I was concerned that the data frames were not being bound together or that missing values were not being handled well by some of the statistical functions. Thank you in advance for any advice you can offer.
Cheers,
Adam
You've tackled this problem in a way that makes it very hard to debug. I'd recommend switching things around so you can more easily check each step. (Using informative variable names also helps!) The code is unlikely to work as is, but it should be much easier to work iteratively, checking that each step has succeeded before continuing to the next.
paths <- dir("C:/Data/", pattern = "\\.csv$")
# Read in CSV files as data frames
raw <- lapply(paths, read.csv, str)
# Extract needed rows
filter_metrics <- c("MaxTemp", "AvgTemp", "MinTemp")
filter_years <- 1970:2000
filtered <- lapply(raw, subset,
!is.na(Value) & Year %in% filter_years & Element %in% filter_metrics)
# Drop any empty data frames
rows <- vapply(filtered, nrow, integer(1))
filtered <- filtered[rows > 0]
# Compute aggregates
my_aggregate <- function(df, fun) {
aggregate(Value ~ Year + Month + Day + Element, data = df, FUN = fun,
na.action = na.pass)
}
means <- lapply(filtered, my_aggregate, mean)
sds <- lapply(filtered, my_aggregate, sd)
scales <- lapply(filtered, my_aggregate, scale)

Removing rows in data frame using get function

Suppose I have following data frame:
mydataframe <- data.frame(ID=c(1,2,NA,4,5,NA),score=11:16)
I want to get following data frame at the end:
mydataframe[-which(is.na(mydataframe$ID)),]
I need to do this kind of cleaning (and other similar manipulations) with many other data frames. So, I decided to assign a name to mydataframe, and variable of interest.
dbname <- "mydataframe"
varname <- "ID"
attach(get(dbname))
I get an error in the following line, understandably.
get(dbname) <- get(dbname)[-which(is.na(get(varname))),]
detach(get(dbname))
How can I solve this? (I don't want to assign to a new data frame, even though it seems only solution right now. I will use "dbname" many times afterwards.)
Thanks in advance.
There is no get<- function, and there is no get(colname) function (since colnames are not first class objects), but there is an assign() function:
assign(dbname, get(dbname)[!is.na( get(dbname)[varname] ), ] )
You also do not want to use -which(.). It would have worked here since there were some matches to the condition. It will bite you, however, whenever there are not any rows that match and instead of returning nothing as it should, it will return everything, since vec[numeric(0)] == vec. Only use which for "positive" choices.
As #Dason suggests, lists are made for this sort of work.
E.g.:
# make a list with all your data.frames in it
# (just repeating the one data.frame 3x for this example)
alldfs <- list(mydataframe,mydataframe,mydataframe)
# apply your function to all the data.frames in the list
# have replaced original function in line with #DWin and #flodel's comments
# pointing out issues with using -which(...)
lapply(alldfs, function(x) x[!is.na(x$ID),])
The suggestion to use a list of data frames is good, but I think people are assuming that you're in a situation where all the data frames are loaded simultaneously. This might not necessarily be the case, eg if you're working on a number of projects and just want some boilerplate code to use in all of them.
Something like this should fit the bill.
stripNAs <- function(df, var) df[!is.na(df[[var]]), ]
mydataframe <- stripNAs(mydataframe, "ID")
cars <- stripNAs(cars, "speed")
I can totally understand your need for this, since I also frequently need to cycle through a set of data frames. I believe the following code should help you out:
mydataframe <- data.frame(ID=c(1,2,NA,4,5,NA),score=11:16)
#define target dataframe and varname
dbname <- "mydataframe"
varname <- "ID"
tmp.df <- get(dbname) #get df and give it a temporary name
col.focus <- which(colnames(tmp.df) == varname) #define the column of focus
tmp.df <- tmp.df[which(!is.na(tmp.df[,col.focus])),] #cut out the subset of the df where the column of focus is not NA.
#Result
ID score
1 1 11
2 2 12
4 4 14
5 5 15

Resources