Subset data.frame based on a list - r

Sample Data : https://www.dropbox.com/s/f3l2uub1cttwmf2/test.csv?dl=0
I need to subset this data.frame based on only those county codes (fips) that are available on another dataset. I have a list of all the fips codes from the other dataset and am trying to remove all those not in the list, but am not having much luck.
From this small sample dataset are three fips (8009,8011,8013), so how would i remove all except for 8009 and 8011 in the context that this would be a list.
Here's what I've tried :
prism.dd <- prism.d[(prism.d$fips %in% fips) ,]
Where fips is a list of 779 fips to keep:
fips <- unique(DustBowlData_Pre$fips)
But it's only returning the same number. A solution with data.table would be preferred, but what works best is also fine.
Thanks!
Edit : Update for akrun's request :
Output of dput(head(fips))
c(8009L, 8011L, 8013L, 8017L, 8035L, 8039L)
Update : str(prism.d)
Classes ‘data.table’ and 'data.frame': 52802 obs. of 3 variables:
$ fips: int 30061 30063 30077 30049 30013 30059 30045 30027 30069 30033 ...
$ Year: int 1910 1910 1910 1910 1910 1910 1910 1910 1910 1910 ...
$ ppt : num 87 64.2 52.4 46.6 34.9 ...
- attr(*, ".internal.selfref")=<externalptr>
Solution :
setkey(setDT(prism.d), fips)
fips <- unique(DustBowlData_Pre$fips)
fips <- data.table(fips)
Subpr <- prism.d[fips]
Thanks #akrun! This worked perfectly. I really need to learn data.table.

You could try using data.table
library(data.table)
setkey(setDT(prism.d), fips)
fips <- c(8009, 8011)
fips1 <- data.table(fips)
Subpr <- prism.d[fips1]
Update
I think the previous code didn't work because I thought the dataset is data.frame and not data.table. Try
fips2 <- fips #renaming because `prism.d` has the `same` column name `fips`
prism.d[fips %in% fips2]
data
prism.d <- read.csv('test-1.csv')

Related

Pivottabler tallying number of cases in dataset, not their values

I am working through creating pivot tables with the Pivottabler package to summarise frequencies of rock art classes by location. The data I am summarising here are from published papers, and I have it stored in an RDS file created in R, and looks like this:
> head(cyp_art_freq)
Class Location value
1: Figurative Princess Charlotte Bay 347
2: Track Princess Charlotte Bay 35
3: Non-Figurative Princess Charlotte Bay 18
4: Figurative Mitchell-Palmer and Chillagoe 320
5: Track Mitchell-Palmer and Chillagoe 79
6: Non-Figurative Mitchell-Palmer and Chillagoe 1002
>str(cyp_art_freq)
Classes ‘data.table’ and 'data.frame': 12 obs. of 3 variables:
Class : chr "Figurative" "Track" "Non-Figurative" "Figurative" ...
Location: chr "Princess Charlotte Bay" "Princess Charlotte Bay" "Princess Charlotte Bay" "Mitchell-Palmer and Chillagoe" ...
value : num 347 35 18 320 79 ...
attr(*, ".internal.selfref")=<externalptr>
The problem is that pivottabler does not sum the contents of the 'value' col. Instead, it counts the number of rows/cases. So, as the graphic below shows, the resulting table includes a total of 12 cases when the result should be into the 1000s. I think this relates to the 'value' column which is a count of a larger dataset. I've tried pivot_longer and pivot_wider, changed datatypes and used CSVs instead of RDS for import (and more).
The code block I'm using for this data works with the built-in BHMtrains dataset, and my other datasets, but I suspect I can either specify that pivottabler tallies the contents of the 'values' col, or I just expand the underlying dataset.
How might I ensure that the 'Count' columns actually count the contents of the input 'value' column? I hope that is clear, and thanks for any suggestions on how to address this issue.
table01 <- PivotTable$new()
table01$addData(cyp_art_freq)
table01$addColumnDataGroups("Class", totalCaption = "Total")
table01$defineCalculation(calculationName="Count", summariseExpression="n()", caption="Count", visible=TRUE)
filterOverrides <- PivotFilterOverrides$new(table01, keepOnlyFiltersFor="Count")
table01$defineCalculation(calculationName="TOCTotal", filters=filterOverrides,
summariseExpression="n()", caption="TOC Total", visible=FALSE)
table01$defineCalculation(calculationName="PercentageAllMotifs", type="calculation",
basedOn=c("Count", "TOCTotal"),
calculationExpression="values$Count/values$TOCTotal*100",
format="%.1f %%", caption="Percent")
table01$addRowDataGroups("Location")
table01$theme <- "compact"
table01$renderPivot()
table01$evaluatePivot()
The PT returned from this code

which() function in R - after sorting in descending order, issues matching with duplicate values

I'm trying to find the next closest store from a matrix of store IDs, zip codes, and long/latitude coordinates for each of the zip codes. Trouble happens when there are more than 1 store per zipcode, and the script doesn't know how to order 2 values that are identical (store x is 10 miles away, store y is 10 miles, and has trouble with the order of x and y, and is returning (c(x,y)), instead of x,y or y,x). I need to find a way to have my code figure out how to list both of them (arbituary order since they are the same distance away from the store, based on zip code).
I'm thinking there can likely be modifications to the which() function, but I'm not having any luck.
Note that all the stores run, just the 100 or so stores that have the same zipcode as another store get tripped up - I'd love to not manually go through and edit the csv.
library(data.table)
library(zipcode)
library(geosphere)
source<-read.csv("C:\\Users\\mcan\Desktop\\Projects\\Closest Store\\Site and Zip.csv",header=TRUE, sep=",") #open
zip<-source[,2] #break apart the source zip codes
ID<-source[,1] #break apart the IDs
zip<-clean.zipcodes(zip) #clean up the zipcodes
CleanedData<-data.frame(ID,zip) #combine the IDs and cleaned Zip codes
CleanedData<-merge(x=CleanedData,y=zipcode,by="zip",all.x=TRUE) #dataset of store IDs, zipcodes, and their long/lat positions
setDT(CleanedData) #set data frame to data table
storeDistances <- distm(CleanedData[,.(longitude,latitude)],CleanedData[,.(longitude,latitude)]) #matrix between long/lat points of all stores in list
colnames(storeDistances) <- rownames(storeDistances) <- CleanedData[,ID]
whatsClosest <- function(number=1){
apply(storeDistances,1,function(x) (colnames(storeDistances)[which(x==sort(x)[number+1])])) #sorts in descending order and picks the 2nd closest distance, matches with storeID
}
CleanedData[,firstClosestSite:=whatsClosest(1)] #looks for 1st closest store
CleanedData[,secondClosestSite:=whatsClosest(2)] #looks for 2nd closest store
CleanedData[,thirdClosestSite:=whatsClosest(3)] #looks for 3rd closest store
Data set format:
Classes ‘data.table’ and 'data.frame': 1206 obs. of 9 variables:
$ zip : Factor w/ 1182 levels "01234","02345",..: 1 2 3 4 5 6 7 8 9 10 ...
$ ID : int 11111 12222 13333 10528 ...
$ city : chr "Boston" "Somerville" "Cambridge" "Weston" ...
$ state : chr "MA" "MA" "MA" "MA" ...
$ latitude : num 40.0 41.0 42.0 43.0 ...
$ longitude : num -70.0 -70.1 -70.2 -70.3 -70.4 ...
$ firstClosestSite :List of 1206
..$ : chr "12345"
$ secondClosestSite :List of 1206
..$ : chr "12344"
$ thirdClosestSite :List of 1206
..$ : chr "12343"
Issue comes with the firstClosestSite and secondClosest site, they are sorted by distance, but if the distance is the same because two stores exist in the same zipcode, the which() function (I think) doesn't know how to account for this, so you get this awkward concatenation in the CSV:
StoreID Zip City State Longitude Latitude FirstClosestSite
11222 11000 Boston MA 40.0 -70.0 c("11111""12222")
SecondClosestSite ThirdClosestSite
c("11111" "12222") 13333
Example of how the distance matrix is formed (store IDs in first row and column, with the matrix values being the distance between store IDs):
11111 22222 33333 44444 55555 66666
11111 0 6000 32000 36000 28000 28000
22222 6000 0 37500 40500 32000 32000
33333 32000 37500 0 11000 6900 6900
44444 36000 40500 11000 0 8900 8900
55555 28000 32000 6900 8900 0 0
66666 28000 32000 6900 8900 0 0
Issue is the duplicates in each row... the which() doesn't know which store is closest to 11111 (either 55555 or 66666).
Here is my attempt at a solution. Everything up until the line with colnames(storeDistances) <- ... stays the same. After that, you should replace the code with the following:
whatsClosestList <- sapply(as.data.frame(storeDistances), function(x) list(data.frame(distance = x, store = rownames(storeDistances), stringsAsFactors = F)))
# Get the names of the stores
# this step is necessary because lapply doesn't allow us
# to access the list names
storeNames = names(whatsClosestList)
# Iterate through each store's data frame using storeNames
# and delete the distance to itself
whatsClosestListRemoveSelf <- lapply(storeNames, function(name) {
df <- whatsClosestList[[name]]
df <- df[!df$store == name,]
})
# The previous step got rid of the store names in the list,
# so we add them again here
names(whatsClosestListRemoveSelf) <- storeNames
whatsClosestOrderedList <- lapply(whatsClosestListRemoveSelf, function(df) { df[order(df$distance),] })
whatsClosestTopThree <- lapply(whatsClosestOrderedList, function(df) { df$store[1:3] })
firstClosestSite <- lapply(whatsClosestTopThree, function(x) { x[1]} )
secondClosestSite <- lapply(whatsClosestTopThree, function(x) { x[2]} )
thirdClosestSite <- lapply(whatsClosestTopThree, function(x) { x[3]} )
CleanedData[,firstClosestSite:=firstClosestSite] #looks for 1st closest store in list
CleanedData[,secondClosestSite:=secondClosestSite] #looks for 2nd closest store in list
CleanedData[,thirdClosestSite:=thirdClosestSite] #looks for 3rd closest store in list
Basically, instead of searching only for the (first, second, third) closest site, I create a list of dataframes for each store with the distance to all other stores. I then order these dataframes, and extract the three closest stores, which sometimes include ties (if tied, they're ordered by the name of the store). Then you only need to extract a list with the firstClosestSite, secondClosestSite, etc., for each store, and that's why you use in the search in CleanedData. Hope it works!

Dynamically Changing Data Type for a Data Frame

I have a set of data frames belonging to many countries consisting of 3 variables (year, AI, OAD). The example for Zimbabwe is shown as below,
>str(dframe_Zimbabwe_1955_1970)
'data.frame': 16 obs. of 3 variables:
$ year: chr "1955" "1956" "1957" "1958" ...
$ AI : chr "11.61568161" "11.34114927" "11.23639317" "11.18841409" ...
$ OAD : chr "5.740789488" "5.775882473" "5.800441036" "5.822536579" ...
I am trying to change the data type of the variables in data frame to below so that I can model the linear fit using lm(dframe_Zimbabwe_1955_1970$AI ~ dframe_Zimbabwe_1955_1970$year).
>str(dframe_Zimbabwe_1955_1970)
'data.frame': 16 obs. of 3 variables:
$ year: int 1955 1956 1957 1958 ...
$ AI : num 11.61568161 11.34114927 11.23639317 11.18841409 ...
$ OAD : num 5.740789488 5.775882473 5.800441036 5.822536579 ...
The below static code able to change AI from character (chr) to numeric (num).
dframe_Zimbabwe_1955_1970$AI <- as.numeric(dframe_Zimbabwe_1955_1970$AI)
However when I tried to automate the code as below, AI still remains as character (chr)
countries <- c('Zimbabwe', 'Afghanistan', ...)
for (country in countries) {
assign(paste('dframe_',country,'_1955_1970$AI', sep=''), eval(parse(text = paste('as.numeric(dframe_',country,'_1955_1970$AI)', sep=''))))
}
Can you advice what I could have done wrong?
Thanks.
42: Your code doesn't work as written but with some edits it will. in addition to the missing parentheses and wrong sep, you can't use $'column name' in assign, but you don't need it anyway
for (country in countries) {
new_val <- get(paste( 'dframe_',country,'_1955_1970', sep=''))
new_val[] <- lapply(new_val, as.numeric) # the '[]' on LHS keeps dataframe
assign(paste('dframe_',country,'_1955_1970', sep=''), new_val)
remove(new_val)
}
proof it works:
dframe_Zimbabwe_1955_1970 <- data.frame(year = c("1955", "1956", "1957"),
AI = c("11.61568161", "11.34114927", "11.23639317"),
OAD = c("5.740789488", "5.775882473", "5.800441036"),
stringsAsFactors = F)
str(dframe_Zimbabwe_1955_1970)
'data.frame': 3 obs. of 3 variables:
$ year: chr "1955" "1956" "1957"
$ AI : chr "11.61568161" "11.34114927" "11.23639317"
$ OAD : chr "5.740789488" "5.775882473" "5.800441036"
countries <- 'Zimbabwe'
for (country in countries) {
new_val <- get(paste( 'dframe_',country,'_1955_1970', sep=''))
new_val[] <- lapply(new_val, as.numeric) # the '[]' on LHS keeps dataframe
assign(paste('dframe_',country,'_1955_1970', sep=''), new_val)
remove(new_val)
}
str(dframe_Zimbabwe_1955_1970)
'data.frame': 3 obs. of 3 variables:
$ year: num 1955 1956 1957
$ AI : num 11.6 11.3 11.2
$ OAD : num 5.74 5.78 5.8
It's going to be considered fairly ugly code by teh purists but perhaps this:
for (country in countries) {
new_val <- get(paste('dframe_',country,'_1955_1970', sep=''))
new_val[] <- lapply(new_val, as.numeric) # the '[]' on LHS keeps dataframe
assign(paste('dframe_',country,'_1955_1970', sep=''), new_val)
}
Using the get('obj_name') function is considered cleaner than eval(parse(text=...)). It would get handled more R-naturally had you assembled these dataframes in a list.

arrange multiple graphs using a for loop in ggplot2

I want to produce a pdf which shows multiple graphs, one for each NetworkTrackingPixelId.
I have a data frame similar to this:
> head(data)
NetworkTrackingPixelId Name Date Impressions
1 2421 Rubicon RTB 2014-02-16 168801
2 2615 Google RTB 2014-02-16 1215235
3 3366 OpenX RTB 2014-02-16 104419
4 3606 AppNexus RTB 2014-02-16 170757
5 3947 Pubmatic RTB 2014-02-16 68690
6 4299 Improve Digital RTB 2014-02-16 701
I was thinking to use a script similar to the one below:
# create a vector which stores the NetworkTrackingPixelIds
tp <- data %.%
group_by(NetworkTrackingPixelId) %.%
select(NetworkTrackingPixelId)
# create a for loop to print the line graphs
for (i in tp) {
print(ggplot(data[which(data$NetworkTrackingPixelId == i), ], aes(x = Date, y = Impressions)) + geom_point() + geom_line())
}
I was expecting this command to produce many graphs, one for each NetworkTrackingPixelId. Instead the result is an unique graph which aggregate all the NetworkTrackingPixelIds.
Another thing I've noticed is that the variable tp is not a real vector.
> is.vector(tp)
[1] FALSE
Even if I force it..
tp <- as.vector(data %.%
group_by(NetworkTrackingPixelId) %.%
select(NetworkTrackingPixelId))
> is.vector(tp)
[1] FALSE
> str(tp)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1397 obs. of 1 variable:
$ NetworkTrackingPixelId: int 2421 2615 3366 3606 3947 4299 4429 4786 6046 6286 ...
- attr(*, "vars")=List of 1
..$ : symbol NetworkTrackingPixelId
- attr(*, "drop")= logi TRUE
- attr(*, "indices")=List of 63
..$ : int 24 69 116 162 205 253 302 351 402 454 ...
..$ : int 1 48 94 140 184 232 281 330 380 432 ...
[I've cut a bit this output]
- attr(*, "group_sizes")= int 29 29 2 16 29 1 29 29 29 29 ...
- attr(*, "biggest_group_size")= int 29
- attr(*, "labels")='data.frame': 63 obs. of 1 variable:
..$ NetworkTrackingPixelId: int 8799 2615 8854 8869 4786 7007 3947 9109 9126 9137 ...
..- attr(*, "vars")=List of 1
.. ..$ : symbol NetworkTrackingPixelId
Since I don't have your dataset, I will use the mtcars dataset to illustrate how to do this using dplyr and data.table. Both packages are the finest examples of the split-apply-combine paradigm in rstats. Let me explain:
Step 1 Split data by gear
dplyr uses the function group_by
data.table uses argument by
Step 2: Apply a function
dplyr uses do to which you can pass a function that uses the pieces x.
data.table interprets the variables to the function in context of each piece.
Step 3: Combine
There is no combine step here, since we are saving the charts created to file.
library(dplyr)
mtcars %.%
group_by(gear) %.%
do(function(x){ggsave(
filename = sprintf("gear_%s.pdf", unique(x$gear)), qplot(wt, mpg, data = x)
)})
library(data.table)
mtcars_dt = data.table(mtcars)
mtcars_dt[,ggsave(
filename = sprintf("gear_%s.pdf", unique(gear)), qplot(wt, mpg)),
by = gear
]
UPDATE: To save all files into one pdf, here is a quick solution.
plots = mtcars %.%
group_by(gear) %.%
do(function(x) {
qplot(wt, mpg, data = x)
})
pdf('all.pdf')
invisible(lapply(plots, print))
dev.off()
I recently had a project that required producing a lot of individual pngs for each record. I found I got a huge speed up doing some pretty simple parallelization. I am not sure if this is more performant than the dplyr or data.table technique but it may be worth trying. I saw a huge speed bump:
require(foreach)
require(doParallel)
workers <- makeCluster(4)
registerDoParallel(workers)
foreach(i = seq(1, length(mtcars$gear)), .packages=c('ggplot2')) %dopar% {
j <- qplot(wt, mpg, data = mtcars[i,])
png(file=paste(getwd(), '/images/',mtcars[i, c('gear')],'.png', sep=''))
print(j)
dev.off()
}
Unless I'm missing something, generating plots by a subsetting variable is very simple. You can use split(...) to split the original data into a list of data frames by NetworkTrackingPixelId, and then pass those to ggplot using lapply(...). Most of the code below is just to crate a sample dataset.
# create sample data
set.seed(1)
names <- c("Rubicon","Google","OpenX","AppNexus","Pubmatic")
dates <- as.Date("2014-02-16")+1:10
df <- data.frame(NetworkTrackingPixelId=rep(1:5,each=10),
Name=sample(names,50,replace=T),
Date=dates,
Impressions=sample(1000:10000,50))
# end create sample data
pdf("plots.pdf")
lapply(split(df,df$NetworkTrackingPixelId),
function(gg) ggplot(gg,aes(x = Date, y = Impressions)) +
geom_point() + geom_line()+
ggtitle(paste("NetworkTrackingPixelId:",gg$NetworkTrackingPixelId)))
dev.off()
This generates a pdf containing 5 plots, one for each NetworkTrackingPixelId.
I think you would be better off writing a function for plotting, then using lapply for every Network Tracking Pixel.
For example, your function might look like:
plot.function <- function(ntpid){
sub = subset(dataset, dataset$networktrackingpixelid == ntpid)
ggobj = ggplot(data=sub, aes(...)) + geom...
ggsave(filename=sprintf("%s.pdf", ntpid))
}
It would be helpful for you to put a reproducible example, but I hope this works! Not sure about the vector issue though..
Cheers!

Reading durations

I have a CSV file containing times per competitor of each section of a triathlon. I am having trouble reading the data so that R can use it. Here is an example of how the data looks (I've removed some columns for clarity):
"Place","Division","Gender","Swim","T1","Bike","T2","Run","Finish"
1, "40-49","M","7:45","0:55","27:07","0:29","18:53","55:07"
2, "UNDER 18","M","5:41","0:28","30:41","0:28","18:38","55:55"
3, "40-49","M","6:27","0:26","29:24","0:40","20:16","57:11"
4, "40-49","M","7:57","0:35","29:19","0:23","19:20","57:32"
5, "40-49","M","6:28","0:32","31:00","0:34","19:19","57:51"
6, "40-49","M","7:42","0:30","30:02","0:37","19:11","58:02"
....
250 ,"18-29","F","13:20","3:23","1:06:40","1:19","38:00","2:02:40"
251 ,"30-39","F","13:01","2:42","1:02:12","1:20","43:45","2:02:58"
252 ,50 ,"F","20:45","1:33","58:09","3:17","40:14","2:03:56"
253 ,"30-39","M","13:14","1:14","DNF","1:11","25:10","DNF bike"
254 ,"40-49","M","10:04","1:41","56:36","2:32",,"D.N.F"
My first naive attempt to plot the data went like this.
> tri <- read.csv(file.choose(), header=TRUE, as.is=TRUE)
> pairs(~ Bike + Run + Swim, data=tri)
The times are not being imported in a sensible way so the charts don't make sense.
I have found the difftime type and have tried to use it to parse the times in the data file.
There are some rows with DNF or similar in place of times, I'm happy for rows with times that can't be parsed to be discarded. There are two formats for the times "%M:%S" and "%H:%M:%S"
I think I need to create a new data frame from the data but I am having trouble parsing the times. This is what I have so far.
> tri <- read.csv(file.choose(), header=TRUE, as.is=TRUE)
> str(tri)
'data.frame': 254 obs. of 12 variables:
$ Place : num 1 2 3 4 5 6 7 8 9 10 ...
$ Race.. : num 237 274 268 226 267 247 264 257 273 272 ...
$ First.Name: chr ** removed names ** ...
$ Last.Name : chr ** removed names ** ...
$ Division : chr "40-49" "UNDER 18" "40-49" "40-49" ...
$ Gender : chr "M" "M" "M" "M" ...
$ Swim : chr "7:45" "5:41" "6:27" "7:57" ...
$ T1 : chr "0:55" "0:28" "0:26" "0:35" ...
$ Bike : chr "27:07" "30:41" "29:24" "29:19" ...
$ T2 : chr "0:29" "0:28" "0:40" "0:23" ...
$ Run : chr "18:53" "18:38" "20:16" "19:20" ...
$ Finish : chr "55:07" "55:55" "57:11" "57:32" ...
> as.numeric(as.difftime(tri$Bike, format="%M:%S"), units="secs")
This converts all the times that are under one hour, but the hours are interpreted as minutes for any times over an hour. Substituting "%H:%M:%S" for "%M:%S" parses times over an hour but produces NA otherwise. What is the best way to convert both types of times?
EDIT: Adding a simple example as requested.
> times <- c("27:07", "1:02:12", "DNF")
> as.numeric(as.difftime(times, format="%M:%S"), units="secs")
[1] 1627 62 NA
> as.numeric(as.difftime(times, format="%H:%M:%S"), units="secs")
[1] NA 3732 NA
The output I would like would be 1627 3732 NA
Here's a quick hack at a solution, although there may be a better one:
cdifftime <- function(x) {
x2 <- gsub("^([0-9]+:[0-9]+)$","00:\\1",x) ## prepend 00: to %M:%S elements
res <- as.difftime(x2,format="%H:%M:%S")
units(res) <- "secs"
as.numeric(res)
}
times <- c("27:07", "1:02:12", "DNF")
cdifftime(times)
## [1] 1627 3732 NA
You can apply this to the relevant columns:
tri[4:9] <- lapply(tri[4:9],cdifftime)
A couple of notes from trying to replicate your example:
you may want to use na.strings="DNF" to set "did not finish" values to NA automatically
you need to make sure strings are not read in as factors, e.g. (1) set options(stringsAsFactors="FALSE"); (2) use stringsAsFactors=FALSE when calling read.csv; (3) use as.is=TRUE, ditto.

Resources