R: Create Stock Indicator from OHLC data - r

I have OHLC (Open/High/Low/Close)
data which we can get using Finance API and all.
I want to create a target indicator (-1,0,1) on which I will build stock classification model.
To create this target variable.
I need to create another indicator, log(tomorrow's CLOSE/today's CLOSE)
Which will give me value in (-inf to inf).
Now, I want to create labels=c(-1, 0, 1) from breaks=c(-Inf,
range_start, range_end, Inf) of log(tomorrow's CLOSE/today's CLOSE).
My first question is to create this target variable without looking into the future data, as my formula log(tomorrow's CLOSE/today's CLOSE) looks into the future, which is wrong, I want to shift the dataframe/inputs backward by one row and treat today as tomorrow and so on.
and then, calculate the target category, based on range_start, range_end and breaks I will define, the -1, 0,1 .
My 2nd question is how can i define it in best manner, this value, I am taking this as -0.0015,0.0015 as of now.
need some comments and suggestions here, thanks.
masterDF_close <- masterDF %>% dplyr::select('Date', 'Close')
# create a one-row matrix the same length as data
temprow <- matrix(c(rep.int(NA,length(masterDF))),nrow=1,ncol=length(masterDF))
# make it a data.frame and give cols the same names as data
newrow <- data.frame(temprow)
colnames(newrow) <- colnames(masterDF)
# rbind the empty row to data
masterDF <- rbind(newrow,masterDF)
###View(masterDF)
temprow2 <- matrix(c(rep.int(NA,length(masterDF_close))),nrow=1,ncol=length(masterDF_close))
# make it a data.frame and give cols the same names as data
newrow2 <- data.frame(temprow2)
colnames(newrow2) <- colnames(masterDF_close)
# rbind the empty row to data
masterDF_close <- rbind(masterDF_close, newrow2)
masterDF['Close_unshifted'] = masterDF_close$Close
###View(masterDF)
# Shifting data backwards, assuming today Close as tomorrow Close and yesterday Close as today Close
# close <- masterDF$Close
# lead_close <- lag(close, k = -1)
#
# close[1:10]
# lead_close[1:10]
#
# log(close/lead_close)
#
# plot(log(close/lead_close))
masterDF['TargetIndicator'] <- log(masterDF$Close_unshifted/masterDF$Close)
###View(masterDF)
masterDF = masterDF[-1,]
masterDF$TargetIndicator[is.na(masterDF$TargetIndicator)] <- 0
masterDF_ <- masterDF %>% mutate(category=cut(TargetIndicator,
breaks=c(-Inf, range_start, range_end, Inf),
labels=c(-1, 0, 1)))
These are two operations, I am doing on the code.

Related

Looping row numbers from one dataframe to create new data using logical operations in R

I would like to extract a dataframe that shows how many years it takes for NInd variable (dataset p1) to recover due to some culling happening, which is showed in dataframe e1.
I have the following datasets (mine are much bigger, but just to give you something to play with):
# Dataset 1
Batch <- c(2,2,2,2,2,2,2,2,2,2)
Rep <- c(0,0,0,0,0,0,0,0,0,0)
Year <- c(0,0,1,1,2,2,3,3,4,4)
RepSeason <- c(0,0,0,0,0,0,0,0,0,0)
PatchID <- c(17,25,19,16,21,24,23,20,18,33)
Species <- c(0,0,0,0,0,0,0,0,0,0)
Selected <- c(1,1,1,1,1,1,1,1,1,1)
Nculled <- c(811,4068,1755,449,1195,1711,619,4332,457,5883)
e1 <- data.frame(Batch,Rep,Year,RepSeason,PatchID,Species,Selected,Nculled)
# Dataset 2
Batch <- c(2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
Rep <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
Year <- c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2)
RepSeason <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
PatchID <- c(17,25,19,16,21,24,23,20,18,33,17,25,19,16,21,24,23,20,18,33,17,25,19,16,21,24,23,20,18,33)
Ncells <- c(6,5,6,4,4,5,6,5,5,5,6,5,6,4,4,5,6,7,3,5,4,4,3,3,4,4,5,5,6,4)
Species <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
NInd <- c(656,656,262,350,175,218,919,218,984,875,700,190,93,127,52,54,292,12,43,68,308,1000,98,29,656,656,262,350,175,300)
p1 <- data.frame(Batch, Rep, Year, RepSeason, PatchID, Ncells, Species, NInd)
The dataset called e1 shows only those year where some culled happened to the population on specific PatchID.
I have created the following script that basically use each row from e1 to create a Recovery number. Maybe there is an easier way to get to the end, but this is the one I managed to get...
When you run this, you are working on ONE row of e1, so we focus on the first PatchID encounter and then do some calculation to match that up with p1, and finally I get a number named Recovery.
Now, the thing is my dataframe has 50,000 rows, so doing this over and over looks quite tedious. So, that's where I thought a loop may be useful. But have tried and no luck on how to make it work at all...
#here is where I would like the loop
e2 <- e1[1,] # Trial for one row only # but the idea is having here a loop that keep doing of comes next for each row
e3 <- e2 %>%
select(1,2,4,5)
p2 <- p1[,c(1,2,4,5,3,6,7,8)] # Re-order
row2 <- which(apply(p2, 1, function(x) return(all(x == e3))))
p3 <- p1 %>%
slice(row2) # all years with that particular patch in that particular Batch
#How many times was this patch cull during this replicate?
e4 <- e2[,c(1,2,4,5,3,6,7,8)]
e4 <- e4 %>%
select(1,2,3,4)
c_batch <- e1[,c(1,2,4,5,3,6,7,8)]
row <- which(apply(c_batch, 1, function(x) return(all(x == e4))))
c4 <- c_batch %>%
slice(row)
# Number of year to recover to 95% that had before culled
c5 <- c4[1,] # extract the first time was culled
c5 <- c5 %>%
select(1:5)
row3 <- which(apply(p2, 1, function(x) return(all(x == c5))))
Before <- p2 %>%
slice(row3)
NInd <- Before[,8] # Before culling number of individuals
Year2 <- Before[,5] # Year number where first culling happened (that actually the number corresponds to individuals before culling, as the Pop file is developed during reproduction, while Cull file is developed after!)
Percent <- (95*NInd)/100 # 95% recovery we want to achieve would correspond to having 95% of NInd BEFORE culled happened (Year2)
After <- p3 %>%
filter(NInd >= Percent & Year > Year2) # Look rows that match number of ind and Year
After2 <- After[1,] # we just want the first year where the recovery was successfully achieved
Recovery <- After2$Year - Before$Year
# no. of years to reach 95% of the population immediately before the cull
I reckon that the end would have to change somehow to to tell R that we are creating a dataframe with the Recovery, something like:
Batch <- c(1,1,2,2)
Rep <- c(0,0,0,0)
PatchID <- c(17,25,30,12)
Recovery <- c(1,2,1,5)
Final <- data.frame(Batch, Rep, PatchID, Recovery)
Would that be possible? OR this is just too mess-up and I may should try a different way?
Does the following solve the problem correectly?
I have first added a unique ID to your data.frames to allow matching of the cull and population files (this saves most of you complicated look-up code):
# Add a unique ID for the patch/replicate etc. (as done in the example code)
e1$RepID = paste(e1$Batch, e1$Rep, e1$RepSeason, e1$PatchID, sep = ":")
p1$RepID = paste(p1$Batch, p1$Rep, p1$RepSeason, p1$PatchID, sep = ":")
If you want a quick overview of the number of times each patch was culled, the new RepID makes this easy:
# How many times was each patch culled?
table(p1$RepID)
Then you want a loop to check the recovery time after each cull.
My solutions uses an sapply loop (which also retains the RepIDs so you can match to other metadata later):
sapply(unique(e1$RepID), function(rep_id){
all_cull_events = e1[e1$RepID == rep_id, , drop = F]
first_year = order(all_cull_events$Year)[1] # The first cull year (assuming data might not be in temporal order)
first_cull_event = all_cull_events[first_year, ] # The row corresponding to the first cull event
population_counts = p1[p1$RepID == first_cull_event$RepID, ] # The population counts for this plot/replicate
population_counts = population_counts[order(population_counts$Year), ] # Order by year (assuming data might not be in temporal order)
pop_at_first_cull_event = population_counts[population_counts$Year == first_cull_event$Year, "NInd"]
population_counts_after_cull = population_counts[population_counts$Year > first_cull_event$Year, , drop = F]
years_to_recovery = which(population_counts_after_cull$NInd >= (pop_at_first_cull_event * .95))[1] # First year to pass 95% threshold
return(years_to_recovery)
})
2:0:0:17 2:0:0:25 2:0:0:19 2:0:0:16 2:0:0:21 2:0:0:24 2:0:0:23 2:0:0:20 2:0:0:18 2:0:0:33
1 2 1 NA NA NA NA NA NA NA
(The output contains some NAs because the first cull year was outside the range of population counts in the data you gave us)
Please check this against your expected output though. There were some aspects of the question and example code that were not clear (see comments).

How to count and plot a cumulative number over a date range by groups

I want to find the best way to plot a chart showing the cumulative number of individuals in a group based on the date they came into the group as well as the date they may have left the group. This would be within the minimum and maximum date ranges of the date values. Each row is a person.
group_id Date_started Date_exit
1 2005-06-23 NA
1 2013-03-17 2013-09-20
2 2019-10-24 NA
3 2019-11-27 2019-11-27
4 2019-08-14 NA
3 2018-10-17 NA
4 2018-04-13 2019-10-12
1 2019-07-10 NA
I've considered creating a new data frame with a row per day within the min/max range and then applying some kind of function to tally the groups totals for each row (adding and subtracting from a running total based on whether or not there is a new value in either of the columns) but I'm not sure if one, that's the best way to approach the problem and two, how to practically run the cumulative count function either.
Ultimately though I want to be able to plot this as a line chart so I can see the trends over time for each group as I suspect one or more of them are much more volatile in terms of overall numbers. So again I'm not sure if ggplot2 has something already in place to handle this.
As you mentioned, you will need to create a dataframe with the desired dates and count, for each group, how many individuals are in the group.
I quickly put this together, so I'm sure there's a more optimal solution, but it should be what you're looking for.
library(ggplot2)
library(reshape2) # for melt
# your data
test <- read.table(
text =
"group_id,Date_started,Date_exit
1,2005-06-23,NA
1,2013-03-17,2013-09-20
2,2019-10-24,NA
3,2019-11-27,2019-11-27
4,2019-08-14,NA
3,2018-10-17,NA
4,2018-04-13,2019-10-12
1,2019-07-10,NA",
h = T, sep = ",", stringsAsFactors = F
)
# make date series
from <- min(as.POSIXct(test$Date_started))
to <- max(as.POSIXct(test$Date_started))
datebins <- seq(from, to, by = "1 month")
d_between <- function(d, ds, de){
if(ds <= d & (de > d | is.na(de)))
return(TRUE)
return(FALSE)
}
# make df to plot
df <- data.frame(dates = datebins)
df[,paste0("g", unique(test$group_id))] <- 0
for(i in seq_len(nrow(df))){
for(j in seq_len(nrow(test))){
gid <- paste0("g", test$group_id[j])
df[i, gid] <- df[i, gid] + d_between(df$dates[i], test$Date_started[j], test$Date_exit[j])
}
}
# plot
ggplot(melt(df, id.vars = "dates"), aes(dates, value, color = variable)) +
geom_line(size = 1) + theme_bw()
This gives:
Feel free to play with the date bins (in seq()) as necessary.
EDIT : for loop explanation
for(i in seq_len(nrow(df))){
for(j in seq_len(nrow(test))){
gid <- paste0("g", test$group_id[j])
df[i, gid] <- df[i, gid] + d_between(df$dates[i], test$Date_started[j], test$Date_exit[j])
}
}
The first loop iterates over the chosen dates.
For each date, go through the dataframe of interest (test) with the second for loop and use the custom d_between() function to determine whether or not an individual is part of the group. That function returns a boolean (which can translate to 0/1). The value 0 or 1 is then added to the df dataframe's column corresponding to the appropriate group (with gid) at the date we checked (row i).
Note that I'm considering the individuals as part of the group as soon as they join (ds <= d), but are not a part of the group the day they quit (de > d).

How to efficiently calculate distance between GPS points in one dataset and GPS points in another data set using data.table

I am facing a coding (optimization) problem in R. I have a long data set with GPS coordinates (lon, lat, timestamp) and for every row I need to check whether the location is near a bus stop. I have a .csv file with all the bus stops (in the Netherlands). The GPS coordinates file is millions of entries long, but could be split if necessary. The bus stop dataset is around 5500 entries long.
Using the code and tips given on, inter alia, these pages:
1) How to efficiently calculate distance between pair of coordinates using data.table :=
2) Using a simple for loop on spatial data
3) Calculate distance between two latitude-longitude points? (Haversine formula)
4) Fastest way to determine COUNTRY from millions of GPS coordinates [R]
I was able to construct a code that works, but is (too) slow. I was wondering if someone can help me with a faster data.table() implementation or can point out where the bottle neck in my code is? Is it the spDistsN1() function, or maybe the apply and melt() functions combination? I am most comfortable in R, but open to other software (as long as it is open source).
Due to privacy concerns I cannot upload the full dataset, but this is a (small) reproducible example that is not too different from how the real data looks.
# packages:
library(data.table)
library(tidyverse)
library(sp)
# create GPS data
number_of_GPS_coordinates <- 20000
set.seed(1)
gpsdata<-as.data.frame(cbind(id=1:number_of_GPS_coordinates,
lat=runif(number_of_GPS_coordinates,50.5,53.5),
lon=runif(number_of_GPS_coordinates,4,7)))
# create some busstop data. In this case only 2000 bus stops
set.seed(1)
number_of_bus_stops <- 2000
stop<-as.data.frame(gpsdata[sample(nrow(gpsdata), number_of_bus_stops), -1]) # of course do not keep id variable
stop$lat<-stop$lat+rnorm(number_of_bus_stops,0,.0005)
stop$lon<-stop$lon+rnorm(number_of_bus_stops,0,.0005)
busdata.data<-cbind(stop, name=replicate(number_of_bus_stops, paste(sample(LETTERS, 15, replace=TRUE), collapse="")))
names(busdata.data) <- c("latitude_bustops", "longitude_bustops", "name")
Download the real bus stop data if you want, kind of hard to reproduce a random sample of this.
#temp <- tempfile()
#download.file("http://data.openov.nl/haltes/stops.csv.gz", temp) #1.7MB
#gzfile(temp, 'rt')
#busstopdata <- read.csv(temp, stringsAsFactors = FALSE)
#unlink(temp)
#bus_stops <- fread("bus_stops.csv")
#busdata.data <- busstopdata %>%
# mutate(latitude_bustops = latitude)%>%
# mutate(longitude_bustops = longitude)%>%
# dplyr::select(name, latitude_bustops, longitude_bustops)
Code I use now to calculate distances. It works but it is pretty slow
countDataPoints3 <- function(p) {
distances <- spDistsN1(data.matrix(gpsdata[,c("lon","lat")]),
p,
longlat=TRUE) # in km
return(which(distances <= .2)) # distance is now set to 200 meters
}
# code to check per data point if a bus stop is near and save this per bus stop in a list entry
datapoints.by.bustation <- apply(data.matrix(busdata.data[,c("longitude_bustops","latitude_bustops")]), 1, countDataPoints3)
# rename list entries
names(datapoints.by.bustation) <- busdata.data$name
# melt list into one big data.frame
long.data.frame.busstops <- melt(datapoints.by.bustation)
# now switch to data.table grammar to speed up process
# set data.table
setDT(gpsdata)
gpsdata[, rowID := 1:nrow(gpsdata)]
setkey(gpsdata, key = "rowID")
setDT(long.data.frame.busstops)
# merge the data, and filter non-unique entries
setkey(long.data.frame.busstops, key = "value")
GPS.joined <- merge(x = gpsdata, y = long.data.frame.busstops, by.x= "rowID", by.y= "value", all.x=TRUE)
GPS.joined.unique <- unique(GPS.joined, by="id") # mak
# this last part of the code is needed to make sure that if there are more than 1 bus stop nearby it puts these bus stop in a list
# instead of adding row and making the final data.frame longer than the original one
GPS.joined.unique2 <- setDT(GPS.joined.unique)[order(id, L1), list(L1=list(L1)), by=id]
GPS.joined.unique2[, nearby := TRUE][is.na(L1), nearby := FALSE] # add a dummy to check if any bus stop is nearby.
# makes sense:
as.tibble(GPS.joined.unique2) %>%
summarize(sum = sum(nearby))
Consider cutting using an slicing method: first cut by close latitudes and close longitudes. In this case 0.5 latitude and 0.5 longitude (which is still about a 60 km disc). We can use data.table's superb support of rolling joins.
The following takes a few milliseconds for 20,000 entries and only a few seconds for 2M entries.
library(data.table)
library(hutils)
setDT(gpsdata)
setDT(busdata.data)
gps_orig <- copy(gpsdata)
busdata.orig <- copy(busdata.data)
setkey(gpsdata, lat)
# Just to take note of the originals
gpsdata[, gps_lat := lat + 0]
gpsdata[, gps_lon := lon + 0]
busdata.data[, lat := latitude_bustops + 0]
busdata.data[, lon := longitude_bustops + 0]
setkey(busdata.data, lat)
gpsID_by_lat <-
gpsdata[, .(id), keyby = "lat"]
By_latitude <-
busdata.data[gpsdata,
on = "lat",
# within 0.5 degrees of latitude
roll = 0.5,
# +/-
rollends = c(TRUE, TRUE),
# and remove those beyond 0.5 degrees
nomatch=0L] %>%
.[, .(id_lat = id,
name_lat = name,
bus_lat = latitude_bustops,
bus_lon = longitude_bustops,
gps_lat,
gps_lon),
keyby = .(lon = gps_lon)]
setkey(busdata.data, lon)
By_latlon <-
busdata.data[By_latitude,
on = c("name==name_lat", "lon"),
# within 0.5 degrees of latitude
roll = 0.5,
# +/-
rollends = c(TRUE, TRUE),
# and remove those beyond 0.5 degrees
nomatch=0L]
By_latlon[, distance := haversine_distance(lat1 = gps_lat,
lon1 = gps_lon,
lat2 = bus_lat,
lon2 = bus_lon)]
By_latlon[distance < 0.2]
This is the function I came up with so far. #Dave2e, thanks. It is already an awful lot faster than what I had. There still is clearly room for a lot of improvement, but as it stands it is fast enough for my analysis now. I only slice by latitude and not longitude. The only reason for that is that it makes indexing and then looping over indices really easy, but more speed could be gained by also indexing by longitude. Also, in real GPS data there tend to be many duplicate values (same lon/lat, different time stamp), the code would also be more efficient if it would take this into account. Maybe I will work on that in the future.
# this app could be much faster if it would filter by duplicate GPS coordinates
check_if_close <- function(dataset1 = GPS.Utrecht.to.Gouda,
dataset2 = bus_stops,
n.splits = 500,
desired.dist = .2){
# dataset1 needs at least the columns
# - "id",
# - "device_id"
# - "latitude"
# - "longitude"
# dataset2 needs at least the columns
# - "id",
# - "name"
# - "latitude"
# - "longitude"
# these are the average coordinates of the Netherlands. A change of ,.0017 in latitude leads to a change of 189 meters
# spDistsN1(matrix(c(5.2913, 52.1326), ncol=2), matrix(c(5.2913, 52.1326+.0017), ncol=2), longlat=TRUE)*1000
# [1] 189.1604
# this means that the latitude slices we can cut (the subsection of) the Netherlands is have to be at least .0017 wide.
# if we look at the Netherlands a whole this would mean we can use max (53.5-50.5)/.0017 = 1765 slices.
# if we look only at a small subsection (because we are only looking a a single trip for example we need much less slices.
# 1) we only select the variables we need from dataset 1
dataset1 <- setDT(dataset1)[,c("id", "device_id", "latitude", "longitude")]
setnames(dataset1, old = c("id", "latitude", "longitude"), new = c("id_dataset1", "latitude_gps", "longitude_gps"))
# 2) we only select the variables we need from dataset 2
dataset2 <- setDT(dataset2)[,c("id", "name", "latitude", "longitude")]
setnames(dataset2, old = c("id", "latitude", "longitude"), new = c("id_dataset2", "latitude_feature", "longitude_feature"))
# 3) only keep subet of dataset2 that falls within dataset 1.
# There is no reason to check if features are close that already fall out of the GPS coordinates in the trip we want to check
# We do add a 0.01 point margin around it to be on the save side. Maybe a feature falls just out the GPS coordinates,
# but is still near to a GPS point
dataset2 <- dataset2[latitude_feature %between% (range(dataset1$latitude_gps) + c(-0.01, +0.01))
& longitude_feature %between% (range(dataset1$longitude_gps) + c(-0.01, +0.01)), ]
# 4) we cut the dataset2 into slices on the latitude dimension
# some trial and error is involved getting the right amount. if you add to many you get a large and redudant amount of empty values
# if you add to few you get you need to check too many GPS to feauture distances per slice
dataset2[, range2 := as.numeric(Hmisc::cut2(dataset2$latitude_feature, g=n.splits))]
# 5) calculate the ranges of the slices we just created
ranges <- dataset2[,list(Min=min(latitude_feature), Max= max(latitude_feature)), by=range2][order(range2)]
setnames(ranges, old = c("range2", "Min", "Max"), new = c("latitude_range", "start", "end"))
# 6) now we assign too which slice every GPS coordinate in our dataset1 belongs
# this is super fast when using data.table grammar
elements1 <- dataset1$latitude_gps
ranges <- setDT(ranges)[data.table(elements1), on = .(start <= elements1, end >=elements1)]
ranges[, rowID := seq_len(.N)]
dataset1[,rowID := seq_len(.N)]
setkey(dataset1, rowID)
setkey(ranges, rowID)
dataset1<-dataset1[ranges]
# 7) this is the actual function we use to check if a datapoint is nearby.
# potentially there are faster function to do this??
checkdatapoint <- function(p, h, dist=desired.dist) {
distances <- spDistsN1(data.matrix(filter(dataset1,latitude_range==h)[,c("longitude_gps","latitude_gps")]),
p,
longlat=TRUE) # in km
return(which(distances <= dist)) # distance is now set to 200 meters
}
# 8) we assign a ID to the dataset1 starting again at every slice.
# we need this to later match the data again
dataset1[, ID2 := sequence(.N), by = latitude_range]
# 9) here we loop over all the splits and for every point check if there is a feature nearby in the slice it falls in
# to be on the save side we also check the slice left and right of it, just to make sure we do not miss features that
# are nearby, but just fall in a different slice.
# 9a: create an empty list we fill with dataframes later
TT<-vector("list", length=n.splits)
# 9b: loop over the number of slices using above defined function
for(i in 1:n.splits){
datapoints.near.feature<-apply(data.matrix(dataset2[range2 %in% c(i-1,i, i+1), c("longitude_feature","latitude_feature")]), 1, checkdatapoint, h=i)
# 9c: if in that slice there was no match between a GPS coordinate and an nearby feature, we create an empty list input
if(class(datapoints.near.feature)=="integer"|class(datapoints.near.feature)=="matrix"){
TT[[i]] <-NULL
} else {
# 9d: if there was a match we get a list of data point that are named
names(datapoints.near.feature) <- dataset2[range2 %in% c(i-1,i, i+1), name]
# 9e: then we 'melt' this list into data.frame
temp <- melt(datapoints.near.feature)
# 9f: then we transform it into a data.table and change the names
setDT(temp)
setnames(temp, old=c("value", "L1"), new= c("value", "feature_name"))
# 9h: then we only select the data point in dataset1 that fall in the current slice give them an
# ID and merge them with the file of nearby busstops
gpsdata.f <- dataset1[latitude_range==i, ]
gpsdata.f[, rowID2 := seq_len(.N)]
setkey(gpsdata.f, key = "rowID2")
setkey(temp, key = "value")
GPS.joined.temp <- merge(x = gpsdata.f, y = temp, by.x= "rowID2", by.y= "value", all.x=TRUE)
# 9i: we only keep the unique entries and for every slice save them to the list
GPS.joined.unique.temp <- unique(GPS.joined.temp, by=c("id_dataset1", "feature_name"))
TT[[i]] <- GPS.joined.unique.temp
cat(paste0(round(i/n.splits*100), '% completed'), " \r"); flush.console()
#cat(i/n.splits*100, " \r"); flush.console()
}
}
# 10) now we left join the original dataset and and the data point that are near a feature
finallist<- merge(x = dataset1,
y = rbindlist(TT[vapply(TT, Negate(is.null), NA)]),
by.x= "id_dataset1",
by.y= "id_dataset1",
all.x=TRUE)
# 11) we add a new logical variable to check if any bus stop is near
finallist[, nearby := TRUE][is.na(feature_name), nearby := FALSE] # add a dummy to check if any bus stop is nearby.
# 12) if a point is near multiple features at once these are listed in a vector,
# instead of having duplicate rows with teh same id but different features
finallist <- unique(setDT(finallist)[order(id_dataset1, feature_name), list(feature_name=list(feature_name), id=id_dataset1, lat=latitude_gps.x, lon=longitude_gps.x, nearby=nearby), by=id_dataset1], by="id_dataset1")
return(finallist)
}

Difficulty combining lists, characters, and numbers into data frame

I'm lost on how to combine my data into a usable data frame. I have a list of lists of character and number vectors Here is a working example of my code so far:
remove(list=ls())
# Headers for each of my column names
headers <- c("name","p","c","prophylaxis","control","inclusion","exclusion","conversion excluded","infection criteria","age criteria","mean age","age sd")
#_name = author and year
#_p = no. in experimental arm.
#_c = no. in control arm
#_abx = antibiotic used
#_con = control used
#_inc = inclusion criteria
#_exc = exclusion criteria
#_coexc = was conversion to open excluded?
#_infxn = infection criteria
#_agecrit = age criteria
#_agemean = mean age of study
#_agesd = sd age of study
# Passos 2016
passos_name <- c("Passos","2016")
passos_p <- 50
passos_c <- 50
passos_abx <- "cefazolin 1g at induction"
passos_con <- "none"
passos_inc <- c("elective LC","symptomatic cholelithiasis","low risk")
passos_exc <- c("renal impairment","hepatic impairment","immunosuppression","regular steroid use","antibiotics within 48H","acute cholecystitis","choledocolithiasis")
passos_coexc <- TRUE
passos_infxn <- c("temperature >37.8C","tachycardia","asthenia","local pain","local purulence")
passos_agecrit <- NULL
passos_agemean <- 48
passos_agesd <- 13.63
passos <- list(passos_name,passos_p,passos_c,passos_abx,passos_con,passos_inc,passos_exc,passos_coexc,passos_infxn,passos_agecrit,passos_agemean,passos_agesd)
names(passos) <- headers
# Darzi 2016
darzi_name <- c("Darzi","2016")
darzi_p <- 182
darzi_c <- 247
darzi_abx <- c("cefazolin 1g 30min prior to induction","cefazolin 1g 6H after induction","cefazolin 1g 12H after induction")
darzi_con <- "NaCl"
darzi_inc <- c("elective LC","first time abdominal surgery")
darzi_exc <- c("antibiotics within 7 days","immunosuppression","acute cholecystitis","choledocolithiasis","cholangitis","obstructive jaundice",
"pancreatitis","previous biliary tract surgery","previous ERCP","DM","massive intraoperative bleeding","antibiotic allergy","major thalassemia",
"empyema")
darzi_coexc <- TRUE
darzi_infxn <- c("temperature >38C","local purulence","intra-abdominal collection")
darzi_agecrit <- c(">18", "<75")
darzi_agemean <- 43.75
darzi_agesd <- 13.30
darzi <- list(darzi_name,darzi_p,darzi_c,darzi_abx,darzi_con,darzi_inc,darzi_exc,darzi_coexc,darzi_infxn,darzi_agecrit,darzi_agemean,darzi_agesd)
names(darzi) <- headers
# Matsui 2014
matsui_name <- c("Matsui","2014")
matsui_p <- 504
matsui_c <- 505
matsui_abx <- c("cefazolin 1g at induction","cefazolin 1g 12H after induction","cefazolin 1g 24H after induction")
matsui_con <- "none"
matsui_inc <- "elective LC"
matsui_exc <- c("emergent","concurrent surgery","regular insulin use","regular steroid use","antibiotic allergy","HD","antibiotics within 7 days","hepatic impairment","chemotherapy")
matsui_coexc <- FALSE
matsui_infxn <- c("local purulence","intra-abdominal collection","distant infection","temperature >38C")
matsui_agecrit <- ">18"
matsui_agemean <- NULL
matsui_agesd <- NULL
matsui <- list(matsui_name,matsui_p,matsui_c,matsui_abx,matsui_con,matsui_inc,matsui_exc,matsui_coexc,matsui_infxn,matsui_agecrit,matsui_agemean,matsui_agesd)
names(matsui) <- headers
# Find unique exclusion critieria in order to create the list of all possible levels
exc <- ls()[grepl("_exc",ls())]
exclist <- sapply(exc,get)
exc.levels <- unique(unlist(exclist,use.names = F))
# Find unique inclusion critieria in order to create the list of all possible levels
inc <- ls()[grepl("_inc",ls())]
inclist <- sapply(inc,get)
inc.levels <- unique(unlist(inclist,use.names = F))
# Find unique antibiotics order to create the list of all possible levels
abx <- ls()[grepl("_abx",ls())]
abxlist <- sapply(abx,get)
abx.levels <- unique(unlist(abxlist,use.names = F))
# Find unique controls in order to create the list of all possible levels
con <- ls()[grepl("_con",ls())]
conlist <- sapply(con,get)
con.levels <- unique(unlist(conlist,use.names = F))
# Find unique age critieria in order to create the list of all possible levels
agecrit <- ls()[grepl("_agecrit",ls())]
agecritlist <- sapply(agecrit,get)
agecrit.levels <- unique(unlist(agecritlist,use.names = F))
I have been struggling with:
1) Turn each of the _exc, _inc, _abx, _con, _agecrit lists into factors using the levels generated at the end of the code block. I have been trying to use a for loop such as:
for (x in exc) {
as.name(x) <- factor(get(x),levels = exc.levels)
}
This only creates a variable, x, that stores the last parsed list as a factor.
2) Combine all of my data into a data frame formatted as such:
name, p, c, prophylaxis, control, inclusion, exclusion, conversion excluded, infection criteria, age criteria, mean age, age sd
"Passos 2016", 50, 50, "cefazolin 1g at induction", "none", ["elective LC","symptomatic cholelithiasis","low risk"], ["renal impairment","hepatic impairment","immunosuppression","regular steroid use","antibiotics within 48H","acute cholecystitis","choledocolithiasis"], TRUE, ["temperature >37.8C","tachycardia","asthenia","local pain","local purulence"], NULL, 48, 13.63
...
# [] = factors
# columns correspond to each studies variables (i.e. passos_name, passos_p, passos_c, etc..)
# rows correspond to each study (i.e., passos, darzi, matsui)
I have tried various solutions on StackOverflow, but have not found any that work; for example:
studies <- list(passos,darzi,matsui,ruangsin,turk,naqvi,hassan,sharma,uludag,yildiz,kuthe,koc,maha,tocchi,higgins,mahmoud,kumar)
library(data.table)
rbindlist(lapply(studies,as.data.frame.list))
I suspect my data may not be exactly amenable to a data frame? Primarily because of trying to store a list of factors in a column. Is that allowed? If not, how is this type of data normally stored? My goal is to be able to meaningfully compare these various criterion across studies.
This is too long for a comment, so I turn it into an "answer":
To start with, have a look at what happens here:
data.frame(name = "Passos, 2016", p = 50)
name p
1 Passos, 2016 50
data.frame(name = c("Passos", "2016"), p = 50)
name p
1 Passos 50
2 2016 50
In the first one, we created a dataframe with the column "name" which contained one entry "Passos, 2016", i.e. one character containing both pieces of information, and the column "p". All fine. Now, in the second version, I specified the column "name" as you did above, using c(Passos, 2016). This is a two-element vector, and hence we get two rows in the dataframe: one with name Passos, one with name 2016, and the column p gets recycled.
Clearly, the latter is probably not what you intended. But it works anyway because R just recycles the shorter vector. Now, what do you think happens if I add a vector that contains three elements?
And this highlights the main issue with what you are doing: you are trying to get a dataframe from many vectors with different lengths. Now, in some cases this is fine if you want the shorter vector to be repeated (in R speech, we call this "recycled"), but it does not look like something you want to do here.
So, my recommendation would be this: try to imagine a matrix and make sure you understand what each element (row and column) is supposed to be. Then specify your data accordingly. If in doubt, look up "tidy data".

Calculating grades in r

I am calculating final averages for a course. There are about 500 students, and the grades are organized into a .csv file. Column headers include:
Name, HW1, ..., HW10, Quiz1, ..., Quiz5, Exam1, Exam2, Final
Each is weighted differently, and that shouldn't be an issue programming. However, the lowest 2 HW and the lowest Quiz are dropped for each student. How could I program this in r? Note that the HW/Quiz dropped for each student may be different (i.e. Student A has HW2, HW5, Quiz2 dropped, Student B has HW4, HW8, Quiz1 dropped).
Here is a simpler solution. The sum_after_drop function takes a vector x and drops the i lowest scores and sums up the remaining. We invoke this function for each row in the dataset. ddply is overkill for this job, but keeps things simple. You should be able to do this with apply, except that you will have to convert the end result to a data frame.
The actual grade calculations can then be carried out on dd2. Note that using the cut function with breaks is a simple way to get letter grades from the total scores.
library(plyr)
sum_after_drop <- function(x, i){
sum(sort(x)[-(1:i)])
}
dd2 = ddply(dd, .(Name), function(d){
hw = sum_after_drop(d[,grepl("HW", nms)], 1)
qz = sum_after_drop(d[,grepl("Quiz", nms)], 1)
data.frame(hw = hw, qz = qz)
})
Here's a sketch of how you could approach it using the reshape2 package and base functions.
#sample data
set.seed(734)
dd<-data.frame(
Name=letters[1:20],
HW1=rpois(20,7),
HW2=rpois(20,7),
HW3=rpois(20,7),
Quiz1=rpois(20,15),
Quiz2=rpois(20,15),
Quiz3=rpois(20,15)
)
Now I convert it to long format and split apart the field names
require(reshape2)
mm<-melt(dd, "Name")
mm<-cbind(mm,
colsplit(gsub("(\\w+)(\\d+)","\\1:\\2",mm$variable, perl=T), ":",
names=c("type","number"))
)
Now i can use by() to get a data.frame for each name and do the rest of the calculations. Here i just drop the lowest homework and lowest quiz and i give homework a weight of .2 and quizzes a weight of .8 (assuming all home works were worth 15pts and quizzes 25 pts).
grades<-unclass(by(mm, mm$Name, function(x) {
hw <- tail(sort(x$value[x$type=="HW"]), -1);
quiz <- tail(sort(x$value[x$type=="Quiz"]), -1);
(sum(hw)*.2 + sum(quiz)*.8) / (length(hw)*15*.2+length(quiz)*25*.8)
}))
attr(grades, "call")<-NULL #get rid of crud from by()
grades;
Let's check our work. Look at student "c"
Name HW1 HW2 HW3 Quiz1 Quiz2 Quiz3
c 6 9 7 21 20 14
Their grade should be
((9+7)*.2+(21+20)*.8) / ((15+15)*.2 + (25+25)*.8) = 0.7826087
and in fact, we see
grades["c"] == 0.7826087
Here's a solution with dplyr. It ranks the scores by student and type of assignment (i.e. calculates the rank order of all of student 1's homeworks, etc.), then filters out the lowest 1 (or 2, or whatever). dplyr's syntax is pretty intuitive—you should be able to walk through the code fairly easily.
# Load libraries
library(reshape2)
library(dplyr)
# Sample data
grades <- data.frame(name=c("Sally", "Jim"),
HW1=c(10, 9),
HW2=c(10, 5),
HW3=c(5, 10),
HW4=c(6, 9),
HW5=c(8, 9),
Quiz1=c(9, 5),
Quiz2=c(9, 10),
Quiz3=c(10, 8),
Exam1=c(95, 96))
# Melt into long form
grades.long <- melt(grades, id.vars="name", variable.name="graded.name") %.%
mutate(graded.type=factor(sub("\\d+","", graded.name)))
grades.long
# Remove the lowest scores for each graded type
grades.filtered <- grades.long %.%
group_by(name, graded.type) %.%
mutate(ranked.score=rank(value, ties.method="first")) %.% # Rank all the scores
filter((ranked.score > 2 & graded.type=="HW") | # Ignore the lowest two HWs
(ranked.score > 1 & graded.type=="Quiz") | # Ignore the lowest quiz
(graded.type=="Exam"))
grades.filtered
# Calculate the average for each graded type
grade.totals <- grades.filtered %.%
group_by(name, graded.type) %.%
summarize(total=mean(value))
grade.totals
# Unmelt, just for fun
final.grades <- dcast(grade.totals, name ~ graded.type, value.var="total")
final.grades
You technically could add the summarize(total=mean(value)) to the grades.filtered data frame rather than making a separate grade.totals data frame—I separated them into multiple data frames for didactical reasons.

Resources