Summing values for a month in R - r

please see data sample as follows:
3326 2015-03-03 Wm Eu Apple 2L 60
3327 2015-03-03 Tp Euro 2 Layer 420
3328 2015-03-03 Tpe 3-Layer 80
3329 2015-03-03 14/3 Bgs 145
3330 2015-03-04 T/P 196
3331 2015-03-04 Wm Eu Apple 2L 1,260
3332 2015-03-04 Tp Euro 2 Layer 360
3333 2015-03-04 14/3 Bgs 1,355
Currently graphing this data creates a really horrible graph because the amount of cartons change so rapidly by day. It would make more sense to sum the cartons by month so that each data point represents a sum for that month rather than an individual day. The current range of the data is 11/01/2008-04/01/2015.
This is the code that I am using to graph (which may or may not be relevant for this):
ggvis(myfile, ~Shipment.Date, ~ctns) %>%
layer_lines()
Shipment.Date is column 2 in the data set and ctns is the 4th column.
I don't know much about R and have given it a few trys with some code that I have found here but I don't think I have found a problem similar enough to match the code. My idea is to create a new table, sum Act. Ctns for the month and then save it as that new table and graph from there.
Thanks for any assistance! :)

Do you need this:
data.aggregated<-aggregate(list(new.value=data$value),
by=list(date.time=cut(data$date.time, breaks="1 month")),
FUN=function(x) sum(x))

Related

RStudio: Separate YYYY-MM-DD into Individual Columns

I am fairly new to R and I am pulling my hair out trying to do what is probably something super simple.
I downloaded the crime data for Los Angeles from 2010 - 2019. There are 2,114,010 rows of data. Right now, it is called 'df' in my Global Environment area.
I want to manipulate one specific column titled "Occurred" - which is a date reference to when the crime occurred.
Right now, it is set up as YYYY-MM-DD (ie., 2010-02-20).
I am trying to separate all three into individual columns. I have Googled, and Googled, and Googled and tried and tried and tried things from this forum and StackExchange and just cannot get it to work.
I have tried Lubridate and followed instructions to other answers, but it simply won't create new columns (one each for Year, Month, Day).
Here is a bit of the reprex from the dataset ... I did not include all of the different variables, because they aren't the issue.
As mentioned, I am trying to separate 'occurred' into individual Year, Month, and Day columns.
> head(df, 10)[c('dr_no','occurred','time','area_name')]
dr_no occurred time area_name
1 1307355 2010-02-20 1350 Newton
2 11401303 2010-09-12 45 Pacific
3 70309629 2010-08-09 1515 Newton
4 90631215 2010-01-05 150 Hollywood
5 100100501 2010-01-02 2100 Central
6 100100506 2010-01-04 1650 Central
7 100100508 2010-01-07 2005 Central
8 100100509 2010-01-08 2100 Central
9 100100510 2010-01-09 230 Central
10 100100511 2010-01-06 2100 Central
We can do this with tidyverse and lubridate
library(dplyr)
library(lubridate)
df <- df %>%
mutate(occurred = as.Date(occurred),
year = year(occurred), month = month(occurred), day = day(occurred))

brownian.bridge slow calculation and Error in area.grid[1, 1] : incorrect number of dimensions

I am trying to calculate some BBMM.contours for caribou during a movement period in northern Canada.
I am still in the exploratory phase of using this function, and have worked through some tutorials which worked fine, but now that I am trying my sample data the brownian.bridge function seems to be taking an eternity.
I understand that this is a function that can take a long time to calculate, but I have tried subsetting my data to including fewer and fewer locations, simply to see if the end product is what I want before committing to running the dataset with thousands of locations. Currently I only have 34 locations in the subset, and I have waited over night for it to run without any completion.
When I used some practice Panther location data with 1000 locations it took under a minute to run, so I am thinking there is something wrong with my code or my data.
Any help working through this would be greatly appreciated.
#Load data
data<-(X2017loc)
#Used to sort data in code below for all caribou
data$DT <- as.POSIXct(data$TimeStamp, format='%Y-%m-%d %H:%M:%S')
#Sort Data
data <- data[order(data$SAMPLED_ANIMAL_ID, data$DT),]
#TIME DIFF NECESSARY IN BBMM CODE
###Joel is not sure about this part...Timelag is maybe time until GPS upload???.
timediff <- diff(data$DT)
data <- data[-1,]
data$timelag <-as.numeric(abs(timediff))
#set Timelag
data <- data[-1,] #Remove first record with wrong timelag
data$SAMPLED_ANIMAL_ID <- factor(data$SAMPLED_ANIMAL_ID)
data<-data[!is.na(data$timelag), ]
data$LONGITUDE<-as.numeric(data$LONGITUDE)
data$LATITUDE<-as.numeric(data$LATITUDE)
BBMM = brownian.bridge(x=data$LONGITUDE, y=data$LATITUDE, time.lag=data$timelag, location.error=6, cell.size=30)
bbmm.summary(BBMM)
Additional information:
Timelag is in seconds and
Collars have 6m location error
I am not certain what the cell.size refers to and how I should determine this number.
SAMPLED_ANIMAL_ID LONGITUDE LATITUDE TimeStamp timelag
218 -143.3138219 68.2468358 2017-05-01 02:00 18000
218 -143.1637592 68.2687447 2017-05-01 07:00 18000
218 -143.0699697 68.3082906 2017-05-01 12:00 18000
218 -142.8352869 68.3182258 2017-05-01 17:00 18000
218 -142.7707111 68.2892111 2017-05-01 22:00 18000
218 -142.5362769 68.3394269 2017-05-02 03:00 18000
218 -142.4734997 68.3459528 2017-05-02 08:00 18000
218 -142.3682272 68.3801822 2017-05-02 13:00 18000
218 -142.2198042 68.4023253 2017-05-02 18:00 18000
218 -142.0235464 68.3968672 2017-05-02 23:00 18000
I would suggest to use cell.size = 100 instead of area.grid since for area.grid, you would have to define a unique rectangular grid for all animals (which could increase compute time).
Ok, I have answered my original question, in that I was missing the following code to reproject the latlong to UTM.
data <- SpatialPoints(data[ , c("LONGITUDE","LATITUDE")], proj4string=CRS("+proj=longlat +ellps=WGS84"))
data <- spTransform(data, CRS("+proj=utm +west+zone=7 +ellps=WGS84"))

Many dataframes, different row lengths, similiar columns and dataframe titles, how to bind?

This takes a bit to explain and the post itself may be a bit too long to be answered.
I have MANY data frames of individual chess players and their specific ratings at points in time.
Here is what my data looks like. Please forgive me for my poor formatting of separating the datasets. Carlsen and Nakamura are separate dataframes.
Player1
Nakamura, Hikaru Year
2364 2001-01-01
2430 2002-01-01
2520 2003-01-01
2571 2004-01-01
2613 2005-01-01
2644 2006-01-01
2651 2007-01-01
2670 2008-01-01
2699 2009-01-01
2708 2010-01-01
2751 2011-01-01
2759 2012-01-01
2769 2013-01-01
2789 2014-01-01
2776 2015-01-01
2787 2016-01-01
Player2
Carlsen, Magnus Year
2127 2002-01-01
2279 2003-01-01
2484 2004-01-01
2553 2005-01-01
2625 2006-01-01
2690 2007-01-01
2733 2008-01-01
2776 2009-01-01
2810 2010-01-01
2814 2011-01-01
2835 2012-01-01
2861 2013-01-01
2872 2014-01-01
2862 2015-01-01
2844 2016-01-01
You can download the two sets here:
Download Player2
Download Player1
Between the above code, and below, Ive deleted two columns and reassigned an observation as a column title.
Hikaru Nakamura/Magnus Carlsen's chess rating over time
Hikaru's data is assigned to a dataframe, Player1.
Magnus's data is assigned to a dataframe, Player2.
What I want to be able to do is get what you see below, a dataframe of them combined.
The code I used to produce this frame is
merged<- merge(Player1, Player2, by = c("Year"), all = TRUE)
Now, this is all fun and dandy for two data sets, but I am having very annoying difficulties to add more players to this combined data set.
For example, maybe I would like to add 5, 10, 15 more players to this set. Examples of these players would be Kramnik, Anand, Gelfand ( Examples of famous chess players). As you'd expect, for 5 players, the dataframe would have 6 columns, 10 would have 11, 15 would have 16, all ordered nicely by the Year variable.
Fortunately, the number of observations for each Player is less than 100 always. Also, each individual player is assigned his/her own dataset.
For example,
Nakamura is the Player1 dataframe
Carlsen is the Player2 dataframe
Kramnik is the Player3 dataframe
Anand is the Player4 dataframe
Gelfand is the Player5 dataframe
all of which I have created using a for loop assigning process using this code
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
assign(paste("Player",i,sep=""), subset(Timed_set_filtered, Name == unique(Timed_set_filtered$Name)[i]))
}
I don't want to write out something like below:
merged<- merge(Player1, Player2,.....Player99 ,Player100, by = c("Year"), all = TRUE)
I want to able to merge all 5, 10, 15...i number of Player"i" objects that I created in the loop together by Year.
Also, once it leaves the loop initially, each dataset looks like this.
So what ends up happening is that I assign all of the data sets to a list by using the following snippet:
lst <- mget(ls(pattern='^Player\\d+'))
list2env(lapply(lst,`[`,-2), envir =.GlobalEnv)
lst <- mget(ls(pattern='^Player\\d+'))
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
names(lst[[i]]) [names(lst[[i]]) == 'Rating'] <- eval(unique(Timed_set_filtered$Name)[i])
}
This is what my list looks like.
Is there a way I write a table with YEAR as the way its merged by, so that it[cbinds, bind_cols, merges, etc] each of the Player"i" dataframes, which are necessarily not equal in length , in my lists are such a way that I get a combined/merged set like the one you saw below the merged(player1, player2) set?
Here is the diagram again, but it would have to be for many players, not just Carlsen and Nakmura.
Also, is there a way I can avoid using the list function, and just straight up do
names(Player"i") [names(Player"i") == 'Rating'] <- eval(unique(Timed_set_filtered$Name)[i])
which just renames the titles of all of the dataframes that start with "Player".
merge(player1, player2, player3,...., player99, player100, by = c("YEAR"), all = TRUE)
which would merge all of the "Player""i" datasets?
If anything is unclear, please mention it.
It was pretty funny that one line of code did the trick. After I assigned all of the Player1, Player 2....Player i into the list, I just joined all of the sets contained in the list by Year.
For loop that generates all of unique datasets.
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
assign(paste("Player",i,sep=""), subset(Timed_set_filtered, Name == unique(Timed_set_filtered$Name)[i]))
}
Puts them into a list
lst <- mget(ls(pattern='^Player\\d+'))
Merge, or join by common value
df <- join_all(lst, by = 'Year')
Unfortunately, unlike merge(datasets...., all= TRUE), it drops certain observations for an unknown reason, will have to see why this happens.

R Programming - Sum Elements of Rows with Common Values

Hello and thank you in advance for your assistance,
(PLEASE Note Comments section for additional insight: i.e. the cost column in the example below was added to this question; Simon, provides a great answer, but the cost column itself is not represented in the data response from him, although the function he provides works with the cost column)
I have a data set, lets call it 'data' which looks like this
NAME DATE COLOR PAID COST
Jim 1/1/2013 GREEN 150 100
Jim 1/2/2013 GREEN 50 25
Joe 1/1/2013 GREEN 200 150
Joe 1/2/2013 GREEN 25 10
What I would like to do is sum the PAID (and COST) elements of the records with the same NAME value and reduce the number of rows (as in this example) to 2, such that my new data frame looks like this:
NAME DATE COLOR PAID COST
Jim 1/2/2013 GREEN 200 125
Joe 1/2/2013 GREEN 225 160
As far as the dates are concerned, I don't really care about which one survives the summation process.
I've gotten as far as rowSums(data), but I'm not exactly certain how to use it. Any help would be greatly appreciated.
aggregate is the function you are looking for:
aggregate( cbind( PAID , COST ) ~ NAME + COLOR , data = data , FUN = sum )
# NAME PAID
# 1 Jim 200
# 2 Joe 225

easy way to subset data into bins

I have a data frame as seen below with over 1000 rows. I would like to subset the data into bins by 1m intervals (0-1m, 1-2m, etc.). Is there an easy way to do this without finding the minimum depth and using the subset command multiple times to place the data into the appropriate bins?
Temp..ÂșC. Depth..m. Light time date
1 17.31 -14.8 255 09:08 2012-06-19
2 16.83 -21.5 255 09:13 2012-06-19
3 17.15 -20.2 255 09:17 2012-06-19
4 17.31 -18.8 255 09:22 2012-06-19
5 17.78 -13.4 255 09:27 2012-06-19
6 17.78 -5.4 255 09:32 2012-06-19
Assuming that the name of your data frame is df, do the following:
split(df, findInterval(df$Depth..m., floor(min(df$Depth..m.)):0))
You will then get a list where each element is a data frame containing the rows that have Depth..m. within a particular 1 m interval.
Notice however that empty bins will be removed. If you want to keep them you can use cut instead of findInterval. The reason is that findInterval returns an integer vector, making it impossible for split to know what the set of valid bins is. It only knows the values it has seen and discards the rest. cut on the other hand returns a factor, which has all valid bins defined as levels.

Resources