To aggregate a csv file having day wise data into month - r

I have a csv file named crime.csv as below:-
OFFENSE_CODE OFFENSE_TYPE OFFENSE_DESCRIPTION DISTRICT DAY YEAR MONTH STREET NO_OF_CRIME
1106 Confidence Games FRAUD - CREDIT CARD / ATM FRAUD B2 1 2018 2 WASHINGTON ST 1
3201 Property Lost PROPERTY - LOST B2 15 2018 3 ELM HILL AVE 1
1001 Counterfeiting FORGERY / COUNTERFEITING D4 9 2018 1 TREMONT ST 1
2629 Harassment HARASSMENT E5 1 2018 1 CROWN POINT DR 1
1001 Counterfeiting FORGERY / COUNTERFEITING E5 8 2018 4 REDGATE RD 1
1106 Confidence Games FRAUD - CREDIT CARD / ATM FRAUD D4 22 2018 2 BOYLSTON ST 1
2629 Harassment HARASSMENT B2 9 2017 10 AKRON ST 1
1102 Fraud FRAUD - FALSE PRETENSE / SCHEME A7 25 2018 4 LIVERPOOL ST 1
3201 Property Lost PROPERTY - LOST D14 1 2018 1 FIDELIS WAY 1
1106 Confidence Games FRAUD - CREDIT CARD / ATM FRAUD E5 12 2018 4 SPRING ST 1
3201 Property Lost PROPERTY - LOST A1 30 2018 4 NASHUA ST 1
I need to aggregate this data into monthly data based on OFFENCE_CODE. So that NO_OF_CRIMES gets aggregated for that particular month. Any help would be really great.

Assuming x is the name of the dataframe:
aggregate(x$NO_OF_CRIME, by = list(x$OFFENCE_CODE, x$MONTH), FUN = sum)

We can use tidyverse
library(tidyverse)
df1 %>%
group_by(OFFENCE_CODE, YEAR, MONTH) %>%
mutate(SUM_NO_OF_CRIME = sum(NO_OF_CRIME))

Related

list of data frames, trying to create new column with normalisation values for each dataframe

I'm new to r and mostly work with dataframes. A frequent task is to normalize counts for several parameters from several data frames. I have a demo dataset:
dataset
Season
Product
Quality
Sales
Winter
Apple
bad
345
Winter
Apple
good
13
Winter
Potato
bad
23
Winter
Potato
good
66
Winter
Beer
bad
345
Winter
Beer
good
34
Summer
Apple
bad
88
Summer
Apple
good
90
Summer
Potato
bad
123
Summer
Potato
good
457
Summer
Beer
bad
44
Summer
Beer
good
546
What I want to do is add a column "FC" ([tag:fold change]) for "Sales". FC must be calculated for each "Season" and "Product" according to "Quality". "Bad" is the baseline.
Desired result:
Season
Product
Quality
Sales
FC
Winter
Apple
bad
345
1.00
Winter
Apple
good
13
0.04
Winter
Potato
bad
23
1.00
Winter
Potato
good
66
2.87
Winter
Beer
bad
345
1.00
Winter
Beer
good
34
0.10
Summer
Apple
bad
88
1.00
Summer
Apple
good
90
1.02
Summer
Potato
bad
123
1.00
Summer
Potato
good
457
3.72
Summer
Beer
bad
44
1.00
Summer
Beer
good
546
12.41
One way to do it is to filter first by "Season" and then by "Product" (e.g. creating subset data frame subset_winter_apple) and then calculate FC similarly to this:
subset_winter_apple$FC = subset_winter_apple$Sales / subset_winter_apple$Sales[1]
Later on, I can then combine all subset dataframes again e.g. using rbind() to reconstitute the original data frame with the FC column. However, this is highly inefficient. So I thought of splitting the data frame and creating a list:
split(
dataset,
list(dataset$Season, dataset$Product)
)
However, now I struggle with the normalisation (FC calculation) as I do not know how to reference the specific first cell value of "Sales" in the list of data frames so that each value in that column in each listed data frame is individually normalized. I did manage to calculate an FC value for the list, however, it is an exact copy in each listed data frame from the first one using lappy:
lapply(
dataset,
function(DF){DF$FC = dataset[[1]]$Sales/dataset[[1]]$Sales[1]; DF}
)
Clearly, I do not know how to reference the first cell in a specific column to normalize the entire column for each listed data frame. Can somebody please help me?
Many thanks in advance for your suggestions.
dplyr solution
Using logical indexing within a grouped mutate():
library(dplyr)
dataset %>%
group_by(Season, Product) %>%
mutate(FC = Sales / Sales[Quality == "bad"]) %>%
ungroup()
# A tibble: 12 × 5
Season Product Quality Sales FC
<chr> <chr> <chr> <int> <dbl>
1 Winter Apple bad 345 1
2 Winter Apple good 13 0.0377
3 Winter Potato bad 23 1
4 Winter Potato good 66 2.87
5 Winter Beer bad 345 1
6 Winter Beer good 34 0.0986
7 Summer Apple bad 88 1
8 Summer Apple good 90 1.02
9 Summer Potato bad 123 1
10 Summer Potato good 457 3.72
11 Summer Beer bad 44 1
12 Summer Beer good 546 12.4
Base R solution
Using by():
dataset <- by(
dataset,
list(dataset$Season, dataset$Product),
\(x) transform(x, FC = Sales / Sales[Quality == "bad"])
)
dataset <- do.call(rbind, dataset)
dataset[order(as.numeric(rownames(dataset))), ]
Season Product Quality Sales FC
1 Winter Apple bad 345 1.00000000
2 Winter Apple good 13 0.03768116
3 Winter Potato bad 23 1.00000000
4 Winter Potato good 66 2.86956522
5 Winter Beer bad 345 1.00000000
6 Winter Beer good 34 0.09855072
7 Summer Apple bad 88 1.00000000
8 Summer Apple good 90 1.02272727
9 Summer Potato bad 123 1.00000000
10 Summer Potato good 457 3.71544715
11 Summer Beer bad 44 1.00000000
12 Summer Beer good 546 12.40909091

Merge different dataset

I have a question, I need to merge two different dataset in one but they have a different class. How I can I do? rbind doesn't work, ideas?
nycounties <- rgdal::readOGR("https://raw.githubusercontent.com/openpolis/geojson-italy/master/geojson/limits_IT_provinces.geojson")
city <- c("Novara", "Milano","Torino","Bari")
dimension <- c("150000", "5000000","30000","460000")
df <- cbind(city, dimension)
total <- rbind(nycounties,df)
Are you looking for something like this?
nycounties#data = data.frame(nycounties#data,
df[match(nycounties#data[, "prov_name"],
df[, "city"]),])
Output
nycounties#data[!is.na(nycounties#data$dimension),]
prov_name prov_istat_code_num prov_acr reg_name reg_istat_code reg_istat_code_num prov_istat_code city dimension
0 Torino 1 TO Piemonte 01 1 001 Torino 30000
2 Novara 3 NO Piemonte 01 1 003 Novara 150000
12 Milano 15 MI Lombardia 03 3 015 Milano 5000000
81 Bari 72 BA Puglia 16 16 072 Bari 460000

Manipulating R Data Frames

I've currently got two separate data frames, excerpts as per below:
mydata
Player TG% Pts Team Opp Yr Rd Grnd
John 56 42 A 1 2015 1 Grnd1
James 94 64 B 2 2015 1 Grnd2
Jerry 85 78 C 3 2015 1 Grnd3
Daniel 97 51 D 4 2015 1 Grnd4
John 89 61 A 1 2015 1 Grnd2
James 65 26 B 4 2015 1 Grnd3
Jerry 73 34 C 3 2015 1 Grnd2
Daniel 73 40 D 2 2015 1 Grnd2
John 89 26 A 1 2015 1 Grnd3
James 92 42 B 3 2015 1 Grnd1
Jerry 89 25 C 2 2015 1 Grnd2
Daniel 80 41 D 4 2015 1 Grnd2
John 73 82 A 3 2015 1 Grnd3
James 73 41 B 4 2015 1 Grnd3
Jerry 89 76 C 2 2015 1 Grnd1
Daniel 91 77 D 1 2015 1 Grnd2
round
Team Opp Grnd
A 1 Grnd1
B 3 Grnd4
C 4 Grnd2
D 2 Grnd3
What I want to be able to do is manipulate this so that I generate a second data frame as per below
Player Gms Avg.Pts Avg.Last3 Avg.v.Opp Avg.#.Grnd
John
James
Jerry
Daniel
I know how to do this in Excel, however I'm struggling in R
Gms - total number of games for each individual player (excel would be countif)
Avg.Pts - this is the average of Pts for each Player name (excel would be averageif)
Avg.Last3 - this is the average of Pts for each Player in their last 3 games, note that the data frame is in order with most recent games at the end of the data frame.
Avg.v.Opp - this is the average of Pts for each player against the next opponent as defined in data frame round. For example John plays for team A and his next opponent is Opp 1. (excel would be averageifs)
Avg.#.Grnd - this is the average of Pts for each player at the next ground as defined in data fram round. For example John plays for team A and his next game is held at Grnd1. (excel would be averageifs)
I've tried using dplyr and a number of other options but haven't seemed to successfully put together something that works at this stage. Note that mydata data frame runs to over 10,000+ rows.
I think this will work. If you share your sample data with dput(), I'll be happy to copy/paste it and check (and debug if necessary).
First I'll do the easy ones, the ones that don't depend on round:
library(dplyr)
group_by(mydata, Player) %>%
summarize(Gms = n(),
Avg.Pts = mean(Pts),
Avg.Last3 = mean(tail(Pts, 3)))
I wanted to do that one separately to emphasize how clean dplyr can be for simple cases. All the "ifs" in your Excel commands are taken care of by the single group_by at the beginning. n() is the count, and mean() is the average. tail() is a handy base function that returns the end of a data frame or vector.
To add in the round data, we'll want to join the data frames together based on the Team column. We still we'll want to be able to tell the other columns apart whether they're from mydata or round, so I'll rename the round columns:
round = rename(round, next_opp = Opp, next_grnd = Grnd)
Then we'll start with the join and proceed as before. This time we do need some ifs at the end, which I'll do with a simple subset inside the mean calls:
left_join(mydata, round) %>%
# convert ground columns to character as discussed in comments
mutate(next_grnd = as.character(next_grnd),
Grnd = as.character(Grnd)) %>%
group_by(Player) %>%
summarize(Gms = n(),
Avg.Pts = mean(Pts),
Avg.Last3 = mean(tail(Pts, 3)),
Avg.v.Opp = mean(Pts[Opp == next_opp]),
Avg.at.Grnd = mean(Pts[Grnd == next_grnd]))

"for" loop in R and checking previous value from a column

I'm working on a data frame which looks like this
Here's how it looks like:
shape id day hour week id footfall category area name
22496 22/3/14 3 12 634 Work cluster CBD area 1
22670 22/3/14 3 12 220 Shopping cluster Orchard Road 1
23287 22/3/14 3 12 723 Airport Changi Airport 2
16430 22/3/14 4 12 947 Work cluster CBD area 2
4697 22/3/14 3 12 220 Residential area Ang Mo Kio 2
4911 22/3/14 3 12 1001 Shopping cluster Orchard Rd 3
11126 22/3/14 3 12 220 Residential area Ang Mo Kio 2
and so on... until 635 rows return.
and the other dataset that I want to compare with can be found here
Here's how it looks like:
category Foreigners Locals
Work cluster 1600000 3623900
Shopping cluster 1800000 3646666.667
Airport 15095152 8902705
Residential area 527700 280000
They both share the same attribute, i.e. category
I want to check if I can compare the previous hour from the column hour in the first dataset so I can compare it with the value from the second dataset.
Here's, what I ideally want to find in R:
#for n in 1: number of rows{
# check the previous hour from IDA dataset !!!!
# calculate hourSum - previousHour = newHourSum and store it as newHourSum
# calculate hour/(newHourSum-previousHour) * Foreigners and store it as footfallHour
# add to the empty dataframe }
I'm not sure how to do that and here's what i tried:
tbl1 <- secondDataset
tbl2 <- firstDataset
mergetbl <- function(tbl1, tbl2)
{
newtbl = data.frame(hour=numeric(),forgHour=numeric(),locHour=numeric())
ntbl1rows<-nrow(tbl1) # get the number of rows
for(n in 1:ntbl1rows)
{
#get the previousHour
newHourSum <- tbl1$hour - previousHour
footfallHour <- (tbl1$hour/(newHourSum-previousHour)) * tbl2$Foreigners
#add to newtbl
}
}
This would what i expected:
shape id day hour week id footfall category area name forgHour locHour
22496 22/3/14 3 12 634 Work cluster CBD area 1 1 12
22670 22/3/14 3 12 220 Shopping cluster Orchard Road 1 21 25
23287 22/3/14 3 12 723 Airport Changi Airport 2 31 34
16430 22/3/14 4 12 947 Work cluster CBD area 2 41 23
4697 22/3/14 3 12 220 Residential area Ang Mo Kio 2 51 23
4911 22/3/14 3 12 1001 Shopping cluster Orchard Rd 3 61 45
11126 22/3/14 3 12 220 Residential area Ang Mo Kio 2 72 54

Calculate rows with same title

Since my other question got closed, here is the required data.
What I'm trying to do is have R calculate the last column 'count' towards the column city so I can map the data. Therefore I would need some kind of code to match this. Since I want to show how many participants (in count) are in the state of e.g Hawaii (HI)
zip city state latitude longitude count
96860 Pearl Harbor HI 24.859832 -168.021815 36
96863 Kaneohe Bay HI 21.439867 -157.74772 39
99501 Anchorage AK 61.216799 -149.87828 12
99502 Anchorage AK 61.153693 -149.95932 17
99506 Elmendorf AFB AK 61.224384 -149.77461 2
what I've tried is
match<- c(match(datazip$state, datazip$number))>$
but I'm really helpless trying to find a solution since I don't even know how to describe this in short. My plan afterwards is to make choropleth map with the data and believe me by now I've seen almost all the pages that try to give advice. so your help is pretty much appreciated. Thanks
# I read your sample data to a data frame
> df
zip city state latitude longitude count
1 96860 Pearl_Harbor HI 24.85983 -168.0218 36
2 96863 Kaneohe_Bay HI 21.43987 -157.7477 39
3 99501 Anchorage AK 61.21680 -149.8783 12
4 99502 Anchorage AK 61.15369 -149.9593 17
5 99506 Elmendorf_AFB AK 61.22438 -149.7746 2
# If you want to sum the number of counts by state
library(plyr)
> ddply(df, .(state), transform, count2 = sum(count))
zip city state latitude longitude count count2
1 99501 Anchorage AK 61.21680 -149.8783 12 31
2 99502 Anchorage AK 61.15369 -149.9593 17 31
3 99506 Elmendorf_AFB AK 61.22438 -149.7746 2 31
4 96860 Pearl_Harbor HI 24.85983 -168.0218 36 75
5 96863 Kaneohe_Bay HI 21.43987 -157.7477 39 75
Maybe aggregate would be a nice and simple solution for you:
df
zip city state latitude longitude count
1 96860 Pearl Harbor HI 24.85983 -168.0218 36
2 96863 Kaneohe Bay HI 21.43987 -157.7477 39
3 99501 Anchorage AK 61.21680 -149.8783 12
4 99502 Anchorage AK 61.15369 -149.9593 17
5 99506 Elmendorf AFB AK 61.22438 -149.7746 2
aggregate(df$count,by=list(df$state),sum)
Group.1 x
1 AK 31
2 HI 75
aggregate(df$count,by=list(df$city),sum)
Group.1 x
1 Anchorage 29
2 Elmendorf AFB 2
3 Kaneohe Bay 39
4 Pearl Harbor 36

Resources