Make a new column and categorize data from 2 dataset - r

I have two datasets. The first one is called Buildings, the data consists of each Building ID with its respective characteristics.
Building_ID Address Year BCR
1 Machida, TY 1994 80
2 Ueno, TY 1972 50
3 Asakusa, TY 1990 70
4 Machida, TY 1982 60
.
.
.
54634 Chiyoda, TY 2002 70
The second dataset is called Residential ID. It only has one table, consisting of the Building ID (which is the same with the Building ID in 'Buildings' dataset) which have Residential usage.
Building_ID
2
3
14
23
39
44
45
133
393
423
.
.
or something like that. What I want to do is to make a new column in my first dataset with regards to my second dataset. I want to categorize which one is a Residential building and which one is not (basically, I want to select all the Buildings ID mentioned in my second dataset and categorize it into Residential in my first dataset). If it is residential, we can name it 'Residential'and else it is 'NR' so it could look something like this:
Building_ID Address Year BCR Category
1 Machida, TY 1994 80 NR
2 Ueno, TY 1972 50 Residential
3 Asakusa, TY 1990 70 Residential
4 Machida, TY 1982 60 NR
.
.
.
54634 Chiyoda, TY 2002 70 NR
I was thinking it has something to do with ifelse or grepl but so far my code doesn't work.

Related

How can i sum values of 1 column based on the categories of another column, multiple times, in R?

I guess my question its a little strange, let me try to explain it. I need to solve a simple equation for a longitudinal database (29 consecutive years) about food availability and international commerce: (importations-exportations)/(production+importations-exportations)*100[equation for food dependence coeficient, by FAO]. The big problem is that my database has the food products and its values of interest (production, importation and exportation) dissagregated, so i need to find a way to apply that equation to a sum of the values of interest for every year, so i can get the coeficient i need for every year.
My data frame looks like this:
element product year value (metric tons)
Production Wheat 1990 16
Importation Wheat 1990 2
Exportation Wheat 1990 1
Production Apples 1990 80
Importation Apples 1990 0
Exportation Apples 1990 72
Production Wheat 1991 12
Importation Wheat 1991 20
Exportation Wheat 1991 0
I guess the solution its pretty simple, but im not good enough in R to solve this problem by myself. Every help is very welcome.
Thanks!
This is a picture of my R session
require(data.table)
# dummy table. Use setDT(df) if yours isn't a data table already
df <- data.table(element = (rep(c('p', 'i', 'e'), 3))
, product = (rep(c('w', 'a', 'w'), each=3))
, year = rep(c(1990, 1991), c(6,3))
, value = c(16,2,1,80,0,72,12,20,0)
); df
element product year value
1: p w 1990 16
2: i w 1990 2
3: e w 1990 1
4: p a 1990 80
5: i a 1990 0
6: e a 1990 72
7: p w 1991 12
8: i w 1991 20
9: e w 1991 0
# long to wide
df_1 <- dcast(df
, product + year ~ element
, value.var = 'value'
); df_1
# apply calculation
df_1[, food_depend_coef := (i-e) / (p+i-e)*100][]
product year e i p food_depend_coef
1: a 1990 72 0 80 -900.000000
2: w 1990 1 2 16 5.882353
3: w 1991 0 20 12 62.500000

Find and tag a number between a range

I have two dfs as below
>codes1
Country State City Start No End No
IN Telangana Hyderabad 100 200
IN Maharashtra Pune (Bund Garden) 300 400
IN Haryana Gurgaon 500 600
IN Maharashtra Pune 700 800
IN Gujarat Ahmedabad (Vastrapur) 900 1000
Now i want to tag ip address from table 1
>codes2
ID No
1 157
2 346
3 389
4 453
5 562
6 9874
7 98745
Now i want to tag numbers in codes2 df as per the range given in codes1 df for No column , expected ouput is
ID No Country State City
1 157 IN Telangana Hyderabad
2 346 IN Maharashtra Pune(Bund Garden)
.
.
.
Basically want to tag No column in codes 2 with codes1 according to the range (Start No and End No) that No observations falls in.
Also the order could be anything in codes 2 df .
You could use the non-equi join capability of the data.table package for that:
library(data.table)
setDT(codes1)
setDT(codes2)
codes2[codes1, on = .(No > StartNo, No < EndNo), ## (1)
`:=`(cntry = Country, state = State, city = City)] ## (2)
(1) obtains matching row indices in codes2 corresponding to each row in codes1, while matching on the condition provided to the on argument.
(2) updates codes2 values for those matching rows for the columns specified directly by reference (i.e., you don't have to assign the result back to another variable).
This gives:
codes2
# ID No cntry state city
# 1: 1 157 IN Telangana Hyderabad
# 2: 2 346 IN Maharashtra Pune (Bund Garden)
# 3: 3 389 IN Maharashtra Pune (Bund Garden)
# 4: 4 453 NA NA NA
# 5: 5 562 IN Haryana Gurgaon
# 6: 6 9874 NA NA NA
# 7: 7 98745 NA NA NA
if you're comfortable writing SQL, you might consider using the sqldf package to do something like
library('sqldf')
result <- sqldf('select * from codes2 left join codes1 on codes2.No between codes1.StartNo and codes1.EndNo')
you may have to remove special characters and spaces from the columnnames of your dataframes beforehand.

Manipulating R Data Frames

I've currently got two separate data frames, excerpts as per below:
mydata
Player TG% Pts Team Opp Yr Rd Grnd
John 56 42 A 1 2015 1 Grnd1
James 94 64 B 2 2015 1 Grnd2
Jerry 85 78 C 3 2015 1 Grnd3
Daniel 97 51 D 4 2015 1 Grnd4
John 89 61 A 1 2015 1 Grnd2
James 65 26 B 4 2015 1 Grnd3
Jerry 73 34 C 3 2015 1 Grnd2
Daniel 73 40 D 2 2015 1 Grnd2
John 89 26 A 1 2015 1 Grnd3
James 92 42 B 3 2015 1 Grnd1
Jerry 89 25 C 2 2015 1 Grnd2
Daniel 80 41 D 4 2015 1 Grnd2
John 73 82 A 3 2015 1 Grnd3
James 73 41 B 4 2015 1 Grnd3
Jerry 89 76 C 2 2015 1 Grnd1
Daniel 91 77 D 1 2015 1 Grnd2
round
Team Opp Grnd
A 1 Grnd1
B 3 Grnd4
C 4 Grnd2
D 2 Grnd3
What I want to be able to do is manipulate this so that I generate a second data frame as per below
Player Gms Avg.Pts Avg.Last3 Avg.v.Opp Avg.#.Grnd
John
James
Jerry
Daniel
I know how to do this in Excel, however I'm struggling in R
Gms - total number of games for each individual player (excel would be countif)
Avg.Pts - this is the average of Pts for each Player name (excel would be averageif)
Avg.Last3 - this is the average of Pts for each Player in their last 3 games, note that the data frame is in order with most recent games at the end of the data frame.
Avg.v.Opp - this is the average of Pts for each player against the next opponent as defined in data frame round. For example John plays for team A and his next opponent is Opp 1. (excel would be averageifs)
Avg.#.Grnd - this is the average of Pts for each player at the next ground as defined in data fram round. For example John plays for team A and his next game is held at Grnd1. (excel would be averageifs)
I've tried using dplyr and a number of other options but haven't seemed to successfully put together something that works at this stage. Note that mydata data frame runs to over 10,000+ rows.
I think this will work. If you share your sample data with dput(), I'll be happy to copy/paste it and check (and debug if necessary).
First I'll do the easy ones, the ones that don't depend on round:
library(dplyr)
group_by(mydata, Player) %>%
summarize(Gms = n(),
Avg.Pts = mean(Pts),
Avg.Last3 = mean(tail(Pts, 3)))
I wanted to do that one separately to emphasize how clean dplyr can be for simple cases. All the "ifs" in your Excel commands are taken care of by the single group_by at the beginning. n() is the count, and mean() is the average. tail() is a handy base function that returns the end of a data frame or vector.
To add in the round data, we'll want to join the data frames together based on the Team column. We still we'll want to be able to tell the other columns apart whether they're from mydata or round, so I'll rename the round columns:
round = rename(round, next_opp = Opp, next_grnd = Grnd)
Then we'll start with the join and proceed as before. This time we do need some ifs at the end, which I'll do with a simple subset inside the mean calls:
left_join(mydata, round) %>%
# convert ground columns to character as discussed in comments
mutate(next_grnd = as.character(next_grnd),
Grnd = as.character(Grnd)) %>%
group_by(Player) %>%
summarize(Gms = n(),
Avg.Pts = mean(Pts),
Avg.Last3 = mean(tail(Pts, 3)),
Avg.v.Opp = mean(Pts[Opp == next_opp]),
Avg.at.Grnd = mean(Pts[Grnd == next_grnd]))

"for" loop in R and checking previous value from a column

I'm working on a data frame which looks like this
Here's how it looks like:
shape id day hour week id footfall category area name
22496 22/3/14 3 12 634 Work cluster CBD area 1
22670 22/3/14 3 12 220 Shopping cluster Orchard Road 1
23287 22/3/14 3 12 723 Airport Changi Airport 2
16430 22/3/14 4 12 947 Work cluster CBD area 2
4697 22/3/14 3 12 220 Residential area Ang Mo Kio 2
4911 22/3/14 3 12 1001 Shopping cluster Orchard Rd 3
11126 22/3/14 3 12 220 Residential area Ang Mo Kio 2
and so on... until 635 rows return.
and the other dataset that I want to compare with can be found here
Here's how it looks like:
category Foreigners Locals
Work cluster 1600000 3623900
Shopping cluster 1800000 3646666.667
Airport 15095152 8902705
Residential area 527700 280000
They both share the same attribute, i.e. category
I want to check if I can compare the previous hour from the column hour in the first dataset so I can compare it with the value from the second dataset.
Here's, what I ideally want to find in R:
#for n in 1: number of rows{
# check the previous hour from IDA dataset !!!!
# calculate hourSum - previousHour = newHourSum and store it as newHourSum
# calculate hour/(newHourSum-previousHour) * Foreigners and store it as footfallHour
# add to the empty dataframe }
I'm not sure how to do that and here's what i tried:
tbl1 <- secondDataset
tbl2 <- firstDataset
mergetbl <- function(tbl1, tbl2)
{
newtbl = data.frame(hour=numeric(),forgHour=numeric(),locHour=numeric())
ntbl1rows<-nrow(tbl1) # get the number of rows
for(n in 1:ntbl1rows)
{
#get the previousHour
newHourSum <- tbl1$hour - previousHour
footfallHour <- (tbl1$hour/(newHourSum-previousHour)) * tbl2$Foreigners
#add to newtbl
}
}
This would what i expected:
shape id day hour week id footfall category area name forgHour locHour
22496 22/3/14 3 12 634 Work cluster CBD area 1 1 12
22670 22/3/14 3 12 220 Shopping cluster Orchard Road 1 21 25
23287 22/3/14 3 12 723 Airport Changi Airport 2 31 34
16430 22/3/14 4 12 947 Work cluster CBD area 2 41 23
4697 22/3/14 3 12 220 Residential area Ang Mo Kio 2 51 23
4911 22/3/14 3 12 1001 Shopping cluster Orchard Rd 3 61 45
11126 22/3/14 3 12 220 Residential area Ang Mo Kio 2 72 54

Calculate rows with same title

Since my other question got closed, here is the required data.
What I'm trying to do is have R calculate the last column 'count' towards the column city so I can map the data. Therefore I would need some kind of code to match this. Since I want to show how many participants (in count) are in the state of e.g Hawaii (HI)
zip city state latitude longitude count
96860 Pearl Harbor HI 24.859832 -168.021815 36
96863 Kaneohe Bay HI 21.439867 -157.74772 39
99501 Anchorage AK 61.216799 -149.87828 12
99502 Anchorage AK 61.153693 -149.95932 17
99506 Elmendorf AFB AK 61.224384 -149.77461 2
what I've tried is
match<- c(match(datazip$state, datazip$number))>$
but I'm really helpless trying to find a solution since I don't even know how to describe this in short. My plan afterwards is to make choropleth map with the data and believe me by now I've seen almost all the pages that try to give advice. so your help is pretty much appreciated. Thanks
# I read your sample data to a data frame
> df
zip city state latitude longitude count
1 96860 Pearl_Harbor HI 24.85983 -168.0218 36
2 96863 Kaneohe_Bay HI 21.43987 -157.7477 39
3 99501 Anchorage AK 61.21680 -149.8783 12
4 99502 Anchorage AK 61.15369 -149.9593 17
5 99506 Elmendorf_AFB AK 61.22438 -149.7746 2
# If you want to sum the number of counts by state
library(plyr)
> ddply(df, .(state), transform, count2 = sum(count))
zip city state latitude longitude count count2
1 99501 Anchorage AK 61.21680 -149.8783 12 31
2 99502 Anchorage AK 61.15369 -149.9593 17 31
3 99506 Elmendorf_AFB AK 61.22438 -149.7746 2 31
4 96860 Pearl_Harbor HI 24.85983 -168.0218 36 75
5 96863 Kaneohe_Bay HI 21.43987 -157.7477 39 75
Maybe aggregate would be a nice and simple solution for you:
df
zip city state latitude longitude count
1 96860 Pearl Harbor HI 24.85983 -168.0218 36
2 96863 Kaneohe Bay HI 21.43987 -157.7477 39
3 99501 Anchorage AK 61.21680 -149.8783 12
4 99502 Anchorage AK 61.15369 -149.9593 17
5 99506 Elmendorf AFB AK 61.22438 -149.7746 2
aggregate(df$count,by=list(df$state),sum)
Group.1 x
1 AK 31
2 HI 75
aggregate(df$count,by=list(df$city),sum)
Group.1 x
1 Anchorage 29
2 Elmendorf AFB 2
3 Kaneohe Bay 39
4 Pearl Harbor 36

Resources