Merge different dataset - r

I have a question, I need to merge two different dataset in one but they have a different class. How I can I do? rbind doesn't work, ideas?
nycounties <- rgdal::readOGR("https://raw.githubusercontent.com/openpolis/geojson-italy/master/geojson/limits_IT_provinces.geojson")
city <- c("Novara", "Milano","Torino","Bari")
dimension <- c("150000", "5000000","30000","460000")
df <- cbind(city, dimension)
total <- rbind(nycounties,df)

Are you looking for something like this?
nycounties#data = data.frame(nycounties#data,
df[match(nycounties#data[, "prov_name"],
df[, "city"]),])
Output
nycounties#data[!is.na(nycounties#data$dimension),]
prov_name prov_istat_code_num prov_acr reg_name reg_istat_code reg_istat_code_num prov_istat_code city dimension
0 Torino 1 TO Piemonte 01 1 001 Torino 30000
2 Novara 3 NO Piemonte 01 1 003 Novara 150000
12 Milano 15 MI Lombardia 03 3 015 Milano 5000000
81 Bari 72 BA Puglia 16 16 072 Bari 460000

Related

Split numeric variables by decimals in R

I have a data frame with a column that contains numeric values, which represent the price.
ID
Total
1124
12.34
1232
12.01
1235
13.10
I want to split the column Total by "." and create 2 new columns with the euro and cent amount. Like this:
ID
Total
Euro
Cent
1124
12.34
12
34
1232
12.01
12
01
1235
13.10
13
10
1225
13.00
13
00
The euro and cent column should also be numeric.
I tried:
df[c('Euro', 'Cent')] <- str_split_fixed(df$Total, "(\\.)", 2)
But I get 2 new columns of type character that looks like this:
ID
Total
Euro
Cent
1124
12.34
12
34
1232
12.01
12
01
1235
13.10
13
1
1225
13.00
13
If I convert the character columns (euro and cent) to numeric like this:
as.numeric(df$Euro)
the 00 cent value turns into NULL and the 10 cent turn into 1 cent.
Any help is welcome.
Two methods:
If class(dat$Total) is numeric, you can do this:
dat <- transform(dat, Euro = Total %/% 1, Cent = 100 * (Total %% 1))
dat
# ID Total Euro Cent
# 1 1124 12.34 12 34
# 2 1232 12.01 12 1
# 3 1235 13.10 13 10
%/% is the integer-division operator, %% the modulus operator.
If class(dat$Total) is character, then
dat <- transform(dat, Euro = sub("\\..*", "", Total), Cent = sub(".*\\.", "", Total))
dat
# ID Total Euro Cent
# 1 1124 12.34 12 34
# 2 1232 12.01 12 01
# 3 1235 13.10 13 10
The two new columns are also character. For this, you may want one of two more steps:
Removing leading 0s, and keep them character:
dat[,c("Euro", "Cent")] <- lapply(dat[,c("Euro", "Cent")], sub, pattern = "^0+", replacement = "")
dat
# ID Total Euro Cent
# 1 1124 12.34 12 34
# 2 1232 12.01 12 1
# 3 1235 13.10 13 10
Convert to numbers:
dat[,c("Euro", "Cent")] <- lapply(dat[,c("Euro", "Cent")], as.numeric)
dat
# ID Total Euro Cent
# 1 1124 12.34 12 34
# 2 1232 12.01 12 1
# 3 1235 13.10 13 10
(You can also use as.integer if you know both columns will always be such.)
Just use standard numeric functions:
df$Euro <- floor(df$Total)
df$Cent <- df$Total %% 1 * 100

Match and replace value using 2 Data Frames (R)

2 dfs, need to match "Name" with info$Name and replace corresponding values in details$Salary , df - details should retain all values and there should be no NAs(if match found replace the value if not found leave as it is)
details<- data.frame(Name = c("Aks","Bob","Caty","David","Enya","Fredrick","Gaby","Hema","Isac","Jaby","Katy"),
Age = c(12,22,33,43,24,67,41,19,25,24,32),
Gender = c("f","m","m","f","m","f","m","f","m","m","m"),
Salary = c(1500,2000,3.6,8500,1.2,1400,2300,2.5,5.2,2000,1265))
info <- data.frame(Name = c("caty","Enya","Dadi","Enta","Billu","Viku","situ","Hema","Ignu","Isac"),
income = c(2500,5600,3200,1522,2421,3121,4122,5211,1000,3500))
Expected Result :
Name Age Gender Salary
Aks 12 f 1500
Bob 22 m 2000
Caty 33 m 2500
David 43 f 8500
Enya 24 m 5600
Fredrick 67 f 1400
Gaby 41 m 2300
Hema 19 f 5211
Isac 25 m 3500
Jaby 24 m 2000
Katy 32 m 1265
None of the following is giving expected result
dplyr::left_join(details,info,by = "Name")
dplyr::right_join(details,info,by = "Name")
dplyr::inner_join(details,info, by ="Name") # for other matching and replace this works fine but not here
dplyr:: full_join(details,info,by ="Name")
All the results are giving NA's , tried using match function also but it is not giving desired result, any help would be highly appreciated
You have Name in both the dataframe in different cases, we need to first bring them in the same case then do a left_join with them and use coalesce to select the first non-NA value between income and salary.
library(dplyr)
details %>% mutate(Name = stringr::str_to_title(Name)) %>%
left_join(info %>% mutate(Name = stringr::str_to_title(Name)), by = "Name") %>%
mutate(Salary = coalesce(income, Salary)) %>%
select(names(details))
# Name Age Gender Salary
#1 Aks 12 f 1500
#2 Bob 22 m 2000
#3 Caty 33 m 2500
#4 David 43 f 8500
#5 Enya 24 m 5600
#6 Fredrick 67 f 1400
#7 Gaby 41 m 2300
#8 Hema 19 f 5211
#9 Isac 25 m 3500
#10 Jaby 24 m 2000
#11 Katy 32 m 1265
A base R solution:
matches <- match(tolower(details$Name), tolower(info$Name))
match <- !is.na(matches)
details$Salary[match] <- info$income[matches[match]]
#Result
Name Age Gender Salary
1 Aks 12 f 1500
2 Bob 22 m 2000
3 Caty 33 m 2500
4 David 43 f 8500
5 Enya 24 m 5600
6 Fredrick 67 f 1400
7 Gaby 41 m 2300
8 Hema 19 f 5211
9 Isac 25 m 3500
10 Jaby 24 m 2000
11 Katy 32 m 1265

To aggregate a csv file having day wise data into month

I have a csv file named crime.csv as below:-
OFFENSE_CODE OFFENSE_TYPE OFFENSE_DESCRIPTION DISTRICT DAY YEAR MONTH STREET NO_OF_CRIME
1106 Confidence Games FRAUD - CREDIT CARD / ATM FRAUD B2 1 2018 2 WASHINGTON ST 1
3201 Property Lost PROPERTY - LOST B2 15 2018 3 ELM HILL AVE 1
1001 Counterfeiting FORGERY / COUNTERFEITING D4 9 2018 1 TREMONT ST 1
2629 Harassment HARASSMENT E5 1 2018 1 CROWN POINT DR 1
1001 Counterfeiting FORGERY / COUNTERFEITING E5 8 2018 4 REDGATE RD 1
1106 Confidence Games FRAUD - CREDIT CARD / ATM FRAUD D4 22 2018 2 BOYLSTON ST 1
2629 Harassment HARASSMENT B2 9 2017 10 AKRON ST 1
1102 Fraud FRAUD - FALSE PRETENSE / SCHEME A7 25 2018 4 LIVERPOOL ST 1
3201 Property Lost PROPERTY - LOST D14 1 2018 1 FIDELIS WAY 1
1106 Confidence Games FRAUD - CREDIT CARD / ATM FRAUD E5 12 2018 4 SPRING ST 1
3201 Property Lost PROPERTY - LOST A1 30 2018 4 NASHUA ST 1
I need to aggregate this data into monthly data based on OFFENCE_CODE. So that NO_OF_CRIMES gets aggregated for that particular month. Any help would be really great.
Assuming x is the name of the dataframe:
aggregate(x$NO_OF_CRIME, by = list(x$OFFENCE_CODE, x$MONTH), FUN = sum)
We can use tidyverse
library(tidyverse)
df1 %>%
group_by(OFFENCE_CODE, YEAR, MONTH) %>%
mutate(SUM_NO_OF_CRIME = sum(NO_OF_CRIME))

Best method to Merge two Datasets (Maybe if function?)

I have two data sets I am working with. Datasets TestA and Test B (Below is how to make them in R)
Instructor <- c('Mr.A','Mr.A','Mr.B', 'Mr.C', 'Mr.D')
Class <- c('French','French','English', 'Math', 'Geometry')
Section <- c('1','2','3','5','5')
Time <- c('9:00-10:00','10:00-11:00','9:00-10:00','9:00-10:00','10:00-11:00')
Date <- c('MWF','MWF','TR','TR','MWF')
Enrollment <- c('30','40','24','29','40')
TestA <- data.frame(Instructor,Class,Section,Time,Date,Enrollment)
rm(Instructor,Class,Section,Time,Date,Enrollment)
Student <- c("Frances","Cass","Fern","Pat","Peter","Kory","Cole")
ID <- c('123','121','101','151','456','789','314')
Instructor <- c('','','','','','','')
Time <- c('','','','','','','')
Date <- c('','','','','','','')
Enrollment <- c('','','','','','','')
Class <- c('French','French','French','French','English', 'Math', 'Geometry')
Section <- c('1','1','2','2','3','5','5')
TestB <- data.frame(Student, ID, Instructor, Class, Section, Time, Date, Enrollment)
rm(Instructor,Class,Section,Time,Date,Enrollment,ID,Student)
I would like to merge both datasets (If possible, without using merge() ) So that All the columns of Test A are filled with the information provided by TestB and it should be added depending on the Class and Section.
I tried using merge(TestA, TestB, by=c('Class','Section'), all.x=TRUE) but it adds observations to the original TestA. This is just a test but in the datasets I am using there are hundreds of observations. It worked when I did it with these smaller frames but something is happening to the bigger set. That's why I'd like to know if there is a merge alternative.
Any ideas on how to do this?
The output should look like this
Class Section Instructor Time Date Enrollment Student ID
English 3 Mr.B 9:00-10:00 TR 24 Peter 456
French 1 Mr.A 9:00-10:00 MWF 30 Frances 123
French 1 Mr.A 9:00-10:00 MWF 30 Cass 121
French 2 Mr.A 10:00-11:00 MWF 40 Fern 101
French 2 Mr.A 10:00-11:00 MWF 40 Pat 151
Geometry 5 Mr.D 10:00-11:00 MWF 40 Cole 314
Math 5 Mr.C 9:00-10:00 TR 29 Kory 789
I was once a big fan of merge() until I learned about dplyr's join functions.
Try this instead:
library(dplyr)
TestA %>%
left_join(TestB, by = c("Class", "Section")) %>% #Here, you're joining by just the "Class" and "Section" columns of TestA and TestB
select(Class,
Section,
Instructor = Instructor.x,
Time = Time.x,
Date = Date.x,
Enrollment = Enrollment.x,
Student,
ID) %>%
arrange(Class, Section) #Added to match your output.
The select statement is keeping only those columns that are specifically named and, in some cases, renaming them.
Output:
Class Section Instructor Time Date Enrollment Student ID
1 English 3 Mr.B 9:00-10:00 TR 24 Peter 456
2 French 1 Mr.A 9:00-10:00 MWF 30 Frances 123
3 French 1 Mr.A 9:00-10:00 MWF 30 Cass 121
4 French 2 Mr.A 10:00-11:00 MWF 40 Fern 101
5 French 2 Mr.A 10:00-11:00 MWF 40 Pat 151
6 Geometry 5 Mr.D 10:00-11:00 MWF 40 Cole 314
7 Math 5 Mr.C 9:00-10:00 TR 29 Kory 789
The key is to drop the empty but duplicate columns from TestB before merging / joining as shown by SymbolixAU.
Here is an implementation in data.table syntax:
library(data.table)
setDT(TestB)[, .(Student, ID, Class, Section)][setDT(TestA), on = .(Class, Section)]
Student ID Class Section Instructor Time Date Enrollment
1: Frances 123 French 1 Mr.A 9:00-10:00 MWF 30
2: Cass 121 French 1 Mr.A 9:00-10:00 MWF 30
3: Fern 101 French 2 Mr.A 10:00-11:00 MWF 40
4: Pat 151 French 2 Mr.A 10:00-11:00 MWF 40
5: Peter 456 English 3 Mr.B 9:00-10:00 TR 24
6: Kory 789 Math 5 Mr.C 9:00-10:00 TR 29
7: Cole 314 Geometry 5 Mr.D 10:00-11:00 MWF 40

"for" loop in R and checking previous value from a column

I'm working on a data frame which looks like this
Here's how it looks like:
shape id day hour week id footfall category area name
22496 22/3/14 3 12 634 Work cluster CBD area 1
22670 22/3/14 3 12 220 Shopping cluster Orchard Road 1
23287 22/3/14 3 12 723 Airport Changi Airport 2
16430 22/3/14 4 12 947 Work cluster CBD area 2
4697 22/3/14 3 12 220 Residential area Ang Mo Kio 2
4911 22/3/14 3 12 1001 Shopping cluster Orchard Rd 3
11126 22/3/14 3 12 220 Residential area Ang Mo Kio 2
and so on... until 635 rows return.
and the other dataset that I want to compare with can be found here
Here's how it looks like:
category Foreigners Locals
Work cluster 1600000 3623900
Shopping cluster 1800000 3646666.667
Airport 15095152 8902705
Residential area 527700 280000
They both share the same attribute, i.e. category
I want to check if I can compare the previous hour from the column hour in the first dataset so I can compare it with the value from the second dataset.
Here's, what I ideally want to find in R:
#for n in 1: number of rows{
# check the previous hour from IDA dataset !!!!
# calculate hourSum - previousHour = newHourSum and store it as newHourSum
# calculate hour/(newHourSum-previousHour) * Foreigners and store it as footfallHour
# add to the empty dataframe }
I'm not sure how to do that and here's what i tried:
tbl1 <- secondDataset
tbl2 <- firstDataset
mergetbl <- function(tbl1, tbl2)
{
newtbl = data.frame(hour=numeric(),forgHour=numeric(),locHour=numeric())
ntbl1rows<-nrow(tbl1) # get the number of rows
for(n in 1:ntbl1rows)
{
#get the previousHour
newHourSum <- tbl1$hour - previousHour
footfallHour <- (tbl1$hour/(newHourSum-previousHour)) * tbl2$Foreigners
#add to newtbl
}
}
This would what i expected:
shape id day hour week id footfall category area name forgHour locHour
22496 22/3/14 3 12 634 Work cluster CBD area 1 1 12
22670 22/3/14 3 12 220 Shopping cluster Orchard Road 1 21 25
23287 22/3/14 3 12 723 Airport Changi Airport 2 31 34
16430 22/3/14 4 12 947 Work cluster CBD area 2 41 23
4697 22/3/14 3 12 220 Residential area Ang Mo Kio 2 51 23
4911 22/3/14 3 12 1001 Shopping cluster Orchard Rd 3 61 45
11126 22/3/14 3 12 220 Residential area Ang Mo Kio 2 72 54

Resources