In R, comparing 2 fields across 2 rows in a dataframe - r

I am trying to compare 2 different fields across consecutive rows on a data frame in R and indicate the ones that are different. Below is the input data:-
Start End
1 Atl Bos
2 Bos Har
3 Har NYC
4 Stf SFO
5 SFO Chi
I am trying to establish a chain of movement and where the End doesn't match up to the Start of the next row I want to indicate that row. So for the above I would indicate row 4 as below:-
Start End Ind
1 Atl Bos Y
2 Bos Har Y
3 Har NYC Y
4 Stf SFO N
5 SFO Chi Y
I am pretty new to R, I have tried looking up this problem but cant seem to find a solution. Any help is appreciated.

An alternative would be:
> Ind <- as.character(dat$Start[-1]) == as.character(dat$End [-length(dat$End)])
> dat$Ind <- c(NA, ifelse(Ind==TRUE, "Y", "N"))
> dat
Start End Ind
1 Atl Bos <NA>
2 Bos Har Y
3 Har NYC Y
4 Stf SFO N
5 SFO Chi Y
Note that your first item should be <NA>

You can do that with dplyr using mutate and lead. Note that the last item should be NA because there is no line 6 to compare SFO-CHI to.
library(dplyr)
df1 <- read.table(text=" Start End
Atl Bos
Bos Har
Har NYC
Stf SFO
SFO Chi", header=TRUE, stringsAsFactors=FALSE)
df1 %>%
mutate(Ind=ifelse(End==lead(Start),"Y","N"))
Start End Ind
1 Atl Bos Y
2 Bos Har Y
3 Har NYC N
4 Stf SFO Y
5 SFO Chi <NA>

Related

How to assign one dataframe column's value to be the same as another column's value in r?

I am trying to run this line of code below to copy the city.output column to pm.city where it is not NA (in my sample dataframe, nothing is NA though) because city.output contains the correct city spellings.
resultdf <- dplyr::mutate(df, pm.city = ifelse(is.na(city.output) == FALSE, city.output, pm.city))
df:
pm.uid pm.address pm.state pm.zip pm.city city.output
<int> <chr> <chr> <chr> <chr> <fct>
1 1 1809 MAIN ST OH 63312 NORWOOD NORWOOD
2 2 123 ELM DR CA NA BRYAN BRYAN
3 3 8970 WOOD ST UNIT 4 LA 33333 BATEN ROUGE BATON ROUGE
4 4 4444 OAK AVE OH 87481 CINCINATTI CINCINNATI
5 5 3333 HELPME DR MT 87482 HELENA HELENA
6 6 2342 SOMEWHERE RD LA 45103 BATON ROUGE BATON ROUGE
resultdf (pm.city should be the same as city.output but it's an integer)
pm.uid pm.address pm.state pm.zip pm.city city.output
<int> <chr> <chr> <chr> <int> <fct>
1 1 1809 MAIN ST OH 63312 7 NORWOOD
2 2 123 ELM DR CA NA 2 BRYAN
3 3 8970 WOOD ST UNIT 4 LA 33333 1 BATON ROUGE
4 4 4444 OAK AVE OH 87481 3 CINCINNATI
5 5 4444 HELPME DR MT 87482 4 HELENA
6 6 2342 SOMEWHERE RD LA 45103 1 BATON ROUGE
An integer is instead assigned to pm.city. It appears the integer is the order number of the cities when they're in alphabetical order. Prior to this, I used the dplyr left_join method to attach city.output column from another dataframe but even there, there was no row number that I supplied explicitly.
This works on my computer in r studio but not when I run it from a server. Maybe it has something to do with my version of dplyr or the factor data type under city.output? I am pretty new to r.
The city.output is factor which gets coerced to integer storage values. Instead, convert to character with as.character
dplyr::mutate(df, pm.city = ifelse(!is.na(city.output), as.character(city.output), pm.city))

How to remove rows that contain duplicate characters in R

I want remove entire row if there are duplicates in two columns. Any quick help in doing so in R (for very large dataset) would be highly appreciated. For example:
mydf <- data.frame(p1=c('a','a','a','b','g','b','c','c','d'),
p2=c('b','c','d','c','d','e','d','e','e'),
value=c(10,20,10,11,12,13,14,15,16))
This gives:
mydf
p1 p2 value
1 a b 10
2 c c 20
3 a d 10
4 b c 11
5 d d 12
6 b b 13
7 c d 14
8 c e 15
9 e e 16
I want to get:
p1 p2 value
1 a b 10
2 a d 10
3 b c 11
4 c d 14
5 c e 15
your note in the comments suggests your actual problem is more complex. There's some preprocessing you could do to your strings before you compare p1 to p2. You will have the domain expertise to know what steps are appropriate, but here's a first start. I remove all spaced and punctuation from p1 and p2. I then convert them all to uppercase before testing for equality. You can modify the clean_str function to include more / different cleaning operations.
Additionally, you may consider approximate matching to address typos / colloquial naming conventions. Package stringdist is a good place to start.
mydf <- data.frame(p1=c('New York','New York','New York','TokYo','LosAngeles','MEMPHIS','memphis','ChIcAGo','Cleveland'),
p2=c('new York','New.York','MEMPHIS','Chicago','knoxville','tokyo','LosAngeles','Chicago','CLEVELAND'),
value=c(10,20,10,11,12,13,14,15,16),
stringsAsFactors = FALSE)
mydf[mydf$p1 != mydf$p2,]
#> p1 p2 value
#> 1 New York new York 10
#> 2 New York New.York 20
#> 3 New York MEMPHIS 10
#> 4 TokYo Chicago 11
#> 5 LosAngeles knoxville 12
#> 6 MEMPHIS tokyo 13
#> 7 memphis LosAngeles 14
#> 8 ChIcAGo Chicago 15
#> 9 Cleveland CLEVELAND 16
clean_str <- function(col){
#removes all punctuation
d <- gsub("[[:punct:][:blank:]]+", "", col)
d <- toupper(d)
return(d)
}
mydf$p1 <- clean_str(mydf$p1)
mydf$p2 <- clean_str(mydf$p2)
mydf[mydf$p1 != mydf$p2,]
#> p1 p2 value
#> 3 NEWYORK MEMPHIS 10
#> 4 TOKYO CHICAGO 11
#> 5 LOSANGELES KNOXVILLE 12
#> 6 MEMPHIS TOKYO 13
#> 7 MEMPHIS LOSANGELES 14
Created on 2020-05-03 by the reprex package (v0.3.0)
Several ways to do that. Among them :
Base R
mydf[mydf$p1 != mydf$p2, ]
dplyr
library(dplyr)
mydf %>% filter(p1 != p2)
data.table
library(data.table)
setDT(mydf)
mydf[p1 != p2]
Here's a two-step solution based on #Chase's data:
First step (as suggested by #Chase) - preprocess your data in p1and p2to make them comparable:
# set to lower-case:
mydf[,c("p1", "p2")] <- lapply(mydf[,c("p1", "p2")], tolower)
# remove anything that's not alphanumeric between words:
mydf[,c("p1", "p2")] <- lapply(mydf[,c("p1", "p2")], function(x) gsub("(\\w+)\\W(\\w+)", "\\1\\2", x))
Second step - (i) using apply, paste the rows together, (ii) use grepl and backreference \\1 to look out for immediately adjacent duplicates in these rows, and (iii) remove (-) those rows which contain these duplicates:
mydf[-which(grepl("\\b(\\w+)\\s+\\1\\b", apply(mydf, 1, paste0, collapse = " "))),]
p1 p2 value
3 newyork memphis 10
4 tokyo chicago 11
5 losangeles knoxville 12
6 memphis tokyo 13
7 memphis losangeles 14

r - Join data frame coordinates by shapefile regions aka Join Attributes by Location

I have a large data set, loaded in R as a data.frame. It contains observations associated with coordinate points (lat/lon).
I also have a shape file of North America.
In the empty column (NA filled) in my data frame, labelled BCR, I want to insert the region name which each coordinate falls into according to the shapefile.
I know how to do this is QGIS using the Vector > Data Management Tools > Join Attributes by Location
The shapefile can be downloaded by clicking HERE.
My data, right now, looks like this (a sample):
LATITUDE LONGITUDE Year EFF n St PJ day BCR
50.406752 -104.613 2009 1 0 SK 90 2 NA
50.40678 -104.61256 2009 2 0 SK 120 3 NA
50.40678 -104.61256 2009 2 1 SK 136 2 NA
50.40678 -104.61256 2009 3 2 SK 149 4 NA
43.0026385 -79.2900467 2009 2 0 ON 112 3 NA
43.0026385 -79.2900467 2009 2 1 ON 122 3 NA
But I want it to look like this:
LATITUDE LONGITUDE Year EFF n St PJ day BCR
50.406752 -104.613 2009 1 0 SK 90 2 Prairie Potholes
50.40678 -104.61256 2009 2 0 SK 120 3 Prairie Potholes
50.40678 -104.61256 2009 2 1 SK 136 2 Prairie Potholes
50.40678 -104.61256 2009 3 2 SK 149 4 Prairie Potholes
43.0026385 -79.2900467 2009 2 0 ON 112 3 Lower Great Lakes/St.Lawrence Plain
43.0026385 -79.2900467 2009 2 1 ON 122 3 Lower Great Lakes/St.Lawrence Plain
Notice the BCR column is now filled with the appropriate BCR region name.
My code so far is just importing and formatting the data and shapefile:
library(rgdal)
library(proj4)
library(sp)
library(raster)
# PFW data, full 2.5m observations
df = read.csv("MyData.csv")
# Clearning out empty coordinate data
pfw = df[(df$LATITUDE != 0) & (df$LONGITUDE != 0) & (!is.na(df$LATITUDE)) & (!is.na(df$LATITUDE)),]
# Creating a new column to be filled with associated Bird Conservation Regions
pfw["BCR"] = NA
# Making a duplicate data frame to conserve data
toSPDF = pfw
# Ensuring spatial formatting
#coordinates(toSPDF) = ~LATITUDE + LONGITUDE
SPDF <- SpatialPointsDataFrame(toSPDF[,c("LONGITUDE", "LATITUDE"),],
toSPDF,
proj4string = CRS("+init=epsg:4326"))
# BCR shape file, no state borders
shp = shapefile("C:/Users/User1/Desktop/BCR/BCR_Terrestrial_master_International.shx")
spPoly = spTransform(shp, CRS("+init=epsg:4326 +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"))
# Check
isTRUE(proj4string(spPoly) == proj4string(SPDF))
# Trying to join attributes by location
#try1 = point.in.polygon(spPoly, SPDF) # Sounds good doesn't work
#a.data <- over(SPDF, spPoly[,"BCRNAME"]) # Error: cannot allocate vector of size 204.7 Mb
I think you want to do a spatial query with points and polygons. That is to assign polygon attributes to the corresponding points. You can do that like this:
Example data
library(terra)
f <- system.file("ex/lux.shp", package="terra")
polygons <- vect(f)
points <- spatSample(v, 10)
Solution
e <- extract(polygons, points)
e
# id.y ID_1 NAME_1 ID_2 NAME_2 AREA POP
#1 1 3 Luxembourg 9 Esch-sur-Alzette 251 176820
#2 2 3 Luxembourg 9 Esch-sur-Alzette 251 176820
#3 3 2 Grevenmacher 6 Echternach 188 18899
#4 4 1 Diekirch 2 Diekirch 218 32543
#5 5 3 Luxembourg 9 Esch-sur-Alzette 251 176820
#6 6 1 Diekirch 4 Vianden 76 5163
#7 7 3 Luxembourg 11 Mersch 233 32112
#8 8 2 Grevenmacher 7 Remich 129 22366
#9 9 1 Diekirch 3 Redange 259 18664
#10 10 3 Luxembourg 9 Esch-sur-Alzette 251 176820
With the older spatial packages you can use raster::extract or sp::over.
Example data:
library(raster)
pols <- shapefile(system.file("external/lux.shp", package="raster"))
set.seed(20180121)
pts <- data.frame(coordinates(spsample(pols, 5, 'random')), name=letters[1:5])
plot(pols); points(pts)
Solution:
e <- extract(pols, pts[, c('x', 'y')])
pts$BCR <- e$NAME_2
pts
# x y name BCR
#1 6.009390 49.98333 a Wiltz
#2 5.766407 49.85188 b Redange
#3 6.268405 49.62585 c Luxembourg
#4 6.123015 49.56486 d Luxembourg
#5 5.911638 49.53957 e Esch-sur-Alzette

Arrange dataframe for pairwise correlations

I am working with data in the following form:
Country Player Goals
"USA" "Tim" 0
"USA" "Tim" 0
"USA" "Dempsey" 3
"USA" "Dempsey" 5
"Brasil" "Neymar" 6
"Brasil" "Neymar" 2
"Brasil" "Hulk" 5
"Brasil" "Luiz" 2
"England" "Rooney" 4
"England" "Stewart" 2
Each row represents the number of goals that a player scored per game, and also contains that player's country. I would like to have the data in the form such that I can run pairwise correlations to see whether being from the same country has some association with the number of goals that a player scores. The data would look like this:
Player_1 Player_2
0 8 # Tim Dempsey
8 5 # Neymar Hulk
8 2 # Neymar Luiz
5 2 # Hulk Luiz
4 2 # Rooney Stewart
(You can ignore the comments, they are there simply to clarify what each row contains).
How would I do this?
table(df$player)
gets me the number of goals per player, but then how to I generate these pairwise combinations?
This is a pretty classic self-join problem. I'm gonna start by summarizing your data to get the total goals for each player. I like dplyr for this, but aggregate or data.table work just fine too.
library(dplyr)
df <- df %>% group_by(Player, Country) %>% dplyr::summarize(Goals = sum(Goals))
> df
Source: local data frame [7 x 3]
Groups: Player
Player Country Goals
1 Dempsey USA 8
2 Hulk Brasil 5
3 Luiz Brasil 2
4 Neymar Brasil 8
5 Rooney England 4
6 Stewart England 2
7 Tim USA 0
Then, using good old merge, we join it to itself based on country, and then so we don't get each row twice (Dempsey, Tim and Tim, Dempsey---not to mention Dempsey, Dempsey), we'll subset it so that Player.x is alphabetically before Player.y. Since I already loaded dplyr I'll use filter, but subset would do the same thing.
df2 <- merge(df, df, by.x = "Country", by.y = "Country")
df2 <- filter(df2, as.character(Player.x) < as.character(Player.y))
> df2
Country Player.x Goals.x Player.y Goals.y
2 Brasil Hulk 5 Luiz 2
3 Brasil Hulk 5 Neymar 8
6 Brasil Luiz 2 Neymar 8
11 England Rooney 4 Stewart 2
15 USA Dempsey 8 Tim 0
The self-join could be done in dplyr if we made a little copy of the data and renamed the Player and Goals columns so they wouldn't be joined on. Since merge is pretty smart about the renaming, it's easier in this case.
There is probably a smarter way to get from the aggregated data to the pairs, but assuming your data is not too big (national soccer data), you can always do something like:
A<-aggregate(df$Goals~df$Player+df$Country,data=df,sum)
players_in_c<-table(A[,2])
dat<-NULL
for(i in levels(df$Country)) {
count<-players_in_c[i]
pair<-combn(count,m=2)
B<-A[A[,2]==i,]
dat<-rbind(dat, cbind(B[pair[1,],],B[pair[2,],]) )
}
dat
> dat
df$Player df$Country df$Goals df$Player df$Country df$Goals
1 Hulk Brasil 5 Luiz Brasil 2
1.1 Hulk Brasil 5 Neymar Brasil 8
2 Luiz Brasil 2 Neymar Brasil 8
4 Rooney England 4 Stewart England 2
6 Dempsey USA 8 Tim USA 0

R: aggregate data rows with same date and attributes into a weekly counts for time series

I have the following data that I put together in Excel by combining weekly reports completed on fridays in the last year. Each row is an open account.
location code days.open report.date
LA C1 186 8/2/2013
SF C2 186 8/2/2013
SF M 18 8/2/2013
LA C1 130 7/26/2013
HB M 30 7/26/2013
LA F 2 7/19/2013
HB F 188 7/19/2013
LA C3 90 7/12/2013
LB F 30 7/12/2013
LB F 36 7/12/2013
SF M 94 7/12/2013
NB C1 6 7/5/2013
HB M 18 7/5/2013
LB M 35 6/28/2013
SD C3 201 6/28/2013
SD F 69 6/21/2013
and so for over a million entries.
This is my first time using R for timeseries and I need help preparing the data for time series analysis.
I have a few things I want to look at:
1) the count of open accounts at each report date
2) the count of open accounts at each report date by location
3) the coount of open accounts at each report date by code
4) the count of open accounts seperated into days.open<=30 , 30 < days.open <= 60, 60 < days.open <= 90, days open > 90
5) The same counts as #4 further broken down by location.
I am not quite sure where to start.
I appreciate any help you can provide.
First convert data into its format and use plyr package or data.table package. The following is the solution using ddply from plyr package. You should read this article if you want to use plyr. In the following code, mydata is your data.
mydata$report.date<-as.Date(mydata$report.date,"%m/%d/%Y")
library(plyr)
ddply(mydata,.(report.date),summarize, freq=length(days.open)) #1
ddply(mydata,.(report.date,location),summarize, freq=length(days.open)) #2
ddply(mydata,.(report.date,code),summarize, freq=length(days.open)) #3
Generate a variable that assigns days.open into four intervals.
mydata$new<-with(mydata,ifelse(days.open<=30,"A",ifelse(days.open>30 & days.open<=60,"B",ifelse(days.open>60 & days.open<=90,"C","D"))))
ddply(mydata,.(new),summarize, freq=length(days.open)) #4
ddply(mydata,.(new,location),summarize, freq=length(days.open)) #5
output for last one
new location freq
1 A HB 2
2 A LA 1
3 A LB 1
4 A NB 1
5 A SF 1
6 B LB 2
7 C LA 1
8 C SD 1
9 D HB 1
10 D LA 2
11 D SD 1
12 D SF 2
Here's the answer to your last question - once you understand this, the rest will be trivial:
library(data.table)
dt = data.table(your_df)
cuts = c(-Inf, 30, 60, 90, Inf)
dt[, .N, by = list(cut(days.open, cuts), location)]
# cut location N
# 1: (90, Inf] LA 2
# 2: (90, Inf] SF 2
# 3: (-Inf,30] SF 1
# 4: (-Inf,30] HB 2
# 5: (-Inf,30] LA 1
# 6: (90, Inf] HB 1
# 7: (60,90] LA 1
# 8: (-Inf,30] LB 1
# 9: (30,60] LB 2
#10: (-Inf,30] NB 1
#11: (90, Inf] SD 1
#12: (60,90] SD 1

Resources