Replacing values in one data frame with values in a second data frame conditional on a logic statement - r

I have two data frames: "unit_test" with unique descriptions of survey units (one row per survey unit) and "data_test" with field data (multiple rows per survey unit). If it is a ground survey (data_test$type='ground'), I want to replace data_test$easting with the value in unit_test$east for the corresponding code (unit_test$code must match data_test$code1). If it is an air survey (data_test$type=='air'), I want to keep the original values in data_test$easting.
# Create units table
code <- c('pondA','pondB','pondC','pondD','transect1','transect2','transect3','transect4')
east <- c(12345,23456,34567,45678,NA,NA,NA,NA)
north <- c(99876,98765,87654,76543,NA,NA,NA,NA)
unit_test <- data.frame(cbind(code,east,north))
unit_test
# Create data table
code1 <- c('pondA','pondA','transect1','pondB','pondB','transect2','pondC','transect3','pondD','transect4')
type <- c('ground','ground','air','ground','ground','air','ground','air','ground','air')
easting <- c(NA,NA,18264,NA,NA,46378,NA,86025,NA,46295)
northing <-c(NA,NA,96022,NA,NA,85766,NA,21233,NA,23090)
species <- c('NOPI','NOPI','SCAU','GWTE','GWTE','RUDU','NOPI','GADW','NOPI','MALL')
count <- c(10,23,50,1,2,43,12,3,7,9)
data_test <- data.frame(cbind(code1,type,easting,northing,species,count))
data_test
I have tried using the match function:
if(data_test$type=="ground") {
data_test$easting <- unit_test$east[match(data_test$code1, unit_test$code)]
}
However it replaces the easting values if data_test$type=='air' with NAs. Any help would be much appreciated.
I want my final output to look like this:
code1 type easting northing species count
1 pondA ground 12345 99876 NOPI 10
2 pondA ground 12345 99876 NOPI 23
3 transect1 air 18264 96022 SCAU 50
4 pondB ground 23456 98765 GWTE 1
5 pondB ground 23456 98765 GWTE 2
6 transect2 air 46378 85766 RUDU 43
7 pondC ground 34567 87654 NOPI 12
8 transect3 air 86025 21233 GADW 3
9 pondD ground 45678 76543 NOPI 7
10 transect4 air 46295 23090 MALL 9

I think data.table package is really useful for this task:
install.packages("data.table")
library(data.table)
unit_test = data.table(unit_test)
data_test = data.table(data_test)
Add a column to unit_test specifying it refers to "ground":
unit_test$type = "ground"
Set keys to table in order to cross reference
setkey(data_test, code1, type, species)
setkey(unit_test, code, type)
Every time you have "ground" for type in data_test, lookup appropriate data in unit_test and replace easting with east
data_test[unit_test, easting:= east]
data_test[unit_test,northing:= north]
Results:
> data_test
code1 type easting northing species count
1: pondA ground 12345 99876 NOPI 10
2: pondA ground 12345 99876 NOPI 23
3: pondB ground 23456 98765 GWTE 1
4: pondB ground 23456 98765 GWTE 2
5: pondC ground 34567 87654 NOPI 12
6: pondD ground 45678 76543 NOPI 7
7: transect1 air 18264 96022 SCAU 50
8: transect2 air 46378 85766 RUDU 43
9: transect3 air 86025 21233 GADW 3
10: transect4 air 46295 23090 MALL 9

Base R:
data_test[data_test$type == 'ground',c('easting','northing')] <- unit_test[match(data_test[data_test$type == 'ground','code1'],unit_test$code),c('east','north')]
Find the spots you want to fill, and make an index with match like you mentioned. This is after a change in your sample data. I used stringsAsFactors = F when creating both data frames so I didn't have to deal with factors.

Related

Is this the most concise way to iterate and capture output from API in R?

I want to iterate through a sequence of years and capture each output in one large dataframe.
The query only allows one year at a time of data to be requested so I thought I could run a loop like below and capture into an empty dataframe. This seems to work but I was wondering if there is a more concise way of achieving this.
In case anyone is interested.
API info and signup:https://aqs.epa.gov/aqsweb/documents/data_api.html#bdate
library("jsonlite")
library(lubridate)
base_url_site <- "https://aqs.epa.gov/data/api/sampleData/bySite"
years <- as.character(2011:2019)
dat = {}
for (year in years) {
my_raw_result <- httr::GET(base_url_site,
query = list(email="example#email.com",key=Sys.getenv("AQS_KEY"),
param = "44201",
bdate=paste(year,"0101",sep = ""),
edate=paste(year,"1231",sep = ""),state="48",
county="141", site="0055"))
my_content <- httr::content(my_raw_result, as = 'text')
my_content_from_json <- fromJSON(my_content)
df <- my_content_from_json$Data
dat = rbind(dat,df)
}
A slightly more efficient solution may be obtained by using rbind() only once, rather than iteratively in the loop. We can do this with a combination of Base R and lapply(). The key change in order to make the code work was converting the list output from the fromJSON() function into a data frame, which did not work correctly in the code posted with the original question.
# set private key
Sys.setenv(AQS_KEY = "yourKeyGoesHere")
base_url_site <- "https://aqs.epa.gov/data/api/sampleData/bySite"
library(RJSONIO)
library(tidyr)
years <- as.character(2011:2019)
system.time(dfList <- lapply(years,function(year){
my_raw_result <- httr::GET(base_url_site,
query = list(email="example#gmail.com",key=Sys.getenv("AQS_KEY"),
param = "44201",
bdate=paste(year,"0101",sep = ""),
edate=paste(year,"1231",sep = ""),state="48",
county="141", site="0055"))
my_content <- httr::content(my_raw_result, as = 'text')
my_content_from_json <- fromJSON(my_content)
df <- data.frame(t(sapply(my_content_from_json$Data,c)))
df$uncertainty <- " "
tidyr::unnest(df,cols = colnames(df)) # unnest & return to parent
}))
system.time(combinedData <- do.call(rbind,dfList))
The code to extract years 2011 - 2019 from the EPA database runs in about 46.8 seconds of user time, including the initial extracts, unnesting of each resulting data structure, and the one time combination of data frames at the end.
user system elapsed
46.670 0.756 71.432
> system.time(combinedData <- data.frame(do.call(rbind,dfList)))
user system elapsed
0.096 0.027 0.123
The large difference between user time and elapsed time is likely due to wait times to receive data from the API.
A key feature of this solution is the technique used to convert the list of lists into data frame rows, which is accomplished as follows (h/t Alex Brown's answer for Convert a List to a Data Frame, as well as the unnesting of the resulting data structure with tidyr::unnest(). We also had to set the uncertainty column to blank, because unnest() fails with the NULL values extracted from the EPA API.
df <- data.frame(t(sapply(my_content_from_json$Data,c)))
df$uncertainty <- " "
tidyr::unnest(df,cols = colnames(df)) # unnest & return to parent
Output from the combined data frame looks like this.
> head(combinedData)
state_code county_code site_number parameter_code poc latitude longitude datum
1 48 141 0055 44201 1 31.74677 -106.4028 WGS84
2 48 141 0055 44201 1 31.74677 -106.4028 WGS84
3 48 141 0055 44201 1 31.74677 -106.4028 WGS84
4 48 141 0055 44201 1 31.74677 -106.4028 WGS84
5 48 141 0055 44201 1 31.74677 -106.4028 WGS84
6 48 141 0055 44201 1 31.74677 -106.4028 WGS84
parameter date_local time_local date_gmt time_gmt sample_measurement
1 Ozone 2011-12-31 23:00 2012-01-01 06:00 0.023
2 Ozone 2011-12-31 22:00 2012-01-01 05:00 NA
3 Ozone 2011-12-31 21:00 2012-01-01 04:00 NA
4 Ozone 2011-12-31 20:00 2012-01-01 03:00 0.018
5 Ozone 2011-12-31 19:00 2012-01-01 02:00 0.006
6 Ozone 2011-12-31 18:00 2012-01-01 01:00 0.002
units_of_measure units_of_measure_code sample_duration sample_duration_code
1 Parts per million 007 1 HOUR 1
2 Parts per million 007 1 HOUR 1
3 Parts per million 007 1 HOUR 1
4 Parts per million 007 1 HOUR 1
5 Parts per million 007 1 HOUR 1
6 Parts per million 007 1 HOUR 1
sample_frequency detection_limit uncertainty
1 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
2 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
3 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
4 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
5 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
6 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
qualifier method_type method
1 <NA> FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
2 BF - Precision/Zero/Span. FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
3 BF - Precision/Zero/Span. FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
4 <NA> FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
5 <NA> FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
6 <NA> FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
method_code state county date_of_last_change cbsa_code
1 087 Texas El Paso 2012-01-23 21340
2 087 Texas El Paso 2012-01-23 21340
3 087 Texas El Paso 2012-01-23 21340
4 087 Texas El Paso 2012-01-23 21340
5 087 Texas El Paso 2012-01-23 21340
6 087 Texas El Paso 2012-01-23 21340
The original code, updated to result in a data frame without nested lists, runs in about 43.6 seconds, about 3 seconds faster than the lapply() version which is a bit surprising.
base_url_site <- "https://aqs.epa.gov/data/api/sampleData/bySite"
years <- as.character(2011:2019)
dat = {}
system.time(for (year in years) {
my_raw_result <- httr::GET(base_url_site,
query = list(email="example#gmail.com",key=Sys.getenv("AQS_KEY"),
param = "44201",
bdate=paste(year,"0101",sep = ""),
edate=paste(year,"1231",sep = ""),state="48",
county="141", site="0055"))
my_content <- httr::content(my_raw_result, as = 'text')
my_content_from_json <- fromJSON(my_content)
dataList <- my_content_from_json$Data
df <- data.frame(t(sapply(dataList,c)))[!(colnames(df) == "uncertainty")]
unnestedDf <- tidyr::unnest(df,cols = colnames(df))
dat <- rbind(dat,unnestedDf)
})
...and the runtime stats, which show the same pattern of elapsed time relative to user time:
user system elapsed
43.586 0.686 66.604

Change order of conditions when plotting normalised counts for single gene

I have a df of 17 variables (my samples) with the condition location which I would like to plot based on a single gene "photosystem II protein D1 1"
View(metadata)
sample location
<chr> <chr>
1 X1344 West
2 X1345 West
3 X1365 West
4 X1366 West
5 X1367 West
6 X1419 West
7 X1420 West
8 X1421 West
9 X1473 Mid
10 X1475 Mid
11 X1528 Mid
12 X1584 East
13 X1585 East
14 X1586 East
15 X1678 East
16 X1679 East
17 X1680 East
View(countdata)
func X1344 X1345 X1365 X1366 X1367 X1419 X1420 X1421 X1473 X1475 X1528 X1584 X1585 X1586 X1678 X1679 X1680
photosystem II protein D1 1 11208 6807 3483 4091 12198 7229 7404 5606 6059 7456 4007 2514 5709 2424 2346 4447 5567
countdata contains thousands of genes but I am only showing the headers and gene of interest
ddsMat has been created like this:
ddsMat <- DESeqDataSetFromMatrix(countData = countdata,
colData = metadata,
design = ~ location)
When plotting:
library(DeSeq2)
plotCounts(ddsMat, "photosystem II protein D1 1", intgroup=c("location"))
By default, the function plots the "conditions" alphabetically eg: East-Mid-West. But I would like to order them so I can see them on the graph West-Mid-East.
Check plotCountsIMAGEhere
Is there a way of doing this?
Thanks,
I have found that you can manually change the order like this:
ddsMat$location <- factor(ddsMat$location, levels=c("West", "Mid", "East"))

extracting values of a column into a string and replacing values in a data frame column

More than the programming, I am lost on the right approach for this problem. I have 2 data frames with a market name column. Unfortunately the names vary by a few characters / symbols in each column, for e.g. Albany.Schenectady.Troy = ALBANY, Boston.Manchester = BOSTON.
I want to standardize the market names in both data frames so I can perform merge operations later.
I thought of tackling the problem in two steps:
1) Create a vector of the unique market names from both tables and use that to create a look up table. Something that looks like:
Table 1 Markets > "Albany.Schenectady.Troy" , "Albuquerque.Santa.Fe", "Atlanta" . . . .
Table2 Markets > "SPOKANE" , "BOSTON" . . .
I tried marketnamesvector <- paste(unique(Table1$Market, sep = "", collapse = ",")) but that doesn't produce the desired output.
2) Change Market names in Table 2 to equivalent market names in Table 1. For any market name not available in Table 1, Table 2 should retain the same value in market name.
I know I could use a looping function like below but I still need a lookup table I think.
replacefunc <- function (data, oldvalue, newvalue) {
newdata <- data
for (i in unique(oldvalue)) newdata[data == i] <- newvalue[oldvalue == i]
newdata
}
Table 1: This table is 90 rows x 2 columns and has 90 unique market names.
Market Leads Investment Leads1 Leads2 Leads3
1 Albany.Schenectady.Troy NA NA NA NA NA
2 Albuquerque.Santa.Fe NA NA NA NA NA
3 Atlanta NA NA NA NA NA
4 Austin NA NA NA NA NA
5 Baltimore NA NA NA NA NA
Table 2 : This table is 150K rows x 20 columns and has 89 unique market names.
> df
Spot.ID Date Hour Time Local.Date Broadcast.Week Local.Hour Local.Time Market
2 13072765 6/30/14 0 12:40 AM 2014-06-29 1 21 9:40 PM SPOKANE
261 13072946 6/30/14 5 5:49 AM 2014-06-30 1 5 5:49 AM BOSTON
356 13081398 6/30/14 10 10:52 AM 2014-06-30 1 7 7:52 AM SPOKANE
389 13082306 6/30/14 11 11:25 AM 2014-06-30 1 8 8:25 AM SPOKANE
438 13082121 6/30/14 8 8:58 AM 2014-06-30 1 8 8:58 AM BOSTON
469 13081040 6/30/14 9 9:17 AM 2014-06-30 1 9 9:17 AM ALBANY
482 13080104 6/30/14 12 12:25 PM 2014-06-30 1 9 9:25 AM SPOKANE
501 13082120 6/30/14 9 9:36 AM 2014-06-30 1 9 9:36 AM BOSTON
617 13080490 6/30/14 13 1:23 PM 2014-06-30 1 10 10:23 AM SPOKANE
Assume that the data is in data frames df1, df2. The goal is to adjust the market names to be the same, they are currently slightly different.
First, list the markets, use the following command to list the unique names in df1, repeat for df2.
mk1 <- sort(unique(df1$market))
mk2 <- sort(unique(df2$market))
dmk12 <- setdiff(mk1,mk2)
dmk21 <- setdiff(mk2,mk1)
Use dmk12 and dmk21 to identify the different markets. Decide what names to use, and how they match up, let's change "Atlanta, GA" from df1 to "Atlanta" from df2. Then use
df2[df2$market=="Atlanta","market"] = "Atlanta, GA"
The format is
df_to_change[df_to_change[,"column"]=="old data", "column"] = "new data"
If you only have 90 names to correct, I would write out 90 change lines like the one above.
After adjusting all the names, do sort(unique(df)) again and use setdiff twice to confirm all the names are the same.

Arrange dataframe for pairwise correlations

I am working with data in the following form:
Country Player Goals
"USA" "Tim" 0
"USA" "Tim" 0
"USA" "Dempsey" 3
"USA" "Dempsey" 5
"Brasil" "Neymar" 6
"Brasil" "Neymar" 2
"Brasil" "Hulk" 5
"Brasil" "Luiz" 2
"England" "Rooney" 4
"England" "Stewart" 2
Each row represents the number of goals that a player scored per game, and also contains that player's country. I would like to have the data in the form such that I can run pairwise correlations to see whether being from the same country has some association with the number of goals that a player scores. The data would look like this:
Player_1 Player_2
0 8 # Tim Dempsey
8 5 # Neymar Hulk
8 2 # Neymar Luiz
5 2 # Hulk Luiz
4 2 # Rooney Stewart
(You can ignore the comments, they are there simply to clarify what each row contains).
How would I do this?
table(df$player)
gets me the number of goals per player, but then how to I generate these pairwise combinations?
This is a pretty classic self-join problem. I'm gonna start by summarizing your data to get the total goals for each player. I like dplyr for this, but aggregate or data.table work just fine too.
library(dplyr)
df <- df %>% group_by(Player, Country) %>% dplyr::summarize(Goals = sum(Goals))
> df
Source: local data frame [7 x 3]
Groups: Player
Player Country Goals
1 Dempsey USA 8
2 Hulk Brasil 5
3 Luiz Brasil 2
4 Neymar Brasil 8
5 Rooney England 4
6 Stewart England 2
7 Tim USA 0
Then, using good old merge, we join it to itself based on country, and then so we don't get each row twice (Dempsey, Tim and Tim, Dempsey---not to mention Dempsey, Dempsey), we'll subset it so that Player.x is alphabetically before Player.y. Since I already loaded dplyr I'll use filter, but subset would do the same thing.
df2 <- merge(df, df, by.x = "Country", by.y = "Country")
df2 <- filter(df2, as.character(Player.x) < as.character(Player.y))
> df2
Country Player.x Goals.x Player.y Goals.y
2 Brasil Hulk 5 Luiz 2
3 Brasil Hulk 5 Neymar 8
6 Brasil Luiz 2 Neymar 8
11 England Rooney 4 Stewart 2
15 USA Dempsey 8 Tim 0
The self-join could be done in dplyr if we made a little copy of the data and renamed the Player and Goals columns so they wouldn't be joined on. Since merge is pretty smart about the renaming, it's easier in this case.
There is probably a smarter way to get from the aggregated data to the pairs, but assuming your data is not too big (national soccer data), you can always do something like:
A<-aggregate(df$Goals~df$Player+df$Country,data=df,sum)
players_in_c<-table(A[,2])
dat<-NULL
for(i in levels(df$Country)) {
count<-players_in_c[i]
pair<-combn(count,m=2)
B<-A[A[,2]==i,]
dat<-rbind(dat, cbind(B[pair[1,],],B[pair[2,],]) )
}
dat
> dat
df$Player df$Country df$Goals df$Player df$Country df$Goals
1 Hulk Brasil 5 Luiz Brasil 2
1.1 Hulk Brasil 5 Neymar Brasil 8
2 Luiz Brasil 2 Neymar Brasil 8
4 Rooney England 4 Stewart England 2
6 Dempsey USA 8 Tim USA 0

Clustering / Matching Over Many Dimensions in R

I have a very large and complex data set with many observations of companies. Some of the observations of the companies are redundant and I need to make a key to map the redundant observations to a single one. However the only way to tell if they are actually representing the same company is through the similarity of a variety of variables. I think the appropriate approach is a kind of clustering based on a variety of conditions or perhaps even some kind of propensity score matching. Perhaps I just need flexible tools for making a complex kind of similarity matrix.
Unfortunately, I am not quite sure how to go about that in R. Most of the tools I've seen for clustering and categorizing seem to do so with either numerical distance or categorical data, but don't seem to allow multiple conditions or user specified conditions.
Below I've tried to create a smaller, public example of the kind of data I am working with and the result I am trying to produce. There are some conditions that must apply, for example, the location must be the same. There are some features that may associate one with another, for example var1 and var2. Then there are some features that may associate one with another, but they must not conflict, such as var3.
An additional layer of complexity is that the kind of association I am trying to use to map the redundant observation varies. For example, id1 and id2 are the same company redundantly entered into the data twice. In one place its name is "apples" and another "red apples". They share the same location, var1 value and var3 (after adjusting for formatting). Similarly ids 3, 5 and 6, are also really just one company, though much of the input for each is different. Some clusters would identify multiple observations, others would only have one. Ideally I would like to find a way to categorize or associate the observations based on several conditions, for example:
1. Test that the location is the same
2. Test whether var3 is different
3. Test whether the names is a substring of others
4. Test the edit distance of names
5. Test the similarity of var1 and var2 between observations
Anyways, hopefully there are better, more flexible tools for this than what I am finding or someone has experience with this kind of data work in R. Any and all suggestions and advice are much appreciated!
Data
id name location var1 var2 var3
1 apples US 1 abc 12345
2 red apples US 1 NA 12-345
3 green apples Mexico 2 def 235-92
4 bananas Brazil 2 abc NA
5 oranges Mexico 2 NA 23592
6 green apple Mexico NA def NA
7 tangerines Honduras NA abc 3498
8 mango Honduras 1 NA NA
9 strawberries Honduras NA abcd 3498
10 strawberry Honduras NA abc 3498
11 blueberry Brazil 1 abcd 2348
12 blueberry Brazil 3 abc NA
13 blueberry Mexico NA def 1859
14 bananas Brazil 1 def 2348
15 blackberries Honduras NA abc NA
16 grapes Mexico 6 qrs NA
17 grapefruits Brazil 1 NA 1379
18 grapefruit Brazil 2 bcd 1379
19 mango Brazil 3 efaq NA
20 fuji apples US 4 NA 189-35
Result
id name location var1 var2 var3 Result
1 apples US 1 abc 12345 1
2 red apples US 1 NA 12-345 1
3 green apples Mexico 2 def 235-92 3
4 bananas Brazil 2 abc NA 4
5 oranges Mexico 2 NA 23592 3
6 green apple Mexico NA def NA 3
7 tangerines Honduras NA abc 3498 7
8 mango Honduras 1 NA NA 8
9 strawberries Honduras NA abcd 3498 7
10 strawberry Honduras NA abc 3498 7
11 blueberry Brazil 1 abcd 2348 11
12 blueberry Brazil 3 abc NA 11
13 blueberry Mexico NA def 1859 13
14 bananas Brazil 1 def 2348 11
15 blackberries Honduras NA abc NA 15
16 grapes Mexico 6 qrs NA 16
17 grapefruits Brazil 1 NA 1379 17
18 grapefruit Brazil 2 bcd 1379 17
19 mango Brazil 3 efaq NA 19
20 fuji apples US 4 NA 189-35 20
Thanks in advance for your time and help!
library(stringdist)
getMatches <- function(df, tolerance=6){
out <- integer(nrow(df))
for(row in 1:nrow(df)){
dists <- numeric(nrow(df))
for(col in 1:ncol(df)){
tempDist <- stringdist(df[row, col], df[ , col], method="lv")
# WARNING: Matches NA perfectly.
tempDist[is.na(tempDist)] <- 0
dists <- dists + tempDist
}
dists[row] <- Inf
min_dist <- min(dists)
if(min_dist < tolerance){
out[row] <- which.min(dists)
}
else{
out[row] <- row
}
}
return(out)
}
test$Result <- getMatches(test[, -1])
Where test is your data. This probably definitely needs some refining and certainly needs some postprocessing. This creates a column with the index of the closest match. If it can't find a match within the given tolerance, it returns the index of itself.
EDIT: I will attempt some more later.

Resources