R: Find duplicates that are within a certain value - r

I have a data frame (df) that includes latitude and longitude coordinates (Lat, Long) as well as the depth (Depth) of a temperature measurement for each entry. Essentially, each entry has (x,y,z)=(Lat,Long,Depth) locational information for each temperature measurement.
I'm trying to clean the data by finding and removing duplicate measurement locations. The easy part is removing the exact duplicates, handled as such:
df = df[!(duplicated(df$Lat) & duplicated(df$Long) & duplicated(df$Depth)),]
However the problem is that the values of lat/long for some entries are just slightly off, meaning the above code won't catch them but they are still clearly duplicated (e.g. lat = 39.252880 & lat = 39.252887).
Is there a way to find duplicates that are within a certain absolute value or percentage of the first instance?
I appreciate any help, thanks!

Based on this post I was able to come up with a solution. I modified the function to have a tighter tolerance on "duplicates" to be 0.001, otherwise the function is unchanged. The application, however, changes slightly to:
output=data.frame(apply(dataDF,2,fun))
because I wanted to compare values within a single column instead of in a single row.
To continue, I then add artificial indices to my output data frame for later use:
output$ind = 1:nrow(output)
The main part is finding the row indices where the function returned TRUE for the three locational information fields (lat, long, depth). The following code finds the indices where all three were true, creates a temporary data frame with only those entries (still logicals), finds the indices of the duplicates, then reverses it to return the indices that will be removed from the full, original dataset (pardon the bad variable names):
ind = which(with(output,(Lat=='TRUE' & Long=='TRUE' & Depth=='TRUE')))
temp = dataDF[ind,]
temp$ind = ind
ind2 = duplicated(temp[,1:3])
rem = ind[ind2]
df.final = dataDF[-rem,]
Hopefully that helps! It's a bit complicated and the function is very slow for large datasets but it gets the job done.

Related

How to make a histogram of a non numeric column in R?

I have a data frame called mydata with multiple columns, one of which is Benefits, which contains information about samples whether they are CB (complete responding), ICB (Intermediate) or NCB (Non-responding at all).
So basically the Benefit column is a vector with three values:
Benefit <- c("CB" , "ICB" , "NCB")
I want to make a histogram/barplot based on the number of each one of those. So basically it's not a numeric column. I tried solving this by the following code :
hist(as.numeric(metadata$Benefit))
tried also
barplot(metadata$Benefit)
didn't work obviously.
The second thing I want to do is to find a relation between the Age column of the same data frame and the Benefit column, like for example do the younger patients get more benefit ? Is there anyway to do that ?
THANKS!
Hi and welcome to the site :)
One nice way to find issues with code is to only run one command at the time.
# lets create some data
metadata <- data.frame(Benefit = c("ICB", "CB", "CB", "NCB"))
now the command 'as.numeric' does not work on character-data
as.numeric(metadata$Benefit) # returns NA's
Instead what we want is to count the number of instances of each unique value of the column Benefit, we do this with 'table'
tabledata <- table(metadata$Benefit)
and then it is the barplot function we want to create the plot
barplot(tabledata)

Merging Two Datasets on Matched Column in R

I'm an R beginner and I'm trying to merge two datasets and having some trouble with losing data. I might be totally off base with what I'm doing.
The first dataset is the Dewey Decimal System and the data looks like this
image of 10 rows of data from this set
I've named this dataset DDC
The next dataset is a list of books ordered during a particular time period.
image of 10 rows of the book ordering dataset
I've named this dataset DOA
I'm unsure how to include the data not in an image
(Can also provide the .csvs if needed)
I would like to merge the sets based on the first three digits of the call number.
To achieve this I've created a new variable in both sets called Call_Category2 that takes the first three digits of the call number value to be matched.
DDC$Call_Category2 = str_pad(DDC$Call_Category, width = 3, side = "left", pad = "0")
This dataset is just over 1000 rows. It is also padded because the 000 to 099 Dewey Decimal Classifications were dropping their leading 0s
DOA_data = transform(DOA_data, Call_Category2 = substr(Call_Category, 1,3))
This dataset is about 24000 rows.
I merge the sets and create a new set called DOA_Call
DOA_Call = merge(DDC, DOA_data, all.x = TRUE)
When I head the data the merge seems to be working properly but 10,000 rows do not get the DOA_Call data added. They just stay in their original state. This is about 40% of my total dataset so it is pretty substantial. My first instinct was that it was only putting DDC rows in once but that would mean I would be missing 23,000 rows which I'm not.
Am I doing something wrong with the merge or could it be an issue with the data not being clean enough?
Let me know if more information is needed!
I don't necessarily need code, pointers on what direction to troubleshoot in would be very helpful!
This is my best try with the information you provide. You will need to use:
functions such as left_join from dplyr (see https://dplyr.tidyverse.org/reference/join.html)
the stringt library to handle some variables (https://dplyr.tidyverse.org/reference/join.html)
and some familiarity with the tidyverse.
Please keep in mind that the best way to ask in stackoveflow is by providing a minimal reproducible example

Loop through dataframe rows and add value in a new column (R)

I have a dataframe (df) with a column of Latitude (Lat), and I need to match up the corresponding Longitude value (based off relationships in another dataset). New column name is 'Long_matched'.
Here, I am trying to write a new value in the column 'Long_matched' at the corresponding row to latitudes between -33.9238 and -33.9236. The data in 'Lat' has many more decimal places (e.g: -33.9238026666667, -33.9236026666667, etc.). As I will be applying this code to multiple datasets over the same geographical location (hence the long decimals will vary slightly), I want to write Longitude values which fall within a a 0.0002 degree range.
Some attempts of code I have tried include:
df$Long_matched <- ifelse(df$Lat< -33.9236 & df$Lat> -33.9238, 151.2279 , "N/A")
or
df$Long_matched[df$Lat< -33.9236 & df$Lat> -33.9238] <- 151.2279
I think I need to use a for loop to loop through the rows and an if statement, but struggling to figure this out - any help would be appreciated!
Resulting output should look something like this:
Lat Long_matched
-33.9238026666667 151.2279
-33.9236026666667 (new long value will go here)
Everything said in the comments applies, but here is a trick you can try:
In the following code, you will need to replace text with numbers.
Latitude_breaks <- seq(min_latitude, max_latitude, 0.0002) # you need to replace `min_latitude`, `max_latitude`, and `increment` with numbers
Longitude_values <- seq(first, last, increment) # you need to replace `first`, `last` and `increment` with numbers
df <- within(df, {
# make a categorical version of `Lat`
Lat_cat <- cut(Lat, Latitude_breaks)
Long_matched <- Longitude_values[Lat_cat]
})
A few notes:
the values between min_latitude and min_latitude + 1 will be assigned to the values of Longitude marked first.
The length of Latitude_beaks should be one more than the length of Longitude_values.
Values of Lat outside of Latitude_breaks will becomes NAs.
This works by exploiting a nice feature of factors - they are stored as integers. So we can use them to index another vector - in this case, Longitude_values.

R Value Mapping with a Vector

I'm looking for a function similar to FindReplace that will allow me to map values based on a vector, rather than a single value.
I have a lookup table that looks like this that I want to use to map values in a dataframe.
Headers: COLUMN_NAME, CODE, DESCRIPTION
Row1: arrmin, 97, Officially Cancelled
Row2: arrmin, 98, Unknown if Arrived
Row3: atmcond, -1, Blank
Row4: atmcond, 0, No Additional Atmospheric Conditions
This lookup table has thousands of rows, so I can't type them in manually, and my original solution is too inefficient and will take days to run.
The dataframe I am using has hundreds of columns, such as arrmin and atmcond that needs the values changed from 97 to Officially Cancelled, etc.
The values from 0-100 (or however many values there are) change based on which column it is in. I've written this code below, but it is really inefficient and takes days to run for 300k rows.
columnsToReplace <- which(colnames(CRASH) %in% CapitalColumns)
dfColumns <- colnames(CRASH)
for (i in columnsToReplace){
tempColumn <- dfColumns[i]
tempLookup <- capitalLookupTable[which(capitalLookupTable$COLUMN_NAME ==tempColumn),]
CRASH <- FindReplace(data=CRASH,Var=tempColumn,replaceData = capitalLookupTable,
from = "Code",to = "Description",exact=T)
}
columnsToReplace is a vector I created that contains the string names of each of the columns that exist in the lookup table.
#Some data
s<-data.frame(A=c(1,1,2,2),B=c(2,4,6,6),C=c(1,3,5,7))
mapping<-data.frame(ColumnName=c(rep("A",2), rep("B",3), rep("C",4)), Code=c(1,2,2,4,6,1,3,5,7))
mapping$Description<-paste0(mapping$ColumnName, mapping$Code)
#From wide to long
library(reshape)
melted.s<-melt(s)
#Join
melted.s<-merge(melted.s, mapping, by.x=c("variable","value"), by.y=c("ColumnName","Code"))
#From long to wide
p<-data.frame(matrix(melted.s$Description, ncol=ncol(s)))
names(p)<-names(s)

Hexbin: how to trace bin contents

After applying hexbin'ning I would like to know which id or rownumbers of the original data ended up in which bin.
I am currently analysing spatial data and I am binning, e.g., depth of water and temperature. Ideally, I would like to map the colormap of the bins back to the spatial map to see where more or less common parameter combinations exist. I'm not bound to hexbin though.
I wasn't able to figure out from the documentation, how to trace which datapoint ends up in which bin. It seems hexbin() only stores counts.
Is there a function that generates a list with one entry for every bin, each containing a vector of all rownumbers that were assigned to that bin?
Please point me into the right direction.
Up to now, I use plain hexbin to do the binning:
library(hexbin)
set.seed(5)
df <- data.frame(depth=runif(1000,min=0,max=100),temp=runif(1000,min=4,max=14))
h <- hexbin(df)
but currently I see no way to extract rownames of df from h that link the bins to df. Possibly there is no such thing, maybe I overlooked it or there is a completely different approach needed.
Assuming you are using the hexbin package, then you will need to set IDs=TRUE to be able to go back to the original rows
library(hexbin)
set.seed(5)
df <- data.frame(depth=runif(1000,min=0,max=100),temp=runif(1000,min=4,max=14))
h<-hexbin(df, IDs=TRUE)
Then to get the bin number for each observation, you can use
h#cID
To get the count of observations in the cell populated by a particular observation, you would do
h#count[match(h#cID, h#cell)]
The idea is that the second observation df[2,] is in cell h#cID[2]=424. Cell 424 is at index which(h#cell==424)=241 in the list of cells (zero count cells appear to be omitted). The number of observations in that cell is h#count[241]=2.

Resources