R Value Mapping with a Vector - r

I'm looking for a function similar to FindReplace that will allow me to map values based on a vector, rather than a single value.
I have a lookup table that looks like this that I want to use to map values in a dataframe.
Headers: COLUMN_NAME, CODE, DESCRIPTION
Row1: arrmin, 97, Officially Cancelled
Row2: arrmin, 98, Unknown if Arrived
Row3: atmcond, -1, Blank
Row4: atmcond, 0, No Additional Atmospheric Conditions
This lookup table has thousands of rows, so I can't type them in manually, and my original solution is too inefficient and will take days to run.
The dataframe I am using has hundreds of columns, such as arrmin and atmcond that needs the values changed from 97 to Officially Cancelled, etc.
The values from 0-100 (or however many values there are) change based on which column it is in. I've written this code below, but it is really inefficient and takes days to run for 300k rows.
columnsToReplace <- which(colnames(CRASH) %in% CapitalColumns)
dfColumns <- colnames(CRASH)
for (i in columnsToReplace){
tempColumn <- dfColumns[i]
tempLookup <- capitalLookupTable[which(capitalLookupTable$COLUMN_NAME ==tempColumn),]
CRASH <- FindReplace(data=CRASH,Var=tempColumn,replaceData = capitalLookupTable,
from = "Code",to = "Description",exact=T)
}
columnsToReplace is a vector I created that contains the string names of each of the columns that exist in the lookup table.

#Some data
s<-data.frame(A=c(1,1,2,2),B=c(2,4,6,6),C=c(1,3,5,7))
mapping<-data.frame(ColumnName=c(rep("A",2), rep("B",3), rep("C",4)), Code=c(1,2,2,4,6,1,3,5,7))
mapping$Description<-paste0(mapping$ColumnName, mapping$Code)
#From wide to long
library(reshape)
melted.s<-melt(s)
#Join
melted.s<-merge(melted.s, mapping, by.x=c("variable","value"), by.y=c("ColumnName","Code"))
#From long to wide
p<-data.frame(matrix(melted.s$Description, ncol=ncol(s)))
names(p)<-names(s)

Related

For loop for selecting certain data to form a new data frame

First of all, I am using the ukpolice library in R and extracted data to a new data frame called crimes. Now i am running into a new problem, i am trying to extract certain data to a new empty data frame called df.shoplifting if the category of the crime is equal to "shoplifiting" it needs to add the id, month and street name to the new dataframe. I need to use a loop and if statement togheter.
EDIT:
Currently i have this working but it lacks the IF statemtent:
for (i in crimes$category) {
shoplifting <- subset(crimes, category == "shoplifting", select = c(id, month, street_name))
names(shoplifting) <- c("ID", "Month", "Street_Name")
}
What i am trying to do:
for (i in crimes$category) {
if(crimes$category == "shoplifting"){
data1 <- subset(crimes, category == i, select = c(id, month, street_name))
}
}
It does run and create the new data frame data1. But the data that it extracts is wrong and does not only include items with the shoplifting category..
I'll guess, and update if needed based on your question edits.
rbind works only on data.frame and matrix objects, not on vectors. If you want to extend a vector (N.B., that is not part of a frame or column/row of a matrix), you can merely extend it with c(somevec, newvals) ... but I think that this is not what you want here.
You are iterating through each value of crimes$category, but if one category matches, then you are appending all data within crimes. I suspect you mean to subset crimes when adding. We'll address this in the next bullet.
One cannot extend a single column of a multi-column frame in the absence of the others. A data.frame as a restriction that all columns must always have the same length, and extending one column defeats that. (And doing all columns immediately-sequentially does not satisfy that restriction.)
One way to work around this is to rbind a just-created data.frame:
# i = "shoplifting"
newframe <- subset(crimes, category == i, select = c(id, month, street_name))
names(newframe) <- c("ID", "Month", "Street_Name") # match df.shoplifting names
df.shoplifting <- rbind(df.shoplifting, newframe)
I don't have the data, but if crimes$category ever has repeats, you will re-add all of the same-category rows to df.shoplifting. This might be a problem with my assumptions, but is likely not what you really need.
If you really just need to do it once for a category, then do this without the need for a for loop:
df.shoplifting <- subset(crimes, category == "shoplifting", select = c(id, month, street_name))
# optional
names(df.shoplifting) <- c("ID", "Month", "Street_Name")
Iteratively adding rows to a frame is a bad idea: while it works okay for smaller datasets, as your data scales, the performance worsens. Why? Because each time you add rows to a data.frame, the entire frame is copied into a new object. It's generally better to form a list of frames and then concatenate them all later (c.f., https://stackoverflow.com/a/24376207/3358227).
On this note, if you need one frame per category, you can get that simply with:
df_split(df, df$category)
and then operate on each category as its own frame by working on a specific element within the df_split named list (e.g., df_split[["shoplifting"]]).
And lastly, depending on the analysis you're doing, it might still make sense to keep it all together. Both the dplyr and data.table dialects of R making doing calculations on data within groups very intuitive and efficient.
Try:
df.shoplifting <- crimes[which(crimes$category == 'shoplifting'),]
Using a for loop in this instance will work, but when working in R you want to stick to vectorized operations if you can.
This operation subsets the crimes dataframe and selects rows where the category column is equal to shoplifting. It is not necessary to convert the category column into a factor - you can match the string with the == operator.
Note the comma at the end of the which(...) function, inside of the square brackets. The which function returns indices (row numbers) that meet the criteria. The comma after the function tells R that you want all of the rows. If you wanted to select only a few rows you could do:
df.shoplifting <- crimes[which(crimes$category == 'shoplifting'),c("id","Month","Street_Name")]
OR you could call the columns based on their number (I don't have your data so I don't know the numbers...but if the columns id, Month, Street_Name, you could use 1, 2, 3).
df.shoplifting <- crimes[which(crimes$category == 'shoplifting'),c(1,2,3)]

Counting unique subsets of data efficiently

I have a relatively large dataset that I wouldn't qualify as 'big data'. It's around 3 to 5 million rows; because of the size I'm using the data.table library to do analysis.
The dataset (named df, which is a data.table structure) composition can essentially be broken into:
n identify fields, hereafter ID_1, ID_2, ..., ID_n, some of which are numeric and some of which are character vector.
m categorical variables, hereafter C_1, ..., C_m, all of which are character vector and have very few values apiece (2 in one, 3 in another, etc...)
2 measurement variables, M_1, and M_2, both numeric.
A subset of data is identified by ID_1 through ID_n, has a full set of all values of C_1 through C_m, and has a range of values of M_1 and M_2. A subset of data consists of 126 records.
I need to accurately count the unique sets of data and, because of the size of the data, I would like to know if there already exists a much more efficient way to do this. (Why roll my own if other, much smarter, people have done it already?)
I've already done a fair amount of Google work to arrive at the method below.
What I've done is to use the ht package (https://github.com/nfultz/ht) so that I can use a data frame as a hash value (using digest in the background).
I paste together the ID fields to create a new, single column, hereafter referred to as ID, which resembles...
ID = "ID1:ID2:...:IDn"
Then I loop through each unique set of identifiers and then, using just the subset data frame of C_1 through C_m, M_1, and M_2 (126 rows of data), hash the value / increment the hash.
Afterwards I'm taking that information and putting it back into the data frame.
# Create the hash structure
datasets <- ht()
# Declare the fields which will denote a subset of data
uniqueFields <- c("C_1",..."C_m","M_1","M_2")
# Create the REPETITIONS field in the original data.table structure
df[,REPETITIONS := 0]
# Create the KEY field in the original data.table structure
df[,KEY := ""]
# Use the updateHash function to fill datasets
updateHash <- function(val){
key <- df[ID==val, uniqueFields, with=FALSE]
if (isnull(datasets[key])) {
# If this unique set of data doesn't already exist in datasets...
datasets[key] <- list(val)
} else {
# If this unique set of data does already exist in datasets...
datasets[key] <- append(datasets[key],val)
}
}
# Loop through the ID fields. I've explored using apply;
# this vector is around 10-15K long. This version works.
for (id in unique(df$ID)) {
updateHash(id)
}
# Now update the original data.table structure so the analysis can
# be done. Again, I could use the R apply family, this version works.
for(dataset in ls(datasets)){
IDS <- unlist(datasets[[dataset]]$val)
# For this set of information, how many times was it repeated?
df[ID%in%IDS, REPETITIONS := length(datasets[[dataset]]$val)]
# For this set, what is a unique identifier?
df[ID%in%IDS, KEY := dataset]
}
This does what I want to, though not blindingly fast. I now have the capability to present some neat analysis revolving around variability in datasets to people who care about it. I don't like that it's hackey and, one way or another, I'm going to clean this up and make it better. Before I do that I want to do my final due diligence and see if it's simply my Google Fu failing me.

Non-numeric argument to binary operator in R for data read from CSV

Very new to R, I know that loops aren't always considered a good option in R but only one part of my code is giving me trouble. Essentially, I'm analysing longitudinal data: each subject has two tables of measures (each table has approx. 200 different study variables) from two different time points. I've read them all in, stored as different variables in R, and am trying to subtract the first table from the second for each participant.
It works fine if I run this as an individual line of code:
data_difference_n <- data_2_n - data_1_n
where n is the participant's ID number, but that would mean running this line for about 1,000 participants whose IDs aren't consecutive numbers. So I've tried to put it inside a loop
participants <- c(100, 105, 106, 119 ...)
for (n in participants) {
...
data_difference_n <- paste("data_difference", subject, sep="_")
data_1_n <- paste("data_1", subject, sep="_")
data_2_n <- paste("data_2", subject, sep="_")
data_difference_n <- data_2_n - data_1_n
}
which gives me an error of "non-numeric argument to binary operator".
Each data table is a CSV with the same properties, mostly numbers and some cells with N/A. The first bit of code gives me the result I want: a new table where all the numerical values are the values in the first table subtracted from the values in the second, for that participant. I'm confused about why the second bit of code doesn't work, because the result should call the same variables as the first?
I've tried reading a lot of other posts about this error here and on other sites but can't seem to resolve it. This one says that using apply converts the data frame to a character matrix, is it the same principle with looping? Feel like I'm missing something really basic and simple here - apologies if so and would appreciate any help!

Loop through dataframe rows and add value in a new column (R)

I have a dataframe (df) with a column of Latitude (Lat), and I need to match up the corresponding Longitude value (based off relationships in another dataset). New column name is 'Long_matched'.
Here, I am trying to write a new value in the column 'Long_matched' at the corresponding row to latitudes between -33.9238 and -33.9236. The data in 'Lat' has many more decimal places (e.g: -33.9238026666667, -33.9236026666667, etc.). As I will be applying this code to multiple datasets over the same geographical location (hence the long decimals will vary slightly), I want to write Longitude values which fall within a a 0.0002 degree range.
Some attempts of code I have tried include:
df$Long_matched <- ifelse(df$Lat< -33.9236 & df$Lat> -33.9238, 151.2279 , "N/A")
or
df$Long_matched[df$Lat< -33.9236 & df$Lat> -33.9238] <- 151.2279
I think I need to use a for loop to loop through the rows and an if statement, but struggling to figure this out - any help would be appreciated!
Resulting output should look something like this:
Lat Long_matched
-33.9238026666667 151.2279
-33.9236026666667 (new long value will go here)
Everything said in the comments applies, but here is a trick you can try:
In the following code, you will need to replace text with numbers.
Latitude_breaks <- seq(min_latitude, max_latitude, 0.0002) # you need to replace `min_latitude`, `max_latitude`, and `increment` with numbers
Longitude_values <- seq(first, last, increment) # you need to replace `first`, `last` and `increment` with numbers
df <- within(df, {
# make a categorical version of `Lat`
Lat_cat <- cut(Lat, Latitude_breaks)
Long_matched <- Longitude_values[Lat_cat]
})
A few notes:
the values between min_latitude and min_latitude + 1 will be assigned to the values of Longitude marked first.
The length of Latitude_beaks should be one more than the length of Longitude_values.
Values of Lat outside of Latitude_breaks will becomes NAs.
This works by exploiting a nice feature of factors - they are stored as integers. So we can use them to index another vector - in this case, Longitude_values.

R: Find duplicates that are within a certain value

I have a data frame (df) that includes latitude and longitude coordinates (Lat, Long) as well as the depth (Depth) of a temperature measurement for each entry. Essentially, each entry has (x,y,z)=(Lat,Long,Depth) locational information for each temperature measurement.
I'm trying to clean the data by finding and removing duplicate measurement locations. The easy part is removing the exact duplicates, handled as such:
df = df[!(duplicated(df$Lat) & duplicated(df$Long) & duplicated(df$Depth)),]
However the problem is that the values of lat/long for some entries are just slightly off, meaning the above code won't catch them but they are still clearly duplicated (e.g. lat = 39.252880 & lat = 39.252887).
Is there a way to find duplicates that are within a certain absolute value or percentage of the first instance?
I appreciate any help, thanks!
Based on this post I was able to come up with a solution. I modified the function to have a tighter tolerance on "duplicates" to be 0.001, otherwise the function is unchanged. The application, however, changes slightly to:
output=data.frame(apply(dataDF,2,fun))
because I wanted to compare values within a single column instead of in a single row.
To continue, I then add artificial indices to my output data frame for later use:
output$ind = 1:nrow(output)
The main part is finding the row indices where the function returned TRUE for the three locational information fields (lat, long, depth). The following code finds the indices where all three were true, creates a temporary data frame with only those entries (still logicals), finds the indices of the duplicates, then reverses it to return the indices that will be removed from the full, original dataset (pardon the bad variable names):
ind = which(with(output,(Lat=='TRUE' & Long=='TRUE' & Depth=='TRUE')))
temp = dataDF[ind,]
temp$ind = ind
ind2 = duplicated(temp[,1:3])
rem = ind[ind2]
df.final = dataDF[-rem,]
Hopefully that helps! It's a bit complicated and the function is very slow for large datasets but it gets the job done.

Resources