I have a dataframe (df) with a column of Latitude (Lat), and I need to match up the corresponding Longitude value (based off relationships in another dataset). New column name is 'Long_matched'.
Here, I am trying to write a new value in the column 'Long_matched' at the corresponding row to latitudes between -33.9238 and -33.9236. The data in 'Lat' has many more decimal places (e.g: -33.9238026666667, -33.9236026666667, etc.). As I will be applying this code to multiple datasets over the same geographical location (hence the long decimals will vary slightly), I want to write Longitude values which fall within a a 0.0002 degree range.
Some attempts of code I have tried include:
df$Long_matched <- ifelse(df$Lat< -33.9236 & df$Lat> -33.9238, 151.2279 , "N/A")
or
df$Long_matched[df$Lat< -33.9236 & df$Lat> -33.9238] <- 151.2279
I think I need to use a for loop to loop through the rows and an if statement, but struggling to figure this out - any help would be appreciated!
Resulting output should look something like this:
Lat Long_matched
-33.9238026666667 151.2279
-33.9236026666667 (new long value will go here)
Everything said in the comments applies, but here is a trick you can try:
In the following code, you will need to replace text with numbers.
Latitude_breaks <- seq(min_latitude, max_latitude, 0.0002) # you need to replace `min_latitude`, `max_latitude`, and `increment` with numbers
Longitude_values <- seq(first, last, increment) # you need to replace `first`, `last` and `increment` with numbers
df <- within(df, {
# make a categorical version of `Lat`
Lat_cat <- cut(Lat, Latitude_breaks)
Long_matched <- Longitude_values[Lat_cat]
})
A few notes:
the values between min_latitude and min_latitude + 1 will be assigned to the values of Longitude marked first.
The length of Latitude_beaks should be one more than the length of Longitude_values.
Values of Lat outside of Latitude_breaks will becomes NAs.
This works by exploiting a nice feature of factors - they are stored as integers. So we can use them to index another vector - in this case, Longitude_values.
Related
I would like to be able to see whether values in a column of a data frame are within the range of values in another dataframe in R. I can do that with sapply and put Yes and NA with ifelse. However, I want to be able to find out exactly which row numbers (and even better what content) of the first file was used for those Yes rows. In other words, I want to know the contents of second df matched to which columns of the first and then get the contents of a specific column in the first file.
This is what I am using:
cigar2_count$Visit1_counts <- ifelse(sapply(cigar2_count$HPV_Position, function(p)
any(cigar1_count$minV <= p & cigar1_count$maxV >= p)),"YES", NA)
This is what I want to be able to do but it gives me content of first file based on row numbers of the second one not actually what row in the first file corresponded to the second file.
cigar2_count$Visit1_counts <- ifelse(sapply(cigar2_count$HPV_Position, function(p)
any(cigar1_count$minV <= p & cigar1_count$maxV >= p)),cigar1_count$Unique_Read_Count, NA)
Here is a sample data:
First file: I made columns for the 500 range of HPV_Position and named those min and max
Second file:
These are just samples though. The actual files are much larger.
Thanks!
I have a sample working data set (called df) which I have added columns to in R, and I would like to fill these columns with data according to very specific conditions.
I ran samples in a lab with 8 different variables, and always ran each sample with each variable twice (sample column). From this, I calculated an average result, called Cq_mean.
The columns I have added in R below refer to each variable name.
I would like to fill these columns with positive or negative based on 2 conditions :
Variable
Cq_mean
As you see with my code below, I am able to create positive or negative results based on Cq_mean, however this logically runs it over the entire dataset, not taking into account variable as well, and it fills in cells with data that I would like to remain empty. I am not sure how to ask R to take these two conditions into account at the same time.
positive: Cq_mean <= 37.1
negative: Cq_mean >= 37
Helpful information:
Under sample, the data is always separated by a dash (-) with sample number in front, and variable name after. Somehow I need to isolate what comes after the dash.
Please refer to my desired results table to visualize what I am aiming for.
df <- read.table("https://pastebin.com/raw/ZPJS9Vjg", header=T,sep="")
add column names respective to variables
df$TypA <- ""
df$TypB <- ""
df$TypC <- ""
df$RP49 <- ""
df$RPS5 <- ""
df$H20 <- ""
df$F1409B <-""
df$F1430A <- ""
fill columns with data
df$TypA <- ifelse(df$Cq_mean>=37.1,"negative", 'positive')
df$TypB <- ifelse(df$Cq_mean>=37.1,"negative", 'positive')
and continue through with each variable
desired results (subset of entire dataset done by hand in excel):
desired_outcome <- read.table("https://pastebin.com/raw/P3PPbiwr", header = T, sep="\t")
Something like this will do the trick:
df$TypA[grepl('TypA', df$sample1)] <- ifelse(df$Cq_mean[grepl('TypA', df$sample1)] >= 37.1,
'neg', 'pos')
You'll need to do this once per new column you want.
The grepl will filter out only the rows where your string of choice (here TypA) is present in the sample variable.
I'm looking for a function similar to FindReplace that will allow me to map values based on a vector, rather than a single value.
I have a lookup table that looks like this that I want to use to map values in a dataframe.
Headers: COLUMN_NAME, CODE, DESCRIPTION
Row1: arrmin, 97, Officially Cancelled
Row2: arrmin, 98, Unknown if Arrived
Row3: atmcond, -1, Blank
Row4: atmcond, 0, No Additional Atmospheric Conditions
This lookup table has thousands of rows, so I can't type them in manually, and my original solution is too inefficient and will take days to run.
The dataframe I am using has hundreds of columns, such as arrmin and atmcond that needs the values changed from 97 to Officially Cancelled, etc.
The values from 0-100 (or however many values there are) change based on which column it is in. I've written this code below, but it is really inefficient and takes days to run for 300k rows.
columnsToReplace <- which(colnames(CRASH) %in% CapitalColumns)
dfColumns <- colnames(CRASH)
for (i in columnsToReplace){
tempColumn <- dfColumns[i]
tempLookup <- capitalLookupTable[which(capitalLookupTable$COLUMN_NAME ==tempColumn),]
CRASH <- FindReplace(data=CRASH,Var=tempColumn,replaceData = capitalLookupTable,
from = "Code",to = "Description",exact=T)
}
columnsToReplace is a vector I created that contains the string names of each of the columns that exist in the lookup table.
#Some data
s<-data.frame(A=c(1,1,2,2),B=c(2,4,6,6),C=c(1,3,5,7))
mapping<-data.frame(ColumnName=c(rep("A",2), rep("B",3), rep("C",4)), Code=c(1,2,2,4,6,1,3,5,7))
mapping$Description<-paste0(mapping$ColumnName, mapping$Code)
#From wide to long
library(reshape)
melted.s<-melt(s)
#Join
melted.s<-merge(melted.s, mapping, by.x=c("variable","value"), by.y=c("ColumnName","Code"))
#From long to wide
p<-data.frame(matrix(melted.s$Description, ncol=ncol(s)))
names(p)<-names(s)
I have a data frame (df) that includes latitude and longitude coordinates (Lat, Long) as well as the depth (Depth) of a temperature measurement for each entry. Essentially, each entry has (x,y,z)=(Lat,Long,Depth) locational information for each temperature measurement.
I'm trying to clean the data by finding and removing duplicate measurement locations. The easy part is removing the exact duplicates, handled as such:
df = df[!(duplicated(df$Lat) & duplicated(df$Long) & duplicated(df$Depth)),]
However the problem is that the values of lat/long for some entries are just slightly off, meaning the above code won't catch them but they are still clearly duplicated (e.g. lat = 39.252880 & lat = 39.252887).
Is there a way to find duplicates that are within a certain absolute value or percentage of the first instance?
I appreciate any help, thanks!
Based on this post I was able to come up with a solution. I modified the function to have a tighter tolerance on "duplicates" to be 0.001, otherwise the function is unchanged. The application, however, changes slightly to:
output=data.frame(apply(dataDF,2,fun))
because I wanted to compare values within a single column instead of in a single row.
To continue, I then add artificial indices to my output data frame for later use:
output$ind = 1:nrow(output)
The main part is finding the row indices where the function returned TRUE for the three locational information fields (lat, long, depth). The following code finds the indices where all three were true, creates a temporary data frame with only those entries (still logicals), finds the indices of the duplicates, then reverses it to return the indices that will be removed from the full, original dataset (pardon the bad variable names):
ind = which(with(output,(Lat=='TRUE' & Long=='TRUE' & Depth=='TRUE')))
temp = dataDF[ind,]
temp$ind = ind
ind2 = duplicated(temp[,1:3])
rem = ind[ind2]
df.final = dataDF[-rem,]
Hopefully that helps! It's a bit complicated and the function is very slow for large datasets but it gets the job done.
I am currently working on a code which applies to various datasets from an experiment which looks at a wide range of variables which might not be present in every repetition. My first step is to create an empty dataset with all the possible variables, and then write a function which retains columns that are in the dataset being inputted and delete the rest. Here is an example of how I want to achieve this:-
x<-c("a","b","c","d","e","f","g")
y<-c("c","f","g")
Is there a way of removing elements of x that aren't present in y and/or retaining values of x that are present in y?
For your first question: "My first step is to create an empty dataset with all the possible variables", I would use factor on the concatenation of all the vectors, for example:
all_vect = c(x, y)
possible = levels(factor(all_vect))
Then, for the second part " write a function which retains columns that are in the dataset being inputted and delete the rest", I would write:
df[,names(df)%in%possible]
As akrun wrote, use intersect(x,y) or
> x[x %in% y]