Mutate when function output has data in two columns (geosphere) - r

I have animal survey data from transects. Transects are divided into sections. There are lat/lon data for the start/endpoints of some sections but not others, and I want to calculate the start/endpoints for sections where these values are missing. Missing start/endpoints should be calculated using the section bearing (degrees), section length (m).
Example data:
Section
StartLon
StartLat
EndLon
EndLat
Bearing
Length
1
-132.4053
53.00704
-132.4053
53.00714
360
5
2
-132.4053
53.00714
NA
NA
360
10
I'm trying to use destPoint (geosphere) to calculate the missing start/endpoints (NAs). The output of destPoint looks like:
lon lat
[1,] -132.4053 53.00701
My code:
data %>%
mutate(EndLon = if_else(is.na(EndLon), destPoint(c(StartLon, StartLat), Bearing, Length), EndLon))
data %>%
mutate(EndLat = if_else(is.na(EndLat), destPoint(c(StartLon, StartLat), Bearing, Length), EndLat))
My code gives this error:
Error: Problem with `mutate()` input `test`.
x Wrong length for a vector, should be 2
i Input `test` is `if_else(...)`
I think the error is because the output of destPoint is two values (lon and lat), and the mutated column can only hold one value. Maybe there's a way to use select() so that only lon or lat goes into the mutated column?
Hoping for a dplyr solution.

We may use rowwise
library(dplyr)
library(geosphere)
data %>%
rowwise %>%
mutate(EndLon = if(is.na(EndLon))
destPoint(c(StartLon, StartLat), Bearing, Length)[, 'lon'] else EndLon) %>%
ungroup
-output
# A tibble: 2 x 7
# Section StartLon StartLat EndLon EndLat Bearing Length
# <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
#1 1 -132. 53.0 -132. 53.0 360 5
#2 2 -132. 53.0 -132. NA 360 10
data
data <- structure(list(Section = 1:2, StartLon = c(-132.4053, -132.4053
), StartLat = c(53.00704, 53.00714), EndLon = c(-132.4053, NA
), EndLat = c(53.00714, NA), Bearing = c(360L, 360L), Length = c(5L,
10L)), class = "data.frame", row.names = c(NA, -2L))
The issue would be that c(StartLon, StartLat) would concatenate the whole column values from both of those column, and thereby the length of one of the arguments for if_else becomes different in length than the rest. If we do the rowwise, it is grouped by row and we can use if/else (which requires a input logical expression of length 1)

Related

How do I convert the geometry of multipolygons to lat and long using r

Hi I am trying to convert the geometry field into lat and long. I need to use the data to create a map in leaflet for a shiny app.
I have tried the following code:
rl_coord <- rle %>% mutate(lat = unlist(map(rle$geometry, 1)), long = unlist(map(rle$geometry, 2)))
output:
Error in stopifnot():
! Problem while computing lat = unlist(map(rle$geometry, 1)).
x lat must be size 45152 or 1, not 19470112.
Run rlang::last_error() to see where the error occurred.
sample of data (first 2 rows):
(column names)
oi, name, rle_status, percentage, geometry
1, Aggeneys Gravel Vygieveld, Least Concern, 99, list(c(18.1286933451867, 18.1288873157412, 18.1285209659984, 18.1253701585106, 18.1256087808649, 18.125834301081, 18.1261614439592, 18.1263051557795, 18.1264605244489, 18.1266239843156, 18.1269636452865, 18.12730379009, 18.1274679032905, 18.1276242720226, 18.1277694335806, 18.128129283741, 18.1285605618558, 18.1286933451867, -29.2730674003889, -29.2731573351778, -29.273107503584, -29.273040061747, -29.2729443517532, -29.2728706490378, -29.2727512792674, -29.2727164322246, -29.2726934896327, -29.2726798052816, -29.2726735519181, -29.2726912387301, -29.2727098853755, -29.2727368207384, -29.272774262715, -29.2728993661673, -29.2730186720589, -29.2730674003889))
2,Aggeneys Gravel Vygieveld, Least Concern, 99, list(c(18.1537914281886, 18.1575870517462, 18.1606979380474, 18.1636028297453, 18.1655769358587, 18.16918945379, 18.1735477460181, 18.1759052286574, 18.1772384655054, 18.1773223880217, 18.1776390084706, 18.1759643559649, 18.1735973363743, 18.169952393675, 18.1688938146476, 18.1682070250338, 18.1683513617348, 18.1685030450031, 18.1688311043061, 18.1693585935878, 18.1702652749817, 18.1706158522373, 18.1709379223772, 18.1710792665481, 18.1712000947379, 18.1712925878997, 18.1713437767173, 18.1713693252922, 18.1713707328073, 18.1713479804211, 18.1713015143846, 18.1712382418641, 18.1709475315411, 18.1704635387361, 18.1684209356022, 18.1680682165722, 18.1678999593179, 18.1677395265964, 18.1675893883433, 18.1674634611086, 18.16651344333, 18.1637821201569, 18.1598033908529, 18.1569099426752, 18.1529178620954, 18.14891243025, 18.1454734808983, 18.141641617306, 18.1389141094622, 18.1363582617806, 18.1337413799508, 18.1329259143444, 18.1327123178834, 18.1321225427447, 18.1316307840128, 18.1308981359519, 18.130405823682, 18.1296760299562, 18.1295423948817, 18.1293256727083, 18.1291879582129, 18.1287509532673, 18.1283905796694, 18.1282495214403, 18.1280975604639, 18.1277725515827, 18.1267805933984, 18.1265024871873, 18.1263039910056, 18.1262050407897, 18.1261056638745, 18.1257711483693, 18.1254680798452, 18.1251343803712, 18.1237035930329, 18.1233520704461, 18.1230207288408, 18.1228686758013, 18.122730590917, 18.122428801227, 18.1222578827598, 18.1221962726986, 18.1221588093525, 18.1221453668644, 18.1221553763524, 18.1221897086251, 18.1222487087564, 18.1223573579371, 18.121940612864, 18.119729996386, 18.1201267241008, 18.1202602393336, 18.1232814790033, 18.1262264249677, 18.1309795388961, 18.1359252938222, 18.1414241797188, 18.1456203463775, 18.1505165109378, 18.1537914281886, -29.2714061489315, -29.2729282131844, -29.27283475266, -29.2735366562541, -29.2748603575312, -29.2763805135057, -29.2774333702504, -29.2781257379722, -29.2796611540815, -29.279758428048, -29.2815265405493, -29.282613729886, -29.2822437040154, -29.2818450674144, -29.2814769494464, -29.2812386697835, -29.2812023081652, -29.2811798583234, -29.2811567953417, -29.2811519774458, -29.2811626612424, -29.2811582844914, -29.2811379726675, -29.2811178709109, -29.2810880943803, -29.2810455977044, -29.2809975064909, -29.2809446550115, -29.2808944231771, -29.2808543392358, -29.2808307426598, -29.2808240597577, -29.2808606799739, -29.2808810489644, -29.2808696770233, -29.2808822434794, -29.280895647919, -29.280916088344, -29.2809458790151, -29.2809806868257, -29.2806510672722, -29.280271505169, -29.2787456264884, -29.2775611622045, -29.2765159357731, -29.2759494531304, -29.2747535453343, -29.2745113125779, -29.2739715327436, -29.2739181265728, -29.2737464650722, -29.2736922023781, -29.2736427340937, -29.2735310196154, -29.2733754030725, -29.2732322407442, -29.2730770705516, -29.2729311019628, -29.2728915355121, -29.2728078851028, -29.2727640339101, -29.2726519877553, -29.2725284230301, -29.2724913733567, -29.2724634517196, -29.2724256261332, -29.2723486672961, -29.2723047868547, -29.2722601758286, -29.2722516909261, -29.2722563540645, -29.2723118485858, -29.2723331410326, -29.2723400722685, -29.272324458358, -29.2723303416353, -29.2723548747257, -29.2723790183498, -29.272414889976, -29.2725469760291, -29.2726412327055, -29.2726981781648, -29.2727653543012, -29.2728386599826, -29.2729125214399, -29.2729813861663, -29.2730411602367, -29.2731112309018, -29.2731265774833, -29.2737197621271, -29.2726058704699, -29.272533391812, -29.2709121456387, -29.2701740008779, -29.2701167811391, -29.2697410330702, -29.2692184199776, -29.2694682832248, -29.2708549256421, -29.2714061489315))
You can use sf::st_coordinates (with reproducible data set). You'll get a list of X and Y coordinates for each multipolygon. You can then use unnest to get one row for each X Y pair.
library(spData)
library(sf)
data(nz)
nz %>%
transmute(lon = list(st_coordinates(.)[, 1]),
lat = list(st_coordinates(.)[, 2])) %>%
unnest(lon, lat) %>%
st_drop_geometry()
output
# A tibble: 19,056 × 2
lon lat
* <dbl> <dbl>
1 1745493. 6001802.
2 1740539. 5995066.
3 1733165. 5989714.
4 1720197. 5980078.
5 1709110. 5986672.
6 1701512. 5996205.
7 1694072. 5996670.
8 1698473. 5988332.
9 1703769. 5984809.
10 1706031. 5975136.
# … with 19,046 more rows
Or use the wrapper sfheaders::sf_to_df (pointed out by #tospig) and select the x and y columns:
sfheaders::sf_to_df(nz)[c("x", "y")]
Try this
rl_coord <-
rle %>%
sf::st_coordinates() %>%
as_tibble() %>%
select("long" = X, "lat" = Y)
rl_coord

Calculate distance between multiple latitude and longitude points

I have a dataset that has latitude and longitude information for participants' home and work, and I'd like to create a new column in the dataset containing the euclidean distance between home and work for each participant. I think this should be relatively simple, but all the other Q&As I've seen seem to be dealing with slightly different issues.
To start, I tried running this code (using the geosphere package):
distm(c(homelong, homelat), c(worklong, worklat), fun=distHaversine)
But got an error saying "Error in .pointsToMatrix(x) : Wrong length for a vector, should be 2" because (if I understand correctly) I'm trying to calculate the distance between multiple sets of two points.
Can I adjust this code to get what I'm looking for, or is there something else I should be trying instead? Thanks!
distm() returns a distance matrix, which is not what you want; you want the pairwise distances. So use the distance function (distHaversine(), distGeo(), or whatever) directly:
library(tidyverse)
locations <- tibble(
homelong = c(0, 2),
homelat = c(2, 5),
worklong = c(70, 60),
worklat = c(45, 60)
)
locations <- locations %>%
mutate(
dist = geosphere::distHaversine(cbind(homelong, homelat), cbind(worklong, worklat))
)
locations
#> # A tibble: 2 × 5
#> homelong homelat worklong worklat dist
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 2 70 45 8299015.
#> 2 2 5 60 60 7809933.
Note that geosphere functions want matrices as inputs, so you can cbind() your columns together. Don't c() them; that's creating a single shapeless vector and losing the differentiation between lon and lat. This is the cause of the error, I suspect; the vector only has one dimension, not two like a matrix.
You can have the latitudes and longitudes in a dataframe and then do rowwise operations on the dataframe to get the distance corresponding to each row.
library(tidyverse)
library(geosphere)
locations <- tibble(
homelong = c(0, 2),
homelat = c(2, 5),
worklong = c(70, 60),
worklat = c(45, 60)
)
locations %>%
rowwise() %>%
mutate(d = as.numeric(distm(c(homelong, homelat), c(worklong, worklat), fun = distHaversine)))
results in
# A tibble: 2 x 5
# Rowwise:
homelong homelat worklong worklat d
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 2 70 45 8299015.
2 2 5 60 60 7809933.

Extract allele specific information from dataframe using regex and split this into two new dataframes

I have a 100 by 100 dataframe, which contains 100 sample IDs for 100 SNPs in the genome, in the following format.
structure(list(`c("12545=1", "12545=0")` = c("12545=1|1", "12545=0|0"
), `c("12994=0|0", "12994=0|1")` = c("12994=0|0", "12994=0|1"
), `c("15240=0|0", "15240=1|1")` = c("15240=0|0", "15240=1|1"
)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
The integers at either side of the | represent the genotype of each individual. I'm looking for a solution in bash or R, potentially using regex, that creates two new data frames each with the same dimensions as the first. One dataframe containing only the reference allele info for each sample at each locus, so containing only the integer value before the | and the second dataframe containing the alternate allele info at each locus, i.e. only the value of the integer after the | in each cell. Thus, two new 100 x 100 dataframes of 1s and 0s would be generated.
Everything before the = can be disregarded.
Any advice on how to solve this would be greatly appreciated.
Assuming your data is stored in a data frame df, this should do what you want:
library(tidyverse)
df1 <- df %>%
mutate_all(~ str_extract(., "\\d(?=\\|)"))
df2 <- df %>%
mutate_all(~ str_extract(., "(?<=\\|)\\d"))
df1
# `c("12545=1", "12545=0")` `c("12994=0|0", "12994=0|1")` `c("15240=0|0", "15240=1|1")`
# <chr> <chr> <chr>
# 1 1 0 0
# 2 0 0 1
df2
# `c("12545=1", "12545=0")` `c("12994=0|0", "12994=0|1")` `c("15240=0|0", # "15240=1|1")`
# <chr> <chr> <chr>
# 1 1 0 0
# 2 0 1 1
Note that the column names in your example are a bit weird, but it should work nonetheless.

Extracting value and finding the minimum without merging

I'm trying to extract a subtext and get the minimum value from a list of a list in R. My initial tsv looks like this (this is a smaller version):
cases counts
"S35718:10.63,S35585:6.75,S35708:7.28,S36617:12.23" "6.75,7.28,10.63,12.23,6.17,4.09,3.95,5.00"
"S35718:10.63" "10.63"
And I am trying to extract the numbers after the colon and find the minimum, then I wanted to see how many in the counts column are greater than the minimum.
For instance my ideal output would be:
min: 6.75
greater than 6.75 in counts column: 4
Within this .tsv, there are approximately 100,000 lines. I've tried using gsub, but it ends up merging all the numbers such as the example below:
test <- gsub(".*:", "",outlier$cases)
[1]"10.63" "6.75" "7.28" "12.23" "10.63" ... all the other subsequent values
I would appreciate any help on this. I'm a bit of a beginner with R but would love to improve further. Thank you so much!
An option is to extract the numbers after the :, convert it to numeric, get the min and find the counts by creating a logical expression and take the sum
library(stringr)
library(dplyr)
library(purrr)
library(tidyr)
outlier %>%
transmute(caselist = str_extract_all(cases, "(?<=:)\\d+\\.\\d+"),
countlist = str_extract_all(counts, "[0-9.]+")) %>%
transmute(out = map2(caselist, countlist,
~tibble(min = min(as.numeric(.x)),
greater_than_min = sum(as.numeric(.y) >= min)))) %>%
unnest_wider(c(out))
# A tibble: 2 x 2
# min greater_than_min
# <dbl> <int>
#1 6.75 4
#2 10.6 1
data
outlier <- structure(list(cases = c("S35718:10.63,S35585:6.75,S35708:7.28,S36617:12.23",
"S35718:10.63"), counts = c("6.75,7.28,10.63,12.23,6.17,4.09,3.95,5.00",
"10.63")), class = "data.frame", row.names = c(NA, -2L))

Carrying out a simple dataframe subset with dplyr

Consider the following dataframe slice:
df = data.frame(locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
row.names = c("a091", "b231", "a234", "d154"))
df
locations score
a091 argentina 1
b231 brazil 2
a234 argentina 3
d154 denmark 4
sorted = c("a234","d154","a091") #in my real task these strings are provided from an exogenous function
df2 = df[sorted,] #quick and simple subset using rownames
EDIT: Here I'm trying to subset AND order the data according to sorted - sorry that was not clear before. So the output, importantly, is:
locations score
a234 argentina 1
d154 denmark 4
a091 argentina 3
And not as you would get from a simple subset operation:
locations score
a091 argentina 1
a234 argentina 3
d154 denmark 4
I'd like to do the exactly same thing in dplyr. Here is an inelegant hack:
require(dplyr)
dt = as_tibble(df)
rownames(dt) = rownames(df)
Warning message:
Setting row names on a tibble is deprecated.
dt2 = dt[sorted,]
I'd like to do it properly, where the rownames are an index in the data table:
dt_proper = as_tibble(x = df,rownames = "index")
dt_proper2 = dt_proper %>% ?some_function(index, sorted)? #what would this be?
dt_proper2
# A tibble: 3 x 3
index locations score
<chr> <fct> <int>
1 a091 argentina 1
2 d154 denmark 4
3 a234 argentina 3
But I can't for the life of me figure out how to do this using filter or some other dplyr function, and without some convoluted conversion to factor, re-order factor levels, etc.
Hy,
you can simply use mutate and filter to get the row.names of your data frame into a index column and filter to the vector "sorted" and sort the data frame due to the vector "sorted":
df2 <- df %>% mutate(index=row.names(.)) %>% filter(index %in% sorted)
df2 <- df2[order(match(df2[,"index"], sorted))]
I think I've figured it out:
dt_proper2 = dt_proper[match(sorted,dt_proper$index),]
Seems to be shortest implementation of what df[sorted,] will do.
Functions in the tidyverse (dplyr, tibble, etc.) are built around the concept (as far as I know), that rows only contain attributes (columns) and no row names / labels / indexes. So in order to sort columns, you have to introduce a new column containing the ranks of each row.
The way I would do it is to create another tibble containing your "sorting information" (sorting attribute, rank) and inner join it to your original tibble. Then I could order the rows by rank.
library(tidyverse)
# note that I've changed the third column's name to avoid confusion
df = tibble(
locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
custom_id = c("a091", "b231", "a234", "d154")
)
sorted_ids = c("a234","d154","a091")
sorting_info = tibble(
custom_id = sorted_ids,
rank = 1:length(sorted_ids)
)
ordered_ids = df %>%
inner_join(sorting_info) %>%
arrange(rank) %>%
select(-rank)

Resources