Mean distance from one city to others by a mutual ID - r

I don't understand spatial.data at all. I have been studying but I'm missing something.
What I have: data.frame enterprises with the columns: id, parent_subsidiary, city_cod.
What I need: the mean and the max distance from the parent's city to the subsidiary cities.
Ex:
id | mean_dist | max_dist
1111 | 25km | 50km
232 | 110km | 180km
333 | 0km | 0km
What I did :
library("tidyverse")
library("sf")
# library("brazilmaps") not working anymore
library("geobr")
parent <- enterprises %>% filter(parent_subsidiary==1)
subsidiary <- enterprises %>% filter(parent_subsidiary==2)
# Cities - polygons
m_city_br <- read_municipality(code_muni="all", year=2019)
# or shp_city<- st_read("/BR_Municipios_2019.shp")
# data.frame with the column geom
map_parent <- left_join(parent, m_city_br, by=c("city_cod"="code_muni"))
map_subsidiary <- left_join(subsidiary, m_city_br, by=c("city_cod"="code_muni"))
st_distance(map_parent$geom[1],map_subsidiary$geom[2]) %>% units::set_units(km)
# it took a long time and the result is different from google.maps
# is it ok?!
# To do by ID -- I also stucked here
distance_p_s <- data.frame(id=as.numeric(),subsidiar=as.numeric(),mean_dist=as.numeric(),max_dist=as.numeric())
id_v <- as.vector(parent$id)
for (i in 1:length(id_v)){
test_p <- map_parent %>% filter(id==id_v[i])
test_s <- map_subsidiary %>% filter(id==id_v[i])
total <- 0
value <- 0
max <- 0
l <- 0
l <- nrow(test_s)
for (j in 1:l){
value <- as.numeric(round(st_distance(test_p$geom[1],test_s$geom[j]) %>% units::set_units(km),2))
total <- total + value
ifelse(value>max,max<-value,NA)
}
mean_dist <- total/l
done <- data.frame(id=id[i],subsidiary=l,mean_dist=round(mean_dist,2),max_dist=max)
distance_p_s <- rbind(distance_p_s,done)
rm(done)
}
}
Is it right?
Can I calculate the centroid of the cities and than calculate the distance?
I realize that the distance from code_muni==4111407 to code_muni==4110102, the distance is 0, but is another city (Imbituva, PR,Brasil - Ivaí, PR,Brasil). Why?
Data example: structure(list(id = c("1111", "1111", "1111", "1111", "232", "232", "232", "232", "3123", "3123", "4455", "4455", "686", "333", "333", "14112", "14112", "14112", "3633", "3633"), parent_subsidiary = c("1","2", "2", "2", "1", "2", "2", "2", "1", "2", "1", "2", "1", "2", "1", "1", "2", "2", "1", "2"), city_cod = c(4305801L,4202404L, 4314803L, 4314902L, 4318705L, 1303403L, 4304507L, 4314100L, 2408102L, 3144409L, 5208707L, 4205407L, 5210000L, 3203908L, 3518800L, 3118601L, 4217303L, 3118601L, 5003702L, 5205109L)), row.names = c(NA, 20L), class = "data.frame")
PS: this is Brazilian cities
https://github.com/ipeaGIT/geobr/tree/master/r-package

Great problem. I looked at it for a little while. Then I came back and looked some more after thinking about it. The mean was not calculated. Only the distances were determined from each parent to its subsidiaries.
The data was binded - the cities data and the data frame data. Then the new df was mutated to add the centroid data for each point on the surface.
The df was split by id and resulted in a list of 8 df's. Each df contained separate parent with related subsidiaries. (1:4, 1:3, 1:4, 1:2, .... )
A loop with a function cleaned up the 8 df's, and calculated the distance from each parent to each subsidiary.
I checked the distance of the first df in the list against values for distances from a website. The distances of df1 were nearly identical to the website.
The output is shown at [link]

I did something like that:
distance_p_s <- data.frame(id=as.character(),
qtd_subsidiary=as.numeric(),
dist_min=as.numeric(),
dist_media=as.numeric(),
dist_max=as.numeric())
id <- as.vector(mparentid$id)
for (i in 1:length(id)){
eval(parse(text=paste0("
print('Filtering id: ",id[i]," (",i," of ",length(id),")')
")))
teste_m <- mparentid %>% filter(id==id[i]) %>% st_as_sf()
teste_f <- msubsidiaryid %>% filter(id==id[i]) %>% st_as_sf()
teste_f <- st_centroid(teste_f)
teste_m <- st_centroid(teste_m)
teste_f = st_transform(teste_f, 4674)
teste_m = st_transform(teste_m, 4674)
total <- 0
value <- 0
min <- 0
max <- 0
l <- 0
l <- nrow(teste_f)
for (j in 1:l){
eval(parse(text=paste0("
print('Tratando id: ",id[i]," (",i," de ",length(id),"), subsidiary: ",j," de ",l,"')
")))
value <- as.numeric(round(st_distance(teste_m$geom[1],teste_f$geom[j]) %>% units::set_units(km),2))
total <- total + value
ifelse(value>max,max<-value,NA)
if(j==1){
min<-value
} else {
ifelse(value<min,min<-value,NA)}
}
dist_med <- total/l
done <- data.frame(id=id[i],qtd_subsidiary=l,dist_min=min,dist_media=round(dist_med,2),dist_max=max)
distance_p_s <- rbind(distance_p_s,done)
eval(parse(text=paste0("
print('Concluido id: ",id[i]," (",i," de ",length(id),"), subsidiary: ",j," de ",l,"')
")))
rm(done)
}
Probably this is not the best way, but it solved my problem for now.

Related

sdTrim (trimr package) does not recognize defined conditions

I'm having an issue with the sdTrim function, which had previously ran perfectly.
I have a dataframe (= new_data) containing the following variable names:
There are 8 different conditions: FA_1, HIT_1, ..., FA_4, HIT_4
I wanted to trim the reaction times and calculate a mean per participant and per condition. I used the following code:
trimmedData <- sdTrim(new_data, minRT = 150, sd = 2, pptVar = "participant", condVar = "condition", rtVar = "rt", accVar = "accuracy", perParticipant = TRUE, returnType = "mean")
This used to work fine, but suddenly my condition variable is not recognized as such anymore: instead of 8 variables, all are put into one:
What seems to be the issue here?
I tried different ways of including perCondition = TRUE, FALSE etc. which did not change anything.
the participant and condition variables are characters, the rt is numeric
As far as I can tell, the problem is with your data, not with your code. The example data you posted only has one row per participant/condition at most; there isn't a FA_3 or FA_4 for participant 988. If your real data doesn't have enough data for each combination of participant and conditions, then it looks like sdTrim just averages by participant.
I'm unfamiliar with reaction time data, but you might be able to accomplish what you're looking for using group_by and summarize from dplyr.
Below is an example with a larger dataset based on your example data.
library(trimr)
set.seed(123)
participant <- c(rep("1", 100), rep("2", 100), rep("3", 100))
accuracy <- sample(x = c("1", "0"), size = 300, replace = TRUE, prob = c(.9, .1))
condition <- sample(x = c("hit_1", "FA_1", "hit_2", "FA_2", "hit_3", "FA_3", "FA_4", "hit_4", "hit_1", "FA_1", "hit_2", "FA_2", "hit_3", "hit_4"), size = 300, replace = TRUE)
rt <- sample(x = 250:625, size = 300)
new_data <- data.frame(participant, accuracy, condition, rt)
trimmedData <- sdTrim(data = new_data,
minRT = 150,
sd = 2,
pptVar = "participant",
condVar = "condition",
rtVar = "rt",
accVar = "accuracy",
perParticipant = TRUE,
returnType = "mean")
print(trimmedData)
participant FA_1 hit_1 hit_3 hit_2 FA_4 FA_2 FA_3 hit_4
1 1 439.800 477.250 433.85 440.375 426.286 439.500 508.8 457.429
2 2 477.067 489.933 466.50 360.000 405.000 387.533 427.2 428.364
3 3 398.333 446.500 438.00 362.077 445.000 432.333 419.2 497.125
Update (1/23/23)
In both your original and your updated datasets, you simply don't have enough values per condition to properly use sdTrim() with both participant = TRUE and condition = TRUE (condition is automatically set to TRUE if you don't specify it).
Here is a link to the sdTrim() function on Github. Start looking at line 545, which describes what happens when you have both participant and condition set to TRUE.
Part of this function involves taking the standard deviation of the data for each combination of participant and condition. If you only have one value for each combination of participant and condition, your standard deviation value will be NA. See the below example of just using participant 988 and condition hit_4. Once your standard deviation is NA, NA's just follow after that.
You either need a larger dataset with more values for each combination of participant and condition or you need to set perParticipant and perCondition to both be FALSE. If you do the second option, you will have two NaN values because those values fall under the minRT threshold that you set. However, you can avoid that by also doing returnType = "raw".
new_data <- structure(list(participant = c("986", "986", "986", "986", "986", "986", "986", "986", "988", "988", "988", "988", "988", "988", "988", "988"), accuracy = c("1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"), condition = c("hit_1", "FA_1", "hit_2", "FA_2", "hit_3", "FA_3", "FA_4", "hit_4", "hit_1", "FA_1", "hit_2", "FA_2", "hit_3", "hit_4", "FA_3", "FA_4"), rt = c(638, 286, 348, 310, 404, 301, 216, 534, 348, 276, 256, 293, 495, 438, 73, 73)), row.names = c(NA, -16L), class = "data.frame")
stDev <- 2
minRT <- 150
# get the list of participant numbers
participant <- unique(new_data$participant)
# get the list of experimental conditions
conditionList <- unique(new_data$condition)
# trim the data
trimmedData <- new_data[new_data$rt > minRT, ]
# ready the final data set
finalData <- as.data.frame(matrix(0, nrow = length(participant), ncol = length(conditionList)))
# give the columns the condition names
colnames(finalData) <- conditionList
# add the participant column
finalData <- cbind(participant, finalData)
# convert to data frame
finalData <- data.frame(finalData)
# intialise looping variable for subjects
i <- 1
j <- 2
# take apart the loop
# focus on participant 988, condition hit_4
currSub <- "988"
currCond <- "hit_4"
# get relevant data
tempData <- trimmedData[trimmedData$participant == currSub & trimmedData$condition == currCond, ]
# find the cutoff
curMean <- mean(tempData$rt)
print(curMean)
[1] 438
curSD <- sd(tempData$rt)
print(curSD) # <- here is where the NA values start
[1] NA
curCutoff <- curMean + (stDev * curSD)
# trim the data
curData <- tempData[tempData$rt < curCutoff, ]
# find the average, and add to the data frame
finalData[i, j] <- round(mean(curData$rt))
head(finalData)
> participant hit_1 FA_1 hit_2 FA_2 hit_3 FA_3 FA_4 hit_4
1 986 NA 0 0 0 0 0 0 0
2 988 0 0 0 0 0 0 0 0

Hieraching across rows for the same id

So, I have a data set with a lot of observations for X individuals and more rows per some individuals. For each row, I have assigned a classification (the variable clinical_significance) that takes three values in prioritized order: definite disease, possible, colonization. Now, I would like to have only one row for each individual and the "highest classification" across the rows, e.g. definite if present, subsidiary possible and colonization. Any good suggestions on how to overcome this?
For instance, as seen in the example, I would like all ID #23 clinical_signifiance to be 'definite disease' as this outranks 'possible'
id id_row number_of_samples species_ny clinical_significa…
18 1 2 MAC possible
18 2 2 MAC possible
20 1 2 scrofulaceum possible
20 2 2 scrofulaceum possible
23 1 2 MAC possible
23 2 2 MAC definite disease
Making a reproducible example:
df <- structure(
list(
id = c("18", "18", "20", "20", "23", "23"),
id_row = c("1","2", "1", "2", "1", "2"),
number_of_samples = c("2", "2", "2","2", "2", "2"),
species_ny = c("MAC", "MAC", "scrofulaceum", "scrofulaceum", "MAC", "MAC"),
clinical_significance = c("possible", "possible", "possible", "possible", "possible", "definite disease")
),
row.names = c(NA, -6L), class = c("data.frame")
)
The idea is to turn clinical significance into a factor, which is stored as an integer instead of character (i.e. 1 = definite, 2 = possible, 3 = colonization). Then, for each ID, take the row with lowest number.
df_prio <- df |>
mutate(
fct_clin_sig = factor(
clinical_significance,
levels = c("definite disease", "possible", "colonization")
)
) |>
group_by(id) |>
slice_min(fct_clin_sig)
I fixed it using
df <- df %>%
group_by(id) %>%
mutate(clinical_significance_new = ifelse(any(clinical_significance == "definite disease"), "definite disease", as.character(clinical_significance)))

Replace stars (***) with numbers

I have a data frame that I exported from lightroom to R.
In this data frame there is a grading system for the photographs where each is graded with stars from 1 (*) to 5 (*****)
I want to replace these stars with numbers but tried several functions (gsub, replace) with no success
Lightroom$Rating <- gsub("*", "1", Lightroom$Rating)
Lightroom <- replace(Lightroom, "*", "1")
Thank you for your help
If I understand your question correctly, you want to replace the number of stars with the actual count. This allows some flexibility in case you want to do something else with each matched number of asterisks (*).
library(tidyverse)
Lightroom <- data.frame(Rating = c("*",
"**",
"***",
"****",
"*****"))
Lightroom_subbed <- Lightroom %>%
mutate(Rating2 = case_when(grepl(x = Rating, pattern = "^\\*{1}$") ~ "1",
grepl(x = Rating, pattern = "^\\*{2}$") ~ "2",
grepl(x = Rating, pattern = "^\\*{3}$") ~ "3",
grepl(x = Rating, pattern = "^\\*{4}$") ~ "4",
grepl(x = Rating, pattern = "^\\*{5}$") ~ "5"
)
)
Lightroom_subbed
Rating Rating2
1 * 1
2 ** 2
3 *** 3
4 **** 4
5 ***** 5
Much simpler approach is available. Use the factor data-type's underlying integer structure:
as.numeric(factor(Lightroom$Rating))
[1] 1 2 3 4 5

replace and remove duplicate rows using ifelse

I have a data frame of postcodes with a regional/metro classification assigned. In some instances, due to the datasource, the same postcode will occur with both a regional and metro classification.
POSTCODE REGON
1 3000 METRO
2 3000 REGIONAL
3 3256 METRO
4 3145 METRO
I am wondering how to remove the duplicate row and replace the region with "SPLIT" in these instances.
I have tried using the below code however this reassignes the entire dataset with either "METRO" or "REGIONAL"
test <- within(PC_ACTM, REGION <- ifelse(duplicated("Postcode"), "SPLIT", REGION))
The desired output would be
POSTCODE REGON
1 3000 SPLIT
2 3256 METRO
3 3145 METRO
Example data:
dput(PC_ACTM)
structure(list(POSTCODE = c(3000L, 3000L, 3256L, 3145L), REGON = c("METRO",
"REGIONAL", "METRO", "METRO")), class = "data.frame", row.names = c("1",
"2", "3", "4"))
Based on your title, you're looking for an ifelse() solution; perhaps this will suit?
PC_ACTM <- structure(list(POSTCODE = c(3000L, 3000L, 3256L, 3145L),
REGION = c("METRO", "REGIONAL", "METRO", "METRO")),
class = "data.frame",
row.names = c("1", "2", "3", "4"))
PC_ACTM$REGION <- ifelse(duplicated(PC_ACTM$POSTCODE), "SPLIT", PC_ACTM$REGION)
PC_ACTM[!duplicated(PC_ACTM$POSTCODE, fromLast = TRUE),]
#> POSTCODE REGION
#> 2 3000 SPLIT
#> 3 3256 METRO
#> 4 3145 METRO
Created on 2022-04-07 by the reprex package (v2.0.1)
Consider ave to sequential count by group and then subset the last but before use ifslse to replace needed value for any group counts over 1. Below uses new base R 4.1.0+ pipe |>:
test <- within(
PC_ACTM, {
PC_SEQ <- ave(1:nrow(test), POSTCODE, FUN=seq_along)
PC_COUNT <- ave(1:nrow(test), POSTCODE, FUN=length)
REGION <- ifelse(
(PC_SEQ == PC_COUNT) & (PC_COUNT > 1), "SPLIT", REGION
)
}
) |> subset(
subset = PC_SEQ == PC_COUNT, # SUBSET ROWS
select = c(POSTCODE, REGION) # SELECT COLUMNS
) |> `row.names<-`(NULL) # RESET ROW NAMES

Create data table containing all possible combinations of values from four lists

I have the following four lists.
varnames <- list("beefpork", "breakfast", "breakfast_yn", "diet_soda", "food_label", "fruit_and_veggie", "fruit_juice", "fruits", "milk", "min_foods","regular_soda", "ssb", "total_fruit", "vegetables", "asthma", "bmiclass3", "bmiclass4","bmiclass5", "dental_absence", "dental_appt", "diabetes", "food_allergies", "sore_teeth", "trying_weight", "count_pa60days", "count_vigpa20days", "gaming_bedroom", "other_organized_pa", "pa30outdoor","paguidelines", "pc_time", "school_transport", "sport_teams", "tv_bedroom", "tv_time_char", "video_games_char")
grades <- list("2", "4", "8", "11")
groups <- list("none", "ethnic", "bordercounty")
regions <- list("state", "hsr")
And the following function, which returns an integer:
all_empty = function(outcome, groupvar, gradevar, regionvar){
#How many observations?
if (groupvar == "none")
fmla <- as.formula(paste0("~", outcome))
else
fmla <- as.formula(paste0("~", outcome, "+", groupvar))
if (regionvar == "hsr")
mydata = span_phrwts
else if (regionvar == "state" & groupvar %in% c("none", "ethnic"))
mydata = span_statewts
else if (regionvar == "state" & groupvar == "bordercounty")
mydata = span_borderwts
else mydata = span_statewts
myrow = svytable(fmla, subset(mydata, grade==gradevar)) %>% nrow()
return(myrow)
}
I'm trying to write a code that will run the function on all 864 possible combinations of the values from my lists, and create one data table with 864 rows and 5 columns.
I would like the final table to look something like this, but has not been successful:
Variable Grade Group Region Obs
beefpork 2 none state 5
beefpork 4 none state 5
beefpork 8 none state 3
beefpork 11 none state 0
This is my attempt to run this, but am unable to calculate the rownum correctly.
output_all <- matrix(ncol = 5, nrow = length(varnames)*length(grades)*length(groups)*length(regions))
for(l in 1:length(regions)) {
for (k in 1:length(grades)) {
for(j in 1:length(groups)) {
for(i in 1:length(varnames)){
rownum = i + ((length(groups)*length(grades)*length(regions)) - 1)
output_all[rownum, 1] = varnames[[i]]
output_all[rownum, 2] = groups[[j]]
output_all[rownum, 3] = grades[[k]]
output_all[rownum, 4] = regions[[l]]
output_all[rownum, 5] = all_empty(varnames[[i]], groups[[j]], grades [[k]], regions[[l]])
}
}
}
}
output_all %>% as_data_frame() %>% View()
Any help/advice would be much appreciated!
Using data.table you have the function CJ to create the cross-join. Then we add a row num (Idx) to perform row-wise call of function. We finally remove the Idx column
library(data.table)
dt <- CJ(varnames=varnames,grades=grades,groups=groups,regions=regions)
dt[,Idx:=.I]
dt[,by=Idx, Obs:=all_empty(outcome, groupvar, gradevar, regionvar)]
dt[,Idx:=NULL]
If it's ok to use vectors and not lists, tidyr::crossing seems like a straightforward approach.
varnames <- c("beefpork", "breakfast", "breakfast_yn", "diet_soda", "food_label", "fruit_and_veggie", "fruit_juice", "fruits", "milk", "min_foods","regular_soda", "ssb", "total_fruit", "vegetables", "asthma", "bmiclass3", "bmiclass4","bmiclass5", "dental_absence", "dental_appt", "diabetes", "food_allergies", "sore_teeth", "trying_weight", "count_pa60days", "count_vigpa20days", "gaming_bedroom", "other_organized_pa", "pa30outdoor","paguidelines", "pc_time", "school_transport", "sport_teams", "tv_bedroom", "tv_time_char", "video_games_char")
grades <- c("2", "4", "8", "11")
groups <- c("none", "ethnic", "bordercounty")
regions <- c("state", "hsr")
tidyr::crossing(varnames, grades, groups, regions)
# A tibble: 864 x 4
varnames grades groups regions
<chr> <chr> <chr> <chr>
1 asthma 11 bordercounty hsr
2 asthma 11 bordercounty state
3 asthma 11 ethnic hsr
4 asthma 11 ethnic state
5 asthma 11 none hsr
6 asthma 11 none state
7 asthma 2 bordercounty hsr
8 asthma 2 bordercounty state
9 asthma 2 ethnic hsr
10 asthma 2 ethnic state
Consider expand.grid, then call your function with mapply to pass column values elementwise to user-defined method.
varnames <- c("beefpork", "breakfast", "breakfast_yn", "diet_soda",
"food_label", "fruit_and_veggie", "fruit_juice",
"fruits", "milk", "min_foods", "regular_soda",
"ssb", "total_fruit", "vegetables", "asthma",
"bmiclass3", "bmiclass4","bmiclass5", "dental_absence",
"dental_appt", "diabetes", "food_allergies",
"sore_teeth", "trying_weight", "count_pa60days",
"count_vigpa20days", "gaming_bedroom", "other_organized_pa",
"pa30outdoor","paguidelines", "pc_time", "school_transport",
"sport_teams", "tv_bedroom", "tv_time_char", "video_games_char")
grades <- c("2", "4", "8", "11")
groups <- c("none", "ethnic", "bordercounty")
regions <- c("state", "hsr")
df <- expand.grid(varnames=varnames, grades=grades, groups=groups, regions=regions,
stringsAsFactors = FALSE)
str(df)
# 'data.frame': 864 obs. of 4 variables:
# $ varnames: chr "beefpork" "breakfast" "breakfast_yn" "diet_soda" ...
# $ grades : chr "2" "2" "2" "2" ...
# $ groups : chr "none" "none" "none" "none" ...
# $ regions : chr "state" "state" "state" "state" ...
# ...
df$fmla <- ifelse(df$groups == "none", paste0("~", outcome), paste0("~", outcome, "+", groupvar))
df$mydata <- ifelse(df$regions == "hsr", "span_phrwts",
ifelse(df$regions == "state" & df$groups %in% c("none", "ethnic"), "span_statewts",
ifelse(df$regions == "state" & df$groups == "bordercounty", "span_borderwts",
"span_statewts")))
Function call
all_empty <- function(outcome, groupvar, gradevar, regionvar, fmla, mydata){
# How many observations?
myrow <- svytable(as.formula(fmla), subset(get(mydata), grade==gradevar))
return(nrow(myrow))
}
df$Obs <- mapply(all_empty, df$varnames, df$groups, df$grades,
df$regions, df$fmla, df$mydata)

Resources