Stratified sampling in R with unequal weights and replacement - r

I have a large data set with a field containing a combined FIPS code and zip code, and another data set with population weighted centroids for block groups combined with some zip code data. I want to stratify my data by "FIPS code" and then assign each row a set of coordinates for a block group centroid, where the centroid's probability of being selected is proportional to its population.
I was originally using a sample of the data (1000 rows) and the strata function from the sampling package, which worked fine. Now that I want to do this for every row in the data set, however, I'm getting this error:
Error in strata(popCenters2, stratanames = "FIPS", method = "systematic", :
not enough obervations in the stratum 1
I suspect that this is because strata does not use replacement and my data set is much larger than the centroid data set.
This is the code I used with the strata function applied to my sample:
## Combined fields to match format of other data
popCenters2 <- within(popCenters2,
FIPS <- paste(stateFIPS,
countyFIPS,
zipcode,
sep = ""))
sample %>% group_by(FIPS) %>% count() -> sampleCounts
popCenters2[order(popCenters2$FIPS), ] -> popCenters2
sampleCounts[order(sampleCounts$FIPS), ] -> sampleCounts
st = strata(popCenters2, stratanames = "FIPS", method = "systematic", size =
sampleCounts$n, pik = popCenters2$contribPop)
stTable = getdata(popCenters2, st)
My sample had 5 rows with the "FIPS" variable equal to 4200117325, this is the centroid data corresponding to that:
FIPS tract blkGroup latitude longitude contribPop
4200117325 030200 1 +40.000254 -077.137559 452
4200117325 030200 2 +39.959070 -077.160354 324
4200117325 030400 1 +39.915855 -077.406954 194
4200117325 030400 2 +39.923503 -077.298505 131
4200117325 030400 3 +39.878509 -077.307547 173
4200117325 030400 4 +39.873705 -077.360488 176
4200117325 030400 5 +39.880362 -077.412175 108
4200117325 030500 1 +39.926149 -077.227283 630
4200117325 030500 2 +39.921269 -077.260640 459
My question is, how can I reproduce this sort of procedure if, for example, my actual data set has 20 rows corresponding to 4200117325? I've read through the documentation for the strata function and a few others (Strata from DescTools, the survey package) but have been unable to find anything that allows replacement.

Related

How to get US county name from Address, city and state in R?

I have a dataset of around 10000 rows. I have the Address, City, State and Zipcode values. I do not have lat/long coordinates. I would like to retrieve the county name without taking a large amount of time. I have tried library(tinygeocoder) but it takes around 14 seconds for 100 values, and is giving a 'time-out' error when I put in the entire dataset. Plus, it's outputting a fip code, which I have to join to get the actual county name. Reproducible example:
library(tidygeocoder)
library(dplyr)
df <- tidygeocoder::louisville[,1:4]
county_fips <- data.frame (fips = c("111", "112"),
county = c("Jefferson", "Montgomery"))
geocoded <- df %>% geocode(street = street, city = city, state = state,
method = 'census', full_results = TRUE,
api_options = list(census_return_type = 'geographies'))
df$fips <- geocoded$county_fips
df_new <- merge(x=df, y=county_fips, by="fips", all.x = T)
You can use a public dataset that links city and/or zipcode to county. I found these websites with such data:
https://www.unitedstateszipcodes.org/zip-code-database
https://simplemaps.com/data/us-cities
You can then do a left join on the linking column (presumably city or zipcode but will depend on the dataset):
df = merge(x=df, y=public_dataset, by="City", all.x=T)
If performance is an issue, you can select just the county and linking columns from the public data set before you do the merge.
public_dataset = public_dataset %>% select(County, City)
The slow performance is due to tinygeocoder's use of the the Census Bureau's API in order to match data. Asking the API to match thousands of addresses is the slow down, and I'm not aware of a different way to do this.
However, we can at least pare down the number of addresses that you are putting into the API. Maybe if we get that number low enough the code will run.
The ZIP Code Tabulation Areas (ZCTA) shows the relationships between ZIP Codes and county names (as well as FIPS). A "|" delimited file with a description of the data can be found on the Bureau's website.
Counting the number of times a ZIP code shows up tells us if a ZIP code spans multiple counties. If the frequency == 1, then you can freely translate the ZIP code to the county.
ZCTA <- read.delim("tab20_zcta520_county20_natl.txt", sep="|")
n_occur <- data.frame(table(ZCTA$GEOID_ZCTA5_20))
head(n_occur, 10)
Var1
Freq
1
601
2
2
602
2
3
603
2
4
606
3
5
610
4
6
611
1
7
612
3
8
616
1
9
617
2
10
622
1
In these results, addresses with ZIP codes 00611 and 00622 can be mapped to the corresponding counties without sending the addresses through the API. If your addresses are very urban, then you may be lucky in that the ZIP codes are small area-wise and may not span typically multiple counties.

R mlr3 create TaskregrST duplicate rows?

I have a dataframe called tab_mlr with coordinates about 19 features in 788 rows.
str(tab_mlr)
This object have 788 observations of 21 variables (with 2 variables as Latitude and Longitude). I create an sf object like this :
data_mlr <- sf::st_as_sf(tab_mlr, coords = c("Longitude", "Latitude"), crs = 4326)
data_mlr have 788 features, that's ok. But when i create a task with this data_mlr like this :
task <- TaskRegrST$new(
"mlr",
backend = data_mlr,
target = "Hauteur"
)
task object have 620 944 rows !!! Why not 788 rows ?
The reason might be that you are making 788 rows for every row, and you have 788^2 rows as a result.

Mean Y for individual X values

I have a data set in .dta format with height and weight of baseball players. I want to calculate the mean height for each individual weight value.
From what I've been able to find, I could use dplyr and "group_by", but my R script does not recognize the command, despite having installed and called the package.
Thanks!
Here is an example coded in base R using baseball player height and weight data obtained from the UCLA SOCR MLB HeightsWeights data set.
After cleaning the data (weight is missing for one player), I posted it to GitHub to make it accessible without having to clean it again.
theCSVFile <- "https://raw.githubusercontent.com/lgreski/datasciencedepot/gh-pages/data/baseballPlayers.csv"
download.file(theCSVFile,"./data/baseballPlayers.csv",method="curl")
theData <- read.csv("./data/baseballPlayers.csv",header=TRUE,stringsAsFactors=FALSE)
aggData <- aggregate(HeightInInches ~ WeightInPounds,mean,
data=theData)
head(aggData)
...and the output is:
> head(aggData)
WeightInPounds HeightInInches
1 150 70.75000
2 155 69.33333
3 156 75.00000
4 160 71.46667
5 163 70.00000
6 164 73.00000
>
regards,
Len

How to create a heat map in R?

I am doing a multiple part project. To begin with I had a data set which provided the deposits per district over the years. After scrubbing the data set, I was able to create a data frame, which provides the growth of deposits by district. I have growth of deposits by 3 different kinds of institutions - foreign banks, public banks and private banks in 3 different data frames as the # of rows differs in each frame. I have been asked to create 3 maps (heat maps) with deposit growth against each of the kind of banks.
My data frame looks like the attached picture.
I want to make a heat map for the growth column. enter image description here
Thanks.
Maybe I provide some spam by this answer, so delete it without hasitation.
I'll show you how I make some heatmaps in R:
Fake data:
Gene Patient_A Patient_B Patient_C Patient_D
BRCA1 52 46 124 148
TP53 512 487 112 121
FOX3D 841 658 321 364
MAPK1 895 541 198 254
RASA1 785 554 125 69
ADAM18 12 65 85 121
hmcols <- rev(redgreen(2750))
heatmap.2(hm_mx, scale="row", key=TRUE, lhei=c(2,5), symkey="FALSE", density.info="none", trace="none", cexRow=1.1, cexCol=1.1, col=hmcols, dendrogram = "none")
In case of read.table you propably will have to convert data frame to matrix and put first column as a row names to avoid errors from R:
hm <- read.table("hm1.txt", sep = '\t', header=TRUE, stringsAsFactors=FALSE)
row.names(hm) <- hm$Gene
hm_mx <- data.matrix(hm)
hm_mx <- hm_mx[,-c(1)]

Sampling in stages in R

I am running some sampling simulations from census data and I would like to sample in 2 stages.
First I want to sample 25 households within each village.
Second I want to sample 1 person from each household.
My data is in long format, with a village identifier, a household identifier, and a binary disease status (0 = healthy, 1 = diseased). The following code runs a monte-carlo simulation to sample 25 individuals per village 3000 times and record the number of malaria-positive individuals sampled.
But, I would like to sample 1 individual from 25 sampled households from each village. I can't figure it out.
Here is the link to my data:
d = read.table("data.txt", sep=",", header=TRUE)
villages = split(d$malaria, d$villageid)
positives = vector("list", 3000)
for(i in 1:3000) {
sampled = lapply(villages, sample, 25)
positives[[i]] = lapply(sampled, sum)
}
How about this?
replicate(3000, sum(sapply(lapply(villages, sample, 25), sample, 1)))
lapply(villages, sample, 25) -> gives 25 households for all 177 villages
sapply(., sample, 1) -> sample 1 person from these 25 people from each of 177 villages
sum(.) -> sum the sampled values
replicate -> repeat the same function 3000 times
I figured out a workaround. It is quite convoluted and involves taking the data and creating another dataset. (I did this in Stata as my R capabilities are limited.) First I sort the dataset by house number and load that into R (d.people). Then I create a new dataset by collapsing the old dataset by house number, and load that into R (d.house). I do the sampling in 2 stages, first sampling 1 person from each household in the people dataset. I can then sample 25 "household sampled people" from each village after combining the houses dataset with the output from sampling 1 person from each household.
d.people = read.table("people data", sep=",", header=TRUE)
d.houses = read.table("houses data", sep=",", header=TRUE)
for(i in 1:3000){
houses = split(d.people$malaria, d.people$house)
firststage = sapply(houses, sample, 1)
secondstage = cbind(d.houses, firststage)
villages = split(secondstage$firststage, secondstage$village)
sampled = lapply(villages, sample, 25)
positives[[i]] = lapply(sampled, sum)
}

Resources