convert some rows to LBS in R - r

on of my vectors have diferents kind of data, I’ve been trying to convert it, but I really don find the way.
In te column I have the Weights, the ones with no indicator are in lbs, the others are in KG, I need to have it all in Lbs. But I do not find how to work with an specific number of rows only. To take out Kg, and multiply it by 2.20 to conver it in lbs for example.
List item
Weight
200
150
220
100KG
80KG
95KG

Try this example:
# example data
df1 <- read.table(text = "Weight
1 194
2 200
3 250
4 50Kg
5 40Kg
6 39Kg", header = TRUE, stringsAsFactors = FALSE)
# using ifelse (gives warning)
ifelse(grepl("Kg", df1$Weight),
as.numeric(gsub("Kg", "", df1$Weight)) * 2.2,
as.numeric(df1$Weight))
# [1] 194.0 200.0 250.0 110.0 88.0 85.8
# Warning message:
# In ifelse(grepl("Kg", df1$Weight), as.numeric(gsub("Kg", "", df1$Weight)) * :
# NAs introduced by coercion
# not using ifelse :)
as.numeric(gsub("Kg", "", df1$Weight)) * (1 + grepl("Kg", df1$Weight) * 1.2)
# [1] 194.0 200.0 250.0 110.0 88.0 85.8

Related

Applying survey weights, and a weighted average concurrently

I have a survey for which I need to do two things;
I need to apply survey weights to a set of variables using the survey package to retrieve the 'weighted' mean AND
I need to find the weighted average of those variables.
I only want the final weighted mean for each variable after doing both these things.
I know how to find the survey weighted mean and the weighted average separately, but I do not know how to apply them together, or in which order to apply these weights. Here is an example below of my data, and how I could find the 'survey weighted mean' and the 'weighted average' separately.
Please see below for sample data:
library(survey)
dat_in <- read_table2("code CCS trad_sec Q1 enrolled wgt
23 TRUE sec 20 400 1.4
66 FALSE trad 40 20 3.0
34 TRUE sec 30 400 4.4
78 FALSE sec 40 25 2.2
84 TRUE trad 20 25 3.7
97 FALSE sec 10 500 4.1
110 TRUE sec 80 1000 4.5
123 FALSE trad 33 679 4.8
137 TRUE sec 34 764 5.2
150 FALSE sec 43 850 5.6
163 TRUE trad 45 935 6.0
177 FALSE trad 46 1020 6.4
190 TRUE trad 48 1105 6.7
203 FALSE trad 50 1190 7.1
217 TRUE trad 52 1276 7.5
230 FALSE trad 53 1361 7.9
243 TRUE trad 55 1446 8.3
256 FALSE trad 57 1531 8.6
270 TRUE sec 59 1616 9.0
283 FALSE sec 60 1701 9.4
296 TRUE sec 62 1787 9.8
310 FALSE sec 64 1872 10.2
")
1.To apply survey weights:
Create survey design
SurveyDesign<- svydesign(id =~code,
weights = ~wgt,
data = dat_in)
Find weighted mean and tabulations
# For CCS FALSE, sec
svymean(~Q1, subset(SurveyDesign,CCS=="FALSE" & trad_sec %in% c("sec")), na.rm = T)
# For CCS TRUE, sec
svymean(~Q1, subset(SurveyDesign,CCS=="TRUE" & trad_sec %in% c("sec")), na.rm = T)
2. To find weighted average:
Weighted average based on enrollment
*edited based on comment
dat_in %>% group_by(CCS, trad_sec) %>% mutate(wgtQ1 = weighted.mean(Q1, w = enrolled))
Possible solution that combines 1 and 2? (based on crowd-source)
generate weighted average by group
dat_in2 <- dat_in %>%
group_by(CCS, trad_sec) %>%
mutate(wgtQ1 = weighted.mean(Q1, w = enrolled)) %>%
ungroup
Create survey design
SurveyDesign2<- svydesign(id =~code,
weights = ~wgt,
data = dat_in2)
**Run mean on aggreated weighted average
svymean(~wgtQ1, subset(SurveyDesign2,CCS=="FALSE" & trad_sec %in% c("sec")), na.rm = T)
My intuition is that I should apply the weighted average first and THEN apply the survey weights? This above solution seems funky because each row is the weighted average for each group (CCS,trad_sec), whereas the designs object should be fed dis-aggregated data?
All suggestions much appreciated!
I assume you care about standard error estimates (since otherwise you can just multiply the two sets of weights and use weighted.mean). If so, it matters whether there is sampling uncertainty in the enrolled variable as well as in Q1, and whether the sampling weights should be applied to that variable too. If not, use svyby to get the group means and svycontrast to weight them
> means<-svyby(~Q1, ~CCS, svymean, design=subset(SurveyDesign, trad_sec %in% "sec"),
covmat=TRUE)
> means
CCS Q1 se
FALSE FALSE 50.36825 7.767602
TRUE TRUE 53.51020 6.453270
> with(subset(dat_in, trad_sec=="sec"), by(enrolled, list(CCS), sum))
: FALSE
[1] 4948
---------------------------------------------------------------------------------------
: TRUE
[1] 5967
> svycontrast(means, c(4948/(4948+5967),5967/(4948+4967)))
contrast SE
contrast 55.036 5.2423
If you want sampling weights applied to enrolled as well, I think you want svyratio to estimate a sampling-weighted version of
sum(enrolled*Q1)/sum(enrolled). You can do that one at a time:
> svyratio(~I(Q1*enrolled),~enrolled,
design=subset(SurveyDesign, trad_sec=="sec" & CCS==TRUE))
Ratio estimator: svyratio.survey.design2(~I(Q1 * enrolled), ~enrolled, design = subset(SurveyDesign,
trad_sec == "sec" & CCS == TRUE))
Ratios=
enrolled
I(Q1 * enrolled) 58.41278
SEs=
enrolled
I(Q1 * enrolled) 3.838715
> svyratio(~I(Q1*enrolled),~enrolled,
design=subset(SurveyDesign, trad_sec=="sec" & CCS==FALSE))
Ratio estimator: svyratio.survey.design2(~I(Q1 * enrolled), ~enrolled, design = subset(SurveyDesign,
trad_sec == "sec" & CCS == FALSE))
Ratios=
enrolled
I(Q1 * enrolled) 57.42204
SEs=
enrolled
I(Q1 * enrolled) 4.340065
or with svyby
> svyby(~I(Q1*enrolled),~CCS, svyratio, denom=~enrolled,
design=subset(SurveyDesign, trad_sec=="sec"))
CCS I(Q1 * enrolled)/enrolled se.I(Q1 * enrolled)/enrolled
FALSE FALSE 57.42204 4.340065
TRUE TRUE 58.41278 3.838715
(a note: it helps if you specify all the packages needed for your example code to run; in your case readr for read_table2)

Why R netCDF4 package is transposing my data?

I'm reading a .nc data in R with ncdf4 and RNetCDF. The NetCDF metadata says that there are 144 lons and 73 lats, which leads to 144 columns and 73 rows, right?
However, the data I get in R seems to be transposed with 144 rows and 73 columns.
Please could you tell me what is wrong?
thanks
library(ncdf4)
a <- tempfile()
download.file(url = "ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis2.derived/pressure/uwnd.mon.mean.nc", destfile = a)
nc <- nc_open(a)
uwnd <- ncvar_get(nc = ncu, varid = "uwnd")
dim(uwnd)
## [1] 144 73 17 494
umed <- (uwnd[ , , 10, 421] + uwnd[ , , 10, 422] + uwnd[ , , 10, 423])/3
nrow(umed)
## [1] 144
ncol(umed)
## [1] 73
It looks you are having two problems.
The first one is related with expecting the same structure that the netCDF file has in R which is a problem in itself because when you are translating the multi-dimensional array structure of the netCDF into 2 dimensional dataframe. NetCDF format needs some reshaping in R in order to be manipulated as it does in python(see: http://geog.uoregon.edu/bartlein/courses/geog490/week04-netCDF.html).
The second one is that you are using values instead of indices when subsetting the data.
umed <- (uwnd[ , , 10, 421] + uwnd[ , , 10, 422] + uwnd[ , , 10, 423])/3
The solution that I see for this is starting by creating the indices of the dimensions that you want to subset. In this example I am subsetting preassure level 10 millibar and all that goes between longitude 230 and 300 and latitude 25 and 40.
nc <- nc_open("uwnd.mon.mean.nc")
LonIdx <- which( nc$dim$lon$vals > 230 & nc$dim$lon$vals <300 )
## [1] 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
## 114 115 116 117 118 119 120
LatIdx <- which( nc$dim$lat$vals >25 & nc$dim$lat$vals < 40)
## [1] 22 23 24 25 26
LevIdx <- which( nc$dim$level$vals==10)
## [1] 17
Then you would need to apply the indices over each dimension except time which i would assume you don't want to subset. Sub setting lon and latitude is important due to R saves all in memory therefore leaving the whole range of them would consume a significant amount of RAM.
lat <- ncvar_get(nc,"lat")[LatIdx]
lon <- ncvar_get(nc,"lon")[LonIdx]
lev <- ncvar_get(nc,"level")[LevIdx]
time <- ncvar_get(nc,"time")
After that you can get the variable that you were looking for uwnd Monthly U-wind on Pressure Levels and finish reading the netCDF file with a nc_close(nc).
uwnd <- ncvar_get(nc,"uwnd")[LonIdx,LatIdx,LevIdx,]
nc_close(nc)
At the end you can expand the grid with all the four dimensions: longitude,latitude,preassure level and time.
uwndf <- data.frame(as.matrix(cbind(expand.grid(lon,lat,lev,time))),c(uwnd))
names(uwndf) <- c("lon","lat","level","time","U-wind")
Bind it to a dataframe with the U-wind variable and convert the netcdf time variable into an R time object.
uwndf$time_final<-convertDateNcdf2R(uwndf$time, units = "hours", origin =
as.POSIXct("1800-01-01", tz = "UTC"),time.format="%Y-%m-%d %Z %H:%M:%S")
At the end you will have the dataframe you are looking for between Jan 1979 and March 2020.
max(uwndf$time_final)
## [1] "2020-03-01 UTC"
min(uwndf$time_final)
## [1] "1979-01-01 UTC"
head(uwndf)
## lon lat level time U-wind time_final
## 1 232.5 37.5 10 1569072 3.289998 1979-01-01
## 2 235.0 37.5 10 1569072 5.209998 1979-01-01
## 3 237.5 37.5 10 1569072 7.409998 1979-01-01
## 4 240.0 37.5 10 1569072 9.749998 1979-01-01
## 5 242.5 37.5 10 1569072 12.009998 1979-01-01
## 6 245.0 37.5 10 1569072 14.089998 1979-01-01
I hope this is useful! Cheers!
Note: For converting the netcdf time variable into an R time object make sure you have the ncdf.tools library installed.

Create a sequence of values by group between a min and max interval using dplyr

this is surely a basic question but couldn't find a way to solve.
I need to create a sequence of values for a minimum (dds_min) to maximum (dds_max) per group (fs).
This is my data:
fs <- c("early", "late")
dds_min <-as.numeric(c("47.2", "40"))
dds_max <-as.numeric(c("122", "105"))
dds_min.max <-as.data.frame(cbind(fs,dds_min, dds_max))
And this is what I did....
dss_levels <-dds_min.max %>%
group_by(fs) %>%
mutate(dds=seq(dds_min,dds_max,length.out=100))
I intended to create a new variable (dds), that has to be 100 length and start and end at different values depending on "fs". My expectation was to end with another dataframe (dss_levels) with two columns (fs and dds), 200 values on it.
But I am getting this error.
Error: Column `dds` must be length 1 (the group size), not 100
In addition: Warning messages:
1: In Ops.factor(to, from) : ‘-’ not meaningful for factors
2: In Ops.factor(from, seq_len(length.out - 2L) * by) :
‘+’ not meaningful for factors
Any help would be really appreciated.
Thanks!
I make the sequence length 5 for illustrative purposes, you can change it to 100.
library(purrr)
library(tidyr)
dds_min.max %>%
mutate(dds= map2(dds_min, dds_max, seq, length.out = 5)) %>%
unnest(cols = dds)
# # A tibble: 10 x 4
# fs dds_min dds_max dds
# <fct> <dbl> <dbl> <dbl>
# 1 early 47.2 122 47.2
# 2 early 47.2 122 65.9
# 3 early 47.2 122 84.6
# 4 early 47.2 122 103.
# 5 early 47.2 122 122
# 6 late 40 105 40
# 7 late 40 105 56.2
# 8 late 40 105 72.5
# 9 late 40 105 88.8
# 10 late 40 105 105
Using this data (make sure your numeric columns are numeric! Don't use cbind!)
fs <- c("early", "late")
dds_min <-c(47.2, 40)
dds_max <-c(122, 105)
dds_min.max <-data.frame(fs,dds_min, dds_max)

R - Sum range over lookback period, divided sum of look back - excel to R

I am looking to workout a percentage total over a look back range in R.
I know how to do this in excel with the following formula:
=SUM(B2:B4)/SUM(B2:B4,C2:C4)
This is summing column B over a range of today looking back 3 lines. It then divides this sum buy the total sum of column B + C again looking back 3 lines.
I am looking to achieve the same calculation in R to run across my matrix.
The output would look something like this:
adv dec perct
1 69 376
2 113 293
3 270 150 0.355625492
4 74 371 0.359559402
5 308 96 0.513790386
6 236 173 0.491255962
7 252 134 0.663886572
8 287 129 0.639966969
9 219 187 0.627483444
This is a line of code I could perhaps add the look back range too:
perct <- apply(data.matrix[,c('adv','dec')], 1, function(x) { (x[1] / x[1] + x[2]) } )
If i could get [1] to sum the previous 3 line range and
If i could get [2] to also sum the previous 3 line range.
Still learning how to apply forward and look back periods within R. So any additional learning on the answer would be appreciated!
Here are some approaches. The first 3 use rollsumr and/or rollapplyr in zoo and the last one uses only the base of R.
1) rollsumr Create a matrix with rollsumr whose columns contain the rollling sums, convert that to row proportions and take the "adv" column. Finally assign that to a new column frac in DF. This approach has the shortest code.
library(zoo)
DF$frac <- prop.table(rollsumr(DF, 3, fill = NA), 1)[, "adv"]
giving:
> DF
adv dec frac
1 69 376 NA
2 113 293 NA
3 270 150 0.3556255
4 74 371 0.3595594
5 308 96 0.5137904
6 236 173 0.4912560
7 252 134 0.6638866
8 287 129 0.6399670
9 219 187 0.6274834
1a) This variation is similar except instead of using prop.table we write out the ratio. The code is longer but you may find it clearer.
m <- rollsumr(DF, 3, fill = NA)
DF$frac <- with(as.data.frame(m), adv / (adv + dec))
1b) This is a variation of (1) that is the same except it uses a magrittr pipeline:
library(magrittr)
DF %>% rollsumr(3, fill = NA) %>% prop.table(1) %>% `[`(TRUE, "adv") -> DF$frac
2) rollapplyr We could use rollapplyr with by.column = FALSE like this. The result is the same.
ratio <- function(x) sum(x[, "adv"]) / sum(x)
DF$frac <- rollapplyr(DF, 3, ratio, by.column = FALSE, fill = NA)
3) Yet another variation is to compute the numerator and denominator separately:
DF$frac <- rollsumr(DF$adv, 3, fill = NA) /
rollapplyr(DF, 3, sum, by.column = FALSE, fill = NA)
4) base This uses embed followed by rowSums on each column to get the rolling sums and then uses prop.table as in (1).
DF$frac <- prop.table(sapply(lapply(rbind(NA, NA, DF), embed, 3), rowSums), 1)[, "adv"]
Note: The input used in reproducible form is:
Lines <- "adv dec
1 69 376
2 113 293
3 270 150
4 74 371
5 308 96
6 236 173
7 252 134
8 287 129
9 219 187"
DF <- read.table(text = Lines, header = TRUE)
Consider an sapply that loops through the number of rows in order to index two rows back:
DF$pred <- sapply(seq(nrow(DF)), function(i)
ifelse(i>=3, sum(DF$adv[(i-2):i])/(sum(DF$adv[(i-2):i]) + sum(DF$dec[(i-2):i])), NA))
DF
# adv dec pred
# 1 69 376 NA
# 2 113 293 NA
# 3 270 150 0.3556255
# 4 74 371 0.3595594
# 5 308 96 0.5137904
# 6 236 173 0.4912560
# 7 252 134 0.6638866
# 8 287 129 0.6399670
# 9 219 187 0.6274834

R - setting equiprobability over a specific variable when sampling

I have a data set with more than 2 millions entries which I load into a data frame.
I'm trying to grab a subset of the data. I need around 10000 entries but I need the entries to be picked with equal probability on one variable.
This is what my data looks like with str(data):
'data.frame': 2685628 obs. of 3 variables:
$ category : num 3289 3289 3289 3289 3289 ...
$ id: num 8064180 8990447 747922 9725245 9833082 ...
$ text : chr "text1" "text2" "text3" "text4" ...
You've noticed that I have 3 variables : category,id and text.
I have tried the following :
> sample_data <- data[sample(nrow(data),10000,replace=FALSE),]
Of course this works, but the probability of sample if not equal. Here is the output of count(sample_data$category) :
x freq
1 3289 707
2 3401 341
3 3482 160
4 3502 243
5 3601 1513
6 3783 716
7 4029 423
8 4166 21
9 4178 894
10 4785 31
11 5108 121
12 5245 2178
13 5637 387
14 5946 1484
15 5977 117
16 6139 664
Update: Here is the output of count(data$category) :
x freq
1 3289 198142
2 3401 97864
3 3482 38172
4 3502 59386
5 3601 391800
6 3783 201409
7 4029 111075
8 4166 6749
9 4178 239978
10 4785 6473
11 5108 32083
12 5245 590060
13 5637 98785
14 5946 401625
15 5977 28769
16 6139 183258
But when I try setting the probability I get the following error :
> catCount <- length(unique(data$category))
> probabilities <- rep(c(1/catCount),catCount)
> train_set <- data[sample(nrow(data),10000,prob=probabilities),]
Error in sample.int(x, size, replace, prob) :
incorrect number of probabilities
I understand that the sample function is randomly picking between the row number but I can't figure out how to associate that with the probability over the categories.
Question : How can I sample my data over an equal probability for the category variable?
Thanks in advance.
I guess you could do this with some simple base R operation, though you should remember that you are using probabilities here within sample, thus getting the exact amount per each combination won't work using this method, though you can get close enough for large enough sample.
Here's an example data
set.seed(123)
data <- data.frame(category = sample(rep(letters[1:10], seq(1000, 10000, by = 1000)), 55000))
Then
probs <- 1/prop.table(table(data$category)) # Calculating relative probabilities
data$probs <- probs[match(data$category, names(probs))] # Matching them to the correct rows
set.seed(123)
train_set <- data[sample(nrow(data), 1000, prob = data$probs), ] # Sampling
table(train_set$category) # Checking frequencies
# a b c d e f g h i j
# 94 103 96 107 105 99 100 96 107 93
Edit: So here's a possible data.table equivalent
library(data.table)
setDT(data)[, probs := .N, category][, probs := .N/probs]
train_set <- data[sample(.N, 1000, prob = probs)]
Edit #2: Here's a very nice solution using the dplyr package contributed by #Khashaa and #docendodiscimus
The nice thing about this solution is that it returns the exact sample size within each group
library(dplyr)
train_set <- data %>%
group_by(category) %>%
sample_n(1000)
Edit #3:
It seems that data.table equivalent to dplyr::sample_n would be
library(data.table)
train_set <- setDT(data)[data[, sample(.I, 1000), category]$V1]
Which will also return the exact sample size within each group

Resources