how to sample from multiple lists (numbers? - r

I would like to sample, let's say the ages of 100 persons above 65 years old,
and the probabilities for the age groups are as follows:
65-74<- 0.56
75-84<- 0.30
85<- 0.24
I know the existence of the sample function and I tried it as follows,but that didn't work unfortunately
list65_74<-range(65,74)
list75_84<-range(75,84)
list85<-range(85,100)
age<-sample(c(list65_74,list75_84,list85),size=10,replace=TRUE,prob =c(0.56,0.30,0.24 ))I get the following error
I got then the following error
Error in sample.int(length(x), size, replace, prob) :
incorrect number of probabilities
So I was wondering what is the proper way to sample from multiple lists.
Thank you very much in advance!

First, let's I'll call those three objects groups instead since they don't use the list function.
The way you define them could be fine, but it's somewhat more direct to go with, e.g., 65:74 rather than c(65, 74). So, ultimately I put the three groups in the following list:
groups <- list(group65_74 = 65:74, group75_84 = 75:84, group85 = 85:100)
Now the first problem with the usage of sample was your x argument value, which is
either a vector of one or more elements from which to choose, or a
positive integer. See ‘Details.’
Meanwhile, you x was just
c(list65_74, list75_84, list85)
# [1] 65 74 75 84 85 100
Lastly, the value of prob is inappropriate. You supply 3 number to a vector of 6 candidates to sample from. Doesn't sound right. Instead, you need to assign an appropriate probability to each age from each group as in
rep(c(0.56, 0.30, 0.24), times = sapply(groups, length))
So that the result is
sample(unlist(groups), size = 10, replace = TRUE,
prob = rep(c(0.56, 0.30, 0.24), times = sapply(groups, length)))
# [1] 82 72 69 74 72 72 69 70 74 70

Related

Calculate Multiple Information Value in R

I am new to R programming and trying to learn part time so apologize for naive coding and questions in advance. I have spent about 1 day trying to figure out code for this and unable to do so hence asking here.
https://www.kaggle.com/c/titanic/data?select=train.csv
I am working on train Titanic Data set from Kaggle imported as train_data. I have cleaned up all the col and also converted them to factor where needed.
My question is 2 fold:
1. Unable to understand why this formula gives IV values as 0 for everything. What have I done wrong?
factor_vars <- colnames(train_data)
all_iv <- data.frame(VARS=factor_vars, IV=numeric(length(factor_vars)),STRENGTH=character(length(factor_vars)),stringsAsFactors = F)
for (factor_var in factor_vars){
all_iv[all_iv$VARS == factor_var, "IV"] <-
InformationValue::IV(X=train_data[, factor_var], Y=train_data$Survived)
all_iv[all_iv$VARS == factor_var, "STRENGTH"] <-
attr(InformationValue::IV(X=train_data[, factor_var], Y=train_data$Survived), "howgood")
}
all_iv <- all_iv[order(-all_iv$IV), ]
2. I am trying to create my own function to calculate IV values for multiple columns in 1 go so that I do not have to do repetitive task however when I run the following formula I get count of total 0 and total 1 instead of items grouped by like I requested. Again, what is that I am doing wrong in this example?
train_data %>% group_by(train_data[[3]]) %>%
summarise(zero = sum(train_data[[2]]==0),
one = sum(train_data[[2]]==1))
I get output
zero one
1 549 342
2 549 342
3 549 342
where as I would anticipate an answer like:
zero one
1 80 136
2 97 87
3 372 119
what is wrong with my code?
3. Is there any pre built function which can give IV values for all columns? On searching I found iv.mult function but I can not get it to work. Any suggestion would be great.
Let's take a look at your questions:
1.
length(factor_vars)
#> [1] 12
length() returns the number of elements of your vector factor_vars. So your code numeric(length(factor_vars)) is evaluated to numeric(12) which returns an numeric vector of length 12, default filled with zeros.
The same applies to character(length(factor_vars)) which returns a character vector of length 12 filled with empty strings "".
Your code doesn't use a correct dplyr syntax.
library(dplyr)
library(dplyr)
train_data %>%
group_by(Pclass) %>%
summarise(zero = sum(Survived == 0),
one = sum(Survived == 1))
returns
# A tibble: 3 x 3
Pclass zero one
<dbl> <int> <int>
1 1 80 136
2 2 97 87
3 3 372 119
which is most likely what you are looking for.
Don't know the meaning of IV.

plotting a scatter plot with wide range data R

I uploaded a csv file to R studio and am trying to plot two columns. The first one shows the number of likes, and the second shows the number of shares. I want to show the relationship between the number of shares when people actually like a post.
The problem is my likes count starts from 1 to 1 million, and the shares count start from 5 to 37000.
sample of my dataset (both columns are of class factor)
topMedia$likes_count
[1] 61 120 271 140 59 498 241 117 124 124 225 117 186 101
[15] 118 134 152 136 153 124 100 77 98 77 88 48 58 66
topMedia$shares_count
[1] 12 171 NULL 23 34 108 430 NULL NULL NULL 283 NULL NULL 57
[15] NULL NULL NULL 68 105 NULL NULL 7 10 45 103 22 75 16
When I use this code to plot a scatter plot. It looks messy.
plot(as.numeric(topMedia$shares_count),as.numeric(topMedia$likes_count))
I tried using other libraries
library(hexbin)
cols = colorRampPalette(c("#fee6ce", "#fd8d3c", "#e6550d", "#a63603"))
plot(hexbin(as.numeric(topMedia$shares_count), as.numeric(topMedia$likes_count), xbins = 40), colorcut = seq(0,1,length=20),
colramp = function(n) cols(20), legend = FALSE,xlab = 'share count', ylab = 'like count')
but I get a similar result even with colours
what would be a better way to show the relationship between those values?
Thanks .
In this case, the even-ish distribution (for what should be a clear positive correlation between "likes" and "shares") is a clue that the numeric data might have been inadvertently loaded as a factor. Another clue is that the x and y value only vary by the number of unique values, not by the range of the underlying numeric data. We need to convert the levels of the factor (and not the values of the factor) to see the intended numbers. We can do this with something like as.numeric(as.character(x)).
To give an example, suppose we had some linearly correlated data like this:
library(ggplot2); library(dplyr)
set.seed(42)
fake_data <- data.frame(x = runif(10000, 0, 1000000))
fake_data$y <- pmax(0, fake_data$x*rnorm(10000, 1, 2) + runif(10000, 0, 1000000))
ggplot(fake_data, aes(x,y)) + geom_point()
If that numeric data were loaded in as factors (easy to do with read.csv if the term stringsAsFactors = FALSE isn't included), it might look more like this, not too dissimilar from the data in this question. The data here is being read as if it were character data, and then made into a factor which is ordered alphabetically, with "10000" before "2" because "1" comes before "2".
fake_data_factor <- fake_data %>%
mutate(x = as.factor(as.character(x)),
y = as.factor(as.character(y)))
The x and y values now have values related to their alphabetical order, different from their underlying levels. R uses the values to sort or to plot, and the x values with the lowest values in the new data have levels near 100,000 instead of near 0. In the table below, 100,124 in row 1 comes alphabetically earlier than 10,058 in row 8!
fake_data_factor %>%
arrange(x) %>%
head(8)
# x y
#1 100124.688120559 0
#2 100229.354342446 289241.187250382
#3 100299.560697749 232233.101769741
#4 100354.233058169 814492.563551191
#5 100364.253856242 1183870.56252858
#6 100370.0227011 1224652.83777805
#7 100461.616180837 1507465.73704898
#8 10058.1261795014 604477.823016668
ggplot(fake_data_factor, aes(as.numeric(x),as.numeric(y))) +
geom_point()
We can get back to the intended numbers by converting the factors to character (which extracts each one's level) and then converting those to numeric.
fake_data_factor %>%
ggplot(aes(as.numeric(as.character(x)),as.numeric(as.character(y)))) +
geom_point()

setting variable value by subsetting

this is my first question, so please bear with me
I am creating a new variable age.f.sex in my dataframe wm.13 using an already existing variable SB1. In the original dataframe, SB1 indicates the age of first sexual intercourse of women interviewed in UNICEF's Multiple Indicators Cluster Surveys. The values that SB1 can take are:
> sort(unique(wm.13$SB1))
[1] 0 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
[26] 30 31 32 33 34 35 36 37 38 39 40 41 42 44 48 95 97 99
Here is the meaning of the values SB1 can take
0 means she never had sex
97 and 99 mean "does not remember/does not know"
95 means that she had her first sexual intercourse when she started living with her husband/partner (for which there is a specific variable, i.e MA9)
Any number between 0 and 95 is the declared age ate their first sexual intercourse
there are also NAs that sort() does not show but they appear if I just use unique()
I created a new variable from SB1, which I called age.f.sex.
wm.13$age.f.sex <- wm.13$SB1
I had the 0, 97 and 99 values replaced with NAs, and I kept the original NAs in SB1. I did this using the following code:
wm.13$age.f.sex[wm.13$SB1 == 0] <- NA
wm.13$age.f.sex[wm.13$SB1 == 97] <- NA
wm.13$age.f.sex[wm.13$SB1 == 99] <- NA
wm.13$age.f.sex[is.na(wm.13$SB1)] <- NA
Everything worked fine until here. However, I am in trouble with the 95 value. I want to code so that the observations that have value 95 in SB1 (i.e. the age of first sexual intercourse) will have the value from MA9 (i.e. the age when the woman started living with her partner/husband) in my new variable age.f.sex.
I first started with this code
> wm.13$age.f.sex[wm.13$SB1 == 95] <- wm.13$MA9
but i got the following error message
Error in wm.13$age.f.sex[wm.13$SB1 == 95] <- wm.13$MA9 :
NAs are not allowed in subscripted assignments
After some researches in this website, I realised that I might need to subset the right-hand side of the code too, but honestly I do not know how to do it. I have a feeling that which() or if.else() might come of use here, but I cannot figure out their argument. Examples I have found in this website show how to impute one specific value, but I could not find anything on subsetting according to the value the observations take in another variable.
I hope I have been clear enough. Any suggestion will be much appreciated.
Thanks, Manolo
Perhaps you could try:
wm.13$age.f.sex <- ifelse(wm.13$SB1 %in% c(0,97,99) | is.na(wm.13$SB1), NA, ifelse(wm.13$SB1 == 95, wm.13$MA9, wm.13$SB1))
In short, it works like this: The code checks whether wm.13$SB1 is 0, 97, 99 or missing, and then returns NA. Subsequently, it checks whether wm.13$SB1 is 95, and if so, it returns the value on that row in the MA9 column. In all other cases it returns the SB1 value. Because of "wm.13$age.f.sex <-" at the beginning of the line the return values are assigned to your new age.f.sex variable.
As the error message indicates, it is not possible to do subscripted assignments when the filter contains NAs. A way to circumvent this is to explicitly include NA as a factor level. The following example illustrates a possible way to replace 95s by their corresponding value in a second column.
# example dataframe
df <- data.frame(a = c(NA, 3, 95, NA),
b = 1:4)
# set a to factor with NA as one of the levels (besides those in a and b)
df$a <- factor(df$a, levels = union(df$a, df$b), exclude = NULL)
# subscripted assignment (don't forget to filter b too!)
df$a[df$a == 95] <- df$b[df$a == 95]
# restore to numeric
df$a <- as.numeric(as.character(df$a))

R: problems when trying to avoid one looping in implementations like matrix1[i, j]

I'm trying to work in a different kind of implementation that I would generally use. I'm trying to avoid one loop by replacing the "which line i of the object" with a vector of the dimnames of the lines in the object. Let's say:
#imagine that I have a dataset with repeated measures (2 measures of each subject)
id <- matrix(c("Subj1", "Subj1", "Subj2", "Subj2", "Subj3", "Subj3"), ncol=1)
days = 3
Weight <- matrix(0, nrow=length(id), ncol=Weight+1, dimnames=list(id, NULL))
initial.weight <- c(72,80,45,60,62,75)
Weight[,1] <- initial.weight
for (j in 1:days) #I'm trying to avoid for (i in 1:length(id))
Weight[id,j+1] <- Weight[id,j] + 2
But, my 2nd, 3rd, and 6th lines are returning the same zeros from my initial Weight matrix! This is because my code is only working with the first measure of a given subject. Of course, I want it to work with all lines.
Anyone know what is going on? how can I make it to work for all two measures of each subject (but still keeping this structure of [id,j] instead of [i,j])?
many thanks in advance for your attention!
You didn't actually say what you wanted as the result, so I'm making two guesses. The first is that you meant to type ncol=days+1 where you put ncol=Weight+1, and the second is that you wanted to serial add 2 to all the column values. So just omit the id index and it will work as I think you expected it to:
for (j in 1:days)
Weight[ ,j+1] <- Weight[ ,j] + 2
Weight
[,1] [,2] [,3] [,4]
Subj1 72 74 76 78
Subj1 80 82 84 86
Subj2 45 47 49 51
Subj2 60 62 64 66
Subj3 62 64 66 68
Subj3 75 77 79 81

Generate a set of random unique integers from an interval

I am trying to build some machine learning models,
so I need training data and a validation data
so suppose I have N number of examples, I want to select random x examples in a data frame.
For example, suppose I have 100 examples, and I need 10 random numbers, is there a way (to efficiently) generate 10 random INTEGER numbers for me to extract the training data out of my sample data?
I tried using a while loop, and slowly change the repeated numbers, but the running time is not very ideal, so I am looking for a more efficient way to do it.
Can anyone help, please?
sample (or sample.int) does this:
sample.int(100, 10)
# [1] 58 83 54 68 53 4 71 11 75 90
will generate ten random numbers from the range 1–100. You probably want replace = TRUE, which samples with replacing:
sample.int(20, 10, replace = TRUE)
# [1] 10 2 11 13 9 9 3 13 3 17
More generally, sample samples n observations from a vector of arbitrary values.
If I understand correctly, you are trying to create a hold-out sampling. This is usually done using probabilities. So if you have n.rows samples and want a fraction of training.fraction to be used for training, you may do something like this:
select.training <- runif(n=n.rows) < training.fraction
data.training <- my.data[select.training, ]
data.testing <- my.data[!select.training, ]
If you want to specify EXACT number of training cases, you may do something like:
indices.training <- sample(x=seq(n.rows), size=training.size, replace=FALSE) #replace=FALSE makes sure the indices are unique
data.training <- my.data[indices.training, ]
data.testing <- my.data[-indices.training, ] #note that index negation means "take everything except for those"
from the raster package:
raster::sampleInt(242, 10, replace = FALSE)
## 95 230 148 183 38 98 137 110 188 39
This may fail if the limits are too large:
sample.int(1e+12, 10)

Resources