this is my first question, so please bear with me
I am creating a new variable age.f.sex in my dataframe wm.13 using an already existing variable SB1. In the original dataframe, SB1 indicates the age of first sexual intercourse of women interviewed in UNICEF's Multiple Indicators Cluster Surveys. The values that SB1 can take are:
> sort(unique(wm.13$SB1))
[1] 0 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
[26] 30 31 32 33 34 35 36 37 38 39 40 41 42 44 48 95 97 99
Here is the meaning of the values SB1 can take
0 means she never had sex
97 and 99 mean "does not remember/does not know"
95 means that she had her first sexual intercourse when she started living with her husband/partner (for which there is a specific variable, i.e MA9)
Any number between 0 and 95 is the declared age ate their first sexual intercourse
there are also NAs that sort() does not show but they appear if I just use unique()
I created a new variable from SB1, which I called age.f.sex.
wm.13$age.f.sex <- wm.13$SB1
I had the 0, 97 and 99 values replaced with NAs, and I kept the original NAs in SB1. I did this using the following code:
wm.13$age.f.sex[wm.13$SB1 == 0] <- NA
wm.13$age.f.sex[wm.13$SB1 == 97] <- NA
wm.13$age.f.sex[wm.13$SB1 == 99] <- NA
wm.13$age.f.sex[is.na(wm.13$SB1)] <- NA
Everything worked fine until here. However, I am in trouble with the 95 value. I want to code so that the observations that have value 95 in SB1 (i.e. the age of first sexual intercourse) will have the value from MA9 (i.e. the age when the woman started living with her partner/husband) in my new variable age.f.sex.
I first started with this code
> wm.13$age.f.sex[wm.13$SB1 == 95] <- wm.13$MA9
but i got the following error message
Error in wm.13$age.f.sex[wm.13$SB1 == 95] <- wm.13$MA9 :
NAs are not allowed in subscripted assignments
After some researches in this website, I realised that I might need to subset the right-hand side of the code too, but honestly I do not know how to do it. I have a feeling that which() or if.else() might come of use here, but I cannot figure out their argument. Examples I have found in this website show how to impute one specific value, but I could not find anything on subsetting according to the value the observations take in another variable.
I hope I have been clear enough. Any suggestion will be much appreciated.
Thanks, Manolo
Perhaps you could try:
wm.13$age.f.sex <- ifelse(wm.13$SB1 %in% c(0,97,99) | is.na(wm.13$SB1), NA, ifelse(wm.13$SB1 == 95, wm.13$MA9, wm.13$SB1))
In short, it works like this: The code checks whether wm.13$SB1 is 0, 97, 99 or missing, and then returns NA. Subsequently, it checks whether wm.13$SB1 is 95, and if so, it returns the value on that row in the MA9 column. In all other cases it returns the SB1 value. Because of "wm.13$age.f.sex <-" at the beginning of the line the return values are assigned to your new age.f.sex variable.
As the error message indicates, it is not possible to do subscripted assignments when the filter contains NAs. A way to circumvent this is to explicitly include NA as a factor level. The following example illustrates a possible way to replace 95s by their corresponding value in a second column.
# example dataframe
df <- data.frame(a = c(NA, 3, 95, NA),
b = 1:4)
# set a to factor with NA as one of the levels (besides those in a and b)
df$a <- factor(df$a, levels = union(df$a, df$b), exclude = NULL)
# subscripted assignment (don't forget to filter b too!)
df$a[df$a == 95] <- df$b[df$a == 95]
# restore to numeric
df$a <- as.numeric(as.character(df$a))
Related
I am new to R programming and trying to learn part time so apologize for naive coding and questions in advance. I have spent about 1 day trying to figure out code for this and unable to do so hence asking here.
https://www.kaggle.com/c/titanic/data?select=train.csv
I am working on train Titanic Data set from Kaggle imported as train_data. I have cleaned up all the col and also converted them to factor where needed.
My question is 2 fold:
1. Unable to understand why this formula gives IV values as 0 for everything. What have I done wrong?
factor_vars <- colnames(train_data)
all_iv <- data.frame(VARS=factor_vars, IV=numeric(length(factor_vars)),STRENGTH=character(length(factor_vars)),stringsAsFactors = F)
for (factor_var in factor_vars){
all_iv[all_iv$VARS == factor_var, "IV"] <-
InformationValue::IV(X=train_data[, factor_var], Y=train_data$Survived)
all_iv[all_iv$VARS == factor_var, "STRENGTH"] <-
attr(InformationValue::IV(X=train_data[, factor_var], Y=train_data$Survived), "howgood")
}
all_iv <- all_iv[order(-all_iv$IV), ]
2. I am trying to create my own function to calculate IV values for multiple columns in 1 go so that I do not have to do repetitive task however when I run the following formula I get count of total 0 and total 1 instead of items grouped by like I requested. Again, what is that I am doing wrong in this example?
train_data %>% group_by(train_data[[3]]) %>%
summarise(zero = sum(train_data[[2]]==0),
one = sum(train_data[[2]]==1))
I get output
zero one
1 549 342
2 549 342
3 549 342
where as I would anticipate an answer like:
zero one
1 80 136
2 97 87
3 372 119
what is wrong with my code?
3. Is there any pre built function which can give IV values for all columns? On searching I found iv.mult function but I can not get it to work. Any suggestion would be great.
Let's take a look at your questions:
1.
length(factor_vars)
#> [1] 12
length() returns the number of elements of your vector factor_vars. So your code numeric(length(factor_vars)) is evaluated to numeric(12) which returns an numeric vector of length 12, default filled with zeros.
The same applies to character(length(factor_vars)) which returns a character vector of length 12 filled with empty strings "".
Your code doesn't use a correct dplyr syntax.
library(dplyr)
library(dplyr)
train_data %>%
group_by(Pclass) %>%
summarise(zero = sum(Survived == 0),
one = sum(Survived == 1))
returns
# A tibble: 3 x 3
Pclass zero one
<dbl> <int> <int>
1 1 80 136
2 2 97 87
3 3 372 119
which is most likely what you are looking for.
Don't know the meaning of IV.
I have used this function to remove rows that are not blanks:
data <- data[data$Age != "",]
in this dataset
Initial Age Type
1 S 21 Customer
2 D Enquirer
3 T 35 Customer
4 D 36 Customer
However if I run the above code, I get this:
Initial Age Type
1 S 21 Customer
N/A N/A N/A N/A
3 T 35 Customer
4 D 36 Customer
When all I want is:
Initial Age Type
1 S 21 Customer
3 T 35 Customer
4 D 36 Customer
I just want the dataset without any NAs and I wanted to remove any rows that are not blank, so ideally all NAs and any that are just "".
I have tried the na.omit function but this deletes everything from my dataset.
This is an example dataset I have used, but in my dataset there's over 1000 columns and I would like to remove all rows that are NA for a particular column name.
This is my first post, I apologise if this isn't the right way to write up my code, plus I am very new to R.
Also my row number has converted to NA when I don't want it there, it's messing up my calculation.
Thank you for taking time to read and commenting this post.
As pointed out in the comments, it would be good to know what the exact values in the "empty" Age cells are. When I recreate the above data snippet using:
data <- data.frame(Initial = c("S", "D", "T", "D"),
Age = c(21, "", 35, 36),
Type = c("Customer", "Enquirer", "Customer", "Customer"))
We can see that "Age" is transformed into column of type "character".
Using the following code we can effectively remove those "empty" Age rows:
data <- subset(data, is.finite(as.numeric(Age)))
This takes the subset of the dataframe "data" where a numeric version of the Age variable is a finite number, thus eliminating the rows with missing Age values.
Hope this solves your problem!
Thank you # M.P.Maurits
This formula worked!
data <- subset(data, is.finite(as.numeric(Age)))
The column was actually an integer but when changed to numeric it removed all rows that were imported as blank but shown as NAs. I didn't think that integer or numeric would be a difference.
Thank you to everyone else who also commented, much appreciated :)
A simple solution based on dplyr's function filter:
library(dplyr)
data %>%
filter(!Age == "")
Initial Age Type
1 S 21 Customer
2 T 35 Customer
3 D 36 Customer
I want to apply a missingness mechanism such as(MCAR,MAR,NMAR) to my data under missing rate (5%.15%) if I have tow variables and nine instances:
Aj <- c(48,75,83,58,83,32,45,50,86)
As <- c(24,30,31,35,60,76,81,82,88)
as follows:
for Simulating MAR, we first randomly separated the variables into pairs
(Aj, As), 1 ≤ j , s ≤ r, where Aj was the variable into
which missing values were introduced, and As was the variable
that affected the missingness of Aj . Given a pair of
variables (Aj, As) and missing rate α,we first split the instances
into two equal-sized subsets according to their values
at As . If the variable As is numerical, we would find the
median of As and then assigned all the instances into
two subsets according to weather the instances have bigger
values than median As. for example we may let the instances
whose values at As are lower than the median 60 (instance
number 1–5) to be missing with the probability of 4α, that
is to say, Pr(Aj = missing|As ≤ 60) = 4α.
I wrote this code for missing mechanism in R
ifelse(As<=median(As),Aj==NA,Aj)
[1] NA NA NA NA NA 32 45 50 86
My question is how to add missing rate for example 5% to this code in R or another code for the above example and illustration.
This one-liner would give you missingness at rate alpha:
ifelse( (As <= median(As)) & (runif(length(As)) < alpha), NA, Aj)
If you want to have the observed rate close to alpha (you need to figure out how you want to handle rounding) you can do something like this:
Aj <- c(48,75,83,58,83,32,45,50,86)
As <- c(24,30,31,35,60,76,81,82,88)
# missingness rate
alpha <- 0.05
# create subset less than the median
b <- ifelse( As <= median(As) , NA, Aj)
# get the size of that subset (not known before hand due to tie handling)
n.b <- sum(is.na(b))
b.small <- Aj[is.na(b)]
# sample from the small subset at a fixed rate, setting sampled to NA
b.small[ sample(1:n.b, size=ceiling(n.b * alpha)) ] <- NA
b[is.na(b)] <- b.small
# b is now Aj with missingness
The output should be similar to
[1] 48 NA 83 58 83 32 45 50 86
I'm currently trying to loop through my data, and by determining whether the value falls above the upper quantile I want to then subset the data so that only those that are above the upper quantile are selected. To demonstrate what I mean:-
vector <- c()
df <- as.integer(read.table(text = " 88 72 92 38 20 16 8 14 8 4 4 8 6 4 6 2 54 272 2 6"))
for(i in 1:length(df)){
current.bin <- i
window.size <- 5
window <- df[current.bin-window.size : current.bin+window.size]
upper.quant <- quantile(window, 0.95)
if(df[current.bin] > upper.quant){
vector[i] <- current.bin
}
}
str(vector)
int [1:18] NA NA 3 4 NA NA NA NA NA NA ...
So as I loop through, I want it to look at the values before and after (a window of 5) and use that to determine the upper quantile before deciding as to whether the value it's currently looking at falls above it or not. After that I want to use the current.bin value to then subselect data from another dataframe by using it to specify the rows I want to extract. However when I look at the vector that is produced, it's 2 less than the number of values in my df. I can't figure out why this is happening, any ideas?
Also how would I go about using the position of values that are above the upper quantile to subselect data? Using the df as an example, I want it to go something along the lines of:-
df <- df["row positions using values in vector", ]
I am looking to remove multiple observations from one column within a dataframe based on their value without affecting the rest of the row.
df1=data.frame(c("male","female","male"),seq(1,30),seq(11,40))
names(df1) = c("col_a","col_b","col_c")
For example removing the values from column b that are below 5 or above 20 without affecting columns a or c. I am then looking to use this data for descriptive analysis and summaries.
Currently I am using this code to do the job:
df1$col_b[df1$col_b<5|df1$col_b>20] <- ""
df1$col_b<-as.numeric(df1$col_b)
However this creates NA values which get in the way of the analysis. Is there a way of doing this that does not create NA values or a quick way of removing them without affecting the row?
A numeric column can have normal values, NA, Inf, -Inf and NaN. But "empty" is not a possible value.
The reason for having NA is to mark that the value isn't available - seems exactly what you want! Using a negative number is just a more awkward way of doing the same thing - you'd have to remove all negative numbers before calculating mean, sum etc... You can do the same thing with NA - and that functionality is typically built into the functions: by specifying na.rm=TRUE.
df1 <- data.frame(col_a=c("male","female","male"),col_b=seq(1,30),col_c=seq(11,40))
df1$col_b[df1$col_b<5|df1$col_b>20] <- NA
sum(df1$col_b, na.rm=TRUE) # 200
median(df1$col_b, na.rm=TRUE) # 12.5
Maybe what you really need is mean(..., na.rm = TRUE). See ?mean, let the existence of NA help you.
Use subset:
> df2 <- subset(df1, ! ( df1$col_b<5|df1$col_b>20) )
> df2$col_b <- as.numeric(df2$col_b)
> df2
col_a col_b col_c
5 female 5 15
6 male 6 16
7 male 7 17
8 female 8 18
9 male 9 19
10 male 10 20
11 female 11 21
12 male 12 22
13 male 13 23
14 female 14 24
15 male 15 25
16 male 16 26
17 female 17 27
18 male 18 28
19 male 19 29
20 female 20 30
I'm taking it your ultimate intent is "How to ignore outliers in a column for subsequent analysis?"
You didn't say where the magic 5,20 range came from, nor what sort of analysis (mean/median/stdev, or something more complicated?).
You said: "aiming to use the column within the original dataframe for the analysis without subsetting as the purpose of this process is to remove outliers both visually and for calculations of averages."
If the magic 5,20 values came from a quantile (e.g. 5th-95th quantile, "middle 90th quantile"), you can compute arbitrary quantile values automatically with quantile(df1$col_b, c(0.05,0.95)). If you e.g. also want to see the median, pass the vector quantile(..., c(0.05,0.5,0.95))
Whereas if 5,20 is a known range, use the approach the others have shown you with logical indexing or subsetting to assign the outliers to NA. NA is your friend for analysis; it propagates into all calculations just like you'd want. NA is also your friend for plotting. Learn to love NA. Keep a copy of the original df (or just the original df$col_b) if you need to access the outlier values later.
If you want to experiment with distributions to see which one your data follows, see Ch 8 "Probability distributions" of http://cran.r-project.org/doc/manuals/R-intro.pdf
Here it all is in code:
#inrange <- function(x,a,b) { x>=a & x<=b }
inrange_else_NA <- function(x,minmax) { ifelse((x>=minmax[1] & x<=minmax[2]), x, NA) }
# If you want to save the original col_b and modify it in-place...
#df$col_b.orig <- df$col_b
# To exclude outliers outside a known range...
df$col_b_NAs <- inrange_else_NA(df$col_b, c(5, 20))
# ... or else to exclude outliers outside (say) middle 90th quantile
middle_90th_quantile <- as.vector(quantile(df$col_b, c(0.05,0.95)))
df$col_b_NAs <- inrange_else_NA(x,middle_90th_quantile)