How to Preprocess data to handle missing values in R - r

I am trying to pre-process my data in R such that I can use the "attribute mean for all samples belonging to the same class as the given tuple"
The missing values or the values falling out of range have been already given a value -1 by the data source provider. But I want to replace those missing values according to the data mining principle stated above in bold. The column that is my class decider is "Accident severity" and I want to give the attribute mean for all samples belonging to the same level of accident severity as the level of severity of the tuple with the missing attribute value.
As there are multiple columns with missing values, I guess I will have to do the taskk repeatedly for all columns one at a time. What r command should I use.
There are mostly two types of data types(vectors) in my data frame.. Factor is for Date and Time columns where as integer is for most of the other columns.
Is there a way that I can upload a subset of the data set here on stack overflow?
here is the link to the reproducible data set https://drive.google.com/file/d/0B3cafW7J7xSfSkRTYWRWMHhaU2c/edit?usp=sharing
Update 2: Now that the data set is there , please help me change the values where there is a "-1" in any of the columns to a value that is the mean of all tuples that have the same value for the attribute "Accident_severity" as the tuple with the missing values..
Update 3: please ignore the colums "X2_roadclass" and "X2_Road_type" as they are mostly blank and I am dropping them. thanks

Please see if this is close to your need
library(ggplot2)
library(reshape)
library(plyr)
Create some data
set.seed(1)
df <- data.frame(severity=rep(c('high', 'moderate', 'low'), each = 3),
factor1 = rep(c(1,2,3), each = 6),
factor2 = rep(c(4,5,6), times = 3),
date=rep(c('2011-01-01','2011-01-03','2011-01-10'),
times = 3), stringsAsFactors = F)
With some -1
df$factor2[3] <- -1
df$factor1[1] <- -1
Replace them with NA
df[df == -1] <- NA
Reshape it
mdf <- melt(df, id.vars= c("severity", 'date'))
Summarize
ddply(mdf, .(severity, variable), summarise, mean=mean(value, na.rm = T))
severity variable mean
1 high factor1 1.6
2 high factor2 4.8
3 low factor1 2.5
4 low factor2 5.0
5 moderate factor1 2.0
6 moderate factor2 5.0
With the data provided, I'd do something like this
dt <- read.csv('./Stackoverflow/datatry1.csv')
#head(dt[ , -c(1:3) ]) # Exclude some unwanted colums
mdt <- melt(dt[ , -c(1:3) ], id.vars= c("Accident_Severity", 'Date',
'Day_of_Week', 'Time'))
dts <- ddply(mdt, .(Accident_Severity, variable), summarise,
mean=mean(value, na.rm = T))
dts
Accident_Severity variable mean
1 1 Number_of_Vehicles 1.00000000
2 1 X1st_Road_Class 3.00000000
3 1 X1st_Road_Number 503.00000000
4 1 Road_Type 6.00000000
5 1 Speed_limit 30.00000000
6 1 Junction_Detail 3.00000000
7 1 X2nd_Road_Class -1.00000000
...

Related

How can I apply the decile cuts from one dataframe to another using R

I have a dataframe (df1) and have calculated the deciles for each row using the following:
#create a function to calculate the deciles
decilefun <- function(x) as.integer(cut(x, unique(quantile(x, probs=0:10/10)), include.lowest=TRUE))
# convert df1 to matrix
mat1 <- as.matrix(df1)
#apply the function I created above to calculate deciles
df1_deciles <- apply(mat1, 1, decilefun)
#add the rownames back in
rownames(df1_deciles) <- row.names(df1)
#convert to dataframe
df1_deciles <- as.data.frame(df1_deciles)
str(df1_deciles) # to show what the data looks like
#'data.frame': 157 obs. of 3321 variables:
# $ Variable1 : int 10 10 4 4 5 8 8 8 6 3 ...
# $ Variable2 : int 8 3 9 7 2 8 9 5 8 2 ...
# $ Variable3 : int 8 4 7 7 2 9 10 3 8 3 ...
I have another dataframe (df2) with the same rownames (Variable1, Variable2,etc...) but different number of columns.
I would like to use the same decile cuts which were used for df1 on this second dataframe but I'm not sure how to do it. I am actually not even sure how to determine/export what the cuts where on the original data which resulted on the df1_deciles dataframe I created. What I mean by this is, how do I export an object which tells me what range of values for Variable1 on df1 were assigned to a decile value = 1 or a decile value = 2, and so on.
I do not want to use the 'decilefun' function I created on df2, but instead want to use the variability and range information from df1.
This is my first question on the platform so I hope it is clear and I hope I have provided enough information. I have tried to find answers on the platform but have not found one. I appreciate any help on this.
Using data.table:
##
# create an artificial dataset with the structure you describe
#
set.seed(1)
df1 <- data.frame(Variable.1=rnorm(1000), variable.2=runif(1000), variable.3=rgamma(1000, scale=10, shape=5))
df1 <- t(df1)
##
#
df2 <- data.frame(Variable.1=rnorm(1000, -1), variable.2=runif(1000), variable.3=rgamma(1000, scale=20, shape=5))
df2 <- t(df2)
##
# you start here
# assumes df1 and df2 have structure described in problem
# data in rows, not columns
#
library(data.table)
df1 <- as.data.table(t(df1)) # transpose: put data in columns
brks <- lapply(df1, quantile, probs=(0:10)/10, labels=FALSE) # list of deciles for each row in df1
df2 <- as.data.table(df2, keep.rownames = TRUE) # keep df2 data in rows: 1000 columns here
result <- df2[ # this does all the work
, .(value= unlist(.SD),
decile=cut(unlist(.SD), breaks=c(-Inf, brks[[rn]], +Inf), labels=c('below', names(brks[[rn]])[2:11], 'above'))
)
, by=.(rn)]
result[, .N, keyby=.(rn, decile)] # validate that result is reasonable
Applying deciles from one dataset to another has the nuance the some values in the new dataset might be outside the range of the original data. The test data here demonstrates this problem. Variable.1 in df2 has values lower than any in df1, and variable.3 in df2 has values larger than any in df1.

Creating new column with condition

I have this data set:
ID Type Frequency
1 A 0.136546185
2 A 0.228915663
3 B 0.006024096
4 C 0.008032129
I want to create a new column that change the Frequency vaules less than 0.00 in to "other" and keep other information as it is. Like this :
ID Type Frequency New_Frequency
1 A 0.136546185 0.136546185
2 A 0.228915663 0.228915663
3 B 0.006024096 other
4 C 0.008032129 other
I used mutate but I dont know how to keep the original frequency bigger than 0.00.
Can you please help me?
You can't achieve what you want in base r because you cannot mix characters and numerics in the same vector. If you are willing to convert everything to characters the other answers will work. If you want to keep them numeric you need to use NA rather than "other". You can also try the labelled package which allows something like SPSS labels or SAS formats on numeric data.
Using mutate():
library(dplyr)
d <- tibble(ID = 1:4,
Type = c("A", "A", "B", "C"),
Frequency = c(0.136546185, 0.228915663, 0.006024096, 0.008032129))
d %>%
mutate(New_Frequency = case_when(Frequency < .01 ~ "other",
TRUE ~ as.character(Frequency)))
You can use ifelse
transform(df, Frequency = ifelse(Frequency < 0.01, 'Other', Frequency))
# ID Type Frequency
#1 1 A 0.136546185
#2 2 A 0.228915663
#3 3 B Other
#4 4 C Other
Note that Frequency column is now character since a column can have data of only one type.

Formatting data output to get numeric variables to always show 2 decmial places

I am running the following code to summarise a variable in r
setDT(RWA_Cleansed_Data)[, .(value=sum(ACCOUNT_BALANCE), freq = .N) , by = PERIOD ]
The output is
PERIOD value freq
1: 201907 167050951793 48840
How can I get the value to print with 2 decimal places
I am trying to get R to always show numeric variables with 2 decimal places in the data frame.
Try using formatC
library(data.table)
setDT(RWA_Cleansed_Data)[, .(value=formatC(sum(ACCOUNT_BALANCE),
digits = 2, format = "f"), freq = .N) , by = PERIOD]
Consider a reproducble example with mtcars
df <- mtcars
setDT(df)[, .(value= formatC(sum(am),digits = 2,format = "f"), freq = .N),by = cyl]
# cyl value freq
#1: 6 3.00 7
#2: 4 8.00 11
#3: 8 2.00 14
This will print value with 2 decimal places however, note that value column is of class "character" now and needs to be converted to numeric before any further processing.

R: merging copies of the same variable

I have data like this in R:
subjID = c(1,2,3,4)
var1 = c(3,8,NA,6)
var1.copy = c(NA,NA,5,NA)
fake = data.frame(subjID = subjID, var1 = var1, var1 = var1.copy)
which looks like this:
> fake
subjID var1 var1.1
1 1 3 NA
2 2 8 NA
3 3 NA 5
4 4 6 NA
Var1 and Var1.1 represent the same variable, so each subject has NA for one column and a numerical value in the other (no one has two NAs or two numbers). I want to merge the columns to get a single Var1: (3, 8, 5, 6).
Any tips on how to do this?
If you're only dealing with two columns, and there are never two numbers or two NAs, you can calculate the row mean and ignore missing values:
fake$fixed <- rowMeans(fake[, c("var1", "var1.1")], na.rm=TRUE)
You can use is.na, which can be vectorised as:
# get all the ones we can from var1
var.merged = var1;
# which ones are available in var1.copy but not in var1?
ind = is.na(var1) & !is.na(var1.copy);
# use those to fill in the blanks
var.merged[ind] = var1.copy[ind];
It depends on how you want to merge if there are conflicts.
You could simply put all non-NA values in var.1.1 into the corresponding slot of var1. In case of conflicts, this will favour var.1.1.
var1[!is.na(var1.copy)] <- var1.copy[!is.na(var1.copy)]
You could just fill in all NA values in var1 with corresponding values of var1.1. In case of conflict, this will favour var1.
var1[is.na(var1)] <- var1.copy[is.na(var1)]

Creating new variable from three existing variables in R

I have a dataset that looks like the one below, and I would like to create a new variable based on these variables, which can be used with the other variables in the dataset.
The first variable, ID, is a respondent identification number. The med variable are 1 and 2, indicating different treatments. Var1_v1 and Var1_v2 has four real options 1,2,3, or 9, and these options are only given to those who med ==1. If med ==2, NA appears in the Var1s. Var2 receives NA when med ==1 and has real values ranging from 1-3 when med==2.
ID <- c(1,2,3,4,5,6,7,8,9,10,11)
med <- c(1,1,1,1,1,1,2,2,2,2,2)
Var1_v1 <- c(2,2,3,9,9,9,NA,NA,NA,NA,NA) #ranges from 1-3, and 9
Var1_v2 <- c(9,9,9,1,3,2,NA,NA,NA,NA,NA) #ranges from 1-3, and 9
Var2 <- c(NA,NA,NA,NA,NA,NA,3,3,1,3,2)
#tables to show you what data looks like relative to med var
table(Var1_v1, med)
table(Var1_v2, med)
table(Var2, med)
I've been looking around for a while to figure out a recoding/new variable creation code, but I have had no luck.
Ultimately, I would like to create a new variable, say Var3, based on three conditions:
Uses the values from Var1_v1 if the value = 1, 2, or 3
Uses the values from Var1_v2 if the value = 1, 2, or 3
uses the values from Var2 if the values = 1, 2, or 3
And this variable should be able to match up with the ID number, so that it can be used within the dataset.
So the final variable should look like:
Var3 <- (2,2,3,1,3,2,3,3,1,3,2)
Thanks!
Something like
v <- Var1_v1
v[Var1_v2 %in% 1:3] <- Var1_v2[Var1_v2 %in% 1:3]
v[Var2 %in% 1:3] <- Var2[Var2 %in% 1:3]
v
[1] 2 2 3 1 3 2 3 3 1 3 2
which uses one of them as a base (you could also use a pure NA vector) and simply fills in only parts that match.

Resources