Replace NAs with grouped means - r

I have a data frame with 7,000 observations and 196 variables, with NAs sprinkled throughout. I created a function to capture grouped means for each numeric variable from the data frame (187 numeric variables, 11 groups). I am now trying to replace the NAs with the appropriate variable grouped mean if the observation is part of a group.
Basically I'm looking to find the NAs in the frame and replace with the appropriate group mean variable.
If df[6501,174] is group 7 & NA, then replace with mean value of group 7's variable 174.
This is the smallest of the data frames I'm working with, and I'm concerned about efficiency.
The historical time series data is as follows:
str(HD_filtered)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 7032 obs. of 196 variables:
$ Date: Factor w/ 87 levels "12/31/1993","03/31/1994",..: 1 2 2 2 2 2 2 2 2 2 ...
$ V2: Factor w/ 1065 levels "","000361105",..: 246 183 312 31 80 87 132 124 121 211 ...
$ V3: Factor w/ 744 levels "A S V","A V",..: 326 231 22 41 106 113 170 160 157 272 ...
$ V4: Factor w/ 7 levels "BHS","BMU","CAN",..: 7 7 7 7 7 7 7 7 7 7 ...
$ V5: Factor w/ 68 levels "I2",..: 48 16 17 28 11 10 38 28 11 13 ...
$ V6: Factor w/ 1 level "C": 1 1 1 1 1 1 1 1 1 1 ...
$ V7: Factor w/ 11 levels "S1",..: 7 4 9 1 6 8 8 1 6 6 ...
$ V8: Factor w/ 146 levels "SI1",..: 8 77 57 51 16 91 93 49 31 22 ...
$ V9: Factor w/ 1259 levels "","3HCKT","3RVTL",..: 261 23 294 26 82 95 111 1
$ V10: num 0.429 7.4 5 7.75 12 ...
$ V11: num 0.839 2.117 0.97 1.237 1.934 ...
$ V12: num NA -0.176 0.262 0.012 0.146 ...
$ V12: num NA NA NA NA NA NA NA NA NA NA ...
$ V13: num NA NA NA NA NA NA NA NA NA NA ...
$ V196: num NA .045 .62 .034 NA NA NA .012 .03 NA
I created a function to calculate means for V10:V196 based on groups (Date, V4, V5, V7, V8) using dplyr.
Summary_Stats_Function <- function(hd, cmn) {
hd %>%
group_by_(.dots = cmn) %>%
summarise_each(funs(min, max, median, mean(., trim = 0.01, na.rm = TRUE), sd(., na.rm = TRUE)), V10:V196)
}
Universal_Summary_Stats_byV4 <- Summary_Stats_Function(HD_filtered, "V4")
Which gives summary stats:
str(U_sector_summ_stats)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 11 obs. of 936 variables:
$ V4: Factor w/ 11 levels "S1",..: 1 2 3 4 5 6 7 8 9 10 ...
$ V10_min: num 0 0 0 0 0 0 0 0 0 0.5 ...
$ V11_min: num -1.0216 -1.8599 0.0501 -0.5723 NA ...
$V196_min: num -0.984 -0.815 -0.848 -0.981 -0.549 ...
$V393_mean: num 4.087 2.716 5.116 2.813 0.589...
$V588_mean: num NA NA NA NA NA ...
$V936_sd: num 107 103 120 103 129 ...
replace_with <- select(Universal_Summary_Stats_byV4, contains("_mean")
I'm trying to figure out how to take the mean results of this function held in replace_with and put back into HD_filtered such that the NAs are replaced with the appropriate group mean.
I have tried using 'for' loops and 'apply' functionality without success, and am probably getting hung up on logical syntax?

Maybe not an elegant solution, but here is a base R solution using merge() of data frames of grouped means and original data frame within nested for loops.
First, since you only want means, run your summarise_each() with only means to get an output of V10_mean - V196_mean.
Summary_Stats_Function <- function(hd, cmn) {
hd %>%
group_by_(.dots = cmn) %>%
summarise_each(funs(mean(., trim = 0.01, na.rm = TRUE)), V10:V196)
}
Then run nested for loops calling above function at group level and merging data frames in outer loop:
# ITERATE THROUGH EACH GROUP (ASSUMING MUTUALLY EXCLUSIVE)
for (grp in c("V4", "V5", "V7", "V8")) {
replace_with <- Summary_Stats_Function(HD_filtered, grp)
mergedf <- merge(HD_filtered, replace_with, by=grp)
# ITERATE THROUGH EACH NUMERIC COLUMN
for (i in 10:196) {
mergedf[[i]][is.na(mergedf[[i]])] <-
mergedf[[paste0("V", i,"_mean")]][is.na(mergedf[[i]])]
}
}

Related

R Dataframe issue preventing normality test

I've read my .CSV and then converted the file to a data frame using several methods including:
df<-read.csv('cdSH2015Fall.csv', dec = ".", na.strings = c("na"), header=TRUE,
row.names=NULL, stringsAsFactors=F)
df<-as.data.frame(lapply(df, unlist)) # converted .csv to a a data.frame
str(df) # provides the structure of df.
'data.frame': 72 obs. of 16 variables:
$ trtGroup : Factor w/ 68 levels "AANN","AAPN",..: 5 7 14 18 20 23
27 33 37 48 ...
$ cd : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ PreviousExp : Factor w/ 2 levels "Empty","Enriched": 2 1 2 2 2 2 1
1 1 1 ...
$ treatment : Factor w/ 2 levels "NN","PN": 1 1 1 1 1 1 1 1 1 1 ...
$ total.Area.DarkBlue.: num 827 1037 663 389 983 ...
$ numberOfGroups : int 1 1 1 1 1 1 1 1 1 1 ...
$ totalGroupArea : num 15.72 2.26 9.45 11.57 9.73 ...
$ averageGrpArea : num 15.72 2.26 9.45 11.57 9.73 ...
$ proximityToPlants : num 5.65 16.05 2.58 9.65 4.74 ...
$ latFeed : num 2 0.5 0 1 0 0 1 0.5 2 1 ...
$ latBalloon : num 6 2 2 NA 0 0.1 3 0.5 1 0.7 ...
$ countChases : int 5 8 16 4 16 21 18 11 14 28 ...
$ chases : int 95 87 67 923 636 96 1210 571 775 816 ...
$ grpDiameter : num 16.8 23.3 19.5 11.2 29.9 ...
$ grpActiv : num 4908 5164 4197 5263 5377 ...
$ NND : num 0 11.88 8.98 3.6 9.8 ...
I then run my model two ways:
First option.
fit = t.test(df$proximityToPlants[which (df$cd==1 &
df$treatment == 'PN')], df$proximityToPlants[which
(df$cd==0 & df$treatment == 'PN')]
)
Second option trying to ensure I have a proper data frame.
Subset the data and then create a matrix.
cdProximityToPlantsPN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==1 & cdSH2015Fall$treatment == 'PN')]
H2ProximityToPlantsPN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==0 & cdSH2015Fall$treatment == 'PN')]
cdProximityToPlantsNN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==1 & cdSH2015Fall$treatment == 'NN')]
H2ProximityToPlantsNN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==0 & cdSH2015Fall$treatment == 'NN')]
Creating a matrix
df<-
cbind(cdProximityToPlantsPN,H2ProximityToPlantsPN,cdProximityToPlantsNN,
H2ProximityToPlantsNN)
mat <- sapply(df,unlist)
fit=t.test(mat[,1],mat[,2], paired = F, var.equal = T)
Yet, I still get errors when assessing outliers using the following:
outlierTest(fit) # Bonferonni p-value for most extreme obs
Error in UseMethod("outlierTest") :
no applicable method for 'outlierTest' applied to an object of class
"htest"
qqPlot(fit, main="QQ Plot") #qq plot for studentized resid 
Error in order(x[good]) : unimplemented type 'list' in 'orderVector1'
leveragePlots(fit) # leverage plots
Error in formula.default(model) : invalid formula
I know the issue must be with my data structure. Any ideas on how to fix it?

How do I resolve this error using lapply and my own function?

I have a list of 18 data frames that I read in using read.xlsx. Each data frame has the same number of columns but some columns contain NA for some rows.
Also, in the Abundance column there are rows that contain non-numeric data and I suspect that I may need to remove these rows from each data frame but I have not been able to find a way to remove those rows.
My data frame structure is like this:
$ :'data.frame': 118 obs. of 10 variables:
..$ Locus : Factor w/ 24 levels "A","CS",..: 14 14 14 14 22 22 NA 22 10 10 ...
..$ Target : Factor w/ 96 levels "[AAAGA]14","[AAAGA]15",..: 88 91 90 87 11 12 NA 9 65 67 ...
..$ Length : num [1:118] 60 76 72 56 24 39 NA 20 139 141 ...
..$ Abundance : num [1:118] 1479 1108 180 144 1786 ...
..$ Size : num [1:118] 15 19 18 14 6 9.3 NA 5 32 32.2 ...
..$ Call : Factor w/ 4 levels "Al","HAs",..: 1 1 3 3 1 1 NA 3 1 1 ...
..$ RAR : num [1:118] NA 74.92 12.17 9.74 NA ...
..$ Position : num [1:118] NA NA NA NA NA NA NA NA NA NA ...
..$ Al.1.s.percent: num [1:118] NA NA 12.17 9.74 NA ...
..$ Al.2.s.percent: num [1:118] NA NA 16.2 13 NA ...
I want to apply this function to each data frame in my list of data frames.
add.sum = function(df){
transform(df, Tot.count = ave(df[[Abundunce]], df[[Locus]], FUN = sum))
}
I tried using this line with lapply
transformed.data = lapply(mydata, add.sum)
I also tried it this way
transformed.data = lapply(mydata, function (x) add.sum(x))
But these give me the following error
Error in .subset2(x, i, exact = exact) : no such index at level 1
Any suggestions on how to get this working correctly?

over with points and polygons in R: get the name of the polygons

In my data I have a list of signals with lat / long.
I have a shape file that I imported with readOGR() and I called it polygons.
With the code
data$inside.polygons <- !is.na(over(data, as(polygons, "SpatialPolygons")))
I have a new variable in my data called inside.polygons . It is a logical variable describing if the signal is inside the polygon (TRUE) or not (FALSE).
Is it possible to add a new column with the name of the name of the polygon?
I create a new table with
polygons.table <- data.frame(polygons)
and I got in $Polygon.name the name of each polygons
> str(polygons.table)
'data.frame': 233 obs. of 6 variables:
$ Country : Factor w/ 9 levels "Denmark","Estonia",..: 9 9 9 9 9 9 9 4 9 9 ...
$ Polygon.name: Factor w/ 237 levels "Aalborg","Aalborg Portland",..: 114 115 69 192 193 8 237 231 230 224 ...
$ Shape_Leng: num 0.0339 0.0209 0.0399 0.1628 0.1343 ...
$ Shape_Area: num 5.64e-05 2.26e-05 4.22e-05 5.25e-04 5.30e-04 ...
$ LOCodes : Factor w/ 193 levels "DEBOF","DEFLF",..: NA NA 155 184 184 137 193 15 191 192 ...
$ Links : Factor w/ 26 levels "http://eng.port-bronka.ru/",..: NA NA NA NA NA NA NA NA NA NA ...
How could I add the Polygon.name to each signals that are inside the polygon?
Thanks!
got it with:
inside.polygon2 <- over(data, polygons[,"Polygon.name"])
and then I add it to my data
data$polygon.name <- inside.polygon2$Polygon.name

Error in ncol(xj) : object 'xj' not found when using R matplot()

Using matplot, I'm trying to plot the 2nd, 3rd and 4th columns of airquality data.frame after dividing these 3 columns by the first column of airquality.
However I'm getting an error
Error in ncol(xj) : object 'xj' not found
Why are we getting this error? The code below will reproduce this problem.
attach(airquality)
airquality[2:4] <- apply(airquality[2:4], 2, function(x) x /airquality[1])
matplot(x= airquality[,1], y= as.matrix(airquality[-1]))
You have managed to mangle your data in an interesting way. Starting with airquality before you mess with it. (And please don't attach() - it's unnecessary and sometimes dangerous/confusing.)
str(airquality)
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
After you do
airquality[2:4] <- apply(airquality[2:4], 2,
function(x) x /airquality[1])
you get
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R:'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 4.63 3.28 12.42 17.39 NA ...
$ Wind :'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 0.18 0.222 1.05 0.639 NA ...
$ Temp :'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 1.63 2 6.17 3.44 NA ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
or
sapply(airquality,class)
## Ozone Solar.R Wind Temp Month Day
## "integer" "data.frame" "data.frame" "data.frame" "integer" "integer"
that is, you have data frames embedded within your data frame!
rm(airquality) ## clean up
Now change one character and divide by the column airquality[,1] rather than airquality[1] (divide by a vector, not a list of length one ...)
airquality[,2:4] <- apply(airquality[,2:4], 2,
function(x) x/airquality[,1])
matplot(x= airquality[,1], y= as.matrix(airquality[,-1]))
In general it's safer to use [, ...] indexing rather than [] indexing to refer to columns of a data frame unless you really know what you're doing ...

How to work with %in% symbol in R?

I found out that %in% stands for matching operator, binary (in model formulae: nesting). There are two tables in my workspace. The first table contains
> str(GP.drugs)
'data.frame': 4158393 obs. of 9 variables:
$ SHA : Factor w/ 10 levels "Q30","Q31","Q32",..: 1 1 1 1 1 1 1 1 1 1 ...
$ PCT : Factor w/ 151 levels "5A3","5A4","5A5",..: 16 16 16 16 16 16 16 16 16 16 ...
$ PRACTICE: Factor w/ 10191 levels "A81001","A81002",..: 344 345 345 345 345 345 345 345 345 345 ...
$ BNF.CODE: Factor w/ 1731 levels "0101010C0","0101010E0",..: 878 4 9 11 17 22 25 26 27 28 ...
$ BNF.NAME: Factor w/ 1524 levels "Abacavir ",..: 317 289 294 1284 37 379 655 825 1115 824 ...
$ ITEMS : int 1 27 1 2 97 4 40 98 27 2 ...
$ NIC : num 1.89 74.94 3.2 7.35 439.83 ...
$ ACT.COST: num 1.77 69.92 2.98 6.84 408.43 ...
$ PERIOD : num 201109 201109 201109 201109 201109 ...
The second table contains
> str(problem.drugs)
'data.frame': 13 obs. of 2 variables:
$ Drug : Factor w/ 13 levels "Alogliptin","Glipizide",..: 1 2 3 9 10 11 12 13 4 7 ...
$ Category: Factor w/ 1 level "metformin": 1 1 1 1 1 1 1 1 1 1 ...
The code and the error I am using is
> t<-subset(GP.drugs,n %in% p)
> t
[1] SHA PCT PRACTICE BNF.CODE BNF.NAME ITEMS NIC ACT.COST PERIOD
<0 rows> (or 0-length row.names)
More errors
Does it make difference on the tables' column names or does it make it difference on the number of columns both have?
Your BNF.NAME column in the GP.drugs data frame appears to have extra trailing spaces in it: notice it says something like "Abacavir " as the first element. If this is true of all the drugs in GP.drugs, but not the ones in problem.drugs, it will prevent any from matching.
To fix this, you can use the str_trim function from stringr, which trims leading and trailing whitespace:
library(stringr)
n <- str_trim(GP.drugs$BNF.NAME)
# same thing you did before
p <- problem.drugs$Drug
t <- subset(GP.drugs, n %in% p)
Other solutions can be found here.
Try,
GP.drugs[GP.drugs$BNF.NAME %in% problem.drugs$Drug, ]

Resources