Remove missing values with user defined function - r

I have a dataset as data with missing values.
a <- sample(1:100,15)
b <- sample(1:20,15)
data <- data.frame(a,b)
data[c(3,6,8,12),2] <- NA
data
Now I want to delete the rows with missing values by one variable at a time. (Don't want to use na.omit() ). I have written the following function, but it's not working.
rmv_missing <- function(y,z){
z <- z[is.na(z$y) == TRUE,]
return(z)
}
rmv_missing("b",data)
Also tried this one...
library(dplyr)
na_values <- function(x,y,z){
z <- (filter(z,!is.na(y)))
return(z)
}
rmv_missing("b",data)
None of these functions are working. Could someone help me to understand where did I make the mistake and rectify the code. Thanks in advance.

First thing, you don't really need the "== T" since the "is.na" function spits out already a logical vector. The other problem is that accessing a data.frame as "data$b" will not work within a function. So instead, do the following et voilĂ :
rmv_missing <- function(y,z)
{
print(z$y) # Does not work
print(z[, y]) # Works
z[is.na(z[, y]),]
}
rmv_missing("b",data)
# NULL
# [1] 9 16 NA 5 13 NA 8 NA 11 17 20 NA 10 12 1
# a b
# 3 33 NA
# 6 59 NA
# 8 81 NA
# 12 26 NA

Related

Why is the for loop returning NA vectors in some positions (in R)?

Following a youtube tutorial, I have created a vector x [-3,6,2,5,9].
Then I create an empty variable of length 5 with the function 'numeric(5)'
I want to store the squares of my vector x in 'Storage2' with a for loop.
When I do the for loop and update my variable, it returns a very strange thing:
[1] 9 4 0 9 25 36 NA NA 81
I can see all numbers in x have been squared, but the order is so random, and there's more than 5.
Also, why are there NAs?? If it's because the last number of x is 9 (and so this number defines the length??), and there's no 7 and 8 position, I would understand, but then I'm also missing positions 1, 3 and 4, so there should be more NAs...
I'm just starting with R, so please keep it simple, and correct me if I'm wrong during my thought process! Thank you!!
x <- c(-3,6,2,5,9)
Storage2 <- numeric(5)
for(i in x){
Storage2[i] <- i^2
}
Storage2
# [1] 9 4 0 9 25 36 NA NA 81
You're looping over the elements of x not over the positions as probably intended. You need to change your loop like so:
for(i in 1:length(x)) {
Storage2[i] <- x[i]^2
}
Storage2
# [1] 9 36 4 25 81
(Note: 1:length(x) can also be expressed as seq_along(x), as pointed out by #NelsonGon in comments and might be faster.)
However, R is a vectorized language so you can simply do that:
Storage2 <- x^2
Storage2
# [1] 9 36 4 25 81

efficient rbind alternative with applied function

To take a step back, my ultimate goal is to read in around 130,000 images into R with a pixel size of HxW and then to make a dataframe/datatable containing the rgb of each pixel of each image on a new row. So the output will be something like this:
> head(train_data, 10)
image_no r g b pixel_no
1: 00003e153.jpg 0.11764706 0.1921569 0.3098039 1
2: 00003e153.jpg 0.11372549 0.1882353 0.3058824 2
3: 00003e153.jpg 0.10980392 0.1843137 0.3019608 3
4: 00003e153.jpg 0.11764706 0.1921569 0.3098039 4
5: 00003e153.jpg 0.12941176 0.2039216 0.3215686 5
6: 00003e153.jpg 0.13333333 0.2078431 0.3254902 6
7: 00003e153.jpg 0.12549020 0.2000000 0.3176471 7
8: 00003e153.jpg 0.11764706 0.1921569 0.3098039 8
9: 00003e153.jpg 0.09803922 0.1725490 0.2901961 9
10: 00003e153.jpg 0.11372549 0.1882353 0.3058824 10
I currently have a piece of code to do this in which I apply a function to get the rgb for each pixel of a specified image, returning the result in a dataframe:
#function to get rgb from image file paths
get_rgb_table <- function(link){
img <- readJPEG(toString(link))
# Creating the data frame
rgb_image <- data.frame(r = as.vector(img[1:H, 1:W, 1]),
g = as.vector(img[1:H, 1:W, 2]),
b = as.vector(img[1:H, 1:W, 3]))
#add pixel id
rgb_image$pixel_no <- row.names(rgb_image)
#add image id
train_rgb <- cbind(sub('.*/', '',link),rgb_image)
colnames(train_rgb)[1] <- "image_no"
return(train_rgb)
}
I call this function on another dataframe which contains the links to all the images:
train_files <- list.files(path="~/images/", pattern=".jpg",all.files=T, full.names=T, no.. = T)
train <- data.frame(matrix(unlist(train_files), nrow=length(train_files), byrow=T))
The train dataframe looks like this:
> head(train, 10)
link
1 C:/Documents/image/00003e153.jpg
2 C:/Documents/image/000155de5.jpg
3 C:/Documents/image/00021ddc3.jpg
4 C:/Documents/image/0002756f7.jpg
5 C:/Documents/image/0002d0f32.jpg
6 C:/Documents/image/000303d4d.jpg
7 C:/Documents/image/00031f145.jpg
8 C:/Documents/image/00053c6ba.jpg
9 C:/Documents/image/00057a50d.jpg
10 C:/Documents/image/0005d01c8.jpg
I finally get the result I want with the following loop:
for(i in 1:length(train[,1])){
train_data <- rbind(train_data,get_rgb_table(train[i,1]))
}
However, this last bit of code is very inefficient. An optimization of how the function is applied and and/or the rbind would help. I think the function get_rgb_table() itself is quick but the problem is with the loop and the rbind. I have tried using apply() but can't manage to do this on each row and put the result in one dataframe without running out of memory. Any help on this would be great. Thanks!
This is very difficult to answer given the vagueness of the question, but I'll make a reproducible example of what I think you're asking and will give a solution.
Say I have a function that returns a data frame:
MyFun <- function(x)randu[1:x,]
And I have a data frame df that will act an input to the function.
# a b
# 1 1 21
# 2 2 22
# 3 3 23
# 4 4 24
# 5 5 25
# 6 6 26
# 7 7 27
# 8 8 28
# 9 9 29
# 10 10 30
From your question, it looks like only one column will be used as input. So, I apply the function to each row of this data frame using lapply then I bind the results together using do.call and rbind like this:
do.call(rbind, lapply(df$a, MyFun))

Column looping through a user function and storing output in a newly created column (R)

I have some data that consists of an oscillatory-like pattern and would like to take some measurements of the peaks. I have several chunks of code and most of them work to do exactly what I want. The main issue I'm having is that I have no idea how to integrate them to work functionally together.
Essentially, I would like to use the freq function I've written on a dataframe so that it will loop through each column (a, b, and c) and give me the results of the function. Then I would like to store the output for each column in a new dataframe with the column names matching the source names.
I have read a lot of answers about looping through columns and creating new columns in a dataframe, which is how I've gotten to this point. Some of the individual pieces need a little tweaking but what I can't find anywhere is a good explanation of how I can put it all together. I have tried to no avail; I just can't see to get the order right.
(For reproducible data)
library(zoo)
count = 1:20
a = c(-0.802776, -0.748272, 0.187434, 1.23577, 1.00677, 0.874122, 0.232802, -0.279368, -1.57815, -1.76652, -0.958916, -0.316385, 0.831575, 1.19312, 1.45508, 0.848923, 0.257728, -0.318474, -1.14129, -1.42576)
b = c(-2.23512, -1.36572, -0.0357366, 0.925563, 1.53282, 0.171045, -0.438714, -1.38769, -0.696898, 1.37184, 2.01038, 2.6302, 2.53296, 1.8788, 0.100366, -1.34726, -1.4309, -1.37271, -0.750669, 0.100656)
c = c(0.749062, 0.0690315, -0.750494, -1.04069, -0.654432, 0.0186072, 0.710011, 0.920915, 1.13075, 0.227108, -0.195086, -0.68333, -0.607532, -0.485424, 0.495913, 0.655385, 0.468796, 0.274053, -0.906834 , 0.321526)
test = data.frame(count, a, b, c)
d = 20:40
This is the chunk of code I've written to go through any data I specify and identify local peaks, then calculate a series of things from the identified peaks. It works really well and there's no issue with the functionality of this (however, suggestions to make it better are welcome), just with putting it together with the rest.
I would like to loop through columns of a dataframe (using a for loop in the next section to accomplish that) and get the result of the freq function for each column
freq = function(x, y, data, w=1, span = 0.05, ...) {
require(zoo)
n = length(y)
y.smooth = loess(y ~ x, span = span)$fitted
y.max = rollapply(zoo(y.smooth), 2*w+1, max, align = "center")
delta = y.max - y.smooth[-c(1:w, n+1-1:w)]
i.max = which(delta <= 0) + w #identifies peaks
list(x = x[i.max], i = i.max, y.hat = y.smooth)
dist = diff(i.max) #calculates distance between peaks
instfreq = (25/dist) #calculates the rate of each peak occurence
print(instfreq) #output I ultimately want
}
#example
freq(count, a, span = 0.5)
This is how I'm looping through columns in a specified dataframe. Also, I'm not sure what I've done but this ends up printing my output twice...(which I'd like to avoid).
for(i in test){
output <- freq(test$count, y = i, span = 0.5)
print(output)
}
This is probably the part giving me the biggest headache. This should add new columns to an existing dataframe. It works so far but I have yet to figure out how to integrate it into the stuff above. Also, I'd really like for it to store the output in a new dataframe, rather than the source dataframe.
For reference, here df = data, to.add = data to add to df, new.name = name of new col
Another thing I'd like is for the new.name to come from the source (to.add). For example if I tried to add d (from above) to the end of test, I'd like for the column name (new.name) to read d without having to specify it. This will be helpful when I'm looping through multiple columns and want to keep the source name from which the output was calculated.
add.col = function(df, to.add, new.name){
if (nrow(df) < length(to.add)){
df = # pads rows if needed
rbind(df, matrix(NA, length(to.add)-nrow(df), ncol(df),
dimnames = list(NULL, names(df))))
}
length(to.add) = nrow(df) # pads with NA's
df[, new.name] = to.add; # names new col whatever was placed in new.name arg
return(head(df)) #shortened output so I can verify it worked
#when I was testing it for myself, this would
#need to be changed so that it adds the column
#to a dataframe and stores the results, which
#I believe would require I use print() and a store
#like Results = print(df)
}
#example
addcol(test, d, "d") #would like the code to grab the name d just from the to.add
#argument, without having to specify "d" as the new.name
Any help, suggestions, or refinements (to make it less clunky, more efficient, etc) would be greatly appreciated.
I can get by with the for loop (if the duplications get fixed) as long as I can figure out how to store all the output together in one place. My actual data is in a similar format to the reproducible set above, it just has far more rows and columns (and will already be in a .csv dataframe rather than creating it from individual vectors).
I've been beating my head over this for a few days now and have gotten so far but just can't get it all the way.
Also, feel free to edit the title to help it get to the right people!
Ok first of all, the reason your function is printing the output twice is because essentially what happens is:
instfreq gets calculated and returned
instfreq gets printed out
instfreq is getting assigned to output
output gets printed out again
Furthermore, I suppose you don't want you function to try and calculate it for the count argument (which returns numeric(0)) so it would be best to run it only for the other columns.
Lastly, such simple for loops can easily be replaced by the apply function in r. Which brings the first part of your question to:
freq = function(x, y, data, w=1, span = 0.05, ...) {
require(zoo)
n = length(y)
y.smooth = loess(y ~ x, span = span)$fitted
y.max = rollapply(zoo(y.smooth), 2*w+1, max, align = "center")
delta = y.max - y.smooth[-c(1:w, n+1-1:w)]
i.max = which(delta <= 0) + w #identifies peaks
list(x = x[i.max], i = i.max, y.hat = y.smooth)
dist = diff(i.max) #calculates distance between peaks
instfreq = (25/dist) #calculates the rate of each peak occurence
return(instfreq) #output I ultimately want
}
output <- apply(test[,2:length(test[1,])],2, function(v) freq(test$count, y=v, span=0.5))
output
# a b c
#2.500000 3.571429 2.777778
the second part of your question wants to return the name of a variable to use it as the name of the new column. for this we can use deparse(substitute(variable)) so your function becomes:
add.col = function(df, to.add){
new.name <- deparse(substitute(to.add))
if (nrow(df) < length(to.add)){
df = # pads rows if needed
rbind(df, matrix(NA, length(to.add)-nrow(df), ncol(df),
dimnames = list(NULL, names(df))))
}
length(to.add) = nrow(df) # pads with NA's
df[, new.name] = to.add; # names new col whatever was placed in new.name arg
return(df)
}
#example
dnametest = 20:40
add.col(test, dnametest)
# count a b c dnametest
#1 1 -0.802776 -2.2351200 0.7490620 20
#2 2 -0.748272 -1.3657200 0.0690315 21
#etc.
this function will not override your original dataframe, so you simply need to assign it to a new dataframe:
newframe <- add.col(test, dnametest)
EDIT adding a possibility to loop x amount of arrays:
The first problem you'll have while trying to loop, is that you're working with arrays of different lengths. This makes it hard to work with dataframes so you'll have to work with lists. In this case it will be easier to write a new function which takes any amount of arrays, and loops them for you automatically. Because it's easier to catch and add the names in this function, I've readjusted your function add.col to take new.name again:
add.col = function(df, to.add, new.name){
if (nrow(df) < length(to.add)){
df = # pads rows if needed
rbind(df, matrix(NA, length(to.add)-nrow(df), ncol(df),
dimnames = list(NULL, names(df))))
}
length(to.add) = nrow(df) # pads with NA's
df[, new.name] = to.add;
return((df))
}
then I can write a second function add.multicol like this:
#this function takes in an unspecfied number of arguments
add.multicol <- function(df, ...){
#convert this number of arguments to a list
to.add.cols <- list(...)
#add the variable names to this list
names(to.add.cols) <- as.list(substitute(list(...)))[-1]
#find number of columns to add
number.cols.to.add <- length(to.add.cols)
#loop add.col
newframe <- df
for(i in 1:number.cols.to.add){
to.add.col <- array(unlist(to.add.cols[i]))
to.add.col.name <- names(to.add.cols[i])
newframe <- add.col(newframe,to.add.col,to.add.col.name)
}
return(newframe)
}
This will allow you to do whichever you want. Example:
dnametest <- 20:40
test1 <- 1:15
test2 <- 25:56
argumentsake <- seq(0,1,length=21)
#run function
newframe <- add.multicol(test,dnametest,test1,test2,argumentsake)
newframe
# count a b c dnametest test1 test2 argumentsake
#1 1 -0.802776 -2.2351200 0.7490620 20 1 25 0.00
#2 2 -0.748272 -1.3657200 0.0690315 21 2 26 0.05
#3 3 0.187434 -0.0357366 -0.7504940 22 3 27 0.10
#4 4 1.235770 0.9255630 -1.0406900 23 4 28 0.15
#5 5 1.006770 1.5328200 -0.6544320 24 5 29 0.20
#6 6 0.874122 0.1710450 0.0186072 25 6 30 0.25
#7 7 0.232802 -0.4387140 0.7100110 26 7 31 0.30
#8 8 -0.279368 -1.3876900 0.9209150 27 8 32 0.35
#9 9 -1.578150 -0.6968980 1.1307500 28 9 33 0.40
#10 10 -1.766520 1.3718400 0.2271080 29 10 34 0.45
#11 11 -0.958916 2.0103800 -0.1950860 30 11 35 0.50
#12 12 -0.316385 2.6302000 -0.6833300 31 12 36 0.55
#13 13 0.831575 2.5329600 -0.6075320 32 13 37 0.60
#14 14 1.193120 1.8788000 -0.4854240 33 14 38 0.65
#15 15 1.455080 0.1003660 0.4959130 34 15 39 0.70
#16 16 0.848923 -1.3472600 0.6553850 35 NA 40 0.75
#17 17 0.257728 -1.4309000 0.4687960 36 NA 41 0.80
#18 18 -0.318474 -1.3727100 0.2740530 37 NA 42 0.85
#19 19 -1.141290 -0.7506690 -0.9068340 38 NA 43 0.90
#20 20 -1.425760 0.1006560 0.3215260 39 NA 44 0.95
#21 NA NA NA NA 40 NA 45 1.00
#22 NA NA NA NA NA NA 46 NA
#23 NA NA NA NA NA NA 47 NA
#24 NA NA NA NA NA NA 48 NA
#25 NA NA NA NA NA NA 49 NA
#26 NA NA NA NA NA NA 50 NA
#27 NA NA NA NA NA NA 51 NA
#28 NA NA NA NA NA NA 52 NA
#29 NA NA NA NA NA NA 53 NA
#30 NA NA NA NA NA NA 54 NA
#31 NA NA NA NA NA NA 55 NA
#32 NA NA NA NA NA NA 56 NA
EDIT 2: extending the loop to take in dataframes of any form as well
now it becomes quite messy, you will also need to rename your output elements so they don't match any column names already present.
add.multicol <- function(df, ...){
#convert this number of arguments to a list
to.add.cols <- list(...)
#find number of columns to add
number.args <- length(to.add.cols)
#number of elements per list entry
hierarch.cols.to.add <- array(0,length(number.args))
for(i in 1:number.args){
#if this list element has only one name, treat it as an array, else treat it as a data frame
if(is.null(names(to.add.cols[[i]]))){
#get variable names from input of normal arrays
names(to.add.cols[[i]]) <- as.list(substitute(list(...)))[i+1]
hierarch.cols.to.add[i] <- 1
} else {
#find the number of columns in the data frame
number <- length(names(to.add.cols[[i]]))
hierarch.cols.to.add[i] <- number
}
}
#loop add.col
newframe <- df
for(i in 1:number.args){
#if array
if(hierarch.cols.to.add[i]==1){
to.add.col <- array(unlist(to.add.cols[[i]]))
to.add.col.name <- names(to.add.cols[[i]][1])
newframe <- add.col(newframe,to.add.col,to.add.col.name)
} else { #if data.frame
#foreach column in the data frame
for(j in 1:hierarch.cols.to.add[i]){
#if only one element per column
if(is.null(dim(to.add.cols[[i]]))){
to.add.col <- to.add.cols[[i]][j]
} else { #if multiple elements per column
to.add.col <- to.add.cols[[i]][,j]
}
to.add.col.name <- names(to.add.cols[[i]])[j]
newframe <- add.col(newframe,to.add.col,to.add.col.name)
}
}
}
return(newframe)
}
testdf <- data.frame(cbind(test1,test2))
dnametest <- 20:40
output <- apply(test[,2:length(test[1,])],2, function(v) freq(test$count, y=v, span=0.5))
#edit output names because we can't have a dataframe with the same name for multiple columns
names(output) <- c("output_a","output_b","output_c")
newframe <- test
#function now takes dataframes of single elements, normal data frames and single arrays
newframe <- add.multicol(newframe,output,dnametest,testdf)
# count a b c output_a output_b output_c dnametest test1 test2
#1 1 -0.802776 -2.2351200 0.7490620 2.5 3.571429 2.777778 20 0 25
#2 2 -0.748272 -1.3657200 0.0690315 NA NA NA 21 1 26
#3 3 0.187434 -0.0357366 -0.7504940 NA NA NA 22 2 27
#4 4 1.235770 0.9255630 -1.0406900 NA NA NA 23 3 28
#...
this is probably not the most efficient way, but it gets the job done

Function in R to determine complete cases

I have the following code to print the complete cases:
complete <- function(directory, id=1:332) {
data<-NULL
dat <- NULL
s <- NULL
for (i in 1:length(id)) {
data[[i]]<- c(paste(directory, "/",formatC(id[i], width=3, flag=0),".csv",sep=""))
df[[i]]<-c(read.csv(data[[i]]))
s[i] <- sum(complete.cases(df[[i]]))
dat <- data.frame(cbind(id,nobs=s[i]))
}
dat
}
The output that I get is as follows:
complete("specdata", c(2, 4, 8, 10, 12))
id nobs
1 2 96
2 4 96
3 8 96
4 10 96
5 12 96
The required output looks like this:
complete("specdata", c(2, 4, 8, 10, 12))
## id nobs
## 1 2 1041
## 2 4 474
## 3 8 192
## 4 10 148
## 5 12 96
The .csv looks like this:
head(file)
Date sulfate nitrate ID
1 2003-01-01 NA NA 1
2 2003-01-02 NA NA 1
3 2003-01-03 NA NA 1
4 2003-01-04 NA NA 1
5 2003-01-05 NA NA 1
6 2003-01-06 NA NA 1
As is evident from the 2 outputs the nobs value for all id's is the replicated corresponding to id==12. I'm unable to figure out a way to output the nobs corresponding to id. Lets ignore the ## in each line of the required output. Thanks in advance.
I tried to clean your code:
complete <- function(directory, id) {
s <- vector()
for (i in 1:length(id)) {
path <- c(paste(directory, "/",formatC(id[i], width=3, flag=0),".csv",sep=""))
data <- c(read.csv(path))
s[i] <- sum(complete.cases(data))
}
dat <- data.frame(cbind(id,nobs=s))
return(dat)
}
If this does not work, you probably might want to check your functions formatC and complete.cases.
EDIT:
There were several redundancies in your code as well as two logical errors.
First, you dont need to initialize objects in R in order to give them a value. I deleted these two
data<-NULL
dat <- NULL
and changed the third into an empty vector. Second, you can see that I removed the indices of your dat and data objects and gave them a more expressive name. As these two objects are newly created in every iteration of the for-loop, it makes no sense to give them an index. Finally, you misplaced the closing bracket (as mentioned above) and created your dat$nobs from only one element of s (namely the last one):
dat <- data.frame(cbind(id,nobs=s[i]))
Fixing this into
dat <- data.frame(cbind(id,nobs=s))
did the trick.
Please consider reading a good beginners book on (R-)programming to gain a better understanding of control structures.

looping over the name of the columns in R for creating new columns

I am trying to use the loop over the column names of the existing dataframe and then create new columns based on one of the old column.Here is my sample data:
sample<-list(c(10,12,17,7,9,10),c(NA,NA,NA,10,12,13),c(1,1,1,0,0,0))
sample<-as.data.frame(sample)
colnames(sample)<-c("x1","x2","D")
>sample
x1 x2 D
10 NA 1
12 NA 1
17 NA 1
7 10 0
9 20 0
10 13 0
Now, I am trying to use for loop to generate two variables x1.imp and x2.imp that have values related to D=0 when D=1 and values related to D=1 when D=0(Here I actually don't need for loop but for my original dataset with large cols (variables), I really need the loop) based on the following condition:
for (i in names(sample[,1:2])){
sample$i.imp<-with (sample, ifelse (D==1, i[D==0],i[D==1]))
i=i+1
return(sample)
}
Error in i + 1 : non-numeric argument to binary operator
However, the following works, but it doesn't give the names of new cols as imp.x2 and imp.x3
for(i in sample[,1:2]){
impt.i<-with(sample,ifelse(D==1,i[D==0],i[D==1]))
i=i+1
print(as.data.frame(impt.i))
}
impt.i
1 7
2 9
3 10
4 10
5 12
6 17
impt.i
1 10
2 12
3 13
4 NA
5 NA
6 NA
Note that I already know the solution without loop [here]. I want with loop.
Expected output:
x1 x2 D x1.impt x2.imp
10 NA 1 7 10
12 NA 1 9 20
17 NA 1 10 13
7 10 0 10 NA
9 20 0 12 NA
10 13 0 17 NA
I would greatly appreciate your valuable input in this regard.
This is nuts, but since you are asking for it... Your code with minimum changes would be:
for (i in colnames(sample)[1:2]){
sample[[paste0(i, '.impt')]] <- with(sample, ifelse(D==1, get(i)[D==0],get(i)[D==1]))
}
A few comments:
replaced names(sample[,1:2]) with the more elegant colnames(sample)[1:2]
the $ is for interactive usage. Instead, when programming, i.e. when the column name is to be interpreted, you need to use [ or [[, hence I replaced sample$i.imp with sample[[paste0(i, '.impt')]]
inside with, i[D==0] will not give you x1[D==0] when i is "x1", hence the need to dereference it using get.
you should not name your data.frame sample as it is also the name of a pretty common function
This should work,
test <- sample[,"D"] == 1
for (.name in names(sample)[1:2]){
newvar <- paste(.name, "impt", sep=".")
sample[[newvar]] <- ifelse(test, sample[!test, .name],
sample[test, .name])
}
sample

Resources