Extract a value from a dataframe iteratively (R) - r

I have a function to select a value from a dataframe. I want to select that value, save it, remove it from the dataset, and select a value using the same function from the remaining values in the dataframe. What is the best way to do this?
Here is a simple example:
V1 <- c(5,6,7,8,9,10)
df <- data.frame(V1)
V2 <- as.data.frame(matrix(nrow=3,ncol=1))
maximum <- function(x){
max(x)
}
V2[i,]<- maximum(df)
df <- anti_join(df,V2,by='V1')
How can I set this up such that I reapply the maximum function to the remaining values in df and save these values in in V2?
I'm using a different and more complex set of functions and if/else statements than max - this is just an example. I do have to reapply the function to the remaining values, because I will be using the function on a new dataframe if df is empty.

Is this what you're looking for?
V1 <- data.frame(origin = c(5,6,7,8,9,10))
V2 <- as.data.frame(matrix(nrow=3,ncol=1))
df1 <- V1
df2 <- V2
recursive_function <- function(df1,df2,depth = 3,count = 1){
if (count == depth){
# Find index
indx <- which.max(df1[,1])
curVal <- df1[indx,1]
df2[count,1] <- curVal
df1 <- df1[-indx, ,drop = FALSE]
return(list(df1,
df2))
} else {
# Find index
indx <- which.max(df1[,1])
# Find Value
curVal <- df1[indx,1]
# Add value to new data frame
df2[count,1] <- curVal
# Subtract value from old dataframe
df1 <- df1[-indx, ,drop = FALSE]
recursive_function(df1,df2,depth,count + 1)
}
}
recursive_function(df1,df2)

Here is another solution that I stumbled across:
V1 <- c(5,6,7,8,9,10)
df <- data.frame(V1)
minFun <- function(df, maxRun){
V2 <- as.data.frame(matrix(nrow=maxRun,ncol=1))
for(i in 1:maxRun){
V2[i,]<- min(df)
df <- dplyr::anti_join(df,V2,by='V1')
}
return(V2)
}
test <- minFun(df = df, maxRun = 3)
test

Related

undefined columns selected and cannot xtfrm data frame error

I am trying to write a code that checks for outliers based on IQR and change those respective values to "NA". So I wrote this:
dt <- rnorm(200)
dg <- rnorm(200)
dh <- rnorm(200)
l <- c(1,3) #List of relevant columns
df <- data.frame(dt,dg,dh)
To check if the column contains any outliers and change their value to NA:
vector.is.empty <- function(x) return(length(x) ==0)
#Checks for empty values in vector and returns booleans.
for (i in 1:length(l)){
IDX <- l[i]
BP <- boxplot.stats(df[IDX])
OutIDX <- which(df[IDX] %in% BP$out)
if (vector.is.empty(OutIDX)==FALSE){
for (u in 1:length(OutIDX)){
IDX2 <- OutIDX[u]
df[IDX2,IDX] <- NA
}
}
}
So, when I run this code, I get these error messages:
I've tried to search online for any good answers. but I'm not sure why they claim that the column is unspecified. Any clues here?
I would do something like that in order to replace the outliers:
# Set a seed (to make the example reproducible)
set.seed(31415)
# Generate the data.frame
df <- data.frame(dt = rnorm(100), dg = rnorm(100), dh = rnorm(100))
# A list to save the result of boxplot.stats()
l <- list()
for (i in 1:ncol(df)){
l[[i]] <- boxplot.stats(df[,i])
df[which(df[,i]==l[[i]]$out),i] <- NA
}
# Which values have been replaced?
lapply(l, function(x) x$out)

Simplifying a function that performs operations on one data frame based on values in another data frame

I have two data frames dat1 and dat2:
dat1 <- data.frame(id = rep(c("a","b","c"), each =100),
dist = rep(1:100, times = 3),
var1 = rnorm(300),
var2 = rnorm(300))
dat2 <- data.frame(id = c("a","b","c"),
value = c(42,56,39))
the value column in dat2 contains the index of the values through which I would like to subset in dat1. I wrote the following function getv to do this subset and perform this operation using that value:
getk <-
function(id, value){
x <- dplyr::filter(dat1, id == id)
x <- x[1:value, ]
k = 10*(value^(2/9))
k = ceiling(k)
k
}
getk(a,42)
I want to add a line to the function that assigns the correct value from dat2 to a new object v, so that I don't have to feed the function the id and value every time. I cannot figure out how to say essentially: "if I give tell you I want to do this for a, assign the number from dat2$value that goes with filter(dat2, id==a) to the object v"
In other words, my function will turn into something close to this:
getk <-
function(id){
x <- dplyr::filter(dat1, id == id)
v <- #the value in dat2
x <- x[1:v, ]
k = 10*(v^(2/9))
k = ceiling(k)
k
}
#after which I could just do this and get the same answer as above:
getk(a)
`
I believe you want
v <- dat2$value[dat2$id == id]
But note it will only work in your function if you use getk("a") since a is not an object.

apply a function with two dataframes as input in r

I want to get the total number of NA that missmatch between two dataframes.
I have found the way to get this for two vectors as follows:
compareNA <- function(v1,v2) {
same <- (v1 == v2) | (is.na(v1) & is.na(v2))
same[is.na(same)] <- FALSE
n <- 0
for (i in 1:length(same))
if (same[i] == "FALSE"){
n <- n+1
}
return(n)
}
Lets say I have vector aand bwhen comparing them I got as a result 2
a <- c(1,2,NA, 4,5,6,NA,8)
b <- c(NA,2,NA, 4,NA,6,NA,8)
h <- compareNA(a,b)
h
[1] 2
My question is: how to apply this function for dataframes instead of vectors?
Having as an example this datafames:
a2 <- c(1,2,NA,NA,NA,6,NA,8)
b2 <- c(1,NA,NA,4,NA,6,NA,NA)
df1 <- data.frame(a,b)
df2 <- data.frame(a2,b2)
what i expect as a result is 5, since this are the total number of NAs that appear in df2 that are not in df1. Any suggestion how to make this work?
Here's a second thought.
xy1 <- data.frame(a = c(NA, 2, 3), b = rnorm(3))
xy2 <- data.frame(a = c(NA, 2, 4), b = rnorm(3))
com <- intersect(colnames(xy1), colnames(xy2))
sum(xy1[, com] == xy2[, com], na.rm = TRUE)
If you don't want to worry about column names (but you should), you can make sure the columns align perfectly. In that case, intersect step is redundant.
sum(xy1 == xy2, na.rm = TRUE)
A third way (assuming dimensions of df1 & df2 are same):
sum(sapply(1:ncol(df1), function(x) compareNA(df1[,x], df2[,x])))
# 5
It would be easier to force both dataframes to have the same column names and compare column by column when those have the same name. You can then simply use a loop over columns and increment a running total by applying your function.
compareNA.df <- function(df1, df2) {
total <- 0
common_columns <- intersect(colnames(df1), colnames(df2))
for (col in common_columns) {
total <- total + compareNA(df1[[col]], df2[[col]])
}
return(total)
}
colnames(df2) <- c("a", "b")
compareNA.df(df1, df2)

Subsetting data from R data frame using incremental variable names

I have an R data frame, df, with column names V1, V2, V3...V1000. I need to subset df by selecting every 20th column, that is, V1, V21, V41, V61 through end of columns.
I think this can be done using dplyr's select(df, num_range("V", val)), but am stumped how to iterate val through 1000 columns, stepping by 20.
Any suggestions?
Use the seq function with dplyr's select and num_range as below:
library(dplyr)
df <- as.data.frame(matrix(rnorm(3000), nrow = 3))
df %>% select(num_range("V", seq(1, 1000, by = 20)))
You can try,
df[seq(1, ncol(df), 20)]
u can use some function like this.Here nskip=20 as u want to skip 20 columns
FOO <- function(data, nSubsets, nskip)
{
outList <- vector("list", length = nSubsets)
totcol <- ncol(data)
for (i in seq_len(nSubsets))
{
colsToGrab<- seq(i, totcol, nSkip)
outList[[i]] <- data[,colsToGrab ]
}
return(outList)
}

Passing variable names to function in R

I have a subset function that takes in an object of a user defined class, a condition passed to the function, and adds that condition as an attribute of the object.
subset.survey.data.frame <- function(x, condition, drop=FALSE, inside=FALSE) {
if(inside) {
condition_call <- deparse(substitute(condition, env=parent.frame(n=1)))
}
else {
condition_call <- substitute(condition)
}
x[["user_conditions"]] <- unique(c(x[["user_conditions"]],list(condition_call)))
cat("Subset Conditions have been added to SDF")
x
}
I can call this function as:
sdf <- subset.survey.data.frame(sdf,dsex =="Male")
This adds dsex == "Male" in user_conditions attribute.
However, if I want to call it from within another function and a loop, it passed v1 and v2, instead of the actual variable names.
for(i in 1:length(lvls)) {
v1 <- rhs_vars[1]
v2 <- lvls[i]
print(v1) #"dsex"
print(v2) #"Male"
dsdf <- subset.survey.data.frame(sdf, v1 == v2, inside=T)
How can I modify the subset function so that I can get the names of v1 and v2 and then add the condition to the object?
Here is what SDF, lvls, and rhs_vars looks like
sdf <- list(user_conditions = list(),default_conditions = list(default_conditions) ,data = data_Laf, weights=weights, pvvars=pvs, fileDescription = f)
Here, data_Laf is an LaF object (http://cran.r-project.org/web/packages/LaF/index.html), weights, pvs, and f are all lists.
rhs_vars <- rhs.vars(y ~ dsex + b017451) # from formula.tools package
> rhs_vars
[1] "dsex" "b017451"
lvls is the levels of a column in a dataframe
lvls <- levels(data[,rhs_vars[1]])
"Male" "Female"
Here is a working example:
default_conditions= quote(rptsamp=="Reporting sample")
sdf <- list(user_conditions = list(),default_conditions = list(default_conditions))
class(sdf) <- "Userdefined"
subset.survey.data.frame <- function(x, condition, drop=FALSE, inside=FALSE) {
if(inside) {
condition_call <- deparse(substitute(condition, env=parent.frame(n=1)))
}
else {
condition_call <- substitute(condition)
}
x[["user_conditions"]] <- unique(c(x[["user_conditions"]],list(condition_call)))
cat("Subset Conditions have been added to X")
x
}
sdf <- subset.survey.data.frame(sdf,dsex =="Male")
print(sdf)
#This gives the correct answer and adds dsex == "Male" to user conditions
#Creating some sample data
dsex =c('1','2','1','1','2','1','1','2','1')
b017451 <- sample(c(1:100), 9)
y <- rep(10, 9)
data <- data.frame(dsex, y, b017451)
data[,'dsex'] <- factor(data[,'dsex'], levels=c("1", "2"), labels=c('Male','Female'))
require(formula.tools)
rhs_vars <- rhs.vars(y ~ dsex + b017451)
lvls <- levels(data[,rhs_vars[1]])
for(i in 1:length(lvls)) {
v1 <- rhs_vars[1]
v2 <- lvls[i]
print(v1) #"dsex"
print(v2) #"Male"
dsdf <- subset.survey.data.frame(sdf, v1 == v2, inside=F)
print(dsdf)
#this doesnt give the correct answer and adds v1 == v2 to user conditions
break
}
As #nrussell was alluding to, substitute should help you build your expressions. Then you just need to evalute them. Here's a simple example
v1 <- quote(cyl)
v2 <- 6
eval(substitute(subset(mtcars, v1==v2), list(v1=v1, v2=v2)))
If your v1 is a character class, you can convert it to a symbol vi as.name() because you need a symbol and not a character for the expression to work.
v1 <- "cyl"
v2 <- 6
eval(substitute(subset(mtcars, v1==v2), list(v1=as.name(v1), v2=v2)))
If you're controlling the "inside" parameter, then isn't it as simple as:
if(inside) condition_call = call(substitute(condition[[1]]), as.name(condition[[2]]), condition[[3]])
This of course assumes people are only using binary conditions, but you can extend the above logic.

Resources