dataframe is collapsed to a vector when given to function - r

I am trying to make use of the content of a dataframe in a function, here is a simplified example of my problem.
df <- data.frame(v1=1:10,v2=23:32)
df2 <- data.frame(v1=1:3,v2=3:5)
fxm <- function(x,y,q)
{
return(cbind(q[q[,2]==x,],y))
}
mapply(fxm,df[,1],df[,2],q=df2)
Error in q[, 2] : incorrect number of dimensions
if I add a print statement:
df <- data.frame(v1=1:10,v2=23:32)
df2 <- data.frame(v1=1:3,v2=3:5)
fxm <- function(x,y,q)
{
print(q)
return(cbind(q[q[,2]==x,],y))
}
mapply(fxm,df[,1],df[,2],q=df2)
I get:
[1] 1 2 3
Error in q[, 2] : incorrect number of dimensions
The data frame is converted to a vector of its first column for some reason. How can I stop this from happening, and have the whole dataframe accessible to my function?
I am trying to select a subset of the dataframe and returning it based on the other two parameters of the function, which is why I need the whole dataframe to be passed to the function.

If I understand you correctly, you want the whole thing q = df2 passed to the fxm function you define, am I right?
The problem is that in your code mapply will extract elements from q = df2 as some additional parameters just same as extracting elements from df[,1] and df[,2]. You need to set MoreArgs parameter for mapply to pass the whole thing to the function like this:
df <- data.frame(v1=1:10,v2=23:32)
df2 <- data.frame(v1=1:3,v2=3:5)
fxm <- function(x,y,q)
{
print(q)
return(cbind(q[q[,2]==x,],y))
}
mapply(fxm,df[,1],df[,2], MoreArgs = list(q=df2))
This still doesn't work for me and there is some error elsewhere. From the printing result you can see the whole data.frame prints out, which solves your original problem.

Related

using R - passing rows from 2 data.frames using purrr::map to calculate a result using values from both data.frames

I have 2 data.frames that store values that will be retrieved and used in an equation. I would like to calculate the result for each row in df1 using each row in df2... basically iterating through the first df rows with each row of the second df.
df1 <- tibble(Type=c("atype","btype","ctype"),
h=c("1","2","3"),
w=c("4","5","6"),
ED=c("101","102","103"))
df2 <- tibble(Item=c("123-htc","but-456","xtc","newID"),
limit=c("rnorm(1)","12","13","14"),
zone=c("30","40","30","11"))
#Note: values in dfs stored as characters/strings to also use get values from random number functions
resItm <- list()
resTyp <- list()
for (i in 1:length(df1$Type)) {
h <- NULL
w <- NULL
ED <- NULL
limit <- NULL
zone <- NULL
for (j in 1:length(df2$Item)) {
Typ<-df1[i,]
Itm<-df2[j,]
h<-eval(str2lang(Typ$h))
w<-eval(str2lang(Typ$w))
ED<-eval(str2lang(Typ$ED))
limit<-eval(str2lang(Itm$limit))
zone<-eval(str2lang(Itm$zone))
res1 <- (h * limit * ED) / (zone)
resItm[[j]] <- list(Item=Itm$Item, Result=res1)
}
resTyp[[i]] <- list(Type=Typ$Type, Res_Item = resItm)
}
str(resTyp)
This worked fine with nested for loops and got me to the expected results, however, I was trying to also do this with purrr::map. And thats when it didn't work so well
map2(df1,df2,function(.x,.y) {(eval(str2lang(.x$h))*eval(str2lang(.y$limit))*eval(str2lang(.x$ED)))/eval(str2lang(.y$zone))})
#Error in `map2()`:
#! Can't recycle `.x` (size 4) to match `.y` (size 3).
#Run `rlang::last_error()` to see where the error occurred.
How the heck do I get this to work... i also tried nested map() calls, and tried putting the calculation in a named function map(df1,~map(df2,~c(...))) but that doesnt let me get at the values in each column...
Surely passing 2 dfs row wise to a function has been tried before where each row in one df is iterated with the other?? any help appreciated!

Code Breaking When Turned Into Custom Function?

I am putting together a summary table from a larger data frame. I noticed that I was re-using the following code but with different %like% characters:
# This code creates a df of values where the row name matches the character
df <- (data[which(data$`col_name` %like% "Total"),])
df <- df[3:ncol(df)]
df[is.na(df)] <- 0
# This creates a row composed of the sum of each column
for (i in seq_along(df)) {
df[10, i] <- sum(df[i])
}
# This inserts the resulting values into a separate summary table
summary[1, 2:ncol(summary)] <- df[nrow(df),]
To keep the code dry and avoid repetition, I thought it would be best to translate this into a custom function that I could then call with different strings:
create_row <- function(x) {
df <- (data[which(data$`Crop year` %like% as.character(x)),])
df <- df[3:ncol(df)]
df[is.na(df)] <- 0
for (i in seq_along(df)) {
df[10, i] <- sum(df[i])
}
}
# Then populate the summary table as before with the results
total <- create_row("Total")
summary[1, 2:ncol(summary)] <- total[nrow(total),]
However when attempting to run this, it simply returns an empty variable.
Through trial and error, I have found that the line of code causing this is:
df[is.na(df)] <- 0
The code works absolutely fine when run line by line outside of this custom function.
As mentioned in the comments if you add return(df) at the end of the function, the function will work. We need to do that because for loop unlike any other functions doesn't return an object after it's executed.
Moreover, as mentioned in the comments by #alan that you can use colSums to get sum of each column directly instead of for loop to loop over each column and take its sum.

How to do a complex edit of columns of all data frames in a list?

I have a list of 185 data frames called WaFramesNumeric. Each dataframe has several hundred columns and thousands of rows. I want to edit every data frame, so that it leaves all numeric columns as well as any non-numeric columns that I specify.
Using:
for(i in seq_along(WaFramesNumeric)) {
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][,sapply(WaFramesNumeric[[i]],is.numeric)]
}
successfully makes each dataframe contain only its numeric columns.
I've tried to amend this with lines to add specific columns. I have tried:
for (i in seq_along(WaFramesNumeric)) {
a <- WaFramesNumeric[[i]]$Device_Name
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][,sapply(WaFramesNumeric[[i]],is.numeric)]
cbind(WaFramesNumeric[[i]],a)
}
and in an attempt to call the column numbers of all integer columns as well as the specific ones and then combine based on that:
for (i in seq_along(WaFramesNumeric)) {
f <- which(sapply(WaFramesNumeric[[i]],is.numeric))
m <- match("Cost_Center",colnames(WaFramesNumeric[[i]]))
n <- match("Device_Name",colnames(WaFramesNumeric[[i]]))
combine <- c(f,m,n)
WaFramesNumeric[[i]][,i,combine]
}
These all return errors and I am stumped as to how I could do this. WaFramesNumeric is a copy of another list of dataframes (WaFramesNumeric <- WaFramesAll) and so I also tried adding the specific columns from the WaFramesAll but this was not successful.
I appreciate any advice you can give and I apologize if any of this is unclear.
You are mistakenly assuming that the last commmand in a for loop is meaningful. It is not. In fact, it is being discarded, so since you never assigned it anywhere (the cbind and the indexing of WaFramesNumeric...), it is silently discarded.
Additionally, you are over-indexing your data.frame in the third code block. First, it's using i within the data.frame, even though i is an index within the list of data.frames, not the frame itself. Second (perhaps caused by this), you are trying to index three dimensions of a 2D frame. Just change the last indexing from [,i,combine] to either [,combine] or [combine].
Third problem (though perhaps not seen yet) is that match will return NA if nothing is found. Indexing a frame with an NA returns an error (try mtcars[,NA] to see). I suggest that you can replace match with grep: it returns integer(0) when nothing is found, which is what you want in this case.
for (i in seq_along(WaFramesNumeric)) {
f <- which(sapply(WaFramesNumeric[[i]], is.numeric))
m <- grep("Cost_Center", colnames(WaFramesNumeric[[i]]))
n <- grep("Device_Name", colnames(WaFramesNumeric[[i]]))
combine <- c(f,m,n)
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][combine]
}
I'm not sure what you mean by "an attempt to call the column numbers of all integer columns...", but in case you want to go through a list of data frames and select some columns based on some function and keep given a column name you can do like this:
df <- data.frame(a=rnorm(20), b=rnorm(20), c=letters[1:20], d=letters[1:20], stringsAsFactors = FALSE)
WaFramesNumeric <- rep(list(df), 2)
Selector <- function(data, select_func, select_names) {
select_func <- match.fun(select_func)
idx_names <- match(select_names, colnames(data))
idx_names <- idx_names[!is.na(idx_names)]
idx_func <- which(sapply(data, select_func))
idx <- unique(c(idx_func, idx_names))
return(data[, idx])
}
res <- lapply(X = WaFramesNumeric, FUN = Selector, select_names=c("c"), select_func = is.numeric)

R - cannot use variable generated by for loop as argument in table()

I'm trying to extract prop.test p-values over a set of columns in a dataframe existing in the global environment (df) and save them as a dataframe. I have a criteria column and 19 variable columns (among others)
proportiontest <- function() {
prop_df <- data.frame()
for(i in 1:19) {
x <- paste("df$var_", i, sep="")
y <- (prop.test(table(df$criteria, x), correct=FALSE))$p.value
z <- cbind (x, y)
prop_df <- rbind(prop_df, z)
}
assign("prop_df",prop_df,envir = .GlobalEnv)
}
proportiontest()
When I run this I get the error:
Error in table(df$criteria, x) : all arguments must have the same length
When I manually insert the column name into the function (instead of x) everything runs fine. e.g.
y <- (prop.test(table(df$criteria, df$var_1), correct=FALSE))$p.value
I seem to have the problem of using the variable (x) value generated via the for loop as the argument.
What am I missing or doing wrong in this case? I have tried passing x into the table() function as.String(x) as.character(x) among countless others to no avail. I cannot seem to understand in which form the argument must be. I'm probably misunderstanding something very basic in R but it's driving me insane and I cannot seem to formulate the question in a manner where google/SO can help me.
Currently in your function x is just a string. If you want to use a column from your data frame df you can do this in your for loop:
x <- df[,i]
You'll then need to change z or you'll be cbinding a column to a single p value, maybe just change to this:
z <- cbind(i,y)
so that you know which df column belongs to each p value.
You should be careful as well since the function will search for df created within itself and then move to the parent environment if it doesn't find it, so maybe you could pass the df as an argument to avoid any mistakes.

R: Test condition on column of dataframe elements within list; return smaller list

My goal is take a list of dataframes, see if a specific column of the data frames has a max value of 0, and if so, remove that data frame from my list.
Right now I am looping over names of the list. Given that this is R, there must be a better way. I feel I need some function applied through lapply() to get this right. I've also considered ddply() but I think that maybe overkill. Here is what I have so far:
# Make df of First element
myColumn <- rep ("ElementA",times=10)
values <- seq(1,10)
a <- data.frame(myColumn,values)
# Make df of second element
myColumn <- rep ("ElementB",times=10)
values <- rep(0,10)
b <- data.frame(myColumn,values)
# Bind the dataframes together
df <- rbind(a,b)
#Now split the dataframes based on element name
myList <- split(df,df$myColumn)
# Now loop through element lists and check for max of 0 in values
for (name in names(myList)) { # Loop through List
if (max(myList[[name]]$values) == 0) { # Check Max for 0
myList <- myList[[-names]] # If 0, remove element from list
} # Close If
} # Close Loop
Error in -names : invalid argument to unary operator
I've tested my code outside the loop, and it all seems to work.
Any help is greatly appreciated. Thanks!
You can use this:
myList <- myList[sapply(myList, function(d) max(d$values) != 0)]
instead of the for() loop. This will let pass dataframes with zero rows, with a warning.
To ensure empty dataframes are removed, use this:
myList <- myList[sapply(myList, function(d) if(nrow(d)==0) FALSE else max(d$values)!=0)]

Resources