Pass multiple arguments to function from dataframe with unknown number of columns - r

I have a function similar to this:
testfun = function(jID,kID,d){
g=paste0(jID,kID)
date = d
bb=data.frame(g,date)
return(bb)
}
Data frame:
x=data.frame(jID = c("a","b"),kID=c("c","d"),date="20170206",stringsAsFactors = FALSE)
I want to pass each row as inputs into the function. The solutions provided here: Passing multiple arguments to a function taken from dataframe are great but in their case, the number of columns was known. How would a solution like this:
vtestfun <- (Vectorize(testfun, SIMPLIFY=FALSE))
vtestfun(x[,1],x[,2],x[,3])
be applied if the number of columns in the dataframe is not known or keeps changing?

If you can match the argument names to the column names like so:
testfun <- function(jID, kID, date){ # 'date', not 'd'
g <- paste0(jID, kID)
bb <- data.frame(g, date)
return(bb)
}
You could do:
purrr::pmap(x, testfun)
Returning:
[[1]]
g date
1 ac 20170206
[[2]]
g date
1 bd 20170206
# Data used:
x <- structure(list(jID = c("a", "b"), kID = c("c", "d"), date = c("20170206", "20170206")), class = "data.frame", row.names = c(NA, -2L))

Related

Compare two data.frames to find similar values from data.frame 1 in data.frame 2

I have two data.frames: name and searches
name <- data.frame(
A = c("example", "firstly", "second.com")
searches <- data.frame(
A = c("example.com","secondly","first"),
B = c("test", "test.com", "test1"))
I want to search in data.frame "searches" for the values in data.frame "name". If there is a similar value (not exactly the same) I want R to return the value from name and from searches in a new row in a new table.
So a new data.frame could be
result <- data.frame(
A = "example", "firstly", "second.com",
B = "example.com","first","secondly",
C = "test", "test1", "test.com")
Is that possible?
You can use the stringr package in R to do this. For example, if you have
name <- data.frame(
A = c("example", "firstly", "second.com"))
searches <- data.frame(
A = c("example.com","secondly","first"),
B = c("test", "test.com", "test1"))
then you can use
str_extract(searches$A, '.*example.*')
This gives an output of
> str_extract(searches$A, '.*example.*')
[1] "example.com" NA NA
If you set this up with an appropriate for loop to iterate over elements in your name dataframe and cells of your searches dataframe then you could pick up all matches and extract them as desired.
use the function stringdist_join from the fuzzyjoin package.
library(fuzzyjoin)
name <- data.frame(
A = c("example", "firstly", "second.com")
)
searches <- data.frame(
A = c("example.com","secondly","first"),
B = c("test", "test.com", "test1")
)
result <- stringdist_join(name, searches, by = "A", max_dist = 5)
Which results to:
> print(result)
A.x A.y B
1 example example.com test
2 firstly first test1
3 second.com secondly test.com

Returning specific values within a row

I have 1 row of data and 50 columns in the row from a csv which I've put into a dataframe. The data is arranged across the spreadsheet like this:
"FSEG-DFGS-THDG", "SGDG-SGRE-JJDF", "DIDC-DFGS-LEMS"...
How would I select only the middle part of each element (eg, "DFGS" in the 1st one, "SGRE" in the second etc), count their occurances and display the results?
I have tried using the strsplit function but I couldn't get it to work for the entire row of data. I'm thinking a loop of some kind might be what I need
You can do unlist(strsplit(x, '-'))[seq(2, length(x)*3, 3)] (assuming your data is consistently of the form A-B-C).
# E.g.
fun <- function(x) unlist(strsplit(x, '-'))[seq(2, length(x)*3, 3)]
fun(c("FSEG-DFGS-THDG", "SGDG-SGRE-JJDF", "DIDC-DFGS-LEMS"))
# [1] "DFGS" "SGRE" "DFGS"
Edit
# Data frame
df <- structure(list(a = "FSEG-DFGS-THDG", b = "SGDG-SGRE-JJDF", c = "DIDC-DFGS-LEMS"),
class = "data.frame", row.names = c(NA, -1L))
fun(t(df[1,]))
# [1] "DFGS" "SGRE" "DFGS"
First we create a function strng() and then we apply() it on every column of df. strsplit() splits a string by "-" and strng() returns the second part.
df = data.frame(a = "ab-bc-ca", b = "gn-bc-ca", c = "kj-ll-mn")
strng = function(x) {
strsplit(x,"-")[[1]][2]
}
# table() outputs frequency of elements in the input
table(apply(df, MARGIN = 2, FUN = strng))
# output: bc ll
2 1

Passing dataframe as argument to function

I am writing a function to process data from a huge dataframe (row by row) which always has the same column names. So I want to pass the dataframe itself as a function to read out the information I need from the individual rows. However, when I try to use it as argument I can't read the information from it for some reason.
Dataframe:
DF <- data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"))
My code:
List <- do.call(list, Map(function(DT) {
DT <- as.data.frame(DT)
aa <- as.numeric(strsplit(DT$Age, ","))
mean.aa <- mean(aa)
},
DF))
Trying this I get a list with the column names, but all Values are NULL.
Expected output :
My expected output is a list with length equal to the number of rows in the data frame. Under each list index there should be another list with the age of the corresponding row (an also other stuff from the same row of the data table, later).
DF <- apply(data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"), "mean.aa" = c(179.7143, 100.8571)), 1, as.list)
What am I doing wrong?
Here is one way :
DF <- data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"))
apply(DF, 1, function(row){
aa <- as.numeric(strsplit(row["Age"], ",")[[1]])
row["mean.aa"] <- mean(aa)
as.list(row)
})

Iterating over columns in a data frame in order to replace values from matching data in list of data frames

I'm interested in building a function making use of apply/sapply or Map that would iterate over available columns in dta and replace values in each column with matched values from data frame available in a nameless list of data frames with list item index corresponding to the column number of the dta data frame.
Example
Given objects:
set.seed(1)
size <- 20
# Data set
dta <-
data.frame(
unitA = sample(LETTERS[1:4], size = size, replace = TRUE),
unitB = sample(letters[16:20], size = size, replace = TRUE),
unitC = sample(month.abb[1:4], size = size, replace = TRUE),
someValue = sample(1:1e6, size = size, replace = TRUE)
)
# Meta data
lstMeta <- list(
# Unit A definitions
data.frame(
V1 = c("A", "B", "D"),
V2 = c("Letter A", "Letter B", "Letter D")
),
# Unit B definitions
data.frame(
V1 = c("t", "q"),
V2 = c("small t", "small q")
),
# Unit C definitions
data.frame(
V1 = c("Mar", "Jan"),
V2 = c("March", "January")
)
)
Desired results
When applied on dta, the function should return a data.frame corresponding to the extract below:
unitA unitB unitC someValue
Letter B small t Apr 912876
Letter B small q March 293604
C s Apr 459066
Letter D p March 332395
Letter A small q March 650871
Letter D small q Apr 258017
Letter D p January 478546
C small q Feb 766311
C small t March 84247
Letter A small q March 875322
Letter A r Feb 339073
Letter A r Ap 839441
C r Feb 346684
Letter B p January 333775
Letter D small t January 476352
(...)
Existing approach
replaceLbls <- function(dataSet, lstDict) {
sapply(seq_along(dataSet), function(i) {
# Take corresponding metadata data frame
dtaDict <- lstDict[[i]]
# Replace values in selected column
# Where matches on V1 push corrsponding values from V2
dataSet[,i][match(dataSet[,i], dtaDict[,1])] <- dtaDict[,2][match(dtaDict[,1], dataSet[,i])]
})
}
# Testing -----------------------------------------------------------------
replaceLbls(dataSet = dta, lstDict = lstMeta)
Of course the approach proposed above does not work as it will try to use NA in assignments; but it summarises what I want to achieve:
Error in x[...] <- m : NAs are not allowed in subscripted assignments
In addition: Warning message: In [<-.factor(*tmp*, match(dataSet[,
i], dtaDict[, 1]), value = c(NA, : invalid factor level, NA
generated
Additional remarks
Source data set
The key characteristics of the data are:
The list is nameless so subsetting has to be done by item numbers not by names
Item number correspond to column numbers
There is no full match between metadata data frames available in the list of data frames and unit columns available in the data
The someValue column also should be iterated over as it may contain labels that should be replaced
Solution
I'm not interested in dplyr/data.table/sqldf-based solutions.
I'm not interested in nested for-loops
I have a hacky solution that doesn't use for loops or other packages. I needed to convert the factors to characters for it to work but you might be able to improve my solution.
The solution works by only matching values that are found in your lstMeta by creating a vector of indices where matches are found. I also used the <<- operator. If you're better at R than me, you can probably improve this.
set.seed(1)
size <- 20
# Data set
dta <-
data.frame(
unitA = sample(LETTERS[1:4], size = size, replace = TRUE),
unitB = sample(letters[16:20], size = size, replace = TRUE),
unitC = sample(month.abb[1:4], size = size, replace = TRUE),
someValue = sample(1:1e6, size = size, replace = TRUE),
stringsAsFactors = F
)
# Meta data
lstMeta <- list(
# Unit A definitions
data.frame(
V1 = c("A", "B", "D"),
V2 = c("Letter A", "Letter B", "Letter D"),
stringsAsFactors = F
),
# Unit B definitions
data.frame(
V1 = c("t", "q"),
V2 = c("small t", "small q"),
stringsAsFactors = F
),
# Unit C definitions
data.frame(
V1 = c("Mar", "Jan"),
V2 = c("March", "January"),
stringsAsFactors = F
)
)
replaceLbls <- function(dataSet, lstDict) {
sapply(1:3, function(i) {
# Take corresponding metadata data frame
dtaDict <- lstDict[[i]]
# Replace values in selected column
# Where matches on V1 push corrsponding values from V2
myUniques <- which(dataSet[,i] %in% dtaDict[,1])
dataSet[myUniques,i]<<- dtaDict[,2][match(dataSet[myUniques,i],dtaDict[,1])]
})
return(dataSet)
}
# Testing -----------------------------------------------------------------
replaceLbls(dataSet = dta, lstDict = lstMeta)
The following approach works well for the example data:
replaceLbls <- function(dataSet, lstDict) {
dataSet[seq_along(lstDict)] <- Map(function(x, lst) {
x <- as.character(x)
idx <- match(x, as.character(lst$V1))
replace(x, !is.na(idx), as.character(lst$V2)[na.omit(idx)])
}, dataSet[seq_along(lstDict)], lstDict)
dataSet
}
head(replaceLbls(dta, lstMeta))
# unitA unitB unitC someValue
# 1 Letter B small t Apr 912876
# 2 Letter B small q March 293604
# 3 C s Apr 459066
# 4 Letter D p March 332395
# 5 Letter A small q March 650871
# 6 Letter D small q Apr 258017
This assumes that you want to apply the changes to the first X column of the data that are as long as the meta-list. You might want to include an extra step to convert back to factor since this approach converts the adjusted columns to character class.
Another remark on factors: you could potentially speed up the performance by working only on the levels of any factor variables instead the whole column. The general process would be similar but requires a few more steps to check classes etc.
You can also try this:
mapr<-function(t,meta){
ind<-match(t,meta$V1)
if(!is.na(ind)){return(meta$V2[ind])}
else{return(t)}}
Then using sapply:
dta<-as.data.frame(cbind(sapply(1:3,function(t,df,meta){sapply(df[,t],mapr,lstMeta[[t]])},dta,lstMeta,simplify = T),dta[,4]))
A couple of mapplys can do the job
f1 <- function(df, lst){
d1 <- setNames(data.frame(mapply(function(x, y) x$V2[match(y, x$V1)], lst, df[1:3]),
df$someValue, stringsAsFactors = FALSE),
names(df))
as.data.frame(mapply(function(x, y) replace(x, is.na(x), y[is.na(x)]), d1, df))
}

How to add colums to a blank data frame columns by columns in R?

I try to create a data.fame, and then add some columns to this data.frame.
I try following code, but it does not work:
test.dim <- as.data.frame(matrix(nrow=0, ncol=4))
names <- c("A", "B", "C", "D")
colnames(test.dim) <- names
for (i in 1:4) {
name = names[i]
# do some calculations, at last get another data.fame named x.data
mean.data <- apply(x.data, 1, mean)
test.dim[, name] <- mean.data
}
Usually one would already have a data.frame (call it df) and simply add frames by calling df$newColName = values or df[,newColNames] = frame_of_values.
Your question indicates that you are separating the creation of your values from putting them in the data frame (which I do not recommend). But if you really want to start from a zero row zero col frame here are some options:
colnamesToAdd = LETTERS[1:4]
test.dim = data.frame( matrix(rep(NA),length(colnamesToAdd),nrow=1) )
colnames(test.dim) = colnamesToAdd
test.dim = test.dim[-1,]
Another option:
colnamesToAdd = LETTERS[1:4]
test.dim = data.frame("USELESS" = NA)
test.dim[,colnamesToAdd] = NA
test.dim = test.dim[-1,-1]
If you are looking to add a mean to your table and repeat it for every factor:
library(data.table);
test.dim = data.table("FACTOR" = sample(letters[1:4],100,replace=TRUE), "VALUE" = runif(100), "MEAN" = NA)
means = test.dim[,list(AVG=mean(VALUE)),by="FACTOR"]
# without data.table: by(test.dim$VALUE, test.dim$FACTOR, mean)
for(x in 1:nrow(means)) { test.dim$MEAN[test.dim$FACTOR==means$FACTOR[x]] = means$AVG[x] } # normally I would use the foreach package instead of this last for loop

Resources