note: this is a direct follow up to this previous question
I have very long dataframe consisting of two columns that I am using as arguments for a function that will find the value of a third column using mapply as so:
df$3rd <- mapply(myfunction, A=df$1st, B=df$2nd)
where myfunction has arguments A and B. While this works great for small datasets, it stalls for large datasets so I was thinking a good way to approach the problem would be to apply this function using ddply. I don't know if ddply is the best approach for this problem but I am also having some trouble with syntax. So suggestions for either would be appreciated.
This is what I am trying:
> df$3rd <- ddply(df, .(1st), function(x) x$3rd <-
> mapply(myfunction, A=x$1st, B=df$second))
and this is the error I am getting:
Error in `$<-.data.frame`(`*tmp*`, "n", value = c(1L, 1L, 1L, 1L, 1L, :
replacement has 112 rows, data has 16
EDIT:
In light of the answer and comments I I am posting a small reproducible example below - it is one of the answers from the previous question. However as the commenters below note, ddply is probably not the way to go. I am trying Ramnath's solution right now.
library(reshape2)
foo <- data.frame(x = c('a', 'a', 'a', 'b', 'b', 'b'),
y = c('ab', 'ac', 'ad', 'ae', 'fx', 'fy'))
bar <- data.frame(x = c('c', 'c', 'c', 'd', 'd', 'd'),
y = c('ab', 'xy', 'xz', 'xy', 'fx', 'xz'))
nShared <- function(A, B) {
length(intersect(with(foo, y[x==A]), with(bar, y[x==B])))
}
# Enumerate all combinations of groups in foo and bar
(combos <- expand.grid(foo.x=unique(foo$x), bar.x=unique(bar$x)))
# Find number of elements in common among all pairs of groups
combos$n <- mapply(nShared, A=combos$foo.x, B=combos$bar.x)
# Reshape results into matrix form
dcast(combos, foo.x ~ bar.x)
# foo.x c d
# 1 a 1 0
# 2 b 0 1
ddply isn't what you're after here, ddply(df,.(1st), FUNCTION) is more like:
for each val in unique(df$1st)
outdf[nrow(outdf)+1,] = FUNCTION( df[df$1st==val] )
That is, it makes outdf consisting of FUNCTION applied to subsets of df determined by column 1st.
In any case, I think your error might be because you have df instead of x in function(x) x$3rd<-mapply(myfunction,A=x$1st, B=df$second) (the B argument)? Although it is hard to tell without a working example.
What exactly does myfunction do? I think your best bet is to vectorise myfunction so that you can just do df$third <- myfunction( A=df$first, B=df$second ).
For example, if myfunction <- function(A,B) { A+B }, instead of doing mapply(myfunction,df$first,df$second) you could equivalently do myfunction(df$first,df$second) and not even need mapply at all.
Related
I am building a shiny app. The user will need to be able to reduce the data by selecting variables and filtering on specific values for those variables. I am stuck trying to get a generalizable function that can work based on all possible selections.
Here is an example - I skip the shiny code because I think the problem is with the function:
#sample dataframe
df <- data.frame('date' = c(1, 2, 3, 2, 2, 3, 1),
'time' = c('a', 'b', 'c', 'e', 'b', 'a', 'e'),
'place' = c('A', 'A', 'A', 'H', 'A', 'H', 'H'),
'result' = c('W', 'W', 'L', 'W', 'W', 'L', 'L'))
If the user selected date and result for the date values 1, 2; and the result values W, I would do the following:
out <- df %>%
select(date, result) %>%
filter(date %in% c(1,2)) %>%
filter(result %in% c('W'))
The challenge I am having is that the user can select any unique combinations of variables and values. Using the input$ values from my shiny app, I can get the selected variables into a vector and I can get the selected values into a list of values, positionaly matching the selected variables. For example:
selected_variables <- c('date', 'result')
selected_values <- list(c(1,2), c('W'))
What i think i then need is a generalizable function that will match up the filter calls with the correct variables. Something like:
#function that takes data frame, vector of selected variables, list of vectors of chosen values for each variable
#Returns a reduced table of selected variables, filtered values
table_reducer <- function(df, select_var, filter_values) {
#select the variables
out <- df %>%
#now filter each variable by the values contained in the list
select(vect_of_var)
out <- [for loop that iterates over vect_of_var, list_of_vec, filtering accordingly]
out #return out
}
My thinking would be to use a zip equivalent from python, but all my searching on that just points me to mapply and i can't see how to use that within the for loop (which i also know is not always approved in R - but i am talking about a relatively small number of iterations). If there is a better solution to this i would welcome it.
Here's a 1-liner table_reducer function in base R -
table_reducer <- function(df, select_var, filter_values) {
subset(df, Reduce(`&`, Map(`%in%`, df[select_var], filter_values)))
}
selected_variables <- c('date', 'result')
selected_values <- list(c(1,2), c('W'))
table_reducer(df, selected_variables, selected_values)
# date time place result
#1 1 a A W
#2 2 b A W
#4 2 e H W
#5 2 b A W
Map is a wrapper over mapply so you were right in thinking that you should use mapply for this task. This answer is also free of dreaded for loops.
I am trying to run a t.test() on multiple columns of data within 'j' in a data.table. I've found a way that works, but isn't very elegant, and I feel like there's probably a more concise way using .SDcols, but haven't had any luck looking through here, or the data.table vignette. If this has been asked previously I apologize and please point me in the right direction.
My data.table has essentially the following format
DT <- data.table(name = c('a', 'b', 'c', 'a', 'b', 'c'),
y = c(rnorm(6, mean = 100, sd = 30)),
z = c(rnorm(6, mean = 10, sd = 3)),
group = rep(c('One', 'Two'), 3))
When I want to run a t.test comparing the values of y in group One and group Two, it's very straightforward:
DT[,t.test(y~group)]
If I want to get an output for both y and z the following works, but is clunky and inelegant. And with my actual data, I'm trying to do this over many columns so it would be more time consuming to type out each iteration I would like to run.
DT[,.(t.test(y~group), t.test(z~group))]
In the data.table vignette, using a function over a specific subset of columns is achieved by
DT[,lapply(.SD, mean), .SDcols = c('y', 'z')]
However replacing mean with t.test yields a one sample t.test, while I'm trying to get a two sample t.test. I've tried:
DT[,lapply(.SD, t.test, formula = .SDcols ~ group, data = DT), .SDcols = c('y', 'z')]
But this gives me a comparison between y and z, not both the comparisons of y~group and z~group.
I've tried several versions of lapply with a custom function to get the output I want, but I won't make anyone read through my walls of unsuccessful code. Needless to say I have been unable to get that to work.
Question:
Is there a way via lapply() or function() or a way currently unknown to me, to get t.test to run over multiple columns of data within 'j' in a data.table?
Thanks in advance for your help,
Chris
To pull together the parts of the answer and to rearrange to put the name in the first column (if desired for nicer printing):
library(data.table)
DT <- data.table(name = c('a', 'b', 'c', 'a', 'b', 'c'),
y = c(rnorm(6, mean = 100, sd = 30)),
z = c(rnorm(6, mean = 10, sd = 3)),
group = rep(c('One', 'Two'), 3))
result <-
DT[,lapply(.SD, function(x) t.test(x ~ group)), .SDcols = y:z][
,ttname:=names(t.test(1:5))][ # add names
,.(ttname,y,z)] # put names in first column
result
# ttname y z
# 1: statistic 0.1391646 0.1295093
# 2: parameter 3.468876 3.559917
# 3: p.value 0.8970165 0.9039359
# 4: conf.int -99.61786,109.47358 -8.209637, 8.972439
# 5: estimate 110.7286,105.8008 11.15414,10.77274
# 6: null.value 0 0
# 7: stderr 35.41031 2.94497
# 8: alternative two.sided two.sided
# 9: method Welch Two Sample t-test Welch Two Sample t-test
# 10: data.name x by group x by group
Here is my solution, wrapped as a function. In the accepted answer, , I didn't like that the test output was rows and the variables columns, i like it the other way around, makes it easier to read I think.
I also a added an argument for rounding, and one that default only prints the most important info, the pvalue and test statistic. the function requires purrr. the input for the group variable and the variables to test are character, so dt_test(dtx, 'varname', c('z','y'))
dt_ttest <- function(dtx, grp, thecols, decimals=3, small=TRUE, ...) {
x1 <- a2[, map(.SD, ~ t.test(.x ~ get(grp))), .SDcols = thecols]
x2 <- t(x1) %>% data.table()
setnames(x2, names(t.test(1:2)))
x2 <- x2[, var := thecols] [, !'data.name']
tcols <- c('p.value', 'statistic', 'stderr', 'null.value', 'parameter', 'method', 'alternative')
x2[, (tcols) := map(.SD, unlist), .SDcols=tcols ]
x2
thecols2 <- keep(x2, is.numeric) %>% names()
x2[, (thecols2) := map(.SD, ~ round(.x, decimals)), .SDcols=thecols2 ]
# go one level deeper to round the two list cols
thecols3 <- c('conf.int', 'estimate')
x2[, (thecols3) := modify_depth(.SD, 2, ~ round(.x, decimals)), .SDcols=thecols3 ]
# set order
setcolorder(x2, c('var', 'p.value', 'statistic', 'stderr', 'conf.int', 'estimate', 'parameter', 'method', 'alternative') )
if( small) x2[, .(var, p.value, statistic)] else x2[]
}
I have a very simple dataset like this,
a <- c(29, 10, 29)
b <- c(32, 23, 43)
c <- c(33,22,1)
df1 <- data.frame(a, b, c)
I want to create a new data frame from vector a and c from df1. I am runing the following command,
df2 <- data.frame(df1$a, df1$c)
It is creating a data frame with variable name df.aand df.c. Is there any way I can have the variable name exactly like what I have in df1?
df2 <- data.frame(a=df1$a, c=df1$c)
a b
1 29 33
2 10 22
3 29 1
I assume your a, b, c variables are not directly available anymore
colnames(df2) <- c("a", "c")
should do the trick?
df1[,c("a","c")]
In case you select only column: df1[,"a",drop=FALSE].
Always include drop=FALSE to handle the general case:
selectedColumns <- c("a","c")
df1[, selectedColumns, drop=FALSE]
If your real application is more complex than just taking a subset (which seems an obviously good solution), you can use setNames (here it doesn't make much sense, but it could help if you are trying to automatically rename the data frame at construction...):
df2 <- setNames(df1[, c('a', 'b')], names(df1[, c('a', 'b')]) )
I would like to calculate the length of many objects in R and return those objects with the name-prefix 'length_'. However, when I type this code:
A <- c('A', 'B', '3')
B <- c('A', '2')
files <- ls()
for (i in 1:length(files)) assign(paste("length_",files[i], sep = ""), length(unlist(files[i])))
This returns the vectors length_A and length_B, but each with the value 1 and not 3 and 2.
Thank you for any help,
Paul
p.s. I actually would like to apply this to a different function instead of length (GC.content from package ape to calculate GC content of DNA-sequences), but with that function I have the same problem as with the abovementioned example.
In R 3.2.0, the lengths function was introduced which calculates the length of each item of a list. Using this function, as #docendo-discimus notes in the comments above, a super compact (and R-like) solution is
lengths(mget(ls()))
which returns a named vector
A B
3 2
mget returns a list of objects in the environment and is sort of like "multipleget."
A <- c('A', 'B', '3')
B <- c('A', '2')
files <- ls()
for (i in 1:length(files)) assign(paste("length_",files[i], sep = ""), length(get(files[i])))
This create a length_A of value 3 and length_B of value 2.
A <- c('A', 'B', '3')
B <- c('A', '2')
files <- list(A,B)
sapply(files,length)
this will give you the answer but I don't know if it's what you want.
If I do something like this:
> df <- data.frame()
> rbind(df, c("A","B","C"))
X.A. X.B. X.C.
1 A B C
You can see the row gets added to the empty data frame. However, the columns get named automatically based on the content of the data.
This causes problems if I later want to:
> df <- rbind(df, c("P", "D", "Q"))
Is there a way to control the names of the columns that get automatically created by rbind? Or some other way to do what I'm attempting to do here?
#baha-kev has a good answer regarding strings and factors.
I just want to point out the weird behavior of rbind for data.frame:
# This is "should work", but it doesn't:
# Create an empty data.frame with the correct names and types
df <- data.frame(A=numeric(), B=character(), C=character(), stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Messes up names!
rbind(df, list(A=42, B='foo', C='bar')) # OK...
# If you have at least one row, names are kept...
df <- data.frame(A=0, B="", C="", stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Names work now...
But if you only have strings then why not use a matrix instead? Then it works fine to start with an empty matrix:
# Create a 0x3 matrix:
m <- matrix('', 0, 3, dimnames=list(NULL, LETTERS[1:3]))
# Now add a row:
m <- rbind(m, c('foo','bar','baz')) # This works fine!
m
# Then optionally turn it into a data.frame at the end...
as.data.frame(m, stringsAsFactors=FALSE)
Set the option "stringsAsFactors" to False, which stores the values as characters:
df=data.frame(first = 'A', second = 'B', third = 'C', stringsAsFactors=FALSE)
rbind(df,c('Horse','Dog','Cat'))
first second third
1 A B C
2 Horse Dog Cat
sapply(df2,class)
first second third
"character" "character" "character"
Later, if you want to use factors, you could convert it like this:
df2 = as.data.frame(df, stringsAsFactors=T)