Getting at grouping term from within ddply function? - r

I have the following example data frame x
id val
a 1
a 2
a 3
b 2
b 4
b 9
I have a simple function that I apply while doing my SAC. Something like,
f.x <- function(x){
cat(x$id,"\n")
}
Note, all that I'm trying to do is print out the grouping term. This function is called using ddply as follows
ddply(x,.(id),f.x)
However, when I run this I get integer outputs which I suspect are indices of a list. How do I get the actual grouping term within f.x, in this case 'a' & 'b'?
Thanks much in advance.

Never be fooled into thinking that just because a variable looks like a bunch of characters, that it really is just a bunch of characters.
Factors are not characters. They are integer codes and will often behave like integer codes.
Coerce the values to character using as.character().

If every thing you want to do is print you should use d_ply.
In ddply (d_ply) actual argument passed to function (f.x here) is data.frame (subset of x data.frame), so you should modify your f.x function.
f.x <- function(x) cat(x[1,"id"],"\n")
d_ply(x,.(id),f.x)

Related

Dynamically change part of variable name in R

I am trying to automatise some post-hoc analysis, but I will try to explain myself with a metaphor that I believe will illustrate what I am trying to do.
Suppose I have a list of strings in two lists, in the first one I have a list of names and in the other a list of adjectives:
list1 <- c("apt", "farm", "basement", "lodge")
list2 <- c("tiny", "noisy")
Let's suppose also I have a data frame with a bunch of data that I have named something like this as they are the results of some previous linear analysis.
> head(df)
qt[apt_tiny,Intercept] qt[apt_noisy,Intercept] qt[farm_tiny,Intercept]
1 4.196321 -0.4477012 -1.0822793
2 3.231220 -0.4237787 -1.1433449
3 2.304687 -0.3149331 -0.9245896
4 2.768691 -0.1537728 -0.9925387
5 3.771648 -0.1109647 -0.9298861
6 3.370368 -0.2579591 -1.0849262
and so on...
Now, what I am trying to do is make some automatic operations where the strings in the previous lists dynamically change as they go in a for loop. I have made a list with all the distinct combinations and called it distinct. Now I am trying to do something like this:
for (i in 1:nrow(distinct)){
var1[[i]] <- list1[[i]]
var2[[i]] <- list2[[i]]
#this being the insertable name part for the rest of the variables and parts of variable,
#i'll put it inside %var[[i]]% for the sake of the explanation.
%var1[[i]]%_%var2[[i]]%_INT <- df$`qt[%var1[[i]]%_%var2[[i]]%,Intercept]`+ df$`qt[%var1[[i]]%,Intercept]`
}
The difficult thing for me here is %var1[[i]]% is at the same time inside a variable and as the name of a column inside a data frame.
Any help would be much appreciated.
You cannot use $ to extract column values with a character variable. So df$`qt[%var1[[i]]%_%var2[[i]]%,Intercept] will not work.
Create the name of the column using sprintf and use [[ to extract it. For example to construct "qt[apt_tiny,Intercept]" as column name you can do :
i <- 1
sprintf('qt[%s_%s,Intercept]', list1[i], list2[i])
#[1] "qt[apt_tiny,Intercept]"
Now use [[ to subset that column from df
df[[sprintf('qt[%s_%s,Intercept]', list1[i], list2[i])]]
You can do the same for other columns.

Different variable lengths when looping over a string vector which corresponds to data frame columns

I am new in writing loops and I have some difficulties there. I already looked through other questions, but didn't find the answer to my specific problem.
So lets just create a random dataset, give column names and set the variables as character:
d<-data.frame(replicate(4,sample(1:9,197,rep=TRUE)))
colnames(d)<-c("variable1","variable2","trait1","trait2")
d$variable1<-as.character(d$variable1)
d$variable2<-as.character(d$variable2)
Now I define my vector over which I want to loop. It correspons to trait 1 and trait 2:
trt.nm <- names(d[c(3,4)])
Now I want to apply the following model for trait 1 and trait 2 (which should now be as column names in trt.nm) in a loop:
library(lme4)
for(trait in trt.nm)
{
lmer (trait ~ 1 + variable1 + (1|variable2) ,data=d)
}
Now I get the error that variable lengths differ. How could this be explained?
If I apply the model without loop for each trait, I get a result, so the problem has to be somewhere in the loop, I think.
trait is a string, so you'll have to convert it to a formula to work; see http://www.cookbook-r.com/Formulas/Creating_a_formula_from_a_string/ for more info.
Try this (you'll have to add a print statement or save the result to actually see what it does, but this will run without errors):
for(trait in trt.nm) {
lmer(as.formula(paste(trait, " ~ 1 + variable1 + (1|variable2)")), data = d)
}
Another suggestion would be to use a list and lapply or purrr::map instead. Good luck!

Convert A List Object into a Useable Matrix Name (R)

I want to be able to use a loop to perform the same funtion on a group of data sets without having to recall the name of all of the data sets individually. For example, say I have the following matricies:
a<-matrix(1:5,nrow=5,ncol=2)
b<-matrix(6:10,nrow=5,ncol=2)
c<-matrix(11:15,nrow=5,ncol=2)
I define a vector of set names:
SetNames<- c("a","b","c")
Then I want to sum the second column of all of the matricies without having to call each matrix name. Basically, I would like to be able to call SetNames[1], have the program return 'a' as USEABLE text which can be used to call apply(a[2],2,sum).
If apply(SetNames[1][2],2,sum) worked, that would be the basic syntax I was looking for, however I would replace the 1 with a variable I can increase in a loop.
sapply can do that.
sapply(SetNames, function(z) {
dfz <- get(z)
sum(dfz[,2])
})
# a b c
# 15 40 65
Notice that get() is used here to dynamically access a variable.
a less compact way of writing this would be
sumRowTwo <- function(z) {
dfz <- get(z)
sum(dfz[,2])
}
sapply(SetNames, sumRowTwo)
and now you can play around with sumRowTwo and see what e.g.
sumRowTwo("a")
returns

Iterate over Factors in a dataframe in R

I am rather new to R and struggling at the moment with a specific issue. I need to iterate over a dataframe with 1 variable returned from a SQL database so that I can ultimately issue additional SQL queries using the information in the 1 variable. I need help understanding how to do this.
Here is what I have
> dt
Col
1 5D2D3F03-286E-4643-8F5B-10565608E5F8
2 582771BE-811E-4E45-B770-42A98EB5D7FB
3 4EB4D553-C680-4576-A854-54ED817226B0
4 80D53D5D-80D1-4A60-BD86-C85F6D53390D
5 9EF6CABF-0A4F-4FA9-9FD9-132589CAAC31
when trying to access by using it prints the entire list just as above
> dt[1]
Col
1 5D2D3F03-286E-4643-8F5B-10565608E5F8
2 582771BE-811E-4E45-B770-42A98EB5D7FB
3 4EB4D553-C680-4576-A854-54ED817226B0
4 80D53D5D-80D1-4A60-BD86-C85F6D53390D
5 9EF6CABF-0A4F-4FA9-9FD9-132589CAAC31
when trying to access by dt[1,] it brings additional unwanted information.
> a<-dt[1,]
> a
[1] 5D2D3F03-286E-4643-8F5B-10565608E5F8
5 Levels: 4EB4D553-C680-4576-A854-54ED817226B0 ... 9EF6CABF-0A4F-4FA9-9FD9-132589CAAC31
I need to isolate just the '5D2D3F03-286E-4643-8F5B-10565608E5F8' information and not the '5 levels......'.
I am sure this is simple, I just can't find it. any help is appreciated!
thanks!
There are two issues you need to address. One is that you want character data, not a factor variable (a factor is essentially a category variable). The other is that you want a simple vector of the values, not a data.frame.
1) To get the first column as a vector, use double-brackets or the $ notation:
a <- dt[[1]]
a <- dt[['Col']]
a <- dt$Col
Your notation dt[1,] does actually return the column as a vector too, but using the somewhat obscure fact that the [ method for data.frame objects will silently "drop" its value to a vector when using the two-index form dt[i,j], but not when using the one-index form dt[i]:
When [ and [[ are used with a single vector index (x[i] or x[[i]]), they index the data frame as if it were a list. In this usage a drop argument is ignored, with a warning.
Think of "dropping" like unboxing the data - instead of getting a data.frame with a single column, you're just getting the column data itself.
2) To convert to character data, use one of the suggestions in the comments from #akrun or #Vlo:
a <- as.character(dt[[1]])
a <- as.character(dt[['Col']])
a <- as.character(dt$Col)
or use the API of whatever you're using to make the SQL query - or to read in the results of the query - not convert the strings to factors in the first place.

R add to a list in a loop, using conditions

I have a data.frame dim = (200,500)
I want to do a shaprio.test on each column of my dataframe and append to a list. This is what I'm trying:
colstoremove <- list();
for (i in range(dim(I.df.nocov)[2])) {
x <- shapiro.test(I.df.nocov[1:200,i])
colstoremove[[i]] <- x[2]
}
However this is failing. Some pointers? (background is mainly python, not much of an R user)
Consider lapply() as any data frame passed into it runs operations on columns and the returned list will be equal to number of columns:
colstoremove <- lapply(I.df.noconv, function(col) shapiro.test(col)[2])
Here is what happens in
for (i in range(dim(I.df.nocov)[2]))
For the sake of example, I assume that I.df.nocov contains 100 rows and 5 columns.
dim(I.df.nocov) is the vector of I.df.nocov dimensions, i.e. c(100, 5)
dim(I.df.nocov)[2] is the 2nd dimension of I.df.nocov, i.e. 5
range(x)is a 2-element vector which contains minimal and maximal values of x. For example, range(c(4,10,1)) is c(1,10). So range(dim(I.df.nocov)[2]) is c(5,5).
Therefore, the loop iterate twice: first time with i=5, and second time also with i=5. Not surprising that it fails!
The problem is that R's function range and Python's function with the same name do completely different things. The equivalent of Python's range is called seq. For example, seq(5)=c(1,2,3,4,5), while seq(3,5)=c(3,4,5), and seq(1,10,2)=c(1,3,5,7,9). You may also write 1:n, it is the same as seq(n), and m:n is same as seq(m,n) (but the priority of ':' is very high, so 1:2*x is interpreted as (1:2)*x.
Generally, if something does not work in R, you should print the subexpressions from the innerwise to the outerwise. If some subexpression is too big to be printed, use str(x) (str means "structure"). And never assume that functions in Python and R are same! If there is a function with same name, it usually does a different thing.
On a side note, instead of dim(I.df.nocov)[2] you could just write ncol(I.df.nocov) (there is also a function nrow).

Resources