Using S3 object to analyse other data.frames - noob level question - r

This was an attempt to understand OOP - taking a tutorial in it now. I'm very new at R and I would consider deleting the question as it is not well formulated. I would not use it for learning or guidance
Part 1 - I want to build a class in S3 using datadump1111 data. I want to call the S3 object a50survey then I want to ouput stuff. This seems to work but I'm not sure I made a proper S3 class or the function is working like it normally should.
a50DATA <- datadump1111
inputdata <- sapply(a50DATA, function(x) t(sapply(x, table)))
a50survey <- sapply(inputdata, function(x) colSums(prop.table(x)))
print(a50survey)
class(a50survey)
a50survey$drugs
> a50survey$drugs
Not Tried once Occasional Regular
0.72 0.12 0.14 0.02
Part 2 What I'm really aiming for is to introduce new data datadump2222 instead of the original datadump1111 I was trying to do that by
a50DATA <- datadump2222
a50survey$drugs
What I get is
> a50survey$drugs
Not Tried once Occasional Regular
0.72 0.12 0.14 0.02
What I should get is
$drugs
Not Tried once Occasional Regular
0.52 0.14 0.34 0.00
...and when I run a50survey I was hoping that the the addition of the new a50DATA would get picked up by the S3 object a50survey and give me another set of outputs correct for the new dataset i.e. datadump2222. But it returns the data output from datadump1111
Request for help Can you guide me simply to the solution as I want to understand and replicate this? Thank you
The ouput from datadump1111 is here
> dim(datadump1111)
[1] 50 4
> str(datadump1111)
'data.frame': 50 obs. of 4 variables:
$ alcohol: Factor w/ 5 levels "Not","Once or Twice a week",..: 3 2 3 2 4 4 3 4 2 4 ...
$ drugs : Factor w/ 4 levels "Not","Tried once",..: 1 3 1 1 3 1 2 3 1 1 ...
$ smoke : Factor w/ 3 levels "Not","Occasional",..: 1 3 1 1 1 3 3 3 1 2 ...
$ sport : Factor w/ 2 levels "Not regular",..: 1 1 1 1 2 2 2 2 1 2 ...

Related

Why adonis function DF changes with different factors combination?

> data(dune)
> data(dune.env)
> str(dune.env)
'data.frame': 20 obs. of 5 variables:
$ A1 : num 2.8 3.5 4.3 4.2 6.3 4.3 2.8 4.2 3.7 3.3 ...
$ Moisture : Ord.factor w/ 4 levels "1"<"2"<"4"<"5": 1 1 2 2 1 1 1 4 3 2 ...
$ Management: Factor w/ 4 levels "BF","HF","NM",..: 4 1 4 4 2 2 2 2 2 1 ...
$ Use : Ord.factor w/ 3 levels "Hayfield"<"Haypastu"<..: 2 2 2 2 1 2 3 3 1 1 ...
$ Manure : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 5 3 5 5 3 3 4 4 2 2 ...
As shown above, Moisture has four groups and Management has four groups, Manure has five groups when I run:
adonis(dune ~ Manure*Management*A1*Moisture, data=dune.env, permutations=99)
Call:
adonis(formula = dune ~ Manure * Management * A1 * Moisture, data = dune.env, permutations = 99)
Permutation: free
Number of permutations: 99
Terms added sequentially (first to last)
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
Manure 4 1.5239 0.38097 2.03088 0.35447 0.13
Management 2 0.6118 0.30592 1.63081 0.14232 0.16
A1 1 0.3674 0.36743 1.95872 0.08547 0.21
Moisture 3 0.6929 0.23095 1.23116 0.16116 0.33
Manure:Management 1 0.1091 0.10906 0.58138 0.02537 0.75
Manure:A1 4 0.3964 0.09909 0.52826 0.09220 0.91
Management:A1 1 0.1828 0.18277 0.97431 0.04251 0.50
Manure:Moisture 1 0.0396 0.03963 0.21126 0.00922 0.93
Residuals 2 0.3752 0.18759 0.08727
Total 19 4.2990 1.00000
Why is DF of Management not 3(4-1)?
This is a general, rather than a specific answer.
Your formula Moisture*Management*A1*Manure corresponds to a linear model with 160 (!) predictors (2*4*4*5):
dim(model.matrix(~Moisture*Management*A1*Manure, dune.env))
adonis builds this model matrix internally and uses it to construct the machinery for calculating the permutation statistics. When there are multicollinear combinations of predictors, it drops enough columns to make the problem well-defined again. The detailed rules for which columns get dropped depends on the order of the columns; if you reorder the factors in your question you'll see the reported Df change.
For what it's worth, I don't think the df calculations change the statistical outcomes at all — the statistics are based on the distributions derived from permutations, not from an analytical calculation that depends on the df.
Ben Bolker got it right. If you only look at Management and Manure and forget all other variables, you will see this:
> with(dune.env, table(Management, Manure))
Manure
Management 0 1 2 3 4
BF 0 2 1 0 0
HF 0 1 2 2 0
NM 6 0 0 0 0
SF 0 0 1 2 3
Look at row Management NM and column Manure 0 that only have one non-zero case. This means that Management NM and Manure 0 are synonyms, the same thing (or "aliased"). After you have had Manure in your model, Management only has three new levels, and hence 2 d.f. If you do it in reversed order and first have Management then you only have four levels Manure that you do not yet know, and that would give you 3 d.f. of Manure.
Although you really have overparametrized your model, you would also get the same result with only these two variables. Compare models:
adonis2(dune ~ Manure + Management, data=dune.env)
adonis2(dune ~ Management + Manure, data=dune.env)

Grouping error with lmer

I have a data frame with the following structure:
> t <- read.csv("combinedData.csv")[,1:7]
> str(t)
'data.frame': 699 obs. of 7 variables:
$ Awns : int 0 0 0 0 0 0 0 0 1 0 ...
$ Funnel : Factor w/ 213 levels "MEL001","MEL002",..: 1 1 2 2 2 3 4 4 4 4 ...
$ Plant : int 1 2 1 3 8 1 1 2 3 5 ...
$ Line : Factor w/ 8 levels "a","b","c","cA",..: 2 2 1 1 1 3 1 1 1 1 ...
$ X : int 1 2 3 4 7 8 9 10 11 12 ...
$ ID : Factor w/ 699 levels "MEL_001-1b","MEL_001-2b",..: 1 2 3 4 5 6 7 8 9 10 ...
$ BobWhite_c10082_241: int 2 2 2 2 2 2 0 2 2 0 ...
I want to construct a mixed effect model. I know in my data frame that the random effect I want to include (Funnel) is a factor, but it does not work:
> lmer(t$Awns ~ (1|t$Funnel) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Funnel within model frame: try adding grouping factor to data frame explicitly if possible
In fact this happens whatever I want to include as a random effect e.g. Plant:
> lmer(t$Awns ~ (1|t$Plant) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Plant within model frame: try adding grouping factor to data frame explicitly if possible
Why is R giving me this error? The only other answer I could google fu is that the random effect fed in wasn't a factor in the DF. But as str shows, df$Funnel certainly is.
It is actually not so easy to provide a convenient syntax for modeling functions and at the same time have a robust implementation. Most package authors assume that you use the data parameter and even then scoping issues can occur. Thus, strange things can happen if you specify variables with DF$col syntax since package authors rarely spend a lot of effort to make this work correctly and don't include a lot of unit tests for this.
It is therefore strongly recommended to use the data parameter if the model function offers a formula method. Strange things can happen if you don't follow that praxis (also with other model functions like lm).
In your example:
lmer(Awns ~ (1|Funnel) + BobWhite_c10082_241, data = t)
This not only works, but is also more convenient to write.

tapply function complains that args are unequal length yet they appear to match

Here is the failing call, error messages and some displays to show the lengths in question:
it <- tapply(molten, c(molten$Activity, molten$Subject, molten$variable), mean)
# Error in tapply(molten, c(molten$Activity, molten$Subject, molten$variable), :
# arguments must have same length
length(molten$Activity)
# [1] 679734
length(molten$Subject)
# [1] 679734
length(molten$variable)
# [1] 679734
dim(molten)
# [1] 679734 4
str(molten)
# 'data.frame': 679734 obs. of 4 variables:
# $ Activity: Factor w/ 6 levels "WALKING","WALKING_UPSTAIRS",..: 5 5 5 5 5 5 5 5 5 5 ...
# $ Subject : Factor w/ 30 levels "1","2","3","4",..: 2 2 2 2 2 2 2 2 2 2 ...
# $ variable: Factor w/ 66 levels "tBodyAcc-mean()-X",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ value : num 0.257 0.286 0.275 0.27 0.275 ...
If you have a look at ?tapply you will see that X should be "an atomic object, typically a vector". You feed tapply with a data frame ("molten"), which is not an atomic object. See is.atomic, and try is.atomic(molten). Furthermore, your grouping variables should be provided as a list (see INDEX argument).
Something like this works:
tapply(X = warpbreaks$breaks, INDEX = list(warpbreaks$wool, warpbreaks$tension), mean)
# L M H
# A 44.55556 24.00000 24.55556
# B 28.22222 28.77778 18.77778
You need to have a single object for INDEX, butc( )will string them all together which is the source of the eror, so use a list:
it <- tapply(molten$value, list(Act=molten$Activity, sub=molten$Subject, var=molten$variable), mean)
Better would be:
it <- with(molten , tapply(value, list(Act=Activity, Sub=Subject, var=variable), mean) )
Ever got this solved? Because I had the same issue reading in a CSV file and could fix the issue by saving the original CSV file as CSV(delimiter seperated) instead of CSV(delimiter seperated-UTF-8). My dataset had German Umlauts in it though so that might play a role aswell.

Assign list of attributes() to sublist in R

I have a dataframe called 'situations' containing list of attributes.
> str(situations)
'data.frame': 24 obs. of 8 variables:
$ ID.SITUATION : Factor w/ 24 levels "cnf_01_be","cnf_02_ch",..: 1 2 3 4 5 6 7 8 9 10 ...
$ ELICITATION.D : Factor w/ 2 levels "NATUREL","SEMI.DIRIGE": 1 1 1 1 1 1 1 1 2 2 ...
$ INTERLOCUTEUR.C : Factor w/ 3 levels "DIALOGUE","MONOLOGUE",..: 2 2 2 2 3 3 3 3 1 1 ...
$ PREPARATION.D : Factor w/ 3 levels "PREPARE","SEMI.PREPARE",..: 2 2 2 2 3 3 3 3 3 3 ...
$ INTERACTIVITE.D : Factor w/ 3 levels "INTERACTIF","NON. INTERACTIF",..: 2 2 2 2 1 1 1 1 3 3 ...
$ MEDIATISATION.D : Factor w/ 3 levels "MEDIATIQUE","NON.MEDIATIQUE",..: 2 2 2 2 2 2 2 2 2 2 ...
$ PROFESSIONNALISATION.C: Factor w/ 1 level "PRO": 1 1 1 1 1 1 1 1 1 1 ...
$ ID.TASK : Factor w/ 5 levels "conference scientifique",..: 1 1 1 1 2 2 2 2 3 3 ...
I have as many observation in this dataframes (24) than i have sublist in a given corpus.
ID situation names (cnf_01_be) correspond to the name of the sublist (cnf_01_be).
I know how to assign individual attributes :
attributes(corpus$cnf_01_be) = situations[1,]
attributes(corpus$cnf_02_ch) = situations[2,]
And retrieve them for a specific purpose :
attr(corpus$cnf_01_be, "ELICITATION.D")
attr(corpus$cnf_02_ch, "ELICITATION.D")
attr(corpus$cnf_02_ch, "PREPARATION.D")
But how can I use for example lapply to assign automatically attributes to all the sublist in my corpus ?
I feel like all my trial are going in the wrong direction :
setattr <- function(x,y) {
attributes(x) <- situations[y,]
return(attributes)
}
...or...
lapply(corpus,setattr)
lapply(corpus, attributes(corpus) <- situations[c(1:length(situations[,1])),])
Thanks in advance!
The main problem with using lapply (and similar approaches) is that they cannot normally change the original object of interest, but rather return a new structure. so if you already have a list "corpus" and just want to change its members' attributes you can't usually do that inside a function.
One way to overcome this limitation is to use eval.parent() call instead of the usual assignment. This function evaluates the assignment expression in the parent environment (the environment that called the function), rather than to the local instances (copies) of the objects you assign. if you use this you don't have to return any value.
Another option would be to create a local copy of your corpus list within the function, add to it all the attributes, then return the whole structure from the function and use it to substitute the old list. if your list is big/complex this is probably not a wise choice
Here is a code that does it. note - this is an ugly code. I'm still looking to see if I can make it simpler, but because of the issues above, i'm not sure there is a much simpler option. Anyway, I hope the following will do the trick for you:
f = function(lname,data) {
snames = eval.parent(parse(text=paste("names(",lname,")")))
for (xn in snames) {
rd = data[match(xn,as.character(data$id)),]
if (nrow(rd)>0) {
tmp___ <<-rd[1,]
cmm = paste("attributes(",lname,"[[",xn,"]]) = tmp___")
eval.parent(parse(text=cmm))
}
}
}
Note that in order to use it you need to supply your list name (as a character string, and not as a variable), and your data frame. In your case the call would be:
f("corpus",situations)
I hope this helps.

Replace strings in data frame columns with integer in R

I have a data frame called 'foo':
foo <- data.frame("row1" = c(1,2,3,4,5), "row2" = c(1,2.01,3,"-","-"))
'foo' was uploaded from a different program as a CSV file and has two columns. one is a numerical data type and the other is a factor data type.
str(foo)
'data.frame': 5 obs. of 2 variables:
$ row1: num 1 2 3 4 5
$ row2: Factor w/ 4 levels "-","1","2.01",..: 2 3 4 1 1
Notice there are dashes, e.g. "-" , in foo$row2, which causes this column to be a factor. I want to replace the dashes with zeros, such that data.class(foo$row2) will return 'numerical'. The idea is to replace all dashes in each column so I can run numberical analyses on it with R.
What is the simplest way to do this in R?
Thanks,
Q: The idea is to replace all dashes in each column so I can run numerical analyses on it with R.
Use apply or sapply with sub
kk<-data.frame(apply(foo,2,function(x) as.numeric(sub("-",0,x))))
> kk
row1 row2
1 1 1.00
2 2 2.01
3 3 3.00
4 4 0.00
5 5 0.00
> str(kk$row2)
num [1:5] 1 2.01 3 0 0
Or, you can use sapply
kk<-data.frame(sapply(names(foo),function(x)as.numeric(sub("-",0,foo[,x]))))
Update:
If you want just the second col, you don't need to use apply:foo$row2<- as.numeric(sub("-",0,foo[,2]))
Here is one simple way to do it. There might be a more elegant way, but this will work:
> foo <- data.frame("row1" = c(1,2,3,4,5), "row2" = c(1,2.01,3,"-","-"))
> levels(foo$row2)[levels(foo$row2)=="-"]<-0
> foo$row2<-as.numeric(as.character(foo$row2))
> class(foo$row2)
[1] "numeric"
> foo
row1 row2
1 1 1.00
2 2 2.01
3 3 3.00
4 4 0.00
5 5 0.00
I would use ifelse() for this:
foo$row2 <- ifelse(foo$row2 == "-", 0, as.numeric(foo$row2))
you might also need to as as.character() to convert from factor to character
How about gsub...
as.numeric( gsub("-" , 0 , foo[,2] ) )
#[1] 1.00 2.01 3.00 0.00 0.00

Resources