R merge() not working (anymore) as intended [duplicate] - r

This question already has answers here:
Why are these numbers not equal?
(6 answers)
Closed 6 years ago.
This has worked for me before but now it isn't and I have spent two days tinkering with it before I ask for help here.
I have two datasets, one called Access, the other CO2. Each one has four variables, two of which are common and are what I want to use to merge the two datasets. Just to play it really save, I am pasting the head() and str() outputs here:
> head(Access) > head(CO2)
x y access x y CO2equ
1 -32.65 83.65 0.00 1 -32.65 83.65 183316.4
2 -36.85 83.55 4481.25 2 -36.85 83.55 173327.8
3 -36.75 83.55 4464.75 3 -36.75 83.55 301413.9
4 -36.65 83.55 4448.25 4 -36.65 83.55 360757.2
5 -36.55 83.55 4431.00 5 -36.55 83.55 409523.5
6 -36.45 83.55 4414.50 6 -36.45 83.55 448302.0
> str(Access)
'data.frame': 2183106 obs. of 3 variables:
$ x : num -32.7 -36.8 -36.8 -36.7 -36.5 ...
$ y : num 83.7 83.5 83.5 83.5 83.5 ...
$ access: num 0 4481 4465 4448 4431 ...
- attr(*, "data_types")= chr "N" "N" "N"
> str(CO2)
'data.frame': 2183106 obs. of 3 variables:
$ x : num -32.7 -36.9 -36.8 -36.7 -36.6 ...
$ y : num 83.6 83.5 83.5 83.5 83.5 ...
$ CO2equ: num 183316 173328 301414 360757 409523 ...
- attr(*, "data_types")= chr "N" "N" "N"
Now I am trying to versions of merge(). The first one results in an empty data.frame, the second in all rows existing twice, once for the variables from the first dataset, and the second with the variables from the second dataset:
> M1 = merge(Access, CO2, c("x","y"))
> head(M1)
[1] x y access CO2equ
<0 rows> (or 0-length row.names)
> M2 = merge(Access, CO2, by=c("x","y"), all=TRUE)
> length(M2$x)
[1] 4366212
> head(M2)
x y access CO2equ
1 -179.95 -89.95 NA 0
2 -179.95 -89.85 NA 0
3 -179.95 -89.75 NA 0
4 -179.95 -89.65 NA 0
5 -179.95 -89.55 NA 0
6 -179.95 -89.45 NA 0
Obviously, the respective x- and y-values are not recognized as being equivalent - but I do not know why. The data types are the same, the values look the same, and worst of all, I did this successfully a few months ago. Back then, I sasve the command history and now when I just copy and paste it into my R console, it does not work. I tried it in both R 2.13.0 and Revolution R Enterprise 4.3. I am reasonably sure that this is not a software bug but something trivial that I just overlooked even after spending some two days on this.
Cheers,
Jochen

Try round(..., 1) on both x and y before the merge.

Related

"contrasts can be applied only to factors with 2 or more levels" Despite having multiple levels in each factor

I am working on a two-way mixed ANOVA using the data below, using one dependent variable, one between-subjects variable and one within-subjects variable. When I tested the normality of the residuals, of the dependent variable, I find that they are not normally distributed. But at this point I am able to perform the two-way ANOVA. Howerver, when I perform a log10 transformation, and run the script again using the log transformed variable, I get the error "contrasts can be applied only to factors with 2 or more levels".
> str(m_runjumpFREQ)
'data.frame': 564 obs. of 8 variables:
$ ID1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ ID : chr "ID1" "ID2" "ID3" "ID4" ...
$ Group : Factor w/ 2 levels "II","Non-II": 1 1 1 1 1 1 1 1 1 1 ...
$ Pos : Factor w/ 3 levels "center","forward",..: 2 1 2 3 2 2 1 3 2 2 ...
$ Match_outcome : Factor w/ 2 levels "W","L": 2 2 2 2 2 2 2 2 2 1 ...
$ time : Factor w/ 8 levels "runjump_nADJmin_q1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ runjump : num 0.0561 0.0858 0.0663 0.0425 0.0513 ...
$ log_runjumpFREQ: num -1.25 -1.07 -1.18 -1.37 -1.29 ...
Some answers on StackOverflow to this error have mentioned that one or more factors in the data set, used for the ANOVA, are of less than two levels. But as seen above they are not.
Another explanation I have read is that it may be the issue of missing values, where there may be NA's. There is:
m1_nasum <- sum(is.na(m_runjumpFREQ$log_runjumpFREQ))
> m1_nasum
[1] 88
However, I get the same error even after removing the rows including NA's as follows.
> m_runjumpFREQ <- na.omit(m_runjumpFREQ)
> m1_nasum <- sum(is.na(m_runjumpFREQ$log_runjumpFREQ))
> m1_nasum
[1] 0
I could run the same script without log transformation and it would work, but with it, I get the same error. The factors are the same either way and the missing values do not make a difference. Either I am doing a crucial mistake or the issue is in the line of the log transformation below.
log_runjumpFREQ <- log10(m_runjumpFREQ$runjump)
m_runjumpFREQ <- cbind(m_runjumpFREQ, log_runjumpFREQ)
I appreciate the help.
It is not good enough that the factors have 2 levels. In addition those levels must actually be present. For example, below f has 2 levels but only 1 is actually present.
y <- (1:6)^2
x <- 1:6
f <- factor(rep(1, 6), levels = 1:2)
nlevels(f) # f has 2 levels
## [1] 2
lm(y ~ x + f)
## Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
## contrasts can be applied only to factors with 2 or more levels

Calculation of Geometric Mean of Data that includes NAs

EDIT: The problem was not within the geoMean function, but with a wrong use of aggregate(), as explained in the comments
I am trying to calculate the geometric mean of multiple measurements for several different species, which includes NAs. An example of my data looks like this:
species <- c("Ae", "Ae", "Ae", "Be", "Be")
phen <- c(2, NA, 3, 1, 2)
hveg <- c(NA, 15, 12, 60, 59)
df <- data.frame(species, phen, hveg)
When I try to calculate the geometric mean for the species Ae with the built-in function geoMean from the package EnvStats like this
library("EnvStats")
aggregate(df[, 3:3], list(df1$Sp), geoMean, na.rm=TRUE)
it works wonderful and skips the NAs to give me the geometric means per species.
Group.1 phen hveg
1 Ae 4.238536 50.555696
2 Be 1.414214 1.414214
When I do this with my large dataset, however, the function stumbles over NAs and returns NA as result even though there are e.g 10 numerical values and only one NA. This happens for example with the column SLA_mm2/mg.
My large data set looks like this:
> str(cut2trait1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 22 obs. of 19 variables:
$ Cut : chr "15_08" "15_08" "15_08" "15_08" ...
$ Block : num 1 1 1 1 1 1 1 1 1 1 ...
$ ID : num 451 512 431 531 591 432 551 393 511 452 ...
$ Plot : chr "1_1" "1_1" "1_1" "1_1" ...
$ Grazing : chr "n" "n" "n" "n" ...
$ Acro : chr "Leuc.vulg" "Dact.glom" "Cirs.arve" "Trif.prat" ...
$ Sp : chr "Lv" "Dg" "Ca" "Tp" ...
$ Label_neu : chr "Lv021" "Dg022" "Ca021" "Tp021" ...
$ PlantFunctionalType: chr "forb" "grass" "forb" "forb" ...
$ PlotClimate : chr "AC" "AC" "AC" "AC" ...
$ Season : chr "Aug" "Aug" "Aug" "Aug" ...
$ Year : num 2015 2015 2015 2015 2015 ...
$ Tiller : num 6 3 3 5 6 8 5 2 1 7 ...
$ Hveg : num 25 38 70 36 68 65 23 58 71 27 ...
$ Hrep : num 39 54 77 38 76 70 65 88 98 38 ...
$ Phen : num 8 8 7 8 8 7 6.5 8 8 8 ...
$ SPAD : num 40.7 42.4 48.7 43 31.3 ...
$ TDW_in_g : num 4.62 4.85 11.86 5.82 8.99 ...
$ SLA_mm2/mg : num 19.6 19.8 20.3 21.2 21.7 ...
and the result of my code
gm_cut2trait1 <- aggregate(cut2trait1[, 13:19], list(cut2trait1$Sp), geoMean, na.rm=TRUE)
is (only the first two rows):
Group.1 Tiller Hveg Hrep Phen SPAD TDW_in_g SLA_mm2/mg
1 Ae 13.521721 73.43485 106.67933 NA 28.17698 1.2602475 NA
2 Be 8.944272 43.95452 72.31182 5.477226 20.08880 0.7266361 9.309672
Here, the geometric mean of SLA for Ae is NA, even though there are 9 numeric measurements and only one NA in the column used to calculate the geometric mean.
I tried to use the geometric mean function suggested here:
Geometric Mean: is there a built-in?
But instead of NAs, this returned the value 1.000 when used with my big dataset, which doesn't solve my problem.
So my question is: What is the difference between my example df and the big dataset that throws the geoMean function off the rails?

Wilcoxon test in R - x must be numeric error

I have a problem with the wilcox.test in R.
My data object is a matrix in which the first column contains a name, and all other columns contain a (gene expression) measurement, which is numeric:
str(myMatrix)
'data.frame': 2000 obs. of 143 variables:
$ precursor : chr "name1" "name2" "name3" "name4" ...
$ sample1: num 1.46e-03 2.64e+02 1.46e-03 1.46e-03 1.46e-03 ...
$ sample2: num 1.46e-03 1.91e+02 1.46e-03 1.46e-03 1.46e-03 ...
$ sample3: num 1.46e-03 3.01e+02 1.46e-03 1.46e-03 4.96 ...
For all of the 2000 rows I want to test whether there is a difference between 2 given parts of the matrix. I tried this in 4 different ways:
wilcox.test(as.numeric(myMatrix[i,2:87],myMatrix[i,88:98]))$p.value
#[1] 1.549484e-16
wilcox.test(myMatrix[i,2:87],myMatrix[i,88:98])$p.value
#Error in wilcox.test.default(myMatrix[i, 2:87], myMatrix[i, 88:98]) :
#'x' must be numeric
t.test(as.numeric(myMatrix[i,2:87],myMatrix[i,88:98]))$p.value
#[1] 0.2973957
t.test(myMatrix[i,2:87],myMatrix[i,88:98])$p.value
#[1] 0.3098505
So as you can see, only if I use as.numeric() on the already numeric values I get a result without an error message for the Wilcoxon test, but the results completely differ from t.test results even if they should not.
Manually verifying by using an online tool shows that the t.test results using as.numeric() values are wrong.
Any suggestions about how I can solve this problem and do the correct Wilcoxon test? If you need more information let me know.
Actually, myMatrix [i, 2:87] is still a data.frame. See the following example.
> myMat
fir X1 X2 X3 X4
1 name1 1 5 9 13
2 name2 2 6 10 14
3 name3 3 7 11 15
4 name4 4 8 12 16
> class(myMat[1, 2:4])
[1] "data.frame"
> as.numeric(myMat[1, 2:4])
[1] 1 5 9
Change your data to a real Matrix will solve your problem.
> myMat_01 <- myMat[, 2:5]
> rownames(myMat_01) <- myMat$fir
> myMat_01 <- as.matrix(myMat_01)
> class(myMat_01[1, 2:4])
[1] "integer"

daply: Correct results, but confusing structure

I have a data.frame mydf, that contains data from 27 subjects. There are two predictors, congruent (2 levels) and offset (5 levels), so overall there are 10 conditions. Each of the 27 subjects was tested 20 times under each condition, resulting in a total of 10*27*20 = 5400 observations. RT is the response variable. The structure looks like this:
> str(mydf)
'data.frame': 5400 obs. of 4 variables:
$ subject : Factor w/ 27 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
$ congruent: logi TRUE FALSE FALSE TRUE FALSE TRUE ...
$ offset : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 5 5 1 2 5 5 2 2 3 5 ...
$ RT : int 330 343 457 436 302 311 595 330 338 374 ...
I've used daply() to calculate the mean RT of each subject in each of the 10 conditions:
myarray <- daply(mydf, .(subject, congruent, offset), summarize, mean = mean(RT))
The result looks just the way I wanted, i.e. a 3d-array; so to speak 5 tables (one for each offset condition) that show the mean of each subject in the congruent=FALSE vs. the congruent=TRUE condition.
However if I check the structure of myarray, I get a confusing output:
List of 270
$ : num 417
$ : num 393
$ : num 364
$ : num 399
$ : num 374
...
# and so on
...
[list output truncated]
- attr(*, "dim")= int [1:3] 27 2 5
- attr(*, "dimnames")=List of 3
..$ subject : chr [1:27] "1" "2" "3" "5" ...
..$ congruent: chr [1:2] "FALSE" "TRUE"
..$ offset : chr [1:5] "1" "2" "3" "4" ...
This looks totally different from the structure of the prototypical ozone array from the plyr package, even though it's a very similar format (3 dimensions, only numerical values).
I want to compute some further summarizing information on this array, by means of aaply. Precisely, I want to calculate the difference between the congruent and the incongruent means for each subject and offset.
However, already the most basic application of aaply() like aaply(myarray,2,mean) returns non-sense output:
FALSE TRUE
NA NA
Warning messages:
1: In mean.default(piece, ...) :
argument is not numeric or logical: returning NA
2: In mean.default(piece, ...) :
argument is not numeric or logical: returning NA
I have no idea, why the daply() function returns such weirdly structured output and thereby prevents any further use of aaply. Any kind of help is kindly appreciated, I frankly admit that I have hardly any experience with the plyr package.
Since you haven't included your data it's hard to know for sure, but I tried to make a dummy set off your str(). You can do what you want (I'm guessing) with two uses of ddply. First the means, then the difference of the means.
#Make dummy data
mydf <- data.frame(subject = rep(1:5, each = 150),
congruent = rep(c(TRUE, FALSE), each = 75),
offset = rep(1:5, each = 15), RT = sample(300:500, 750, replace = T))
#Make means
mydf.mean <- ddply(mydf, .(subject, congruent, offset), summarise, mean.RT = mean(RT))
#Calculate difference between congruent and incongruent
mydf.diff <- ddply(mydf.mean, .(subject, offset), summarise, diff.mean = diff(mean.RT))
head(mydf.diff)
# subject offset diff.mean
# 1 1 1 39.133333
# 2 1 2 9.200000
# 3 1 3 20.933333
# 4 1 4 -1.533333
# 5 1 5 -34.266667
# 6 2 1 -2.800000

Between/within standard deviations

When working on a hierarchical/multilevel/panel dataset, it may be very useful to adopt a package which returns the within- and between-group standard deviations of the available variables.
This is something that with the following data in Stata can be easily done through the command
xtsum, i(momid)
I made a research, but I cannot find any R package which can do that..
edit:
Just to fix ideas, an example of hierarchical dataset could be this:
son_id mom_id hispanic mom_smoke son_birthweigth
1 1 1 1 3950
2 1 1 0 3890
3 1 1 0 3990
1 2 0 1 4200
2 2 0 1 4120
1 3 0 0 2975
2 3 0 1 2980
The "multilevel" structure is given by the fact that each mother (higher level) has two or more sons (lower level). Hence, each mother defines a group of observations.
Accordingly, each dataset variable can vary either between and within mothers or only between mothers. birtweigth varies among mothers, but also within the same mother. Instead, hispanic is fixed for the same mother.
For example, the within-mother variance of son_birthweigth is:
# mom1 means
bwt_mean1 <- (3950+3890+3990)/3
bwt_mean2 <- (4200+4120)/2
bwt_mean3 <- (2975+2980)/2
# Within-mother variance for birthweigth
((3950-bwt_mean1)^2 + (3890-bwt_mean1)^2 + (3990-bwt_mean1)^2 +
(4200-bwt_mean2)^2 + (4120-bwt_mean2)^2 +
(2975-bwt_mean3)^2 + (2980-bwt_mean3)^2)/(7-1)
While the between-mother variance is:
# overall mean of birthweigth:
# mean <- sum(data$son_birthweigth)/length(data$son_birthweigth)
mean <- (3950+3890+3990+4200+4120+2975+2980)/7
# within variance:
((bwt_mean1-mean)^2 + (bwt_mean2-mean)^2 + (bwt_mean3-mean)^2)/(3-1)
I don't know what your stata command should reproduce, but to answer the second part of question about
hierarchical structure , it is easy to do this with list.
For example, you define a structure like this:
tree = list(
"var1" = list(
"panel" = list(type ='p',mean = 1,sd=0)
,"cluster" = list(type = 'c',value = c(5,8,10)))
,"var2" = list(
"panel" = list(type ='p',mean = 2,sd=0.5)
,"cluster" = list(type="c",value =c(1,2)))
)
To create this lapply is convinent to work with list
tree <- lapply(list('var1','var2'),function(x){
ll <- list(panel= list(type ='p',mean = rnorm(1),sd=0), ## I use symbol here not name
cluster= list(type = 'c',value = rnorm(3))) ## R prefer symbols
})
names(tree) <-c('var1','var2')
You can view he structure with str
str(tree)
List of 2
$ var1:List of 2
..$ panel :List of 3
.. ..$ type: chr "p"
.. ..$ mean: num 0.284
.. ..$ sd : num 0
..$ cluster:List of 2
.. ..$ type : chr "c"
.. ..$ value: num [1:3] 0.0722 -0.9413 0.6649
$ var2:List of 2
..$ panel :List of 3
.. ..$ type: chr "p"
.. ..$ mean: num -0.144
.. ..$ sd : num 0
..$ cluster:List of 2
.. ..$ type : chr "c"
.. ..$ value: num [1:3] -0.595 -1.795 -0.439
Edit after OP clarification
I think that package reshape2 is what you want. I will demonstrate this here.
The idea here is in order to do the multilevel analysis we need to reshape the data.
First to divide the variables into two groups :identifier and measured variables.
library(reshape2)
dat.m <- melt(dat,id.vars=c('son_id','mom_id')) ## other columns are measured
str(dat.m)
'data.frame': 21 obs. of 4 variables:
$ son_id : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 1 2 1 2 3 ...
$ mom_id : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 3 3 1 1 1 ...
$ variable: Factor w/ 3 levels "hispanic","mom_smoke",..: 1 1 1 1 1 1 1 2 2 2 ...
$ value : num 1 1 1 0 0 0 0 1 0 0 ..
Once your have data in "moten" form , you can "cast" to rearrange it in the shape that you want:
# mom1 means for all variable
acast(dat.m,variable~mom_id,mean)
1 2 3
hispanic 1.0000000 0 0.0
mom_smoke 0.3333333 1 0.5
son_birthweigth 3943.3333333 4160 2977.5
# Within-mother variance for birthweigth
acast(dat.m,variable~mom_id,function(x) sum((x-mean(x))^2))
1 2 3
hispanic 0.0000000 0 0.0
mom_smoke 0.6666667 0 0.5
son_birthweigth 5066.6666667 3200 12.5
## overall mean of each variable
acast(dat.m,variable~.,mean)
[,1]
hispanic 0.4285714
mom_smoke 0.5714286
son_birthweigth 3729.2857143
I know this question is four years old, but recently I wanted to do the same in R and came up with the following function. It depends on dplyr and tibble. Where: df is the dataframe, columns is a numerical vector to subset the dataframe and individuals is the column with the individuals.
xtsumR<-function(df,columns,individuals){
df<-dplyr::arrange_(df,individuals)
panel<-tibble::tibble()
for (i in columns){
v<-df %>% dplyr::group_by_() %>%
dplyr::summarize_(
mean=mean(df[[i]]),
sd=sd(df[[i]]),
min=min(df[[i]]),
max=max(df[[i]])
)
v<-tibble::add_column(v,variacao="overal",.before=-1)
v2<-aggregate(df[[i]],list(df[[individuals]]),"mean")[[2]]
sdB<-sd(v2)
varW<-df[[i]]-rep(v2,each=12) #
varW<-varW+mean(df[[i]])
sdW<-sd(varW)
minB<-min(v2)
maxB<-max(v2)
minW<-min(varW)
maxW<-max(varW)
v<-rbind(v,c("between",NA,sdB,minB,maxB),c("within",NA,sdW,minW,maxW))
panel<-rbind(panel,v)
}
var<-rep(names(df)[columns])
n1<-rep(NA,length(columns))
n2<-rep(NA,length(columns))
var<-c(rbind(var,n1,n1))
panel$var<-var
panel<-panel[c(6,1:5)]
names(panel)<-c("variable","variation","mean","standard.deviation","min","max")
panel[3:6]<-as.numeric(unlist(panel[3:6]))
panel[3:6]<-round(unlist(panel[3:6]),2)
return(panel)
}

Resources