Change global environment to read in numerical order - r

I have a dataset in which I have 22 animals. Each animal has been named as follows: c(" Shark1", "Shark2", "Shark3", ...) etc.
I am trying to plot a two category variables against each other do determine the proportion of time each shark spent at separate depths:
Sharks<-table(merge$DepthCat, merge$ID2) #Depth category vs. ID
merge$DepthCat[merge$Depth2>200]<-"4"
Sharks<-table(merge$DepthCat, merge$ID2)
plot(t(Sharks), main="",
col=c("whitesmoke", "slategray3", "slategray", "slategray4"),
ylab="Depth catagory", xlab="Month")
axis(side=4)
While the plot works, it is not plotting in numerical order but instead alphabetical therefore I am getting the following graph below.
Does anyone know how to resolve this for the plot? I have research the array method but unsure how it would be implemented here.

You didn't provide your complete data set, so I generated my own random data. Given that the bar headers derived from ID2 are sorting lexicographically, I assumed they are stored as characters in your data.frame merge, so I generated them thusly.
set.seed(2L);
NR <- 300L;
merge <- data.frame(ID2=sample(as.character(1:22),NR,T),Depth2=pmax(0,rnorm(NR,100,50)),stringsAsFactors=F);
merge$DepthCat <- as.character(findInterval(merge$Depth2,c(0,66,133,200)));
str(merge);
## 'data.frame': 300 obs. of 3 variables:
## $ ID2 : chr "5" "16" "13" "4" ...
## $ Depth2 : num 148.8 91.5 136.1 57.8 163.9 ...
## $ DepthCat: chr "3" "2" "3" "1" ...
And sure enough, we can reproduce the problem with this test data:
Sharks <- table(merge$DepthCat,merge$ID2);
plot(t(Sharks),main='',col=c('whitesmoke','slategray3','slategray','slategray4'),ylab='Depth category',xlab='Month');
axis(side=4L);
The solution is to coerce the ID2 vector to numeric so it sorts numerically.
merge$ID2 <- as.integer(merge$ID2);
str(merge);
## 'data.frame': 300 obs. of 3 variables:
## $ ID2 : int 5 16 13 4 21 21 3 19 11 13 ...
## $ Depth2 : num 148.8 91.5 136.1 57.8 163.9 ...
## $ DepthCat: chr "3" "2" "3" "1" ...
Sharks <- table(merge$DepthCat,merge$ID2);
plot(t(Sharks),main='',col=c('whitesmoke','slategray3','slategray','slategray4'),ylab='Depth category',xlab='Month');
axis(side=4L);

Related

looping through a named vector

i am trying to figure out how to loop through a named vector of regression coefficients. i want to loop through the vector and detect whether or not a coefficient name contains the string 'country'. if it does, i want to append the corresponding value to an empty vector. i already solved this using dplyr tools, but i also want to do it using a for loop.
this is what my data looks like:
str(co2_per_cap_model$coefficients)
Named num [1:164] -0.0511 0.3289 1.2352 3.0743 0.8654 ...
- attr(*, "names")= chr [1:164] "(Intercept)" "time" "countryAlbania" "countryAlgeria" ...
this is the loop i've been tinkering with. any advice? thank you in advance.
storage <- c()
for(coeff in co2_per_cap_model$coefficients){
if(str_detect(names(co2_per_cap_model$coefficients), 'country')){
storage <- c(coeff, storage)
}
}
We need to create some reproducible data. Then just use grep:
set.seed(42)
coef <- 1:25
names(coef) <- sample(LETTERS[1:5], 25, replace=TRUE)
str(coef)
# Named int [1:25] 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, "names")= chr [1:25] "A" "E" "A" "A" ...
idx <- grep("A", names(coef))
coef[idx]
# A A A A A A A
# 1 3 4 9 11 17 18

Subsetting a dataframe by names in another dataframe [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 7 years ago.
I have very big reference file with thousands of pairwise comparisons between thousands of objects ("OTUs). The dataframe is in long format:
data.frame': 14845516 obs. of 3 variables:
$ OTU1 : chr "0" "0" "0" "0" ...
$ OTU2 : chr "8192" "1" "8194" "3" ...
$ gendist: num 78.7 77.8 77.6 74.4 75.3 ...
I also have a much smaller subset with observed data (slightly different structure):
'data.frame': 286903 obs. of 3 variables:
$ OTU1 : chr "1239" "1603" "2584" "1120" ...
$ OTU2 : chr "12136" "12136" "12136" "12136" ...
$ ecodist: num 2.08 1.85 2 1.73 1.53 ...
- attr(*, "na.action")=Class 'omit' Named int [1:287661] 1 759 760 1517 1518 1519 2275 2276 2277 2278 ...
.. ..- attr(*, "names")= chr [1:287661] "1" "759" "760" "1517" ...
Again, its a pairwise comparison of objects ('OTUs'). All objects in the smaller dataset are also in the reference dataset.
I want to reduce the reference that it only contains objects that are also found in the smaller dataset. It is very important that its done on both columns (OTU1, OTU2).
Here is toy data:
library(reshape)
###reference
Ref <- cor(as.data.frame(matrix(rnorm(100),10,10)))
row.names(Ref) <- colnames(Ref) <- LETTERS[1:10]
Ref[upper.tri(Ref)] <- NA
diag(Ref) <- NA
Ref.m <- na.omit(melt(Ref, varnames = c('row', 'col')))
###query
tmp <- cor(as.data.frame(matrix(rnorm(25),5,5)))
row.names(tmp) <- colnames(tmp) <- LETTERS[1:5]
tmp[upper.tri(tmp)] <- NA
diag(tmp) <- NA
tmp.m <- na.omit(melt(tmp, varnames = c('row', 'col')))
The following works for me using your toy data:
Ref[rownames(tmp), colnames(tmp)]
This selects (by name) only those rows in Ref whose names are also the names of rows in tmp, and likewise for columns.
If you want to stick with the long format in the str outputs in the first part of your question, you can instead use something like:
data1[(data1$OTU1 %in% data2$OTU1) & (data1$OTU2 %in% data2$OTU2), ]
Here I'm creating a logical vector that indicates which rows of your reference data frame (data1) have their OTU1 entry somewhere in data2$OTU1, and the same for OTU2. Said logical vector is then used to select rows of data1.

Transform to numeric a column with "NULL" values

I've imported a dataset into R where in a column which should be supposed to contain numeric values are present NULL. This make R set the column class to character or factor depending on if you are using or not the stringAsFactors argument.
To give you and idea this is the structure of the dataset.
> str(data)
'data.frame': 1016 obs. of 10 variables:
$ Date : Date, format: "2014-01-01" "2014-01-01" "2014-01-01" "2014-01-01" ...
$ Name : chr "Chi" "Chi" "Chi" "Chi" ...
$ Impressions: chr "229097" "3323" "70171" "1359" ...
$ Revenue : num 533.78 11.62 346.16 3.36 1282.28 ...
$ Clicks : num 472 13 369 1 963 161 1 7 317 21 ...
$ CTR : chr "0.21" "0.39" "0.53" "0.07" ...
$ PCC : chr "32" "2" "18" "0" ...
$ PCOV : chr "3470.52" "94.97" "2176.95" "0" ...
$ PCROI : chr "6.5" "8.17" "6.29" "NULL" ...
$ Dimension : Factor w/ 11 levels "100x72","1200x627",..: 1 3 4 5 7 8 9 10 11 1 ...
I would like to transform the PCROI column as numeric, but containing NULLs it makes this harder.
I've tried to get around the issue setting the value 0 to all observations where current value is NULL, but I got the following error message:
> data$PCROI[which(data$PCROI == "NULL"), ] <- 0
Error in data$PCROI[which(data$PCROI == "NULL"), ] <- 0 :
incorrect number of subscripts on matrix
My idea was to change to 0 all the NULL observations and afterwards transform all the column to numeric using the as.numeric function.
You have a syntax error:
data$PCROI[which(data$PCROI == "NULL"), ] <- 0 # will not work
data$PCROI[which(data$PCROI == "NULL")] <- 0 # will work
by the way you can say:
data$PCROI = as.numeric(data$PCROI)
it will convert your "NULL" to NA automatically.

sort column in data frame according to levels of another column

i'm new to R and stuck with the following data. Either i searched with the wrong terms or this question has not been risen (maybe due to simplicity?).
i have a data frame containing factors and numerical columns:
> head(PAMdata1)
salt cultivar Ratiod13 StdErr
1 50 1 1.0760163 0.02915785
2 100 1 0.9814083 0.04914316
3 50 2 0.9617199 0.06571578
4 100 2 0.7878740 0.10270647
5 50 4 0.9551830 0.04134652
6 100 4 0.8429793 0.10993336
> str(PAMdata1)
'data.frame': 36 obs. of 4 variables:
$ salt : Factor w/ 2 levels "50","100": 1 2 1 2 1 2 1 2 1 2 ...
$ cultivar: Factor w/ 18 levels "27","26","21",..: 7 7 15 15 13 13 11 11 9 9 ...
$ Ratiod13: num 1.076 0.981 0.962 0.788 0.955 ...
$ StdErr : num 0.0292 0.0491 0.0657 0.1027 0.0413 ...
The column 'cultivar' contains factors, whose levels are ordered using another data frame:
PAMdata1$cultivar <- factor(PAMdata1$cultivar, levels = unique(as.character(my_other_df$cultivar)))
levels(PAMdata1$cultivar)
[1] "27" "26" "21" "52" "14" "25" "1" "23" "7" "8" "5" "28" "4" "22" "2"
[16] "53" "51" "50"
What i would like to have is PAMdata1$Ratiod13 ordered by the levels of cultivars. How do i transform the vector of levels into a vector of line numbers each level is located in?
I would appreciate any help on this.
Thanks a lot
talking to an office mate and seeing your comments i saw my confusion between sort() and order()
PAMdata1[order(PAMdata1$cultivar),]
or even
PAMdata1[with(PAMdata1,order(cultivar)),]
would do the job.
Thanks a lot for your help.
A data.table option would be to setkey cultivar:
setkey(PAMdata1,cultivar)

daply: Correct results, but confusing structure

I have a data.frame mydf, that contains data from 27 subjects. There are two predictors, congruent (2 levels) and offset (5 levels), so overall there are 10 conditions. Each of the 27 subjects was tested 20 times under each condition, resulting in a total of 10*27*20 = 5400 observations. RT is the response variable. The structure looks like this:
> str(mydf)
'data.frame': 5400 obs. of 4 variables:
$ subject : Factor w/ 27 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
$ congruent: logi TRUE FALSE FALSE TRUE FALSE TRUE ...
$ offset : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 5 5 1 2 5 5 2 2 3 5 ...
$ RT : int 330 343 457 436 302 311 595 330 338 374 ...
I've used daply() to calculate the mean RT of each subject in each of the 10 conditions:
myarray <- daply(mydf, .(subject, congruent, offset), summarize, mean = mean(RT))
The result looks just the way I wanted, i.e. a 3d-array; so to speak 5 tables (one for each offset condition) that show the mean of each subject in the congruent=FALSE vs. the congruent=TRUE condition.
However if I check the structure of myarray, I get a confusing output:
List of 270
$ : num 417
$ : num 393
$ : num 364
$ : num 399
$ : num 374
...
# and so on
...
[list output truncated]
- attr(*, "dim")= int [1:3] 27 2 5
- attr(*, "dimnames")=List of 3
..$ subject : chr [1:27] "1" "2" "3" "5" ...
..$ congruent: chr [1:2] "FALSE" "TRUE"
..$ offset : chr [1:5] "1" "2" "3" "4" ...
This looks totally different from the structure of the prototypical ozone array from the plyr package, even though it's a very similar format (3 dimensions, only numerical values).
I want to compute some further summarizing information on this array, by means of aaply. Precisely, I want to calculate the difference between the congruent and the incongruent means for each subject and offset.
However, already the most basic application of aaply() like aaply(myarray,2,mean) returns non-sense output:
FALSE TRUE
NA NA
Warning messages:
1: In mean.default(piece, ...) :
argument is not numeric or logical: returning NA
2: In mean.default(piece, ...) :
argument is not numeric or logical: returning NA
I have no idea, why the daply() function returns such weirdly structured output and thereby prevents any further use of aaply. Any kind of help is kindly appreciated, I frankly admit that I have hardly any experience with the plyr package.
Since you haven't included your data it's hard to know for sure, but I tried to make a dummy set off your str(). You can do what you want (I'm guessing) with two uses of ddply. First the means, then the difference of the means.
#Make dummy data
mydf <- data.frame(subject = rep(1:5, each = 150),
congruent = rep(c(TRUE, FALSE), each = 75),
offset = rep(1:5, each = 15), RT = sample(300:500, 750, replace = T))
#Make means
mydf.mean <- ddply(mydf, .(subject, congruent, offset), summarise, mean.RT = mean(RT))
#Calculate difference between congruent and incongruent
mydf.diff <- ddply(mydf.mean, .(subject, offset), summarise, diff.mean = diff(mean.RT))
head(mydf.diff)
# subject offset diff.mean
# 1 1 1 39.133333
# 2 1 2 9.200000
# 3 1 3 20.933333
# 4 1 4 -1.533333
# 5 1 5 -34.266667
# 6 2 1 -2.800000

Resources