Labelling Variables - r

I have a series of variables that fall under one related question: lets say there are 20 such variables in my dataframe, each one corresponds to an option on a MC question. They are titled popn1, popn2......popn20.
I want to label each variable by its option, as an example: (popn1 = Everyone; popn2=Children)
I'm using the labelVector package.
Is there a way I can do it without writing out each variable name? Ex. is there a paste function I can use, such as
df2 <- Set_label(df1,
(paste0(popn, 1:20) = "Everyone", "Children", .... "Youth"?)

This can be done in base R quite easily. Here's some sample data (using columns instead of 20, to make it easier to view)
popn1 popn2 popn3 popn4 popn5
1 -0.4085141 3.240716 2.730837 6.428722 8.015210
2 3.1378943 2.512700 2.021546 3.333371 5.654401
3 2.4073278 1.475619 2.449742 2.817447 6.295569
It looks like you already have your new column names in a character vector:
your_column_names <- c("Everyone", "Youth", "Someone", "Something", "Somewhere")
Then you just use the setNames argument on the column names for your data:
colnames(data) <- setNames(your_column_names, colnames(data))
Everyone Youth Someone Something Somewhere
1 -0.4085141 3.240716 2.730837 6.428722 8.015210
2 3.1378943 2.512700 2.021546 3.333371 5.654401
3 2.4073278 1.475619 2.449742 2.817447 6.295569
Sample Data:
data <- structure(list(popn1 = c(-0.408514139489243, 3.13789432899688,
2.40732780606037), popn2 = c(3.24071608151551, 2.51269963339946,
1.47561933493116), popn3 = c(2.73083728435832, 2.02154567048998,
2.44974180329751), popn4 = c(6.42872215439841, 3.3333709733048,
2.81744655980154), popn5 = c(8.0152099281755, 5.65440141443164,
6.29556905855252)), class = "data.frame", row.names = c(NA, -3L
))

Related

Rolling Sample standard deviation in R

I wanted to get the standard deviation of the 3 previous row of the data, the present row and the 3 rows after.
This is my attempt:
mutate(ming_STDDEV_SAMP = zoo::rollapply(ming_f, list(c(-3:3)), sd, fill = 0)) %>%
Result
ming_f
ming_STDDEV_SAMP
4.235279667
0.222740262
4.265353
0.463348209
4.350810667
0.442607461
3.864739333
0.375839159
3.935632333
0.213821765
3.802632333
0.243294783
3.718387667
0.051625808
4.288542333
0.242010836
4.134689
0.198929941
3.799883667
0.112733475
This is what I expected:
ming_f
ming_STDDEV_SAMP
4.235279667
0.225532646
4.265353
0.212776157
4.350810667
0.23658801
3.864739333
0.253399417
3.935632333
0.26144862
3.802632333
0.246259684
3.718387667
0.20514358
4.288542333
0.208578409
4.134689
0.208615874
3.799883667
0.233948429
It doesn't match your output exactly, but perhaps this is what you need:
zoo::rollapply(quux$ming_f, 7, FUN=sd, partial=TRUE)
(It also works replacing 7 with list(-3:3).)
This expression isn't really different from your sample code, but the output is correct. Perhaps your original frame has a group_by still applied?
Data
quux <- structure(list(ming_f = c(4.235279667, 4.265353, 4.350810667, 3.864739333, 3.935632333, 3.802632333, 3.718387667, 4.288542333, 4.134689, 3.799883667), ming_STDDEV_SAMP = c(0.225532646, 0.212776157, 0.23658801, 0.253399417, 0.26144862, 0.246259684, 0.20514358, 0.208578409, 0.208615874, 0.233948429)), class = "data.frame", row.names = c(NA, -10L))

How to prevent coercion to list in R

I am trying to remove all NA values from two columns in a matrix and make sure that neither column has a value that the other doesn't.
code:
data <- dget(file)
dependent <- data[,"chroma"]
independent <- data[,"mass..Pantheria."]
names(independent) <- names(dependent) <- rownames(data)
for (name in rownames(data)) {
if(is.na(dependent[name])) {
independent$name <- NULL
dependent$name <- NULL
}
if(is.na(independent[name])) {
independent$name <- NULL
dependent$name <- NULL
}
}
print(dput(independent))
print(dput(dependent))
I am brand new to R and am trying to perform this task with a for loop. However, when I delete a section by assigning NULL I receive the following warning:
1: In independent$Aeretes_melanopterus <- NULL : Coercing LHS to a list
2: In dependent$name <- NULL : Coercing LHS to a list
No elements are deleted and independent and dependent retain all their original rows.
file (input):
structure(list(chroma = c(7.443501276, 10.96156313, 13.2987235,
17.58110922, 13.4991105), mass..Pantheria. = c(NA, 126.57, NA,
160.42, 250.57)), .Names = c("chroma", "mass..Pantheria."), class = "data.frame", row.names = c("Aeretes_melanopterus",
"Ammospermophilus_harrisii", "Ammospermophilus_insularis", "Ammospermophilus_nelsoni",
"Atlantoxerus_getulus"))
chroma mass..Pantheria.
Aeretes_melanopterus 7.443501 NA
Ammospermophilus_harrisii 10.961563 126.57
Ammospermophilus_insularis 13.298723 NA
Ammospermophilus_nelsoni 17.581109 160.42
Atlantoxerus_getulus 13.499111 250.57
desired output:
structure(list(chroma = c(10.96156313, 17.58110922, 13.4991105
), mass..Pantheria. = c(126.57, 160.42, 250.57)), .Names = c("chroma",
"mass..Pantheria."), class = "data.frame", row.names = c("Ammospermophilus_harrisii",
"Ammospermophilus_nelsoni", "Atlantoxerus_getulus"))
chroma mass..Pantheria.
Ammospermophilus_harrisii 10.96156 126.57
Ammospermophilus_nelsoni 17.58111 160.42
Atlantoxerus_getulus 13.49911 250.57
structure(c(126.57, 160.42, 250.57), .Names = c("Ammospermophilus_harrisii",
"Ammospermophilus_nelsoni", "Atlantoxerus_getulus"))
Ammospermophilus_harrisii Ammospermophilus_nelsoni Atlantoxerus_getulus
126.57 160.42 250.57
structure(c(10.96156313, 17.58110922, 13.4991105), .Names = c("Ammospermophilus_harrisii",
"Ammospermophilus_nelsoni", "Atlantoxerus_getulus"))
Ammospermophilus_harrisii Ammospermophilus_nelsoni Atlantoxerus_getulus
10.96156 17.58111 13.49911
Looks like you want to omit rows from your data where chroma or mass..Pantheria are NA. Here's a quick way to do it:
data = data[!is.na(data$chroma) & !is.na(data$mass..Pantheria.), ]
I'm not sure why you are breaking independent and dependent out separately, but after filtering out bad observations is a good time to do it.
Since those are your only two columns, this is equivalent to omitting rows from your data frame that have any NA values, so you can use a shortcut like this:
data = na.omit(data)
If you want to keep a "pristine" copy of your raw data, simply change the name of the result:
data_no_na = na.omit(data)
# or
data = data[!is.na(data$chroma) & !is.na(data$mass..Pantheria.), ]
As to what's wrong with your code, $ is used for extracting columns from a data frame, but you're trying to use it for a named vector (since you've already extracted the columns), which doesn't work. Even then, $ only works with a literal string, you can't use it with a variable. For data frames, you need to use brackets to extract columns stored in variables. For example, the built-in mtcars data has a column called "mpg":
# these work:
mtcars$mpg
mtcars[, "mpg"]
my_col = "mpg"
mtcars[, my_col]
mtcars$my_col ## does not work, need to use brackets!
You can never use $ with row names in a data frame, only column names.

Microarray Limma package, in topTable function don't assign ID for probsets column

I tried a tutorial by Daniel Swan ,it works perfectly well. But I'm facing a problem in topTable function of limma package.
The "topTable" function create a "probeset list" but this probset list have not "ID" header (other columns name is their sample name, but Probe list column have not name (ID)).
At the result, when I am runing:
gene.symbols <- getSYMBOL(probeset.list$ID, "hgu133plus2")
I'm getting the following error
Error in .select(x, keys, columns, keytype = extraArgs[["kt"]], jointype = jointype):
'keys' must be a character vector
topTable is:
logFC AveExpr t P.Value adj.P.Val B
204779_s_at 7.367790 4.171707 72.77347 3.284937e-15 8.969850e-11 20.25762
207016_s_at 6.936667 4.027733 57.39252 3.694641e-14 5.044293e-10 19.44987
209631_s_at 5.192949 4.003992 51.24892 1.170273e-13 1.065182e-09 18.96660
my expression Set achieved by simpleaffy (gcrma) package.
I'm runing R 3.0.2 under windows 7 with latest bioconductor packages, simpleaffy_2.38.0 , limma_3.18.13 and anotation files: hgu133plus2.db_2.10.1 ,hgu133plus2probe_2.13.0, hgu133plus2cdf_2.13.0
I would be very thankful, if somebody could help me.
The IDs are not stored as an ID column, but as the rownames of the table. Change the line to:
gene.symbols <- getSYMBOL(rownames(probeset.list), "hgu133plus2")
If you want there to be an ID column instead of using row names, you can assign one with:
probeset.list$ID = rownames(probeset.list)
According to the documentation of the toptable function, the ID column will exist if and only if there are duplicated gene names:
If ‘fit’ had unique rownames, then the row.names of the above
data.frame are the same in sorted order. Otherwise, the row.names
of the data.frame indicate the row number in ‘fit’. If ‘fit’ had
duplicated row names, then these are preserved in the ‘ID’ column
of the data.frame, or in ‘ID0’ if ‘genelist’ already contained an
‘ID’ column.
In the other examples you've seen ID used, there must have been duplicate gene names in the input. This makes sense because R typically doesn't like having duplicated rownames (but has no problem having duplicate IDs in a column).
Hope my piece of working codes can make your question clear:
library(limma) # загружаем нужную библиотека
library(siggenes)
library(cluster)
library(stats)
data <- read.table("AneurismDataAllProbesGenesisLog2NormalizedExperAndGenes.tab", sep = "\t", header = TRUE) # read from file
q = as.matrix(data) # данные в матрицу
b = as.matrix(cbind(data[, 2:10], data[, 11:14])) # cмежные колонки данных
m = normalizeQuantiles(b, ties=TRUE)
f = data.frame(condition = c(0,0,0,0,0,0,0,0,0,1,1,1,1)) # дизайн
fit = lmFit(m, f) # линейная модель
e = eBayes(fit) # тест Байеса
volcanoplot(e, coef=1, highlight=5, names=data$GeneName, xlab="Log Fold Change", ylab="Log Odds", pch=19, cex=0.67, col = "dark blue") # график-вулкан
z = rownames(m) = data[, 1]
hc <- hclust(dist(m), "ave") # кластерграмма
plot(hc)
plot(hc, hang = -1)
print(e$coefficients) # output eBayes coefficients
print(e$p.value) # get out the P values
toptable(e) # select 10 most differentialy expressed genes, the disadvantage that it outputs only the gene row number and not the name
printresult <-toptable(e) # assign the result to a variable
write.csv(printresult, file = "eBayesTableAneurism", row.names = TRUE) # write to the file in the current folder
volcanoplot(e, coef=1, highlight=10, names=data[,1], xlab="Log Fold Change", ylab="Log Odds", pch=19, cex=0.67, col = "red") # график-вулкан c именами
volcanoplot(e, coef=1, highlight=5, names=data[,1], xlab="Log Fold Change", ylab="Log Odds", pch=19, cex=0.67, col = "blue") # график-вулкан с именами (Volcano with gene names)

what is the "class" parameter in structure()?

I am trying to use the structure() function to create a data frame in R.
I saw something like this
structure(mydataframe, class="data.frame")
Where did class come from? I saw someone using it, but it is not listed in the R document.
Is this something programmers learned in another language and carries it over? And it works. I am very confused.
Edit: I realized dput(), is what actually created a data frame looking like this. I got it figured out, cheers!
You probably saw someone using dput. dput is used to post (usually short) data. But normally you would not create a data frame like that. You would normally create it with the data.frame function. See below
> example_df <- data.frame(x=rnorm(3),y=rnorm(3))
> example_df
x y
1 0.2411880 0.6660809
2 -0.5222567 -0.2512656
3 0.3824853 -1.8420050
> dput(example_df)
structure(list(x = c(0.241188014013708, -0.522256746461544, 0.382485333260912
), y = c(0.666080872170054, -0.251265630627216, -1.84200501106852
)), .Names = c("x", "y"), row.names = c(NA, -3L), class = "data.frame")
Then, if someone wants to "copy" your data.frame, he just has to run the following:
> copied_df <- structure(list(x = c(0.241188014013708, -0.522256746461544, 0.382485333260912
+ ), y = c(0.666080872170054, -0.251265630627216, -1.84200501106852
+ )), .Names = c("x", "y"), row.names = c(NA, -3L), class = "data.frame")
I put "copy" in quotes because note the following:
> identical(example_df,copied_df)
[1] FALSE
> all.equal(example_df,copied_df)
[1] TRUE
identical yields false because when you post your dput output, often the numbers get rounded to a certain decimal point.
'class' is not a specific argument to the structure function - that's why you didn't find it in the help file.
structure takes an object and then any number of name/value pairs and sets them as attributes on the object. In this case, class was such an attribute. You can try this to add fictional 'foo' and 'bar' attributes to a vector:
x <- structure(1:3, foo=42, bar='hello')
attributes(x)
#$foo
#[1] 42
#
#$bar
#[1] "hello"
And as Joshua Ulrich and Xu Wang mentioned, you should not create a data.frame like that.
I'm scratching my head, wondering what "R Document" would not have said something about "class". It's a very basic component of the the language and how functions get applied. You should type this and read:
?class
?methods

How do I use elements of a dataframe like hash keys / dictionary keys / primary keys?

I have a dataframe in which I want to use certain values as hash keys / dictionary keys (or whatever you call it in your language of choice) for other values in that dataframe. Say I have a dataframe like this which I've read in from a large csv file (only first row shown):
Plate.name QN.number Well Allele.X.Rn Allele.Y.Rn Call
1 Plate 1_A1 QN2200 A 1.766 2.791 Both
which in R code would be:
structure(list(Plate.name = structure(1L, .Label = "Plate 1_A1", class = "factor"),
QN.number = structure(1L, .Label = "QN2200", class = "factor"),
Well = structure(1L, .Label = "A1", class = "factor"), Allele.X.Rn = 1.766,
Allele.Y.Rn = 2.791, Call = structure(1L, .Label = "Both", class = "factor")), .Names = c("Plate.name",
"QN.number", "Well", "Allele.X.Rn", "Allele.Y.Rn", "Call"), class = "data.frame", row.names = c(NA,
-1L))
THe QN.numbers are unique IDs in my dataset. How do I then retrieve data using the QN.number as a reference for the other values, that is to say I want to know the Call or the Allele.X.Rn for a given QN.number? It seems row.names might do the trick but then how would I use them in this instance?
Using row.names is like this:
> row.names(d)=d$QN.number
> d["QN2200",]
Plate.name QN.number Well Allele.X.Rn Allele.Y.Rn Call
QN2200 Plate 1_A1 QN2200 A1 1.766 2.791 Both
> d["QN2201",]
Plate.name QN.number Well Allele.X.Rn Allele.Y.Rn Call
NA <NA> <NA> <NA> NA NA <NA>
You just use the row name as the first parameter in the subsetting. You can also use multiple row names:
> d=data.frame(a=letters[1:10],b=runif(10))
> row.names(d)=d$a
> d[c("a","g","d"),]
a b
a a 0.6434431
g g 0.6724661
d d 0.9826392
Now I'm not sure how clever this is, and whether it does sequential search for each row name or faster indexing...
Use subset.
subset(your_data, QN.number == "QN2200", Allele.X.Rn)
with provides an alternative; here the output is a vector rather than another data frame.
with(your_data, Allele.X.Rn[QN.number == "QN2200"])
Assuming that we're storing our data frame in a variable name--I'll call it dataframe for now--the following should do it:
dataframe$Allele.X.Rn[which(dataframe$Qn.number == <whatever>)]
Where, of course <whatever> is the number that you'd like to use for Qn.number.

Resources