SPSS value labels as column names for tables in R? - r

I'm reading a .sav file using haven:
library(haven)
data <- read_spss("file.sav", user_na = FALSE)
Then trying to display one of the variables in a table:
table(data$region)
Which returns:
1 2 3 4 5 6 7 8 9 10 11 12
85 208 43 171 30 40 95 310 133 29 77 36
Which is technically correct, however - in SPSS, the numerical values in the top row have labels associated with them (region names in this case). If I just run data$region, it shows me the numbers and their associated labels at the end of the output, but is there a way to make those string labels appear in the first table row instead of their numerical counterparts?
Thank you in advance for your help!

The way to do this is to cast the variable as a factor, using the "labels" attribute of the vector as the factor levels. The sjlabelled package includes a function that does this in one step:
data$region <- sjlabelled::as_label(data$region)
While the table command will still work on the resulting data, the layout may be a little messy. The forcats package has a function that pretty-prints frequency tables for factors:
data$region %>% forcats::fct_count()

Related

Problem generating row names for a read counts matrix in R

I am following this tutorial online for analyzing RNA-seq data between cell types.
https://combine-australia.github.io/RNAseq-R/06-rnaseq-day1.html
I have been able to perform most of this using my own data, but I am now trying to perform pathway enrichment analysis. However, I am having issues because I am unable to label the rows of my initial readcounts matrix accounting to the Gene IDs.
I have tried to simply create a new column with the Gene IDs, however this changes the matrix to a dataframe and prevents me from using DGEList.
seqdata is my data.frame with all the information on the genes from the analysis, with column 1 as the gene ID names and columns 15 to 24 as the vectors with the read count information of each gene across 10 samples.
I generated a matrix from this data.frame called readcounts_g that just has the read counts for each of these genes, but I am trying to assign row names in which i take column 1 from seqdata and use the gene names in this vector to assign the rownames for readcounts_g dataframe.
rownames(readcounts_g) <- seqdata[,1]
Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length
In addition: Warning message:
Setting row names on a tibble is deprecated.
I also have thought to simply enter the gene names as an additional vector into readcounts_g, but if i do that they I cannot use DEGList because it requires a matrix.
Ultimately, I am trying to use goana to do an enrichment pathway analysis with differentially expressed genes. But, I am unable to do this without having gene names assigned to the final matrix of DEGs.
If anyone has insight on how I can remedy this, it would be greatly appreciated. I can try to explain further if need be.
If seqdata is a tibble, seqdata[,1]is of class tibble and not character or numeric, hence you are unable to assign it as rownames of a matrix, see below for the alternative:
library(dplyr)
seqdata = tibble(geneID=sample(1:1000),
s1=rpois(1000,10),s2=rpois(1000,15),
s3=rpois(1000,20),s4=rpois(1000,25))
readcounts_g = as.matrix(seqdata[,2:5])
rownames(readcounts_g) = seqdata[,1]
#throws error
rownames(readcounts_g) = seqdata$geneID
#ok
> head(readcounts_g)
s1 s2 s3 s4
763 16 13 13 24
776 13 19 24 26
308 12 19 19 34
88 10 8 13 22
23 10 13 16 25
509 9 12 14 28

Converting contingency tables with counts to two-column data tables with frequency columns

I would like to enter a frequency table into an R data.table.
The data are in a format like this:
Height
Gender 3 35
m 173 125
f 323 198
... where the entries in the table (173, 125, etc.) are counts.
I have a 2 by 2 table, and I want to turn it into two-column data.table.
The data is from a study of birds who nest at a height. The question is whether different genders of the bird prefer certain heights.
I thought the frequency table should be turned into something like this:
Gender height N
m 3 173
m 35 125
f 3 323
f 35 198
but now I'm not so sure. Some of the models I want to run need every case itemized.
Can I do this conversion in R? Ideally, I'd like a way to switch back and forth between the two formats.
Based on a review of ?table.
Make a data frame (x) with columns for Gender, Height, and Freq which would be your N value.
Convert that to a table by using
tabledata <- xtabs(Freq ~ ., x)
There are a number of base functions that can work with this kind of data, which is obviously much more compact than individual rows.
Also from ?loglin this example using table.
loglin(HairEyeColor, list(c(1, 2), c(1, 3), c(2, 3)))
Thanks, everybody (#simon and #Elin) for the help. I thought I was conducting a poll that would get answers like "start with the 4-row version" or "start with the 719-row version" and you all have given me an entire toolbox of ways to move from one to the other. It's really great, informative, and way more than the inquiry deserves.
I unquestionably need to work harder and get more explicit in forming a question. I see by the -3 rating that this boondoggle has earned, crystallizing the fact that I'm not adding anything to the knowledge base, so will delete the question in order to keep future searchers from finding this. I've had a bad run recently with my questions, and as a former teacher of the year, writer of five books, and PhD statistician, it's extremely embarrassing to have been on Stack Exchange for as long as I have, and stand here with one reputation point. One. That means that my upvotes of your answers don't count for a thing.
That reputation point should be scarlet colored.
Here's what I was getting at:
In a book, a common way to express data is in a 2×2 table:
Height
Gender 3 35
M 173 175
F 323 198
My tic-tac-sized mind sees two ways of entering that into a data table:
require(data.table)
GENDER <- c("m","m","f","f")
HEIGHT <- c(3, 35, 3, 35)
N <- c(173, 125, 323, 198)
SANDFLIERS <-data.table(GENDER, HEIGHT, N)
That gives the four-line flat-file/tidy representation of the data:
GENDER HEIGHT N
1: m 3 173
2: m 35 125
3: f 3 323
4: f 35 198
The other option is to make a 719-row data table with 173 male#3ft, 125 male#35 feet, etc. It's not too bad if you use the rep() command and build your table columns carefully. I hate doing arithmetic, so I leave some of these numbers bare and untotaled.
# I need 173+125 males, and 323+198 females.
# One c(rep()) for "m", one c(rep() for "f", and one c() to merge them
gender <- c(c(rep("m", 173+25)), c(rep("f",(323+198))))
# Same here, except the c() functions are one level 'deeper'. I need two
# sets for males (at heights 3 and 35, 173 and 125 of each, respectively)
# and two sets for females (at heights 3 and 35, 323 and 198 respectively)
heights <-c(c(c(rep(3, 173)), c(rep(35,25))), c(c(rep(3, 323)), c(rep(35,198))))
which, when merged into a data.table gives 719 rows, one for each observed bird.
1: m 3
2: m 3
3: m 3
4: m 3
5: m 3
---
715: f 35
716: f 35
717: f 35
718: f 35
719: f 35
Now that I have the data in two formats, I start looking for ways to do plots and analyses.
I can get a mosaic plot using the 719-row version, but you can't see it because of my 1-point reputation
mosaicplot(table(sandfliers), COLOR=TRUE, margin, legend=TRUE)
Mosaic Plot
and you can get a balloon plot using the 4-row version
Balloon Plot
So my question was, for those of you with lots and lots of experience with this sort of thing, do you find the 4-row or the 719-row tables more common. I can change from one to the other, but that's more code to add to the book (again I hear my editor, "You're teaching statistics, not R").
So, as I said at the top, this was just an informal poll on whether one is used more often than the other, or whether beginners are better off with one.
This is in the form of a contingency table. It isn't easy to enter directly into R but it can be done as follows (based on http://cyclismo.org/tutorial/R/tables.html):
> f <- matrix(c(173,125,323,198),nrow=2,byrow=TRUE)
> colnames(f) <- c(3,35)
> rownames(f) <- c("m","f")
> f <- as.table(f)
> f
3 35
m 173 125
f 323 198
You can then create a count or frequency table with:
> as.data.frame(f)
Var1 Var2 Freq
1 m 3 173
2 f 3 323
3 m 35 125
4 f 35 198
The R Cookbook gives a short function to convert to a table of cases (i.e. a long list of the individual items), as follows:
> countsToCases(as.data.frame(f))
... where:
# Convert from data frame of counts to data frame of cases.
# `countcol` is the name of the column containing the counts
countsToCases <- function(x, countcol = "Freq") {
# Get the row indices to pull from x
idx <- rep.int(seq_len(nrow(x)), x[[countcol]])
# Drop count column
x[[countcol]] <- NULL
# Get the rows from x
x[idx, ]
}
... thus you can convert the data to the format needed by any analysis method from any starting format.
(EDIT)
Another way to read in the contingency table is to start with text like this:
> ss <- " 3 35
+ m 173 125
+ f 323 198"
> read.table(text=ss,row.name=1)
X3 X35
m 173 125
f 323 198
Instead of using text =, you can also use a file name to read the table from (for example) a CSV file.

R: Merging Two Dataframes by Rowname Values & Column Values whilst Preserving Rownames [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 years ago.
I'm attempting to merge two dataframes. One dataframe contains rownames which appear as values within a column of another dataframe. I would like to append a single column (Top.Viral.TaxID.Name) from the second dataframe based upon these mutual values, to the first dataframe.
The first dataframe looks like this:
ERR1780367 ERR1780369 ERR2013703 xxx...
374840 73 0 0
417290 56 57 20
1923444 57 20 102
349409 40 0 0
265522 353 401 22
322019 175 231 35
The second dataframe looks like this:
Top.Viral.TaxID Top.Viral.TaxID.Name
1 374840 Enterobacteria phage phiX174 sensu lato
2 417290 Saccharopolyspora erythraea prophage pSE211
3 1923444 Shahe picorna-like virus 14
4 417290 Saccharopolyspora erythraea prophage pSE211
5 981323 Gordonia phage GTE2
6 349409 Pandoravirus dulcis
However, I would also like to preserve the rownames of the first dataframe, so the result would look something like this:
ERR1780367 ERR1780369 ERR2013703 xxx... Top.Viral.TaxID.Name
374840 73 0 0 Enterobacteria phage phiX174 sensu lato
417290 56 57 20 Saccharopolyspora erythraea prophage pSE211
1923444 57 20 102 Shahe picorna-like virus 14
349409 40 0 0 Pandoravirus dulcis
265522 353 401 22 Hyposoter fugitivus ichnovirus
322019 175 231 35 Acanthocystis turfacea Chlorella virus 1
Thanks in advance.
I would strongly recommend against relying on rownames. They are embarrasingly often removed, and the function in dplyr/tidyr always strip them.
Always make the rownames a part of the data, i.e. use "tidy" data sets as in the example below
data(iris)
# We mix the data a bit, to check if rownames are conserved
iris = iris[sample.int(nrow(iris), 20),]
head(iris)
description =
data.frame(Species = unique(iris$Species))
description$fullname = paste("The wonderful", description$Species)
description
# .... the above are your data
iris = cbind(row = rownames(iris), iris)
# Now it is easy
merge(iris, description, by="Species")
And please, use reproducibly data when asking questions in SO to get fast answers. It is lot of work to reformat the data you presented into a form that can be tested.
Use sapply to loop through rownames of dataframe 1 (df1) and search the id in the dataframe 2 (df2), returning the description in the same row.
Something like this
df1$Top.Viral.TaxID.Name <- sapply(rownames(df1), (function(id){
df2$Top.Viral.TaxID.Name[df2$Top.Viral.TaxID == id]
}))

How to get the levels number in R? [duplicate]

This question already has answers here:
How to drop factor levels while scraping data off US Census HTML site
(2 answers)
Closed 5 years ago.
I used a as.data.frame(table(something_to_count)), and get result like:
Var1 Freq
1 20 2970
2 30 1349
3 40 322
4 50 1009
I just want the $Var1 value, but if I write d[1,]$Var1 or d[1,1], I always get these things:
1] 20
305 Levels: 20 30 40 50 60 70 80 90 100 110 120 130 150 160 170 190 200 ... 4120
And when I try to output the value, it is always not 20, but 1. And as.number() also can only return 1. How can I literally get the Var1 value as it is instead of getting the id of the row? Also, when the outputs are levels numbers? What is wrong?
The as.data.frame method for objects of class "table" returns the first column as a factor and (along with any other "marginal labels" columns) and only the last column as the numeric counts. See the help page for ?table and look at the Value section. Tyler's recommendation to use the R-FAQ recommended as.numeric(as.character(.)) conversion strategy is "standard R".
This is because the function table turns the argument into a factor (type table into your console and you'll see the line a <- factor(a, exclude=exclude).
The best solution is just to do what Tyler suggested to transform the results of table into data.frame

Create a barplot of two tables of differing length

I can not seem to figure out how to get a nice barplot that contains the data from two tables that contain a different number of columns.
The tables in question are something like (snipped some data from the end):
> tab1
1 2 3 6 8 31
5872 1525 831 521 299 4
> tab2
1 2 3 4 22
7874 422 2 5 1
Note the column names and sizes are different. When I just do barplot() on one of these tables it comes out with the plot I'd like (showing the column names as the X-axis, frequencies on Y-axis). But, I would like these two side by side.
I've gotten as far as creating a data frame containing both variables as comments and the different row names in the first column (with data.frame()and merge()), but when I plot this the X-axis seems to be all wrong. Attempting to reorder the columns gives me an exception about lengths differing.
Code:
combined <- merge(data.frame(tab1), data.frame(tab2), by = c('Var1'), all=T)
barplot(t(combined[,2:3]), names.arg = combined[,1], beside=T)
This shows a plot, but not all labels are present and the value for position 26 is plotted after 33.
Is there any simple way to get this plot working? A ggplot2 solution would be nice.
You can put all your data in one data frame (as in example).
df<-data.frame(group=rep(c("A","B"),times=c(2,3)),
values=c(23,56,345,6,7),xval=c(1,2,1,2,8))
group values xval
1 A 23 1
2 A 56 2
3 B 345 1
4 B 6 2
5 B 7 8
Then ggplot() with geom_bar() can be used to plot the data.
ggplot(df,aes(xval,values,fill=group))+
geom_bar(stat="identity",position="dodge")

Resources