R: create new dataframe rows are columns from another dataframe - r

Is there a simple one liner to create a new dataframe based on an original dataframe where the rownames (or at least first row) comes from the column names in the original dataframe?
for example:
Original <- data.frame("A"=c("apples", "aligator", "algebra"), "B"=c("Banana", "Beans", "Baby"))
Gives:
A B
1 apples Banana
2 aligator Beans
3 algebra Baby
What I want is:
A
B

Actually figured it out - was very simple.
NewDataFrame <- data.frame(colnames(Original))

Related

How to subset the first column (rownames) in R [duplicate]

This question already has answers here:
What is about the first column in R's dataset mtcars?
(4 answers)
Closed 3 years ago.
I have xy data for gene expression in multiple samples. I wish to subset the first column so I can order the genes alphabetically and perform some other filtering.
> setwd("C:/Users/Will/Desktop/BIOL3063/R code assignment");
> df = read.csv('R-assignments-dataset.csv', stringsAsFactors = FALSE);
Here is a simplified example of the dataset I'm working with, it has 270 columns (tissue samples) and 7065 rows (gene names).
The first column is a list of gene names (A2M, AAAS, AACS etc.) and each column is a different tissue sample, thus showing the gene expression in each tissue sample.
The question being asked is "Sort the gene names alpahabetically (A-Z) and print out the first 20 gene names"
My thought process would be to subset the first column (gene names) and then perform order() to sort alphabetically, after which I can use head() to print the first 20.
However when I try
> genes <- df[1]
It simply subsets the first column that has data in it (TCGA-A6-2672_TissueA) rather than the one to its left.
Also
> genes <- df[,df$col1];
> genes;
data frame with 0 columns and 7065 rows
> order(genes);
integer(0)
Appears to create a list of gene names in R studio's viewer but I cannot perform any manipulation on it.
I am unable to correctly locate the first column in the data.frame, since it does not have a column header, and I also have the same problem when doing the same thing with row 1 (sample names) as well.
I'm a complete novice at R and this is part of an assignment I'm working on, it seems I'm missing something fundamental but I can not figure out what.
Cheers guys
Please include a sample of your text file as text instead of an image.
I have created a dataset similar to yours:
X Y
1 a b
2 c d
3 d g
Note that your tissue columns have a header but your gene names do not. Therefore these will be interpreted as rownames, see ?read.table:
If row.names is not specified and the header line has one less entry
than the number of columns, the first column is taken to be the row
names.
Reading it in R:
df <- read.table(text = ' X Y
1 a b
2 c d
3 d g')
So your gene names are not at df[1] but instead in rownames(df), so to get these genes <- rownames(df) or to add these to the existing df you can use df$gene <- rownames(df)
There are numerous ways to convert your row names to a column see for example this question.
If you are asking what I think you are asking, you just need to subset inside the as.data.frame function, which will auto-generate a "header", as you call it. It will be called V1, the first variable of your new data frame.
genes <- as.data.frame(df[,1])
genes$V1
1 A
2 C
3 A
4 B
5 C
6 D
7 A
8 B
As per the comment below, the issue could be avoided if you remove the comma from your subsetting syntax. When you select columns from a data.frame, you only need to index the column, not the rows.
genes <- df[1]

Use two columns as index to calculate a third column

I have one vector
>a<-c(4,5,6,7,8)
I have one data.frame
>df<-data.frame(start=c(1,4),end=c(3,5))
I want to create a third column in this df based on the start-end
>df
start end
1 1 3 mean(a[1:3])
2 4 5 mean(a[4:5])
of course mean(a[df$start:df$end]) does not work.
I have solved this in a long manner by creating a new data.frame, but I am wondering if is there a short way to do.
We can use mapply to get the seq of corresponding elements of 'start' and 'end' column, subset the 'a' based on that index, get the mean and assign the output to create the new column ('Mean') in 'df'
df$Mean <- mapply(function(x,y) mean(a[seq(x,y)]), df$start, df$end)

splitting data frame by repeating strings [duplicate]

This question already has answers here:
Split data.frame based on levels of a factor into new data.frames
(3 answers)
Closed 2 years ago.
I have a data frame where one column will repeat the same string for a number of lines (it varies). I'd like to split the data frame based on each of the repeating names into separate data frames (the output can be a list). For example for this data frame:
dat = data.frame(names=c('dog','dog','dog','dog','cat','cat'), value=c(1,2,3,4,5,5))
The output should be
names value
dog 1
dog 2
dog 3
dog 4
and
names value
cat 5
cat 5
I should mention there are thousands of different repeating names.
You can use the split function, which will give the output in a list. I think it would be easier to have the datasets in the list as most of the operations can be performed within the list itself
split(dat, dat$names)
If in case you want to split the 'dog', 'cat', 'dog' as a 'list' with 3 elements (based on the example showed by #BondedDust), one option is
indx <- inverse.rle(within.list(rle(as.character(dat$names)),
values <- seq_along(values)))
split(dat, indx)
Or using the devel version of data.table, we can use rleid to create a grouping variable
library(data.table)#v1.9.5+
setDT(dat)[, grp:= rleid(names)]
and then use the standard data.table operations for the different groups by specifying the 'grp' as the grouping variable.

Combine, count df columns w/o repeating other columns

A simple question that I am completely stumped on after consulting packages thatI thought would help(plyr, reshape, unique)
Let's say I have the df below:
df <- data.frame(location=c("ny","nj","pa","ct"),
animal=c("dog","hamster","dog","pig"),
animal2=c("cat","dog","pig","dog"))
I would like to count the unique entities in specific columns and then rank occurrences. So here, I'd like to count the combined unique entities in the columns animal and animal2. If I use reshape and melt, the associated location values will repeat in the additional rows...but I don't want that because I only want to count the frequencies of the "location' variables as given in the original df.
Is there a way to rbind without repeating other columns? So in this case I would have another column called AnimalMaster and that would have all of the frequencies I need.
When I try count(df,c("animal","animal2")), it counts the joint occurrences, which is not what I want. Alternatively, I could also do this by just counting the unique strings across multiple columns without combining them. Is there a straightforward way to do this without running into the count problem?
Thank you for helping a beginner.
EDIT:
My desired output is the following:
countsdf with columns (Type, Name, Frequency, Frequency (%)), so that top row would be:
AnimalMaster | dog | 4 | 100%
Here's a suggestion with reshape2 and data.table
require(reshape2)
require(data.table)
dt <- data.table(melt(df, id.vars = 'location', value.name = 'animal'))
dt[, list(n=length(unique(location)),
percent=100*.N/dt[, length(unique(location))]),
by=animal]
# animal n percent
# 1: dog 4 100
# 2: hamster 1 25
# 3: pig 2 50
# 4: cat 1 25

How to combine two columns of factors into one column without changing the factor levels into number [duplicate]

This question already has answers here:
Joining factor levels of two columns
(3 answers)
Closed 4 years ago.
I am trying to find a way to combine two columns of factors into one column without changing the factor levels into numbers. For instance, consider the following two data.frame datasets
dataset 1 dataset 2
Number Student Number Student
1 Chris 1 Matt
2 Sarah 2 Keith
I am trying to take "student" column from the dataset1 and the "student" column from the dataset2, and make one big student column containing the names "Chris", "Sarah", "Matt", and "Keith"
I tried:
student.list<-c(dataset1[,2],dataset2[,2])
student.list
However, this doesn't work since the names turns into numbers with c() function. I want my list to preserve the names of students (i.e. without converting them into numbers). I also tried cbind(), but gives same problem as c()...
Thank you
factors are numbers that happen to have labels. When you combine factors, you generally are combining their numeric values. This can often trip a person up.
If you want their labels, you must coerce them to strings, using as.character
student.list <- c( as.character(dataset1[,2]) ,
as.character(dataset2[,2]) )
If you want to get that back to factors, wrap it all in as.factor (can be all in one line, or split into two lines for easier reading)
student.list <- c(as.character(dataset1[,2]),as.character(dataset2[,2]))
student.list <- as.factor(student.list)
There is interaction() function in the base R package.
There is also strata() function in the survival package.
The data.table package, which extends the functionality of data frames in some very useful ways, will combine factors automatically when you use the rbindlist function. Plus, if your two data sets are large, it will usually combine them more quickly.
library(data.table)
# Example data:
# (If you already have data frames, you can convert them using `as.data.table(dataframename)`)
dataset1<-data.table(Number=1:2,Student=as.factor(c("Chris","Sarah")))
dataset2<-data.table(Number=1:2,Student=as.factor(c("Matt","Keith")))
# Combine the two data sets:
# (It's not necessary to convert factors to characters)
rbindlist(list(dataset1,dataset2))
# Number Student
#1: 1 Chris
#2: 2 Sarah
#3: 1 Matt
#4: 2 Keith
You can now do this easily with fct_c() from the forcats package.
dataset1 <- data.frame(Number = c(1,2), Student = factor(c('Chris','Sarah')))
dataset2 <- data.frame(Number = c(1,2), Student = factor(c('Matt','Keith')))
library(forcats)
fct_c(list(dataset1[ ,2], dataset2[ ,2]))
# [1] Chris Sarah Matt Keith
# Levels: Chris Sarah Keith Matt
If you factors are inside of data frames then you can combine them this way using rbind:
> df1 <- data.frame(x=factor(c('a','b')))
> df2 <- data.frame(x=factor(c('c','d')))
> rbind(df1,df2)
x
1 a
2 b
3 c
4 d

Resources