splitting data frame by repeating strings [duplicate] - r

This question already has answers here:
Split data.frame based on levels of a factor into new data.frames
(3 answers)
Closed 2 years ago.
I have a data frame where one column will repeat the same string for a number of lines (it varies). I'd like to split the data frame based on each of the repeating names into separate data frames (the output can be a list). For example for this data frame:
dat = data.frame(names=c('dog','dog','dog','dog','cat','cat'), value=c(1,2,3,4,5,5))
The output should be
names value
dog 1
dog 2
dog 3
dog 4
and
names value
cat 5
cat 5
I should mention there are thousands of different repeating names.

You can use the split function, which will give the output in a list. I think it would be easier to have the datasets in the list as most of the operations can be performed within the list itself
split(dat, dat$names)
If in case you want to split the 'dog', 'cat', 'dog' as a 'list' with 3 elements (based on the example showed by #BondedDust), one option is
indx <- inverse.rle(within.list(rle(as.character(dat$names)),
values <- seq_along(values)))
split(dat, indx)
Or using the devel version of data.table, we can use rleid to create a grouping variable
library(data.table)#v1.9.5+
setDT(dat)[, grp:= rleid(names)]
and then use the standard data.table operations for the different groups by specifying the 'grp' as the grouping variable.

Related

Is there an R function for performing basic operations on every column of a data frame? [duplicate]

This question already has answers here:
Standardize data columns in R
(16 answers)
Closed 2 years ago.
I have a data frame with n columns like the one below with all the columns being numeric (ex. below only has 3, but the actual one has an unknown number).
col_1 col_2 col_3
1 3 7
3 8 9
5 5 2
8 10 1
11 9 2
I'm trying to transform the data on every column based on this equation: (x-min(col)/(max(col)-min(col)) so that every element is scaled based on the values in the column.
Is there a way to do this without using a for loop to iterate through every column? Would sapply or tapply work here?
We can use scale on the dataset
scale(df1)
Or if we want to use a custom function, create the function, loop over the columns with lapply, apply the function and assign it back to the dataframe
f1 <- function(x) (x-min(col)/(max(col)-min(col))
df1[] <- lapply(df1, f1)
Or this can be done with mutate_all
library(dplyr)
df1 %>%
mutate_all(f1)
In complement to #akrun answer, you can also do that using data.table
library(data.table)
setDT(df)
df[,lapply(.SD, function(x) return((x-min(col)/(max(col)-min(col)))]
If you want to use a subset of columns, you can use .SDcols argument, e.g.
library(data.table)
df[,lapply(.SD, function(x) return((x-min(col)/(max(col)-min(col))),
.SDcols = c('a','b')]

How to subset the first column (rownames) in R [duplicate]

This question already has answers here:
What is about the first column in R's dataset mtcars?
(4 answers)
Closed 3 years ago.
I have xy data for gene expression in multiple samples. I wish to subset the first column so I can order the genes alphabetically and perform some other filtering.
> setwd("C:/Users/Will/Desktop/BIOL3063/R code assignment");
> df = read.csv('R-assignments-dataset.csv', stringsAsFactors = FALSE);
Here is a simplified example of the dataset I'm working with, it has 270 columns (tissue samples) and 7065 rows (gene names).
The first column is a list of gene names (A2M, AAAS, AACS etc.) and each column is a different tissue sample, thus showing the gene expression in each tissue sample.
The question being asked is "Sort the gene names alpahabetically (A-Z) and print out the first 20 gene names"
My thought process would be to subset the first column (gene names) and then perform order() to sort alphabetically, after which I can use head() to print the first 20.
However when I try
> genes <- df[1]
It simply subsets the first column that has data in it (TCGA-A6-2672_TissueA) rather than the one to its left.
Also
> genes <- df[,df$col1];
> genes;
data frame with 0 columns and 7065 rows
> order(genes);
integer(0)
Appears to create a list of gene names in R studio's viewer but I cannot perform any manipulation on it.
I am unable to correctly locate the first column in the data.frame, since it does not have a column header, and I also have the same problem when doing the same thing with row 1 (sample names) as well.
I'm a complete novice at R and this is part of an assignment I'm working on, it seems I'm missing something fundamental but I can not figure out what.
Cheers guys
Please include a sample of your text file as text instead of an image.
I have created a dataset similar to yours:
X Y
1 a b
2 c d
3 d g
Note that your tissue columns have a header but your gene names do not. Therefore these will be interpreted as rownames, see ?read.table:
If row.names is not specified and the header line has one less entry
than the number of columns, the first column is taken to be the row
names.
Reading it in R:
df <- read.table(text = ' X Y
1 a b
2 c d
3 d g')
So your gene names are not at df[1] but instead in rownames(df), so to get these genes <- rownames(df) or to add these to the existing df you can use df$gene <- rownames(df)
There are numerous ways to convert your row names to a column see for example this question.
If you are asking what I think you are asking, you just need to subset inside the as.data.frame function, which will auto-generate a "header", as you call it. It will be called V1, the first variable of your new data frame.
genes <- as.data.frame(df[,1])
genes$V1
1 A
2 C
3 A
4 B
5 C
6 D
7 A
8 B
As per the comment below, the issue could be avoided if you remove the comma from your subsetting syntax. When you select columns from a data.frame, you only need to index the column, not the rows.
genes <- df[1]

joining strings in a table according to values in another column [duplicate]

This question already has answers here:
Collapse text by group in data frame [duplicate]
(2 answers)
Closed 7 years ago.
I got a table like this:
id words
1 I like school.
2 I hate school.
3 I like cakes.
1 I like cats.
Here's what I want to do, joining the strings in each row according to id.
id words
1 I like school. I like cats.
2 I hate school.
3 I like cakes.
Is there a package to do that in R?
We can paste the 'words' together grouped by 'id'. This can be done with any of the group by operations. One way it is data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)) and then do the operation was mentioned above.
# install.packages(c("data.table"), dependencies = TRUE)
library(data.table)
setDT(df1)[, list(words = paste(words, collapse=' ')), by = id]
A base R operation would be to use aggregate
aggregate(words~id, df1, FUN= paste, collape=' ')

How to combine two columns of factors into one column without changing the factor levels into number [duplicate]

This question already has answers here:
Joining factor levels of two columns
(3 answers)
Closed 4 years ago.
I am trying to find a way to combine two columns of factors into one column without changing the factor levels into numbers. For instance, consider the following two data.frame datasets
dataset 1 dataset 2
Number Student Number Student
1 Chris 1 Matt
2 Sarah 2 Keith
I am trying to take "student" column from the dataset1 and the "student" column from the dataset2, and make one big student column containing the names "Chris", "Sarah", "Matt", and "Keith"
I tried:
student.list<-c(dataset1[,2],dataset2[,2])
student.list
However, this doesn't work since the names turns into numbers with c() function. I want my list to preserve the names of students (i.e. without converting them into numbers). I also tried cbind(), but gives same problem as c()...
Thank you
factors are numbers that happen to have labels. When you combine factors, you generally are combining their numeric values. This can often trip a person up.
If you want their labels, you must coerce them to strings, using as.character
student.list <- c( as.character(dataset1[,2]) ,
as.character(dataset2[,2]) )
If you want to get that back to factors, wrap it all in as.factor (can be all in one line, or split into two lines for easier reading)
student.list <- c(as.character(dataset1[,2]),as.character(dataset2[,2]))
student.list <- as.factor(student.list)
There is interaction() function in the base R package.
There is also strata() function in the survival package.
The data.table package, which extends the functionality of data frames in some very useful ways, will combine factors automatically when you use the rbindlist function. Plus, if your two data sets are large, it will usually combine them more quickly.
library(data.table)
# Example data:
# (If you already have data frames, you can convert them using `as.data.table(dataframename)`)
dataset1<-data.table(Number=1:2,Student=as.factor(c("Chris","Sarah")))
dataset2<-data.table(Number=1:2,Student=as.factor(c("Matt","Keith")))
# Combine the two data sets:
# (It's not necessary to convert factors to characters)
rbindlist(list(dataset1,dataset2))
# Number Student
#1: 1 Chris
#2: 2 Sarah
#3: 1 Matt
#4: 2 Keith
You can now do this easily with fct_c() from the forcats package.
dataset1 <- data.frame(Number = c(1,2), Student = factor(c('Chris','Sarah')))
dataset2 <- data.frame(Number = c(1,2), Student = factor(c('Matt','Keith')))
library(forcats)
fct_c(list(dataset1[ ,2], dataset2[ ,2]))
# [1] Chris Sarah Matt Keith
# Levels: Chris Sarah Keith Matt
If you factors are inside of data frames then you can combine them this way using rbind:
> df1 <- data.frame(x=factor(c('a','b')))
> df2 <- data.frame(x=factor(c('c','d')))
> rbind(df1,df2)
x
1 a
2 b
3 c
4 d

In R, how do I select rows of entries in one dataframe by identifiers from a second datafrmame [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
In R, how do I subset a data.frame by values from another data.frame?
I have two data.frames. The first (df1) is a single column of 100 entries with header - "names". The second (df2) is a dataframe containing hundreds of columns of metadata for tens of thousands of entries. The first column of df2 also has the header "names".
I simply want to select all the metadata in df2 by the subset of names found in df1.
Please help this novice R user. Thank you!
You can use data.frame with %in% but it can be slow if you have many thousands of names to look up.
I would recommend using data.table because it sorts the index columns and can do an almost instantaneous database join even with millions of records. Read the data.table documentation for more information.
Suppose you have a big data.frame and little data.frame:
library(data.table)
big <- data.frame(names=1:5, data=1:5)
small <- data.frame(names=c(1, 3, 6))
Make them into data.table objects and set the key column to be names.
big <- data.table(big, key='names')
small <- data.table(small, key='names')
Now perform the join. [] in data.table allows a data.table to be indexed by the key column of another data.table. In this case, we return the rows of big that are also in small, and there will be missing data if there are names in small but not in big.
big[small]
# names data
# 1: 1 1
# 2: 3 3
# 3: 6 NA

Resources