Reading in a table with non-standard row names in R? - r

I have a .txt file that looks like this:
xyz ghj asd qwe
a / b: 1 2 3 4
c / d: 5 6 7 8
e / f: 9 10 11 12
...
...
I'm trying to use read.table(header = T) but it seems to be misinterpreting the row name. Is there a way to deal with this in read.table() or should I just use readLines()

There is no option to just skip a few characters in each row using a read.table option.
Instead, you can call read.table twice, once for all the data after the first row, and the second time for the header.
Where your data are in a file called "test.txt", you would do:
library(magrittr)
tmp <- read.table(file="test.txt", sep="", stringsAsFactors = FALSE, skip=1)[, -c(1:3)] %>%
setNames(read.table(file="test.txt", sep="", stringsAsFactors = FALSE, nrows=1))
> tmp
xyz ghj asd qwe
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
>
Package magrittr is what gives you the pipe operator %>% that allows you to read the data and the header separately, but put them together in a single line. If you have a sufficiently-new R version you can use the |> operator instead, without the magrittr package.

Related

Add column names of a dataframe or from an R object to another dataframe

I'm currently working with a huge count matrix issued of single cell sequencing ...
So, in order to analyze them with R and my 8 Gb of RAM, I had to split it in several sub-matrices.
I simply used split in order to do that so I loose the heathers of the matrix.
So, I would like to add them back with R or find a better way to split them more efficiently.
My questions are:
1. If a have an object called heathers with all the column names stocked inside, is there a way to efficiently add this object to a dataframe? I tried rbind but it doesn't really solve the problem.
2. Is there a better way to cut those huge count matrices into multiple parts? (I can't do it through R because I don't have enough RAM, R crashes if I try to import the whole matrix)
If a have an object called heathers with all the column names stocked inside, is there a way to efficiently add this object to a dataframe? I tried rbind but it doesn't really solve the problem.
You can add headers to a dataframe like this:
dataframe <- data.frame(c("a", "b","c"),
c("d", "e", "f"))
headers <- c("header_1" , "header_2")
names(dataframe) <- headers
dataframe
header_1 header_2
1 a d
2 b e
3 c f
You could use bash for such tasks.
You can access and mutate a data.frames column names with the names function:
df <- data.frame(foo = 1:5, bar = 6:10, opt = 11:15)
original_names <- names(df)
original_names
Returns:
[1] "foo" "bar" "opt"
And to assign new names:
names(df) <- c("new_col1", "new_col2", "new_col3")
Now:
df
Returns:
new_col1 new_col2 new_col3
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
And to 'undo' the renaming:
names(df) <- original_names
And df has again its original names:
foo bar opt
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15

Reading bulk CSV's and Filtering to get rid of headers not working

I am using the package read_bulk to read in a large number of CSV files.
dfc <- data.frame(read_bulk(directory = "C:/place/with/data",
subdirectories = FALSE,
extension = ".csv",
data = NULL,
verbose = TRUE,
fun = utils::read.csv, stringsAsFactors = FALSE, is.na(" ")))
names(dfc) <- c("Headers", "I", "Want", "Instead")
write_csv(dfc, path = paste("Data"," ",Sys.Date(),".csv"))
which works fine, but I'd like the headers to be removed. headers = FALSE does not work in read_bulk. I thought this would be a simple fix by doing
dfc %>%
filter(Headers != "undesirable headers from read_bulk")
after I assign the names but this has not worked. I also tried str_extract_all for the "undesirable headers from read_bulk" but this hasn't worked either.
the str of all the data are all characters, though the first column header of all the data has  before the column name after read_bulk. Is this an encoding problem? is this causing my data not to be filtered?
dummy data
CSV Dataset 1 CSV Dataset2 ...etc more datasets
Facility ID Status Facility ID Status
abc 1 A def 5 A
efg 2 B lmo 8 B
hij 3 A pqr 9 C
abc 4 B xyz 7 B
R output after read_bulk of dummy data
Facility ID Status
abc 1 A
efg 2 B
hij 3 A
abc 4 B
Facility ID Status
def 5 A
lmo 8 B
pqr 9 C
xyz 7 B
I would like to remove these headers from my data set

Subset dataframe if a $ symbol exists in a string column

I have a dataframe with a time column and a string column. I want to subset this dataframe - where I only keep the rows in which the string column contains a $ symbol somewhere in it.
After subsetting, I want to clean the string column so that it only contains the characters after the $ symbol until there is a space or symbol
df <- data.frame("time"=c(1:10),
"string"=c("$ABCD test","test","test $EFG test",
"$500 test","$HI/ hello","test $JK/",
"testing/123","$MOO","$abc","123"))
I want the final output to be:
Time string
1 ABCD
3 EFG
4 500
5 HI
6 JK
8 MOO
9 abc
It only keeps rows that have a $ in the string column, and then only keeps the characters after the $ symbol and until a space or symbol
I have had some success with sub simply to pull out the string, but haven't been able to apply that to the df and subset it. Thanks for your help.
Until someone comes up with pretty regex solutions, here is my take:
# subset for $ signs and convert to character class
res <- df[ grepl("$", df$string, fixed = TRUE),]
res$string <- as.character(res$string)
# split on non alpha and non $, and grab the one with $, then remove $
res$clean <- sapply(strsplit(res$string, split = "[^a-zA-Z0-9$']", perl = TRUE),
function(i){
x <- i[grepl("$", i, fixed = TRUE)]
# in case when there is more than one $
# x <- i[grepl("$", i, fixed = TRUE)][1]
gsub("$", "", x, fixed = TRUE)
})
res
# time string clean
# 1 1 $ABCD test ABCD
# 3 3 test $EFG test EFG
# 4 4 $500 test 500
# 5 5 $HI/ hello HI
# 6 6 test $JK/ JK
# 8 8 $MOO MOO
# 9 9 $abc abc
We can do this by extracting the substring with regexpr/regmatches to extract only substring that follows a $
i1 <- grep("$", df$string, fixed = TRUE)
transform(df[i1,], string = regmatches(string, regexpr("(?<=[$])\\w+", string, perl = TRUE)))
# time string
#1 1 ABCD
#3 3 EFG
#4 4 500
#5 5 HI
#6 6 JK
#8 8 MOO
#9 9 abc
Or with the tidyverse syntax
library(tidyverse)
df %>%
filter(str_detect(string, fixed("$"))) %>%
mutate(string = str_extract(string, "(?<=[$])\\w+"))

Reshaping count-summarised data into long form in R [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
Embarrassingly basic question, but if you don't know.. I need to reshape a data.frame of count summarised data into what it would've looked like before being summarised. This is essentially the reverse of {plyr} count() e.g.
> (d = data.frame(value=c(1,1,1,2,3,3), cat=c('A','A','A','A','B','B')))
value cat
1 1 A
2 1 A
3 1 A
4 2 A
5 3 B
6 3 B
> (summry = plyr::count(d))
value cat freq
1 1 A 3
2 2 A 1
3 3 B 2
If you start with summry what is the quickest way back to d? Unless I'm mistaken (very possible), {Reshape2} doesn't do this..
Just use rep:
summry[rep(rownames(summry), summry$freq), c("value", "cat")]
# value cat
# 1 1 A
# 1.1 1 A
# 1.2 1 A
# 2 2 A
# 3 3 B
# 3.1 3 B
A variation of this approach can be found in expandRows from my "SOfun" package. If you had that loaded, you would be able to simply do:
expandRows(summry, "freq")
There is a good table to dataframe function on the R cookbook website that you can modify slightly. The only modifications were changing 'Freq' -> 'freq' (to be consistent with plyr::count) and making sure the rownames were reset as increasing integers.
expand.dft <- function(x, na.strings = "NA", as.is = FALSE, dec = ".") {
# Take each row in the source data frame table and replicate it
# using the Freq value
DF <- sapply(1:nrow(x),
function(i) x[rep(i, each = x$freq[i]), ],
simplify = FALSE)
# Take the above list and rbind it to create a single DF
# Also subset the result to eliminate the Freq column
DF <- subset(do.call("rbind", DF), select = -freq)
# Now apply type.convert to the character coerced factor columns
# to facilitate data type selection for each column
for (i in 1:ncol(DF)) {
DF[[i]] <- type.convert(as.character(DF[[i]]),
na.strings = na.strings,
as.is = as.is, dec = dec)
}
row.names(DF) <- seq(nrow(DF))
DF
}
expand.dft(summry)
value cat
1 1 A
2 1 A
3 1 A
4 2 A
5 3 B
6 3 B

Read multiple files into separate data frames and process every dataframe

for all the files in one directory, I want to read each file into a data frame then process the file, for example, calculate cor across columns. For example:
files<-list.files(path=".") <br>
names <- substr(files,18,20)
for(i in c(1:length(names))){
name <- names[i]
assign (name, read.table(files[i]))
sapply(3:ncol(name), function(y) cor(name[, 2], name[, y], ))
}
but 'name' is a string in the last statement of the code, how can I process the dataframe 'name'?
This is exactly what R's lists are for. Also calling sapply to get all of the correlations is unnecessary since cor returns the correlation matrix so you can just subset
R> files <- list.files(pattern = "tsv")
R> dat <- lapply(files, read.table)
R> dat
[[1]]
a b c
1 2.802164 4.835557 6
2 1.680186 4.974198 3
3 3.002777 4.670041 6
4 2.182691 5.137982 11
5 4.206979 5.170269 5
6 1.307195 4.753041 9
7 2.919497 4.657171 7
8 2.938614 5.305558 9
9 2.575200 4.893604 2
10 1.548161 4.871108 4
[[2]]
a b c
1 -1.8483890 2 6
2 -2.9035164 0 7
3 -0.6490283 1 6
4 -2.8842633 3 2
5 -1.8803775 0 12
6 -3.0267870 1 9
7 0.5287124 0 7
8 -3.7220733 0 2
9 -2.0663912 2 9
10 -1.6232248 1 6
You can then lapply over this list again to process or do it as a one liner.
R> dat <- lapply(files, function(x) cor(read.table(x))[1,-1] )
R> dat
[[1]]
b c
0.27236143 -0.04973541
[[2]]
b c
-0.1440812 0.2771511
The way to do this is to put all the files you wish to read in in one folder, and then work with lists:
your.dir <- "" # adjust
files <- list.files(your.dir)
your.dfs <- lapply(file.path(your.dir, files), read.table)
your.dfsis now a list holding all your data frames. You can perform functions on all data frames simultaneously using lapply, or you can access individual data frames with the usual subsetting syntax, for example your.dfs[[1]] to access the first data frame.

Resources