Remove index column in read.csv - r

Inspired by Prevent row names to be written to file when using write.csv, I am curious if there a way to ignore the index column in R using the read.csv() formula. I want to import a text file into an RMarkdown document and don't want the row numbers to show in my HTML file produced by RMarkdown.
Running the following code
write.csv(head(cars), "cars.csv", row.names=FALSE)
produces a CSV that looks like this:
speed dist
4 2
4 10
7 4
7 22
8 16
9 10
But, if you read this index-less file back into R (ie, read.csv("cars.csv")), the index column returns:
. speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
I was hoping the solution would be as easy as including row.names=FALSE to the read.csv() statement, as is done with write.csv(), however after I run read.csv("cars.csv", row.names=FALSE), R gets sassy and returns an "invalid 'row.names' specification" error message.
I tried read.csv("cars.csv")[-1], but that just dropped the speed column, not the index column.
How do I prevent the row index from being imported?

If you save your object, you won't have row names.
x <- read.csv("cars.csv")
But if you print it (to HTML), you will use the print.data.frame function. Which will show row numbers by default. If I use the following (as last line) in my markdown chunk, I didn't have row numbers displayed:
print(read.csv("cars.csv"), row.names = FALSE)

Why?: This problem seems associated with a previous subset procedure that created the data. I have a file that keeps coming back with a pesky index column as I round-trip the data via read/write.csv.
Bottom Line: read.csv takes a file completely and outputs a dataframe, but the file has to be read before any other operation, like dropping a column, is possible.
Easy Workaround: Fortunately it's very simple to drop the column from the new dataframe:
df <- read.csv("data.csv")
df <- df[,-1]

Related

Read Excel file and select specific rows and columns

I want to read a xls file into R and select specific columns.
For example I only want columns 1 to 10 and rows 5 - 700. I think you can do this with xlsx but I can't use that library on the network that I am using.
Is there another package that I can use? And how would I go about selecting the columns and rows that I want?
You can try this:
library(xlsx)
read.xlsx("my_path\\my_file.xlsx", "sheet_name", rowIndex = 5:700, colIndex = 1:10)
Since you are unable to lead the xlsx package, you might want to consider base R and use read.csv. For this, save your Excel file as a csv. The explanation for how to do this can be easily found on the web. Note, csv files can still be opened as Excel.
These are the steps you need to take to only read the 2nd and 3rd column and row.
hd = read.csv('a.csv', header=F, nrows=1, as.is=T) # first read headers
removeCols <- c('NULL', NA, NA) #define which columns to keep/remove
df <- read.csv('a.csv', skip=2, header=F, colClasses=removeCols) #skip says which rows not to read
colnames(df) <- hd[is.na(removeCols)]
df
two three
1 5 8
2 6 9
This is the example data I used.
a <- data.frame(one=1:3, two=4:6, three=7:9)
write.csv(a, 'a.csv', row.names=F)
read.csv('a.csv')
one two three
1 1 4 7
2 2 5 8
3 3 6 9

Load data from a text file to R

I have data in text format whose structure is as follows:
ATCTTTGAT*TTAGGGGGAAAAATTCTACGC*TTACTGGACTATGCT
.........T.....,,,,,,,,,.......T,,,,,,.........
......A..*............,,,,,,,,.A........T......
....*..................,,,T...............
...*.....................*...........
...................*.....
I have been trying to import it into R using the read.table() command but when I do the output has an altered structure like this:
V1
1 ATCTTTGAT*TTAGGGGGAAAAATTCTACGC*TTACTGGACTATGCT
2 .........T.....,,,,,,,,,.......T,,,,,,.........
3 ......A..*............,,,,,,,,.A........T......
4 ....*..................,,,T...............
5 ...*.....................*...........
6 ...................*.....
For some reason, R is shifting the rows with lesser number of characters to the right. How can I load my data into R without altering the data structure present in the original text file?
Try this :)
read.table(file, sep = "\n")
result:
V1
1 ATCTTTGAT*TTAGGGGGAAAAATTCTACGC*TTACTGGACTATGCT
2 .........T.....,,,,,,,,,.......T,,,,,,.........
3 ......A..*............,,,,,,,,.A........T......
4 ....*..................,,,T...............
5 ...*.....................*...........
6 ...................*.....

dealing with blank/missing data with write.table in R

I have a data frame where some of the rows have blanks entries, e.g. to use a toy example
Sample Gene RS Chromosome
1 A rs1 10
2 B X
3 C rs4 Y
i.e. sample 2 has no rs#. If I attempt to save this data frame in a file using:
write.table(mydata,file="myfile",quote=FALSE,sep='\t')
and then read.table('myfile',header=TRUE,sep='\t'), I get an error stating that the number of entries in line 2 doesn't have 4 elements. If I set quote=TRUE, then a "" entry appears in the table. I'm trying to figure out a way to create a table using write.table with quote=FALSE while retaining a blank placeholder for rows with missing entries such as 2.
Is there a simple way to do this? I attempted to use the argument NA="" in write.table() but this didn't change anything.
If result of my script's data frame has NA I always replace it , One way would be to replace NA in the data frames with a some other text which tells you that this entry was NA in the data frame -Specially if you are saving the result in a csv /database or some non -R env
a simple script to do that
replace_NA <- function(x,replacement="N/A"){
x[is.na(x)==T] <- replacement
}
sapply(df,replace_NA,replacement ="N/A" )
You are attempting to reinvent the fixed-width file format. Your requested format would have a blank column between every real column. I don't find a write.fwf, although the 'utils' package has read.fwf. The simplest method of getting your requested output would be:
capture.output(dat, file='test.dat')
# Result in a text file
Sample Gene RS Chromosome
1 1 A rs1 10
2 2 B X
3 3 C rs4 Y
This essentially uses the print method (at the end of the R REPL) for dataframes to do the spacing for you.

Adding a new column in R based on maximum occurrence of words from a CSV

I am working with two CSV files. They are formatted like this:
File 1
able,2
gobble,3
highway,3
test,6
zoo,10
File 2
able,6
gobble,10
highway,3
speed,7
test,8
upper,3
zoo,10
In my program I want to do the following:
Create a keyword list by combining the values from two CSV files and keeping only unique keywords
Compare that keyword list to each individual CSV file to determine the maximum number of occurences of a given keyword, then append that information to the keyword list.
The first step I have done already.
I am getting confused by R reading things as vectors/factors/data frames etc...and "coercion to lists". For example in my files given above, the maximum occurrence for the word "gobble" should be 10 (its value is 3 in file 1 and 10 in file 2)
So basically two things need to happen. First, I need to create a column in "keywords" that holds information about the maximum number of occurrences of a word from the CSV files. Second, I need to populate that column with the maximum value.
Here is my code:
# Read in individual data sets
keywordset1=as.character(read.csv("set1.csv",header=FALSE,sep=",")$V1)
keywordset2=as.character(read.csv("set2.csv",header=FALSE,sep=",")$V1)
exclude_list=as.character(read.csv("exclude.csv",header=FALSE,sep=",")$V1)
# Sort, capitalize, and keep unique values from the two keyword sets
keywords <- sapply(unique(sort(c(keywordset1, keywordset2))), toupper)
# Keep keywords greater than 2 characters in length (basically exclude in at etc...)
keywords <- keywords[nchar(keywords) > 2]
# Keep keywords that are not in the exclude list
keywords <- setdiff(keywords, sapply(exclude_list, toupper))
# HERE IS WHERE I NEED HELP
# Compare the read keyword list to the master keyword list
# and keep the frequency column
key1=read.csv("set1.csv",header=FALSE,sep=",")
key1$V1=sapply(key1[[1]], toupper)
keywords$V2=key1[which(keywords[[1]] %in% key1$V1),2]
return(keywords)
The reason that your last commmand fails is that you try to use the $ operator on a vector. It only works on lists or data frames (which are a special case of lists).
A remark regarding toupper (and many other functions in R): it works on vectors, such that you don't need to use sapply. toupper(c(keywordset1, keywordset2)) is perfectly fine.
But I would like to propose an entirely different solution to your problem. First, I create the data as follows:
keywords1 <- read.table(text="able,2
gobble,3
highway,3
test,6
zoo,10",sep=",",stringsAsFactors=FALSE)
keywords2 <- read.table(text="gobble,10
highway,3
speed,7
test,8
upper,3
zoo,10",sep=",",stringsAsFactors=FALSE)
Note that I use stringsAsFactors=FALSE. This prevents read.table from converting characters to factors, such that there is no need to call as.character later.
The next steps are to capitalize the keyword columns in both tables. At the same time, I put both tables in a list. This is often a good way to simplify calculations in R, because you can use lapply to apply a function on all the list elements. Then I put both tables into a single table.
keyword_list <- lapply(list(keywords1,keywords2),function(kw)
transform(kw,V1=toupper(V1)))
keywords_all <- do.call(rbind,keyword_list)
The next step is to sort the data frame in decreasing order by the number in the second column:
keywords_sorted <- keywords_all[order(keywords_all$V2,decreasing=TRUE),]
keywords_sorted looks as follows:
V1 V2
5 ZOO 10
6 GOBBLE 10
11 ZOO 10
9 TEST 8
8 SPEED 7
4 TEST 6
2 GOBBLE 3
3 HIGHWAY 3
7 HIGHWAY 3
10 UPPER 3
1 ABLE 2
As you notice, some keywords appear only once and for those that appear twice, the first appearance is the one you want to keep. There is a function in R that can be used to extract exactly these elements: duplicated() (run ?duplicated to learn more). Basically, the function returns TRUE, if an element appears for the at least second time in a vector. These are the elements you don't want. To convert TRUE to FALSE (and vice versa), you use the operator !. So the following gives your desired result:
keep <- !duplicated(keywords_sorted$V1)
keywords_max <- keywords_sorted[keep,]
V1 V2
5 ZOO 10
6 GOBBLE 10
9 TEST 8
8 SPEED 7
3 HIGHWAY 3
10 UPPER 3
1 ABLE 2

Splitting a dataframe every other column to create two separate files

I want to (as ever) use code that performs better but functions equivalently to the following:
write.table(results.df[seq(1, ncol(results.df),2)],file="/path/file.txt", row.names=TRUE, sep="\t")
write.table(results.df[seq(2, ncol(results.df),2)],file="/path/file2.txt",row.names=TRUE, sep="\t")
results.df is a dataframe that looks something thus:
row.names 171401 171401 111201 111201
1 1 0.8320923 10 0.8320923
2 2 0.8510621 11 0.8510621
3 3 0.1009001 12 0.1009001
4 4 0.9796110 13 0.9796110
5 5 0.4178686 14 0.4178686
6 6 0.6570377 15 0.6570377
7 7 0.3689075 16 0.3689075
There is no consistent patterning in the column headers except that each one is repeated twice consecutively.
I want to create (1) one file with only odd-numbered columns of results.df and (2) another file with only even-numbered columns of results.df. I have one solution above, but was wondering whether there is a better-performing means of achieving the same thing.
IDEA UPDATE: I was thinking there may be some way of excising - deleting it from memory - each processed column rather than just copying it. This way the size of the dataframe progressively decreases and may result in a performance increase???
The code is only slightly shorter but...
# Instead of
results.df[seq(1, ncol(results.df), 2]
results.df[seq(2, ncol(results.df), 2]
#you could use
results.df[c(T,F)]
results.df[c(F,T)]

Resources