dealing with blank/missing data with write.table in R

dealing with blank/missing data with write.table in R - r

I have a data frame where some of the rows have blanks entries, e.g. to use a toy example
Sample Gene RS Chromosome
1 A rs1 10
2 B X
3 C rs4 Y
i.e. sample 2 has no rs#. If I attempt to save this data frame in a file using:
write.table(mydata,file="myfile",quote=FALSE,sep='\t')
and then read.table('myfile',header=TRUE,sep='\t'), I get an error stating that the number of entries in line 2 doesn't have 4 elements. If I set quote=TRUE, then a "" entry appears in the table. I'm trying to figure out a way to create a table using write.table with quote=FALSE while retaining a blank placeholder for rows with missing entries such as 2.
Is there a simple way to do this? I attempted to use the argument NA="" in write.table() but this didn't change anything.

If result of my script's data frame has NA I always replace it , One way would be to replace NA in the data frames with a some other text which tells you that this entry was NA in the data frame -Specially if you are saving the result in a csv /database or some non -R env
a simple script to do that
replace_NA <- function(x,replacement="N/A"){
x[is.na(x)==T] <- replacement
}
sapply(df,replace_NA,replacement ="N/A" )

You are attempting to reinvent the fixed-width file format. Your requested format would have a blank column between every real column. I don't find a write.fwf, although the 'utils' package has read.fwf. The simplest method of getting your requested output would be:
capture.output(dat, file='test.dat')
# Result in a text file
Sample Gene RS Chromosome
1 1 A rs1 10
2 2 B X
3 3 C rs4 Y
This essentially uses the print method (at the end of the R REPL) for dataframes to do the spacing for you.

Related

Counting NA values by ID?

I'm learning R from scratch right now and am trying to count the number of NA's within a given table, aggregated by the ID of the file it came from. I then want to output that information in a new data frame, showing just the ID and the sum of the NA lines contained within. I've looked at some similar questions, but they all seem to deal with very short datasets, whereas mine is comparably long (10k + lines) so I can't call out each individual line to aggregate.
Ideally, if I start with a data table called "Data" with a total of four columns, and one column called "ID", I would like to output a data frame that is simply:
[ID] [NA_Count]
1 500
2 352
3 100
Thanks in advance...

Something like the following should work, although I am assuming that Date is always there and Field 1 and Field 2 are numeric:
# get file names and initialize a vector for the counts
fileNames <- list.files(<filePath>)
missRowsVec <- integer(length(fileNames))
# loop through files, get number of
for(filePos in 1:length(fileNames)) {
# read in files **fill in <filePath>**
temp <- read.csv(paste0(<filePath>, fileNames[filePos]), as.is=TRUE)
# count the number of rows with missing values,
# ** fill in <fieldName#> with strings of variable names **
missRowsVec[filePos] <- sum(apply(temp[, c(<field1Name>, <field2Name>)],
function(i) anyNA(i)))
} # end loop
# build data frame
myDataFrame <- data.frame("fileNames"=fileNames, "missCount"=missRowsVec)
This may be a bit dense, but it should work more or less. Try small portions of it, like just some inner function, to see how stuff works.

Adding a new column in R based on maximum occurrence of words from a CSV

I am working with two CSV files. They are formatted like this:
File 1
able,2
gobble,3
highway,3
test,6
zoo,10
File 2
able,6
gobble,10
highway,3
speed,7
test,8
upper,3
zoo,10
In my program I want to do the following:
Create a keyword list by combining the values from two CSV files and keeping only unique keywords
Compare that keyword list to each individual CSV file to determine the maximum number of occurences of a given keyword, then append that information to the keyword list.
The first step I have done already.
I am getting confused by R reading things as vectors/factors/data frames etc...and "coercion to lists". For example in my files given above, the maximum occurrence for the word "gobble" should be 10 (its value is 3 in file 1 and 10 in file 2)
So basically two things need to happen. First, I need to create a column in "keywords" that holds information about the maximum number of occurrences of a word from the CSV files. Second, I need to populate that column with the maximum value.
Here is my code:
# Read in individual data sets
keywordset1=as.character(read.csv("set1.csv",header=FALSE,sep=",")$V1)
keywordset2=as.character(read.csv("set2.csv",header=FALSE,sep=",")$V1)
exclude_list=as.character(read.csv("exclude.csv",header=FALSE,sep=",")$V1)
# Sort, capitalize, and keep unique values from the two keyword sets
keywords <- sapply(unique(sort(c(keywordset1, keywordset2))), toupper)
# Keep keywords greater than 2 characters in length (basically exclude in at etc...)
keywords <- keywords[nchar(keywords) > 2]
# Keep keywords that are not in the exclude list
keywords <- setdiff(keywords, sapply(exclude_list, toupper))
# HERE IS WHERE I NEED HELP
# Compare the read keyword list to the master keyword list
# and keep the frequency column
key1=read.csv("set1.csv",header=FALSE,sep=",")
key1$V1=sapply(key1[[1]], toupper)
keywords$V2=key1[which(keywords[[1]] %in% key1$V1),2]
return(keywords)

The reason that your last commmand fails is that you try to use the $ operator on a vector. It only works on lists or data frames (which are a special case of lists).
A remark regarding toupper (and many other functions in R): it works on vectors, such that you don't need to use sapply. toupper(c(keywordset1, keywordset2)) is perfectly fine.
But I would like to propose an entirely different solution to your problem. First, I create the data as follows:
keywords1 <- read.table(text="able,2
gobble,3
highway,3
test,6
zoo,10",sep=",",stringsAsFactors=FALSE)
keywords2 <- read.table(text="gobble,10
highway,3
speed,7
test,8
upper,3
zoo,10",sep=",",stringsAsFactors=FALSE)
Note that I use stringsAsFactors=FALSE. This prevents read.table from converting characters to factors, such that there is no need to call as.character later.
The next steps are to capitalize the keyword columns in both tables. At the same time, I put both tables in a list. This is often a good way to simplify calculations in R, because you can use lapply to apply a function on all the list elements. Then I put both tables into a single table.
keyword_list <- lapply(list(keywords1,keywords2),function(kw)
transform(kw,V1=toupper(V1)))
keywords_all <- do.call(rbind,keyword_list)
The next step is to sort the data frame in decreasing order by the number in the second column:
keywords_sorted <- keywords_all[order(keywords_all$V2,decreasing=TRUE),]
keywords_sorted looks as follows:
V1 V2
5 ZOO 10
6 GOBBLE 10
11 ZOO 10
9 TEST 8
8 SPEED 7
4 TEST 6
2 GOBBLE 3
3 HIGHWAY 3
7 HIGHWAY 3
10 UPPER 3
1 ABLE 2
As you notice, some keywords appear only once and for those that appear twice, the first appearance is the one you want to keep. There is a function in R that can be used to extract exactly these elements: duplicated() (run ?duplicated to learn more). Basically, the function returns TRUE, if an element appears for the at least second time in a vector. These are the elements you don't want. To convert TRUE to FALSE (and vice versa), you use the operator !. So the following gives your desired result:
keep <- !duplicated(keywords_sorted$V1)
keywords_max <- keywords_sorted[keep,]
V1 V2
5 ZOO 10
6 GOBBLE 10
9 TEST 8
8 SPEED 7
3 HIGHWAY 3
10 UPPER 3
1 ABLE 2

Remove index column in read.csv

Inspired by Prevent row names to be written to file when using write.csv, I am curious if there a way to ignore the index column in R using the read.csv() formula. I want to import a text file into an RMarkdown document and don't want the row numbers to show in my HTML file produced by RMarkdown.
Running the following code
write.csv(head(cars), "cars.csv", row.names=FALSE)
produces a CSV that looks like this:
speed dist
4 2
4 10
7 4
7 22
8 16
9 10
But, if you read this index-less file back into R (ie, read.csv("cars.csv")), the index column returns:
. speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
I was hoping the solution would be as easy as including row.names=FALSE to the read.csv() statement, as is done with write.csv(), however after I run read.csv("cars.csv", row.names=FALSE), R gets sassy and returns an "invalid 'row.names' specification" error message.
I tried read.csv("cars.csv")[-1], but that just dropped the speed column, not the index column.
How do I prevent the row index from being imported?

If you save your object, you won't have row names.
x <- read.csv("cars.csv")
But if you print it (to HTML), you will use the print.data.frame function. Which will show row numbers by default. If I use the following (as last line) in my markdown chunk, I didn't have row numbers displayed:
print(read.csv("cars.csv"), row.names = FALSE)

Why?: This problem seems associated with a previous subset procedure that created the data. I have a file that keeps coming back with a pesky index column as I round-trip the data via read/write.csv.
Bottom Line: read.csv takes a file completely and outputs a dataframe, but the file has to be read before any other operation, like dropping a column, is possible.
Easy Workaround: Fortunately it's very simple to drop the column from the new dataframe:
df <- read.csv("data.csv")
df <- df[,-1]

Calling on a column from a data frame within a data frame

I have a list of data frame (lets call that "data") that I have generated which goes something like this:
$"something.csv"
x y z
1 1 1 1
2 2 2 2
3 3 3 3
$"something else.csv"
x y z
1 1 1 1
2 2 2 2
3 3 3 3
I would like to output from the table "something.csv" one number within column x.
So far I have used:
data$"something.csv"$x[2]
This coding works and I am happy that it does, but my problem is that I want to automate this process and so i have put all the table titles into a list (filename) which goes:
[1] "something.csv", "something else.csv"
So i made a for loop which should allow me to do so but when I put in:
data$filename[1]$x[2]
it gives me back NULL.
When i print filename[1], I get [1] "something.csv" and if I type
data$"something.csv"$x[2]
I get the result I want. so if filename[1] = "something.csv", why does it not give me the same results?
I just want my code to out put the second row of column x and automate by using filename[i] in a for loop.

The way you have tried to approach the problem tries to find a column 'filename[1]' from the list, but it is not found. Hence, the NULL gets returned.
You need to use square brackets, and subset the object data. Here's an example:
# Generate data
data<-vector("list", 2)
names(data)<-c("something.csv", "something else.csv")
data[[1]]<-data.frame(x=1:3, y=1:3, z=1:3)
data[[2]]<-data.frame(x=1:3, y=1:3, z=1:3)
filename<-names(l)
# Subset the data
# The first data frame, notice the square brackets for subsetting lists!
data[[filename[1]]]
# column x
data[[filename[1]]]$x
# Second observation of x
data[[filename[1]]]$x[2]
The above uses for subsetting the names of the objects in the list. You can also use the number-based subsetting suggested by #Jeremy.

you can also use [ and [[ to call data$"something.csv"$x[2] try
data[[1]][2,1]
where [[1]] is the first list element and [2,1] is the data frame reference element

melt giving several value columns

I am reading in parameter estimates from some results files that I would like to compare side by side in a table. But I cant get the dataframe to the structure that I want to have (Parameter name, Values(file1), Values(file2))
When I read in the files I get a wide dataframe with each parameter in a separate column that I would like to transform to "long" format using melt. But that gives only one column with values. Any idea on how to get several value columns without using a for loop?
paraA <- c(1,2)
paraB <- c(6,8)
paraC <- c(11,9)
Source <- c("File1","File2")
parameters <- data.frame(paraA,paraB,paraC,Source)
wrong_table <- melt(parameters, by="Source")

You can use melt in combination with cast to get what you want. This is in fact the intended pattern of use, which is why the functions have the names they do:
m<-melt(parameters)
dcast(m,variable~Source)
# variable File1 File2
# 1 paraA 1 2
# 2 paraB 6 8
# 3 paraC 11 9

Converting #alexis's comment to an answer, transpose (t()) pretty much does what you want:
setNames(data.frame(t(parameters[1:3])), parameters[, "Source"])
# File1 File2
# paraA 1 2
# paraB 6 8
# paraC 11 9
I've used setNames above to conveniently rename the resulting data.frame in one step.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

dealing with blank/missing data with write.table in R - r

Related

Counting NA values by ID?

Adding a new column in R based on maximum occurrence of words from a CSV

Remove index column in read.csv

Calling on a column from a data frame within a data frame

melt giving several value columns

Categories

Resources