How to format R data.frame output? - r

Let me start by saying I am brand new to R, so any solution with a detailed explanation would be appreciated so I can learn from it.
I have a set of csv files with the following rows of information:
"ID" "Date" "A" "B" (where A and B are some data points)
I am attempting to get the output in a meaningful manner and am stuck on what I am missing.
observations <- funtion(dir, id= 1:10){
#get all file names in a vector
all_files <- list.files(directory, full.names=TRUE)
#get the subset of files we want to read
file_contents <- lapply(all_files[id], read.csv)
#cbind the file contents
output <- do.call(rbind, file_contents)
#remove all NA values
output <- output[complete.cases(output), ]
#at this point output is a data.frame so display the output
table(output[["ID"]])
}
My current output is :
2 4 8 10 12
1000 500 200 150 100
which is correct but I need it in column form so it can be understood by looking at it. The output I am trying to get to is below:
id obs_total
1 2 1000
2 4 500
3 8 200
4 10 150
5 12 100
What am I missing here?

table outputs a contingency table. You want a data frame. You can wrap as.data.frame(...) around you output to convert it.
as.data.frame(table(ID = output[["ID"]]))

Assuming that the numbers are correct, looks like you have everything you need, just transpose the data frame. Try this:
mat<-matrix(round(runif(10),3),nrow=2)
df<-as.data.frame(mat)
colnames(df)=c("1","2","3","4","5")
t(df)

Related

List containing data tables - Unable to use a function to rename columns?

I have 20 excel files containing city level data for each year. I imported them in a list because I thought it will be easier to loop over them.
The first task that I wanted to do is to change the name of the second column of each file.
If, for a single file I do:
#data is a list of data tables/frames. Example:
data<-list(a = data.frame(1:2,3:4),b = data.frame(5:8,15:18) )
#renaming first column of a (works)
names(data[[1]])[2]<-"ABC"
I am able to rename the column.
To do batch editing I wanted to write a function to be used in lapply. The function should be a simple version of the above thing:
rename <-function(df){
names(df)[2]<-"XYZ"}
Rename(data[[1]]) however, does nothing to the second column. Any ideas why?
You need to return the full modified object at each iteration:
data <- lapply( data, function(x) {names(x)[2]<-"ABC"; x})
data
#---------
[[1]]
X1.2 ABC
1 1 3
2 2 4
[[2]]
X5.8 ABC
1 5 15
2 6 16
3 7 17
4 8 18
I'm sure this is a duplicate but I don't know what the right search terms might be, so I'm just answering it .... again.

dealing with blank/missing data with write.table in R

I have a data frame where some of the rows have blanks entries, e.g. to use a toy example
Sample Gene RS Chromosome
1 A rs1 10
2 B X
3 C rs4 Y
i.e. sample 2 has no rs#. If I attempt to save this data frame in a file using:
write.table(mydata,file="myfile",quote=FALSE,sep='\t')
and then read.table('myfile',header=TRUE,sep='\t'), I get an error stating that the number of entries in line 2 doesn't have 4 elements. If I set quote=TRUE, then a "" entry appears in the table. I'm trying to figure out a way to create a table using write.table with quote=FALSE while retaining a blank placeholder for rows with missing entries such as 2.
Is there a simple way to do this? I attempted to use the argument NA="" in write.table() but this didn't change anything.
If result of my script's data frame has NA I always replace it , One way would be to replace NA in the data frames with a some other text which tells you that this entry was NA in the data frame -Specially if you are saving the result in a csv /database or some non -R env
a simple script to do that
replace_NA <- function(x,replacement="N/A"){
x[is.na(x)==T] <- replacement
}
sapply(df,replace_NA,replacement ="N/A" )
You are attempting to reinvent the fixed-width file format. Your requested format would have a blank column between every real column. I don't find a write.fwf, although the 'utils' package has read.fwf. The simplest method of getting your requested output would be:
capture.output(dat, file='test.dat')
# Result in a text file
Sample Gene RS Chromosome
1 1 A rs1 10
2 2 B X
3 3 C rs4 Y
This essentially uses the print method (at the end of the R REPL) for dataframes to do the spacing for you.

Counting NA values by ID?

I'm learning R from scratch right now and am trying to count the number of NA's within a given table, aggregated by the ID of the file it came from. I then want to output that information in a new data frame, showing just the ID and the sum of the NA lines contained within. I've looked at some similar questions, but they all seem to deal with very short datasets, whereas mine is comparably long (10k + lines) so I can't call out each individual line to aggregate.
Ideally, if I start with a data table called "Data" with a total of four columns, and one column called "ID", I would like to output a data frame that is simply:
[ID] [NA_Count]
1 500
2 352
3 100
Thanks in advance...
Something like the following should work, although I am assuming that Date is always there and Field 1 and Field 2 are numeric:
# get file names and initialize a vector for the counts
fileNames <- list.files(<filePath>)
missRowsVec <- integer(length(fileNames))
# loop through files, get number of
for(filePos in 1:length(fileNames)) {
# read in files **fill in <filePath>**
temp <- read.csv(paste0(<filePath>, fileNames[filePos]), as.is=TRUE)
# count the number of rows with missing values,
# ** fill in <fieldName#> with strings of variable names **
missRowsVec[filePos] <- sum(apply(temp[, c(<field1Name>, <field2Name>)],
function(i) anyNA(i)))
} # end loop
# build data frame
myDataFrame <- data.frame("fileNames"=fileNames, "missCount"=missRowsVec)
This may be a bit dense, but it should work more or less. Try small portions of it, like just some inner function, to see how stuff works.

r summary of row over multiple files

I have around 100 text files which I have loaded into R:
myFiles <- (Sys.glob("C:/../.../*.txt"))
dataFiles <- lapply(myFiles, read.table)
Files have different number of rows, but all have 4 columns. 1st column is the name and the last 3 are coordinates.
example of files:
[[1]]
n x y z
1 Bal 0.459405 -238.3565 -653.5304
2 tri 0.028990 -224.5127 -600.0000
.....
14 mon 24.514049 -264.7673 -627.0550
[[2]]
n x y z
1 bal 2.220795 -284.1022 -651.8112
2 reg 2.077444 -290.4326 -631.3667
...
8 tri 32.837284 -347.2596 -633.0000
There is one row which is present in all files: e.g. row.name="tri". I want to find summary (median,mean,max,min) of that row's coordinates (x,y,z) over all 100 files.
I found quite a few examples of summary of a row in one file but not over multiple files.
I think I need to use lapply but not sure how to start with it.
Also I need summary to create classes later based on the values I have. I found "summary" function for taht to be sueful. If there is any other function which might be be of more use you could suggest for taht purposes it would be helpful.
Any help would be great!
Thanks!
For pulling all those "tri" rows together you can do:
df <- do.call("rbind", lapply(dataFiles, function(z) z[z$n=="tri",]))
summary(df)

melt giving several value columns

I am reading in parameter estimates from some results files that I would like to compare side by side in a table. But I cant get the dataframe to the structure that I want to have (Parameter name, Values(file1), Values(file2))
When I read in the files I get a wide dataframe with each parameter in a separate column that I would like to transform to "long" format using melt. But that gives only one column with values. Any idea on how to get several value columns without using a for loop?
paraA <- c(1,2)
paraB <- c(6,8)
paraC <- c(11,9)
Source <- c("File1","File2")
parameters <- data.frame(paraA,paraB,paraC,Source)
wrong_table <- melt(parameters, by="Source")
You can use melt in combination with cast to get what you want. This is in fact the intended pattern of use, which is why the functions have the names they do:
m<-melt(parameters)
dcast(m,variable~Source)
# variable File1 File2
# 1 paraA 1 2
# 2 paraB 6 8
# 3 paraC 11 9
Converting #alexis's comment to an answer, transpose (t()) pretty much does what you want:
setNames(data.frame(t(parameters[1:3])), parameters[, "Source"])
# File1 File2
# paraA 1 2
# paraB 6 8
# paraC 11 9
I've used setNames above to conveniently rename the resulting data.frame in one step.

Resources