Manipulating multiple files in R - r

I am new to R and am looking for a code to manipulate hundreds of files that I have at hand. They are .txt files with a few rows of unwanted text, followed by columns of data, looking something like this:
XXXXX
XXXXX
XXXXX
Col1 Col2 Col3 Col4 Col5
1 36 37 35 36
2 34 34 36 37
.
.
1500 34 35 36 35
I wrote a code (below) to extract selected rows of columns 1 and 5 of an individual .txt file, and would like to do a loop for all the files that I have.
data <- read.table(paste("/Users/tan/Desktop/test/01.txt"), skip =264, nrows = 932)
selcol<-c("V1", "V5")
write.table(data[selcol], file="/Users/tan/Desktop/test/01ed.txt", sep="\t")
With the above code, the .txt file now looks like this:
Col1 Col5
300 34
.
.
700 34
If possible, I would like to combine all the Col5 of the .txt files with one of Column 1 (which is the same for all txt files), so that it looks something like this:
Col1 Col5a Col5b Col5c Col5d ...
300 34 34 36 37
.
.
700 34 34 36 37
Thank you!
Tan

Alright - I think I hit on all your questions here, but let me know if I missed something. The general process that we will go through here is:
Identify all of the files that we want to read in and process in our working directory
Use lapply to iterate over each of those file names to create a single list object that contains all of the data
Select your columns of interest
Merge them together by the common column
For the purposes of the example, consider I have four files named file1.txt through file4.txt that all look like this:
x y y2
1 1 2.44281173 -2.32777987
2 2 -0.32999022 -0.60991623
3 3 0.74954561 0.03761497
4 4 -0.44374491 -1.65062852
5 5 0.79140012 0.40717932
6 6 -0.38517329 -0.64859906
7 7 0.92959219 -1.27056731
8 8 0.47004041 2.52418636
9 9 -0.73437337 0.47071120
10 10 0.48385902 1.37193941
##1. identify files to read in
filesToProcess <- dir(pattern = "file.*\\.txt$")
> filesToProcess
[1] "file1.txt" "file2.txt" "file3.txt" "file4.txt"
##2. Iterate over each of those file names with lapply
listOfFiles <- lapply(filesToProcess, function(x) read.table(x, header = TRUE))
##3. Select columns x and y2 from each of the objects in our list
listOfFiles <- lapply(listOfFiles, function(z) z[c("x", "y2")])
##NOTE: you can combine steps 2 and 3 by passing in the colClasses parameter to read.table.
#That code would be:
listOfFiles <- lapply(filesToProcess, function(x) read.table(x, header = TRUE
, colClasses = c("integer","NULL","numeric")))
##4. Merge all of the objects in the list together with Reduce.
# x is the common columns to join on
out <- Reduce(function(x,y) {merge(x,y, by = "x")}, listOfFiles)
#clean up the column names
colnames(out) <- c("x", sub("\\.txt", "", filesToProcess))
Results in the following:
> out
x file1 file2 file3 file4
1 1 -2.32777987 -0.671934857 -2.32777987 -0.671934857
2 2 -0.60991623 -0.822505224 -0.60991623 -0.822505224
3 3 0.03761497 0.049694686 0.03761497 0.049694686
4 4 -1.65062852 -1.173863215 -1.65062852 -1.173863215
5 5 0.40717932 1.189763270 0.40717932 1.189763270
6 6 -0.64859906 0.610462808 -0.64859906 0.610462808
7 7 -1.27056731 0.928107752 -1.27056731 0.928107752
8 8 2.52418636 -0.856625895 2.52418636 -0.856625895
9 9 0.47071120 -1.290480033 0.47071120 -1.290480033
10 10 1.37193941 -0.235659079 1.37193941 -0.235659079

Related

In R, How to write from a list to file with a set amount of elements on each line?

Let's say I have a list of 23 elements.
ls <- list(1:23)
Which I want to write to a file which has 5 elements on each line, separated by a tab until not possible anymore:
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23
How would I go about doing this? I don't see any options in write.lines or write.table.
The code by #akrun works best:
cat(gsub("\\s*((\\d+\\s+){1,4}\\d+)", "\\1\n",
paste(unlist(ls), collapse="\t")), '\n', file = 'file1.txt')
With a minor error for decimal values, as the resulting file1.txt looks like this:
0.0005862 0.0005983 0.0006225 0.0006637 0
.0006622 0.0006197 0.000599 0.0005983 0
.0006247 0.0006707 0.0006641 0.0006253 0
.0006087 0.0006234 0.0006807 0.0007485 0
.0007546 0.0007 0.000643 0.0006183 0
.0006264 0.0006819 0.000697 0.0006453 0
It can be done with cat and gsub. unlist the list, paste them into a single string, insert nextline (\n) at every block of 'n' digits with spaces, and use cat to write into console
cat(gsub("\\s*((\\d+\\s+){1,4}\\d+)", "\\1\n",
paste(unlist(ls), collapse="\t")), '\n')
#1 2 3 4 5
#6 7 8 9 10
#11 12 13 14 15
#16 17 18 19 20
#21 22 23
or write to a file
cat(gsub("\\s*((\\d+\\s+){1,4}\\d+)", "\\1\n",
paste(unlist(ls), collapse="\t")), '\n', file = 'file1.txt')
If it is a complex data with scientific notation etc. we could split into a list and then append NA at the end for those elements with less number of elements
v1 <- unlist(ls)
lst1 <- split(v1, (seq_along(v1)-1) %/% 4 + 1)
mat1 <- do.call(rbind, lapply(lst1, `length<-`, max(lengths(lst1))))
write(mat1, 'file2.txt')
You first need to define the chunks, I used BBmisc which have chunk function to obtain chunks of N elementes (five in your case).
Then you can use write.table witch have the append option.
library(BBmisc)
x <-list(1:20)
n<-5
splited<-chunk(x[[1]],n)
for(i in 1:length(splited)){
x=splited[[i]]
line=paste(x,collapse = "\t")
write.table(line, file = "output.txt", sep = "\t",
row.names = FALSE, col.names = FALSE, quote = FALSE, append = T)
}
Regards

Query column from one .csv file against column from a second .csv file. Print queried column to new file and annotate with File 2 row where match

I have two .csv files and I would like to query column 3 from File 1 against column 3 from File 2. The output file should consist of column 3 from File 1, and if the entry exists in File 2, print the entire corresponding row from File 2. If no match in File 2, print File 1 column 3 and leave rest blank (see below).
File 1:
... ... a ...
... ... e ...
... ... b ...
... ... c ...
File 2:
... ... a a-info-1 a-info-2 a-info-n
... ... c c-info-1 c-info-2 c-info-n
... ... d d-info-1 d-info-2 d-info-n
... ... e e-info-1 e-info-2 e-info-n
... ... f f-info-1 f-info-2 f-info-n
Desired output:
a ... ... a-info-1 a-info-2 a-info-n
e ... ... e-info-1 e-info-2 e-info-n
b
c ... ... c-info-1 c-info-2 c-info-n
I have tried to accomplish this in both R and bash. I thought I should be able to figure this out by referencing this thread: awk compare 2 files, 2 fields different order in the file, print or merge match and non match lines, but I am very much new to all things programming and can't seem to figure out how to transcribe the solution to my case.
My best try:
awk -F"," 'NR==FNR{a[$3]=$3;next}{if ($3 in a)print a[$3]","$0;}' file1.csv file2.csv > output.csv
The problem with this code is that it does not print the entries from File 1 that do not have entries in File 2.
If your solutions could be overly explanatory, I would very much appreciate it!
From your example, I understand that you want to lookup the rows in second file by the values in the third column of first file.
Assuming your data frames are named df1 and df2, let df1 and df2 be
> df1
col1 col2 col3
1 48 39 a
2 26 42 e
3 45 30 b
4 40 46 c
> df2
col1 col2 col3 col4 col5
1 49 47 a info a info2 a
2 27 45 b info b info2 b
3 28 37 c info c info2 c
4 37 26 e info e info2 e
5 25 40 f info f info2 f
The first step is to identify the rows in the second file,
rn <- match(df1$col3, df2$col3)
This will create the vector of the row names, based on which the file 2 can be filtered
out <- cbind(df1[3], df2[rn,][-3])
Here, the column from first data frame is combined with the all the columns except third of the looked-up rows of second data frame.
The new data frame will have the row names from the vector which we had created earlier. Also, if there are some unidentified rows it will create NA in the data frame which can be replaced with an empty character.
rownames(out) <- rownames(df1)
out[is.na(out)] <- ""
The output will be like this:
> out
col3 col1 col2 col4 col5
1 a 49 47 info a info2 a
2 e 37 26 info e info2 e
3 b
4 c 28 37 info c info2 c

Append data to different sheets in an excel in R

I have a dataframe like
All_DATA
ID Name Age
1 xyz 10
2 pqr 20
5 abc 15
6 pqr 19
8 xyz 10
9 pqr 12
10 abc 20
54 abc 41
Right now I have code which works for subsetting the data based on Name and the putting them into different excel ,but Now I want it in same excel in different sheets.
Here is the code for putting them into different excel
library("xlsx")
library("openxlsx")
All_DATA = read.xlsx("D:/test.xlsx")
data.list=list()
for(i in unique(All_DATA$Name)){
data.list[[i]] = subset(All_DATA,NAME==i)
write.xlsx( data.list[[i]],file=paste0("D:/Admin/",i,".xlsx"),row.names=F)
}
Is there any way by which a single excel file with data on multiple sheets can be generated.
Thanks
Domnick
You can use
write.xlsx(data.list[[i]], file="file.xlsx", sheetName=paste0("Sheet_",i,".xlsx"), row.names = F)

Is there a way stop table from sorting in R

Problem setup: Creating a function to take multiple CSV files selected by ID column and combine into 1 csv, then create an output of number of observations by ID.
Expected:
complete("specdata", 30:25) ##notice descending order of IDs requested
## id nobs
## 1 30 932
## 2 29 711
## 3 28 475
## 4 27 338
## 5 26 586
## 6 25 463
I get:
> complete("specdata", 30:25)
id nobs
1 25 463
2 26 586
3 27 338
4 28 475
5 29 711
6 30 932
Which is "wrong" because it has been sorted by id.
The CSV file I read from does have the data in descending order. My snippet:
dfTable<-read.csv("~/progAssign1/specdata/tmpdata.csv")
ccTab<-complete.cases(dfTable)
xTab3<-as.data.frame(table(dfTable$ID[ccTab]),)
colnames(xTab3)<-c("id","nobs")
And as near as I can tell, the third line is where sorting occurs. I broke out the expression and it happens in the table() call. I've not found any option or parameter I can pass to make something like sort=FALSE. You'd think...
Anyway. Any help appreciated!
So, the problem is in the output of table, which are sorted by default. For example:
> r = sample(5,15,replace = T)
> r
[1] 1 4 1 1 3 5 3 2 1 4 2 4 2 4 4
> table(r)
r
1 2 3 4 5
4 3 2 5 1
If you want to take the order of first appearance, you are going to get your hands a little bit dirty by recoding the table function:
unique_r = unique(r)
table_r = rbind(label=unique_r, count=sapply(unique_r,function(x)sum(r==x)))
table_r
[,1] [,2] [,3] [,4] [,5]
label 1 4 3 5 2
count 4 5 2 1 3
One way to get around this is...don't use table. Here's an example where I create three one-line data sets from your data. Then I read them in with a descending sequence, with read.table and it seems to be okay.
The real big thing here is that multiple data sets should be placed in a list upon being read into R. You'll get the exact order of data sets you want that way, among other benefits.
Once you've read them into R the way you want them, it's much easier to order them at the very end. Ordering of rows (for me) is usually the very last step.
> dat <- read.table(h=T, text = "id nobs
1 25 463
2 26 586
3 27 338
4 28 475
5 29 711
6 30 932")
Write three one-line files:
> write.table(dat[3,], "dat3.csv", row.names = FALSE)
> write.table(dat[2,], "dat2.csv", row.names = FALSE)
> write.table(dat[1,], "dat1.csv", row.names = FALSE)
Read them in using a 3:1 order:
> do.call(rbind, lapply(3:1, function(x){
read.table(paste0("dat", x, ".csv"), header = TRUE)
}))
# id nobs
# 1 27 338
# 2 26 586
# 3 25 463
Then, if we change 3:1 to 1:3 the rows "comply" with our request
> do.call(rbind, lapply(1:3, function(x){
read.table(paste0("dat", x, ".csv"), header = TRUE)
}))
# id nobs
# 1 25 463
# 2 26 586
# 3 27 338
And just for fun
> fun <- function(z){
do.call(rbind, lapply(z, function(x){
read.table(paste0("dat", x, ".csv"), header = TRUE) }))
}
> fun(c(2, 3, 1))
# id nobs
# 1 26 586
# 2 27 338
# 3 25 463
You may try something like this:
t1 <- c(5,3,1,3,5,5,5)
as.data.frame(table(t1)) ##result in ascending order
# t1 Freq
#1 1 1
#2 3 2
#3 5 4
t1 <- factor(t1)
as.data.frame(table(reorder(t1, rep(-1, length(t1)),sum)))
# Var1 Freq
#1 5 4
#2 3 2
#3 1 1
In your case you are complaining about the actions of the table function with a single argument returning the items with the names in ascending order and you wnat them in descending order. You could have simply used the rev() function around the table call.
xTab3<-as.data.frame( rev( table( dfTable$ID[ccTab] ) ),)
(I'm not sure what that last comma is doing in there.) The sort order in the original would not be expected to determine the order of a table operation. Generally R will return results with discrete labels sorted in alpha (ascending) order unless the levels of a factor item have been specified differently. That's one of those R-specific rules that may be difficult to intuit. The other R-specific rule that may be difficult to grasp (although not really a problem here) is that arguments are often expected to be in the form of R-lists.
It's probably wise to think about R-table objects at this point (and what happens with the as.data.frame call. table-objects are actually R-matrices, so the feature that you wanted to sort by was actually the rownames of that table object and are of class character:
r = sample(5,15,replace = T)
table(r)
#r
#2 3 4 5
#5 3 2 5
rownames(table(r))
#[1] "2" "3" "4" "5"
str(as.data.frame(table(r)))
#-------
'data.frame': 4 obs. of 2 variables:
$ r : Factor w/ 4 levels "2","3","4","5": 1 2 3 4
$ Freq: int 5 3 2 5
I just wanna share this homework I've done
complete <- function(directory, id=1:332){
setwd("E:/Coursera")
files <- dir(directory, full.names = TRUE)
data <- lapply(files, read.csv)
specdata <- do.call(rbind, data)
cleandata <- specdata[!is.na(specdata$sulfate) & !is.na(specdata$nitrate),]
targetdata <- data.frame(Date=numeric(0), sulfate=numeric(0), nitrate=numeric(0), ID=numeric(0))
result<-data.frame(id=numeric(0), nobs=numeric(0))
for(i in id){
targetdata <- cleandata[cleandata$ID == i, ]
result <- rbind(result, data.frame(table(targetdata$ID)))
}
names(result) <- c("id","nobs")
result
}
A simple solution that no one has proposed yet is combining table() with unique() function. The unique() function does the behaviour that you are looking (listing unique IDs in order of appearance).
In your case it would be something like this:
dfTable<-read.csv("~/progAssign1/specdata/tmpdata.csv")
ccTab<-complete.cases(dfTable)
x<-dfTable$ID[ccTab] #unique IDs
xTab3<-as.data.frame(table(x)[unique(x)],) #here you sort the "table()" result in order of appearance
colnames(xTab3)<-c("id","nobs")

Print data column associated with the same column variable values in R

I am working on data collection by R on Win7
This is related to my previous question:
Data grouping and sub-grouping by column variable in R
I have this data frame.
var1 var2 value
1 56 649578
1 56 427352
1 88 354623
1 88 572397
2 17 357835
2 17 498455
2 90 357289
2 90 678658
I need to print them in CSV file as:
649578 354623 357835 357289
427352 572397 498455 678658
I need to use dictionary or hashset in R?
Here's your data, again, just for reproducibility:
mydf <- read.table(text='var1 var2 value
1 56 649578
1 56 427352
1 88 354623
1 88 572397
2 17 357835
2 17 498455
2 90 357289
2 90 678658', header=TRUE)
Take a look at the documentation for write.table.
You say you want a CSV, which would look like the following:
write.csv(matrix(mydf$value, nrow=2), 'test.csv')
Produces "test.csv":
"","V1","V2","V3","V4"
"1",649578,354623,357835,357289
"2",427352,572397,498455,678658
Or, I think you probably want:
write.table(matrix(mydf$value, nrow=2), 'test.tsv', sep='\t')
Produces "test.tsv":
"V1" "V2" "V3" "V4"
"1" 649578 354623 357835 357289
"2" 427352 572397 498455 678658

Resources