I'm trying to read in 360 data files in text format. I can do so using this code:
temp = list.files(pattern="*.txt")
myfiles = lapply(temp, read.table)
The problem I have is that the files are named as "DO_1, DO_2,...DO_360" and when I try to import the files into a list, they do not maintain this order. Instead I get DO_1, DO_10, etc. Is there a way to specify the order in which the files are imported and stored? I didn't see anything in the help pages for list.files or read.table. Any suggestions are greatly appreciated.
lapply will process the files in the order you have them stored in temp. So your goal is to sort them the way you actually think about them. Luckily there is the mixedsort function from the gtools package that does just the kind of sorting you're looking for. Here is a quick demo.
> library(gtools)
> vals <- paste("DO", 1:20, sep = "_")
> vals
[1] "DO_1" "DO_2" "DO_3" "DO_4" "DO_5" "DO_6" "DO_7" "DO_8" "DO_9"
[10] "DO_10" "DO_11" "DO_12" "DO_13" "DO_14" "DO_15" "DO_16" "DO_17" "DO_18"
[19] "DO_19" "DO_20"
> vals <- sample(vals)
> sort(vals) # doesn't give us what we want
[1] "DO_1" "DO_10" "DO_11" "DO_12" "DO_13" "DO_14" "DO_15" "DO_16" "DO_17"
[10] "DO_18" "DO_19" "DO_2" "DO_20" "DO_3" "DO_4" "DO_5" "DO_6" "DO_7"
[19] "DO_8" "DO_9"
> mixedsort(vals) # this is the sorting we're looking for.
[1] "DO_1" "DO_2" "DO_3" "DO_4" "DO_5" "DO_6" "DO_7" "DO_8" "DO_9"
[10] "DO_10" "DO_11" "DO_12" "DO_13" "DO_14" "DO_15" "DO_16" "DO_17" "DO_18"
[19] "DO_19" "DO_20"
So in your case you just want to do
library(gtools)
temp <- mixedsort(temp)
before your call to lapply that calls read.table.
Related
As far as we know, the parsing library like XML and xml2 can read standard table on web page perfectly. But there are some sorts of table which has no grid of table but organizing labels, such as “<span>” and “<div>”.
Now I am coping with a table like this,
The structure of table marks with “<span>”, and every 4 “<span>” Labels organize one record. I’ve used a loop to solve this problem and succeed. But I want to process it without loop. I heard that library purrr may help on this problem, but I don’t know how to use it in this situation.
I do my analysis by both “XML” and “xml2”:
Analysis with “XML” package
pg<-"http://www.irgrid.ac.cn/simple-search?fq=eperson.unique.id%3A311007%5C-000920"
library(XML)
tableNodes <- getNodeSet(htmlParse(pg), "//table[#class='miscTable2']")
itemlines <- xpathApply(tableNodes[[1]], "//tr[#class='itemLine']/td[#width='750']")
ispan <- xmlElementsByTagName(itemlines[[2]], "span")
title <- xmlValue(ispan$span)
isuedate <- xmlValue(ispan$span[1,2])
author <- xmlValue(ispan$span[3])
In this case, “XML” got a list of one span, but this list is very strange but met my expectations:
> attributes(ispan)
$names
[1] "span" "span" "span" "span"
It seems have one row only, but four columns. However, it doesn’t. The 2-4 “span” couldn’t be select by column. The first “span” occupied 2 columns, and other “span” could not get.
> val <- xmlValue(ispan$span[[1]])
> val
[1] "超高周疲劳裂纹萌生与初始扩展的特征尺度"
> isuedate <- xmlValue(ispan$span[[2]])
> isuedate
[1] " \r\n [科普文章]"
> isuedate <- xmlValue(ispan$span[[3]])
> isuedate
[1] NA
> author <- xmlValue(ispan$span[[4]])
> author
[1] NA
None of the selection method used in list works:
> title <- xmlValue(ispan$span[1,1])
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "c('XMLInternalNodeList', 'XMLNodeList')"
title <- xmlValue(ispan$span[1,])
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "c('XMLInternalNodeList', 'XMLNodeList')"
author <- xmlValue(ispan[1,3])
Error in ispan[1, 3] : incorrect number of dimensions
Analysis with “xml2”
Use “xml2” the obstacle of “span” makes same problem
pg<-"http://www.irgrid.ac.cn/simple-search?fq=eperson.unique.id%3A311007%5C-000920"
library(xml2)
tableSource <- xml_find_all(read_html(pg, encoding = "UTF-8"), "//table[#class='miscTable2']")
itemspan <- xml_child(itemspantab, "span")
It could not gether any of these “span” labels:
> itemspan
{xml_nodeset (1)}
[1] <NA>
If we make a step further to locate the “span” labels, it only get nothing:
> itemspanl <- xml_find_all(itemspantab, '//tr[#class="itemLine"]/td/span')
> itemspan <- xml_child(itemspanl, "span")
> itemspan
{xml_nodeset (40)}
[1] <NA>
[2] <NA>
[3] <NA>
...
An suggest told me use library(purrr) to do this, but the “purrr” process dataframe only, the “list” prepared by “xml2” could not be analyzed.
I want not to use loop and get the result like below, can we do it? I hope the scholars who have experience on “XML” and “xml2” could give me some advise on how to cope with this non-standard table. Thanks a lot.
The objective is to change within a for loop the current working directory and do some other stuff in it,.e.g. searching for files. The paths are stored in generic variables.
The R code I am running for this is the following:
require("foreach")
# The following lines are generated by an external tool and stored in filePath.info
# Loaded via source("filePaths.info")
result1 <- '/home/user/folder1'
result2 <- '/home/user/folder2'
result3 <- '/home/user/folder3'
number_results <- 3
# So I know that I have all in all 3 folders with results by number_results
# and that the variable name that contains the path to the results is generic:
# string "result" plus 1:number_results.
# Now I want to switch to each result path and do some computation within each folder
start_dir <- getwd()
print(paste0("start_dir: ",start_dir))
# For every result folder switch into the directory of the folder
foreach(i=1:number_results) %do% {
# for (i in 1:number_results){ leads to the same output
# Assign path in variable, not the variable name as string: current_variable <- result1 (not string "result1")
current_variable <- eval(parse(text = paste0("result", i)))
print(paste0(current_variable, " in interation_", i))
# Set working directory to string in variable current_variable
current_dir <- setwd(current_variable)
print(paste0("current_dir: ",current_dir))
# DO SOME OTHER STUFF WITH FILES IN THE CURRENT FOLDER
}
# Switch back into original directory
current_dir <- setwd(start_dir)
print(paste0("end_dir: ",current_dir))
The output is the following ...
[1] "start_dir: /home/user"
[1] "/home/user/folder1 in interation_1"
[1] "current_dir: /home/user"
[1] "/home/user/folder2 in interation_2"
[1] "current_dir: /home/user/folder1"
[1] "/home/user/folder3 in interation_3"
[1] "current_dir: /home/user/folder2"
[1] "end_dir: /home/user/folder3"
... while I would have expected this:
[1] "start_dir: /home/user"
[1] "/home/user/folder1 in interation_1"
[1] "current_dir: /home/user/folder1"
[1] "/home/user/folder2 in interation_2"
[1] "current_dir: /home/user/folder2"
[1] "/home/user/folder3 in interation_3"
[1] "current_dir: /home/user/folder3"
[1] "end_dir: /home/user/"
So it turns out that the path assigned to current_dir is somewhat "behind" ...
Why is this the case?
As I am far away from being a R expert, I have no idea what is causing this behaviour and most important how to get the desired behaviour.
So any help, hint, code correction/optimization would be highly appreciated!
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Platform: x86_64-pc-linux-gnu (64-bit)
From the ?setwd help page...
setwd returns the current directory before the change, invisibly and with the same conventions as getwd. It will give an error if it does not succeed (including if it is not implemented).
So when you do
current_dir <- setwd(current_variable)
print(paste0("current_dir: ",current_dir))
You are not getting the "current" directory, you are getting the previous one. You should use getwd() to get the current one
setwd(current_variable)
current_dir <- getwd()
print(paste0("current_dir: ",current_dir))
I wrote the following R code that identifies duplicate files in a directory. How can one vectorize the for-loop using the plyr package (or similar)? I would like to achieve a more idiomatic R solution than the one I came up with.
library("digest") # to compute the MD5 digest
test_dir = "/Users/user/Dropbox/kaggle/r_projects/test_photo"
filelist <- dir(test_dir, pattern = "JPG|AVI", recursive=TRUE,
all.files =TRUE, full.names=TRUE)
fl = list() #create and empty list to hold md5's and filenames
for (itm in filelist) {
file_digest = digest(itm, file=TRUE, algo="md5")
fl[[file_digest]]= c(fl[[file_digest]],itm)
}
fl
the output is ( using a small test directory):
> fl
$`5715b719723c5111b3a38a6ff8b7ca56`
[1] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3480 copy.JPG"
[2] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3480.JPG"
$`24fd4d7d252ca66c8d7a88b539c55112`
[1] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3481 copy.JPG"
[2] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3481.JPG"
[3] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3481.JPG"
$`2a1d668c874dc856b9df0fbf3f2e81ec`
[1] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3482 copy.JPG"
[2] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3482.JPG"
[3] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3482 copy.JPG"
[4] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3482.JPG"
I tried:
h=ldply(filelist, digest, file=TRUE, algo="md5")
h$filenames=filelist
but ended up with a unique row for every key value pair of (MD5, filename). I was not able to get the compact output desired.
(Background: As an exercise, I converted the python code presented by Raymond Hettinger in his PyCon AU 2011 keynote "What Makes Python Awesome". The slides are here: http://slidesha.re/WKkh9M . I was able to cut the LOC in half, but I think I can do better - and learn more - by vectorizing).
Here is a solution in base that is a little more concise:
md5s<-sapply(filelist,digest,file=TRUE,algo="md5")
split(filelist,md5s)
Here's one answer. First get the md5 and file names on to a data.frame with ldply. Then, create the list you desire with dlply.
fl <- ldply(seq_along(filelist), function(idx)
c(digest(filelist[idx], file=TRUE, algo="md5"),
filelist[idx]))
fl <- dlply(fl, .(V1), function(x) x$V2)
I have a list of files and I'm trying to extract all layer1_*.grd files. Is there a way of doing this in one grep expression?
lof <- c("layer1_1.grd", "layer1_1.gri", "layer1_2.grd", "layer1_2.gri",
"layer1_3.grd", "layer1_3.gri", "layer1_4.grd", "layer1_4.gri",
"layer1_5.grd", "layer1_5.gri", "layer2_1.grd", "layer2_1.gri",
"layer2_2.grd", "layer2_2.gri", "layer2_3.grd", "layer2_3.gri",
"layer2_4.grd", "layer2_4.gri", "layer2_5.grd", "layer2_5.gri",
"layer3_1.grd", "layer3_1.gri", "layer3_2.grd", "layer3_2.gri",
"layer3_3.grd", "layer3_3.gri", "layer3_4.grd", "layer3_4.gri",
"layer3_5.grd", "layer3_5.gri", "layer4_1.grd", "layer4_1.gri",
"layer4_2.grd", "layer4_2.gri", "layer4_3.grd", "layer4_3.gri",
"layer4_4.grd", "layer4_4.gri", "layer4_5.grd", "layer4_5.gri")
I tried doing this in two steps:
list.of.files <- list.files(pattern = c("1_"))
list.of.files <- list.of.files[grep(".grd", list.of.files)]
Can someone enlighten me how to do this with grep in one step? I naively tried passing list() and c() to the grep but, as you can imagine, it doesn't work.
list.of.files <- list.files()
list.of.files <- list.of.files[grep(list("1_", ".grd"), list.of.files)]
This should work for you:
> lof[grep("layer1_.*.grd", lof)]
[1] "layer1_1.grd" "layer1_2.grd" "layer1_3.grd" "layer1_4.grd" "layer1_5.grd"
Also, just to clarify your terminology: your list of files is not really a list; it's a character vector.
The stringr alternative is lof[str_detect(lof, "layer1_.*.grd")].
In fact, in this case you can be even more specific about the missing characters, so "layer1_[[:digit:]].grd" would work as the pattern here, and might be faster if lof is very long.
The function below works perfectly for my purpose. The display is wonderful. Now my problem is I need to be able to do it again, many times, on other variables that fit other patterns.
In this example, I've output results for "q4a", I would like to be able to do it for sequences of questions that follow patterns like: q4 < a - z > or q < 4 - 10 >< a - z >, automagically.
Is there some way to iterate this such that the specified variable (in this case q4a) changes each time?
Here's my function:
require(reshape) # Using it for melt
require(foreign) # Using it for read.spss
d1 <- read.spss(...) ## Read in SPSS file
attach(d1,warn.conflicts=F) ## Attach SPSS data
q4a_08 <- d1[,grep("q4a_",colnames(d1))] ## Pull in everything matching q4a_X
q4a_08 <- melt(q4a_08) ## restructure data for post-hoc
detach(d1)
q4aaov <- aov(formula=value~variable,data=q4a) ## anova
Thanks in advance!
Not sure if this is what you are looking for, but to generate the list of questions:
> gsub('^', 'q', gsub(' ', '',
apply(expand.grid(1:10,letters),1,
function(r) paste(r, sep='', collapse='')
)))
[1] "q1a" "q2a" "q3a" "q4a" "q5a" "q6a" "q7a" "q8a" "q9a" "q10a"
[11] "q1b" "q2b" "q3b" "q4b" "q5b" "q6b" "q7b" "q8b" "q9b" "q10b"
[21] "q1c" "q2c" "q3c" "q4c" "q5c" "q6c" "q7c" "q8c" "q9c" "q10c"
[31] "q1d" "q2d" "q3d" "q4d" "q5d" "q6d" "q7d" "q8d" "q9d" "q10d"
[41] "q1e" "q2e" "q3e" "q4e" "q5e" "q6e" "q7e" "q8e" "q9e" "q10e"
[51] "q1f" "q2f" "q3f" "q4f" "q5f" "q6f" "q7f" "q8f" "q9f" "q10f"
[61] "q1g" "q2g" "q3g" "q4g" "q5g" "q6g" "q7g" "q8g" "q9g" "q10g"
[71] "q1h" "q2h" "q3h" "q4h" "q5h" "q6h" "q7h" "q8h" "q9h" "q10h"
[81] "q1i" "q2i" "q3i" "q4i" "q5i" "q6i" "q7i" "q8i" "q9i" "q10i"
[91] "q1j" "q2j" "q3j" "q4j" "q5j" "q6j" "q7j" "q8j" "q9j" "q10j"
...
And then you turn your inner part of the analysis into a function that takes the question prefix as a parameter:
analyzeQuestion <- function (prefix)
{
q <- d1[,grep(prefix,colnames(d1))] ## Pull in everything matching q4a_X
q <- melt(q) ## restructure data for post-hoc
qaaov <- aov(formula=value~variable,data=q4a) ## anova
return (LTukey(q4aaov,which="",conf.level=0.95)) ## Tukey's post-hoc
}
Now - I'm not sure where your 'q4a' variable is coming from (as used in the aov(..., data=q4a)- so not sure what to do about that bit. But hopefully this helps.
To put the two together you can use sapply() to apply the analyzeQuestion function to each of the prefixes that we automagically generated.
I would recommend melting the entire dataset and then splitting variable into its component pieces. Then you can more easily use subset to look at (e.g.) just question four: subset(molten, q = 4).