Use of wildcards with readtext() - r

A basic question. I have a bunch of transcripts (.docx files) I want to read into a corpus. I use readtext() to read in single files no problem.
dat <- readtext("~/ownCloud/NLP/interview_1.docx")
As soon as I put "*.docx" in my readtext statement it spits an error.
dat <- readtext("~/ownCloud/NLP/*.docx")
Error: '/var/folders/bl/61g7ngh55vs79cfhfhnstd4c0000gn/T//RtmpWD6KSx/readtext-aa71916b691c0cf3cabc73a2e04a45f7/word/document.xml' does not exist.
In addition: Warning message:
In utils::unzip(file, exdir = path) : error 1 in extracting from zip file
Why the reference to a zip file? I have only .docx files in the directory.

I was able to reproduce the same problem. The issue was there are some hidden/temp .docx files in that folder, if you delete them and then try the code it works.
To see the hidden files, go to the folder from where you are reading docx files and based on your OS select a way to show them. On my mac I used
CMD + SHIFT + .
Once you delete them, try the code again and it should work
library(readtext)
dat <- readtext("~/ownCloud/NLP/*.docx")

Related

How to read in file with dynamic name while avoiding hard-coding in R?

I run into issues reading in csv files with dynamic names and avoiding hard coding the file path. I'd like short tidy code (non-hardcoded). If I hardcode the full path (everything before the "~") it reads in the files fine. But soft-coding (if that is the opposite of hard coding) the file path it gives the error (despite showing the correct path in the error. I have two variable parts of the file name that I paste into the file name before reading it in. If I avoid paste and just type a path per individual it also works.
#dynamic part I usually have in a loop with all the options.
part_a <- "outside" #other options here in my loop include "inside"
part_b <- "late" # other option "early" or "preterm"
#reading in the df
df <-read.csv(paste0("~/Data/FromR/clean_",part_a,part_b,"_2016.csv"),
check.names=FALSE, na.strings="null")
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'C:/Users/myname/Documents/Data/FromR/clean_outsidelate_2016.csv': No such file or directory
if I use getwd() in the first part of the paste in place of ~ as suggested here it works by producing this string "C:/Users/myname/Documents/MyR_Projects/Specific_R_project/" at the beginning of the paste. But how can I get it to work with the "~"? when using the ~ it stops at the "Documents" folder...
The desired outcome is to read in the file without error perform functions and repeat with other files. My loop works fine hardcoded, and I only wanted to make it more general or softcoded.
I just tried to read a file (testFile.txt) in my home from a different wdand it works fine with ~
myFile <- "testFile
mymy <- ".txt"
ciao <- read.delim(paste0("~/",myFile,mymy))
In powershell you can use %~% (have a look here tread), but I am not sure how to expand the $HOME in R.
#-------- edit
Have a look here and here. Basically any variable defined in your .Renviron should be accessible.

reading gctx file in R

I am trying to read a gctx file extracted from LINCS source for gene expression analysis. The codes for eading the file are provided at the link below.
https://github.com/cmap/l1ktools.
I am using the script provided and I have sourced the script. however when I tried the function parse.gctx it gives me following error:
ds <- parse.gctx("../L1000 Data/zspc_n40172x22268.gctx")
reading ../L1000 Data/zspc_n40172x22268.gctx
Error in h5checktypeOrOpenLoc(file, readonly = TRUE) :
Error in h5checktypeOrOpenLoc(). Cannot open file. File 'C:\L1000 Data\zspc_n40172x22268.gctx' does not exist.
How can I resolve this issue and read my gctx file?
Since you're getting a 'file does not exist' error, I think the problem is because you have a space in the path to the file you're trying to read (specifically, in "L1000 Data"); if you remove the space in the path it should parse properly.
In other words, try renaming your "L1000 Data" folder so that instead of:
ds <- parse.gctx("../L1000 Data/zspc_n40172x22268.gctx")
you have something along the lines of:
ds <- parse.gctx("../L1000_Data/zspc_n40172x22268.gctx")

Open a dta file in R

I am trying to open a Stata .dta file which is compressed into winrar in R. Here are my codes:
library(foreign)
setwd("C:/Users/ASUS/Desktop/Data on oil/Oil discovery")
data <- read.dta("oil_discovery")
and I get :
Error in read.dta("oil_discovery") : unable to open file: 'No such file or directory'
I think that my problem is coming from the assignment of my working directory but I don't know how to manage it.
You need to specify the full file name to read.dta. This includes the file ending. That is, instead of
data <- read.dta("oil_discovery")
you need to write
data <- read.dta("oil_discovery.dta")
If there is an additional problem with the compression, I would imagine that the error message will be different. However, Error in read.dta("oil_discovery") : unable to open file: 'No such file or directory' very explicitly points out that the current error is that the file oil_discovery is not found.
A good way to check if the name or path is causing the error is to use choose.files(). That is, run the following line:
data <- read.dta(choose.files())
This will open a pop-up window where you can manually select the file. If this works, then the name of the file was misspecified.
library(haven)
data <- read_dta("**.dta")
View(data)

Check existence of file in archive (zip)

I'm using unz to extract data from a file within an archive. This actually works pretty well but unfortunately I've a lot of zip files and need to check the existence of a specific file within the archive. I could not manage to get a working solution with if exists or else.
Has anyone an idea how to perform a check if a file exists in an archive without extracting the whole archive before?
Example:
read.table(unz(D:/Data/Test.zip, "data.csv"), sep = ";")[-1,]
This works pretty well if data.csv exists but gives an error if the file is not available in the archive Test.zip.
Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") :
cannot locate file 'data.csv' in zip file 'D:/Data/Test.zip'
Any comments are welcome!
You could use unzip(file, list = TRUE)$Name to get the names of the files in the zip without having to unzip it. Then you can check to see if the files you need are in the list.
## character vector of all file names in the zip
fileNames <- unzip("D:/Data/Test.zip", list = TRUE)$Name
## check if any of those are 'data.csv' (or others)
check <- basename(fileNames) %in% "data.csv"
## extract only the matching files
if(any(check)) {
unzip("D:/Data/Test.zip", files = fileNames[check], junkpaths = TRUE)
}
You could probably put another if() statement to run unz() in cases where there is only one matched file name, since it's faster than running unzip() on a single file.

read csv file from zipped temp file with multiple folders in R

I am trying to read a csv file that is contained in a file I extracted from the web. The problem is the zipped file has multiple cascading folders. I have to do that for several different units, so I am performing a loop. There is no problem with the loop, the file name is correct and I get to download the file. However I get an error message (and I think is because R cannot find the exact file I am asking it to find). The error is:
Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") :
cannot locate file 'XXXX.csv' in zip file 'c:\yyy\temp\bla\'
download.file(paste("http://web.com_",units[i],"_",places[j],".zip",
sep=""),
temp,
cacheOK = F )
data <- read.csv2(unz(temp,
paste("name_",units[i],"_",places[j],".csv",
sep="")),
header=F,
skip=1)
unlink(temp)
fili<-rbind(X,
data)
}
How do I make R find the file I want?
You have the right approach but (as the warning tells you) the wrong filename.
It's worth double checking that the zip file does exist before you start trying to read its contents.
if(file.exists(temp))
{
read.csv2(unz(...))
} else
{
stop("ZIP file has not been downloaded to the place you expected.")
}
It's also a good idea to a browse around inside the downloaded file (you may wish to unzip it first) to make sure that you are looking in the right place for the CSV contents.
It looks like the file, you're going to read, is located in directory. In this case your reading should be changed as follows:
data <- read.csv2(unz(temp,
paste("**dirname**/name_",units[i],"_",places[j],".csv",
sep="")),
header=F,
skip=1)

Resources