How to use grep with fread in Rstudio? - r

I have a large csv file that I want to import into Rstudio. Because it is so big (~29GB), I would like to only import specific rows at a time, rather than import the whole thing and then filter. I have been trying to use grep in the cmd = argument in fread(), however, I keep getting the message that
'grep' is not a recognized internal or external command, operable program, or batch file.
I have installed http://cygwin.com, and have even tried manually adding the path (C:\cygwin64\bin) to the Windows environment, but still, I keep getting the error message.
As such, I suppose I have two questions:
How do I get grep to work when using fread to import a CSV into Rstudio?
Once grep is working, what syntax would I use to only import rows that contain a certain string?
library(rgbif)
library(data.table)
#Download the csv (note: downloads a zip file ~7GB, unzipped ~29GB
occ_download_get("0299151-200613084148143") #add argument, path = ,if you want to direct the download
#I now only want to import the rows that contain "Acanthiza pusilla", for example.
fread() # it's here that I am unsure of the correct syntax
Any help would be appreciated. Additionally, let me know if you require more information.
Thanks

fread(cmd='findstr /r Acanthiza.pusilla [your file location]"')
ex) fread(cmd='findstr /r Acanthiza.pusilla \"C:\\Users\\abc\\Documents\\test.csv\"')

Related

Error when trying to copy xlsx file path into R

I am trying to import a xlsx file into R but I keep getting this error when copying the file path. The file path is "C:\Users\aj\Downloads" so I have been writing:
library(openxlsx)
Data <- read.xlsx(paste0(C:\Users\aj\Downloads, '\Data.xlsx'))
I tried switching to "/" from "" and I got a different error that stated unexpected input. Does anybody know what I am missing or if I am even on the right track?
As a work-around: you could use the Import Dataset button in the Environment pane. It will import the dataset for you and also show you the correct command.

How do I import data via code in R (Instead of using the import in the menu bar) from code typed into an R notebook?

Every time I type in the file name in this case "labelled edited.xlsx" (perfectly - it was copied from the import box when using the import function from the menu into an R notebook), then try to run it, it says 'Error: path does not exist'. However using the import menu works. If I copy and paste the exact same thing from the import box:
labellededited <- read_excel("labelled edited.xlsx", col_names = TRUE, .name_repair="minimal")
into the notebook and run it immediately, it works perfectly. However, when I close R, open it again, set the working directory (without changing a single thing in the directory folder so the file names are the same), it returns the error even though absolutely nothing has changed - I just restarted R.
In addition to this, copying the code from the notebook into the import box on the bottom right will import the dataset perfectly, as does copying the line of code into the console. It only happens when I press cmd+enter directly from the notebook.
Any tips on fixing this? I know it's not a big deal, but ideally, i'd like to create a code, set the directory and then just let it run.
The problem has to do with RStudio and file types. In order to use the keyboard shortcuts (Ctrl+Enter) the commands have to be saved as an R script file. So start a new one (Ctrl+Shift+N) and copy the commands from the .Rmd file, and try again.
Hi you can use this i guess,
set working directory using setwd("your Path/") then
library(readxl)
if you want to import xlsx use read_xlsx , if you want to import xls use read_xls
labellededited <- read_xlsx("labelled edited.xlsx",sheet = "select sheet number"(default it will consider as first sheet)
more better way you can keep path inside the code and import the file(if you don't move the file it will import without any error)
labellededited <- read_xlsx("yourpath/labelled edited.xlsx",sheet = "select sheet number")
Hope it helps

PDF File Import R

I have multiple .pdf-files (stored in a local folder), that contain text. I would like to import the .pdf-files (i.e., the texts) in R. I applied the function 'read_dir' (R package: [textreadr][1])
library ("textreadr")
Data <- read_dir("<MY PATH>")
The function works well. BUT. For several files, that include special characters (i.e., letters) in their names (such as 'ć'; e.g., 'filenameć.pdf'), the function did not work (error message: 'The following files failed to read in and were removed:' …).
What can I do?
I tried to rename the files via R (did not work (probably due to the same reasons)). That might be a workaround.
I did not want to rename the files manually :)
Follow-Up (only for experts):
For several files, I got one of the following error messages (and I have no idea why):
PDF error: Mismatch between font type and embedded font file
or
PDF error: Couldn't find trailer dictionary
Any suggestions or hints how to solve this issue?
Likely the issue concerns the encoding of the file names. If you absolutely want to use R to rename the files for you, the function you want to use is iconv, determine the encoding of the file names and then convert them to utf-8.
However, a much better system would imply renaming them using bash from command line. Can you provide a more complete set of examples?

Importing .csv files with Sys.Dates()

I have a .csv dataset that gets dumped everyday which I use to generate a daily list for tracking participants using a R script. I would like to automate this R script, however in order to do so, I need to read in the .csv using Sys.Date().
The .csv dataset is named: DumpedList_2013-11-27 (The date will always be today's date).
I would like to import this into the script, like I would for .Rdata file.
load(paste('/srv/Data/Baseline2/baseline2_', Sys.Date(), '.Rdata',sep=''))
What is the equivalent of the command above for reading in .csv files?
I have tried load and read.csv commands, but get error messages:
data=read.csv('P:/DirectoryPath/DumpedList_',Sys.Date(),'.csv')
I also attempted to create todaydate=Sys.Date() and then used it to load the data, but error messages again. a=load(paste("P:/DirectoryPath/DumpedList_",todaydate,".csv"))
Any insight?
By default paste will separate with spaces, use paste0 to join strings together seamlessly:
read.csv(paste0('P:/DirectoryPath/DumpedList_',Sys.Date(),'.csv'))

Speed up read.dbf in R (problems with importing large dbf file)

I have a dataset given in .dbf format and need to import it into R.
I haven't worked with such extension previously, so have no idea how to export dbf file with multiple tables into different format.
Simple read.dbf has been running hours and still no results.
Tried to look for speeding up R performance, but not sure whether it's the case, think the problem is behind reading the large dbf file itself (weights ~ 1.5Gb), i.e. the command itself must be not efficient at all. However, I don't know any other option how to deal with such dataset format.
Is there any other option to import the dbf file?
P.S. (NOT R ISSUE) The source of the dbf file uses visual foxpro, but can't export it to other format. I've installed foxpro, but given that I've never used it before, I don't know how to export it in the right way. Tried simple "Export to type=XLS" command, but here comes a problem with encoding as most of variables are in Russian Cyrillic and can't be decrypted by excel. In addition, the dbf file contains multiple tables that should be merged in 1 big table, but I don't know how to export those tables separately to xls, same as I don't know how to export multiple tables as a whole into xls or csv, or how to merge them together as I'm absolutely new to dbf files theme (though looked through base descriptions already)
Any helps will be highly appreciated. Not sure whether I can provide with sample dataset, as there are many columns when I look the dbf in foxpro, plus those columns must be merged with other tables from the same dbf file, and have no idea how to do that. (sorry for the mess)
Your can export from Visual FoxPro in many formats using the COPY TO command via the Command Window, as per the VFP help file.
For example:
use mydbf in 0
select mydbf
copy to myfile.xls type xl5
copy to myfile.csv type delimited
If you're having language-related issues, you can add an 'as codepage' clause to the end of those. For example:
copy to myfile.csv type delimited as codepage 1251
If you are not familiar with VFP I would try to get the raw data out like that, and into a platform that you are familiar with, before attempting merges etc.
To export them in a loop you could use the following in a .PRG file (amending the two path variables at the top to reflect your own setup).
Close All
Clear All
Clear
lcDBFDir = "c:\temp\" && -- Where the DBF files are.
lcOutDir = "c:\temp\export\" && -- Where you want your exported files to go.
lcDBFDir = Addbs(lcDBFDir) && -- In case you forgot the backslash.
lcOutDir = Addbs(lcOutDir)
* -- Get the filenames into an array.
lnFiles = ADir(laFiles, Addbs(lcDBFDir) + "*.DBF")
* -- Process them.
For x = 1 to lnFiles
lcThisDBF = lcDBFDir + laFiles[x, 1]
Use (lcThisDBF) In 0 Alias currentfile
Select currentfile
Copy To (lcOutDir + Juststem(lcThisDBF) + ".csv") type csv
Use in Select("Currentfile") && -- Close it.
EndFor
Close All
... and run it from the Command Window - Do myprg.prg or whatever.

Resources