How to import messy data in R?

How to import messy data in R? - r

How to import this data in R ???is so messy...I dont know if must first cleaning and then import..i dont know what to do....in the first line is the names of columns.
https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/

It is not messy but very clean. The file is a comma separated values file (although the delimiter seems to be a semi-colon). You can use read.delim for this:
df <- read.delim("winequality-red.csv", sep = ";")
Make sure that the file is stored in the working directory. You can check the working directory by using getwd() and change it by setwd()

Related

how do i get my variables to be recognized?

I am writing a dataframe using a csv file. I am making a data frame. However, when I go to run it, it's not recognizing the objects in the file. It will recognize some of them, but not all.
smallsample <- data.frame(read.csv("SmallSample.csv",header = TRUE),smallsample$age,smallsample$income,smallsample$gender,smallsample$marital,smallsample$numkids,smallsample$risk)
smallsample
It wont recognize marital or numkids, despite the fact that those are the column names in the table in the .csv file.

When you use read.csv the output is already in a dataframe.
You can simple use smallsample <- read.csv("SmallSample.csv")
Result using a dummy csv file
<table><tbody><tr><th> </th><th>age</th><th>income</th><th>gender</th><th>marital</th><th>numkids</th><th>risk</th></tr><tr><td>1</td><td>32</td><td>34932</td><td>Female</td><td>Single</td><td>1</td><td>0.9611315</td></tr><tr><td>2</td><td>22</td><td>50535</td><td>Male</td><td>Single</td><td>0</td><td>0.7257541</td></tr><tr><td>3</td><td>40</td><td>42358</td><td>Male</td><td>Single</td><td>1</td><td>0.6879534</td></tr><tr><td>4</td><td>40</td><td>54648</td><td>Male</td><td>Single</td><td>3</td><td>0.568068</td></tr></tbody></table>

R import .data file extension

Hi I'm trying to import data from the URL:https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data but it always imports it as single line. I split the data by "\t" but it still not working. My R code;
bostonHousing <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data",
col.names= c("CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT","MEDV"),
dec=",",sep = "\t")

The file isn't tab-separated, it's whitespace-separated. By default, read.table assumes columns are separated by one or more whitespace characters (tab or space). Specifying tab-delimiters (or using read.delim()) is only really necessary when columns are tab-delimited and the data columns may contain embedded spaces ...
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
bostonHousing <- read.table(url)
seems to work fine (dec="," is also a bad idea)

Loading csv into R with `sep=,` as the first line

The program I am exporting my data from (PowerBI) saves the data as a .csv file, but the first line of the file is sep=, and then the second line of the file has the header (column names).
Sample fake .csv file:
sep=,
Initiative,Actual to Estimate (revised),Hours Logged,Revised Estimate,InitiativeType,Client
FakeInitiative1 ,35 %,320.08,911,Platform,FakeClient1
FakeInitiative2,40 %,161.50,400,Platform,FakeClient2
I'm using this command to read the file:
initData <- read.csv("initData.csv",
row.names=NULL,
header=T,
stringsAsFactors = F)
but I keep getting an error that there are the wrong number of columns (because it thinks the first line tells it the number of columns).
If I do header=F instead then it loads, but then when I do names(initData) <- initData[2,] then the names have spaces and illegal characters and it breaks the rest of my program. Obnoxious.
Does anyone know how to tell R to ignore that first line? I can go into the .csv file in a text editor and just delete the first line manually before I load it each time (if I do that, everything works fine) but I have to export a bunch of files and this is a bit stupid and tedious.
Any help would be much appreciated.

There are many ways to do that. Here's one:
all_content = readLines("initData.csv")
skip_first_line = all_content[-1]
initData <- read.csv(textConnection(skip_first_line),
row.names=NULL,
header=T,
stringsAsFactors = F)

Your file could be in a UTF-16 encoding. See hrbrmstr's answer in how to read a UTF-16 file:

How to read a csv file or load an excel workbook by ignoring some characters in the file path?

I'm writing a loop script which involves reading a file from a workbook (using the package XLConnect). The challenge is that the file names contain characters (representing time) that I want to ignore.
For example, here are 3 paths to those files:
G://User//Documents//daily_data//Op_Schedule_20160520_132025.xlsx
G://User//Documents//daily_data//Op_Schedule_20160521_142805.xlsx
G://User//Documents//daily_data//Op_Schedule_20160522_103052.xlsx
I need to import hundreds of those files. I can easily account for the character string representing the date (e.g. 20160522), but not the time.
Is there a way to tell R to ignore some characters located in the file path? Here is how I was thinking of writing my script (the "???" is where i need help). I know a loop is probably not the most efficient way, but i'm open to suggestions, should you have any:
require(XLConnect)
path= "G://User//Documents//daily_data//Op_Schedule_"
wd.seq = format(seq(as.Date("2014-01-01"),as.Date("2016-12-31"),"days"),format="%Y%m%d")
scheduleList = rep(list(matrix(1,1,1)),length(wd.seq))
for(i in 1:length(wd.seq)) {
wb = loadWorkbook(file= paste0(path,wd.seq[i],"???",".xlxs"))
scheduleList[[i]] = readWorksheet(wb,sheet='=SCHEDULE', header = TRUE)
}
`
Thanks for reading and suggestions, if any.
Mathieu

I don't know if this is helpful, but if you want to read all the files in a certain directory (which it seems to me is what you're after), you can read all the filenames into a list using the list.files() function, for example
fileList <- list.files(""G://User//Documents//daily_data//")
And then load the xlsx files looping through the list with a for loop
for(i in fileList) {
loadWorkbook(file = i)
}
I haven't used the XLConnect function before so that exact code probably doesn't work, but the loop will iterate through all the files in that directory and so you can construct your loading call using the i variable for the filename (it won't be an absolute path though, so you might need to use paste to add the first part of the filepath)
I realize there might be other files in the directory that are not excel files, you could use grepl to select only files containg "OP_Schedule_"
fileListClean <- fileList[grepl("Op_Schedule_",fileList)]
or perhaps only selecting .xlsx files in the directory:
fileListClean <- fileList[grepl(".xlsx",fileList)]
Edit to fit your reply:
Since you need to fit it to a sequence, you can do it as you did earlier:
wd.seq = format(seq(as.Date("2014-01-01"),as.Date("2016-12-31"),"days"),format="%Y%m%d")
wd.seq2 <- paste("Op_Schedule_", wd.seq, sep = "")
And then use grepl to only pick files starting with that extensions:
fileListClean <- fileList[grepl(paste(wd.seq2, collapse = "|"), fileList)]
Full disclosure: The last part i got from this SO answer: grep using a character vector with multiple patterns

How to import multiple matlab files into R (Using package R.Matlab)

Thank you in advance for your're help. I am using R to analyse some data that is initially created in Matlab. I am using the package "R.Matlab" and it is fantastic for 1 file, but I am struggling to import multiple files.
The working script for a single file is as follows...
install.packages("R.matlab")
library(R.matlab)
x<-("folder_of_files")
path <- system.file("/home/ashley/Desktop/Save/2D Stream", package="R.matlab")
pathname <- file.path(x, "Test0000.mat")
data1 <- readMat(pathname)
And this works fantastic. The format of my files is 'Name_0000.mat' where between files the name is a constant and the 4 digits increase, but not necesserally by 1.
My attempt to load multiple files at once was along these lines...
for (i in 1:length(temp))
data1<-list()
{data1[[i]] <- readMat((get(paste(temp[i]))))}
And also in multiple other ways that included and excluded path and pathname from the loop, all of which give me the same error:
Error in get(paste(temp[i])) :
object 'Test0825.mat' not found
Where 0825 is my final file name. If you change the length of the loop it is always just the name of the final one.
I think the issue is that when it pastes the name it looks for that object, which as of yet does not exist so I need to have the pasted text in speach marks, yet I dont know how to do that.
Sorry this was such a long post....Many thanks

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to import messy data in R? - r

How to import this data in R ???is so messy...I dont know if must first cleaning and then import..i dont know what to do....in the first line is the names of columns. https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/

Related

how do i get my variables to be recognized?

R import .data file extension

Loading csv into R with `sep=,` as the first line

How to read a csv file or load an excel workbook by ignoring some characters in the file path?

How to import multiple matlab files into R (Using package R.Matlab)

Categories

Resources