Organizing Messy Notepad data - r

I have some data in Notepad that is a mess. There is basically no space between any of the different columns which hold different data. I know the spaces for the data.
For example, Columns 1-2 are X, Columns 7-10 are Y....
How can I organize this? Can it be done in R? What is the best way to do this?

?read.fwf may be a good bet for this circumstance.
Set the path to the file:
temp <- "\pathto\file.txt"
Then set the widths of the variables within the file, as demonstrated below.
#1-2 = x, 3-10=y
widths <- c(2,8)
Then set the names of the columns.
cols <- c("X","Y")
Finally, import the data into a new variable in your session:
dataset <- read.fwf(temp,widths,header=FALSE,col.names=cols)

Something I've done in the past to handle that kind of mess is actually import it into excel as delimited width text, then save as a CSV.
Just a suggestion for you. If it's a one off project then that should be fine. no coding at all. But if it's a repeat offender... then you might look at regular expressions.
i.e. ^(.{6})(.{7})(.{2})(.{5})$ for 4 fields of 6,7,2 and 5 characters width in order.

Related

How to get R to read my first column as a "header"?

I want to calculate diversity indices of different sampling sites in R. I have sites in the first row and the different species in the first column. However, R is reading the first column as normal data (not as a header so to speak).
Pics:
https://imgur.com/a/iBsFtbe
Code:
>Macro<-read.csv("C:\\Users\\Carly\\OneDrive\\Desktop\\Ecology >Projects\\Macroinvertebrates & Water >Quality\\Macro_RData\\Macroinvert\\MacroR\\MacroCSV.csv", header = T)
You need to add row.names = 1 to your command. This will indicate that row names are stored in column number 1.
Macro <- read.csv("<...>/MacroCSV.csv", header = TRUE, row.names = 1)
I sense that you are frustrated. As r2evans said, it is easier for people to help you if you provide them with the data in text form and not with screenshots - because we can't recreate the problem or try to solve it by loading a screenshot into R.
CSV files are just text, so you can open them with a text editor such as NotePad and copy and paste it here. You don't need the whole text - the columns and lines needed to reproduce the problem are enough. This was what we were looking for:
Site,Aeshnidae,Amnicolidae,Ancylidae,Asellidae
AN0119A,0,0,0,6,0
AN0143,0,0,0,0,0
Programming for many people is very frustrating when they start out, don't let this discourage you!
It looks like your data is in the wrong orientation for analysis in vegan - your species are the rows, and sites are columns. From your pics, it looks like you've spotted this issue and tried transposing, but are having issues with the placement of the headers.
Try reading your csv in, and specifying that the first column should be row names:
MacroDataDataFinal <- read.csv("Path/to/file.csv",
row.names=1)
Then transpose the data
MacroDataDataFinal_transposed <- t(MacroDataDataFinal)
Then try running the specaccum function:
library(vegan)
speccurve <- specaccum(comm=MacroDataDataFinal_transposed,
method="random",
permutation=1000)
Hopefully this will work. If you get any errors please let us know the code you typed, and the precise error message.

Referencing last used row in a data frame

I couldn't find the answer in any previously asked questions, but I believe this is an easy one.
I have the below two lines of code, which take in data from excel in a specific range (using readxl for this). The range itself only goes through row 2589 in the excel document, but it will update dynamically (it's a time series) and to ensure I capture the different observations (rows) as they're added, I've included rows to 10000 in the read_excel range argument.
In the end, I'd like to run charts on this data, but a key part of this is identifying the last used row, without manually updating the code row for the latest date. I've tried using nrow but to no avail.
Raw_Index_History <- read_excel("RData.xlsx", range = "ReturnsA6:P10000", col_names = TRUE)
Raw_Index_History <- Raw_Index_History[nrow(Raw_Index_History),]
Does anybody have any thoughts or advice? Thanks very much.
It would be easier to answer your question if you include an example.
Not knowing how your data looks like answers are likely going to be a bit vague.
Does your data contain NAs? If not it should be straight forward to remove the empty rows with
na.omit(Raw_Index_History)
It appears you also have control over the excel spreadsheet. So in case your data does contain NAs you could have some default value in your empty rows that will get overwritten as soon as a new data point is recorded. This will allow you to filter your dataframe accordingly.
Raw_Index_History[!grepl("place_holder", Raw_Index_History$column_with_placeholder),]
If you expect data in the spreadsheet to grow, you can specify only the columns to include, instead of a defined boundary.
Something like this ...
Raw_Index_History <- read_excel("RData.xlsx",
sheet = 1,
range = cell_cols("A:P"), # Only cols, no rows
col_names = TRUE)
Every time you run the code, R will pull in the data from columns between A:P up until the last populated row.
This will be a more elegant approach to your use case. (Consider what you'd do when your data crosses 10000 rows in the future)

Exporting multiple R data frames to a single Excel sheet

I would like to export multiple data frames from R to a single Excel sheet. By using the following code:
write.xlsx(DF1, file="C:\\Users\\Desktop\\filename.xlsx", sheetName="sheet1",
col.names=TRUE, row.names=TRUE, append=FALSE)
write.xlsx(DF2, file="C:\\Users\\Desktop\\filename.xlsx", sheetName="sheet2",
col.names=TRUE, row.names=TRUE, append=TRUE)
I can export two data frames to a single excel workbook, but in two separate sheets. I would like to export them in a single sheet, andif possible, to determine the specific cells that these data frames will be placed in.
Any suggestions more than welcome.
This is not a ready to use answer but this should get you to your target. It would be a mess to write it into a comment.
Create the combined df with the tools of R
Write df to excel
a few notes to point 1.:
vertical offset the second df from the first by using Reduce(rbind,c(list(mtcars),rep(list(NA),3))) for a 3 cell offset for e.g.
rbind the colnames to your df rbind(names(mtcars),mtcars)
use numbers as colnames for so you will not have a problem rbinding different df with different variables. names(mtcars) <- seq_along(mtcars)
To point 2.:
Since your colnames are numbers now make sure you have your colnames set as FALSE.
Hope this helps and you can get your desired output.
Following most of your suggestions I realized that by using cbind.data.frame I get an output which is not optimal, but the amount of time that I need to restructure the data in EXCEL is really insignificant. So, I will proceed with this for the time being.
Thanks
I can't comment yet, so I'll provide my input here:
Using write.xlsx in R, how to write in a Specific Row or column in excel file
In that link it is suggested to organize your data in a single data frame to then write that into the excel sheet. You should have a look at that.
as slackline suggested, this is quite easy if your columns or rows are the same, using his suggested methods
Edit: To add spaces in between, just insert empty columns in between before writing

Custom column design

I have a raw dataset and the columns are not clearly defined at all. When I go to import the data using "Read.Table" in R, it automatically tries to approximate where the columns begin and end. But it is not correct. I know the number of characters per variable, but I am not sure how to customize them as one would in Excel(=Left(x,3) OR =MID(X,4,1)... etc.). Some variables are separated by spaces, some aren't. It is not consistent.
FYI: The document was originally ".dat", then I saved the file as a ".R" file.
Here is an example of my data
Any help is much appreciated! Let me know
You can use read_fwf from the great readr package, to specify the fix widths per variable.

Replacing Rows in a large data frame in R

I have to manually collect some rows so based on the R Cookbook, it recommended me to pre-allocate some memory for a large data frame. Say my code is
dataSize <- 500000;
shoesRead <- read.csv(file="someShoeCsv.csv", head=TRUE, sep=",");
shoes <- data.frame(size=integer(dataSize), price=double(dataSize),
cost=double(dataSize), retail=double(dataSize));
So now, I have some data about shoes which I imported via csv, and then I perform some calculation and want to insert into the data frame shoes. Let's say the someShoeCsv.csv has a column called ukSize and so
usSize <- ukSize * 1.05 #for example
My question is how do I do so? Running the code, noting now I have a usSize variable which was transformed from the ukSize column, read from the csv file:
shoes <- rbind(shoes,
data.frame("size"=usSize, "price"=price,
"cost"=cost, "retail"=retail));
adds to the already large data frame.
I have experimented with doing the list and then rbind but understand that it is tedious and so I am thinking of using this method but still to no avail.
I'm not quite sure what you're trying to do, but if you're trying to replace some of the pre-allocated rows with new data, you could do so like this:
Nreplace = length(usSize)
shoes$size[1:Nreplace] <- usSize
shoes$price[1:Nreplace] <- shoesRead$price
And so on, for the rest of the columns.
Here's some unsolicited advice. Looking at the code you've included, you reference ukSize and price etc without referencing the data frame, which makes it appear like you've done attach(shoesRead). Definitely never use attach(). If you want the price vector, for example, just do shoesRead$price. It's just a little bit more typing for the sake of much more readable code.

Resources