Convert a short wide data frame to a long narrative free text report - r

I have imported data relating to about 70 human subjects from three data tables and have merged them into one data frame in R. Some of the 100 fields are straight forward such as date.birth, number.surgeries.lifetime and number.surgeries.12months. Other fields, such as "comments" may contain no value or it may contain one sentence or maybe even several sentences.
Some human subjects have an anomaly, meaning that something is missing or not just right and for those people I have to manually investigate whats up. When I open the data frame as a data frame or even as a table in fix() it is difficult to read. I have to scroll from left to right and then I have to expand some columns by a ridiculous amount to read just one comment.
It would be much better if I could subset the 5 patients I need to explore and report the data as free flowing text. I thought I could do that by exporting to a csv but its difficult to see which fields are what. For example 2001-01-05, 12, 4, had testing done while still living in Los Angeles. That was easy, imagine what happens if there are 100 fields, many are numbers, many are dates, there are several different comment fields.
A better way would be to output a report such as this:
date.birth:2001-01-05, number.surgeries.lifetime:12, number.surgeries.12months:4, comments:will come talk to us on Monday
Each one of the 5 records would follow that format.
field name 1:field 1 value record 1, field name 2:field 2 value record 1...
skip a line (or something easy to see)
field name 1:field 1 value record 2, field name 2:field 2 value record 2
How can I do it?

How about this?
set.seed(1)
age <- abs(rnorm(10, 40, 20))
patient.key <- 101:110
date.birth <- as.Date("2011-02-02") - age * 365
number.surgeries.12months <- rnbinom(10, 1, .5)
number.surgeries.lifetime <- trunc(number.surgeries.12months * (1 + age/10))
comments <- "comments text here"
data <- data.frame(patient.key,
date.birth,
number.surgeries.12months,
number.surgeries.lifetime,
comments)
Subset the data by the patients and fields you are interested in:
selected.patients <- c(105, 109)
selected.fields <- c("patient.key", "number.surgeries.lifetime", "comments")
subdata <- subset(data[ , selected.fields], patient.key %in% selected.patients)
Format the result for printing.
# paste the column name next to each data field
taggeddata <- apply(subdata, 1,
function(row) paste(colnames(data), row, sep = ":"))
# paste all the data fields into one line of text
textdata <- apply(taggeddata, 2,
function(rec) do.call("paste", as.list(rec)))
# write to a file or to screen
writeLines(textdata)

Though I risk repeating myself, I'll make yet another case for the RMySQL package. You will be able to edit your database with your favorite SQL client (I recommend SequelPro). Using SELECT statements / Filtering and then edit it. For example
SELECT patentid, patentname, inability FROM patients LIMIT 5
could display only your needed fields. With a nice SQL client you could edit the result directly and store the result to the database. Afterwards, you can just reload the database into R. I know a lot of folks would argue that your dataset ist too small for such an overhead, but still I´d prefer the editing properties of most SQL editors of R. The same applies for joining tables if it gets trickier. Plus you might be interesting in writing views ("tables" that are updated on access) which will be treated like tables in R.

Check out library( reshape). I think if you start by melt()-ing your data your feet will be on the path to your desired outcome. Let us know if that looks like it will help and how it goes from there.

Related

Writing Apache Arrow dataset in batches in R

I'm wondering what the correct approach is to creating an Apache Arrow multi-file dataset as described here in batches. The tutorial explains quite well how to write a new partitioned dataset from data in memory, but is it possible to do this in batches?
My current approach is to simply write the datasets individually, but to the same directory. This appears to be working, but I have to imagine this causes issues with the metadata that powers the feature. Essentially, my logic is as follows (pseudocode):
data_ids <- c(123, 234, 345, 456, 567)
# write data in batches
for (id in data_ids) {
## assume this is some complicated computation that returns 1,000,000 records
df <- data_load_helper(id)
df <- group_by(df, col_1, col_2, col_3)
arrow::write_dataset(df, "arrow_dataset/", format = 'arrow')
}
# read in data
dat <- arrow::open_dataset("arrow_dataset/", format="arrow", partitioning=c("col_1", "col_2", "col_3"))
# check some data
dat %>%
filter(col_1 == 123) %>%
collect()
What is the correct way of doing this? Or is my approach correct? Loading all of the data into one object and then writing it at once is not viable, and certain chunks of the data will update at different periods over time.
TL;DR: Your solution looks pretty reasonable.
There may be one or two issues you run into. First, if your batches do not all have identical schemas then you will need to make sure to pass in unify_schemas=TRUE when you are opening the dataset for reading. This could also become costly and you may want to just save the unified schema off on its own.
certain chunks of the data will update at different periods over time.
If by "update" you mean "add more data" then you may need to supply a basename_template. Otherwise every call to write_dataset will try and create part-0.arrow and they will overwrite each other. A common practice to work around this is to include some kind of UUID in the basename_template.
If by "update" you mean "replace existing data" then things will be a little trickier. If you want to replace entire partitions worth of data you can use existing_data_behavior="delete_matching". If you want to replace matching rows I'm not sure there is a great solution at the moment.
This approach could also lead to small batches, depending on how much data is in each group in each data_id. For example, if you have 100,000 data ids and each data id has 1 million records spread across 1,000 combinations of col_1/col_2/col_3 then you will end up with 1 million files, each with 1,000 rows. This won't perform well. Ideally you'd want to end up with 1,000 files, each with 1,000,000 rows. You could perhaps address this with some kind of occasional compaction step.

Set up automatic process for R to read directory and process?

I am so very very new to R. Like had to look up how to open a file in R new. Diving in the deep end. Anyway
I have a bunch of .csv files with results that I need to analyse. Really, I would like to set up some kind of automation so I can just say "go" (a function?)
Basically I have results in one file that are -particle.csv and another that are -ROI.csv. They have the same names so I know which ones match up (e.g. brain1 section1 -particle.csv and brain1 section1 -ROI.csv). I need to do some maths using these two datasets - Divide column 2 rows 2-x in -particle.csv (the row number might change but is there a way of saying row "2-No more content"?) by column 1, 5, 10, etc. row 2 in -ROI.csv (the column number will always stay the same but if it helps they are all called Area1, Area2, Area3,... the number of Area columns can vary but surely there's a way I can say "every column that begins with Area"? Also the area count and the row count will always match up)
Okay, I'm fine to do that manually for each set up results but I have over 300 brains to analyse! Is there anyway I can set it up as a process that I can apply this these and future results that will be in the same format?
Sorry if this is a huge ask!

Referencing last used row in a data frame

I couldn't find the answer in any previously asked questions, but I believe this is an easy one.
I have the below two lines of code, which take in data from excel in a specific range (using readxl for this). The range itself only goes through row 2589 in the excel document, but it will update dynamically (it's a time series) and to ensure I capture the different observations (rows) as they're added, I've included rows to 10000 in the read_excel range argument.
In the end, I'd like to run charts on this data, but a key part of this is identifying the last used row, without manually updating the code row for the latest date. I've tried using nrow but to no avail.
Raw_Index_History <- read_excel("RData.xlsx", range = "ReturnsA6:P10000", col_names = TRUE)
Raw_Index_History <- Raw_Index_History[nrow(Raw_Index_History),]
Does anybody have any thoughts or advice? Thanks very much.
It would be easier to answer your question if you include an example.
Not knowing how your data looks like answers are likely going to be a bit vague.
Does your data contain NAs? If not it should be straight forward to remove the empty rows with
na.omit(Raw_Index_History)
It appears you also have control over the excel spreadsheet. So in case your data does contain NAs you could have some default value in your empty rows that will get overwritten as soon as a new data point is recorded. This will allow you to filter your dataframe accordingly.
Raw_Index_History[!grepl("place_holder", Raw_Index_History$column_with_placeholder),]
If you expect data in the spreadsheet to grow, you can specify only the columns to include, instead of a defined boundary.
Something like this ...
Raw_Index_History <- read_excel("RData.xlsx",
sheet = 1,
range = cell_cols("A:P"), # Only cols, no rows
col_names = TRUE)
Every time you run the code, R will pull in the data from columns between A:P up until the last populated row.
This will be a more elegant approach to your use case. (Consider what you'd do when your data crosses 10000 rows in the future)

R read data from a text file [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have a challenging file-reading task.
I have a .txt file from a typical old accounting department (with headers, titles, pages and the useful tabulated quantitative and qualitative information). It looks like this:
From this file I am trying to do two tasks (with read.table and scan):
1) extract the information which is tabulated between | which is the accounting information (any trial ended in a not easy data frames or character vectors)
2) include as a variable each subtitle which begins with "Customers" in the text file: as you can see the Customer info is a title, then comes the accounting info (payables), then again another customer and the accounting info and so on. So is not a column, but a row (?)
I´ve been trying with read.table (several sep and quote parameters) and with scan and then having tried to work with the character vectors.
Thanks!!
I've been there before so I kind of know what you're going through.
I've got 2 news for you, one bad, one good. The bad one is I have read-in these types of files in SAS tons of times but never in R - however
the good news is I can give you some tips so you can work it out in R.
So the strategy is as follow:
1) You're going to read the file into a dataframe that contains only a single column. This column is character and will hold
a whole line of your input file. i.e. length is 80 if the largest line in your file is 80 long.
2) Now you have a data frame where every record equals a line in your input file. At this point you may want to check your
dataframe has the same number or records as per lines in your file.
3) Now you can use grep to get rid-off or keep only those lines that meet your criteria (ie subtitle which begins with "Customers").
You may find regular expressions really useful here.
4) Your dataframe now only have records that matches 'Customer' patterns and table patterns
(i.e line begin with 'Country' or /\d{3} \d{8}/ or ' Total').
5) What you need now is to create a group variable that increment +1 every time it finds 'Customer'. So group=1 will repeat the same value until it finds 'Customer 010343' where group is now group=2. Or even better your group can be customer id until a new id is found. You need to somehow retain the id until a new id is found.
From the last step you're pretty much done as you will be able to identify customers and tables pretty easy. You may want to create a function that output your table strings in a tabular format.
Whether you process them in a single table or split the data frame in n data frame to process them individually is up to you.
In SAS there is this concept of pointer (#) and retention (retain statement) where each line matching a criteria can be process differently from other criterias so you output data set already contains columns and customer info in a tabular format.
Well hope this helps you.

R XLConnect getting index/formula to a chunk of data using content found in first cell

Sorry if this is difficult to understand - I don't have enough karma to add a picture so I will do the best I can to describe this! Using XLConnect package within R to read & write from/to Excel spreadsheets.
I am working on a project in which I am trying to take columns of data out of many workbooks and concatenate them together into rows of a new workbook based on which workbook they came from (each workbook is data from a consecutive business day). The snag is that the data that I seek is only a small part (10 rows X 3 columns) of each workbook/worksheet and is not always located in the same place within the worksheet due to sloppiness on behalf of the person who originally created the spreadsheets. (e.g. I can't just start at cell A2 because the dataset that starts at A2 in one workbook might start at B12 or C3 in another workbook).
I am wondering if it is possible to search for a cell based on its contents (e.g. a cell containing the title "Table of Arb Prices") and return either the index or reference formula to be able to access that cell.
Also wondering if, once I reference that cell based on its contents, if there is a way to adjust that formula to get to where I know another cell is compared to that one. For example if a cell with known contents is always located 2 rows above and 3 columns to the left of the cell where I wish to start collecting data, is it possible for me to take that first reference formula and increment it by 2 rows and 3 columns to get the reference formula for the cell I want?
Thanks for any help and please advise me if you need further information to be able to understand my questions!
You can just read the entire worksheet in as a matrix with something like
library(XLConnect)
demoExcelFile <- system.file("demoFiles/mtcars.xlsx", package = "XLConnect")
mm <- as.matrix(readWorksheetFromFile(demoExcelFile, sheet=1))
class(mm)<-"character" # convert all to character
Then you can search for values and get the row/colum
which(mm=="3.435", arr.ind=T)
# row col
# [1,] 23 6
Then you can offset those and extract values from the matrix how ever you like. In the end, when you know where you want to read from, you can convert to a cleaner data frame with
read.table(text=apply(mm[25:27, 6:8],1,paste, collapse="\t"), sep="\t")
Hopefully that gives you a general idea of something you can try. It's hard to be more specific without knowing exactly what your input data looks like.

Resources