Cleansing an excel spreadsheet with whitespace cells - r

I'm looking for advice about how to cleanse an excel spreadsheet using R.
http://www.abs.gov.au/AUSSTATS/abs#.nsf/DetailsPage/5506.02012-13?OpenDocument
Gathering the years by tidyr::gather is simple enough. The difficulty is the subgroups. The groups are defined by whitespace. Each amount of whitespace is a subgroup.
My question is how to assign each row to its group, so that the table is tidy form.
My initial instinct was to look where there is a line of NAs in the spreadsheet and use na.locf to fill them, but that method cannot distinguish between subgroups followed by groups without subgroups. Is there a way to count the amount of whitespace visible before the cells in the linked excel spreadsheet?

On the particular sheet you are talking about, there aren't any leading characters - the indentation is just the formatting applied to the cell, in much the same way as you might apply a font to a cell.
The only way to count the indents in the formatting is to create a macro . Here's a user defined function that will work:
Public Function inds(r As Excel.Range) As Integer
inds = r.Cells(1, 1).IndentLevel
End Function
You would then just count the indents with =inds(a3)

Looks like you might be trying to prepare the data for a pivot table (there might be better options). However to count the leading spaces, simple formula:
=len(a3)-len(trim(a3))+1

Related

Adding new column with data value in R

forest area to the I want to add a column name (say ForestAreaPerPopn) to find the ratio of forest area to the population(represented by variable Total below) residing. The data contains the following variables and their values.
How can I add a column named ForestAreaPerPopn in Table****ForestAreaPerPop (shown below) so that the column contains the data calculated as ratio of forest area to Total.
Too long for a comment.
You have a couple of problems. First, your column names have spaces and other special characters. This is allowed but creates all kinds of problems later. I suggest you do something like:
colnames(ForestAreaPerPop) <- gsub(' |\\(|\\)', '_', colnames(ForestAreaPerPop))
This will replaces any spaces, left or right parens in the colnames with '_'.
Then, something like:
ForestAreaPerPop$n <- with(ForestAreaPerPop, Forest_Area_in_ha/Total)
should give you what you want.
Some advice: long table names and column names may seem like a good idea, but you will live to regret it. Make them short but meaningful (easier said than done).

Is there a way to extract a substring from a cell in OpenOffice Calc?

I have tens of thousands of rows of unstructured data in csv format. I need to extract certain product attributes from a long string of text. Given a set of acceptable attributes, if there is a match, I need it to fill in the cell with the match.
Example data:
"[ROOT];Earrings;Brands;Brands>JeweleryExchange;Earrings>Gender;Earrings>Gemstone;Earrings>Metal;Earrings>Occasion;Earrings>Style;Earrings>Gender>Women's;Earrings>Gemstone>Zircon;Earrings>Metal>White Gold;Earrings>Occasion>Just to say: I Love You;Earrings>Style>Drop/Dangle;Earrings>Style>Fashion;Not Visible;Gifts;Gifts>Price>$500 - $1000;Gifts>Shop>Earrings;Gifts>Occasion;Gifts>Occasion>Christmas;Gifts>Occasion>Just to say: I Love You;Gifts>For>Her"
Look up table of values:
Zircon, Diamond, Pearl, Ruby
Output:
Zircon
I tried using the VLOOKUP() function, but it needs to match an entire cell and works better for translating acronyms. Haven't really found a built in function that accomplishes what I need. The data is totally unstructured, and changes from row to row with no consistency even within variations of the same product. Does anyone have an idea how to do this?? Or how to write an OpenOffice Calc function to accomplish this? Also open to other better methods of doing this if anyone has any experience or ideas in how to approach this...
ok so I figured out how to do this on my own... I created many different columns, each with a keyword I was looking to extract as a header.
Spreadsheet solution for structured data extraction
Then I used this formula to extract the keywords into the correct row beneath the column header. =IF(ISERROR(SEARCH(CF$1,$D769)),"",CF$1) The Search function returns a number value for the position of a search string otherwise it produces an error. I use the iserror function to determine if there is an error condition, and the if statement in such a way that if there is an error, it leaves the cell blank, else it takes the value of the header. Had over 100 columns of specific information to extract, into one final column where I join all the previous cells in the row together for the final list. Worked like a charm. Recommend this approach to anyone who has to do a similar task.

COUNTIF of non-empty and non-blank cells

In Google Sheets I want to count the number of cells in a range (C4:U4) that are non-empty and non-blank. Counting non-empty is easy with COUNTIF. The tricky issue seems to be that I want to treat cells with one or more blank as empty. (My users keep leaving blanks in cells which are not visible and I waste a lot of time cleaning them up.)
=COUNTIF(C4:U4,"<>") treats a cell with one or more blanks as non-empty and counts it. I've also tried =COUNTA(C4:U4) but that suffers from the same problem of counting cells with one or more blanks.
I found a solution in stackoverflow flagged as a solution by 95 people but it doesn't work for cells with blanks.
After much reading I have come up with a fancy formula:
=COUNTIF(FILTER(C4:U4,TRIM(C4:U4)>="-"),"<>")
The idea is that the TRIM removes leading and trailing blanks before FILTER tests the cell to be greater than or equal to a hyphen (the lowest order of printable characters I could find). The FILTER function then returns an array to the COUNTIF function which only contains non-empty and non-blank cells. COUNTIF then tests against "<>"
This works (or at least "seems" to work) but I was wondering if I've missed something really obvious. Surely the problem of hidden blanks is very common and has been around since the dawn of excel and google sheets. there must be a simpler way.
(My first question so apologies for any breaches of forum rules.)
I don't know about Google. But for Excel you could use this array formula for multiple contiguous columns:
=ROWS(A1:B10) * COLUMNS(A1:B10)-(COUNT(IF(ISERROR(CODE(A1:B10)),1,""))+COUNT(IF(CODE(A1:B10)=32,1,"")))
Could try this but I'm not at all sure about it
=SUMPRODUCT(--(trim((substitute(A2:A5,char(160),"")))<>""))
seems in Google Sheets that you've got to put char(160) to match a space entered into a cell?
Seems this is due to a non-breaking space and could possibly apply to Excel also - as explained here - the suggestion is that you could also pass it through the CLEAN function to eliminate invisible characters with codes in range 0-31.
I found another way to do it using:
=ARRAYFORMULA(SUM(IF(TRIM($C4:$U4)<>"",1,0)))
I'm still looking for a simpler way to do it if one is available.
This should work:
=countif(C4:U4,">""")
I found this solution here:
Is COUNTA counting blank (empty) cells in new Google spreadsheets?
Please let me know if it does.
=COLUMNS(C4:U4)-COUNTBLANK(C4:U4)
This will count how many cells are in your range (C4 to U4 = 19 cells), and subtract those that are truly "empty".
Blank spaces will not get counted by COUNTBLANK, despite its name, which should really be COUNTEMPTY.

Extracting Data from a column based on symbols

I have a tricky question that I'm hoping someone can help me with. I have an output file that looks pretty standard in that there is one value per row, per column - except for one column (excerpt below) that contains multiple entries per row:
4:103806204-103940896,4:103806204-103940896,4:103822084-103940896,4:103806204-103940896
7:27135712-27139877,7:27135712-27139877
2:209030070-209054773
1:16091458-16113084,1:16090993-16101715,1:16085254-16113084
16:70333061-70367735,16:70323669-70367735,16:70333061-70367735,16:70333061-70367735,16:70328735-70367735,16:70328699-70367735,16:70333061-70367735
It would be easy enough to split this column by ',' but then I won't be able to read it into, say, R very easily.
Instead, I'm hoping I can use a simple bit of code to select only the first two values, and then make one column into two, removing the rest. So the above would become the below:
4 103806204
7 27135712
2 209030070
1 16091458
16 70333061
I lose a little bit of info this way, but it makes the data more manageable. Does anyone have any suggestions?
We can use str_extract_all from library(stringr). We extract the numeric elements (\\d+) in a list, convert the 'character' class to numeric and get the first two elements with head, rbind the list elements.
library(stringr)
do.call(rbind, lapply(str_extract_all(df$col, '\\d+'),
function(x) head(as.numeric(x),2)))

Concatenate excel cells with numbers without summing

I know you can use the concatenate function in excel to combine two strings in two different cells into one cell, but how do I do the same for cells with numbers in them? I have two columns as seen in the image below (I have started the process by hand to demonstrate what I want) and I want to concatenate the value to read into R to perform a choice experiment evaluation on the data but using concatenate sums the values.
You can also do the concatenation within R itself, by setting the column type to string, and then using the paste0(string1, string2) function.
You can use the & sign
=A1&C1
concatenate does also work in excel
=CONCATENATE(A1,C1)

Resources