Adding (mathematically) columns of a CSV based on information in another column with PowerShell - math

I was having a really hard time describing what I need in the Title, so I apologize ahead of time if that makes absolutely no sense.
If I have a CSV that has 2 columns, one with a persons name and a second column with a numeric value I need to find the duplicates in the names column then add the numeric values for that person together to get a total number in a new CSV.
This is a very simplified version of the real CSV
Name,Number
Dog,1
Cat,2
Fish,1
Dog,3
Dog,2
Cat,2
Fish,1
Given the information above, what I would like to be able to produce is this:
Name,Number
Dog,6
Cat,4
Fish,2
I really don't have any idea how to get there or if it's possible with PowerShell. I can only get as far as using group-object to group by name, but I have no clue how to add the columns after that.
The biggest problem I'm coming across with my research on this is that most if not all the results I get when googling involve adding new columns to a csv and not performing the mathematical calculation.

I finally got it
$csvfile = import-csv c:\csvfile.csv
$csvfile | group name | select name,#{Name="Totals";Expression={($_.group | Measure-Object -sum number).sum}}
Credit goes to:
http://www.hanselman.com/blog/ParsingCSVsAndPoorMansWebLogAnalysisWithPowerShell.aspx

Related

Name matching and correcting spelling error in r

I have a huge data table with millions of rows that consists of Merchandise code with its description. I want to assign a category to each group (based on the combination of code and description). The problem is that the description is spelled in different ways and I want to convert all the similar names into a single one. Here is an illustrative example:
ibrary(data.table)
dt <- data.table(code = c(rep(1,2),rep(2,2),rep(3,2)), name = c('McDonalds','Mc
Dnald','Macys','macy','Comcast','Com-cats'))
dt[,cat:='NA']
setkeyv(dt,c('code','name'))
dt[.(1,'McDonalds'),cat:='Restaurant']
dt[.(1,'Mc Dnald'),cat:='Restaurant']
dt[.(1,'Macys'),cat:='Department Store']
Of course in the real case, it is impossible to go through all the spelling that refer to the same word and fix them manually.
Is there a way to detect all the similar words and convert them to a single (correct) spelling?
Thanks in advance

Is there a way to extract a substring from a cell in OpenOffice Calc?

I have tens of thousands of rows of unstructured data in csv format. I need to extract certain product attributes from a long string of text. Given a set of acceptable attributes, if there is a match, I need it to fill in the cell with the match.
Example data:
"[ROOT];Earrings;Brands;Brands>JeweleryExchange;Earrings>Gender;Earrings>Gemstone;Earrings>Metal;Earrings>Occasion;Earrings>Style;Earrings>Gender>Women's;Earrings>Gemstone>Zircon;Earrings>Metal>White Gold;Earrings>Occasion>Just to say: I Love You;Earrings>Style>Drop/Dangle;Earrings>Style>Fashion;Not Visible;Gifts;Gifts>Price>$500 - $1000;Gifts>Shop>Earrings;Gifts>Occasion;Gifts>Occasion>Christmas;Gifts>Occasion>Just to say: I Love You;Gifts>For>Her"
Look up table of values:
Zircon, Diamond, Pearl, Ruby
Output:
Zircon
I tried using the VLOOKUP() function, but it needs to match an entire cell and works better for translating acronyms. Haven't really found a built in function that accomplishes what I need. The data is totally unstructured, and changes from row to row with no consistency even within variations of the same product. Does anyone have an idea how to do this?? Or how to write an OpenOffice Calc function to accomplish this? Also open to other better methods of doing this if anyone has any experience or ideas in how to approach this...
ok so I figured out how to do this on my own... I created many different columns, each with a keyword I was looking to extract as a header.
Spreadsheet solution for structured data extraction
Then I used this formula to extract the keywords into the correct row beneath the column header. =IF(ISERROR(SEARCH(CF$1,$D769)),"",CF$1) The Search function returns a number value for the position of a search string otherwise it produces an error. I use the iserror function to determine if there is an error condition, and the if statement in such a way that if there is an error, it leaves the cell blank, else it takes the value of the header. Had over 100 columns of specific information to extract, into one final column where I join all the previous cells in the row together for the final list. Worked like a charm. Recommend this approach to anyone who has to do a similar task.

R read data from a text file [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have a challenging file-reading task.
I have a .txt file from a typical old accounting department (with headers, titles, pages and the useful tabulated quantitative and qualitative information). It looks like this:
From this file I am trying to do two tasks (with read.table and scan):
1) extract the information which is tabulated between | which is the accounting information (any trial ended in a not easy data frames or character vectors)
2) include as a variable each subtitle which begins with "Customers" in the text file: as you can see the Customer info is a title, then comes the accounting info (payables), then again another customer and the accounting info and so on. So is not a column, but a row (?)
I´ve been trying with read.table (several sep and quote parameters) and with scan and then having tried to work with the character vectors.
Thanks!!
I've been there before so I kind of know what you're going through.
I've got 2 news for you, one bad, one good. The bad one is I have read-in these types of files in SAS tons of times but never in R - however
the good news is I can give you some tips so you can work it out in R.
So the strategy is as follow:
1) You're going to read the file into a dataframe that contains only a single column. This column is character and will hold
a whole line of your input file. i.e. length is 80 if the largest line in your file is 80 long.
2) Now you have a data frame where every record equals a line in your input file. At this point you may want to check your
dataframe has the same number or records as per lines in your file.
3) Now you can use grep to get rid-off or keep only those lines that meet your criteria (ie subtitle which begins with "Customers").
You may find regular expressions really useful here.
4) Your dataframe now only have records that matches 'Customer' patterns and table patterns
(i.e line begin with 'Country' or /\d{3} \d{8}/ or ' Total').
5) What you need now is to create a group variable that increment +1 every time it finds 'Customer'. So group=1 will repeat the same value until it finds 'Customer 010343' where group is now group=2. Or even better your group can be customer id until a new id is found. You need to somehow retain the id until a new id is found.
From the last step you're pretty much done as you will be able to identify customers and tables pretty easy. You may want to create a function that output your table strings in a tabular format.
Whether you process them in a single table or split the data frame in n data frame to process them individually is up to you.
In SAS there is this concept of pointer (#) and retention (retain statement) where each line matching a criteria can be process differently from other criterias so you output data set already contains columns and customer info in a tabular format.
Well hope this helps you.

How do I match single ID's in one data frame to multiples of the IDs in another data frame in R?

For a project at work, I need to generate a table from a list of proposal ids, and a table with more data about some of those proposals (called "awards"). I'm having trouble with the match() function; the data in the "awards" table often has several rows that use the same ID, while the proposals frame has only one copy of each ID. From what I've tried, R ignores multiple rows and only returns the first match, when I need all of them. I haven't been able to find anything in documentation or through searches that helps me, though I have been having difficulty phrasing the right question.
Here's what I have so far:
#R CODE to add awards data on proposals to new data spreadsheet
#read tab delimited files
Awards=read.delim("O:/testing.txt",as.is=T)
Proposals=read.delim("O:/test.txt",as.is=T)
#match IDs from both spreadsheets
Proposals$TotalAwarded=Awards$TotalAwarded([match(Proposals$IDs,Awards$IDs)]),
write.table(Proposals,"O:/tested.txt",quote=F,row.names=F,sep="\t")
This does exactly what I want, except that only the first match is encapsulated.
What's the best way to go forward? How do I make R utilize all of the matches available?
Thanks
See help on merge: ?merge
merge( Proposals, Awards, by=ID, all.y=TRUE )
But I cannot believe this hasn't been asked on SO before.

Select Rows and Columns At the Same Time in SPSS

I have a dataset in SPSS that has 100K+ rows and over 100 columns. I want to filter both the rows and columns at the same time into a new SPSS dataset.
I can accomplish this very easily using the subset command in R. For example:
new_data = subset(old_data, select = ColumnA >10, select = c(ColumnA, ColumnC, ColumnZZ))
Even easier would be:
new data = old_data[old_data$ColumnA >10, c(1, 4, 89)]
where I am passing the column indices instead.
What is the equivalent in SPSS?
I love R, but the read/write and data management speed of SPSS is significantly better.
I am not sure what exactly you are referring to when you write that "the read/write and data management speed of SPSS being significantly better" than R. Your question itself demonstrates how flexible R is at data management! And, a dataset of 100k rows and 100 columns is by no means a large one.
But, to answer your question, perhaps you are looking for something like this. I'm providing a "programmatic" solution, rather than the GUI one, because you're asking the question on Stack Overflow, where the focus is more on the programming side of things. I'm using a sample data file that can be found here: http://www.ats.ucla.edu/stat/spss/examples/chp/p004.sav
Save that file to your SPSS working directory, open up your SPSS syntax editor, and type the following:
GET FILE='p004.sav'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'mynewdatafile.sav'
/KEEP currentm previous lactatio.
GET FILE='mynewdatafile.sav'.
More likely, though, you'll have to go through something like this:
FILE HANDLE directoryPath /NAME='C:\path\to\working\directory\' .
FILE HANDLE myFile /NAME='directoryPath/p004.sav' .
GET FILE='myFile'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'directoryPath/mynewdatafile.sav'
/KEEP currentm previous lactatio.
FILE HANDLE myFile /NAME='directoryPath/mynewdatafile.sav'.
GET FILE='myFile'.
You should now have a new file created that has just three columns, and where no value in the "lactatio" column is greater than 3.
So, the basic steps are:
Load the data you want to work with.
Subset for all columns from all the cases you're interested in.
Save a new file with only the variables you're interested in.
Load that new file before you proceed.
With R, the basic steps are:
Load the data you want to work with.
Create an object with your subset of rows and columns (which you know how to do).
Hmm.... I don't know about you, but I know which method I prefer ;)
If you're using the right tools with R, you can also directly read in the specific subset you are interested in without first loading the whole dataset if speed really is an issue.
In spss you can't combine the two actions in one command, but it's easy enough to do it in two:
dataset copy old_data. /* delete this if you don't need to keep both old and new data.
select if ColumnA>10.
add files /file=* /keep=ColumnA ColumnC ColumnZZ.

Resources