Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have a challenging file-reading task.
I have a .txt file from a typical old accounting department (with headers, titles, pages and the useful tabulated quantitative and qualitative information). It looks like this:
From this file I am trying to do two tasks (with read.table and scan):
1) extract the information which is tabulated between | which is the accounting information (any trial ended in a not easy data frames or character vectors)
2) include as a variable each subtitle which begins with "Customers" in the text file: as you can see the Customer info is a title, then comes the accounting info (payables), then again another customer and the accounting info and so on. So is not a column, but a row (?)
I´ve been trying with read.table (several sep and quote parameters) and with scan and then having tried to work with the character vectors.
Thanks!!
I've been there before so I kind of know what you're going through.
I've got 2 news for you, one bad, one good. The bad one is I have read-in these types of files in SAS tons of times but never in R - however
the good news is I can give you some tips so you can work it out in R.
So the strategy is as follow:
1) You're going to read the file into a dataframe that contains only a single column. This column is character and will hold
a whole line of your input file. i.e. length is 80 if the largest line in your file is 80 long.
2) Now you have a data frame where every record equals a line in your input file. At this point you may want to check your
dataframe has the same number or records as per lines in your file.
3) Now you can use grep to get rid-off or keep only those lines that meet your criteria (ie subtitle which begins with "Customers").
You may find regular expressions really useful here.
4) Your dataframe now only have records that matches 'Customer' patterns and table patterns
(i.e line begin with 'Country' or /\d{3} \d{8}/ or ' Total').
5) What you need now is to create a group variable that increment +1 every time it finds 'Customer'. So group=1 will repeat the same value until it finds 'Customer 010343' where group is now group=2. Or even better your group can be customer id until a new id is found. You need to somehow retain the id until a new id is found.
From the last step you're pretty much done as you will be able to identify customers and tables pretty easy. You may want to create a function that output your table strings in a tabular format.
Whether you process them in a single table or split the data frame in n data frame to process them individually is up to you.
In SAS there is this concept of pointer (#) and retention (retain statement) where each line matching a criteria can be process differently from other criterias so you output data set already contains columns and customer info in a tabular format.
Well hope this helps you.
Related
I've never used Qualtrics myself and do not need to, but my company receives Qualtrics-generated CSV data from another company, and we have to advise them about the names to use for fields/variables, such as "mobilephone".
The main thing I need to know is the maximum number of characters, but other limits (such as special characters to avoid) would be helpful. For example, would profile_field_twenty6chars be good? (The data is going into Moodle, which uses profile_field).
Qualtrics has 2 different limitations to my knowledge.
Question names are limited to 10 characters(absurd in my opinion)
Question Export tags (this defaults to the question text, and is shown on the second row for csv datasets) has a limit of 200 characters.
I have a several data frames which start with a bit of text. Sometimes the information I need starts at row 11 and sometimes it starts at row 16 for instance. It changes. All the data frames have in common that the usefull information starts after a row with the title "location".
I'd like to make a loop to delete all the rows in the data frame above the useful information (including the row with "location").
I'm guessing that you want something like this:
readfun <- function(fn,n=-1,target="location",...) {
r <- readLines(fn,n=n)
locline <- grep(target,r)[1]
read.table(fn,skip=locline,...)
}
This is fairly inefficient because it reads the data file twice (once as raw character strings and once as a data frame), but it should work reasonably well if your files are not too big. (#MrFlick points out in the comments that if you have a reasonable upper bound on how far into the file your target will occur, you can set n so that you don't have to read the whole file just to search for the target.)
I don't know any other details of your files, but it might be safer to use "^location" to identify a line that begins with that string, or some other more specific target ...
For a project at work, I need to generate a table from a list of proposal ids, and a table with more data about some of those proposals (called "awards"). I'm having trouble with the match() function; the data in the "awards" table often has several rows that use the same ID, while the proposals frame has only one copy of each ID. From what I've tried, R ignores multiple rows and only returns the first match, when I need all of them. I haven't been able to find anything in documentation or through searches that helps me, though I have been having difficulty phrasing the right question.
Here's what I have so far:
#R CODE to add awards data on proposals to new data spreadsheet
#read tab delimited files
Awards=read.delim("O:/testing.txt",as.is=T)
Proposals=read.delim("O:/test.txt",as.is=T)
#match IDs from both spreadsheets
Proposals$TotalAwarded=Awards$TotalAwarded([match(Proposals$IDs,Awards$IDs)]),
write.table(Proposals,"O:/tested.txt",quote=F,row.names=F,sep="\t")
This does exactly what I want, except that only the first match is encapsulated.
What's the best way to go forward? How do I make R utilize all of the matches available?
Thanks
See help on merge: ?merge
merge( Proposals, Awards, by=ID, all.y=TRUE )
But I cannot believe this hasn't been asked on SO before.
I have a dataset in SPSS that has 100K+ rows and over 100 columns. I want to filter both the rows and columns at the same time into a new SPSS dataset.
I can accomplish this very easily using the subset command in R. For example:
new_data = subset(old_data, select = ColumnA >10, select = c(ColumnA, ColumnC, ColumnZZ))
Even easier would be:
new data = old_data[old_data$ColumnA >10, c(1, 4, 89)]
where I am passing the column indices instead.
What is the equivalent in SPSS?
I love R, but the read/write and data management speed of SPSS is significantly better.
I am not sure what exactly you are referring to when you write that "the read/write and data management speed of SPSS being significantly better" than R. Your question itself demonstrates how flexible R is at data management! And, a dataset of 100k rows and 100 columns is by no means a large one.
But, to answer your question, perhaps you are looking for something like this. I'm providing a "programmatic" solution, rather than the GUI one, because you're asking the question on Stack Overflow, where the focus is more on the programming side of things. I'm using a sample data file that can be found here: http://www.ats.ucla.edu/stat/spss/examples/chp/p004.sav
Save that file to your SPSS working directory, open up your SPSS syntax editor, and type the following:
GET FILE='p004.sav'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'mynewdatafile.sav'
/KEEP currentm previous lactatio.
GET FILE='mynewdatafile.sav'.
More likely, though, you'll have to go through something like this:
FILE HANDLE directoryPath /NAME='C:\path\to\working\directory\' .
FILE HANDLE myFile /NAME='directoryPath/p004.sav' .
GET FILE='myFile'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'directoryPath/mynewdatafile.sav'
/KEEP currentm previous lactatio.
FILE HANDLE myFile /NAME='directoryPath/mynewdatafile.sav'.
GET FILE='myFile'.
You should now have a new file created that has just three columns, and where no value in the "lactatio" column is greater than 3.
So, the basic steps are:
Load the data you want to work with.
Subset for all columns from all the cases you're interested in.
Save a new file with only the variables you're interested in.
Load that new file before you proceed.
With R, the basic steps are:
Load the data you want to work with.
Create an object with your subset of rows and columns (which you know how to do).
Hmm.... I don't know about you, but I know which method I prefer ;)
If you're using the right tools with R, you can also directly read in the specific subset you are interested in without first loading the whole dataset if speed really is an issue.
In spss you can't combine the two actions in one command, but it's easy enough to do it in two:
dataset copy old_data. /* delete this if you don't need to keep both old and new data.
select if ColumnA>10.
add files /file=* /keep=ColumnA ColumnC ColumnZZ.
I was having a really hard time describing what I need in the Title, so I apologize ahead of time if that makes absolutely no sense.
If I have a CSV that has 2 columns, one with a persons name and a second column with a numeric value I need to find the duplicates in the names column then add the numeric values for that person together to get a total number in a new CSV.
This is a very simplified version of the real CSV
Name,Number
Dog,1
Cat,2
Fish,1
Dog,3
Dog,2
Cat,2
Fish,1
Given the information above, what I would like to be able to produce is this:
Name,Number
Dog,6
Cat,4
Fish,2
I really don't have any idea how to get there or if it's possible with PowerShell. I can only get as far as using group-object to group by name, but I have no clue how to add the columns after that.
The biggest problem I'm coming across with my research on this is that most if not all the results I get when googling involve adding new columns to a csv and not performing the mathematical calculation.
I finally got it
$csvfile = import-csv c:\csvfile.csv
$csvfile | group name | select name,#{Name="Totals";Expression={($_.group | Measure-Object -sum number).sum}}
Credit goes to:
http://www.hanselman.com/blog/ParsingCSVsAndPoorMansWebLogAnalysisWithPowerShell.aspx