I've never used Qualtrics myself and do not need to, but my company receives Qualtrics-generated CSV data from another company, and we have to advise them about the names to use for fields/variables, such as "mobilephone".
The main thing I need to know is the maximum number of characters, but other limits (such as special characters to avoid) would be helpful. For example, would profile_field_twenty6chars be good? (The data is going into Moodle, which uses profile_field).
Qualtrics has 2 different limitations to my knowledge.
Question names are limited to 10 characters(absurd in my opinion)
Question Export tags (this defaults to the question text, and is shown on the second row for csv datasets) has a limit of 200 characters.
Related
I am trying to read in a csv file (exported from survey monkey).
I have tried survey <- read.csv("Survey Item Evaluation2.csv", header=TRUE, stringsAsFactors = FALSE)
I ran skim(survey), which shows it is reading in as characters.
str(survey) output: data.frame: 623obs. of 68 variables. G1 (which is a survey item) reads in as chr "1" "3" "4" "1"....
How do I change those survey item variables to numeric?
The correct answer to your question is given in the first two comments by two very well respected people with a combined reputation of over 600k. I'll post their very similar answer here:
as.numeric(survey$G1)
However, that is not very good advice in my opinion. Your question should really have been:
"Why am I getting character data when I'm sure this variable should be numeric?"
To which the answer would be: "Either your not reading the data correctly (does the data start at row 3), or there is non-numeric (garbage) data among the numeric data (for example NA is entered as . or some other character), or certain people entered a , instead of a . to represent decimal point (such as nationals of Indonesia and some European countries), or they entered a thin thousand separator instead of a comma, or some other unknown reason which needs further investigation. Maybe a certain group of people enter text instead of numbers for their age (fifty instead of 50), or they put a . at the end of the data, for example 62.5. instead of 62.5 for their age (older folks were taught to always end a sentence with a period!). In these last two cases, certain groups (elderly) will have missing data and your data is then missing not at random (MNAR), a big bias in your analysis".
I see this all too often and I worry that new users of R are making terrible mistakes due being given poor advice, or because they didn't learn the basics. Importing data is the first step of analysis. It can be difficult because data files come in all shapes and sizes - there is no global standard. Data is also often entered without any quality control mechanisms. I'm glad that you added the stringsAsFactors=FALSE argument in your command to import the data. Someone gave you good advice there. But that person forgot to advise you not to trust your data, especially if it was given to you by someone else to analyse. Always check every variable carefully before the analysis. This can take time, but it can be worth the investment.
Hope that helps at least someone out there.
I've been using the REDCapR package to read in data from my survey form. I was reading in the data with no issue using redcap_read until I realized I needed to add a field restriction to one question on my survey. Initially it was a short answer field asking users how many of something they had, and people were doing expectedly annoying things like spelling out numbers or entering "a few" instead of a number. But all of that data read in fine. I changed the field to be a short answer field (same type as before) that requires the response to be an integer and now the data won't read into R using redcap_read.
When I run:
redcap_read(redcap_uri=uri, token=api_token)$data
I get the error message that:
Column [name of my column] can't be converted from numeric to character
I also noticed when I looked at the data that that it read in the 1st and 6th records of that column (both zeros) just fine (out of 800+ records), but everything else is NA. Is there an inherent problem with trying to read in data from a text field restricted to an integer or is there another way to do this?
Edit: it also reads the dates fine, which are text fields with a date field restriction. This seems to be very specific to reading in the validated numbers from the text field.
I also tried redcapAPI::exportRecords and it will continue to read in the rest of the dataset, but reads in NA for all values in the column with the test restriction.
Upgrade REDCapR to the version on GitHub, which stacks the batches on top of each other before determining the data type (see #257).
# install.packages("remotes") # Run this line if the 'remotes' package isn't installed already.
remotes::install_github(repo="OuhscBbmc/REDCapR")
In your case, I believe that the batches (of 200 records, by default) contain different different data types (character & numeric, according to the error message), which won't stack on top of each other silently.
The REDCapR::redcap_read() function should work then. (If not, please create a new issue).
Two alternatives are
calling redcap_read_oneshot with a large value of guess_max, or
calling redcap_read_oneshot with guess_type = TRUE.
I asked over at the English Stack Exchange, "What is the English word with the longest single definition?" The best answer they could give is that I would need a program that could figure out the longest entry in a (text) file listing dictionary definitions, by counting the amount of characters or words in a given entry, and then provide a list of the longest entries. I also asked at Superuser but they couldn't come up with an answer either, so I decided to give it a shot here.
I managed to find a dictionary file which converted to text has the following format:
a /a/ indefinite article (an before a vowel) 1 any, some, one (have a cookie). 2 one single thing (there’s not a store for miles). 3 per, for each (take this twice a day).
aardvark /ard-vark/ n an African mammal with a long snout that feeds on ants.
abacus /a-ba-kus, a-ba-kus/ n a counting frame with beads.
As you can see, each definition comes after the pronunciation (enclosed by slashes), and then either:
1) ends with a period, or
2) ends before an example (enclosed by parenthesis), or
3) follows a number and ends with a period or before an example, when a word has multiple definitions.
What I would need, then, is a function or program that can distinguish each definition (including considering multiple definitions of a single word as separate ones), then count the amount of characters and/or words within (ignoring the examples in parenthesis since that is not the proper definition), and finally provide a list of the longest definitions (I don't think I would need more than say, a top 20 or so to compare). If the file format was an issue, I can convert the file to PDF, EPUB, etc. with no problem. And, I guess ideally I would want to be able to choose between counting length by characters and by words, if it was possible.
How should I go to do this? I have little experience from programming classes I took a long time ago, but I think it's better to assume I know close to nothing about programming at all.
Thanks in advance.
I'm not going to write code for you, but I'll help think the problem through. Pick the programming language you're most familiar with from long ago, and give it a whack. When you run in to problems, come back and ask for help.
I'd chop this task up into a bunch of subproblems:
Read the dictionary file from the filesystem.
Chunk the file up into discrete entries. If it's a text file like you show, most programming languages have a facility to easily iterate linewise through a file (i.e. take a line ending character or character sequence as the separator).
Filter bad entries: in your example, your lines appear separated by an empty line. As you iterate, you'll just drop those.
Use your human observation and judgement to look for strong patterns in the data that you can give communicate as firm rules -- this is one of the central activities of programming. You've already started identifying some patterns in your question, i.e.
All entries have a preamble with the pronounciation and part of speech.
A multiple definition entry will be interspersed with lone numerals.
Otherwise, a single definition just follows the preamble.
Write the rules you've invented into code. It'll go something like this: First find a way to lop off the word itself and the preamble. With the remainder, identify multiple-def entries by presence of lone numerals or whatever; if it's not, treat it as single-def.
For each entry, iterate over each of the one-or-more definitions you've identified.
Write a function that will count a definition either word-wise or character-wise. If word-wise, you'll probably tokenize based on whitespace. Counting the length of a string character-wise is trivial in most programming languages. Why not implement both!
Keep a data structure in memory as you iterate the file to track "longest". For each definition in each entry, after you apply the length calculation, you'll compare against the previous longest entry. If the new one is longer, you'll record this new leading word and its word count in your data structure. Comparing 'greater than' and storing a variable are fundamental in most programming languages, so while this is the real meat of your program, this shouldn't be hard.
Implement some way to display your results once iteration is done. This may be as simple as a print statement.
Finally, write the glue code that lets you execute the program easily. A program like this could easily be a command-line tool that takes one or two arguments (the path to the file to be analyzed, perhaps you pass your desired counting method 'character|word' as an argument too, since you implemented both). Different languages vary in how easy it is to create an executable to run from the command line, but most support it, so it's a good option for tasks like this.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have a challenging file-reading task.
I have a .txt file from a typical old accounting department (with headers, titles, pages and the useful tabulated quantitative and qualitative information). It looks like this:
From this file I am trying to do two tasks (with read.table and scan):
1) extract the information which is tabulated between | which is the accounting information (any trial ended in a not easy data frames or character vectors)
2) include as a variable each subtitle which begins with "Customers" in the text file: as you can see the Customer info is a title, then comes the accounting info (payables), then again another customer and the accounting info and so on. So is not a column, but a row (?)
I´ve been trying with read.table (several sep and quote parameters) and with scan and then having tried to work with the character vectors.
Thanks!!
I've been there before so I kind of know what you're going through.
I've got 2 news for you, one bad, one good. The bad one is I have read-in these types of files in SAS tons of times but never in R - however
the good news is I can give you some tips so you can work it out in R.
So the strategy is as follow:
1) You're going to read the file into a dataframe that contains only a single column. This column is character and will hold
a whole line of your input file. i.e. length is 80 if the largest line in your file is 80 long.
2) Now you have a data frame where every record equals a line in your input file. At this point you may want to check your
dataframe has the same number or records as per lines in your file.
3) Now you can use grep to get rid-off or keep only those lines that meet your criteria (ie subtitle which begins with "Customers").
You may find regular expressions really useful here.
4) Your dataframe now only have records that matches 'Customer' patterns and table patterns
(i.e line begin with 'Country' or /\d{3} \d{8}/ or ' Total').
5) What you need now is to create a group variable that increment +1 every time it finds 'Customer'. So group=1 will repeat the same value until it finds 'Customer 010343' where group is now group=2. Or even better your group can be customer id until a new id is found. You need to somehow retain the id until a new id is found.
From the last step you're pretty much done as you will be able to identify customers and tables pretty easy. You may want to create a function that output your table strings in a tabular format.
Whether you process them in a single table or split the data frame in n data frame to process them individually is up to you.
In SAS there is this concept of pointer (#) and retention (retain statement) where each line matching a criteria can be process differently from other criterias so you output data set already contains columns and customer info in a tabular format.
Well hope this helps you.
I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".
My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar.
My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)
I've very briefly looked at the tm package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.
I have a few related questions:
Is the tm package appropriate for this sort of task?
Is there a faster alternative to agrep? (Said function uses the
Levenshtein edit distance which is anecdotally slow.)
Are there other suitable tools in R, apart from agrep and tm?
Should I even be doing this in R, or should this sort of thing be
done directly in the database? (It's an Access database, so I'd
rather avoid touching it if possible.)
If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.
I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).
If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.
So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.
You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.
Maybe google refine could help. It looks maybe more fitted if you have lots of exceptions and you don't know them all yet.
What you're doing is called record linkage, and it's been a huge field of research over many decades already. Luckily for you, there's a whole bunch of tools out there that are ready-made for this sort of thing. Basically, you can point them at your database, set up some cleaning and comparators (like Levenshtein or Jaro-Winkler or ...), and they'll go off and do the job for you.
These tools generally have features in place to solve the performance issues, so that even though Levenshtein is slow they can run fast because most record pairs never get compared at all.
The Wikipedia link above has links to a number of record linkage tools you can use. I've personally written one called Duke in Java, which I've used successfully for exactly this. If you want something big and expensive you can buy a Master Data Management tool.
In your case probably something like edit-distance calculation would work, but if you need to find near duplicates in larger text based documents, you can try
http://www.softcorporation.com/products/neardup/