R tm package DataframeSource import - r

Reading a CSV into R and wanting to make a corpus from it with the tm package, but not getting the desired results. Currently, when I read in a CSV of text, then inspect the corpus, the data is all numerical. (I only included the first three columns of data to protect privacy; there are nine as shown in the inspect results.)
library(tm)
data <- read.csv("filename.csv")
head(data)
Directory.Code First.Name Last.Name
1 SCA0025 Nbcde Cdbaace
2 SCA0025 AJCocei aiceice
3 SCA0025 aceca Ac;eice
4 SCA0025 Acoicm aie;cee
5 SCA0025 acei aciomac
6 SCA0025 caeij CIMCEv
data.corp <- corpus(DataframeSource,data)
inspect(data.corp[1])
A corpus with 1 text document
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
$`1`
16
2195
6655
6613
1
5
9757
1
1
If it helps to know the purpose: I am trying to read in a csv of names and un-normalized job titles/descriptions, then compare to a corpus of known titles/descriptions as categories. Now that I type this in, I realize that this csv will be my test/prediction data, but I still want to build a corpus from a csv with colnames = KnownJobTitle,Description.
The goal of this question is to successfully read a CSV into a corpus, but I would also like to know if it is advisable to use the tm package for more than 2 categorizations, and/or if there are other packages more suited to this task.

I get the similar error. It's because the text fields read from the csv are categorical instead of char. You need to first convert those to character using something like:
data <- data.frame(lapply(data, as.character), stringsAsFactors=FALSE)

Related

Excel is changing (12.3%) to -12.3% when I save a .csv file

I've been formatting data in R to automate the process of formatting a table for a manuscript. This means that I've been less concerned with tidy data and analysis and more concerned with getting the bits and pieces of a data frame where I want them. I formatted my data frame in R to look like this (more or less):
screening Disease_2015
1 Total (n = 12,345)
2 Screened 12,345
3 Screened_Percent (12.3%)
4 Positive 1,234
5 Positive_Pct 1.2%
write.csv (table, "U:\\My Documents\\Table.csv", row.names = FALSE)
I'm mostly concerned about the parentheses in row 3. They show up in the data frame in R, but not in Excel. This is the code I was using to add them in to specific columns.
addparentheses <- function(x){paste("(", x, ")", sep = "")}
Table$column <- addparentheses(Table$column)
When I open the file in excel to make some finishing touches, it shows up in this format:
I've done this process with a few other data frames and this is the first time that it is replacing the () with a -. How can I prevent this from happening?

Selecting following characters after a pattern match in R

I have a data frame which has one column of text with info that I need to extract, here is one observation from that column: each question has three attributes associated to it objectives,KeyResults and responsible
[{"text":"Newideas.","translationKey":"new.question-4","id":4,"objectives":"Great","KeyResults":"Awesome","responsible":"myself"},{"text":"customer focus.","translationKey":"new.question-5","id":5,"objectives":"Goalset","KeyResults":"Amazing","responsible":"myself"}
-------------------------DESIRED OUTPUT -----------------------
Question# Objectives KeyResults responsible Question# Objectives KeyResults responsible
4 Great Awesome myself 5 Goalset Amazin myself
Data is a valid json (but you need square bracket closing ] on it). You can read json into R object using json parser package (eg. jsonlite)
Let say your text is in column text of data frame df, then this will transform that text into R dataframe.
library(jsonlite)
dat <- fromJSON(df$text)
dat
# text translationKey id objectives KeyResults responsible
# 1 Newideas. new.question-4 4 Great Awesome myself
# 2 customer focus. new.question-5 5 Goalset Amazing myself
You need to install jsonlite to make it works
install.packages("jsonlite")

Checking for number of items in a string in R

I have a very large csv file (1.4 million rows). It is supposed to have 22 fields and 21 commas in each row. It was created by taking quarterly text files and compiling them into one large text file so that I could import into SQL. In the past, one field was not in the file. I don't have the time to go row by row and check for this.
In R, is there a way to verify that each row has 22 fields or 21 commas? Below is a small sample data set. The possibly missing field is the 0 in the 10th slot.
32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1
you can use the base R function count.fields to do this:
count.fields(tmp, sep=",")
[1] 22 22
The input for this function is the name of a file or a connection. Below, I supplied a textConnection. For large files, you would probably want to feed this into table:
table(count.fields(tmp, sep=","))
Note that this can also be used to count the number of rows in a file using length, similar to the output of wc -l in the *nix OSs.
data
tmp <- textConnection(
"32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1"
)
Assuming df is your dataframe
apply(df, 1, length)
This will give you the length of each row.

R readr package - written and read in file doesn't match source

I apologize in advance for the somewhat lack of reproducibility here. I am doing an analysis on a very large (for me) dataset. It is from the CMS Open Payments database.
There are four files I downloaded from that website, read into R using readr, then manipulated a bit to make them smaller (column removal), and then stuck them all together using rbind. I would like to write my pared down file out to an external hard drive so I don't have to read in all the data each time I want to work on it and doing the paring then. (Obviously, its all scripted but, it takes about 45 minutes to do this so I'd like to avoid it if possible.)
So I wrote out the data and read it in, but now I am getting different results. Below is about as close as I can get to a good example. The data is named sa_all. There is a column in the table for the source. It can only take on two values: gen or res. It is a column that is actually added as part of the analysis, not one that comes in the data.
table(sa_all$src)
gen res
14837291 822559
So I save the sa_all dataframe into a CSV file.
write.csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv',
row.names = FALSE)
Then I open it:
sa_all2 <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
table(sa_all2$src)
g gen res
1 14837289 822559
I did receive the following parsing warnings.
Warning: 4 parsing failures.
row col expected actual
5454739 pmt_nature embedded null
7849361 src delimiter or quote 2
7849361 src embedded null
7849361 NA 28 columns 54 columns
Since I manually add the src column and it can only take on two values, I don't see how this could cause any parsing errors.
Has anyone had any similar problems using readr? Thank you.
Just to follow up on the comment:
write_csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv')
sa_all2a <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
Warning: 83 parsing failures.
row col expected actual
1535657 drug2 embedded null
1535657 NA 28 columns 25 columns
1535748 drug1 embedded null
1535748 year an integer No
1535748 NA 28 columns 27 columns
Even more parsing errors and it looks like some columns are getting shuffled entirely:
table(sa_all2a$src)
100000000278 Allergan Inc. gen GlaxoSmithKline, LLC.
1 1 14837267 1
No res
1 822559
There are columns for manufacturer names and it looks like those are leaking into the src column when I use the write_csv function.

Creating a vector from a file in R

I am new to R and my question should be trivial. I need to create a word cloud from a txt file containing the words and their occurrence number. For that purposes I am using the snippets package.
As it can be seen at the bottom of the link, first I have to create a vector (is that right that words is a vector?) like bellow.
> words <- c(apple=10, pie=14, orange=5, fruit=4)
My problem is to do the same thing but create the vector from a file which would contain words and their occurrence number. I would be very happy if you could give me some hints.
Moreover, to understand the format of the file to be inserted I write the vector words to a file.
> write(words, file="words.txt")
However, the file words.txt contains only the values but not the names(apple, pie etc.).
$ cat words.txt
10 14 5 4
Thanks.
words is a named vector, the distinction is important in the context of the cloud() function if I read the help correctly.
Write the data out correctly to a file:
write.table(words, file = "words.txt")
Create your word occurrence file like the txt file created. When you read it back in to R, you need to do a little manipulation:
> newWords <- read.table("words.txt", header = TRUE)
> newWords
x
apple 10
pie 14
orange 5
fruit 4
> words <- newWords[,1]
> names(words) <- rownames(newWords)
> words
apple pie orange fruit
10 14 5 4
What we are doing here is reading the file into newWords, the subsetting it to take the one and only column (variable), which we store in words. The last step is to take the row names from the file read in and apply them as the "names" on the words vector. We do the last step using the names() function.
Yes, 'vector' is the proper term.
EDIT:
A better method than write.table would be to use save() and load():
save(words. file="svwrd.rda")
load(file="svwrd.rda")
The save/load combo preserved all the structure rather than doing coercion. The write.table followed by names()<- is kind of a hassle as you can see in both Gavin's answer here and my answer on rhelp.
Initial answer:
Suggest you use as.data.frame to coerce to a dataframe an then write.table() to write to a file.
write.table(as.data.frame(words), file="savew.txt")
saved <- read.table(file="savew.txt")
saved
words
apple 10
pie 14
orange 5
fruit 4

Resources