How to properly read in text data

How to properly read in text data - r

I am just getting started with text analysis in r. By reading in some example text data I get the following result.
sms_raw <- read.csv("sms_spam.csv", stringsAsFactors = FALSE)
> str(sms_raw)
'data.frame': 5559 obs. of 2 variables:
$ type : chr "ham" "ham" "ham" "spam,\"complimentary 4 STAR Ibiza
Holiday or Â£10,000 cash needs your URGENT collection. 09066364349 NOW from
Landline not to l"| __truncated__ ...
$ text.........: chr "Hope you are having a good week. Just checking
in;;;;;;;;;" "K..give back my thanks.;;;;;;;;;" "Am also doing in cbe only.
But have to pay.;;;;;;;;;" "" ...
It seems to me as if the variables are not getting seperated properly. Further analyzing the data with the head function I get the following result:
head(sms_raw)
type
1
ham
2
ham
3
ham
4 spam,"complimentary 4 STAR Ibiza Holiday or Â£10,000 cash needs your
URGENT collection. 09066364349 NOW from Landline not to lose out!
Box434SK38WP150PPM18+";;;;;;;;;
5
spam
6
ham
text.........
1
Hope you are having a good week. Just checking in;;;;;;;;;
2
K..give back my thanks.;;;;;;;;;
3
Am also doing in cbe only. But have to pay.;;;;;;;;;
Does anybody have suggestions how to resolve this?

Try data.table::fread("sms_spam.csv", stringsAsFactors = FALSE,sep=";")
EDIT
you can try:
input_file<-readLines("/path/of/sms_spam.csv")

Related

Row binding of arrays loaded from JSON sources in R

I feel this must be an easy issue, but my search fu is failing me, so your assistance is very welcome, and apologies if it is indeed answered elsewhere.
I'm working with JSON data from a REST API (specifically GitHub data for pull requests) which contains nested arrays (in this case, the comments on the PRs, which then nest other things, like the data for the comment author). I use JSONlite::fromJSON to parse this, and I get a dataframe with nested sets of lists and data.frames. Here's a cut-down example of a single row (PR):
jsn = '[
{
"pr":123,
"comment_total":2,
"comments":[
{
"user":{"name":"Me Myself","username":"me"},
"body":"comment 1"
},
{
"user":{"name":"Me Myself","username":"me"},
"body":"comment 2"
}
]
}
]'
This represents a single pull request, which has two comments on it. If I load this with JSONLite, I get 1 row as expected:
> df = jsonlite::fromJSON(jsn)
> str(df)
'data.frame': 1 obs. of 3 variables:
$ pr : int 123
$ comment_total: int 2
$ comments :List of 1
..$ :'data.frame': 2 obs. of 2 variables:
.. ..$ user:'data.frame': 2 obs. of 2 variables:
.. .. ..$ name : chr "Me Myself" "Me Myself"
.. .. ..$ username: chr "me" "me"
.. ..$ body: chr "comment 1" "comment 2"
I'd like to unwrap the first level of this comments column, so that I get one row per PR comment, but I'm struggling to do so. What I'm aiming for is something like:
pr comment_total comments.user comments.body
1 123 2 <data.frame> comment 1
2 123 2 <data.frame> comment 2
I thought tidyr::unnest() would deal with this, but it doesn't seem to like the nested data.frames:
> unnest(df)
Error in bind_rows_(x, .id) :
Argument 1 can't be a list containing data frames
I also looked at purrr::map_dfr which outputs rows, but I can't seem to get that right either - I'm using it to access the data.frame directly, but it's still unhappy:
> map_dfr(df,.id="comments", `[[`,1)
Error in bind_rows_(x, .id) :
Argument 3 can't be a list containing data frames
I'm sure I'm missing something obvious, but I can't see it - someone enlighten me? Thanks!
EDIT: The code I'm using to get the data from GitHub looks like below - if there are better ways to query this, I'm interested.
library(httr)
base_url = 'https://api.github.com/repos/ansible/ansible'
# `pr` comes from a loop, e.g. pr = 38508
issue_url = paste0(base_url,'/issues/',pr,'/comments')
# api_user and api_key are my GitHub credentials
i_resp <- GET(issue_url, authenticate(api_user,api_key))
issue_comments = as.tibble(
jsonlite::fromJSON(
content(i_resp,as="text"),
flatten = TRUE
)
)

One way is to tell jsonlite to not be as helpful as it usually is and then work the unwinding on your own (much like XML processing, heavily nested output from custom API endpoints tend to need some domain/API knowledge to get the data into rectangles):
jsonlite::fromJSON(
txt = jsn,
simplifyVector = FALSE,
simplifyDataFrame = FALSE,
flatten = FALSE
) %>%
map_df(~{ # in case there is more than one
dplyr::as_data_frame(.x) %>%
mutate(
body = map_chr(comments, ~.x$body),
username = map_chr(comments, ~.x$user$name),
name = map_chr(comments, ~.x$user$name)
) %>%
select(-comments)
})
## # A tibble: 2 x 5
## pr comment_total body username name
## <int> <int> <chr> <chr> <chr>
## 1 123 2 comment 1 Me Myself Me Myself
## 2 123 2 comment 2 Me Myself Me Myself

Getting {xml_nodeset (0)} when using html_nodes from rvest package in R

I am trying to scrape headlines off a few news websites using the html_node function and the SelectorGadget but find that some do not work giving the result "{xml_nodeset (0)}". For example the below code gives such result:
url_cnn = 'https://edition.cnn.com/'
webpage_cnn = read_html(url_cnn)
headlines_html_cnn = html_nodes(webpage_cnn,'.cd__headline-text')
headlines_html_cnn
The ".cd__headline-text" I got using the SelectorGadget.
Other websites work such as:
url_cnbc = 'https://www.cnbc.com/world/?region=world'
webpage_cnbc = read_html(url_cnbc)
headlines_html_cnbc = html_nodes(webpage_cnbc,'.headline')
headlines_html_cnbc
Gives a full set of headlines. Any ideas why some websites return the "{xml_nodeset (0)}" result?

Please, please, please stop using Selector Gadget. I know Hadley swears by it but he's 100% wrong. What you see with Selector Gadget is what's been created in the DOM after javascript has been executed and other resources have been loaded asynchronously. Please use "View Source". That's what you get when you use read_html().
Having said that, I'm impressed CNN is as generous as they are (you def can scrape this page) and the content is most certainly on that page, just not rendered (which is likely even better):
Now, that's javascript, not JSON so we'll need some help from the V8 package:
library(rvest)
library(V8)
ctx <- v8()
# get the page source
pg <- read_html("https://edition.cnn.com/")
# find the node with the data in a <script> tag
html_node(pg, xpath=".//script[contains(., 'var CNN = CNN || {};CNN.isWebview')]") %>%
html_text() %>% # get the plaintext
ctx$eval() # sent it to V8 to execute it
cnn <- ctx$get("CNN") # get the data ^^ just created
After exploring the cnn object:
str(cnn[["contentModel"]][["siblings"]][["articleList"]], 1)
## 'data.frame': 55 obs. of 7 variables:
## $ uri : chr "/2018/11/16/politics/cia-assessment-khashoggi-assassination-saudi-arabia/index.html" "/2018/11/16/politics/hunt-crown-prince-saudi-un-resolution/index.html" "/2018/11/15/politics/us-khashoggi-sanctions/index.html" "/2018/11/15/middleeast/jamal-khashoggi-saudi-prosecutor-death-penalty-intl/index.html" ...
## $ headline : chr "<strong>CIA determines Saudi Crown Prince personally ordered journalist's death, senior US official says</strong>" "Saudi crown prince's 'fit' over UN resolution" "US issues sanctions on 17 Saudis over Khashoggi murder" "Saudi prosecutor seeks death penalty for Khashoggi killers" ...
## $ thumbnail : chr "//cdn.cnn.com/cnnnext/dam/assets/181025083025-prince-mohammed-bin-salman-small-11.jpg" "//cdn.cnn.com/cnnnext/dam/assets/181025083025-prince-mohammed-bin-salman-small-11.jpg" "//cdn.cnn.com/cnnnext/dam/assets/181025171830-jamal-khashoggi-small-11.jpg" "//cdn.cnn.com/cnnnext/dam/assets/181025171830-jamal-khashoggi-small-11.jpg" ...
## $ duration : chr "" "" "" "" ...
## $ description: chr "The CIA has determined that Saudi Crown Prince Mohammed bin Salman personally ordered the killing of journalist"| __truncated__ "Multiple sources tell CNN that a much-anticipated United Nations Security Council resolution calling for a cess"| __truncated__ "The Trump administration on Thursday imposed penalties on 17 individuals over their alleged roles in the <a hre"| __truncated__ "Saudi prosecutors said Thursday they would seek the death penalty for five people allegedly involved in the mur"| __truncated__ ...
## $ layout : chr "" "" "" "" ...
## $ iconType : chr NA NA NA NA ...

Data importing Delimiter issue in R

I am trying to import a text file into R, and put it into a data frame, along with other data.
My delimiter is "|" and a sample of my data is here :
|Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD,
very light load and had 3 seats to myself. A very enthusiastic and friendly crew as usual on this transpacific
route that I take several times a year. Arrived 20 min ahead of schedule. The expected high level of service from
our flag carrier, Air Canada. Altitude Elite member.
|We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited
staffing in Toronto our flight was excellent. Due to the rush in Toronto one of our carry ones was placed to go in
the cargo hold. When we arrived in Winnipeg it stayed in Toronto, they were most helpful and kind at the Winnipeg
airport, and we received 3 phone calls the following day in regards to the misplaced bag and it was delivered to
our home. We are very thankful and more than appreciative of the service we received what a great end to a
wonderful holiday.
|Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which
had no storage whatsoever, and not even any room under the seats. Ridiculous. Crew were poor, not friendly. One
older male member of staff was quite attitudinal, acting as though he was doing everyone a huge favour by serving
them. A reasonable dinner but breakfast was a measly piece of banana loaf. That's it! The worst airline breakfast
I have had.
As you can see, there are many "|" , but as this screenshot below shows, when I imported the data in R, it only separated it once, instead of about 152 times.
How do I get each individual piece of text in a different column inside the data frame? I would like a data frame of length 152, not 2.
EDIT: The code lines are:
myData <- read.table("C:/Users/Norbert/Desktop/research/Important files/Airline Reviews/Reviews/air_can_Review.txt", sep="|",quote=NULL, comment='',fill = TRUE, header=FALSE)
length(myData)
[1] 2
class(myData)
[1] "data.frame"
str(myData)
'data.frame': 1244 obs. of 2 variables:
$ V1: Factor w/ 1093 levels "","'delayed' on departure (I reference flights between March 2014 and January 2015 in this regard: Denver, SFO,",..: 210 367 698 853 1 344 483 87 757 52 ...
$ V2: Factor w/ 154 levels ""," hotel","5/9/2014, LHR to Vancouver, AC855. 23/9/2014, Vancouver to LHR, AC854. For Economy the leg room was OK compared to",..: 1 1 1 1 78 1 1 1 1 1 ...
myDataFrame <- data.frame(text = myData, otherVar2 = 1, otherVar2 = "blue", stringsAsFactors = FALSE)
str(myDataFrame)
'data.frame': 531 obs. of 3 variables:
$ text : chr "BRU-YUL, May 26th, A330-300. Departed on-time, landed 30 minutes late due to strong winds, nice flight, food" "excellent, cabin-crew smiling and attentive except for one old lady throwing meal trays like boomerangs. Seat-" "pitch was very generous, comfortable seat, IFE a bit outdated but selection was Okay. Air Canadas problem is\nthat the new pro"| __truncated__ "" ...
$ otherVar2 : num 1 1 1 1 1 1 1 1 1 1 ...
$ otherVar2.1: chr "blue" "blue" "blue" "blue" ...
length(myDataFrame)
[1] 3

A better way to read in the text is using scan(), and then put it into a data frame with your other variables (here I just made some up). Note that I took your text above, and pasted it into a file called sample.txt, after removing the starting "|".
myData <- scan("sample.txt", what = "character", sep = "|")
myDataFrame <- data.frame(text = myData, otherVar2 = 1, otherVar2 = "blue",
stringsAsFactors = FALSE)
str(myDataFrame)
## 'data.frame': 3 obs. of 3 variables:
## $ text : chr "Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD, very light loa"| __truncated__ "We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited staffing in Toront"| __truncated__ "Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which had no storage "| __truncated__
## $ otherVar2 : num 1 1 1
## $ otherVar2.1: Factor w/ 1 level "blue": 1 1 1
The otherVar1, otherVar2 are just placeholders for your own variables, as you said you wanted a data.frame with other variables. I chose an integer variable and a text variable, and by specifying a single value, it gets recycled for all observations in the dataset (in the example, 3).
I realize that your question asks how to get each text in a different column, but that is not a good way to use a data.frame, since data.frames are designed to hold variables in columns. (With one text per column, you cannot add other variables.)
If you really want to do that, you have to coerce the data after transposing it, as follows:
myDataFrame <- as.data.frame(t(data.frame(text = myData, stringsAsFactors = FALSE)), stringsAsFactors = FALSE)
str(myDataFrame)
## 'data.frame': 1 obs. of 3 variables:
## $ V1: chr "Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD, very light loa"| __truncated__
## $ V2: chr "We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited staffing in Toront"| __truncated__
## $ V3: chr "Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which had no storage "| __truncated__
length(myDataFrame)
## [1] 3
"Measly banana loaf"? Definitely economy class.

Exporting R dataframes in excel spreadsheet [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I want to work using the fbRanks package in R, that's a package built to rank football/soccer teams and predict the number of goals during a match.
After I read the [pdf manual][1] describing that package provided by the author, I still don't understand how I can build the dataset in order to start analysing data.
In order to understand how the author built the dataset, I want to export the dataframe he used to do the analysis in an excel spreadsheet.
I already nstalled the xlsx package that should allow to do this, but, when I run
write.xlsx(dataframe_name, "path_in_windows")
R gives an error:
Error in is.data.frame(x) : object 'B00scores' not found
even if the dataframe exists.
Can somebody help me?
I'm not an expert user of R; if the question is not clear, please comment below.

The error suggest that not only have your installed the package but that you jave also loaded it into the current session. You should have offered code that showed how you created the object with the name 'dataframe_name', but we can see from the error that it had the value "B00scores". Again you should have shown how B00scores was created since is does not exist after:
> data('B00data', package='fbRanks')
> str(B00scores)
Error in str(B00scores) : object 'B00scores' not found
As it turns out, you are misspelling the name:
> str(B00.scores)
'data.frame': 2031 obs. of 9 variables:
$ date : Date, format: "2012-05-25" "2012-05-25" ...
$ home.team : chr "PacNW Blue B00" "WA Rush Swoosh B00" "Crossfire Premier B B00" "Normandy Park FC White B00" ...
$ home.score: num 0 1 12 4 1 10 0 4 7 5 ...
$ away.team : chr "Normandy Park FC White B00" "PacNW Maroon B00" "WA Rush Swoosh B00" "WA Rush Nike B00" ...
$ away.score: num 8 7 0 0 8 1 1 0 1 1 ...
$ venue : chr "PacNW Memorial Cup 2012" "PacNW Memorial Cup 2012" "PacNW Memorial Cup 2012" "PacNW Memorial Cup 2012" ...
$ home.adv : chr "neutral" "neutral" "neutral" "neutral" ...
$ away.adv : chr "neutral" "neutral" "neutral" "neutral" ...
$ surface : chr "Grass" "Grass" "Turf" "Grass" ...
So unlike the two voters to close for lack of clarity, I think the close vote should be for misspelling.

R mistreat a number as a character

I am trying to read a series of text files into R. These files are of the same form, at least appear to be of the same form. Everything is fine except one file. When I read that file, R treated all numbers as characters. I used as.numeric to convert back, but the data value changed. I also tried to convert text file to csv and then read into R, but it did not work, either. Did any one have such problem before, please? How to fix it, please? Thank you!
The data is from Human Mortality Database. I cannot attach the data here due to copyright issue. But everyone can register through HMD and download data (www.mortality.org). As an example, I used Australian and Belgium 1 by 1 exposure data.
My codes are as follows:
AUSe<-read.table("AUS.Exposures_1x1.txt",skip=1,header=TRUE)[,-5]
BELe<-read.table("BEL.Exposures_1x1.txt",skip=1,header=TRUE)[,-5]
Then I want to add some rows in the above data frame or matrix. It is fine for Australian data (e.g, AUSe[1,3]+AUSe[2,3]). But error occurred when same command is applied to Belgium data: Error in BELe[1, 3] + BELe[2, 3] : non-numeric argument to binary operator. But if you look at the text file, you know those are two numbers. It is clear that R treated a number as a character when reading the text file, which is rather odd.

Try this instead:
BELe<-read.table("BEL.Exposures_1x1.txt",skip=1, colClasses="numeric", header=TRUE)[,-5]
Or you could surely post just a tiny bit of that file and not violate any copyright laws at least in my jurisdiction (which I think is the same one as The Human Mortality Database).
Belgium, Exposure to risk (period 1x1) Last modified: 04-Feb-2011, MPv5 (May07)
Year Age Female Male Total
1841 0 61006.15 62948.23 123954.38
1841 1 55072.53 56064.21 111136.73
1841 2 51480.76 52521.70 104002.46
1841 3 48750.57 49506.71 98257.28
.... . ....
So I might have suggested the even more accurate colClasses:
BELe<-read.table("BEL.Exposures_1x1.txt",skip=2, # really two lines to skip I think
colClasses=c(rep("integer", 2), rep("numeric",3)),
header=TRUE)[,-5]
I suspect the promlem occurs because of lines like these:
1842 110+ 0.00 0.00 0.00
So you will need to determine how much interest you have in preserving the 110+ values. With my method they will be coerced to NA's. (Well I thought they would be but like you I got an error. So this multi-step process is needed:
BELe<-read.table("Exposures_1x1.txt",skip=2,
header=TRUE)
BELe[ , 2:5] <- lapply(BELe[ , 2:5], as.character)
str(BELe)
#-------------
'data.frame': 18759 obs. of 5 variables:
$ Year : int 1841 1841 1841 1841 1841 1841 1841 1841 1841 1841 ...
$ Age : chr "0" "1" "2" "3" ...
$ Female: chr "61006.15" "55072.53" "51480.76" "48750.57" ...
$ Male : chr "62948.23" "56064.21" "52521.70" "49506.71" ...
$ Total : chr "123954.38" "111136.73" "104002.46" "98257.28" ...
#-------------
BELe[ , 2:5] <- lapply(BELe[ , 2:5], as.numeric)
#----------
Warning messages:
1: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
2: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
3: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
4: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
str(BELe)
#-----------
'data.frame': 18759 obs. of 5 variables:
$ Year : int 1841 1841 1841 1841 1841 1841 1841 1841 1841 1841 ...
$ Age : num 0 1 2 3 4 5 6 7 8 9 ...
$ Female: num 61006 55073 51481 48751 47014 ...
$ Male : num 62948 56064 52522 49507 47862 ...
$ Total : num 123954 111137 104002 98257 94876 ...
# and just to show that tey are not really integers:
BELe$Total[1:5]
#[1] 123954.38 111136.73 104002.46 98257.28 94875.89

The way I typically read those files is:
BELexp <- read.table("BEL.Exposures_1x1.txt", skip = 2, header = TRUE, na.strings = ".", as.is = TRUE)
Note that Belgium lost 3 years of data in WWI that may never be recovered, and hence these three years are all NAs, which in those files are marked with ".", a character string. Hence the argument na.strings = ".". Specifying that argument will take care of all columns except Age, which is character (intentionally), due to the "110+". The reason the HMD does this is so that users have to be intentional about treatment of the open age group. You can convert the age column to integer using:
BELexp$Age <- as.integer(gsub("[+]", "", BELexp$Age))
Since such issues are long the bane of R-HMD users, the HMD has recently posted some R functions in a small but growing package on github called (for now) DemogBerkeley. The function readHMD() removes all of the above headaches:
library(devtools)
install_github("DemogBerkeley", subdir = "DemogBerkeley", username = "UCBdemography")
BELexp <- readHMD("BEL.Exposures_1x1.txt")
Note that a new indicator column, called OpenInterval is added, while Age is converted to integer as above.

Can you try read.csv(... stringsAsFactors=FALSE) ?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex