Data importing Delimiter issue in R - r

I am trying to import a text file into R, and put it into a data frame, along with other data.
My delimiter is "|" and a sample of my data is here :
|Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD,
very light load and had 3 seats to myself. A very enthusiastic and friendly crew as usual on this transpacific
route that I take several times a year. Arrived 20 min ahead of schedule. The expected high level of service from
our flag carrier, Air Canada. Altitude Elite member.
|We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited
staffing in Toronto our flight was excellent. Due to the rush in Toronto one of our carry ones was placed to go in
the cargo hold. When we arrived in Winnipeg it stayed in Toronto, they were most helpful and kind at the Winnipeg
airport, and we received 3 phone calls the following day in regards to the misplaced bag and it was delivered to
our home. We are very thankful and more than appreciative of the service we received what a great end to a
wonderful holiday.
|Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which
had no storage whatsoever, and not even any room under the seats. Ridiculous. Crew were poor, not friendly. One
older male member of staff was quite attitudinal, acting as though he was doing everyone a huge favour by serving
them. A reasonable dinner but breakfast was a measly piece of banana loaf. That's it! The worst airline breakfast
I have had.
As you can see, there are many "|" , but as this screenshot below shows, when I imported the data in R, it only separated it once, instead of about 152 times.
How do I get each individual piece of text in a different column inside the data frame? I would like a data frame of length 152, not 2.
EDIT: The code lines are:
myData <- read.table("C:/Users/Norbert/Desktop/research/Important files/Airline Reviews/Reviews/air_can_Review.txt", sep="|",quote=NULL, comment='',fill = TRUE, header=FALSE)
length(myData)
[1] 2
class(myData)
[1] "data.frame"
str(myData)
'data.frame': 1244 obs. of 2 variables:
$ V1: Factor w/ 1093 levels "","'delayed' on departure (I reference flights between March 2014 and January 2015 in this regard: Denver, SFO,",..: 210 367 698 853 1 344 483 87 757 52 ...
$ V2: Factor w/ 154 levels ""," hotel","5/9/2014, LHR to Vancouver, AC855. 23/9/2014, Vancouver to LHR, AC854. For Economy the leg room was OK compared to",..: 1 1 1 1 78 1 1 1 1 1 ...
myDataFrame <- data.frame(text = myData, otherVar2 = 1, otherVar2 = "blue", stringsAsFactors = FALSE)
str(myDataFrame)
'data.frame': 531 obs. of 3 variables:
$ text : chr "BRU-YUL, May 26th, A330-300. Departed on-time, landed 30 minutes late due to strong winds, nice flight, food" "excellent, cabin-crew smiling and attentive except for one old lady throwing meal trays like boomerangs. Seat-" "pitch was very generous, comfortable seat, IFE a bit outdated but selection was Okay. Air Canadas problem is\nthat the new pro"| __truncated__ "" ...
$ otherVar2 : num 1 1 1 1 1 1 1 1 1 1 ...
$ otherVar2.1: chr "blue" "blue" "blue" "blue" ...
length(myDataFrame)
[1] 3

A better way to read in the text is using scan(), and then put it into a data frame with your other variables (here I just made some up). Note that I took your text above, and pasted it into a file called sample.txt, after removing the starting "|".
myData <- scan("sample.txt", what = "character", sep = "|")
myDataFrame <- data.frame(text = myData, otherVar2 = 1, otherVar2 = "blue",
stringsAsFactors = FALSE)
str(myDataFrame)
## 'data.frame': 3 obs. of 3 variables:
## $ text : chr "Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD, very light loa"| __truncated__ "We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited staffing in Toront"| __truncated__ "Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which had no storage "| __truncated__
## $ otherVar2 : num 1 1 1
## $ otherVar2.1: Factor w/ 1 level "blue": 1 1 1
The otherVar1, otherVar2 are just placeholders for your own variables, as you said you wanted a data.frame with other variables. I chose an integer variable and a text variable, and by specifying a single value, it gets recycled for all observations in the dataset (in the example, 3).
I realize that your question asks how to get each text in a different column, but that is not a good way to use a data.frame, since data.frames are designed to hold variables in columns. (With one text per column, you cannot add other variables.)
If you really want to do that, you have to coerce the data after transposing it, as follows:
myDataFrame <- as.data.frame(t(data.frame(text = myData, stringsAsFactors = FALSE)), stringsAsFactors = FALSE)
str(myDataFrame)
## 'data.frame': 1 obs. of 3 variables:
## $ V1: chr "Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD, very light loa"| __truncated__
## $ V2: chr "We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited staffing in Toront"| __truncated__
## $ V3: chr "Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which had no storage "| __truncated__
length(myDataFrame)
## [1] 3
"Measly banana loaf"? Definitely economy class.

Related

fread skip argument not consistent

Ok, I know this is not a reproducible example, because I only managed to get this error with this specific data.table, that has almost 1gb, so I don't know how to send it to you. Anyway, I am completely lost... If someone knows what is happening here, please tell me.
I have the original data.table and some other ones obtained by just changing the skip argument.
> original <- fread('json.csv')
> skip100 <- fread('json.csv', skip = 100, sep = ',')
> skip1000 <- fread('json.csv', skip = 1000, sep = ',')
> skip10000 <- fread('json.csv', skip = 10000, sep = ',')
> str(original)
Classes ‘data.table’ and 'data.frame': 29315 obs. of 7 variables:
$ id : chr "0015023cc06b5362d332b3baf348d11567ca2fbb" "004f0f8bb66cf446678dc13cf2701feec4f36d76" "00d16927588fb04d4be0e6b269fc02f0d3c2aa7b" "0139ea4ca580af99b602c6435368e7fdbefacb03" ...
$ title : chr "The RNA pseudoknots in foot-and-mouth disease virus are dispensable for genome replication but essential for th"| __truncated__ "Healthcare-resource-adjusted vulnerabilities towards the 2019-nCoV epidemic across China" "Real-time, MinION-based, amplicon sequencing for lineage typing of infectious bronchitis virus from upper respiratory samples" "A Combined Evidence Approach to Prioritize Nipah Virus Inhibitors" ...
$ authors : chr "Joseph C Ward,Lidia Lasecka-Dykes,Chris Neil,Oluwapelumi Adeyemi,Sarah , Gold,Niall Mclean,Caroline Wrig"| __truncated__ "Hanchu Zhou,Jiannan Yang,Kaicheng Tang,â\200 ,Qingpeng Zhang,Zhidong Cao,Dirk Pfeiffer,Daniel Dajun Zeng" "Salman L Butt,Eric C Erwood,Jian Zhang,Holly S Sellers,Kelsey Young,Kevin K Lahmers,James B Stanton" "Nishi Kumari,Ayush Upadhyay,Kishan Kalia,Rakesh Kumar,Kanika Tuteja,Rani Paul,Eugenia Covernton,Tina Sh"| __truncated__ ...
$ institution: chr "" "City University of Hong Kong,City University of Hong Kong,City University of Hong Kong,NA,City University of Ho"| __truncated__ "University of Georgia,University of Georgia,University of Georgia,University of Georgia,University of Georgia,V"| __truncated__ "Panjab University,Delhi University,D.A.V. College,CSIR-Institute of Microbial Technology,Panjab University,Univ"| __truncated__ ...
$ country : chr "" "China,China,China,NA,China,China,China,China" "USA,USA,USA,USA,USA,USA,USA" "India,India,India,India,India,India,France,India,NA,India" ...
$ abstract : chr "word count: 194 22 Text word count: 5168 23 24 25 author/funder. All rights reserved. No reuse allowed without "| __truncated__ "" "Infectious bronchitis (IB) causes significant economic losses in the global poultry industry. Control of infect"| __truncated__ "Nipah Virus (NiV) came into limelight recently due to an outbreak in Kerala, India. NiV causes severe disease a"| __truncated__ ...
$ body_text : chr "VP3, and VP0 (which is further processed to VP2 and VP4 during virus assembly) (6). The P2 64 and P3 regions en"| __truncated__ "The 2019-nCoV epidemic has spread across China and 24 other countries 1-3 as of February 8, 2020 . The mass qua"| __truncated__ "Infectious bronchitis (IB), which is caused by infectious bronchitis virus (IBV), is one of the most important "| __truncated__ "Nipah is an infectious negative-sense single-stranded RNA virus which belongs to the genus henipavirus and fami"| __truncated__ ...
- attr(*, ".internal.selfref")=<externalptr>
The number of observations is consistent for skip = 100 and 10000, but not for skip = 1000, as shown below.
> nrow(original)
[1] 29315
> nrow(skip100)
[1] 29215
> nrow(skip1000)
[1] 28316
> nrow(skip10000)
[1] 19315
What is happening?

Splitting a single variable dataframe

I have a CSV file that appears as just one variable. I want to split it to 6. I need help.
str(nyt_data)
'data.frame': 3104 obs. of 1 variable:
$ Article_ID.Date.Title.Subject.Topic.Code: Factor w/ 3104 levels "16833;7-Dec-03;Ruse in Toyland: Chinese Workers' Hidden Woe;Chinese Workers Hide Woes for American Inspectors;5",..: 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 ...
nyt_data$Article_ID.Date.Title.Subject.Topic.Code
The result displaced after the above line of code is:
> head(nyt_data$Article_ID.Date.Title.Subject.Topic.Code)
[1] 41246;1-Jan-96;Nation's Smaller Jails Struggle To Cope With Surge in Inmates;Jails overwhelmed with hardened criminals;12
[2] 41257;2-Jan-96;FEDERAL IMPASSE SADDLING STATES WITH INDECISION;Federal budget impasse affect on states;20
[3] 41268;3-Jan-96;Long, Costly Prelude Does Little To Alter Plot of Presidential Race;Contenders for 1996 Presedential elections;20
Please help me with code to split these into 6 separate columns Article_ID, Date, Title, Subject, Topic, Code.
The data is split with ";" but read.csv defaults to ",". Simply do the following:
df <- read.csv(data, sep = ";")
Just read CSV file with custom sep.
Like this:
data <- read.csv(input_file, sep=';')

How to properly read in text data

I am just getting started with text analysis in r. By reading in some example text data I get the following result.
sms_raw <- read.csv("sms_spam.csv", stringsAsFactors = FALSE)
> str(sms_raw)
'data.frame': 5559 obs. of 2 variables:
$ type : chr "ham" "ham" "ham" "spam,\"complimentary 4 STAR Ibiza
Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from
Landline not to l"| __truncated__ ...
$ text.........: chr "Hope you are having a good week. Just checking
in;;;;;;;;;" "K..give back my thanks.;;;;;;;;;" "Am also doing in cbe only.
But have to pay.;;;;;;;;;" "" ...
It seems to me as if the variables are not getting seperated properly. Further analyzing the data with the head function I get the following result:
head(sms_raw)
type
1
ham
2
ham
3
ham
4 spam,"complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your
URGENT collection. 09066364349 NOW from Landline not to lose out!
Box434SK38WP150PPM18+";;;;;;;;;
5
spam
6
ham
text.........
1
Hope you are having a good week. Just checking in;;;;;;;;;
2
K..give back my thanks.;;;;;;;;;
3
Am also doing in cbe only. But have to pay.;;;;;;;;;
Does anybody have suggestions how to resolve this?
Try data.table::fread("sms_spam.csv", stringsAsFactors = FALSE,sep=";")
EDIT
you can try:
input_file<-readLines("/path/of/sms_spam.csv")

Select by Date or Sort by date on GoogleNewsSource R

I am using the R package tm.plugin.webmining. Using the function GoogleNewsSource() I would like to query the news sorted by date and also from a specific date. Is there any paremeter to query the news of a specific date?
library(tm)
library(tm.plugin.webmining)
searchTerm <- "Data Mining"
corpusGoog <- WebCorpus(GoogleNewsSource(params=list(hl="en", q=searchTerm,
ie="utf-8", num=10, output="rss" )))
headers <- meta(corpusGoog,tag="datetimestamp")
If you're looking for a data frame-like structure, this is how you'd go about creating it (note: not all fields are extracted from the corpus):
library(dplyr)
make_row <- function(elem) {
data.frame(timestamp=elem[[2]]$datetimestamp,
heading=elem[[2]]$heading,
description=elem[[2]]$description,
content=elem$content,
stringsAsFactors=FALSE)
}
dat <- bind_rows(lapply(corpusGoog, make_row))
str(dat)
## Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10 obs. of 4 variables:
## $ timestamp : POSIXct, format: "2015-02-03 13:08:16" "2015-01-11 23:37:45" ...
## $ heading : chr "A guide to data mining with Hadoop - Information Age" "Barack Obama to seek limits on student data mining - Politico" "Is data mining riddled with risk or a natural hazard of the internet? - INTHEBLACK" "Why an obscure British data-mining company is worth $3 billion - Quartz" ...
## $ description: chr "Information AgeA guide to data mining with HadoopInformation AgeWith the advent of the Internet of Things and the transition fr"| __truncated__ "PoliticoBarack Obama to seek limits on student data miningPoliticoPresident Barack Obama on Monday is expected to call for toug"| __truncated__ "INTHEBLACKIs data mining riddled with risk or a natural hazard of the internet?INTHEBLACKData mining is now viewed as a serious"| __truncated__ "QuartzWhy an obscure British data-mining company is worth $3 billionQuartzTesco, the troubled British retail group, is starting"| __truncated__ ...
## $ content : chr "A guide to data mining with Hadoop\nHow businesses can realise and capitalise on the opportunities that Hadoop offers\nPosted b"| __truncated__ "By Stephanie Simon\n1/11/15 6:32 PM EST\nPresident Barack Obama on Monday is expected to call for tough legislation to protect "| __truncated__ "By Adam Courtenay\nData mining is now viewed as a serious security threat, but with all the hype, s"| __truncated__ "How We Buy\nJanuary 12, 2015\nTesco, the troubled British retail group, is starting over. After an accounting scandal , a serie"| __truncated__ ...
Then, you can do anything you want. For example:
dat %>%
arrange(timestamp) %>%
select(heading) %>%
head
## Source: local data frame [6 x 1]
##
## heading
## 1 The potential of fighting corruption through data mining - Transparency International (pre
## 2 Barack Obama to seek limits on student data mining - Politico
## 3 Why an obscure British data-mining company is worth $3 billion - Quartz
## 4 Parks and Rec Recap: Treat Yo Self to Some Data Mining - Indianapolis Monthly
## 5 Fraud and data mining in Vancouverâ\u0080¦just Outside the Lines - Vancouver Sun (blog)
## 6 'Parks and Rec' Data-Mining Episode Was Eerily True To Life - MediaPost Communications
If you want/need something else, you need to be clearer in your question.
I was looking at google query string and noticed they pass startdate and enddate tag in the query if you click dates on right hand side of the page.
You can use the same tag name and yout results will be confined within start and end date.
GoogleFinanceSource(query, params = list(hl = "en", q = query, ie = "utf-8",
start = 0, num = 25, output = "rss",
startdate='2015-10-26', enddate = '2015-10-28'))

R mistreat a number as a character

I am trying to read a series of text files into R. These files are of the same form, at least appear to be of the same form. Everything is fine except one file. When I read that file, R treated all numbers as characters. I used as.numeric to convert back, but the data value changed. I also tried to convert text file to csv and then read into R, but it did not work, either. Did any one have such problem before, please? How to fix it, please? Thank you!
The data is from Human Mortality Database. I cannot attach the data here due to copyright issue. But everyone can register through HMD and download data (www.mortality.org). As an example, I used Australian and Belgium 1 by 1 exposure data.
My codes are as follows:
AUSe<-read.table("AUS.Exposures_1x1.txt",skip=1,header=TRUE)[,-5]
BELe<-read.table("BEL.Exposures_1x1.txt",skip=1,header=TRUE)[,-5]
Then I want to add some rows in the above data frame or matrix. It is fine for Australian data (e.g, AUSe[1,3]+AUSe[2,3]). But error occurred when same command is applied to Belgium data: Error in BELe[1, 3] + BELe[2, 3] : non-numeric argument to binary operator. But if you look at the text file, you know those are two numbers. It is clear that R treated a number as a character when reading the text file, which is rather odd.
Try this instead:
BELe<-read.table("BEL.Exposures_1x1.txt",skip=1, colClasses="numeric", header=TRUE)[,-5]
Or you could surely post just a tiny bit of that file and not violate any copyright laws at least in my jurisdiction (which I think is the same one as The Human Mortality Database).
Belgium, Exposure to risk (period 1x1) Last modified: 04-Feb-2011, MPv5 (May07)
Year Age Female Male Total
1841 0 61006.15 62948.23 123954.38
1841 1 55072.53 56064.21 111136.73
1841 2 51480.76 52521.70 104002.46
1841 3 48750.57 49506.71 98257.28
.... . ....
So I might have suggested the even more accurate colClasses:
BELe<-read.table("BEL.Exposures_1x1.txt",skip=2, # really two lines to skip I think
colClasses=c(rep("integer", 2), rep("numeric",3)),
header=TRUE)[,-5]
I suspect the promlem occurs because of lines like these:
1842 110+ 0.00 0.00 0.00
So you will need to determine how much interest you have in preserving the 110+ values. With my method they will be coerced to NA's. (Well I thought they would be but like you I got an error. So this multi-step process is needed:
BELe<-read.table("Exposures_1x1.txt",skip=2,
header=TRUE)
BELe[ , 2:5] <- lapply(BELe[ , 2:5], as.character)
str(BELe)
#-------------
'data.frame': 18759 obs. of 5 variables:
$ Year : int 1841 1841 1841 1841 1841 1841 1841 1841 1841 1841 ...
$ Age : chr "0" "1" "2" "3" ...
$ Female: chr "61006.15" "55072.53" "51480.76" "48750.57" ...
$ Male : chr "62948.23" "56064.21" "52521.70" "49506.71" ...
$ Total : chr "123954.38" "111136.73" "104002.46" "98257.28" ...
#-------------
BELe[ , 2:5] <- lapply(BELe[ , 2:5], as.numeric)
#----------
Warning messages:
1: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
2: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
3: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
4: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
str(BELe)
#-----------
'data.frame': 18759 obs. of 5 variables:
$ Year : int 1841 1841 1841 1841 1841 1841 1841 1841 1841 1841 ...
$ Age : num 0 1 2 3 4 5 6 7 8 9 ...
$ Female: num 61006 55073 51481 48751 47014 ...
$ Male : num 62948 56064 52522 49507 47862 ...
$ Total : num 123954 111137 104002 98257 94876 ...
# and just to show that tey are not really integers:
BELe$Total[1:5]
#[1] 123954.38 111136.73 104002.46 98257.28 94875.89
The way I typically read those files is:
BELexp <- read.table("BEL.Exposures_1x1.txt", skip = 2, header = TRUE, na.strings = ".", as.is = TRUE)
Note that Belgium lost 3 years of data in WWI that may never be recovered, and hence these three years are all NAs, which in those files are marked with ".", a character string. Hence the argument na.strings = ".". Specifying that argument will take care of all columns except Age, which is character (intentionally), due to the "110+". The reason the HMD does this is so that users have to be intentional about treatment of the open age group. You can convert the age column to integer using:
BELexp$Age <- as.integer(gsub("[+]", "", BELexp$Age))
Since such issues are long the bane of R-HMD users, the HMD has recently posted some R functions in a small but growing package on github called (for now) DemogBerkeley. The function readHMD() removes all of the above headaches:
library(devtools)
install_github("DemogBerkeley", subdir = "DemogBerkeley", username = "UCBdemography")
BELexp <- readHMD("BEL.Exposures_1x1.txt")
Note that a new indicator column, called OpenInterval is added, while Age is converted to integer as above.
Can you try read.csv(... stringsAsFactors=FALSE) ?

Resources