When read.table encounters an emoji in text data, it inserts several EOL prematurely before continuing a new line beginning with data on the same line it was interrupted on.
Tried permutations of parameters on read.table, read.delim
myData <- read.table("myData.tsv", sep = '\t', encoding = "UTF-16", skipNul = TRUE, fill = TRUE, header = TRUE, skip = 3, quote = "", stringsAsFactors = FALSE)
Replicated using this dataset:
StartDate Q15.5 Q16.5 gc response_order
Start Date Which of these statements best reflect how you feel about [Brand]? [Brand] is _____. "In your own words, why do you feel that [Brand] is [QID32-ChoiceGroup-SelectedChoices]?" gc response_order
"{""ImportId"":""startDate"",""timeZone"":""America/Denver""}" "{""ImportId"":""QID32""}" "{""ImportId"":""QID33_TEXT""}" "{""ImportId"":""gc""}" "{""ImportId"":""response_order""}"
4/4/2019 9:39 Holding its ground i dont really hear much about it but i would assume its holding its ground 1 reversed
4/4/2019 9:37 Probably on its way up 👨🏾🌾😛🤯👨🏾🌾🤯😄🤯😄🤯 1 reversed
4/4/2019 9:29 Probably on its way up Growing company 1 normal
4/4/2019 9:37 Holding its ground "It is mostly geared towards the younger generation, which is good because it calls to new customers. On the other hand, the older generations are moving on to business that more geared towards us." 1 normal
4/4/2019 9:17 Probably on its way up Its well used and good 1 reversed
4/4/2019 9:41 Probably on its way up Its going good 1 normal
4/4/2019 9:38 Definitely on its way up reasons 1 normal
4/4/2019 9:38 Holding its ground It's beginning to look less like a fly by night outfit and more like a responsible company 1 normal
4/4/2019 9:38 Holding its ground "I feel that the company, while providing a useful service, is not constantly working to innovate and continue building upon the product to match the needs of the customer." 1 reversed
4/4/2019 9:37 Definitely on its way up They are a trustworthy company that constantly stays in tune with the technology of today 1 normal
4/4/2019 9:48 Holding its ground I still hear about it 1 normal
Resulting in:
"X....ImportId.....startDate.....timeZone.....America.Denver....","X....ImportId.....QID32....","X....ImportId.....QID33_TEXT....","X....ImportId.....gc....","X....ImportId.....response_order...."
"4/4/2019 9:39","Holding its ground","i dont really hear much about it but i would assume its holding its ground ",1,"reversed"
"4/4/2019 9:37","Probably on its way up","=ØhÜ<Øþß",NA,""
" <Ø>ß=ØÞ>Ø/Ý=ØhÜ<Øþß","","",NA,""
" <Ø>ß>Ø/Ý=ØÞ>Ø/Ý=ØÞ>Ø/Ý","1","reversed",NA,""
"4/4/2019 9:29","Probably on its way up","Growing company",1,"normal"
"4/4/2019 9:37","Holding its ground","""It is mostly geared towards the younger generation, which is good because it calls to new customers. On the other hand, the older generations are moving on to business that more geared towards us.""",1,"normal"
"4/4/2019 9:17","Probably on its way up","Its well used and good",1,"reversed"
"4/4/2019 9:41","Probably on its way up","Its going good",1,"normal"
"4/4/2019 9:38","Definitely on its way up","reasons",1,"normal"
"4/4/2019 9:38","Holding its ground","It's beginning to look less like a fly by night outfit and more like a responsible company",1,"normal"
"4/4/2019 9:38","Holding its ground","""I feel that the company, while providing a useful service, is not constantly working to innovate and continue building upon the product to match the needs of the customer.""",1,"reversed"
"4/4/2019 9:37","Definitely on its way up","They are a trustworthy company that constantly stays in tune with the technology of today",1,"normal"
"4/4/2019 9:48","Holding its ground","I still hear about it ",1,"normal"
Related
I'm trying to load a dataset into R Studio, where the dataset itself is space-delimited, but it also contains spaces in quoted text like in csv files. Here is the head of the data:
DOC_ID LABEL RATING VERIFIED_PURCHASE PRODUCT_CATEGORY PRODUCT_ID PRODUCT_TITLE REVIEW_TITLE REVIEW_TEXT
1 __label1__ 4 N PC B00008NG7N "Targus PAUK10U Ultra Mini USB Keypad, Black" useful "When least you think so, this product will save the day. Just keep it around just in case you need it for something."
2 __label1__ 4 Y Wireless B00LH0Y3NM Note 3 Battery : Stalion Strength Replacement 3200mAh Li-Ion Battery for Samsung Galaxy Note 3 [24-Month Warranty] with NFC Chip + Google Wallet Capable New era for batteries Lithium batteries are something new introduced in the market there average developing cost is relatively high but Stallion doesn't compromise on quality and provides us with the best at a low cost.<br />There are so many in built technical assistants that act like a sensor in their particular forté. The battery keeps my phone charged up and it works at every voltage and a high voltage is never risked.
3 __label1__ 3 N Baby B000I5UZ1Q "Fisher-Price Papasan Cradle Swing, Starlight" doesn't swing very well. "I purchased this swing for my baby. She is 6 months now and has pretty much out grown it. It is very loud and doesn't swing very well. It is beautiful though. I love the colors and it has a lot of settings, but I don't think it was worth the money."
4 __label1__ 4 N Office Products B003822IRA Casio MS-80B Standard Function Desktop Calculator Great computing! I was looking for an inexpensive desk calcolatur and here it is. It works and does everything I need. Only issue is that it tilts slightly to one side so when I hit any keys it rocks a little bit. Not a big deal.
5 __label1__ 4 N Beauty B00PWSAXAM Shine Whitening - Zero Peroxide Teeth Whitening System - No Sensitivity Only use twice a week "I only use it twice a week and the results are great. I have used other teeth whitening solutions and most of them, for the same results I would have to use it at least three times a week. Will keep using this because of the potency of the solution and also the technique of the trays, it keeps everything in my teeth, in my mouth."
6 __label1__ 3 N Health & Personal Care B00686HNUK Tobacco Pipe Stand - Fold-away Portable - Light Weight - For Single Pipe not sure I'm not sure what this is supposed to be but I would recommend that you do a little more research into the culture of using pipes if you plan on giving this as a gift or using it yourself.
7 __label1__ 4 N Toys B00NUG865W ESPN 2-Piece Table Tennis PING PONG TABLE GREAT FOR YOUTHS AND FAMILY "Pleased with ping pong table. 11 year old and 13 year old having a blast, plus lots of family entertainment too. Plus better than kids sitting on video games all day. A friend put it together. I do believe that was a challenge, but nothing they could not handle"
8 __label1__ 4 Y Beauty B00QUL8VX6 "Abundant Health 25% Vitamin C Serum with Vitamin E and Hyaluronic Acid for Youthful Looking Skin, 1 fl. oz." Great vitamin C serum "Great vitamin C serum... I really like the oil feeling, not too sticky. I used it last week on some of my recent bug bites and it helps heal the skin faster than normal."
9 __label1__ 4 N Health & Personal Care B004YHKVCM PODS Spring Meadow HE Turbo Laundry Detergent Pacs 77-load Tub wonderful detergent. "I've used tide pods laundry detergent for many years,its such a great detergent to use having a nice scent and leaver the cloths smelling fresh."
Problem is that it looks tab-delimited but it is not, example would be DOC_ID = 1, where there are only two spaces between useful and "When least...", this way passing sep = "/t" to read.table throws an error saying that line 1 did not have 10 elements, which for some reason is incorrect, because the number of elements should be 9. Here are the parameters that I'm passing(without the original path):
read.table(file = "path", sep ="\t", header = TRUE, strip.white = TRUE)
Also relying on quotes is not a good strategy, because some lines do not have their text quoted, so the delimiter should be something like a double space, which combined with strip.white should work properly, but read.table only accepts single byte delimiters.
So the question is how would you parse such corpus in R or with any other third party software that could convert it adequately to a csv or atleast a tab-delimited file?
Parsing the data using python pandas.read_csv(filename, sep='\t', header = 0, ...) seems to have parsed the data successfully and from this point anything could be done with it. Closing this out.
I have one page story (i.e. text data), I need to use Bayesian network on that story and analyse the same. Could someone tell me whether it is possible in R? If yes, that how to proceed?
The objective of the analysis is - Extract Action Descriptions from
Narrative Text.
The data considered for analysis -
Krishna’s Dharam-shasthra to Arjuna:
The Gita is the conversation between Krishna and Arjuna leading up to the battle.
Krishna emphasised on two terms: Karma and Dharma. He told Arjun that this was a righteous war; a war of Dharma. Dharma is the way of righteousness or a set of rules and laws laid down. The Kauravas were on the side of Adharma and had broken rules and laws and hence Arjun would have to do his Karma to uphold Dharma.
Arjuna doesn't want to fight. He doesn't understand why he has to shed his family's blood for a kingdom that he doesn't even necessarily want. In his eyes, killing his evil and killing his family is the greatest sin of all. He casts down his weapons and tells Krishna he will not fight. Krishna, then, begins the systematic process of explaining why it is Arjuna's dharmic duty to fight and how he must fight in order to restore his karma.
Krishna first explains the samsaric cycle of birth and death. He says there is no true death of the soul simply a sloughing of the body at the end of each round of birth and death. The purpose of this cycle is to allow a person to work off their karma, accumulated through lifetimes of action. If a person completes action selflessly, in service to God, then they can work off their karma, eventually leading to a dissolution of the soul, the achievement of enlightenment and vijnana, and an end to the samsaric cycle. If they act selfishly, then they keep accumulating debt, putting them further and further into karmic debt.
What I want is - post tagger to separate verbs, nouns etc. and then create a meaningful network using that.
The steps that should be followed in pre-processing are:
syntactic processing (post tagger)
SRL algorithm (semantic role labelling of characters of the story)
conference resolution
Using all of the above I need to create a knowledge database and create a Bayesian network.
This is what I have tried so far to get post tagger:
txt <- c("As the years went by, they remained isolated in their city. Their numbers increased by freeing women from slavery.
Doom would come to the world in the form of Ares the god of war and the Son of Zeus. Ares was unhappy with the gods as he wanted to prove just how foul his father’s creation was. Hence, he decided to corrupt the mortal men created by Zeus. Fearing his wrath upon the world Zeus decided to create the God killer in order to stop Ares. He then commanded Hippolyta to mould a baby from the sand and clay of the island. Then the five goddesses went back into the Underworld, drawing out the last soul that remained in the Well and giving it incredible powers. The soul was merged with the clay and became flesh. Hippolyta had her daughter and named her Diana, Princess of the Amazons, the first child born on Paradise Island.
Each of the six members of the Greek Pantheon granted Diana a gift: Demeter, great strength; Athena, wisdom and courage; Artemis, a hunter's heart and a communion with animals; Aphrodite, beauty and a loving heart; Hestia, sisterhood with fire; Hermes, speed and the power of flight. Diana was also gifted with a sword, the Lasso of truth and the bracelets of penance as weapons to defeat Ares.
The time arrived when Diana, protector of the Amazons and mankind was sent to the Man's World to defeat Ares and rid the mortal men off his corruption. Diana believed that only love could truly rid the world of his influence. Diana was successfully able to complete the task she was sent out by defeating Ares and saving the world.
")
writeLines(txt, tf <- tempfile())
library(stringi)
library(cleanNLP)
cnlp_init_tokenizers()
anno <- cnlp_annotate(tf)
names(anno)
get_token(anno)
cnlp_init_spacy()
anno <- cnlp_annotate(tf)
get_token(anno)
cnlp_init_corenlp()
Please anyone can help me to import angle brackets data into R from a unix executable file. It seems like an XML type so, I tried to use XML parser but it failed.
I have attached sample file.
Thanks in advance.
https://drive.google.com/file/d/0B97ow4h4jwHcRTVtWHdudDJ0c1k/view?usp=sharing
'&' characters exist in elements in your XML document.
One example is below:
<DOC>
<DATE>01/07/2009</DATE>
<AUTHOR>Debce</AUTHOR>
<TEXT>I have owned my MDX for about 1 1/2 yrs & have loved every minute of driving the 24k problem free miles on it! It is so much fun to drive; looks & feels luxurious so no problem pulling up to upscale places! I didn't want to give up space to pop things in the back and go so I keep the third seat down & purchased the rubber mat for the back. I have plenty of room while at the same time I am "zippy"; easily pulling into parking spaces and getting around town. I love the navigation system, although it does need updating and the bluetooth is wonderful, although for some reason it keeps unhooking my Treo phone which the Acura people say is the phone's fault. LOVE IT & would buy it again.</TEXT>
<FAVORITE>Large storage area, hands free phone with the bluetooth & voice recognition is safe. The heaviness of it feels safe and large interior is very comfortable. </FAVORITE>
</DOC>
'&' characters should be escaped.
'>'
'<'
'&'
'%'
characters are special characters which should be escaped in an XML document.
Here is a way of extracting the data into a character matrix.
> require(XML)
> x <- htmlParse("/temp/2007_acura_mdx")
>
> # get the 'DOC'
> docs <- getNodeSet(x, "//doc")
>
> # display one
> docs[[1]]
<doc>
<date>07/31/2009</date>
<author>FlewByU</author>
<text>I just moved to Germany two months ago and bought an 07 MDX from another military member. It has everything I could want. We just returned from a week driving through the Alps and this SUV is simply amazing. Granted, I get to drive it much faster than I could in the states, but even at 120 MPH, it was rock solid. We need the AWD for the snow and the kids stay entertained with the AV system. Plenty of passing power and very comfortable on long trips. Acuras are rare in Germany and I get stares all the time by curious Bavarians wondering what kind of vehicle I have. If you are in the market for a luxury SUV for family touring, with cool tech toys to play with, MDX can't be beat. </text>
<favorite>The separate controls for the rear passengers are awesome. I can control temp and AV from the front or switch to rear. Sound system is amazing. I will sometimes sit in the driveway and just listen. Also has a 120v outlet in console. Great for us since we live with 220v and need 120 on occasion. </favorite>
</doc>
>
> # process docs getting all fields -- need to transpose
> results <- t(sapply(docs, function(x) xmlSApply(x, xmlValue)))
>
> # show head
> head(results)
date author
[1,] "07/31/2009" "FlewByU"
[2,] "07/30/2009" "cvillemdx"
[3,] "06/22/2009" "Pleased"
[4,] "04/13/2009" "wasatch7"
[5,] "04/06/2009" "mnozek"
[6,] "01/07/2009" "Debce"
text
[1,] "I just moved to Germany two months ago and bought an 07 MDX from another military member. It has everything I could want. We just returned from a week driving through the Alps and this SUV is simply amazing. Granted, I get to drive it much faster than I could in the states, but even at 120 MPH, it was rock solid. We need the AWD for the snow and the kids stay entertained with the AV system. Plenty of passing power and very comfortable on long trips. Acuras are rare in Germany and I get stares all the time by curious Bavarians wondering what kind of vehicle I have. If you are in the market for a luxury SUV for family touring, with cool tech toys to play with, MDX can't be beat. "
[2,] "After months of careful research and test drives at BMW, Lexus, Volvo, etc. I settled on the MDX without a doubt in mind. I love the way the car handles, no stiffness or resistance in the steering or acceleration. The interior design is a little Star Trek for me, but once I figured everything out, it is a pleasure to have all the extras (XM radio, navigation, Bluetooth, backup camera, etc.)"
[3,] "I'm two years into a three year lease and I love this car. The only thing I would change would be the shape of the grill...THAT'S IT. Everything else is perfect. Great performance, plenty of power and AWD when skiing, plenty of room for baggage, great MPG for an SUV, navi system is far superior to GM's Suburban (don't have to put in park to change your destination, etc). Zero problems...just gas and oil changes. One beautiful car...except for the sho-gun shield looking grill."
[4,] "First luxury crossover SUV I have owned. MDX won out over the Lexus, and cost less for a very well equipped base package. Handling, power and ride are outstanding. Back seats are a little less comfortable for my tall teenagers. Back cargo area is very roomy, and easily expandable with 3rd seat folded and back seats down. I drive up snowy, often treacherous mountain canyons to ski in the winter. The SH-AWD system, coupled with the manual shift mode (for descents), is outstanding. The MDX is much better in the snow than 3 truck base SUVs, I have owned previously. "
[5,] "This is the first Japanese SUV we have had in a while. Last SUV's were Yukon XL and Envoy XL. This beats them out by far. Performs almost as well as our Mercedes e class but has the utility of our Envoy. We always take this on trips and it is very comfortable. The third row is great for smaller children but not so much for adults. Best SUV so far. No problems within our almost 2 years ownership."
[6,] "I have owned my MDX for about 1 1/2 yrs & have loved every minute of driving the 24k problem free miles on it! It is so much fun to drive; looks & feels luxurious so no problem pulling up to upscale places! I didn't want to give up space to pop things in the back and go so I keep the third seat down & purchased the rubber mat for the back. I have plenty of room while at the same time I am \"zippy\"; easily pulling into parking spaces and getting around town. I love the navigation system, although it does need updating and the bluetooth is wonderful, although for some reason it keeps unhooking my Treo phone which the Acura people say is the phone's fault. LOVE IT & would buy it again."
favorite
[1,] "The separate controls for the rear passengers are awesome. I can control temp and AV from the front or switch to rear. Sound system is amazing. I will sometimes sit in the driveway and just listen. Also has a 120v outlet in console. Great for us since we live with 220v and need 120 on occasion. "
[2,] "The self-adjusting side mirrors which rotate to give you a view of the curb/lines as you back up. Makes backing into parking spaces and parallel parking a breeze, along with the back-up camera. Also a fan of the push-to-talk for my cell phone."
[3,] "Navi is easy, hands-free is great, AWD is perfect."
[4,] "AWD system, exterior styling, cargo room"
[5,] "Navigation, sound system, bluetooth, comfort, acceleration, performance, all wheel drive ability."
[6,] "Large storage area, hands free phone with the bluetooth & voice recognition is safe. The heaviness of it feels safe and large interior is very comfortable. "
>
>
>
I have data file which has angle brackets from http://kavita-ganesan.com/opinosis-opinion-dataset.
<DOCNO>2007_acura_mdx</DOCNO>
<DOC>
<DATE>07/31/2009</DATE>
<AUTHOR>FlewByU</AUTHOR>
<TEXT>I just moved to Germany two months ago and bought an 07 MDX from another military member. It has everything I could want. We just returned from a week driving through the Alps and this SUV is simply amazing. Granted, I get to drive it much faster than I could in the states, but even at 120 MPH, it was rock solid. We need the AWD for the snow and the kids stay entertained with the AV system. Plenty of passing power and very comfortable on long trips. Acuras are rare in Germany and I get stares all the time by curious Bavarians wondering what kind of vehicle I have. If you are in the market for a luxury SUV for family touring, with cool tech toys to play with, MDX can't be beat. </TEXT>
<FAVORITE>The separate controls for the rear passengers are awesome. I can control temp and AV from the front or switch to rear. Sound system is amazing. I will sometimes sit in the driveway and just listen. Also has a 120v outlet in console. Great for us since we live with 220v and need 120 on occasion. </FAVORITE>
</DOC>
<DOC>
<DATE>07/30/2009</DATE>
<AUTHOR>cvillemdx</AUTHOR>
<TEXT>After months of careful research and test drives at BMW, Lexus, Volvo, etc. I settled on the MDX without a doubt in mind. I love the way the car handles, no stiffness or resistance in the steering or acceleration. The interior design is a little Star Trek for me, but once I figured everything out, it is a pleasure to have all the extras (XM radio, navigation, Bluetooth, backup camera, etc.)</TEXT>
<FAVORITE>The self-adjusting side mirrors which rotate to give you a view of the curb/lines as you back up. Makes backing into parking spaces and parallel parking a breeze, along with the back-up camera. Also a fan of the push-to-talk for my cell phone.</FAVORITE>
</DOC>
<DOC>
<DATE>06/22/2009</DATE>
<AUTHOR>Pleased</AUTHOR>
<TEXT>I'm two years into a three year lease and I love this car. The only thing I would change would be the shape of the grill...THAT'S IT. Everything else is perfect. Great performance, plenty of power and AWD when skiing, plenty of room for baggage, great MPG for an SUV, navi system is far superior to GM's Suburban (don't have to put in park to change your destination, etc). Zero problems...just gas and oil changes. One beautiful car...except for the sho-gun shield looking grill.</TEXT>
<FAVORITE>Navi is easy, hands-free is great, AWD is perfect.</FAVORITE>
</DOC>
It seems like an XML file, but when I tried to
xml.url <- "2007_acura_mdx"
xmlfile <- xmlTreeParse(xml.url)
class(xmlfile)
xmltop <- xmlRoot(xmlfile)
topxml <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
xml_df <- data.frame(t(topxml), row.names=NULL)
I had a problem when I executed data.frame. Can anyone help me? At this moment I would like to use grep()`` and gsub() but this is also not easy.
Try this:
txt <- "<DOCNO>2007_acura_mdx</DOCNO>
<DOC>
<DATE>07/31/2009</DATE>
<AUTHOR>FlewByU</AUTHOR>
<TEXT>I just moved to Germany two months ago and bought an 07 MDX from another military member. It has everything I could want. We just returned from a week driving through the Alps and this SUV is simply amazing. Granted, I get to drive it much faster than I could in the states, but even at 120 MPH, it was rock solid. We need the AWD for the snow and the kids stay entertained with the AV system. Plenty of passing power and very comfortable on long trips. Acuras are rare in Germany and I get stares all the time by curious Bavarians wondering what kind of vehicle I have. If you are in the market for a luxury SUV for family touring, with cool tech toys to play with, MDX can't be beat. </TEXT>
<FAVORITE>The separate controls for the rear passengers are awesome. I can control temp and AV from the front or switch to rear. Sound system is amazing. I will sometimes sit in the driveway and just listen. Also has a 120v outlet in console. Great for us since we live with 220v and need 120 on occasion. </FAVORITE>
</DOC>
<DOC>
<DATE>07/30/2009</DATE>
<AUTHOR>cvillemdx</AUTHOR>
<TEXT>After months of careful research and test drives at BMW, Lexus, Volvo, etc. I settled on the MDX without a doubt in mind. I love the way the car handles, no stiffness or resistance in the steering or acceleration. The interior design is a little Star Trek for me, but once I figured everything out, it is a pleasure to have all the extras (XM radio, navigation, Bluetooth, backup camera, etc.)</TEXT>
<FAVORITE>The self-adjusting side mirrors which rotate to give you a view of the curb/lines as you back up. Makes backing into parking spaces and parallel parking a breeze, along with the back-up camera. Also a fan of the push-to-talk for my cell phone.</FAVORITE>
</DOC>
<DOC>
<DATE>06/22/2009</DATE>
<AUTHOR>Pleased</AUTHOR>
<TEXT>I'm two years into a three year lease and I love this car. The only thing I would change would be the shape of the grill...THAT'S IT. Everything else is perfect. Great performance, plenty of power and AWD when skiing, plenty of room for baggage, great MPG for an SUV, navi system is far superior to GM's Suburban (don't have to put in park to change your destination, etc). Zero problems...just gas and oil changes. One beautiful car...except for the sho-gun shield looking grill.</TEXT>
<FAVORITE>Navi is easy, hands-free is great, AWD is perfect.</FAVORITE>
</DOC>"
library(XML)
txt2 <- paste("<root>", txt, "</root>")
doc <- xmlTreeParse(txt2, asText = TRUE, useInternalNodes = TRUE)
L <- xpathApply(doc, "//DOC", xmlApply, FUN = xmlValue)
dd <- do.call(rbind, lapply(L, as.data.frame, stringsAsFactors = FALSE))
giving:
> str(dd)
'data.frame': 3 obs. of 4 variables:
$ DATE : chr "07/31/2009" "07/30/2009" "06/22/2009"
$ AUTHOR : chr "FlewByU" "cvillemdx" "Pleased"
$ TEXT : chr "I just moved to Germany two months ago and bought an 07 MDX from another military member. It has everything I could want. We ju"| __truncated__ "After months of careful research and test drives at BMW, Lexus, Volvo, etc. I settled on the MDX without a doubt in mind. I lov"| __truncated__ "I'm two years into a three year lease and I love this car. The only thing I would change would be the shape of the grill...THAT"| __truncated__
$ FAVORITE: chr "The separate controls for the rear passengers are awesome. I can control temp and AV from the front or switch to rear. Sound sy"| __truncated__ "The self-adjusting side mirrors which rotate to give you a view of the curb/lines as you back up. Makes backing into parking sp"| __truncated__ "Navi is easy, hands-free is great, AWD is perfect."
I'm new to Unix, however, I have recently realized that very simple Unix commands can do very simple things to large data set very very quickly. My question is why are these Unix commands so fast relative to R?
Let's begin by assuming that the data is big, but not larger than the amount of RAM on your computer.
Computationally, I understand that Unix commands are likely faster than their R counterparts. However, I can't imagine that this would explain the entire time difference. After all basic R functions, like Unix commands, are written in low-level languages like C/C++.
I therefore suspect that the speed gains have to do with I/O. While I only have a basic understanding of how computers work, I do understand that to manipulate data it most first be read from disk (assuming the data is local). This is slow. However, regardless of whether you use R functions or Unix commands to manipulate data both most obtain the data from disk.
Therefore I suspect that how data is read from disk, if that even makes sense, is what is driving the time difference. Is that intuition correct?
Thanks!
UPDATE: Sorry for being vague. This was done on purpose, I was hoping to discuss this idea in general, rather than focus on a specific example.
Regardless, I'll generate an example of counting the number of rows
First I'll generate a big data set.
row = 1e7
col = 50
df<-matrix(rpois(row*col,1),row,col)
write.csv(df,"df.csv")
Doing it with Unix
time wc -l df.csv
real 0m12.261s
user 0m1.668s
sys 0m2.589s
Doing it with R
library(data.table)
system.time({ nrow(fread("df.csv")) })
...
user system elapsed
26.77 1.67 47.07
Notice that elapsed/real > user + system. This suggests that the CPU is waiting on the disk.
I suspected the slow speed of R has to do with reading the data in. It appears that I'm right:
system.time(fread("df.csv"))
user system elapsed
34.69 2.81 47.41
My question is how is the I/O different for Unix and R. Why?
I'm not sure what operations you're talking about, but in general, more complex processing systems like R use more complex internal data structures to represent the data being manipulated, and constructing these data structures can be a big bottleneck, significantly slower than the simple lines, words, and characters that Unix commands like grep tend to operate on.
Another factor (depending on how your scripts are set up) is whether you're processing the data one thing at a time, in "streaming mode", or reading everything into memory. Unix commands tend to be written to operate in pipelines, and to read a small piece of data (usually one line), process it, maybe write out a result, and move on to the next line. If, on the other hand, you read the entire data set into memory before processing it, then even if you do have enough RAM, allocating and organizing all the necessary memory can be very expensive.
[updated in response to your additional information]
Aha. So you were asking R to read the whole file into memory at once. That accounts for much of the difference. Let's talk about a few more things.
I/O. We can think about three ways of reading characters from a file, especially if the style of processing we're doing affects the way that's most convenient to do the reading.
Unbuffered small, random reads. We ask the operating system for 1 or a few characters at a time, and process them as we read them.
Unbuffered large, block-sized reads. We ask the operating for big chunks of memory -- usually of a size like 1k or 8k -- and chew on each chunk in memory before asking for the next chunk.
Buffered reads. Our programming language gives us a way of asking for as many characters as we want out of an intermediate buffer, and code that's built into the language ("library" code) automatically takes care of keeping that buffer full by reading large, block-sized chunks from the operating system.
Now, the important thing to know is that the operating system would much rather read big, block-sized chunks. So #1 can be drastically slower than 2 and 3. (I've seen factors of 10 or 100.) But no well-written programs use #1, so we can pretty much forget about it. As long as you're using 2 or 3, the I/O speed will be roughly the same. (In extreme cases, if you know what you're doing, you can get a little efficiency increase by using 2 instead of 3, if you can.)
Now let's talk about the way each program processes the data. wc has basically 5 steps:
Read characters one at a time. (I can assure you it uses method 3.)
For each character read, add one to the character count.
If the character read was a newline, add one to the line count.
If the character read was or wasn't a word-separator character, update the word count.
At the very end, print out the counts of lines, words, and/or characters, as requested.
So as you can see it's all I/O and very simple, character-based processing. (The only step that's at all complicated is 4. As an exercise, I once wrote a version of wc that contrived not to do all of steps 2, 3, and 4 inside the read loop if the user didn't ask for all the counts. My version did indeed run significantly faster if you invoked wc -c or wc -l. But obviously the code was significantly more complicated.)
In the case of R, on the other hand, things are quite a bit more complicated. First, you told it to read a CSV file. So as it reads, it has to find the newlines separating lines and the commas separating columns. That's roughly equivalent to the processing that wc has to do. But then, for each number that it finds, it has to convert it into an internal number that it can work with efficiently. For example, if somewhere in the CSV file occurs the sequence
...,12345,...
R is going to have to read those digits (as individual characters) and then do the equivalent of the math problem
1 * 10000 + 2 * 1000 + 3 * 100 + 4 * 10 + 5 * 1
to get the value 12345.
But there's more. You asked R to build a table. A table is a specific, highly regular data structure which orders all the data into rigid rows and columns for efficient lookup. To see how much work that can be, let's use a slightly far-fetched hypothetical real-world example.
Suppose you're a survey company and it's your job to ask people walking by on the street certain questions. But suppose that the questions are complicated enough that you need all the people seated in a classroom at once. (Suppose further that the people don't mind this inconvenience.)
But first you have to build that classroom. You're not sure how many people are going to walk by, so you build an ordinary classroom, with room for 5 rows of 6 desks for 30 people, and you haul in the desks, and the people start filing in, and after 30 people file in you notice there's a 31st, so what do you do? You could ask him to stand in the back, but you're kind of fixated on the rigid-rows-and-columns idea, so you ask the 31st person to wait, and you quickly call the builders and ask them to build a second 30-person classroom right next to the first, and now you can accept the 31st person and in fact 29 more for a total of 60, but then you notice a 61st person.
So you ask him to wait, and you call the builders back again, and you have them build two more classrooms, so now you've got a nice 2x2 grid of 30-person classrooms, but the people keep coming and soon enough the 121st person shows up and there's not enough room and you still haven't even started asking your survey questions yet.
So you call some fancier builders that know how to do steelwork and you have them build a big 5-story building next door with 50-person classrooms, 5 on each floor, for a total of 50 x 5 x 5 = 1,250 desks, and you have the first 120 people (who've been waiting patiently) file out of the old rooms into the new building, and now there's room for the 121st person and quite a few more behind him, and you hire some wreckers to demolish the old classrooms and recycle some of the materials, and the people keep coming and pretty soon there's 1,250 people in your new building waiting to be surveyed and the 1,251st has just showed up.
So you build a giant new skyscraper with 1,000 desks on each floor and 100 floors, and you demolish the old 5-story building, but the people keep coming, and how big did you say your big data set was? 1e7 x 50? So I don't think the 100-story building is going to be big enough, either. (And when you're all done with all this, the only "survey question" you're going to ask is "How many rows are there?")
Contrived as it may seem, this is actually not too bad an analogy for what R is having to do internally to build the table to store that data set in.
Meanwhile, Bob's discount survey company, who can only tell you how many people he surveyed and how many were men and women and in which age brackets, is down there on the streetcorner, and the people are filing by, and Bob is jotting down tally marks on his clipboards, and the people, once surveyed, are walking away and going about their business, and Bob isn't wasting time and money building any classrooms at all.
I don't know anything about R, but see if there's a way to construct an empty 1e7 x 50 matrix up front, and read the CSV file into it. You might find that significantly quicker. R will still have to do some building, but at least it won't have any false starts.