How can I tokenize a string in R? - r

I am trying to calculate readability, but it seems everything is written to expect either a file path or a Corpus. How do I handle a string?
Error (on the tokenization step):
Error: Unable to locate
I tried:
str<-"Readability zero one. Ten, Eleven.", "The cat in a dilapidated tophat."
library(koRpus)
ll.tagged <- tokenize(str, lang="en")
readability(ll.tagged,measure="Flesch.Kincaid")

You need to download the language file
install.koRpus.lang(c("en"))
library(koRpus.lang.en)
ll.tagged <- tokenize(str, format = "obj", lang = "en")
ll.tagged
doc_id token tag lemma lttr wclass desc stop stem idx sntc
1 <NA> Readability word.kRp 11 word <NA> <NA> <NA> 1 1
2 <NA> zero word.kRp 4 word <NA> <NA> <NA> 2 1
3 <NA> one word.kRp 3 word <NA> <NA> <NA> 3 1
4 <NA> . .kRp 1 fullstop <NA> <NA> <NA> 4 1
5 <NA> Ten word.kRp 3 word <NA> <NA> <NA> 5 2
6 <NA> , ,kRp 1 comma <NA> <NA> <NA> 6 2
[...]
10 <NA> cat word.kRp 3 word <NA> <NA> <NA> 10 3
11 <NA> in word.kRp 2 word <NA> <NA> <NA> 11 3
12 <NA> a word.kRp 1 word <NA> <NA> <NA> 12 3
13 <NA> dilapidated word.kRp 11 word <NA> <NA> <NA> 13 3
14 <NA> tophat word.kRp 6 word <NA> <NA> <NA> 14 3
15 <NA> . .kRp 1 fullstop <NA> <NA> <NA> 15 3

Related

counting genes: error: the combined objects have no sequence levels in common

I am new in processing RNA-seq data. I have human RNA-seq data and am now trying to count the genes using summarizeoverlaps, but I get this warning for all my files:
" In .Seqinfo.mergexy(x, y) :
The 2 combined objects have no sequence levels in common. (Use
suppressWarnings() to suppress this warning.)"
This is what I did:
I aligned my RNA seq files to the an Ensembl reference file (Homo_sapiens.GRCh38.cdna.all.fa.gz) and generated BAM files.
seqinfo(bamfiles)`
Seqinfo object with 190508 sequences from an unspecified genome:
seqnames seqlengths isCircular genome
ENST00000633009.1 20 <NA> <NA>
ENST00000634070.1 18 <NA> <NA>
ENST00000632963.1 20 <NA> <NA>
ENST00000633030.1 19 <NA> <NA>
ENST00000633765.1 31 <NA> <NA>
... ... ... ...
ENST00000638565.1 1331 <NA> <NA>
ENST00000673346.1 895 <NA> <NA>
ENST00000673247.1 369 <NA> <NA>
ENST00000672305.1 758 <NA> <NA>
ENST00000671911.1 943 <NA> <NA>
I also downloaded the GTF file from Ensembl:
Homo_sapiens.GRCh38.100.gtf.gz
seqinfo(txdb)
Seqinfo object with 47 sequences (1 circular) from an unspecified genome; no seqlengths:
seqnames seqlengths isCircular genome
1 <NA> <NA> <NA>
2 <NA> <NA> <NA>
3 <NA> <NA> <NA>
4 <NA> <NA> <NA>
5 <NA> <NA> <NA>
... ... ... ...
KI270731.1 <NA> <NA> <NA>
KI270733.1 <NA> <NA> <NA>
KI270734.1 <NA> <NA> <NA>
KI270744.1 <NA> <NA> <NA>
KI270750.1 <NA> <NA> <NA>
I am guessing it has something to do with the seqnames, but I am not sure what I have to do. I tried converting it to Ensembl-style:
mapSeqlevels(seqlevels(bamfiles), "Ensembl")
mapSeqlevels(seqlevels(txdb), "Ensembl")
but that did no do anything...
NB featurecounts does not work either...
Thanks in advance!
Sandra

Dropdown all list and collect data rcurl

From a page like this
https://stackoverflow.com/users/11786778/nathalie?tab=reputation
How is it possible to dropdown all list from reputation table and receive the information which is loaded in the process of load?
Is this what you want?
library(rvest)
library(magrittr)
library(plyr)
#Doing URLs one by one
url<-"https://stackoverflow.com/users/11786778/nathalie?tab=reputation"
##GET SALES DATA
pricesdata <- read_html(url) %>% html_nodes(xpath = "//table[1]") %>% html_table(fill=TRUE)
library(plyr)
df <- ldply(pricesdata, data.frame)
Produces:
1 <NA>
2 Take the result of call for a list of ids
3 <NA>
4 <NA>
5 <NA>
6 Add Detailed history
7 <NA>
8 <NA>
9 <NA>
10 <NA>
11 <NA>
12 <NA>
13 <NA>
14 <NA>
15 <NA>
16 <NA>
17 <NA>
18 <NA>
19 <NA>
20 <NA>
21 <NA>
22 <NA>
23 <NA>
24 <NA>
25 <NA>
26 <NA>
27 <NA>
28 <NA>
29 <NA>
30 <NA>
31 <NA>
32 <NA>
33 <NA>
34 <NA>
35 <NA>
36 <NA>
37 <NA>
38 <NA>
39 <NA>
40 <NA>
41 <NA>
42 <NA>
43 <NA>
>

R - split list and marge same table

Sorry, the title may not describe well
I have a dataframe form google history
original
> head(testAC)
latitudeE7 longitudeE7 activity
1 247915291 1209946249 NULL
2 248033293 1209803613 NULL
3 248033293 1209803613 1505536182769, IN_VEHICLE, STILL, UNKNOWN, 54, 31, 15
result
> head(testAC)
latitudeE7|longitudeE7| activityTime|mainactivity| speed
1 247915291| 1209946249| | NULL |
2 248033293| 1209803613| | NULL |
3 248033293| 1209803613|1505536182769| IN_VEHICLE | 54
4 248033293| 1209803613|1505536182769| STILL | 31
5 248033293| 1209803613|1505536182769| UNKNOWN | 15
Original line 3, become result 3 to 5 lines
I only know do.call ("rbind", testAC$activity),
But just split the activity, latitudeE7 and longitudeE7 disappeared
> do.call ("rbind", testAC$activity)
timestampMs activity
1 1505536182769 IN_VEHICLE, STILL, UNKNOWN, 54, 31, 15
2 1505536077547 IN_VEHICLE, UNKNOWN, ON_BICYCLE, STILL, 64, 23, 8, 5
I look for two days, but may not keyword, can not find
Can anyone explain how to do what I want?
Thank you
I have a Rdata uploaded on Google Drive, maybe know more about it
google drive
How about this:
library(plyr)
cbind(dataAC[, 1:2], ldply(lapply(dataAC$activity, function(x) if (!is.null(x)) unlist(lapply(x, unlist)) else NA), rbind))
It will give you a dataframe instead of nested lists, and then you can reshape it however you want
latitudeE7 longitudeE7 1 timestampMs activity.type1 activity.type2 activity.type3 activity.confidence1 activity.confidence2
1 247915291 1209946249 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 248033293 1209803613 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 248033293 1209803613 <NA> 1505536182769 IN_VEHICLE STILL UNKNOWN 54 31
4 248002555 1209895254 <NA> 1505536077547 IN_VEHICLE UNKNOWN ON_BICYCLE 64 23
5 247966714 1209957315 <NA> 1505535932508 IN_VEHICLE ON_BICYCLE <NA> 54 46
6 247966714 1209957315 <NA> 1505535825664 <NA> <NA> <NA> <NA> <NA>
activity.confidence3 activity.type4 activity.confidence4 activity.type activity.confidence
1 <NA> <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA>
3 15 <NA> <NA> <NA> <NA>
4 8 STILL 5 <NA> <NA>
5 <NA> <NA> <NA> <NA> <NA>
6 <NA> <NA> <NA> TILTING 100

extraction function gives a warning message and no data in r

I want to do an extration from rasterdata and add the information to the polygon, but I get this warning and the extration contains only null values
Warning message:
In .local(x, y, ...) :
cannot return a sp object because the data length varies between polygons
What could be the problem? I did the same extraction last week with the same information and it was working fine. The formula is
expop <- extract(rasterdata, floods1985, small=TRUE, fun=sum, na.rm=TRUE, df=FALSE, nl=1, sp=TRUE)
Data of floods1985 is
head(floods1985)
ID AREA CENTRIODX CENTRIODY DFONUMBER GLIDE__ LINKS OTHER NATIONS
0 1 92620 5.230 35.814 1 <NA> Algeria <NA> <NA>
1 2 678500 -45.349 -18.711 2 <NA> Brazil <NA> <NA>
2 3 12850 122.974 10.021 3 <NA> Philippines <NA> <NA>
3 4 16540 124.606 1.015 4 <NA> Indonesia <NA> <NA>
4 5 20080 32.349 -25.869 5 <NA> Mozambique <NA> <NA>
5 6 1040 43.360 -11.652 6 <NA> Comoros islas <NA> <NA>
X_AFFECTED
0 <NA>
1 <NA>
2 <NA>
3 <NA>
4 <NA>
5 <NA>
AND_RIVERS
0 Northeastern
1 States: Rio de Janeiro, Minas Gerais a Espirito Santo
2 Towns: Tanjay a Pamplona
3 Region: Northern Sulawesi; Towns: Gorontalo Regency
4 Provinces: Natal, Maputo; Rivers: Nkomati, Omati, Maputo, Umbeluzi, Incomati, Limpopo, Pungue, Buzi a Zambezi; Town: Ressano Garcia
5 Isla of Anjouan; Villages: Hassimpao, Marahare, Vouani
RIVERS BEGAN ENDED DAYS DEAD DISPLACED X_USD_ MAIN_CAUSE
0 <NA> 1985/01/01 1985/01/05 5 26 3000 <NA> Heavy rain
1 <NA> 1985/01/15 1985/02/02 19 229 80000 2000000000 Heavy rain
2 <NA> 1985/01/20 1985/01/21 2 43 444 <NA> Brief torrential rain
3 <NA> 1985/02/04 1985/02/18 15 21 300 <NA> Brief torrential rain
4 <NA> 1985/02/09 1985/02/11 3 19 <NA> 3000000 Heavy rain
5 <NA> 1985/02/16 1985/02/28 13 2 35000 5600000 Tropical cyclone
SEVERITY__ SQ_KM X_M___
0 1.0 92620 5.665675
1 1.5 678500 7.286395
2 1.0 12850 4.409933
3 1.0 16540 5.394627
4 1.5 20080 4.955976
5 1.0 1040 4.130977

Reshape aggregated rows to new columns, categorical data

I am trying to use R to aggregate rows to columns. Here is a sample of my dataset.
age sex hash emotion color
22 1 b17f9762462b37e7510f0e6d2534530d Lonely #006666
22 1 b17f9762462b37e7510f0e6d2534530d Energetic #66CC00
22 1 b17f9762462b37e7510f0e6d2534530d Calm #FFFFFF
22 1 b17f9762462b37e7510f0e6d2534530d Angry #FF0000
24 1 7bb50ca97a9b517239b39440a966d2f6 Calm #006666
24 1 7bb50ca97a9b517239b39440a966d2f6 Excited #0033cc
24 1 7bb50ca97a9b517239b39440a966d2f6 Empty/void #999999
24 1 7bb50ca97a9b517239b39440a966d2f6 No emotion #FF6600
26 1 209f1ba8ef86e855deccc0aae120825c Comfortable #330066
21 1 b9e9309c0b1255a7efb2edf9ba66ae46 Energetic #330099
21 1 b9e9309c0b1255a7efb2edf9ba66ae46 Happy #330066
26 1 209f1ba8ef86e855deccc0aae120825c No emotion #FFCC00
26 1 209f1ba8ef86e855deccc0aae120825c Calm #006666
21 1 61debd3dea6d1aacce5c9fc7daec4fe5 Empty/void #FFFFFF
21 1 b9e9309c0b1255a7efb2edf9ba66ae46 Calm #006666
26 1 209f1ba8ef86e855deccc0aae120825c No emotion #339900
21 1 61debd3dea6d1aacce5c9fc7daec4fe5 Loved #FF6600
26 1 209f1ba8ef86e855deccc0aae120825c No emotion #66CC00
What I want to do is get this:
age sex hash #000000 #FF0000 ... #FFFFFF
22 1 8798tkojstwz9ei sad happy ... loved
...
One response is defined by the hash, associated data is age and sex.
I want to have each response as 1 instead of several columns. Each color should have it's own column and the associated emotion as value of that column.
The whole dataset has 13 colors, 20+ emotions and 1000+ responses. The dataset looks exactly as the sample and is stored in a mySQL database.
I have tried with reshape, but it doesn't play well with categorical data or I did not use the appropriate functions. Any ideas? It can include some mySQL preparation if needed. Java was here very slow and since I have 12k+ rows R sounds like the right thing for this.
Thank you.
using reshape2
dcast(dat,...~color,value.var='emotion')
age sex hash #0033cc #006666 #330066 #330099 #339900 #66CC00 #999999 #FF0000 #FF6600
1 21 1 61debd3dea6d1aacce5c9fc7daec4fe5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Loved
2 21 1 b9e9309c0b1255a7efb2edf9ba66ae46 <NA> Calm Happy Energetic <NA> <NA> <NA> <NA> <NA>
3 22 1 b17f9762462b37e7510f0e6d2534530d <NA> Lonely <NA> <NA> <NA> Energetic <NA> Angry <NA>
4 24 1 7bb50ca97a9b517239b39440a966d2f6 Excited Calm <NA> <NA> <NA> <NA> Empty <NA> Noemotion
5 26 1 209f1ba8ef86e855deccc0aae120825c <NA> Calm Comfortable <NA> Noemotion Noemotion <NA> <NA> <NA>
#FFCC00 #FFFFFF
1 <NA> Empty
2 <NA> <NA>
3 <NA> Calm
4 <NA> <NA>
5 Noemotion <NA>
If I understand your objective correctly, reshape() is indeed the function you're looking for. Assuming your dataset is called mydf, try this:
reshape(mydf, direction = "wide",
idvar = c("hash", "age", "sex"),
timevar = "color")
# age sex hash emotion.#006666 emotion.#66CC00
# 1 22 1 b17f9762462b37e7510f0e6d2534530d Lonely Energetic
# 5 24 1 7bb50ca97a9b517239b39440a966d2f6 Calm <NA>
# 9 26 1 209f1ba8ef86e855deccc0aae120825c Calm No emotion
# 10 21 1 b9e9309c0b1255a7efb2edf9ba66ae46 Calm <NA>
# 14 21 1 61debd3dea6d1aacce5c9fc7daec4fe5 <NA> <NA>
# emotion.#FFFFFF emotion.#FF0000 emotion.#0033cc emotion.#999999 emotion.#FF6600
# 1 Calm Angry <NA> <NA> <NA>
# 5 <NA> <NA> Excited Empty/void No emotion
# 9 <NA> <NA> <NA> <NA> <NA>
# 10 <NA> <NA> <NA> <NA> <NA>
# 14 Empty/void <NA> <NA> <NA> Loved
# emotion.#330066 emotion.#330099 emotion.#FFCC00 emotion.#339900
# 1 <NA> <NA> <NA> <NA>
# 5 <NA> <NA> <NA> <NA>
# 9 Comfortable <NA> No emotion No emotion
# 10 Happy Energetic <NA> <NA>
# 14 <NA> <NA> <NA> <NA>
You can rename the columns later if you need to.

Resources