Reshape aggregated rows to new columns, categorical data

Reshape aggregated rows to new columns, categorical data - r

I am trying to use R to aggregate rows to columns. Here is a sample of my dataset.
age sex hash emotion color
22 1 b17f9762462b37e7510f0e6d2534530d Lonely #006666
22 1 b17f9762462b37e7510f0e6d2534530d Energetic #66CC00
22 1 b17f9762462b37e7510f0e6d2534530d Calm #FFFFFF
22 1 b17f9762462b37e7510f0e6d2534530d Angry #FF0000
24 1 7bb50ca97a9b517239b39440a966d2f6 Calm #006666
24 1 7bb50ca97a9b517239b39440a966d2f6 Excited #0033cc
24 1 7bb50ca97a9b517239b39440a966d2f6 Empty/void #999999
24 1 7bb50ca97a9b517239b39440a966d2f6 No emotion #FF6600
26 1 209f1ba8ef86e855deccc0aae120825c Comfortable #330066
21 1 b9e9309c0b1255a7efb2edf9ba66ae46 Energetic #330099
21 1 b9e9309c0b1255a7efb2edf9ba66ae46 Happy #330066
26 1 209f1ba8ef86e855deccc0aae120825c No emotion #FFCC00
26 1 209f1ba8ef86e855deccc0aae120825c Calm #006666
21 1 61debd3dea6d1aacce5c9fc7daec4fe5 Empty/void #FFFFFF
21 1 b9e9309c0b1255a7efb2edf9ba66ae46 Calm #006666
26 1 209f1ba8ef86e855deccc0aae120825c No emotion #339900
21 1 61debd3dea6d1aacce5c9fc7daec4fe5 Loved #FF6600
26 1 209f1ba8ef86e855deccc0aae120825c No emotion #66CC00
What I want to do is get this:
age sex hash #000000 #FF0000 ... #FFFFFF
22 1 8798tkojstwz9ei sad happy ... loved
...
One response is defined by the hash, associated data is age and sex.
I want to have each response as 1 instead of several columns. Each color should have it's own column and the associated emotion as value of that column.
The whole dataset has 13 colors, 20+ emotions and 1000+ responses. The dataset looks exactly as the sample and is stored in a mySQL database.
I have tried with reshape, but it doesn't play well with categorical data or I did not use the appropriate functions. Any ideas? It can include some mySQL preparation if needed. Java was here very slow and since I have 12k+ rows R sounds like the right thing for this.
Thank you.

using reshape2
dcast(dat,...~color,value.var='emotion')
age sex hash #0033cc #006666 #330066 #330099 #339900 #66CC00 #999999 #FF0000 #FF6600
1 21 1 61debd3dea6d1aacce5c9fc7daec4fe5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Loved
2 21 1 b9e9309c0b1255a7efb2edf9ba66ae46 <NA> Calm Happy Energetic <NA> <NA> <NA> <NA> <NA>
3 22 1 b17f9762462b37e7510f0e6d2534530d <NA> Lonely <NA> <NA> <NA> Energetic <NA> Angry <NA>
4 24 1 7bb50ca97a9b517239b39440a966d2f6 Excited Calm <NA> <NA> <NA> <NA> Empty <NA> Noemotion
5 26 1 209f1ba8ef86e855deccc0aae120825c <NA> Calm Comfortable <NA> Noemotion Noemotion <NA> <NA> <NA>
#FFCC00 #FFFFFF
1 <NA> Empty
2 <NA> <NA>
3 <NA> Calm
4 <NA> <NA>
5 Noemotion <NA>

If I understand your objective correctly, reshape() is indeed the function you're looking for. Assuming your dataset is called mydf, try this:
reshape(mydf, direction = "wide",
idvar = c("hash", "age", "sex"),
timevar = "color")
# age sex hash emotion.#006666 emotion.#66CC00
# 1 22 1 b17f9762462b37e7510f0e6d2534530d Lonely Energetic
# 5 24 1 7bb50ca97a9b517239b39440a966d2f6 Calm <NA>
# 9 26 1 209f1ba8ef86e855deccc0aae120825c Calm No emotion
# 10 21 1 b9e9309c0b1255a7efb2edf9ba66ae46 Calm <NA>
# 14 21 1 61debd3dea6d1aacce5c9fc7daec4fe5 <NA> <NA>
# emotion.#FFFFFF emotion.#FF0000 emotion.#0033cc emotion.#999999 emotion.#FF6600
# 1 Calm Angry <NA> <NA> <NA>
# 5 <NA> <NA> Excited Empty/void No emotion
# 9 <NA> <NA> <NA> <NA> <NA>
# 10 <NA> <NA> <NA> <NA> <NA>
# 14 Empty/void <NA> <NA> <NA> Loved
# emotion.#330066 emotion.#330099 emotion.#FFCC00 emotion.#339900
# 1 <NA> <NA> <NA> <NA>
# 5 <NA> <NA> <NA> <NA>
# 9 Comfortable <NA> No emotion No emotion
# 10 Happy Energetic <NA> <NA>
# 14 <NA> <NA> <NA> <NA>
You can rename the columns later if you need to.

Related

Extracting a numeric information align with ID from unstructured dataset in R

I am trying to extract score information for each ID and for each itemID. Here how my sample dataset looks like.
df <- data.frame(Text_1 = c("Scoring", "1 = Incorrect","Text1","Text2","Text3","Text4", "Demo 1: Color Naming","Amarillo","Azul","Verde","Azul",
"Demo 1: Errors","Item 1: Color naming","Amarillo","Azul","Verde","Azul",
"Item 1: Time in seconds","Item 1: Errors",
"Item 2: Shape Naming","Cuadrado/Cuadro","Cuadrado/Cuadro","Círculo","Estrella","Círculo","Triángulo",
"Item 2: Time in seconds","Item 2: Errors"),
School.2 = c("Teacher:","DC Name:","Date (mm/dd/yyyy):","Child Grade:","Student Study ID:",NA, NA,NA,NA,NA,NA,
0,"1 = Incorrect responses",0,1,NA,NA,NA,0,"1 = Incorrect responses",0,NA,NA,1,1,0,NA,0),
X_Elementary_School..3 = c("Bill:","X District","10/7/21","K","123-2222-2:",NA, NA,NA,NA,NA,NA,
NA,"Child response",NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA,NA,NA),
School.4 = c("Teacher:","DC Name:","Date (mm/dd/yyyy):","Child Grade:","Student Study ID:",NA, 0,NA,1,NA,NA,0,"1 = Incorrect responses",0,1,NA,NA,120,0,"1 = Incorrect responses",NA,1,0,1,NA,1,110,0),
Y_Elementary_School..2 = c("John:","X District","11/7/21","K","112-1111-3:",NA, NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA, NA,NA))
> df
Text_1 School.2 X_Elementary_School..3 School.4 Y_Elementary_School..2
1 Scoring Teacher: Bill: Teacher: John:
2 1 = Incorrect DC Name: X District DC Name: X District
3 Text1 Date (mm/dd/yyyy): 10/7/21 Date (mm/dd/yyyy): 11/7/21
4 Text2 Child Grade: K Child Grade: K
5 Text3 Student Study ID: 123-2222-2: Student Study ID: 112-1111-3:
6 Text4 <NA> <NA> <NA> <NA>
7 Demo 1: Color Naming <NA> <NA> 0 <NA>
8 Amarillo <NA> <NA> <NA> <NA>
9 Azul <NA> <NA> 1 <NA>
10 Verde <NA> <NA> <NA> <NA>
11 Azul <NA> <NA> <NA> <NA>
12 Demo 1: Errors 0 <NA> 0 <NA>
13 Item 1: Color naming 1 = Incorrect responses Child response 1 = Incorrect responses Child response
14 Amarillo 0 <NA> 0 <NA>
15 Azul 1 <NA> 1 <NA>
16 Verde <NA> <NA> <NA> <NA>
17 Azul <NA> <NA> <NA> <NA>
18 Item 1: Time in seconds <NA> <NA> 120 <NA>
19 Item 1: Errors 0 <NA> 0 <NA>
20 Item 2: Shape Naming 1 = Incorrect responses Child response 1 = Incorrect responses Child response
21 Cuadrado/Cuadro 0 <NA> <NA> <NA>
22 Cuadrado/Cuadro <NA> <NA> 1 <NA>
23 Círculo <NA> <NA> 0 <NA>
24 Estrella 1 <NA> 1 <NA>
25 Círculo 1 <NA> <NA> <NA>
26 Triángulo 0 <NA> 1 <NA>
27 Item 2: Time in seconds <NA> <NA> 110 <NA>
28 Item 2: Errors 0 <NA> 0 <NA>
This sample dataset is limited only for two schools, two teachers and two students.
In this step, I need to extract student responses for each item.
Wherever the first column has Item , I need to grab from there. I especially need to index the rows and columns columns rather than giving the exact row columns number since this will be for multiple datafiles and each files has different information. No need to grab the ..:Error part.
################################################################################
# ## 2-extract the score information here
# ## 1-grab item information from where "Item 1:.." starts
Here, rather than using row number, how to automate this part.
score<-df[c(7:11,13:17,20:26),c(seq(2,dim(df)[2],2))] # need to automate row and columns index here
score<-as.data.frame(t(score))
rownames(score)<-seq(1,nrow(score),1)
colnames(score)<-paste0('i',seq(1,ncol(score),1)) # assign col names for items
score<-apply(score,2,as.numeric) # only keep numeric columns
score<-as.data.frame(score)
score$total<-rowSums(score,na.rm=T); score # create a total score
> score
i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 total
1 NA NA NA NA NA NA 0 1 NA NA NA 0 NA NA 1 1 0 3
2 0 NA 1 NA NA NA 0 1 NA NA NA NA 1 0 1 NA 1 5
Additionally, I need to add ID which I could not achieve here.
My desired output would be:
> score
ID i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 total
1 123-2222-2 NA NA NA NA NA NA 0 1 NA NA NA 0 NA NA 1 1 0 3
2 112-1111-3 0 NA 1 NA NA NA 0 1 NA NA NA NA 1 0 1 NA 1 5

Dropdown all list and collect data rcurl

From a page like this
https://stackoverflow.com/users/11786778/nathalie?tab=reputation
How is it possible to dropdown all list from reputation table and receive the information which is loaded in the process of load?

Is this what you want?
library(rvest)
library(magrittr)
library(plyr)
#Doing URLs one by one
url<-"https://stackoverflow.com/users/11786778/nathalie?tab=reputation"
##GET SALES DATA
pricesdata <- read_html(url) %>% html_nodes(xpath = "//table[1]") %>% html_table(fill=TRUE)
library(plyr)
df <- ldply(pricesdata, data.frame)
Produces:
1 <NA>
2 Take the result of call for a list of ids
3 <NA>
4 <NA>
5 <NA>
6 Add Detailed history
7 <NA>
8 <NA>
9 <NA>
10 <NA>
11 <NA>
12 <NA>
13 <NA>
14 <NA>
15 <NA>
16 <NA>
17 <NA>
18 <NA>
19 <NA>
20 <NA>
21 <NA>
22 <NA>
23 <NA>
24 <NA>
25 <NA>
26 <NA>
27 <NA>
28 <NA>
29 <NA>
30 <NA>
31 <NA>
32 <NA>
33 <NA>
34 <NA>
35 <NA>
36 <NA>
37 <NA>
38 <NA>
39 <NA>
40 <NA>
41 <NA>
42 <NA>
43 <NA>
>

How can I tokenize a string in R?

I am trying to calculate readability, but it seems everything is written to expect either a file path or a Corpus. How do I handle a string?
Error (on the tokenization step):
Error: Unable to locate
I tried:
str<-"Readability zero one. Ten, Eleven.", "The cat in a dilapidated tophat."
library(koRpus)
ll.tagged <- tokenize(str, lang="en")
readability(ll.tagged,measure="Flesch.Kincaid")

You need to download the language file
install.koRpus.lang(c("en"))
library(koRpus.lang.en)
ll.tagged <- tokenize(str, format = "obj", lang = "en")
ll.tagged
doc_id token tag lemma lttr wclass desc stop stem idx sntc
1 <NA> Readability word.kRp 11 word <NA> <NA> <NA> 1 1
2 <NA> zero word.kRp 4 word <NA> <NA> <NA> 2 1
3 <NA> one word.kRp 3 word <NA> <NA> <NA> 3 1
4 <NA> . .kRp 1 fullstop <NA> <NA> <NA> 4 1
5 <NA> Ten word.kRp 3 word <NA> <NA> <NA> 5 2
6 <NA> , ,kRp 1 comma <NA> <NA> <NA> 6 2
[...]
10 <NA> cat word.kRp 3 word <NA> <NA> <NA> 10 3
11 <NA> in word.kRp 2 word <NA> <NA> <NA> 11 3
12 <NA> a word.kRp 1 word <NA> <NA> <NA> 12 3
13 <NA> dilapidated word.kRp 11 word <NA> <NA> <NA> 13 3
14 <NA> tophat word.kRp 6 word <NA> <NA> <NA> 14 3
15 <NA> . .kRp 1 fullstop <NA> <NA> <NA> 15 3

R - split list and marge same table

Sorry, the title may not describe well
I have a dataframe form google history
original
> head(testAC)
latitudeE7 longitudeE7 activity
1 247915291 1209946249 NULL
2 248033293 1209803613 NULL
3 248033293 1209803613 1505536182769, IN_VEHICLE, STILL, UNKNOWN, 54, 31, 15
result
> head(testAC)
latitudeE7|longitudeE7| activityTime|mainactivity| speed
1 247915291| 1209946249| | NULL |
2 248033293| 1209803613| | NULL |
3 248033293| 1209803613|1505536182769| IN_VEHICLE | 54
4 248033293| 1209803613|1505536182769| STILL | 31
5 248033293| 1209803613|1505536182769| UNKNOWN | 15
Original line 3, become result 3 to 5 lines
I only know do.call ("rbind", testAC$activity),
But just split the activity, latitudeE7 and longitudeE7 disappeared
> do.call ("rbind", testAC$activity)
timestampMs activity
1 1505536182769 IN_VEHICLE, STILL, UNKNOWN, 54, 31, 15
2 1505536077547 IN_VEHICLE, UNKNOWN, ON_BICYCLE, STILL, 64, 23, 8, 5
I look for two days, but may not keyword, can not find
Can anyone explain how to do what I want?
Thank you
I have a Rdata uploaded on Google Drive, maybe know more about it
google drive

How about this:
library(plyr)
cbind(dataAC[, 1:2], ldply(lapply(dataAC$activity, function(x) if (!is.null(x)) unlist(lapply(x, unlist)) else NA), rbind))
It will give you a dataframe instead of nested lists, and then you can reshape it however you want
latitudeE7 longitudeE7 1 timestampMs activity.type1 activity.type2 activity.type3 activity.confidence1 activity.confidence2
1 247915291 1209946249 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 248033293 1209803613 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 248033293 1209803613 <NA> 1505536182769 IN_VEHICLE STILL UNKNOWN 54 31
4 248002555 1209895254 <NA> 1505536077547 IN_VEHICLE UNKNOWN ON_BICYCLE 64 23
5 247966714 1209957315 <NA> 1505535932508 IN_VEHICLE ON_BICYCLE <NA> 54 46
6 247966714 1209957315 <NA> 1505535825664 <NA> <NA> <NA> <NA> <NA>
activity.confidence3 activity.type4 activity.confidence4 activity.type activity.confidence
1 <NA> <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA>
3 15 <NA> <NA> <NA> <NA>
4 8 STILL 5 <NA> <NA>
5 <NA> <NA> <NA> <NA> <NA>
6 <NA> <NA> <NA> TILTING 100

extraction function gives a warning message and no data in r

I want to do an extration from rasterdata and add the information to the polygon, but I get this warning and the extration contains only null values
Warning message:
In .local(x, y, ...) :
cannot return a sp object because the data length varies between polygons
What could be the problem? I did the same extraction last week with the same information and it was working fine. The formula is
expop <- extract(rasterdata, floods1985, small=TRUE, fun=sum, na.rm=TRUE, df=FALSE, nl=1, sp=TRUE)
Data of floods1985 is
head(floods1985)
ID AREA CENTRIODX CENTRIODY DFONUMBER GLIDE__ LINKS OTHER NATIONS
0 1 92620 5.230 35.814 1 <NA> Algeria <NA> <NA>
1 2 678500 -45.349 -18.711 2 <NA> Brazil <NA> <NA>
2 3 12850 122.974 10.021 3 <NA> Philippines <NA> <NA>
3 4 16540 124.606 1.015 4 <NA> Indonesia <NA> <NA>
4 5 20080 32.349 -25.869 5 <NA> Mozambique <NA> <NA>
5 6 1040 43.360 -11.652 6 <NA> Comoros islas <NA> <NA>
X_AFFECTED
0 <NA>
1 <NA>
2 <NA>
3 <NA>
4 <NA>
5 <NA>
AND_RIVERS
0 Northeastern
1 States: Rio de Janeiro, Minas Gerais a Espirito Santo
2 Towns: Tanjay a Pamplona
3 Region: Northern Sulawesi; Towns: Gorontalo Regency
4 Provinces: Natal, Maputo; Rivers: Nkomati, Omati, Maputo, Umbeluzi, Incomati, Limpopo, Pungue, Buzi a Zambezi; Town: Ressano Garcia
5 Isla of Anjouan; Villages: Hassimpao, Marahare, Vouani
RIVERS BEGAN ENDED DAYS DEAD DISPLACED X_USD_ MAIN_CAUSE
0 <NA> 1985/01/01 1985/01/05 5 26 3000 <NA> Heavy rain
1 <NA> 1985/01/15 1985/02/02 19 229 80000 2000000000 Heavy rain
2 <NA> 1985/01/20 1985/01/21 2 43 444 <NA> Brief torrential rain
3 <NA> 1985/02/04 1985/02/18 15 21 300 <NA> Brief torrential rain
4 <NA> 1985/02/09 1985/02/11 3 19 <NA> 3000000 Heavy rain
5 <NA> 1985/02/16 1985/02/28 13 2 35000 5600000 Tropical cyclone
SEVERITY__ SQ_KM X_M___
0 1.0 92620 5.665675
1 1.5 678500 7.286395
2 1.0 12850 4.409933
3 1.0 16540 5.394627
4 1.5 20080 4.955976
5 1.0 1040 4.130977

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Reshape aggregated rows to new columns, categorical data - r

Related

Extracting a numeric information align with ID from unstructured dataset in R

Dropdown all list and collect data rcurl

How can I tokenize a string in R?

R - split list and marge same table

extraction function gives a warning message and no data in r

Categories

Resources