Parsing Character String in R with "\r\n" from PDF Conversion - r

I am having issues parsing the following character string in R:
> dput(txt[1])
"NATIONAL BASKETBALL ASSOCIATION OFFICIAL SCORER'S REPORT\r\n FINAL BOX\r\nWednesday, January 17, 2018 Spectrum Center, Charlotte, NC\r\nOfficials: #15 Zach Zarba, #11 Derrick Collins, #12 CJ Washington\r\n Game Duration: 2:14\r\n Attendance: 11528\r\nVISITOR: Washington Wizards (25-20)\r\n POS MIN FG FGA 3P 3PA FT FTA OR DR TOT A PF ST TO BS +/- PTS\r\n22 Otto Porter Jr. F 22:45 2 6 1 3 1 2 0 2 2 2 0 0 0 0 -24 6\r\n 5 Markieff Morris F 15:26 1 5 0 2 0 0 0 5 5 1 4 0 1 0 -10 2\r\n13 Marcin Gortat C 19:42 0 3 0 0 0 2 3 5 8 1 2 2 0 0 -23 0\r\n 3 Bradley Beal G 27:43 10 19 4 6 2 2 1 2 3 2 1 0 5 1 -14 26\r\n 2 John Wall G 24:20 5 11 2 2 0 0 0 2 2 9 2 0 3 2 -20 12\r\n30 Mike Scott 25:13 7 10 2 2 2 2 0 2 2 3 3 1 0 0 -8 18\r\n12 Kelly Oubre Jr. 26:32 5 9 3 5 3 4 0 5 5 0 4 0 3 0 3 16\r\n28 Ian Mahinmi 08:47 1 2 0 0 2 2 0 1 1 2 1 0 0 1 -1 4\r\n31 Tomas Satoransky 25:49 2 3 0 0 2 2 1 2 3 7 1 0 1 0 -13 6\r\n20 Jodie Meeks 20:17 2 3 1 2 2 3 1 4 5 0 0 0 2 0 -10 7\r\n14 Jason Smith 13:50 4 8 0 0 2 2 1 2 3 3 5 0 1 2 0 10\r\n 1 Chris McCullough 06:48 1 3 0 1 0 0 0 1 1 0 0 0 0 1 -1 2\r\n 8 Tim Frazier 02:48 0 0 0 0 0 2 0 0 0 1 0 0 0 0 1 0\r\n 240:00 40 82 13 23 16 23 7 33 40 31 23 3 16 7 -24 109\r\n 48.8% 56.5% 69.6% TM REB: 7 TOT TO: 16 (20 PTS)\r\nHOME: CHARLOTTE HORNETS (18-25)\r\n POS MIN FG FGA 3P 3PA FT FTA OR DR TOT A PF ST TO BS +/- PTS\r\n14 Michael Kidd-Gilchrist F 22:59 8 11 0 0 5 6 0 4 4 2 1 3 0 0 26 21\r\n 2 Marvin Williams F 21:38 4 7 3 4 1 1 1 2 3 1 0 0 1 0 25 12\r\n12 Dwight Howard C 28:54 7 13 0 0 4 5 3 12 15 2 2 2 2 2 17 18\r\n 5 Nicolas Batum G 26:00 4 8 2 4 1 2 0 3 3 4 1 1 2 0 17 11\r\n15 Kemba Walker G 28:48 6 15 4 8 3 3 1 2 3 7 1 0 1 1 14 19\r\n 3 Jeremy Lamb 21:02 7 9 2 2 0 0 3 0 3 0 4 0 0 1 -4 16\r\n44 Frank Kaminsky 23:36 6 14 1 4 1 1 0 2 2 2 1 1 0 0 -5 14\r\n10 Michael Carter-Williams 15:12 0 2 0 1 3 4 0 3 3 5 2 0 0 1 8 3\r\n 8 Johnny O'Bryant III 19:07 2 6 1 2 2 2 3 3 6 1 1 1 2 0 7 7\r\n21 Treveon Graham 22:00 3 6 1 2 2 2 2 2 4 1 4 1 1 0 7 9\r\n 1 Malik Monk 03:59 1 5 1 3 0 0 0 1 1 1 0 1 0 0 2 3\r\n 7 Dwayne Bacon 03:59 0 2 0 1 0 0 0 0 0 0 0 0 0 0 2 0\r\n32 Julyan Stone 02:46 0 0 0 0 0 0 0 2 2 1 1 0 0 0 4 0\r\n 240:00 48 98 15 31 22 26 13 36 49 27 18 10 9 5 24 133\r\n 49% 48.4% 84.6% TM REB: 7 TOT TO: 10 (15 PTS)\r\nSCORE BY PERIOD 1 2 3 4 FINAL\r\n Wizards 36 25 18 30 109\r\n HORNETS 38 39 25 31 133\r\nInactive: Wizards - Mac (Injury/Illness - left achilles surgery), Robinson (G League Team - two-way player)\r\nInactive: Hornets - Mathiang, Paige (G League Team - two-way player), Zeller (Injury/Illness - left knee surgery)\r\nPoints in the Paint: Wizards 30 (15/27), HORNETS 50 (25/48) Biggest Lead: Wizards 2, HORNETS 28\r\n2nd Chance Points: Wizards 9 (4/7), HORNETS 21 (6/12) Lead Changes: 2\r\nFast Break Points: Wizards 16 (6/8), HORNETS 10 (5/8) Times Tied: 5\r\nTechnical fouls - Individual\r\nWizards (3): Wall 4:16 1st , Brooks 6:49 2nd , Frazier 4:00 4th\r\nHORNETS (2): Kidd-Gilchrist 3:08 2nd , Carter-Williams 4:00 4th\r\nTechnical fouls - Defensive Three Seconds\r\nWizards (0) : NONE\r\nHORNETS (1) : Howard 2:27 1st\r\nEjections\r\nWizards (1): Frazier 4:00 4th\r\nHORNETS (1): Carter-Williams 4:00 4th\r\nMEMO: Ejected for excessive communication and contact during stoppage in play.\r\nMEMO: Ejected for excessive communication and contact during stoppage in play.\r\n Copyright (c ) 2017-2018 NBA Properties, INC. All Rights Reserved\r\n"
I would like to extract the following sub-section of the above character string:
Technical fouls - Individual\r\nWizards (3): Wall 4:16 1st , Brooks 6:49 2nd , Frazier 4:00 4th\r\nHORNETS (2): Kidd-Gilchrist 3:08 2nd , Carter-Williams 4:00 4th\r\nTechnical fouls - Defensive Three Seconds\r\nWizards (0) : NONE\r\nHORNETS (1) : Howard 2:27 1st\r\n
To do this, I have tried different variations of the following approach:
library(stringr)
> techs <- str_extract(txt[1], "(?<=\\bTechnical).+?.(\\\r\nEjections)")
> techs
[1] NA
But, as you can see, it doesn't work. This approach has worked successfully in other sections of the character string. Like here, for example:
> str_extract(txt[1], "(?<=\\bInactive).+?.(\\\r\nPoints)")
[1] ": Hornets - Mathiang, Paige (G League Team - two-way player), Zeller (Injury/Illness - left knee surgery)\r\nPoints"
So, why doesn't it work when I target \r\nEjections\r\n as my endpoint? The only difference I can see is that \r\n both precedes and succeeds Ejections, whereas \r\n only precedes Points. I've tried to account for the additional \r\n like so:
> techs <- str_extract(txt[1], "(?<=\\bTechnical).+?.(\\\r\nEjections\r\n)")
> techs
[1] NA
But, that still doesn't work. What is this \r\n that's all over the place, and how do I account for it? I acquired this character string from a PDF like so:
library(pdftools)
> download.file("http://www.nba.com/data/html/nbacom/2017/gameinfo/20180117/0021700653_Book.pdf", "mydf", mode = "wb")
trying URL 'http://www.nba.com/data/html/nbacom/2017/gameinfo/20180117/0021700653_Book.pdf'
Content type 'application/pdf' length unknown
downloaded 308 KB
> txt <- pdf_text("mydf")
> txt[1]
[1] "NATIONAL BASKETBALL ASSOCIATION OFFICIAL SCORER'S REPORT\r\n FINAL BOX\r\nWednesday, January 17, 2018 Spectrum Center, Charlotte, NC\r\nOfficials: #15 Zach Zarba, #11 Derrick Collins, #12 CJ Washington\r\n Game Duration: 2:14\r\n Attendance: 11528\r\nVISITOR: Washington Wizards (25-20)\r\n POS MIN FG FGA 3P 3PA FT FTA OR DR TOT A PF ST TO BS +/- PTS\r\n22 Otto Porter Jr. F 22:45 2 6 1 3 1 2 0 2 2 2 0 0 0 0 -24 6\r\n 5 Markieff Morris F 15:26 1 5 0 2 0 0 0 5 5 1 4 0 1 0 -10 2\r\n13 Marcin Gortat C 19:42 0 3 0 0 0 2 3 5 8 1 2 2 0 0 -23 0\r\n 3 Bradley Beal G 27:43 10 19 4 6 2 2 1 2 3 2 1 0 5 1 -14 26\r\n 2 John Wall G 24:20 5 11 2 2 0 0 0 2 2 9 2 0 3 2 -20 12\r\n30 Mike Scott 25:13 7 10 2 2 2 2 0 2 2 3 3 1 0 0 -8 18\r\n12 Kelly Oubre Jr. 26:32 5 9 3 5 3 4 0 5 5 0 4 0 3 0 3 16\r\n28 Ian Mahinmi 08:47 1 2 0 0 2 2 0 1 1 2 1 0 0 1 -1 4\r\n31 Tomas Satoransky 25:49 2 3 0 0 2 2 1 2 3 7 1 0 1 0 -13 6\r\n20 Jodie Meeks 20:17 2 3 1 2 2 3 1 4 5 0 0 0 2 0 -10 7\r\n14 Jason Smith 13:50 4 8 0 0 2 2 1 2 3 3 5 0 1 2 0 10\r\n 1 Chris McCullough 06:48 1 3 0 1 0 0 0 1 1 0 0 0 0 1 -1 2\r\n 8 Tim Frazier 02:48 0 0 0 0 0 2 0 0 0 1 0 0 0 0 1 0\r\n 240:00 40 82 13 23 16 23 7 33 40 31 23 3 16 7 -24 109\r\n 48.8% 56.5% 69.6% TM REB: 7 TOT TO: 16 (20 PTS)\r\nHOME: CHARLOTTE HORNETS (18-25)\r\n POS MIN FG FGA 3P 3PA FT FTA OR DR TOT A PF ST TO BS +/- PTS\r\n14 Michael Kidd-Gilchrist F 22:59 8 11 0 0 5 6 0 4 4 2 1 3 0 0 26 21\r\n 2 Marvin Williams F 21:38 4 7 3 4 1 1 1 2 3 1 0 0 1 0 25 12\r\n12 Dwight Howard C 28:54 7 13 0 0 4 5 3 12 15 2 2 2 2 2 17 18\r\n 5 Nicolas Batum G 26:00 4 8 2 4 1 2 0 3 3 4 1 1 2 0 17 11\r\n15 Kemba Walker G 28:48 6 15 4 8 3 3 1 2 3 7 1 0 1 1 14 19\r\n 3 Jeremy Lamb 21:02 7 9 2 2 0 0 3 0 3 0 4 0 0 1 -4 16\r\n44 Frank Kaminsky 23:36 6 14 1 4 1 1 0 2 2 2 1 1 0 0 -5 14\r\n10 Michael Carter-Williams 15:12 0 2 0 1 3 4 0 3 3 5 2 0 0 1 8 3\r\n 8 Johnny O'Bryant III 19:07 2 6 1 2 2 2 3 3 6 1 1 1 2 0 7 7\r\n21 Treveon Graham 22:00 3 6 1 2 2 2 2 2 4 1 4 1 1 0 7 9\r\n 1 Malik Monk 03:59 1 5 1 3 0 0 0 1 1 1 0 1 0 0 2 3\r\n 7 Dwayne Bacon 03:59 0 2 0 1 0 0 0 0 0 0 0 0 0 0 2 0\r\n32 Julyan Stone 02:46 0 0 0 0 0 0 0 2 2 1 1 0 0 0 4 0\r\n 240:00 48 98 15 31 22 26 13 36 49 27 18 10 9 5 24 133\r\n 49% 48.4% 84.6% TM REB: 7 TOT TO: 10 (15 PTS)\r\nSCORE BY PERIOD 1 2 3 4 FINAL\r\n Wizards 36 25 18 30 109\r\n HORNETS 38 39 25 31 133\r\nInactive: Wizards - Mac (Injury/Illness - left achilles surgery), Robinson (G League Team - two-way player)\r\nInactive: Hornets - Mathiang, Paige (G League Team - two-way player), Zeller (Injury/Illness - left knee surgery)\r\nPoints in the Paint: Wizards 30 (15/27), HORNETS 50 (25/48) Biggest Lead: Wizards 2, HORNETS 28\r\n2nd Chance Points: Wizards 9 (4/7), HORNETS 21 (6/12) Lead Changes: 2\r\nFast Break Points: Wizards 16 (6/8), HORNETS 10 (5/8) Times Tied: 5\r\nTechnical fouls - Individual\r\nWizards (3): Wall 4:16 1st , Brooks 6:49 2nd , Frazier 4:00 4th\r\nHORNETS (2): Kidd-Gilchrist 3:08 2nd , Carter-Williams 4:00 4th\r\nTechnical fouls - Defensive Three Seconds\r\nWizards (0) : NONE\r\nHORNETS (1) : Howard 2:27 1st\r\nEjections\r\nWizards (1): Frazier 4:00 4th\r\nHORNETS (1): Carter-Williams 4:00 4th\r\nMEMO: Ejected for excessive communication and contact during stoppage in play.\r\nMEMO: Ejected for excessive communication and contact during stoppage in play.\r\n Copyright (c ) 2017-2018 NBA Properties, INC. All Rights Reserved\r\n"
Obviously, the PDF does not show \r\ns in its text. Were they inserted upon conversion? Is there a way to convert without them? Or, is there a simple-enough way to work with them? Thanks for the help.
EDIT
It does not appear that adding \\ to \\\r\n makes a difference.
> techs <- str_extract(txt[1], "(?<=\\bTechnical).+?.(\\\r\n)")
> techs
[1] " fouls - Individual\r\n"
> techs <- str_extract(txt[1], "(?<=\\bTechnical).+?.(\\\r\nEjections)")
> techs
[1] NA
> techs <- str_extract(txt[1], "(?<=\\bTechnical).+?.(\\\r\nEjections\r\n)")
> techs
[1] NA
> techs <- str_extract(txt[1], "(?<=\\bTechnical).+?.(\\\r\nEjections\\\r\n)")
> techs
[1] NA
> techs <- str_extract(txt[1], "(?<=\\bTechnical).+?.(\\\r\nEjections\\\r\\\n)")
> techs
[1] NA
> techs <- str_extract(txt[1], "(?<=\\bTechnical).+?.(\\\r\\\nEjections\\\r\\\n)")
> techs
[1] NA
> techs <- str_extract(txt[1], "(?<=\\bTechnical).+?.(\\\r\\\nEjections\\\r\n)")
> techs
[1] NA
> techs <- str_extract(txt[1], "(?<=\\bTechnical).+?.(\\\r\\\nEjections\r\n)")
> techs
[1] NA
> techs <- str_extract(txt[1], "(?<=\\bTechnical).+?.(\\\r\\\nEjections)")
> techs
[1] NA
Am I doing it right?

Related

How Would I go About Subsetting this Data Frame?

I have the follow data frame:
> resident
X LOS Age Meds MHealth DietRest ReligAff NmChores Employed EdLevel Courses
1 R1 27 35 2 1 3 2 2 0 2 1
2 R2 56 43 0 0 0 1 3 1 3 2
3 R3 101 41 1 1 0 0 2 2 2 3
4 R4 19 54 3 2 4 3 1 0 1 0
5 R5 34 29 0 0 0 2 3 0 2 1
6 R6 78 46 2 0 2 1 2 1 3 2
7 R7 134 51 3 2 4 0 1 1 3 2
8 R8 112 38 0 1 1 4 2 1 2 3
9 R9 83 61 3 1 3 2 2 0 4 3
10 R10 9 50 2 0 2 1 1 2 2 0
11 R11 67 23 0 1 0 0 2 0 3 1
12 R12 30 47 2 2 0 3 2 0 4 0
13 R13 95 65 4 1 4 2 2 0 3 2
14 R14 165 63 5 2 4 1 1 0 2 2
15 R15 29 40 0 1 0 0 3 2 5 0
16 R16 44 33 2 2 1 0 2 0 3 1
17 R17 36 48 2 1 0 3 2 0 1 1
18 R18 58 57 3 0 2 1 1 1 2 1
19 R19 116 39 0 1 0 2 2 1 3 1
20 R20 73 44 1 0 0 2 1 0 4 2
21 R21 79 30 3 2 3 3 1 0 2 1
22 R22 39 41 0 0 0 0 3 2 2 2
23 R23 18 50 2 1 2 1 1 1 3 0
24 R24 60 35 1 0 0 0 2 1 4 2
25 R25 106 48 3 2 3 2 2 0 2 2
26 R26 46 31 2 1 0 0 1 1 3 1
27 R27 52 59 2 0 1 1 3 2 2 1
28 R28 28 62 6 0 4 2 1 0 5 1
29 R29 79 45 4 2 3 3 2 1 3 2
30 R30 24 42 1 1 1 0 1 0 2 1
31 R31 123 36 3 1 0 2 2 1 3 4
32 R32 11 49 2 0 2 1 2 0 1 0
33 R33 95 26 1 1 0 1 3 0 3 4
34 R34 61 24 0 0 0 2 2 1 2 1
35 R35 88 63 2 1 0 1 1 1 4 2
36 R36 64 38 1 2 1 4 1 1 2 3
37 R37 99 40 2 0 0 1 3 2 4 1
>
LOS = length of stay
I am trying to go through the data frame and create a new column that consists of either a zero or one, based upon if the resident is completing an average of one course every thirty days. How would I go upon doing this? I understand I would need to do something like within this subset of people, break things down so that if someone has been there between thirty and fifty-nine days and has completed at least one course, they receive a value of one. If someone has been there between sixty and eighty-nine days and that person has finished at least two courses, give them a one, and so forth and if not give them a value of zero. How would I create a function that does this and adds a value of either 1 or 0 to a new vector based upon the data for each resident?

data cleaning for plotting data frames

I am currently working with survey data in R studio. I originally had two csv files but I merged them into one. Both CSV files contained sample IDs. The first file also contains bivariate info, while the second contains rating as a continuous variable.
Here is a sample of the data
ID O1 O2 O3 O4 O5 O6 O7 O8 S1 S2 S3 S4 S5 S6 S7 S8
22 0 1 0 1 0 1 0 1 4 6 2 6 4 3 6 2
23 0 1 0 0 1 1 0 1 5 6 10 4 5 7 7 6
24 0 1 1 0 1 0 0 1 7 4 7 8 7 6 3 9
25 0 0 1 1 0 0 1 1 3 5 5 7 4 6.9 6 5
26 0 1 0 0 1 1 0 1 2 2.5 7 5 4 5 4 3
27 0 1 1 1 0 1 0 0 6 3 4 6 5 6 5 6
28 0 1 1 1 0 0 0 1 7 4 2 8 2 1 4 5
29 0 0 1 0 1 1 1 0 2 5 1 2 4 3 2 2
30 0 1 0 1 1 1 0 0 8 2 6 7 1 7 5 4
31 0 0 0 1 0 1 1 1 7 4 3 2 4 5 7 2
32 0 0 1 0 0 1 1 1 4 7 5 3 1 6 2 3
33 0 1 1 0 1 1 0 0 7 4 5 8 8 5 6 7
For example the 0 in O1 corresponds to the 4 in S1.
I want to make a loop that will sum all of the values corresponding to variable 0 and 1.
if value in O1 is 0, add value in S1 to "sum of 0"
if value in O1 is 1, add value in S1 to "sum of 1"
repeat for all columns to get a total value for 0 and 1.
Any strategies or tips would be helpful going forward!

Correct syntax for xpathSApply in R

I'm struggling to get the statistics table on a website in a dataframe to do analysis on it. The table an be found here:
http://nl.soccerway.com/teams/netherlands/afc-ajax/1515/squad/
My code so far:
library(XML)
url <- "http://nl.soccerway.com/teams/netherlands/afc-ajax/1515/squad/"
doc <- htmlParse(url)
xpathSApply(doc, "//tr[#*]/td/child::node()", xmlValue)
But this returns the data in an unworkable form. What is the correct xpathSApply code?
The table with the data has id='page_team_1_block_team_squad_3-table' you can use this in an xpath. An xpath
"//table[#id='page_team_1_block_team_squad_3-table']/tbody" will find the table with that id and return the table body. You can then use readHTMLTable with argument header = FALSE to return the data
library(XML)
url <- "http://nl.soccerway.com/teams/netherlands/afc-ajax/1515/squad/"
doc <- htmlParse(url)
res <- readHTMLTable(doc["//table[#id='page_team_1_block_team_squad_3-table']/tbody"][[1]], header = FALSE)
head(res)
> head(res)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1 1 K. Vermeer 28 K 856 10 10 0 1 24 0 0 0 0
2 22 J. Cillessen 25 K 2204 25 24 1 0 8 0 0 0 0
3 30 M. van der Hart 20 K 0 0 0 0 0 2 0 0 0 0
4 2 R. van Rhijn 23 V 2786 32 31 1 1 1 2 3 6 0
5 3 T. Alderweireld 25 V 360 4 4 0 0 0 0 0 0 0
6 4 N. Moisander 28 V 1985 23 22 1 0 3 1 2 0 0
V17
1 0
2 0
3 0
4 1
5 0
6 0
You don't need xpathSapply. This one-liner can do it given the url:
readHTMLTable(url, header = "")[[1]]
giving:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1 1 K. Vermeer 28 K 856 10 10 0 1 24 0 0 0 0 0
2 22 J. Cillessen 25 K 2204 25 24 1 0 8 0 0 0 0 0
3 30 M. van der Hart 20 K 0 0 0 0 0 2 0 0 0 0 0
4 2 R. van Rhijn 23 V 2786 32 31 1 1 1 2 3 6 0 1
5 3 T. Alderweireld 25 V 360 4 4 0 0 0 0 0 0 0 0
6 4 N. Moisander 28 V 1985 23 22 1 0 3 1 2 0 0 0
7 6 M. van der Hoorn 21 V 166 3 2 1 1 21 0 0 0 0 0
8 12 J. Veltman 22 V 2158 25 24 1 1 2 2 2 2 0 0
9 15 N. Boilesen 22 V 1445 20 17 3 6 6 1 2 3 0 0
10 17 D. Blind 24 V 2531 29 29 0 5 3 1 1 4 0 0
11 24 S. Denswil 21 V 1350 17 15 2 1 14 1 0 1 0 0
12 27 R. Ligeon 22 V 350 5 4 1 3 8 0 1 0 0 0
13 42 J. Riedewald 17 V 222 5 3 2 3 10 2 0 1 0 0
14 44 K. Tete 18 V 0 0 0 0 0 1 0 0 0 0 0
15 5 C. Poulsen 34 M 1523 29 14 15 3 20 1 3 2 0 0
16 8 L. Duarte 23 M 655 14 6 8 2 14 3 0 1 0 0
17 8 C. Eriksen 22 M 360 4 4 0 0 0 2 3 1 0 0
18 10 S. de Jong 25 M 1257 19 16 3 8 3 7 1 1 0 0
19 18 D. Klaassen 21 M 2102 26 23 3 2 5 10 3 1 0 0
20 20 L. Schöne 28 M 2149 29 25 4 6 6 9 8 1 0 0
21 25 T. Serero 24 M 2276 29 25 4 6 6 3 3 3 0 0
22 34 L. de Sa 21 M 512 12 5 7 5 12 1 1 1 0 0
23 7 V. Fischer 20 A 1636 24 19 5 6 6 3 2 1 0 0
24 9 K. Sigþórsson 24 A 1928 30 20 10 16 11 10 2 0 0 0
25 11 Bojan 23 A 1357 24 17 7 12 11 4 3 2 0 0
26 16 L. Andersen 19 A 405 9 4 5 3 14 0 0 0 0 0
27 19 T. Sana 24 A 223 4 2 2 1 7 0 0 0 0 0
28 23 D. Hoesen 23 A 450 14 4 10 2 15 2 1 0 0 0
29 43 R. Kishna 19 A 389 8 5 3 5 5 1 2 0 0 0

Multiple panel plot non-NA error

I've got a data frame that looks like this.
a[,2:25]
UT1 UT2 UT3 UT4 UT5 UT6 UT7 UT8 UT9 UT10 UT11 UT12 TR1 TR2 TR3 TR4
3094 9 0 1 37 6 2 8 1 1 6 3 1 3 0 0 1
3095 4 0 0 10 17 6 7 1 5 3 1 12 2 0 0 1
3096 18 0 0 4 6 15 14 0 7 9 3 8 5 2 1 2
3097 11 0 0 7 5 15 10 2 4 7 16 17 7 3 0 0
3098 18 0 11 2 5 11 7 3 2 1 1 0 3 3 1 1
3099 25 0 6 11 17 3 10 1 1 3 9 2 2 1 1 2
3100 1 0 1 27 12 28 27 0 2 11 6 0 1 7 4 6
3101 0 0 1 40 0 17 13 1 0 3 3 0 1 3 3 1
3102 2 0 0 30 1 9 2 1 1 5 0 0 1 3 3 0
3103 3 0 0 11 4 7 5 2 4 0 1 0 5 4 0 0
3104 5 0 0 3 1 10 4 2 3 0 3 0 7 2 1 0
TR5 TR6 TR7 TR8 TR9 TR10 TR11 TR12
3094 1 0 15 3 0 0 42 1
3095 1 0 4 29 0 0 42 0
3096 0 0 3 22 0 0 3 0
3097 1 0 4 14 0 0 2 0
3098 0 0 1 10 0 0 1 0
3099 0 0 4 41 1 0 3 0
3100 0 0 10 21 0 0 17 0
3101 0 0 2 1 1 0 13 3
3102 0 0 2 4 0 0 10 3
3103 1 0 3 4 0 0 12 1
3104 0 0 1 2 0 0 8 0
The first column of my data it's time so I separated it using
tiempo<-a$Tiempo
tiempo
[1] 618.6 618.8 619.0 619.2 619.4 619.6 619.8 620.0 620.2 620.4 620.6
In order to plot each column as a fucntion of time and do lm I used reshape package and lattice. I'm not sure that's the best option but almost gets me what I want.
The code looks like this:
m<-melt(a[,2:25])
f<-m$variable
xyplot(m$value~tiempo | f, panel=function(x,y,...){
panel.xyplot(x,y,...)
panel.lmline(x,y, col=2, lty=2)
})
And the output is this graph
I don't get why it gives this error, I expect them to be non-NA, I don't understand why it is a problem. In fact, the first panel worked just fine.
When I change the panel.lmline(...) part this happens:
xyplot(m$value~tiempo | f, panel=function(x,y,...){
panel.xyplot(tiempo,m$value,...)
panel.lmline(tiempo,m$value, col=2, lty=2)
})
I get this lenght error but I think it's because each panel is using all datapoints from m when it should be using only 11.
The lm regression function I use is separated from the plotting and this doesn't mess with my statistical analysis but I'm trying to put everything together and won't be able to do it if I can't plot the data. I want visual information about the regression in order to be able to remove outliers if the Rsquared is too low or maybe not even consider that observation.
I hope I've made myself clear.
Thank you very much
Edited with suggestions
You got most of the code right.
It would be better to use the time (tiempo) variable as an id variable in your melt call
This will ensure the lengths of the data match up.
library(reshape2) #This is faster version of reshape
df.m <- melt(df.matias, id.var="Tiempo") #I stored your data in df.matias
Now we can use the melted data to make your plot
library(lattice)
xyplot(value ~ Tiempo | variable, data = df.m,
panel = function(x,y,...) {
panel.xyplot(x,y,...)
panel.lmline(x,y, col = 2, lty =2)
})

how to create r matrix or table from data

My data format in csv is the following. I would like to create a matrix for heatmap using this file. R gglot i am going to use.
A B C
1 apple 3
2 book 5
4 bag 1
9 desk 4
10 apple 8
11 book 66
14 desk 2
I would like to create a matrix for heatmap using that above file.
1 2 4 9 10 11 14
apple 3 0 0 0 8 0 0
book 0 5 0 0 0 66 0
bag 0 0 1 0 0 0 0
desk 0 0 0 4 0 0 2
i have another column in initial file for ordering.
A B C D
1 apple 3 4
2 book 5 1
4 bag 1 2
9 desk 4 3
10 apple 8 4
11 book 66 1
14 desk 2 3
how can i order my matrix due to this D ordering column? or i would like to order by sum of 1-14 column.
You can use xtabs.
d <- read.delim(textConnection("
A B C
1 apple 3
2 book 5
4 bag 1
9 desk 4
10 apple 8
11 book 66
14 desk 2
"), sep=" ")
xtabs(C ~ B + A, d)
A
B 1 2 4 9 10 11 14
0 0 0 0 0 0 0
apple 3 0 0 0 8 0 0
bag 0 0 1 0 0 0 0
book 0 5 0 0 0 66 0
desk 0 0 0 4 0 0 2
You can do this with read.table. You can get the help for choosing the correct parameters by typing ?read.table into your R-GUI.
Using the read.delim part from Vincent above and a reshape approach. Not as elegant...
d <- read.delim(textConnection("
A B C
1 apple 3
2 book 5
4 bag 1
9 desk 4
10 apple 8
11 book 66
14 desk 2
"), sep=" ")
Var1 <- rep(d[,1], d[,3])
Var2 <- rep(d[,2], d[,3])
d <- data.frame(Var1=Var1, Var2=Var2)
d <- cast(melt(d), Var2~value)
> d
Var2 1 2 4 9 10 11 14
1 apple 3 0 0 0 8 0 0
2 bag 0 0 1 0 0 0 0
3 book 0 5 0 0 0 66 0
4 desk 0 0 0 4 0 0 2

Resources