Reading a CSV with observations containing a "–" sign

Reading a CSV with observations containing a "–" sign - r

I have an excel file with the first column looking like this:
Age
17–20
17–20
17–20
17–20
21–24
21–24
21–24
21–24
25–29
25–29
25–29
25–29
30–34
30–34
30–34
30–34
35–39
35–39
35–39
35–39
40–49
40–49
40–49
40–49
50–59
50–59
50–59
50–59
60+
60+
60+
60+
I would like to read this into R without changing each individual observation in excel. I used the following code:
df39 <- read.csv("AutoCollision.csv",header=TRUE,sep=",",
colClasses=c("character","character","numeric","numeric"))
However, this makes the data set look like this:
Age Vehicle_Use Severity Claim_Count
1 17\xd020 Pleasure 250.48 21
2 17\xd020 DriveShort 274.78 40
3 17\xd020 DriveLong 244.52 23
4 17\xd020 Business 797.80 5
5 21\xd024 Pleasure 213.71 63
6 21\xd024 DriveShort 298.60 171
7 21\xd024 DriveLong 298.13 92
8 21\xd024 Business 362.23 44
9 25\xd029 Pleasure 250.57 140
10 25\xd029 DriveShort 248.56 343
11 25\xd029 DriveLong 297.90 318
12 25\xd029 Business 342.31 129
13 30\xd034 Pleasure 229.09 123
14 30\xd034 DriveShort 228.48 448
15 30\xd034 DriveLong 293.87 361
16 30\xd034 Business 367.46 169
17 35\xd039 Pleasure 153.62 151
18 35\xd039 DriveShort 201.67 479
19 35\xd039 DriveLong 238.21 381
20 35\xd039 Business 256.21 166
21 40\xd049 Pleasure 208.59 245
22 40\xd049 DriveShort 202.80 970
23 40\xd049 DriveLong 236.06 719
24 40\xd049 Business 352.49 304
25 50\xd059 Pleasure 207.57 266
26 50\xd059 DriveShort 202.67 859
27 50\xd059 DriveLong 253.63 504
28 50\xd059 Business 340.56 162
29 60+ Pleasure 192.00 260
30 60+ DriveShort 196.33 578
31 60+ DriveLong 259.79 312
32 60+ Business 342.58 96
Why did it change the minus signs to "\xd0" and how could I go about fixing this? Thanks in advance!

Related

Adding a column to a data frame with two different variables

I am sure this is a super easy answer but I am struggling with how to add a column with two different variables to my dataframe. Currently, this is what it looks like
vcv.index model.index par.index grid index estimate se lcl ucl fixed
1 6 6 16 A 16 0.8856724 0.07033280 0.6650468 0.9679751
2 7 7 17 A 17 0.6298118 0.06925471 0.4873052 0.7528014
3 8 8 18 A 18 0.6299359 0.06658557 0.4930263 0.7487169
4 9 9 19 A 19 0.6297988 0.05511771 0.5169948 0.7300157
5 10 10 20 A 20 0.7575811 0.05033490 0.6461758 0.8424612
6 21 21 61 B 61 0.8713467 0.07638687 0.6404598 0.9626184
7 22 22 62 B 62 0.6074379 0.06881230 0.4677827 0.7314827
8 23 23 63 B 63 0.6041054 0.06107520 0.4805279 0.7156792
9 24 24 64 B 64 0.5806565 0.06927308 0.4422237 0.7074601
10 25 25 65 B 65 0.7370944 0.05892108 0.6070620 0.8357394
11 41 41 121 C 121 0.8048479 0.09684385 0.5519097 0.9324759
12 42 42 122 C 122 0.5259547 0.07165218 0.3871380 0.6608721
13 43 43 123 C 123 0.5427100 0.07127273 0.4033255 0.6757137
14 44 44 124 C 124 0.5168820 0.06156392 0.3975561 0.6343132
15 45 45 125 C 125 0.6550049 0.07378403 0.5002851 0.7826343
16 196 196 586 A 586 0.8536314 0.08709394 0.5979992 0.9580976
17 197 197 587 A 587 0.5672194 0.07079508 0.4268452 0.6975725
18 198 198 588 A 588 0.5675415 0.06380445 0.4408540 0.6859714
19 199 199 589 A 589 0.5666874 0.06499899 0.4377071 0.6872233
20 200 200 590 A 590 0.7058542 0.05985868 0.5769484 0.8085177
21 211 211 631 B 631 0.8360614 0.09413427 0.5703031 0.9514472
22 212 212 632 B 632 0.5432872 0.07906200 0.3891364 0.6895701
23 213 213 633 B 633 0.5400994 0.06497607 0.4129055 0.6622759
24 214 214 634 B 634 0.5161692 0.06292706 0.3943257 0.6361202
25 215 215 635 B 635 0.6821667 0.07280044 0.5263841 0.8056298
26 226 226 676 C 676 0.7621875 0.10484478 0.5077465 0.9087471
27 227 227 677 C 677 0.4607440 0.07326970 0.3240229 0.6036386
28 228 228 678 C 678 0.4775168 0.08336433 0.3219349 0.6375872
29 229 229 679 C 679 0.4517655 0.06393339 0.3319262 0.5774725
30 230 230 680 C 680 0.5944330 0.07210672 0.4491995 0.7248303
then I am adding a column with periods 1-5 repeated until reaches the end
with this code
SurJagPred$estimates %<>% mutate(Primary = rep(1:5, 6))
and I also need to add sex( F, M) as well. the numbers 1-15 are female and the 16-30 are male. So overall it should look like this.
> vcv.index model.index par.index grid index estimate se lcl ucl fixed Primary Sex
F
1 6 6 16 A 16 0.8856724 0.07033280 0.6650468 0.9679751 1 F
2 7 7 17 A 17 0.6298118 0.06925471 0.4873052 0.7528014 2 F
3 8 8 18 A 18 0.6299359 0.06658557 0.4930263 0.7487169 3 F
4 9 9 19 A 19 0.6297988 0.05511771 0.5169948 0.7300157 4 F

We can use rep with each on a vector of values to replicate each element of the vector to that many times
SurJagPred$estimates %<>%
mutate(Sex = rep(c("F", "M"), each = 15))

debugger freezes on X (arm)

I am developing an application for Olimex A20 with Qt 5.7. This app needs to run on X. If I just run the application it works perfectly fine. The issue is with the debuging - debugger freezes. This is the stack trace I see when I interrupt the debugger.
This is the line of code where the debugger is waiting for something to happen (qwaitcondition_unix.cpp - line 143).
code = pthread_cond_wait(&cond, &mutex);
This is the main stack trace from thread #1
1 __libc_do_syscall 0xb5dd6514
2 pthread_cond_wait * *GLIBC_2.4 0xb5dd1da6
3 QWaitConditionPrivate::wait qwaitcondition_unix.cpp 143 0xb63a7a44
4 QWaitCondition::wait qwaitcondition_unix.cpp 215 0xb63a7a44
5 QSemaphore::acquire qsemaphore.cpp 143 0xb63a2dba
6 QMetaObject::activate qobject.cpp 3708 0xb6506c6e
7 QMetaObject::activate qobject.cpp 3602 0xb6506fee
8 QDBusConnectionManager::connectionRequested moc_qdbusconnectionmanager_p.cpp 141 0xb3baf2a0
9 QDBusConnectionManager::connectToBus qdbusconnection.cpp 225 0xb3b6b37e
10 QDBusConnectionManager::busConnection qdbusconnection.cpp 134 0xb3b6b488
11 QDBusConnection::sessionBus qdbusconnection.cpp 1195 0xb3b6c058
12 DBusConnection::DBusConnection dbusconnection.cpp 73 0xb3e08a4c
13 QSpiAccessibleBridge::QSpiAccessibleBridge bridge.cpp 66 0xb3dff070
14 QXcbIntegration::accessibility qxcbintegration.cpp 337 0xb3dc563c
15 platformAccessibility qaccessible.cpp 485 0xb6a4ae34
16 QAccessible::isActive qaccessible.cpp 791 0xb6a4ae34
17 QQuickTextInputPrivate::emitCursorPositionChanged qquicktextinput.cpp 4206 0xb6eb5fde
18 QQuickTextInputPrivate::moveCursor qquicktextinput.cpp 3264 0xb6eb9d76
19 QQuickTextInputPrivate::setCursorPosition qquicktextinput_p_p.h 407 0xb6eb9e2a
20 QQuickTextInput::setReadOnly qquicktextinput.cpp 682 0xb6eb9e2a
21 QQuickTextInput::qt_static_metacall moc_qquicktextinput_p.cpp 1180 0xb6f593e8
22 QQuickTextInput::qt_metacall moc_qquicktextinput_p.cpp 1257 0xb6f59cf8
23 QQmlPropertyPrivate::write qqmlproperty.cpp 1254 0xb68a648a
24 QQmlPropertyPrivate::writeValueProperty qqmlproperty.cpp 1183 0xb68a7594
25 QQmlBinding::write qqmlbinding.cpp 333 0xb68f38da
26 QQmlBinding::update qqmlbinding.cpp 197 0xb68f46fc
27 QQmlObjectCreator::finalize qqmlobjectcreator.cpp 1202 0xb68faf92
28 QQmlComponentPrivate::complete qqmlcomponent.cpp 926 0xb68a861e
29 QQmlComponentPrivate::completeCreate qqmlcomponent.cpp 962 0xb68a8698
30 QQmlComponent::create qqmlcomponent.cpp 788 0xb68a85ac
31 QQmlApplicationEnginePrivate::_q_finishLoad qqmlapplicationengine.cpp 136 0xb68f55fc
32 QQmlApplicationEnginePrivate::startLoad qqmlapplicationengine.cpp 115 0xb68f57a4
33 QQmlApplicationEngine::load qqmlapplicationengine.cpp 260 0xb68f57de
34 main main.cpp 50 0x1fd60
This is thread #6
1 __libc_do_syscall 0xb5dd6514
2 pthread_cond_wait * *GLIBC_2.4 0xb5dd1da6
3 _mali_osu_lock_wait 0xb61ec7fe
4 __egl_worker_thread 0xb61e7096
5 start_thread 0xb5dcd5dc
6 ??
0xb5ffd71c
Anyone came across this issue? Any pointers would be appreciated.

Plot histogram by first sorting data and then dividing x values into bins in R

I have a dataset in a given format:
USER.ID avgfrequency
1 3 3.7821782
2 7 14.7500000
3 9 13.4761905
4 13 5.1967213
5 16 6.7812500
6 26 41.7500000
7 49 13.6666667
8 50 7.0000000
9 51 1.0000000
10 52 17.7500000
11 69 4.5000000
12 75 9.9500000
13 91 84.2000000
14 98 8.0185185
15 138 14.2000000
16 139 34.7500000
17 149 7.6666667
18 155 35.3333333
19 167 24.0000000
20 170 7.3529412
21 171 4.4210526
22 175 6.5781250
23 176 19.2857143
24 177 10.4864865
25 178 28.0000000
26 180 4.8461538
27 183 25.5000000
28 184 13.0000000
29 210 32.0000000
30 215 13.4615385
31 220 11.3611111
32 223 26.2500000
I want to first sort the dataset by avgfrequency and then I want to plot count of USER.ID's that fall under different bin categories.
I want to divide avgfrequency into different bin categories of width 10.
I am trying to sort data using:
user_avgfrequency <- user_avgfrequency[order(user_avgfrequency[,1]), ]
but getting an error.

df <- data.frame(USER.ID=c(3,7,9,13,16,26,49,50,51,52,69,75,91,98,138,139,149,155,167,170,171,175,176,177,178,180,183,184,210,215,220,223), avgfrequency=c(3.7821782,14.7500000,13.4761905,5.1967213,6.7812500,41.7500000,13.6666667,7.0000000,1.0000000,17.7500000,4.5000000,9.9500000,84.2000000,8.0185185,14.2000000,34.7500000,7.6666667,35.3333333,24.0000000,7.3529412,4.4210526,6.5781250,19.2857143,10.4864865,28.0000000,4.8461538,25.5000000,13.0000000,32.0000000,13.4615385,11.3611111,26.2500000) );
breaks <- seq(0,ceiling(max(df$avgfrequency)/10)*10,10);
cols <- colorRampPalette(c('blue','green','red'))(length(breaks)-1);
hist(df$avgfrequency,breaks,col=cols,axes=F,xlab='Average Frequency',ylab='Count');
axis(1,breaks);
axis(2,0:max(tabulate(cut(df$avgfrequency,breaks))));

How to process multi columns data in data.frame with plyr

I am trying to solve the DSC(Differential scanning calorimetry) data with R but it seems that I ran into some troubles. All this used to be done in Origin or Qtiplot tediously in my lab.But I wonder if there is another way to do it in batch.But the result did not goes well. For example, maybe I have used the wrong colnames of my data.frame,the code
dat$0.5min
Error: unexpected numeric constant in "dat$0.5"
can not reach my data.
So below is the full description of my purpose, thank you in advance!
the DSC data is like this（I store the CSV file in my GoogleDrive Link ）　:
T1 0.5min T2 1min
40.59 -0.2904 40.59 -0.2545
40.81 -0.281 40.81 -0.2455
41.04 -0.2747 41.04 -0.2389
41.29 -0.2728 41.29 -0.2361
41.54 -0.2553 41.54 -0.2239
41.8 -0.07 41.8 -0.0732
42.06 0.1687 42.06 0.1414
42.32 0.3194 42.32 0.2817
42.58 0.3814 42.58 0.3421
42.84 0.3863 42.84 0.3493
43.1 0.3665 43.11 0.3322
43.37 0.3438 43.37 0.3109
43.64 0.3265 43.64 0.2937
43.9 0.3151 43.9 0.2819
44.17 0.3072 44.17 0.2735
44.43 0.2995 44.43 0.2656
44.7 0.2899 44.7 0.2563
44.96 0.2779 44.96 0.245
in fact I have merge the data into a data.frame and hope I can adjust it and do something further.
the command is:
dat<-read.csv("Book1.csv",header=F)
colnames(dat)<-c('T1','0.5min','T2','1min','T3','2min','T4','4min','T5','8min','T6','10min',
'T7','20min','T8','ascast1','T9','ascast2','T10','ascast3','T11','ascast4',
'T12','ascast5'
)
so actually dat is a data.frame with 1163 obs. of 24 variables.
T1,T2,T3.....T12 means temperature that the samples were tested of DSC although in the same interval they do differ a little due to the unstability of the machine.
And the colname along T1~T12 is Heat Flow of different heat treatment durations that records by the machine and ascast1~ascast5 means nothing done to the sample to check the accuracy of the machine.
Now I need to do something like the following:
for T1~T2 is in Celsius Degrees，I need to change them into Kelvin Degrees whichi means every data plus 273.16.
Two temperature is chosen to compare the result that is Ts=180.25,Te=240.45(all is discussed in Celsius Degrees and I have seen it Qtiplot to make sure). To be clear I list the two temperature and the first 6 columns data.
T1 0.5min T2 1min T3 2min T4 4min
180.25 -0.01710000 180.25 -0.01780000 180.25 -0.02120000 180.25 -0.02020000
. . . .
. . . .
240.45 0.05700000 240.45 0.04500000 240.45 0.05780000 240.45 0.05580000
That all Heat Flow in Ts should be the same that can be made 0 for convenience. So based on the different values Heat Flow of different times like 0.5min,1min,2min,4min,8min,10min,20min and ascas1~ascast5 all Heat Flow value should be minus the Heat Flow value in Ts.
And for Heat Flow in Te, the value should be adjust to make sure that all the Heat Flow data are the same in Te. The purpose is like the following, (1) calculate mean of the 12 heat flow data in Te. Let's use Hmean for the mean heat flow.So Hmean is the value that all Heat Flow should be. (2) for data in column 0.5min,I use col("0.5min") to denote, and the lineal transform formula is like the following:
col("0.5min")-[([0.05700000-(-0.01710000)]-Hmean)/(Te-Ts)]*(col(T1)-Ts)
Actually, [0.05700000-(-0.01710000)] is done in step 2,but I write it for your reference. And this formula is used for different pair of T1~T12 and columns,like (T1,0.5min),(T2, 1min),(T3,1min).....all is 12 pairs.
Now we can plot the 12 pairs of data on the same plot with intervals from 180~240(also in Celsius Degrees) to magnify the details of differences between the different scans of DSC.
I have been stuck on this problems for 2 days , so I return to stackoverflow for help.
Thanks!

I am assuming that your question was right in the beginning where you got the following error,
dat$0.5min
Error: unexpected numeric constant in "dat$0.5"
As I could not find a question in the rest of the steps. They just seemed like a step by step procedure of an experiment.
To fix that error, the problem is the column name has a number in it so to use the column name in the way you want (to reference a column), you should use "`", accent mark, symbol.
>dataF <- data.frame("0.5min"=1:10,"T2"=11:20,check.names = F)
> dataF$`0.5min`
[1] 1 2 3 4 5 6 7 8 9 10
Based on comments adding more information,
You can add a constant to add to alternate columns in the following manner,
dataF <- data.frame(matrix(1:100,10,10))
const <- 237
> print(dataF)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 11 21 31 41 51 61 71 81 91
2 2 12 22 32 42 52 62 72 82 92
3 3 13 23 33 43 53 63 73 83 93
4 4 14 24 34 44 54 64 74 84 94
5 5 15 25 35 45 55 65 75 85 95
6 6 16 26 36 46 56 66 76 86 96
7 7 17 27 37 47 57 67 77 87 97
8 8 18 28 38 48 58 68 78 88 98
9 9 19 29 39 49 59 69 79 89 99
10 10 20 30 40 50 60 70 80 90 100
dataF[,seq(1,ncol(dataF),by = 2)] <- dataF[,seq(1,ncol(dataF),by = 2)] + const
> print(dataF)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 238 11 258 31 278 51 298 71 318 91
2 239 12 259 32 279 52 299 72 319 92
3 240 13 260 33 280 53 300 73 320 93
4 241 14 261 34 281 54 301 74 321 94
5 242 15 262 35 282 55 302 75 322 95
6 243 16 263 36 283 56 303 76 323 96
7 244 17 264 37 284 57 304 77 324 97
8 245 18 265 38 285 58 305 78 325 98
9 246 19 266 39 286 59 306 79 326 99
10 247 20 267 40 287 60 307 80 327 100
To generalize, we know that the columns of a dataframe can be referenced with a vector of numbers/column names. Most operations in R are vectorized. You can use column names or numbers based on the pattern you are looking for.
For example, I change the name of my first two columns and want to access just those I do this,
colnames(dataF)[c(1,2)] <- c("Y1","Y2")
#Reference all column names with "Y" in it. You can do any operation you want on this.
dataF[,grep("Y",colnames(dataF))]
Y1 Y2
1 238 11
2 239 12
3 240 13
4 241 14
5 242 15
6 243 16
7 244 17
8 245 18
9 246 19
10 247 20

Reading html tables in R using readHTMLTable where most work but some do not

I am trying to read a few hundred html tables using readHTMLTable in R. This works mostly fine, expect for a couple of tables. The tables look fine in firefox.
Specifically, tables are by year and state. The following code reads the first table for Maryland in 2005 and works fine:
readHTMLTable("http://www.ssa.gov/policy/docs/statcomps/oasdi_sc/2005/md.html", header=FALSE)[[1]]
However, when trying to do this for Maryland and 2006, the table consists only of the first row of numbers.
readHTMLTable("http://www.ssa.gov/policy/docs/statcomps/oasdi_sc/2006/md.html", header=FALSE)[[1]]
I'm not sure where the problem is and appreciate if anyone could point me toward that.
Stephan

The problem I see is in the second URL "http://www.ssa.gov/policy/docs/statcomps/oasdi_sc/2006/md.html" if you inspect the source code you will see that in the "table 4" there is 2 "tbody". Then I think that readHTMLTable read the first tbody it founds in the page. Thats why you only get the "first" row (which is the first tbody tag)
You need to precise the tbody you want, in your case it's the 2nd tbody of the table in the div with the id "table4", you can identify this node by "//div[#id='table4']/table/tbody[2]"
doc <- "http://www.ssa.gov/policy/docs/statcomps/oasdi_sc/2006/md.html"
body <- getNodeSet(htmlParse(doc), "//div[#id='table4']/table/tbody[2]")[[1]]
> readHTMLTable(body)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 Allegany 16,515 10,060 1,060 120 1,955 560 2,225 75 460 4,835 7,050
2 Anne Arundel 69,550 47,150 3,245 475 6,380 2,750 7,760 100 1,690 21,900 29,260
3 Baltimore 136,035 91,755 6,040 1,200 13,470 5,545 14,695 175 3,155 41,480 61,215
4 Calvert 10,655 7,035 450 95 1,035 520 1,195 20 305 3,225 4,330
5 Caroline 5,955 3,835 180 70 575 245 835 10 205 1,760 2,355
6 Carroll 24,835 17,205 1,030 160 2,270 825 2,675 30 640 7,635 10,880
7 Cecil 15,030 8,870 630 140 1,415 725 2,435 55 760 4,210 5,330
8 Charles 14,925 9,305 625 135 1,320 950 2,040 20 530 4,275 5,625
9 Dorchester 6,980 4,860 165 65 550 265 895 10 170 2,105 2,875
10 Frederick 28,270 18,950 1,300 225 2,585 1,025 3,205 40 940 8,550 12,080
11 Garrett 6,300 3,760 435 45 780 225 855 40 160 1,910 2,485
12 Harford 34,695 23,020 1,540 235 3,330 1,365 4,140 60 1,005 10,540 14,330
13 Howard 26,855 18,825 1,150 260 2,085 1,275 2,555 25 680 8,595 11,330
14 Kent 5,385 3,865 280 40 485 125 500 5 85 1,815 2,385
15 Montgomery 105,195 76,640 6,085 1,105 8,810 3,105 7,615 80 1,755 35,725 50,710
16 Prince George's 84,190 53,900 2,770 1,025 6,370 5,815 11,420 75 2,815 24,310 32,780
17 Queen Anne's 7,050 5,030 310 50 545 225 695 15 180 2,395 2,825
18 St. Mary's 11,220 7,195 570 95 1,135 520 1,380 10 315 3,540 4,380
19 Somerset 4,625 3,055 155 55 385 180 665 10 120 1,365 1,830
20 Talbot 9,260 6,910 485 70 780 170 695 5 145 3,255 4,105
21 Washington 25,385 16,440 1,245 225 2,500 900 3,290 65 720 7,585 10,595
22 Wicomico 16,040 10,700 490 140 1,300 690 2,205 35 480 4,680 6,480
23 Worcester 13,365 10,235 440 70 965 275 1,130 20 230 4,605 5,765