Delete Several Lines in txt file with conditional in R - r

i got problem how to delete several lines in txt file then convert into csv with R because i just want to get the data from txt.
My code cant delete propely because it delete lines which contain the date of the data
Here the code i used
setwd("D:/tugasmaritim/")
FILES <- list.files( pattern = ".txt")
for (i in 1:length(FILES)) {
l <- readLines(FILES[i],skip=4)
l2 <- l[-sapply(grep("</PRE><H3>", l), function(x) seq(x, x + 30))]
l3 <- l2[-sapply(grep("<P>Description", l2), function(x) seq(x, x + 29))]
l4 <- l3[-sapply(grep("<HTML>", l3), function(x) seq(x, x + 3))]
write.csv(l4,row.names=FALSE,file=paste0("D:/tugasmaritim/",sub(".txt","",FILES[i]),".csv"))
}
my data looks like this
<HTML>
<TITLE>University of Wyoming - Radiosonde Data</TITLE>
<LINK REL="StyleSheet" HREF="/resources/select.css" TYPE="text/css">
<BODY BGCOLOR="white">
<H2>96749 WIII Jakarta Observations at 00Z 02 Oct 1995</H2>
<PRE>
-----------------------------------------------------------------------------
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
hPa m C C % g/kg deg knot K K K
-----------------------------------------------------------------------------
1011.0 8 23.2 22.5 96 17.30 0 0 295.4 345.3 298.5
1000.0 98 23.6 22.4 93 17.39 105 8 296.8 347.1 299.8
977.3 300 24.6 22.1 86 17.49 105 8 299.7 351.0 302.8
976.0 311 24.6 22.1 86 17.50 104 8 299.8 351.2 303.0
950.0 548 23.0 22.0 94 17.87 88 12 300.5 353.2 303.7
944.4 600 22.6 21.8 95 17.73 85 13 300.6 352.9 303.8
925.0 781 21.2 21.0 99 17.25 90 20 301.0 351.9 304.1
918.0 847 20.6 20.6 100 16.95 90 23 301.0 351.0 304.1
912.4 900 20.4 18.6 89 15.00 90 26 301.4 345.7 304.1
897.0 1047 20.0 13.0 64 10.60 90 26 302.4 334.1 304.3
881.2 1200 19.4 11.4 60 9.70 90 26 303.3 332.5 305.1
850.0 1510 18.2 8.2 52 8.09 95 18 305.2 329.9 306.7
845.0 1560 18.0 7.0 49 7.49 91 17 305.5 328.4 306.9
810.0 1920 15.0 9.0 67 8.97 60 11 306.0 333.4 307.7
792.9 2100 14.3 3.1 47 6.06 45 8 307.1 325.9 308.2
765.1 2400 13.1 -6.8 24 3.01 40 8 309.0 318.7 309.5
746.0 2612 12.2 -13.8 15 1.77 38 10 310.3 316.2 310.6
712.0 3000 10.3 -15.0 15 1.69 35 13 312.3 318.1 312.6
700.0 3141 9.6 -15.4 16 1.66 35 13 313.1 318.7 313.4
653.0 3714 6.6 -16.4 18 1.63 32 12 316.0 321.6 316.3
631.0 3995 4.8 -2.2 60 5.19 31 11 317.0 333.9 318.0
615.3 4200 3.1 -3.9 60 4.70 30 11 317.4 332.8 318.3
601.0 4391 1.6 -5.4 60 4.28 20 8 317.8 331.9 318.6
592.9 4500 0.6 -12.0 38 2.59 15 6 317.9 326.6 318.4
588.0 4567 0.0 -16.0 29 1.88 11 6 317.9 324.4 318.3
571.0 4800 -1.2 -18.9 25 1.51 355 5 319.1 324.4 319.4
549.8 5100 -2.8 -22.8 20 1.12 45 6 320.7 324.8 321.0
513.0 5649 -5.7 -29.7 13 0.64 125 10 323.6 326.0 323.8
500.0 5850 -5.1 -30.1 12 0.63 155 11 326.8 329.1 326.9
494.0 5945 -4.9 -29.9 12 0.65 146 11 328.1 330.6 328.3
471.7 6300 -7.4 -32.0 12 0.56 110 13 329.3 331.5 329.4
453.7 6600 -9.6 -33.8 12 0.49 100 14 330.3 332.2 330.4
400.0 7570 -16.5 -39.5 12 0.31 105 14 333.5 334.7 333.5
398.0 7607 -16.9 -39.9 12 0.30 104 14 333.4 334.6 333.5
371.9 8100 -20.4 -42.6 12 0.24 95 16 335.4 336.3 335.4
300.0 9660 -31.3 -51.3 12 0.11 115 18 341.1 341.6 341.2
269.0 10420 -36.3 -55.3 12 0.08 79 20 344.7 345.0 344.7
265.9 10500 -36.9 75 20 344.9 344.9
250.0 10920 -40.3 80 28 346.0 346.0
243.4 11100 -41.8 85 37 346.4 346.4
222.5 11700 -46.9 75 14 347.6 347.6
214.0 11960 -49.1 68 16 348.1 348.1
200.0 12400 -52.7 55 20 349.1 349.1
156.0 13953 -66.1 55 25 352.1 352.1
152.3 14100 -67.2 55 26 352.6 352.6
150.0 14190 -67.9 55 26 352.9 352.9
144.7 14400 -69.6 60 26 353.6 353.6
137.5 14700 -72.0 60 39 354.6 354.6
130.7 15000 -74.3 50 28 355.6 355.6
124.2 15300 -76.7 40 36 356.5 356.5
118.0 15600 -79.1 50 48 357.4 357.4
116.0 15698 -79.9 45 44 357.6 357.6
112.0 15900 -79.1 45 26 362.6 362.6
106.3 16200 -78.0 35 24 370.2 370.2
100.0 16550 -76.7 35 24 379.3 379.3
</PRE><H3>Station information and sounding indices</H3><PRE>
Station identifier: WIII
Station number: 96749
Observation time: 951002/0000
Station latitude: -6.11
Station longitude: 106.65
Station elevation: 8.0
Showalter index: 6.30
Lifted index: -1.91
LIFT computed using virtual temperature: -2.80
SWEAT index: 145.41
K index: 6.50
Cross totals index: 13.30
Vertical totals index: 23.30
Totals totals index: 36.60
Convective Available Potential Energy: 799.02
CAPE using virtual temperature: 1070.13
Convective Inhibition: -26.70
CINS using virtual temperature: -12.88
Equilibrum Level: 202.64
Equilibrum Level using virtual temperature: 202.60
Level of Free Convection: 828.70
LFCT using virtual temperature: 909.19
Bulk Richardson Number: 210.78
Bulk Richardson Number using CAPV: 282.30
Temp [K] of the Lifted Condensation Level: 294.96
Pres [hPa] of the Lifted Condensation Level: 958.67
Mean mixed layer potential temperature: 298.56
Mean mixed layer mixing ratio: 17.50
1000 hPa to 500 hPa thickness: 5752.00
Precipitable water [mm] for entire sounding: 36.31
</PRE>
<H2>96749 WIII Jakarta Observations at 00Z 03 Oct 1995</H2>
<PRE>
-----------------------------------------------------------------------------
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
hPa m C C % g/kg deg knot K K K
-----------------------------------------------------------------------------
1012.0 8 23.6 22.9 96 17.72 140 2 295.7 346.9 298.9
1000.0 107 24.0 21.6 86 16.54 135 3 297.1 345.2 300.1
990.0 195 24.4 20.3 78 15.39 128 4 298.4 343.4 301.2
945.4 600 22.9 20.2 85 16.00 95 7 300.9 348.0 303.7
925.0 791 22.2 20.1 88 16.29 100 6 302.0 350.3 304.9
913.5 900 21.9 18.2 80 14.63 105 6 302.8 346.3 305.4
911.0 924 21.8 17.8 78 14.28 108 6 302.9 345.4 305.5
850.0 1522 17.4 16.7 96 14.28 175 6 304.4 347.1 307.0
836.0 1665 16.4 16.4 100 14.24 157 7 304.8 347.5 307.4
811.0 1925 15.0 14.7 98 13.14 123 8 305.9 345.6 308.3
795.0 2095 14.2 7.2 63 8.08 101 9 306.8 331.6 308.3
794.5 2100 14.2 7.2 63 8.05 100 9 306.8 331.5 308.3
745.0 2642 10.4 2.4 58 6.14 64 11 308.4 327.6 309.6
736.0 2744 11.0 0.0 47 5.23 57 11 310.2 326.7 311.1
713.8 3000 9.2 5.0 75 7.70 40 12 310.9 335.0 312.4
711.0 3033 9.0 5.6 79 8.08 40 12 311.0 336.2 312.6
700.0 3163 8.6 1.6 61 6.18 40 12 312.0 331.5 313.1
688.5 3300 8.3 -6.0 36 3.57 60 12 313.1 324.8 313.8
678.0 3427 8.0 -13.0 21 2.08 70 12 314.2 321.2 314.6
642.0 3874 5.0 -2.0 61 5.17 108 11 315.7 332.4 316.7
633.0 3989 4.4 -11.6 30 2.50 117 10 316.3 324.7 316.8
616.6 4200 3.1 -14.1 27 2.09 135 10 317.1 324.3 317.6
580.0 4694 0.0 -20.0 21 1.36 164 13 319.1 323.9 319.4
572.3 4800 -0.4 -20.7 20 1.29 170 14 319.9 324.5 320.1
510.8 5700 -4.0 -26.6 15 0.86 80 10 326.1 329.2 326.2
500.0 5870 -4.7 -27.7 15 0.79 80 10 327.2 330.2 327.4
497.0 5917 -4.9 -27.9 15 0.78 71 13 327.6 330.5 327.7
491.7 6000 -5.5 -28.3 15 0.76 55 19 327.9 330.7 328.0
473.0 6300 -7.6 -29.9 15 0.68 55 16 328.9 331.4 329.0
436.0 6930 -12.1 -33.1 16 0.54 77 17 330.9 333.0 331.0
400.0 7580 -17.9 -37.9 16 0.37 100 19 331.6 333.1 331.7
388.3 7800 -19.9 -39.9 15 0.31 105 20 331.8 333.1 331.9
386.0 7844 -20.3 -40.3 15 0.30 103 20 331.9 333.1 331.9
372.0 8117 -18.3 -38.3 16 0.38 91 23 338.1 339.6 338.1
343.6 8700 -22.1 -41.4 16 0.30 65 29 340.7 342.0 340.8
329.0 9018 -24.1 -43.1 16 0.26 73 27 342.2 343.2 342.2
300.0 9680 -29.9 -44.9 22 0.23 90 22 343.1 344.1 343.2
278.6 10200 -34.3 85 37 344.1 344.1
266.9 10500 -36.8 60 32 344.7 344.7
255.8 10800 -39.4 65 27 345.2 345.2
250.0 10960 -40.7 65 27 345.4 345.4
204.0 12300 -51.8 55 23 348.6 348.6
200.0 12430 -52.9 55 23 348.8 348.8
194.6 12600 -55.0 60 23 348.1 348.1
160.7 13800 -70.1 35 39 342.4 342.4
153.2 14100 -73.9 35 41 340.6 340.6
150.0 14230 -75.5 35 41 339.9 339.9
131.5 15000 -76.3 50 53 351.6 351.6
124.9 15300 -76.6 50 57 356.2 356.2
122.0 15436 -76.7 57 45 358.3 358.3
118.6 15600 -77.3 65 31 360.2 360.2
115.0 15779 -77.9 65 31 362.2 362.2
112.6 15900 -77.7 85 17 364.8 364.8
107.0 16200 -77.2 130 10 371.2 371.2
100.0 16590 -76.5 120 18 379.7 379.7
</PRE><H3>Station information and sounding indices</H3><PRE>
Station identifier: WIII
Station number: 96749
Observation time: 951003/0000
Station latitude: -6.11
Station longitude: 106.65
Station elevation: 8.0
Showalter index: -0.58
Lifted index: 0.17
LIFT computed using virtual temperature: -0.57
SWEAT index: 222.41
K index: 31.80
Cross totals index: 21.40
Vertical totals index: 22.10
Totals totals index: 43.50
Convective Available Potential Energy: 268.43
CAPE using virtual temperature: 431.38
Convective Inhibition: -84.04
CINS using virtual temperature: -81.56
Equilibrum Level: 141.42
Equilibrum Level using virtual temperature: 141.35
Level of Free Convection: 784.91
LFCT using virtual temperature: 804.89
Bulk Richardson Number: 221.19
Bulk Richardson Number using CAPV: 355.46
Temp [K] of the Lifted Condensation Level: 293.21
Pres [hPa] of the Lifted Condensation Level: 940.03
Mean mixed layer potential temperature: 298.46
Mean mixed layer mixing ratio: 16.01
1000 hPa to 500 hPa thickness: 5763.00
Precipitable water [mm] for entire sounding: 44.54
and here my data
data
and this is what i want to get
contoh

Related

How to sample data non-random

I have weather dataset my data is date-dependent
I want to predict the temperature from 07 May 2008 until 18 May 2008 (which is maybe a total of 10-15 observations) my data size is around 200
I will be using decision tree/RF and SVM & NN to make my prediction
I've never handled data like this so I'm not sure how to sample non random data
I want to sample data 80% train data and 30% test data but I want to sample the data in the original order not randomly. Is that possible ?
install.packages("rattle")
install.packages("RGtk2")
library("rattle")
seed <- 42
set.seed(seed)
fname <- system.file("csv", "weather.csv", package = "rattle")
dataset <- read.csv(fname, encoding = "UTF-8")
dataset <- dataset[1:200,]
dataset <- dataset[order(dataset$Date),]
set.seed(321)
sample_data = sample(nrow(dataset), nrow(dataset)*.8)
test<-dataset[sample_data,] # 30%
train<-dataset[-sample_data,] # 80%
output
> head(dataset)
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
1 2007-11-01 Canberra 8.0 24.3 0.0 3.4 6.3 NW 30
2 2007-11-02 Canberra 14.0 26.9 3.6 4.4 9.7 ENE 39
3 2007-11-03 Canberra 13.7 23.4 3.6 5.8 3.3 NW 85
4 2007-11-04 Canberra 13.3 15.5 39.8 7.2 9.1 NW 54
5 2007-11-05 Canberra 7.6 16.1 2.8 5.6 10.6 SSE 50
6 2007-11-06 Canberra 6.2 16.9 0.0 5.8 8.2 SE 44
WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am
1 SW NW 6 20 68 29 1019.7
2 E W 4 17 80 36 1012.4
3 N NNE 6 6 82 69 1009.5
4 WNW W 30 24 62 56 1005.5
5 SSE ESE 20 28 68 49 1018.3
6 SE E 20 24 70 57 1023.8
Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
1 1015.0 7 7 14.4 23.6 No 3.6 Yes
2 1008.4 5 3 17.5 25.7 Yes 3.6 Yes
3 1007.2 8 7 15.4 20.2 Yes 39.8 Yes
4 1007.0 2 7 13.5 14.1 Yes 2.8 Yes
5 1018.5 7 7 11.1 15.4 Yes 0.0 No
6 1021.7 7 5 10.9 14.8 No 0.2 No
> head(test)
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
182 2008-04-30 Canberra -1.8 14.8 0.0 1.4 7.0 N 28
77 2008-01-16 Canberra 17.9 33.2 0.0 10.4 8.4 N 59
88 2008-01-27 Canberra 13.2 31.3 0.0 6.6 11.6 WSW 46
58 2007-12-28 Canberra 15.1 28.3 14.4 8.8 13.2 NNW 28
96 2008-02-04 Canberra 18.2 22.6 1.8 8.0 0.0 ENE 33
126 2008-03-05 Canberra 12.0 27.6 0.0 6.0 11.0 E 46
WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am
182 E N 2 19 80 40 1024.2
77 N NNE 15 20 58 62 1008.5
88 N WNW 4 26 71 28 1013.1
58 NNW NW 6 13 73 44 1016.8
96 SSE ENE 7 13 92 76 1014.4
126 SSE WSW 7 6 69 35 1025.5
Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
182 1020.5 1 7 5.3 13.9 No 0.0 No
77 1006.1 6 7 24.5 23.5 No 4.8 Yes
88 1009.5 1 4 19.7 30.7 No 0.0 No
58 1013.4 1 5 18.3 27.4 Yes 0.0 No
96 1011.5 8 8 18.5 22.1 Yes 9.0 Yes
126 1022.2 1 1 15.7 26.2 No 0.0 No
> head(train)
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
7 2007-11-07 Canberra 6.1 18.2 0.2 4.2 8.4 SE 43
9 2007-11-09 Canberra 8.8 19.5 0.0 4.0 4.1 S 48
11 2007-11-11 Canberra 9.1 25.2 0.0 4.2 11.9 N 30
16 2007-11-16 Canberra 12.4 32.1 0.0 8.4 11.1 E 46
22 2007-11-22 Canberra 16.4 19.4 0.4 9.2 0.0 E 26
25 2007-11-25 Canberra 15.4 28.4 0.0 4.4 8.1 ENE 33
WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am
7 SE ESE 19 26 63 47 1024.6
9 E ENE 19 17 70 48 1026.1
11 SE NW 6 9 74 34 1024.4
16 SE WSW 7 9 70 22 1017.9
22 ENE E 6 11 88 72 1010.7
25 SSE NE 9 15 85 31 1022.4
Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
7 1022.2 4 6 12.4 17.3 No 0.0 No
9 1022.7 7 7 14.1 18.9 No 16.2 Yes
11 1021.1 1 2 14.6 24.0 No 0.2 No
16 1012.8 0 3 19.1 30.7 No 0.0 No
22 1008.9 8 8 16.5 18.3 No 25.8 Yes
25 1018.6 8 2 16.8 27.3 No 0.0 No
I use mtcars as an example. An option to non-randomly split your data in train and test is to first create a sample size based on the number of rows in your data. After that you can use split to split the data exact at the 80% of your data. You using the following code:
smp_size <- floor(0.80 * nrow(mtcars))
split <- split(mtcars, rep(1:2, each = smp_size))
With the following code you can turn the split in train and test:
train <- split$`1`
test <- split$`2`
Let's check the number of rows:
> nrow(train)
[1] 25
> nrow(test)
[1] 7
Now the data is split in train and test without losing their order.

Split one column into multiple based on spaces in r

How can I split one column in multiple columns in R using spaces as separators?
I tried to find an answer for few hours (even days) but now I count on you guys to help me!
This is how my data set looks like and it's all in one column, I don't really care about the column names as in the end I will only need a few of them for my analysis:
[1] 1000.0 246
[2] 970.0 491 -3.3 -5.0 88 2.73 200 4 272.2 279.8 272.7
[3] 909.0 1002 -4.7 -6.6 87 2.58 200 12 275.9 283.2 276.3
[4] 900.0 1080 -5.5 -7.5 86 2.43 200 13 275.8 282.8 276.2
[5] 879.0 1264 -6.5 -8.8 84 2.25 200 16 276.7 283.1 277.0
[6] 850.0 1525 -6.5 -12.5 62 1.73 200 20 279.3 284.4 279.6
Also, I tried the separate function and it give me an error telling me that this is not possible for a function class object.
Thanks a lot for your help!
It's always easier to help if there is minimal reproducible example in the question. The data you show is not easily usable...
MRE:
data_vector <- c("1000.0 246",
"970.0 491 -3.3 -5.0 88 2.73 200 4 272.2 279.8 272.7",
"909.0 1002 -4.7 -6.6 87 2.58 200 12 275.9 283.2 276.3",
"900.0 1080 -5.5 -7.5 86 2.43 200 13 275.8 282.8 276.2",
"879.0 1264 -6.5 -8.8 84 2.25 200 16 276.7 283.1 277.0",
"850.0 1525 -6.5 -12.5 62 1.73 200 20 279.3 284.4 279.6")
And here is a solution using gsub and read.csv:
oo <- read.csv(text=gsub(" +", " ", paste0(data_vector, collapse="\n")), sep=" ", header=FALSE)
Which produces this output:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 1000 246 NA NA NA NA NA NA NA NA NA
2 970 491 -3.3 -5.0 88 2.73 200 4 272.2 279.8 272.7
3 909 1002 -4.7 -6.6 87 2.58 200 12 275.9 283.2 276.3
4 900 1080 -5.5 -7.5 86 2.43 200 13 275.8 282.8 276.2
5 879 1264 -6.5 -8.8 84 2.25 200 16 276.7 283.1 277.0
6 850 1525 -6.5 -12.5 62 1.73 200 20 279.3 284.4 279.6
The read.table/read.csv would work if we pass it as a character vector
read.table(text = data_vector, header = FALSE, fill = TRUE)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
#1 1000 246 NA NA NA NA NA NA NA NA NA
#2 970 491 -3.3 -5.0 88 2.73 200 4 272.2 279.8 272.7
#3 909 1002 -4.7 -6.6 87 2.58 200 12 275.9 283.2 276.3
#4 900 1080 -5.5 -7.5 86 2.43 200 13 275.8 282.8 276.2
#5 879 1264 -6.5 -8.8 84 2.25 200 16 276.7 283.1 277.0
#6 850 1525 -6.5 -12.5 62 1.73 200 20 279.3 284.4 279.6
data
data_vector <- c("1000.0 246",
"970.0 491 -3.3 -5.0 88 2.73 200 4 272.2 279.8 272.7",
"909.0 1002 -4.7 -6.6 87 2.58 200 12 275.9 283.2 276.3",
"900.0 1080 -5.5 -7.5 86 2.43 200 13 275.8 282.8 276.2",
"879.0 1264 -6.5 -8.8 84 2.25 200 16 276.7 283.1 277.0",
"850.0 1525 -6.5 -12.5 62 1.73 200 20 279.3 284.4 279.6")

Calculate difference beetwen a big number of variables

I am trying to calculate differences beetwen differents columns, I did it with a loop but I know that is not a elegant solution and not the best in R (not efficient) also my results have duplicated results and not logical operation (disp-disp or hp_disp and disp_hp).
My real data have Na, I tried to simulate them. My goal is try to improvement my command to get the same table below.
An example of my command is like:
names(mtcars)
mtcars$mpg[mtcars$am==1]=NA
vars1= c("mpg","cyl","disp","hp")
vars2= c("mpg","cyl","disp","hp")
df=data.frame()
df_all=data.frame()
df_all=length(mtcars)
for(i in vars1){
for(k in vars2) {
df= mtcars[[i]]-mtcars[[k]]
df_all=cbind(df_all, df)
length =ncol(df_all)
colnames(df_all)[length]= paste0(i,"_",k)
}
}
head(df_all)
disp_mpg disp_cyl disp_disp disp_hp hp_mpg hp_cyl hp_disp hp_hp
[1,] NA 154 0 50 NA 104 -50 0
[2,] NA 154 0 50 NA 104 -50 0
[3,] NA 104 0 15 NA 89 -15 0
[4,] 236.6 252 0 148 88.6 104 -148 0
[5,] 341.3 352 0 185 156.3 167 -185 0
[6,] 206.9 219 0 120 86.9 99 -120 0
Here's one way to do that, using the data.table library
library(data.table)
vars = c("mpg","cyl","disp","hp")
# create table of pairs to diff
to_diff <- CJ(vars, vars)[V1 < V2]
# calculate diffs
diffs <-
to_diff[, .(diff_val = mtcars[, V1] - mtcars[, V2]),
by = .(cols = paste0(V1, '_minus_', V2))]
# number each row in each "cols" group
diffs[, rid := rowid(cols)]
# transform so that rid determines the row, cols determines the col, and
# the values are the value of diff_val
dcast(diffs, rid ~ cols, value.var = 'diff_val')
Output
#
# rid cyl_minus_disp cyl_minus_hp cyl_minus_mpg disp_minus_hp disp_minus_mpg hp_minus_mpg
# 1: 1 -154.0 -104 -15.0 50.0 139.0 89.0
# 2: 2 -154.0 -104 -15.0 50.0 139.0 89.0
# 3: 3 -104.0 -89 -18.8 15.0 85.2 70.2
# 4: 4 -252.0 -104 -15.4 148.0 236.6 88.6
# 5: 5 -352.0 -167 -10.7 185.0 341.3 156.3
# 6: 6 -219.0 -99 -12.1 120.0 206.9 86.9
# 7: 7 -352.0 -237 -6.3 115.0 345.7 230.7
# 8: 8 -142.7 -58 -20.4 84.7 122.3 37.6
# 9: 9 -136.8 -91 -18.8 45.8 118.0 72.2
# 10: 10 -161.6 -117 -13.2 44.6 148.4 103.8
# 11: 11 -161.6 -117 -11.8 44.6 149.8 105.2
# 12: 12 -267.8 -172 -8.4 95.8 259.4 163.6
# 13: 13 -267.8 -172 -9.3 95.8 258.5 162.7
# 14: 14 -267.8 -172 -7.2 95.8 260.6 164.8
# 15: 15 -464.0 -197 -2.4 267.0 461.6 194.6
# 16: 16 -452.0 -207 -2.4 245.0 449.6 204.6
# 17: 17 -432.0 -222 -6.7 210.0 425.3 215.3
# 18: 18 -74.7 -62 -28.4 12.7 46.3 33.6
# 19: 19 -71.7 -48 -26.4 23.7 45.3 21.6
# 20: 20 -67.1 -61 -29.9 6.1 37.2 31.1
# 21: 21 -116.1 -93 -17.5 23.1 98.6 75.5
# 22: 22 -310.0 -142 -7.5 168.0 302.5 134.5
# 23: 23 -296.0 -142 -7.2 154.0 288.8 134.8
# 24: 24 -342.0 -237 -5.3 105.0 336.7 231.7
# 25: 25 -392.0 -167 -11.2 225.0 380.8 155.8
# 26: 26 -75.0 -62 -23.3 13.0 51.7 38.7
# 27: 27 -116.3 -87 -22.0 29.3 94.3 65.0
# 28: 28 -91.1 -109 -26.4 -17.9 64.7 82.6
# 29: 29 -343.0 -256 -7.8 87.0 335.2 248.2
# 30: 30 -139.0 -169 -13.7 -30.0 125.3 155.3
# 31: 31 -293.0 -327 -7.0 -34.0 286.0 320.0
# 32: 32 -117.0 -105 -17.4 12.0 99.6 87.6
# rid cyl_minus_disp cyl_minus_hp cyl_minus_mpg disp_minus_hp disp_minus_mpg hp_minus_mpg

Trouble with character column from a file read in with read.csv in r

On the website:
http://naturalstattrick.com/teamtable.php?season=20172018&stype=2&sit=pp&score=all&rate=n&vs=all&loc=B&gpf=82&fd=2017-10-04&td=2018-04-07
the bottom of the page there is an option to download csv. I downloaded the csv file and renamed it Team Season Totals - Natural Stat Trick 2007-2008 5 vs 5 (Counts).csv. I also put the csv file in my directory.
I successfully read in the file using read.csv.
teams <- read.csv(file = "Team Season Totals - Natural Stat Trick 2007-2008 5 vs 5 (Counts).csv", stringsAsFactors = FALSE)
head(teams)
ï.. Team GP TOI W L OTL ROW CF CA CF. FF FA FF. SF SA SF. GF GA GF. SCF SCA SCF. SCGF SCGA SCGF. SCSH.
1 1 Atlanta Thrashers 82 3539.050 34 40 8 25 2638 3512 42.89 2002 2717 42.42 1505 2052 42.31 125 172 42.09 1195 1500 44.34 83 126 39.71 6.95
2 2 Pittsburgh Penguins 82 3435.417 47 27 8 40 2820 3380 45.48 2192 2542 46.30 1580 1812 46.58 142 122 53.79 1343 1374 49.43 112 90 55.45 8.34
3 3 Los Angeles Kings 82 3502.333 32 43 7 27 3008 3576 45.69 2306 2787 45.28 1649 1961 45.68 137 174 44.05 1049 1286 44.93 63 80 44.06 6.01
4 4 Montreal Canadiens 82 3475.183 47 25 10 42 3089 3601 46.17 2266 2603 46.54 1617 1863 46.47 144 138 51.06 1156 1221 48.63 62 61 50.41 5.36
5 5 Edmonton Oilers 82 3442.633 41 35 6 26 2958 3424 46.35 2255 2585 46.59 1601 1830 46.66 143 166 46.28 1334 1398 48.83 104 116 47.27 7.80
6 6 Philadelphia Flyers 82 3374.800 42 29 11 39 2902 3343 46.47 2188 2505 46.62 1609 1857 46.42 125 137 47.71 919 1028 47.20 61 68 47.29 6.64
SCSV. HDCF HDCA HDCF. HDGF HDGA HDGF. HDSH. HDSV. SH. SV. PDO
1 91.60 388 468 45.33 51 82 38.35 13.14 82.48 8.31 91.62 0.999
2 93.45 503 444 53.12 79 49 61.72 15.71 88.96 8.99 93.27 1.023
3 93.78 270 356 43.13 29 36 44.62 10.74 89.89 8.31 91.13 0.994
4 95.00 271 322 45.70 25 31 44.64 9.23 90.37 8.91 92.59 1.015
5 91.70 443 452 49.50 57 61 48.31 12.87 86.50 8.93 90.93 0.999
6 93.39 257 266 49.14 24 24 50.00 9.34 90.98 7.77 92.62 1.004
The one thing I noticed was the Team Column had a accent in it:
teams$Team
[1] "Atlanta Thrashers" "Pittsburgh Penguins" "Los Angeles Kings" "Montreal Canadiens" "Edmonton Oilers" "Philadelphia Flyers"
[7] "St Louis Blues" "Colorado Avalanche" "Vancouver Canucks" "Minnesota Wild" "Florida Panthers" "Phoenix Coyotes"
[13] "Tampa Bay Lightning" "Buffalo Sabres" "Chicago Blackhawks" "New York Islanders" "Nashville Predators" "Anaheim Ducks"
[19] "Boston Bruins" "Ottawa Senators" "Dallas Stars" "Toronto Maple Leafs" "Carolina Hurricanes" "Columbus Blue Jackets"
[25] "New Jersey Devils" "Calgary Flames" "San Jose Sharks" "New York Rangers" "Washington Capitals" "Detroit Red Wings"
Removing the accent:
teams$Team <- sub(pattern = "Â", replacement = "", teams$Team)
teams$Team[1]
[1] "Atlanta Thrashers"
Now when I want to subset the data based on Team, all the values come back FALSE:
teams$Team[1]
[1] "Atlanta Thrashers"
teams$Team[1] == "Atlanta Thrashers"
[1] FALSE
dplyr::filter(teams, Team == "Atlanta Thrashers")
[1] ï.. Team GP TOI W L OTL ROW CF CA CF. FF FA FF. SF SA SF. GF GA GF. SCF SCA SCF. SCGF SCGA
[26] SCGF. SCSH. SCSV. HDCF HDCA HDCF. HDGF HDGA HDGF. HDSH. HDSV. SH. SV. PDO
<0 rows> (or 0-length row.names)
It comes back FALSE for every team and I don't understand why? Something with the accent that I removed? Does it have to do something with encoding, i.e., utf-8? If someone could please assist me I would appreciate it. Thanks.
I figured it out. I had to do with the accent. I used:
iconv(teams$Team,, "UTF-8", "UTF-8",sub=' ')
iconv(teams$Team, "UTF-8", "UTF-8",sub=' ')[1] == "Atlanta Thrashers"
[1] TRUE
I never had that happen to me and have no experience with encoding and utf-8.

Write a dataframe formatted to a csv sheet

I am having a dataframe which looks like that:
> (eventStudyList120_After)
Dates Company Returns Market Returns Abnormal Returns
1 25.08.2009 4.81 0.62595516 4.184045
2 26.08.2009 4.85 0.89132960 3.958670
3 27.08.2009 4.81 -0.93323011 5.743230
4 28.08.2009 4.89 1.00388875 3.886111
5 31.08.2009 4.73 2.50655343 2.223447
6 01.09.2009 4.61 0.28025201 4.329748
7 02.09.2009 4.77 0.04999239 4.720008
8 03.09.2009 4.69 -1.52822071 6.218221
9 04.09.2009 4.89 -1.48860354 6.378604
10 07.09.2009 4.85 -0.38646531 5.236465
11 08.09.2009 4.89 -1.54065680 6.430657
12 09.09.2009 5.01 -0.35443455 5.364435
13 10.09.2009 5.01 -0.54107231 5.551072
14 11.09.2009 4.89 0.15189458 4.738105
15 14.09.2009 4.93 -0.36811321 5.298113
16 15.09.2009 4.93 -1.31185921 6.241859
17 16.09.2009 4.93 -0.53398643 5.463986
18 17.09.2009 4.97 0.44765285 4.522347
19 18.09.2009 5.01 0.81109101 4.198909
20 21.09.2009 5.01 -0.76254262 5.772543
21 22.09.2009 4.93 0.11309704 4.816903
22 23.09.2009 4.93 1.64429117 3.285709
23 24.09.2009 4.93 0.37294212 4.557058
24 25.09.2009 4.93 -2.59894035 7.528940
25 28.09.2009 5.21 0.29588776 4.914112
26 29.09.2009 4.93 0.49762314 4.432377
27 30.09.2009 5.41 2.17220569 3.237794
28 01.10.2009 5.21 1.67482716 3.535173
29 02.10.2009 5.25 -0.79014302 6.040143
30 05.10.2009 4.97 -2.69996146 7.669961
31 06.10.2009 4.97 0.18086490 4.789135
32 07.10.2009 5.21 -1.39072582 6.600726
33 08.10.2009 5.05 0.04210020 5.007900
34 09.10.2009 5.37 -1.14940251 6.519403
35 12.10.2009 5.13 1.16479551 3.965204
36 13.10.2009 5.37 -2.24208216 7.612082
37 14.10.2009 5.13 0.41327193 4.716728
38 15.10.2009 5.21 1.54473332 3.665267
39 16.10.2009 5.13 -1.73781565 6.867816
40 19.10.2009 5.01 0.66416288 4.345837
41 20.10.2009 5.09 -0.27007314 5.360073
42 21.10.2009 5.13 1.26968917 3.860311
43 22.10.2009 5.01 0.29432965 4.715670
44 23.10.2009 5.01 1.73758937 3.272411
45 26.10.2009 5.21 0.38854011 4.821460
46 27.10.2009 5.21 2.72671890 2.483281
47 28.10.2009 5.21 -1.76846884 6.978469
48 29.10.2009 5.41 2.95523593 2.454764
49 30.10.2009 5.37 -0.22681024 5.596810
50 02.11.2009 5.33 1.38835160 3.941648
51 03.11.2009 5.33 -1.83751398 7.167514
52 04.11.2009 5.21 -0.68721323 5.897213
53 05.11.2009 5.21 -0.26954741 5.479547
54 06.11.2009 5.21 -2.24083342 7.450833
55 09.11.2009 5.17 0.39168239 4.778318
56 10.11.2009 5.09 -0.99082271 6.080823
57 11.11.2009 5.17 0.07924735 5.090753
58 12.11.2009 5.81 -0.34424802 6.154248
59 13.11.2009 6.21 -2.00230195 8.212302
60 16.11.2009 7.81 0.48655978 7.323440
61 17.11.2009 7.69 -0.21092848 7.900928
62 18.11.2009 7.61 1.55605852 6.053941
63 19.11.2009 7.21 0.71028798 6.499712
64 20.11.2009 7.01 -2.38596631 9.395966
65 23.11.2009 7.25 0.55334705 6.696653
66 24.11.2009 7.21 -0.54239847 7.752398
67 25.11.2009 7.25 3.36386413 3.886136
68 26.11.2009 7.01 -1.28927630 8.299276
69 27.11.2009 7.09 0.98053264 6.109467
70 30.11.2009 7.09 -2.61935612 9.709356
71 01.12.2009 7.01 -0.11946242 7.129462
72 02.12.2009 7.21 0.17152317 7.038477
73 03.12.2009 7.21 -0.79343095 8.003431
74 04.12.2009 7.05 0.43919792 6.610802
75 07.12.2009 7.01 1.62169804 5.388302
76 08.12.2009 7.01 0.74055990 6.269440
77 09.12.2009 7.05 -0.99504492 8.045045
78 10.12.2009 7.21 -0.79728245 8.007282
79 11.12.2009 7.21 -0.73784636 7.947846
80 14.12.2009 6.97 -0.14656077 7.116561
81 15.12.2009 6.89 -1.42712116 8.317121
82 16.12.2009 6.97 0.95988962 6.010110
83 17.12.2009 6.69 0.22718293 6.462817
84 18.12.2009 6.53 -1.46958638 7.999586
85 21.12.2009 6.33 -0.21365446 6.543654
86 22.12.2009 6.65 -0.17256757 6.822568
87 23.12.2009 7.05 -0.59940253 7.649403
88 24.12.2009 7.05 NA NA
89 25.12.2009 7.05 NA NA
90 28.12.2009 7.05 -0.22307263 7.273073
91 29.12.2009 6.81 0.76736750 6.042632
92 30.12.2009 6.81 0.00000000 6.810000
93 31.12.2009 6.81 -1.50965723 8.319657
94 01.01.2010 6.81 NA NA
95 04.01.2010 6.65 0.06111069 6.588889
96 05.01.2010 6.65 -0.13159651 6.781597
97 06.01.2010 6.65 0.09545081 6.554549
98 07.01.2010 6.49 -0.32727619 6.817276
99 08.01.2010 6.81 -0.07225296 6.882253
100 11.01.2010 6.81 1.61131397 5.198686
101 12.01.2010 6.57 -0.40791980 6.977920
102 13.01.2010 6.85 -0.53016383 7.380164
103 14.01.2010 6.93 1.82016604 5.109834
104 15.01.2010 6.97 -0.62552046 7.595520
105 18.01.2010 6.93 -0.80490241 7.734902
106 19.01.2010 6.77 2.02857647 4.741424
107 20.01.2010 6.93 1.68204556 5.247954
108 21.01.2010 6.89 1.02683875 5.863161
109 22.01.2010 6.90 0.96765669 5.932343
110 25.01.2010 6.73 -0.57603687 7.306037
111 26.01.2010 6.81 0.50990350 6.300096
112 27.01.2010 6.81 1.64994011 5.160060
113 28.01.2010 6.61 -1.13511086 7.745111
114 29.01.2010 6.53 -0.82206204 7.352062
115 01.02.2010 7.03 -1.03993428 8.069934
116 02.02.2010 6.93 0.61692305 6.313077
117 03.02.2010 7.73 2.53012795 5.199872
118 04.02.2010 7.97 1.96223075 6.007769
119 05.02.2010 9.33 -0.76549820 10.095498
120 08.02.2010 8.01 -0.34391479 8.353915
When I write it to a csv sheet it looks like that:
write.table(eventStudyList120_After$`Abnormal Returns`, file = "C://Users//AbnormalReturns.csv", sep = ";")
In fact I want to let it look like that:
So my question is:
How to write the data frame as it is into a csv and how to transpose the Abnormal return column and put the header as in the example sheet?
Two approaches: transpose the data in R or in Excel
In R
Add an index column, select the columns you want and transpose the data using the function t
d <- anscombe
d$index <- 1:nrow(anscombe)
td <- t(d[c("index", "x1")])
write.table(td, "filename.csv", col.names = F, sep = ";")
Result:
"index";1;2;3;4;5;6;7;8;9;10;11
"x1";10;8;13;9;11;14;6;4;12;7;5
In Excel
Excel allows you to transpose data as well: http://office.microsoft.com/en-us/excel-help/switch-transpose-columns-and-rows-HP010224502.aspx

Resources