I have two dataframes. For example
require('xlsx')
csvData <- read.csv("myData.csv")
xlsData <- read.xlsx("myData.xlsx")
csvData looks like this:
Period CPI VIX
1 0.029 31.740
2 0.039 32.840
3 0.028 34.720
4 0.011 43.740
5 -0.003 35.310
6 0.013 26.090
7 0.032 28.420
8 0.022 45.080
xlsData looks like this:
Period CPI DJIA
1 0.029 12176
2 0.039 10646
3 0.028 11407
4 0.011 9563
5 -0.003 10708
6 0.013 10776
7 0.032 9384
8 0.022 7774
When I merge this data, the CPI data is duplicated, and a suffix is put on the header, which is problematic (I have many more columns in my real df's).
mergedData <- merge(xlsData, csvData, by = "Period")
mergedData:
Period CPI.x VIX CPI.y DJIA
1 0.029 31.740 0.029 12176
2 0.039 32.840 0.039 10646
3 0.028 34.720 0.028 11407
4 0.011 43.740 0.011 9563
5 -0.003 35.310 -0.003 10708
6 0.013 26.090 0.013 10776
7 0.032 28.420 0.032 9384
8 0.022 45.080 0.022 7774
I want to merge the data frames without duplicating columns with the same name. For example, I want this kind of output:
Period CPI VIX DJIA
1 0.029 31.740 12176
2 0.039 32.840 10646
3 0.028 34.720 11407
4 0.011 43.740 9563
5 -0.003 35.310 10708
6 0.013 26.090 10776
7 0.032 28.420 9384
8 0.022 45.080 7774
I don't want to have to use additional 'by' arguments, or dropping columns from one of the df's, because there are too many columns that are duplicated in both df's. I'm just looking for a dynamic way to drop those duplicated columns during the merge process.
Thanks!
You can skip the by argument if the common columns are named the same.
From ?merge:
By default the data frames are merged on the columns with names they both have, but separate specifications of the columns can be given by by.x and by.y.
Keeping that in mind, the following should work (as it did on your sample data):
merge(csvData, xlsData)
# Period CPI VIX DJIA
# 1 1 0.029 31.74 12176
# 2 2 0.039 32.84 10646
# 3 3 0.028 34.72 11407
# 4 4 0.011 43.74 9563
# 5 5 -0.003 35.31 10708
# 6 6 0.013 26.09 10776
# 7 7 0.032 28.42 9384
# 8 8 0.022 45.08 7774
You can also index your specific column of interest by name. This is useful if you just need a single column/vector from a large data frame.
Period <- seq(1,8)
CPI <- seq(11,18)
VIX <- seq(21,28)
DJIA <- seq(31,38)
Other1 <- paste(letters)[1:8]
Other2 <- paste(letters)[2:9]
Other3 <- paste(letters)[3:10]
df1<- data.frame(Period,CPI,VIX)
df2<- data.frame(Period,CPI,Other1,DJIA,Other2,Other3)
merge(df1,df2[c("Period","DJIA")],by="Period")
> merge(df1,df2[c("Period","DJIA")],by="Period")
Period CPI VIX DJIA
1 1 11 21 31
2 2 12 22 32
3 3 13 23 33
4 4 14 24 34
5 5 15 25 35
6 6 16 26 36
7 7 17 27 37
8 8 18 28 38
Related
I output a series of acf results and want to extract just lag 1 autocorrelation coefficient. Can anyone give a quick pointer? Thank you.
#A snippet of a series of acf() results
$`25`
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11 12
1.000 0.366 -0.347 -0.399 -0.074 0.230 0.050 -0.250 -0.213 -0.106 0.059 0.154 0.031
$`26`
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11
1.000 0.060 0.026 -0.163 -0.233 -0.191 -0.377 0.214 0.037 0.178 -0.016 0.049
$`27`
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11 12
1.000 -0.025 -0.136 0.569 -0.227 -0.264 0.218 -0.262 -0.411 0.123 -0.039 -0.192 0.130
#For this example, the extracted values will be 0.366, 0.060, -0.025, the values can either
be in a list or matrix
EDIT
#`acf` in base R was used
p <- acf.each()
#sapply was tried but it resulted this
sapply(acf.each(), `[`, "1")
1 2 3
acf 0.7398 0.1746 0.4278
type "correlation" "correlation" "correlation"
n.used 24 17 14
lag 1 1 1
series "x" "x" "x"
snames NULL NULL NULL
The structure seems to be a list. We can use sapply to do the extraction
sapply(lst1, function(x) x$acf[2])
data
lst1 <- list(acf(ldeaths), acf(ldeaths))
Not sure if the title makes sense, but I am new to "R" and to say the least I am confused. As you can see in the code below I have multiple entries that have the same name. For example, time ON and sample 1 appears 3 times. I want to figure out how to calculate the average of the OD at time ON and sample 1. How do I go about doing this? I want to do this for all repeats in the data frame.
Thanks in advance! Hope my question makes sense.
> freednaod
time sample OD
1 ON 1 0.248
2 ON 1 0.245
3 ON 1 0.224
4 ON 2 0.262
5 ON 2 0.260
6 ON 2 0.255
7 ON 3 0.245
8 ON 3 0.249
9 ON 3 0.244
10 0 1 0.010
11 0 1 0.013
12 0 1 0.012
13 0 2 0.014
14 0 2 0.013
15 0 2 0.015
16 0 3 0.013
17 0 3 0.013
18 0 3 0.014
19 30 1 0.018
20 30 1 0.020
21 30 1 0.019
22 30 2 0.017
23 30 2 0.019
24 30 2 0.021
25 30 3 0.021
26 30 3 0.020
27 30 3 0.024
28 60 1 0.023
29 60 1 0.024
30 60 1 0.023
31 60 2 0.031
32 60 2 0.031
33 60 2 0.033
34 60 3 0.025
35 60 3 0.028
36 60 3 0.024
37 90 1 0.052
38 90 1 0.048
39 90 1 0.049
40 90 2 0.076
41 90 2 0.078
42 90 2 0.081
43 90 3 0.073
44 90 3 0.068
45 90 3 0.067
46 120 1 0.124
47 120 1 0.128
48 120 1 0.134
49 120 2 0.202
50 120 2 0.202
51 120 2 0.186
52 120 3 0.192
53 120 3 0.182
54 120 3 0.183
55 150 1 0.229
56 150 1 0.215
57 150 1 0.220
58 150 2 0.197
59 150 2 0.216
60 150 2 0.200
61 150 3 0.207
62 150 3 0.211
63 150 3 0.209
By converting the 'time' column to a factor with levels specified by the unique level, the output would be ordered in the same order as in the initial dataset
aggregate(OD~ sample + time, transform(freednaod,
time = factor(time, levels = unique(time))), mean)[c(2, 1, 3)]
Or using dplyr
library(dplyr)
freednaod %>%
group_by(time = factor(time, levels = unique(time)), sample) %>%
summarise(OD = mean(OD))
This question already has answers here:
Create dummy variable in R excluding certain cases as NA
(2 answers)
Convert continuous numeric values to discrete categories defined by intervals
(2 answers)
Closed 5 years ago.
I currently have a data frame in r where one column contains values from 1-20. I need to replace values 0-4 -> A, 5-9 -> B, 10-14 -> C.... and so on. What would be the best way to accomplish this?
This is what the first part of the data frame currently looks like:
sex length diameter height wholeWeight shuckedWeight visceraWeight shellWeight rings
1 2 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.1500 15
2 2 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 7
3 1 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.2100 9
4 2 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.1550 10
5 0 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.0550 7
6 0 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.1200 8
7 1 0.530 0.415 0.150 0.7775 0.2370 0.1415 0.3300 20
8 1 0.545 0.425 0.125 0.7680 0.2940 0.1495 0.2600 16
9 2 0.475 0.370 0.125 0.5095 0.2165 0.1125 0.1650 9
10 1 0.550 0.440 0.150 0.8945 0.3145 0.1510 0.3200 19
11 1 0.525 0.380 0.140 0.6065 0.1940 0.1475 0.2100 14
12 2 0.430 0.350 0.110 0.4060 0.1675 0.0810 0.1350 10
I'm trying to replace the values in rings.
Maybe you can create a vector to be a transform table.
tran_table <- rep(c("A","B","C","D"), rep(5,4))
names(tran_table) <- 1:20
test_df <- data.frame(
ring = sample(1:20, 20)
)
test_df$ring <- tran_table[test_df$ring]
And the result is:
> test_df
ring
1 A
2 D
3 B
4 A
5 A
6 C
7 B
8 D
9 B
10 A
11 B
12 D
13 C
14 A
15 C
16 C
17 B
18 D
19 C
20 D
I've been working on a r function to filter a large data frame of baseball team batting stats by game id, (i.e."2016/10/11/chnmlb-sfnmlb-1"), to create a list of past team matchups by season.
When I use some combinations of teams, output is correct, but others are not. (output contains a variety of ids)
I'm not real familiar with grep, and assume that is the problem. I patched my grep line and list output together by searching stack overflow and thought I had it till testing proved otherwise.
matchup.func <- function (home, away, df) {
matchups <- grep(paste('[0-9]{4}/[0-9]{2}/[0-9]{2}/[', home, '|', away, 'mlb]{6}-[', away, '|', home, 'mlb]{6}-[0-9]{1}', sep = ''), df$game.id, value = TRUE)
df <- df[df$game.id %in% matchups, c(1, 3:ncol(df))]
out <- list()
for (n in 1:length(unique(df$season))) {
for (s in unique(df$season)[n]) {
out[[s]] <- subset(df, season == s)
}
}
return(out)
}
sample of data frame:
bat.stats[sample(nrow(bat.stats), 3), ]
date game.id team wins losses flag ab r h d t hr rbi bb po da so lob avg obp slg ops roi season
1192 2016-04-11 2016/04/11/texmlb-seamlb-1 sea 2 5 away 38 7 14 3 0 0 7 2 27 8 11 15 0.226 0.303 0.336 0.639 0.286 R
764 2016-03-26 2016/03/26/wasmlb-slnmlb-1 sln 8 12 away 38 7 9 2 1 1 5 2 27 8 11 19 0.289 0.354 0.474 0.828 0.400 S
5705 2016-09-26 2016/09/26/oakmlb-anamlb-1 oak 67 89 home 29 2 6 1 0 1 2 2 27 13 4 12 0.260 0.322 0.404 0.726 0.429 R
sample of errant output:
matchup.func('tex', 'sea', bat.stats)
$S
date team wins losses flag ab r h d t hr rbi bb po da so lob avg obp slg ops roi season
21 2016-03-02 atl 1 0 home 32 4 7 0 0 2 3 2 27 19 2 11 0.203 0.222 0.406 0.628 1.000 S
22 2016-03-02 bal 0 1 away 40 11 14 3 2 2 11 10 27 13 4 28 0.316 0.415 0.532 0.947 0.000 S
47 2016-03-03 bal 0 2 home 41 10 17 7 0 2 10 0 27 9 3 13 0.329 0.354 0.519 0.873 0.000 S
48 2016-03-03 tba 1 1 away 33 3 5 0 1 0 3 2 24 10 8 13 0.186 0.213 0.343 0.556 0.500 S
141 2016-03-05 tba 2 2 home 35 6 6 2 0 0 5 3 27 11 5 15 0.199 0.266 0.318 0.584 0.500 S
142 2016-03-05 bal 0 5 away 41 10 17 5 1 0 10 4 27 9 10 13 0.331 0.371 0.497 0.868 0.000 S
sample of good:
matchup.func('bos', 'bal', bat.stats)
$S
date team wins losses flag ab r h d t hr rbi bb po da so lob avg obp slg ops roi season
143 2016-03-06 bal 0 6 home 34 8 14 4 0 0 8 5 27 5 8 22 0.284 0.330 0.420 0.750 0.000 S
144 2016-03-06 bos 3 2 away 38 7 10 3 0 0 7 7 24 7 13 25 0.209 0.285 0.322 0.607 0.600 S
209 2016-03-08 bos 4 3 home 37 1 12 1 1 0 1 4 27 15 8 26 0.222 0.292 0.320 0.612 0.571 S
210 2016-03-08 bal 0 8 away 36 5 12 5 0 1 4 4 27 9 4 27 0.283 0.345 0.429 0.774 0.000 S
On the good it gives a list of matchups as it should, (i.e. S, R, F, D), on the bad it outputs by season, but seems to only give matchups by date and not team. Not sure what to think.
I think that the issue is that regex inside [] behaves differently than you might expect. Specifically, it is looking for any matches to those characters, and in any order. Instead, you might try
matchups <- grep(paste0("(", home, "|", away, ")mlb-(", home, "|", away, ")mlb")
, df$game.id, value = TRUE)
That should give you either the home or the away team, followed by either the home or away team. Without more sample data though, I am not sure if this will catch edge cases.
You should also note that you don't have to match the entire string, so the date-finding regex at the beginning is likely superfluous.
I have a time series data, and I wanted to use a function to return suitably lagged and iterated divided value.
Data:
ID Temperature value
1 -1.1923333
2 -0.2123333
3 -0.593
4 -0.7393333
5 -0.731
6 -0.4976667
7 -0.773
8 -0.6843333
9 -0.371
10 0.754
11 1.798
12 3.023
13 3.8233333
14 4.2456667
15 4.599
16 5.078
17 4.9133333
18 3.5393333
19 2.0886667
20 1.8236667
21 1.2633333
22 0.6843333
23 0.7953333
24 0.6883333
The function should work like this:
new values : 23ID=value(24)/value(23), 22ID=value(23)/value(22), 21ID=value(22)/value(21), and so forth.
Expected Results:
ID New Temperature value
1 0.17
2 2.79
3 1.24
4 0.98
5 0.68
6 1.55
7 0.885
8 0.54
9 -2.03
10 2.38
11 1.68
12 1.264
13 1.11
14 1.083
15 1.104
16 0.967
17 0.72
18 0.59
19 0.873
20 0.69
21 0.541
22 1.16
23 0.86
24 NAN
To divide each element of a vector x by its successor, use:
x[-1] / x[-length(x)]
This will return a vector with a length of length(x) - 1. If you really need the NaN value at the end, add it by hand via c(x[-1] / x[-length(x)], NaN).