This question already has answers here:
Create dummy variable in R excluding certain cases as NA
(2 answers)
Convert continuous numeric values to discrete categories defined by intervals
(2 answers)
Closed 5 years ago.
I currently have a data frame in r where one column contains values from 1-20. I need to replace values 0-4 -> A, 5-9 -> B, 10-14 -> C.... and so on. What would be the best way to accomplish this?
This is what the first part of the data frame currently looks like:
sex length diameter height wholeWeight shuckedWeight visceraWeight shellWeight rings
1 2 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.1500 15
2 2 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 7
3 1 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.2100 9
4 2 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.1550 10
5 0 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.0550 7
6 0 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.1200 8
7 1 0.530 0.415 0.150 0.7775 0.2370 0.1415 0.3300 20
8 1 0.545 0.425 0.125 0.7680 0.2940 0.1495 0.2600 16
9 2 0.475 0.370 0.125 0.5095 0.2165 0.1125 0.1650 9
10 1 0.550 0.440 0.150 0.8945 0.3145 0.1510 0.3200 19
11 1 0.525 0.380 0.140 0.6065 0.1940 0.1475 0.2100 14
12 2 0.430 0.350 0.110 0.4060 0.1675 0.0810 0.1350 10
I'm trying to replace the values in rings.
Maybe you can create a vector to be a transform table.
tran_table <- rep(c("A","B","C","D"), rep(5,4))
names(tran_table) <- 1:20
test_df <- data.frame(
ring = sample(1:20, 20)
)
test_df$ring <- tran_table[test_df$ring]
And the result is:
> test_df
ring
1 A
2 D
3 B
4 A
5 A
6 C
7 B
8 D
9 B
10 A
11 B
12 D
13 C
14 A
15 C
16 C
17 B
18 D
19 C
20 D
Related
I output a series of acf results and want to extract just lag 1 autocorrelation coefficient. Can anyone give a quick pointer? Thank you.
#A snippet of a series of acf() results
$`25`
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11 12
1.000 0.366 -0.347 -0.399 -0.074 0.230 0.050 -0.250 -0.213 -0.106 0.059 0.154 0.031
$`26`
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11
1.000 0.060 0.026 -0.163 -0.233 -0.191 -0.377 0.214 0.037 0.178 -0.016 0.049
$`27`
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11 12
1.000 -0.025 -0.136 0.569 -0.227 -0.264 0.218 -0.262 -0.411 0.123 -0.039 -0.192 0.130
#For this example, the extracted values will be 0.366, 0.060, -0.025, the values can either
be in a list or matrix
EDIT
#`acf` in base R was used
p <- acf.each()
#sapply was tried but it resulted this
sapply(acf.each(), `[`, "1")
1 2 3
acf 0.7398 0.1746 0.4278
type "correlation" "correlation" "correlation"
n.used 24 17 14
lag 1 1 1
series "x" "x" "x"
snames NULL NULL NULL
The structure seems to be a list. We can use sapply to do the extraction
sapply(lst1, function(x) x$acf[2])
data
lst1 <- list(acf(ldeaths), acf(ldeaths))
To start off - thank you for taking your time to view/answer my question. I will try my best to explain this question (which hopefully isn't too difficult, I am not an expert in R by no means)
Lets suppose I have the below data (the first column is the Date, the second column is "levels" and levels is a repeating sequence from 2:8 for each day. Var 3 is just some statistic..)
Date level var3
1 2/10/2017 2 0.2340
2 2/10/2017 3 0.1240
3 2/10/2017 4 0.5120
4 2/10/2017 5 0.4440
5 2/10/2017 6 0.1200
6 2/10/2017 7 0.5213
7 2/10/2017 8 0.1200
8 2/11/2017 2 0.4100
9 2/11/2017 3 0.6500
10 2/11/2017 4 0.2400
11 2/11/2017 5 0.5500
13 2/11/2017 6 0.3100
14 2/11/2017 7 0.1500
15 2/11/2017 8 0.2300
16 2/12/2017 2 0.1500
17 2/12/2017 3 0.5800
18 2/12/2017 4 0.3300
19 2/12/2017 5 0.2100
20 2/12/2017 6 0.9800
21 2/12/2017 7 0.3200
22 2/12/2017 8 0.1800
My goal is to standardize the data BY doing the following:
- Create a new column called 'Change'
- For each unique date, Change is (log(var3) - log(var3[level == 5])
Essentially, for each unique date, I want to take the Var3 data by row and subtract the log of it by the level 5 value of the var3 FOR THAT DAY* [so for example, change[1] = log(.2340) - log(.4440) .. change[2] = log(.1240) - log(.444)... and but for change[10] it would be log(.2400) - log(.5500).. and so on..
I am having trouble code this in R, below is the code I came up with (but the results seem to be 21 rows x 24 vars... but I really just want the 21 rows and 4 columns, with the 4th one being the "CHANGE"... and I just can't get it:/ )
log_mean <- function(data_set) {
for (i in unique(data_set$Date) {
midpoint <- data_set$var3[data_set$level == 5]
c <- (log(data_set$var3) - log(midpoint))
change <- rbind(change,c)}}
y <- cbind(x, change)
Please help if you can, Intuitively it seems real easy to do, I am not sure how to do this in R [and yes, I am relatively new-ish]..
Thank you so much!
Try this:
library(dplyr)
df %>% group_by(Date) %>% mutate(change = log(var3) - log(var3[level==5]))
# A tibble: 21 x 4
# Groups: Date [3]
Date level var3 change
<fct> <int> <dbl> <dbl>
1 2/10/2017 2 0.234 -0.641
2 2/10/2017 3 0.124 -1.28
3 2/10/2017 4 0.512 0.143
4 2/10/2017 5 0.444 0
5 2/10/2017 6 0.12 -1.31
6 2/10/2017 7 0.521 0.161
7 2/10/2017 8 0.12 -1.31
8 2/11/2017 2 0.41 -0.294
9 2/11/2017 3 0.65 0.167
10 2/11/2017 4 0.24 -0.829
# ... with 11 more rows
In general, this falls into the category split-apply-combine. Google the term and familiarize yourself with the options that R offers you (e.g. base, dplyr, data.table). It will come in handy in the future.
I've been working on a r function to filter a large data frame of baseball team batting stats by game id, (i.e."2016/10/11/chnmlb-sfnmlb-1"), to create a list of past team matchups by season.
When I use some combinations of teams, output is correct, but others are not. (output contains a variety of ids)
I'm not real familiar with grep, and assume that is the problem. I patched my grep line and list output together by searching stack overflow and thought I had it till testing proved otherwise.
matchup.func <- function (home, away, df) {
matchups <- grep(paste('[0-9]{4}/[0-9]{2}/[0-9]{2}/[', home, '|', away, 'mlb]{6}-[', away, '|', home, 'mlb]{6}-[0-9]{1}', sep = ''), df$game.id, value = TRUE)
df <- df[df$game.id %in% matchups, c(1, 3:ncol(df))]
out <- list()
for (n in 1:length(unique(df$season))) {
for (s in unique(df$season)[n]) {
out[[s]] <- subset(df, season == s)
}
}
return(out)
}
sample of data frame:
bat.stats[sample(nrow(bat.stats), 3), ]
date game.id team wins losses flag ab r h d t hr rbi bb po da so lob avg obp slg ops roi season
1192 2016-04-11 2016/04/11/texmlb-seamlb-1 sea 2 5 away 38 7 14 3 0 0 7 2 27 8 11 15 0.226 0.303 0.336 0.639 0.286 R
764 2016-03-26 2016/03/26/wasmlb-slnmlb-1 sln 8 12 away 38 7 9 2 1 1 5 2 27 8 11 19 0.289 0.354 0.474 0.828 0.400 S
5705 2016-09-26 2016/09/26/oakmlb-anamlb-1 oak 67 89 home 29 2 6 1 0 1 2 2 27 13 4 12 0.260 0.322 0.404 0.726 0.429 R
sample of errant output:
matchup.func('tex', 'sea', bat.stats)
$S
date team wins losses flag ab r h d t hr rbi bb po da so lob avg obp slg ops roi season
21 2016-03-02 atl 1 0 home 32 4 7 0 0 2 3 2 27 19 2 11 0.203 0.222 0.406 0.628 1.000 S
22 2016-03-02 bal 0 1 away 40 11 14 3 2 2 11 10 27 13 4 28 0.316 0.415 0.532 0.947 0.000 S
47 2016-03-03 bal 0 2 home 41 10 17 7 0 2 10 0 27 9 3 13 0.329 0.354 0.519 0.873 0.000 S
48 2016-03-03 tba 1 1 away 33 3 5 0 1 0 3 2 24 10 8 13 0.186 0.213 0.343 0.556 0.500 S
141 2016-03-05 tba 2 2 home 35 6 6 2 0 0 5 3 27 11 5 15 0.199 0.266 0.318 0.584 0.500 S
142 2016-03-05 bal 0 5 away 41 10 17 5 1 0 10 4 27 9 10 13 0.331 0.371 0.497 0.868 0.000 S
sample of good:
matchup.func('bos', 'bal', bat.stats)
$S
date team wins losses flag ab r h d t hr rbi bb po da so lob avg obp slg ops roi season
143 2016-03-06 bal 0 6 home 34 8 14 4 0 0 8 5 27 5 8 22 0.284 0.330 0.420 0.750 0.000 S
144 2016-03-06 bos 3 2 away 38 7 10 3 0 0 7 7 24 7 13 25 0.209 0.285 0.322 0.607 0.600 S
209 2016-03-08 bos 4 3 home 37 1 12 1 1 0 1 4 27 15 8 26 0.222 0.292 0.320 0.612 0.571 S
210 2016-03-08 bal 0 8 away 36 5 12 5 0 1 4 4 27 9 4 27 0.283 0.345 0.429 0.774 0.000 S
On the good it gives a list of matchups as it should, (i.e. S, R, F, D), on the bad it outputs by season, but seems to only give matchups by date and not team. Not sure what to think.
I think that the issue is that regex inside [] behaves differently than you might expect. Specifically, it is looking for any matches to those characters, and in any order. Instead, you might try
matchups <- grep(paste0("(", home, "|", away, ")mlb-(", home, "|", away, ")mlb")
, df$game.id, value = TRUE)
That should give you either the home or the away team, followed by either the home or away team. Without more sample data though, I am not sure if this will catch edge cases.
You should also note that you don't have to match the entire string, so the date-finding regex at the beginning is likely superfluous.
I have two dataframes. For example
require('xlsx')
csvData <- read.csv("myData.csv")
xlsData <- read.xlsx("myData.xlsx")
csvData looks like this:
Period CPI VIX
1 0.029 31.740
2 0.039 32.840
3 0.028 34.720
4 0.011 43.740
5 -0.003 35.310
6 0.013 26.090
7 0.032 28.420
8 0.022 45.080
xlsData looks like this:
Period CPI DJIA
1 0.029 12176
2 0.039 10646
3 0.028 11407
4 0.011 9563
5 -0.003 10708
6 0.013 10776
7 0.032 9384
8 0.022 7774
When I merge this data, the CPI data is duplicated, and a suffix is put on the header, which is problematic (I have many more columns in my real df's).
mergedData <- merge(xlsData, csvData, by = "Period")
mergedData:
Period CPI.x VIX CPI.y DJIA
1 0.029 31.740 0.029 12176
2 0.039 32.840 0.039 10646
3 0.028 34.720 0.028 11407
4 0.011 43.740 0.011 9563
5 -0.003 35.310 -0.003 10708
6 0.013 26.090 0.013 10776
7 0.032 28.420 0.032 9384
8 0.022 45.080 0.022 7774
I want to merge the data frames without duplicating columns with the same name. For example, I want this kind of output:
Period CPI VIX DJIA
1 0.029 31.740 12176
2 0.039 32.840 10646
3 0.028 34.720 11407
4 0.011 43.740 9563
5 -0.003 35.310 10708
6 0.013 26.090 10776
7 0.032 28.420 9384
8 0.022 45.080 7774
I don't want to have to use additional 'by' arguments, or dropping columns from one of the df's, because there are too many columns that are duplicated in both df's. I'm just looking for a dynamic way to drop those duplicated columns during the merge process.
Thanks!
You can skip the by argument if the common columns are named the same.
From ?merge:
By default the data frames are merged on the columns with names they both have, but separate specifications of the columns can be given by by.x and by.y.
Keeping that in mind, the following should work (as it did on your sample data):
merge(csvData, xlsData)
# Period CPI VIX DJIA
# 1 1 0.029 31.74 12176
# 2 2 0.039 32.84 10646
# 3 3 0.028 34.72 11407
# 4 4 0.011 43.74 9563
# 5 5 -0.003 35.31 10708
# 6 6 0.013 26.09 10776
# 7 7 0.032 28.42 9384
# 8 8 0.022 45.08 7774
You can also index your specific column of interest by name. This is useful if you just need a single column/vector from a large data frame.
Period <- seq(1,8)
CPI <- seq(11,18)
VIX <- seq(21,28)
DJIA <- seq(31,38)
Other1 <- paste(letters)[1:8]
Other2 <- paste(letters)[2:9]
Other3 <- paste(letters)[3:10]
df1<- data.frame(Period,CPI,VIX)
df2<- data.frame(Period,CPI,Other1,DJIA,Other2,Other3)
merge(df1,df2[c("Period","DJIA")],by="Period")
> merge(df1,df2[c("Period","DJIA")],by="Period")
Period CPI VIX DJIA
1 1 11 21 31
2 2 12 22 32
3 3 13 23 33
4 4 14 24 34
5 5 15 25 35
6 6 16 26 36
7 7 17 27 37
8 8 18 28 38
I have a time series data, and I wanted to use a function to return suitably lagged and iterated divided value.
Data:
ID Temperature value
1 -1.1923333
2 -0.2123333
3 -0.593
4 -0.7393333
5 -0.731
6 -0.4976667
7 -0.773
8 -0.6843333
9 -0.371
10 0.754
11 1.798
12 3.023
13 3.8233333
14 4.2456667
15 4.599
16 5.078
17 4.9133333
18 3.5393333
19 2.0886667
20 1.8236667
21 1.2633333
22 0.6843333
23 0.7953333
24 0.6883333
The function should work like this:
new values : 23ID=value(24)/value(23), 22ID=value(23)/value(22), 21ID=value(22)/value(21), and so forth.
Expected Results:
ID New Temperature value
1 0.17
2 2.79
3 1.24
4 0.98
5 0.68
6 1.55
7 0.885
8 0.54
9 -2.03
10 2.38
11 1.68
12 1.264
13 1.11
14 1.083
15 1.104
16 0.967
17 0.72
18 0.59
19 0.873
20 0.69
21 0.541
22 1.16
23 0.86
24 NAN
To divide each element of a vector x by its successor, use:
x[-1] / x[-length(x)]
This will return a vector with a length of length(x) - 1. If you really need the NaN value at the end, add it by hand via c(x[-1] / x[-length(x)], NaN).