How to add a new column in data frame using calculation in R? - r

I want to add a new column with calculation. In the below data frame,
Env<- c("High_inoc","High_NO_inoc","Low_inoc", "Low_NO_inoc")
CV1<- c(30,150,20,100)
CV2<- c(74,99,49,73)
CV3<- c(78,106,56,69)
CV4<- c(86,92,66,70)
CV5<- c(74,98,57,79)
Data<-data.frame(Env,CV1,CV2,CV3,CV4,CV5)
Data$Mean <- rowMeans(Data %>% select(-Env))
Data <- rbind(Data, c("Mean", colMeans(Data %>% select(-Env))))
I'd like to add a new column names 'Env_index' with calculation {each value of 'mean' column - overall mean (76.3) such as 68.4 - 76.3 , 109- 76.3 ,... 78.2 - 76.3
So, I did like this and obtained what I want.
Data$Env_index <- c(68.4-76.3,109-76.3,49.6-76.3,78.2-76.3, 76.3-76.3)
But, I want to directly calculate using code, so if I code like this,
Data$Env_index <- with (data, data$Mean - 76.3)
It generates error. Could you let me know how to calculate?
Thanks,

To make the calculation dynamic which will work on any data you can do :
Data$Mean <- as.numeric(Data$Mean)
Data$Env_index <- Data$Mean - Data$Mean[nrow(Data)]
Data
# Env CV1 CV2 CV3 CV4 CV5 Mean Env_index
#1 High_inoc 30 74 78 86 74 68.4 -7.9
#2 High_NO_inoc 150 99 106 92 98 109.0 32.7
#3 Low_inoc 20 49 56 66 57 49.6 -26.7
#4 Low_NO_inoc 100 73 69 70 79 78.2 1.9
#5 Mean 75 73.75 77.25 78.5 77 76.3 0.0
Data$Mean[nrow(Data)] will select last value of Data$Mean.

Related

Filter all columns in timeseries to keep only top 1/3

I have a timeseries with about 100 dates, 50 entities per date (so 5,000 rows) and 50 columns (all are different variables). How can I filter each column in the data frame, per unique date, to keep the top 1/3 of values for each column on each date. Then get the average Return for that group for that date. Thank you.
My data is organized as follows but the numbers in each column are random and vary like they do in column "a" (this is a sample, the real data has many more columns and many more rows):
Date Identity Return a b c d e f... ...z
2/1/19 X 5 75 43 67 85 72 56 92
2/1/19 Y 4 27 43 67 85 72 56 92
2/1/19 Z 7 88 43 67 85 72 56 92
2/1/19 W 2 55 43 67 85 72 56 92
2/2/19 X 7 69 43 67 85 72 56 92
2/2/19 Y 8 23 43 67 85 72 56 92
2/3/19 X 2 34 43 67 85 72 56 92
2/3/19 Y 3 56 43 67 85 72 56 92
2/3/19 Z 4 62 43 67 85 72 56 92
2/3/19 W 4 43 43 67 85 72 56 92
2/3/19 U 4 26 43 67 85 72 56 92
2/4/19 X 6 67 43 67 85 72 56 92
2/4/19 Y 1 78 43 67 85 72 56 92
2/5/19 X 4 75 43 67 85 72 56 92
2/7/19 X 5 99 43 67 85 72 56 92
2/7/19 Y 4 72 43 67 85 72 56 92
2/7/19 Z 4 45 43 67 85 72 56 92
I am trying to filter data into quantiles. I have a code that works for filtering into quantiles for one measure. However I want filtered results for many measures individually (i.e. I want a “high” group for a ton of columns).
The code that I have that works for one measure is as follows.
Columns are date, identity, and a a is the indicator I want to sort on
High = df[!is.na(df$a),] %>%
group_by(df.date) %>%
filter(a > quantile(a, .666)) %>%
summarise(high_return = sum(df.return) / length(df.identity)
Now I want to loop this for when I have many indicators to sort on individually (I.e. I do not want to sort within one another, I want each sorted separately and the results to be broken out by indicator)
I want the output of the loop to be a new data frame with the following format (where a_Return is the average return of the top 1/3 of the original a's on a given date):
Date a_Return b_Return c_Return
2/1/19 6. 7 3
2/3/19 4. 2 5
2/4/19 2. 4 6
I have tried the code below without it working:
Indicators <- c(“a”, “b”, “c”)
for(i in 1:length(Indicators)){
High = df %>%
group_by(df.date) %>%
filter(High[[I]] > quantile(High[[i]], .666)) %>%
summarise(g = sum(df.return) / length(df.identity)}
With this attempt I get the error: "Error in filter_impl(.data, quo) : Result must have length 20, not 4719.
I also tried:
High %>%
group_by(date) %>%
filter_at(vars(Indicators[i]), any_vars(. > quantile (., .666)))%>%
summarise(!!Indicators[I] := sum(Return) / n())
but with that code I get the error "Strings must match column names. Unknown Columns: NA"
I want High to turn up with a date column and then a column for each a, b, and c.
If you combine the filtering and calculations into a single function, then you can put that into summarize_at to apply it easily to each column. Since you're example data isn't fully reproducible, I'll use the iris dataset. In your case, you'd replace Species with Date, and Petal.Width with Return:
library(dplyr)
top_iris <- iris %>%
group_by(Species) %>%
summarize_at(vars(one_of('Sepal.Length', 'Sepal.Width', 'Petal.Length')),
funs(return = sum(Petal.Width[. > quantile(., .666)]) / length(Petal.Width[. > quantile(., .666)])))
top_iris
# A tibble: 3 x 4
Species Sepal.Length_return Sepal.Width_return Petal.Length_return
<fct> <dbl> <dbl> <dbl>
1 setosa 0.257 0.262 0.308
2 versicolor 1.44 1.49 1.49
3 virginica 2.1 2.22 2.09
The problem with using filter is that each function in the pipe runs in order, so any criteria you give to filter_* will have to be applied to the whole data.frame before the result is piped into summarize_at. Instead, we just use a single summarize_at statement, and filter each column as the summarization function is applied to it.
To explain this in more detail, summarize_at takes 2 arguments:
The first argument is one or more of the variable selector functions described in ?select_helpers, enclosed in the vars function. Here we use one_of which just takes a vector of column names, but we could also use matches to select using a regular expession, or starts_with to choose based on a prefix, for example.
The second argument is a list of one or more function calls to be run on each selected column, enclosed in the funs function. Here we have 1 function call, to which we've given the name return.
Like with any tidyverse function, this is evaluated in a local environment constructed from the data piped in. So bare variable names like Petal.Width function as data$Petal.Width. In *_at functions, the . represents the variable passed in, so when the Sepal.Length column is being summarized:
Petal.Width[. > quantile(., .666)]
means:
data$Petal.Width[data$Sepal.Length > quantile(data$Sepal.Length, .666)]
Finally, since the function in funs is named (that's the return =), then the resulting summary columns have the function's name (return) appended to the original column names.
If you want to remove missing data before running these calculations, you can use na.omit to strip out NA values.
To remove all rows containing NA, just pipe your data through na.omit before grouping:
iris2 <- iris
iris2[c(143:149), c(1:2)] <- NA
iris2 %>%
na.omit() %>%
group_by(Species) %>%
summarize_at(vars(one_of('Sepal.Length', 'Sepal.Width', 'Petal.Length')),
funs(return = sum(Petal.Width[. > quantile(., .666)]) / length(Petal.Width[. > quantile(., .666)])))
Species Sepal.Length_return Sepal.Width_return Petal.Length_return
<fct> <dbl> <dbl> <dbl>
1 setosa 0.257 0.262 0.308
2 versicolor 1.44 1.49 1.49
3 virginica 2.09 2.19 2.07
To strip NA values from each column as it's being summarized, you need to move na.omit inside the summarize function:
iris2 %>%
group_by(Species) %>%
summarize_at(vars(one_of('Sepal.Length', 'Sepal.Width', 'Petal.Length')),
funs(return = {
var <- na.omit(.)
length(Petal.Width[var > quantile(var, .666)])
}))
# A tibble: 3 x 4
Species Sepal.Length_return Sepal.Width_return Petal.Length_return
<fct> <dbl> <dbl> <dbl>
1 setosa 0.257 0.262 0.308
2 versicolor 1.44 1.49 1.49
3 virginica 2.11 2.2 2.09
Here we use curly braces to extend the function we run in summarize_at to multiple expressions. First, we strip out NA values, then we calculate the return values. Since this function is in summarize_at it gets applied to each variable based on the grouping established by group_by.

Issue with calculating row mean in data table for selected columns in R

I have a data table as shown below.
Table:
LP GMweek1 GMweek2 GMweek3 PMweek1 PMweek2 PMweek3
215 45 50 60 11 0.4 10.2
0.1 50 61 24 12 0.8 80.0
0 45 24 35 22 20.0 15.4
51 22.1 54 13 35 16 2.2
I want to obtain the Output table below. My code below does not work. Can somebody help me to figure out what I am doing wrong here.
Any help is appreciated.
Output:
LP GMweek1 GMweek2 GMweek3 PMweek1 PMweek2 PMweek3 AvgGM AvgPM
215 45 50 60 11 0.4 10.2 51.67 7.20
0.1 50 61 24 12 0.8 80.0 45.00 30.93
0 45 24 35 22 20.0 15.4 34.67 19.13
51 22.1 54 13 35 16 2.2 29.70 17.73
sel_cols_GM <- c("GMweek1","GMweek2","GMweek3")
sel_cols_PM <- c("PMweek1","PMweek2","PMweek3")
Table <- Table[, .(AvgGM = rowMeans(sel_cols_GM)), by = LP]
Table <- Table[, .(AvgPM = rowMeans(sel_cols_PM)), by = LP]
Ok so you're doing a couple of things wrong. First, rowMeans can't evaluate a character vector, if you want to select columns by using it you must use .SD and pass the character vector to .SDcols. Second, you're trying to calculate a row aggregation and grouping, which I don't think makes much sense. Third, even if your expression didn't throw an error, you are assigning it back to Table, which would destroy your original data (if you want to add a new column use := to add it by reference).
What you want to do is calculate the row means of your selected columns, which you can do like this:
Table[, AvgGM := rowMeans(.SD), .SDcols = sel_cols_GM]
Table[, AvgPM := rowMeans(.SD), .SDcols = sel_cols_PM]
This means create these new columns as the row means of my subset of data (.SD) which refers to these columns (.SDcols)

How to add nearest coordinates points from one file to another using RANN package

I tried the RANN package to extract the nearest coordinate points by comparing two files and then add the nearest extracted points to the another file.
My files -> fire
lat lon frp
30.037 80.572 38.5
23.671 85.008 7.2
22.791 86.206 11.4
23.755 86.421 5.6
23.673 86.088 4.2
23.768 86.392 8.4
23.789 86.243 7.8
23.805 86.327 6.4
23.682 86.085 7.8
23.68 86.095 5.7
21.194 81.41 19
16.95 81.912 8
16.952 81.898 11.5
16.899 81.682 10.6
12.994 79.651 16.1
9.2 77.603 14.5
12.291 77.346 20.5
17.996 79.708 13.9
17.998 79.718 29.6
16.61 81.266 6.6
16.499 81.2 6.8
19.505 81.784 22.4
18.322 80.555 7.7
19.506 81.794 28.2
21.081 81.957 8.7
21.223 82.127 9.4
20.918 81.025 6.3
19.861 82.123 9.3
20.62 75.049 11.6
and 2nd file -> wind
lat lon si10 u10 v10
40 60 3.5927058834376 -0.874587879393667 -0.375465368327018
40 60.125 3.59519876134577 -0.836646189656238 -0.388624092937835
40 60.25 3.59769163925393 -0.798704499918809 -0.401782817548651
40 60.375 3.6001845171621 -0.76076281018138 -0.414941542159468
40 60.5 3.60246965524458 -0.722821120443951 -0.428380239634345
40 60.625 3.60496253315275 -0.684585309080651 -0.441538964245162
40 60.75 3.60766315088659 -0.646937740969094 -0.454977661720038
40 60.875 3.60911732966636 -0.609878416109279 -0.468976304923035
40 61 3.608701850015 -0.575172064256437 -0.484934758174451
40 61.125 3.60807863053795 -0.540759834029467 -0.500893211425867
40 61.25 3.60787089071227 -0.506053482176625 -0.516851664677283
40 61.375 3.60745541106091 -0.471641251949655 -0.533090090792759
40 61.5 3.60703993140955 -0.437229021722684 -0.548768571180115
40 61.625 3.60662445175819 -0.402522669869843 -0.565006997295591
40 61.75 3.60454705350139 -0.398993210359384 -0.579285613362648
40 61.875 3.60163869594186 -0.411346318645989 -0.592724310837524
40 62 3.59873033838234 -0.423405305306722 -0.606163008312401
40 62.125 3.59540650117145 -0.435758413593327 -0.619601705787278
40 62.25 3.59249814361192 -0.44781740025406 -0.633320376126214
40 62.375 3.5895897860524 -0.460170508540664 -0.646759073601091
40 62.5 3.58668142849287 -0.471935373575526 -0.660197771075968
40 62.625 3.57546347790613 -0.509288820061212 -0.666357174085286
40 62.75 3.56445326714507 -0.546642266546898 -0.672236604230545
40 62.875 3.55323531655832 -0.584289834658455 -0.678675980103923
40 63 3.54201736597158 -0.621643281144141 -0.684835383113241
40 63.125 3.53100715521052 -0.658996727629827 -0.69099478612256
40 63.25 3.51978920462378 -0.696350174115513 -0.697154189131878
40 63.375 3.50005392118414 -0.726644701580281 -0.692954596170979
40 63.5 3.46266075256166 -0.743115512629088 -0.668037011269646
I want to add wind$si10 wind$u10 wind$v10 into the fire file with nearest coordinates corresponding to frp values. First, I tried only with variable si10 because in RANN package both fire and wind files should have the same number of columns. So I use the code with si10 only
library(RANN)
read.table(file.choose(), sep="\t", header = T) -> wind_jan
read.table(file.choose(), sep="\t", header = T) -> fire_jan
names(fire_jan)
names(wind_jan)
closest <- RANN::nn2(data = wind_jan, query = fire_jan, k = 1)
closest
fire_jan$wind_lat <- wind_jan[closest$nn.idx, "lat"]
fire_jan$wind_lon <- wind_jan[closest$nn.idx, "lon"]
fire_jan$WS <- wind_jan[closest$nn.idx, "si10"]
From the above code I am able to extract si10 values at the nearby coordinates of fire$frp but when I apply the same code for u10 and v10variables in wind file then I am not able to get the extracted values on the same coordinates as I got with si10.
How can I solve this query with this code?
you call closest_u$nn.id that doesnt exist.
Maybe there is an error with your label as well when reading wind df ?
could that be the error?

R identifying first value in data-frame and creating new variable by adding/subtracting this from all values in data-frame in new column

I know this question may have been already answered elsewhere and apologies for repeating it if so but I haven't found a workable answer as yet.
I have 17 subjects each with two variables as below:
Time (s) OD
130 41.48
130.5 41.41
131 39.6
131.5 39.18
132 39.41
132.5 37.91
133 37.95
133.5 37.15
134 35.5
134.5 36.01
135 35.01
I would like R to identify the first value in column 2 (OD) of my dataframe and create a new column (OD_adjusted) by adding or subtracting (depending if the first value is +ive or -ive) from all values in column 2 so it would look like this:
Time (s) OD OD_adjusted
130 41.48 0
130.5 41.41 -0.07
131 39.6 -1.88
131.5 39.18 -2.3
132 39.41 -2.07
132.5 37.91 -3.57
133 37.95 -3.53
133.5 37.15 -4.33
134 35.5 -5.98
134.5 36.01 -5.47
135 35.01 -6.47
First value in column 2 is 41.48 therefore I want to subtract this value from all datapoints in column 2 to create a new third column (OD_adjusted).
I can use OD_adjusted <- ((df$OD) - 41.48) however, I would like to automate the process using a function and this is where I am stuck:
AUC_OD <- function(df){
return_value_1 = df %>%
arrange(OD) %>%
filter(OD [1,2) %>%
slice_(1)
colnames(return_value_1)[3] <- "OD_adjusted"
if (nrow(return_value_1) > 0 ) { subtract
(return_value_1 [1,2] #into new row
else add
(return_value_1 [1,2] #into new row
}
We get the first element of 'OD' and subtract from the column
library(dplyr)
df1 %>%
mutate(OD_adjusted = OD- OD[1])
Or using base R
df1$OD_adjusted <- with(df1, OD - OD[1])

R equivalent of Stata's for-loop over local macro list of stubnames

I'm a Stata user that's transitioning to R and there's one Stata crutch that I find hard to give up. This is because I don't know how to do the equivalent with R's "apply" functions.
In Stata, I often generate a local macro list of stubnames and then loop over that list, calling on variables whose names are built off of those stubnames.
For a simple example, imagine that I have the following dataset:
study_id year varX06 varX07 varX08 varY06 varY07 varY08
1 6 50 40 30 20.5 19.8 17.4
1 7 50 40 30 20.5 19.8 17.4
1 8 50 40 30 20.5 19.8 17.4
2 6 60 55 44 25.1 25.2 25.3
2 7 60 55 44 25.1 25.2 25.3
2 8 60 55 44 25.1 25.2 25.3
and so on...
I want to generate two new variables, varX and varY that take on the values of varX06 and varY06 respectively when year is 6, varX07 and varY07 respectively when year is 7, and varX08 and varY08 respectively when year is 8.
The final dataset should look like this:
study_id year varX06 varX07 varX08 varY06 varY07 varY08 varX varY
1 6 50 40 30 20.5 19.8 17.4 50 20.5
1 7 50 40 30 20.5 19.8 17.4 40 19.8
1 8 50 40 30 20.5 19.8 17.4 30 17.4
2 6 60 55 44 25.1 25.2 25.3 60 25.1
2 7 60 55 44 25.1 25.2 25.3 55 25.2
2 8 60 55 44 25.1 25.2 25.3 44 25.3
and so on...
To clarify, I know that I can do this with melt and reshape commands - essentially converting this data from wide to long format, but I don't want to resort to that. That's not the intent of my question.
My question is about how to loop over a local macro list of stubnames in R and I'm just using this simple example to illustrate a more generic dilemma.
In Stata, I could generate a local macro list of stubnames:
local stub varX varY
And then loop over the macro list. I can generate a new variable varX or varY and replace the new variable value with the value of varX06 or varY06 (respectively) if year is 6 and so on.
foreach i of local stub {
display "`i'"
gen `i'=.
replace `i'=`i'06 if year==6
replace `i'=`i'07 if year==7
replace `i'=`i'08 if year==8
}
The last section is the section that I find hardest to replicate in R. When I write 'x'06, Stata takes the string "varX", concatenates it with the string "06" and then returns the value of the variable varX06. Additionally, when I write 'i', Stata returns the string "varX" and not the string "'i'".
How do I do these things with R?
I've searched through Muenchen's "R for Stata Users", googled the web, and searched through previous posts here at StackOverflow but haven't been able to find an R solution.
I apologize if this question is elementary. If it's been answered before, please direct me to the response.
Thanks in advance,
Tara
Well, here's one way. Columns in R data frames can be accessed using their character names, so this will work:
# create sample dataset
set.seed(1) # for reproducible example
df <- data.frame(year=as.factor(rep(6:8,each=100)), #categorical variable
varX06 = rnorm(300), varX07=rnorm(300), varX08=rnorm(100),
varY06 = rnorm(300), varY07=rnorm(300), varY08=rnorm(100))
# you start here...
years <- unique(df$year)
df$varX <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varX0",yr)]))
df$varY <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varY0",yr)]))
print(head(df),digits=4)
# year varX06 varX07 varX08 varY06 varY07 varY08 varX varY
# 1 6 -0.6265 0.8937 -0.3411 -0.70757 1.1350 0.3412 -0.6265 -0.70757
# 2 6 0.1836 -1.0473 1.5024 1.97157 1.1119 1.3162 0.1836 1.97157
# 3 6 -0.8356 1.9713 0.5283 -0.09000 -0.8708 -0.9598 -0.8356 -0.09000
# 4 6 1.5953 -0.3836 0.5422 -0.01402 0.2107 -1.2056 1.5953 -0.01402
# 5 6 0.3295 1.6541 -0.1367 -1.12346 0.0694 1.5676 0.3295 -1.12346
# 6 6 -0.8205 1.5122 -1.1367 -1.34413 -1.6626 0.2253 -0.8205 -1.34413
For a given yr, the anonymous function extracts the rows with that yr and column named "varX0" + yr (the result of paste0(...). Then lapply(...) "applies" this function for each year, and unlist(...) converts the returned list into a vector.
Maybe a more transparent way:
sub <- c("varX", "varY")
for (i in sub) {
df[[i]] <- NA
df[[i]] <- ifelse(df[["year"]] == 6, df[[paste0(i, "06")]], df[[i]])
df[[i]] <- ifelse(df[["year"]] == 7, df[[paste0(i, "07")]], df[[i]])
df[[i]] <- ifelse(df[["year"]] == 8, df[[paste0(i, "08")]], df[[i]])
}
This method reorders your data, but involves a one-liner, which may or may not be better for you (assume d is your dataframe):
> do.call(rbind, by(d, d$year, function(x) { within(x, { varX <- x[, paste0('varX0',x$year[1])]; varY <- x[, paste0('varY0',x$year[1])] }) } ))
study_id year varX06 varX07 varX08 varY06 varY07 varY08 varY varX
6.1 1 6 50 40 30 20.5 19.8 17.4 20.5 50
6.4 2 6 60 55 44 25.1 25.2 25.3 25.1 60
7.2 1 7 50 40 30 20.5 19.8 17.4 19.8 40
7.5 2 7 60 55 44 25.1 25.2 25.3 25.2 55
8.3 1 8 50 40 30 20.5 19.8 17.4 17.4 30
8.6 2 8 60 55 44 25.1 25.2 25.3 25.3 44
Essentially, it splits the data based on year, then uses within to create the varX and varY variables within each subset, and then rbind's the subsets back together.
A direct translation of your Stata code, however, would be something like the following:
u <- unique(d$year)
for(i in seq_along(u)){
d$varX <- ifelse(d$year == 6, d$varX06, ifelse(d$year == 7, d$varX07, ifelse(d$year == 8, d$varX08, NA)))
d$varY <- ifelse(d$year == 6, d$varY06, ifelse(d$year == 7, d$varY07, ifelse(d$year == 8, d$varY08, NA)))
}
Here's another option.
Create a 'column selection matrix' based on year, then use that to grab the values you want from any block of columns.
# indexing matrix based on the 'year' column
col_select_mat <-
t(sapply(your_df$year, function(x) unique(your_df$year) == x))
# make selections from col groups by stub name
sapply(c('varX', 'varY'),
function(x) your_df[, grep(x, names(your_df))][col_select_mat])
This gives the desired result (which you can cbind to your_df if you like)
varX varY
[1,] 50 20.5
[2,] 60 25.1
[3,] 40 19.8
[4,] 55 25.2
[5,] 30 17.4
[6,] 44 25.3
OP's dataset:
your_df <- read.table(header=T, text=
'study_id year varX06 varX07 varX08 varY06 varY07 varY08
1 6 50 40 30 20.5 19.8 17.4
1 7 50 40 30 20.5 19.8 17.4
1 8 50 40 30 20.5 19.8 17.4
2 6 60 55 44 25.1 25.2 25.3
2 7 60 55 44 25.1 25.2 25.3
2 8 60 55 44 25.1 25.2 25.3')
Benchmarking: Looking at the three posted solutions, this appears to be the fastest on average, but the differences are very small.
df <- your_df
d <- your_df
arvi1000 <- function() {
col_select_mat <- t(sapply(your_df$year, function(x) unique(your_df$year) == x))
# make selections from col groups by stub name
cbind(your_df,
sapply(c('varX', 'varY'),
function(x) your_df[, grep(x, names(your_df))][col_select_mat]))
}
jlhoward <- function() {
years <- unique(df$year)
df$varX <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varX0",yr)]))
df$varY <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varY0",yr)]))
}
Thomas <- function() {
do.call(rbind, by(d, d$year, function(x) { within(x, { varX <- x[, paste0('varX0',x$year[1])]; varY <- x[, paste0('varY0',x$year[1])] }) } ))
}
> microbenchmark(arvi1000, jlhoward, Thomas)
Unit: nanoseconds
expr min lq mean median uq max neval
arvi1000 37 39 43.73 40 42 380 100
jlhoward 38 40 46.35 41 42 377 100
Thomas 37 40 56.99 41 42 1590 100

Resources