Program R - How to create a function that creates another function with user inputed argument names? - r

Program R
I want to create a function in R which would take the column names from a data frame and create a new function which argument names are the same as the column names, having as many arguments as there are columns.
my_function<-function(data_frame){
return(new_function<-function(#column_name_1,#column_name_2,#column_name3,#column_name_x){
[function does something cool]
}
)
}
For example, if I have a data frame:
> dimorphandra
genus_species temp germinability waterpotential
1 Dimorphandra_mollis 30.8 86 0.0
2 Dimorphandra_mollis 32.5 94 0.0
3 Dimorphandra_mollis 35.0 74 0.0
4 Dimorphandra_mollis 37.0 44 0.0
5 Dimorphandra_mollis 39.0 2 0.0
6 Dimorphandra_mollis 41.0 0 0.0
Then I would apply my_function to it:
my_function(dimorphandra)
Which would output:
new_function<-function(genus_species,temp,germinability,water potential){
[function does something cool]
}
How can I create my_function?
Thank you very much!

Related

How to add a new column in data frame using calculation in R?

I want to add a new column with calculation. In the below data frame,
Env<- c("High_inoc","High_NO_inoc","Low_inoc", "Low_NO_inoc")
CV1<- c(30,150,20,100)
CV2<- c(74,99,49,73)
CV3<- c(78,106,56,69)
CV4<- c(86,92,66,70)
CV5<- c(74,98,57,79)
Data<-data.frame(Env,CV1,CV2,CV3,CV4,CV5)
Data$Mean <- rowMeans(Data %>% select(-Env))
Data <- rbind(Data, c("Mean", colMeans(Data %>% select(-Env))))
I'd like to add a new column names 'Env_index' with calculation {each value of 'mean' column - overall mean (76.3) such as 68.4 - 76.3 , 109- 76.3 ,... 78.2 - 76.3
So, I did like this and obtained what I want.
Data$Env_index <- c(68.4-76.3,109-76.3,49.6-76.3,78.2-76.3, 76.3-76.3)
But, I want to directly calculate using code, so if I code like this,
Data$Env_index <- with (data, data$Mean - 76.3)
It generates error. Could you let me know how to calculate?
Thanks,
To make the calculation dynamic which will work on any data you can do :
Data$Mean <- as.numeric(Data$Mean)
Data$Env_index <- Data$Mean - Data$Mean[nrow(Data)]
Data
# Env CV1 CV2 CV3 CV4 CV5 Mean Env_index
#1 High_inoc 30 74 78 86 74 68.4 -7.9
#2 High_NO_inoc 150 99 106 92 98 109.0 32.7
#3 Low_inoc 20 49 56 66 57 49.6 -26.7
#4 Low_NO_inoc 100 73 69 70 79 78.2 1.9
#5 Mean 75 73.75 77.25 78.5 77 76.3 0.0
Data$Mean[nrow(Data)] will select last value of Data$Mean.

Issue with calculating row mean in data table for selected columns in R

I have a data table as shown below.
Table:
LP GMweek1 GMweek2 GMweek3 PMweek1 PMweek2 PMweek3
215 45 50 60 11 0.4 10.2
0.1 50 61 24 12 0.8 80.0
0 45 24 35 22 20.0 15.4
51 22.1 54 13 35 16 2.2
I want to obtain the Output table below. My code below does not work. Can somebody help me to figure out what I am doing wrong here.
Any help is appreciated.
Output:
LP GMweek1 GMweek2 GMweek3 PMweek1 PMweek2 PMweek3 AvgGM AvgPM
215 45 50 60 11 0.4 10.2 51.67 7.20
0.1 50 61 24 12 0.8 80.0 45.00 30.93
0 45 24 35 22 20.0 15.4 34.67 19.13
51 22.1 54 13 35 16 2.2 29.70 17.73
sel_cols_GM <- c("GMweek1","GMweek2","GMweek3")
sel_cols_PM <- c("PMweek1","PMweek2","PMweek3")
Table <- Table[, .(AvgGM = rowMeans(sel_cols_GM)), by = LP]
Table <- Table[, .(AvgPM = rowMeans(sel_cols_PM)), by = LP]
Ok so you're doing a couple of things wrong. First, rowMeans can't evaluate a character vector, if you want to select columns by using it you must use .SD and pass the character vector to .SDcols. Second, you're trying to calculate a row aggregation and grouping, which I don't think makes much sense. Third, even if your expression didn't throw an error, you are assigning it back to Table, which would destroy your original data (if you want to add a new column use := to add it by reference).
What you want to do is calculate the row means of your selected columns, which you can do like this:
Table[, AvgGM := rowMeans(.SD), .SDcols = sel_cols_GM]
Table[, AvgPM := rowMeans(.SD), .SDcols = sel_cols_PM]
This means create these new columns as the row means of my subset of data (.SD) which refers to these columns (.SDcols)

How to add nearest coordinates points from one file to another using RANN package

I tried the RANN package to extract the nearest coordinate points by comparing two files and then add the nearest extracted points to the another file.
My files -> fire
lat lon frp
30.037 80.572 38.5
23.671 85.008 7.2
22.791 86.206 11.4
23.755 86.421 5.6
23.673 86.088 4.2
23.768 86.392 8.4
23.789 86.243 7.8
23.805 86.327 6.4
23.682 86.085 7.8
23.68 86.095 5.7
21.194 81.41 19
16.95 81.912 8
16.952 81.898 11.5
16.899 81.682 10.6
12.994 79.651 16.1
9.2 77.603 14.5
12.291 77.346 20.5
17.996 79.708 13.9
17.998 79.718 29.6
16.61 81.266 6.6
16.499 81.2 6.8
19.505 81.784 22.4
18.322 80.555 7.7
19.506 81.794 28.2
21.081 81.957 8.7
21.223 82.127 9.4
20.918 81.025 6.3
19.861 82.123 9.3
20.62 75.049 11.6
and 2nd file -> wind
lat lon si10 u10 v10
40 60 3.5927058834376 -0.874587879393667 -0.375465368327018
40 60.125 3.59519876134577 -0.836646189656238 -0.388624092937835
40 60.25 3.59769163925393 -0.798704499918809 -0.401782817548651
40 60.375 3.6001845171621 -0.76076281018138 -0.414941542159468
40 60.5 3.60246965524458 -0.722821120443951 -0.428380239634345
40 60.625 3.60496253315275 -0.684585309080651 -0.441538964245162
40 60.75 3.60766315088659 -0.646937740969094 -0.454977661720038
40 60.875 3.60911732966636 -0.609878416109279 -0.468976304923035
40 61 3.608701850015 -0.575172064256437 -0.484934758174451
40 61.125 3.60807863053795 -0.540759834029467 -0.500893211425867
40 61.25 3.60787089071227 -0.506053482176625 -0.516851664677283
40 61.375 3.60745541106091 -0.471641251949655 -0.533090090792759
40 61.5 3.60703993140955 -0.437229021722684 -0.548768571180115
40 61.625 3.60662445175819 -0.402522669869843 -0.565006997295591
40 61.75 3.60454705350139 -0.398993210359384 -0.579285613362648
40 61.875 3.60163869594186 -0.411346318645989 -0.592724310837524
40 62 3.59873033838234 -0.423405305306722 -0.606163008312401
40 62.125 3.59540650117145 -0.435758413593327 -0.619601705787278
40 62.25 3.59249814361192 -0.44781740025406 -0.633320376126214
40 62.375 3.5895897860524 -0.460170508540664 -0.646759073601091
40 62.5 3.58668142849287 -0.471935373575526 -0.660197771075968
40 62.625 3.57546347790613 -0.509288820061212 -0.666357174085286
40 62.75 3.56445326714507 -0.546642266546898 -0.672236604230545
40 62.875 3.55323531655832 -0.584289834658455 -0.678675980103923
40 63 3.54201736597158 -0.621643281144141 -0.684835383113241
40 63.125 3.53100715521052 -0.658996727629827 -0.69099478612256
40 63.25 3.51978920462378 -0.696350174115513 -0.697154189131878
40 63.375 3.50005392118414 -0.726644701580281 -0.692954596170979
40 63.5 3.46266075256166 -0.743115512629088 -0.668037011269646
I want to add wind$si10 wind$u10 wind$v10 into the fire file with nearest coordinates corresponding to frp values. First, I tried only with variable si10 because in RANN package both fire and wind files should have the same number of columns. So I use the code with si10 only
library(RANN)
read.table(file.choose(), sep="\t", header = T) -> wind_jan
read.table(file.choose(), sep="\t", header = T) -> fire_jan
names(fire_jan)
names(wind_jan)
closest <- RANN::nn2(data = wind_jan, query = fire_jan, k = 1)
closest
fire_jan$wind_lat <- wind_jan[closest$nn.idx, "lat"]
fire_jan$wind_lon <- wind_jan[closest$nn.idx, "lon"]
fire_jan$WS <- wind_jan[closest$nn.idx, "si10"]
From the above code I am able to extract si10 values at the nearby coordinates of fire$frp but when I apply the same code for u10 and v10variables in wind file then I am not able to get the extracted values on the same coordinates as I got with si10.
How can I solve this query with this code?
you call closest_u$nn.id that doesnt exist.
Maybe there is an error with your label as well when reading wind df ?
could that be the error?

R equivalent of Stata's for-loop over local macro list of stubnames

I'm a Stata user that's transitioning to R and there's one Stata crutch that I find hard to give up. This is because I don't know how to do the equivalent with R's "apply" functions.
In Stata, I often generate a local macro list of stubnames and then loop over that list, calling on variables whose names are built off of those stubnames.
For a simple example, imagine that I have the following dataset:
study_id year varX06 varX07 varX08 varY06 varY07 varY08
1 6 50 40 30 20.5 19.8 17.4
1 7 50 40 30 20.5 19.8 17.4
1 8 50 40 30 20.5 19.8 17.4
2 6 60 55 44 25.1 25.2 25.3
2 7 60 55 44 25.1 25.2 25.3
2 8 60 55 44 25.1 25.2 25.3
and so on...
I want to generate two new variables, varX and varY that take on the values of varX06 and varY06 respectively when year is 6, varX07 and varY07 respectively when year is 7, and varX08 and varY08 respectively when year is 8.
The final dataset should look like this:
study_id year varX06 varX07 varX08 varY06 varY07 varY08 varX varY
1 6 50 40 30 20.5 19.8 17.4 50 20.5
1 7 50 40 30 20.5 19.8 17.4 40 19.8
1 8 50 40 30 20.5 19.8 17.4 30 17.4
2 6 60 55 44 25.1 25.2 25.3 60 25.1
2 7 60 55 44 25.1 25.2 25.3 55 25.2
2 8 60 55 44 25.1 25.2 25.3 44 25.3
and so on...
To clarify, I know that I can do this with melt and reshape commands - essentially converting this data from wide to long format, but I don't want to resort to that. That's not the intent of my question.
My question is about how to loop over a local macro list of stubnames in R and I'm just using this simple example to illustrate a more generic dilemma.
In Stata, I could generate a local macro list of stubnames:
local stub varX varY
And then loop over the macro list. I can generate a new variable varX or varY and replace the new variable value with the value of varX06 or varY06 (respectively) if year is 6 and so on.
foreach i of local stub {
display "`i'"
gen `i'=.
replace `i'=`i'06 if year==6
replace `i'=`i'07 if year==7
replace `i'=`i'08 if year==8
}
The last section is the section that I find hardest to replicate in R. When I write 'x'06, Stata takes the string "varX", concatenates it with the string "06" and then returns the value of the variable varX06. Additionally, when I write 'i', Stata returns the string "varX" and not the string "'i'".
How do I do these things with R?
I've searched through Muenchen's "R for Stata Users", googled the web, and searched through previous posts here at StackOverflow but haven't been able to find an R solution.
I apologize if this question is elementary. If it's been answered before, please direct me to the response.
Thanks in advance,
Tara
Well, here's one way. Columns in R data frames can be accessed using their character names, so this will work:
# create sample dataset
set.seed(1) # for reproducible example
df <- data.frame(year=as.factor(rep(6:8,each=100)), #categorical variable
varX06 = rnorm(300), varX07=rnorm(300), varX08=rnorm(100),
varY06 = rnorm(300), varY07=rnorm(300), varY08=rnorm(100))
# you start here...
years <- unique(df$year)
df$varX <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varX0",yr)]))
df$varY <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varY0",yr)]))
print(head(df),digits=4)
# year varX06 varX07 varX08 varY06 varY07 varY08 varX varY
# 1 6 -0.6265 0.8937 -0.3411 -0.70757 1.1350 0.3412 -0.6265 -0.70757
# 2 6 0.1836 -1.0473 1.5024 1.97157 1.1119 1.3162 0.1836 1.97157
# 3 6 -0.8356 1.9713 0.5283 -0.09000 -0.8708 -0.9598 -0.8356 -0.09000
# 4 6 1.5953 -0.3836 0.5422 -0.01402 0.2107 -1.2056 1.5953 -0.01402
# 5 6 0.3295 1.6541 -0.1367 -1.12346 0.0694 1.5676 0.3295 -1.12346
# 6 6 -0.8205 1.5122 -1.1367 -1.34413 -1.6626 0.2253 -0.8205 -1.34413
For a given yr, the anonymous function extracts the rows with that yr and column named "varX0" + yr (the result of paste0(...). Then lapply(...) "applies" this function for each year, and unlist(...) converts the returned list into a vector.
Maybe a more transparent way:
sub <- c("varX", "varY")
for (i in sub) {
df[[i]] <- NA
df[[i]] <- ifelse(df[["year"]] == 6, df[[paste0(i, "06")]], df[[i]])
df[[i]] <- ifelse(df[["year"]] == 7, df[[paste0(i, "07")]], df[[i]])
df[[i]] <- ifelse(df[["year"]] == 8, df[[paste0(i, "08")]], df[[i]])
}
This method reorders your data, but involves a one-liner, which may or may not be better for you (assume d is your dataframe):
> do.call(rbind, by(d, d$year, function(x) { within(x, { varX <- x[, paste0('varX0',x$year[1])]; varY <- x[, paste0('varY0',x$year[1])] }) } ))
study_id year varX06 varX07 varX08 varY06 varY07 varY08 varY varX
6.1 1 6 50 40 30 20.5 19.8 17.4 20.5 50
6.4 2 6 60 55 44 25.1 25.2 25.3 25.1 60
7.2 1 7 50 40 30 20.5 19.8 17.4 19.8 40
7.5 2 7 60 55 44 25.1 25.2 25.3 25.2 55
8.3 1 8 50 40 30 20.5 19.8 17.4 17.4 30
8.6 2 8 60 55 44 25.1 25.2 25.3 25.3 44
Essentially, it splits the data based on year, then uses within to create the varX and varY variables within each subset, and then rbind's the subsets back together.
A direct translation of your Stata code, however, would be something like the following:
u <- unique(d$year)
for(i in seq_along(u)){
d$varX <- ifelse(d$year == 6, d$varX06, ifelse(d$year == 7, d$varX07, ifelse(d$year == 8, d$varX08, NA)))
d$varY <- ifelse(d$year == 6, d$varY06, ifelse(d$year == 7, d$varY07, ifelse(d$year == 8, d$varY08, NA)))
}
Here's another option.
Create a 'column selection matrix' based on year, then use that to grab the values you want from any block of columns.
# indexing matrix based on the 'year' column
col_select_mat <-
t(sapply(your_df$year, function(x) unique(your_df$year) == x))
# make selections from col groups by stub name
sapply(c('varX', 'varY'),
function(x) your_df[, grep(x, names(your_df))][col_select_mat])
This gives the desired result (which you can cbind to your_df if you like)
varX varY
[1,] 50 20.5
[2,] 60 25.1
[3,] 40 19.8
[4,] 55 25.2
[5,] 30 17.4
[6,] 44 25.3
OP's dataset:
your_df <- read.table(header=T, text=
'study_id year varX06 varX07 varX08 varY06 varY07 varY08
1 6 50 40 30 20.5 19.8 17.4
1 7 50 40 30 20.5 19.8 17.4
1 8 50 40 30 20.5 19.8 17.4
2 6 60 55 44 25.1 25.2 25.3
2 7 60 55 44 25.1 25.2 25.3
2 8 60 55 44 25.1 25.2 25.3')
Benchmarking: Looking at the three posted solutions, this appears to be the fastest on average, but the differences are very small.
df <- your_df
d <- your_df
arvi1000 <- function() {
col_select_mat <- t(sapply(your_df$year, function(x) unique(your_df$year) == x))
# make selections from col groups by stub name
cbind(your_df,
sapply(c('varX', 'varY'),
function(x) your_df[, grep(x, names(your_df))][col_select_mat]))
}
jlhoward <- function() {
years <- unique(df$year)
df$varX <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varX0",yr)]))
df$varY <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varY0",yr)]))
}
Thomas <- function() {
do.call(rbind, by(d, d$year, function(x) { within(x, { varX <- x[, paste0('varX0',x$year[1])]; varY <- x[, paste0('varY0',x$year[1])] }) } ))
}
> microbenchmark(arvi1000, jlhoward, Thomas)
Unit: nanoseconds
expr min lq mean median uq max neval
arvi1000 37 39 43.73 40 42 380 100
jlhoward 38 40 46.35 41 42 377 100
Thomas 37 40 56.99 41 42 1590 100

R - "find and replace" integers in a column with character labels

I have two data frames the first (DF1) is similar to this:
Ba Ram You Sheep
30 1 33.2 120.9
27 3 22.1 121.2
22 4 39.1 99.1
11 1 20.0 101.6
9 3 9.8 784.3
The second (DF2) contains titles for column "Ram":
V1 V2
1 RED
2 GRN
3 YLW
4 BLU
I need to replace the DF1$Ram with corresponding character strings of DF2$V2:
Ba Ram You Sheep
30 RED 33.2 120.9
27 YLW 22.1 121.2
22 BLU 39.1 99.1
11 RED 20.0 101.6
9 YLW 9.8 784.3
I can do this with a nested for loop, but it feels REALLY inefficient:
x <- c(1:nrows(DF1))
y <- c(1:4)
for (i in x) {
for (j in y) {
if (DF1$Ram[i] == x) {
DF1$Ram[i] <- DF2$V2[y]
}
}
}
Is there a way to do this more efficiently??!?! I know there is. I'm a noob.
Use merge
> result <- merge(df1, df2, by.x="Ram", by.y="V1")[,-1] # merging data.frames
> colnames(result)[4] <- "Ram" # setting name
The following is just for getting the output in the order you showed us
> result[order(result$Ba, decreasing = TRUE), c("Ba", "Ram", "You", "Sheep")]
Ba Ram You Sheep
1 30 RED 33.2 120.9
3 27 YLW 22.1 121.2
5 22 BLU 39.1 99.1
2 11 RED 20.0 101.6
4 9 YLW 9.8 784.3
Usually, when you encode some character strings with integers, you likely want factor. They offer some benefits you can read about in the fine manual.
df1 <- data.frame(V2 = c(3,3,2,3,1))
df2 <- data.frame(V1=1:4, V2=c('a','b','c','d'))
df1 <- within(df1, {
f <- factor(df1$V2, levels=df2$V1, labels=df2$V2)
aschar <- as.character(f)
asnum <- as.numeric(f)
})

Resources