I have a three dimensional data structure reflecting data at particular longitudes, latitudes, and depth. I would like to apply a function to this data. Normally, say I want to find the depth-averaged value I'd do the following:
apply(MyData, MAR = c(1, 2), mean)
which makes sense to me. What I'm struggling with is that I have want to apply a function that depends on longitude and latitude. Is there a way for apply to pass the indices of elements to the function?
I think you want to use outer() and take advantage of lexical scoping
so that you don't have to pass myData to the function being
called with the longitude and lattitude:
myData <- read.table(...) # or whatever
outer(seq.int(dim(mydata)[1]),
seq.int(dim(mydata)[2]),
function(longitude,lattitude){
do things that depend on
myData[longitude,lattitude,]
})
Related
I have a tibble called 'Volume' in which I store some data (10 columns - the first 2 columns are characters, 30 rows).
Now I want to calculate the relative Volume of every column that corresponds to Column 3 of my tibble.
My current solution looks like this:
rel.Volume_unmod = tibble(
"Volume_OD" = Volume[[3]] / Volume[[3]],
"Volume_Imp" = Volume[[4]] / Volume[[3]],
"Volume_OD_1" = Volume[[5]] / Volume[[3]],
"Volume_WS_1" = Volume[[6]] / Volume[[3]],
"Volume_OD_2" = Volume[[7]] / Volume[[3]],
"Volume_WS_2" = Volume[[8]] / Volume[[3]],
"Volume_OD_3" = Volume[[9]] / Volume[[3]],
"Volume_WS_3" = Volume[[10]] / Volume[[3]])
rel.Volume_unmod
I would like to keep the tibble structure and the labels. I am sure there is a better solution for this, but I am relative new to R so I it's not obvious to me. What I tried is something like this, but I can't actually run this:
rel.Volume = NULL
for(i in Volume[,3:10]){
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
}
Mockup Data
Since you did not provide some data, I've followed the description you provided to create some mockup data. Here:
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE))
Volume[3:10] <- rnorm(30*8)
Solution with Dplyr
library(dplyr)
# rename columns [brute force]
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(Volume)[3:10] <- cols
# divide by Volumn_OD
rel.Volume_unmod <- Volume %>%
mutate(across(all_of(cols), ~ . / Volume_OD))
# result
rel.Volume_unmod
Explanation
I don't know the names of your columns. Probably, the names correspond to the names of the columns you intended to create in rel.Volume_unmod. Anyhow, to avoid any problem I renamed the columns (kinda brutally). You can do it with dplyr::rename if you wan to.
There are many ways to select the columns you want to mutate. mutate is a verb from dplyr that allows you to create new columns or perform operations or functions on columns.
across is an adverb from dplyr. Let's simplify by saying that it's a function that allows you to perform a function over multiple columns. In this case I want to perform a division by Volum_OD.
~ is a tidyverse way to create anonymous functions. ~ . / Volum_OD is equivalent to function(x) x / Volumn_OD
all_of is necessary because in this specific case I'm providing across with a vector of characters. Without it, it will work anyway, but you will receive a warning because it's ambiguous and it may work incorrectly in same cases.
More info
Check out this book to learn more about data manipulation with tidyverse (which dplyr is part of).
Solution with Base-R
rel.Volume_unmod <- Volume
# rename columns
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(rel.Volume_unmod)[3:10] <- cols
# divide by columns 3
rel.Volume_unmod[3:10] <- lapply(rel.Volume_unmod[3:10], `/`, rel.Volume_unmod[3])
rel.Volume_unmod
Explanation
lapply is a base R function that allows you to apply a function to every item of a list or a "listable" object.
in this case rel.Volume_unmod is a listable object: a dataframe is just a list of vectors with the same length. Therefore, lapply takes one column [= one item] a time and applies a function.
the function is /. You usually see / used like this: A / B, but actually / is a Primitive function. You could write the same thing in this way:
`/`(A, B) # same as A / B
lapply can be provided with additional parameters that are passed directly to the function that is being applied over the list (in this case /). Therefore, we are writing rel.Volume_unmod[3] as additional parameter.
lapply always returns a list. But, since we are assigning the result of lapply to a "fraction of a dataframe", we will just edit the columns of the dataframe and, as a result, we will have a dataframe instead of a list. Let me rephrase in a more technical way. When you are assigning rel.Volume_unmod[3:10] <- lapply(...), you are not simply assigning a list to rel.Volume_unmod[3:10]. You are technically using this assigning function: [<-. This is a function that allows to edit the items in a list/vector/dataframe. Specifically, [<- allows you to assign new items without modifying the attributes of the list/vector/dataframe. As I said before, a dataframe is just a list with specific attributes. Then when you use [<- you modify the columns, but you leave the attributes (the class data.frame in this case) untouched. That's why the magic works.
Whithout a minimal working example it's hard to guess what the Variable Volume actually refers to. Apart from that there seems to be a problem with your for-loop:
for(i in Volume[,3:10]){
Assuming Volume refers to a data.frame or tibble, this causes the actual column-vectors with indices between 3 and 10 to be assigned to i successively. You can verify this by putting print(i) inside the loop. But inside the loop it seems like you actually want to use i as a variable containing just the index of the current column as a number (not the column itself):
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
Also, two brackets are usually used with lists, not data.frames or tibbles. (You can, however, do so, because data.frames are special cases of lists.)
Last but not least, initialising the variable rel.Volume with NULL will result in an error, when trying to reassign to that variable, since you haven't told R, what rel.Volume should be.
Try this, if you like (thanks #Edo for example data):
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE),
Vol1 = rnorm(30),
Vol2 = rnorm(30),
Vol3 = rnorm(30))
rel.Volume <- Volume[1:2] # Assuming you want to keep the IDs.
# Your data.frame will need to have the correct number of rows here already.
for (i in 3:ncol(Volume)){ # ncol gives the total number of columns in data.frame
rel.Volume[i] = Volume[i]/Volume[3]
}
A more R-like approach would be to avoid using a for-loop altogether, since R's strength is implicit vectorization. These expressions will produce the same result without a loop:
# OK, this one messes up variable names...
rel.V.2 <- data.frame(sapply(X = Volume[3:5], FUN = function(x) x/Volume[3]))
rel.V.3 <- data.frame(Map(`/`, Volume[3:5], Volume[3]))
Since you said you were new to R, frankly I would recommend avoiding the Tidyverse-packages while you are still learing the basics. From my experience, in the long run you're better off learning base-R first and adding the "sugar" when you're more familiar with the core language. You can still learn to use Tidyverse-functions later (but then, why would anybody? ;-) ).
I have got a data frame with geographic position inside. The positions are strings.
This is my function to scrape the strings and get the positions by Degress.Decimal.
Example position 23º 30.0'N
latitud.decimal <- function(y) {
latregex <- str_match(y,"(\\d+)º\\s(\\d*.\\d*).(.)")
latitud <- (as.numeric(latregex[1,2])) +((as.numeric(latregex[1,3])) / 60)
if (latregex[1,4]=="S") {latitud <- -1*latitud}
return(latitud)
}
Results> 23.5
then I would like to create a new column in my original dataframe applying the function to every item in the Latitude column.
Is the same issue for the longitude. Another new column
I know how to do this using Python and Pandas buy I am newbie y R and cannot find the solution.
I am triying with
lapply(datos$Latitude, 2 , FUN= latitud.decimal(y))
but do not read the y "argument" which is every column value.
Note that the str_match is vectorized as stated in the help page of the function help("str_match").
For the sake of answering the question, I lack a reproducable example and data. This page describes how one can make questions that are more likely to be reproducable and thus obtain better answers.
As i lack data, and code, i cannot test whether i am actually hitting the spot, but i will give it a shot anyway.
Using the fact the str_match is vectorized, we can apply the entire function without using lapply, and thus create a new column simply. I'll slightly rewrite your function, to incorporate the vectorizations. Note the missing 1's in latregex[., .]
latitud.decimal <- function(y) {
latregex <- str_match(y,"(\\d+)º\\s(\\d*.\\d*).(.)")
latitud <- as.numeric(latregex[, 2]) + as.numeric(latregex[, 3]) / 60)
which_south <- which(latregex[, 4] == "S")
latitud[which_south] <- -latitud[which_south]
latitud
}
Now that the function is ready, creating a column can be done using the $ operator. If the data is very large, it can be performed more efficiently using the data.table. See this stackoverflow page for an example of how to assign via the data.table package.
In base R we would simply perform the action as
datos$new_column <- latitud.decimal(datos$Latitude)
datos$lat_decimal = sapply(datos$Latitude, latitud.decimal)
I am currently trying to make my code dryer by rewriting some parts with the help of functions. One of the functions I am using is:
datasetperuniversity<-function(university,year){assign(paste("data",university,sep=""),subset(get(paste("originaldata",year,sep="")),get(paste("allcollaboration",university,sep=""))==1))}
Executing the function datasetperuniversity("Harvard","2000") would result within the function in something like this:
dataHarvard=subset(originaldata2000,allcollaborationHarvard==1)
The function runs nearly perfectly, except that it does not store a the results in dataHarvard. I read that this is normal in functions, and using the <<- instead of the = could solve this issue, however since I am making use of the assign function this is not really possible, since the = is just the outcome of the assign function.
Here some data:
sales = c(2, 3, 5,6)
numberofemployees = c(1, 9, 20,12)
allcollaborationHarvard = c(0, 1, 0,1)
originaldata = data.frame(sales, numberofemployees, allcollaborationHarvard)
Generally, it's best not to embed data/a variable into the name of an object. So instead of using assign to dataHarvard, make a list data with an element called "Harvard":
# enumerate unis, attaching names for lapply to use
unis = setNames(, "Harvard")
# make a table for each subset with lapply
data = lapply(unis, function(x)
originaldata[originaldata[[ paste0("allcollaboration", x) ]] == 1, ]
)
which gives
> data
$Harvard
sales numberofemployees allcollaborationHarvard
2 3 9 1
4 6 12 1
As seen here, you can use DF[["column name"]] to access a column instead of get as in the OP. Also, see the note in ?subset:
Warning
This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.
Generally, it's also better not to embed data in column names if possible. If the allcollaboration* columns are mutually exclusive, they can be collapsed to a single categorical variable with values like "Harvard", "Yale", etc. Alternately, it might make sense to put the data in long form.
For more guidance on arranging data, I recommend Hadley Wickham's tidy data paper.
Or how to split a vector into pairs of contiguous members and combine them in a list?
Supose you are given the vector
map <-seq(from = 1, to = 20, by = 4)
which is
1 5 9 13 17
My goal is to create the following list
path <- list(c(1,5), c(5,9), c(9,13), c(13,17))
This is supposed to represent the several path segments that the map is sugesting us to follow. In order to go from 1 to 17, we must first take the first path (path[1]), then the second path (path[2]), and all the way to the end.
My first attempt lead me to:
path <- split(aux <- data.frame(S = map[-length(map)], E = map[-1]), row(aux))
But I think it would be possible without creating this auxiliar data frame
and avoiding the performance decrease when the initial vector (the map) is to big. Also, it returns a warning message which is quite alright, but I like to avoid them.
Then I found this here on stackoverflow (not exactly like this, this is the adapted version for my problem):
mod_map <- c(map, map[c(-1,-length(map))])
mod_map <- sort(mod_map)
split(mod_map, ceiling(seq_along(mod_map)/2))
which is a simpler solution, but I have to use this modified version of my map.
Pherhaps I'm asking too much as I already got two solutions. But, could it be possible to have a third one, so that I don't have so use data frames as in my first solution and can use the original map, unlike my second solution?
We can use Map on the vector ('map' - better not to use function names - it is a function from purrr) with 1st and last element removed and concatenate elementwise
Map(c, map[-length(map)], map[-1])
Or as #Sotos mentioned, split can be used which would be faster
split(cbind(map[-length(map)], map[-1]), seq(length(map)-1))
I am trying to put together a function that will loop thru a given data frame in blocks and return a new data frame containing stuff calculated from the original. The length of x will be different each time and the actual problem will have more loops in the function. New-ish to R and have not been able to find anything helpful (I don't think using a list will help)
func<-function(x){
tmp # need to declare this here?
for (i in 1:dim(x)[1]){
tmp[i]<-ave(x[i,]) # add things to it
}
return(tmp)
}
df<-cbind(rnorm(10),rnorm(10))
means<-func(df)
This code does not work but I hope it gets across what I want to do. thanks!
Do you mean you want to loop through each row of df and return a data frame with the calculated values?
You may want to look in to the apply function:
df <- cbind(rnorm(10),rnorm(10))
# apply(df,1,FUN) does FUN(df[i,])
# e.g. mean of each row:
apply(df,1,mean)
For more complicated looping like performing some operation on a per-factor basis, I strongly recommend package plyr, and function ddply within. Quick example:
df <- data.frame( gender=c('M','M','F','F'), height=c(183,176,157,168) )
# find mean height *per gender*
ddply(df,.(gender), function(x) c(height=mean(x$height)))
# returns:
gender height
1 F 162.5
2 M 179.5