I'm trying to replicate solution on applying multiple functions in sapply posted on R-Bloggers but I can't get it to work in the desired manner. I'm working with a simple data set, similar to the one generated below:
require(datasets)
crs_mat <- cor(mtcars)
# Triangle function
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)] <- NA
return(cormat)
}
require(reshape2)
crs_mat <- melt(get_upper_tri(crs_mat))
I would like to replace some text values across columns Var1 and Var2. The erroneous syntax below illustrates what I am trying to achieve:
crs_mat[,1:2] <- sapply(crs_mat[,1:2], function(x) {
# Replace first phrase
gsub("mpg","MPG",x),
# Replace second phrase
gsub("gear", "GeArr",x)
# Ideally, perform other changes
})
Naturally, the code is not syntactically correct and fails. To summarise, I would like to do the following:
Go through all the values in first two columns (Var1 and Var2) and perform simple replacements via gsub.
Ideally, I would like to avoid defining a separate function, as discussed in the linked post and keep everything within the sapply syntax
I don't want a nested loop
I had a look at the broadly similar subject discussed here and here but, if possible, I would like to avoid making use of plyr. I'm also interested in replacing the column values not in creating new columns and I would like to avoid specifying any column names. While working with my existing data frame it is more convenient for me to use column numbers.
Edit
Following very useful comments, what I'm trying to achieve can be summarised in the solution below:
fun.clean.columns <- function(x, str_width = 15) {
# Make character
x <- as.character(x)
# Replace various phrases
x <- gsub("perc85","something else", x)
x <- gsub("again", x)
x <- gsub("more","even more", x)
x <- gsub("abc","ohmg", x)
# Clean spaces
x <- trimws(x)
# Wrap strings
x <- str_wrap(x, width = str_width)
# Return object
return(x)
}
mean_data[,1:2] <- sapply(mean_data[,1:2], fun.clean.columns)
I don't need this function in my global.env so I can run rm after this but even nicer solution would involve squeezing this within the apply syntax.
We can use mgsub from library(qdap) to replace multiple patterns. Here, I am looping the first and second column using lapply and assign the results back to the crs_mat[,1:2]. Note that I am using lapply instead of sapply as lapply keeps the structure intact
library(qdap)
crs_mat[,1:2] <- lapply(crs_mat[,1:2], mgsub,
pattern=c('mpg', 'gear'), replacement=c('MPG', 'GeArr'))
Here is a start of a solution for you, I think you're capable of extending it yourself. There's probably more elegant approaches available, but I don't see them atm.
crs_mat[,1:2] <- sapply(crs_mat[,1:2], function(x) {
# Replace first phrase
step1 <- gsub("mpg","MPG",x)
# Replace second phrase. Note that this operates on a modified dataframe.
step2 <- gsub("gear", "GeArr",step1)
# Ideally, perform other changes
return(step2)
#or one nested line, not practical if more needs to be done
#return(gsub("gear", "GeArr",gsub("mpg","MPG",x)))
})
Related
I have a tibble called 'Volume' in which I store some data (10 columns - the first 2 columns are characters, 30 rows).
Now I want to calculate the relative Volume of every column that corresponds to Column 3 of my tibble.
My current solution looks like this:
rel.Volume_unmod = tibble(
"Volume_OD" = Volume[[3]] / Volume[[3]],
"Volume_Imp" = Volume[[4]] / Volume[[3]],
"Volume_OD_1" = Volume[[5]] / Volume[[3]],
"Volume_WS_1" = Volume[[6]] / Volume[[3]],
"Volume_OD_2" = Volume[[7]] / Volume[[3]],
"Volume_WS_2" = Volume[[8]] / Volume[[3]],
"Volume_OD_3" = Volume[[9]] / Volume[[3]],
"Volume_WS_3" = Volume[[10]] / Volume[[3]])
rel.Volume_unmod
I would like to keep the tibble structure and the labels. I am sure there is a better solution for this, but I am relative new to R so I it's not obvious to me. What I tried is something like this, but I can't actually run this:
rel.Volume = NULL
for(i in Volume[,3:10]){
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
}
Mockup Data
Since you did not provide some data, I've followed the description you provided to create some mockup data. Here:
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE))
Volume[3:10] <- rnorm(30*8)
Solution with Dplyr
library(dplyr)
# rename columns [brute force]
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(Volume)[3:10] <- cols
# divide by Volumn_OD
rel.Volume_unmod <- Volume %>%
mutate(across(all_of(cols), ~ . / Volume_OD))
# result
rel.Volume_unmod
Explanation
I don't know the names of your columns. Probably, the names correspond to the names of the columns you intended to create in rel.Volume_unmod. Anyhow, to avoid any problem I renamed the columns (kinda brutally). You can do it with dplyr::rename if you wan to.
There are many ways to select the columns you want to mutate. mutate is a verb from dplyr that allows you to create new columns or perform operations or functions on columns.
across is an adverb from dplyr. Let's simplify by saying that it's a function that allows you to perform a function over multiple columns. In this case I want to perform a division by Volum_OD.
~ is a tidyverse way to create anonymous functions. ~ . / Volum_OD is equivalent to function(x) x / Volumn_OD
all_of is necessary because in this specific case I'm providing across with a vector of characters. Without it, it will work anyway, but you will receive a warning because it's ambiguous and it may work incorrectly in same cases.
More info
Check out this book to learn more about data manipulation with tidyverse (which dplyr is part of).
Solution with Base-R
rel.Volume_unmod <- Volume
# rename columns
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(rel.Volume_unmod)[3:10] <- cols
# divide by columns 3
rel.Volume_unmod[3:10] <- lapply(rel.Volume_unmod[3:10], `/`, rel.Volume_unmod[3])
rel.Volume_unmod
Explanation
lapply is a base R function that allows you to apply a function to every item of a list or a "listable" object.
in this case rel.Volume_unmod is a listable object: a dataframe is just a list of vectors with the same length. Therefore, lapply takes one column [= one item] a time and applies a function.
the function is /. You usually see / used like this: A / B, but actually / is a Primitive function. You could write the same thing in this way:
`/`(A, B) # same as A / B
lapply can be provided with additional parameters that are passed directly to the function that is being applied over the list (in this case /). Therefore, we are writing rel.Volume_unmod[3] as additional parameter.
lapply always returns a list. But, since we are assigning the result of lapply to a "fraction of a dataframe", we will just edit the columns of the dataframe and, as a result, we will have a dataframe instead of a list. Let me rephrase in a more technical way. When you are assigning rel.Volume_unmod[3:10] <- lapply(...), you are not simply assigning a list to rel.Volume_unmod[3:10]. You are technically using this assigning function: [<-. This is a function that allows to edit the items in a list/vector/dataframe. Specifically, [<- allows you to assign new items without modifying the attributes of the list/vector/dataframe. As I said before, a dataframe is just a list with specific attributes. Then when you use [<- you modify the columns, but you leave the attributes (the class data.frame in this case) untouched. That's why the magic works.
Whithout a minimal working example it's hard to guess what the Variable Volume actually refers to. Apart from that there seems to be a problem with your for-loop:
for(i in Volume[,3:10]){
Assuming Volume refers to a data.frame or tibble, this causes the actual column-vectors with indices between 3 and 10 to be assigned to i successively. You can verify this by putting print(i) inside the loop. But inside the loop it seems like you actually want to use i as a variable containing just the index of the current column as a number (not the column itself):
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
Also, two brackets are usually used with lists, not data.frames or tibbles. (You can, however, do so, because data.frames are special cases of lists.)
Last but not least, initialising the variable rel.Volume with NULL will result in an error, when trying to reassign to that variable, since you haven't told R, what rel.Volume should be.
Try this, if you like (thanks #Edo for example data):
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE),
Vol1 = rnorm(30),
Vol2 = rnorm(30),
Vol3 = rnorm(30))
rel.Volume <- Volume[1:2] # Assuming you want to keep the IDs.
# Your data.frame will need to have the correct number of rows here already.
for (i in 3:ncol(Volume)){ # ncol gives the total number of columns in data.frame
rel.Volume[i] = Volume[i]/Volume[3]
}
A more R-like approach would be to avoid using a for-loop altogether, since R's strength is implicit vectorization. These expressions will produce the same result without a loop:
# OK, this one messes up variable names...
rel.V.2 <- data.frame(sapply(X = Volume[3:5], FUN = function(x) x/Volume[3]))
rel.V.3 <- data.frame(Map(`/`, Volume[3:5], Volume[3]))
Since you said you were new to R, frankly I would recommend avoiding the Tidyverse-packages while you are still learing the basics. From my experience, in the long run you're better off learning base-R first and adding the "sugar" when you're more familiar with the core language. You can still learn to use Tidyverse-functions later (but then, why would anybody? ;-) ).
I have a list of data frames. I want to use lapply on a specific column for each of those data frames, but I keep throwing errors when I tried methods from similar answers:
The setup is something like this:
a <- list(*a series of data frames that each have a column named DIM*)
dim_loc <- lapply(1:length(a), function(x){paste0("a[[", x, "]]$DIM")}
Eventually, I'll want to write something like results <- lapply(dim_loc, *some function on the DIMs*)
However, when I try get(dim_loc[[1]]), say, I get an error: Error in get(dim_loc[[1]]) : object 'a[[1]]$DIM' not found
But I can return values from function(a[[1]]$DIM) all day long. It's there.
I've tried working around this by using as.name() in the dim_loc assignment, but that doesn't seem to do the trick either.
I'm curious 1. what's up with get(), and 2. if there's a better solution. I'm constraining myself to the apply family of functions because I want to try to get out of the for-loop habit, and this name-as-list method seems to be preferred based on something like R- how to dynamically name data frames?, but I'd be interested in other, more elegant solutions, too.
I'd say that if you want to modify an object in place you are better off using a for loop since lapply would require the <<- assignment symbol (<- doesn't work on lapply`). Like so:
set.seed(1)
aList <- list(cars = mtcars, iris = iris)
for(i in seq_along(aList)){
aList[[i]][["newcol"]] <- runif(nrow(aList[[i]]))
}
As opposed to...
invisible(
lapply(seq_along(aList), function(x){
aList[[x]][["newcol"]] <<- runif(nrow(aList[[x]]))
})
)
You have to use invisible() otherwise lapply would print the output on the console. The <<- assigns the vector runif(...) to the new created column.
If you want to produce another set of data.frames using lapply then you do:
lapply(seq_along(aList), function(x){
aList[[x]][["newcol"]] <- runif(nrow(aList[[x]]))
return(aList[[x]])
})
Also, may I suggest the use of seq_along(list) in lapply and for loops as opposed to 1:length(list) since it avoids unexpected behavior such as:
# no length list
seq_along(list()) # prints integer(0)
1:length(list()) # prints 1 0.
I want to take an average for each row across different data frames. Does anyone know of a more clever way to do this using apply statements? Sorry for the wall of code.
Youl would need a vector of 1000:1006 for each hiXXXX file and then a vector 2:13 for the columns. I have used mapply for something weird like this before so maybe that could do it somehow?
for (i in 1:nrow(subavg)) {
subavg[i,c(2)] <- mean(c(hi1000[i,c(2)],hi1001[i,c(2)],hi1002[i,c(2)],hi1003[i,c(2)],hi1004[i,c(2)],hi1005[i,c(2)],hi1006[i,c(2)]))
subavg[i,c(3)] <- mean(c(hi1000[i,c(3)],hi1001[i,c(3)],hi1002[i,c(3)],hi1003[i,c(3)],hi1004[i,c(3)],hi1005[i,c(3)],hi1006[i,c(3)]))
subavg[i,c(4)] <- mean(c(hi1000[i,c(4)],hi1001[i,c(4)],hi1002[i,c(4)],hi1003[i,c(4)],hi1004[i,c(4)],hi1005[i,c(4)],hi1006[i,c(4)]))
subavg[i,c(5)] <- mean(c(hi1000[i,c(5)],hi1001[i,c(5)],hi1002[i,c(5)],hi1003[i,c(5)],hi1004[i,c(5)],hi1005[i,c(5)],hi1006[i,c(5)]))
subavg[i,c(6)] <- mean(c(hi1000[i,c(6)],hi1001[i,c(6)],hi1002[i,c(6)],hi1003[i,c(6)],hi1004[i,c(6)],hi1005[i,c(6)],hi1006[i,c(6)]))
subavg[i,c(7)] <- mean(c(hi1000[i,c(7)],hi1001[i,c(7)],hi1002[i,c(7)],hi1003[i,c(7)],hi1004[i,c(7)],hi1005[i,c(7)],hi1006[i,c(7)]))
subavg[i,c(8)] <- mean(c(hi1000[i,c(8)],hi1001[i,c(8)],hi1002[i,c(8)],hi1003[i,c(8)],hi1004[i,c(8)],hi1005[i,c(8)],hi1006[i,c(8)]))
subavg[i,c(9)] <- mean(c(hi1000[i,c(9)],hi1001[i,c(9)],hi1002[i,c(9)],hi1003[i,c(9)],hi1004[i,c(9)],hi1005[i,c(9)],hi1006[i,c(9)]))
subavg[i,c(10)] <- mean(c(hi1000[i,c(10)],hi1001[i,c(10)],hi1002[i,c(10)],hi1003[i,c(10)],hi1004[i,c(10)],hi1005[i,c(10)],hi1006[i,c(10)]))
subavg[i,c(11)] <- mean(c(hi1000[i,c(11)],hi1001[i,c(11)],hi1002[i,c(11)],hi1003[i,c(11)],hi1004[i,c(11)],hi1005[i,c(11)],hi1006[i,c(11)]))
subavg[i,c(12)] <- mean(c(hi1000[i,c(12)],hi1001[i,c(12)],hi1002[i,c(12)],hi1003[i,c(12)],hi1004[i,c(12)],hi1005[i,c(12)],hi1006[i,c(12)]))
subavg[i,c(13)] <- mean(c(hi1000[i,c(13)],hi1001[i,c(13)],hi1002[i,c(13)],hi1003[i,c(13)],hi1004[i,c(13)],hi1005[i,c(13)],hi1006[i,c(13)]))
}
As there are only 7 datasets, we can use that as arguments for Map, then cbind it, and get the rowMeans
Map(function(...) rowMeans(cbind(...)), hi1000, hi1001, hi1002, hi1003,
hi1004, hi1005, hi1006)
Or use + with Reduce after getting the datasets in a list and then divide by the total number of datasets, i.e. 7
Reduce(`+`, mget(paste0("hi", 1000:1006)))/7
The second solution is more compact, but if we have NAs in the dataset, it is better to use the first one as the rowMeans have na.rm argument. By default it is FALSE, but we can set it to TRUE.
I have a data.frame mapping which contains path and map.
I also have another data.frame DATA which contains the raw path and value.
EDIT: Path might have two components or more: e.g. "A>C" or "A>C>B"
set.seed(24);
DATA <- data.frame(
path=paste0(sample(LETTERS[1:3], 25, replace=TRUE), ">", sample(LETTERS[1:3], 25, replace=TRUE)),
value=rnorm(25)
)
mapping <- data.frame(path=c("A","B","C"), map=c("X","Y","Z"))
lapply(mapping, function (x) {
for (i in 1:nrow(DATA)) {
DATA$path[i] <- gsub(as.character(x["path"]),as.character(x["map"]),as.character(DATA$path[i]))
}
})
I'm trying to replace the path in DATA with the map value in mapping but this doesn't seem to be working for me.
"A>C" will be converted to "X>Z".
I understand that for loops are not good in R, but I can't think of another way to code it. Data size I'm working with is 6m row in DATA and 16k rows in mapping.
Clarification on Data: While the path consists of alphabets (ABC) now, the real path are actually domain names. Number of steps in a path is also not fixed at 2 and can be any number.
You can use chartr
DATA$path <- chartr('ABC', 'XYZ', DATA$path)
Or if we are using the data from 'mapping'
DATA$path <- chartr(paste(mapping$path, collapse=''),
paste(mapping$map, collapse=''), DATA$path)
Or using gsubfn
library(gsubfn)
pat <- paste0('[', paste(mapping$path, collapse=''),']')
indx <- setNames(as.character(mapping$map), mapping$path)
gsubfn(pat, as.list(indx), as.character(DATA$path))
Or a base R option based on #smci's comment
vapply(strsplit(as.character(DATA$path), '>'), function(x)
paste(indx[x], collapse=">"), character(1L))
Using data.table (1.9.5+), especially advisable b/c of the size of your data.
library(data.table)
setDT(DATA); setDT(mapping)
DATA[,paste0("path",1:2):=tstrsplit(path,split=">")]
setkey(DATA,path1)[mapping,new.path1:=i.map]
setkey(DATA,path2)[mapping,new.path2:=i.map]
DATA[,new.path:=paste0(new.path1,">",new.path2)]
If you want to get rid of the extra columns:
DATA[,paste0(c("","","new.","new."),"path",rep(1:2,2)):=NULL]
If you just want to overwrite path, use path on the LHS of the last line instead of new.path.
This could also be written more concisely:
library(data.table)
setDT(mapping)
setkey(setkey(setDT(DATA)[,paste0("path",1:2):=tstrsplit(path,split=">")
],path1)[mapping,new.path1:=i.map],path2
)[mapping,new.path:=paste0(new.path1,">",i.map)]
I think you're using the wrong apply.
mapply allows you to use two arguments to the function, here the path and the map. Note that in mapply, the argument FUN comes first. You also do not need to do this row by row, you can just do the entire column at once. Finally, in an apply the variables do not get updated as they do in a for loop, so you need to assign them in the .GlobalEnv. You can do this with an explicit call to assign() or using <<- which assigns them in the first place it finds them in the stack. In this case, that will be back in .GlobalEnv.
After defining mapping and DATA as you do above, try this.
head(DATA)
invisible(mapply( function (x,y) {
DATA$path <<- gsub(x,y,DATA$path)
},mapping$path, mapping$map))
head(DATA)
note that the call to invisible suppresses output from mapply.
If you really want to use lapply, you can. But you need to transpose mapping. You can do that but it will be converted to a matrix, so you have to convert it back. Then, you can just use the same tricks with <<- and not using a for loop as above to get this code:
invisible(lapply(as.data.frame(t(mapping)), function (x) {
DATA$path <<- gsub(x[1],x[2],DATA$path)
}))
head(DATA)
Thanks for sharing, I learned a lot answering this question.
I've been learning R for my project and have been unable to google a solution to my current problem.
I have ~ 100 csv files and need to perform an exact set of operations across them. I've read them in as separate objects (which I assume is probably improper r style) but I've been unable to write a function that can loop through. Each csv is a dataframe that contain information, including a column with dates in decimal year form. I need to create 2 new columns containing year and day of year. I've figured out how to do it manually I would like to find a way to automate the process. Here's what I've been doing:
#setup
library(lubridate) #Used to check for leap years
df.00 <- data.frame( site = seq(1:10), date = runif(10,1980,2000 ))
#what I need done
df.00$doy <- NA # make an empty column which I'm going to place the day of the year
df.00$year <- floor(df.00$date) # grabs the year from the date column
df.00$dday <- df.00$date - df.00$year # get the year fraction. intermediate step.
# multiply the fraction year by 365 or 366 if it's a leap year to give me the day of the year
df.00$doy[which(leap_year(df.00$year))] <- round(df.00$dday[which(leap_year(df.00$year))] * 366)
df.00$doy[which(!leap_year(df.00$year))] <- round(df.00$dday[which(!leap_year(df.00$year))] * 365)
The above, while inelegant, does what I would like it to. However, I need to do this to the other data frames, df.01 - df.99. So far I've been unable to place it in a function or for loop. If I place it into a function:
funtest <- function(x) {
x$doy <- NA
}
funtest(df.00) does nothing. Which is what I would expect from my understanding of how functions work in r but if I wrap it up in a for loop:
for(i in c(df.00)) {
i$doy <- NA }
I get "In i$doy <- NA : Coercing LHS to a list" several times which tells me that the loop isn't treat the dataframe as a single unit but perhaps looking at each column in the frame.
I would really appreciate some insight on what I should be doing. I feel that I could have solved this easily using bash and awk but I would like to be less incompetent using r
the most efficient and direct way is to use a list.
Put all of your CSV's into one folder
grab a list of the files in that folder
eg: files <- dir('path/to/folder', full.names=TRUE)
iterativly read in all those files into a list of data.frames
eg: df.list <- lapply(files, read.csv, <additional args>)
apply your function iteratively over each data.frame
eg: lapply(df.list, myFunc, <additional args>)
Since your df's are already loaded, and they have nice convenient names, you can grab them easily using the following:
nms <- c(paste0("df.0", 0:9), paste0("df.", 10:99))
df.list <- lapply(nms, get)
Then take everything you have in the #what I need done portion and put inside a function, eg:
myFunc <- function(DF) {
# what you want done to a single DF
return(DF)
}
And then lapply accordingly
df.list <- lapply(df.list, myFunc)
On a separate notes, regarding functions:
The reason your funTest "does nothing" is that it you are not having it return anything. That is to say, it is doing something, but when it finishes doing that, then it does "nothing".
You need to include a return(.) statement in the function. Alternatively, the output of last line of the function, if not assigned to an object, will be used as the return value -- but this last sentence is only loosely true and hence one needs to be cautious. The cleanest option (in my opinion) is to use return(.)
regarding the for loop over the data.frame
As you observed, using for (i in someDataFrame) {...} iterates over the columns of the data.frame.
You can iterate over the rows using apply:
apply(myDF, MARGIN=1, function(x) { x$doy <- ...; return(x) } ) # dont forget to return