data frame in user defined function in R - r

I'm trying to make a function that takes two arguments. One argument is the name of a data frame, and the second is the name of a column in that data frame. The goal is for the function to manipulate data in the whole frame based on information contained in the specified column.
My problem is that I can't figure out how to use the character expression entered into the second argument to access that particular column in the data frame within the function. Here's a super brief example,
datFunc <- function(dataFrame = NULL, charExpres = NULL) {
return(dataFrame$charExpress)
}
If, for instance you enter
datFunc(myData, "variable1")
this does not return myData$variable1. there HAS to be a simple way to do this. Sorry if the question is stupid, but i'd appreciate a little help here.
A related question would be, how do i use the character string "myData$variable1" to actually return variable1 from myData?

I think OP wants to pass name of dataframe as string too. If that is the case your function should be something like. (borrowed sample from other answer)
fooFunc <- function( dfNameStr, colNamestr, drop=TRUE) {
df <- get(dfNameStr)
return(df[,colNamestr, drop=drop])
}
> myData <- data.frame(ID=1:10, variable1=rnorm(10, 10, 1))
> myData
ID variable1
1 1 10.838590
2 2 9.596791
3 3 10.158037
4 4 9.816136
5 5 10.388900
6 6 10.873294
7 7 9.178112
8 8 10.828505
9 9 9.113271
10 10 10.345151
> fooFunc('myData', 'ID', drop=F)
ID
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> fooFunc('myData', 'ID', drop=T)
[1] 1 2 3 4 5 6 7 8 9 10

You're almost there, try using [ instead of $ for this kind of indexing
datFunc <- function(dataFrame = NULL, charExpres = NULL, drop=TRUE) {
return(dataFrame[, charExpres, drop=drop])
}
# An example
set.seed(1)
myData <- data.frame(ID=1:10, variable1=rnorm(10, 10, 1)) # DataFrame
datFunc(myData, "variable1") # dropping dimensions
[1] 9.373546 10.183643 9.164371 11.595281 10.329508 9.179532 10.487429 10.738325 10.575781 9.694612
datFunc(myData, "variable1", drop=FALSE) # keeping dimensions
variable1
1 9.373546
2 10.183643
3 9.164371
4 11.595281
5 10.329508
6 9.179532
7 10.487429
8 10.738325
9 10.575781
10 9.694612

Alternatively, you can find the column index of the dataframe:
df <- as.data.frame(matrix(rnorm(100), ncol = 10))
colnames(df) <- sample(LETTERS, 10)
column.index.of.A <- grep("^A$", colnames(df))
df[, column.index.of.A]

Related

Find objects in list that contain values from vector in R

My data frame
set.seed(1)
df <- data_frame(col1 = c(1:49), col2 = sample(c(0:20), 49, replace = T))
My list
fields <- list(A = c(2:4, 12:16, 24:28, 36:40, 48:49),
B = c(6:10, 18:22, 30:34, 42:46))
I would like to create a new column that contains the name of the (vector) object in fields, which contains the number in df$col1
I have created a conditional for loop over fields:
col1 <- df$col1
for (i in col1) {
if (col1[i] %in% fields[[1]] == T) {
col1[i] <- names(fields)[1]
} else if (col1[i] %in% fields[[2]] == T) {
col1[i] <- names(fields)[2]
}
}
Although this works, and I can then assign the resulting new vector col1 to my data frame, this doesn't seem very efficient to me- especially because I also have lists with more objects.
The reason why I want to do this: I would like to use ggplot and dplyr to grouping and summarising the observations according to their position in my lists (fields, but also other lists) . I hope it is clear from my question what I intend to do. Thanks!
EDIT
I have created a more generalised function that contains a nested for-loop
find_object <- function(x, list) {
for (j in 1:length(list)) {
for (i in 1:length(x)) {
if (x[i] %in% list[[j]] == TRUE) {
x[i] <- names(list)[j]
}
}
}
x
}
find_object(col1, fields)
That is more or less what I want - but this is a nested for loop, and I have heard that this is bad... Does anyone have a better solution??
Thanks
A better way is to transform the list to data.frame and then do a join/merge:
library(dplyr)
fields.df <- stack(fields) %>% mutate(ind = as.character(ind))
df %>% left_join(fields.df, by = c('col1' = 'values'))
# col1 col2 ind
# <int> <int> <chr>
# 1 1 5 <NA>
# 2 2 7 A
# 3 3 12 A
# 4 4 19 A
# 5 5 4 <NA>
# 6 6 18 B
# 7 7 19 B
# 8 8 13 B
# 9 9 13 B
# 10 10 1 B
note: I use left_join from dplyr because you are using data_frame. The base R merge should also work.
Another way would be to use match() after creating a data frame with stack().
library(dplyr)
foo <- stack(fields)
mutate(df, whatever = foo$ind[match(df$col1, foo$values)])
col1 col2 whatever
<int> <int> <fctr>
1 1 5 <NA>
2 2 7 A
3 3 12 A
4 4 19 A
5 5 4 <NA>
6 6 18 B
7 7 19 B
8 8 13 B
9 9 13 B
10 10 1 B

Paste string values from df column into a function

I have a dataset in R organized like so:
x freq
1 PRODUCT10000 6
2 PRODUCT10001 20
3 PRODUCT10002 11
4 PRODUCT10003 4
5 PRODUCT10004 1
6 PRODUCT10005 2
Then, I have a function like
fun <- function(number, df1, string, df2){NormC <- as.numeric(df1[string, "normc"])
df2$NormC <- rep(NormC)}
How can I iterate through my df and insert each value of "x" into the function?
I think the problem is that this part of the function (which has 4 input variables) is structured like so- NormC <- as.numeric(df[string, "normc"])
As explained by #duckmayr, you don't need to iterate through column x. Here is an example creating new variable.
df <- read.table(text = " x freq
1 PRODUCT10000 6
2 PRODUCT10001 20
3 PRODUCT10002 11
4 PRODUCT10003 4
5 PRODUCT10004 1
6 PRODUCT10005 2", header = TRUE)
fun <- function(string){paste0(string, "X")} # example
# option 1
df$new.col1 <- fun(df$x) # see duckmayr's comment
# option 2
library(data.table)
setDT(df)[, new.col2 := fun(x)]

combine list of data frames in list in specific manner

I got a list which have another list of data frames.
The outside list elements represents years and inside list represent months data.
Now I want to create a final list which will contain data for all months. Each Month columns will be "cbinded" by other years column values.
Alldata <- list()
Alldata[[1]] <- list(data.frame(Jan_2015_A=c(1,2), Jan_2015_B=c(3,4)), data.frame(Feb_2015_C=c(5,6), Feb_2015_D=c(7,8)))
Alldata[[2]] <- list(data.frame(Jan_2016_A=c(1,2), Jan_2016_B=c(3,4)), data.frame(Feb_2016_C=c(5,6), Feb_2016_D=c(7,8)))
Expected output list is as following
I've tried using for loops and its little complex, I want any R function to do this task.
I have done this using for loops using following code. But this is really complex and I myself found this little complicate. Hope I will get any simpler and tidy code for this operation.
I created list with each months and years data as a list item in form of data frames
x2 <- list()
for(l1 in 1: length(Alldata[[1]])){
temp <- list()
for(l2 in 1: length(Alldata)){
temp <- append(temp, list(Alldata[[l2]][[l1]]))
}
x2 <- append(x2, list(temp))
}
# then created final List with succesive years data of each month as list items. This is primarily used for Tracking data for years For Example: how much was count was for Jan_2015 and Jan_2016 for "A"
finalList <- list()
for(l3 in 1: length(x2)){
temp <- x2[[l3]]
td2 <- as.data.frame(matrix("", nrow = nrow(temp[[1]])))
rownames(td2)[rownames(temp[[1]])!=""] <- rownames(temp[[1]])[rownames(temp[[1]])!=""]
for(l4 in 1:ncol(temp[[1]])){
for(l5 in 1: length(temp)){
# lapply(l4, function(x) do.call(cbind,
td2 <- cbind(td2, temp[[l5]][, l4, drop=F])
}
}
finalList <- append(finalList, list(td2))
}
> finalList
[[1]]
V1 Jan_2015_A Jan_2016_A Jan_2015_B Jan_2016_B
1 1 1 3 3
2 2 2 4 4
[[2]]
V1 Feb_2015_C Feb_2016_C Feb_2015_D Feb_2016_D
1 5 5 7 7
2 6 6 8 8
You could do the following below. The lapply will iterate over the outer list and the do.call will cbind the inner list of data frames.
lapply(Alldata, do.call, what = 'cbind')
[[1]]
Jan_2015_A Jan_2015_B Feb_2015_C Feb_2015_D
1 1 3 5 7
2 2 4 6 8
[[2]]
Jan_2016_A Jan_2016_B Feb_2016_C Feb_2016_D
1 1 3 5 7
2 2 4 6 8
You can also use dplyr to get the same results.
library(dplyr)
lapply(Alldata, bind_cols)
Here is a third option proposed by J.R.
lapply(Alldata, Reduce, f = cbind)
EDIT
After clarification from OP, the above solution has been modified (see below) to produce the newly specified output. The solution above has been left there since it is a building block for the solution below.
pattern.vec <- c("Jan", "Feb")
### For a given vector of months/patterns, returns a
### list of elements with only that month.
mon_data <- function(mo) {
return(bind_cols(sapply(Alldata, function(x) { x[grep(pattern = mo, x)]})))
}
### Loop through months/patterns.
finalList <- lapply(pattern.vec, mon_data)
finalList
## [[1]]
## Jan_2015_A Jan_2015_B Jan_2016_A Jan_2016_B
## 1 1 3 1 3
## 2 2 4 2 4
##
## [[2]]
## Feb_2015_C Feb_2015_D Feb_2016_C Feb_2016_D
## 1 5 7 5 7
## 2 6 8 6 8
## Ordering the columns as specified in the original question.
## sorting is by the last character in the column name (A or B)
## and then the year.
lapply(finalList, function(x) x[ order(gsub('[^_]+_([^_]+)_(.*)', '\\2_\\1', colnames(x))) ])
## [[1]]
## Jan_2015_A Jan_2016_A Jan_2015_B Jan_2016_B
## 1 1 1 3 3
## 2 2 2 4 4
##
## [[2]]
## Feb_2015_C Feb_2016_C Feb_2015_D Feb_2016_D
## 1 5 5 7 7
## 2 6 6 8 8

Converting row names to data frame column

I want to be able to access b0.e7, c0.14,...,f8.d4. But right now these are not in a column, but are the "row names". How can I have the row names be 1,2,3,4,5,6,7 and b0.e7, c0.14,...,f8.d4 to be it's own column. Thanks for the help in advance.
df=as.data.frame(c)
df = subset(df, c>7)
df
c
b0.e7 11
c0.14 8
f8.d1 10
f8.d2 9
f8.d3 11
f8.d4 12
Try this. The first line assigns a new column that is just the current row names of the data frame. The second line resets the row names to NULL, resulting in a sequence.
> df$new <- rownames(df)
> rownames(df) <- NULL
Which should result in
> df
# c new
# 1 11 b0.e7
# 2 8 c0.14
# 3 10 f8.d1
# 4 9 f8.d2
# 5 11 f8.d3
# 6 12 f8.d4
And you can reverse the column order if needed with df[, c(2, 1)]
You can make use of the fact that cbind.data.frame can make use of arguments from data.frame, one of which is row.names. That argument can be set to NULL, meaning that a slightly more direct approach than proposed by Richard is:
cbind(rn = rownames(mydf), mydf, row.names = NULL)
# rn c
# 1 b0.e7 11
# 2 c0.14 8
# 3 f8.d1 10
# 4 f8.d2 9
# 5 f8.d3 11
# 6 f8.d4 12
You can try this as well.
rows = row.names(df)
df1 = cbind(rows,df)

how can I get the means after splitting the data by columns? statistics with R

thanks for the useful answer in:
Loop over vector (introspection in R?) or some other approach
I want to get the mean out of each vector without having to type it for each vector individually. how do i do this?
My code:
probability_ratings = split(offline$Probability,
paste(offline$Item, offline$Cond, sep=""))
head(probability_ratings)
$i01c1
[1] 7 7 7 3 7 3 7 6
$i01c2
[1] 4 4 5 3 4 5 5 3
$i01c3
[1] 7 4 6 4 7 5 5 5
$i01c4
[1] 1 2 2 1 2 2 2 4
$i01c5
[1] 5 5 6 5 7 3 4
$i01c6
[1] 6 6 7 6 7 5 6
I need the mean of each row, but I am not sure what data type this is and if/how i can apply the mean() function.
Thanks,
Katerina
split returns a list, so you just need to use sapply or lapply to apply mean to each list element. lapply will return a list and sapply will return a named vector (in this case).
probability_ratings <- list(
i01c1=c(7,7,7,3,7,3,7,6),
i01c2=c(4,4,5,3,4,5,5,3),
i01c3=c(7,4,6,4,7,5,5,5),
i01c4=c(1,2,2,1,2,2,2,4),
i01c5=c(5,5,6,5,7,3,4),
i01c6=c(6,6,7,6,7,5,6) )
sapply(probability_ratings, mean)
# i01c1 i01c2 i01c3 i01c4 i01c5 i01c6
# 5.875000 4.125000 5.375000 2.000000 5.000000 6.142857
I would go for aggregate(), without using split.
Using the example from your link:
tf <- data.frame(
formant = sample(c("F1","F2"), 100, T),
vowels = sample(c('a', 'e', 'i', 'o', 'u'), 100, T),
IL = runif(100)
)
aggregate(IL ~ formant + vowels, data = tf, mean)
but there are also a lot of other possibilities to do that...

Resources