I have a data frame called "Region_Data" which I have created by performing some functions on it.
I want to take this data frame called "Region_Data" and use it an input and I want to subset it using the following function that I created. The function should produce the subset data frame:
Region_Analysis_Function <- function(Input_Region){
Subset_Region_Data = subset(Region_Data, Region == "Input_Region" )
Subset_Region_Data
}
However, when I create this function and then execute it using:
Region_Analysis_Fuction("North West")
I get 0 observations when I execute this code (though I know that there are xx number of observations in the data frame.)
I read that there is something called global / local environment, but I'm not really clear on that.
How do I solve this issue? Thank you so much in advance!!
When you try to subset your data using subset(Region_Data, Region == "Input_Region" ), "Input_Region" is being interpreted as a string literal, rather than being evaluated to the value it represents. This means that unless the column Input_Region in your object Region_Data contains some rows with the value "Input_Region", your function will return a zero-row subset. Removing the quotes will solve this, and changing == to %in% will make your function more generalized. Consider the following data set,
mydf <- data.frame(
x = 1:5,
y = rnorm(5),
z = letters[1:5])
##
R> mydf
x y z
1 1 -0.4015449 a
2 2 0.4875468 b
3 3 0.9375762 c
4 4 -0.7464501 d
5 5 0.8802209 e
and the following 3 functions,
qfoo <- function(Z) {
subset(mydf, z == "Z")
}
foo <- function(Z) {
subset(mydf, z == Z)
}
##
bar <- function(Z) {
subset(mydf, z %in% Z)
}
where qfoo represents the approach used in your question, foo implements the first change I noted, and bar implements both changes.
The second two functions will work when the input value is a scalar,
R> qfoo("c")
[1] x y z
<0 rows> (or 0-length row.names)
##
R> foo("c")
x y z
3 3 0.9375762 c
##
R> bar("c")
x y z
3 3 0.9375762 c
but only the third will work if it is a vector:
R> foo(c("a","c"))
x y z
1 1 -0.4015449 a
Warning messages:
1: In is.na(e1) | is.na(e2) :
longer object length is not a multiple of shorter object length
2: In `==.default`(z, Z) :
longer object length is not a multiple of shorter object length
##
R> bar(c("a","c"))
x y z
1 1 -0.4015449 a
3 3 0.9375762 c
Related
I have the following data. I want to get mean from each column, which is mode for nominal data.
df1<-data.frame(c("a","a"),c("b","d"),c(1,5),c(4,8))
names(df1)<-c("x","y","z","w")
df1
x y z w
1 a b 1 4
2 a d 5 8
df2<-as.data.frame(matrix(0,ncol=4,nrow=1))
names(df2)<-c("x","y","z","w")
df2$x<-names(table(df1$x))[table(df1$x)==max(table(df1$x))]
df2$y<-names(table(df1$y))[table(df1$y)==max(table(df1$y))]
df2$z<-mean(df1$z)
df2$w<-mean(df1$w)
If the data frame only contains a data, and one of the nominal columns of the next data is different with the previous one, then the following error is showing.
Error in `$<-.data.frame`(`*tmp*`, y, value = c("b", "d")) :
replacement has 2 rows, data has 1
What can I do to fix this error?Thank you so much for your help
You can write a function to calculate either the mean or mode of each column:
get.mean.mod <- function (df) {
data.frame(lapply(df, function (x) {
if (is.numeric(x)) return (mean(x))
freq <- table(x)
names(freq)[which.max(freq)]
}))
}
get.mean.mod(df1)
# x y z w
# 1 a b 3 6
x <- 1
y <- 1
for (y in 1:2){
for (x in 1:2){
z <- x+y
zresults <- data.frame(x, y, z)
}
}
Hello together,
sorry for my dump question, but I am new to R and this is actually my first attempt to code a little bit.
I created a for-loop with the indizes x and y and I want to save the output values (z) together with the corresponding x and y values in a data.frame.
The code posted it is obviously wrong but I'm not getting it.
The data.frame should look like that:
x y z
1 1 1 2
2 2 1 3
3 1 2 3
4 2 2 4
Thank you guys a lot in advance!
Greetings from Germany
Here's one way to do what you want to do:
zresults <- expand.grid(x=1:2,y=1:2);
zresults$z <- zresults$x + zresults$y;
zresults;
## x y z
## 1 1 1 2
## 2 2 1 3
## 3 1 2 3
## 4 2 2 4
Notes on your attempt:
The initial assignments to x and y are not necessary. The values are overwritten on the first iteration of each respective loop with the first value of the RHS vector (1 in each case). Also worth noting is that, unlike languages like C/C++ and Java, in R you don't have to declare variables; any variable name can be assigned a value at any time.
In your inner loop you're assigning zresults. After the first iteration, you are overwriting the previous value that existed for zresults. If you want to "build up" a data.frame one row at a time, you can use the following solutions, although note that performance will not be ideal with these approaches:
zresults[nrow(zresults)+1L,] <- c(x,y,z);
or
zresults <- rbind(zresults,c(x,y,z));
Also note that zresults would have to be initialized first, prior to the build-up loop; for example:
zresults <- data.frame(x=integer(),y=integer(),z=integer());
In general, try to avoid for-loops in R. Instead, vectorization is preferred. There are many good sources on this; for example, see http://www.noamross.net/blog/2014/4/16/vectorization-in-r--why.html and http://alyssafrazee.com/vectorization.html.
Here is another solution
x = 1
y = 1
result = NULL
for (y in 1:2) {
for (x in 1:2) {
z = x + y
if (is.null(result)) {
result = data.frame(x,y,z)
} else {
result = rbind(result, data.frame(x,y,z))
}
}
}
result
I have a huge data frame. I am stuck with if function. Let me first present the simple example and then I lay down my problem:
z <- c(0,1,2,3,4,5)
y <- c(2,2,2,3,3,3)
a <- c(1,1,1,2,2,2)
x <- data.frame(z,y,a)
Problem: I want to run if function which sums column z values based for row which has same y and a only if the second row of each group has corresponding z equals 1
I am sorry but I am quite new in R so not able to present any reasonable codes which I have done by my own.
Any help would be highly appreciated.
As mentioned, your problem isn't clearly stated.
Perhaps you are looking to do something like this:
x$new <- with(x, ave(z, y, a, FUN = function(k)
ifelse(k[2] == 1, sum(k), NA)))
x
# z y a new
# 1 0 2 1 3
# 2 1 2 1 3
# 3 2 2 1 3
# 4 3 3 2 NA
# 5 4 3 2 NA
# 6 5 3 2 NA
Here, I've created a new column "new" which sums the values of "z" grouped by "y" and "a", but only if the second value in the group is equal to 1.
Since you say your data frame is quite large, you might want to convert your data frame to a data.table object using the data.table package. You will likely find that the required operations are much faster if you have a great many rows. However, the construction of the code for your case is not straight forward with data.table.
If I understnad what you want to do (which is not entirely clear to me) you could try the following:
library(data.table)
z <- c(0,1,2,3,4,5)
y <- c(2,2,2,3,3,3)
a <- c(1,1,1,2,2,2)
x <- data.frame(z,y,a)
xx <- as.data.table(x) # Make a data.table object
setkey(xx, z) # Make the z column a key
xx[1, sum(a)] # Sum all values in column a where the key z = 1
[1] 1
# Now try the other sum you mention
xx[, sum(z), by = list(z = y)] # A column sum over groups defined by z = y
z V1
1: 2 2
2: 3 3
sum(xx[, sum(z), by = list(z = y)][, V1]) # Summing over the sums for each group should do it
[1] 5
To create the sum over the column a where z = 1, I made the z column a key. The syntax xx[1, sum(a)] sums a where the key value (z value) is 1.
I can create groups with the data.table object with by, which is analogous to a SQL WHERE clause if you are familiar with SQL. However, the result is the sum of the column z for each of groups created. This may be inefficient if you have a great many possible matching values where z = y. The outer sum adds the values for each group in the sub-selected V1 column of the inner result.
If you are going to use data.table in a serious way study the informative vignettes available for that package.
M Dowle, T Short, S Lianoglou, A Srinivasan with contributions from R Saporta and E Antonyan (2014). data.table: Extensions of data.frame. R package version 1.9.2. http://CRAN.R-project.org/package=data.table
Given this data.frame
x y z
1 1 3 5
2 2 4 6
I'd like to add the value of columns x and z plus a coefficient 10, for every rows in dat.
The intended result is this
x y z result
1 1 3 5 16 #(1+5+10)
2 2 4 6 18 #(2+6+10)
But why this code doesn't produce the desired result?
dat <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
Coeff <- 10
# Function
process.xz <- function(v1,v2,cf) {
return(v1+v2+cf)
}
# It breaks here
sm <- apply(dat[,c('x','z')], 1, process.xz(dat$x,dat$y,Coeff ))
# Later I'd do this:
# cbind(dat,sm);
I wouldn't use an apply here. Since the addition + operator is vectorized, you can get the sum using
> process.xz(dat$x, dat$z, Coeff)
[1] 16 18
To write this in your data.frame, don't use cbind, just assign it directly:
dat$result <- process.xz(dat$x, dat$z, Coeff)
The reason it fails is because apply doesn't work like that - you must pass the name of a function and any additional parameters. The rows of the data frame are then passed (as a single vector) as the first argument to the function named.
dat <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
Coeff <- 10
# Function
process.xz <- function(x,cf) {
return(x[1]+x[2]+cf)
}
sm <- apply(dat[,c('x','z')], 1, process.xz,cf=Coeff)
I completely agree that there's no point in using apply here though - but it's good to understand anyway.
So I have a bunch of data frames in a list object. Frames are organised such as
ID Category Value
2323 Friend 23.40
3434 Foe -4.00
And I got them into a list by following this topic. I can also run simple functions on them as shown in this topic.
Now I am trying to run a conditional function with lapply, and I'm running into trouble. In some tables the 'ID' column has a different name (say, 'recnum'), and I need to tell lapply to go through each data frame, check if there is a column named 'recnum', and change its name to 'ID', as in
colnr <- which(names(x) == "recnum"
if (length(colnr > 0)) {names(x)[colnr] <- "ID"}
But I'm running into trouble with local scope and who knows what. Any ideas?
Use the rename function from plyr; it renames by name, not position:
x <- data.frame(ID = 1:2,z=1:2)
y <- data.frame('recnum' = 1:2,z=3:4)
.list <- list(x,y)
library(plyr)
lapply(.list, rename, replace = c('recnum' = 'ID'))
[[1]]
ID z
1 1 1
2 2 2
[[2]]
ID z
1 1 3
2 2 4
Your original code works fine:
foo <- function(x){
colnr <- which(names(x) == "recnum")
if (length(colnr > 0)) {names(x)[colnr] <- "ID"}
x
}
.list <- list(x,y)
lapply(.list, foo)
Not sure what your problem was.
If you look at the second part of mnel's answer, you can see that the function foo evaluates x as its last expression. Without that, if you try to change the names of the data.frames in your list directly from within the anonymous function passed to lapply, it will likely not work.
Just as an alternative, you could use gsub and avoid loading an additional package (although plyr is a nice package):
xx <- list(data.frame("recnum" = 1:3, "recnum2" = 1:3),
data.frame("ID" = 4:6, "hat" = 4:6))
lapply(xx, function(x){
names(x) <- gsub("^recnum$", "ID", names(x))
return(x)
})
# [[1]]
# ID recnum2
# 1 1 1
# 2 2 2
# 3 3 3
# [[2]]
# ID hat
# 1 4 4
# 2 5 5
# 3 6 6