I have a dataframe(df) with only three columns showing ideal weight(x) in kgs (column1), age(y) in years (column 2) and gender(z) (column 3, boy coded 1 & girl coded 2) for school students. I want to write a function for getting what is ideal weight of a school student at given age and gender. My novice attempt is shown below:
idealwt<-function(age,gender){
age=df$y
gender=df$z
idealwt = df$x[age==df$y & gender==df$z]
return(idealwt)
}
However, above function returns the whole vector instead of specific value.
The problem in the OP's function results from creating the objects 'age' and 'gender' which is also the arguments of the function. So, in essence, we are comparing the df$y == df$y and df$z == df$z which results in getting TRUE for all the elements and output is the whole vector. Instead, we can define the function without age = df$y and gender = df$z
idealwt<-function(age,gender){
df$x[age==df$y & gender==df$z]
}
idealwt(12, 2)
#[1] 38 42
data
df <- data.frame(x = c(45, 38, 55, 33, 42), y = c(15, 12, 18, 14, 12),
z = c(1, 2, 1, 1, 2))
Related
I wrote a (not pretty, but working) function to make one long vector from a certain column of my dataframe and add in a certain number of NA's every time the ID changes. Now what I am looking for is a possibility to automatically rename the variable array within the function so the output of the function carries an individual name (to make it easy to identify which values are in there and to prevent it from getting overwritten when running the function for a different column). One possibility would be to rename it with x or array_x. Now what I tried is several variations of this:
c("array_", as.character(x)) <- array
rm(array)
print(c("array_", as.character(x)))
But it only throws errors- I assume because the string is not recognized as a variable name. Can anyone help me on solving this?
Here is some example data and the part of the function that is already running:
ID <- c(rep ("A", 3), rep("B", 3))
Day <- c(1,2,3,1,2,3)
Score1 <- c(12,4, 16, 9, 12, 13)
Score2 <- c(1, 4, 4, 1, 3, 5)
Score3 <- c(23, 19, 12, 12, 24, 11)
df <- data.frame(ID, Day, Score1, Score2, Score3)
print(df)
foo <- function(x) {
array <- c(df[1,x])
for (i in 2:nrow(df))
{
if (df[i, 1] == df[i-1, 1 ]) {
array <- append (array, df[i, x])
}
else
{
array <- append (array, rep (NA, 5))
array <- append (array, df[i, x])
}
}
#rename array
print (array)
}
foo("Score1")
I have some data like so:
a <- c(1, 2, 9, 18, 6, 45)
b <- c(12, 3, 34, 89, 108, 44)
c <- c(0.5, 3.3, 2.4, 5, 13,2)
df <- data.frame(a, b,c)
I need to create a function to lag a lot of variables at once for a very large time series analysis with dozens of variables. So i need to lag a lot of variables without typing it all out. In short, I would like to create variables a.lag1, b.lag1 and c.lag1 and be able to add them to the original df specified above. I figure the best way to do so is by creating a custom function, something along the lines of:
lag.fn <- function(x) {
assign(paste(x, "lag1", sep = "."), lag(x, n = 1L)
return (assign(paste(x, "lag1", sep = ".")
}
The desired output is:
a.lag1 <- c(NA, 1, 2, 9, 18, 6, 45)
b.lag1 <- c(NA, 12, 3, 34, 89, 108, 44)
c.lag1 <- c(NA, 0.5, 3.3, 2.4, 5, 13, 2)
However, I don't get what I am looking for. Should I change the environment to the global environment? I would like to be able to use cbind to add to orignal df. Thanks.
Easy using dplyr. Don't call data frames df, may cause confusion with the function of the same name. I'm using df1.
library(dplyr)
df1 <- df1 %>%
mutate(a.lag1 = lag(a),
b.lag1 = lag(b),
c.lag1 = lag(c))
The data frame statement in the question is invalid since a, b and c are not the same length. What you can do is create a zoo series. Note that the lag specified in lag.zoo can be a vector of lags as in the second example below.
library(zoo)
z <- merge(a = zoo(a), b = zoo(b), c = zoo(c))
lag(z, -1) # lag all columns
lag(z, 0:-1) # each column and its lag
We can use mutate_all
library(dplyr)
df %>%
mutate_all(funs(lag = lag(.)))
If everything else fails, you can use a simple base R function:
my_lag <- function(x, steps = 1) {
c(rep(NA, steps), x[1:(length(x) - steps)])
}
I want to have multiple copies of a dataframe, but with each time a new randomization of a variable. My objective behind this is to do multiple iterations of an analysis with a randomize value for one variable.
I've started by doing a list of dataframe, with copies of my original dataframe:
a <- c(1, 2, 3, 4, 5)
b <- c(45, 34, 50, 100, 64)
test <- data.frame(a, b)
test2 <- lapply(1:2,function(x) test) #List of 2 dataframe, identical to test
I know about transform and sample, to randomize the values of a column:
test1 <- transform(test, a = sample(a))
I just cannot find how to apply it to the entire list of dataframes. I've tried this:
test3<- lapply(test2,function(i) sample(i[["a"]]))
But I lost the other variables. And this:
test3 <- lapply(test2,function(i) {transform(i, i[["a"]]==sample(i[["a"]]))})
But my variable is not randomized.
Multiple questions are similar to mine, but didn't helped me to solve my problem:
Adding columns to each in a list of dataframes
Add a column in a list of data frames
You can try the following:
lapply(test2, function(df) {df$a <- sample(df$a); df})
Or, using transform:
lapply(test2, function(df) transform(df, a = sample(a)))
Or just
lapply(test2, transform, a = sample(a))
Is there a reason you need them in separate lists?
This will give you 10 columns of randomized samples of a in different columns and then you could loop through the columns for your further analysis.
a <- c(1, 2, 3, 4, 5)
b <- c(45, 34, 50, 100, 64)
test <- data.frame(a, b)
for(i in 3:12){
test[,i] <- transform(sample(a))
}
`
I'm having some difficulties trying to calculate the gini coefficient using binned census data, and would really appreciate any help.
My data looks a little something like this (but with 14,000 observations of 13 variables).
location <- c('A','B','C', 'D', 'E', 'F')
no_income <- c(20, 1, 40, 79, 12, 2)
income1 <- c(13, 4, 56, 17, 9, 4)
income2 <- c(27, 39, 49, 12, 19, 0)
income3 <- c(0, 1, 4, 3, 27, 0)
df <- data.frame(location, no_income, income1, income2, income3)
So for each observation there is a location given, and then a series of columns indicating how many households in the area earn within the given income bracket (so for location A, 20 households earn $0, 13 earn income1, 27 income2, and 0 income3).
I've created an empty column to return the results to:
df$gini = 0
I've then created a numerical vector (x) containing the income amount I want to use for each income bin
x <- c(0, 300, 1000, 2000)
I've been trying to use the gini function within the reldist package, and have written the following for loop to cycle through each row of the data, apply the gini function and return the output to a new column.
for (i in 1:nrow(samp)){
w <- samp[i,2:5]
df$gini <- gini(x, w=rep(1, length=length(x)))
}
The problem is that the ouput returned is currently identical for each row, which is obviously not correct. I'm relatively new to this though, and not sure what I'm doing wrong...
R vectorises operations, so there's often no need to write a loop; in this case you do because of how the function works. You also don't often need to initialise a container (sometimes you might, but rarely).
Here's a working example using apply to loop over the rows:
# setup
install.packages("reldist")
library(reldist)
# dummy data
df = data.frame(ID=letters,
Bin1=rpois(26, 3),
Bin2=rpois(26, 8),
Bin3=rpois(26, 1))
inc = c(0, 300, 1000)
# new column with gini
df$gini = apply(df[, 2:4], 1, function(i){
gini(inc, i)
})
Worth noting that gini() defaults the weights argument to =rep(1, length=length(x)), so if that's what you want you don't need to define it.
Edit:
I've added inclusion of weights, based on what I read in the manual: https://cran.r-project.org/web/packages/reldist/reldist.pdf.
Sorry guys if this is a noob question.
I need help on how to loop over my dataframe.Here is a sample data.
a <- c(10:29);
b <- c(40:59);
e <- rep(1,20);
test <- data.frame(a,b,e)
I need to manipulate column "e" using the following criteria for values in column "a"
for all values of
"a" <= 15, "e" = 1,
"a" > 15 & < 20, "e" = 2
"a" > 20 & < 25, "e" = 3
"a" > 25 & < 30, "e" = 4 and so on to look like this
result <- cbind(a,b,rep(1:4, each=5))
My actual data frame is over 100k long. Would be great if you could sort me out here.
data.frame(a, b, e=(1:4)[cut(a, c(-Inf, 15, 20, 25, 30))])
Update:
Greg's comment provides a more direct solution without the need to go via subsetting an integer vector with a factor returned from cut.
data.frame(a, b, e=findInterval(a, c(-Inf, 15, 20, 25, 30)))
I would use cut() for this:
test$e = cut(test$a,
breaks = c(0, 15, 20, 25, 30),
labels = c(1, 2, 3, 4))
If you want to "generalize" the cut--in other words, where you don't know exactly how many sets of 5 (levels) you need to make--you can take a two-step approach using c() and seq():
test$e = cut(test$a,
breaks = c(0, seq(from = 15, to = max(test$a)+5, by = 5)))
levels(test$e) = 1:length(levels(test$e))
Since Backlin beat me to the cut() solution, here's another option (which I don't prefer in this case, but am posting just to demonstrate the many options available in R).
Use recode() from the car package.
require(car)
test$e = recode(test$a, "0:15 = 1; 15:20 = 2; 20:25 = 3; 25:30 = 4")
You don't need a loop.
You have nearly all you need:
test[test$a > 15 & test$a < 20, "e"] <- 2