refactor data.frame column values

refactor data.frame column values - r

Sorry guys if this is a noob question.
I need help on how to loop over my dataframe.Here is a sample data.
a <- c(10:29);
b <- c(40:59);
e <- rep(1,20);
test <- data.frame(a,b,e)
I need to manipulate column "e" using the following criteria for values in column "a"
for all values of
"a" <= 15, "e" = 1,
"a" > 15 & < 20, "e" = 2
"a" > 20 & < 25, "e" = 3
"a" > 25 & < 30, "e" = 4 and so on to look like this
result <- cbind(a,b,rep(1:4, each=5))
My actual data frame is over 100k long. Would be great if you could sort me out here.

data.frame(a, b, e=(1:4)[cut(a, c(-Inf, 15, 20, 25, 30))])
Update:
Greg's comment provides a more direct solution without the need to go via subsetting an integer vector with a factor returned from cut.
data.frame(a, b, e=findInterval(a, c(-Inf, 15, 20, 25, 30)))

I would use cut() for this:
test$e = cut(test$a,
breaks = c(0, 15, 20, 25, 30),
labels = c(1, 2, 3, 4))
If you want to "generalize" the cut--in other words, where you don't know exactly how many sets of 5 (levels) you need to make--you can take a two-step approach using c() and seq():
test$e = cut(test$a,
breaks = c(0, seq(from = 15, to = max(test$a)+5, by = 5)))
levels(test$e) = 1:length(levels(test$e))
Since Backlin beat me to the cut() solution, here's another option (which I don't prefer in this case, but am posting just to demonstrate the many options available in R).
Use recode() from the car package.
require(car)
test$e = recode(test$a, "0:15 = 1; 15:20 = 2; 20:25 = 3; 25:30 = 4")

You don't need a loop.
You have nearly all you need:
test[test$a > 15 & test$a < 20, "e"] <- 2

Related

Problem with row-wise operation in base R

I have a problem with performing row-wise operations using 'apply' function in R. I want to calculate the distance between two points:
d <- function(x,y){
length <- norm(x-y,type="2")
as.numeric(length)
}
The coordinates are given by two dataframes:
start <- data.frame(
a = c(7, 5, 17, 1),
b = c(5, 17, 1, 2))
stop <- data.frame(
b = c(5, 17, 1, 2),
c = c(17, 1, 2, 1))
My point is to calculate successive distances given by start and stop coordiantes. I wish it worked like:
d(start[1,], stop[1,])
d(start[2,], stop[2,])
d(start[3,], stop[3,])
etc...
I have tried:
apply(X = start, MARGIN = 1, FUN = d, y = stop)
which brought some strange results. Can you please help me with finding the proper solution? I know how to perform the operation using dplyr rowwise() function, however my wish is to use base only.
Can you also explain me why did I receive such a strange results with apply()?

Loop over the sequence of rows and apply the d
sapply(seq_len(nrow(start)), function(i) d(start[i,], stop[i,]))
[1] 12.165525 20.000000 16.031220 1.414214
Or if we want to use apply, create a single data by cbinding the two data and then subset by indexing
apply(cbind(start, stop), 1, FUN = function(x) d(x[1:2], x[3:4]))
[1] 12.165525 20.000000 16.031220 1.414214
Or may use dapply for efficiency
library(collapse)
dapply(cbind(start, stop), MARGIN = 1, parallel = TRUE,
FUN = function(x) d(x[1:2], x[3:4]))
[1] 12.165525 20.000000 16.031220 1.414214

R function for creating, naming and lagging variables

I have some data like so:
a <- c(1, 2, 9, 18, 6, 45)
b <- c(12, 3, 34, 89, 108, 44)
c <- c(0.5, 3.3, 2.4, 5, 13,2)
df <- data.frame(a, b,c)
I need to create a function to lag a lot of variables at once for a very large time series analysis with dozens of variables. So i need to lag a lot of variables without typing it all out. In short, I would like to create variables a.lag1, b.lag1 and c.lag1 and be able to add them to the original df specified above. I figure the best way to do so is by creating a custom function, something along the lines of:
lag.fn <- function(x) {
assign(paste(x, "lag1", sep = "."), lag(x, n = 1L)
return (assign(paste(x, "lag1", sep = ".")
}
The desired output is:
a.lag1 <- c(NA, 1, 2, 9, 18, 6, 45)
b.lag1 <- c(NA, 12, 3, 34, 89, 108, 44)
c.lag1 <- c(NA, 0.5, 3.3, 2.4, 5, 13, 2)
However, I don't get what I am looking for. Should I change the environment to the global environment? I would like to be able to use cbind to add to orignal df. Thanks.

Easy using dplyr. Don't call data frames df, may cause confusion with the function of the same name. I'm using df1.
library(dplyr)
df1 <- df1 %>%
mutate(a.lag1 = lag(a),
b.lag1 = lag(b),
c.lag1 = lag(c))

The data frame statement in the question is invalid since a, b and c are not the same length. What you can do is create a zoo series. Note that the lag specified in lag.zoo can be a vector of lags as in the second example below.
library(zoo)
z <- merge(a = zoo(a), b = zoo(b), c = zoo(c))
lag(z, -1) # lag all columns
lag(z, 0:-1) # each column and its lag

We can use mutate_all
library(dplyr)
df %>%
mutate_all(funs(lag = lag(.)))

If everything else fails, you can use a simple base R function:
my_lag <- function(x, steps = 1) {
c(rep(NA, steps), x[1:(length(x) - steps)])
}

function writing in R

I have a dataframe(df) with only three columns showing ideal weight(x) in kgs (column1), age(y) in years (column 2) and gender(z) (column 3, boy coded 1 & girl coded 2) for school students. I want to write a function for getting what is ideal weight of a school student at given age and gender. My novice attempt is shown below:
idealwt<-function(age,gender){
age=df$y
gender=df$z
idealwt = df$x[age==df$y & gender==df$z]
return(idealwt)
}
However, above function returns the whole vector instead of specific value.

The problem in the OP's function results from creating the objects 'age' and 'gender' which is also the arguments of the function. So, in essence, we are comparing the df$y == df$y and df$z == df$z which results in getting TRUE for all the elements and output is the whole vector. Instead, we can define the function without age = df$y and gender = df$z
idealwt<-function(age,gender){
df$x[age==df$y & gender==df$z]
}
idealwt(12, 2)
#[1] 38 42
data
df <- data.frame(x = c(45, 38, 55, 33, 42), y = c(15, 12, 18, 14, 12),
z = c(1, 2, 1, 1, 2))

Convert percentage to letter grade in R

I have created an R script to calculate final scores for a class I am teaching, and would like to use R to create the final letter grade. How can I take a vector of percentages such as:
grades <- c(.47, .93, .87, .86, .79, .90)
and convert them to letter grades given the appropriate percentage ranges, say, 70-79 = C, 80-89 = B, 90-100 = A?

grades <- c(.47, .93, .87, .86, .79, .90)
letter.grade <- cut(grades, c(0, 0.7,0.8,0.9,Inf), right=FALSE, labels=letters[4:1]);
letter.grade
As #MrFlick suggested you can use cut. The right = FALSE options means that the interval is not inclusive on the right hand end. For cut to work just pass your grades and the cut points you want as the argument.
Thanks to #jlhoward for the improvement suggestion

Another option would be to use findInterval
LETTERS[4:1][findInterval(grades, c(0,0.7,0.8,0.9))]
#[1] "D" "A" "B" "B" "C" "A"

In case anyone like me wants + and - too:
numeric_grades = c(98, 65, 78)
cutoff = c(0, seq(60, 100, by = 10/3))
cutoff[length(cutoff)] = Inf
grades = c("F", paste0(toupper(rep(letters[4:1], each = 3)), rep(c("-","","+"),4)))
cut(numeric_grades, cutoff, right=FALSE, labels=grades)
# [1] A+ D C+

How to apply a distribution function for each row in data frame

I know similar questions have been asked in this site here, here, and here, but none of them tackles my problem.
I've a data frame which I want to apply the rdirichlet function (from gtools) to each line. So, each line shall be consider as aplha.
data = NULL
data <- data.frame(rbind(
oct = c(60, 32, 8),
sep = c(53, 35, 12),
ago = c(54, 40, 6)
))
data <- data/100*1000
library(gtools) # contains the function
sim <- 10000 # simulation
My first attenpt was to use apply, it does work, but the output is not that clear for conducting further analysis; each row computation becomes a vector:
p = apply(data, 1, function(x) rdirichlet(sim, alpha = x + 1))
I also try in a loop without success:
p = NULL
for(i in 1:length(data)) {
p[i] <- rdirichlet(sim, alpha = data[i] + 1)
}
Any tip how can I solve this?

Well firstly you might want to change the data in your anonymous function in the apply to x to match the x in function(x)
apply(data, 1, function(x) rdirichlet(sim, alpha = x + 1))
This works for me, as in it provides an output with three columns and 30000 rows.

Two important things here. First, vectorizing is the best way to go:
ans <- apply(data, 1, function(x) rdirichlet(sim, alpha = x + 1))
By doing this, you'll receive each row computations as vector, essentially k vs sim like.
Then you'll need to subsample things like:
margin <- ans[1:100000,1] - ans[100001:200000,1]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

refactor data.frame column values - r

data.frame(a, b, e=(1:4)[cut(a, c(-Inf, 15, 20, 25, 30))]) Update: Greg's comment provides a more direct solution without the need to go via subsetting an integer vector with a factor returned from cut. data.frame(a, b, e=findInterval(a, c(-Inf, 15, 20, 25, 30)))

You don't need a loop. You have nearly all you need: test[test$a > 15 & test$a < 20, "e"] <- 2

Related

Problem with row-wise operation in base R

R function for creating, naming and lagging variables

function writing in R

Convert percentage to letter grade in R

How to apply a distribution function for each row in data frame

Categories

Resources