Currently, this code works to do what I want to do where dx$res is a vector selecting values from dx$val1 or dx$val2 depending on value of dx$x0.
x0<-c(1,2,1,2,2,1)
val1<-c(8,6,4,5,3,2)
val2<-c(4,8,6,7,9,5)
dx<-data.frame(x0,val1,val2)
dx$res<-(dx$x0==1)*dx$val1+(dx$x0==2)*dx$val2
I would like to know if there were more elegant methods to do this like using apply function.
One option is model.matrix with rowSums. It is also more general for 'n' number of distinct elements in the 'x0' column.
dx$res <- rowSums(dx[-1]*model.matrix(~ factor(x0) - 1 , dx))
dx$res
#[1] 8 8 4 7 9 2
Related
I have two dataframes let's say A
col1 col2
4 7
5 8
and B
col1 col2
2 5
1 4
Now, I want to compute the percentage change between each corresponding element in the two dataframes. So, the percentage change between element 1,1 in A and B, between element 2,1 in A and B and so on. I want to store these percentage changes also in a 2 times 2 dataframe. Does anyone knows how to do this without looping over the dataframes?
As these are equal-sized data.frames, simply do the subtraction and divide by one of the datasets would get the output
(A - B)/A
You can just use R element-wise matrix division.
If you do A/B, it will perform the division element by element. So, the complete formula for percentage would be (A-B)/A
I am using R to analyze a survey. Several of the columns include numbers 1-10, depending on how survey respondents answered the respective questions. I'd like to change the 1-10 scale to a 1-3 scale. Is there a simple way to do this? I was writing a complicated set of for loops and if statements, but I feel like there must be a better way in R.
I'd like to change numbers 1-3 to 1; numbers 4 and 8 to 2; numbers 5-7 to 3, and numbers 9 and 10 to NA.
So in the snippet below, OriginalColumn would become NewColumn.
OriginalColumn=c(4,9,1,10,8,3,2,7,5,6)
NewColumn=c(2,NA,1,NA,2,1,1,3,3,3)
Is there an easy way to do this without a bunch of crazy for loops? Thanks!
You can do this using positional indexing:
> c(1,1,1,2,3,3,3,2,NA,NA)[OriginalColumn]
[1] 2 NA 1 NA 2 1 1 3 3 3
It is better than repeated/nested ifelse because it is vectorized (thus easier to read, write, and understand; and probably faster). In essence, you're creating a new vector that contains that new values for every value you want to replace. So, for values 1:3 you want 1, thus the first three elements of the vector are 1, and so forth. You then use your original vector to extract the new values based on the positions of the original values.
You could also try
library(car)
recode(OriginalColumn, '1:3=1; c(4,8)=2; 5:7=3; else=NA')
#[1] 2 NA 1 NA 2 1 1 3 3 3
I have a data frame that contains multiple rows and multiple columns.
I have a character vector that contains the names of some of the columns in the data frame. The number of columns can vary.
For each line, for each of these columns, I have to identify if one of them is not NA. (basically any(!is.na(df[namecolumns])) for each line), to then do a subset for the ones that are TRUE.
Actually, any(!is.na(df[1,][namescolumns])) works well, but it's only for the first line.
I could easily do a for loop, which is my first reflex as a programmer and because it works for the first line, but I'm sure it's not the R way and that there is a way to do this with an "apply" (lapply, mapply, sapply, tapply or other), but I can't figure out which one and how.
Thank you.
try using apply over the first dimension (rows):
apply(df, 1 function(x) any(!is.na(x[namescolumns])))
The results will come back transposed, and so, you might want to wrap the whole statement inside of t(.)
You can use a combination of lapply and Reduce
has.na.in.cols <- Reduce(`&`, lapply(colnames, function (name) !is.na(df[name])))
to get a vector of whether or not there are NA values in any of the columns in colnames, which can in turn be used to subset the data.
df[has.any.na,]
For example. Given:
df <- data.frame(a = c(1,2,3,4,NA,6,7),
b = c(2,4,6,8,10,12,14),
c = c("one","two","three","four","five","six","seven"),
d = c("a",NA,"c","d","e","f","g")
)
colnames <- c("a","d")
You can get:
> df[Reduce(`&`, lapply(colnames, function (name) !is.na(df[name]))),]
a b c d
1 1 2 one a
3 3 6 three c
4 4 8 four d
6 6 12 six f
7 7 14 seven g
I want to pick elements from a vector and exclude 3 values by what position they got.
I know about the x[-n] function to exclude a single value but I donĀ“t know how to exclude more than one.
You could use the c function to combine the values into a vector:
a <- 1:10
a[-c(1,2,3)]
[1] 4 5 6 7 8 9 10
Super short version: I'm trying to use a user-defined function to populate a new column in a dataframe with the command:
TestDF$ELN<-EmployeeLocationNumber(TestDF$Location)
However, when I run the command, it seems to just apply EmployeeLocationNumber to the first row's value of Location rather than using each row's value to determine the new column's value for that row individually.
Please note: I'm trying to understand R, not just perform this particular task. I was actually able to get the output I was looking for using the Apply() function, but that's irrelevant. My understanding is that the above line should work on a row-by-row basis, but it isn't.
Here are the specifics for testing:
TestDF<-data.frame(Employee=c(1,1,1,1,2,2,3,3,3),
Month=c(1,5,6,11,4,10,1,5,10),
Location=c(1,5,6,7,10,3,4,2,8))
This testDF keeps track of where each of 3 employees was over the course of the year among several locations.
(You can think of "Location" as unique to each Employee...it is eseentially a unique ID for that row.)
The the function EmployeeLocationNumber takes a location and outputs a number indicating the order that employee visited that location. For example EmployeeLocationNumber(8) = 2 because it was the second location visited by the employee who visited it.
EmployeeLocationNumber <- function(Site){
CurrentEmployee <- subset(TestDF,Location==Site,select=Employee, drop = TRUE)[[1]]
LocationDate<- subset(TestDF,Location==Site,select=Month, drop = TRUE)[[1]]
LocationNumber <- length(subset(TestDF,Employee==CurrentEmployee & Month<=LocationDate,select=Month)[[1]])
return(LocationNumber)
}
I realize I probably could have packed all of that into a single subset command, but I didn't know how referencing worked when you used subset commands inside other subset commands.
So, keeping in mind that I'm really trying to understand how to work in R, I have a few questions:
Why won't TestDF$ELN<-EmployeeLocationNumber(TestDF$Location) work row-by-row like other assignment statements do?
Is there an easier way to reference a particular value in a dataframe based on the value of another one? Perhaps one that does not return a dataframe/list that then must be flattened and extracted from?
I'm sure the function I'm using is laughably un-R-like...what should I have done to essentially emulate an INNER Join type query?
Using logical indexing, the condensed one-liner replacement for your function is:
EmployeeLocationNumber <- function(Site){
with(TestDF[do.call(order, TestDF), ], which(Location[Employee==Employee[which(Location==Site)]] == Site))
}
Of course this isn't the most readable way, but it demonstrates the principles of logical indexing and which() in R. Then, like others have said, just wrap it up with a vectorized *ply function to apply this across your dataset.
A) TestDF$Location is a vector. Your function is not set up to return a vector, so giving it a vector will probably fail.
B) In what sense is Location:8 the "second location visited"?
C) If you want within group ordering then you need to pass you dataframe split up by employee to a funciton that calculates a result.
D) Conditional access of a data.frame typically involves logical indexing and or the use of which()
If you just want the sequence of visits by employee try this:
(Changed first argument to Month since that is what determines the sequence of locations)
with(TestDF, ave(Location, Employee, FUN=seq))
[1] 1 2 3 4 2 1 2 1 3
TestDF$LocOrder <- with(TestDF, ave(Month, Employee, FUN=seq))
If you wanted the second location for EE:3 it would be:
subset(TestDF, LocOrder==2 & Employee==3, select= Location)
# Location
# 8 2
The vectorized nature of R (aka row-by-row) works not by repeatedly calling the function with each next value of the arguments, but by passing the entire vector at once and operating on all of it at one time. But in EmployeeLocationNumber, you only return a single value, so that value gets repeated for the entire data set.
Also, your example for EmployeeLocationNumber does not match your description.
> EmployeeLocationNumber(8)
[1] 3
Now, one way to vectorize a function in the manner you are thinking (repeated calls for each value) is to pass it through Vectorize()
TestDF$ELN<-Vectorize(EmployeeLocationNumber)(TestDF$Location)
which gives
> TestDF
Employee Month Location ELN
1 1 1 1 1
2 1 5 5 2
3 1 6 6 3
4 1 11 7 4
5 2 4 10 1
6 2 10 3 2
7 3 1 4 1
8 3 5 2 2
9 3 10 8 3
As to your other questions, I would just write it as
TestDF$ELN<-ave(TestDF$Month, TestDF$Employee, FUN=rank)
The logic is take the months, looking at groups of the months by employee separately, and give me the rank order of the months (where they fall in order).
Your EmployeeLocationNumber function takes a vector in and returns a single value.
The assignment to create a new data.frame column therefore just gets a single value:
EmployeeLocationNumber(TestDF$Location) # returns 1
TestDF$ELN<-1 # Creates a new column with the single value 1 everywhere
Assignment doesn't do any magic like that. It takes a value and puts it somewhere. In this case the value 1. If the value was a vector of the same length as the number of rows, it would work as you wanted.
I'll get back to you on that :)
Dito.
Update: I finally worked out some code to do it, but by then #DWin has a much better solution :(
TestDF$ELN <- unlist(lapply(split(TestDF, TestDF$Employee), function(x) rank(x$Month)))
...I guess the ave function does pretty much what the code above does. But for the record:
First I split the data.frame into sub-frames, one per employee. Then I rank the months (just in case your months are not in order). You could use order too, but rank can handle ties better. Finally I combine all the results into a vector and put it into the new column ELN.
Update again Regarding question 2, "What is the best way to reference a value in a dataframe?":
This depends a bit on the specific problem, but if you have a value, say Employee=3 and want to find all rows in the data.frame that matches that, then simply:
TestDF$Employee == 3 # Returns logical vector with TRUE for all rows with Employee == 3
which(TestDF$Employee == 3) # Returns a vector of indices instead
TestDF[which(TestDF$Employee == 3), ] # Subsets the data.frame on Employee == 3