I am just started learning data science and I have a question that is probably easy for you.
I have a dataset that looks something like this
df <- data.frame(id= c(1,1,1,2,2,2,3,3,3), time=c(1,2,3,1,2,3,1,2,3),y = rnorm(9), x1 = LETTERS[seq( from = 1, to = 9 )], x2 = c(0,0,0,0,1,0,1,1,1),c2 = rnorm(9))
df
# id time y x1 x2 c2
# 1 1 1 0.6364831 A 0 -0.066480473
# 2 1 2 0.4476390 B 0 0.161372575
# 3 1 3 1.5113458 C 0 0.343956178
# 4 2 1 0.3532957 D 0 0.279987147
# 5 2 2 0.3401402 E 1 -0.462635393
# 6 2 3 -0.3160222 F 0 0.338454940
# 7 3 1 -1.3797158 G 1 -0.621169576
# 8 3 2 1.4026640 H 1 -0.005690801
# 9 3 3 0.2958363 I 1 -0.176488132
I am writing a function with multiple steps. I would like the feed the function with two elements: the dataset and the variable of interest.
However, the function breaks down in an intermediate step, when I try to filter my data using data table. The crucial step of the function looks something like this.
testfun<- function(dataset,var){
intermediatedf<-unique(setDT(dataset)[var==1 & c2>0,.(y)])
return(intermediatedf)
}
However, running the df2<-testfun(df,y) breaks down.
Can anyone help me and explain how can I create a function where I index both a dataset and a variable?
Thank you in advance for your help
You can use substitute and eval
testfun <- function(dataset, var) {
var <- substitute(var)
intermediatedf <- unique(dataset[eval(var) == 1 & c2 > 0, .(y)])
return(intermediatedf)
}
Related
I have a large list of data frames with different dimensions. The dimensions of my real data are too big, so I create a new list (myList) for an instant as follow:
myList <- list( data.frame(ID = c("T-02", "T-04","T-06"),
Test = rnorm(3, mean=50, sd=10),Shape=c("C","S","r"),
Time=rnorm(3, mean=90, sd=10),Event=c(1,0,1),
KPS=c(90,100,70),Sex=c("F","M","F"),Race=c("W","B","W")),
data.frame(ID = c("T-02", "T-04","T-06"), Shape=c("C","S","r"),
Value = 1:3,Time=rnorm(3, mean=90, sd=10),
Event=c(1,0,1),
KPS=c(90,100,70),Sex=c("F","M","F"),Race=c("W","B","W")),
data.frame(ID = c("T-02", "T-04","T-06"),
Test = rnorm(3, mean=50, sd=10),
Value = 1:3, Time=rnorm(3, mean=90, sd=10),Event=c(1,0,1),
KPS=c(90,100,70),Sex=c("F","M","F"),Race=c("W","B","W")),
data.frame(ID = c("T-02", "T-04","T-06"),
Test = rnorm(3, mean=50, sd=10),
Value = 1:3,Time=rnorm(3, mean=90, sd=10),Event=c(1,0,1),
KPS=c(90,100,70),Sex=c("F","M","F"),Race=c("W","B","W")))
I looking for a function in which I can drop all columns after the "Event" column or select columns from the first to the "Event" column.
I can do it easily for such short data by the bellow code:
new_drop<-lapply(myList, function(x){x[,!names(x)%in%c("KPS","Sex","Race")]})
But I want to remove more than 20 columns like this in my data. I wonder if there is a simpler way.
I also tried this code but it did not work properly!
new_drop1<-lapply(myList, function(x){x[,endsWith(colnames(x),"Event")]})
I appreciate any help.
you could use grep:
lapply(myList, function(x) x[seq(grep("Event", names(x)))])
[[1]]
ID Test Shape Time Event
1 T-02 65.11001 C 94.53361 1
2 T-04 70.25636 S 84.86061 0
3 T-06 44.56480 r 85.30492 1
[[2]]
ID Shape Value Time Event
1 T-02 C 1 93.40279 1
2 T-04 S 2 85.78726 0
3 T-06 r 3 97.02140 1
[[3]]
ID Test Value Time Event
1 T-02 39.89387 1 94.80438 1
2 T-04 48.28122 2 85.62445 0
3 T-06 49.47685 3 90.10609 1
[[4]]
ID Test Value Time Event
1 T-02 38.55385 1 78.33900 1
2 T-04 47.60908 2 77.63453 0
3 T-06 43.59754 3 92.25645 1
The below code allows you to keep only the previous columns to "Event", using which. which is a base function which returns the indexes that satisfy a set of conditions. Here the only condition is colnames(x)=="Event".
new_drop<-lapply(myList, function(x){
res <- NULL
if(is.data.frame(x)) {
eventcol <- which(colnames(x)=="Event")
res <- x[,1:eventcol]
} else {
res <- x
}
return(res)
})
This would work even if not all your list elements are of class data.frame
> lapply(new_drop, head)
[[1]]
ID Test Shape Time Event
1 T-02 57.38475 C 76.05545 1
2 T-04 40.84934 S 85.98049 0
3 T-06 45.44281 r 85.18336 1
[[2]]
ID Shape Value Time Event
1 T-02 C 1 101.68492 1
2 T-04 S 2 100.13524 0
3 T-06 r 3 89.14877 1
[[3]]
ID Test Value Time Event
1 T-02 42.92581 1 82.37073 1
2 T-04 42.10800 2 90.51706 0
3 T-06 50.51329 3 96.52649 1
[[4]]
ID Test Value Time Event
1 T-02 49.13385 1 85.91036 1
2 T-04 52.72536 2 98.83747 0
3 T-06 68.96858 3 96.51575 1
For each data frame you can find the number of columns with ncol and then the position of the Event column with the which function. For example with mtcars assume the wt column is the "Event" column.
cars <- mtcars
ncols <- ncol(cars)
11
wt <- which(colnames(cars) == "wt")
6
# To remove columns 1 to to wt-1 and wt+1 to end
before <- wt - 1
cars1 <- cars[ , -(1:before)]
after <- wt + 1
cars2 <- cars[ , -(after:ncols)]
You can use the sapply function to find the dimensions of all your data frames and the positions of the Event column followed by an lapply to process the column deletions.
I am trying to recreate a Stata code snippet in R and I have hit a snag.
In Stata, the lag function gives this result when applied:
A B
1 2
1 2
1 2
1 2
replace A=B if A==A[_n-1]
A B
1 2
2 2
1 2
2 2
If I try to replicate in R I get the following:
temp <- data.frame("A" = rep(1,4), "B" = rep(2,4))
temp
A B
1 2
1 2
1 2
1 2
temp <- temp %>% mutate(A = ifelse(A==lag(A,1),B,A))
temp
A B
2 2
2 2
2 2
2 2
I need it to be the same as in Stata.
lag would not be used here because it uses the original values in A whereas at each iteration the question needs the most recently updated values.
Define an Update function and apply it using accumulate2 in the purrr package. It returns a list so unlist it.
library(purrr)
Update <- function(prev, A, B) if (A == prev) B else A
transform(temp, A = unlist(accumulate2(A, B[-1], Update)))
giving:
A B
1 1 2
2 2 2
3 1 2
4 2 2
Another way to write this uses fn$ in gsubfn which causes formula arguments to be interpreted as functions. The function that it builds uses the free variables in the formula as the arguments in the order encountered.
library(gsubfn)
library(purrr)
transform(temp, A = unlist(fn$accumulate2(A, B[-1], ~ if (prev == A) B else A)))
Also note the comments below this answer for another variation.
Looks like we need to update after each run
for(i in 2:nrow(temp)) temp$A[i] <- if(temp$A[i] == temp$A[i-1])
temp$B[i] else temp$A[i]
temp
# A B
#1 1 2
#2 2 2
#3 1 2
#4 2 2
Or as #G.Grothendieck mentioned in the comments, it can be compact with
for(i in 2:nrow(temp)) if (temp$A[i] == temp$A[i-1]) temp$A[i] <- temp$B[i]
Here's a function that will do it:
lagger <- function(x,y){
current = x[1]
out = x
for(i in 2:length(x)){
if(x[i] == current){
out[i] = y[i]
}
current = out[i]
}
out
}
lagger(temp$A, temp$B)
[1] 1 2 1 2
I'd like to make a frequency table like this in R:
df = data.frame(aa = c(9,8,7,8), bb = c(9,7,9,8), cc = c(7,9,8,7))
apply(df, 2, table)
# outputs:
# aa bb cc
# 7 1 1 2
# 8 2 1 1
# 9 1 2 1
But, if one of the columns of df would have a count of 0 (e.g. if we change the above so that df$cc has no 9) we'll get a list instead of a nice dataframe.
# example that gives a list
df = data.frame(aa = c(9,8,7,8), bb = c(9,7,9,8), cc = c(7,8,8,7))
apply(df, 2, table)
What's a simple way do something similar that will guarantee dataframe output regardless of the counts?
I can imagine a number of solutions that seem messy or hacked, for example, this produces the desired result:
# example of a messy but correct solution
df = data.frame(aa = c(9,8,7,8), bb = c(9,7,9,8), cc = c(7,8,8,7))
apply(df, 2, function(x) summary(factor(x, levels = unique(unlist(df)))))
Is there a cleaner way to do this?
I'll go ahead and answer, though I still object to the lack of criteria. If we think of "tidy" as the opposite of "messy", then we should first tidy the input data into a long format. Then we can do a two-way table:
library(tidyr)
df %>% gather %>%
with(table(value, key))
# key
# value aa bb cc
# 7 1 1 2
# 8 2 1 2
# 9 1 2 0
Thanks to Markus for a base R version:
table(stack(df))
# ind
# values aa bb cc
# 7 1 1 2
# 8 2 1 2
# 9 1 2 0
Say I have a data table and I want to calculate a new variable based on several conditions of the old variables like this:
library(data.table)
test <- data.table(a = c(1,1,0), b = c(0,1,0), c = c(1,1,1))
test[a==1 & b==1 & c==1,test2:=1]
But I actually have many more conditions (all combinations of the different variables) which also have a different length. I draw those from a list such as:
conditions<-list(c("a","b","c"), c("b","c"))
and then I want to loop through that list and build a character vector like this (with which I want to do something before deleting it and going to the next element of the list):
mystring <- paste0(paste0(conditions[[1]], collapse = "==1 & "), "==1")
But how can I use "mystring" inside the data.table? as.function() or get() or eval() don't seem to work. Something like:
test[mystring,test3:=1]
is what I'm looking for.
For the given use case, you may use join with on = to achieve the desired goal without having to create and evaluate complex strings of conditions.
Instead of
test[a==1 & b==1 & c==1, test2 := 1][]
we can write
test[.(1, 1, 1), on = c("a", "b", "c"), test2 := 1][]
# a b c test2
#1: 1 0 1 NA
#2: 1 1 1 1
#3: 0 0 1 NA
Now, the OP had requested to loop over a list of conditions using lapply() "to do something". This can be achieved as follows
# create list of conditions for subsetting
col = list(c("a","b","c"), c("b","c"))
val = list(c(1, 1, 1), c(0, 1))
# loop over conditions
lapply(seq_along(col), function(i) test[as.list(val[[i]]), on = col[[i]], test2 := i])
#[[1]]
#
#[[2]]
# a b c test2
#1: 1 0 1 2
#2: 1 1 1 1
#3: 0 0 1 2
Note that the output of lapply() is not used because test has been modified in place:
test
# a b c test2
#1: 1 0 1 2
#2: 1 1 1 1
#3: 0 0 1 2
I have a data frame with url strings and am using the stringr package in R to produce new columns with a boolean on whether the string contains an element or not.
library(stringr)
url = data.frame(u=c("http://www.subaru.com/vehicles/impreza/index.html",
"http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=214495e6-dbe0-6668-9222-00003d7cd876&prid=87&k_affcode=76602",
"http://www.subaru.com/customer-support.html",
"http://www.subaru.com/",
"http://www.subaru.com/vehicles/forester/index.html"))
url
cs = c("customer-support")
f = c("forester")
one_match <- str_c(cs, collapse = "|")
two_match <- str_c(f, collapse = "|")
main <- function(df) {
df$customer_support <- as.numeric(str_detect(url$u, one_match))
df
}
d1 = main(url)
main <- function(df) {
df$forester <- as.numeric(str_detect(url$u, two_match))
df
}
d2 = main(url)
mydt = join(d1, d2)
mydt
The above code produces the following results.
mydt
u
1 http://www.subaru.com/vehicles/impreza/index.html
2 http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=214495e6-dbe0-6668-9222-00003d7cd876&prid=87&k_affcode=76602
3 http://www.subaru.com/customer-support.html
4 http://www.subaru.com/
5 http://www.subaru.com/vehicles/forester/index.html
customer_support forester
1 0 0
2 0 0
3 1 0
4 0 0
5 0 1
What I want to do is reshape the data frame so that I restructure columns 2 and 3 so that they are combined and no longer boolean values
It should look like:
page
0
0
customer_support
0
forester
I've tried many different things, including variations of reshape, transform, dcast, etc and nothing seems to get the job done. Can anyone help me get the desired output.
You don't need to write such complicated functions.. You can simply use grepl and ifelse functions as below
urldata = data.frame(u = c("http://www.subaru.com/vehicles/impreza/index.html", "http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=214495e6-dbe0-6668-9222-00003d7cd876&prid=87&k_affcode=76602",
"http://www.subaru.com/customer-support.html", "http://www.subaru.com/", "http://www.subaru.com/vehicles/forester/index.html"))
cs = c("customer-support")
f = c("forester")
urldata
## u
## 1 http://www.subaru.com/vehicles/impreza/index.html
## 2 http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=214495e6-dbe0-6668-9222-00003d7cd876&prid=87&k_affcode=76602
## 3 http://www.subaru.com/customer-support.html
## 4 http://www.subaru.com/
## 5 http://www.subaru.com/vehicles/forester/index.html
urldata$page <- ifelse(grepl(cs, urldata$u), cs, ifelse(grepl(f, urldata$u), f, 0))
urldata
## u
## 1 http://www.subaru.com/vehicles/impreza/index.html
## 2 http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=214495e6-dbe0-6668-9222-00003d7cd876&prid=87&k_affcode=76602
## 3 http://www.subaru.com/customer-support.html
## 4 http://www.subaru.com/
## 5 http://www.subaru.com/vehicles/forester/index.html
## page
## 1 0
## 2 0
## 3 customer-support
## 4 0
## 5 forester