Ciao,
I have taken much feedback to create a reproducible example and my coding attempts.
Here is a sample data frame:
df <- data.frame("STUDENT" = 1:10,
"test_FALL" = c(0, 0, 0, 0, 1, 1, 1, 0, 0, NA),
"test_SPRING" = c(1, 1, 1, 1, 0, 0, 0, 1, 1, 1),
"score_FALL" = c(53, 52, 98, 54, 57 ,87, 62, 95, 75, NA),
"score_SPRING" = c(75, 54, 57, 51, 51, 81, 77, 87, 73, 69),
"final_SCORE" = c(75, 54, 57, 51, 57, 87, 62, 87, 73, 69))
And sample code:
df$final_score[df$test_SPRING == 1] <- df$score_SPRING
df$final_score[df$test_FALL == 1] <- df$score_FALL
And a second attempt at the code:
df$final_score1[(df$test_SPRING == 1 & !is.na(df$test_SPRING))] <- df$score_SPRING
df$final_score1[(df$test_FALL == 1 & !is.na(df$test_FALL))] <- df$score_FALL
In my dataframe I have when a student took a test (test_SPRING test_FALL) and scores on tests (score_SPRING score_FALL). Basically I want to create the final_SCORE column which I include in the dataframe SUCH THAT if test_SPRING = 1, tot_score = score_SPRING else if test_FALL = 1, tot_score = score_Fall. I am unable to do so and cannot figure it out after many hours. Please offer any advice you may have.
There are couple of ways to create the 'final_score'.
1) Using ifelse - Based on the example, the 'test' columns are mutually exclusive. So, we can use a single ifelse by checking the condition based on 'test_SPRING' (test_SPRING == 1). If it is TRUE, then get the 'score_SPRING' or else get 'score_FALSE'
with(df, ifelse(test_SPRING == 1, score_SPRING, score_FALL))
#[1] 75 54 57 51 57 87 62 87 73 69
2) Arithmetic - Number multiplied by 1 is the number itself and 0 gives 0 so multiply the 'score' columns with the corresponding 'test' columns, cbind and use rowSums
with(df, NA^(!test_FALL & !test_SPRING) * rowSums(cbind(
score_SPRING * test_SPRING,
score_FALL * test_FALL),
na.rm = TRUE))
#[1] 75 54 57 51 57 87 62 87 73 69
Related
I'm looking to find multiple max values using multiple ranges from a single table without using a loop.
It's difficult to explain, but here's an example:
list of value <- c(100, 110, 54, 64, 73, 23, 102)
beginning_of_max_range <- c(1, 2, 4)
end_of_max_range <- c(3, 5, 6)
output
110, 110, 73
max(100, 110, 54)
max(110, 54, 64)
max(64, 73, 23)
You may do this with mapply -
list_of_value <- c(100, 110, 54, 64, 73, 23, 102)
beginning_of_max_range <- c(1, 2, 4)
end_of_max_range <- c(3, 5, 6)
mapply(function(x, y) max(list_of_value[x:y]), beginning_of_max_range, end_of_max_range)
#[1] 110 110 73
We create a sequence from beginning_of_max_range to end_of_max_range, subset it from list_of_value and get the max from each pair.
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
The previous question was closed, but the easy solution doesn't seem to work. So I further explained my question here.
I have two dataframes, df1 and df2. df1 has a lot of raw data, df2 has pointers based on "value_a" where to look in the raw data.
df1 <- data.frame("id" = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), "value_a" = c(0, 10, 21, 30, 43, 53, 69, 81, 93, 5, 16, 27, 33, 45, 61, 75, 90, 2, 11, 16, 24, 31, 40, 47, 60, 75, 88), "value_b" = c(100, 101, 100, 95, 90, 85, 88, 84, 75, 110, 105, 106, 104, 95, 98, 96, 89, 104, 104, 104, 103, 106, 103, 101, 99, 98, 97), "value_c" = c(0, -1, -2, -2, -2, -2, -1, -1, 0, 0, 0, 0, 1, 1, 2, 2, 1, -1, 0, 0, 1, 1, 2, 2, 1, 1, 0), "value_d" = c(1:27))
df2 <- data.frame("id" = c(1, 2, 3), "value_a" = c(53, 45, 47))
I would like to use the values in df2, to search in df1. So, for every "id", and it's unique "value_a" given in df2, find the corresponding "value_b" and "value_c" in df1, so I can generate a df3 which looks like this:
df3 <- data.frame("id" = c(1, 2, 3), "value_a" = c(53, 45, 47), "value_b" = c(85, 95, 101), "value_c" = c(-2, 1, 2))
Obviously, I have hundreds of "id"s to cover. Since I want to find multiple variables ("value_b", "value_c" but not "value_d") pull() won't work since it only pulls one variable.
Based on this page I started thinking of joining.
An innerjoin() wont work either, because I have to select on multiple variables (id & value_a). Merging like this
df3 <- merge(x = df1, y = df2, by.x = c(id, value_a), by.y = c(id, value_a)) %>%
select(id, value_a, value_b, value_c)
is propably what describes what I'm thinking of, but this throws an error: Error in fix.by(by.x, x) : object 'value_a' not found
I was also thinking of using tapply() but I get stuck on using two different data.frames. Does someone have a good idea on how to tackle this?
Best regards,
Johan
I believe this will be useful:
df2 %>%
inner_join(df1, by = c("id"="id", "value_a"="value_a"))
Output:
id value_a value_b value_c value_d
1 1 53 85 -2 6
2 2 45 95 1 14
3 3 47 101 2 24
I believe this can solve your issue. I hope this helps:
merge(df2,df1,by.x=c('id','value_a'),by.y=c('id','value_a'),all.x=T)
id value_a value_b value_c value_d
1 1 53 85 -2 6
2 2 45 95 1 14
3 3 47 101 2 24
Both answers of Duck and codez0mb1e work, when adding %>% select(id, value_a, value_b, value_c) they give the output
id value_a value_b value_c
1 1 53 85 -2
2 2 45 95 1
3 3 47 101 2
Thanks for the feedback in my propable misinterpretation of both innerjoin() and merge!
Hello I have a large data set, part of which might look something like this.
Seconds <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24)
B<- c(1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1)
C<-c(50, 60, 62, 65, 80, 60, 68, 66, 60, 69, 70, 89)
mydata<- data.frame(Seconds, B, C)
I am stuck in the analysis of this type of data. Getting straight to the problem, I need
Number of times C<80 for continuously more than 6 seconds and 10 seconds.
in this case
N6(C<80 for more than 6 seconds)=4
N10(C<80 for more than 10 seconds)=1
I hope this makes sense! Any help is appreciated :)
We can do
with(mydata, sum(C<80 & Seconds>=6 & B!=0))
#[1] 4
It could be also
library(data.table)
setDT(mydata)[Seconds>=6 & B!=0, sum(C<80), rleid(B)]
I would like to suggest this modest dplyr-based solution
# Libs
Vectorize(require)(package = c("dplyr", "magrittr"),
char = TRUE)
# Summary
mydata %<>%
mutate(criteria = ifelse(Seconds >= 6 & C < 80, TRUE, FALSE)) %>%
group_by(criteria) %>%
tally()
Preview
> head(mydata)
Source: local data frame [2 x 2]
criteria n
(lgl) (int)
1 FALSE 4
2 TRUE 8
I have the following variables:
loc.dir <- c(1, -1, 1, -1, 1, -1, 1)
max.index <- c(40, 46, 56, 71, 96, 113, 156)
min.index <- c(38, 48, 54, 69, 98, 112, 155)
My goal is to produce the following:
data.loc <- c(40, 48, 56, 69, 96, 112, 156)
In words, I look at each element loc.dir. If the ith element is 1, then I will take the ith element in max.index. On the other hand, if the ith element is -1, then I will take the ith element in min.index.
I am able to get the elements that should be in data.loc by using:
plus.1 <- max.index[which(loc.dir == 1)]
minus.1 <- min.index[which(loc.dir == -1)]
But now I don't know how to combine plus.1 and minus.1 so that it is identical to data.loc
ifelse was designed for this:
ifelse(loc.dir == 1, max.index, min.index)
#[1] 40 48 56 69 96 112 156
It does something similar to this:
res <- min.index
res[loc.dir == 1] <- max.index[loc.dir == 1]
How can I extract the exact probabilities for each factor y at any value of x with cdplot(y~x)
Thanks
Following the example from the help file of ?cdplot you can do...
## NASA space shuttle o-ring failures
fail <- factor(c(2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1,
1, 2, 1, 1, 1, 1, 1),
levels = 1:2, labels = c("no", "yes"))
temperature <- c(53, 57, 58, 63, 66, 67, 67, 67, 68, 69, 70, 70,
70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 81)
## CD plot
result <- cdplot(fail ~ temperature)
And this is a simple way to obtain the probabilities from the cdplot output.
# Getting the probabilities for each group.
lapply(split(temperature, fail), result[[1]])
$no
[1] 0.8166854 0.8209055 0.8209055 0.8209055 0.8090438 0.7901473 0.7718317 0.7718317 0.7579343
[10] 0.7664731 0.8062898 0.8326761 0.8326761 0.8905854 0.9185472 0.9626185
$yes
[1] 3.656304e-05 6.273653e-03 1.910046e-02 6.007471e-01 7.718317e-01 7.718317e-01 8.062898e-01
Note that result is a conditional density function (cumulative over the levels of fail) returned invisibly by cdplot, therefore we can split temperature by fail and apply the returned function over those values using lapply.
Here a simple version of getS3method('cdplot','default') :
get.props <- function(x,y,n){
ny <- nlevels(y)
yprop <- cumsum(prop.table(table(y)))
dx <- density(x, n )
y1 <- matrix(rep(0, n * (ny - 1L)), nrow = (ny - 1L))
rval <- list()
for (i in seq_len(ny - 1L)) {
dxi <- density(x[y %in% levels(y)[seq_len(i)]],
bw = dx$bw, n = n, from = min(dx$x), to = max(dx$x))
y1[i, ] <- dxi$y/dx$y * yprop[i]
}
}