Hello I have a large data set, part of which might look something like this.
Seconds <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24)
B<- c(1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1)
C<-c(50, 60, 62, 65, 80, 60, 68, 66, 60, 69, 70, 89)
mydata<- data.frame(Seconds, B, C)
I am stuck in the analysis of this type of data. Getting straight to the problem, I need
Number of times C<80 for continuously more than 6 seconds and 10 seconds.
in this case
N6(C<80 for more than 6 seconds)=4
N10(C<80 for more than 10 seconds)=1
I hope this makes sense! Any help is appreciated :)
We can do
with(mydata, sum(C<80 & Seconds>=6 & B!=0))
#[1] 4
It could be also
library(data.table)
setDT(mydata)[Seconds>=6 & B!=0, sum(C<80), rleid(B)]
I would like to suggest this modest dplyr-based solution
# Libs
Vectorize(require)(package = c("dplyr", "magrittr"),
char = TRUE)
# Summary
mydata %<>%
mutate(criteria = ifelse(Seconds >= 6 & C < 80, TRUE, FALSE)) %>%
group_by(criteria) %>%
tally()
Preview
> head(mydata)
Source: local data frame [2 x 2]
criteria n
(lgl) (int)
1 FALSE 4
2 TRUE 8
Related
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
The previous question was closed, but the easy solution doesn't seem to work. So I further explained my question here.
I have two dataframes, df1 and df2. df1 has a lot of raw data, df2 has pointers based on "value_a" where to look in the raw data.
df1 <- data.frame("id" = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), "value_a" = c(0, 10, 21, 30, 43, 53, 69, 81, 93, 5, 16, 27, 33, 45, 61, 75, 90, 2, 11, 16, 24, 31, 40, 47, 60, 75, 88), "value_b" = c(100, 101, 100, 95, 90, 85, 88, 84, 75, 110, 105, 106, 104, 95, 98, 96, 89, 104, 104, 104, 103, 106, 103, 101, 99, 98, 97), "value_c" = c(0, -1, -2, -2, -2, -2, -1, -1, 0, 0, 0, 0, 1, 1, 2, 2, 1, -1, 0, 0, 1, 1, 2, 2, 1, 1, 0), "value_d" = c(1:27))
df2 <- data.frame("id" = c(1, 2, 3), "value_a" = c(53, 45, 47))
I would like to use the values in df2, to search in df1. So, for every "id", and it's unique "value_a" given in df2, find the corresponding "value_b" and "value_c" in df1, so I can generate a df3 which looks like this:
df3 <- data.frame("id" = c(1, 2, 3), "value_a" = c(53, 45, 47), "value_b" = c(85, 95, 101), "value_c" = c(-2, 1, 2))
Obviously, I have hundreds of "id"s to cover. Since I want to find multiple variables ("value_b", "value_c" but not "value_d") pull() won't work since it only pulls one variable.
Based on this page I started thinking of joining.
An innerjoin() wont work either, because I have to select on multiple variables (id & value_a). Merging like this
df3 <- merge(x = df1, y = df2, by.x = c(id, value_a), by.y = c(id, value_a)) %>%
select(id, value_a, value_b, value_c)
is propably what describes what I'm thinking of, but this throws an error: Error in fix.by(by.x, x) : object 'value_a' not found
I was also thinking of using tapply() but I get stuck on using two different data.frames. Does someone have a good idea on how to tackle this?
Best regards,
Johan
I believe this will be useful:
df2 %>%
inner_join(df1, by = c("id"="id", "value_a"="value_a"))
Output:
id value_a value_b value_c value_d
1 1 53 85 -2 6
2 2 45 95 1 14
3 3 47 101 2 24
I believe this can solve your issue. I hope this helps:
merge(df2,df1,by.x=c('id','value_a'),by.y=c('id','value_a'),all.x=T)
id value_a value_b value_c value_d
1 1 53 85 -2 6
2 2 45 95 1 14
3 3 47 101 2 24
Both answers of Duck and codez0mb1e work, when adding %>% select(id, value_a, value_b, value_c) they give the output
id value_a value_b value_c
1 1 53 85 -2
2 2 45 95 1
3 3 47 101 2
Thanks for the feedback in my propable misinterpretation of both innerjoin() and merge!
Ciao,
I have taken much feedback to create a reproducible example and my coding attempts.
Here is a sample data frame:
df <- data.frame("STUDENT" = 1:10,
"test_FALL" = c(0, 0, 0, 0, 1, 1, 1, 0, 0, NA),
"test_SPRING" = c(1, 1, 1, 1, 0, 0, 0, 1, 1, 1),
"score_FALL" = c(53, 52, 98, 54, 57 ,87, 62, 95, 75, NA),
"score_SPRING" = c(75, 54, 57, 51, 51, 81, 77, 87, 73, 69),
"final_SCORE" = c(75, 54, 57, 51, 57, 87, 62, 87, 73, 69))
And sample code:
df$final_score[df$test_SPRING == 1] <- df$score_SPRING
df$final_score[df$test_FALL == 1] <- df$score_FALL
And a second attempt at the code:
df$final_score1[(df$test_SPRING == 1 & !is.na(df$test_SPRING))] <- df$score_SPRING
df$final_score1[(df$test_FALL == 1 & !is.na(df$test_FALL))] <- df$score_FALL
In my dataframe I have when a student took a test (test_SPRING test_FALL) and scores on tests (score_SPRING score_FALL). Basically I want to create the final_SCORE column which I include in the dataframe SUCH THAT if test_SPRING = 1, tot_score = score_SPRING else if test_FALL = 1, tot_score = score_Fall. I am unable to do so and cannot figure it out after many hours. Please offer any advice you may have.
There are couple of ways to create the 'final_score'.
1) Using ifelse - Based on the example, the 'test' columns are mutually exclusive. So, we can use a single ifelse by checking the condition based on 'test_SPRING' (test_SPRING == 1). If it is TRUE, then get the 'score_SPRING' or else get 'score_FALSE'
with(df, ifelse(test_SPRING == 1, score_SPRING, score_FALL))
#[1] 75 54 57 51 57 87 62 87 73 69
2) Arithmetic - Number multiplied by 1 is the number itself and 0 gives 0 so multiply the 'score' columns with the corresponding 'test' columns, cbind and use rowSums
with(df, NA^(!test_FALL & !test_SPRING) * rowSums(cbind(
score_SPRING * test_SPRING,
score_FALL * test_FALL),
na.rm = TRUE))
#[1] 75 54 57 51 57 87 62 87 73 69
I have a cdplot where I'm trying to find my x value where the distribution (or the y value) = .5 and couldn't find a method to do it that works. Additionally I want to find the y value when my x value is 0 and would like help finding that equation to if it's different.
I cant really provide my code as it relies on a saved workspace with a large dataframe. I'll give this as an example:
fail <- factor(c(2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1,1, 2, 1, 1, 1, 1, 1),levels = 1:2, labels = c("no", "yes"))
temperature <- c(53, 57, 58, 63, 66, 67, 67, 67, 68, 69, 70, 70,70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 81)
cdplot(fail ~ temperature)
So I don't need a quick and dirty way to solve this specific example, I need a code I can apply to my own workspace.
If you capture the return of cdplot, you get a function that you can use to find these values.
CDP = cdplot(fail ~ temperature
uniroot(function(x) { CDP$no(x) - 0.5}, c(55,80))
> uniroot(function(x) { CDP$no(x) - 0.5}, c(55,80))
$root
[1] 62.34963
$f.root
[1] 3.330669e-16
HD = 4
D = 3
C = 2
P = 1
N = 0
GPA = x/n
where n is the number of variables
x = a + b + c + d
where
a is variable 1 which can equal a number between 4(HD) and 0(N)
b is variable 2 which can equal a number between 4(HD) and 0(N)
c is variable 3 which can equal a number between 4(HD) and 0(N)
d is variable 4 which can equal a number between 4(HD) and 0(N)
With that in mind, how would I work out the possible combinations of variables that could = x?
Example:
1. Input = 3.75, output = (HD, HD, HD, D)
2. Input = 2.00, output = (HD, N, HD, N), (C, C, C, C), etc
By the way, this isn't for an assignment or anything - I just saw a lot of people asking what their grades could be after my university released student's GPA before grades and I've been brain storming how to work out the permutations. I'm an arts student with an interest in maths and programming so go easy.
The easiest way i can think of, is multiplying the GPA by the number of subjects (in your example case 4) and then finding the possible combinations to get to that sum. You can use the code in this link to find all the possible combinations.
In your first example = 3.75*4 = 15. In your second example = 2*4 =8.
A given x is achieved in n terms from all the sums of x-k in n-1 terms, by adding k.
So you get the recurrence
C(x;n) = C(x-0;n-1) + C(x-1;n-1) + C(x-2;n-1) + C(x-3;n-1) + C(x-4;n-1),
with C(x;n)=0 for x<0 and x>4n.
The first values are (rows for increasing n, x horizontally):
1, 1, 1, 1, 1
1, 2, 3, 4, 5, 4, 3, 2, 1
1, 3, 6, 10, 15, 18, 19, 18, 15, 10, 6, 3, 1
1, 4, 10, 20, 35, 52, 68, 80, 85, 80, 68, 52, 35, 20, 10, 4, 1
1, 5, 15, 35, 70, 121, 185, 255, 320, 365, 381, 365, 320, 255, 185, 121, 70, 35, 15, 5, 1
In other words, one row is computed from the previous by a convolution with the kernel [1, 1, 1, 1, 1], and the sum of a row is 5^n.
Interestingly, the histograms converge to a Gaussian curve.
The recursive implementation of the combination themselves should be obvious from the recurrence.
How can I extract the exact probabilities for each factor y at any value of x with cdplot(y~x)
Thanks
Following the example from the help file of ?cdplot you can do...
## NASA space shuttle o-ring failures
fail <- factor(c(2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1,
1, 2, 1, 1, 1, 1, 1),
levels = 1:2, labels = c("no", "yes"))
temperature <- c(53, 57, 58, 63, 66, 67, 67, 67, 68, 69, 70, 70,
70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 81)
## CD plot
result <- cdplot(fail ~ temperature)
And this is a simple way to obtain the probabilities from the cdplot output.
# Getting the probabilities for each group.
lapply(split(temperature, fail), result[[1]])
$no
[1] 0.8166854 0.8209055 0.8209055 0.8209055 0.8090438 0.7901473 0.7718317 0.7718317 0.7579343
[10] 0.7664731 0.8062898 0.8326761 0.8326761 0.8905854 0.9185472 0.9626185
$yes
[1] 3.656304e-05 6.273653e-03 1.910046e-02 6.007471e-01 7.718317e-01 7.718317e-01 8.062898e-01
Note that result is a conditional density function (cumulative over the levels of fail) returned invisibly by cdplot, therefore we can split temperature by fail and apply the returned function over those values using lapply.
Here a simple version of getS3method('cdplot','default') :
get.props <- function(x,y,n){
ny <- nlevels(y)
yprop <- cumsum(prop.table(table(y)))
dx <- density(x, n )
y1 <- matrix(rep(0, n * (ny - 1L)), nrow = (ny - 1L))
rval <- list()
for (i in seq_len(ny - 1L)) {
dxi <- density(x[y %in% levels(y)[seq_len(i)]],
bw = dx$bw, n = n, from = min(dx$x), to = max(dx$x))
y1[i, ] <- dxi$y/dx$y * yprop[i]
}
}