I am having a dataframe with timestamps that also have decimal values. I want to calculate the difference between the first and all other events from the same group. To do that I use the following code:
values <- c("1671535501.862424", "1671535502.060679","1671535502.257422",
"1671535502.472993", "1671535502.652619","1671535502.856569",
"1671535503.048685", "1671535503.245988")
column_b <- c("a", "a","a","a","a","a","a","a")
values<-as.numeric(values)
#-- Calculate differences
data <- data.frame(values,column_b) #create data frame
res <- data %>%
group_by(column_b) %>%
arrange(values) %>%
mutate(time=values-lag(values, default = first(values)))
In general, the code does exactly what I expect it to do. It groups them, arranges them, and calculates the difference for each group. The output looks like this:
> res
# A tibble: 8 × 3
# Groups: column_b [2]
values column_b time
<dbl> <fct> <dbl>
1 1671535502. a 0
2 1671535502. a 0.198
3 1671535502. a 0.197
4 1671535502. a 0.216
5 1671535503. a 0.180
6 1671535503. a 0.204
7 1671535503. a 0.192
8 1671535503. a 0.197
Nevertheless, I have my doubts about the math results. If I am not mistaken, the values in this example are prearranged. But even if that was not the case, arrange() should have done the job. Hence, IF it is arranging the values, how can the 4th have a larger value than the 5th? There are multiple examples where we see that it does not make sense. What am I missing?
Related
I want to extract the random effects from my lmer model, including the person this random effect belongs to. My goal is to create a tibble that has one column for the person and another column for the random effect.
Using coef(modelA)$bib I am able to extract the random effect to a list. Here I also see which person the random effect belongs to.
> coef(modelA)$bib
(Intercept)
31 0.37031060
32 0.49877575
33 0.50586345
34 0.52036187
35 0.49813250
However, adding this to a tibble, this information is lost.
> tibble(randEffectModA)
# A tibble: 65 x 1
`(Intercept)`
<dbl>
1 0.370
2 0.499
3 0.506
4 0.520
5 0.498
Is there a simple way to solve this problem?
Those are rownames and tibbles do not support rownames.
You have few options -
Keep the information in a dataframe instead of tibble so the rownames are maintained.
result <- data.frame(coef(modelA)$bib)
Create the rownames as separate column if you want to use tibbles.
randEffectModA <- data.frame(coef(modelA)$bib)
result <- tibble::tibble(person_no = rownames(randEffectModA),
intercept = unlist(randEffectModA))
I want to calculate a conditional mean for values that exceed a certain threshold.
For example by looking at the cars dataset, I want to calculate the mean hp for cars with more than 100 hp. Furthermore, I want to store this value within a datastream.
I tried the following Code (PROPTV is the dataset and DDM, g=0 is the variable for which I need the mean of values>0):
PROPTV %>%
group_by(`DDM, g=0`>0) %>%
summarise(mean=mean(`DDM, g=0`))
I get the following tibble:
# A tibble: 2 × 2
`\`DDM, g=0\` > 0` mean
<lgl> <dbl>
1 FALSE 0
2 TRUE 0.709
The 0.709 should be correct but I have no idea how to store this value without using any help dataframe.
Any ideas?
Thanks in advance!!
I'm working with some data that involves participants running on a cognitive task that measures their outcome (Correct or Incorrect) and reaction time (RT) (the entire dataset is called practice). For each participant, I want to create a new dataframe with their average RT when they got the answer correct, and one for when they were incorrect. I've tried
practice %>%
mutate(correctRT = mean(practice$RT[practice$Outcome=="Correct"]))
Using dplyr and tidyverse, as well as
correctRT <- c(mean(practice$RT[practice$Outcome=="Correct"]))
(which I'm sure isn't the correct way to do it) and nothing seems to be working. I'm a complete novice and am working with this dataset in order to learn how to do stats with R and just can't find any answers with R.
In R you can "keep" multiple objects (e.g. data frames) in a single list. This saves you from storing every (sub)dataframe in a separate variable (e.g. through subsetting your problem and storing it based on Participant, Outcome). This will come handy when you have "many" individuals and a manual filter and storing of the (sub)dataframe becomes prohibitive.
Conceptually, your problem is to "subset" your data to the Participant and Outcome you aim for and calculate the mean on this group.
The following is based on {tidyverse}, i.e. {dplyr}.
data
As you have not provided a reproducble example, this is a quick hack of your data:
practice <- data.frame(
Participant = c("A","A","A","B","B","B","B","C","C","D"),
RT = c(10, 12, 14, 9, 12, 13, 17, 11, 13, 17),
Outcome = c("Incorrect","Correct", "Correct","Incorrect","Incorrect","Correct", "Correct","Incorrect","Correct", "Correct")
)
which looks like the following:
practice
Participant RT Outcome
1 A 10 Incorrect
2 A 12 Correct
3 A 14 Correct
4 B 9 Incorrect
5 B 12 Incorrect
6 B 13 Correct
7 B 17 Correct
8 C 11 Incorrect
9 C 13 Correct
10 D 17 Correct
splitting groups of a dataframe
The {tidyverse} provides some neat functions for the general data processing.
{dplyr} has a group_split() function that returns such a list.
library(dplyr)
practice %>% group_split(Participant, Outcome)
<list_of<
tbl_df<
Participant: character
RT : double
Outcome : character
>
>[7]>
[[1]]
# A tibble: 2 x 3
Participant RT Outcome
<chr> <dbl> <chr>
1 A 12 Correct
2 A 14 Correct
[[2]]
...
You can address the respective list-elements with the [[]] notation.
Store the list in a variable and try my_list_name[[3]] to extract the 3rd element.
potential summary for your problem
If you do not need a list you could wrap this into a data summary.
If you want to split on Outcomes, you may want to filter your data in 2 sub-dataframes only holding the respective outcome (e.g. correct <- practice %>% filter(Outcome == "Correct")).
Group your data dependent on the summary you want to construct.
Use summarise() to summarise your groups into a 1-row summary.
Note you can combine multiple operations. For example next to the mean reaction time, the following counts the number of rows (:= attempts).
practice %>%
group_by(Participant, Outcome) %>%
##--------- summarise data into 1 row summarise
summarise( Mean_RT = mean(RT) # calculate mean reaction time
,Attempts = n() ) # how many times
This yields:
# A tibble: 7 x 4
# Groups: Participant [4]
Participant Outcome Mean_RT Attempts
<chr> <chr> <dbl> <int>
1 A Correct 13 2
2 A Incorrect 10 1
3 B Correct 15 2
4 B Incorrect 10.5 2
5 C Correct 13 1
6 C Incorrect 11 1
7 D Correct 17 1
Please note that this is a grouped data frame. If you further process the data, you need to "remove" the grouping. Otherwise any follow up operation in a pipe will be on the group-level.
For this you can either use summarise(...., .groups = "drop") or you add ... %>% ungroup() to your pipe.
If you need to split the result, check for above group_split().
I am just start learning R for data analysis. Here is my problem.
I want to analyse the body weight(BW) difference between male and female in different species. (For example, in Sorex gracilliums, male and female body weight is significantly different just an example,I don't know the answer. :))At first I thought maybe I can first divide them by Species into several groups.(This indeed can be done in Excel, but I have tooo many files, I think maybe R is better ) And then I can just using some simple code to test sex difference. But I don't know how to divide them, how to make new data frame..
I tried to use group_split. It indeed split the data, but just many tribble.
like image showed
What should I do?
Or maybe there is a better way for testing the difference?
I am a foreigner,so maybe there are many grammar mistakes.. But I will be very appreciated if you help!
Assuming your data is in a data.frame called df, with columns NO, SPECIES, SEX, BW:
set.seed(100)
df = data.frame(NO=1:100,
SPECIES=sample(LETTERS[1:4],100,replace=TRUE),
SEX=sample(c("M","F"),100,replace=TRUE),
BW = rnorm(100,80,2)
)
And we make Species D to have an effect:
df$BW[df$SPECIES=="D" & df$SEX=="M"] = df$BW[df$SPECIES=="D" & df$SEX=="M"] + 5
If we want to do it on one data frame, say Species A, we do
dat = subset(df,SPECIES=="A")
t.test(BW ~ SEX,data=dat)
And you get the relevant statistics and so forth. To do this systematically for all SPECIES, we can use broom, dplyr:
library(dplyr)
library(broom)
df %>% group_by(SPECIES) %>% do(tidy(t.test(BW ~ SEX,data=.)))
# A tibble: 4 x 11
# Groups: SPECIES [4]
SPECIES estimate estimate1 estimate2 statistic p.value parameter conf.low
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0.883 80.4 79.6 0.936 3.65e-1 14.2 -1.14
2 B 0.259 80.2 79.9 0.377 7.12e-1 14.1 -1.21
3 C 0.170 80.1 79.9 0.359 7.23e-1 25.3 -0.807
4 D -5.55 79.7 85.2 -7.71 1.29e-7 21.4 -7.05
If you don't want to install any packages, this will give you all the test results:
by(df, df$SPECIES, function(x)t.test(BW ~ SEX,data=x))
And combining them into one data.frame:
func = function(x){
Nu=t.test(BW ~ SEX,data=x);
data.frame(estimate_1=Nu$estimate[1],estimate_2=Nu$estimate[2],p=Nu$p.value)}
do.call(rbind,by(df, df$SPECIES,func))
Here is an example to set multiple data.frames from one. The exemple data set iris is a table of character for 3 species.
First you can set a vector with all the species in your dataframe nspe. I then create a liste of the same length.
The for loop allows to watch each element of this list et put it a data.frame with just the species.
At the end of this script, I compute the mean petal width of the setosa species. If I had two discrete character on this species, I could do a t.test as well. I did one here but it's not really usefull...
data("iris")
summary(iris)
nspe <- as.vector(unique(iris$Species))
spe <- list() ; length(spe) = length(nspe) ; names(spe) <- nspe
for(i in nspe){
spe[i][[1]] <- iris[which(iris$Species == i),]
}
mean(spe$setosa$Petal.Width)
# [1] 0.246
t.test(spe$setosa$Petal.Width)
Below is an example to show how you can run a t.test on one species. Note that you will surely have trouble with species names and spaces, so I think it's easier to set ID for species than keeping their full names.
In future questions, consider providing a small example dataset rather than pictures, it's easier to help you.
# NOT RUN
t.test(
spe$Sorex_gracilliums$BW[which(spe$Sorex_gracilliums$SEX == 'm')],
spe$Sorex_gracilliums$BW[which(spe$Sorex_gracilliums$SEX == 'f')]
)
I have a data.frame with a head that looks like this:
> head(movies_by_yr)
Source: local data frame [6 x 4]
Groups: YR_Released [6]
Movie_Title YR_Released Rating Num_Reviews
<fctr> <fctr> <dbl> <int>
1 The Shawshank Redemption 1994 9.2 1773755
2 The Godfather 1972 9.2 1211083
3 The Godfather: Part II 1974 9.0 832342
4 The Dark Knight 2008 8.9 1755341
5 12 Angry Men 1957 8.9 477276
6 Schindler's List 1993 8.9 909358
Note that when created, I specified stringsAsFactors=FALSE, so I believe the columns that got converted to factors were converted when I grouped the data frame in preparation for the next step:
movies_by_yr <- group_by(problem1_data, YR_Released)
Now we come to the problem. The goal is to group by YR_Released so we can get counts of records by year. I thought the next step would be something like this, but it throws an error and I am not sure what i am doing wrong:
summarise(movies_by_yr, total = nrow(YR_Released))
I choose nrow because once you have a grouping, the number of rows within that grouping should be the count. Can someone point me to what I am doing wrong?
The error thrown is:
Error in summarise_impl(.data, dots) : Not a vector
But I know this data.frame was created from a series of vectors and whatever is different from the sample code from class and my attempt, I am just not seeing it. Hoping someone can answer this ...
Let's use data that everyone has, like the built-in mtcars data.frame, to make this more useful for future readers.
If you look at the documentation ?nrow you'll see that function is meant to be called on a data.frame or matrix. You are calling it on a column, YR_Released. There is a vector-specific variant of the function nrow, called (confusingly) NROW - if you try that instead, it may work.
But even if it does, the intended dplyr way to count rows is with n(), like this:
mycars <- mtcars
mycars <- group_by(mycars, cyl)
summarise(mycars, total = NROW(cyl))
#> # A tibble: 3 x 2
#> cyl total
#> <dbl> <int>
#> 1 4 11
#> 2 6 7
#> 3 8 14
And because it's such a common use case, the wrapper function count() will save you some code:
mtcars %>%
count(cyl)
Try this (I think it's what you want)
table(movies_by_year$YR_Released)