I have asked these question before and solve the problem with Saga's help.
I am working on a simulation study. I have to reorganize my results and continue to analysis.
I have a data matrix contains may results like this
> data
It S X Y F
1 1 0.5 0.8 2.39
1 2 0.3 0.2 1.56
2 1 1.56 2.13 1.48
3 1 2.08 1.05 2.14
3 2 1.56 2.04 2.45
.......
It shows iteration
S shows second iteration working inside of IT
X shows coordinate of X obtained from a method
Y shows coordinate of Y obtained from a method
F shows the F statistic.
My problem is I have to find minimum F value for every iteration. So I have to store every iteration on a different matrix or data frame and find minimum F value.
I have tried many things but not worked. Any help, idea will be appreciated.
EDIT: Updated table information
This was the solution:
library(dplyr)
data %>%
group_by(It) %>%
slice(which.min(F))
A tibble: 3 x 5
Groups: It [3]
It S X Y F
1 1 2 0.30 0.20 1.56
2 2 1 1.56 2.13 1.48
3 3 1 2.08 1.05 2.14
However , I will continue another for loop and I want to select every X values providing above conditions.
For example when I use data$X[i] This code doesn't select to values of X (0.30, 1.56, 2.08). It selected original values from "data" before grouping. How can I solve this problem?
I hope this is what you are expecting:
> library(dplyr)
> data %>%
group_by(It) %>%
slice(which.min(F))
# A tibble: 3 x 5
# Groups: It [3]
It S X Y F
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 0.30 0.20 1.56
2 2 1 1.56 2.13 1.48
3 3 1 2.08 1.05 2.14
Related
I try to get the square root of negative number. I got the absolute value of data and, for the positive number, I use the squart root of absolute number directly, otherwive add an negaitve sign to the result. However all numbers I got are negaitve...
My code
Results shown
I try to get negaitve and positive results, but I only got negative numbers.your text``your text
Library and Data
Not sure exactly what you are doing because your original data frame isn't included in the question. However, I have simulated a dataset that should emulate what you want depending on what you are doing. First, I loaded the tidyverse package for data wrangling like creating/manipulating variables, then set a random seed so you can reproduce the simulated data.
#### Load Library ####
library(tidyverse)
#### Set Random Seed ####
set.seed(123)
Now I create a randomly distributed x value that is both positive and negative.
#### Create Randomly Distributed X w/Neg Values ####
tib <- tibble(
x = rnorm(n=100)
)
Creating Variables
Now we can make absolute values, followed by square roots, which are made negative if the original raw value was negative.
#### Create Absolute and Sqrt Values ####
new.tib <- tib %>%
mutate(
abs.x = abs(x),
sq.x = sqrt(abs.x),
final.x = ifelse(x < 0,
sq.x * -1,
sq.x)
)
new.tib
If you print new.tib, the end result will look like this:
# A tibble: 100 × 4
x abs.x sq.x final.x
<dbl> <dbl> <dbl> <dbl>
1 2.20 2.20 1.48 1.48
2 1.31 1.31 1.15 1.15
3 -0.265 0.265 0.515 -0.515
4 0.543 0.543 0.737 0.737
5 -0.414 0.414 0.644 -0.644
6 -0.476 0.476 0.690 -0.690
7 -0.789 0.789 0.888 -0.888
8 -0.595 0.595 0.771 -0.771
9 1.65 1.65 1.28 1.28
10 -0.0540 0.0540 0.232 -0.232
If you just want to select the final x values, you can simply select them, like so:
new.tib %>%
select(final.x)
Giving you just this vector:
# A tibble: 100 × 1
final.x
<dbl>
1 1.48
2 1.15
3 -0.515
4 0.737
5 -0.644
6 -0.690
7 -0.888
8 -0.771
9 1.28
10 -0.232
# … with 90 more rows
Using the first example in ?ifelse:
x <- c(6:-4)
[1] 6 5 4 3 2 1 0 -1 -2 -3 -4
sqrt(ifelse(x >= 0, x, -x))
[1] 2.449490 2.236068 2.000000 1.732051 1.414214 1.000000
[7] 0.000000 1.000000 1.414214 1.732051 2.000000
I have a huge amount of DFs in R (>50), which correspond to different filtering I've performed, here's an example of 7 of them:
Steps_Day1 <- filter(PD2, Gait_Day == 1)
Steps_Day2 <- filter(PD2, Gait_Day == 2)
Steps_Day3 <- filter(PD2, Gait_Day == 3)
Steps_Day4 <- filter(PD2, Gait_Day == 4)
Steps_Day5 <- filter(PD2, Gait_Day == 5)
Steps_Day6 <- filter(PD2, Gait_Day == 6)
Steps_Day7 <- filter(PD2, Gait_Day == 7)
Each of the dataframes contains 19 variables, however I'm only interested in their speed (to calculate mean) and their subjectID, as each subject has multiple observations of speed in the same DF.
An example of the data we're interested in, in dataframe - Steps_Day1:
Speed SubjectID
0.6 1
0.7 1
0.7 2
0.8 2
0.1 2
1.1 3
1.2 3
1.5 4
1.7 4
0.8 4
The data goes up to 61 pts. and each particpants number of observations is much larger than this.
Now what I want to do, is create a code that automatically cycles through each of 50 dataframes (taking the 7 above as an example) and calculates the mean speed for each participant and stores this and saves it in a new dataframe, alongside the variables containing to mean for each participant in the other DFs.
An example of Steps day 1 (Values not accurate)
Speed SubjectID
0.6 1
0.7 2
1.2 3
1.7 4
and so on... Before I end up with a final DF containing in column vectors the means for each participant from each of the other data frames, which may look something like:
Steps_Day1 StepsDay2 StepsDay3 StepsDay4 SubjectID
0.6 0.8 0.5 0.4 1
0.7 0.9 0.6 0.6 2
1.2 1.1 0.4 0.7 3
1.7 1.3 0.3 0.8 4
I could do this through some horrible, messy long code - but looking to see if anyone has more intuitive ideas please!
:)
To add to the previous answer, I agree that it is much easier to do this without creating a new data frame for each day. Using some generated data, you can achieve your desired results as follows:
# Generate some data
df <- data.frame(
day = rep(1:5, 1, 100),
subject = rep(5:10, 1, 100),
speed = runif(500)
)
df %>%
group_by(day, subject) %>%
summarise(avg_speed = mean(speed)) %>%
pivot_wider(names_from = day,
names_prefix = "Steps_Day",
values_from = avg_speed)
# A tibble: 6 × 6
subject Steps_Day1 Steps_Day2 Steps_Day3 Steps_Day4 Steps_Day5
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5 0.605 0.416 0.502 0.516 0.517
2 6 0.592 0.458 0.625 0.531 0.460
3 7 0.475 0.396 0.586 0.517 0.449
4 8 0.430 0.435 0.489 0.512 0.548
5 9 0.512 0.645 0.509 0.484 0.566
6 10 0.530 0.453 0.545 0.497 0.460
You don't include a MCVE of your dataset so I can't test out a solution, but it seems like a pretty simple problem using tidyverse solutions.
First, why do you split PD2 into separate dataframes? If you skip that, you can just use group and summarize to get the average for groups:
PD2 %>%
group_by(Gait_Day, SubjectID) %>%
summarize(Steps = mean(Speed))
This will give you a "long-form" data.frame with 3 variables: Gait_Day, SubjectID, and Steps, which has the mean speed for that subject and day. If you want it in the format you show at the end, just pivot into "wide-form" using pivot_wider. You can see this question for further explaination on that: How to reshape data from long to wide format
I need to prepare a table that includes the means and standards deviations for each level of several demographic variables and for many variables.
Consider the following data:
df <- tibble(place=c("London","Paris","London","Rome","Rome","Madrid","Madrid"),gender=c("m","f","f","f","m","m","f"), education = c(1,1,2,3,5,5,3), var1 = c(2.2,3.1,4.5,1,5,1.4,2.3),var2 = c(4.2,2.1,2.5,4,5,4.4,1.3),var3 = c(0.2,0.1,3.5,3,5,2.4,4.3))
I would like to get a dataframe that contains the grouping variables (place, gender, education) and their levels (e.g., London, Paris, etc.) in the first column and their means and standard deviations for each variable starting with var (var1, var2, var3) in additional columns.
I know how to do this for one group and several variables at a time. However, since I need to repeat this dozens of times I am looking for a way to automate this process. It would be great to have a function to which I simply need to pass (a) the names of the grouping variables (e.g., gender, education) and (b) the variables from which to get the M / SD (e.g. var1, var2).
The solution I look for should look like this (the stats are not correct in the example below):
my_results <- tibble(grouping_vars = c("place_London","place_Paris","place_Rome","place_Madrid","gender_m","gender_f","last_element"),mean_var1=c(1.3,2.5,4.5,1.7,2.5,3.6,4.0),sd_var1=c(0.01,0.41,0.21,0.12,0.02,0.38,0.28),mean_var2=c(4.3,4.5,4.0,1.2,2.5,1.6,2.3),sd_var2=c(0.21,0.1,0.1,0.32,0.22,0.18,0.08),mean_var3=c(2.3,2.5,2.0,3.2,3.5,0.6,5),sd_var3=c(0.51,0.15,0.51,0.52,0.52,0.15,0.48))
grouping_vars mean_var1 sd_var1 mean_var2 sd_var2 mean_var3 sd_var3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 place_London 1.3 0.01 4.3 0.21 2.3 0.51
2 place_Paris 2.5 0.41 4.5 0.1 2.5 0.15
3 place_Rome 4.5 0.21 4 0.1 2 0.51
4 place_Madrid 1.7 0.12 1.2 0.32 3.2 0.52
5 gender_m 2.5 0.02 2.5 0.22 3.5 0.52
6 gender_f 3.6 0.38 1.6 0.18 0.6 0.15
7 last_element 4 0.28 2.3 0.08 5 0.48
Since I typically work with tidyverse, I would particularly appreciate solutions that use these packages (probably dplyr or purrr?).
EDIT:
I thought there would be an elegant way to do this using map(). Maybe there is but I haven't found it yet. For the mean time, I figured out a way that simply restructures the data into an appropriate long format and then computes the statistics.
df %>%
# all grouping vars need to be of the same type, here "factor" is most appropriate
mutate_at(grouping_vars, list(factor)) %>%
# pivot longer, so that each row is a unique combination of grouping variable and grouping level
pivot_longer(
cols = one_of(grouping_vars),
names_to = "group_var",
values_to = "group_level"
) %>%
# merge grouping variable and group level into a single column
unite(var_level,group_var,group_level, sep="_") %>%
# group by group level
group_by(var_level) %>%
# compute means and sd for each test variable
summarise_at(test_vars, list(~mean(., na.rm = TRUE), ~sd(., na.rm = TRUE)))
The result seems fine, e.g., the mean of var1 of the two people who live in London (2.2 + 4.5) is 3.35.
# A tibble: 10 x 7
var_level var1_mean var2_mean var3_mean var1_sd var2_sd var3_sd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 education_1 2.65 3.15 0.15 0.636 1.48 0.0707
2 education_2 4.5 2.5 3.5 NA NA NA
3 education_3 1.65 2.65 3.65 0.919 1.91 0.919
4 education_5 3.2 4.7 3.7 2.55 0.424 1.84
5 gender_f 2.72 2.48 2.72 1.47 1.13 1.83
6 gender_m 2.87 4.53 2.53 1.89 0.416 2.40
7 place_London 3.35 3.35 1.85 1.63 1.20 2.33
8 place_Madrid 1.85 2.85 3.35 0.636 2.19 1.34
9 place_Paris 3.1 2.1 0.1 NA NA NA
10 place_Rome 3 4.5 4 2.83 0.707 1.41
Any thoughts on possible risks of this approach or how this could be improved?
One option is the describeBy function from psych:
library(psych)
describeBy(df,group = c("gender","education"), mat= TRUE)
Then subset what you want from there.
Another, surprisingly simple option with dplyr:
library(dplyr)
group.vars <- c("gender","education")
measure.vars <- c("var1","var2")
df %>%
group_by_at(group.vars) %>%
summarize_at(measure.vars,
list(mean =~ mean(.),sd =~ sd(.)))
# A tibble: 5 x 6
# Groups: gender [2]
gender education var1_mean var2_mean var1_sd var2_sd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 f 1 3.1 2.1 NA NA
2 f 2 4.5 2.5 NA NA
3 f 3 1.65 2.65 0.919 1.91
4 m 1 2.2 4.2 NA NA
5 m 5 3.2 4.7 2.55 0.424
You can continue adding additional function to that list. For every element, the name will be appended to the variable and the result will be come the column values. Recall that ~ is shorthand for function(x).
To give a small working example, suppose I have the following data frame:
library(dplyr)
country <- rep(c("A", "B", "C"), each = 6)
year <- rep(c(1,2,3), each = 2, times = 3)
categ <- rep(c(0,1), times = 9)
pop <- rep(c(NA, runif(n=8)), each=2)
money <- runif(18)+100
df <- data.frame(Country = country,
Year = year,
Category = categ,
Population = pop,
Money = money)
Now the data I'm actually working with has many more repetitions, namely for every country, year, and category, there are many repeated rows corresponding to various sources of money, and I want to sum these all together. However, for now it's enough just to have one row for each country, year, and category, and just trivially apply the sum() function on each row. This will still exhibit the behavior I'm trying to get rid of.
Notice that for country A in year 1, the population listed is NA. Therefore when I run
aggregate(Money ~ Country+Year+Category+Population, df, sum)
the resulting data frame has dropped the rows corresponding to country A and year 1. I'm only using the ...+Population... bit of code because I want the output data frame to retain this column.
I'm wondering how to make the aggregate() function not drop things that have NAs in the columns by which the grouping occurs--it'd be nice if, for instance, the NAs themselves could be treated as values to group by.
My attempts: I tried turning the Population column into factors but that didn't change the behavior. I read something on the na.action argument but neither na.action=NULL nor na.action=na.skip changed the behavior. I thought about trying to turn all the NAs to 0s, and I can't think of what that would hurt but it feels like a hack that might bite me later on--not sure. But if I try to do it, I'm not sure how I would. When I wrote a function with the is.na() function in it, it didn't apply the if (is.na(x)) test in a vectorized way and gave the error that it would just use the first element of the vector. I thought about perhaps using lapply() on the column and coercing it back to a vector and sticking that in the column, but that also sounds kind of hacky and needlessly round-about.
The solution here seemed to be about keeping the NA values out of the data frame in the first place, which I can't do: Aggregate raster in R with NA values
As you have already mentioned dplyr before your data, you can use dplyr::summarise function. The summarise function supports grouping on NA values.
library(dplyr)
df %>% group_by(Country,Year,Category,Population) %>%
summarise(Money = sum(Money))
# # A tibble: 18 x 5
# # Groups: Country, Year, Category [?]
# Country Year Category Population Money
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 A 1.00 0 NA 101
# 2 A 1.00 1.00 NA 100
# 3 A 2.00 0 0.482 101
# 4 A 2.00 1.00 0.482 101
# 5 A 3.00 0 0.600 101
# 6 A 3.00 1.00 0.600 101
# 7 B 1.00 0 0.494 101
# 8 B 1.00 1.00 0.494 101
# 9 B 2.00 0 0.186 100
# 10 B 2.00 1.00 0.186 100
# 11 B 3.00 0 0.827 101
# 12 B 3.00 1.00 0.827 101
# 13 C 1.00 0 0.668 100
# 14 C 1.00 1.00 0.668 101
# 15 C 2.00 0 0.794 100
# 16 C 2.00 1.00 0.794 100
# 17 C 3.00 0 0.108 100
# 18 C 3.00 1.00 0.108 100
Note: The OP's sample data doesn't have multiple rows for same groups. Hence, number of summarized rows will be same as actual rows.
I am looking for an explicit function to subscript elements in R, say subscript(x,i) to mean x[i].
The reason that I need this traces back to a piece of code using dplyr and magrittr pipe operator, which is not a pipe, and where I need to divide by the first element of each column.
pipedDF <- rawdata %>% filter, merge, summarize, dcast %>%
mutate_each( funs(./subscript(., 1) ), -index)
I think this would do the trick and keep that pipe syntax which people like.
Without dplyr it would look like this...
Example,
> df
index a b c
1 1 6.00 5.0 4
2 2 7.50 6.0 5
3 3 5.00 4.5 6
4 4 9.00 7.0 7
> data.frame(sapply(df, function(x)x/x[1]))
index a b c
1 1 1.00 1.0 1.00
2 2 1.25 1.2 1.25
3 3 0.83 0.9 1.50
4 4 1.50 1.4 1.75
You should be able to use '[', as in
x<-5:1
'['(x,2)
# [1] 4