I have this final dataset of roughly 150 000 rows per 40 columns that covers all my potential samples from 1932 to 2016, and I need to make a random selection of 53 samples per year for a total number of ~5000.
The selection in itself is really straight forward using the sample() function to get a subset, however I need to display the selection in the original dataframe to be able to check various things. My issue is the following:
If I edit one of the fields in my random subset and merge it back with the main one, it creates duplicates that I can't remove because one field changed and thus R considers the two rows aren't duplicates. If I don't edit anything, I can't find which rows were selected.
My solution for now was to merge everything in Excel instead of R, apply color codes to highlight the selected rows and delete manually the duplicates. However it's time consuming, prone to mistakes and not practicable as the dataset seems to be too big and my PC quickly runs out of memory when I try...
UPDATE:
Here's a reproducible example:
dat <- data.frame(
X = sample(2000:2016, 50, replace=TRUE),
Y = sample(c("yes", "no"), 50, replace = TRUE),
Z = sample(c("french","german","english"), 50, replace=TRUE)
)
dat2 <- subset(dat, dat$X==2000) #samples of year 2000
sc <- dat2[sample(nrow(dat2), 1), ] #Random selection of 1
What I would like to do is select directly in the dataset (dat1), for example by randomly assigning the value "1" in a column called "selection". Or, if not possible, how can I merge the sampled rows (here called "sc") back to the main dataset but with something indicating they have been sampled
Note:
I've been using R sporadically for the last 2 years and I'm a fairly inexperienced user, so I apologize if this is a silly question. I've been roaming Google and SO for the last 3 days and couldn't find any relevant answer yet.
I recently got in a PhD program in biology that requires me to handle a lot of data from an archive.
EDIT: updated based on comments.
You could add a column that indicates if a row is part of your sample. So maybe try the following:
df = data.frame(year= c(1,1,1,1,1,1,2,2,2,2,2,2), id=c(1,2,3,4,5,6,7,8,9,10,11,12),age=c(7,7,7,12,12,12,7,7,7,12,12,12))
library(dplyr)
n_per_year_low_age = 2
n_per_year_high_age = 1
df <- df %>% group_by(year) %>%
mutate(in_sample1 = as.numeric(id %in% sample(id[age<8],n_per_year_low_age))) %>%
mutate(in_sample2 = as.numeric(id %in% sample(id[age>8],n_per_year_high_age))) %>%
mutate(in_sample = in_sample1+in_sample2) %>%
select(-in_sample1,-in_sample2)
Output:
# A tibble: 12 x 4
# Groups: year [2]
year id age in_sample
<dbl> <dbl> <dbl> <dbl>
1 1.00 1.00 7.00 1.00
2 1.00 2.00 7.00 1.00
3 1.00 3.00 7.00 0
4 1.00 4.00 12.0 1.00
5 1.00 5.00 12.0 0
6 1.00 6.00 12.0 0
7 2.00 7.00 7.00 1.00
8 2.00 8.00 7.00 0
9 2.00 9.00 7.00 1.00
10 2.00 10.0 12.0 0
11 2.00 11.0 12.0 0
12 2.00 12.0 12.0 1.00
Futher operations are then trivial:
# extracting your sample
df %>% filter(in_sample==1)
# comparing statistics of your sample against the rest of the population
df %>% group_by(year,in_sample) %>% summarize(mean(id))
Related
I have a big dataset ('links_cl', each participant of a study has several 100 rows), which I need to subset into dfs, one for each participant.
For those 42 dfs, I then need to do the same operation again and again. After spending half a day trying to write my own function, trying to find a solution online, I now have to ask here.
So, I am looking for a way to
subset the huge dataset several times and have one in my environment for every participant, without using the same code 42 times. What I did so far 'by hand' is:
Subj01 <- subset(links_cl, Subj == 01, select = c("Condition", "ACC_LINK", "RT_LINK", "CONF_LINK", "ACC_SOURCE", "RT_SOURCE", "CONF_SOURCE"))
filter for Column 'Condition' (either == 1,2,3 or 4), and describe/get the mean and sd of 'RT_LINK', which I so far also did 'manually'.
Subj01 %>% filter(Condition == 01) %>% describe(Subj01$RT_LINK)
But here I just get the description of the whole df of Subj01, so I would have to find 4x41 means by hand. It would be great to just have an output with the means and SDs of every participant, but I have no idea where to start and how to tell R to do this.
I tried this, but it won't work:
subsetsubj <- function(x,y) {
Subj_x <- links_cl %>%
subset(links_cl,
Subj == x,
select = c("Condition", "ACC_LINK", "RT_LINK", "CONF_LINK", "ACC_SOURCE", "RT_SOURCE", "CONF_SOURCE")) %>%
filter(Condition == y) %>%
describe(Subj_x$RT_LINK)
}
I also tried putting all dfs into a List and work with that, but it lead to nowhere.
If there is a solution without the subsetting, that would also work. This just seemed a logical step to me. Any idea, any help how to solve it?
You don't really need to split the dataset up into one dataframe for each patient. I would recommend a standard group_by()/summarize() approach, like this:
links_cl %>%
group_by(Subj, Condition) %>%
summarize(mean_val = mean(RT_LINK),
sd_val = sd(RT_LINK))
Output:
Subj Condition mean_val sd_val
<int> <int> <dbl> <dbl>
1 1 1 0.0375 0.873
2 1 2 0.103 1.05
3 1 3 0.184 0.764
4 1 4 0.0375 0.988
5 2 1 -0.0229 0.962
6 2 2 -0.156 0.820
7 2 3 -0.175 0.999
8 2 4 -0.0763 1.12
9 3 1 0.272 1.02
10 3 2 0.0172 0.835
# … with 158 more rows
Input:
set.seed(123)
links_cl <- data.frame(
Subj = rep(1:42, each =100),
Condition = rep(1:4, times=4200/4),
RT_LINK = rnorm(4200)
)
I am measuring electric current (µA) over a certain time interval (s) for 4 different channels (chan_n) and this is how my data looks:
dat
s µA chan_n
<dbl> <dbl> <chr>
0.00 -0.03167860 1
0.02 -0.03136610 1
0.04 -0.03118490 1
0.06 -0.03094740 1
0.08 -0.03065360 1
0.10 -0.03047860 1
0.12 -0.03012230 1
0.14 -0.02995980 1
0.16 -0.02961610 1
... ... ...
My end goal is to get the current of a certain time after the peak value. Therefore I first get the time timepoints at which the maximum appears for each channel:
BaslineTime <- dat %>%
group_by(chan_n) %>%
slice(which.max(µA)) %>% # get max current values
transmute(s = s + 30) # add 30 to the timepoints at which the max value appears
chan_n s
<chr> <dbl>
1 539.84
2 540.00
3 539.82
4 539.80
But if I use BaselineTime to filter for my current values I get two NAs:
BaslineVal <- right_join(dat, BaselineTime, by =c("chan_n","s"))
s µA chan_n
<dbl> <dbl> <chr>
540.00 0.00364974 2
539.80 0.00610948 4
539.84 NA 1
539.82 NA 3
I checked if the time values exist for channel 1 and 3 and they do. Also if I create a data frame manualy by hardcoding the time values and use it for filtering, it works just fine.
So why isn't it working? I would be very happy for any suggestions or explanations.
I think it might have something to do the the decimal places as for channel 2 and 4 there is a 0 on the last decimal place.
Untested as the sample data isn't suitable for testing. I would try something like this:
data %>%
group_by(chan_n) %>%
mutate(
is_peak = row_number() == which.max(µA),
post_peak = lag(is_peak, n = 30, default = FALSE)
)
This will give a TRUE in the new post_peak column 30 rows after the peak, so you can trivially ... %>% filter(post_peak) or do whatever you need to with the result.
If you need more help than this, please share some data that illustrates the problem better, e.g., 10 rows each of 2 chan_n groups with the goal of finding the row 3 after the peak (and that row existing in the data).
I'm looking to include a group statement within my for loop and I'm having difficulty finding any details into how to properly do this.
The example below , calculates the Extra, Outstanding and Current Column within my loop statement. I'm trying to group by id so that the loop will restart with every id. My current code:
dat <- tibble(
id = c("A","A","A","A","A","A","B","B"),
rn= c(1,2,3,4,5,6,1,2),
current = c(100,0,0,0,0,0,500,0),
paid = c(10,12,12,13,13,13,20,20),
pct_extra = c(.02,.05,.05,.07, .03, .01, .09,.01),
Extra = NA,
Outstanding = NA)
for(i in 1:nrow(dat)){
dat$Extra[i] <- dat$current[i]*dat$pct_extra[i]
dat$Outstanding[i] <- dat$current[i] - dat$paid[i] - dat$Extra[i]
if(i < nrow(dat)){
dat$current[(i+1)] <- dat$Outstanding[i]}}
I saw other posts with this same question and they seem to revert to using dplyr. So my first attempt was:
for(i in 1:nrow(dat)){
dat%>%
group_by(id)%>%
mutate(Extra=pct_extra*(current-paid),
Outstanding=current-paid-Extra,
current=if_else(rn==1,current,lag(Outstanding)))}
But this attempt didnt actually calculate the Extra, Outstanding and current columns which my guess is because I'm not using the loop statement properly.
Does anyone have any suggestions/references on how I can include a group statement into my for loop?
Thanks!
A few things.
for loops (surrounding dplyr pipes) are generally not necessary with dplyr grouping, this is no exception (though we will use your for loop in a "single group at a time" way).
Even if it were, you loop with i and never use i, so you're doing the same calculation to all rows, nrow(dat) times.
Third, you aren't storing the results.
My first attempt (after realizing the rolling nature of this) was to try to adapt slider::slide to it, but unfortunately I couldn't get it to work.
In older-dplyr, I would dat %>% group_by(id) %>% do({...}), but they've superseded do in lieu of across and multi-row summarize (which I could not figure out how to make do this).
So then I realized that your for loop works fine, it just needs to be applied one group at a time.
func <- function(z) {
for (i in seq_len(nrow(z))) {
z$Extra[i] <- z$current[i]*z$pct_extra[i]
z$Outstanding[i] <- z$current[i] - z$paid[i] - z$Extra[i]
if (i < nrow(z)) {
z$current[(i+1)] <- z$Outstanding[i]
}
}
z
}
library(dplyr)
library(tidyr) # nest, unnest
library(purrr) # map, can be done with base::Map as well
dat %>%
group_by(id) %>%
nest(quux = -id) %>%
mutate(quux = map(quux, func)) %>%
unnest(quux) %>%
ungroup()
# # A tibble: 8 x 7
# id rn current paid pct_extra Extra Outstanding
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 1 100 10 0.02 2 88
# 2 A 2 88 12 0.05 4.4 71.6
# 3 A 3 71.6 12 0.05 3.58 56.0
# 4 A 4 56.0 13 0.07 3.92 39.1
# 5 A 5 39.1 13 0.03 1.17 24.9
# 6 A 6 24.9 13 0.01 0.249 11.7
# 7 B 1 500 20 0.09 45 435
# 8 B 2 435 20 0.01 4.35 411.
In R, I perform dunn's test. The function I use has no option to group the input variables by their statistical significant differences. However, this is what I am genuinely interested in, so I tried to write my own function. Unfortunately, I am not able to wrap my head around it. Perhaps someone can help.
I use the airquality dataset that comes with R as an example. The result that I need could look somewhat like this:
> library (tidyverse)
> ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
# A tibble: 5 x 2
Month Mean
<int> <dbl>
1 5 23.6
2 6 29.4
3 7 59.1
4 8 60.0
5 9 31.4
When I run the dunn.test, I get the following:
> dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
Kruskal-Wallis rank sum test
data: x and group
Kruskal-Wallis chi-squared = 29.2666, df = 4, p-value = 0
Comparison of x by group
(Benjamini-Hochberg)
Col Mean-|
Row Mean | 5 6 7 8
---------+--------------------------------------------
6 | -0.925158
| 0.4436
|
7 | -4.419470 -2.244208
| 0.0001* 0.0496*
|
8 | -4.132813 -2.038635 0.286657
| 0.0002* 0.0691 0.8604
|
9 | -1.321202 0.002538 3.217199 2.922827
| 0.2663 0.9980 0.0043* 0.0087*
alpha = 0.05
Reject Ho if p <= alpha
From this result, I deduce that May differs from July and August, June differs from July (but not from August) and so on. So I'd like to append significantly differing groups to my results table:
# A tibble: 5 x 3
Month Mean Group
<int> <dbl> <chr>
1 5 23.6 a
2 6 29.4 ac
3 7 59.1 b
4 8 60.0 bc
5 9 31.4 a
While I did this by hand, I suppose it must be possible to automate this process. However, I don't find a good starting point. I created a dataframe containing all comparisons:
> ozone_differences <- dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
> ozone_differences <- data.frame ("P" = ozone_differences$altP.adjusted, "Compare" = ozone_differences$comparisons)
P Compare
1 4.436043e-01 5 - 6
2 9.894296e-05 5 - 7
3 4.963804e-02 6 - 7
4 1.791748e-04 5 - 8
5 6.914403e-02 6 - 8
6 8.604164e-01 7 - 8
7 2.663342e-01 5 - 9
8 9.979745e-01 6 - 9
9 4.314957e-03 7 - 9
10 8.671708e-03 8 - 9
I thought that a function iterating through this data frame and using a selection variable to choose the right letter from letters() might work. However, I cannot even think of a starting point, because changing numbers of rows have to considered at the same time...
Perhaps someone has a good idea?
Perhaps you could look into cldList() function from rcompanion library, you can pipe the res results from the output od dunnTest() and create a table that specifies the compact letter display comparison per group.
Following the advice of #TylerRuddenfort , the following code will work. The first cld is created with rcompanion::cldList, and the second directly uses multcompView::multcompLetters. Note that to use multcompLetters, the spaces have to be removed from the names of the comparisons.
Here, I have used FSA:dunnTest for the Dunn test (1964).
In general, I recommend ordering groups by e.g. median or mean before running e.g. dunnTest if you plan on using a cld, so that the cld comes out in a sensible order.
library (tidyverse)
ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
library(FSA)
Result = dunnTest(airquality$Ozone, airquality$Month, method = "bh")$res
### Use cldList()
library(rcompanion)
cldList(P.adj ~ Comparison, data=Result)
### Use multcompView
library(multcompView)
X = Result$P.adj <= 0.05
names(X) = gsub(" ", "", Result$Comparison)
multcompLetters(X)
This question already has answers here:
Google Docs exports spreadsheet values with commas. read.csv() in R treats these as factors instead of numeric
(2 answers)
Closed 8 years ago.
I am fairly new to R, and I am trying to create a new column, which is one column minus another column. For example:
price <- c("$10.00", "$7.15", "$8.75", "12.00", "9.20")
quantity <- c(5, 6, 7, 8, 9)
price <- as.factor(price)
quantity <- as.factor(quantity)
df <- data.frame(price, quantity)
In my actual data set, all the columns imported as factors. When I try to create the new column I get this:
diff <- price - quantity
In Ops.factor(price, quantity): - not meaningful for factors
I have tried to coerce the data to numeric using as.numeric(df), as.numeric(levels(df)), as.numeric(levels(df))[df], and setting stringsAsFactors to false, but the data gets converted to NAs. Data.matrix changes the values. Is there another way to get the above equation to work? Thanks!
You should avoid "" and $ in price column and avoid converting them to factors if you want to do math operations on them:
price <- c(10.00, 7.15, 8.75, 12.00, 9.20)
quantity <- c(5, 6, 7, 8, 9)
df <- data.frame(price, quantity)
df$diff <- price - quantity
df
price quantity diff
1 10.00 5 5.00
2 7.15 6 1.15
3 8.75 7 1.75
4 12.00 8 4.00
5 9.20 9 0.20
Try:
as.numeric(gsub("^\\$","", price))-as.numeric(as.character(quantity))
#[1] 5.00 1.15 1.75 4.00 0.20
Or from df
df$diff <- Reduce(`-`,lapply(df, function(x) as.numeric(gsub("^\\$","",x))))
df$diff
#[1] 5.00 1.15 1.75 4.00 0.20
If you're stuck with factor columns, you could add a new diff column with within() and some type coercion
> within(df, {
diff <- as.numeric(gsub("[$]", "", price)) -
as.numeric(as.character(quantity))
})
# price quantity diff
# 1 $10.00 5 5.00
# 2 $7.15 6 1.15
# 3 $8.75 7 1.75
# 4 12.00 8 4.00
# 5 9.20 9 0.20
You may also consider going back and re-reading the data into R. It's simple, and will make things a little easier. Here's how you could do it and get the desired result that way.
Create a data file: This won't be necessary for you, since you can just read the original file again.
> write.table(df, "df.txt")
Read the data into R, remove the $ sign, and calculate the difference:
> df2 <- read.table("df.txt", stringsAsFactors = FALSE)
> df2$price <- as.numeric(gsub("[$]", "", df2$price))
> with(df2, { price - quantity })
# [1] 5.00 1.15 1.75 4.00 0.20