My code looks as below, I am wondering if there any better way to make it faster:
pos=NULL
row=data.frame(matrix(nrow=216,ncol=4))
colnames(row)=c("sub","subi","group","trial")
for (i in 1:100000){
row$sub="Positive"
row$subi=NA
row$group=NA
row$subi[1:144]=c(1:144)
row$group[1:144]=1
row$subi[145:216]=c(1:72)
row$group[145:216]=2
row$trial=i
pos=rbind(pos,row)
}
No loop needed. You can build a data.frame or tibble(my example) on your own.
Given you want to adjust the row length later:
library(dplyr)
n_rows <- 10000
tibble(
trail = 1:n_rows,
sub = "positive",
subi = c(seq(1:144), seq(1:72), rep(NA, n_rows-216)),
group = c(rep(1, 144), rep(2, 72), rep(NA, n_rows-216))
)
Output is:
# A tibble: 10,000 × 4
trail sub subi group
<int> <chr> <int> <dbl>
1 1 positive 1 1
2 2 positive 2 1
3 3 positive 3 1
4 4 positive 4 1
5 5 positive 5 1
6 6 positive 6 1
7 7 positive 7 1
8 8 positive 8 1
9 9 positive 9 1
10 10 positive 10 1
# … with 9,990 more rows
The only thing different in each pass of the loop is trial. rep is your friend. For the other columns, R will automatically recycle to match the longest column (here, it is trial with 21.6M rows).
pos <- data.frame(
sub = "Positive",
subi = c(1:144, 1:72),
group = rep.int(1:2, c(144, 72)),
trial = rep(1:1e5, each = 216)
)
It looks like you are trying to replicate this data frame 100,000 times, with each iteration of the frame having a different trial number.
data.frame(sub = rep("Positive", 216),
subi = c(1:144, 1:72),
group = rep(c(1, 2), c(144, 72)))
The replicate function is great for running static code multiple time. So one option is to create your 100,000 copies and then update the trial number.
FrameList <-
replicate(n = 100,
{
data.frame(sub = rep("Positive", 216),
subi = c(1:144, 1:72),
group = rep(c(1, 2), c(144, 72)),
trial = rep(NA_real_, 216))
},
simplify = FALSE)
To update the trial number, you can go with a for loop
for (i in seq_along(FrameList)){
FrameList$trial <- i
}
or you can try something fancy-pants, but taking a lot more code
FrameList <- mapply(function(FL, i){
FL$trial <- i
FL
},
FrameList,
seq_along(FrameList),
SIMPLIFY = FALSE)
Whichever way you go, you can stack them all together with
Frame <- do.call("rbind", FrameList)
This certainly isn't the most elegant way to do this, so watch for others to give you other clever tricks. But this, I would guess, would be the basic process to follow.
Related
Hello all a R noob here,
I hope you guys can help me with the following.
I need to transform multiple columns in my dataset to new columns based on the values in the original columns multiple times. This means that for the first transformation I use column 1, 2, 3 and if certain conditions are met the output results a new column with a 1 or a 0, for the second transformation I use columns 4, 5, 6 and the output should be a 1 or a 0 also. I have to do this 18 times. I already wrote a function which succesfully does the transformation if I impute the variables manually, but I would like to apply this function to all the desired columns at once. My desired output would be 18 new columns with 0's and 1's. Finally I will make a last column which will display a 1 if any of the 18 columns is a 1 and a 0 otherwise.
df <- data.frame(admiss1 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
admiss2 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
admiss3 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
visit1 = sample(seq(as.Date('1995/01/01'), as.Date('1996/01/01'), by="day"), 12),
visit2 = sample(seq(as.Date('1997/01/01'), as.Date('1998/01/01'), by="day"), 12),
reason1 = sample(3,12, replace = T),
reason2 = sample(3,12, replace = T),
reason3 = sample(3,12, replace = T))
df$discharge1 <- df$admiss1 + 10
df$discharge2 <- df$admiss2 + 10
df$discharge3 <- df$admiss3 + 10
#every discharge date is 10 days after the admission date for the sake of this example
#now I have the following dataframe
#for the sake of it I included only 3 dates and reasons(instead of 18)
admiss1 admiss2 admiss3 visit1 visit2 reason1 reason2 reason3 discharge1 discharge2 discharge3
1 1990-03-12 1992-04-04 1998-07-31 1995-01-24 1997-10-07 2 1 3 1990-03-22 1992-04-14 1998-08-10
2 1999-05-18 1990-11-25 1995-10-04 1995-03-06 1997-03-13 1 2 1 1999-05-28 1990-12-05 1995-10-14
3 1993-07-16 1998-06-10 1991-07-05 1995-11-06 1997-11-15 1 1 2 1993-07-26 1998-06-20 1991-07-15
4 1991-07-05 1992-06-17 1995-10-12 1995-05-14 1997-05-02 2 1 3 1991-07-15 1992-06-27 1995-10-22
5 1995-08-16 1999-03-08 1992-04-03 1995-02-20 1997-01-03 1 3 3 1995-08-26 1999-03-18 1992-04-13
6 1999-10-07 1991-12-26 1995-05-05 1995-10-24 1997-10-15 3 1 1 1999-10-17 1992-01-05 1995-05-15
7 1998-03-18 1992-04-18 1993-12-31 1995-11-14 1997-06-14 3 2 2 1998-03-28 1992-04-28 1994-01-10
8 1992-08-04 1991-09-16 1992-04-23 1995-05-29 1997-10-11 1 2 3 1992-08-14 1991-09-26 1992-05-03
9 1997-02-20 1990-02-12 1998-03-08 1995-10-09 1997-12-29 1 1 3 1997-03-02 1990-02-22 1998-03-18
10 1992-09-16 1997-06-16 1997-07-18 1995-12-11 1997-01-12 1 2 2 1992-09-26 1997-06-26 1997-07-28
11 1991-01-25 1998-04-07 1999-07-02 1995-12-27 1997-05-28 3 2 1 1991-02-04 1998-04-17 1999-07-12
12 1996-02-25 1993-03-30 1997-06-25 1995-09-07 1997-10-18 1 3 2 1996-03-06 1993-04-09 1997-07-05
admissdate <- function(admis, dis, rsn, vis1, vis2){
xnew <- ifelse(df[eval(substitute(admis))] >= df[eval(substitute(vis1))] & df[eval(substitute(dis))] <= df[eval(substitute(vis2))] & df[eval(substitute(rsn))] == 2, 1, 0)
xnew <- ifelse(df[eval(substitute(admis))] >= df[eval(substitute(vis1))] & df[eval(substitute(admis))] <= df[eval(substitute(vis2))] & df[eval(substitute(dis))] >= df[eval(substitute(vis2))] & df[eval(substitute(rsn))] == 2, 1, xnew)
return(xnew)
}
I wrote this function to generate a 1 if the conditions are true and a 0 if the conditions are false.
-Condition 1: admission date and discharge date are between visit 1 and visit 2 + admission reason is 2.
-Condition 2: admission date is after visit 1 but before visit 2 and the discharge date is after visit 2 with also admission reason 2.
It should return 1 if these conditions are true and 0 if these conditions are false. Eventually, I will end up with 18 new variables with 1's or 0's and will combine them to make one variable with Admission between visit 1 and visit 2 (with reason 2).
If I manually impute the variable names it will work, but I cant make it work for all the variables at once. I tried to make a string vector with all the admiss dates, discharge dates and reasons and tried to transform them with mapply, but this does not work.
admiss <- paste0(rep("admiss", 3), 1:3)
discharge <- paste0(rep("discharge", 3), 1:3)
reason <- paste0(rep("reason", 3), 1:3)
visit1 <- rep("visit1",3)
visit2 <- rep("visit2",3)
mapply(admissdate, admis = admiss, dis = discharge, rsn = reason, vis1 = visit1, vis2 = visit2)
I have also considered lapply but here you have to define an X = ..., which I think I cannot use because I have multiple column that I want to impute, please correct me if I am wrong!
Also I considered using a for loop, but I don't know how to use that with multiple conditions.
Any help would be greatly appreciated!
You can change the function to accept values instead of column names.
admissdate <- function(admis, dis, rsn, vis1, vis2){
xnew <- as.integer(admis >= vis1 & dis <= vis2 & rsn == 2)
xnew <- ifelse(admis >= vis1 & admis <= vis2 & dis >= vis2 & rsn == 2, 1, xnew)
return(xnew)
}
Now create new columns -
admiss <- paste0("admiss", 1:3)
discharge <- paste0("discharge", 1:3)
reason <- paste0("reason", 1:3)
new_col <- paste0('newcol', 1:3)
df[new_col] <- Map(function(x, y, z) admissdate(x, y, z, df$visit1, df$visit2),
df[admiss],df[discharge],df[reason])
#Additional column will be 1 if any of the value in the new column is 1.
df$result <- as.integer(rowSums(df[new_col]) > 0)
df
i can figure out the solution of my problem but in a very not optimal way and thus the solution i have is not adapted for a large df. Let me explain.
I have a big dataframe and i need to create new columns by subtracting two others ones. Let me show you using a simple df.
A<-rnorm(10)
B<-rnorm(10)
C<-rnorm(10)
D<-rnorm(10)
E<-rnorm(10)
F<-rnorm(10)
df1<-data_frame(A,B,C,D,E,F)
# A tibble: 10 x 6
A B C D E F
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -2.8750025 0.4685855 2.4435767 1.6999761 -1.3848386 -0.58992249
2 0.2551404 1.8555876 0.8365116 -1.6151186 -1.7754623 0.04423463
3 0.7740396 -1.0756147 0.6830024 -2.3879337 -1.3165875 -1.36646493
4 0.2059932 0.9322016 1.2483196 -0.1787840 0.3546773 -0.12874831
5 -0.4561725 -0.1464692 -0.7112905 0.2791592 0.5835127 0.16493237
6 1.2401795 -1.1422917 -0.6189480 -1.4975416 0.5653565 -1.32575021
7 -1.6173618 0.2283430 0.6154920 0.6082847 0.0273447 0.16771783
8 0.3340799 -0.5096500 -0.5270123 -0.2814217 -2.3732234 0.27972188
9 -0.4841361 0.1651265 0.0296500 0.4324903 -0.3895971 -2.90426195
10 -2.7106357 0.5496335 0.3081533 -0.3083264 -0.1341055 -0.17927807
I need (i) to subtract two columns at a similar distance : D-A, E-B, F-C while (ii) giving the new column a name based on the name of the initial variables' names.
I did in that way and it works:
df2<-df1 %>%
transmute (!!paste0("diff","D","A") := D-A,
!!paste0("diff","E","B") := E-B,
!!paste0("diff","F","C") := F-C)
# A tibble: 10 x 3
diffDA diffEB diffFC
<dbl> <dbl> <dbl>
1 4.5749785 -1.8534241 -3.0334991
2 -1.8702591 -3.6310500 -0.7922769
3 -3.1619734 -0.2409728 -2.0494674
4 -0.3847772 -0.5775242 -1.3770679
5 0.7353317 0.7299819 0.8762229
6 -2.7377211 1.7076482 -0.7068022
7 2.2256465 -0.2009983 -0.4477741
8 -0.6155016 -1.8635734 0.8067342
9 0.9166264 -0.5547236 -2.9339120
10 2.4023093 -0.6837390 -0.4874314
However, i have many columns and i would like to find a way to make the code simpler. I tried many things (like with mutate_all, mutate_at or add_columns) but nothing works...
OK, here's a method that will work for the full width of your data set.
df1 <- tibble(A = rnorm(10),
B = rnorm(10),
C = rnorm(10),
D = rnorm(10),
E = rnorm(10),
F = rnorm(10),
G = rnorm(10),
H = rnorm(10),
I = rnorm(10))
ct <- 1:ncol(df1)
diff_tbl <- tibble(testcol = rnorm(10))
for (i in ct) {
new_tbl <- tibble(col = df1[[i+3]] - df1[[i]])
names(new_tbl)[1] <- paste('diff',colnames(df1[i+3]),colnames(df1[i]),sep='')
diff_tbl <- bind_cols(diff_tbl,new_tbl)
}
diff_tbl <- diff_tbl %>%
select(-testcol)
df1 <- bind_cols(df1,diff_tbl)
Basically, what you are doing is creating a second dummy tibble to compute the differences, iterating over the possible differences (i.e. gaps of three columns) then assembling them into a single tibble, then binding those columns to the original tibble. As you can see, I extended df1 by three extra columns and the whole thing worked like a charm.
It's probable that there's a more elegant way to do this, but this method definitely works. There's one slightly awkward thing in that I had to create the diff_tbl with a dummy column and then remove it before the final bind_cols() call, but it's not a major thing, I think.
You could divide the data frame in two parts and do
inds <- ncol(df1)/2
df1[paste0("diff", names(df1[(inds + 1):ncol(df1)]), names(df1[1:inds]))] <-
df1[(inds + 1):ncol(df1)] - df1[1:inds]
Note that column names with dashes in them are improper and not recommended.
result = df1[4:6] - df1[1:3]
names(result) = paste(names(df1)[4:6], names(df1)[1:3], sep = "-")
result
# D-A E-B F-C
# 1 0.12459065 0.05855622 0.6134559
# 2 -2.65583389 0.26425762 0.8344115
# 3 -1.48761765 -3.13999402 1.3008065
# 4 -4.37469763 1.37551178 1.3405191
# 5 1.01657135 -0.90690359 1.5848562
# 6 -0.34050959 -0.57687686 -0.3794937
# 7 0.85233808 0.57911293 -0.8896393
# 8 0.01931559 0.91385740 3.2685647
# 9 -0.62012982 -2.34166712 -0.4001903
# 10 -2.21764146 0.05927664 0.3965072
I would like to calculate the response latency of the eye. I want to do this by measuring the time difference between onscreen appearance of a target and the onset of the fast eye movement in response.
Below a picutre of a single trial example. The purple line is the period where target appears on screen. The top line shows the position data of the Y coordinate of the eye, the bottom line shows the velocity.
As you can see here, the fast eye movement downwards, with a high velocity, is a saccade.
To give you an idea how my data looks, I made a dummy data.frame. The block represents the blocks you can also see in the figure. Ignore trial.block for now. saccade is a column telling you if the data is a S(saccade) or F(fixation).
Any idea how to calculate the time between Iview at the onset of the target and the onset of the first saccade for each individual trial?
Thanks alot
library(dplyr)
N = 500
G.df <- data.frame(Iview = seq(N*2),
cue.condition = rep(c("spatial", "non-spatial"), each = N),
block = rep(c("fixation.1", "fixation.2", "target.1", "target.2"), each = N/2),
trial.block = rep(1:4, each = N/2),
trial.number = rep(1:50, each = 10),
saccade = sample(c("S","F"), size = 100, replace = T))
I'm not sure if I understood your request correctly. The time between the first occurence of block == 'target.1' and the first occurence of block == 'target.1' & saccade == 'S' for each trial could be calculated like this:
G.df %>%
group_by(trial.number) %>%
summarise(time_between = Iview[block == "target.1" & saccade == "S"][1] - Iview[block == "target.1"][1])
# A tibble: 50 x 2
trial.number time_between
<int> <int>
1 1 2
2 2 1
3 3 0
4 4 0
5 5 1
6 6 1
7 7 1
8 8 1
9 9 0
10 10 0
# ... with 40 more rows
I have data giving me the percentage of people in some groups who have various levels of educational attainment:
df <- data_frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10))
df
# A tibble: 2 x 5
group no.highschool high.school college graduate
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 20. 70. 10. 0.
2 B 10. 40. 40. 10.
E.g., in group A 70% of people have a high school education.
I want to generate 4 variables that give me the proportion of people in each group with less than each of the 4 levels of education (e.g., lessthan_no.highschool, lessthan_high.school, etc.).
desired df would be:
desired.df <- data.frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10),
lessthan_no.highschool = c(0,0),
lessthan_high.school = c(20, 10),
lessthan_college = c(90, 50),
lessthan_graduate = c(100, 90))
In my actual data I have many groups and a lot more levels of education. Of course I could do this one variable at a time, but how could I do this programatically (and elegantly) using tidyverse tools?
I would start by doing something like a mutate_at() inside of a map(), but where I get tripped up is that the list of variables being summed is different for each of the new variables. You could pass in the list of new variables and their corresponding variables to be summed as two lists to a pmap(), but it's not obvious how to generate that second list concisely. Wondering if there's some kind of nesting solution...
Here is a base R solution. Though the question asks for a tidyverse one, considering the dialog in the comments to the question I have decided to post it.
It uses apply and cumsum to do the hard work. Then there are some cosmetic concerns before cbinding into the final result.
tmp <- apply(df[-1], 1, function(x){
s <- cumsum(x)
100*c(0, s[-length(s)])/sum(x)
})
rownames(tmp) <- paste("lessthan", names(df)[-1], sep = "_")
desired.df <- cbind(df, t(tmp))
desired.df
# group no.highschool high.school college graduate lessthan_no.highschool
#1 A 20 70 10 0 0
#2 B 10 40 40 10 0
# lessthan_high.school lessthan_college lessthan_graduate
#1 20 90 100
#2 10 50 90
how could I do this programatically (and elegantly) using tidyverse tools?
Definitely the first step is to tidy your data. Encoding information (like edu level) in column names is not tidy. When you convert education to a factor, make sure the levels are in the correct order - I used the order in which they appeared in the original data column names.
library(tidyr)
tidy_result = df %>% gather(key = "education", value = "n", -group) %>%
mutate(education = factor(education, levels = names(df)[-1])) %>%
group_by(group) %>%
mutate(lessthan_x = lag(cumsum(n), default = 0) / sum(n) * 100) %>%
arrange(group, education)
tidy_result
# # A tibble: 8 x 4
# # Groups: group [2]
# group education n lessthan_x
# <chr> <fct> <dbl> <dbl>
# 1 A no.highschool 20 0
# 2 A high.school 70 20
# 3 A college 10 90
# 4 A graduate 0 100
# 5 B no.highschool 10 0
# 6 B high.school 40 10
# 7 B college 40 50
# 8 B graduate 10 90
This gives us a nice, tidy result. If you want to spread/cast this data into your un-tidy desired.df format, I would recommend using data.table::dcast, as (to my knowledge) the tidyverse does not offer a nice way to spread multiple columns. See Spreading multiple columns with tidyr or How can I spread repeated measures of multiple variables into wide format? for the data.table solution or an inelegant tidyr/dplyr version. Before spreading, you could create a key less_than_x_key = paste("lessthan", education, sep = "_").
I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!
Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363
Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]