R dynamic data.frame subseting - r

I have a dataframe which is similar to this:
A B C D E F G H I J K origin
1 -0.7236848 -0.4245541 0.7083451 3.1623596 3.8169532 -0.04582876 2.0287920 4.409196 -0.3194430 5.9069321 2.7071142 1
2 -0.8317734 4.8795289 0.4585997 -0.2634786 -0.7881651 -0.37251184 1.0951245 4.157672 4.2433676 1.4588268 -0.6623890 1
3 -0.7633280 -0.2985844 -0.9139702 3.7480425 3.5672651 0.06220035 -0.3770195 1.101240 2.0921264 6.6496937 -0.7218320 1
4 0.8311566 -0.7939485 0.5295287 -0.5508124 -0.3175838 -0.63254736 0.6145020 4.186136 -0.1858161 -0.1864584 0.7278854 2
5 1.4768837 -0.7612165 0.8571546 2.3227173 -0.8568081 -0.87088020 0.2269735 4.386104 3.9009236 -0.6429641 3.6163318 2
6 -0.9335004 4.4542639 1.0238832 -0.2304124 0.8630241 -0.50556039 2.8072757 5.168369 5.9942144 0.6165200 -0.5349257 2
Note that the last variable is called origin; a factor label of levels 1 and 2; my real data set has more levels.
A function I am using requires this format:
result <- specialFuc(matrix1, matrix2, ....)
What I want to do, is write a function such that the input dataframe (or matrix) is split by "origin" then dynamically I would get multiple matrices to give to my "specialFuc"
my solution is:
for (i in 1:length(levels(df[,"origin"))){
assign(paste("Var", "_", i, sep=''), subset(df, origin!=i)))
}
using this, I can create a list of names which I use get() to put in my special function.
As you can imagine this is not dynamic,...
Any suggestions?

I think something like
do.call(specialFunc,
split.data.frame(df[,-ncol(df)],df$origin))
should do it?

Related

Slicing long form R data into multiple columns

I've been given some data that I've combined into long form, but I need to get it into a certain format for a deliverable. I've tinkered with dataframe and list options and cannot seem to find a way to get the data I have into the output form I need. Any thoughts and solutions are appreciated.
If the desired output form seems odd for R, it is because other people will open the resulting data in Excel for additional study. So I will save the final data as a csv or Excel file. The full data in the desired form will have 40 rows (+header) and 110 columns (55 student and score pairs).
Here is example code for my long form data:
class
student
score
1
a
0.4977
1
b
0.7176
1
c
0.9919
1
d
0.3800
1
e
0.7774
2
f
0.9347
2
g
0.2121
2
h
0.6517
2
i
0.1256
2
j
0.2672
3
k
0.3861
3
l
0.0134
3
m
0.3824
3
n
0.8697
3
o
0.3403
Here is an example of how I need the final data to appear:
class_1_student
class_1_score
class_2_student
class_2_score
class_3_student
class_3_score
a
0.4977
f
0.9347
k
0.3861
b
0.7176
g
0.2121
l
0.0134
c
0.9919
h
0.6517
m
0.3824
d
0.3800
i
0.1256
n
0.8697
e
0.7774
j
0.2672
o
0.3403
Here is R code to generate the sample long form and desired form data:
set.seed(1)
d <- data.frame(
class=c(rep(1,5), rep(2,5), rep(3,5)),
student=c(letters[1:5], letters[6:10], letters[11:15]),
score=round(runif(15, 0, 1),4)
)
d2 <- data.frame(
class_1_student = d[1:5,2],
class_1_score = d[1:5,3],
class_2_student = d[6:10,2],
class_2_score = d[6:10,3],
class_3_student = d[11:15,2],
class_3_score = d[11:15,3]
)
If it's helpful, I also have the student and score data in separate matrices (1 row per student and 1 column per class) that I could use to help generate the final data.
You can just split data:
library(tidyverse)
split(select(d, -class), d$class) %>%
imap(~setNames(.x, str_c("class", .y, names(.x), sep = "_"))) %>%
bind_cols()
Column binding will work only if the groups are of equal sizes.

How to apply same function to different sets of column in R?

With the following data set and time variable as time=c(1:10)
mydata
beta_C1 1 beta_C1 2 beta_C1 3 beta_C2 1 beta_C2 2 beta_C2 3
1 5.388135 0.2036038 -0.006050338 5.488691 0.1778483 -0.0036647072
2 5.536004 0.2374793 -0.009960762 5.768781 0.1463565 -0.0012642700
3 5.798095 0.1798015 -0.004768584 6.059320 0.1127296 0.0006366231
4 5.648306 0.2720582 -0.011654632 6.129815 0.1282014 -0.0015109727
5 5.712576 0.2320445 -0.007225099 6.166659 0.1490687 -0.0042889325
6 5.674026 0.2325392 -0.006198976 6.242121 0.1559551 -0.0064668515
I would like to create two matrix such as
new_mat1=outer(1:nrow(mydata), 1:length(time), function(x,y){
mydata[x,1]+
mydata[x,2]*time[y]+
mydata[x,3]*time[y]^2
})
and
new_mat2=outer(1:nrow(mydata), 1:length(time), function(x,y){
mydata[x,4]+
mydata[x,5]*time[y]+
mydata[x,6]*time[y]^2
})
The first matrix is created by taking the first three columns of mydata and the last three columns are used to create the second matrix.
Can I apply a function or for loop to create both matrices together? Any help is appreciated

Performing a 2 sample t test in R with replicates

I have a dataframe name R_alltemp in R with 6 columns, 2 groups of data with 3 replicates each. I'm trying to perform a t-test for each row between the first three values and the last three and use apply() so it can go through all the rows with one line. Here is the code im using so far.
R_alltemp$p.value<-apply(R_all3,1, function (x) t.test(x(R_alltemp[,1:3]), x(R_alltemp[,4:6]))$p.value)
and here is a snapshot of the table
R1.HCC827 R2.HCC827 R3.HCC827 R1.nci.h1975 R2.nci.h1975 R3.nci.h1975 p.value
1 13.587632 22.225083 15.074230 58.187465 79 82.287573 0.4391160
2 2.717526 1.778007 1.773439 1.763257 2 1.679338 0.4186339
3 203.814478 191.135711 232.320487 253.908939 263 263.656100 0.4904493
4 44.386264 45.339169 54.089884 3.526513 3 5.877684 0.3095634
it functions, but the p-values im getting just from eyeballing it seem wrong. For instance in the first line, the average of the first group is way lower than the second group, but my p value is only .4.
I feel like I'm missing something very obvious here, but I've been struggling with it for much longer than I'd like. Any help would be appreciated.
Your code is incorrect. I actually don't understand why it does not return an error. This part in particular: x(R_alltemp[,1:3]) should be x[1:3].
This should be your code:
R_alltemp$p.value2 <- apply(R_alltemp, 1, function(x) t.test(x[1:3], x[4:6])$p.value)
R1.HCC827 R2.HCC827 R3.HCC827 R1.nci.h1975 R2.nci.h1975 R3.nci.h1975 p.value p.value2
1 13.587632 22.225083 15.074230 58.187465 79 82.287573 0.4391160 0.010595829
2 2.717526 1.778007 1.773439 1.763257 2 1.679338 0.4186339 0.477533387
3 203.814478 191.135711 232.320487 253.908939 263 263.656100 0.4904493 0.044883436
4 44.386264 45.339169 54.089884 3.526513 3 5.877684 0.3095634 0.002853154
Remember that by specifying 1 it you are telling apply to get the columns. So function(x) returns the equivalent of this: x <- c(13.587632, 22.225083, 15.074230, 58.187465, 79, 82.287573) which means you want to subset the first three values by x[1:3] and then the last three x[4:6] and apply t.test to them.
A good idea before using apply is to test the function manually so if you do get odd results like these you know something went wrong with your code.
So the two-tailed p-value for the first row should be:
> g1 <- c(13.587632, 22.225083, 15.074230)
> g2 <- c(58.187465, 79, 82.287573)
> t.test(g1,g2)$p.value
[1] 0.01059583
Applying the function across all rows (I tacked the new p-val at the end as pval:
> tt$pval <- apply(tt,1,function(x) t.test(x[1:3],x[4:6])$p.value)
> tt
R1.HCC827 R2.HCC827 R3.HCC827 R1.nci.h1975 R2.nci.h1975 R3.nci.h1975 p.value pval
1 13.587632 22.225083 15.074230 58.187465 79 82.287573 0.4391160 0.010595829
2 2.717526 1.778007 1.773439 1.763257 2 1.679338 0.4186339 0.477533387
3 203.814478 191.135711 232.320487 253.908939 263 263.656100 0.4904493 0.044883436
4 44.386264 45.339169 54.089884 3.526513 3 5.877684 0.3095634 0.002853154
Maybe it's the double-use of the data frame name in the function (that you don't need)?

Setting values to a new variable

I'm trying to create a new variable in a data frame (making a new column). The value is calculated different for each observation so I used for loop for that. Lets say the new variable I'm trying to add to the data frame REPLIC is called PL
REPLIC$PL <- for (i in 1:ncol(REPLIC)) if (REPLIC$FTR[i]=="D") { REPLIC$PL[i] <- REPLIC$f_of_bet[i]*starting_budget*REPLIC$max[i])} else { REPLIC$PL[i] <- REPLIC$f_of_bet[i]*starting_budget*-1}
I have also tried using mutate
REPLIC <- mutate(REPLIC, PL = for loop goes here)
also tried apply function
REPLIC$PL <- apply(REPLIC,1, for loop here)
I'm new to R and I don't really get what I'm missing here. The only thing I've managed so far is to create PL values in global environment. I'd be really happy if anyone could instruct me.
No need to use a loop here, everything can be done using vectors.
Since you didn't share anything about your data, I had to make some assumptions, please correct me if these are wrong.
#create fake data
starting_budget <- 1000
REPLIC <- data.frame(FTR = c(rep('D',5),rep('A',5)),f_of_bet = runif(10),max=runif(10))
> REPLIC
FTR f_of_bet max
1 D 0.78590664 0.3620227
2 D 0.15498935 0.4921082
3 D 0.20469729 0.5597419
4 D 0.01167919 0.3677215
5 D 0.32862533 0.5531767
6 A 0.52029750 0.5391566
7 A 0.63206626 0.9727405
8 A 0.54632605 0.7221810
9 A 0.58939969 0.6103260
10 A 0.15375445 0.1996567
The following code will add your new column. I'm using ifelse since you have a condition on FTR:
REPLIC$PL <- ifelse(REPLIC$FTR == 'D',
REPLIC$f_of_bet * starting_budget * REPLIC$max,
REPLIC$f_of_bet * starting_budget * -1)
This gives you:
> REPLIC
FTR f_of_bet max PL
1 D 0.78590664 0.3620227 284.51602
2 D 0.15498935 0.4921082 76.27153
3 D 0.20469729 0.5597419 114.57764
4 D 0.01167919 0.3677215 4.29469
5 D 0.32862533 0.5531767 181.78787
6 A 0.52029750 0.5391566 -520.29750
7 A 0.63206626 0.9727405 -632.06626
8 A 0.54632605 0.7221810 -546.32605
9 A 0.58939969 0.6103260 -589.39969
10 A 0.15375445 0.1996567 -153.75445

Adding variables to a data.frame using a string as syntax

Supose I have this variables:
data <- data.frame(x=rnorm(10), y=rnorm(10))
form <- 'z = x*y'
How can I compute z (using data's variables) and add as a new variable to data?
I tried with parse() and eval() (base on an old question), but without success :/
Given what #Nico said is correct you might do:
d1 <- within(data, eval(parse(text=form)) )
d1
x y z
1 0.5939462 1.58683345 0.94249368
2 0.3329504 0.55848643 0.18594826
3 1.0630998 -1.27659221 -1.35714497
4 -0.3041839 -0.57326541 0.17437812
5 0.3700188 -1.22461261 -0.45312970
6 0.2670988 -0.47340064 -0.12644474
7 -0.5425200 -0.62036668 0.33656135
8 1.2078678 0.04211587 0.05087041
9 1.1604026 -0.91092165 -1.05703586
10 0.7002136 0.15802877 0.11065390
transform() is the easy way if using this interactively:
data <- data.frame(x=rnorm(10), y=rnorm(10))
data <- transform(data, z = x * y)
R> head(data)
x y z
1 -1.0206 0.29982 -0.30599
2 -1.6985 1.51784 -2.57805
3 0.8940 1.19893 1.07187
4 -0.3672 -0.04008 0.01472
5 0.5266 -0.29205 -0.15381
6 0.2545 -0.26889 -0.06842
You can't do this using form though, but within(), which is similar to transform(), does allow this, e.g.
R> within(data, eval(parse(text = form)))
x y z
1 -0.8833 -0.05256 0.046428
2 1.6673 1.61101 2.686115
3 1.1261 0.16025 0.180453
4 0.9726 -1.32975 -1.293266
5 -1.6220 -0.51079 0.828473
6 -1.1981 2.62663 -3.147073
7 -0.3596 -0.01506 0.005416
8 -0.9700 0.21865 -0.212079
9 1.0626 1.30377 1.385399
10 -0.8020 -1.04639 0.839212
though it involves some amount of jiggery-pokery with the language which to my mind is not elegant. Effectively, you are doing something like this:
R> eval(eval(parse(text = form), data), data, parent.frame())
[1] 0.046428 2.686115 0.180453 -1.293266 0.828473 -3.147073 0.005416
[8] -0.212079 1.385399 0.839212
(and assigning the result to the named component in data.)
Does form have to come like this, as a character string representing some expression to be evaluated?

Resources