Setting values to a new variable - r

I'm trying to create a new variable in a data frame (making a new column). The value is calculated different for each observation so I used for loop for that. Lets say the new variable I'm trying to add to the data frame REPLIC is called PL
REPLIC$PL <- for (i in 1:ncol(REPLIC)) if (REPLIC$FTR[i]=="D") { REPLIC$PL[i] <- REPLIC$f_of_bet[i]*starting_budget*REPLIC$max[i])} else { REPLIC$PL[i] <- REPLIC$f_of_bet[i]*starting_budget*-1}
I have also tried using mutate
REPLIC <- mutate(REPLIC, PL = for loop goes here)
also tried apply function
REPLIC$PL <- apply(REPLIC,1, for loop here)
I'm new to R and I don't really get what I'm missing here. The only thing I've managed so far is to create PL values in global environment. I'd be really happy if anyone could instruct me.

No need to use a loop here, everything can be done using vectors.
Since you didn't share anything about your data, I had to make some assumptions, please correct me if these are wrong.
#create fake data
starting_budget <- 1000
REPLIC <- data.frame(FTR = c(rep('D',5),rep('A',5)),f_of_bet = runif(10),max=runif(10))
> REPLIC
FTR f_of_bet max
1 D 0.78590664 0.3620227
2 D 0.15498935 0.4921082
3 D 0.20469729 0.5597419
4 D 0.01167919 0.3677215
5 D 0.32862533 0.5531767
6 A 0.52029750 0.5391566
7 A 0.63206626 0.9727405
8 A 0.54632605 0.7221810
9 A 0.58939969 0.6103260
10 A 0.15375445 0.1996567
The following code will add your new column. I'm using ifelse since you have a condition on FTR:
REPLIC$PL <- ifelse(REPLIC$FTR == 'D',
REPLIC$f_of_bet * starting_budget * REPLIC$max,
REPLIC$f_of_bet * starting_budget * -1)
This gives you:
> REPLIC
FTR f_of_bet max PL
1 D 0.78590664 0.3620227 284.51602
2 D 0.15498935 0.4921082 76.27153
3 D 0.20469729 0.5597419 114.57764
4 D 0.01167919 0.3677215 4.29469
5 D 0.32862533 0.5531767 181.78787
6 A 0.52029750 0.5391566 -520.29750
7 A 0.63206626 0.9727405 -632.06626
8 A 0.54632605 0.7221810 -546.32605
9 A 0.58939969 0.6103260 -589.39969
10 A 0.15375445 0.1996567 -153.75445

Related

Creating new columns with mutate

i can figure out the solution of my problem but in a very not optimal way and thus the solution i have is not adapted for a large df. Let me explain.
I have a big dataframe and i need to create new columns by subtracting two others ones. Let me show you using a simple df.
A<-rnorm(10)
B<-rnorm(10)
C<-rnorm(10)
D<-rnorm(10)
E<-rnorm(10)
F<-rnorm(10)
df1<-data_frame(A,B,C,D,E,F)
# A tibble: 10 x 6
A B C D E F
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -2.8750025 0.4685855 2.4435767 1.6999761 -1.3848386 -0.58992249
2 0.2551404 1.8555876 0.8365116 -1.6151186 -1.7754623 0.04423463
3 0.7740396 -1.0756147 0.6830024 -2.3879337 -1.3165875 -1.36646493
4 0.2059932 0.9322016 1.2483196 -0.1787840 0.3546773 -0.12874831
5 -0.4561725 -0.1464692 -0.7112905 0.2791592 0.5835127 0.16493237
6 1.2401795 -1.1422917 -0.6189480 -1.4975416 0.5653565 -1.32575021
7 -1.6173618 0.2283430 0.6154920 0.6082847 0.0273447 0.16771783
8 0.3340799 -0.5096500 -0.5270123 -0.2814217 -2.3732234 0.27972188
9 -0.4841361 0.1651265 0.0296500 0.4324903 -0.3895971 -2.90426195
10 -2.7106357 0.5496335 0.3081533 -0.3083264 -0.1341055 -0.17927807
I need (i) to subtract two columns at a similar distance : D-A, E-B, F-C while (ii) giving the new column a name based on the name of the initial variables' names.
I did in that way and it works:
df2<-df1 %>%
transmute (!!paste0("diff","D","A") := D-A,
!!paste0("diff","E","B") := E-B,
!!paste0("diff","F","C") := F-C)
# A tibble: 10 x 3
diffDA diffEB diffFC
<dbl> <dbl> <dbl>
1 4.5749785 -1.8534241 -3.0334991
2 -1.8702591 -3.6310500 -0.7922769
3 -3.1619734 -0.2409728 -2.0494674
4 -0.3847772 -0.5775242 -1.3770679
5 0.7353317 0.7299819 0.8762229
6 -2.7377211 1.7076482 -0.7068022
7 2.2256465 -0.2009983 -0.4477741
8 -0.6155016 -1.8635734 0.8067342
9 0.9166264 -0.5547236 -2.9339120
10 2.4023093 -0.6837390 -0.4874314
However, i have many columns and i would like to find a way to make the code simpler. I tried many things (like with mutate_all, mutate_at or add_columns) but nothing works...
OK, here's a method that will work for the full width of your data set.
df1 <- tibble(A = rnorm(10),
B = rnorm(10),
C = rnorm(10),
D = rnorm(10),
E = rnorm(10),
F = rnorm(10),
G = rnorm(10),
H = rnorm(10),
I = rnorm(10))
ct <- 1:ncol(df1)
diff_tbl <- tibble(testcol = rnorm(10))
for (i in ct) {
new_tbl <- tibble(col = df1[[i+3]] - df1[[i]])
names(new_tbl)[1] <- paste('diff',colnames(df1[i+3]),colnames(df1[i]),sep='')
diff_tbl <- bind_cols(diff_tbl,new_tbl)
}
diff_tbl <- diff_tbl %>%
select(-testcol)
df1 <- bind_cols(df1,diff_tbl)
Basically, what you are doing is creating a second dummy tibble to compute the differences, iterating over the possible differences (i.e. gaps of three columns) then assembling them into a single tibble, then binding those columns to the original tibble. As you can see, I extended df1 by three extra columns and the whole thing worked like a charm.
It's probable that there's a more elegant way to do this, but this method definitely works. There's one slightly awkward thing in that I had to create the diff_tbl with a dummy column and then remove it before the final bind_cols() call, but it's not a major thing, I think.
You could divide the data frame in two parts and do
inds <- ncol(df1)/2
df1[paste0("diff", names(df1[(inds + 1):ncol(df1)]), names(df1[1:inds]))] <-
df1[(inds + 1):ncol(df1)] - df1[1:inds]
Note that column names with dashes in them are improper and not recommended.
result = df1[4:6] - df1[1:3]
names(result) = paste(names(df1)[4:6], names(df1)[1:3], sep = "-")
result
# D-A E-B F-C
# 1 0.12459065 0.05855622 0.6134559
# 2 -2.65583389 0.26425762 0.8344115
# 3 -1.48761765 -3.13999402 1.3008065
# 4 -4.37469763 1.37551178 1.3405191
# 5 1.01657135 -0.90690359 1.5848562
# 6 -0.34050959 -0.57687686 -0.3794937
# 7 0.85233808 0.57911293 -0.8896393
# 8 0.01931559 0.91385740 3.2685647
# 9 -0.62012982 -2.34166712 -0.4001903
# 10 -2.21764146 0.05927664 0.3965072

R dynamic data.frame subseting

I have a dataframe which is similar to this:
A B C D E F G H I J K origin
1 -0.7236848 -0.4245541 0.7083451 3.1623596 3.8169532 -0.04582876 2.0287920 4.409196 -0.3194430 5.9069321 2.7071142 1
2 -0.8317734 4.8795289 0.4585997 -0.2634786 -0.7881651 -0.37251184 1.0951245 4.157672 4.2433676 1.4588268 -0.6623890 1
3 -0.7633280 -0.2985844 -0.9139702 3.7480425 3.5672651 0.06220035 -0.3770195 1.101240 2.0921264 6.6496937 -0.7218320 1
4 0.8311566 -0.7939485 0.5295287 -0.5508124 -0.3175838 -0.63254736 0.6145020 4.186136 -0.1858161 -0.1864584 0.7278854 2
5 1.4768837 -0.7612165 0.8571546 2.3227173 -0.8568081 -0.87088020 0.2269735 4.386104 3.9009236 -0.6429641 3.6163318 2
6 -0.9335004 4.4542639 1.0238832 -0.2304124 0.8630241 -0.50556039 2.8072757 5.168369 5.9942144 0.6165200 -0.5349257 2
Note that the last variable is called origin; a factor label of levels 1 and 2; my real data set has more levels.
A function I am using requires this format:
result <- specialFuc(matrix1, matrix2, ....)
What I want to do, is write a function such that the input dataframe (or matrix) is split by "origin" then dynamically I would get multiple matrices to give to my "specialFuc"
my solution is:
for (i in 1:length(levels(df[,"origin"))){
assign(paste("Var", "_", i, sep=''), subset(df, origin!=i)))
}
using this, I can create a list of names which I use get() to put in my special function.
As you can imagine this is not dynamic,...
Any suggestions?
I think something like
do.call(specialFunc,
split.data.frame(df[,-ncol(df)],df$origin))
should do it?

Quicken this for loop?

I have a dataset called cpue with 3.3 million rows. I have made a subset of this dataframe called dat.frame. (See below for the heads of cpue and dat.frame.) I have added two new fields to dat.frame: "ssh_vec" and "ssh_mag". Although the heads of cpue and dat.frame look the same, the rest of the rows are not actually in the same order.
head(cpue)
code event Lat Long stat_area Day Month Year id
1 BCO 447602 -43.45 182.73 49 17 3 1995 1
head(dat.frame)
code event Lat Long stat_area Day Month Year id cal.jdate ssh_vec ssh_mag
1 BCO 447602 -43.45 182.73 49 17 3 1995 1 2449857 56.83898 4.499350
Currently, I am running a loop to add the ssh_vec and ssh_mag variables to "cpue" using the unique identifier "id":
cpue$ssh<- NA
cpue$sshmag<- NA
for(i in 1:nrow(dat.frame))
{
ndx<- dat.frame$id[i]
cpue_full$ssh[ndx]<- dat.frame$ssh_vec[i]
cpue_full$sshmag[ndx]<- dat.frame$ssh_mag[i]
}
This has been running over the weekend and is only up to:
i
[1] 132778
... out of:
nrow(dat.frame)
[1] 2797789
Within the loop, there is nothing that looks too computationally demanding. Is there a better alternative?
Are you sure you need a for loop at all? I think this might be equivalent:
cpue_full$ssh[dat.frame$id]<- dat.frame$ssh_vec
cpue_full$sshmag[dat.frame$id]<- dat.frame$ssh_mag
I would recommend taking a look at data.table. Since I don't have your data, here is a simple example using dummy data.
library(data.table)
N = 10^6
dat <- data.table(
x = rnorm(1000),
g = sample(LETTERS, N, replace = TRUE)
)
dat2 <- dat[,list(mx = mean(x)),g]
h = merge(dat, dat2, 'g')
Do you even need to loop? From the code fragment posted it would appear not.
cpue_full$ssh[dat.frame$id] <- dat.frame$ssh_vec
cpue_full$sshmag[dat.frame$id]<- dat.frame$ssh_mag
should work. A quick (and small) dummy example:
set.seed(666)
ssh <- rnorm(10^4)
datf <- data.frame(id = sample.int(10000L), ssh = NA)
system.time(datf$ssh[datf$id] <- ssh) # user 0, system 0, elapsed 0
# Reset dummy data
datf$ssh <- NA
system.time({
for (i in 1:nrow(datf) ) {
ndx <- datf$id[i]
datf$ssh[ndx] <- ssh[i]
}
} ) # user 2.26, system 0.02, elapsed 2.28
PS - I've not used the data.table package, so I don't follow Ramnath's answer. In general you should avoid loops if possible (see fortune(142) and Circle 3 of The R Inferno).

Adding variables to a data.frame using a string as syntax

Supose I have this variables:
data <- data.frame(x=rnorm(10), y=rnorm(10))
form <- 'z = x*y'
How can I compute z (using data's variables) and add as a new variable to data?
I tried with parse() and eval() (base on an old question), but without success :/
Given what #Nico said is correct you might do:
d1 <- within(data, eval(parse(text=form)) )
d1
x y z
1 0.5939462 1.58683345 0.94249368
2 0.3329504 0.55848643 0.18594826
3 1.0630998 -1.27659221 -1.35714497
4 -0.3041839 -0.57326541 0.17437812
5 0.3700188 -1.22461261 -0.45312970
6 0.2670988 -0.47340064 -0.12644474
7 -0.5425200 -0.62036668 0.33656135
8 1.2078678 0.04211587 0.05087041
9 1.1604026 -0.91092165 -1.05703586
10 0.7002136 0.15802877 0.11065390
transform() is the easy way if using this interactively:
data <- data.frame(x=rnorm(10), y=rnorm(10))
data <- transform(data, z = x * y)
R> head(data)
x y z
1 -1.0206 0.29982 -0.30599
2 -1.6985 1.51784 -2.57805
3 0.8940 1.19893 1.07187
4 -0.3672 -0.04008 0.01472
5 0.5266 -0.29205 -0.15381
6 0.2545 -0.26889 -0.06842
You can't do this using form though, but within(), which is similar to transform(), does allow this, e.g.
R> within(data, eval(parse(text = form)))
x y z
1 -0.8833 -0.05256 0.046428
2 1.6673 1.61101 2.686115
3 1.1261 0.16025 0.180453
4 0.9726 -1.32975 -1.293266
5 -1.6220 -0.51079 0.828473
6 -1.1981 2.62663 -3.147073
7 -0.3596 -0.01506 0.005416
8 -0.9700 0.21865 -0.212079
9 1.0626 1.30377 1.385399
10 -0.8020 -1.04639 0.839212
though it involves some amount of jiggery-pokery with the language which to my mind is not elegant. Effectively, you are doing something like this:
R> eval(eval(parse(text = form), data), data, parent.frame())
[1] 0.046428 2.686115 0.180453 -1.293266 0.828473 -3.147073 0.005416
[8] -0.212079 1.385399 0.839212
(and assigning the result to the named component in data.)
Does form have to come like this, as a character string representing some expression to be evaluated?

summarize data from csv using R

I'm new to R, and I wrote some code to summarize data from .csv file according to my needs.
here is the code.
raw <- read.csv("trees.csv")
looks like this
SNAME CNAME FAMILY PLOT INDIVIDUAL CAP H
1 Alchornea triplinervia (Spreng.) M. Arg. Tainheiro Euphorbiaceae 5 176 15 9.5
2 Andira fraxinifolia Benth. Angelim Fabaceae 3 321 12 6.0
3 Andira fraxinifolia Benth. Angelim Fabaceae 3 326 14 7.0
4 Andira fraxinifolia Benth. Angelim Fabaceae 3 327 18 5.0
5 Andira fraxinifolia Benth. Angelim Fabaceae 3 328 12 6.0
6 Andira fraxinifolia Benth. Angelim Fabaceae 3 329 21 7.0
#add 2 other rows
for (i in 1:nrow(raw)) {
raw$VOLUME[i] <- treeVolume(raw$CAP[i],raw$H[i])
raw$BASALAREA[i] <- treeBasalArea(raw$CAP[i])
}
#here comes.
I need a new data frame, with the mean of columns H and CAP and the sums of columns VOLUME and BASALAREA. This dataframe is grouped by column SNAME and subgrouped by column PLOT.
plotSummary = merge(
aggregate(raw$CAP ~ raw$SNAME * raw$PLOT, raw, mean),
aggregate(raw$H ~ raw$SNAME * raw$PLOT, raw, mean))
plotSummary = merge(
plotSummary,
aggregate(raw$VOLUME ~ raw$SNAME * raw$PLOT, raw, sum))
plotSummary = merge(
plotSummary,
aggregate(raw$BASALAREA ~ raw$SNAME * raw$PLOT, raw, sum))
The functions treeVolume and treeBasal area just return numbers.
treeVolume <- function(radius, height) {
return (0.000074230*radius**1.707348*height**1.16873)
}
treeBasalArea <- function(radius) {
return (((radius**2)*pi)/40000)
}
I'm sure that there is a better way of doing this, but how?
I can't manage to read your example data in, but I think I've made something that generally represents it...so give this a whirl. This answer builds off of Greg's suggestion to look at plyr and the functions ddply to group by segments of your data.frame and numcolwise to calculate your statistics of interest.
#Sample data
set.seed(1)
dat <- data.frame(sname = rep(letters[1:3],2), plot = rep(letters[1:3],2),
CAP = rnorm(6),
H = rlnorm(6),
VOLUME = runif(6),
BASALAREA = rlnorm(6)
)
#Calculate mean for all numeric columns, grouping by sname and plot
library(plyr)
ddply(dat, c("sname", "plot"), numcolwise(mean))
#-----
sname plot CAP H VOLUME BASALAREA
1 a a 0.4844135 1.182481 0.3248043 1.614668
2 b b 0.2565755 3.313614 0.6279025 1.397490
3 c c -0.8280485 1.627634 0.1768697 2.538273
EDIT - response to updated question
Ok - now that your question is more or less reproducible, here's how I'd approach it. First of all, you can take advantage of the fact that R is a vectorized meaning that you can calculate ALL of the values from VOLUME and BASALAREA in one pass, without looping through each row. For that bit, I recommend the transform function:
dat <- transform(dat, VOLUME = treeVolume(CAP, H), BASALAREA = treeBasalArea(CAP))
Secondly, realizing that you intend to calculate different statistics for CAP & H and then VOLUME & BASALAREA, I recommend using the summarize function, like this:
ddply(dat, c("sname", "plot"), summarize,
meanCAP = mean(CAP),
meanH = mean(H),
sumVOLUME = sum(VOLUME),
sumBASAL = sum(BASALAREA)
)
Which will give you an output that looks like:
sname plot meanCAP meanH sumVOLUME sumBASAL
1 a a 0.5868582 0.5032308 9.650184e-06 7.031954e-05
2 b b 0.2869029 0.4333862 9.219770e-06 1.407055e-05
3 c c 0.7356215 0.4028354 2.482775e-05 8.916350e-05
The help pages for ?ddply, ?transform, ?summarize should be insightful.
Look at the plyr package. I will split the data by the SNAME variable for you, then you give it code to do the set of summaries that you want (mixing mean and sum and whatever), then it will put the pieces back together for you. You probably want either the 'ddply' or the 'daply' function in that package.

Resources