Sum or mean of certain element in data frame in R - r

I have a data frame that look like this:
k v 2002 2006 2010
1 a x 79.1 80.2 83.2
2 a y 75.1 76.2 79.3
3 a z 74.7 75.8 79.0
4 b x 82.8 85.9 87.6
5 b y 81.1 83.5 85.1
6 b z 80.5 83.1 84.6
etc. What I need is the mean of the numeric values for every row, i.e. I want it to look like this:
k v tot
1 a x 80.833
2 a y 76.867
3 a z 76.500
4 b x 85.433
5 b y 83.233
6 b z 82.733
I don't want to keep the original values, just the means. I know about rowMeans but as far as I know I can't (and don't want to) use it since it is averaging the whole row, not just the three last columns. I tried to use
rowMeans(subset(df,select=3:5))
but then I only get the numerical values and loose the variables k and v. Does anyone know a convenient way to get the mean over just some of the elements in a row?

dplyr::mutate(df, tot= (`2002`+`2006`+`2010`)/3)
should work too.
This will preserve the first two columns of variables as you intended, and append a column named tot that is = the mean of the three 'years' cols.

Related

Is there way to calculate multiple new rows of a data frame based on previous rows' values?

I am creating a data frame (hoops) with three columns (t, x, y) and 700 rows. See code at bottom. In the first row, I have set column t to equal 0. In the second row, I want to have the column t be calculated by taking the previous row's t value and add a constant (hoops_var). I want this to formula to continue to row 700.
hoops<-data.frame(t=double(),x=double(),y=double())
hoops_var<- 1.5
hoops[1,1]<- 0
hoops[1,2]<- (hoops$t+23)
hoops[1,3]<- (hoops$t/2)
# What I want for row 2
hoops[2,1]<- hoops[[1,1]]+hoops_var #this formula for rows 2 to 700
hoops[2,2]<- (hoops$t+23) #same as row 1
hoops[2,2]<- (hoops$t/2) #same as row 1
# What I want for row 3 to 700 (same as row 2)
hoops[3:700,1]<- hoops[[2,2]]+hoops_var #same as row 2
hoops[3:700,2]<- (hoops$t+23) #same as rows 1 & 2
hoops[3:700,3]<- (hoops$t/2) #same as row 1 & 2
The first four rows of the table should look like this
The only applicable solution I found (linked at bottom) did not work for me.
I am fairly new to R, so apologies if this is a dumb question. Thanks in advance for any help.
R: Creating a new row based on previous rows
You should use vectorized operations
# first create all columns as vectors
hoops_t <- hoops_var*(0:699) #0:699 gives a vector of 700 consecutive integers
hoops_x <- hoops_t+23
hoops_y <- hoops_t/2
# now we are ready to put all vectors in a dataframe
hoops <- data.frame(t=hoops_t,x=hoops_x,y=hoops_y)
Now if you want to change the t column you can use lag from dplyr to shift all values for example
library(dplyr)
hoops$t[2:nrow(hoops)] <- lag(hoops$x*hoops$y)[2:nrow(hoops)]
I select only [2:nrow(hoops)] (all rows except the first one) because you don't want the first row to be modified
You could use the following :
n <- 10 #Number of rows in the data.frame
t <- seq(0, by = 1.5, length.out = n)
x <- 23 + t
y <- t/2
hoops <- data.frame(t, x, y)
hoops #Sample for 10 rows.
# t x y
#1 0.0 23.0 0.00
#2 1.5 24.5 0.75
#3 3.0 26.0 1.50
#4 4.5 27.5 2.25
#5 6.0 29.0 3.00
#6 7.5 30.5 3.75
#7 9.0 32.0 4.50
#8 10.5 33.5 5.25
#9 12.0 35.0 6.00
#10 13.5 36.5 6.75

Applying a label depending on which condition is met using R

I would like to use a simple R function where the contents of a specified data frame column are read row by row, then depending on the value, a string is applied to that row in a new column.
So far, I've tried to use a combination of loops and generating individual columns which were combined later. However, I cannot seem to get the syntax right.
The input looks like this:
head(data,10)
# A tibble: 10 x 5
Patient T1Score T2Score T3Score T4Score
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 96.4 75 80.4 82.1
2 5 100 85.7 53.6 55.4
3 6 82.1 85.7 NA NA
4 7 82.1 85.7 60.7 28.6
5 8 100 76.8 64.3 57.7
6 10 46.4 57.1 NA 75
7 11 71.4 NA NA NA
8 12 98.2 92.9 85.7 82.1
9 13 78.6 89.3 37.5 42.9
10 14 89.3 100 64.3 87.5
and the function I have written looks like this:
minMax<-function(x){
#make an empty data frame for the output to go
output<-data.frame()
#making sure the rest of the commands only look at what I want them to look at in the input object
a<-x[2:5]
#here I'm gathering the columns necessary to perform the calculation
minValue<-apply(a,1,min,na.rm=T)
maxValue<-apply(a,1,max,na.rm=T)
tempdf<-as.data.frame((cbind(minValue,maxValue)))
Difference<-tempdf$maxValue-tempdf$minValue
referenceValue<-ave(Difference)
referenceValue<-referenceValue[1]
#quick aside to make the first two thirds of the output file
output<-as.data.frame((cbind(x[1],Difference)))
#Now I need to define the class based on the referenceValue, and here is where I run into trouble.
apply(output, 1, FUN =
for (i in Difference) {
ifelse(i>referenceValue,"HIGH","LOW")
}
)
output
}
I also tried...
if (i>referenceValue) {
apply(output,1,print("HIGH"))
}else(print("LOW")) {}
}
)
output
}
Regardless, both end up giving me the error message,
c("'for (i in Difference) {' is not a function, character or symbol", "' ifelse(i > referenceValue, \"HIGH\", \"LOW\")' is not a function, character or symbol", "'}' is not a function, character or symbol")
The expected output should look like:
Patient Difference Toxicity
3 21.430000 LOW
5 46.430000 HIGH
6 3.570000 LOW
7 57.140000 HIGH
8 42.310000 HIGH
10 28.570000 HIGH
11 0.000000 LOW
12 16.070000 LOW
13 51.790000 HIGH
14 35.710000 HIGH
Is there a better way for me to organize the last loop?
Since you seem to be using tibbles anyway, here's a much shorter version using dplyr and tidyr:
> d %>%
gather(key = tscore,value = score,T1Score:T4Score) %>%
group_by(Patient) %>%
summarise(Difference = max(score,na.rm = TRUE) - min(score,na.rm = TRUE)) %>%
ungroup() %>%
mutate(AvgDifference = mean(Difference),
Toxicity = if_else(Difference > mean(Difference),"HIGH","LOW"))
# A tibble: 10 x 4
Patient Difference AvgDifference Toxicity
<int> <dbl> <dbl> <chr>
1 3 21.4 30.3 LOW
2 5 46.4 30.3 HIGH
3 6 3.6 30.3 LOW
4 7 57.1 30.3 HIGH
5 8 42.3 30.3 HIGH
6 10 28.6 30.3 LOW
7 11 0 30.3 LOW
8 12 16.1 30.3 LOW
9 13 51.8 30.3 HIGH
10 14 35.7 30.3 HIGH
I think maybe your expected output might have been based on a slightly different average difference, so this output is very slightly different.
And a much simpler base R version if you prefer:
d$min <- apply(d[,2:5],1,min,na.rm = TRUE)
d$max <- apply(d[,2:5],1,max,na.rm = TRUE)
d$diff <- d$max - d$min
d$avg_diff <- mean(d$diff)
d$toxicity <- with(d,ifelse(diff > avg_diff,"HIGH","LOW"))
A few notes on your existing code:
as.data.frame((cbind(minValue,maxValue))) is not an advisable way to create data frames. This is more awkward than simply doing data.frame(minValue = minValue,maxValue = maxValue) and risks unintended coercion from cbind.
ave is for computing summaries over groups; just use mean if you have a single vector
The FUN argument in apply expects a function, not an arbitrary expression, which is what you're trying to pass at the end. The general syntax for an "anonymous" function in that context would be apply(...,FUN = function(arg) { do some stuff and return exactly the thing you want}).

subsetting closed values in a column based on binary column in a data frame by R

I have a data frame with 85 rows and 35 columns which is sorted based on age column, like below:
No Gender Age
1 F 5.8
2 F 5.9
3 F 6
4 M 6.2
5 F 7
6 F 7.2
7 M 7.4
8 M 7.8
9 M 7.9
10 M 8.1
11 F 8.3
12 F 8.6
13 M 8.9
14 M 9
15 F 9.2
16 F 9.3
I need to subset closest ages in different genders. like below:
No Gender Age
1 F 6
2 M 6.2
3 F 7.2
4 M 7.4
5 M 8.1
6 F 8.3
7 F 8.6
8 M 8.9
9 M 9
10 F 9.2
Ok, I think I got this. It was surprisingly difficult, and maybe someone else will be able to come up with a more elegant solution, but here's what I got:
df <- data.frame(No=c(1L,2L,3L,4L,5L,6L,7L,8L,9L,10L,11L,12L,13L,14L,15L,16L),Gender=c('F','F','F','M','F','F','M','M','M','M','F','F','M','M','F','F'),Age=c(5.8,5.9,6,6.2,7,7.2,7.4,7.8,7.9,8.1,8.3,8.6,8.9,9,9.2,9.3),stringsAsFactors=F);
mls <- df$Gender=='M';
mages <- df$Age[mls];
fages <- df$Age[!mls];
fisLower <- findInterval(mages,fages);
TOL <- 1e-5;
fisClosest <- fisLower+ifelse(fisLower==0L | fisLower<length(fages) & mages-fages[replace(fisLower,fisLower==0L,NA)]>fages[fisLower+1L]-mages+TOL,1L,0L);
mis <- unname(tapply(seq_along(mages),fisClosest,function(is) is[which.min(abs(mages[is]-fages[fisClosest[is[1L]]]))]));
fis <- unique(fisClosest);
df[sort(c(which(mls)[mis],which(!mls)[fis])),];
## No Gender Age
## 3 3 F 6.0
## 4 4 M 6.2
## 6 6 F 7.2
## 7 7 M 7.4
## 10 10 M 8.1
## 11 11 F 8.3
## 12 12 F 8.6
## 13 13 M 8.9
## 14 14 M 9.0
## 15 15 F 9.2
Explanation of variables:
df The input data.frame.
mls "male logicals": A logical vector representing which elements of df$Gender are male.
mages "male ages": The subset of df$Age for male rows.
fages "female ages": The subset of df$Age for female rows.
fisLower "female indexes lower": For each element of mages, this has the index into fages of the female age that lies just below (or possibly equal to) the male age. This could be zero if fages has no ages below the element of mages. Hence this vector is "parallel" to mages, meaning it's the same length and the elements correspond to each other.
TOL "tolerance" This was a necessary annoyance to prevent spurious floating-point comparison errors in the following statement.
fisClosest "female indexes closest" This is a simple transformation of fisLower. Basically, we must add 1L to each element of fisLower if the corresponding element of mages is actually closer to the subsequent element of fages (the "upper" one) rather than the one pointed to by the corresponding element of fisLower (the "lower" one). This must be done for two cases: (1) zero elements of fisLower, and (2) where the element of fisLower points to a non-last element of fages and the element of mages is actually closer to the subsequent element of fages.
mis "male indexes" First of all, understand that fisClosest may contain duplicates if multiple male ages have the same female age as their closest, IOW there is no other female age closer to that male age, for all of them. For each of these conflicts, we must find the one male age that is closest to the female age from the set of male ages. This requires a vector aggregation for which tapply() is appropriate. We group by fisClosest, passing mages indexes into the lambda, where we call which.min() on the absolute differences between the ages to get the winning male age, and return its index.
fis "female indexes" This is simply the unique set of indexes into fages which we need to select from df; we get this from fisClosest by removing duplicates.
At this point we can finally convert from mages and fages indexes (mis and fis) to df row indexes by indexing the appropriate respective polarities of mls. After combining and sorting the two index sets, we can finally index df to get the required output.
Original (Incorrect) Solution
It looks like you want the first and last row of each run length, excepting the first and last row of the entire data.frame. Here's one way to achieve that:
df <- data.frame(No=c(1L,2L,3L,4L,5L,6L,7L,8L,9L,10L,11L,12L,13L,14L,15L,16L),Gender=c('F','F','F','M','F','F','M','M','M','M','F','F','M','M','F','F'),Age=c(5.8,5.9,6,6.2,7,7.2,7.4,7.8,7.9,8.1,8.3,8.6,8.9,9,9.2,9.3),stringsAsFactors=F);
x <- cumsum(rle(df$Gender)$lengths); df2 <- df[unique(c(rbind(c(1L,x[-length(x)]+1L),x))),];
df2 <- df2[-c(1L,nrow(df2)),]; ## remove first and last row from original data.frame
df2;
## No Gender Age
## 3 3 F 6.0
## 4 4 M 6.2
## 5 5 F 7.0
## 6 6 F 7.2
## 7 7 M 7.4
## 10 10 M 8.1
## 11 11 F 8.3
## 12 12 F 8.6
## 13 13 M 8.9
## 14 14 M 9.0
## 15 15 F 9.2
I think you missed the F 7.0 row in your expected output; other than that, this gets the same set of rows. If you want to fix up No to be sequential from 1, you can run df2$No <- seq_len(nrow(df2)). Ditto for the row names (with rownames(df2) on the LHS).

Filling in a ton of NA data in R by indices?

I have Price data indexed according to three things:
State, Date, and UPC (that is the Product Code).
I have a bunch of prices that are NA.
I am trying to fill the NAs in in the following way: For a given missing Price with index (S,D,UPC), fill in with the average Price of all the data points with the same S and UPC. I.e., take the average over Date.
There must be an incredibly easy way to do this because this is very simple. I have been using for loops, but I now realize that that is incredibly inefficient and I would like to use a function, such as one in plyr or dplyr, that will do it all in as few steps as possible.
upc=c(1153801013,1153801013,1153801013,1153801013,1153801013,1153801013,2105900750,2105900750,2105900750,2105900750,2105900750,2173300001,2173300001,2173300001,2173300001)
date=c(200601,200602,200603,200604,200601,200602,200601,200602,200603,200601,200602,200603,200604,200605,200606)
price=c(26,28,NA,NA,23,24,85,84,NA,81,78,24,19,98,NA)
state=c(1,1,1,1,2,2,1,1,2,2,2,1,1,1,1)
# This is what I have:
data <- data.frame(upc,date,state,price)
# This is what I want:
price=c(26,28,27,27,23,24,85,84,79.5,81,78,24,19,98,47)
data2 <- data.frame(upc,date,state,price)
Any advice? Thanks.
Use ave with multiple grouping variables, and then replace NA values with the means:
with(data,
ave(price, list(upc,state), FUN=function(x) replace(x,is.na(x),mean(x,na.rm=TRUE) ) )
)
# [1] 26.0 28.0 27.0 27.0 23.0 24.0 85.0 84.0 79.5 81.0 78.0 24.0 19.0 98.0 47.0
You can construct a matrix of means by upc and state:
meanmtx <- tapply(dat$price, dat[c('upc','state')], mean, na.rm=TRUE)
That matrix has character indices that can be matched to values in upc and state. So then use 2 column character indexing to put these in the empty "slots":
dat$price[is.na(dat$price)] <-
meanmtx[ cbind( as.character(dat[ is.na(dat$price), 'upc']),
as.character(dat[ is.na(dat$price),'state']) ) ]
> dat
upc date state price
1 1153801013 200601 1 26.0
2 1153801013 200602 1 28.0
3 1153801013 200603 1 27.0
4 1153801013 200604 1 27.0
5 1153801013 200601 2 23.0
6 1153801013 200602 2 24.0
7 2105900750 200601 1 85.0
8 2105900750 200602 1 84.0
9 2105900750 200603 2 79.5
10 2105900750 200601 2 81.0
11 2105900750 200602 2 78.0
12 2173300001 200603 1 24.0
13 2173300001 200604 1 19.0
14 2173300001 200605 1 98.0
15 2173300001 200606 1 47.0
Here is another compact option using na.aggregate (from zoo) and data.table. The na.aggregate by default replace the NA values with the mean of the column of interest. It also has a FUN argument in case we want to replace the NA by median, min or max, or whatever we wish. The group by operations can be done by dplyr/data.table/base R methods. With data.table, we convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'upc', 'state', we assign (:=) the 'price' as the na.aggregate of 'price'.
library(data.table)
library(zoo)
setDT(data)[, price:= na.aggregate(price) , .(upc, state)]

R- replace values in a matrix with the average value of its group?

I am new-ish to R and have what should be a simple enough question to answer; any help would be greatly appreciated.
The situation is I have a tab delimited data matrix (data matrix.txt) like below with group information included on the last column.
sampleA sampleB sampleC Group
obs11 23.2 52.5 -86.3 1
obs12 -86.3 32.5 -84.7 1
obs41 -76.2 35.8 -16.3 2
obs74 23.2 32.5 -86.8 2
obs82 -86.2 52.8 -83.2 3
obs38 -36.2 59.5 -74.3 3
I would like to replace the values of each of the groups with the average value for that group
How can a group average rather than a row or column average be calculated in R?
And how can I use this value to replace original values? Is the replace() function useable in this situation or is that only for replacing two known values?
Thanks in advance
The package ddply should do the trick.
dat <- as.data.frame(matrix(runif(80),ncol=4))
dat$group <- sample(letters[1:4],size=20,replace=T)
head(dat)
library(plyr)
ddply(.data = dat, .variables =.(group), colwise(mean))
Result
group V1 V2 V3 V4
1 a 0.4741673 0.7669612 0.5043857 0.5039938
2 b 0.3648794 0.5776748 0.4033758 0.5748613
3 c 0.1450466 0.5399372 0.2440170 0.5124578
4 d 0.4249183 0.3252093 0.5467726 0.4416924

Resources