Issue with calculating row mean in data table for selected columns in R - r

I have a data table as shown below.
Table:
LP GMweek1 GMweek2 GMweek3 PMweek1 PMweek2 PMweek3
215 45 50 60 11 0.4 10.2
0.1 50 61 24 12 0.8 80.0
0 45 24 35 22 20.0 15.4
51 22.1 54 13 35 16 2.2
I want to obtain the Output table below. My code below does not work. Can somebody help me to figure out what I am doing wrong here.
Any help is appreciated.
Output:
LP GMweek1 GMweek2 GMweek3 PMweek1 PMweek2 PMweek3 AvgGM AvgPM
215 45 50 60 11 0.4 10.2 51.67 7.20
0.1 50 61 24 12 0.8 80.0 45.00 30.93
0 45 24 35 22 20.0 15.4 34.67 19.13
51 22.1 54 13 35 16 2.2 29.70 17.73
sel_cols_GM <- c("GMweek1","GMweek2","GMweek3")
sel_cols_PM <- c("PMweek1","PMweek2","PMweek3")
Table <- Table[, .(AvgGM = rowMeans(sel_cols_GM)), by = LP]
Table <- Table[, .(AvgPM = rowMeans(sel_cols_PM)), by = LP]

Ok so you're doing a couple of things wrong. First, rowMeans can't evaluate a character vector, if you want to select columns by using it you must use .SD and pass the character vector to .SDcols. Second, you're trying to calculate a row aggregation and grouping, which I don't think makes much sense. Third, even if your expression didn't throw an error, you are assigning it back to Table, which would destroy your original data (if you want to add a new column use := to add it by reference).
What you want to do is calculate the row means of your selected columns, which you can do like this:
Table[, AvgGM := rowMeans(.SD), .SDcols = sel_cols_GM]
Table[, AvgPM := rowMeans(.SD), .SDcols = sel_cols_PM]
This means create these new columns as the row means of my subset of data (.SD) which refers to these columns (.SDcols)

Related

R identifying first value in data-frame and creating new variable by adding/subtracting this from all values in data-frame in new column

I know this question may have been already answered elsewhere and apologies for repeating it if so but I haven't found a workable answer as yet.
I have 17 subjects each with two variables as below:
Time (s) OD
130 41.48
130.5 41.41
131 39.6
131.5 39.18
132 39.41
132.5 37.91
133 37.95
133.5 37.15
134 35.5
134.5 36.01
135 35.01
I would like R to identify the first value in column 2 (OD) of my dataframe and create a new column (OD_adjusted) by adding or subtracting (depending if the first value is +ive or -ive) from all values in column 2 so it would look like this:
Time (s) OD OD_adjusted
130 41.48 0
130.5 41.41 -0.07
131 39.6 -1.88
131.5 39.18 -2.3
132 39.41 -2.07
132.5 37.91 -3.57
133 37.95 -3.53
133.5 37.15 -4.33
134 35.5 -5.98
134.5 36.01 -5.47
135 35.01 -6.47
First value in column 2 is 41.48 therefore I want to subtract this value from all datapoints in column 2 to create a new third column (OD_adjusted).
I can use OD_adjusted <- ((df$OD) - 41.48) however, I would like to automate the process using a function and this is where I am stuck:
AUC_OD <- function(df){
return_value_1 = df %>%
arrange(OD) %>%
filter(OD [1,2) %>%
slice_(1)
colnames(return_value_1)[3] <- "OD_adjusted"
if (nrow(return_value_1) > 0 ) { subtract
(return_value_1 [1,2] #into new row
else add
(return_value_1 [1,2] #into new row
}
We get the first element of 'OD' and subtract from the column
library(dplyr)
df1 %>%
mutate(OD_adjusted = OD- OD[1])
Or using base R
df1$OD_adjusted <- with(df1, OD - OD[1])

Subsetting - R prints data in reverse order- [R 3.2.2, Win10 Pro, 64-bit]

Aim: To retrieve last two entries of data.( I am aware of the tail function, or direct indexing)
Code:
> tdata <- read.csv("hw1_data.csv")
> temp <- tdata[(nrow(tdata)-1):nrow(tdata), ]
> temp
Ozone Solar.R Wind Temp Month Day
152 18 131 8.0 76 9 29
153 20 223 11.5 68 9 30
> temp <- tdata[nrow(tdata)-1:nrow(tdata), ]
> temp
Ozone Solar.R Wind Temp Month Day
152 18 131 8.0 76 9 29
151 14 191 14.3 75 9 28
150 NA 145 13.2 77 9 27
149 30 193 6.9 70 9 26
148 14 20 16.6 63 9 25
147 7 49 10.3 69 9 24
.
.
.
While taking a subset using the extract operator, I have used the nrows() function to retrieve the total number of rows in the data and subtracted one from it (one less than total rows) and used sequence operator(:) to sequence till nrows(data), i.e. total number of rows.
When I use parentheses, the logic works fine, but when I skip the parentheses the output is the total dataframe in a reverse order.
I can figure out that precedence rules are at play, but unable to figure out exact logic. New at R, so any formal explanation would be valuable.
As suspected correctly in the post, the observed behavior is in fact a matter of operator precedence.
A complete list of the operator syntax and precedence rules in R can be obtained by typing
help(Syntax)
in the console.
In this context, R programmers sometimes refer to a well-known and rather witty quote which encourages the use of parentheses:
library(fortunes)
fortune(138)
nrow(tdata) = 153
So the first line you run is:
temp <- tdata[(nrow(tdata)-1):nrow(tdata),]
This executes as tdata[152:153,]
Second line:
temp <- tdata[nrow(tdata)-1:nrow(tdata),]
This executes as tdata[153-1:153,]
So it returns the following:
tdata[152,]
tdata[151,]
...
tdata[0,]

Looping through rows, creating and reusing multiple variables

I am building a streambed hydrology calculator in R using multiple tables from an Access database. I am having trouble automating and calculating the same set of indices for multiple sites. The following sample dataset describes my data structure:
> Thalweg
StationID AB0 AB1 AB2 AB3 AB4 AB5 BC1 BC2 BC3 BC4 Xdep_Vdep
1 1AAUA017.60 47 45 44 55 54 6 15 39 15 11 18.29
2 1AXKR000.77 30 27 24 19 20 18 9 12 21 13 6.46
3 2-BGU005.95 52 67 62 42 28 25 23 26 11 19 20.18
4 2-BLG011.41 66 85 77 83 63 35 10 70 95 90 67.64
5 2-CSR003.94 29 35 46 14 19 14 13 13 21 48 6.74
where each column represents certain field-measured parameters (i.e. depth of a reach section) and each row represents a different site.
I have successfully used the apply functions to simultaneously calculate simple functions on multiple rows:
> Xdepth <- apply(Thalweg[, 2:11], 1, mean) # Mean Depth
> Xdepth
1 2 3 4 5
33.1 19.3 35.5 67.4 25.2
and appending the results back to the proper station in a dataframe.
However, I am struggling when I want to calculate and save variables that are subsequently used for further calculations. I cannot seem to loop or apply the same function to multiple columns on a single row and complete the same calculations over the next row without mixing variables and data.
I want to do:
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + other_variables), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + other_variables), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + other_variables), Thalweg$AB3)
# etc.
Depth_AB0 <- (Thalweg$AB0 - Residual_AB0)
Depth_AB1 <- (Thalweg$AB1 - Residual_AB1)
Depth_AB2 <- (Thalweg$AB2 - Residual_AB2)
# etc.
I have tried and subsequently failed at for loops such as:
for (i in nrow(Thalweg)){
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + Stacks_Equation), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + Stacks_Equation), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + Stacks_Equation), Thalweg$AB3)
Residuals <- data.frame(Thalweg$StationID, Residual_AB0, Residual_AB1, Residual_AB2, Residual_AB3)
}
Is there a better way to approach looping through multiple lines of data when I need unique variables saved for each specific row that I am currently calculating? Thank you for any suggestions.
your exact problem is still a mistery to me...
but it looks like you want a double for loop
for(i in 1:nrow(thalweg)){
residual=thalweg[i,"Xdep_Vdep"]
for(j in 2:11){
residual=min(residual,thalweg[i,j])
}
}

Loop over the data-set columns and calculate statistics in R

I am just starting with R and need help with looping over the data-set and calculating statistics.
I have two data-sets:
>head(windows)
W1
W1
W2
W2
W3
W4
W4
W5
...
>head(values) # this is very large file (>20Gb)
Case1 Case2 Case3 Case4 ...
21 19 14 64
14 24 48 13
21 34 65 83
45 53 25 63
62 32 72 11
24 75 12 66
12 23 73 37
45 23 56 74
...
What I what to do:
For every Case column in values join it with windows row by row;
Should look something like this (Case1):
W1 21
W1 14
W2 21
W2 45
W3 62
W4 24
W4 12
W5 45
For every joined window group, e.g.:
W1(Case1): 21,14
W2(Case1): 21,45
W3(Case1): 62
W4(Case1): 24,12
W5(Case1): 45
W1(Case2): 19,24
Calculate mean (or median);
Perfect output would look like this:
Case1 Case2 Case3 Case4
W1 17.50 21.50 mean mean
W2 33.00 mean mean mean
W3 62.00 mean mean mean
W4 18.00 mean mean mean
W5 45.00 mean mean mean
Pseudo code might be:
For cases in values
join row by row with windows
For every window
Calculate mean
end
end
NB: I have tried joining windows with values using rbind,merge,data.frame, but data-sets are too large and process gets killed.
Since you have a considerably large data file, I think there are two good options to do it, either using data.table or dplyr. So here's how you could do it using dplyr.
But first of all, I think you don't really want to merge values and windows. Based on your description, I think what you want to do is add windows as an additional column to values (since there is nothing that could be merged, it seems).
So I would first create that additional column in values. (I assume here, that windows is a vector, although it is not clear from your question, it might also be a data.frame, but you could do it very similar in that case):
values$windows <- windows #assuming windows is a vector
Then you can use dplyr for the calculation:
Method 1:
Referencing each column you want to operate on:
library(dplyr)
values %>%
group_by(windows) %>%
summarize(Case1 = mean(Case1, na.rm=TRUE),
Case2 = mean(Case2, na.rm=TRUE),
Case3 = mean(Case3, na.rm=TRUE),
Case4 = mean(Case4, na.rm=TRUE))
Method 2:
Using summarise_each to do the same operation for all columns except the grouping variables (windows in this case). If you have a large number of columns you want to do the same operation on, this saves you some typing. Plus, you can specify more functions to be calculated, for example mean and median, if you want.
library(dplyr) # if it's not yet loaded
values %>%
group_by(windows) %>%
summarise_each(funs(mean(., na.rm=TRUE)))
The result is the same in both cases:
# windows Case1 Case2 Case3 Case4
#1 W1 17.5 21.5 31.0 38.5
#2 W2 33.0 43.5 45.0 73.0
#3 W3 62.0 32.0 72.0 11.0
#4 W4 18.0 49.0 42.5 51.5
#5 W5 45.0 23.0 56.0 74.0
Edit
Here's an example with much larger sample data including conversion from matrix to data.frame/vector. If your conversion from "big.matrix" to matrix works, then I think, this should work the same way with your original data.
# create a matrix with 100 columns and 5 million rows for per column
m <- matrix(runif(100*5e6), ncol=100)
dim(m)
#[1] 5000000 100
object.size(m)
# 4000000200 bytes
# convert to data.frame
df <- as.data.frame(m)
# create a second matrix "windows" with a single column
windows <- matrix(sample(1:1000, nrow(df), replace=TRUE), ncol = 1)
# convert matrix "windows" to vector
windows.vec <- as.vector(windows[,1])
# add windows.vec as a grouping variable to "df"
df$windows <- windows.vec # you could also do this directly from the "windows" matrix
# check dimensions of "df"
dim(df)
#[1] 5000000 101
# now you can do the calculation
df %>%
group_by(windows) %>%
summarise_each(funs(mean(., na.rm=T), median(., na.rm=TRUE)))
This is by no means the most elegant solution, but it seems to do what you want simply by stacking your values data into a single column and then using a tapply() function. It also prevents the need to bind together your windows factors and values data.
First, a small sample dataset, similar to the above format:
> set.seed(42)
> values <- data.frame(replicate(4, sample(1:100, 1e3, replace=T)))
> head(values)
[,1] [,2] [,3] [,4]
[1,] 85 34 42 77
[2,] 21 3 72 66
[3,] 36 45 77 14
[4,] 78 50 7 31
[5,] 51 89 42 92
[6,] 61 23 55 2
> windows <- rep(1:(1e3/2), each=2)
> head(windows)
[1] 1 1 2 2 3 3
Now stack the values data into a single column, creating a new variable ind:
> values <- stack(values)
And repeat your windows values to match the length of the stacked dataframe:
> windows <- rep(windows, 4)
Now you can use a simple tapply to calculate the mean by windows variable for each column:
> tapply(values$values, list(values$ind, windows), mean)
Sample output:
1 2 3 ...
X1 50.0 81.5 39.5
X2 36.0 26.5 52.5
X3 68.5 77.5 85.5
X4 52.0 90.0 91.5

How can I get column data to be added based on a group designation using R?

The data set that I'm working with is similar to the one below (although the example is of a much smaller scale, the data I'm working with is 10's of thousands of rows) and I haven't been able to figure out how to get R to add up column data based on the group number. Essentially I want to be able to get the number of green(s), blue(s), and red(s) added up for all of group 81 and 66 separately and then be able to use that information to calculate percentages.
txt <- "Group Green Blue Red Total
81 15 10 21 46
81 10 10 10 30
81 4 8 0 12
81 42 2 2 46
66 11 9 1 21
66 5 14 5 24
66 7 5 2 14
66 1 16 3 20
66 22 4 2 28"
dat <- read.table(textConnection(txt), sep = " ", header = TRUE)
I've spent a good deal of time trying to figure out how to use some of the functions on my own hoping I would stumble across a proper way to do it, but since I'm such a new basic user I feel like I have hit a wall that I cannot progress past without help.
One way is via aggregate. Assuming your data is in an object x:
aggregate(. ~ Group, data=x, FUN=sum)
# Group Green Blue Red Total
# 1 66 46 48 13 107
# 2 81 71 30 33 134
Both of the answers above are perfect examples of how to address this type of problem. Two other options exist within reshape and plyr
library(reshape)
cast(melt(dat, "Group"), Group ~ ..., sum)
library(plyr)
ddply(dat, "Group", function(x) colSums(x[, -1]))
I would suggest that #Joshua's answer is neater, but two functions you should learn are apply and tapply. If a is your data set, then:
## apply calculates the sum of each row
> total = apply(a[,2:4], 1, sum)
## tapply calculates the sum based on each group
> tapply(total, a$Group, sum)
66 81
107 134

Resources