Loop over the data-set columns and calculate statistics in R - r

I am just starting with R and need help with looping over the data-set and calculating statistics.
I have two data-sets:
>head(windows)
W1
W1
W2
W2
W3
W4
W4
W5
...
>head(values) # this is very large file (>20Gb)
Case1 Case2 Case3 Case4 ...
21 19 14 64
14 24 48 13
21 34 65 83
45 53 25 63
62 32 72 11
24 75 12 66
12 23 73 37
45 23 56 74
...
What I what to do:
For every Case column in values join it with windows row by row;
Should look something like this (Case1):
W1 21
W1 14
W2 21
W2 45
W3 62
W4 24
W4 12
W5 45
For every joined window group, e.g.:
W1(Case1): 21,14
W2(Case1): 21,45
W3(Case1): 62
W4(Case1): 24,12
W5(Case1): 45
W1(Case2): 19,24
Calculate mean (or median);
Perfect output would look like this:
Case1 Case2 Case3 Case4
W1 17.50 21.50 mean mean
W2 33.00 mean mean mean
W3 62.00 mean mean mean
W4 18.00 mean mean mean
W5 45.00 mean mean mean
Pseudo code might be:
For cases in values
join row by row with windows
For every window
Calculate mean
end
end
NB: I have tried joining windows with values using rbind,merge,data.frame, but data-sets are too large and process gets killed.

Since you have a considerably large data file, I think there are two good options to do it, either using data.table or dplyr. So here's how you could do it using dplyr.
But first of all, I think you don't really want to merge values and windows. Based on your description, I think what you want to do is add windows as an additional column to values (since there is nothing that could be merged, it seems).
So I would first create that additional column in values. (I assume here, that windows is a vector, although it is not clear from your question, it might also be a data.frame, but you could do it very similar in that case):
values$windows <- windows #assuming windows is a vector
Then you can use dplyr for the calculation:
Method 1:
Referencing each column you want to operate on:
library(dplyr)
values %>%
group_by(windows) %>%
summarize(Case1 = mean(Case1, na.rm=TRUE),
Case2 = mean(Case2, na.rm=TRUE),
Case3 = mean(Case3, na.rm=TRUE),
Case4 = mean(Case4, na.rm=TRUE))
Method 2:
Using summarise_each to do the same operation for all columns except the grouping variables (windows in this case). If you have a large number of columns you want to do the same operation on, this saves you some typing. Plus, you can specify more functions to be calculated, for example mean and median, if you want.
library(dplyr) # if it's not yet loaded
values %>%
group_by(windows) %>%
summarise_each(funs(mean(., na.rm=TRUE)))
The result is the same in both cases:
# windows Case1 Case2 Case3 Case4
#1 W1 17.5 21.5 31.0 38.5
#2 W2 33.0 43.5 45.0 73.0
#3 W3 62.0 32.0 72.0 11.0
#4 W4 18.0 49.0 42.5 51.5
#5 W5 45.0 23.0 56.0 74.0
Edit
Here's an example with much larger sample data including conversion from matrix to data.frame/vector. If your conversion from "big.matrix" to matrix works, then I think, this should work the same way with your original data.
# create a matrix with 100 columns and 5 million rows for per column
m <- matrix(runif(100*5e6), ncol=100)
dim(m)
#[1] 5000000 100
object.size(m)
# 4000000200 bytes
# convert to data.frame
df <- as.data.frame(m)
# create a second matrix "windows" with a single column
windows <- matrix(sample(1:1000, nrow(df), replace=TRUE), ncol = 1)
# convert matrix "windows" to vector
windows.vec <- as.vector(windows[,1])
# add windows.vec as a grouping variable to "df"
df$windows <- windows.vec # you could also do this directly from the "windows" matrix
# check dimensions of "df"
dim(df)
#[1] 5000000 101
# now you can do the calculation
df %>%
group_by(windows) %>%
summarise_each(funs(mean(., na.rm=T), median(., na.rm=TRUE)))

This is by no means the most elegant solution, but it seems to do what you want simply by stacking your values data into a single column and then using a tapply() function. It also prevents the need to bind together your windows factors and values data.
First, a small sample dataset, similar to the above format:
> set.seed(42)
> values <- data.frame(replicate(4, sample(1:100, 1e3, replace=T)))
> head(values)
[,1] [,2] [,3] [,4]
[1,] 85 34 42 77
[2,] 21 3 72 66
[3,] 36 45 77 14
[4,] 78 50 7 31
[5,] 51 89 42 92
[6,] 61 23 55 2
> windows <- rep(1:(1e3/2), each=2)
> head(windows)
[1] 1 1 2 2 3 3
Now stack the values data into a single column, creating a new variable ind:
> values <- stack(values)
And repeat your windows values to match the length of the stacked dataframe:
> windows <- rep(windows, 4)
Now you can use a simple tapply to calculate the mean by windows variable for each column:
> tapply(values$values, list(values$ind, windows), mean)
Sample output:
1 2 3 ...
X1 50.0 81.5 39.5
X2 36.0 26.5 52.5
X3 68.5 77.5 85.5
X4 52.0 90.0 91.5

Related

Issue with calculating row mean in data table for selected columns in R

I have a data table as shown below.
Table:
LP GMweek1 GMweek2 GMweek3 PMweek1 PMweek2 PMweek3
215 45 50 60 11 0.4 10.2
0.1 50 61 24 12 0.8 80.0
0 45 24 35 22 20.0 15.4
51 22.1 54 13 35 16 2.2
I want to obtain the Output table below. My code below does not work. Can somebody help me to figure out what I am doing wrong here.
Any help is appreciated.
Output:
LP GMweek1 GMweek2 GMweek3 PMweek1 PMweek2 PMweek3 AvgGM AvgPM
215 45 50 60 11 0.4 10.2 51.67 7.20
0.1 50 61 24 12 0.8 80.0 45.00 30.93
0 45 24 35 22 20.0 15.4 34.67 19.13
51 22.1 54 13 35 16 2.2 29.70 17.73
sel_cols_GM <- c("GMweek1","GMweek2","GMweek3")
sel_cols_PM <- c("PMweek1","PMweek2","PMweek3")
Table <- Table[, .(AvgGM = rowMeans(sel_cols_GM)), by = LP]
Table <- Table[, .(AvgPM = rowMeans(sel_cols_PM)), by = LP]
Ok so you're doing a couple of things wrong. First, rowMeans can't evaluate a character vector, if you want to select columns by using it you must use .SD and pass the character vector to .SDcols. Second, you're trying to calculate a row aggregation and grouping, which I don't think makes much sense. Third, even if your expression didn't throw an error, you are assigning it back to Table, which would destroy your original data (if you want to add a new column use := to add it by reference).
What you want to do is calculate the row means of your selected columns, which you can do like this:
Table[, AvgGM := rowMeans(.SD), .SDcols = sel_cols_GM]
Table[, AvgPM := rowMeans(.SD), .SDcols = sel_cols_PM]
This means create these new columns as the row means of my subset of data (.SD) which refers to these columns (.SDcols)

Normalise only some columns in R

I'm new to R and still getting to grips with how it handles data (my background is spreadsheets and databases). the problem I have is as follows. My data looks like this (it is held in CSV):
RecNo Var1 Var2 Var3
41 800 201.8 Y
43 140 39 N
47 60 20.24 N
49 687 77 Y
54 570 135 Y
58 1250 467 N
61 211 52 N
64 96 117.3 N
68 687 77 Y
Column 1 (RecNo) is my observation number; while it is a number, it is not required for my analysis. Column 4 (Var3) is a Yes/No column which, again, I do not currently need for the analysis but will need later in the process to add information in the output.
I need to normalise the numeric data in my dataframe to values between 0 and 1 without losing the other information. I have the following function:
normalize <- function(x) {
x <- sweep(x, 2, apply(x, 2, min))
sweep(x, 2, apply(x, 2, max), "/")
}
However, when I apply it to my above data by calling
myResult <- normalize(myData)
it returns an error because of the text in Column 4. If I set the text in this column to binary values it runs fine, but then also normalises my case numbers, which I don't want.
So, my question is: How can I change my normalize function above to accept the names of the columns to transform, while outputting the full dataset (i.e. without losing columns)?
I could not get TUSHAr's suggestion to work, but I have found two solutions that work fine:
1. akrun's suggestion above:
myData2 <- myData1 %>% mutate_at(2:3, funs((.-min(.))/max(.-min(.))))
This produces the following:
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
Alternatively, there is the package BBmisc which allowed me the following after transforming my record numbers to factors:
> myData <- myData %>% mutate(RecNo = factor(RecNo))
> myNorm <- normalize(myData2, method="range", range = c(0,1), margin = 1)
> myNorm
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
EDIT: For completion I include TUSHAr's solution as well, showing as always that there are many ways around a single problem:
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)
Thank you for your help!
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)

Using ddply across numerous variables when calculating descriptive statistics

Here's my data. It shows the amount of fish I found at three different sites.
Selidor.Bay Enlades.Bay Cumphrey.Bay
1 39 29 187
2 70 370 50
3 13 44 52
4 0 65 20
5 43 110 220
6 0 30 266
What I would like to do is create a script to calculate basic statistics for each site.
If I re-arrange the data by stacking it. I.e :
values site
1 29 Selidor.Bay
2 370 Selidor.Bay
3 44 Selidor.Bay
4 65 Enlades.Bay
I'm able to use the following:
data <- ddply(df, c("site"), summarise,
N = length(values),
mean = mean(values),
sd = sd(values),
se = sd / sqrt(N),
sum = sum(values)
)
data.
My question is how can I use the script without having to stack my dataframe?
Thanks.
A slight variation on #docendodiscimus' comment:
library(reshape2)
library(dplyr)
DF %>%
melt(variable.name="site") %>%
group_by(site) %>%
summarise_each(funs( n(), mean, sd, se=sd(.)/sqrt(n()), sum ), value)
# site n mean sd se sum
# 1 Selidor.Bay 6 27.5 27.93385 11.40395 165
# 2 Enlades.Bay 6 108.0 131.84688 53.82626 648
# 3 Cumphrey.Bay 6 132.5 104.29909 42.57992 795
melt does what the OP referred to as "stacking" the data.frame. There is likely some analogous function in the tidyr package.

How to find the highest value of a column in a data frame in R?

I have the following data frame which I called ozone:
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
I would like to extract the highest value from ozone, Solar.R, Wind...
Also, if possible how would I sort Solar.R or any column of this data frame in descending order
I tried
max(ozone, na.rm=T)
which gives me the highest value in the dataset.
I have also tried
max(subset(ozone,Ozone))
but got "subset" must be logical."
I can set an object to hold the subset of each column, by the following commands
ozone <- subset(ozone, Ozone >0)
max(ozone,na.rm=T)
but it gives the same value of 334, which is the max value of the data frame, not the column.
Any help would be great, thanks.
Similar to colMeans, colSums, etc, you could write a column maximum function, colMax, and a column sort function, colSort.
colMax <- function(data) sapply(data, max, na.rm = TRUE)
colSort <- function(data, ...) sapply(data, sort, ...)
I use ... in the second function in hopes of sparking your intrigue.
Get your data:
dat <- read.table(h=T, text = "Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9")
Use colMax function on sample data:
colMax(dat)
# Ozone Solar.R Wind Temp Month Day
# 41.0 313.0 20.1 74.0 5.0 9.0
To do the sorting on a single column,
sort(dat$Solar.R, decreasing = TRUE)
# [1] 313 299 190 149 118 99 19
and over all columns use our colSort function,
colSort(dat, decreasing = TRUE) ## compare with '...' above
To get the max of any column you want something like:
max(ozone$Ozone, na.rm = TRUE)
To get the max of all columns, you want:
apply(ozone, 2, function(x) max(x, na.rm = TRUE))
And to sort:
ozone[order(ozone$Solar.R),]
Or to sort the other direction:
ozone[rev(order(ozone$Solar.R)),]
Here's a dplyr solution:
library(dplyr)
# find max for each column
summarise_each(ozone, funs(max(., na.rm=TRUE)))
# sort by Solar.R, descending
arrange(ozone, desc(Solar.R))
UPDATE: summarise_each() has been deprecated in favour of a more featureful family of functions: mutate_all(), mutate_at(), mutate_if(), summarise_all(), summarise_at(), summarise_if()
Here is how you could do:
# find max for each column
ozone %>%
summarise_if(is.numeric, funs(max(., na.rm=TRUE)))%>%
arrange(Ozone)
or
ozone %>%
summarise_at(vars(1:6), funs(max(., na.rm=TRUE)))%>%
arrange(Ozone)
In response to finding the max value for each column, you could try using the apply() function:
> apply(ozone, MARGIN = 2, function(x) max(x, na.rm=TRUE))
Ozone Solar.R Wind Temp Month Day
41.0 313.0 20.1 74.0 5.0 9.0
Another way would be to use ?pmax
do.call('pmax', c(as.data.frame(t(ozone)),na.rm=TRUE))
#[1] 41.0 313.0 20.1 74.0 5.0 9.0
There is a package matrixStats that provides some functions to do column and row summaries, see in the package vignette, but you have to convert your data.frame into a matrix.
Then you run: colMaxs(as.matrix(ozone))
max(may$Ozone, na.rm = TRUE)
Without $Ozone it will filter in the whole data frame, this can be learned in the swirl library.
I'm studying this course on Coursera too ~
Assuming that your data in data.frame called maxinozone, you can do this
max(maxinozone[1, ], na.rm = TRUE)
max(ozone$Ozone, na.rm = TRUE) should do the trick. Remember to include the na.rm = TRUE or else R will return NA.
Try this solution:
Oz<-subset(data, data$Month==5,select=Ozone) # select ozone value in the month of
#May (i.e. Month = 5)
summary(T) #gives caracteristics of table( contains 1 column of Ozone) including max, min ...

Select a value for based on a highest value in another column

I don't understand why I can't find a solution for this, since I feel that this is a pretty basic question. Need to ask for help, then. I want to rearrange airquality dataset by month with maximum temp value for each month. In addition I want to find the corresponding day for each monthly maximum temperature. What is the laziest (code-wise) way to do this?
I have tried following without a success:
require(reshape2)
names(airquality) <- tolower(names(airquality))
mm <- melt(airquality, id.vars = c("month", "day"), meas = c("temp"))
dcast(mm, month + day ~ variable, max)
aggregate(formula = temp ~ month + day, data = airquality, FUN = max)
I am after something like this:
month day temp
5 7 89
...
There was quite a discussion a while back about whether being lazy is good or not. Anwyay, this is short and natural to write and read (and is fast for large data so you don't need to change or optimize it later) :
require(data.table)
DT=as.data.table(airquality)
DT[,.SD[which.max(Temp)],by=Month]
Month Ozone Solar.R Wind Temp Day
[1,] 5 45 252 14.9 81 29
[2,] 6 NA 259 10.9 93 11
[3,] 7 97 267 6.3 92 8
[4,] 8 76 203 9.7 97 28
[5,] 9 73 183 2.8 93 3
.SD is the subset of the data for each group, and you just want the row from it with the largest Temp, iiuc. If you need the row number then that can be added.
Or to get all the rows where the max is tied :
DT[,.SD[Temp==max(Temp)],by=Month]
Month Ozone Solar.R Wind Temp Day
[1,] 5 45 252 14.9 81 29
[2,] 6 NA 259 10.9 93 11
[3,] 7 97 267 6.3 92 8
[4,] 7 97 272 5.7 92 9
[5,] 8 76 203 9.7 97 28
[6,] 9 73 183 2.8 93 3
[7,] 9 91 189 4.6 93 4
Another approach with plyr
require(reshape2)
names(airquality) <- tolower(names(airquality))
mm <- melt(airquality, id.vars = c("month", "day"), meas = c("temp"), value.name = 'temp')
library(plyr)
ddply(mm, .(month), subset, subset = temp == max(temp), select = -variable)
Gives
month day temp
1 5 29 81
2 6 11 93
3 7 8 92
4 7 9 92
5 8 28 97
6 9 3 93
7 9 4 93
Or, even simpler
require(reshape2)
require(plyr)
names(airquality) <- tolower(names(airquality))
ddply(airquality, .(month), subset,
subset = temp == max(temp), select = c(month, day, temp) )
how about with plyr?
max.func <- function(df) {
max.temp <- max(df$temp)
return(data.frame(day = df$Day[df$Temp==max.temp],
temp = max.temp))
}
ddply(airquality, .(Month), max.func)
As you can see, the max temperature for the month happens on more than one day. If you want different behavior, the function is easy enough to adjust.
Or if you want to use the data.table package (for instance, if speed is an issue and the data set is large or if you prefer the syntax):
library(data.table)
DT <- data.table(airquality)
DT[, list(maxTemp=max(Temp), dayMaxTemp=.SD[max(Temp)==Temp, Day]), by="Month"]
If you want to know what the .SD stands for, have a look here: SO

Resources