Cut Function in R program - r

Time Velocity
0 0
1.5 1.21
3 1.26
4.5 1.31
6 1.36
7.5 1.41
9 1.46
10.5 1.51
12 1.56
13 1.61
14 1.66
15 1.71
16 1.76
17 1.81
18 1.86
19 1.91
20 1.96
21 2.01
22.5 2.06
24 2.11
25.5 2.16
27 2.21
28.5 2.26
30 2.31
31.5 2.36
33 2.41
34.5 2.4223
36 2.4323
So I have data about Time and Velocity...I want to use the cut or the which function to separate my data into 6 min intervals...my Maximum Time usually goes up to 3000 mins
So I would want the output to be similar to this...
Time Velocity
0 0
1.5 1.21
3 1.26
4.5 1.31
6 1.36
Time Velocity
6 1.36
7.5 1.41
9 1.46
10.5 1.51
12 1.56
Time Velocity
12 1.56
13 1.61
14 1.66
15 1.71
16 1.76
17 1.81
18 1.86
So what I did so far is read the data using data=read.delim("clipboard")
I decided to use the function 'which'....but I would need to do it for up 3000 mins etc
dat <- data[which(data$Time>=0
& data$Time < 6),],
dat1 <- data[which(data$Time>=6
& data$Time < 12),]
etc
But this wouldn't be so convenient if I had time to went up to 3000 mins
Also I would want all my results to be contained in one output/ variable

I will assume here that you really don't want to duplicate the values across the bins.
cuts = cut(data$Time, seq(0, max(data$Time)+6, by=6), right=FALSE)
x <- by(data, cuts, FUN=I)
x
## cuts: [0,6)
## Time Velocity
## 1 0.0 0.00
## 2 1.5 1.21
## 3 3.0 1.26
## 4 4.5 1.31
## ------------------------------------------------------------------------------------------------------------
## cuts: [6,12)
## Time Velocity
## 5 6.0 1.36
## 6 7.5 1.41
## 7 9.0 1.46
## 8 10.5 1.51
## ------------------------------------------------------------------------------------------------------------
## <snip>
## ------------------------------------------------------------------------------------------------------------
## cuts: [36,42)
## Time Velocity
## 28 36 2.4323

I don't think that you want to get duplicated bounds. here A simple solution without using cut( similar to #Mathew solution).
dat <- transform(dat, index = dat$Time %/% 6)
by(dat,dat$index,FUN=I)

If you really need duplicates timestamps that are integral multipes of 6 then you will have to do some data duplication before splitting.
txt <- "Time Velocity\n0 0\n1.5 1.21\n3 1.26\n4.5 1.31\n6 1.36\n7.5 1.41\n9 1.46\n10.5 1.51\n12 1.56\n13 1.61\n14 1.66\n15 1.71\n16 1.76\n17 1.81\n18 1.86\n19 1.91\n20 1.96\n21 2.01\n22.5 2.06\n24 2.11\n25.5 2.16\n27 2.21\n28.5 2.26\n30 2.31\n31.5 2.36\n33 2.41\n34.5 2.4223\n36 2.4323"
DF <- read.table(text = txt, header = TRUE)
# Create duplicate timestamps where timestamp is multiple of 6 second
posinc <- DF[DF$Time%%6 == 0, ]
neginc <- DF[DF$Time%%6 == 0, ]
posinc <- posinc[-1, ]
neginc <- neginc[-1, ]
# Add tiny +ve and -ve increments to these duplicated timestamps
posinc$Time <- posinc$Time + 0.01
neginc$Time <- neginc$Time - 0.01
# Bind original dataframe without 6 sec multiple timestamp with above duplicated timestamps
DF2 <- do.call(rbind, list(DF[!DF$Time%%6 == 0, ], posinc, neginc))
# Order by timestamp
DF2 <- DF2[order(DF2$Time), ]
# Split the dataframe by quotient of timestamp divided by 6
SL <- split(DF2, DF2$Time%/%6)
# Round back up the timestamps of split data to 1 decimal place
RESULT <- lapply(SL, function(x) {
x$Time <- round(x$Time, 1)
return(x)
})
RESULT
## $`0`
## Time Velocity
## 2 1.5 1.21
## 3 3.0 1.26
## 4 4.5 1.31
## 51 6.0 1.36
##
## $`1`
## Time Velocity
## 5 6.0 1.36
## 6 7.5 1.41
## 7 9.0 1.46
## 8 10.5 1.51
## 91 12.0 1.56
##
## $`2`
## Time Velocity
## 9 12 1.56
## 10 13 1.61
## 11 14 1.66
## 12 15 1.71
## 13 16 1.76
## 14 17 1.81
## 151 18 1.86
##
## $`3`
## Time Velocity
## 15 18.0 1.86
## 16 19.0 1.91
## 17 20.0 1.96
## 18 21.0 2.01
## 19 22.5 2.06
## 201 24.0 2.11
##
## $`4`
## Time Velocity
## 20 24.0 2.11
## 21 25.5 2.16
## 22 27.0 2.21
## 23 28.5 2.26
## 241 30.0 2.31
##
## $`5`
## Time Velocity
## 24 30.0 2.3100
## 25 31.5 2.3600
## 26 33.0 2.4100
## 27 34.5 2.4223
## 281 36.0 2.4323
##
## $`6`
## Time Velocity
## 28 36 2.4323
##

Related

creating a data.frame from a larger data.frame based on index values

I have two small data.frames (d1&d2) below. In d1, the column post varies across the rows (length(unique(d1$post)) == 1L gives FALSE).
From d1, I wonder how to form the following data.frame (ALWAYS, items with 1 suffix (ex. mpre1) are from control==F subset of the dataframe & items with 2 suffix (ex. mpre2) are from control==T subset):
# Desired output from `d1` (4 rows x 6 columns):
# mpre1 sdpre1 n1 mpre2 sdpre2 n2
#1 0.31 0.39 20 0.23 0.39 18 ##group=1,control=F&T,outcome=1
#2 3.54 1.21 20 3.08 1.57 18 ##group=1,control=F&T,outcome=2
#3 0.16 0.27 19 0.23 0.39 18 ##group=2,control=F&T,outcome=1
#4 2.85 1.99 19 3.08 1.57 18 ##group=2,control=F&T,outcome=2
In d2, the column post does NOT vary across the rows (length(unique(d2$post)) == 1L gives TRUE). From d2, I wonder how to form the following data.frame:
# Desired output from `d2`(4 rows x 6 columns):
# mpre1 sdpre1 n1 mpre2 sdpre2 n2
#1 81.6 10.8 73 80.50 11.20 80 ##group=1,control=F&T,outcome=1
#2 85.7 13.7 66 90.30 6.60 74 ##group=1,control=F&T,outcome=2
#3 81.4 10.9 72 80.50 11.20 80 ##group=2,control=F&T,outcome=1
#4 90.4 8.2 61 90.30 6.60 74 ##group=2,control=F&T,outcome=2
The index values (for group & outcome) to extract the above vectors from either d1 or d2 are given by (I mean for d1 put d1 for d2 put d2):
with(subset(d1,!control),rev(expand.grid(outcome=unique(outcome),group=unique(group))))
I have a list of these data.frames, thus a functional BASE R answer is highly appreciated (d1& d2 are below).
(d1 = read.csv("https://raw.githubusercontent.com/rnorouzian/m2/main/g.csv"))
# study.name group n mpre sdpre mpos sdpos post control outcome
#1 Diab_a 1 20 0.31 0.39 0.02 0.06 1 FALSE 1
#2 Diab_a 1 20 0.31 0.39 0.05 0.08 2 FALSE 1
#3 Diab_a 1 20 3.54 1.21 1.38 0.89 1 FALSE 2
#4 Diab_a 1 20 3.54 1.21 1.38 0.55 2 FALSE 2
#5 Diab_a 2 19 0.16 0.27 0.12 0.19 1 FALSE 1
#6 Diab_a 2 19 0.16 0.27 0.03 0.06 2 FALSE 1
#7 Diab_a 2 19 2.85 1.99 1.22 0.43 1 FALSE 2
#8 Diab_a 2 19 2.85 1.99 1.94 1.12 2 FALSE 2
#9 Diab_a 3 18 0.23 0.39 0.07 0.12 1 TRUE 1
#10 Diab_a 3 18 0.23 0.39 0.06 0.09 2 TRUE 1
#11 Diab_a 3 18 3.08 1.57 1.53 0.64 1 TRUE 2
#12 Diab_a 3 18 3.08 1.57 1.93 0.61 2 TRUE 2
(d2 = read.csv("https://raw.githubusercontent.com/rnorouzian/m2/main/g2.csv"))
# study.name group n mpre sdpre mpos sdpos post control outcome
#1 Dlsk_Krlr 1 73 81.6 10.8 83.1 11.1 1 FALSE 1
#2 Dlsk_Krlr 1 66 85.7 13.7 88.8 10.5 1 FALSE 2
#3 Dlsk_Krlr 2 72 81.4 10.9 85.0 8.1 1 FALSE 1
#4 Dlsk_Krlr 2 61 90.4 8.2 91.2 7.6 1 FALSE 2
#5 Dlsk_Krlr 3 80 80.5 11.2 80.8 10.7 1 TRUE 1
#6 Dlsk_Krlr 3 74 90.3 6.6 89.6 6.3 1 TRUE 2
Do you want this?
library(tidyverse)
d1 %>% select( n, mpre, sdpre, control, outcome, post) %>%
unique %>%
mutate(control = control + 1) %>%
pivot_wider(values_from = c(mpre, sdpre, n), names_from = control, names_glue = '{.value}{control}',
values_fn = list) %>%
mutate(across(ends_with('1') | ends_with('2'), ~ifelse(post ==1, map_dbl(., first),
map_dbl(., last)))) %>%
arrange(post) %>%
select(ends_with('1'), ends_with('2'))
# A tibble: 4 x 6
mpre1 sdpre1 n1 mpre2 sdpre2 n2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.31 0.39 20 0.23 0.39 18
2 3.54 1.21 20 3.08 1.57 18
3 0.16 0.27 19 0.23 0.39 18
4 2.85 1.99 19 3.08 1.57 18
for d2?
d2 %>% select(n, mpre, sdpre, control, outcome, post) %>%
mutate(control = control + 1) %>%
pivot_wider(values_from = c(mpre, sdpre, n), names_from = control,
names_glue = '{.value}{control}', values_fn = list) %>%
unnest(everything()) %>%
select(ends_with('1'), ends_with('2'))
# A tibble: 4 x 6
mpre1 sdpre1 n1 mpre2 sdpre2 n2
<dbl> <dbl> <int> <dbl> <dbl> <int>
1 81.6 10.8 73 80.5 11.2 80
2 81.4 10.9 72 80.5 11.2 80
3 85.7 13.7 66 90.3 6.6 74
4 90.4 8.2 61 90.3 6.6 74
If the strategy adopted in d2 is near to your expectation, you can do similar for `d1 also
d1 %>% select(n, mpre, sdpre, control, outcome, post) %>%
mutate(control = control + 1) %>%
pivot_wider(values_from = c(mpre, sdpre, n), names_from = control,
names_glue = '{.value}{control}', values_fn = list) %>%
unnest(everything()) %>%
select(ends_with('1'), ends_with('2')) %>% unique
# A tibble: 4 x 6
mpre1 sdpre1 n1 mpre2 sdpre2 n2
<dbl> <dbl> <int> <dbl> <dbl> <int>
1 0.31 0.39 20 0.23 0.39 18
2 0.16 0.27 19 0.23 0.39 18
3 3.54 1.21 20 3.08 1.57 18
4 2.85 1.99 19 3.08 1.57 18
I wrote the code in full BASE R.
Then checked it on d1 and d2 dataframe.
Then to checked it in an other dataframe, I created an imaginary dataset(d3) with 6 groups and 4 outomes
I implemented them on tidyverse answer, The results are similar to my code. answer(albeit with different rows order, because I wanted to make it similar to the desired output order).
But the difference is that this code is written completely in BASE R as question required.
d1 <- read.csv("https://raw.githubusercontent.com/rnorouzian/m2/main/g.csv")
d2 <- read.csv("https://raw.githubusercontent.com/rnorouzian/m2/main/g2.csv")
d1_index <- rev(expand.grid(outcome = unique(d1$outcome), group =unique(d1$group)))
d2_index <- rev(expand.grid(outcome = unique(d2$outcome), group =unique(d2$group)))
df_fin <- function(index, data){
group1 <- group2 <- data.frame()
sd <- data.frame()
group2_df <- group1_df <- final_df <- data.frame()
Groups_list <- list()
outcome_no <- range(data$outcome)
for (i in unique(data$group)) {
for (j in seq_len(range(outcome_no)[2])){
couple <- subset(index, group == i & (outcome == j))
subsetted_df <- subset(data, group == couple[1, 1] & (outcome == couple[1,2] | outcome == couple[2,2]) & post == 1)
if (TRUE %in% subsetted_df$control){
subsetted_df <- subset(data, group == couple[1, 1] & (outcome == couple[1,2] | outcome == couple[2,2]))
subsetted_df <- subsetted_df[order(subsetted_df$post),]
} else {
subsetted_df <- subset(data, group == couple[1, 1] & (outcome == couple[1,2] | outcome == couple[2,2]) & post == 1)
}
sd <- rbind(sd, subsetted_df)
}
}
G1 <- subset(sd, control == FALSE)
G2 <- subset(sd, control == TRUE)
G2 <- G2[order(G2$post),]
G1_r <- G1[,c("mpre", "sdpre", "n")]
G2_r <- G2[,c("mpre", "sdpre", "n")]
colnames(G1_r) <- c("mpre1", "sdpre1", "n1")
colnames(G2_r) <- c("mpre2", "sdpre2", "n2")
final_df <- cbind(G1_r, G2_r)
return(final_df)
}
#Check the function(Base R) on d1,d2
# d1
df_fin(d1_index, d1)
## mpre1 sdpre1 n1 mpre2 sdpre2 n2
## 1 0.31 0.39 20 0.23 0.39 18
## 3 3.54 1.21 20 3.08 1.57 18
## 5 0.16 0.27 19 0.23 0.39 18
## 7 2.85 1.99 19 3.08 1.57 18
#d2
df_fin(d2_index, d2)
## mpre1 sdpre1 n1 mpre2 sdpre2 n2
## 1 81.6 10.8 73 80.5 11.2 80
## 2 85.7 13.7 66 90.3 6.6 74
## 3 81.4 10.9 72 80.5 11.2 80
## 4 90.4 8.2 61 90.3 6.6 74
#Creating an example dataframe d3 to ckeck the code functionality for other dataset
d3 <- d1
d3 <- rbind(d3, d1)
d3[13:24, 'group'] <- rep(c(4,5,6), each = 4)
d3[13:24, 'outcome'] <- rep(c(3,4), each = 2)
d3[13:24, 'mpre'] <- rep(c(111,999), each = 2)
d3_index <- rev(expand.grid(outcome = unique(d3$outcome), group =unique(d3$group)))
#Check on d3
df_fin(d3_index, d3)
## mpre1 sdpre1 n1 mpre2 sdpre2 n2
## 1 0.31 0.39 20 0.23 0.39 18
## 3 3.54 1.21 20 3.08 1.57 18
## 5 0.16 0.27 19 111.00 0.39 18
## 7 2.85 1.99 19 999.00 1.57 18
## 13 111.00 0.39 20 0.23 0.39 18
## 15 999.00 1.21 20 3.08 1.57 18
## 17 111.00 0.27 19 111.00 0.39 18
## 19 999.00 1.99 19 999.00 1.57 18

R: Error making time series stationary using diff()

I have a dataset with monthly data starting from the 1st of every month during a period of time.
Here's head() of my Date column:
> class(gas_data$Date)
[1] "Date"
> head(gas_data$Date)
[1] "2010-10-01" "2010-11-01" "2010-12-01" "2011-01-01" "2011-02-01" "2011-03-01"
Trying to make the time series stationary, I'm using diff() from the base package:
> gas_data_diff <- diff(gas_data, differences = 1, lag = 12)
> head(gas_data_diff)
data frame with 0 columns and 6 rows
> names(gas_data_diff)
character(0)
> gas_data_diff %>%
+ ggplot(aes(x=Date, y=Price.Gas)) +
+ geom_line(color="darkorchid4")
Error in FUN(X[[i]], ...) : object 'Price.Gas' not found
As you can see I get an error, and when trying to visualize the data with head() or look for the feature names I get an unexpected output.
Why am I getting this error and how could I fix it?
Here's head() of my original data
> head(gas_data)
Date Per.Change Domestic.Production.from.UKCS Import Per.GDP.Growth Average.Temperature Price.Electricity Price.Gas
1 2010-10-01 2.08 3.54 5.40 0.2 10.44 43.50 46.00
2 2010-11-01 -3.04 3.46 6.74 -0.1 5.52 46.40 49.66
3 2010-12-01 0.31 3.54 9.00 -0.9 0.63 58.03 62.26
4 2011-01-01 2.65 3.59 7.58 0.6 4.05 48.43 55.98
5 2011-02-01 1.52 3.20 5.68 0.4 6.29 46.47 53.74
6 2011-03-01 -1.38 3.40 5.93 0.5 6.59 51.41 60.39
This is how the non-stationary plot of the original data looks like for gas prices for instances
Explanation
As per help from diff, the argument x must be
x : a numeric vector or matrix containing the values to be differenced.
As in your case, diff returns an eplty data.frame if x is a dataframe.
Best approach IMO
I don't see much of point in using the date column in diff. So. I am likely to follow the following approach.
rownames(df) <- df$Date
diff(as.matrix(df[, - 1]), lag = 1)
# converr to a matrix and apply diff
diff_mat <- diff(as.matrix(df[, - 1]), lag = 1)
# convert back to dataframe and set the Date column
diff_df <- as.data.frame(diff_mat)
diff_df$Date <- diff_df$Date
# now plot function should work
Using the data given to address the comments.
Convert the Date vector to numeric and then convert to matrix in diff
df$Date <- as.numeric(df$Date)
diff(as.matrix(df), 1, 2)
# Date Per.Change Domestic.Production.from.UKCS Import Per.GDP.Growth
# [1,] -1 8.47 0.16 0.92 -0.5
# [2,] 1 -1.01 -0.03 -3.68 2.3
# [3,] 0 -3.47 -0.44 -0.48 -1.7
# [4,] -3 -1.77 0.59 2.15 0.3
# Average.Temperature Price.Electricity Price.Gas
# [1,] 0.03 8.73 8.94
# [2,] 8.31 -21.23 -18.88
# [3,] -1.18 7.64 4.04
# [4,] -1.94 6.90 8.89
Creating the Data
df <- read.table(text = "Date Per.Change Domestic.Production.from.UKCS Import Per.GDP.Growth Average.Temperature Price.Electricity Price.Gas
2010-10-01 2.08 3.54 5.40 0.2 10.44 43.50 46.00
2010-11-01 -3.04 3.46 6.74 -0.1 5.52 46.40 49.66
2010-12-01 0.31 3.54 9.00 -0.9 0.63 58.03 62.26
2011-01-01 2.65 3.59 7.58 0.6 4.05 48.43 55.98
2011-02-01 1.52 3.20 5.68 0.4 6.29 46.47 53.74
2011-03-01 -1.38 3.40 5.93 0.5 6.59 51.41 60.39",
header = T, sep ="")
df$Date <- as.Date(df$Date)

How do I remove NA from a data frame with the intention of using sapply on the data frame [duplicate]

This question already has answers here:
calculate the mean for each column of a matrix in R
(10 answers)
Closed 3 years ago.
I have a data frame:
colA colB
1 15.3 1.76
2 10.8 1.34
3 8.1 1.27
4 19.5 1.47
5 7.2 1.27
6 5.3 1.49
7 9.3 1.31
8 11.1 1.09
9 7.5 1.18
10 12.2 1.22
11 6.7 1.25
12 5.2 1.19
13 19.0 1.95
14 15.1 1.28
15 6.7 1.52
16 8.6 NA
17 4.2 1.12
18 10.3 1.37
19 12.5 1.19
20 16.1 1.05
21 13.3 1.32
22 4.9 1.03
23 8.8 1.12
24 9.5 1.70
How would I be able to remove/change the value of all NAs such that when I use sapply (i.e. sapply(x, mean)), I am taking the mean of 24 rows in the case of colA and 23 columns for colB?
I understand that data frames have to have the same number of rows so using something like na.omit() would not work because it'd remove, in this case, row 16; I'd lose a row of data when I'm calculating the mean for colA.
Thanks!
You should be able to pass na.rm = TRUE and get the mean.
Example:
df <- data.frame(A = 1:3, B = c(NA, 1, 2))
apply(df, 2, mean, na.rm = TRUE)
# A B
# 2.0 1.5

Calculate means from different data frames

My goal is calculate a final data frame, which would contain the means from several different data frames. Given data like this:
A <- c(1,2,3,4,5,6,7,8,9)
B <- c(2,2,2,3,4,5,6,7,8)
C <- c(1,1,1,1,1,1,2,2,1)
D <- c(5,5,5,5,6,6,6,7,7)
E <- c(4,4,3,5,6,7,8,9,7)
DF1 <- data.frame(A,B,C)
DF2 <- data.frame(E,D,C)
DF3 <- data.frame(A,C,E)
DF4 <- data.frame(A,D,E)
I'd like to calculate means for all three columns (per row) in each data frame. To do this I put together a for loop:
All <- data.frame(matrix(ncol = 3, nrow = 9))
for(i in seq(1:ncol(DF1))){
All[,i] <- mean(c(DF1[,i], DF2[,i], DF3[,i], DF4[,i]))
}
X1 X2 X3
1 5.222222 4.277778 3.555556
2 5.222222 4.277778 3.555556
3 5.222222 4.277778 3.555556
4 5.222222 4.277778 3.555556
5 5.222222 4.277778 3.555556
6 5.222222 4.277778 3.555556
7 5.222222 4.277778 3.555556
8 5.222222 4.277778 3.555556
9 5.222222 4.277778 3.555556
But the end result was that I calculated entire column means (as opposed to a mean for each individual row).
For example, the first row and first column for each of the 4 data frames is 1,4,1,1. So I would expect the first col and row of the final data frame to be 1.75 (mean(c(1,4,1,1))
We place the datasets in a list, get the sum (+) of corresponding elements using Reduce and divide it by the number of datasets
Reduce(`+`, mget(paste0("DF", 1:4)))/4
# A B C
#1 1.75 3.25 2.5
#2 2.50 3.25 2.5
#3 3.00 3.25 2.0
#4 4.25 3.50 3.0
#5 5.25 4.25 3.5
#6 6.25 4.50 4.0
#7 7.25 5.00 5.0
#8 8.25 5.75 5.5
#9 8.50 5.75 4.0
NOTE: It should be faster than any apply based solutions and the output is a data.frame as that of the original dataset
If we want the tidyverse, then another option is
library(dplyr)
library(tidyr)
library(purrr)
library(tibble)
mget(paste0("DF", 1:4)) %>%
map(rownames_to_column, "rn") %>%
map(setNames, c("rn", LETTERS[1:3])) %>%
bind_rows() %>%
group_by(rn) %>%
summarise_each(funs(mean))
# A tibble: 9 × 4
# rn A B C
# <chr> <dbl> <dbl> <dbl>
#1 1 1.75 3.25 2.5
#2 2 2.50 3.25 2.5
#3 3 3.00 3.25 2.0
#4 4 4.25 3.50 3.0
#5 5 5.25 4.25 3.5
#6 6 6.25 4.50 4.0
#7 7 7.25 5.00 5.0
#8 8 8.25 5.75 5.5
#9 9 8.50 5.75 4.0
Since what you're describing is effectively an array, you can actually make it one with abind::abind, which makes the operation pretty simple:
apply(abind::abind(DF1, DF2, DF3, DF4, along = 3), 1:2, mean)
## A D E
## [1,] 1.75 3.25 2.5
## [2,] 2.50 3.25 2.5
## [3,] 3.00 3.25 2.0
## [4,] 4.25 3.50 3.0
## [5,] 5.25 4.25 3.5
## [6,] 6.25 4.50 4.0
## [7,] 7.25 5.00 5.0
## [8,] 8.25 5.75 5.5
## [9,] 8.50 5.75 4.0
The column names are meaningless, and the result is a matrix, not a data.frame, but even if you wrap it in data.frame, it's still very fast.
A combination of tidyverse and base:
#install.packages('tidyverse')
library(tidyverse)
transpose(list(DF1, DF2, DF3, DF4)) %>%
map(function(x)
rowMeans(do.call(rbind.data.frame,
transpose(x)))) %>%
bind_cols()
Should yield:
# A B C
# <dbl> <dbl> <dbl>
# 1 1.75 3.25 2.5
# 2 2.50 3.25 2.5
# 3 3.00 3.25 2.0
# 4 4.25 3.50 3.0
# 5 5.25 4.25 3.5
# 6 6.25 4.50 4.0
# 7 7.25 5.00 5.0
# 8 8.25 5.75 5.5
# 9 8.50 5.75 4.0

R: returning row value when certain number of columns reach certain value

Return row value when certain number of columns reach certain value from the following table
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 3.93 3.92 3.74 4.84 4.55 4.67 3.99 4.10 4.86 4.06
2 4.00 3.99 3.81 4.90 4.61 4.74 4.04 4.15 4.92 4.11
3 4.67 4.06 3.88 5.01 4.66 4.80 4.09 4.20 4.98 4.16
4 4.73 4.12 3.96 5.03 4.72 4.85 4.14 4.25 5.04 4.21
5 4.79 4.21 4.04 5.09 4.77 4.91 4.18 4.30 5.10 4.26
6 4.86 4.29 4.12 5.15 4.82 4.96 4.23 4.35 5.15 4.30
7 4.92 4.37 4.19 5.21 4.87 5.01 4.27 4.39 5.20 4.35
8 4.98 4.43 4.25 5.26 4.91 5.12 4.31 4.43 5.25 4.38
9 5.04 4.49 4.31 5.30 4.95 5.15 4.34 4.46 5.29 4.41
10 5.04 4.50 4.49 5.31 5.01 5.17 4.50 4.60 5.30 4.45
11 ...
12 ...
As an output, I need a data frame, containing the % reach of the value of interest ('5' in this example) by V1-V10:
Rownum Percent
1 0
2 0
3 10
4 20
5 20
6 20
7 33
8 33
9 40
10 50
Many thanks!
If your matrix is mat:
cbind(1:dim(mat)[1],rowSums(mat>5)/dim(mat)[2]*100)
As far as it's always about 0 and 1 with ten columns, I would multiply the whole dataset by 10 (equals percentage values in this case...). Just use the following code:
# Sample data
set.seed(10)
data <- as.data.frame(do.call("rbind", lapply(seq(9), function(...) {
sample(c(0, 1), 10, replace = TRUE)
})))
rownames(data) <- c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yza")
# Percentages
rowSums(data * 10)
# abc def ghi jkl mno pqr stu vwx yza
# 80 40 80 60 60 10 30 50 50
Ok, so now I believe you want to get the percentage of values in each row that meet some threshold criteria. You give the example > 5. One solution of many is using apply:
apply( df , 1 , function(x) sum( x > 5 )/length(x)*100 )
# 1 2 3 4 5 6 7 8 9 10
# 0 0 10 20 20 20 30 30 40 50
#Thomas' solution will be faster for large data.frames because it converts to a matrix first, and these are faster to operate on.

Resources