Dynamic mean of consecutive columns in dplyr - r

I have a data frame with a large number of columns containing numeric values.
I'd like to dynamically calculate the mean value of the two consecutive columns (so mean of column 1 and column 2, mean of column 3 and 4, mean of 5 and 6 etc...) and either store it into new column names or replace one of the two columns I used in the calculation.
I tried creating a function that calculate the mean of two columns and storing it into the first column then applying a loop to that function so it applies to my whole datatable.
However I'm struggling with mutate: since I dynamically generate the name of the column I use (they all start with "PUISSANCE" then a number) through a glue, it displays the value as a string into the mutate and doesn't evaluate it.
mean_col <- function(data, k)
{
n<-2*k+1
m<-2*k+2
varname_even <- paste("PUISSANCE", m,sep="")
varname_odd <- paste("PUISSANCE", n,sep="")
mutate(data, "{{varname_odd}}" := ({{varname_odd}}+{{varname_even}})/2) %% *here is the issue, the argument on the right is considered as non numeric, since it is the sum of two strings...*
data
}
for (k in 0:24) {
my_data_set <- mean_col(my_data_set,k)
}

Ok guys, just to let you know that I managed to solve it myself.
I did a pivot_longer transmute in order to put all the "PuissanceXX" in the same column and the values associated in another.
Then I used str_extract to get only the number XX from the string "PUISSANCEXX" that I converted into a numeric.
Thanks to a division by 2 (-0,5) I managed to have each successive value being X and X,5. so getting both values to X thanks to a floor. Then I just did a group_by/summarize in order to get the sum and that's it !
pivot_longer(starts_with("PUISSANCE"),names_to = "heure", values_to = "puissance") %>%
mutate("time" = floor(as.numeric(str_extract(heure, "\\d+"))/2-0.5)) %>%
select(-heure) %>%
group_by(time) %>%
summarise("power" = mean(puissance))

Related

I need to find the values in one column given the three minimum values of another column

I have a data set data1 and I need to find the min values in one column pph and find out what value they correspond to in another column state, which is column 1
this is what i have
data1[which.min(data1$pph), 1]
this gives me the minimum value and what it corresponds to in the first column, but I cannot figure out how to find the three most minimum values and what they correspond to.
This can be easily done using slice_min() from the dplyr library.
# making fake data
data1 <- data.frame("pph" = c(1:10), "state" = c(rep(c(1,2),5)))
# find min 3 pph but only show state column
data1 %>% slice_min(pph, n = 3) %>% select(state)
You can also assign the output of the dplyr pipeline to a variable if you need to.
We can use filter
library(dplyr)
data1 %>%
filter(pph == min(pph))
Using base you can try
data1[order(data1$pph)[1:3],"state"]
Using data.table you can try
library(data.table)
setDT(data1)
data1[order(pph)[1:3],"state"]#Here you actually call the three most minimum values
Or similarly
data1[order(pph),"state"][1:3] #Here is like you filter the three most minimum values

r- Using sum and match to find first occurrence of a high frequency

I have several data frames in wide format imported from dbf. So every column is a date and every row is an observation. Thus for every day i have between 500-2000 observations depending on the size of the geographic shape i am looking at. For the purposes of reproducible I created 2 dummy data frames with a range of values I may see in my actual data frames.
Data1<- data.frame(replicate(10, sample(0:1000, 20, rep= TRUE)))
Data<- data.frame(replicate(10, sample(0:1000, 20, rep= TRUE)))
Since I have many of these data frames I have put them in a list so I can run functions on many at once.
filenames<- mget(ls(pattern= 'Data'))
Now my issue is that I am trying to write a function to count the number of occurrences in each column where values are within the range 0-100. I can accomplish this with
library(plyr)
Datacount<- ldply(Data, function(x) length(which(x>=0 & x<=100)))
Then i need to be able to match the first column instance (date) in which this counted number is greater than 10% of the total number of observations per column. So for a dataframe with 20 observations I would want the first date where the number of cells between 0-100 is greater than 2. I previously accomplished this using apply (where "V1" is the column name containing the counts)
Datamatch<- apply (Datacount["V1"]>2,2,function(x) match (TRUE,x))
My question is whether there is a way I can combine these functions into one process that I can employ into either a for loop over "filenames" or using one of the lapply family functions?
For detail here is an example of a single function I built to run across each row of the dataframe. This gives me a column index of the last date where each row value is <= 100. Then i used lapply to loop over all dataframes in my list and append the results of the function to the original dataframe.
icein<- function(dataframe){
dataframe$icein<- apply(dataframe, 1, function(x){tail(which(x<=100), 1)})
dataframe
}
list2env(lapply(filenames, icein), envir= .GlobalEnv)
After loading all the 'Data' into a list, loop over the list with map, get the mean of logical vector (between(., 0, 100)) check if it greater than or equal to 2, unlist the data.frame, wrap with which to get the position index, extract the first one
library(dplyr)
library(purrr)
n <- 0.2
mget(ls(pattern= 'Data')) %>%
map_int(~ .x %>%
summarise_all(~ mean(between(., 0, 100)) >= n) %>%
unlist %>%
which %>%
first)

Compare multiple columns in 2 different dataframes in R

I am trying to compare multiple columns in two different dataframes in R. This has been addressed previously on the forum (Compare group of two columns and return index matches R) but this is a different scenario: I am trying to compare if a column in dataframe 1 is between the range of 2 columns in dataframe 2. Functions like match, merge, join, intersect won't work here. I have been trying to use purr::pluck but didn't get far. The dataframes are of different sizes.
Below is an example:
temp1.df <- mtcars
temp2.df <- data.frame(
Cyl = sample (4:8, 100, replace = TRUE),
Start = sample (1:22, 100, replace = TRUE),
End = sample (1:22, 100, replace = TRUE)
)
temp1.df$cyl <- as.character(temp1.df$cyl)
temp2.df$Cyl <- as.character(temp2.df$Cyl)
My attempt:
temp1.df <- temp1.df %>% mutate (new_mpg = case_when (
temp1.df$cyl %in% temp2.df$Cyl & temp2.df$Start <= temp1.df$mpg & temp2.df$End >= temp1.df$mpg ~ 1
))
Error:
Error in mutate_impl(.data, dots) :
Column `new_mpg` must be length 32 (the number of rows) or one, not 100
Expected Result:
Compare temp1.df$cyl and temp2.df$Cyl. If they are match then -->
Check if temp1.df$mpg is between temp2.df$Start and temp2.df$End -->
if it is, then create a new variable new_mpg with value of 1.
It's hard to show the exact expected output here.
I realize I could loop this so for each row of temp1.df but the original temp2.df has over 250,000 rows. An efficient solution would be much appreciated.
Thanks
temp1.df$new_mpg<-apply(temp1.df, 1, function(x) {
temp<-temp2.df[temp2.df$Cyl==x[2],]
ifelse(any(apply(temp, 1, function(y) {
dplyr::between(as.numeric(x[1]),as.numeric(y[2]),as.numeric(y[3]))
})),1,0)
})
Note that this makes some assumptions about the organization of your actual data (in particular, I can't call on the column names within apply, so I'm using indexes - which may very well change, so you might want to rearrange your data between receiving it and calling apply, or maybe changing the organization of it within apply, e.g., by apply(temp1.df[,c("mpg","cyl")]....
At any rate, this breaks your data set into lines, and each line is compared to the a subset of the second dataset with the same Cyl count. Within this subset, it checks if any of the mpg for this line falls between (from dplyr) Start and End, and returns 1 if yes (or 0 if no). All these ones and zeros are then returned as a (named) vector, which can be placed into temp1.df$new_mpg.
I'm guessing there's a way to do this with rowwise, but I could never get it to work properly...

R - Summation of data frame columns changes data type

I have a data frame of 15 columns where the first column is an integer and others are numeric. I have to generate a one-liner summary of the sum of all columns except the last one. I need to generate mean of the last column. So, I am doing something as below:
summary <- c(sum(df$col1), ... mean(df$col15))
The summary then appears with values up to two decimal places even for the integer column (first one). I have been trying the round function to fix this. I can understand, when different types are added, e.g. 1 + 1.0. But, in this case, shouldn't the summation maintain the data-type?
Please let me know what am I missing?
If you are looking for a one-line summary:
lst <- c(lapply(df[-ncol(df)], function(x) sum(x)), mean=mean(df[,ncol(df)]))
as.data.frame(lst)
# int num1 mean
#1 10 6 2.5
The output is a data frame that preserves the classes of each vector. If you would like the output to be added to the original data frame you can replace as.data.frame(lst) with:
names(lst) <- names(df)
rbind(df, lst)
If you are trying to get the sum of all integer columns and the mean of numeric columns, go with #Frank's answer.
Data
df <- data.frame(int=1:4, num1=seq(1,2,length.out=4), num2=seq(2,3,length.out=4))
Perhaps an adaptation of this?
apply(iris[,1:4], 2, sum) / c(rep(1,3), nrow(iris))

dplyr mutate in R - adding a new column depending on sequence of another column

I am having an issue with mutate function in dplyr.
I am trying to
add a new column called state depending on the change in one of the column (V column). (V column repeat itself with a sequence so each sequence (rep(seq(100,2100,100),each=96) corresponds to one dataset in my df)
Error: impossible to replicate vector of size 8064
Here is reproducible example of md df:
df <- data.frame (
No=(No= rep(seq(0,95,1),times=84)),
AC= rep(rep(c(78,110),each=1),times=length(No)/2),
AR = rep(rep(c(256,320,384),each=2),times=length(No)/6),
AM = rep(1,times=length(No)),
DQ = rep(rep(seq(0,15,1),each=6),times=84),
V = rep(rep(seq(100,2100,100),each=96),times=4),
R = sort(replicate(6, sample(5000:6000,96))))
labels <- rep(c("CAP-CAP","CP-CAP","CAP-CP","CP-CP"),each=2016)
I added here 2016 value intentionally since I know the number of rows of each dataset.
But I want to assign these labels with automated function when the dataset changes. Because there is a possibility the total number of rows may change for each df for my real files. For this question think about its only one txt file and also think about there are plenty of them with different number of rows. But the format is the same.
I use dplyr to arrange my df
library("dplyr")
newdf<-df%>%mutate_each(funs(as.numeric))%>%
mutate(state = labels)
is there elegant way to do this process?
Iff you know the number of data sets contained in df AND the column you're keying off --- here, V --- is ordered in df like it is in your toy data, then this works. It's pretty clunky, and there should be a way to make it even more efficient, but it produced what I take to be the desired result:
# You'll need dplyr for the lead() part
library(dplyr)
# Make a vector with the labels for your subsets of df
labels <- c("AP-AP","P-AP","AP-P","P-P")
# This line a) produces an index that marks the final row of each subset in df
# with a 1 and then b) produces a vector with the row numbers of the 1s
endrows <- which(grepl(1, with(df, ifelse(lead(V) - V < 0, 1, 0))))
# This line uses those row numbers or the differences between them to tell rep()
# how many times to repeat each label
newdf$state <- c(rep(labels[1], endrows[1]), rep(labels[2], endrows[2] - endrows[1]),
rep(labels[3], endrows[3] - endrows[2]), rep(labels[4], nrow(newdf) - endrows[3]))

Resources