Problem
Let's consider two data frames :
One containing only 1's and 0's and second one with data :
set.seed(20)
df<-data.frame(sample(0:1,5,T),sample(0:1,5,T),sample(0:1,5,T))
#zero_one data frame
sample.0.1..5..T. sample.0.1..5..T..1 sample.0.1..5..T..2
1 0 1 0
2 1 0 0
3 1 1 1
4 0 0 0
5 1 0 1
df1<-data.frame(append(rnorm(4),10),append(runif(4),-5),append(rexp(4),20))
#with data
append.rnorm.4...10. append.runif.4....5. append.rexp.4...20.
1 0.08609139 0.2374272 0.3341095
2 -0.63778176 0.2297862 0.7537732
3 0.22642990 0.9447793 1.3011998
4 -0.05418293 0.8448115 1.2097271
5 10.00000000 -5.0000000 20.0000000
Now what I want to do is to change values in second data frame for which first data frame takes values 0 by mean calculated for values for which first data frame takes value one.
Example
In first column I want to replace 0.08609139 and -0.05418293 (values for which first column in first data frame takes values 0) by mean(-0.63778176, 0.22642990,10.00000000) (values for which first column in first data frame takes values 1).
I want to do it using mutate_all() function from dplyr package.
My work so far
df1<-df1 %>% mutate_all(
function(x) ifelse(df[x]==0, mean(x[df==1],na.rm=T,x)))
I know that the condition df[x] is meaningless, but I have no idea what should i put there. Could you please help me with that ?
You could follow #deschen's suggestion and multiply the two data frames together.
Here is another approach to consider using mapply. For each column, identify the positions (indices) in df where value is zero.
Then, substitute the corresponding df1 column of those positions with the mean of other values in the column. y[-idx] should be all values in the df1 column that exclude those positions.
Note that my set.seed is different - when I used yours of 20 I got different values, and a column with all zeroes. Please let me know if you are able to reproduce.
set.seed(12)
df<-data.frame(sample(0:1,5,T),sample(0:1,5,T),sample(0:1,5,T))
df1<-data.frame(append(rnorm(4),10),append(runif(4),-5),append(rexp(4),20))
my_fun <- function(x, y) {
idx <- which(x == 0)
y[idx] <- mean(y[-idx])
return(y)
}
mapply(my_fun, df, df1)
Related
I need to loop through a large number of data frames whose column names will vary slightly. I need to filter the data frame for dynamic column names whose rows == 0. How can I use the filter function with a list of dynamic column names?
Abbreviated example:
data <- data frame with column names that include: "pfall_met" , "cfall_met", "fall_met, "spring_met", csprin_met", "pspring_met" or any combination of these names ending with "met"
attempt:
mets<-c(names(data)[grep("met",names(data))]) #to list out the column names that end with "met" for that data frame
data_filt<-data%>%
filter( paste0(mets) == 0 ) #to filter the rows where all the column names in data from the "mets" list equal 0
If there is syntax that can work like ends_with() in the select function that would be great too:
data_filt<-data%>%
filter( ends_with("mets") == 0 )
filter((!!sym(mets))==0)
yields the following error:
Error in sym():
! Can't convert a character vector to a symbol.
Run rlang::last_error() to see where the error occurred.
Thank you in advance.
Assuming that you have the data as below where you want to filter data with columns where rows sum ==0, here i used select instead of filter as an alternate approach
for your purpose you can use the below code and instead of starts_with use ends_with('met')
x1 x2 x3
1 1 3 0
2 2 4 0
3 3 5 0
4 4 6 0
Try the code below
data <- data.frame(x1=c(1,2,3,4), x2=c(3,4,5,6), x3=c(0,0,0,0)) %>%
select(!starts_with('x'), where(~sum(.x)==0))
output
x3
1 0
2 0
3 0
4 0
I have the following matrix:
x=matrix(c(1,2,2,1,10,10,20,21,30,31,40,
1,3,2,3,10,11,20,20,32,31,40,
0,1,0,1,0,1,0,1,1,0,0),11,3)
I would like to find for each unique value of the first column in x, the maximum value (across all records having that value of the first column in x) of the third column in x.
I have created the following code:
v1 <- sequence(rle(x[,1])$lengths)
A=split(seq_along(v1), cumsum(v1==1))
A_diff=rep(0,length(split(seq_along(v1), cumsum(v1==1))))
for( i in 1:length(split(seq_along(v1), cumsum(v1==1))) )
{
A_diff[i]=max(x[split(seq_along(v1), cumsum(v1==1))[[i]],3])
}
However, the provided code works only when same elements are consecutive in the first column (because I use rle) and I use a for loop.
So, how can I do it to work generally without the for loop as well, that is using a function?
If I understand correctly
> tapply(x[,3],x[,1],max)
1 2 10 20 21 30 31 40
1 1 1 0 1 1 0 0
For grouping more than 1 variable I would do aggregate, note that matrices are cumbersome for this purpose, I would suggest you transform it to a data frame, nonetheless
> aggregate(x[,3],list(x[,1],x[,2]),max)
I am fairly new to R and can't find a concise way to a problem.
I have a dataframe in R called df that looks as such. It contain a column called values that contains values from 0 to 1 ordered numerically and a binary column called flag that contains either 0 or 1.
df
value flag
0.033 0
0.139 0
0.452 1
0.532 0
0.687 1
0.993 1
I wish to split this dataframe into X amount of groups from 0 to 1 binning the values column. For example if I wished a 4 split grouping, the data would be split from 0-0.25, 0.25-0.5, 0.5-0.75, 0.75-1. This data would also contain the corresponding flag to that point.
THE ANSWER SHOULD ONLY USE DATAFRAME FORMAT AS INPUT, ONLY THE COLUMNS STATED IN THIS QUESTION AND ONLY PACKAGES FROM TIDYVERSE, CARET OR DATA.TABLE.
I want to solution to be scalable so if I wished to split it into more group then I can.
Does anyone have a solution for this? Thanks
I cant see how the answers you got earlier are not scaleable, they use native R with no packages... if n is number of partitions you want:
n = 4
L = seq(1,n)/n
GroupedList = lapply(L,function(x){
df[(df$value < x) & (df$value > (x-(1/n))),]
})
Perhaps the cut() function can help you? This divides the range of a numeric column into a custom amount of intervals:
n <- 4
breaks <- seq(0, 1, by = 1/n)
df$group <- cut(df$value, breaks = breaks)
Assuming I have an original version dataset containing a complete set of "texsts" (a string variable), and a second dataset that only contains those "texts" for which the new variable "value" takes a certain value (0, 1, or NA).
Now I would like to merge them back together so that the resulting dataset contains the full range of "texts" from the first dataset but also includes "value" which should be 0 if coded 0 and/or only present in the original dataset.
dat1<-data.frame(text=c("a","b","c","d","e","f","g","h")) # original dataset
dat2<-data.frame(text=c("e","f","g","h"), value=c(0,NA,1,1)) # second version
The final dataset should look like this:
> dat3
text value
1 a 0
2 b 0
3 c 0
4 d 0
5 e 0
6 f NA
7 g 1
8 h 1
However, what Base-R's merge() does is to introduce NAs where I want 0s instead:
dat3<-merge(dat1, dat2, by=c("text"), all=T)
Is there a way to define a default input for when the variable by which datasets are merged is only present in one but not the other dataset? In other words, how can I define 0 as standard input value instead of NA?
I am aware of the fact that I could temporarily change the coded NAs in the second dataset to something else to distinguish later on between "real" NAs and NAs that just get introduced, but I would really like to refrain from doing so, if there's another, cleaner way. Ideally, I would like to use merge() or plyr::join() for that purpose but couldn't find anything in the manual(s).
I know that this is not ideal too, but something to consider:
library(dplyr)
dat3 <- dplyr::left_join(dat1,dat2,all.x =T)
dat3[which(dat2$text != dat3$text),2] = 0
Or wrapping in a function to call a one-liner:
merge_NA <- function(dat1,dat2){
dat3 <- dplyr::left_join(dat1,dat2,all.x = T)
dat3[which(dat2$text != dat3$text),2] = 0
return(dat3)
}
Now, you only call:
merge_NA(dat1,dat2)
I am so desperated and even I am ready to lose some more rep points but I have to ask it.
(Yes, I read some threads about it).
I created a dataframe with only 2 columns I want to put to the matrix (I didn't know how to pick just 2 columns from whole data):
tbl_corel <- tbl_end[,c("diff", "abund_mean")]
In next step I created and empty matrix:
## Creating a empty matrix to check the correlation between diff and abund_mean
mat_corel <- matrix(0, ncol = 2)
colnames(mat_corel) <- c("diff", "abund_mean")
I tried to use that function to fill the matrix with the data:
mat_corel <- matrix(tbl_corel), nrow = 676,ncol = 2)
Of course I had to check manually how many rows I have in my data frame...
It doesn't work.
Tried that function as well:
mat_corel[ as.matrix(tbl_corel) ] <- 1
It doesn't work. I'd be so grateful for the help.
diff abund_mean
1 0 3444804.80
2 0 847887.02
3 0 93654.19
4 0 721692.76
5 0 382711.04
6 1 428656.66
If you want to create a matrix from your two-columns data frame, there is a more direct and simpler way : just transform you data frame as a matrix directly :
mat_corel <- as.matrix(tbl_corel)
But if you just want to compute a correlation coefficient, you can do it directly from your data frame :
cor(tbl_end$diff, tbl_end$abund_mean)