I have a vector a<-c(0, 0). I want to convert this to a dataframe and then remove duplicated rows (as part of a loop).
This is my code:
a<-c(0, 0)
df<-t(as.data.frame(a))
distinct(df)
This isn't working because df isn't a dataframe even though I have converted it to a dataframe in the second step. I'm not sure how to make this work when the dataframe only has one row.
swap t and as.data.frame like this:
library(dplyr)
a<-c(0, 0)
df<-as.data.frame(t(a))
distinct(df)
Output:
V1 V2
1 0 0
The function t will transform the data.frame back to a matrix. The simplest solution would be to simply change the order of the functions:
a <- c(0,0)
df <- as.data.frame(t(a))
distinct(df)
You can use rbind.data.frame function as follows:
a <- c(0,0)
df = rbind.data.frame(a)
distinct(df)
Using dplyr and matrix:
library(dplyr)
data.frame(matrix(0, 1, 2)) %>%
distinct
#> X1 X2
#> 1 0 0
I know the answer is completely easy but I could not get it so far. Also I tried to find the answer through the similar questions but i could not. Anyway, I need to return the ID of matrix m that has all elemets of a vector (NoN). In the example that I prepared at below, I need to return IDs 1 and 3.
Example:
m<-matrix(c(1,1,1,1,2,2,34,45,4,4,4,4,4,5,6,3,3,3,3,21,22,3425,345,65,22,42,65,86,456,454,5678,5,234,22,65,21,22,786),nrow=19)
colnames(m)<-c("ID","LO")
NoN<-c(21,22)
My attempts so far are as follow:
1: m[all(m[,2] %in% NoN),1]
2: m[match(NoN, m[,2]),1]
3: subset(m, m[,2] %in% NoN)
4: m[which(m[,2] %in% NoN),1]
Appreciate!
Here's a function using base R:
FOO <- function(m, NoN){
# split matrix based on ID column
m2 <- lapply(split(m, m[, 1]), function(x) matrix(x, ncol = 2))
# match every element of NoN, create logical matrix
matchresult <- do.call(cbind, lapply(lapply(m2, function(x) lapply(NoN, function(y) match(y, x[,2]))), unlist))
# print colnames (= ID) of columns with no NA
as.numeric(colnames(matchresult)[colSums(apply(matchresult, 2, is.na)) == 0])
}
Result of function call:
> FOO(m, NoN)
[1] 1 3
Untested except for your example, but this should be able to handle any length of NoN as well as duplicated combinations of ID and LO.
Edit: More concise and efficient variant provided by #docendodiscimus:
FOO <- function(m, NoN){
df <- as.data.frame(m)
unique(df[as.logical(ave(df$LO, df$ID, FUN = function(x) all(NoN %in% x))),"ID"])
}
A not so safe way using base R:
m<-matrix(c(1,1,1,1,2,2,34,45,4,4,4,4,4,5,6,3,3,3,3,21,22,3425,345,65,22,42,65,86,456,454,5678,5,234,22,65,21,22,786),nrow=19)
colnames(m)<-c("ID","LO")
NoN<-c(21,22)
IDs <- m[m[, 2] %in% NoN, 1]
IDs <- table(IDs)
IDs <- names(IDs)[IDs >= length(NoN)]
> IDs
[1] "1" "3"
But beware, this does not take duplicated values into account. So if ID 1 would have two LOs of value 21 but no 22, it would still return ID 1.
EDIT: A safe way using dplyr:
library(dplyr)
m <- data.frame(m)
IDs <- m %>%
slice(which(LO %in% NoN)) %>% # get all rows which contain values from NoN
group_by(ID) %>% # group by ID
summarise(uniques = n_distinct(LO)) %>% # count unique values per ID
filter(uniques == length(NoN)) %>% # number of unique values has to be the same as the number of values in NoN
select(ID) %>% # select ID columns
unlist() %>% # unlist it
as.numeric() # convert from named num to numeric
> IDs
[1] 1 3
Here's an alternative solution that saves matrix m as a dataframe and performs a process for each ID:
# example data
m<-matrix(c(1,1,1,1,2,2,34,45,4,4,4,4,4,5,6,3,3,3,3,21,22,3425,345,65,22,42,65,86,456,454,5678,5,234,22,65,21,22,786),nrow=19)
colnames(m)<-c("ID","LO")
NoN<-c(21,22)
library(dplyr)
data.frame(m) %>% # save m as dataframe
group_by(ID) %>% # for each ID
summarise(sum_flag = sum(LO %in% NoN)) %>% # count number of LO elements in NoN
filter(sum_flag == length(NoN)) %>% # keep rows where this number matches the length of NoN
pull(ID) # get the corresponding IDs
# [1] 1 3
Keep in mind that this process assumes (based on your example) that elements of NoN and rows of m are unique.
I took this answer from #docendo discimus, which I found it efficient and concise.
df <- as.data.frame(m);
unique(df[as.logical(ave(df$LO, df$ID, FUN = function(x) all(NoN %in% x))),"ID"])
I wish to do exactly this: Take dates from one dataframe and filter data in another dataframe - R
except without joining, as I am afraid that after I join my data the result will be too big to fit in memory, prior to the filter.
Here is sample data:
tmp_df <- data.frame(a = 1:10)
I wish to do an operation that looks like this:
lower_bound <- c(2, 4)
upper_bound <- c(2, 5)
tmp_df %>%
filter(a >= lower_bound & a <= upper_bound) # does not work as <= is vectorised inappropriately
and my desired result is:
> tmp_df[(tmp_df$a <= 2 & tmp_df$a >= 2) | (tmp_df$a <= 5 & tmp_df$a >= 4), , drop = F]
# one way to get indices to subset data frame, impractical for a long range vector
a
2 2
4 4
5 5
My problem with memory requirements (with respect to the join solution linked) is when tmp_df has many more rows and the lower_bound and upper_bound vectors have many more entries. A dplyr solution, or a solution that can be part of pipe is preferred.
Maybe you could borrow the inrange function from data.table, which
checks whether each value in x is in between any of the
intervals provided in lower,upper.
Usage:
inrange(x, lower, upper, incbounds=TRUE)
library(dplyr); library(data.table)
tmp_df %>% filter(inrange(a, c(2,4), c(2,5)))
# a
#1 2
#2 4
#3 5
If you'd like to stick with dplyr it has similar functionality provided through the between function.
# ranges I want to check between
my_ranges <- list(c(2,2), c(4,5), c(6,7))
tmp_df <- data.frame(a=1:10)
tmp_df %>%
filter(apply(bind_rows(lapply(my_ranges,
FUN=function(x, a){
data.frame(t(between(a, x[1], x[2])))
}, a)
), 2, any))
a
1 2
2 4
3 5
4 6
5 7
Just be aware that the argument boundaries are included by default and that cannot be changed as with inrange
I have a dataset like this
id <- 1:12
b <- c(0,0,1,2,0,1,1,2,2,0,2,2)
c <- rep(NA,3)
d <- rep(NA,3)
df <-data.frame(id,b)
newdf <- data.frame(c,d)
I want to do simple math. If x==1 or x==2 count them and write how many 1 and 2 are there in this dataset. But I don't want to count whole dataset, I want my function count them four by four.
I want to a result like this:
> newdf
one two
1 1 1
2 2 1
3 0 3
I tried this with lots of variation but I couldn't success.
afonk <- function(x) {
ifelse(x==1 | x==2, x, newdf <- (x[1]+x[2]))
}
afonk(newdf$one)
lapply(newdf, afonk)
Thanks in advance!
ismail
Fun with base R:
# counting function
countnum <- function(x,num){
sum(x == num)
}
# make list of groups of 4
df$group <- rep(1:ceiling(nrow(df)/4),each = 4)[1:nrow(df)]
dfl <- split(df$b,f = df$group)
# make data frame of counts
newdf <- data.frame(one = sapply(dfl,countnum,1),
two = sapply(dfl,countnum,2))
Edit based on comment:
# make list of groups of 4
df$group <- rep(1:ceiling(nrow(df)/4),each = 4)[1:nrow(df)]
table(subset(df, b != 0L)[c("group", "b")])
Which you prefer depends on what type of result you need. A table will work for a small visual count, and you can likely pull the data out of the table, but if it is as simple as your example, you might opt for the data.frame.
We could use dcast from data.table. Create a grouping variable using %/% and then dcast from 'long' to 'wide' format.
library(data.table)
dcast(setDT(df)[,.N ,.(grp=(id-1)%/%4+1L, b)],
grp~b, value.var='N', fill =0)[,c(2,4), with=FALSE]
Or a slightly more compact version would be using fun.aggregate as length.
res <- dcast(setDT(df)[,list((id-1)%/%4+1L, b)][b!=0],
V1~b, length)[,V1:=NULL][]
res
# 1 2
#1: 1 1
#2: 2 1
#3: 0 3
If we need the column names to be 'one', 'two'
library(english)
names(res) <- as.character(english(as.numeric(names(res))))
I have a data frame that looks something like this:
x y
1 a
1 b
1 c
1 NA
1 NA
2 d
2 e
2 NA
2 NA
And my desired output should be a data frame that should display the sum of all complete cases of Y (that is the non-NA values) with the corresponding X. So if supposing Y has 2500 complete observations for X = 1, and 557 observations for X = 2, I should get this simple data frame:
x y(c.cases)
1 2500
2 557
Currently my function performs well but only for a single X but when I mention X to be a range (for ex. 30:25) then I get the sum of all the Ys specified instead of individual complete observations for each X. This is an outline of my function:
complete <- function(){
files <- file.list()
dat<- c() #Creates an empty vector
Y <- c() #Empty vector that will list down the Ys
result <- c()
for(i in c(X)){
dat <- rbind(dat, read.csv(files[i]))
}
dat_subset_Y <- dat[which(dat[, 'X'] %in% x), ]
Y <- c(Y, sum(complete.cases(dat)))
result <- cbind(X, Y)
print(result)
}
There are no errors or warning messages but only wrong results in a range of Xs.
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'x', get the sum of all non NA elements (!is.na(y)).
library(data.table)
setDT(df1)[, list(y=sum(!is.na(y))), by = x]
Or another option is table
with(df1, table(x, !is.na(y)))
no need for that loop.
library(dplyr)
df %>%
filter(complete.cases(.))%>%
group_by(x) %>%
summarise(sumy=length(y))
Or
df %>%
group_by(x) %>%
summarise(sumy=sum(!is.na(y)))