I need differences between two data frames. setdiff() gives me modyfied and new rows. But it shows a whole modified row, but I want only different cells. How to do this? I assume the number of columns is the same.
Input data:
df1 <- data.frame(ID = c(1, 2, 3),
A = c(1, 2, 3),
B = c(1, 2, NA))
df2 <- data.frame(ID = c(1, 2, 3, 4),
A = c(1, 2, 3, 4),
B = c(1, 2, 3, NA))
newdata = setdiff(df2,df1) # don't give results as my expectation
As a result it should be such dataframe:
result <- data.frame(ID = c(3, 4),
A = c(NA, 4),
B = c(3, NA))
Column ID should be preserved and always should contain value.
Summary:
Output should contain only new, or modified rows from df2.
In modified rows should be displayed only modified or new cells.
Values in ID column should be displayed even they are not modified.
compare, compare_df? How to do this?
You can do this in separate steps since you are applying different logic to different columns (ID vs A), but can't be achieved as a set of all columns.
df1 <- data.frame(ID = c(1, 2, 3),
A = c(1, 2, 3),
B = c(1, 2, NA))
df2 <- data.frame(ID = c(1, 2, 3, 4),
A = c(1, 2, 3, 4),
B = c(1, 2, 3, NA))
newdata = setdiff(df2,df1)
newdata
ID A B
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 NA
You can apply your logic to cols A & B, and not apply it to ID,
newdata$A[which(df2$A == df1$A)] <- NA
newdata$B[which(df2$B == df1$B)] <- NA
newdata
ID A B
1 1 NA NA
2 2 NA NA
3 3 NA 3
4 4 4 NA
newdata[3:4,]
There are wizards far better than me that might opine, but I see no way to do this in one pass with the ID restriction.
I'm trying to compute ICC values for each subject for the table below, but group_by() is not working as I think it should.
SubID Rate1 Rate2
1 1 2 5
2 1 2 4
3 1 2 5
4 2 3 4
5 2 4 1
6 2 5 1
7 2 2 2
8 3 2 5
9 3 3 5
The code I am running is as follows:
df %>%
group_by(SubID) %>%
summarise(icc = DescTools::ICC(.)$results[3, 2])
and the output:
# A tibble: 3 x 2
SubID icc
<dbl> <dbl>
1 1 -0.247
2 2 -0.247
3 3 -0.247
It seems that summarise is not being applied according to groups, but to the entire dataset. I'm not sure what is going on.
dput()
structure(list(SubID = c(1, 1, 1, 2, 2, 2, 2, 3, 3), Rate1 = c(2,
2, 2, 3, 4, 5, 2, 2, 3), Rate2 = c(5, 4, 5, 4, 1, 1, 2, 5, 5)), class = "data.frame", row.names = c(NA,
-9L))
Not terribly familiar with library(DescTools) but here is a potential solution that utilizes a nest() / map() combo:
library(DescTools)
library(tidyverse)
df <- structure(
list(SubID = c(1, 1, 1, 2, 2, 2, 2, 3, 3),
Rate1 = c(2, 2, 2, 3, 4, 5, 2, 2, 3),
Rate2 = c(5, 4, 5, 4, 1, 1, 2, 5, 5)),
class = "data.frame", row.names = c(NA, -9L)
)
df %>%
nest(ICC3 = -SubID) %>%
mutate(ICC3 = map_dbl(ICC3, ~ ICC(.x)[["results"]] %>%
filter(type == "ICC3") %>%
pull(est)))
#> # A tibble: 3 x 2
#> SubID ICC3
#> <dbl> <dbl>
#> 1 1 2.83e-15
#> 2 2 -5.45e- 1
#> 3 3 -6.66e-16
Created on 2021-03-08 by the reprex package (v0.3.0)
I have following data:
df <- data.frame(
x = c(1, 4, 3, 4, 4, 3),
y = c(2, 3, 4, 4, 2, 3)
)
I try use this code:
library(tidyverse)
df %>%
filter_if(~ is.numeric(.), all_vars(. %in% c('3', '4')))
x y
1 4 3
2 3 4
3 4 4
4 3 3
But, the expected result is:
x y
1 3 3
2 4 4
How do this?
A different approach:
require(tidyverse)
df <- data.frame(
x = c(1, 4, 3, 4, 4, 3),
y = c(2, 3, 4, 4, 2, 3),
z = letters[1:6]
)
df %>%
filter(apply(.,1,function(x) length(unique(x[grepl('[0-9]',x)]))==1))
gives:
x y z
1 4 4 d
2 3 3 f
I have added a non-numeric column to the example data, to illustrate this solution.
Not a filter_if() possibility, but essentially following that logic:
df %>%
filter(rowMeans(select_if(., is.numeric) == pmax(!!!select_if(., is.numeric))) == 1)
x y z
1 4 4 d
2 3 3 f
Sample data:
df <- data.frame(x = c(1, 4, 3, 4, 4, 3),
y = c(2, 3, 4, 4, 2, 3),
z = letters[1:6],
stringsAsFactors = FALSE)
I have a data frame like the following:
df <- data.frame(bee.num=c(1,1,1,2,2,3,3), plant=c("d","d","w","d","d","w","d"))
df$visits = list(1:3, 4:9, 10:11, 1:10, 11:12, 1:4,5:11)
df
bee.num plant visits
1 1 d 1, 2, 3
2 1 d 4, 5, 6, 7, 8, 9
3 1 w 10, 11
4 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
5 2 d 11, 12
6 3 w 1, 2, 3, 4
7 3 d 5, 6, 7, 8, 9, 10, 11
I would like to aggregate visits by bee.num and plant with a function that concatenates the values for visit based on matching bee.num and plant values, like the one below
bee.num plant visits
1 1 d 1, 2, 3, 4, 5, 6, 7, 8, 9
2 1 w 10, 11
3 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
4 3 w 1, 2, 3, 4
5 3 d 5, 6, 7, 8, 9, 10, 11
I've tried
aggregate.data.frame(df$visits, by=list(bee.num = df$bee.num, plant = df$plant), FUN=c)
and
aggregate.data.frame(df$visits, by=list(bee.num = df$bee.num, plant = df$plant), FUN=unlist)
but I always get an "arguments imply differing number of rows" error. Any help would be greatly appreciated. Thanks in advance.
The function works as expected if you pass a data frame containing the list as a column, rather than pass the list itself.
x <- aggregate.data.frame(df['visits'], list(df$bee.num, df$plant) , FUN=c)
names(x) <- c('bee.num', 'plant', 'visits')
x
## bee.num plant visits
## 1 1 d 1, 2, 3, 4, 5, 6, 7, 8, 9
## 2 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
## 3 3 d 5, 6, 7, 8, 9, 10, 11
## 4 1 w 10, 11
## 5 3 w 1, 2, 3, 4
Note:
> class(df$visits)
[1] "list"
> class(df['visits'])
[1] "data.frame"
It would thus suffice to call aggregate above.
Note also, the error is from trying to coerce the list to a data frame. The first two lines of aggregate.data.frame are as follows:
if (!is.data.frame(x))
x <- as.data.frame(x)
Applying this to df$visits results in:
as.data.frame(df$visits)
## Error in data.frame(1:3, 4:9, 10:11, 1:10, 11:12, 1:4, 5:11, check.names = TRUE, :
## arguments imply differing number of rows: 3, 6, 2, 10, 4, 7
Only "rectangular" lists can be coerced to data.frame. All entries must be the same length.
You can also get the output you're looking for if you unlist the list column first and make it so you have a long data.frame to start with:
visits <- unlist(df$visits, use.names=FALSE)
df <- df[rep(rownames(df), sapply(df$visits, length)), c("bee.num", "plant")]
df$visits <- visits
aggregate.data.frame(df$visits, by=list(bee.num = df$bee.num, plant = df$plant), FUN=c)
# bee.num plant x
# 1 1 d 1, 2, 3, 4, 5, 6, 7, 8, 9
# 2 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
# 3 3 d 5, 6, 7, 8, 9, 10, 11
# 4 1 w 10, 11
# 5 3 w 1, 2, 3, 4
## Or, better yet:
aggregate(visits ~ bee.num + plant, df, c)
By the way, "data.table" can handle this listing and unlisting pretty directly:
library(data.table)
DT <- data.table(df)
setkey(DT, bee.num, plant)
DT[, list(visits = list(unlist(visits))), by = key(DT)]
# bee.num plant visits
# 1: 1 d 1,2,3,4,5,6,
# 2: 1 w 10,11
# 3: 2 d 1,2,3,4,5,6,
# 4: 3 d 5,6,7,8,9,10,
# 5: 3 w 1,2,3,4
The output there only looks truncated. All the information is there:
str(.Last.value)
# Classes ‘data.table’ and 'data.frame': 5 obs. of 3 variables:
# $ bee.num: num 1 1 2 3 3
# $ plant : Factor w/ 2 levels "d","w": 1 2 1 1 2
# $ visits :List of 5
# ..$ : int 1 2 3 4 5 6 7 8 9
# ..$ : int 10 11
# ..$ : int 1 2 3 4 5 6 7 8 9 10 ...
# ..$ : int 5 6 7 8 9 10 11
# ..$ : int 1 2 3 4
# - attr(*, "sorted")= chr "bee.num" "plant"
# - attr(*, ".internal.selfref")=<externalptr>
In answer to your specific question, I don't think aggregate.data.frame will do this easily.
As I've stated in previous posts, most R users would probably come up with a way to do this in plyr.
However, as my first exposure to data analysis was through database scripting, I remain partial to the sqldf package for these sorts of tasks.
I also find SQL to be more transparent to non-R users (something I frequently encounter in the social science community where I do most of my work).
Here is a solution to your problem using sqldf:
#your data assigned to dat
bee.num <- c(1,1,1,2,2,3,3)
plant <- c("d", "d", "w", "d", "d", "w", "d")
visits <- c("1, 2, 3"
,"4, 5, 6, 7, 8, 9"
,"10, 11"
,"1, 2, 3, 4, 5, 6, 7, 8, 9, 10"
,"11, 12"
,"1, 2, 3, 4"
,"5, 6, 7, 8, 9, 10, 11")
dat <- as.data.frame(cbind(bee_num, plant, visits))
#load sqldf
require(sqldf)
#write a simple SQL aggregate query using group_concat()
#i.e. "select" your fields specifying the aggregate function for the
#relevant field, "from" a table called dat, and "group by" bee_num
#(because sql_df converts "." into "_" for field names) and plant.
sqldf('select
bee_num
,plant
,group_concat(visits) visits
from dat
group by
bee_num
,plant')
bee_num plant visits
1 1 d 1, 2, 3,4, 5, 6, 7, 8, 9
2 1 w 10, 11
3 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12
4 3 d 5, 6, 7, 8, 9, 10, 11
5 3 w 1, 2, 3, 4
I have a dataframe "forum" that basically looks like this:
post-id: 1, 2, 3, 4, 5, ...
user-id: 1, 1, 2, 3, 4, ...
subforum-id: 1, 1, 1, 2, 3, ...
Now I'm trying to create a new dataframe that looks like this:
subforum-id: 1, 2, 3, ...
number-of-users-that-posted-only-once-to-this-subforum: ...
number-of-users-that-posted-more-than-n-times-to-this-subforum: ...
Is there any way to do that without pre-fabricating all the counts?
Using plyr and summarise:
# N = 1 here
ddply(DF, .(subforum.id), summarise, once = sum(table(user.id) == 1),
n.times = sum(table(user.id) > N))
# subforum.id once n.times
# 1 1 1 1
# 2 2 1 0
# 3 3 1 0
This is the data.frame DF:
DF <- structure(list(post.id = 1:5, user.id = c(1, 1, 2, 3, 4),
subforum.id = c(1, 1, 1, 2, 3)),
.Names = c("post.id", "user.id", "subforum.id"),
row.names = c(NA, -5L), class = "data.frame")
Here's a basic idea to get you started: Use table to get a count of user ids by subforum ids and work from there:
> mydf <- structure(list(post.id = c(1, 2, 3, 4, 5), user.id = c(1, 1,
2, 3, 4), subforum.id = c(1, 1, 1, 2, 3)), .Names = c("post.id",
"user.id", "subforum.id"), row.names = c(NA, -5L), class = "data.frame")
> mytable <- with(mydf, table(subforum.id, user.id))
> mytable
user.id
subforum.id 1 2 3 4
1 2 1 0 0
2 0 0 1 0
3 0 0 0 1
Hint: from there, look at the rowSums function, and think about what happens if you sum over a logical vector.