plyr summarize count error row length - r

suppose I have the following data:
A <- c(4,4,4,4,4)
B <- c(1,2,3,4,4)
C <- c(1,2,4,4,4)
D <- c(3,2,4,1,4)
filt <- c(1,1,10,8,10)
data <- as.data.frame(rbind(A,B,C,D,filt))
data <- t(data)
data <- as.data.frame(data)
> data
A B C d filt
V1 4 1 1 3 1
V2 4 2 2 2 1
V3 4 3 4 4 10
V4 4 4 4 1 8
V5 4 4 4 4 10
I want to get counts on the occurances of 1,2,3, & 4 for each variable, after filtering. In my attempt to achieve this below, I get Error: length(rows) == 1 is not TRUE.
data %>%
dplyr::filter(filt ==1) %>%
plyr::summarize(A_count = count(A),
B_count = count(B))
I get the error - its because some of my columns do not contain all values 1-4. Is there a way to specify what it should look for & give 0 values if not found? I'm not sure how to do this if possible, or if there is a different work around.
Any help is VERY appreciated!!!

This was a bit of a weird one, I didn't use classical plyr, but I think this is roughly what you're looking for. I removed the filtering column , filt as to not get counts of that:
library(dplyr)
data %>%
filter(filt == 1) %>%
select(-filt) %>%
purrr::map_df(function(a_column){
purrr::map_int(1:4, function(num) sum(a_column == num))
})
# A tibble: 4 x 4
A B C D
<int> <int> <int> <int>
1 0 1 1 0
2 0 1 1 1
3 0 0 0 1
4 2 0 0 0

Related

How to calculate values for the first row that meets a certain condition?

I have the following dummy dataframe:
t <- data.frame(
a= c(0,0,2,4,5),
b= c(0,0,4,6,5))
a b
0 0
0 0
2 4
4 6
5 5
I want to replace just the first value that it is not zero for the column b. Imagine that the row that meets this criteria is i. I want to replace t$b[i] with t[i+2]+t[i+1] and the rest of t$b should remain the same. So the output would be
a b
0 0
0 0
2 11
4 6
5 5
In fact the dataset is dynamic so I cannot directly point to a specific row, it has to meet the criteria of being the first row not equal to zero in column b.
How can I create this new t$b?
Here is a straight forward solution in base R:
t <- data.frame(
a= c(0,0,2,4,5),
b= c(0,0,4,6,5))
ind <- which(t$b > 0)[1L]
t$b[ind] <- t$b[ind+2L] + t$b[ind+1L]
t
a b
1 0 0
2 0 0
3 2 11
4 4 6
5 5 5
Here is a roundabout way of getting there with a combination of group_by() and mutate():
library(tidyverse)
t %>%
mutate(
b_cond = b != 0,
row_number = row_number()
) %>%
group_by(b_cond) %>%
mutate(
min_row_number = row_number == min(row_number),
b = if_else(b_cond & min_row_number, lead(b, 1) + lead(b, 2), b)
) %>%
ungroup() %>%
select(a, b) # optional, to get back to original columns
# A tibble: 5 × 2
a b
<dbl> <dbl>
1 0 0
2 0 0
3 2 11
4 4 6
5 5 5

How do I drop all observations except the last of a pattern?

I asked a question a few months back about how to identify and keep only observations that follow a certain pattern: How can I identify patterns over several rows in a column and fill a new column with information about that pattern using R?
I want to take this a step further. In that question I just wanted to identify that pattern. Now, if the pattern appears several times within a group, how I keep only the last occurance of that pattern. For example, given df1 how can I achieve df2
df1
TIME ID D
12:30:10 2 0
12:30:42 2 0
12:30:59 2 1
12:31:20 2 0
12:31:50 2 0
12:32:11 2 0
12:32:45 2 1
12:33:10 2 1
12:33:33 2 1
12:33:55 2 1
12:34:15 2 0
12:34:30 2 0
12:35:30 2 0
12:36:30 2 0
12:36:45 2 0
12:37:00 2 0
12:38:00 2 1
I want to end up with the following df2
df2
TIME ID D
12:33:55 2 1
12:34:15 2 0
12:34:30 2 0
12:35:30 2 0
12:36:30 2 0
12:36:45 2 0
12:37:00 2 0
12:38:00 2 1
Thoughts? There were some helpful answers in the question I linked above, but I now want to narrow it.
Here is a base R function I find too complicated but that gets what is asked for.
If I understood the pattern correctly, it doesn't matter if the last sequence ends in a 1 or a 0. The test with df1b has a last sequence ending in a 0.
keep_last_pattern <- function(data, col){
x <- data[[col]]
if(x[length(x)] == 0) x[length(x)] <- 1
#
i <- ave(x, cumsum(x), FUN = \(y) y[1] == 1 & length(y) > 1)
r <- rle(i)
l <- length(r$lengths)
n <- which(as.logical(r$values))
r$values[ n[-length(n)] ] <- 0
r$values[l] <- r$lengths[l] == 1 && r$values[l] == 0
j <- as.logical(inverse.rle(r))
#
data[j, ]
}
keep_last_pattern(df1, "D")
df1b <- df1
df1b[17, "D"] <- 0
keep_last_pattern(df1b, "D")
Do you want to rows the sequence in each ID between second last 1 and last 1 ?
Here is a function to do that which can be applied for each ID.
library(dplyr)
extract_sequence <- function(x) {
inds <- which(x == 1)
inds[length(inds) - 1]:inds[length(inds)]
}
df %>%
group_by(ID) %>%
slice(extract_sequence(D)) %>%
ungroup
# TIME ID D
# <chr> <int> <int>
#1 12:33:55 2 1
#2 12:34:15 2 0
#3 12:34:30 2 0
#4 12:35:30 2 0
#5 12:36:30 2 0
#6 12:36:45 2 0
#7 12:37:00 2 0
#8 12:38:00 2 1
Not sure this will help as it's unclear what your pattern is.
Let's assume you have data like this, with one column indicating in some way whether the row matches a pattern or not:
set.seed(123)
df <- data.frame(
grp = sample(LETTERS[1:3], 10, replace = TRUE),
x = 1:10,
y = c(0,1,0,0,1,1,1,1,1,1),
pattern = rep(c("TRUE", "FALSE"),5)
)
If the aim is to keep only the last occurrence of pattern == "TRUE" per group, this might work:
df %>%
filter(pattern == "TRUE") %>%
group_by(grp) %>%
slice_tail(.)
# A tibble: 3 x 4
# Groups: grp [3]
grp x y pattern
<chr> <int> <dbl> <chr>
1 A 1 0 TRUE
2 B 9 1 TRUE
3 C 5 1 TRUE

Transforming dataframe to contain counts of values

Have created a dataframe that contains ids and stringvalues :
mycols <- c('id','2')
ids <- c(1,1,2,3)
stringvalues <- c('a','a','b','c')
mydf <- data.frame(ids , stringvalues)
mydf contains :
ids stringvalues
1 1 a
2 1 a
3 2 b
4 3 c
I'm attempting to produce a new dataframe that contains the id and
corresponding counts for each string :
id, a , b , c
1 , 2 , 0 , 0
2 , 0 , 1 , 0
3 , 0 , 0 , 1
I'm trying to create multiple summarise implementations :
g1 <- group_by(mydf , ids)
s1 <- summarise(g1 , a = count('a'))
s2 <- summarise(g1 , b = count('b'))
s3 <- summarise(g1 , c = count('c'))
But returns error : Evaluation error: no applicable method for 'groups' applied to an object of class "character".
How to create new columns that count number of string entries in the column ?
Does doing a dplyr::count followed by tidyr::spread work for you? (I'm only posting this as you mentioned you were wanting to create a dataframe of this sort - otherwise it's much simpler to use table(mydf) as the other comments/answers suggest.)
library(dplyr)
library(tidyr)
mydf %>% count(ids, stringvalues) %>% spread(stringvalues, n, fill = 0)
#> # A tibble: 3 x 4
#> ids a b c
#> * <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 0 0
#> 2 2 0 1 0
#> 3 3 0 0 1
You can use count directly. First,
count(mydf, ids,stringvalues)
gives
# A tibble: 3 x 3
ids stringvalues n
<dbl> <fctr> <int>
1 1 a 2
2 2 b 1
3 3 c 1
then reshape,
count(mydf, ids,stringvalues) %>% tidyr::spread(stringvalues, n)
gives
# A tibble: 3 x 4
ids a b c
* <dbl> <int> <int> <int>
1 1 2 NA NA
2 2 NA 1 NA
3 3 NA NA 1
then replace the NAs with something like res[is.na(res)] <- 0, where res is the object constructed above.
Here's a base-R solution:
data.frame(cbind(table(mydf)))
Output option 1 (row # = ID):
a b c
1 2 0 0
2 0 1 0
3 0 0 1
Output option 2 (with ID as column):
data.frame(cbind(id=unique(mydf$ids),table(mydf)))
id a b c
1 1 2 0 0
2 2 0 1 0
3 3 0 0 1

Grouping and Counting instances?

Is it possible to group and count instances of all other columns using R (dplyr)? For example, The following dataframe
x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1
Turns to this (note: y is value that is being counted)
EDIT:- explaining the transformation, x is what I'm grouping by, for each number grouped, i want to count how many times 0 and 1 and 2 was mentioned, as in the first row in the transformed dataframe, we counted how many times x = 1 was equal to 0 in the other columns (y), so 0 was in column a one time, column b two times and column c one time
x y a b c
1 0 1 2 1
1 1 1 0 2
1 2 1 1 0
2 1 1 0 1
2 2 0 1 0
An approach with a combination of the melt and dcast functions of data.table or reshape2:
library(data.table) # v1.9.5+
dt.new <- dcast(melt(setDT(df), id.vars="x"), x + value ~ variable)
this gives:
dt.new
# x value a b c
# 1: 1 0 1 2 1
# 2: 1 1 1 0 2
# 3: 1 2 1 1 0
# 4: 2 1 1 0 1
# 5: 2 2 0 1 0
In dcast you can specify which aggregation function to use, but this is in this case not necessary as the default aggregation function is length. Without using an aggregation function, you will get a warning about that:
Aggregation function missing: defaulting to length
Furthermore, if you do not explicitly convert the dataframe to a data table, data.table will redirect to reshape2 (see the explanation from #Arun in the comments). Consequently this method can be used with reshape2 as well:
library(reshape2)
df.new <- dcast(melt(df, id.vars="x"), x + value ~ variable)
Used data:
df <- read.table(text="x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1", header=TRUE)
I'd use a combination of gather and spread from the tidyr package, and count from dplyr:
library(dplyr)
library(tidyr)
df = data.frame(x = c(1,1,1,2), a = c(0,1,2,1), b = c(0,0,2,2), c = c(0,1,1,1))
res = df %>%
gather(variable, value, -x) %>%
count(x, variable, value) %>%
spread(variable, n, fill = 0)
# Source: local data frame [5 x 5]
#
# x value a b c
# 1 1 0 1 2 1
# 2 1 1 1 0 2
# 3 1 2 1 1 0
# 4 2 1 1 0 1
# 5 2 2 0 1 0
Essentially, you first change the format of the dataset to:
head(df %>%
gather(variable, value, -x))
# x variable value
#1 1 a 0
#2 1 a 1
#3 1 a 2
#4 2 a 1
#5 1 b 0
#6 1 b 0
Which allows you to use count to get the information regarding how often certain values occur in columns a to c. After that, you reformat the dataset to your required format using spread.

Computing contingency tables in R language

I have a data set in R as follows
ID Variable1 Variable2 Choice
1 1 2 1
1 2 1 0
2 2 1 1
2 2 1 1
I need to get the output table for it as under
Id Variable1-1 Variable1-2 Variable2-1 Variable2-2
1 1 0 0 1
2 0 2 2 0
Note that only those rows are counted where the choice is 1 (choice is a binary variable, however other variables have any integer values). The aim is to have as many columns for a variable as its levels.
Is there a way I can do this in R?
You could use melt and dcast from the reshape2 package:
mydf<-read.table(text="ID Variable1 Variable2 Choice
1 1 2 1
1 2 1 0
2 2 1 1
2 2 1 1",header=TRUE)
library(reshape2)
First melt the data.frame, selecting only those rows where Choice == 1 and removing the Choice column
mydfM <- melt(mydf[mydf$Choice %in% 1, -match("Choice", names(mydf))], id = "ID")
# EDIT above: As #TylerRinker points out, using which could be avoided.
# I've replaced it with %in%
# ID variable value
# 1 1 Variable1 1
# 2 2 Variable1 2
# 3 2 Variable1 2
# 4 1 Variable2 2
# 5 2 Variable2 1
# 6 2 Variable2 1
Then cast the melted data.frame, using length as the aggregation function
(mydfC <- dcast(mydfM, ID ~ variable + value, fun.aggregate = length))
# ID Variable1_1 Variable1_2 Variable2_1 Variable2_2
# 1 1 1 0 0 1
# 2 2 0 2 2 0
It took me a while to figure out what you were after but I got it (I think). I have done what you asked but it's convoluted at best. I think this will help others see what you're after and you'll get better answers now.
dat <- read.table(text="ID Variable1 Variable2 Choice
1 1 2 1
1 2 1 0
2 2 1 1
2 2 1 1", header=T)
A <- split(dat$Choice, list(dat$Variable1, dat$ID))
B <- split(dat$Choice, list(dat$Variable2, dat$ID))
C <- list(A, B)
FUN <- function(x) sapply(x, function(y) sum(y))
FUN2 <- function(x){
len <- length(x)/2
rbind(x[1:len], x[(len+1):length(x)])
}
dat2 <- do.call('data.frame', lapply(lapply(C, FUN), FUN2))
colnames(dat2) <- c('Variable1-1', 'Variable1-2', 'Variable2-1',
'Variable2-2')
dat2
This ain't you're grandmother's contingency table that's for sure. Probably there's a much better way to accomplish all of this, maybe with reshape.

Resources