I have the following structured table (as an example):
Class 1 Class 2
1 1 1
2 1 1
3 1 1
4 1 2
5 3 3
6 3 3
7 3 4
8 4 4
I want to count how many times in a given Class 1 the same value appear in Class 2 and display this as a percentage value. Also group class 1. So I would want the result to be something like this:
Class 1 n_class1 Percentage of occurrence in class 2
1 1 4 0.75
2 3 3 0.666
3 4 1 1.0
I have read a lot about the dplyr package and think the solution can be in there, and also looked at many examples but have not yet found a solution. I'm new to programming so don't have the natural programmer thinking yet, hope someone can give me tips on how to to this.
I have manage to get the n_class1 by using group by but struggling to get the the percentage of occurrence in class 2.
you can do this by creating a new column in.class1 with mutate:
library(dplyr)
df <- data.frame(
class1 = rep(c(1, 3, 4), c(4, 3, 1)),
class2 = rep(c(1, 2, 3, 4), c(3, 1, 2, 2))
)
df %>%
mutate(in.class1 = class2 == class1) %>%
group_by(class1) %>%
summarise(n_class1 = n(),
class2_percentile = sum(in.class1) / n()
)
# # A tibble: 3 × 3
# class1 n_class1 class2_percentile
# <dbl> <int> <dbl>
# 1 1 4 0.7500000
# 2 3 3 0.6666667
# 3 4 1 1.0000000
As suggested by Jaap in comment, this could be simplified to:
df %>%
group_by(class1) %>%
summarise(
n_class1 = n(),
class2_percentile = sum(class1 == class2) / n())
The question has already been asked as part of a larger question the OP has asked before where it has been answered using data.table.
Read data
library(data.table)
cl <- fread(
"id Class1 Class2
1 1 1
2 1 1
3 1 1
4 1 2
5 3 3
6 3 3
7 3 4
8 4 4"
)
Aggregate
cl[, .(.N, share_of_occurence_in_Class2 = sum(Class1 == Class2)/.N), by = Class1]
# Class1 N share_of_occurence_in_Class2
#1: 1 4 0.7500000
#2: 3 3 0.6666667
#3: 4 1 1.0000000
Related
I have a the following dataframe:
Participant_ID Order
1 A
1 A
2 B
2 B
3 A
3 A
4 B
4 B
5 B
5 B
6 A
6 A
Every two rows refer to the same participant. I want to create a new column based on the value in the column 'Order'. If the 'Order' == A, then I want it to create a new column with two rows of [1, 2], and then if the 'Order' == B, then I want it to create two rows of [2,1] in the same column
The preferred output would be the following:
Participant_ID Order Period
1 A 1
1 A 2
2 B 2
2 B 1
3 A 1
3 A 2
4 B 2
4 B 1
5 B 2
5 B 1
6 A 1
6 A 2
Any help would be appreciated
Here are a couple of possibilities. This assumes that Order value is same for a given Participant_ID. If this isn't the case, you will need to include additional logic.
You can use if_else:
library(tidyverse)
df %>%
group_by(Participant_ID) %>%
mutate(Period = if_else(Order == "A", 1:2, 2:1))
Or to explicitly check for multiple different values (e.g., "A", "B", etc.), have more flexibility, and include NA for other cases, you can use case_when:
df %>%
group_by(Participant_ID) %>%
mutate(Period = case_when(
Order == "A" ~ 1:2,
Order == "B" ~ 2:1,
TRUE ~ NA_integer_
))
Output
Participant_ID Order Period
<int> <chr> <int>
1 1 A 1
2 1 A 2
3 2 B 2
4 2 B 1
5 3 A 1
6 3 A 2
7 4 B 2
8 4 B 1
9 5 B 2
10 5 B 1
11 6 A 1
12 6 A 2
How can I run a loop over multiple columns changing consecutive values to true values?
For example, if I have a dataframe like this...
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
I want to show the binned values...
Time Value Bin Subject_ID
1 6 1 1
2 4 2 1
4 8 3 1
1 2 4 1
Is there a way to do it in a loop?
I tried this code...
for (row in 2:nrow(df)) {
if(df[row - 1, "Subject_ID"] == df[row, "Subject_ID"]) {
df[row,1:2] = df[row,1:2] - df[row - 1,1:2]
}
}
But the code changed it line by line and did not give the correct values for each bin.
If you still insist on using a for loop, you can use the following solution. It's very simple but you have to first create a copy of your data set as your desired output values are the difference of values between rows of the original data set. In order for this to happen we move DF outside of the for loop so the values remain intact, otherwise in every iteration values of DF data set will be replaced with the new values and the final output gives incorrect results:
df <- read.table(header = TRUE, text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1")
DF <- df[, c("Time", "Value")]
for(i in 2:nrow(df)) {
df[i, c("Time", "Value")] <- DF[i, ] - DF[i-1, ]
}
df
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
The problem with the code in the question is that after row i is changed the changed row is used in calculating row i+1 rather than the original row i. To fix that run the loop in reverse order. That is use nrow(df):2 in the for statement. Alternately try one of these which do not use any loops and also have the advantage of not overwriting the input -- something which makes the code easier to debug.
1) Base R Use ave to perform Diff by group where Diff uses diff to actually perform the differencing.
Diff <- function(x) c(x[1], diff(x))
transform(df,
Time = ave(Time, Subject_ID, FUN = Diff),
Value = ave(Value, Subject_ID, FUN = Diff))
giving:
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
2) dplyr Using dplyr we write the above except we use lag:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(Time = Time - lag(Time, default = 0),
Value = Value - lag(Value, default = 0)) %>%
ungroup
giving:
# A tibble: 4 x 4
Time Value Bin Subject_ID
<dbl> <dbl> <int> <int>
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
or using across:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(across(Time:Value, ~ .x - lag(.x, default = 0))) %>%
ungroup
Note
Lines <- "Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1"
df <- read.table(text = Lines, header = TRUE)
Here is a base R one-liner with diff in a lapply loop.
df[1:2] <- lapply(df[1.2], function(x) c(x[1], diff(x)))
df
# Time Value Bin Subject_ID
#1 1 1 1 1
#2 2 2 2 1
#3 4 4 3 1
#4 1 1 4 1
Data
df <- read.table(text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
", header = TRUE)
dplyr one liner
library(dplyr)
df %>% mutate(across(c(Time, Value), ~c(first(.), diff(.))))
#> Time Value Bin Subject_ID
#> 1 1 6 1 1
#> 2 2 4 2 1
#> 3 4 8 3 1
#> 4 1 2 4 1
I would like to merge multiple columns. Here is what my sample dataset looks like.
df <- data.frame(
id = c(1,2,3,4,5),
cat.1 = c(3,4,NA,4,2),
cat.2 = c(3,NA,1,4,NA),
cat.3 = c(3,4,1,4,2))
> df
id cat.1 cat.2 cat.3
1 1 3 3 3
2 2 4 NA 4
3 3 NA 1 1
4 4 4 4 4
5 5 2 NA 2
I am trying to merge columns cat.1 cat.2 and cat.3. It is a little complicated for me since there are NAs.
I need to have only one cat variable and even some columns have NA, I need to ignore them. The desired output is below:
> df
id cat
1 1 3
2 2 4
3 3 1
4 4 4
5 5 2
Any thoughts?
Another variation of Gregor's answer using dplyr::transmute:
library(dplyr)
df %>%
transmute(id = id, cat = coalesce(cat.1, cat.2, cat.3))
#> id cat
#> 1 1 3
#> 2 2 4
#> 3 3 1
#> 4 4 4
#> 5 5 2
With dplyr:
library(dplyr)
df %>%
mutate(cat = coalesce(cat.1, cat.2, cat.3)) %>%
select(-cat.1, -cat.2, -cat.3)
An option with fcoalesce from data.table
library(data.table)
setDT(df)[, .(id, cat = do.call(fcoalesce, .SD)), .SDcols = patterns('^cat')]
-output
# id cat
#1: 1 3
#2: 2 4
#3: 3 1
#4: 4 4
#5: 5 2
Does this work:
> library(dplyr)
> df %>% rowwise() %>% mutate(cat = mean(c(cat.1, cat.2, cat.3), na.rm = T)) %>% select(-(2:4))
# A tibble: 5 x 2
# Rowwise:
id cat
<dbl> <dbl>
1 1 3
2 2 4
3 3 1
4 4 4
5 5 2
Since values across rows are unique, mean of the rows will return the same unique value, can also go with max or min.
Here is a base R solution which uses apply:
df$cat <- apply(df, 1, function(x) unique(x[!is.na(x)][-1]))
I have two vectors which contain indices which look like
index A index B
1 1
1 1
1 1
1 2
1 2
2 1
2 1
Now, I want to find the length of each combination between index A and index B. So, in my example there are three unique combinations for index A and index B and I want to get back 3, 2, 2 in a vector. Does anyone know how to this without a for loop?
EDIT:
So, in this example there are three unique combinations (1 1, 1 2 and 2 1) for which the there are 3 of combination 1 1, 2 of 1 2 and 2 of 2 1. Therefore, I want to return 3, 2, 2
I think this is what you want:
library(plyr)
df <- data.frame(index_A = c(1, 1, 1, 1, 1, 2, 2),
index_B = c(1, 1, 1, 2, 2, 1, 1))
count(df, vars = c("index_A", "index_B"))
#> index_A index_B freq
#> 1 1 1 3
#> 2 1 2 2
#> 3 2 1 2
Created on 2019-03-17 by the reprex package (v0.2.1)
I got this from here.
In base R, we can use table
as.data.frame(table(dat))
You could paste the vectors together and call rle
rle(do.call(paste0, dat))$lengths
# [1] 3 2 2
If you need the result as a data.frame, do
as.data.frame(unclass(rle(do.call(paste0, dat))))
# lengths values
#1 3 11
#2 2 12
#3 2 21
data
text <- "indexA indexB
1 1
1 1
1 1
1 2
1 2
2 1
2 1"
dat <- read.table(text = text, header = TRUE)
This is somehow hacky:
library(dplyr)
df %>%
mutate(Combined=paste0(`index A`,"_",`index B`)) %>%
group_by(Combined) %>%
summarise(n=n())
# A tibble: 3 x 2
Combined n
<chr> <int>
1 1_1 3
2 1_2 2
3 2_1 2
Can actually just do:
df %>%
group_by(`index A`,`index B`) %>%
summarise(n=n())
Adding tidyr unite as suggested by #kath
library(tidyr)
df %>%
unite(new_col,`index A`,`index B`,sep="_") %>%
add_count(new_col) %>%
unique()
Data:
df<-read.table(text="index A index B
1 1
1 1
1 1
1 2
1 2
2 1
2 1",header=T,as.is=T,fill=T)
df<-df[,1:2]
names(df)<-c("index A","index B")
Using dplyr :
library(dplyr)
count(dat,!!!dat)$n
# [1] 3 2 2
I have a data with about 1000 groups each group is ordered from 1-100(can be any number within 100).
As I was looking through the data. I found that some groups had bad orders, i.e., it would order to 100 then suddenly a 24 would show up.
How can I delete all of these error data
As you can see from the picture above(before -> after), I would like to find all rows that don't follow the order within the group and just delete it.
Any help would be great!
lag will compute the difference between the current value and the previous value, diff will be used to select only positive difference i.e. the current value is greater than the previous value. min is used as lag give the first value NA. I keep the helper column diff to check, but you can deselect using %>% select(-diff)
library(dplyr)
df1 %>% group_by(gruop) %>% mutate(diff = order-lag(order)) %>%
filter(diff >= 0 | order==min(order))
# A tibble: 8 x 3
# Groups: gruop [2]
gruop order diff
<int> <int> <int>
1 1 1 NA
2 1 3 2
3 1 5 2
4 1 10 5
5 2 1 NA
6 2 4 3
7 2 4 0
8 2 8 4
Data
df1 <- read.table(text="
gruop order
1 1
1 3
1 5
1 10
1 2
2 1
2 4
2 4
2 8
2 3
",header=T, stringsAsFactors = F)
Assuming the order column increments by 1 every time we can use ave where we remove those rows which do not have difference of 1 with the previous row by group.
df[!ave(df$order, df$group, FUN = function(x) c(1, diff(x))) != 1, ]
# group order
#1 1 1
#2 1 2
#3 1 3
#4 1 4
#6 2 1
#7 2 2
#8 2 3
#9 2 4
EDIT
For the updated example, we can just change the comparison
df[ave(df$order, df$group, FUN = function(x) c(1, diff(x))) >= 0, ]
Playing with data.table:
library(data.table)
setDT(df1)[, diffo := c(1, diff(order)), group][diffo == 1, .(group, order)]
group order
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 2 1
6: 2 2
7: 2 3
8: 2 4
Where df1 is:
df1 <- data.frame(
group = rep(1:2, each = 5),
order = c(1:4, 2, 1:4, 3)
)
EDIT
If you only need increasing order, and not steps of one then you can do:
df3 <- transform(df1, order = c(1,3,5,10,2,1,4,7,9,3))
setDT(df3)[, diffo := c(1, diff(order)), group][diffo >= 1, .(group, order)]
group order
1: 1 1
2: 1 3
3: 1 5
4: 1 10
5: 2 1
6: 2 4
7: 2 7
8: 2 9