Consider this data:
m = data.frame(pop=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4),
id=c(0,1,1,1,1,1,0,2,1,1,1,2,1,2,2,2))
> m
pop id
1 1 0
2 1 1
3 1 1
4 1 1
5 2 1
6 2 1
7 2 0
8 2 2
9 2 1
10 3 1
11 3 1
12 3 2
13 3 1
14 3 2
15 4 2
16 4 2
I would like to get the frequency of each unique id in each unique pop? For example, the id 1 is present 3 times out of 4 when pop == 1, therefore the frequency of id 1 in pop 1 is 0.75.
I came up with this ugly solution:
out = matrix(0,ncol=3)
for (p in unique(m$pop))
{
for (i in unique(m$id))
{
m1 = m[m$pop == p,]
f = nrow(m1[m1$id == i,])/nrow(m1)
out = rbind(out, c(p, f, i))
}
}
out = out[-1,]
colnames(out) = c("pop", "freq", "id")
# SOLUTION
> out
pop freq id
[1,] 1 0.25 0
[2,] 1 0.75 1
[3,] 1 0.00 2
[4,] 2 0.20 0
[5,] 2 0.60 1
[6,] 2 0.20 2
[7,] 3 0.00 0
[8,] 3 0.60 1
[9,] 3 0.40 2
[10,] 4 0.00 0
[11,] 4 0.00 1
[12,] 4 1.00 2
I am sure there exists a more efficient solution using data.table or table but couldn't find it.
Here's what I might do:
as.data.frame(prop.table(table(m),1))
# pop id Freq
# 1 1 0 0.25
# 2 2 0 0.20
# 3 3 0 0.00
# 4 4 0 0.00
# 5 1 1 0.75
# 6 2 1 0.60
# 7 3 1 0.60
# 8 4 1 0.00
# 9 1 2 0.00
# 10 2 2 0.20
# 11 3 2 0.40
# 12 4 2 1.00
If you want it sorted by pop, you can do that afterwards. Alternately, you could transpose the table with t before converting to data.frame; or use rev(m) and prop.table on dimension 2.
Try:
library(dplyr)
m %>%
group_by(pop, id) %>%
summarise(s = n()) %>%
mutate(freq = s / sum(s)) %>%
select(-s)
Which gives:
#Source: local data frame [8 x 3]
#Groups: pop
#
# pop id freq
#1 1 0 0.25
#2 1 1 0.75
#3 2 0 0.20
#4 2 1 0.60
#5 2 2 0.20
#6 3 1 0.60
#7 3 2 0.40
#8 4 2 1.00
A data.table solution:
setDT(m)[, {div = .N; .SD[, .N/div, keyby = id]}, by = pop]
# pop id V1
#1: 1 0 0.25
#2: 1 1 0.75
#3: 2 0 0.20
#4: 2 1 0.60
#5: 2 2 0.20
#6: 3 1 0.60
#7: 3 2 0.40
#8: 4 2 1.00
Related
please see sample data below:
data <- data.frame(q1=c(3,4,5,2,1,2,4),
q2=c(3,4,4,5,4,3,2),
q3=c(2,3,2,3,1,2,3),
q4=c(3,4,4,4,4,5,5))
I would like to create a another column which shows the percent of 4/5 responses. The output I am hoping to get looks something like this. Any help is appreciated, thank you!
q1 q2 q3 q4 percent
1 3 3 2 3 0.00
2 4 4 3 4 0.75
3 5 4 2 4 0.75
4 2 5 3 4 0.50
5 1 4 1 4 0.50
6 2 3 2 5 0.25
7 4 2 3 5 0.50
Using rowMeans
library(dplyr)
data %>%
mutate(percent = rowMeans(across(everything(), ~ .x %in% 4:5)))
-output
q1 q2 q3 q4 percent
1 3 3 2 3 0.00
2 4 4 3 4 0.75
3 5 4 2 4 0.75
4 2 5 3 4 0.50
5 1 4 1 4 0.50
6 2 3 2 5 0.25
7 4 2 3 5 0.50
One possible solution:
data$percent = rowMeans(data>3)
Or
data$percent = apply(data, 1, \(x) mean(x %in% 4:5))
q1 q2 q3 q4 percent
1 3 3 2 3 0.00
2 4 4 3 4 0.75
3 5 4 2 4 0.75
4 2 5 3 4 0.50
5 1 4 1 4 0.50
6 2 3 2 5 0.25
7 4 2 3 5 0.50
library(dplyr)
data <- data.frame(q1=c(3,4,5,2,1,2,4),
q2=c(3,4,4,5,4,3,2),
q3=c(2,3,2,3,1,2,3),
q4=c(3,4,4,4,4,5,5))
percent_4_5 <- function(x) {
(sum(x == 4) + sum(x == 5)) / length(x)
}
data %>% rowwise() %>% mutate(percent = percent_4_5(c_across(starts_with("q")))) %>% ungroup()
Another possible solution without using dplyr
library(magrittr)
data$percent <- (data > 3) %>% as.data.frame() %>% apply(., 1, mean)
True is considered as 1 and False is considered as 0 when counting.
Output:
q1 q2 q3 q4 percent
1 3 3 2 3 0.00
2 4 4 3 4 0.75
3 5 4 2 4 0.75
4 2 5 3 4 0.50
5 1 4 1 4 0.50
6 2 3 2 5 0.25
7 4 2 3 5 0.50
My data frame looks like this
value <- c(0,0.1,0.2,0.4,0,"0.05,",0.05,0.5,0.20,0.40,0.50,0.60)
time <- c(1,1,"1,",1,2,2,2,2,3,3,3,3)
ID <- c("1,","2,","3,",4,1,2,3,4,1,2,3,4)
test <- data.frame(value, time, ID)
test
value time ID
1 0 1 1,
2 0.1 1 2,
3 0.2 1, 3,
4 0.4 1 4
5 0 2 1
6 0.05, 2 2
7 0.05 2 3
8 0.5 2 4
9 0.2 3 1
10 0.4 3 2
11 0.5 3 3
12 0.6 3 4
I want to replace the "," from all columns with "" but I am still getting an error
Error in UseMethod("tbl_vars") :
no applicable method for 'tbl_vars' applied to an object of class "character"
I would like my data to look like this
value time ID
1 0.00 1 1
2 0.10 1 2
3 0.20 1 3
4 0.40 1 4
5 0.00 2 1
6 0.05 2 2
7 0.05 2 3
8 0.50 2 4
9 0.20 3 1
10 0.40 3 2
11 0.50 3 3
12 0.60 3 4
EDIT
test %>%
mutate_all(~gsub(",","",.))
The easiest in this case might be to use parse_number from the readr package,
e.g. :
apply(test, 2, readr::parse_number)
or in dplyr lingo:
test %>% mutate_all(readr::parse_number)
A simple base Rsolution:
test <- sapply(test, function(x) as.numeric(sub(",", "", x)))
test
value time ID
[1,] 0.00 1 1
[2,] 0.10 1 2
[3,] 0.20 1 3
[4,] 0.40 1 4
[5,] 0.00 2 1
[6,] 0.05 2 2
[7,] 0.05 2 3
[8,] 0.50 2 4
[9,] 0.20 3 1
[10,] 0.40 3 2
[11,] 0.50 3 3
[12,] 0.60 3 4
test %>%
mutate_at(vars(value, time, ID), ~ gsub(".*?(-?[0-9]+\\.?[0-9]*).*", "\\1", .))
# value time ID
# 1 0 1 1
# 2 0.1 1 2
# 3 0.2 1 3
# 4 0.4 1 4
# 5 0 2 1
# 6 0.05 2 2
# 7 0.05 2 3
# 8 0.5 2 4
# 9 0.2 3 1
# 10 0.4 3 2
# 11 0.5 3 3
# 12 0.6 3 4
The more we get into the "let's try to parse what could be a number", it can get crazy, including scientific notation. For that, readr::parse_number already suggested is likely a better candidate if you can accept one more package dependency.
However ... seeing this suggests that either the method of import has some mistakes in it, or however the data is formed has mistakes in it. While this patch works on those kinds of mistakes, it is far better to fix whichever error is causing this.
I am trying to take the means over the columns sam and dup of the following dataset:
fat co lab sam dup
1 0.62 1 1 1 1
2 0.55 1 1 1 2
3 0.34 1 1 2 1
4 0.24 1 1 2 2
5 0.80 1 1 3 1
6 0.68 1 1 3 2
7 0.76 1 1 4 1
8 0.65 1 1 4 2
9 0.30 1 2 1 1
10 0.40 1 2 1 2
11 0.33 1 2 2 1
12 0.43 1 2 2 2
13 0.39 1 2 3 1
14 0.40 1 2 3 2
15 0.29 1 2 4 1
16 0.18 1 2 4 2
17 0.46 1 3 1 1
18 0.38 1 3 1 2
19 0.27 1 3 2 1
20 0.37 1 3 2 2
21 0.37 1 3 3 1
22 0.42 1 3 3 2
23 0.45 1 3 4 1
24 0.54 1 3 4 2
25 0.18 2 1 1 1
26 0.47 2 1 1 2
27 0.53 2 1 2 1
28 0.32 2 1 2 2
29 0.40 2 1 3 1
30 0.37 2 1 3 2
31 0.31 2 1 4 1
32 0.43 2 1 4 2
33 0.35 2 2 1 1
34 0.39 2 2 1 2
35 0.37 2 2 2 1
36 0.33 2 2 2 2
37 0.42 2 2 3 1
38 0.36 2 2 3 2
39 0.20 2 2 4 1
40 0.41 2 2 4 2
41 0.37 2 3 1 1
42 0.43 2 3 1 2
43 0.28 2 3 2 1
44 0.36 2 3 2 2
45 0.18 2 3 3 1
46 0.20 2 3 3 2
47 0.26 2 3 4 1
48 0.06 2 3 4 2
The output should be this:
lab co fat
1 1 1 0.58000
2 2 1 0.34000
3 3 1 0.40750
4 1 2 0.37625
5 2 2 0.35375
6 3 2 0.26750
These are both in the form of .RData files.
How can this be done?
An example with part of the data you posted:
dt = read.table(text = "
fat co lab sam dup
0.62 1 1 1 1
0.55 1 1 1 2
0.34 1 1 2 1
0.24 1 1 2 2
0.80 1 1 3 1
0.68 1 1 3 2
0.76 1 1 4 1
0.65 1 1 4 2
0.30 1 2 1 1
0.40 1 2 1 2
0.33 1 2 2 1
0.43 1 2 2 2
0.39 1 2 3 1
0.40 1 2 3 2
0.29 1 2 4 1
0.18 1 2 4 2
", header= T)
library(dplyr)
dt %>%
group_by(lab, co) %>% # for each lab and co combination
summarise(fat = mean(fat)) %>% # get the mean of fat
ungroup() # forget the grouping
# # A tibble: 2 x 3
# lab co fat
# <int> <int> <dbl>
# 1 1 1 0.58
# 2 2 1 0.34
This is my data set
You can get the data form this link( If can't ,please inform me)
https://www.dropbox.com/s/1n9hpyhcniaghh5/table.csv?dl=0
LABEL DATE TAU TYPE x y z
1 A 1 2 1 0.75 7 16
2 A 1 2 0 0.41 5 18
3 A 1 2 1 0.39 6 14
4 A 2 3 0 0.65 5 14
5 A 2 3 1 0.55 7 19
6 A 2 3 1 0.69 5 19
7 A 2 3 0 0.66 7 19
8 A 3 1 0 0.38 8 15
9 A 3 1 0 0.02 5 16
10 A 3 1 0 0.71 8 13
11 B 1 2 1 0.25 9 18
12 B 1 2 0 0.06 8 20
13 B 1 2 1 0.60 8 20
14 B 1 2 0 0.56 6 13
15 B 1 3 1 0.50 8 19
16 B 1 3 0 0.04 8 16
17 B 2 1 1 0.04 5 15
18 B 2 1 1 0.75 5 13
19 B 2 1 0 0.44 8 18
20 B 2 1 1 0.52 9 13
I want to filter data by group with multiple conditions. And the conditions is
the number of rows for each type(0,1) of TYPE variable by group must
bigger than 1
the number of rows for each type must be equal
(For example: the number of rows for type 1 is equal to the number of rows for type 0 for each group)
I had tried many times... And finally I get this code and this output
table %>% group_by(label,date,tau,type) %>% filter(n()>1) %>% filter(length(type==1)==length(type==0))
# A tibble: 16 x 7
# Groups: label, date, tau, type [7]
LABEL DATE TAU TYPE x y z
<fctr> <int> <int> <int> <dbl> <int> <int>
1 A 1 2 1 0.75 7 16
2 A 1 2 1 0.39 6 14
3 A 2 3 0 0.65 5 14
4 A 2 3 1 0.55 7 19
5 A 2 3 1 0.69 5 19
6 A 2 3 0 0.66 7 19
7 A 3 1 0 0.38 8 15
8 A 3 1 0 0.02 5 16
9 A 3 1 0 0.71 8 13
10 B 1 2 1 0.25 9 18
11 B 1 2 0 0.06 8 20
12 B 1 2 1 0.60 8 20
13 B 1 2 0 0.56 6 13
14 B 2 1 1 0.04 5 15
15 B 2 1 1 0.75 5 13
16 B 2 1 1 0.52 9 13
I was confused about this output I get with this code. I already get rid of the data which didn't meet the condition 1 BUT the data which didn't meet the condition 2 still inside
The result I want is just like the below
LABEL DATE TAU TYPE x y z
<fctr> <int> <int> <int> <dbl> <int> <int>
3 A 2 3 0 0.65 5 14
4 A 2 3 1 0.55 7 19
5 A 2 3 1 0.69 5 19
6 A 2 3 0 0.66 7 19
10 B 1 2 1 0.25 9 18
11 B 1 2 0 0.06 8 20
12 B 1 2 1 0.60 8 20
13 B 1 2 0 0.56 6 13
And if I want to compute value with the function below for each row, how can i code?? Just use the function of mutate()??
f(x,y,z) = 2 * x + y - z / 3 if TYPE == 1
f(x,y,z) = 4 * x - y / 2 + z / 3 if TYPE == 0
I hope there is anyone can help me and I am appreciate for your help! If you need to provide any other information just let me know ~
# example dataset
df = read.table(text = "
LABEL DATE TAU TYPE x y z
1 A 1 2 1 0.75 7 16
2 A 1 2 0 0.41 5 18
3 A 1 2 1 0.39 6 14
4 A 2 3 0 0.65 5 14
5 A 2 3 1 0.55 7 19
6 A 2 3 1 0.69 5 19
7 A 2 3 0 0.66 7 19
8 A 3 1 0 0.38 8 15
9 A 3 1 0 0.02 5 16
10 A 3 1 0 0.71 8 13
11 B 1 2 1 0.25 9 18
12 B 1 2 0 0.06 8 20
13 B 1 2 1 0.60 8 20
14 B 1 2 0 0.56 6 13
15 B 1 3 1 0.50 8 19
16 B 1 3 0 0.04 8 16
17 B 2 1 1 0.04 5 15
18 B 2 1 1 0.75 5 13
19 B 2 1 0 0.44 8 18
20 B 2 1 1 0.52 9 13
", header=T, stringsAsFactors=F)
library(dplyr)
library(tidyr)
# function to use for each row
# (assumes that type can be only 1 or 0)
f = function(t,x,y,z) { ifelse(t == 1,
2 * x + y - z / 3,
4 * x - y / 2 + z / 3) }
df %>%
count(LABEL, DATE, TAU, TYPE) %>% # count rows for each group (based on those combinations)
filter(n > 1) %>% # keep groups with multiple rows
mutate(TYPE = paste0("TYPE_",TYPE)) %>% # update variable
spread(TYPE, n, fill = 0) %>% # reshape data
filter(TYPE_0 == TYPE_1) %>% # keep groups with equal number of rows for type 0 and 1
select(LABEL, DATE, TAU) %>% # keep variables/groups of interest
inner_join(df, by=c("LABEL", "DATE", "TAU")) %>% # join back info
mutate(f_value = f(TYPE,x,y,z)) # apply function
# # A tibble: 8 x 8
# LABEL DATE TAU TYPE x y z f_value
# <chr> <int> <int> <int> <dbl> <int> <int> <dbl>
# 1 A 2 3 0 0.65 5 14 4.76666667
# 2 A 2 3 1 0.55 7 19 1.76666667
# 3 A 2 3 1 0.69 5 19 0.04666667
# 4 A 2 3 0 0.66 7 19 5.47333333
# 5 B 1 2 1 0.25 9 18 3.50000000
# 6 B 1 2 0 0.06 8 20 2.90666667
# 7 B 1 2 1 0.60 8 20 2.53333333
# 8 B 1 2 0 0.56 6 13 3.57333333
Problem: How can I fill backwards all rows in a group before an occurrence of a certain value. I am not trying to fill in NA or missing value using zoo na.locf. In the following I would like to fill all previous rows in A with 1.00 before the 1.00 occurs by each ID group, ideally using dplyr.
Input:
data<- data.frame(ID=c(1,1,1,1,2,2,2,3,3,3,4,4,4,4,4),
time=c(1,2,3,4,1,2,3,1,2,3,1,2,3,4,5),
A=c(0.10,0.25,1,0,0.25,1,0.25,0,1,0.10,1,0.10,0.10,0.10,0.05))
ID time A
1 1 0.10
1 2 0.25
1 3 1.00
1 4 0.00
2 1 0.25
2 2 1.00
2 3 0.25
3 1 0.00
3 2 1.00
3 3 0.10
4 1 1.00
4 2 0.10
4 3 0.10
4 4 0.10
4 5 0.05
Desired output:
ID time A
1 1 1.00
1 2 1.00
1 3 1.00
1 4 0.00
2 1 1.00
2 2 1.00
2 3 0.25
3 1 1.00
3 2 1.00
3 3 0.10
4 1 1.00
4 2 0.10
4 3 0.10
4 4 0.10
4 5 0.05
After grouping by ID you can check the cumulative sum of 1's and where it's still below 1 (not yet appeared), replace the A-value with 1:
data %>%
group_by(ID) %>%
mutate(A = replace(A, cumsum(A == 1) < 1, 1))
# Source: local data frame [15 x 3]
# Groups: ID [4]
#
# ID time A
# <dbl> <dbl> <dbl>
# 1 1 1 1.00
# 2 1 2 1.00
# 3 1 3 1.00
# 4 1 4 0.00
# 5 2 1 1.00
# 6 2 2 1.00
# 7 2 3 0.25
# 8 3 1 1.00
# 9 3 2 1.00
# 10 3 3 0.10
# 11 4 1 1.00
# 12 4 2 0.10
# 13 4 3 0.10
# 14 4 4 0.10
# 15 4 5 0.05
Quite similar, you could also use cummax:
data %>% group_by(ID) %>% mutate(A = replace(A, !cummax(A == 1), 1))
And here's a base R approach:
transform(data, A = ave(A, ID, FUN = function(x) replace(x, !cummax(x == 1), 1)))
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), get the row where 'A' is 1, find the sequence of rows, use that as i to assign (:=) the values in 'A' to 1
library(data.table)
setDT(data)[data[, .I[seq_len(which(A==1))], ID]$V1, A := 1][]
# ID time A
# 1: 1 1 1.00
# 2: 1 2 1.00
# 3: 1 3 1.00
# 4: 1 4 0.00
# 5: 2 1 1.00
# 6: 2 2 1.00
# 7: 2 3 0.25
# 8: 3 1 1.00
# 9: 3 2 1.00
#10: 3 3 0.10
#11: 4 1 1.00
#12: 4 2 0.10
#13: 4 3 0.10
#14: 4 4 0.10
#15: 4 5 0.05
Or we can use ave from base R
data$A[with(data, ave(A==1, ID, FUN = cumsum)<1)] <- 1