I've been trying to use prop.table() to get the proportions of data I have but keep getting errors. My data is..
Letter Total
a 10
b 34
c 8
d 21
. .
. .
. .
z 2
I want a third column that gives the proportion of each letter.
My original data is in a data frame so I've tried converting to a data table and then using prop.table ..
testtable = table(lettersdf)
prop.table(testtable)
When I try this I keep getting the error,
Error in margin.table(x, margin) : 'x' is not an array
Any help or advise is appreciated.
:)
If the Letter column in your data does not have duplicate values, like this
Df <- data.frame(
Letter=letters,
Total=sample(1:50,26),
stringsAsFactors=F)
you can just do this instead of using prop.table:
Df$Prop <- Df$Total/sum(Df$Total)
> head(Df)
Letter Total Prop
1 a 45 0.074875208
2 b 1 0.001663894
3 c 13 0.021630616
4 d 15 0.024958403
5 e 24 0.039933444
6 f 39 0.064891847
> sum(Df[,3])
[1] 1
If there are duplicated values, like in this object
Df2 <- data.frame(
Letter=sample(letters,50,replace=T),
Total=sample(1:50,50),
stringsAsFactors=F)
you can make a table to sum the frequency of unique Letters,
Table <- table(rep(Df2$Letter,Df2$Total))
> Table
a b c d e f h j k l m n o p q t v w x y z
48 16 99 2 40 75 45 42 66 6 62 27 88 99 32 96 85 64 53 161 69
and then use prop.table on this table object:
> prop.table(Table)
a b c d e f h j k l m
0.037647059 0.012549020 0.077647059 0.001568627 0.031372549 0.058823529 0.035294118 0.032941176 0.051764706 0.004705882 0.048627451
n o p q t v w x y z
0.021176471 0.069019608 0.077647059 0.025098039 0.075294118 0.066666667 0.050196078 0.041568627 0.126274510 0.054117647
You could also make this into a data.frame:
Df2.table <- cbind(
data.frame(Table,stringsAsFactors=F),
Prop=as.numeric(prop.table(Table)))
> head(Df2.table)
Var1 Freq Prop
1 a 48 0.037647059
2 b 16 0.012549020
3 c 99 0.077647059
4 d 2 0.001568627
5 e 40 0.031372549
6 f 75 0.058823529
Related
I am working with the R programming language.
I simulated this dataset which contains 1000 coin flips - then I calculated the number of "2 Flip Sequences":
Coin <- c('H', 'T')
Results = sample(Coin,1000, replace = TRUE)
My_Data = data.frame(id = 1:1000, Results)
Pairs = data.frame(first = head(My_Data$Results, -1), second = tail(My_Data$Results, -1))
Final = as.data.frame(table(Pairs))
first second Freq
1 H H 255
2 T H 245
3 H T 246
4 T T 253
I am curious - is it possible to extend the above code for "3 Flip Sequences"?
For example - I tried modifying parts of the code to see how the results change (and hoped to stumble across the correct way to write this code):
# First Attempt
Pairs = data.frame(first = head(My_Data$Results, -1), second = head(My_Data$Results, -1) , third = tail(My_Data$Results, -1))
Final = as.data.frame(table(Pairs))
first second third Freq
1 H H H 255
2 T H H 245
3 H T H 0
4 T T H 0
5 H H T 0
6 T H T 0
7 H T T 246
8 T T T 253
# Second Attempt
Pairs = data.frame(first = head(My_Data$Results, -1), second = tail(My_Data$Results, -1) , third = tail(My_Data$Results, -1))
Final = as.data.frame(table(Pairs))
first second third Freq
1 H H H 255
2 T H H 0
3 H T H 0
4 T T H 245
5 H H T 246
6 T H T 0
7 H T T 0
8 T T T 253
I am not sure which of these options are correct?
In general, I am looking to understand the logic as to how I can adapt the above code for an "arbitrary number of coin flips" (e.g. "4 flip sequences", "5 flip sequences", etc.)
Also, this might not be the most efficient way to calculate these frequencies - I would also be interested in learning about other ways that might be more efficient ( e.g. as the overall size of the data increases).
Thanks!
It might be helpful to work with strings.
coin <- c("H", "T")
results <- sample(coin, 1000, replace = TRUE)
Then to get sequence counts (assuming overlapping sequences also count) for triples, we could do something like:
triples <- table(
sapply(
1:(length(results) - 3),
function(i) sprintf(
"%s%s%s",
results[i],
results[i + 1],
results[i + 2]
)
)
)
which gives me something like:
HHH HHT HTH HTT THH THT TTH TTT
132 129 138 115 129 124 116 114
This idea could be generalized fairly easily, for example:
n_sequences <- function(n, results) {
helper <- function(i, n) if (n < 1) "" else sprintf(
"%s%s",
helper(i, n - 1),
results[i + n - 1]
)
result <- data.frame(
table(
sapply(
1:(length(results) - n + 1),
function(i) helper(i, n)
)
)
)
colnames(result) <- c("Sequence", "Frequency")
result
}
For example:
n_sequences(5, results)
Gives me something like:
Sequence Frequency
1 HHHHH 34
2 HHHHT 31
3 HHHTH 36
4 HHHTT 31
5 HHTHH 35
6 HHTHT 36
7 HHTTH 20
8 HHTTT 37
9 HTHHH 35
10 HTHHT 34
11 HTHTH 41
12 HTHTT 27
13 HTTHH 27
14 HTTHT 24
15 HTTTH 34
16 HTTTT 30
17 THHHH 31
18 THHHT 36
19 THHTH 36
20 THHTT 26
21 THTHH 34
22 THTHT 32
23 THTTH 31
24 THTTT 27
25 TTHHH 32
26 TTHHT 28
27 TTHTH 25
28 TTHTT 31
29 TTTHH 33
30 TTTHT 31
31 TTTTH 30
32 TTTTT 20
You could first cut along 3 + 1 breaks, split it along the levels. The interaction can now be tabled to get the result.
My_Data$cut3 <- cut(seq_len(nrow(My_Data)), seq.int(1, nrow(My_Data), length.out=3 + 1), include.lowest=TRUE)
(res <- interaction(split(My_Data$Results, My_Data$cut3)) |> table() |> as.data.frame())
# Var1 Freq
# 1 H.H.H 51
# 2 T.H.H 58
# 3 H.T.H 43
# 4 T.T.H 49
# 5 H.H.T 38
# 6 T.H.T 51
# 7 H.T.T 64
# 8 T.T.T 46
To get the desired output, we can strsplit Var1.
strsplit(as.character(res$Var1), '\\.') |> do.call(what=rbind) |>
cbind.data.frame(res$Freq) |> setNames(c('first', 'second', 'third', 'Freq'))
# first second third Freq
# 1 H H H 51
# 2 T H H 58
# 3 H T H 43
# 4 T T H 49
# 5 H H T 38
# 6 T H T 51
# 7 H T T 64
# 8 T T T 46
Note, that nrow of your data should be divisible by 3.
Edit
To generalize, we may write a small function.
f <- \(x, n) {
ct <- cut(seq_len(nrow(x)), seq.int(1L, nrow(x), length.out=n + 1L), include.lowest=TRUE)
res <- interaction(split(x$Results, ct)) |> table() |> as.data.frame()
strsplit(as.character(res$Var1), '\\.') |> do.call(what=rbind) |>
cbind.data.frame(res$Freq) |> setNames(c(LETTERS[seq_len(n)], 'Freq'))
}
f(My_Data, 4)
# A B C D Freq
# 1 H H H H 13
# 2 T H H H 25
# 3 H T H H 18
# 4 T T H H 17
# 5 H H T H 18
# 6 T H T H 15
# 7 H T T H 21
# 8 T T T H 24
# 9 H H H T 26
# 10 T H H T 15
# 11 H T H T 16
# 12 T T H T 18
# 13 H H T T 22
# 14 T H T T 18
# 15 H T T T 10
# 16 T T T T 24
Data:
set.seed(42)
My_Data <- data.frame(id=1:1200, Results=sample(c('H', 'T'), 1200, replace=TRUE))
A slightly generalized solution with tidyverse tools. Change the sets variable for longer or shorter sequences.
coin <- c("H", "T")
sets <- 4
rolls <- 10000
results <- sample(coin, sets * rolls, rep = TRUE)
named_results <- purrr::map_chr(
0:(rolls - 1),
~ paste0(results[(sets * .x + 1):(sets * .x + sets)],
collapse = ""
)
)
dplyr::count(tibble::tibble(x = named_results), x)
with output
# A tibble: 16 x 2
x n
<chr> <int>
1 HHHH 629
2 HHHT 627
3 HHTH 638
4 HHTT 599
5 HTHH 602
6 HTHT 633
7 HTTH 596
8 HTTT 661
9 THHH 631
10 THHT 589
11 THTH 633
12 THTT 647
13 TTHH 660
14 TTHT 637
15 TTTH 623
16 TTTT 595
sets = 8 would give something like
# A tibble: 256 x 2
x n
<chr> <int>
1 HHHHHHHH 37
2 HHHHHHHT 36
3 HHHHHHTH 43
4 HHHHHHTT 35
5 HHHHHTHH 38
6 HHHHHTHT 27
7 HHHHHTTH 32
8 HHHHHTTT 28
9 HHHHTHHH 33
10 HHHHTHHT 38
# ... with 246 more rows
I have a big database and I'm trying to create a new column starting from an existing one doing the difference between elements in consecutive cells ( same column, different row):
existing_column
new_column
A
A-B
B
B-C
C
C-D
D
D-E
...
...
Z
Z-NULL
The way I'm doing it is to duplicate existing column into a dummy one, remove first element, adding NULL as last element and subtracting the existing column and the dummy one ... is there a better way? Thank you
exist <-c("A","B","C","D","E")
db<-data.frame(exist)
dummy<-exist[-1]
dummy[length(dummy)+1]<-"NULL"
new_col<-paste(exist,"-",dummy)
new_col
db<-data.frame(exist,new_col)
db
Does this work:
library(dplyr)
df <- data.frame(existing_column = LETTERS)
df %>% mutate(new_column = paste(existing_column, lead(existing_column, default = 'NULL'), sep = '-'))
existing_column new_column
1 A A-B
2 B B-C
3 C C-D
4 D D-E
5 E E-F
6 F F-G
7 G G-H
8 H H-I
9 I I-J
10 J J-K
11 K K-L
12 L L-M
13 M M-N
14 N N-O
15 O O-P
16 P P-Q
17 Q Q-R
18 R R-S
19 S S-T
20 T T-U
21 U U-V
22 V V-W
23 W W-X
24 X X-Y
25 Y Y-Z
26 Z Z-NULL
Try the code below
transform(
df,
new_column = paste(existing_column, c(existing_column[-1], NA), sep = "-")
)
which gives
existing_column new_column
1 A A-B
2 B B-C
3 C C-D
4 D D-E
5 E E-F
6 F F-G
7 G G-H
8 H H-I
9 I I-J
10 J J-K
11 K K-L
12 L L-M
13 M M-N
14 N N-O
15 O O-P
16 P P-Q
17 Q Q-R
18 R R-S
19 S S-T
20 T T-U
21 U U-V
22 V V-W
23 W W-X
24 X X-Y
25 Y Y-Z
26 Z Z-NA
If you are working with numeric data just represented as characters in your example, you can use mutate() and lead()
df<-data.frame(old_col=sample(1:10))
df%>%mutate(new_col=old_col-lead(old_col, default = 0))
old_col new_col
1 10 4
2 6 -3
3 9 8
4 1 -1
5 2 -5
6 7 3
7 4 1
8 3 -5
9 8 3
10 5 5
In case there is a need of a fast data.table version
dt[, new_column:=paste(exist, shift(exist, type="lead"), sep="-")]
Edit. Turns it isn't much faster:
df = data.table(exist = rep(letters, 80000))
> m = microbenchmark::microbenchmark(
... a = df %>% mutate(new_column = paste(exist, lead(exist, default = 'NULL'), sep = '-')),
...
... b = transform(
... df,
... new_column = paste(exist, c(exist[-1], NA), sep = "-")
... ),
...
... d = df[, new_column := paste(exist, shift(exist, type="lead"), sep="-")]
... )
> m
Unit: milliseconds
expr min lq mean median uq max neval
a 292.2430 309.6150 342.0191 323.9778 361.0937 603.8449 100
b 349.4509 383.3391 475.0177 423.8864 472.0276 2136.2970 100
d 294.6786 302.8530 332.3989 315.6228 340.9642 641.8345 100
I have a list:
l1<-list(A=1:10, B=100:120, C=300:310, D=400:430)
How do I convert it to dataframe with 2 columns:
C1 C2
R1 1 A
R2 2 A
...
R10 10 A
R11 100 B
R12 101 B
....
R73 429 D
R73 430 D
I tried:
df1 <- data.frame(matrix(unlist(l1), nrow=length(l1), byrow=T))
But I'm getting an error because the vectors in my list have multiple lengths. Also my actual list consist of Dates and not just integers.
Just use stack:
stack(l1)
> head(stack(l1))
values ind
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
> tail(stack(l1))
values ind
68 425 D
69 426 D
70 427 D
71 428 D
72 429 D
73 430 D
Update
stack won't work with dates. If you have actual date objects, you can do:
data.frame(ind = rep(names(l1), lengths(l1)),
val = as.Date(unlist(l1), origin = "1970-01-01"))
or
data.frame(ind = rep(names(l1), lengths(l1)), val = do.call(c, l1))
Sample data:
l1<-list(A=Sys.Date()+(1:10),
B=Sys.Date()+(100:120),
C=Sys.Date()+(300:310),
D=Sys.Date()+(400:430))
Here's one method: Similar to #Duck answer using Map and do.call
tmp <- Map(data.frame,N = l1,L = names(l1))
out <- do.call(rbind,tmp)
rownames(out) <- NULL
> tail(out)
N L
68 425 D
69 426 D
70 427 D
71 428 D
72 429 D
73 430 D
Maybe a long solution, but using mapply() and do.call() you can reach the expected result. First, you can extract the names of the list as well as the number of elements. Then, using mapply() you can create a list for the first column in your desired result. After that you combine mapply(), do.call(), rbind() and cbind() to end up with df. Here the code:
#Code
#names
v1 <- names(l1)
#length
v2 <- unlist(lapply(l1, length))
#Create values
l2 <- mapply(function(x,y) rep(x,y),v1,v2)
#Bind
df <- as.data.frame(do.call(rbind,mapply(cbind,l2,l1)))
df$V2 <- as.numeric(df$V2)
Output (some rows):
head(df,15)
V1 V2
1 A 1
2 A 24
3 A 25
4 A 37
5 A 69
6 A 70
7 A 71
8 A 72
9 A 73
10 A 2
11 B 3
12 B 4
13 B 5
14 B 6
15 B 7
suppose I have the following dataframe
x <- c(12,30,45,100,150,305,2,46,10,221)
x2 <- letters[1:10]
df <- data.frame(x,x2)
df <- df[with(df, order(x)), ]
x x2
7 2 g
9 10 i
1 12 a
2 30 b
3 45 c
8 46 h
4 100 d
5 150 e
10 221 j
6 305 f
And I would like to split these into groups based on another vector,
v <- seq(0, 500, 50)
Basically, I would like to partition out each row based on column x and how it matches with to v ( so for example x <= an element in v) - the location/index of that element in v is then used to assign a group for that row. The resulting table should look something like the following:
x x2 group
7 2 g g1
9 10 i g1
1 12 a g1
2 30 b g1
3 45 c g1
8 46 h g2
4 100 d g3
5 150 e g4
10 221 j g4
6 305 f g6
I could try to loop through each row and try and match it to v but I'm still confuse as to how I could easily detect where the match x<=element v occurs so that I can assign a group id to it. thanks.
You can use cut to break up df$x by the values of v:
df$group <- as.numeric(cut(df$x, breaks = v))
df$group <- paste0('g', df$group)
cut returns a factor so you can use as.numeric to just pull out which numeric bucket the value of df$x falls into based on v.
After I'm done with some manipulation in Dataframe, I got a result dataframe. But the index are not listed properly as below.
MsgType/Cxr NoOfMsgs AvgElpsdTime(ms)
161 AM 86 30.13
171 CM 1 104
18 CO 27 1244.81
19 US 23 1369.61
20 VK 2 245
21 VS 11 1273.82
112 fqa 78 1752.22
24 SN 78 1752.22
I would like to get the result as like below.
MsgType/Cxr NoOfMsgs AvgElpsdTime(ms)
1 AM 86 30.13
2 CM 1 104
3 CO 27 1244.81
4 US 23 1369.61
5 VK 2 245
6 VS 11 1273.82
7 fqa 78 1752.22
8 SN 78 1752.22
Please guide how I can get this ?
These are the rownames of your dataframe, which by default are 1:nrow(dfr). When you reordered the dataframe, the original rownames are also reordered. To have the rows of the new order listed sequentially, just use:
rownames(dfr) <- 1:nrow(dfr)
Or, simply
rownames(df) <- NULL
gives what you want.
> d <- data.frame(x = LETTERS[1:5], y = letters[1:5])[sample(5, 5), ]
> d
x y
5 E e
4 D d
3 C c
2 B b
1 A a
> rownames(d) <- NULL
> d
x y
1 E e
2 D d
3 C c
4 B b
5 A a
The index is actually the data frame row names. To change them, you can do something like:
rownames(dd) = 1:dim(dd)[1]
or
rownames(dd) = 1:nrow(dd)
Personally, I never use rownames.
In your example, I suspect that you don't need to worry about them either, since you are just renaming them 1 to n. In particular, when you subset your data frame the rownames will again be incorrect. For example,
##Simple data frame
R> dd = data.frame(a = rnorm(6))
R> dd$type = c("A", "B")
R> rownames(dd) = 1:nrow(dd)
R> dd
a type
1 2.1434 A
2 -1.1067 B
3 0.7451 A
4 -0.1711 B
5 1.4348 A
6 -1.3777 B
##Basic subsetting
R> dd_sub = dd[dd$type=="A",]
##Rownames are "wrong"
R> dd_sub
a type
1 2.1434 A
3 0.7451 A
5 1.4348 A