Wide table to long table - r

I have a dataframe called "results" that look like this
my dataframe:
result <- structure(list(D = c(2986.286, 2842.54, 2921), E = c(3020.458,
2943.926, 2860.763), F = c(3008.644, 3142.134, 3002.515), G = c(2782.983,
3135.148, 2873.025), H = c(2874.082, 3066.655, 2778.107), I = c(2592.377,
3017.99, 2859.603), J = c(3051.184, 3011.467, 3007.769)), class = "data.frame", row.names = c("Above average",
"Below average", "very Good"))
I have tried the following:
result_long <- pivot_longer(result, 1:7, names_to="combination", values_to ="price.m" )
But like this I loose my "index": "Above average",...,"Very good"
As my result is:
# A tibble: 21 x 2
letter value
<chr> <dbl>
1 D 2986.
2 E 3020.
3 F 3009.
4 G 2783.
5 H 2874.
6 I 2592.
7 J 3051.
8 D 2843.
9 E 2944.
10 F 3142.
# ... with 11 more rows
Anyone know how I can achieve the same result but keep the column/index "Above average", "Very good", "Below average"?

While a tibble can have row names (e.g., when converting from a regular data frame), they are removed when subsetting with the [ operator. A warning will be raised when attempting to assign non-NULL row names to a tibble.
Generally, it is best to avoid row names, because they are basically a character column with different semantics than every other column.
https://tibble.tidyverse.org/reference/rownames.html
In your case pivot_longer is removing the rownames, but you could save the rownames as column with rownames_to_column from tibble package before transforming with pivot_longer like this:
library(tibble)
library(tidyr)
library(dplyr)
result_long <- result %>%
rownames_to_column("id") %>%
pivot_longer(
-id,
names_to="combination",
values_to ="price.m"
)
A tibble: 21 x 3
id combination price.m
<chr> <chr> <dbl>
1 Above average D 2986.
2 Above average E 3020.
3 Above average F 3009.
4 Above average G 2783.
5 Above average H 2874.
6 Above average I 2592.
7 Above average J 3051.
8 Below average D 2843.
9 Below average E 2944.
10 Below average F 3142.
# ... with 11 more rows

In base R, we may convert to table and wrap with as.data.frame
as.data.frame.table(as.matrix(result))
-output
Var1 Var2 Freq
1 Above average D 2986.286
2 Below average D 2842.540
3 very Good D 2921.000
4 Above average E 3020.458
5 Below average E 2943.926
6 very Good E 2860.763
7 Above average F 3008.644
8 Below average F 3142.134
...

Related

Rename columns using values from separate dataframe

I have a tibble which has column names containing spaces & special characters which make it a hassle to work with. I want to change these column names to easier to use names while I'm working with the data, and then change them back to the original names at the end for display. Ideally, I want to be able to do this as part of a pipe, however I haven't figured out how to do it with rename_with().
Sample data:
df <- tibble(oldname1 = seq(1:10),
oldname2 = letters[seq(1:10)],
oldname3 = LETTERS[seq(1:10)])
cols_lookup <- tibble(old_names = c("oldname4", "oldname2", "oldname1"),
new_names = c("newname4", "newname2", "newname1"))
Desired output:
> head(df_renamed)
# A tibble: 6 x 3
newname1 newname2 oldname3
<int> <chr> <chr>
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
Some columns are removed & reordered during this work so when converting them back there will be entries in the cols_lookup table which are no longer in df. There are also new columns created in df which I want to remain named the same.
I am aware there are similar questions which have already been asked, however the answers either don't work well with tibbles or in a pipe (eg. those using match()), or don't work if the columns aren't all present in the same order in both tables.
We can use rename_at. From the master lookup table, filter the rows where the names of dataset have a match (filtered_lookup), then use that in rename_at where we specify the 'old_names' in vars and replace with the 'new_names'
library(dplyr)
filtered_lookup <- cols_lookup %>%
filter(old_names %in% names(df))
df %>%
rename_at(vars(filtered_lookup$old_names), ~ filtered_lookup$new_names)
Or using rename_with, use the same logic
df %>%
rename_with(.fn = ~filtered_lookup$new_names, .cols = filtered_lookup$old_names)
Or another option is rename with splicing (!!!) from a named vector
library(tibble)
df %>%
rename(!!! deframe(filtered_lookup[2:1]))
You can use rename_ with setnames
cols_lookup <- tibble(old_names = c("oldname3", "oldname2", "oldname1"),
new_names = c("newname3", "newname2", "newname1"))
df
rename_(df, .dots=setNames(cols_lookup$old_names, cols_lookup$new_names))
Output:
# A tibble: 10 x 3
newname1 newname2 newname3
<int> <chr> <chr>
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J

Finding the mode using Dpylr

I have some code written using the dplyr package. I want to calculate the mode. Currently I get results back with a column which says "Character" all the way down. The mode will be the most reoccurring value, which in my case could be a letter, number of a symbol.
eth.data<-data.comb %>%
group_by(Ethnicity, `Qualification Title`, `Qualification Number`, `OutGrade`)%>%
summarise(`Number of Learners`=n(), `Mode` = mode(`OutGrade`)) %>%
group_by(`Qualification Number`)%>%
mutate(`Total Number of Learners`= sum(`Number of Learners`)) %>%
arrange(`Total Number of Learners`)
Take a look at ?mode. mode tells you the storage mode of an object (e.g. "character" for character vectors). If you want the statistical mode, write your own function, see this question.
Also, if you group_by OutGrade, then you will have precisely 1 unique OutGrade in the summarise function, so don't do that.
Let us set up an example (which you should do when you are asking a question!).
df <- data.frame(group=rep(LETTERS[1:5], each=20),
grade=sample(letters[1:15], 100, replace=T))
mymode <- function(x) {
t <- table(x)
names(t)[ which.max(t) ]
}
df %>% group_by(group) %>% summarise(mode=mymode(grade))
The result is what you want:
# A tibble: 5 x 2
group mode
<chr> <chr>
1 A l
2 B f
3 C g
4 D g
5 E c
Note that if you did group_by(group, grade), the summarise function would be called for each combination of group and grade, so the results would have been very different:
# A tibble: 55 x 3
# Groups: group [5]
group grade mode
<chr> <chr> <chr>
1 A a a
2 A b b
3 A f f
4 A h h
5 A i i
6 A k k
7 A l l
8 A m m
9 A n n
10 B a a
# … with 45 more rows

add a label in a new column according to string match in R

I did a string match with str_detect of stringr and filter data of my df according to them.
df
variable x y z
AN B C D
EF F G H
The code is:
df_filtered <- df %>% filter(str_detect(variable, paste(dict, collapse="|")))
"dict" is my list of words (a character vector) that I want to detect in my data frame.
dict
A
C
D
G
and i obtained:
variable x y z
AN B C D
i want to add a new column for each row extracted, containing the element of dict that match.
variable x y z dict
AN B C D A
how can I do?
In case that you can be sure that there is only one dict entry per line, the code is quite simple.
library(tidyverse)
dict <- c("a", "c", "d", "g")
# I create a random dataframe
(df <- tibble(variable = stringi::stri_rand_strings(1000, 3, pattern = "[a-z]")))
# A tibble: 1,000 x 1
variable
<chr>
1 tmx
2 rgq
3 pkm
4 tue
5 wet
6 slx
7 lkq
8 std
9 ivu
10 vyt
# ... with 990 more rows
# I map your dict list to the dataframe
(df_out <- map_df(dict, ~ filter(df, str_detect(variable, .x)) %>%
mutate(out = str_extract(variable, .x))))
# A tibble: 437 x 2
variable out
<chr> <chr>
1 rar a
2 cam a
3 kba a
4 wax a
5 zta a
6 aep a
7 wao a
8 bga a
9 auv a
10 bea a
# ... with 427 more rows
# Merge all dict-hits per entry
(df_out <- df_out %>%
nest(out, .key = "out") %>%
mutate(out = map_chr(out, ~ str_c(.x$out, collapse = "_"))))
# A tibble: 379 x 2
variable out
<chr> <chr>
1 rar a
2 cam a_c
3 kba a
4 wax a
5 zta a
6 aep a
7 wao a
8 bga a_g
9 auv a
10 bea a
# ... with 369 more rows
[solved by edit] If you run this code with more than one dict entry per line, the code will generate one line per dict hit.

Minimum value matching across values in multiple columns

I would like to return a dataframe with the minimum value of column one based on the values of columns 2-4:
df <- data.frame(one = rnorm(1000),
two = sample(letters, 1000, replace = T),
three = sample(letters, 1000, replace = T),
four = sample(letters, 1000, replace = T))
I can do:
df_group <- df %>%
group_by(two) %>%
filter(one = min(one))
This gets me the lowest value of all the "m's" in column two, but what if column three or four had a lower "m" value in column one?
The output should look like this:
one two
1 -0.311609752 r
2 0.053166742 n
3 1.546485810 a
4 -0.430308725 d
5 -0.145428664 c
6 0.419181639 u
7 0.008881661 i
8 1.223517580 t
9 0.797273157 b
10 0.790565358 v
11 -0.560031797 e
12 -1.546234090 q
13 -1.847945540 l
14 -1.489130228 z
15 -1.203255034 g
16 0.146969892 m
17 -0.552363433 f
18 -0.006234646 w
19 0.982932856 s
20 0.751936728 o
21 0.220751258 h
22 -1.557436228 y
23 -2.034885868 k
24 -0.463354387 j
25 -0.351448850 p
26 1.331365941 x
I don't care which column has the lowest value for a given letter, I just need the lowest value and the letter column.
I'm trying to wrap my head around writing this simplistically. This might be a duplicate, but I didn't know how to word the title and couldn't find any material or previous questions on how to do it.
Another solution based in data.table :
library(data.table)
setDT(df)
melt(df,
measure=grep("one",names(df),invert = TRUE,value=TRUE))[
,min(one),value]
You can do something like this:
library(dplyr); library(tidyr)
df %>% gather(cols, letts, -one) %>% # gather all letters into one column
group_by(letts) %>%
summarise(one = min(one)) # do a group by summary for each letter
# A tibble: 26 × 2
# letts one
# <chr> <dbl>
#1 a -2.092327
#2 b -2.461102
#3 c -3.055858
#4 d -2.092327
#5 e -2.461102
#6 f -2.249439
#7 g -1.941632
#8 h -2.543310
#9 i -3.055858
#10 j -1.896974
# ... with 16 more rows

Getting a summary data frame for all the combinations of categories represented in two columns

I am working with a data frame corresponding to the example below:
set.seed(1)
dta <- data.frame("CatA" = rep(c("A","B","C"), 4), "CatNum" = rep(1:2,6),
"SomeVal" = runif(12))
I would like to quickly build a data frame that would have sum values for all the combinations of the categories derived from the CatA and CatNum as well as for the categories derived from each column separately. On the primitive example above, for the first couple of combinations, this can be achieved with use of simple code:
df_sums <- data.frame(
"Category" = c("Total for A",
"Total for A and 1",
"Total for A and 2"),
"Sum" = c(sum(dta$SomeVal[dta$CatA == 'A']),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 1]),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 2]))
)
This produces and informative data frame of sums:
Category Sum
1 Total for A 2.1801780
2 Total for A and 1 1.2101839
3 Total for A and 2 0.9699941
This solution would be grossly inefficient when applied to a data frame with multiple categories. I would like to achieve the following:
Cycle through all the categories, including categories derived from each column separately as well as from both columns in the same time
Achieve some flexibility with respect to how the function is applied, for instance I may want to apply mean instead of the sum
Save the Total for string a separate object that I could easily edit when applying other function than sum.
I was initially thinking of using dplyr, on the lines:
require(dplyr)
df_sums_experiment <- dta %>%
group_by(CatA, CatNum) %>%
summarise(TotVal = sum(SomeVal))
But it's not clear to me how I could apply multiple groupings simultaneously. As stated, I'm interested in grouping by each column separately and by the combination of both columns. I would also like to create a string column that would indicate what is combined and in what order.
You could use tidyr to unite the columns and gather the data. Then use dplyr to summarise:
library(dplyr)
library(tidyr)
dta %>% unite(measurevar, CatA, CatNum, remove=FALSE) %>%
gather(key, val, -SomeVal) %>%
group_by(val) %>%
summarise(sum(SomeVal))
val sum(SomeVal)
(chr) (dbl)
1 1 2.8198078
2 2 3.0778622
3 A 2.1801780
4 A_1 1.2101839
5 A_2 0.9699941
6 B 1.4405782
7 B_1 0.4076565
8 B_2 1.0329217
9 C 2.2769138
10 C_1 1.2019674
11 C_2 1.0749464
Just loop over the column combinations, compute the quantities you want and then rbind them together:
library(data.table)
dt = as.data.table(dta) # or setDT to convert in place
cols = c('CatA', 'CatNum')
rbindlist(apply(combn(c(cols, ""), length(cols)), 2,
function(i) dt[, sum(SomeVal), by = c(i[i != ""])]), fill = T)
# CatA CatNum V1
# 1: A 1 1.2101839
# 2: B 2 1.0329217
# 3: C 1 1.2019674
# 4: A 2 0.9699941
# 5: B 1 0.4076565
# 6: C 2 1.0749464
# 7: A NA 2.1801780
# 8: B NA 1.4405782
# 9: C NA 2.2769138
#10: NA 1 2.8198078
#11: NA 2 3.0778622
Split then use apply
#result
res <- do.call(rbind,
lapply(
c(split(dta,dta$CatA),
split(dta,dta$CatNum),
split(dta,dta[,1:2])),
function(i)sum(i[,"SomeVal"])))
#prettify the result
res1 <- data.frame(Category=paste0("Total for ",rownames(res)),
Sum=res[,1])
res1$Category <- sub("."," and ",res1$Category,fixed=TRUE)
row.names(res1) <- seq_along(row.names(res1))
res1
# Category Sum
# 1 Total for A 2.1801780
# 2 Total for B 1.4405782
# 3 Total for C 2.2769138
# 4 Total for 1 2.8198078
# 5 Total for 2 3.0778622
# 6 Total for A and 1 1.2101839
# 7 Total for B and 1 0.4076565
# 8 Total for C and 1 1.2019674
# 9 Total for A and 2 0.9699941
# 10 Total for B and 2 1.0329217
# 11 Total for C and 2 1.0749464

Resources