reshape dataframe from columns to rows and collapse cell values - r

Here's the challenge i am facing. I am trying to transform this dataset
a b c
100 0 111
0 137 17
78 117 91
into (column to rows)
col1 col2
a 100,78
b 137,117
c 111,17,91
I know I can do this using reshape or melt function but I am not sure how to collapse and paste the cell values. Any suggestions or pointers is appreciated folks.

Here is a light weight option using toString() method to collapse each column to a string and using stack() to reshape the result list to your desired output:
stack(lapply(df, function(col) toString(col[col!=0])))
# values ind
#1 100, 78 a
#2 137, 117 b
#3 111, 17, 91 c

I would use dplyr rather than reshape.
library(dplyr)
library(tidyr)
Data <- data.frame(a=c(100,0,78),b=c(0,137,117),c=c(111,17,91))
Data %>%
gather(Column, Value) %>%
filter(Value != 0) %>%
group_by(Column) %>%
summarize(Value=paste0(Value,collapse=', '))
The gather function is similar to melt in reshape. The group_by function tells later functions that you want to seperate based off of values in Column. Finally summarize calculates whatever summary we want for each of the groups. In this case, paste all the terms together.
Which should give you:
# A tibble: 3 × 2
Column Value
<chr> <chr>
1 a 100, 78
2 b 137, 117
3 c 111, 17, 91

With library(data.table)
melt(dt)[, .(value = paste(value[value !=0], collapse=', ')), by=variable]
# variable value
# 1: a 100, 78
# 2: b 137, 117
# 3: c 111, 17, 91
The data:
dt = fread("a b c
100 0 111
0 137 17
78 117 91")

Related

Rearranging data according to rater and subject, simultaneously creating new row names

I have a dataset where multiple raters rate multiple subjects.
I'd like to rearrange the data that looks like this:
data <- data.frame(rater=c("A", "B", "C", "A", "B", "C"),
subject=c(1, 1, 1, 2, 2, 2),
measurment1=c(1, 2, 3, 4, 5,6),
measurment2=c(11, 22, 33, 44, 55,66),
measurment3=c(111, 222, 333, 444, 555, 666))
data
# rater subject measurment1 measurment2 measurment3
# 1 A 1 1 11 111
# 2 B 1 2 22 222
# 3 C 1 3 33 333
# 4 A 2 4 44 444
# 5 B 2 5 55 555
# 6 C 2 6 66 666
into data that looks like this:
data_transformed <- data.frame( A = c(1,11,111,4,44,444),
B = c(2,22,222,5,55,555),
C = c(3,33,333,6,66,666)
)
row.names(data_transformed) <- c("measurment1_1", "measurment2_1", "measurment3_1", "measurment1_2", "measurment2_2", "measurment3_2")
data_transformed
# A B C
# measurment1_1 1 2 3
# measurment2_1 11 22 33
# measurment3_1 111 222 333
# measurment1_2 4 5 6
# measurment2_2 44 55 66
# measurment3_2 444 555 666
In the new data frame, the raters (A, B and C) should become the columns. The measurement should become the rows and I'd also like to add the subject number as a suffix to the row-names.
For the rearranging one could probably use the pivot functions, yet I have no idea on how to combine the measurement-variables with the subject number.
Thanks for your help!
We could use pivot_longer, pivot_wider and unite from the tidyr package.
pivot_longer makes our data in a vertical format, it transforms the measurment columns into a sigle variable
pivot_wider does the opposite of pivot_longer, transform a variable into multiple columns for each unique value from the variable
data |>
pivot_longer(measurment1:measurment3) |>
pivot_wider(names_from = rater, values_from = value, values_fill = 0 ) |>
unite("measure_subjet",name,subject, remove = TRUE)
Please try the below code where we can accomplish the expected result using pivot_longer, pivot_wider and column_to_rownames.
library(tidyverse)
data_transformed <- data %>%
pivot_longer(c('measurment1', 'measurment2', 'measurment3')) %>%
mutate(rows = paste0(name, '_', subject)) %>%
pivot_wider(rows, names_from = rater, values_from = value) %>%
column_to_rownames(var = "rows")

In R , how to summarize data frame in multiple dimensions

There is dataframe raw_data as below, How can i change it to wished_data in easy way ?
I currently know group_by/summarise the data serval times (and add variables) , then rbind them. But this is little boring , especially when variables more then this example in occasion.
I want to know ,if is there any general method for similar situation ? Thanks!
library(tidyverse)
country <- c('UK','US','UK','US')
category <- c("A", "B", "A", "B")
y2021 <- c(17, 42, 21, 12)
y2022 <- c(49, 23, 52, 90)
raw_data <- data.frame(country,category,y2021,y2022)
We may use rollup/cube/groupingsets from data.table
library(data.table)
out <- rbind(setDT(raw_data), groupingsets(raw_data, j = lapply(.SD, sum),
by = c("country", "category"),
sets = list("country", "category", character())))
out[is.na(out)] <- 'TOTAL'
-output
> out
country category y2021 y2022
<char> <char> <num> <num>
1: UK A 17 49
2: US B 42 23
3: UK A 21 52
4: US B 12 90
5: UK TOTAL 38 101
6: US TOTAL 54 113
7: TOTAL A 38 101
8: TOTAL B 54 113
9: TOTAL TOTAL 92 214
Or with cube
out <- rbind(raw_data, cube(raw_data,
j = .(y2021= sum(y2021), y2022=sum(y2022)), by = c("country", "category")))
out[is.na(out)] <- 'TOTAL'
We can use the adorn_totals function from janitor. get_totals accepts a data frame and a column and it outputs the data frame with totals for the numeric columns, one such row for each level of the specified column. It then extracts out the total rows and since adorn_totals can rearrange the column order uses select to put the order back to the original so that we can later bind mulitiple instances together. We then bind together the orignal data frame and each of the total row data frames that we want.
library(dplyr)
library(janitor)
get_totals <- function(data, col) {
data %>%
group_by({{col}}) %>%
group_modify(~ adorn_totals(.)) %>%
ungroup %>%
filter(rowSums(. == "Total") > 0) %>%
select(any_of(names(data)))
}
bind_rows(
raw_data,
get_totals(raw_data, category),
get_totals(raw_data, country),
get_totals(raw_data)
)
giving:
country category y2021 y2022
1 UK A 17 49
2 US B 42 23
3 UK A 21 52
4 US B 12 90
5 Total A 38 101
6 Total B 54 113
7 UK Total 38 101
8 US Total 54 113
9 Total - 92 214

How to count unique values a column in R

I have a database and would like to know how many people (identified by ID) match a characteristic. The list is like this:
111 A
109 A
112 A
111 A
108 A
I only need to count how many 'ID's have that feature, the problem is there duplicated ID's. I've tried with
count(df, vars = ID)
but it does not show the total number of ID's, just how many times they are repeated. Same with
count(df, c('ID')
as it shows the total number of ID's and many of them are duplicated, I need to count them one single time.
Do you have any suggestions? Using table function is not an option because of the size of this database.
We can use n_distinct() from dplyr to count the number of unique values for a column in a data frame.
textFile <- "id var1
111 A
109 A
112 A
111 A
108 A"
df <- read.table(text = textFile,header = TRUE)
library(dplyr)
df %>% summarise(count = n_distinct(id))
...and the output:
> df %>% summarise(count = n_distinct(id))
count
1 4
We can also summarise the counts within one or more by_group() columns.
textFile <- "id var1
111 A
109 A
112 A
111 A
108 A
201 B
202 B
202 B
111 B
112 B
109 B"
df <- read.table(text = textFile,header = TRUE)
df %>% group_by(var1) %>% summarise(count = n_distinct(id))
...and the output:
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
var1 count
<chr> <int>
1 A 4
2 B 5
You can first remove duplicates using unique and then countthe remaining rows :
d <- tribble(
~ID,~feature,
111, "A",
109, "A",
112, "A",
111, "A",
108, "A")
count(unique(d,vars = c(ID, feature)),vars=ID)
vars n
<dbl> <int>
1 108 1
2 109 1
3 111 1
4 112 1

How to merge two columns from different dataframes into a new dataframe by a third key variable (R)

I have these two dataframes:
df1 <- data.frame(a= c(1,2,3,1,2,3,1,2,3), b=c(11,21,31,12,22,32,13,23,33))
df2 <- data.frame(a= c(1,2,3,1,2,3,1,2,3), c=c(101,201,301,102,202,302,103,203,303))
I wanna merge the columns "b" and "c" into a new dataframe but using "a" as a key variable.
The expected results is this:
df.output <- data.frame(b=c(21,22,23), c=c(201,202,203))
I have already tried the join function from dplyr withour success.
Thanks,
quelemem
Based on the logic mentioned by the OP in the comments, we could filter only rows in 'a' with value 2, then mutate the column 'c' by getting the corresponding 'c' values were 'a' is 2
library(dplyr)
df1 %>%
filter(a == 2) %>%
mutate(c = df2$c[a ==df2$a]) %>%
select(-a)
# b c
#1 21 201
#2 22 202
#3 23 203
Or using base R
cbind(subset(df1, a==2, select = b), subset(df2, a==2, select = c))
Edit
Based on additional comments by OP, tweaking the original solution can achieve the desired output (although I think #akrun's answer is the better one to choose in this context as no nest / unnest is required).
library(tidyverse)
left_join(nest(df1, -a), nest(df2, -a), by = "a") %>%
filter(a == 2) %>% unnest() %>% select(-a)
#> b c
#> 1 21 201
#> 2 22 202
#> 3 23 203
Original answer
As #akrun mentions in the comments, the desired output is not entirely clear.
Do you mean something like this as output?
library(tidyverse)
df3 <- left_join(nest(df1, -a), nest(df2, -a), by = "a")
df3
#> a data.x data.y
#> 1 1 11, 12, 13 101, 102, 103
#> 2 2 21, 22, 23 201, 202, 203
#> 3 3 31, 32, 33 301, 302, 303

Formatting output in summarise_each with dplyr

Greetings: I am new to dplyr and having some challenges formatting my output. Here is a code snippet that produces some reproducible data, using melt to get it into the shape I need.
set.seed(1234)
library(reshape2)
library(dplyr)
val <- c(0:1)
a <- sample(val, 99, replace=T)
b <- sample(val, 99, replace=T)
c <- sample(val, 99, replace=T)
d <- sample(val, 99, replace=T)
dat <- data.frame(a,b,c,d)
melt.dat <- melt(dat)
Now, I can perform the desired summary:
SummaryTable <- melt.dat %>%
group_by(variable) %>%
summarise_each(funs(sum, sum/n()))
Here is my output:
variable sum *
1 a 50 50.50505
2 b 58 58.58586
3 c 46 46.46465
4 d 46 46.46465
My ideal output would be something as follows. I am unable able to figure out how to specify my column names in the summarise_each or melt functions, set the decimal place and suppress the row numbers. I've spent a long time getting this far, and just can't seem to get the rest figured out!
Letter Count Percent
a 50 50.5
b 58 58.6
c 46 46.5
d 46 46.5
Not sure whether it's possible within dplyr to suppress rownames (numbering), but here's how you could get the names and formatting right:
options(digits = 3)
melt.dat %>%
group_by(Letter = variable) %>%
summarise_each(funs(Count = sum(.), Percent = sum(.)/n()*100), -variable)
#Source: local data frame [4 x 3]
#
# Letter Count Percent
#1 a 45 45.5
#2 b 51 51.5
#3 c 52 52.5
#4 d 48 48.5

Resources