Gathering columns from wide to long by id [duplicate] - r

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I've got a data frame like this:
set.seed(100)
drugs <- data.frame(id = 1:5,
drug_1 = letters[1:5], drug_dos_1 = sample(100,5),
drug_2 = letters[3:7], drug_dos_2 = sample(100,5)
)
id drug_1 drug_dos_1 drug_2 drug_dos_2
1 a 31 c 49
2 b 26 d 81
3 c 55 e 37
4 d 6 f 54
5 e 45 g 17
I'd like to transform this messy table into a tidy table with all drugs of an id in one column and the corresponding drug dosages in one column. The table should look like this in the end:
id drug dosage
1 a 31
1 c 49
2 b 26
2 d 81
etc
I guess this could be achieved by using a reshaping function that transforms by data from wide to long format but I didn't manage.

One option is melt from data.table which can take multiple patterns in the measure argument
library(data.table)
melt(setDT(drugs), measure = patterns('^drug_\\d+$', 'dos'),
value.name = c('drug', 'dosage'))[, variable := NULL][order(id)]
# id drug dosage
#1: 1 a 31
#2: 1 c 49
#3: 2 b 26
#4: 2 d 81
#5: 3 c 55
#6: 3 e 37
#7: 4 d 6
#8: 4 f 54
#9: 5 e 45
#10 5 g 17
Here, the 'drug' is common in all the columns, so we need to create a unique pattern. One way is to specify the starting location (^) followed by the 'drug' substring, then underscore (_) and one or more numbers (\\d+) at the end ($) of the string. For the 'dos', just use that substring to match those column names that have 'dos'

library(dplyr)
drugs %>% gather(key,val,-id) %>% mutate(key=gsub('_\\d','',key)) %>% #replace _1 and _2 at the end wiht nothing
mutate(key=gsub('drug_','',key)) %>% group_by(key) %>% #replace drug_ at the start of dos with nothin and gruop by key
mutate(row=row_number()) %>% spread(key,val) %>%
select(id,drug,dos,-row)
# A tibble: 10 x 3
id drug dos
<int> <chr> <chr>
1 1 a 31
2 1 c 49
3 2 b 26
4 2 d 81
5 3 c 55
6 3 e 37
7 4 d 6
8 4 f 54
9 5 e 45
10 5 g 17
Warning message:
attributes are not identical across measure variables;
they will be dropped
#This warning generated as we merged drug(chr) and dose(num) into one column (val)

Related

Labelling rows according to how many times the group appeared in previous rows

Suppose I have the following data.frame object:
df = data.frame(id=(1:25),
col1=c('a','a','a',
'b','b','b',
'c','c','c',
'd','d',
'b','b','b',
'e',
'c','c','c',
'e','e',
'a','a','a',
'e','e'))
From the snapshot above, you can see that there are two groups of rows that have col1=="a": rows 1 through 3 and rows 21 through 23. Similarly, there are three groups of rows that have col1=="e": row 15, rows 19 through 20 and rows 24 through 25 (and so on and so on with "b", "c" and "d").
Here's my main question
Is it possible to label the rows according to what "chunk" we're currently on? More explicitly: since rows 1 through 3 are the first time where we have col1=="a", they should be labelled as 1. Then, rows 21 through 23 should be labelled as 2, because that is the second time that we have a set of rows that have col1=="a". Using the same logic, but for col1=="e", we'd label row 15 as 1, rows 19 and 20 as 2 and rows 24 and 25 as 3 (again, so on and so on with "b", "c" and "d").
Desired output
Here is what the resulting data.frame would look like:
df = data.frame(id=(1:25),
col1=c('a','a','a',
'b','b','b',
'c','c','c',
'd','d',
'b','b','b',
'e',
'c','c','c',
'e','e',
'a','a','a',
'e','e'),
grup=c(1,1,1,
1,1,1,
1,1,1,
1,1,
2,2,2,
1,
2,2,2,
2,2,
2,2,2,
3,3))
My attempt
I tried implementing a solution using a for loop, but that was quite slow (the original data I'm working on has about 500,000 rows), and it just looked a bit sloppy:
my_classifier = function(input_df, ref_column){
# Keeps a tally of how many times each unique group was "found" before.
group_counter = list()
# Dealing with the corner case of the first row
group_counter[[df$col1[1]]] = 1
output_groups = rep(-1, nrow(input_df))
output_groups[1] = 1
# The for loop starts at the second row because I've already "dealt" with the
# first row in the corner cases above
for(i in 2:nrow(input_df)){
prev_group = input_df[[ref_column]][i-1]
this_group = input_df[[ref_column]][i]
if(is.null(group_counter[[this_group]])){
this_counter = 0
}
else{
this_counter = group_counter[[this_group]]
}
if(prev_group != this_group){
this_counter = this_counter + 1
}
output_groups[i] = this_counter
group_counter[[this_group]] = this_counter
}
return(output_groups)
}
df$grup = my_classifier(df,'col1')
Is there a quicker/more efficient way to solve this problem? Maybe something that relies on vectorized functions or something?
Important notes
Consider that we cannot rely on the number of repetitions of each "block". Sometimes, col1 will have just one single row of a particular group, while other times the "block" will have several rows where col1 share the same value. Also consider that we cannot assume any logic in the "order" or the number of times each group shows up.
So, for example, there might be a a stretch of 10 rows where col1=="z", then a stretch of 15 rows where col1=="x", then another single row where col1=="x" and then finally a stretch of 100 rows where col1=="w".
You can use data.table::rleid() twice, like this:
library(data.table)
setDT(df)[,grp:=rleid(col1)][, grp:=rleid(grp), by=col1][order(id)]
Output:
id col1 grp
<int> <char> <int>
1: 1 a 1
2: 2 a 1
3: 3 a 1
4: 4 b 1
5: 5 b 1
6: 6 b 1
7: 7 c 1
8: 8 c 1
9: 9 c 1
10: 10 d 1
11: 11 d 1
12: 12 b 2
13: 13 b 2
14: 14 b 2
15: 15 e 1
16: 16 c 2
17: 17 c 2
18: 18 c 2
19: 19 e 2
20: 20 e 2
21: 21 a 2
22: 22 a 2
23: 23 a 2
24: 24 e 3
25: 25 e 3
id col1 grp
Here is a possible base R solution:
change <- with(rle(df$col1), rep(seq_along(values), lengths))
cbind(df, grp = with(df, ave(
change,
col1,
FUN = function(x)
inverse.rle(within.list(rle(x), values <- seq_along(values)))
)))
Or another option using a combination of rle and dplyr using the function from here:
rle_new <- function(x) {
x <- rle(x)$lengths
rep(seq_along(x), times=x)
}
library(dplyr)
df %>%
mutate(grp = rle_new(col1)) %>%
group_by(col1) %>%
mutate(grp = rle_new(grp))
Output
id col1 grp
1 1 a 1
2 2 a 1
3 3 a 1
4 4 b 1
5 5 b 1
6 6 b 1
7 7 c 1
8 8 c 1
9 9 c 1
10 10 d 1
11 11 d 1
12 12 b 2
13 13 b 2
14 14 b 2
15 15 e 1
16 16 c 2
17 17 c 2
18 18 c 2
19 19 e 2
20 20 e 2
21 21 a 2
22 22 a 2
23 23 a 2
24 24 e 3
25 25 e 3

R - splitting column by legend from the other one

I've got data.frame like below
ID age legend location
1 83 country;province;city X;A;J
2 15 country;city X;K
3 2 country;province;city Y;B;I
4 12 country;city X;L
5 2 country;city Y;J
6 2 country;province;city Y;A;M
7 18 country;province;city X;B;J
8 85 country;province;city X;A;I
To describe it: there is third column (legend) with description of the value of fourth column (location). Order of the records in the rows of legend column indicate the order of value in location column.
As a result, I need to obtain the data.frame as below
ID age country province city
1 83 X A J
2 15 X <NA> K
3 2 Y B I
4 12 X <NA> L
5 2 Y <NA> J
6 2 Y A M
7 18 X B J
8 85 X A I
To describe, I need to extract info from legend column and set them as name of new columns and then fill with appropriate information from location column. I cannot just split the columns by ; because there is different number of records in each rows. Any suggestion?
Using DF shown reproducibly in the Note at the end use separate_rows and then spread the data out from long to wide. If the order of columns does not matter then the select line can be omitted.
library(dplyr)
library(tidyr)
DF %>%
separate_rows(legend, location) %>%
spread(legend, location) %>%
select(ID, age, country, province, city) # optional
giving:
ID age country province city
1 1 83 X A J
2 2 15 X <NA> K
3 3 2 Y B I
4 4 12 X <NA> L
5 5 2 Y <NA> J
6 6 2 Y A M
7 7 18 X B J
8 8 85 X A I
Note
Lines <- "
ID age legend location
1 83 country;province;city X;A;J
2 15 country;city X;K
3 2 country;province;city Y;B;I
4 12 country;city X;L
5 2 country;city Y;J
6 2 country;province;city Y;A;M
7 18 country;province;city X;B;J
8 85 country;province;city X;A;I"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

Data frame manoeuvre [duplicate]

This question already has an answer here:
R programming - data frame manoevur
(1 answer)
Closed 7 years ago.
Suppose I have the following dataframe:
Categories Variable
1 a 11
2 b 21
3 c 34
4 d 45
5 e 52
6 f 65
7 g 76
8 a 13
9 b 24
I'd like to turn it into a new dataframe like the following:
Categories Variable
1 a 11
2 b 21
3 c 34
4 d+e 97
5 f 65
6 g 76
7 a 13
8 b 24
How can I do it? (Surely, the dataframe is much larger, but I want the sum of all categories of d and e and group it into a new category, say 'H').
Many thanks!
This is a good question but unfortunately OT here. So I'll answer until it get migrated.
I'm assuming Variable is of class factor, so you'll need to properly re-level it (assuming your data is called df)
levels(df$Categories)[levels(df$Categories) %in% c("d", "e")] <- "h"
Next, I'll use the data.table package as you have a large data set and it's devel version (v >= 1.9.5) has a convinient function called rleid (download from GitHub)
library(data.table) ## v >= 1.9.5
setDT(df)[, .(Variable = sum(Variable)), by = .(indx = rleid(Categories), Categories)]
# indx Categories Variable
# 1: 1 a 11
# 2: 2 b 21
# 3: 3 c 34
# 4: 4 h 97
# 5: 5 f 65
# 6: 6 g 76
# 7: 7 a 13
# 8: 8 b 24
You can try this:
# plyr package provides rbind.fill() function for row binding
library(plyr)
# Assuming you have a rows.cvs containing the data, read it into a data frame
data<-read.csv("rows.csv",stringsAsFactors=FALSE)
# Find the lowest index of d or e (whichever comes first)
index<-min(match("d",data$Var1.nominal.), match("e",data$Var1.nominal.))
# Returns all rows containing d and e in Var1(nominal) column
tempData<-data[data$Var1.nominal. %in% c("d","e"),]
# Remove all the rows containing d and e from original data frame
data<-data[!data$Var1.nominal. %in% c("d","e"),]
# Reorder row index numbers in data
rownames(data)<-NULL
# Combine rows containing d and e in Var1(nominal)column, and sum up the column Var2(numeric)
tempData<-data.frame(Var1.nominal.="d+e",Var2.numeric.=sum(tempData[,2]))
# Combine original data and tempData frame with use of index
data<-rbind.fill(data[1:(index-1),],tempData,data[index:length(data[,1]),])
# Renaming "d+e" to"h"
data[index,1]="h"
# Getting rid of the tempData data frame
rm(tempData)
Output:
> data
Var1.nominal. Var2.numeric.
1 a 11
2 b 21
3 c 34
4 h 97
5 f 65
6 g 76
7 a 13
8 b 24

R sort summarise ddply by group sum

I have a data.frame like this
x <- data.frame(Category=factor(c("One", "One", "Four", "Two","Two",
"Three", "Two", "Four","Three")),
City=factor(c("D","A","B","B","A","D","A","C","C")),
Frequency=c(10,1,5,2,14,8,20,3,5))
Category City Frequency
1 One D 10
2 One A 1
3 Four B 5
4 Two B 2
5 Two A 14
6 Three D 8
7 Two A 20
8 Four C 3
9 Three C 5
I want to make a pivot table with sum(Frequency) and used the ddply function like this:
ddply(x,.(Category,City),summarize,Total=sum(Frequency))
Category City Total
1 Four B 5
2 Four C 3
3 One A 1
4 One D 10
5 Three C 5
6 Three D 8
7 Two A 34
8 Two B 2
But I need this results sorted by the total in each Category group. Something like this:
Category City Frequency
1 Two A 34
2 Two B 2
3 Three D 14
4 Three C 5
5 One D 10
6 One A 1
7 Four B 5
8 Four C 3
I have looked and tried sort, order, arrange, but nothing seems to do what I need. How can I do this in R?
Here is a base R version, where DF is the result of your ddply call:
with(DF, DF[order(-ave(Total, Category, FUN=sum), Category, -Total), ])
produces:
Category City Total
7 Two A 34
8 Two B 2
6 Three D 8
5 Three C 5
4 One D 10
3 One A 1
1 Four B 5
2 Four C 3
The logic is basically the same as David's, calculate the sum of Total for each Category, use that number for all rows in each Category (we do this with ave(..., FUN=sum)), and then sort by that plus some tie breakers to make sure stuff comes out as expected.
This is a nice question and I can't think of a straight way of doing this rather than creating a total size index and then sorting by it. Here's a possible data.table approach which uses setorder function which will order your data by reference
library(data.table)
Res <- setDT(x)[, .(Total = sum(Frequency)), by = .(Category, City)]
setorder(Res[, size := sum(Total), by = Category], -size, -Total, Category)[]
# Category City Total size
# 1: Two A 34 36
# 2: Two B 2 36
# 3: Three D 8 13
# 4: Three C 5 13
# 5: One D 10 11
# 6: One A 1 11
# 7: Four B 5 8
# 8: Four C 3 8
Or if you deep in the Hdleyverse, we can reach a similar result using the newer dplyr package (as suggested by #akrun)
library(dplyr)
x %>%
group_by(Category, City) %>%
summarise(Total = sum(Frequency)) %>%
mutate(size= sum(Total)) %>%
ungroup %>%
arrange(-size, -Total, Category)

Resources