Summarise/reshape data in R - r

For an example dataframe:
df <- structure(list(id = 1:18, region = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("a",
"b"), class = "factor"), age.cat = structure(c(1L, 1L, 2L, 2L,
2L, 3L, 3L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 4L), .Label = c("0-18",
"19-35", "36-50", "50+"), class = "factor")), .Names = c("id",
"region", "age.cat"), class = "data.frame", row.names = c(NA,
-18L))
I want to reshape the data, as detailed below:
region 0-18 19-35 36-50 50+
a 2 3 2 1
b 4 2 1 3
Do I simply aggregate or reshape the data? Any help would be much appreciated.

You can do it just using table:
table(df$region, df$age.cat)
0-18 19-35 36-50 50+
a 2 3 2 1
b 4 2 1 3

Using reshape2:
install.packages('reshape2')
library(reshape2)
df1 <- melt(df, measure.vars = 'age.cat')
df1 <- dcast(df1, region ~ value)

Related

Count the number of unique character elements in one column based on several different (sub-)groupings (columns)

Here is a sample dataset.
test_data <- structure(list(ID = structure(c(4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L,
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("P39190",
"U93491", "X28348", "Z93930"), class = "factor"), Sex = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L), .Label = c("F", "M"), class = "factor"), Group = structure(c(2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 3L,
3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("C83Z", "CAP_1", "P000"), class = "factor"),
Category = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L), .Label = c("A",
"B", "C"), class = "factor")), .Names = c("ID", "Sex", "Group",
"Category"), class = "data.frame", row.names = c(NA, -36L))
head(test_data, n = 10)
ID Sex Group Category
1 Z93930 M CAP_1 A
2 Z93930 M CAP_1 A
3 Z93930 M C83Z A
4 Z93930 M C83Z A
5 Z93930 M C83Z A
6 Z93930 M C83Z A
7 X28348 F C83Z B
8 X28348 F C83Z B
9 X28348 F CAP_1 B
10 X28348 F CAP_1 B
I want to count the number of unique elements in three levels:
Count of unique elements per "Category"
Count of unique elements in each "Category" grouped by "Group"
Count of unique elements in each "Group" grouped by "Sex"
I can of course use base R and a bit of dplyr to achieve this:
library(dplyr)
for(i in 1:length(unique(test_data$Category))){
temp <- test_data %>% dplyr::filter(Category == unique(test_data$Category)[i])
message(paste0(unique(test_data$Category)[i]), ": ", length(unique(temp$ID)))
for(k in 1:length(unique(temp$Group))){
temp_grp <- temp %>% dplyr::filter(Group == unique(temp$Group)[k])
message(paste0("\n ├──", unique(temp$Group)[k],
": ", length(unique(temp_grp$ID))))
message(paste0("\n\t"), "F: ", length(unique(temp_grp[which(temp_grp$Sex == "F"),])$ID))
message(paste0("\n\t"), "M: ", length(unique(temp_grp[which(temp_grp$Sex == "M"),])$ID))
}
}
But this is too dirty and unclever.
Is there a function in R that can achieve this in a cleaner and more efficient manner and preferably produce the output in the form of a dataframe?
I was under the impression that dplyr::group_by was made for such tasks. But I cannot quite figure out how it works for sub-groupings.
The code below:
test_data %>% dplyr::group_by(Category) %>% summarise(n = n_distinct(ID))
achieves the first task (point 1. above). But I cannot achieve points 2 and 3 in the same way.
SOLUTION:
test_data %>% dplyr::group_by(Category, Group, Sex) %>% summarise(n = n_distinct(ID))
If I understand your question correctly, you were not very far from it at all. The idea is just to group by two columns at a time this way: group_by(col1, col2).
For point 2:
test_data %>% dplyr::group_by(Category, Group) %>% summarise(n = n_distinct(ID))
Source: local data frame [9 x 3]
Groups: Category [?]
Category Group n
<fctr> <fctr> <int>
1 A C83Z 1
2 A CAP_1 1
3 A P000 2
4 B C83Z 1
5 B CAP_1 1
6 B P000 1
7 C C83Z 1
8 C CAP_1 1
9 C P000 2
And for point 3:
test_data %>% dplyr::group_by(Group, Sex) %>% summarise(n = n_distinct(ID))
If I understand correctly, you can use dplyr::count for all three cases
test_data %>% dplyr::count(Category)
test_data %>% dplyr::count(Group, Category)
test_data %>% dplyr::count(Sex, Group)

How can I use R to fill rows based on column?

I have the following table
Code Name Class
1
2 Monday day
5 green color
9
6
1 red color
1
2
9 Tuesday day
6
5
Goal is to the fill the Name and Class columns based on the Code column of a filled row. For example, the second row is filled and the code is 2. I would like to fill all the rows where code = 2 with Name=Monday and Class=day.
I tried using fill() from tidyR but that seems to require ordered data.
structure(list(Code = c(1L, 2L, 5L, 9L, 6L, 1L, 1L, 2L, 9L, 6L,
5L), Name = structure(c(1L, 3L, 2L, 1L, 1L, 4L, 1L, 1L, 5L, 1L,
1L), .Label = c("", "green", "Monday", "red", "Tuesday"), class = "factor"),
Class = structure(c(1L, 3L, 2L, 1L, 1L, 2L, 1L, 1L, 3L, 1L,
1L), .Label = c("", "color", "day"), class = "factor")), .Names = c("Code",
"Name", "Class"), class = "data.frame", row.names = c(NA, -11L
))
library(dplyr)
final_df <- left_join(df, df[df$Name!='',], by='Code')[,c(1,4:5)]
colnames(final_df) <- colnames(df)
final_df

Taking the frequency of three different columns [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 5 years ago.
I have a dataframe like this:
df <- structure(list(col1 = structure(c(1L, 1L, 2L, 3L, 1L, 3L, 1L,
3L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 4L), .Label = c("stock1",
"stock2", "stock3", "stock4"), class = "factor"), col2 = structure(c(4L,
5L, 7L, 6L, 5L, 5L, 5L, 6L, 6L, 8L, 8L, 4L, 3L, 3L, 1L, 2L, 3L
), .Label = c("comapny1", "comapny1+comapny4", "comapny4", "company1",
"company2", "company2+company1", "company3", "company4"), class = "factor"),
col3 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("predictor1", "predictor2"
), class = "factor")), .Names = c("col1", "col2", "col3"), class = "data.frame", row.names = c(NA,
-17L))
I would like to take the frequency from the three columns.
Expected output
df2 <- structure(list(col1 = structure(c(1L, 1L, 1L, 2L, 4L, 1L, 1L,
3L, 3L, 1L, 2L, 1L), .Label = c("stock1", "stock2", "stock3",
"stock4"), class = "factor"), col2 = structure(c(1L, 2L, 3L,
3L, 3L, 4L, 5L, 5L, 6L, 6L, 7L, 8L), .Label = c("comapany1",
"comapany1+comapany4", "comapany4", "company1", "company2", "company2+company1",
"company3", "company4"), class = "factor"), col3 = structure(c(2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("predictor1",
"predictor2"), class = "factor"), frequency = c(1L, 1L, 1L, 1L,
1L, 2L, 3L, 1L, 2L, 1L, 1L, 2L)), .Names = c("col1", "col2",
"col3", "frequency"), class = "data.frame", row.names = c(NA,
-12L))
How is it possible to make it?
We can use count
library(dplyr)
count(df, col1, col2, col3)
# A tibble: 12 x 4
# col1 col2 col3 n
# <fctr> <fctr> <fctr> <int>
# 1 stock1 comapny1 predictor2 1
# 2 stock1 comapny1+comapny4 predictor2 1
# 3 stock1 comapny4 predictor2 1
# 4 stock1 company1 predictor1 2
# 5 stock1 company2 predictor1 3
# 6 stock1 company2+company1 predictor1 1
# 7 stock1 company4 predictor1 2
# 8 stock2 comapny4 predictor2 1
# 9 stock2 company3 predictor1 1
#10 stock3 company2 predictor1 1
#11 stock3 company2+company1 predictor1 2
#12 stock4 comapny4 predictor2 1
Or with data.table
library(data.table)
setDT(df)[, .N, .(col1, col2, col3)]

Indicator feature creation in R based on multiple columns

I have a dataset with 10 columns and out of them 10, 3 are of interest to create a new indicator feature. The features are "pT", "pN", & "M" and they all take different values. Off all the values that these 3 features take, there are a toal of 9 unique combinations that needs to be captures in the new variable.
PATHOT PATHON PATHOM
1 pT2 pN1 M0
4 pT1 pN1 M0
13 pT3 pN1 M0
161 pT1 *pN2 M0
391 pT1 pN1 *M1
810 *pTIS pN1 M0
948 pT3 *pN2 M0
1043 pT2 pN1 *M1
1067 *pT4 pN1 M0
For example, the new variable will have value "1" when PATHOT=pT2, PATHON=pN1 & PATHOM=M0 and so on upto value 9. I have completed the task but after spending almost 20 lines of code involving vectorised operation for all unique combinations.
diag3_bs$sfd[diag3_bs$pathot=="pT2" & diag3_bs$pathon=="pN1" &
diag3_bs$pathom=="M0"] <- 1
diag3_bs$sfd[diag3_bs$pathot=="pT1" & diag3_bs$pathon=="pN1" &
diag3_bs$pathom=="M0"] <- 2
diag3_bs$sfd[diag3_bs$pathot=="pT3" & diag3_bs$pathon=="pN1" &
diag3_bs$pathom=="M0"] <- 3... so on upto 9.
I want to ask if there is a better more automated way of getting the same result?
dput(data.frame) is given below
structure(list(F_STATUS = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "Y", class = "factor"), EVENT_ID = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "BASELINE", class =
"factor"),
PAG_NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "BR2", class = "factor"), PTSIZE = c(3, 4,
2.7, 2, 0.9, 3, 3, 0.9, 3, 4.5), PTSIZE_U = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "CM", class = "factor"),
PT_SYM = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("", "-", "<", ">"), class = "factor"), PATHOT = structure(c(4L,
4L, 4L, 3L, 3L, 4L, 4L, 3L, 4L, 4L), .Label = c("*pT4", "*pTIS",
"pT1", "pT2", "pT3"), class = "factor"), PATHON = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("*pN2", "pN1"
), class = "factor"), PATHOM = structure(c(2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("*M1", "M0"), class = "factor"),
RSUBJID = 901000:901009, RUSUBJID = structure(1:10, .Label = c(
"000301-000-901-251", "000301-000-901-252", "000301-000-901-253",
"000301-000-901-254", "000301-000-901-255", "000301-000-901-256",
"000301-000-901-257", "000301-000-901-258", "000301-000-901-259",
"000301-000-901-260", "000301-000-901-261", "000301-000-901-262")
, class = "factor")), .Names = c("F_STATUS", "EVENT_ID", "PAG_NAME", "PTSIZE", "PTSIZE_U", "PT_SYM", "PATHOT",
"PATHON", "PATHOM", "RSUBJID", "RUSUBJID"), row.names = c(NA, 10L),
class = "data.frame")
Thanks.
I tried to edit the data so it didn't throw an error on input. Also created a version of that tabulation of possible combinations:
stg_tbl <- structure(list(PATHOT = structure(c(4L, 3L, 5L, 3L, 3L, 2L, 5L,
4L, 1L), .Label = c("*pT4", "*pTIS", "pT1", "pT2", "pT3"), class = "factor"),
PATHON = structure(c(2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L), .Label = c("*pN2",
"pN1"), class = "factor"), PATHOM = structure(c(2L, 2L, 2L,
2L, 1L, 2L, 2L, 1L, 2L), .Label = c("*M1", "M0"), class = "factor")), .Names = c("PATHOT",
"PATHON", "PATHOM"), class = "data.frame", row.names = c("1",
"4", "13", "161", "391", "810", "948", "1043", "1067"))
Make a vector of text-equivalents of the categories:
stg_lbls <- with(stg_tbl, paste(PATHOT, PATHON, PATHOM, sep="_") )
Then the as.numeric values of a factor created using those levels will be the desired result:
dat$stg <- with(dat, factor( paste(PATHOT, PATHON, PATHOM, sep="_"), levels=stg_lbls))
as.numeric(dat$stg)
#[1] 1 1 1 2 2 1 1 2 1 1
You can just assign those values in the usual way:
dat$sfd <- as.numeric(dat$stg)
I made some new data, that should be useful for your problem.
k<-expand.grid(data.frame(a=letters[1:3],b=letters[4:6],c=letters[7:9]))
library(dplyr)
k %>% mutate(groups=paste0(a,b,c))->k2
k2$groups<-as.numeric(factor(k2$groups))
k2
It's crude, and you're not picking which combination get's which numbers, so it'd take some digging afterwards, but it's quick.

how to match two columns by group in R

I have a dataset like d1.
d1<-structure(list(id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 4L), cname = structure(c(1L,
1L, 1L, 2L, 3L, 2L, 2L, 3L, 1L), .Label = c("AA", "BB", "CC"), class = "factor"),
value = c(1L, 2L, 1L, 2L, 2L, 1L, 2L, 3L, 1L), recentcname = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("id", "cname",
"value", "recentcname"), class = "data.frame", row.names = c(NA,
-9L))
Here my key variables are "id" and "value". For every individual id, I have to find out maximum value record in "value" column and take that corresponding "cname" string into "recentcname" column for that id. If we have two maximum values for one id, we have to take second highest record's "cname" string into "recentcname" column.
Finally my output will be like d2.
d2<-structure(list(id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 4L), cname = structure(c(1L,
1L, 1L, 2L, 3L, 2L, 2L, 3L, 1L), .Label = c("AA", "BB", "CC"), class = "factor"),
value = c(1L, 2L, 1L, 2L, 2L, 1L, 2L, 3L, 1L), recentcname = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("AA", "CC"), class = "factor")), .Names = c("id",
"cname", "value", "recentcname"), class = "data.frame", row.names = c(NA,
-9L))
I can do it by splitting dataset by ib. But it is very time consumable. is there any other alternatives for this task. please help...
How about this
d1$recentcname <- unsplit(lapply(split(d1[,c("value","cname")], d1$id), function(x) {
rep(tail(x$cname[x$value==max(x$value)],1), nrow(x))
}), d1$id)
we basically split the data by ID, then look for the last max value in each subset and repeat that value for each row in the subset. Then we use unsplit() to but the values back in the correct order corresponding to d1.
Try:
dd = do.call(rbind, lapply(split(d1, d1$id), function(x)tail(x,1)))
names(dd)[2]= 'recentcname'
merge(d1[1:3], dd[1:2])
id cname value recentcname
1 1 AA 1 AA
2 1 AA 2 AA
3 1 AA 1 AA
4 2 BB 2 CC
5 2 CC 2 CC
6 3 BB 1 CC
7 3 BB 2 CC
8 3 CC 3 CC
9 4 AA 1 AA

Resources