Split Column and then aggregate count of unique values - r

I have the following dataset:
color type
1 black chair
2 black chair
3 black sofa
4 green table
5 green sofa
I want to split this to form the following dataset:
arg value
1 color black
2 color black
3 color black
4 color green
5 color green
6 type chair
7 type chair
8 type sofa
9 type table
10 type sofa
I would then like to calculate unique values of all arg-value combination:
arg value count
1 color black 3
2 color green 2
3 type chair 2
4 type sofa 2
5 type table 1
It does not need to be sorted by count. This would then be printed in the following output form:
arg unique_count_values
1 color black(3) green(2)
2 type chair(2) sofa(2) table(1)
I tried the following:
AttrList<-colnames(DataSet)
aggregate(.~ AttrList, DataSet, FUN=function(x) length(unique(x)) )
I also tried summary(DataSet) but then I am not sure how to manipulate the result to get it in the desired Output form.
I am relatively new to R. If you find something that would reduce the effort then please let me know. Thanks!
Update
So, I tried the following:
x <- matrix(c(101:104,101:104,105:106,1,2,3,3,4,5,4,5,7,5), nrow=10, ncol=2)
V1 V2
1 101 1
2 102 2
3 103 3
4 104 3
5 101 4
6 102 5
7 103 4
8 104 5
9 105 7
10 106 5
Converting to table:
as.data.frame(table(x))
Which gives me:
x Freq
1 1 1
2 2 1
3 3 2
4 4 2
5 5 3
6 7 1
7 101 2
8 102 2
9 103 2
10 104 2
11 105 1
12 106 1
What should I do so I get this:
V Val Freq
1 V2 1 1
2 V2 2 1
3 V2 3 2
4 V2 4 2
5 V2 5 3
6 V2 7 1
7 V1 101 2
8 V1 102 2
9 V1 103 2
10 V1 104 2
11 V1 105 1
12 V1 106 1

Try
library(tidyr)
library(dplyr)
df %>%
gather(arg, value) %>%
count(arg, value) %>%
summarise(unique_count_values = toString(paste0(value, "(", n, ")")))
Which gives:
#Source: local data frame [2 x 2]
#
# arg unique_count_values
# (fctr) (chr)
#1 color black(3), green(2)
#2 type chair(2), sofa(2), table(1)

Here's a base R approach. I've expanded it out a bit mostly so that I can add comments as to what is happening.
The basic idea is to just use sapply to loop through the columns, tabulate the data in each column, and then use sprintf to extract the relevant parts of the tabulation to achieve your desired output (the names, followed by the values in brackets).
The stack function takes the final named vector and converts it to a data.frame.
stack( ## convert the final output to a data.frame
sapply( ## cycle through each column
mydf, function(x) {
temp <- table(x) ## calculate counts and paste together values
paste(sprintf("%s (%d)", names(temp), temp), collapse = " ")
}))
# values ind
# 1 black (3) green (2) color
# 2 chair (2) sofa (2) table (1) type
If the data are factors, you could also try something like the following, which matches the data you expect, but not the desired output.
stack(apply(summary(mydf), 2, function(x) paste(na.omit(x), collapse = " ")))
# values ind
# 1 black:3 green:2 color
# 2 chair:2 sofa :2 table:1 type

Related

Is there an R function to sequentially assign a code to each value in a dataframe, in the order it appears within the dataset?

I have a table with a long list of aliased values like this:
> head(transmission9, 50)
# A tibble: 50 x 2
In_Node End_Node
<chr> <chr>
1 c4ca4238 2838023a
2 c4ca4238 d82c8d16
3 c4ca4238 a684ecee
4 c4ca4238 fc490ca4
5 28dd2c79 c4ca4238
6 f899139d 3def184a
I would like to have R go through both columns and assign a number sequentially to each value, in the order that an aliased value appears in the dataset. I would like R to read across rows first, then down columns. For example, for the dataset above:
In_Node End_Node
<chr> <chr>
1 1 2
2 1 3
3 1 4
4 1 5
5 6 1
6 7 8
Is this possible? Ideally, I'd also love to be able to generate a "key" which would match each sequential code to each aliased value, like so:
Code Value
1 c4ca4238
2 2838023a
3 d82c8d16
4 a684ecee
5 fc490ca4
Thank you in advance for the help!
You could do:
df1 <- df
df1[]<-as.numeric(factor(unlist(df), unique(c(t(df)))))
df1
In_Node End_Node
1 1 2
2 1 3
3 1 4
4 1 5
5 6 1
6 7 8
You can match against the unique values. For a single vector, the code is straightforward:
match(vec, unique(vec))
The requirement to go across columns before rows makes this slightly tricky: you need to transpose the values first. After that, match them.
Finally, use [<- to assign the result back to a data.frame of the same shape as your original data (here x):
y = x
y[] = match(unlist(x), unique(c(t(x))))
y
V2 V3
1 1 2
2 1 3
3 1 4
4 1 5
5 6 1
6 7 8
c(t(x)) is a bit of a hack:
t first converts the tibble to a matrix and then transposes it. If your tibble contains multiple data types, these will be coerced to a common type.
c(…) discards attributes. In particular, it drops the dimensions of the transposed matrix, i.e. it converts the matrix into a vector, with the values now in the correct order.
A dplyr version
Let's first re-create a sample data
library(tidyverse)
transmission9 <- read.table(header = T, text = " In_Node End_Node
1 c4ca4238 283802d3a
2 c4ca4238 d82c8d16
3 c4ca4238 a684ecee
4 c4ca4238 fc490ca4
5 28dd2c79 c4ca4238
6 f899139d 3def184a")
Do this simply
transmission9 %>%
mutate(across(everything(), ~ match(., unique(c(t(cur_data()))))))
#> In_Node End_Node
#> 1 1 2
#> 2 1 3
#> 3 1 4
#> 4 1 5
#> 5 6 1
#> 6 7 8
use .names argument if you want to create new columns
transmission9 %>%
mutate(across(everything(), ~ match(., unique(c(t(cur_data())))),
.names = '{.col}_code'))
In_Node End_Node In_Node_code End_Node_code
1 c4ca4238 2838023a 1 2
2 c4ca4238 d82c8d16 1 3
3 c4ca4238 a684ecee 1 4
4 c4ca4238 fc490ca4 1 5
5 28dd2c79 c4ca4238 6 1
6 f899139d 3def184a 7 8

Removing character from dataframe

I have this simple code, which generates a data frame. I want to remove the V character from the middle column. Is there any simple way to do that?
Here is a test code (the actual code is very long), very similar with the actual code.
mat1=matrix(c(1,2,3,4,5,"V1","V2","V3","V4","V5",1,2,3,4,5), ncol=3)
mat=as.data.frame(mat1)
colnames(mat)=c("x","row","y")
mat
This is the data frame:
x row y
1 1 V1 1
2 2 V2 2
3 3 V3 3
4 4 V4 4
5 5 V5 5
I just want to remove the V's like this:
x row y
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
We can use str_replace from stringr
library(stringr)
mat$row <- str_replace(mat$row, "V", "")

Create a new level column based on unique row sets

I want to create a new column with new variables (preferably letters) to count the frequency of each set later on.
Lets say I have a data frame called datatemp which is like:
datatemp = data.frame(colors=rep( c("red","blue"), 6), val = 1:6)
colors val
1 red 1
2 blue 2
3 red 3
4 blue 4
5 red 5
6 blue 6
7 red 1
8 blue 2
9 red 3
10 blue 4
11 red 5
12 blue 6
And I can see my unique row sets where colors and val columns have identical inputs together, such as:
unique(datatemp[c("colors","val")])
colors val
1 red 1
2 blue 2
3 red 3
4 blue 4
5 red 5
6 blue 6
What I really want to do is to create a new column in the same data frame where each unique set of row above has a level, such as:
colors val freq
1 red 1 A
2 blue 2 B
3 red 3 C
4 blue 4 D
5 red 5 E
6 blue 6 F
7 red 1 A
8 blue 2 B
9 red 3 C
10 blue 4 D
11 red 5 E
12 blue 6 F
I know that's very basic, however, I couldn't come up with an useful idea for a huge dataset.
So make the question more clear, I am giving another representation of desired output below:
colA colB newcol
10 11 A
12 15 B
10 11 A
13 15 C
Values in the new column should be based on uniqueness of first two columns before it.
www's solution maps the unique values in your value column to letters in freq column. If you want to do create a factor variable for each unique combination of colors and val, you could do something along these lines:
library(plyr)
datatemp = data.frame(colors=rep( c("red","blue"), 6), val = 1:6)
datatemp$freq <- factor(paste(datatemp$colors, datatemp$val), levels=unique(paste(datatemp$colors, datatemp$val)))
datatemp$freq <- mapvalues(datatemp$freq, from = levels(datatemp$freq), to = LETTERS[1:length(levels(datatemp$freq))])
I first create a new factor variable for each unique combination of val and colors, and then use plyr::mapvalues to rename the factor levels to letters.
We can concatenate the val and color column and create it as factor, then we can change the factor level by letters.
datatemp$Freq <- as.factor(paste(datatemp$val, datatemp$colors, sep = "_"))
levels(datatemp$Freq) <- LETTERS[1:length(levels(datatemp$Freq))]
datatemp
# colors val Freq
# 1 red 1 A
# 2 blue 2 B
# 3 red 3 C
# 4 blue 4 D
# 5 red 5 E
# 6 blue 6 F
# 7 red 1 A
# 8 blue 2 B
# 9 red 3 C
# 10 blue 4 D
# 11 red 5 E
# 12 blue 6 F

How can I arrange data from wide format to long format, and specify relationships

Currently I have a file which I need to converted from wide format to long format. The example of the data is:
Subject,Cat1_Weight,Cat2_Weight,Cat3_Weight,Cat1_Sick,Cat2_Sick,Cat3_Sick
1,10,11,12,1,0,0
2,7,8,9,1,0,0
However, I need it in the long format as follows
Subject,CatNumber,Weight,Sickness
1,1,10,1
1,2,11,0
1,3,12,0
2,1,7,1
2,2,8,0
2,3,9,0
So far I have tried in R to use the melt function
datalong <- melt(exp2_simon_shortform, id ="Subject")
But it treats every single column name as a unique variable each with its own value. Does anybody know how I could get from wide to long as specified, making reference to the column header names?
Cheers.
EDIT: I've realised I made an error. My final output needs to be as follows. So from the Cat1_ portion, I actually need to get out "Cat" and "1"
Subject Animal CatNumber Weight Sickness
1 Cat 1 10 1
1 Cat 2 11 0
1 Cat 3 12 0
2 Cat 1 7 1
2 Cat 2 8 0
2 Cat 3 9 0
Any updated solutions much appreciated.
The "dplyr" + "tidyr" approach might be something like:
library(dplyr)
library(tidyr)
mydf %>%
gather(var, val, -Subject) %>%
separate(var, into = c("CatNumber", "variable")) %>%
spread(variable, val)
# Subject CatNumber Sick Weight
# 1 1 Cat1 1 10
# 2 1 Cat2 0 11
# 3 1 Cat3 0 12
# 4 2 Cat1 1 7
# 5 2 Cat2 0 8
# 6 2 Cat3 0 9
Add a mutate in there along with gsub to remove the "Cat" part of the "CatNumber" column.
Update
Based on the discussions in chat, your data actually look something more like:
A = c("ATCint", "Blank", "None"); B = 1:5; C = c("ResumptionTime", "ResumptionMisses")
colNames <- expand.grid(A, B, C)
colNames <- sprintf("%s%d_%s", colNames[[1]], colNames[[2]], colNames[[3]])
subject = 1:60
set.seed(1)
M <- matrix(sample(10, length(subject) * length(colNames), TRUE),
nrow = length(subject), dimnames = list(NULL, colNames))
mydf <- data.frame(Subject = subject, M)
Thus, you will need to do a few additional steps to get the output you desire. Try:
library(dplyr)
library(tidyr)
mydf %>%
group_by(Subject) %>% ## Your ID variable
gather(var, val, -Subject) %>% ## Make long data. Everything except your IDs
separate(var, into = c("partA", "partB")) %>% ## Split new column into two parts
mutate(partA = gsub("(.*)([0-9]+)", "\\1_\\2", partA)) %>% ## Make new col easy to split
separate(partA, into = c("A1", "A2")) %>% ## Split this new column
spread(partB, val) ## Transform to wide form
Which yields:
Source: local data frame [900 x 5]
Subject A1 A2 ResumptionMisses ResumptionTime
(int) (chr) (chr) (int) (int)
1 1 ATCint 1 9 3
2 1 ATCint 2 4 3
3 1 ATCint 3 2 2
4 1 ATCint 4 7 4
5 1 ATCint 5 7 1
6 1 Blank 1 4 10
7 1 Blank 2 2 4
8 1 Blank 3 7 5
9 1 Blank 4 1 9
10 1 Blank 5 10 10
.. ... ... ... ... ...
You can do it with base reshape, like:
reshape(dat, idvar="Subject", direction="long", varying=list(2:4,5:7),
v.names=c("Weight","Sick"), timevar="CatNumber")
# Subject CatNumber Weight Sick
#1.1 1 1 10 1
#2.1 2 1 7 1
#1.2 1 2 11 0
#2.2 2 2 8 0
#1.3 1 3 12 0
#2.3 2 3 9 0
Alternatively, since reshape expects names like variablename_groupname you could change the names then get reshape to do the hard work:
names(dat) <- gsub("Cat(.+)_(.+)", "\\2_\\1", names(dat))
reshape(dat, idvar="Subject", direction="long", varying=-1,
sep="_", timevar="CatNumber")
# Subject CatNumber Weight Sick
#1.1 1 1 10 1
#2.1 2 1 7 1
#1.2 1 2 11 0
#2.2 2 2 8 0
#1.3 1 3 12 0
#2.3 2 3 9 0
We can use melt from library(data.table) which can take multiple patterns for measure variable.
library(data.table)#v1.9.6+
DT <- melt(setDT(df1), measure=patterns('Weight$', 'Sick$'),
variable.name='CatNumber', value.name=c('Weight', 'Sick'))[order(Subject)]
DT
# Subject CatNumber Weight Sick
#1: 1 1 10 1
#2: 1 2 11 0
#3: 1 3 12 0
#4: 2 1 7 1
#5: 2 2 8 0
#6: 2 3 9 0
If we need the 'Animal' column, we can grep for 'Cat' columns and remove the suffix substring with sub, assign (:=) it to create the 'Animal' column.
DT[, Animal := sub('\\d+\\_.*', '', grep('Cat', colnames(df1), value=TRUE))]
DT
# Subject CatNumber Weight Sick Animal
#1: 1 1 10 1 Cat
#2: 1 2 11 0 Cat
#3: 1 3 12 0 Cat
#4: 2 1 7 1 Cat
#5: 2 2 8 0 Cat
#6: 2 3 9 0 Cat

aggregate dataframe subsets in R

I have the dataframe ds
CountyID ZipCode Value1 Value2 Value3 ... Value25
1 1 0 etc etc etc
2 1 3
3 1 0
4 1 1
5 2 2
6 3 3
7 4 7
8 4 2
9 5 1
10 6 0
and would like to aggregate based on ds$ZipCode and set ds$CountyID equal to the primary county based on the highest ds$Value1. For the above example, it would look like this:
CountyID ZipCode Value1 Value2 Value3 ... Value25
2 1 4 etc etc etc
5 2 2
6 3 3
7 4 9
9 5 1
10 6 0
All the ValueX columns are the sum of that column grouped by ZipCode.
I've tried a bunch of different strategies over the last couple days, but none of them work. The best I've come up with is
#initialize the dataframe
ds_temp = data.frame()
#loop through each subset based on unique zipcodes
for (zip in unique(ds$ZipCode) {
sub <- subset(ds, ds$ZipCode == zip)
len <- length(sub)
maxIndex <- which.max(sub$Value1)
#do the aggregation
row <- aggregate(sub[3:27], FUN=sum, by=list(
CountyID = rep(sub$CountyID[maxIndex], len),
ZipCode = sub$ZipCode))
rbind(ds_temp, row)
}
ds <- ds_temp
I haven't been able to test this on the real data, but with dummy datasets (such as the one above), I keep getting the error "arguments must have the same length). I've messed around with rep() and fixed vectors (eg c(1,2,3,4)) but no matter what I do, the error persists. I also occasionally get an error to the effect of
cannot subset data of type 'closure'.
Any ideas? I've also tried messing around with data.frame(), ddply(), data.table(), dcast(), etc.
You can try this:
data.frame(aggregate(df[,3:27], by=list(df$ZipCode), sum),
CountyID = unlist(lapply(split(df, df$ZipCode),
function(x) x$CountyID[which.max(x$Value1)])))
Fully reproducible sample data:
df<-read.table(text="
CountyID ZipCode Value1
1 1 0
2 1 3
3 1 0
4 1 1
5 2 2
6 3 3
7 4 7
8 4 2
9 5 1
10 6 0", header=TRUE)
data.frame(aggregate(df[,3], by=list(df$ZipCode), sum),
CountyID = unlist(lapply(split(df, df$ZipCode),
function(x) x$CountyID[which.max(x$Value1)])))
# Group.1 x CountyID
#1 1 4 2
#2 2 2 5
#3 3 3 6
#4 4 9 7
#5 5 1 9
#6 6 0 10
In response to your comment on Frank's answer, you can preserve the column names by using the formula method in aggregate. Using Franks's data df, this would be
> cbind(aggregate(Value1 ~ ZipCode, df, sum),
CountyID = sapply(split(df, df$ZipCode), function(x) {
with(x, CountyID[Value1 == max(Value1)]) }))
# ZipCode Value1 CountyID
# 1 1 4 2
# 2 2 2 5
# 3 3 3 6
# 4 4 9 7
# 5 5 1 9
# 6 6 0 10

Resources