Create a new level column based on unique row sets - r

I want to create a new column with new variables (preferably letters) to count the frequency of each set later on.
Lets say I have a data frame called datatemp which is like:
datatemp = data.frame(colors=rep( c("red","blue"), 6), val = 1:6)
colors val
1 red 1
2 blue 2
3 red 3
4 blue 4
5 red 5
6 blue 6
7 red 1
8 blue 2
9 red 3
10 blue 4
11 red 5
12 blue 6
And I can see my unique row sets where colors and val columns have identical inputs together, such as:
unique(datatemp[c("colors","val")])
colors val
1 red 1
2 blue 2
3 red 3
4 blue 4
5 red 5
6 blue 6
What I really want to do is to create a new column in the same data frame where each unique set of row above has a level, such as:
colors val freq
1 red 1 A
2 blue 2 B
3 red 3 C
4 blue 4 D
5 red 5 E
6 blue 6 F
7 red 1 A
8 blue 2 B
9 red 3 C
10 blue 4 D
11 red 5 E
12 blue 6 F
I know that's very basic, however, I couldn't come up with an useful idea for a huge dataset.
So make the question more clear, I am giving another representation of desired output below:
colA colB newcol
10 11 A
12 15 B
10 11 A
13 15 C
Values in the new column should be based on uniqueness of first two columns before it.

www's solution maps the unique values in your value column to letters in freq column. If you want to do create a factor variable for each unique combination of colors and val, you could do something along these lines:
library(plyr)
datatemp = data.frame(colors=rep( c("red","blue"), 6), val = 1:6)
datatemp$freq <- factor(paste(datatemp$colors, datatemp$val), levels=unique(paste(datatemp$colors, datatemp$val)))
datatemp$freq <- mapvalues(datatemp$freq, from = levels(datatemp$freq), to = LETTERS[1:length(levels(datatemp$freq))])
I first create a new factor variable for each unique combination of val and colors, and then use plyr::mapvalues to rename the factor levels to letters.

We can concatenate the val and color column and create it as factor, then we can change the factor level by letters.
datatemp$Freq <- as.factor(paste(datatemp$val, datatemp$colors, sep = "_"))
levels(datatemp$Freq) <- LETTERS[1:length(levels(datatemp$Freq))]
datatemp
# colors val Freq
# 1 red 1 A
# 2 blue 2 B
# 3 red 3 C
# 4 blue 4 D
# 5 red 5 E
# 6 blue 6 F
# 7 red 1 A
# 8 blue 2 B
# 9 red 3 C
# 10 blue 4 D
# 11 red 5 E
# 12 blue 6 F

Related

Change rows index dataframe R

I'm new on programming on R and I need a simple thing. I have a dataframe like this:
A number
3 1 3
4 1 4
11 2 11
12 2 12
18 3 18
19 3 19
the first column is the one obtained by R default. I'd like to exchange this one with the column "number" always having the name of the column. Something like this:
number A
3 1
4 1
11 2
12 2
18 3
19 3
I need to do it because it is a large dataset and going on the correspondence between two columns is lost.
It seems like you want to remove your row names?
df <- data.frame("Colours" = c("Red", "Red", "Green", "Yellow"),
"Number" = c(1,2,3,6))
rownames(df) <- c(1,2,3,6)
df
Colours Number
1 Red 1
2 Red 2
3 Green 3
6 Yellow 6
Setting rownames as NULL, we will remove the row names and they will be called by just row number now.
rownames(df) <- NULL
df
Colours Number
1 Red 1
2 Red 2
3 Green 3
4 Yellow 6

Create a Group Key across two columns

I'm trying to solve the below problem but I find it difficult to explain. I want to assign an incremental value based on the linkage between two columns (Colours & Letters).
Colours <- c("Green","Red","Green","Green","Blue","Red","Brown")
Letters <- c("X","C","Y","A","C","T","P")
df <- data.frame(Colours,Letters)
df
Colours Letters
1 Green X
2 Red C
3 Green Y
4 Green A
5 Blue C
6 Red T
7 Brown P
I'll assign a value to Group so that all identical Colours are in the same Group, along with any other Colour that shares the same Letter(s). For example, Group 2 includes Red & Blue given the shared linkage to Letter C.
Group <- c(1,2,1,1,2,2,3)
df <- data.frame(df,Group)
df
Colours Letters Group
1 Green X 1
2 Red C 2
3 Green Y 1
4 Green A 1
5 Blue C 2
6 Red T 2
7 Brown P 3
If an additional row was added with Colour = Green and Letter = C then the Group column would change to the below. All Greens would be grouped together with any other Colour (e.g. Red) that shared the same Letter (C in the case of Red). Furthermore, any Colour that shared a Letter with Red would likewise be added to the same Group as Green (such is the case for Blue, which shares the Letter C with Red).
Colours Letters Group
1 Green X 1
2 Red C 1
3 Green Y 1
4 Green A 1
5 Blue C 1
6 Red T 1
7 Brown P 2
8 Green C 1
Can anyone help?
As the #Frank above noted, you are describing a graph problem in that you want your group label to reflect connected components -- colours that share a letter. By converting your columns into a graph object you can figure out what the separate components are and return these as groups:
Colours <- c("Green","Red","Green","Green","Blue","Red","Brown")
Letters <- c("X","C","Y","A","C","T","P")
df <- data.frame(Colours,Letters)
Group <- c(1,2,1,1,2,2,3)
df <- data.frame(df,Group)
# load the igraph package for working with graphs
library(igraph)
adj.mat <- table(df$Colours, df$Letters) %*% t(table(df$Colours, df$Letters))
# visual inspection makes it clear what the components are
g <- graph_from_adjacency_matrix(adj.mat, mode = 'undirected', diag = F)
plot(g)
# we create a dataframe that matches each color to a component
mdf <- data.frame(Group_test = components(g)$membership,
Colours = names(components(g)$membership))
mdf
#> Group_test Colours
#> Blue 1 Blue
#> Brown 2 Brown
#> Green 3 Green
#> Red 1 Red
# Then we just match them together
dplyr::left_join(df, mdf)
#> Joining, by = "Colours"
#> Colours Letters Group Group_test
#> 1 Green X 1 3
#> 2 Red C 2 1
#> 3 Green Y 1 3
#> 4 Green A 1 3
#> 5 Blue C 2 1
#> 6 Red T 2 1
#> 7 Brown P 3 2
Clearly the groups have a different numbering but split the colours similarly.
We can look at the extended case as a sanity check, where we add a linking color that reduces the set of components to 2:
# examining the extended case as a check
df2 <- data.frame(Colours = c(Colours, "Green"), Letters = c(Letters, "C"))
df2
#> Colours Letters
#> 1 Green X
#> 2 Red C
#> 3 Green Y
#> 4 Green A
#> 5 Blue C
#> 6 Red T
#> 7 Brown P
#> 8 Green C
# lets wrap the procedure in a function for convenience
getGroup <- function(col, let, plot = FALSE){
adj.mat <- table(col, let) %*% table(let, col)
g <- graph_from_adjacency_matrix(adj.mat, mode = 'undirected',
diag = F)
if (plot) {plot(g)}
comps <- components(g)$membership
mdf <- data.frame(Group = comps, Colours = names(comps))
mdf
}
# we get our desired group key (which we can merge back to the dataframe)
getGroup(df2$Colours, df2$Letters)
#> Group Colours
#> Blue 1 Blue
#> Brown 2 Brown
#> Green 1 Green
#> Red 1 Red
Created on 2018-11-07 by the reprex package (v0.2.1)

Select from column in dataframe based on value in another column

I have a dataframe as follows:
dataDF <- data.frame(
id = 1:5,
to_choose = c('red', 'blue', 'red', 'green', 'yellow'),
red_value = c(1,2,3,4,5),
blue_value = c(6,7,8,9,10),
yellow_value = c(11,12,13,14,15)
)
id to_choose red_value blue_value yellow_value
1 red 1 6 11
2 blue 2 7 12
3 red 3 8 13
4 green 4 9 14
5 yellow 5 10 15
I want to create a new column value, which is the value from the appropriate column based on the to_choose column.
I could do this with an ifelse as follows
mutate(dataDF,
value = ifelse(to_choose == 'red', red_value,
ifelse(to_choose == 'blue', blue_value,
ifelse(to_choose == 'yellow', yellow_value, NA))))
To give
id to_choose red_value blue_value yellow_value value
1 red 1 6 11 1
2 blue 2 7 12 7
3 red 3 8 13 3
4 green 4 9 14 NA
5 yellow 5 10 15 15
But if there a simpler one line way of doing this along the lines of
mutate(dataDF, value = paste(to_choose, 'value', sep = '_'))
dataDF %>%
gather(var, value , 3:5) %>%
mutate(var = gsub('_value', '', var)) %>%
filter(to_choose == var)
A base R approach using mapply
dataDF$value <- mapply(function(x, y) if(length(y) > 0) dataDF[x, y] else NA,
1:nrow(dataDF), sapply(dataDF$to_choose, function(x) grep(x, names(dataDF))))
dataDF
# id to_choose red_value blue_value yellow_value value
#1 1 red 1 6 11 1
#2 2 blue 2 7 12 7
#3 3 red 3 8 13 3
#4 4 green 4 9 14 NA
#5 5 yellow 5 10 15 15
The idea is to get the appropriate row and column indices to subset upon. Row indices we are already know that we need to get value for each row of the dataframe. As far as getting the appropriate column is concerned we use grep over to_choose to find the column index from where the value needs to be extracted.

Split Column and then aggregate count of unique values

I have the following dataset:
color type
1 black chair
2 black chair
3 black sofa
4 green table
5 green sofa
I want to split this to form the following dataset:
arg value
1 color black
2 color black
3 color black
4 color green
5 color green
6 type chair
7 type chair
8 type sofa
9 type table
10 type sofa
I would then like to calculate unique values of all arg-value combination:
arg value count
1 color black 3
2 color green 2
3 type chair 2
4 type sofa 2
5 type table 1
It does not need to be sorted by count. This would then be printed in the following output form:
arg unique_count_values
1 color black(3) green(2)
2 type chair(2) sofa(2) table(1)
I tried the following:
AttrList<-colnames(DataSet)
aggregate(.~ AttrList, DataSet, FUN=function(x) length(unique(x)) )
I also tried summary(DataSet) but then I am not sure how to manipulate the result to get it in the desired Output form.
I am relatively new to R. If you find something that would reduce the effort then please let me know. Thanks!
Update
So, I tried the following:
x <- matrix(c(101:104,101:104,105:106,1,2,3,3,4,5,4,5,7,5), nrow=10, ncol=2)
V1 V2
1 101 1
2 102 2
3 103 3
4 104 3
5 101 4
6 102 5
7 103 4
8 104 5
9 105 7
10 106 5
Converting to table:
as.data.frame(table(x))
Which gives me:
x Freq
1 1 1
2 2 1
3 3 2
4 4 2
5 5 3
6 7 1
7 101 2
8 102 2
9 103 2
10 104 2
11 105 1
12 106 1
What should I do so I get this:
V Val Freq
1 V2 1 1
2 V2 2 1
3 V2 3 2
4 V2 4 2
5 V2 5 3
6 V2 7 1
7 V1 101 2
8 V1 102 2
9 V1 103 2
10 V1 104 2
11 V1 105 1
12 V1 106 1
Try
library(tidyr)
library(dplyr)
df %>%
gather(arg, value) %>%
count(arg, value) %>%
summarise(unique_count_values = toString(paste0(value, "(", n, ")")))
Which gives:
#Source: local data frame [2 x 2]
#
# arg unique_count_values
# (fctr) (chr)
#1 color black(3), green(2)
#2 type chair(2), sofa(2), table(1)
Here's a base R approach. I've expanded it out a bit mostly so that I can add comments as to what is happening.
The basic idea is to just use sapply to loop through the columns, tabulate the data in each column, and then use sprintf to extract the relevant parts of the tabulation to achieve your desired output (the names, followed by the values in brackets).
The stack function takes the final named vector and converts it to a data.frame.
stack( ## convert the final output to a data.frame
sapply( ## cycle through each column
mydf, function(x) {
temp <- table(x) ## calculate counts and paste together values
paste(sprintf("%s (%d)", names(temp), temp), collapse = " ")
}))
# values ind
# 1 black (3) green (2) color
# 2 chair (2) sofa (2) table (1) type
If the data are factors, you could also try something like the following, which matches the data you expect, but not the desired output.
stack(apply(summary(mydf), 2, function(x) paste(na.omit(x), collapse = " ")))
# values ind
# 1 black:3 green:2 color
# 2 chair:2 sofa :2 table:1 type

Read csv with two headers into a data.frame

Apologies for the seemingly simple question, but I can't seem to find a solution to the following re-arrangement problem.
I'm used to using read.csv to read in files with a header row, but I have an excel spreadsheet with two 'header' rows - cell identifier (a, b, c ... g) and three sets of measurements (x, y and z; 1000s each) for each cell:
a b
x y z x y z
10 1 5 22 1 6
12 2 6 21 3 5
12 2 7 11 3 7
13 1 4 33 2 8
12 2 5 44 1 9
csv file below:
a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9
How can I get to a data.frame in R as shown below?
cell x y z
a 10 1 5
a 12 2 6
a 12 2 7
a 13 1 4
a 12 2 5
b 22 1 6
b 21 3 5
b 11 3 7
b 33 2 8
b 44 1 9
Use base R reshape():
temp = read.delim(text="a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9", header=TRUE, skip=1, sep=",")
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT
# time x y z id
# 1.0 0 10 1 5 1
# 2.0 0 12 2 6 2
# 3.0 0 12 2 7 3
# 4.0 0 13 1 4 4
# 5.0 0 12 2 5 5
# 1.1 1 22 1 6 1
# 2.1 1 21 3 5 2
# 3.1 1 11 3 7 3
# 4.1 1 33 2 8 4
# 5.1 1 44 1 9 5
Basically, you should just skip the first row, where there are the letters a-g every third column. Since the sub-column names are all the same, R will automatically append a grouping number after all of the columns after the third column; so we need to add a grouping number to the first three columns.
You can either then create an "id" variable, or, as I've done here, just use the row names for the IDs.
You can change the "time" variable to your "cell" variable as follows:
# Change the following to the number of levels you actually have
OUT$cell = factor(OUT$time, labels=letters[1:2])
Then, drop the "time" column:
OUT$time = NULL
Update
To answer a question in the comments below, if the first label was something other than a letter, this should still pose no problem. The sequence I would take would be as follows:
temp = read.csv("path/to/file.csv", skip=1, stringsAsFactors = FALSE)
GROUPS = read.csv("path/to/file.csv", header=FALSE,
nrows=1, stringsAsFactors = FALSE)
GROUPS = GROUPS[!is.na(GROUPS)]
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT$cell = factor(temp$time, labels=GROUPS)
OUT$time = NULL

Resources