R: Merging factor levels and creating sum in merged columns [duplicate] - r

This question already has answers here:
Cleaning up factor levels (collapsing multiple levels/labels)
(10 answers)
Closed 5 years ago.
I am a beginner in R and this is driving me nuts.
I have a dataframe:
someData = data.frame(Term=c('a', 'b', 'c', 'd', 'a', 'a', 'c', 'c'), Freq=c(1:8), Category=c(1,2,1,2,1,2,1,2));
someData$Term = as.factor(someData$Term)
someData$Category = as.factor(someData$Category)
and would like to combine terms a and c (both factors) into x while summing their respective frequencies and while maintaining the categories so that I have the following resulting dataframe:
Term Freq Category
x 16 1
b 2 2
d 4 2
x 14 2
The following code only changes all names to x, but does not sum its values.
combine <- c("a", "c");
levels(somedata$Term)[levels(somedata$Term) %in% combine] <- paste("x");

This looks valid:
#levels(someData$Term) = list(b = "b", d = "d", x = c("a", "c")) #just another approach
aggregate(Freq ~ Term + Category, someData, sum)
# Term Category Freq
#1 x 1 16
#2 b 2 2
#3 d 2 4
#4 x 2 14

Related

Create id column of repeated rows [duplicate]

This question already has answers here:
Assign unique ID based on two columns [duplicate]
(2 answers)
Closed 3 years ago.
EDITED:
I have a very simple question. I have a dataframe (already given) with repeated rows. I want to identify each unique row and add a column with an ID number.
The original table has thousands of row, but I simplify it here. A toy df can be created in this way.
df <- data.frame(var1 = c('a', 'a', 'a', 'b', 'c', 'c', 'a'),
var2 = c('d', 'd', 'd', 'e', 'f', 'f', 'c'))
For each unique row, I want a numeric ID:
var1 var2 ID
1 a d 1
2 a d 1
3 a d 1
4 b e 2
5 c f 3
6 c f 3
7 a c 4
/EDITED
Here is a base R solution using cumsum + duplicated, i.e.,
df$ID <- cumsum(!duplicated(df))
such that
> df
var1 var2 ID
1 a d 1
2 a d 1
3 a d 1
4 b e 2
5 c f 3
6 c f 3
7 a c 4
EDIT
Well, the question was completely changed by OP. For the updated question we can do
df$ID <- match(paste0(df$var1, df$var2), unique(paste0(df$var1, df$var2)))
Original answer
One way would be to use uncount from tidyr
library(dplyr)
df %>% mutate(ID = row_number()) %>% tidyr::uncount(ID, .remove = FALSE)
# var1 var2 ID
#1 a d 1
#2 b e 2
#2.1 b e 2
#3 c f 3
#3.1 c f 3
#3.2 c f 3
In base R we can create a row number column in the dataframe and repeat rows based on that.
df$ID <- seq(nrow(df))
df[rep(df$ID, df$ID), ]
data
df <- structure(list(var1 = structure(1:3, .Label = c("a", "b", "c"
), class = "factor"), var2 = structure(1:3, .Label = c("d", "e",
"f"), class = "factor")), row.names = c(NA, -3L), class = "data.frame")

stacking rows as columns in R [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 3 years ago.
I'm trying to stack rows of data into columns so that the variables in another column will repeat. I would like to turn something like this
tib <- tribble(~x, ~y, ~z, "a", 1,2, "b", 3,4)
> tib
# A tibble: 2 x 3
x y z
<chr> <dbl> <dbl>
1 a 1 2
2 b 3 4
into
t <- tribble(~X, ~Y, "a", 1, "a", 2, "b", 3, "b", 4)
> t
# A tibble: 4 x 2
X Y
<chr> <dbl>
1 a 1
2 a 2
3 b 3
4 b 4
Thanks for your help and sorry if I've missed this solution somewhere. I did a search, and tried applying gather(), spread(), but couldn't get it to work out.
Here is an example using data.table::melt():
# Assuming your data is a data.frame
xyz <- data.frame(
x = c("a", "b"),
y = c(1, 3),
z = c(2, 4)
)
library(data.table)
melt(xyz, id.vars = "x")[c(1, 3)]
x value
1 a 1
2 b 3
3 a 2
4 b 4
This can be done with many packages. One possibility is tidyr and the function gather (link)
EDIT
Using #sindri_baldur data:
library(tidyr)
xyz %>%
gather(class, measurement, -x)

cumulatively concatenate columns in data.table by group [duplicate]

This question already has answers here:
Cumulatively paste (concatenate) values grouped by another variable
(6 answers)
Closed 4 years ago.
I have a data.table like the following:
x <- data.table(group = c('A', 'A', 'A', 'B', 'B'),
row_id = c(1, 2, 3, 1, 2),
value = c('a', 'b', 'c', 'd', 'e'))
I want to add a new column that cumulatively concatenate column 'value' ordered by 'row_id', within each group indicated by 'group'. So the output would look like:
group row_id value
1: A 1 a
2: A 2 a_b
3: A 3 a_b_c
4: B 1 d
5: B 2 d_e
Thank you for your help!
One option would be to group by 'group', loop through the sequence of rows, get the sequence of it, use that as index to get the corresponding 'value' and paste together with delimiter _, assign (:=) it to update the 'value'
x[, value := sapply(seq_len(.N), function(i)
paste(value[seq(i)], collapse = "_")), by = group]
x
# group row_id value
#1: A 1 a
#2: A 2 a_b
#3: A 3 a_b_c
#4: B 1 d
#5: B 2 d_e

Subset by two factors variables [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(8 answers)
Closed 4 years ago.
I'd like to aggregate my dataset considering the interations between two factors (fac1, fac2) and apply a function for this. For example, consider the dataset given by
set.seed(1)
test <- data.frame(fac1 = sample(c("A", "B", "C"), 30, rep = T),
fac2 = sample(c("a", "b"), 30, rep = T),
value = runif(30))
For fac1 == "A" and "fac2 == a" we have five values and I'd like to aggregate by min. Using brutal force I tried this way
min(test[test$fac1 == "A" & test$fac2 == "a", ]$value)
You mention aggregate and that will work here.
aggregate(test$value, test[,1:2], min)
fac1 fac2 x
1 A a 0.32535215
2 B a 0.14330438
3 C a 0.33239467
4 A b 0.33907294
5 B b 0.08424691
6 C b 0.24548851
Here is a tidyverse alternative
test %>% group_by(fac1, fac2) %>% summarise(x = min(value))
## A tibble: 6 x 3
## Groups: fac1 [?]
# fac1 fac2 x
# <fct> <fct> <dbl>
#1 A a 0.325
#2 A b 0.339
#3 B a 0.143
#4 B b 0.0842
#5 C a 0.332
#6 C b 0.245

Get random sample from subset of other dataframe

I have a large data frame of 100,000's rows, and I want to add a column where the value is a sample of a subset of another data frame based on common names in the data frames. Might be easier to explain with examples...
largeDF <- data.frame(colA = c('a', 'b', 'b', 'a', 'a', 'b'),
colB = c('x', 'y', 'y', 'x', 'y', 'y'),
colC = 1:6)
sampleDF <- data.frame(colA = c('a','a','a','a','b','b','b','b','b','b'),
colB = c('x','x','y','y','x','y','y','y','y','y'),
sample = 1:10)
I then want to add a new column sample to largeDF, which is a random sample of the sample column in sampleDF for the appropriate subset of colA and colB.
For example, for the first row the values are a and x, so the value will be a random sample of 1 or 2, for the next row (b and y) it will be a random sample of 6, 7, 8, 9 or 10.
So we could end up with something like:
rowA rowB rowC sample
1 a x 1 2
2 b y 2 9
3 b y 3 7
4 a x 4 2
5 a y 5 4
6 b y 6 8
Any help would be appreciated!
Using dplyr... (This throws a few warnings, but appears to work anyway.)
library(dplyr)
largeDF <- largeDF %>% group_by(colA,colB) %>%
mutate(sample=sample(sampleDF$sample[sampleDF$colA==colA & sampleDF$colB==colB],
size=n(),replace=TRUE))
largeDF
colA colB colC sample
<fctr> <fctr> <int> <int>
1 a x 1 2
2 b y 2 6
3 b y 3 9
4 a x 4 1
5 a y 5 4
6 b y 6 9
You could do something like this:
largeDF$sample <- apply(largeDF,1,function(a)
with(sampleDF, sample(sampleDF[colA==a[1] & colB==a[2],]$sample,1)))
I do not quite understand the question but it seems that you are just adding a new column in the large data frame that is just the sampled "sample" column from a subsample...
see if the following code gives you an idea into the functionality you need:
cbind.data.frame(largeDF, sample = sample(sampleDF$sample, nrow(largeDF)))
# colA colB colC sample
#1 a x 1 9
#2 b y 2 10
#3 b y 3 1
#4 a x 4 3
#5 a y 5 6
#6 b y 6 7
I think this is one possible solution for you...
library(dplyr)
largeDF_sample <- sapply(1:nrow(largeDF), function(x) {
sampleDF_part = filter(sampleDF, colA==largeDF$colA[x] & colB==largeDF$colB[x])
return(sample(sampleDF_part$sample)[1])
})
largeDF$sample <- largeDF_sample

Resources