Transforming dataset to aggregate values [duplicate] - r

This question already has answers here:
How do I get a contingency table?
(6 answers)
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I have a dataset like this below
Col Value
A 1
A 0
A 1
A 1
A 1
B 0
B 1
B 0
B 1
B 1
How do I transform this so that it looks like this below
Col1 Col2 Col3
A 4 1
B 3 2
Col2 counts all the 1s and Col3 counts all the 0s for each factor value in Col1.

Or we can use dcast
library(reshape2)
dcast(df1, Col~Value, value.var='Value', length)

For this, you can just use table:
table(mydf)
## Value
## Col 0 1
## A 1 4
## B 2 3
Or:
library(data.table)
as.data.table(mydf)[, as.list(table(Value)), by = Col]
## Col 0 1
## 1: A 1 4
## 2: B 2 3

Another approach of aggregating the values is:
df <- data.frame(Col=c("A","A","A","A","A","B","B","B","B","B"), Value=c(1,0,1,1,1,0,1,0,1,1))
new_df <- as.data.frame(with(df, tapply(Value, list(Col, Value), FUN = function(x) length(x))))
new_df <- setNames(cbind(rownames(new_df), new_df), c("Col1","Col2","Col3"))
new_df
Col1 Col2 Col3
A A 1 4
B B 2 3
We can set rownames to NULL if do not wish to see them:
rownames(new_df) <- NULL
Result:
Col1 Col2 Col3
1 A 1 4
2 B 2 3

Related

How to give each instance its own row in a data frame? [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 3 years ago.
How is it possible to transform this data frame so that the count is divided into separate observations?
df = data.frame(object = c("A","B", "A", "C"), count=c(1,2,3,2))
object count
1 A 1
2 B 2
3 A 3
4 C 2
So that the resulting data frame looks like this?
object observation
1 A 1
2 B 1
3 B 1
4 A 1
5 A 1
6 A 1
7 C 1
8 C 1
rep(df$object, df$count)
If you want the 2 columns:
df2 = data.frame(object = rep(df$object, df$count))
df2$count = 1
If you're working with tidyverse - otherwise that's overkill -, you could also do:
library(tidyverse)
uncount(df, count) %>% mutate(observation = 1)
Using data.table:
library(data.table)
setDF(df)[rep(seq_along(count), count), .(object, count = 1L)]
object count
1: A 1
2: B 1
3: B 1
4: A 1
5: A 1
6: A 1
7: C 1
8: C 1

Merging dataframes without changing values [duplicate]

This question already has an answer here:
Column binding in R
(1 answer)
Closed 3 years ago.
I have two dataframes
df1 <- data.frame(c(1:10))
df2 <- data.frame(c(1,0,1,1,0,1,0,0,1,0)
I tried to merge them using this code:
merge(df1,df2,all = TRUE, sort = FALSE)
But my dataframe comes out really weird, there are 100 rows
I want the dataframe to look like this:
col1 col2
1 1
2 0
3 1
4 1
5 0
6 1
7 0
8 0
9 1
10 0
How can I do this?
You can just define a new data frame, and use [,1] to extract the columns from your existing data frames, this gives you the ability to name the columns.
data.frame(col1=df1[,1], col2 = df2[,1])
# col1 col2
#1 1 1
#2 2 0
#3 3 1
#4 4 1
#5 5 0
#6 6 1
#7 7 0
#8 8 0
#9 9 1
#10 10 0
This will get you the formatting you want, with named columns:
library(dplyr)
df1 <- data.frame(col1 = c(1:10))
df2 <- data.frame(col2 = c(1,0,1,1,0,1,0,0,1,0))
df <- bind_cols(df1, df2)
You can do that with cbind(), which stands for column bind:
cbind(df1, df2)

Add Columns with Repeating Values in R [duplicate]

This question already has answers here:
Unique combination of all elements from two (or more) vectors
(6 answers)
create all possible permutations of two vectors in R [duplicate]
(1 answer)
Closed 4 years ago.
I have a column col1 from df1
col1
A
B
C
D
E
I have col2 from df2
col2
1
2
3
I want a new df df3 combining both col2 and col1
col2 col1
1 A
1 B
1 C
1 D
1 E
2 A
2 B
2 C
2 D
2 E
3 A
...
3 E
I used
n = 5; test = do.call("rbind", replicate(n, col2, simplify = FALSE))
n = 3; test = do.call("rbind", replicate(n, col1, simplify = FALSE))
And then merge data together. it's really not efficient with big data. what's the best way to solve this problem?
Use merge with a cross join, for a base R option:
df1 <- data.frame(col1=c("A", "B", "C", "D", "E"))
df2 <- data.frame(col2=c(1:3))
merge(df1, df2, by=NULL)
col1 col2
1 A 1
2 B 1
3 C 1
4 D 1
5 E 1
6 A 2
7 B 2
...
15 E 3
Demo

How to calculate the frequency of each value in a column corresponding to each value in another column in R?

I have a dataset as follows:
col1 col2
A 1
A 2
A 2
B 1
B 1
C 1
C 1
C 2
I want the output as:
col1 col2 Frequency
A 1 1
A 2 2
B 1 2
C 1 2
C 2 1
I tried using the aggregate function and also the table function but I am unable to get desired result.
You can add a dummy column or use the rownames to aggregate on:
aggregate(rownames(mydf) ~ ., mydf, length)
# col1 col2 rownames(mydf)
# 1 A 1 1
# 2 B 1 2
# 3 C 1 2
# 4 A 2 2
# 5 C 2 1
table also works fine but will report combinations that may not be in your data as "0":
data.frame(table(mydf))
# col1 col2 Freq
# 1 A 1 1
# 2 B 1 2
# 3 C 1 2
# 4 A 2 2
# 5 B 2 0
# 6 C 2 1
Another nice approach is to use "data.table":
library(data.table)
as.data.table(mydf)[, .N, by = names(mydf)]
if your data is
col1 <- c("A","A","A","B","B","C","C","C")
col2 <- c(1,2,2,1,1,1,1,2)
df <- data.frame(col1,col2)
you can use dplyr
1) group_by both both variables, since your output is supposed to include every combination of them
2) count the number of observations for each group using n()
library(dplyr)
df %>% group_by(col1,col2) %>% summarize(frequency=n())
# output
col1 col2 frequency
1 A 1 1
2 A 2 2
3 B 1 2
4 C 1 2
5 C 2 1

Keeping rows if any column matches one of a set of values

I have a simple question about subsetting using R; I think I am close but can't quite get it. Basically, I have 25 columns of interest and about 100 values. Any row that has ANY of those values in at one of the columns, I want to keep. Simple example:
Values <- c(1,2,5)
col1 <- c(2,6,8,1,3,5)
col2 <- c(1,4,5,9,0,0)
col3 <- c('dog', 'cat', 'cat', 'pig', 'chicken', 'cat')
df <- cbind.data.frame(col1, col2, col3)
df1 <- subset(df, col1%in%Values)
(Note that the third column is to indicate that there are additional columns but I don't need to match the values to those; the rows retained only depend upon columns 1 and 2). I know that in this trivial case I could just add
| col2%in%Values
to get the additional rows from column 2, but with 25 columns I don't want to add an OR statement for every single one. I tried
file2011_test <- file2011[file2011[,9:33]%in%CO_codes] #real names of values
but it didn't work. (And yes I know this is mixing subsetting types; I find subset() easier to understand but I don't think it can help me with what I need?)
May be you can try:
df[Reduce(`|`, lapply(as.data.frame(df), function(x) x %in% Values)),]
# col1 col2
#[1,] 2 1
#[2,] 8 5
#[3,] 1 9
#[4,] 5 0
Or
indx <- df %in% Values
dim(indx) <- dim(df)
df[!!rowSums(indx),]
# col1 col2
# [1,] 2 1
# [2,] 8 5
# [3,] 1 9
# [4,] 5 0
Update
Using the new dataset
df[Reduce(`|`, lapply(df[sapply(df, is.numeric)], function(x) x %in% Values)),]
# col1 col2 col3
#1 2 1 dog
#3 8 5 cat
#4 1 9 pig
#6 5 0 cat
take a look at data.table package. It is very intuitive and literally 100 times faster.
library(data.table)
df <- data.table(col1, col2, col3)
df[col1%in%Values | col2%in%Values]
# col1 col2 col3
#1: 2 1 dog
#2: 8 5 cat
#3: 1 9 pig
#4: 5 0 cat
If you want to do this for all column you can do this with:
df[rowSums(sapply(df, '%in%', Values) )>0]
# col1 col2 col3
#1: 2 1 dog
#2: 8 5 cat
#3: 1 9 pig
#4: 5 0 cat

Resources