creating table from data in R - r

I want to create a table from the existing data. I have 5 varieties and 3 clusters in the data. In the expected table I want to show the number and the name of the varieties with the corresponding clusters. But I cannot make it. This is my data
variety<-c("a","b","c","d","e")
cluster<-c(1,2,2,3,1)
x <- cbind(variety, cluster)
data <- data.frame(x)
data
variety cluster
1 a 1
2 b 2
3 c 2
4 d 3
5 e 1
My desirable table is like this.
cluster number variety name
1 2 a, e
2 2 b,c
3 1 d
I would be grateful if anyone helps me.

The following can give the results you're looking for:
library(plyr)
variety<-c("a","b","c","d","e")
cluster<-c(1,2,2,3,1)
x <- cbind(variety, cluster)
data <- data.frame(x)
data
ddply(data,.(cluster),summarise,n=length(variety),group=paste(variety,collapse=','))

Here is one option with tidyverse. Grouped by 'cluster', get the number of rows (n()) and paste the 'variety' into a single string (toString)
library(tidyverse)
data %>%
group_by(cluster) %>%
summarise(number = n(), variety_name = toString(variety))

Related

How to count variables in duplicates in R

I want to count variables in duplicates.
I used this code.
library(dplyr)
mydata2<-mydata %>%group_by(sub) %>%summarise(n = n())
mydata2
But, you know, in this code, it counts A and A,B differently.
But I want to get numbers' of variables as A to 3, B to 3, C to 2 from the below data.
How can i make code for it?
Here is my data.
num sub
1 A
2 A,B
3 C
4 A
5 B
6 B,C
library(dplyr)
library(tidyr)
mydata %>%
separate_rows(sub) %>%
count(sub)
We can use base R with strsplit and table
table(strsplit(mydata$sub, ","))

Creating a new variable by combining multiple rows from 1 column

I have a data frame with many columns.
This is what it currently looks like:
ID Type
1 A
1 B
2 B
2 C
3 A
3 C
And this is what I want it to look like:
ID Type
1 A&B
2 B&C
3 A&C
I would like to do this without disrupting the rest of the columns. So it's basically going from long to wide form, but just for that one column. Is that possible?
x <- data.frame(ID = c(1,1,2,2,3,3), type = c('A','B','B','C','A','C'))
library(dplyr)
x %>%
group_by(ID) %>%
summarise(y = paste(type,collapse="&"))
This is just one way, but it is certainly possible.

Calculations across more than two different dataframes in R

I'm trying to transfer some work previously done in Excel into R. All I need to do is transform two basic count_if formulae into readable R script. In Excel, I would use three tables and calculate across those using 'point-and-click' methods, but now I'm lost in how I should address it in R.
My original dataframes are large, so for this question I've posted sample dataframes:
OperatorData <- data.frame(
Operator = c("A","B","C"),
Locations = c(850, 575, 2175)
)
AreaData <- data.frame(
Area = c("Torbay","Torquay","Tooting","Torrington","Taunton","Torpley"),
SumLocations = c(1000,500,500,250,600,750)
)
OperatorAreaData <- data.frame(
Operator = c("A","A","A","B","B","B","C","C","C","C","C"),
Area = c("Torbay","Tooting","Taunton",
"Torbay","Taunton","Torrington",
"Tooting","Torpley","Torquay","Torbay","Torrington"),
Locations = c(250,400,200,
100,400,75,
100,750,500,650,175)
)
What I'm trying to do is add two new columns to the OperatorData dataframe: one indicating the count of Areas that operator operates in and another count indicating how many areas in which that operator operates in and owns more than 50% of locations.
So the new resulting dataframe would look like this
Operator Locations AreaCount Own_GE_50percent
A 850 3 1
B 575 3 1
C 2715 5 4
So far, I've managed to calculate the first column using the table function and then appending:
OpAreaCount <- data.frame(table(OperatorAreaData$Operator))
names(OpAreaCount)[2] <- "AreaCount"
OperatorData$"AreaCount" <- cbind(OpAreaCount$AreaCount)
This is fairly straightforward, but I'm stuck in how to calculate the second column calculation with the condition of 50%.
library(dplyr)
OperatorAreaData %>%
inner_join(AreaData, by="Area") %>%
group_by(Operator) %>%
summarise(AreaCount = n_distinct(Area),
Own_GE_50percent = sum(Locations > (SumLocations/2)))
# # A tibble: 3 x 3
# Operator AreaCount Own_GE_50percent
# <fct> <int> <int>
# 1 A 3 1
# 2 B 3 1
# 3 C 5 4
You can use AreaCount = n() if you're sure you have unique Area values for each Operator.

R sum of rows for different group of columns that start with similar string

I'm quite new to R and this is the first time I dare to ask a question here.
I'm working with a dataset with likert scales and I want to row sum over different group of columns which share the first strings in their name.
Below I constructed a data frame of only 2 rows to illustrate the approach I followed, though I would like to receive feedback on how I can write a more efficient way of doing it.
df <- as.data.frame(rbind(rep(sample(1:5),4),rep(sample(1:5),4)))
var.names <- c("emp_1","emp_2","emp_3","emp_4","sat_1","sat_2"
,"sat_3","res_1","res_2","res_3","res_4","com_1",
"com_2","com_3","com_4","com_5","cap_1","cap_2",
"cap_3","cap_4")
names(df) <- var.names
So, what I did, was to use the grep function in order to be able to sum the rows of the specified variables that started with certain strings and store them in a new variable. But I have to write a new line of code for each variable.
df$emp_t <- rowSums(df[, grep("\\bemp.", names(df))])
df$sat_t <- rowSums(df[, grep("\\bsat.", names(df))])
df$res_t <- rowSums(df[, grep("\\bres.", names(df))])
df$com_t <- rowSums(df[, grep("\\bcom.", names(df))])
df$cap_t <- rowSums(df[, grep("\\bcap.", names(df))])
But there is a lot more variables in the dataset and I would like to know if there is a way to do this with only one line of code. For example, some way to group the variables that start with the same strings together and then apply the row function.
Thanks in advance!
One possible solution is to transpose df and calculate sums for the correct columns using base R rowsum function (using set.seed(123))
cbind(df, t(rowsum(t(df), sub("_.*", "_t", names(df)))))
# emp_1 emp_2 emp_3 emp_4 sat_1 sat_2 sat_3 res_1 res_2 res_3 res_4 com_1 com_2 com_3 com_4 com_5 cap_1 cap_2 cap_3 cap_4 cap_t
# 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 13
# 2 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 14
# com_t emp_t res_t sat_t
# 1 15 14 11 7
# 2 15 10 12 9
Agree with MrFlick that you may want to put your data in long format (see reshape2, tidyr), but to answer your question:
cbind(
df,
sapply(split.default(df, sub("_.*$", "_t", names(df))), rowSums)
)
Will do the trick
You'll be better off in the long run if you put your data into tidy format. The problem is that the data is in a wide rather than a long format. And the variable names, e.g., emp_1, are actually two separate pieces of data: the class of the person, and the person's ID number (or something like that). Here is a solution to your problem with dplyr and tidyr.
library(dplyr)
library(tidyr)
df %>%
gather(key, value) %>%
extract(key, c("class", "id"), "([[:alnum:]]+)_([[:alnum:]]+)") %>%
group_by(class) %>%
summarize(class_sum = sum(value))
First we convert the data frame from wide to long format with gather(). Then we split the values emp_1 into separate columns class and id with extract(). Finally we group by the class and sum the values in each class. Result:
Source: local data frame [5 x 2]
class class_sum
1 cap 26
2 com 30
3 emp 23
4 res 22
5 sat 19
Another potential solution is to use dplyr R rowwise function. https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-rowwise/
df %>%
rowwise() %>%
mutate(emp_sum = sum(c_across(starts_with("emp"))),
sat_sum = sum(c_across(starts_with("sat"))),
res_sum = sum(c_across(starts_with("res"))),
com_sum = sum(c_across(starts_with("com"))),
cap_sum = sum(c_across(starts_with("cap"))))

count frequency of rows based on a column value in R

I understand that this is quite a simple question, but I haven't been able to find an answer to this.
I have a data frame which gives you the id of a person and his hobby. Since a person may have many hobbies, the id field may be repeated in multiple rows, each with a different hobby. I have been trying to print out only those rows which have more than one hobbies. I was able to get the frequencies using table.
But how do I apply the condition to print only when the frequency is greater than one.
Secondly, is there a better way to find frequencies without using table.
This is my attempt with table without the filter for frequency greater than one
> id=c(1,2,2,3,2,4,3,1)
> hobby = c('play','swim','play','movies','golf','basketball','playstation','gameboy')
> df = data.frame(id, hobby)
> table(df$id)
1 2 3 4
2 3 2 1
Try using data table, I find it more readable than using table() functions:
library(data.table)
id=c(1,2,2,3,2,4,3,1)
hobby = c('play','swim','play','movies',
'golf','basketball','playstation','gameboy')
df = data.frame(id=id, hobby=hobby)
dt = as.data.table(df)
dt[,hobbies:=.N, by=id]
You will get, for your condition:
> dt[hobbies >1,]
id hobby hobbies
1: 1 play 2
2: 2 swim 3
3: 2 play 3
4: 3 movies 2
5: 2 golf 3
6: 3 playstation 2
7: 1 gameboy 2
This example assumes you are trying to filter df
id=c(1,2,2,3,2,4,3,1)
hobby = c('play','swim','play','movies','golf','basketball',
'playstation','gameboy')
df = data.frame(id, hobby)
table(df$id)
Get all those ids that have more than one hobby
tmp <- as.data.frame(table(df$id))
tmp <- tmp[tmp$Freq > 1,]
Using that information - select their IDs in df
df1 <- df[df$id %in% tmp$Var1,]
df1

Resources