Creating a new variable by combining multiple rows from 1 column

Creating a new variable by combining multiple rows from 1 column - r

I have a data frame with many columns.
This is what it currently looks like:
ID Type
1 A
1 B
2 B
2 C
3 A
3 C
And this is what I want it to look like:
ID Type
1 A&B
2 B&C
3 A&C
I would like to do this without disrupting the rest of the columns. So it's basically going from long to wide form, but just for that one column. Is that possible?

x <- data.frame(ID = c(1,1,2,2,3,3), type = c('A','B','B','C','A','C'))
library(dplyr)
x %>%
group_by(ID) %>%
summarise(y = paste(type,collapse="&"))
This is just one way, but it is certainly possible.

Related

How can I force R to use an object's value, not the literal object name?

long-time lurker, first-time poster.
I'm trying to edit an existing script so that I can use a variable to easily change some of the options in my analysis in order to conduct a sensitivity analysis. In this specific example, I'm trying to use an object so that I can easily change which column I'm calling. The columns contain categorical plant traits, and if a plant has two traits, I split its value between both of those traits.
Here is some example data (note that NumTraits gets derived earlier based on which Trait column I want to use, in this example it is Trait1):
Plant Number
Value
Trait1
Trait2
NumTraits
1
10
A
A
1
2
20
B
A+B
1
3
15
A+B
A+B
2
4
10
B
B
1
My existing code reads:
split.data <- data %>%
mutate(NewValue = Value/NumTraits) %>%
tidyr::separate_rows(Trait1, sep = "[+]") %>%
group_by(Trait1) %>%
summarise(NewValue = sum(Value), .groups = 'drop')
This produces the desired output, which is:
PlantNumber
Value
Trait1
Trait2
NumTraits
1
10
A
A
1
2
20
B
A+B
1
3
7.5
A
A+B
2
4
10
B
B
1
3
7.5
B
A+B
2
(note that normally the two PlantNumber = 3 rows would be adjacent, but for some reason StackOverflow didn't accept that formatting)
I would like to have an object, say trait.to.use <- "Trait1", that I can put in place of Trait1 above to easily switch between Trait1 and Trait2 throughout my code. However, if I just replace Trait1 with trait.to.use in the above code, it gives me an error because "trait.to.use" is not a column in my data. I tried trait.to.use[1] and all_of(trait.to.use), but while they return "Trait1" the code doesn't split the values and the resulting Value column is just "Trait1" every line.
How can I pass the column name in an object and get it to produce the desired output? Thanks in advance.

You can use !!sym to make it symbol
trait.to.use <- c("Trait1" , "Trait2")
data %>%
mutate(NewValue = Value/NumTraits) %>%
tidyr::separate_rows(!!sym(trait.to.use[1]), sep = "[+]") %>%
group_by(!!sym(trait.to.use[1])) %>%
summarise(NewValue = sum(Value), .groups = 'drop')
output
# A tibble: 2 × 2
Trait1 NewValue
<chr> <int>
1 A 25
2 B 45
and then you can just change trait.to.use[1] to trait.to.use[2]

bind tables of different length

Thank you good ppl! This must be simple but I'm banging my head against it for a while. Please help. I have a large data set from which I get all kinds of information via table(). I then want to store that information which is essentially different counts, so I also want to store the rownames that were counted. For a reproducible example consider
```
a<-c("a","b","c","d","a","b") #one count, occurring twice for a and
b and once for c and d
b<-c("a","c") # a completly different property from the dataset
occurring once for a and c
x<-table(a)
y<-table(b) #so now x and y hold the information I seek
How can I merge/bind/whatever to get from x and y to this form:
x. y.
a 2. 1
b 2. 0
c 1. 1
d. 1 0
HOWEVER, I need to use the solution iteratively, in a loop that takes x and y and gets the requested form above, and then gets more tables added, each hopefully adding a column. One of my many failed attempts, just to show the logic, is:
`. member<-function (data=dfm,groupvar='group',analysis=kc15 {
res<-matrix(NA,ncol=length(analysis$size)+1)
res[,1]<-table(docvars(data,groupvar))
for (i in 1:length(analysis$size)) {
r<-table(docvars(data,groupvar)[analysis$cluster==i])
res<-cbind(res,r)
}
res
}`
So, to sum, the reproducible example above means to replicate the first column in res and an r, and I'm seeking (I think) a correct solution instead of the cbind, which would allow adding columns of different length but similar names, as in the example above.
Please help its embarrassing how much time I'm wasting on this

In base R, you can use table, stack and full join the two counts.
out <- merge(stack(table(a)), stack(table(b)), by = 'ind', all = TRUE)
out
# ind values.x values.y
#1 a 2 1
#2 b 2 NA
#3 c 1 1
#4 d 1 NA
If you want to replace NA with 0, you can do :
out[is.na(out)] <- 0

One purrr and tidyr solution could be:
map_dfr(lst, ~ stack(table(.)), .id = "ID") %>%
pivot_wider(names_from = "ID", values_from = "values", values_fill = list(values = 0))
ind a b
<chr> <int> <int>
1 a 2 1
2 b 2 0
3 c 1 1
4 d 1 0
lst being:
lst <- list(a = a,
b = b)

creating table from data in R

I want to create a table from the existing data. I have 5 varieties and 3 clusters in the data. In the expected table I want to show the number and the name of the varieties with the corresponding clusters. But I cannot make it. This is my data
variety<-c("a","b","c","d","e")
cluster<-c(1,2,2,3,1)
x <- cbind(variety, cluster)
data <- data.frame(x)
data
variety cluster
1 a 1
2 b 2
3 c 2
4 d 3
5 e 1
My desirable table is like this.
cluster number variety name
1 2 a, e
2 2 b,c
3 1 d
I would be grateful if anyone helps me.

The following can give the results you're looking for:
library(plyr)
variety<-c("a","b","c","d","e")
cluster<-c(1,2,2,3,1)
x <- cbind(variety, cluster)
data <- data.frame(x)
data
ddply(data,.(cluster),summarise,n=length(variety),group=paste(variety,collapse=','))

Here is one option with tidyverse. Grouped by 'cluster', get the number of rows (n()) and paste the 'variety' into a single string (toString)
library(tidyverse)
data %>%
group_by(cluster) %>%
summarise(number = n(), variety_name = toString(variety))

R, dplyr: Collect unique values for a column, mutate a label based on set intersection

I am working with a large data set, but let us take a toy example to demonstrate what I am trying to achieve. I am using R and dplyr.
I have a table:
id attribute correct
1 a a
1 b a
1 c a
2 d e
2 e e
3 d f
From the above, I want to create two columns, attribute_set and label. To clarify, I want:
id attribute_set correct label
1 a, b, c a 1
2 d, e e 1
3 d f 0
attribute_set should be a collection (any data structure) that has all of the attributes for an id. label should be 1 if the correct value is in attribute_set and 0 otherwise.
Presently, I create attribute_set like so:
design_mat1 <- design_mat %>%
group_by(id) %>%
mutate(attribute_set = paste(unique(attribute), collapse = "|")) %>%
select(-attribute)
I generate label like so:
design_mat2b <- design_mat2 %>%
group_by(id) %>%
mutate(label = ifelse(correct %in% attribute_set, 1, 0))
However, my label works only when there is one element in attribute_set. I think I have to strsplit on | or make attribute_set use some other data structure. I have been unable to figure out what alternative data structure to use nor have I been able to get a strsplit on | solution to work. Any hints/solutions are appreciated.

After grouping by 'id', we can use summarise to paste the unique elements of 'attribute', while selecting the first or unique values of 'correct' and 'label' if there is any 'correct' elements in 'attribute'
library(dplyr)
design_mat %>%
group_by(id) %>%
summarise(attribute_set = toString(unique(attribute)),
correct = first(correct),
label = +(any(correct %in% attribute)))
# A tibble: 3 x 4
# id attribute_set correct label
# <int> <chr> <chr> <int>
#1 1 a, b, c a 1
#2 2 d, e e 1
#3 3 d f 0
Or use the 'correct' also in group_by and then summarise on 'attribute_set' and 'label'

count frequency of rows based on a column value in R

I understand that this is quite a simple question, but I haven't been able to find an answer to this.
I have a data frame which gives you the id of a person and his hobby. Since a person may have many hobbies, the id field may be repeated in multiple rows, each with a different hobby. I have been trying to print out only those rows which have more than one hobbies. I was able to get the frequencies using table.
But how do I apply the condition to print only when the frequency is greater than one.
Secondly, is there a better way to find frequencies without using table.
This is my attempt with table without the filter for frequency greater than one
> id=c(1,2,2,3,2,4,3,1)
> hobby = c('play','swim','play','movies','golf','basketball','playstation','gameboy')
> df = data.frame(id, hobby)
> table(df$id)
1 2 3 4
2 3 2 1

Try using data table, I find it more readable than using table() functions:
library(data.table)
id=c(1,2,2,3,2,4,3,1)
hobby = c('play','swim','play','movies',
'golf','basketball','playstation','gameboy')
df = data.frame(id=id, hobby=hobby)
dt = as.data.table(df)
dt[,hobbies:=.N, by=id]
You will get, for your condition:
> dt[hobbies >1,]
id hobby hobbies
1: 1 play 2
2: 2 swim 3
3: 2 play 3
4: 3 movies 2
5: 2 golf 3
6: 3 playstation 2
7: 1 gameboy 2

This example assumes you are trying to filter df
id=c(1,2,2,3,2,4,3,1)
hobby = c('play','swim','play','movies','golf','basketball',
'playstation','gameboy')
df = data.frame(id, hobby)
table(df$id)
Get all those ids that have more than one hobby
tmp <- as.data.frame(table(df$id))
tmp <- tmp[tmp$Freq > 1,]
Using that information - select their IDs in df
df1 <- df[df$id %in% tmp$Var1,]
df1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Creating a new variable by combining multiple rows from 1 column - r

x <- data.frame(ID = c(1,1,2,2,3,3), type = c('A','B','B','C','A','C')) library(dplyr) x %>% group_by(ID) %>% summarise(y = paste(type,collapse="&")) This is just one way, but it is certainly possible.

Related

How can I force R to use an object's value, not the literal object name?

bind tables of different length

creating table from data in R

R, dplyr: Collect unique values for a column, mutate a label based on set intersection

count frequency of rows based on a column value in R

Categories

Resources