R combine duplicate rows by appending columns [duplicate] - r

This question already has answers here:
Duplicated rows: select rows based on criteria and store duplicated values
(2 answers)
Closed 3 years ago.
I have a large data set with text comments and their ratings on different variables, like so:
df <- data.frame(
comment = c("commentA","commentB","commentB","commentA","commentA","commentC"
sentiment=c(1,2,1,4,1,2),
tone=c(1,5,3,2,6,1)
)
Every comment is present between one and 3 times, since multiple people are asked to rate the same comment sometimes.
I'm looking to create a data frame where the "comment" column only has unique values, and the other columns are appended, so any one text comment has as many "sentiment" and "tone" columns as there are ratings (which will result in NA's for comments that have not been rated as often, but that's okay):
df <- data.frame(
comment = c("commentA","commentB","commentC",
sentiment.1=c(1,2,2),
sentiment.2=c(4,1,NA),
sentiment.3=c(1,NA,NA),
tone.1=c(1,5,1),
tone.2=c(2,3,NA),
tone.3=c(6,NA,NA)
)
I've been trying to figure this out using reshape to go from long to wide using
reshape(df,
idvar = "comment",
timevar = c("sentiment","tone"),
direction = "wide"
)
But that results in all possible combinations between sentiment and tone, rather than simply duplicating sentiment and tone independently.
I also tried using gather like so df %>% gather(key, value, -comment), but that only gets me halfway there...
Could anyone please point me in the right direction?

You need to create a variable to use as the numbers in the columns. rowid(comment) does the trick.
In dcast you put the row identifiers to the left of ~ and the column identifiers to the right. Then value.var is a character vector of all columns you want to include int this long-to-wide transformation.
library(data.table)
setDT(df)
dcast(df, comment ~ rowid(comment), value.var = c('sentiment', 'tone'))
# comment sentiment_1 sentiment_2 sentiment_3 tone_1 tone_2 tone_3
# 1: commentA 1 4 1 1 2 6
# 2: commentB 2 1 NA 5 3 NA
# 3: commentC 2 NA NA 1 NA NA

Related

R pivot_longer(): tidyr wide to long manipulation reverse pivot summary to individual values [duplicate]

This question already has answers here:
Melt the dataframe by keeping the first column [duplicate]
(1 answer)
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
I am trying to manipulate a wide table which represents the per cent of each household composition type within two towns to a long-form table (basically, a reverse of a pivot table).
In the long table, I would like each row to represent the household composition value for one household. So, the number of rows for each combination depends on the values provided e.g. 18 rows of (town.a, singles), 8 rows of (town.b, singles etc.). However, I just can't seem to figure out how to do this expansion based on the values in each Town column.
I have a data.frame() that looks like this:
household.data <- data.frame(household.composition= c("Singles","Couples", "Families", "Single Parents", "Sharers"),
town.a =c(18,29,41,3,3),
town.b =c(8,37,48,9,3))
The values under the Town A and Town B columns represent the per cent makeup of each household composition within each town.
The goal is to get from this wide summary format to a long format which multiplies the value in the Household Composition column by the numeric value within the Town A and Town B columns. So each row would represent the household composition value for one household. For example:
Again, I know that there must be a way to do this using the spread/gather or pivot function in tidyR. However, I just can't seem to figure out how to do this expansion given that I would like the number of rows to correspond with the per cent value.
You can get the data in long format and use uncount to replicate rows.
library(tidyr)
pivot_longer(household.data, cols = -household.composition) %>% uncount(value)
# A tibble: 199 x 2
# household.composition name
# <chr> <chr>
# 1 Singles town.a
# 2 Singles town.a
# 3 Singles town.a
# 4 Singles town.a
# 5 Singles town.a
# 6 Singles town.a
# 7 Singles town.a
# 8 Singles town.a
# 9 Singles town.a
#10 Singles town.a
# … with 189 more rows
You can work as follows:
Convert the data from wide to long format using tidyr::pivot_longer
Use lapply to apply the rep-licate function based on the number of times in value
Since lapply gives results as list, use dplyr::bind_rows to bind them into a dataframe
Remove the value column to get the desired output
library(dplyr)
library(tidyr)
household.data %>%
pivot_longer(-household.composition, names_to = "town") %>%
lapply(rep, .$value) %>%
bind_rows() %>%
select(-value)
Base R solution:
setNames(within(
reshape(
household.data,
direction = "long",
varying = grepl("town", names(household.data)),
timevar = "town_type",
times = NULL,
idvar = !(grepl("town", names(household.data))),
new.row.names = 1:(nrow(household.data) * length(grepl(
"town", names(household.data)
)))
),
{
rm(town)
}
), c("household.composition", "town"))
data.table solution
library(data.table)
melt(setDT(household.data),id.vars = "household.composition")[rep(1:.N,value),.( household.composition,variable)]

How to write for loop that extracts value of variable based on similarity of another variable of two datasets (in R)? [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 3 years ago.
I have a dataset which only contains one variable variable ("text") and a second dataset which is made up of a subset of this variable in dataset one and a new variable which is called "code".
dat1<-tibble(text=c("book","chair","banana","cherry"))
dat2<-tibble(text=c("banana","cherry"),code=c(1,NA))
What I would like to get at is a for loop that yields the value of "code" for every row (i) where dat1$text is the same as dat2$text and 0 otherwise. The ultimate goal is a vector c(0,0,1,NA) that I could then add back to the first dataset.
However, I don't know how to select the row corresponding to i in the for loop that would get me the value of "code" that I need to arrive at this vector. Also, even if I knew, how to extract these values, I'm not sure this whole thing would work, let alone maintain the order that I need (c(0,0,1,NA)).
for (i in dat2$text) {
ifelse(i==dat1$text, print(dat[...,2]), print(0))
}
Does anyone know how to fix that?
We can match text column of both the dataframe, replace the NA match as 0 or corresponding code value.
inds <- match(dat1$text, dat2$text)
dat1$out <- ifelse(is.na(inds), 0, dat2$code[inds])
dat1
# A tibble: 4 x 2
# text out
# <chr> <dbl>
#1 book 0
#2 chair 0
#3 banana 1
#4 cherry NA
We can do a join
library(dplyr)
dat2 %>%
mutate(code = replace_na(code, 0)) %>%
right_join(dat1)

How to subset the first column (rownames) in R [duplicate]

This question already has answers here:
What is about the first column in R's dataset mtcars?
(4 answers)
Closed 3 years ago.
I have xy data for gene expression in multiple samples. I wish to subset the first column so I can order the genes alphabetically and perform some other filtering.
> setwd("C:/Users/Will/Desktop/BIOL3063/R code assignment");
> df = read.csv('R-assignments-dataset.csv', stringsAsFactors = FALSE);
Here is a simplified example of the dataset I'm working with, it has 270 columns (tissue samples) and 7065 rows (gene names).
The first column is a list of gene names (A2M, AAAS, AACS etc.) and each column is a different tissue sample, thus showing the gene expression in each tissue sample.
The question being asked is "Sort the gene names alpahabetically (A-Z) and print out the first 20 gene names"
My thought process would be to subset the first column (gene names) and then perform order() to sort alphabetically, after which I can use head() to print the first 20.
However when I try
> genes <- df[1]
It simply subsets the first column that has data in it (TCGA-A6-2672_TissueA) rather than the one to its left.
Also
> genes <- df[,df$col1];
> genes;
data frame with 0 columns and 7065 rows
> order(genes);
integer(0)
Appears to create a list of gene names in R studio's viewer but I cannot perform any manipulation on it.
I am unable to correctly locate the first column in the data.frame, since it does not have a column header, and I also have the same problem when doing the same thing with row 1 (sample names) as well.
I'm a complete novice at R and this is part of an assignment I'm working on, it seems I'm missing something fundamental but I can not figure out what.
Cheers guys
Please include a sample of your text file as text instead of an image.
I have created a dataset similar to yours:
X Y
1 a b
2 c d
3 d g
Note that your tissue columns have a header but your gene names do not. Therefore these will be interpreted as rownames, see ?read.table:
If row.names is not specified and the header line has one less entry
than the number of columns, the first column is taken to be the row
names.
Reading it in R:
df <- read.table(text = ' X Y
1 a b
2 c d
3 d g')
So your gene names are not at df[1] but instead in rownames(df), so to get these genes <- rownames(df) or to add these to the existing df you can use df$gene <- rownames(df)
There are numerous ways to convert your row names to a column see for example this question.
If you are asking what I think you are asking, you just need to subset inside the as.data.frame function, which will auto-generate a "header", as you call it. It will be called V1, the first variable of your new data frame.
genes <- as.data.frame(df[,1])
genes$V1
1 A
2 C
3 A
4 B
5 C
6 D
7 A
8 B
As per the comment below, the issue could be avoided if you remove the comma from your subsetting syntax. When you select columns from a data.frame, you only need to index the column, not the rows.
genes <- df[1]

Include third variable in table [duplicate]

This question already has answers here:
Contingency table based on third variable (numeric)
(2 answers)
Closed 4 years ago.
I have made an edit after realising my code was insufficient in order to explain to problem - appologies.
I have a data frame including four columns
purchaseId <- c("abc","xyz","def","ghi")
product <- c("a","b","c","a")
quantity <- c(1,2,2,1)
revenue <- c(500,1000,300,500)
t <- data.frame(purchaseId,product, quantity, revenue)
table(t$product,t$quantity)
Running this query
table(t$product,t$quantity)
returns a table indicating how many times each combination occurs
1 2
a 2 0
b 0 1
c 0 1
What I would like to do is plot both product and quantity as rows and columns (as shown above) but with the revenue as an actual value.
The result should look like this:
1 2
a 1000 0
b 0 1000
c 300 0
This would allow me to create a table that I could export as a csv.
Could anyone help me any further?
edit - the code suggested below throws the following error on the actual data set of 140K rows:
Error: dims [product 21525] do not match the length of object [147805]
Other ideas?
Of course the example code above is a simplified version of the actual data I'm using, but the idea is the same.
Thank you advance,
Kind regards.
table(t$product,t$quantity)*t$revenue
Using library(reshape2) or library(data.table)
dcast(t,product ~ quantity, value.var = "revenue", fun = sum)
it is fairly simple syntax:
Set the data frame you are recasting
Set the "formula" of the resulting data frame. LHS of ~ is the row-wise pivot, RHS is the column-wise.
value.var tells you what column we want to place in the cells, and using fun we want to aggregate with the sum function
As you mentioned in your comments familiarity with Excel Pivot tables, its worth noting that dcast is a fairly comprehensive replacement, with additional flexibility.

R count occurrences of an element by groups [duplicate]

This question already has answers here:
Add column with order counts
(2 answers)
Count number of rows within each group
(17 answers)
Closed 7 years ago.
What is the easiest way to count the occurrences of a an element on a vector or data.frame at every grouop?
I don't mean just counting the total (as other stackoverflow questions ask) but giving a different number to every succesive occurence.
for example for this simple dataframe: (but I will work with dataframes with more columns)
mydata <- data.frame(A=c("A","A","A","B","B","A", "A"))
I've found this solution:
cbind(mydata,myorder=ave(rep(1,nrow(mydata)),mydata$A, FUN=cumsum))
and here the result:
A myorder
A 1
A 2
A 3
B 1
B 2
A 4
A 5
Isn't there any single command to do it?. Or using an specialized package?
I want it to later use tidyr's spread() function.
My question is not the same than
Is there an aggregate FUN option to count occurrences?
because I don't want to know the total number of occurrencies at the end but the cumulative occurencies till every element.
OK, my problem is a little bit more complex
mydata <- data.frame(group=c("x","x","x","x","y","y", "y"), letter=c("A","A","A","B","B","A", "A"))
I only know to solve the first example I wrote above.
But what happens when I want it also by a second grouping variable?
something like occurrencies(letter) by group.
group letter "occurencies within group"
x A 1
x A 2
x A 3
x B 1
y B 1
y A 1
y A 2
I've found the way with
ave(rep(1,nrow(mydata)),list(mydata$group, mydata$letter), FUN=cumsum)
though it shoould be something easier.
Using data.table
library(data.table)
setDT(mydata)
mydata[, myorder := 1:.N, by = .(group, letter)]
The by argument makes the table be dealt with within the groups of the column called A. .N is the number of rows within that group (if the by argument was empty it would be the number of rows in the table), so for each sub-table, each row is indexed from 1 to the number of rows in that sub-table.
mydata
group letter myorder
1: x A 1
2: x A 2
3: x A 3
4: x B 1
5: y B 1
6: y A 1
7: y A 2
or a dplyr solution which is pretty much the same
mydata %>%
group_by(group, letter) %>%
mutate(myorder = 1:n())

Resources