How to separate the data based on names present in a column? - r

I have a data frame that looks this.
df <- data.frame(names = c("Ram","Siddhharth","Nithin","Aashrit","Bragadish","Sridhar"),
house = c("A", "B", "A", "B", "A", "B"))
I want to create a new data frame which gets re-arranged based on the house they are in.
Expected Output
house_A house_B
1 Ram Siddhharth
2 Nithin Aashrit
3 Bragadish Sridhar
How can I achieve this? Many thanks in advance.

You could use tidyr:
df %>%
pivot_wider(names_from="house", names_prefix="house_", values_from="names", values_fn=list) %>%
unnest(cols=everything())
This returns
# A tibble: 3 x 2
house_A house_B
<chr> <chr>
1 Ram Siddhharth
2 Nithin Aashrit
3 Bragadish Sridhar

Related

In R, subset a dataframe on rows whose ID appears more than once [duplicate]

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed last month.
Background
I have a dataframe d with ~10,000 rows and n columns, one of which is an ID variable. Most ID's appear once, but some appear more than once. Say that it looks like this:
Problem
I'd like a new dataframe d_sub which only contains ID's that appear more than once in d. I'd like to have something that looks like this:
What I've tried
I've tried something like this:
d_sub <- subset(d, duplicated(d$ID))
But that only gets me one entry for ID's b and d, and I want each of their respective rows:
Any thoughts?
We may need to change the duplicated with | condition as duplicated by itself is FALSE for the first occurrence of 'ID'
d_sub <- subset(d, duplicated(ID)|duplicated(ID, fromLast = TRUE))
We could use add_count, then filter on n:
library(dplyr)
df %>%
add_count(ID) %>%
filter(n!=1) %>%
select(-n)
Example:
library(dplyr)
df <- tribble(
~ID, ~gender, ~zip,
"a", "f", 1,
"b", "f", NA,
"b", "m", 2,
"c", "f", 3,
"d", "f", NA,
"d", "m", 4)
df %>%
add_count(ID) %>%
filter(n!=1) %>%
select(-n)
Output:
ID gender zip
<chr> <chr> <dbl>
1 b f NA
2 b m 2
3 d f NA
4 d m 4

remove rows if values exists with the same combination in different columns

I have a 410 DNA sequences that I have confronted with each other, to get the similarity.
Now, to trim the database, I should get rid of the row that have the same value in 2 columns, because of course every value will be double.
To make myself clear, I have something like
tribble(
~seq01, ~seq02, ~ similarity,
"a", "b", 100.000,
"b", "a", 100.000,
"c", "d", 99.000,
"d", "c", 99.000,
)
comparing a-b and b-a is the same thing, so I'd want to get rid of the double value
What I want to end up with is
tribble(
~seq01, ~seq02, ~ similarity,
"a", "b", 100.000,
"c", "d", 99.000
)
I am not sure on how to proceed, all the ways I thought of are kinda hacky. I checked other answers, but don't really satisfy me.
Any input would be greatly appreciated (but tidy inputs are even more appreciated!)
We can use pmin and pmax to sort the values and then use distinct to select unique rows.
library(dplyr)
df %>%
mutate(col1 = pmin(seq01, seq02),
col2 = pmax(seq01, seq02), .before = 1) %>%
distinct(col1, col2, similarity)
# col1 col2 similarity
# <chr> <chr> <dbl>
#1 a b 100
#2 c d 99
Another, base R, approach:
df$add1 <- apply(df[,1:2], 1, min) # find rowwise minimum values
df$add2 <- apply(df[,1:2], 1, max) # find rowwise maximum values
df <- df[!duplicated(df[,4:5]),] # remove rows with identical values in new col's
df[,4:5] <- NULL # remove auxiliary col's
Result:
df
# A tibble: 2 x 3
seq01 seq02 similarity
<chr> <chr> <dbl>
1 a b 100
2 c d 99

Add a row to a dataframe that repeats a row and replaces 2 entries

I want to add rows to a dataframe (or tibble) as part of a data entry project. I need to:
Find one row that holds a specific value in one column (obsid)
Duplicate that row. However, replace the value in column "word".
Append the new row to the dataframe
I want to write a function that makes it easy. When I write the function, it won't add the new rows. I can print out the answer. But it won't alter the basic dataframe
If I do it without a function it works as well.
Why won't the function add the row?
df <- tibble(obsid = c("a","b" , "c", "d"), b=c("a", "a", "b", "b"), word= c("what", "is", "the", "answer"))
df$main <- 1
addrow <- function(id, newword) {
rowtoadd <- df %>%
filter(obsid== id & main==1) %>%
mutate(word=replace(word, main==1, newword)) %>%
mutate(main=replace(main, word==newword, 0))
df <- bind_rows(df, rowtoadd)
print(rowtoadd)
print(filter(df, df$obsid== id))}
addrow("a", "xxx")
R objects usually don't modify itself, you need to warp the result in return() to return the modified copy of that dataframe.
Change your function to:
df <- tibble(obsid = c("a","b" , "c", "d"), b=c("a", "a", "b", "b"), word= c("what", "is", "the", "answer"))
df$main <- 1
addrow <- function(id, newword) {
rowtoadd <- df %>%
filter(obsid== id & main==1) %>%
mutate(word=replace(word, main==1, newword)) %>%
mutate(main=replace(main, word==newword, 0))
df <- bind_rows(df, rowtoadd)
return(df)
}
> addrow("a", "xxx")
# A tibble: 5 x 4
obsid b word main
<chr> <chr> <chr> <dbl>
1 a a what 1
2 b a is 1
3 c b the 1
4 d b answer 1
5 a a xxx 0

How to associate a list of character vectors with your data frame in R

The shape of my data is fairly simple:
set.seed(1337)
id <- c(1:4)
values <- runif(0, 1, n=4)
df <- data.frame(id, values)
df
id values
1 1 0.57632155
2 2 0.56474213
3 3 0.07399023
4 4 0.45386562
What isn't simple: I have a list of character-value arrays that match up to each row, where each list item can be empty, or it can contain up to 5 separate tags, like...
tags <- list(
c("A"),
NA,
c("A", "B", "C"),
c("B", "C")
)
I will be asked various questions using the tags as classifers, for instance, "what is the average value of all rows with a B tag?" Or "how many rows contain both tag A and tag C?"
What way would you choose to store the tags so that I can do this? My real-life data file is quite large, which makes experimenting with unlist or other commands difficult.
Here are couple of options to get the expected output. Create 'tags' as a list column in the dataset and unnest (already from the comments), and then summarise the number of 'A' or 'C' by getting the sum of logical vector. Similarly, the mean of 'values' where 'tag' is 'B'
library(tidyverse)
df %>%
mutate(tag = tags) %>%
unnest %>%
summarise(nAC = sum(tag %in% c("A", "C")),
meanB = mean(values[tag == "B"], na.rm = TRUE))
That is not very hard . you just need assign your list to your df create a new columns as name tags then we do unnest, I have list the solutions for your listed questions .
library(tidyr)
library(dplyr)
df$tags=list(
c("A"),
NA,
c("A", "B", "C"),
c("B", "C")
)
Newdf=df%>%tidyr::unnest(tags)
Q1.
Newdf%>%group_by(tags)%>%summarise(Mean=mean(values))%>%filter(tags=='B')
tags Mean
<chr> <dbl>
1 B 0.263927925960161
Q2.
Newdf%>%group_by(id)%>%dplyr::summarise(Count=any(tags=='A')&any(tags=='C'))
# A tibble: 4 x 2
id Count
<int> <lgl>
1 1 FALSE
2 2 NA
3 3 TRUE
4 4 FALSE

Create a CSV file with R with content in specified columns

I am trying to create a new table in R from an existing table.
To illustrate please see the table below:
The query looks at the 2nd and 3rd column and maps instances that combination occurs to produce a new table.
As you can see, there are no instances of repeat and that is critical.
I tried doing this using the Unique function but I have not been able to compute it well enough to generate the CSV output to be like this.
If you are quite new to R the package sqldf may help you. With this it´s possible to write sql querys in R. If you work with tables this can help.
Your code, for what you want to do would then look like this:
install.packages("sqldf")
library(sqldf)
new_table<-sqldf("SELECT Column2, Column3, COUNT(*) as Frequency from old_table group by CONCAT(Column2,Column3)")
write.csv(new_table, "new_table.csv")
To manipulate the data, you can put it in a tibble and afterwards use the dplyr grammar.
library(dplyr)
tibble(col_1=c(14, 5, 7, 688, 56, 565, 674),
col_2=c("A", "A", "B", "B", "B", "A", "C"),
col_3=c("C", "C", "D", "D", "D", "A", "D"),
col_4=c("67rhr", "4gg2", "344g5", "4yy4", "6hthht7", "7ttjty7", "yyuuy")) %>%
count(col_2, col_3) %>%
rename("frequency"=n)
# col_2 col_3 frequency
# <chr> <chr> <int>
# 1 A A 1
# 2 A C 2
# 3 B D 3
# 4 C D 1
Col1 <- c(12,5,7,688,56,565,674)
ColA <- c("A","A","B","B","B","A","C")
ColB <- c("C", "C","D", "D", "D", "A", "C")
df = data.frame(Col1, ColA, ColB)
library(dplyr)
result <- select(df, ColA, ColB) %>%
group_by(ColA, ColB) %>%
summarise(Frequency=n())
write.csv(result, file="somename.csv")

Resources