Is there an R function to move data after a certain string? - r

I am trying to move data to a new column after certain points in the data. My data is spread across multiple data frames that only have some elements in common, so I would like to be able to make a loop to clean the data sets. I am looking for a function that after the first time there is certain text, for example "Total", in a row all the data below that moves to a new columns.
first
second
third
One
1
One
1
Total
2
Two
2
Two
2
Total
2
I want my data to look similar to this below, but due to the variability of the data I am having trouble finding a solution that can be reproduced easily.
left
center
right
fourth
One
1
Two
2
One
1
Two
2
Total
1
Total
2

Personal opinion cbinding data on wider side will be too cumbersome, if the rows are too much. Still you can divide the data into separate groups like this
df <- read.table(text = "first second
One 1
One 1
Total 2
Two 2
Two 2
Total 2", header = T)
df$dummy = rev(cumsum(rev(df$first == "Total")))
df
> df
first second dummy
1 One 1 2
2 One 1 2
3 Total 2 2
4 Two 2 1
5 Two 2 1
6 Total 2 1
You may notice that your data is divided into two groups. You may still cbind() or bind_cols() if you want, easily
df %>% group_split(d = rev(cumsum(rev(first == "Total")))) %>% bind_cols()
# A tibble: 3 x 6
first...1 second...2 d...3 first...4 second...5 d...6
<chr> <int> <int> <chr> <int> <int>
1 Two 2 1 One 1 2
2 Two 2 1 One 1 2
3 Total 2 1 Total 2 2

Here is another try
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(purrr)
data <- structure(list(first = c("One", "One", "Total", "Two", "Two",
"Total"), second = c(1L, 1L, 2L, 2L, 2L, 2L)), row.names = c(NA,
-6L), class = "data.frame")
new_data <- data %>%
# create group using first == "Total"
mutate(total_group = cumsum(first == "Total")) %>%
mutate(total_group = if_else(first == "Total", total_group - 1L, total_group)) %>%
# split df into multiple df and bind cols
group_split(total_group, .keep = FALSE) %>%
bind_cols()
#> New names:
#> * first -> first...1
#> * second -> second...2
#> * first -> first...3
#> * second -> second...4
new_data
#> # A tibble: 3 x 4
#> first...1 second...2 first...3 second...4
#> <chr> <int> <chr> <int>
#> 1 One 1 Two 2
#> 2 One 1 Two 2
#> 3 Total 2 Total 2
# if you only have two group this could work - otherwise need some more
# work on the approach. Hope this provide you enough hint to develop further
names(new_data) <- c("left", "center", "right", "fourth")
new_data
#> # A tibble: 3 x 4
#> left center right fourth
#> <chr> <int> <chr> <int>
#> 1 One 1 Two 2
#> 2 One 1 Two 2
#> 3 Total 2 Total 2
Created on 2021-04-04 by the reprex package (v1.0.0)

Related

Stacking/rearranging every 2 values in a row into 2 columns, row by row, in a dataframe on R

I hope I can explain myself to the point my problem make sense! Been stuck on this one for hours.
What I have: a dataframe where the columns are genes (gene a, gene b, gene c, etc.) and the rows are cell clusters in two conditions (cluster a_ctrl, cluster a_exp, clusterb_ctrl, etc.)
What I want: I want it the other way around (but not transposed!) so that the columns are clusters (cluster a, cluster b, cluster c, etc.) and the rows are genes in two conditions (gene a_ctrl, gene a_exp, gene b_ctrl, etc.)
Thanks for the help!
First, some fake data:
df1 <- data.frame(cluster = c("clustera_ctrl", "clustera_exp", "clusterb_ctrl", "clusterb_exp"),
gene_a = 1:4,
gene_b = 5:8,
gene_c = 9:12)
cluster gene_a gene_b gene_c
1 clustera_ctrl 1 5 9
2 clustera_exp 2 6 10
3 clusterb_ctrl 3 7 11
4 clusterb_exp 4 8 12
Approach using separate, pivot_longer, and pivot_wider from tidyr. I think there's a shorter way with one pivot_longer but this should be clear enough.
library(tidyverse)
df1 %>%
separate(cluster, c("cluster", "type"), sep = "_") %>%
pivot_longer(starts_with("gene")) %>%
pivot_wider(names_from = c(name, type), values_from = value)
# A tibble: 2 × 7
cluster gene_a_ctrl gene_b_ctrl gene_c_ctrl gene_a_exp gene_b_exp gene_c_exp
<chr> <int> <int> <int> <int> <int> <int>
1 clustera 1 5 9 2 6 10
2 clusterb 3 7 11 4 8 12

Mutate subset of rows by using corresponding values from a different data frame

I have two data frames. One contains some towns' names and IDs, the other contains some towns' names and IDs plus an extra value:
library(tidyverse)
towns.df <- structure(list(town_id = c(1, 2, 3), town_name = c("Rome", "Madrid", "Sarajevo")),
row.names = c(NA, -3L),
class = "data.frame") %>% as_tibble()
values.df <- structure(list(town_id = c(1, 5, 4), town_name = c("Rome", "Sarajevo", "Madrid"), town_value = c(18, 11, 15)),
row.names = c(NA, -3L),
class = "data.frame") %>% as_tibble()
The data looks like this:
> towns.df
# A tibble: 3 × 2
town_id town_name
<dbl> <chr>
1 1 Rome
2 2 Madrid
3 3 Sarajevo
> values.df
# A tibble: 3 × 3
town_id town_name town_value
<dbl> <chr> <dbl>
1 1 Rome 18
2 5 Sarajevo 11
3 4 Madrid 15
Using a tidyverse solution, I want to join the data frames based on IDs (for separate reasons I cannot directly do it based on towns' names) but the problem is that the IDs do not always correspond. The IDs that get priority are those found in towns.df, e.g. if the same town has two different IDs in the two data frames, I want it to eventually be associated to the one from towns.df.
So I first check what IDs from towns.df are not present in values.df by using anti_join():
> anti_join(towns.df, values.df, "town_id")
# A tibble: 2 × 2
town_id town_name
<dbl> <chr>
1 2 Madrid
2 3 Sarajevo
Then I want to take the corresponding towns and be able to modify values.df$town_id accordingly, so that the ID would be the one displayed in towns.df.
Or better, I want to directly create a joint data frame where I have town_id, town_name and town_value as columns and with the correct town_id.
The desired output is:
# A tibble: 3 × 3
town_id town_name town_value
<dbl> <chr> <dbl>
1 1 Rome 18
2 2 Madrid 15
3 3 Sarajevo 11
I am aware that in this example the desired output could be obtained by joining the data frames based on town_name - i.e. this would do the trick:
left_join(towns.df, values.df[,2:3], "town_name")
but, as I said, this is something I am not able to do due to separate reasons that are exogenous to this question.
I checked this answer but I am not looking to using mutate with a literal value, instead I need to use a different column (i.e. town_id) than the one I am using to check the identity (i.e. town_name).
We may use
library(powerjoin)
power_left_join(towns.df, values.df, by = "town_name", conflict = \(x, y) x)

How to drop remove rows from tibble based on multiple column values( duplicate and string value)

I can't figure out how to do this. In the tibble below I would like to drop row 4 based on few things.
models == "EADS142"
duplicate attribute BC02S present in row 3 which models =="EADS14"
I don't want to drop row 2 although models=="EADS142" and duplicate attributes in row 1 because in row 1 models !="EADS14"
``
filtered
# A tibble: 7 x 2
attributes models
<chr> <chr>
1 AGG413. EADS05
2 AGG413 EADS142
3 BC02S EADS14
4 BC02S EADS142
Expected result
# A tibble: 4 x 2
attributes models
<chr> <chr>
1 AGG413 EADS05
2 AGG413 EADS142
3 BC02S EADS14
Use duplicated with lag :
library(dplyr)
df %>%
filter(!(duplicated(attributes) &
models == 'EADS142' & lag(models) == 'EADS14'))
# attributes models
#1 AGG413. EADS05
#2 AGG413 EADS142
#3 BC02S EADS14
data
df <- structure(list(attributes = c("AGG413.", "AGG413", "BC02S", "BC02S"
), models = c("EADS05", "EADS142", "EADS14", "EADS142")),
class = "data.frame", row.names = c(NA, -4L))

Create a vector that contains the topic value for the document based on highest gamma value

I have one data frame which contain 3 variable (document, topic and gamma)
document topic gamma
1 1 0.932581726
1 2 0.015250915
1 3 0.009929329
2 1 0.032864538
2 2 0.012939786
2 3 0.13281681
I want to create one vector contain the topic value for the document based on highest gamma value. For which topic gamma value is high , document is belong to that topic.
I have tried some code but not sure is this the correct way to get it.
a2<-function(x){
i=1
while(i< 110)
for(j in 1:7)
x= max(ap_documents$gamma)
return(j)
}
a3<-sapply(ap_documents,a2)
Here is a way with dplyr:
library(dplyr)
df %>%
group_by(document) %>%
filter(gamma == max(gamma))
#output
# A tibble: 2 x 3
# Groups: document [2]
document topic gamma
<int> <int> <dbl>
1 1 1 0.933
2 2 3 0.133
in base R you can use aggregate:
aggregate(gamma ~ document, max, data = df)
#output
document gamma
1 1 0.9325817
2 2 0.1328168
if you would like to keep the topic column you can merge it back:
merge(aggregate(gamma ~ document, max, data = df), df)
#output
document gamma topic
1 1 0.9325817 1
2 2 0.1328168 3
Although the other solutions work fine, I'd like to mention the top_n-function in dplyr, which was build to solve similar tasks:
library(dplyr)
my_df %>%
group_by(document) %>%
top_n(1, topic)
# A tibble: 2 x 3
# Groups: document [2]
# document topic gamma
# <int> <int> <dbl>
# 1 1 3 0.00993
# 2 2 3 0.133
Another simple base R solution is also:
my_df <- my_df[order(my_df$topic, decreasing = TRUE), ]
my_df[!duplicated(my_df$document), ]
# document topic gamma
# 3 1 3 0.009929329
# 6 2 3 0.132816810
Data
my_df <- structure(list(document = c(1L, 1L, 1L, 2L, 2L, 2L),
topic = c(1L, 2L, 3L, 1L, 2L, 3L),
gamma = c(0.932581726, 0.015250915, 0.009929329,
0.032864538, 0.012939786, 0.13281681)),
class = "data.frame", row.names = c(NA, -6L))
If I understood what you want you can use dplyr to accomplish it.
library(dplyr)
result <- df %>%
group_by(topic) %>%
slice(topic_gamma = which.max(gamma))
result
## A tibble: 2 x 3
## Groups: document [2]
# document topic gamma
# <dbl> <dbl> <dbl>
#1 1. 1. 0.933
#2 2. 3. 0.133

Assigning Label based on quantile for every sub group

My data.frame looks like this:
Region Store Sales
A 1 ***
A 2 ***
B 1 ***
B 2 ****
I want to create labels of store based on Sales Performance. That is if a store Sales is higher than 75% quantile assign "High" else low.
Applying ddply using the code
R3 <- ddply(dat, .(REGION), function(x) quantile(x$Sales, na.rm = TRUE))
returns a dataframe with all quantile numbers for the regions.
I can use that frame to join with original and do a if-else for each cluster. I am sure it's not an efficient way. Is there a better approach to it?
Is this what you want ?
df %>% group_by(Region) %>%
mutate(Performance = ifelse(Sales > quantile(Sales, 0.75), 'High', 'Low'))
#> # A tibble: 4 x 4
#> # Groups: Region [2]
#> Region Store Sales Performance
#> <chr> <int> <int> <chr>
#> 1 A 1 100 High
#> 2 A 2 10 Low
#> 3 B 1 90 High
#> 4 B 2 10 Low
Data Input
df = read.table(text = 'Region Store Sales
A 1 100
A 2 10
B 1 90
B 2 10', header = T, stringsAsFactors = F)

Resources