I am trying to convert a single-column data frame into separate columns — the main descriptor in the data is the "item number" and then includes information on the price, date, color, etc. I would just split the column depending on row number, but since each item has a different amount of information, that doesn't really work.
I've been playing around with this a bit but haven't found anything at all to come close, as I can't use regex to create a separate column (using str_which, for example) since the information differs so much item to item. How can I use regex to create intervals that I can then split the column into (so I need the information between each row containing "item" in a separate column). Sample data is below.
data
item 1
$600
red
item 2
$70
item 3
$430
orange
10/11/2017
Thank you!
Here is a function to reformat your data depending on how you want the final dataset to look like. For the function, you supply the dataframe DF, the variable var, and a vector of column names in the correct order colnames and byitem to choose the output format (default is TRUE, which outputs a dataframe with one row per item):
library(tidyverse)
df_transform = function(DF, var, colnames, byitem = TRUE){
if(byitem){
ID = sym("rowid")
}else{
ID = sym("id")
}
DF %>%
group_by(id = paste0("item", cumsum(grepl("item", var)))) %>%
mutate(rowid = replace(2:n(), 2:n(), setNames(colnames[1:(n()-1)], 2:n()))) %>%
filter(!grepl("item", var)) %>%
spread(!!ID, var)
}
Output:
> df_transform(df, var, c("price", "color", "date"))
# A tibble: 3 x 4
# Groups: id [3]
id color date price
<chr> <fct> <fct> <fct>
1 item1 red <NA> $600
2 item2 <NA> <NA> $70
3 item3 orange 10/11/2017 $430
> df_transform(df, var, c("price", "color", "date"), byitem = FALSE)
# A tibble: 3 x 4
rowid item1 item2 item3
<chr> <fct> <fct> <fct>
1 color red <NA> orange
2 date <NA> <NA> 10/11/2017
3 price $600 $70 $430
Note that this would not work if you have missing values in the middle, since the column names are assigned by position.
Data:
df <- structure(list(var = structure(c(5L, 2L, 9L, 6L, 3L, 7L, 1L,
8L, 4L), .Label = c("$430", "$600", "$70", "10/11/2017", "item_1",
"item_2", "item_3", "orange", "red"), class = "factor")), .Names = "var", class = "data.frame", row.names = c(NA,
-9L))
Related
I've been trying to figure out this problem for some time now. I have the following data frame with repeated observation by ID:
ID color
1 blue
1 red
1 blue
2 red
2 blue
2 red
.
.
.
I want to create a new data frame by choosing the color with the highest frequency for each ID so that I have only 1 row for each ID. That is, I'd like to get the following data frame:
ID color
1 blue
2 red
3
.
.
.
I attempted using transform but that didn't work as it only summed the number of times each ID appeared in the data.
transform(df, freq.ID = ave(seq(nrow(df)), ID, FUN=length))
Is there a way I can do this?
We get the frequency count based on 'ID', 'color', creates a summarised 'n' column with frequency, then do order the rows on the 'ID' and descending order of 'n', and use the distinct to return the first unique row for each 'ID'
library(dplyr)
df1 %>%
count(ID, color) %>%
arrange(ID, desc(n)) %>%
select(-n) %>%
distinct(ID, .keep_all = TRUE)
-output
# ID color
#1 1 blue
#2 2 red
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), color = c("blue",
"red", "blue", "red", "blue", "red")), class = "data.frame", row.names = c(NA,
-6L))
Base R method using aggregate and ave -
subset(aggregate(length~color + ID, transform(df1, length = ID), length),
ave(length, ID, FUN = function(x) x == max(x)) == 1)
# color ID length
#1 blue 1 2
#4 red 2 2
I am currently working with genetic data where the headers are cell sample names. There are 2 samples from each type of cell collected, and they need to be plotted in a box plot. Due to inconsistent sample naming, I am using a separate .csv file where the user writes the sample name and the group it belongs to. I am trying to use the group_by() function to access the sample data but then use the grouping information from the other .csv file. Is there a way to accomplish what I am trying to do?
Cell Sample Data CSV:
Sample A1 Sample A2 Sample B1 Sample B2
1 3 3 5
Grouping CSV
Samples Group
Sample A 1
Sample B 1
Sample C 2
Sample D 2
My current idea is doing something like this
library(dplyr)
groupFile <- data %>% group_by(groupFile$Group)
however that didn't work, and I am stuck at how to make the data correspond to the grouping file.
Note: I previously uploaded this question without sample data and code and it was closed. I'm hoping this describes the problem well enough.
First let's improve your example cell sample data by including samples that are in different groups:
celldata <- structure(list(`Sample A1` = 1L, `Sample A2` = 3L, `Sample B1` = 3L,
`Sample B2` = 5L, `Sample C1` = 6L, `Sample C2` = 7L),
class = "data.frame", row.names = c(NA, -1L))
And your groups data:
groupdata <- structure(list(Samples = c("Sample A", "Sample B", "Sample C", "Sample D"),
Group = c(1L, 1L, 2L, 2L)), class = "data.frame",
row.names = c(NA, -4L))
Life will be much easier with data in "long" format rather than wide, and with everything in the one dataframe.
We can use tidyr::gather to reshape the cell data, then dplyr::mutate to get Samples without the numeric suffixes and finally, dplyr::left_join to bring samples and groups together:
library(dplyr)
library(tidyr)
celldata %>%
gather(Sample, Value) %>%
mutate(Samples = gsub("\\d+", "", Sample)) %>%
left_join(groupdata)
Result:
Sample Value Samples Group
1 Sample A1 1 Sample A 1
2 Sample A2 3 Sample A 1
3 Sample B1 3 Sample B 1
4 Sample B2 5 Sample B 1
5 Sample C1 6 Sample C 2
6 Sample C2 7 Sample C 2
Now you can group on Group. Depending on what you want to do next, you may want to convert Group to a factor. And if you're using ggplot2, you may not even need to group_by.
For example:
library(ggplot2)
celldata %>%
gather(Sample, Value) %>%
mutate(Samples = gsub("\\d+", "", Sample)) %>%
left_join(groupdata) %>%
mutate(Group = factor(Group)) %>%
ggplot(aes(Group, Value)) +
geom_boxplot() +
geom_jitter(aes(color = Samples)) +
theme_bw()
I am a beginner in R and was looking for help online, but the examples I found among similar titles don't quite fit my needs, because they only deal with few colums.
I have a data.frame T1 with over 100 columns and what I am looking for is something like a summary, but I want to retain every other column after the summary. I thought about using aggregate but since it's not a function, I am uncertain. The most promising way I think of you can see below.
T2 <- T1 %>% group_by(ID) %>% summarise(AGI = paste(AGI, collapse = "; "))
The summary works the way I want, but I lose any other column.
I definitly appreciate any kind of advice! Thank you very much
Expanding on TTS's comment, if you want to keep any other column you have to use mutate instead of summarise because, as the documentation says, summarise() creates a new data frame.
You should therefore use
T1 %>% group_by(ID) %>% mutate(AGI = paste(AGI, collapse = "; ")) %>% ungroup()
Data
T1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 3L, 4L), UniProt_Accession = c("P25702",
"F4HWZ6", "Q9C5M0", "Q9SR37", "Q9LKR7", "Q9FXI7"), AGI = c("AT1G54630",
"AT1G54630", "AT5G19760", "AT3G09260", "AT5G28510", "AT1G19890"
)), class = "data.frame", row.names = c(NA, -6L))
Output
# A tibble: 6 x 3
# ID UniProt_Accession AGI
# <int> <chr> <chr>
# 1 1 P25702 AT1G54630; AT1G54630
# 2 1 F4HWZ6 AT1G54630; AT1G54630
# 3 2 Q9C5M0 AT5G19760
# 4 3 Q9SR37 AT3G09260; AT5G28510
# 5 3 Q9LKR7 AT3G09260; AT5G28510
# 6 4 Q9FXI7 AT1G19890
I'm quite new to R and I'm facing a problem which I guess is quite easy to fix but I couldn't find the answer.
I have a dataframe called clg where basically I have 3 columns date, X1, X2.
X1 and X2 are name of country teams. X1 and X2 have the same list of countries.
I'm simply trying to count the frequency of each country in the two columns as a total.
So far, I've only been able to count the frequency of the X1 column but I didn't find a way to sum both columns.
clt <- as_tibble(na.omit(count(clg, clg$X1)))
I would like to get a data frame where in the first columns I have unique countries, and in the second column the sum of occurrences in X1 + X2.
You can useunlist() and table() to get the overall counts. Wrapping it in data.frame() will give you the desired two column output.
clg <- data.frame(date=1:3,
X1=c("nor", "swe", "alg"),
X2=c("swe", "alg", "jpn"))
data.frame(table(unlist(clg[c("X1", "X2")])))
# Var1 Freq
# 1 alg 2
# 2 nor 1
# 3 swe 2
# 4 jpn 1
With tidyverse, we can gather into 'long' format and then do the count
library(tidyverse)
gather(clg, key, Var1, -date) %>%
count(Var1)
# A tibble: 4 x 2
# Var1 n
# <chr> <int>
#1 alg 2
#2 jpn 1
#3 nor 1
#4 swe 2
data
clg <- structure(list(date = 1:3, X1 = structure(c(2L, 3L, 1L), .Label = c("alg",
"nor", "swe"), class = "factor"), X2 = structure(c(3L, 1L, 2L
), .Label = c("alg", "jpn", "swe"), class = "factor")),
class = "data.frame", row.names = c(NA,
-3L))
You can obtain your goal with two steps. In the first step, you calculate the sum of occurrences for each country. In the next step, you're joining the two df's together and calculate the total sum.
X1_sum <- df %>%
dplyr::group_by(X1) %>%
dplyr::summarize(n_x1 = n())
X2_sum <- df %>%
dplyr::group_by(X2) %>%
dplyr::summarize(n_x2 = n()
final_summary <- X1_sum %>%
# merging data with by country names
dplyr::left_join(., X2_sum, by = c("X1", "X2")) %>%
dplyr::mutate(n_sum = n_x1 + n_x2)
In R, in aggregate() function, How to specify stopping condition on grouping on applied function on the variable?
For example, I have data-frame like this: "df"
Input Data frame
Note: Assuming each row in input data frame is denoting single ball played by a player in that match. So, by counting a number of rows can tell us the number of balls required.
And, I want my data frame like this one: Output data frame
My need is: How many balls are required to score 10 runs?
Currently, I am using this R code:
group_data <- aggregate(df$score, by=list(Category=df$player,df$match), FUN=sum,na.rm = TRUE)
Using this code, I can not stop grouping as I want, it stops when it groups all rows. I don't want all rows to consider.
But How to put constraint like "Stop grouping as soon as score >= 10"
By putting this constraint, my sole purpose is to count the number of rows satisfying this condition.
Thanks in advance.
Here is one option using dplyr
library(dplyr)
df1 %>%
group_by(match, player) %>%
filter(!lag(cumsum(score) > 10, default = FALSE)) %>%
summarise(score = sum(score), Count = n())
# A tibble: 2 x 4
# Groups: match [?]
# match player score Count
# <int> <int> <dbl> <int>
#1 1 30 12 2
#2 2 31 15 3
data
df1 <- structure(list(match = c(1L, 1L, 1L, 2L, 2L, 2L), player = c(30L,
30L, 30L, 31L, 31L, 31L), score = c(6, 6, 6, 3, 6, 6)), .Names = c("match",
"player", "score"), row.names = c(NA, -6L), class = "data.frame")