How many categories there are in a column in a list of data frame? - r

I have a list of data frames where the index indicates where one family ends and another begins. I would like to know how many categories there are in statepath column in each family.
In my below example I have two families, then I am trying to get a table wiht the frequency of each statepath category (233, 434, 323, etc) in each family.
My input:
List <-
'$`1`
Chr Start End Family Statepath
1 187546286 187552094 father 233
3 108028534 108032021 father 434
1 4864403 4878685 mother 323
1 18898657 18904908 mother 322
2 460238 461771 offspring 322
3 108028534 108032021 offspring 434
$’2’
Chr Start End Family Statepath
1 71481449 71532983 father 535
2 74507242 74511395 father 233
2 181864092 181864690 mother 322
1 71481449 71532983 offspring 535
2 181864092 181864690 offspring 322
3 160057791 160113642 offspring 335'
Thus, my expected output Freq_statepath would look like:
Freq_statepath <- ‘Statepath Family_1 Family_2
233 1 1
434 2 0
323 1 0
322 2 2
535 0 2
335 0 1’

I think you want something like this:
test <- list(data.frame(Statepath = c(233,434,323,322,322)),data.frame(Statepath = c(434,323,322,322)))
list_tables <- lapply(test, function(x) data.frame(table(x$Statepath)))
final_result <- Reduce(function(...) merge(..., by.x = "Var1", by.y = "Var1", all.x = T, all.y = T), list_tables)
final_result[is.na(final_result)] <- 0
> test
[[1]]
Statepath
1 233
2 434
3 323
4 322
5 322
[[2]]
Statepath
1 434
2 323
3 322
4 322
> final_result
Var1 Freq.x Freq.y
1 233 1 0
2 322 2 2
3 323 1 1
4 434 1 1

Related

How to get 3 lists with no duplicates in a random sampling? (R)

I have done the first step:
how many persons have more than 1 point
how many persons have more than 3 points
how many persons have more than 6 points
My goal:
I need to have random samples (with no duplicates of persons)
of 3 persons that have more than 1 point
of 3 persons that have more than 3 points
of 3 persons that have more than 6 points
My dataset looks like this:
id person points
201 rt99 NA
201 rt99 3
201 rt99 2
202 kt 4
202 kt NA
202 kt NA
203 rr 4
203 rr NA
203 rr NA
204 jk 2
204 jk 2
204 jk NA
322 knm3 5
322 knm3 NA
322 knm3 3
343 kll2 2
343 kll2 1
343 kll2 5
344 kll NA
344 kll 7
344 kll 1
345 nn 7
345 nn NA
490 kk 1
490 kk NA
490 kk 2
491 ww 1
491 ww 1
489 tt 1
489 tt 1
325 ll 1
325 ll 1
325 ll NA
That is what I have already tried to code, here is an example of code for finding persons that have more than 1 point:
persons_filtered <- dataset %>%
group_by(person) %>%
dplyr::filter(sum(points, na.rm = T)>1) %>%
distinct(person) %>%
pull()
person_filtered
more_than_1 <- sample(person_filtered, size = 3)
Question:
How to write this code better that I could have in the end 3 lists with unique persons. (I need to prevent to have same persons in the lists)
Here's a tidyverse solution, where the sampling in the three categories of interest is made at the same time.
library(tidyverse)
dataset %>%
# Group by person
group_by(person) %>%
# Get points sum
summarize(sum_points = sum(points, na.rm = T)) %>%
# Classify the sum points into categories defined by breaks, (0-1], (1-3] ...
# I used 100 as the last value so that all sum points between 6 and Inf get classified as (6-Inf]
mutate(point_class = cut(sum_points, breaks = c(0,1,3,6,Inf))) %>%
# ungroup
ungroup() %>%
# group by point class
group_by(point_class) %>%
# Sample 3 rows per point_class
sample_n(size = 3) %>%
# Eliminate the sum_points column
select(-sum_points) %>%
# If you need this data in lists you can nest the results in the sampled_data column
nest(sampled_data= -point_class)

Sorting elements by column in R

I have a simple code for matrix
ind1=which(macierz==1,arr.ind = TRUE)
fragment of theresult is
> ind1
row col
TCGA.CH.5737.01 53 1
TCGA.CH.5791.01 66 1
P03.1334.Tumor 322 1
P04.1790.Tumor 327 1
CPCG0340.F1 425 1
TCGA.CH.5737.01 53 2
TCGA.CH.5791.01 66 2
P03.1334.Tumor 322 2
P04.1790.Tumor 327 2
CPCG0340.F1 425 2
I would like to sort it by first column alphabetical. How can I do this in R?
It looks as if ind1 is a matrix and the first column is the rownames, so you probably need something like ind1 <- ind1[order(rownames(ind1)),]
You need (assuming your first column is called "label" and those are not rownames)
ind1[order(ind1$label),]
order() return a list of row indexes after sorting alphabetically the data frame. Just to make the example reproducible I created your data frame so
ind1 <- data.frame ( label = c("TCGA.CH.5737.01", "TCGA.CH.5791.01",
"P03.1334.Tumor","P04.1790.Tumor", "CPCG0340.F1" , "TCGA.CH.5737.01",
"TCGA.CH.5791.01","P03.1334.Tumor", "P04.1790.Tumor", "CPCG0340.F1"),
row = c(53,66,322,327,425,53,66,322,327,425), col =
c(1,1,1,1,1,2,2,2,2,2),
stringsAsFactors = FALSE)
and the output is
> ind1[order(ind1$label),]
label row col
5 CPCG0340.F1 425 1
10 CPCG0340.F1 425 2
3 P03.1334.Tumor 322 1
8 P03.1334.Tumor 322 2
4 P04.1790.Tumor 327 1
9 P04.1790.Tumor 327 2
1 TCGA.CH.5737.01 53 1
6 TCGA.CH.5737.01 53 2
2 TCGA.CH.5791.01 66 1
7 TCGA.CH.5791.01 66 2
Hope that helps.
Regards, Umberto

how to select data based on a list from a split data frame and then recombine in R

I am trying to do the following. I have a dataset Test:
Item_ID Test_No Category Sharpness Weight Viscocity
132 1 3 14.93199362 94.37250417 579.4236727
676 1 4 44.58750591 70.03232054 1829.170727
699 2 5 89.02760079 54.30587287 1169.226863
850 3 6 30.74535903 83.84377678 707.2280513
951 4 237 67.79568019 51.10388484 917.6609965
1031 5 56 74.06697003 63.31274502 1981.17804
1175 4 354 98.9656142 97.7523884 100.7357981
1483 5 726 9.958040999 51.29537311 1222.910211
1529 7 800 64.11430235 65.69780939 573.8266137
1698 9 125 67.83105185 96.53847341 486.9620194
1748 9 1005 49.43602318 52.9139591 1881.740184
2005 9 28 26.89821508 82.12663209 1709.556135
2111 2 76 83.03593144 85.23622731 276.5088502
I would want to split this data based on Test_No and then compute the number of unique Category per Test_No and also the Median Category value. I chose to use split and Sappply in the following way. But, I am getting an error regarding a missing parenthesis. Is there anything wrong in my approach ? Please find my code below:
function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)), Median_Cat = median(unique(CatRange$Category), na.rm = TRUE) )
}
CatStat <- do.call(rbind,sapply(split(Test, Test$Test_No), function(ModRange)))
Appending my question:
I would want to display the data containing the following information:
Test_No, Category, Median_Cat and Cat_Count
We can try with dplyr
library(dplyr)
Test %>%
group_by(Test_No) %>%
summarise(Cat_Count = n_distinct(Category),
Median_Cat = median(Category,na.rm = TRUE),
Category = toString(Category))
# Test_No Cat_Count Median_Cat Category
# <int> <int> <dbl> <chr>
#1 1 2 3.5 3, 4
#2 2 2 40.5 5, 76
#3 3 1 6.0 6
#4 4 2 295.5 237, 354
#5 5 2 391.0 56, 726
#6 7 1 800.0 800
#7 9 3 125.0 125, 1005, 28
Or if you prefer base R we can also try with aggregate
aggregate(Category~Test_No, CatRange, function(x) c(Cat_Count = length(unique(x)),
Median_Cat = median(x,na.rm = TRUE), Category = toString(x)))
As far as the function written is concerned I think there are some synatx issues in it.
new_func <- function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)),
Median_Cat = median(unique(CatRange$Category), na.rm = TRUE),
Category = toString(CatRange$Category))
}
data.frame(t(sapply(split(CatRange, CatRange$Test_No), new_func)))
# Cat_Count Median_Cat Category
#1 2 3.5 3, 4
#2 2 40.5 5, 76
#3 1 6 6
#4 2 295.5 237, 354
#5 2 391 56, 726
#7 1 800 800
#9 3 125 125, 1005, 28

R: sum rows from column A until conditioned value in column B

I'm pretty new to R and can't seem to figure out how to deal with what seems to be a relatively simple problem. I want to sum the rows of the column 'DURATION' per 'TRIAL_INDEX', but then only those first rows where the values of 'X_POSITION" are increasing. I only want to sum the first round within a trial where X increases.
The first rows of a simplified dataframe:
TRIAL_INDEX DURATION X_POSITION
1 1 204 314.5
2 1 172 471.6
3 1 186 570.4
4 1 670 539.5
5 1 186 503.6
6 2 134 306.8
7 2 182 503.3
8 2 806 555.7
9 2 323 490.0
So, for TRIAL_INDEX 1, only the first three values of DURATION should be added (204+172+186), as this is where X has the highest value so far (going through the dataframe row by row).
The desired output should look something like:
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
I tried to use dplyr, to generate a new dataframe that can be merged with my original dataframe.
However, the code doesn't work, and also I'm not sure on how to make sure it's only adding the first rows per trial that have increasing values for X_POSITION.
FirstPassRT = dat %>%
group_by(TRIAL_INDEX) %>%
filter(dplyr::lag(dat$X_POSITION,1) > dat$X_POSITION) %>%
summarise(FIRST_PASS_TIME=sum(DURATION))
Any help and suggestions are greatly appreciated!
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
# find the rows that will be used for summing DURATION
idx = dt[, .I[1]:.I[min(.N, which(diff(X_POSITION) < 0), na.rm = T)], by = TRIAL_INDEX]$V1
# sum the DURATION for those rows
dt[idx, time := sum(DURATION), by = TRIAL_INDEX][, time := time[1], by = TRIAL_INDEX]
dt
# TRIAL_INDEX DURATION X_POSITION time
#1: 1 204 314.5 562
#2: 1 172 471.6 562
#3: 1 186 570.4 562
#4: 1 670 539.5 562
#5: 1 186 503.6 562
#6: 2 134 306.8 1122
#7: 2 182 503.3 1122
#8: 2 806 555.7 1122
#9: 2 323 490.0 1122
Here is something you can try with dplyr package:
library(dplyr);
dat %>% group_by(TRIAL_INDEX) %>%
mutate(IncLogic = X_POSITION > lag(X_POSITION, default = 0)) %>%
mutate(FIRST_PASS_TIME = sum(DURATION[IncLogic])) %>%
select(-IncLogic)
Source: local data frame [9 x 4]
Groups: TRIAL_INDEX [2]
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
(int) (int) (dbl) (int)
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
If you want to summarize it down to one row per trial you can use summarize like this:
library(dplyr)
df <- data_frame(TRIAL_INDEX = c(1,1,1,1,1,2,2,2,2),
DURATION = c(204,172,186,670, 186,134,182,806, 323),
X_POSITION = c(314.5, 471.6, 570.4, 539.5, 503.6, 306.8, 503.3, 555.7, 490.0))
res <- df %>%
group_by(TRIAL_INDEX) %>%
mutate(x.increasing = ifelse(X_POSITION > lag(X_POSITION), TRUE, FALSE),
x.increasing = ifelse(is.na(x.increasing), TRUE, x.increasing)) %>%
filter(x.increasing == TRUE) %>%
summarize(FIRST_PASS_TIME = sum(X_POSITION))
res
#Source: local data frame [2 x 2]
#
# TRIAL_INDEX FIRST_PASS_TIME
# (dbl) (dbl)
#1 1 1356.5
#2 2 1365.8

Count number of occurances of a string in R under different conditions

I have a dataframe, with multiple columns called "data" which looks like this:
Preferences Status Gender
8a 8b 9a Employed Female
10b 11c 9b Unemployed Male
11a 11c 8e Student Female
That is, each customer selected 3 preferences and specified other information such as Status and Gender. Each preference is given by a [number][letter] combination, and there are c. 30 possible preferences. The possible preferences are:
8[a - c]
9[a - k]
10[a - d]
11[a - c]
12[a - i]
I want to count the number of occurrences of each preference, under certain conditions for the other columns - eg. for all women.
The output will ideally be a dataframe that looks like this:
Preference Female Male Employed Unemployed Student
8a 1034 934 234 495 203
8b 539 239 609 394 235
8c 124 395 684 94 283
9a 120 999 895 945 345
9b 978 385 596 923 986
etc.
What's the most efficient way to achieve this?
Thanks.
I am assuming you are starting with something that looks like this:
mydf <- structure(list(
Preferences = c("8a 8b 9a", "10b 11c 9b", "11a 11c 8e"),
Status = c("Employed", "Unemployed", "Student"),
Gender = c("Female", "Male", "Female")),
.Names = c("Preferences", "Status", "Gender"),
class = c("data.frame"), row.names = c(NA, -3L))
mydf
# Preferences Status Gender
# 1 8a 8b 9a Employed Female
# 2 10b 11c 9b Unemployed Male
# 3 11a 11c 8e Student Female
If that's the case, you need to "split" the "Preferences" column (by spaces), transform the data into a "long" form, and then reshape it to a wide form, tabulating while you do so.
With the right tools, this is pretty straightforward.
library(devtools)
library(data.table)
library(reshape2)
source_gist(11380733) # for `cSplit`
dcast.data.table( # Step 3--aggregate to wide form
melt( # Step 2--convert to long form
cSplit(mydf, "Preferences", " ", "long"), # Step 1--split "Preferences"
id.vars = "Preferences"),
Preferences ~ value, fun.aggregate = length)
# Preferences Employed Female Male Student Unemployed
# 1: 10b 0 0 1 0 1
# 2: 11a 0 1 0 1 0
# 3: 11c 0 1 1 1 1
# 4: 8a 1 1 0 0 0
# 5: 8b 1 1 0 0 0
# 6: 8e 0 1 0 1 0
# 7: 9a 1 1 0 0 0
# 8: 9b 0 0 1 0 1
I also tried a dplyr + tidyr approach, which looks like the following:
library(dplyr)
library(tidyr)
mydf %>%
separate(Preferences, c("P_1", "P_2", "P_3")) %>% ## splitting things
gather(Pref, Pvals, P_1:P_3) %>% # stack the preference columns
gather(Var, Val, Status:Gender) %>% # stack the status/gender columns
group_by(Pvals, Val) %>% # group by these new columns
summarise(count = n()) %>% # aggregate the numbers of each
spread(Val, count) # spread the values out
# Source: local data table [8 x 6]
# Groups:
#
# Pvals Employed Female Male Student Unemployed
# 1 10b NA NA 1 NA 1
# 2 11a NA 1 NA 1 NA
# 3 11c NA 1 1 1 1
# 4 8a 1 1 NA NA NA
# 5 8b 1 1 NA NA NA
# 6 8e NA 1 NA 1 NA
# 7 9a 1 1 NA NA NA
# 8 9b NA NA 1 NA 1
Both approaches are actually pretty quick. Test it with some better sample data than what you shared, like this:
preferences <- c(paste0(8, letters[1:3]),
paste0(9, letters[1:11]),
paste0(10, letters[1:4]),
paste0(11, letters[1:3]),
paste0(12, letters[1:9]))
set.seed(1)
nrow <- 10000
mydf <- data.frame(
Preferences = vapply(replicate(nrow,
sample(preferences, 3, FALSE),
FALSE),
function(x) paste(x, collapse = " "),
character(1L)),
Status = sample(c("Employed", "Unemployed", "Student"), nrow, TRUE),
Gender = sample(c("Male", "Female"), nrow, TRUE)
)

Resources