Count number of occurances of a string in R under different conditions - r

I have a dataframe, with multiple columns called "data" which looks like this:
Preferences Status Gender
8a 8b 9a Employed Female
10b 11c 9b Unemployed Male
11a 11c 8e Student Female
That is, each customer selected 3 preferences and specified other information such as Status and Gender. Each preference is given by a [number][letter] combination, and there are c. 30 possible preferences. The possible preferences are:
8[a - c]
9[a - k]
10[a - d]
11[a - c]
12[a - i]
I want to count the number of occurrences of each preference, under certain conditions for the other columns - eg. for all women.
The output will ideally be a dataframe that looks like this:
Preference Female Male Employed Unemployed Student
8a 1034 934 234 495 203
8b 539 239 609 394 235
8c 124 395 684 94 283
9a 120 999 895 945 345
9b 978 385 596 923 986
etc.
What's the most efficient way to achieve this?
Thanks.

I am assuming you are starting with something that looks like this:
mydf <- structure(list(
Preferences = c("8a 8b 9a", "10b 11c 9b", "11a 11c 8e"),
Status = c("Employed", "Unemployed", "Student"),
Gender = c("Female", "Male", "Female")),
.Names = c("Preferences", "Status", "Gender"),
class = c("data.frame"), row.names = c(NA, -3L))
mydf
# Preferences Status Gender
# 1 8a 8b 9a Employed Female
# 2 10b 11c 9b Unemployed Male
# 3 11a 11c 8e Student Female
If that's the case, you need to "split" the "Preferences" column (by spaces), transform the data into a "long" form, and then reshape it to a wide form, tabulating while you do so.
With the right tools, this is pretty straightforward.
library(devtools)
library(data.table)
library(reshape2)
source_gist(11380733) # for `cSplit`
dcast.data.table( # Step 3--aggregate to wide form
melt( # Step 2--convert to long form
cSplit(mydf, "Preferences", " ", "long"), # Step 1--split "Preferences"
id.vars = "Preferences"),
Preferences ~ value, fun.aggregate = length)
# Preferences Employed Female Male Student Unemployed
# 1: 10b 0 0 1 0 1
# 2: 11a 0 1 0 1 0
# 3: 11c 0 1 1 1 1
# 4: 8a 1 1 0 0 0
# 5: 8b 1 1 0 0 0
# 6: 8e 0 1 0 1 0
# 7: 9a 1 1 0 0 0
# 8: 9b 0 0 1 0 1
I also tried a dplyr + tidyr approach, which looks like the following:
library(dplyr)
library(tidyr)
mydf %>%
separate(Preferences, c("P_1", "P_2", "P_3")) %>% ## splitting things
gather(Pref, Pvals, P_1:P_3) %>% # stack the preference columns
gather(Var, Val, Status:Gender) %>% # stack the status/gender columns
group_by(Pvals, Val) %>% # group by these new columns
summarise(count = n()) %>% # aggregate the numbers of each
spread(Val, count) # spread the values out
# Source: local data table [8 x 6]
# Groups:
#
# Pvals Employed Female Male Student Unemployed
# 1 10b NA NA 1 NA 1
# 2 11a NA 1 NA 1 NA
# 3 11c NA 1 1 1 1
# 4 8a 1 1 NA NA NA
# 5 8b 1 1 NA NA NA
# 6 8e NA 1 NA 1 NA
# 7 9a 1 1 NA NA NA
# 8 9b NA NA 1 NA 1
Both approaches are actually pretty quick. Test it with some better sample data than what you shared, like this:
preferences <- c(paste0(8, letters[1:3]),
paste0(9, letters[1:11]),
paste0(10, letters[1:4]),
paste0(11, letters[1:3]),
paste0(12, letters[1:9]))
set.seed(1)
nrow <- 10000
mydf <- data.frame(
Preferences = vapply(replicate(nrow,
sample(preferences, 3, FALSE),
FALSE),
function(x) paste(x, collapse = " "),
character(1L)),
Status = sample(c("Employed", "Unemployed", "Student"), nrow, TRUE),
Gender = sample(c("Male", "Female"), nrow, TRUE)
)

Related

Conditional filling NA rows with comparing non-NA labeled rows

I want to fill NA rows based on checking the differences between the closest non-NA labeled rows.
For instance
data <- data.frame(sd_value=c(34,33,34,37,36,45),
value=c(383,428,437,455,508,509),
label=c(c("bad",rep(NA,4),"unable")))
> data
sd_value value label
1 34 383 bad
2 33 428 <NA>
3 34 437 <NA>
4 37 455 <NA>
5 36 508 <NA>
6 45 509 unable
I want to evaluate how to change NA rows with checking the difference between sd_value and value those close to bad and unablerows.
if we want to get differences between the rows we can do;
library(dplyr)
data%>%
mutate(diff_val=c(0,diff(value)), diff_sd_val=c(0,diff(sd_value)))
sd_value value label diff_val diff_sd_val
1 34 383 bad 0 0
2 33 428 <NA> 45 -1
3 34 437 <NA> 9 1
4 37 455 <NA> 18 3
5 36 508 <NA> 53 -1
6 45 509 unable 1 9
The condition how I want to label the NA rows is
if the diff_val<50 and diff_sd_val<9 label them with the last non-NA label else use the first non-NA label after the last NA row.
So that the expected output would be
sd_value value label diff_val diff_sd_val
1 34 383 bad 0 0
2 33 428 bad 45 -1
3 34 437 bad 9 1
4 37 455 bad 18 3
5 36 508 unable 53 -1
6 45 509 unable 1 9
The possible solution I cooked up so far:
custom_labelling <- function(x,y,label){
diff_sd_val<-c(NA,diff(x))
diff_val<-c(NA,diff(y))
label <- NA
for (i in 1:length(label)){
if(is.na(label[i])&diff_sd_val<9&diff_val<50){
label[i] <- label
}
else {
label <- label[i]
}
}
return(label)
}
which gives
data%>%
mutate(diff_val=c(0,diff(value)), diff_sd_val=c(0,diff(sd_value)))%>%
mutate(custom_label=custom_labelling(sd_value,value,label))
Error in mutate_impl(.data, dots) :
Evaluation error: missing value where TRUE/FALSE needed.
In addition: Warning message:
In if (is.na(label[i]) & diff_sd_val < 9 & diff_val < 50) { :
the condition has length > 1 and only the first element will be used
One option is to find NA and non-NA index and based on the condition select the closest label to it.
library(dplyr)
#Create a new dataframe with diff_val and diff_sd_val
data1 <- data%>% mutate(diff_val=c(0,diff(value)), diff_sd_val=c(0,diff(sd_value)))
#Get the NA indices
NA_inds <- which(is.na(data1$label))
#Get the non-NA indices
non_NA_inds <- setdiff(1:nrow(data1), NA_inds)
#For every NA index
for (i in NA_inds) {
#Check the condition
if(data1$diff_sd_val[i] < 9 & data1$diff_val[i] < 50)
#Get the last non-NA label
data1$label[i] <- data1$label[non_NA_inds[which.max(i > non_NA_inds)]]
else
#Get the first non-NA label after last NA value
data1$label[i] <- data1$label[non_NA_inds[i < non_NA_inds]]
}
data1
# sd_value value label diff_val diff_sd_val
#1 34 383 bad 0 0
#2 33 428 bad 45 -1
#3 34 437 bad 9 1
#4 37 455 bad 18 3
#5 36 508 unable 53 -1
#6 45 509 unable 1 9
You can remove diff_val and diff_sd_val columns later if not needed.
We can also create a function
custom_label <- function(label, diff_val, diff_sd_val) {
NA_inds <- which(is.na(label))
non_NA_inds <- setdiff(1:length(label), NA_inds)
new_label = label
for (i in NA_inds) {
if(diff_sd_val[i] < 9 & diff_val[i] < 50)
new_label[i] <- label[non_NA_inds[which.max(i > non_NA_inds)]]
else
new_label[i] <- label[non_NA_inds[i < non_NA_inds]]
}
return(new_label)
}
and then apply it
data%>%
mutate(diff_val = c(0, diff(value)),
diff_sd_val = c(0, diff(sd_value)),
new_label = custom_label(label, diff_val, diff_sd_val))
# sd_value value label diff_val diff_sd_val new_label
#1 34 383 bad 0 0 bad
#2 33 428 <NA> 45 -1 bad
#3 34 437 <NA> 9 1 bad
#4 37 455 <NA> 18 3 bad
#5 36 508 <NA> 53 -1 unable
#6 45 509 unable 1 9 unable
If we want to apply it by group we can add a group_by statement and it should work.
data%>%
group_by(group) %>%
mutate(diff_val = c(0, diff(value)),
diff_sd_val = c(0, diff(sd_value)),
new_label = custom_label(label, diff_val, diff_sd_val))

Sorting elements by column in R

I have a simple code for matrix
ind1=which(macierz==1,arr.ind = TRUE)
fragment of theresult is
> ind1
row col
TCGA.CH.5737.01 53 1
TCGA.CH.5791.01 66 1
P03.1334.Tumor 322 1
P04.1790.Tumor 327 1
CPCG0340.F1 425 1
TCGA.CH.5737.01 53 2
TCGA.CH.5791.01 66 2
P03.1334.Tumor 322 2
P04.1790.Tumor 327 2
CPCG0340.F1 425 2
I would like to sort it by first column alphabetical. How can I do this in R?
It looks as if ind1 is a matrix and the first column is the rownames, so you probably need something like ind1 <- ind1[order(rownames(ind1)),]
You need (assuming your first column is called "label" and those are not rownames)
ind1[order(ind1$label),]
order() return a list of row indexes after sorting alphabetically the data frame. Just to make the example reproducible I created your data frame so
ind1 <- data.frame ( label = c("TCGA.CH.5737.01", "TCGA.CH.5791.01",
"P03.1334.Tumor","P04.1790.Tumor", "CPCG0340.F1" , "TCGA.CH.5737.01",
"TCGA.CH.5791.01","P03.1334.Tumor", "P04.1790.Tumor", "CPCG0340.F1"),
row = c(53,66,322,327,425,53,66,322,327,425), col =
c(1,1,1,1,1,2,2,2,2,2),
stringsAsFactors = FALSE)
and the output is
> ind1[order(ind1$label),]
label row col
5 CPCG0340.F1 425 1
10 CPCG0340.F1 425 2
3 P03.1334.Tumor 322 1
8 P03.1334.Tumor 322 2
4 P04.1790.Tumor 327 1
9 P04.1790.Tumor 327 2
1 TCGA.CH.5737.01 53 1
6 TCGA.CH.5737.01 53 2
2 TCGA.CH.5791.01 66 1
7 TCGA.CH.5791.01 66 2
Hope that helps.
Regards, Umberto

How many categories there are in a column in a list of data frame?

I have a list of data frames where the index indicates where one family ends and another begins. I would like to know how many categories there are in statepath column in each family.
In my below example I have two families, then I am trying to get a table wiht the frequency of each statepath category (233, 434, 323, etc) in each family.
My input:
List <-
'$`1`
Chr Start End Family Statepath
1 187546286 187552094 father 233
3 108028534 108032021 father 434
1 4864403 4878685 mother 323
1 18898657 18904908 mother 322
2 460238 461771 offspring 322
3 108028534 108032021 offspring 434
$’2’
Chr Start End Family Statepath
1 71481449 71532983 father 535
2 74507242 74511395 father 233
2 181864092 181864690 mother 322
1 71481449 71532983 offspring 535
2 181864092 181864690 offspring 322
3 160057791 160113642 offspring 335'
Thus, my expected output Freq_statepath would look like:
Freq_statepath <- ‘Statepath Family_1 Family_2
233 1 1
434 2 0
323 1 0
322 2 2
535 0 2
335 0 1’
I think you want something like this:
test <- list(data.frame(Statepath = c(233,434,323,322,322)),data.frame(Statepath = c(434,323,322,322)))
list_tables <- lapply(test, function(x) data.frame(table(x$Statepath)))
final_result <- Reduce(function(...) merge(..., by.x = "Var1", by.y = "Var1", all.x = T, all.y = T), list_tables)
final_result[is.na(final_result)] <- 0
> test
[[1]]
Statepath
1 233
2 434
3 323
4 322
5 322
[[2]]
Statepath
1 434
2 323
3 322
4 322
> final_result
Var1 Freq.x Freq.y
1 233 1 0
2 322 2 2
3 323 1 1
4 434 1 1

R: sum rows from column A until conditioned value in column B

I'm pretty new to R and can't seem to figure out how to deal with what seems to be a relatively simple problem. I want to sum the rows of the column 'DURATION' per 'TRIAL_INDEX', but then only those first rows where the values of 'X_POSITION" are increasing. I only want to sum the first round within a trial where X increases.
The first rows of a simplified dataframe:
TRIAL_INDEX DURATION X_POSITION
1 1 204 314.5
2 1 172 471.6
3 1 186 570.4
4 1 670 539.5
5 1 186 503.6
6 2 134 306.8
7 2 182 503.3
8 2 806 555.7
9 2 323 490.0
So, for TRIAL_INDEX 1, only the first three values of DURATION should be added (204+172+186), as this is where X has the highest value so far (going through the dataframe row by row).
The desired output should look something like:
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
I tried to use dplyr, to generate a new dataframe that can be merged with my original dataframe.
However, the code doesn't work, and also I'm not sure on how to make sure it's only adding the first rows per trial that have increasing values for X_POSITION.
FirstPassRT = dat %>%
group_by(TRIAL_INDEX) %>%
filter(dplyr::lag(dat$X_POSITION,1) > dat$X_POSITION) %>%
summarise(FIRST_PASS_TIME=sum(DURATION))
Any help and suggestions are greatly appreciated!
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
# find the rows that will be used for summing DURATION
idx = dt[, .I[1]:.I[min(.N, which(diff(X_POSITION) < 0), na.rm = T)], by = TRIAL_INDEX]$V1
# sum the DURATION for those rows
dt[idx, time := sum(DURATION), by = TRIAL_INDEX][, time := time[1], by = TRIAL_INDEX]
dt
# TRIAL_INDEX DURATION X_POSITION time
#1: 1 204 314.5 562
#2: 1 172 471.6 562
#3: 1 186 570.4 562
#4: 1 670 539.5 562
#5: 1 186 503.6 562
#6: 2 134 306.8 1122
#7: 2 182 503.3 1122
#8: 2 806 555.7 1122
#9: 2 323 490.0 1122
Here is something you can try with dplyr package:
library(dplyr);
dat %>% group_by(TRIAL_INDEX) %>%
mutate(IncLogic = X_POSITION > lag(X_POSITION, default = 0)) %>%
mutate(FIRST_PASS_TIME = sum(DURATION[IncLogic])) %>%
select(-IncLogic)
Source: local data frame [9 x 4]
Groups: TRIAL_INDEX [2]
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
(int) (int) (dbl) (int)
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
If you want to summarize it down to one row per trial you can use summarize like this:
library(dplyr)
df <- data_frame(TRIAL_INDEX = c(1,1,1,1,1,2,2,2,2),
DURATION = c(204,172,186,670, 186,134,182,806, 323),
X_POSITION = c(314.5, 471.6, 570.4, 539.5, 503.6, 306.8, 503.3, 555.7, 490.0))
res <- df %>%
group_by(TRIAL_INDEX) %>%
mutate(x.increasing = ifelse(X_POSITION > lag(X_POSITION), TRUE, FALSE),
x.increasing = ifelse(is.na(x.increasing), TRUE, x.increasing)) %>%
filter(x.increasing == TRUE) %>%
summarize(FIRST_PASS_TIME = sum(X_POSITION))
res
#Source: local data frame [2 x 2]
#
# TRIAL_INDEX FIRST_PASS_TIME
# (dbl) (dbl)
#1 1 1356.5
#2 2 1365.8

Apply a rule to calculate sum of specific

Hi I have a data set like this.
Num C Pr Value Volume
111 aa Alen 111 222
111 aa Paul 100 200
222 vv Iva 444 555
222 vv John 333 444
I would like to filter the data according to Num and to add a new row where take the sum of column Value and Volume but to keep the information of column Num and C, but in column Pr to put Total. It should look like this way.
Num C Pr Value Volume
222 vv Total 777 999
Could you suggest me how to do it? I would like only for Num 222.
When I try to use res command I end up with this result.
# Num C Pr Value Volume
1: 111 aa Alen 111 222
2: 111 aa Paul 100 200
3: 111 aa Total NA NA
4: 222 vv Iva 444 555
5: 222 vv John 333 444
6: 222 vv Total NA NA
What cause this?
The structure of my data is the following one.
'data.frame': 4 obs. of 5 variables:
$ Num : Factor w/ 2 levels "111","222": 1 1 2 2
$ C : Factor w/ 2 levels "aa","vv": 1 1 2 2
$ Pr : Factor w/ 4 levels "Alen","Iva","John",..: 1 4 2 3
$ Value : Factor w/ 4 levels "100","111","333",..: 2 1 4 3
$ Volume: Factor w/ 4 levels "200","222","444",..: 2 1 4 3
We could use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'Num', 'C' columns and specifying the columns to do the sum in .SDcols, we loop those columns using lapply, get the sum, and create the 'Pr' column. We can rbind the original dataset with the new summarised output ('DT1') and order the result based on 'Num'.
library(data.table)#v1.9.5+
DT1 <- setDT(df1)[,lapply(.SD, sum) , by = .(Num,C),
.SDcols=Value:Volume][,Pr:='Total'][]
rbind(df1, DT1)[order(Num)]
# Num C Pr Value Volume
#1: 111 aa Alen 111 222
#2: 111 aa Paul 100 200
#3: 111 aa Total 211 422
#4: 222 vv Iva 444 555
#5: 222 vv John 333 444
#6: 222 vv Total 777 999
This can be done using base R methods as well. We get the sum of 'Value', 'Volume' columns grouped by 'Num', 'C', using the formula method of aggregate, transform the output by creating the 'Pr' column, rbind with original dataset and order the output ('res') based on 'Num'.
res <- rbind(df1,transform(aggregate(.~Num+C, df1[-3], FUN=sum), Pr='Total'))
res[order(res$Num),]
# Num C Pr Value Volume
#1 111 aa Alen 111 222
#2 111 aa Paul 100 200
#5 111 aa Total 211 422
#3 222 vv Iva 444 555
#4 222 vv John 333 444
#6 222 vv Total 777 999
EDIT: Noticed that the OP mentioned filter. If this is for a single 'Num', we subset the data, and then do the aggregate, transform steps.
transform(aggregate(.~Num+C, subset(df1, Num==222)[-3], FUN=sum), Pr='Total')
# Num C Value Volume Pr
#1 222 vv 777 999 Total
Or we may not need aggregate. After subsetting the data, we convert the 'Num' to 'factor', loop through the output dataset ('df2') get the sum if it the column is numeric class or else we get the first element and wrap with data.frame.
df2 <- transform(subset(df1, Num==222), Num=factor(Num))
data.frame(c(lapply(df2[-3], function(x) if(is.numeric(x))
sum(x) else x[1]), Pr='Total'))
# Num C Value Volume Pr
#1 222 vv 777 999 Total
data
df1 <- structure(list(Num = c(111L, 111L, 222L, 222L), C = c("aa", "aa",
"vv", "vv"), Pr = c("Alen", "Paul", "Iva", "John"), Value = c(111L,
100L, 444L, 333L), Volume = c(222L, 200L, 555L, 444L)), .Names = c("Num",
"C", "Pr", "Value", "Volume"), class = "data.frame",
row.names = c(NA, -4L))
Or using dplyr:
library(dplyr)
df1 %>%
filter(Num == 222) %>%
summarise(Value = sum(Value),
Volume = sum(Volume),
Pr = 'Total',
Num = Num[1],
C = C[1])
# Value Volume Pr Num C
# 1 777 999 Total 222 vv
where we first filter to keep only Num == 222, and then use summarise to obtain the sums and the values for Num and C. This assumes that:
You do not want to get the result for each unique Num (I select one here, you could select multiple). If you need this, use group_by.
There is only ever one C for every unique Num.
You can also use a dplyr package:
df %>%
filter(Num == 222) %>%
group_by(Num, C) %>%
summarise(
Pr = "Total"
, Value = sum(Value)
, Volume = sum(Volume)
) %>%
rbind(df, .)
# Num C Pr Value Volume
# 1 111 aa Alen 111 222
# 2 111 aa Paul 100 200
# 3 222 vv Iva 444 555
# 4 222 vv John 333 444
# 5 222 vv Total 777 999
If you want the total for each Num value you just comment a filter line

Resources