Calculating percent of categorical responses (with grouping) in R - r

I have the following dataframe:
IV Device1 Device2 Device3
Color Same Same Missing
Color Different Same Missing
Color Same Unique Missing
Shape Same Missing Same
Shape Different Same Different
Explanation: each IV (Independent Variable) is composed of several measurements (the ‘Color’ section is composed of 3 different measurements, while 'Shape' is composed of 2).
Each data point has one of 4 possible categorical values: Same/Different/Unique/Missing. 'Missing' means that there is no value for that measurement in the case of that device, while the other 3 values represent the existing result for that measurement.
Question: I want to calculate for each device the percent of times that it has a Same/Different/Unique value (thus generating 3 different percentages), out of the total number of values for that IV (not including cases where there is a ‘Missing’ value).
For example, device 2 would have the following percentages:
Color- 67% same, 0% different, 33% unique.
Shape- 100% same, 0% different, 0% unique.
Thank you!

This is a not a TIDY solution, but you can use this until someone else posts a better one:
# Replace all "Missing" with NAs
df[df == "Missing"] <- NA
# Create factor levels
df[,-1] <- lapply(df[,-1], function(x) {
factor(x, levels = c('Same', 'Different', 'Unique'))
})
# Custom function to calculate percent of categorical responses
custom <- function(x) {
y <- length(na.omit(x))
if(y > 0)
return(round((table(x)/y)*100))
else
return(rep(0, 3))
}
library(purrr)
# Split the dataframe on IV, remove the IV column and apply the custom function
Final <- df %>% split(df$IV) %>%
map(., function(x) {
x <- x[, -1]
t(sapply(x, custom))
})
Output
Final is a list of two data frames:
$Color
Same Different Unique
Device1 67 33 0
Device2 67 0 33
Device3 0 0 0
$Shape
Same Different Unique
Device1 50 50 0
Device2 100 0 0
Device3 50 50 0
Data
structure(list(IV = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("Color",
"Shape"), class = "factor"), Device1 = structure(c(1L, 2L, 1L,
1L, 2L), .Label = c("Same", "Different", "Unique"), class = "factor"),
Device2 = structure(c(1L, 1L, 3L, NA, 1L), .Label = c("Same",
"Different", "Unique"), class = "factor"), Device3 = structure(c(NA,
NA, NA, 1L, 2L), .Label = c("Same", "Different", "Unique"
), class = "factor")), .Names = c("IV", "Device1", "Device2",
"Device3"), row.names = c(NA, -5L), class = "data.frame")

Quick and dirty: First, replace your 'Missing' by 'NA' using your preferred method (sed, excel, etc), then you can use table on each of the columns to get the summary statistics:
myStats <- function(x){
table(factor(x, levels = c('Same', 'Different', 'Unique')))/sum(table(x))
}
apply(yourData, 2, myStats)
This will return the summary of what you want.

Related

Extract numbers greater than and keep only numbers in a vector [duplicate]

This question already has answers here:
Keep only rows if number is greater than... in specific column
(2 answers)
Closed 1 year ago.
This is an example data:
exp_data <- structure(list(Seq = c("AAAARVDS", "AAAARVDSSSAL",
"AAAARVDSRASDQ"), Change = structure(c(19L, 20L, 13L), .Label = c("",
"C[+58]", "C[+58], F[+1152]", "C[+58], F[+1152], L[+12], M[+12]",
"C[+58], L[+2909]", "L[+12]", "L[+370]", "L[+504]", "M[+12]",
"M[+1283]", "M[+1457]", "M[+1491]", "M[+16]", "M[+16], Y[+1013]",
"M[+16], Y[+1152]", "M[+16], Y[+762]", "M[+371]", "M[+386], Y[+12]",
"M[+486], W[+12]", "Y[+12]", "Y[+1240]", "Y[+1502]", "Y[+1988]",
"Y[+2918]"), class = "factor"), `Mass` = c(1869.943,
1048.459, 707.346), Size = structure(c(2L, 2L, 2L), .Label = c("Matt",
"Greg",
"Kieran"
), class = "factor"), `Number` = c(2L, 2L, 2L)), row.names = c(244L,
392L, 396L), class = "data.frame")
I would like to bring your attention to column name Change as this is the one from which I would like to extract number greater than 100 and keep all of the numbers from this column in a separate vector. This vector of numbers will be used for filtering another data frame.
Assuming the difference with the last one is creating a numeric vector of all values for matching columns:
with(exp_data, {
f <- \(.) regmatches(., gregexpr("[[:digit:]]+", .))
Change[vapply(f(Change), \(.) any(as.numeric(.) > 100), T)] |>
as.character() |>
f() |>
unlist() |>
as.numeric()
})
[1] 486 12
Consider changing the structure of the Change column to not contain multiple variables within one column. That is, consider making a letter and values column instead.

How can I apply case_when(mapply (adist, x, y) <= 3 ~ x, TRUE ~ y)) to columns of different length and order

Hi I have been trying for a while to match two large columns of names, several have different spellings etc... so far I have written some code to practice on a smaller dataset
examples%>% mutate(new_ID = case_when(mapply (adist, example_1 , example_2) <= 3 ~ example_1, TRUE ~ example_2))
This manages to create a new column with names the name from example 1 if it is less than an edit distance of 3 away. However, it does not give the name from example 2 if it does not meet this criteria which I need it to do.
This code also only works on the adjacent row of each column, whereas, I need it to work on a dataset which has two columns (one is larger- so cant be put in the same order).
Also needs to not try to match the NAs from the smaller column of names (there to fill it out to equal length to the other one).
Anyone know how to do something like this?
dput(head(examples))
structure(list(. = structure(c(4L, 3L, 2L, 1L, 5L), .Label = c("grarryfieldsred","harroldfrankknight", "sandramaymeres", "sheilaovensnew", "terrifrank"), class = "factor"), example_2 = structure(c(4L, 2L, 3L, 1L,
5L), .Label = c(" grarryfieldsred", "candramymars", "haroldfranrinight",
"sheilowansknew", "terryfrenk"), class = "factor")), row.names = c(NA,
5L), class = "data.frame")
The problem is that your columns have become factors rather than character vectors. When you try to combine two columns together with different factor levels, unexpected results can happen.
First convert your columns to character:
library(dplyr)
examples %>%
mutate(across(contains("example"),as.character)) %>%
mutate(new_ID = case_when(mapply (adist, example_1 , example_2) <= 3 ~ example_1,
TRUE ~ example_2))
# example_1 example_2 new_ID
#1 sheilaovensnew sheilowansknew sheilowansknew
#2 sandramaymeres candramymars candramymars
#3 harroldfrankknight haroldfranrinight harroldfrankknight
#4 grarryfieldsred grarryfieldsred grarryfieldsred
#5 terrifrank terryfrenk terrifrank
In your dput output, somehow the name of example_1 was changed. I ran this first:
names(examples)[1] <- "example_1"

Merging three factors so their dependent variable sums in R

Not sure if someone has answered this - I have searched, but so far nothing has worked for me. I have a very large dataset that I am trying to narrow. I need to combine three factors in my "PROG" variable ("Grad.2","Grad.3","Grad.H") so that they become a single variable ("Grad") where the dependent variable ("NUMBER") of each comparable set of values is summed.
ie.
YEAR = "92/93" AGE = "20-24" PROG = "Grad.2" NUMBER = "50"
YEAR = "92/93" AGE = "20-24" PROG = "Grad.3" NUMBER = "25"
YEAR = "92/93" AGE = "20-24" PROG = "Grad.H" NUMBER = "2"
turns into
YEAR = "92/93" AGE = "20-24" PROG = "Grad" NUMBER = "77"
I want to then drop all other factors for PROG so that I can compare the enrollment rates for Grad without worrying about the other factors (which I deal with separately). So my active independent variables are YEAR and AGE, while the dependent variable is NUMBER.
I hope this shows my data adequately:
structure(list
(YEAR = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("92/93", "93/94", "94/95", "95/96", "96/97",
"97/98", "98/99", "99/00", "00/01", "01/02", "02/03", "03/04",
"04/05", "05/06", "06/07", "07/08", "08/09", "09/10", "10/11",
"11/12", "12/13", "13/14", "14/15", "15/16"), class = "factor"),
AGE = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L), .Label = c("1-19",
"20-24", "25-30", "31-34", "35-39", "40+", "NR", "T.Age"), class = c("ordered",
"factor")),
PROG = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
19L, 19L, 19L), .Label = c("T.Prog", "Basic", "Career", "Grad.H",
"Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res",
"NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred",
"Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual",
"Und.Grad", "Und.Grad.Qual"), class = "factor"),
NUMBER = c(104997L,
347235L, 112644L, 38838L, 35949L, 50598L, 5484L, 104991L,
333807L, 76692L)), row.names = c(7936L, 7948L, 7960L, 7972L,
7984L, 7996L, 8008L, 10459L, 10471L, 10483L), class = "data.frame")
In terms of why I am using factors, I don't know how else I should enter the data. Factors made sense, and they were how R interpreted the raw data when I uploaded it.
I am working on the suggestions below. Not had success yet, but I am still learning how to get R to do what I want, and frequently mess up. Will respond to each of you as soon as I have a reasonable answer to give. (And once I stop banging my poor head on my desk... sigh)
If I understand your question correctly, this should do it.
I am assuming your data frame is named df:
library(tidyverse)
df %>%
mutate(PROG = ifelse(PROG %in% c("Grad2", "Grad3","Grad.H"),
"Grad",
NA)) %>% ##combines the 3 Grad variables into one
filter(!is.na(PROG)) %>% ##drops the other variables
group_by(YEAR, AGE) %>%
summarise(NUMBER = sum(NUMBER))
Slightly different approach: only take factors you want, drop the factor variable (because you want to treat them as a group) and sum up all NUMBER values while grouping by all other variables. df is your data.
aggregate(formula = NUMBER ~ .,
data = subset(df, PROG %in% c("Grad2", "Grad3", "Grad.H"), select = -PROG),
FUN = sum)
There are multiple ways to do this, but I agree with FScott that you are likely looking for the levels() function to rename the factor levels. Here is how I would do the second step of summing.
library(magrittr)
library(dplyr)
#do the renaming of the PROG variables here
#sum by PROG
df <- df %>%
group_by(PROG) %>% # you could add more variable names here to group by i.e. group_by(PROG, AGE, YEAR)
mutate(group.sum= sum(NUMBER))
This chunk will make a new column in df named group.sum with the sum between subsetted groups defined by the group_by() function
if you wanted to condense the data.frame further as where the individual values in NUMBER are replaced with group.sum, again there are many ways to do this but here is a simple way.
#condense df down
df$number <- df$group.sum
df <- df[,-ncol(df)]
df <- unique(df)
A side note: I wouldn't recommend doing the above chunk because you loose information in your data, and your data is more tidy just having the extra column group.sum
I think the levels() function is what you are looking for. From the manual:
## combine some levels
z <- gl(3, 2, 12, labels = c("apple", "salad", "orange"))
z
levels(z) <- c("fruit", "veg", "fruit")
z
I named your data temp and ran this code. It works for me.
z<-gl(n=length(temp$PROG),k=2,labels=c("T.Prog", "Basic", "Career", "Grad.H",
"Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res",
"NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred",
"Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual",
"Und.Grad", "Und.Grad.Qual"))
z
levels(z)<-c(rep("Other",3),rep("Grad",5),rep("Other",12))
z
temp$PROG2<-factor(x=temp$PROG,levels=levels(temp$PROG),labels=z)
temp

Create function to count values across list of columns

R folks:
I have a dataframe with many sets of columns. Each set is a bank of survey items. I would like to count the number of columns in each set having a certain value. I wrote a function to do this but it results in a list of repeated values that is appended to my dataframe.
df<- structure(list(RespondentID = c(6764279930, 6779986023, 6760279439,
6759243066),
q1 = c(3L, 3L, 4L, 1L),
q2 = c(2L, 2L, 4L, 4L),
q3 = c(4L, 2L, 4L, 5L),
q0010_0004 = c(1L, 2L, 3L, 1L)),
.Names = c("RespondentID", "q1", "q2", "q3", "q4"),
row.names = c(NA, 4L), class = "data.frame")
group1<-c("q1","q2","q3","q4")
# Objective: Count number of ratings==4 for each row
# Make function that receives list of columns &
# then returns ONE column in dataframe with total # columns
# having certain value (in this case, 4)
countcol<-function(colgroup) {
s<-subset(df, select=c(colgroup)) #select only the columns designated by list
s$sum<-Reduce("+", apply(X=s,1,FUN=function(x) (sum(x==4, na.rm = TRUE)))) # count instances of value==4
s2<-subset(s,select=c(sum)) # return ONE column with result for each row
return(s2$sum) }
countcol(group1)
My function, countcol runs without errors but as stated above results in what appears to be a transposed list of results for each row. I would like to have ONE number for each row that indicates the count of values.
I attempted various apply functions here but could not prevail. Anyone have a tip?
Thanks!
rowSums can give you results OP is looking for. This return count of ratings==4 for each group.
rowSums(df[2:5]==4)
#1 2 3 4
#1 0 3 1
OR just part of function from OP can give answer.
apply(df[2:5], 1, function(x)(sum(x==4)))
#1 2 3 4
#1 0 3 1

Can I use %in% to search and match two columns?

I have a large dataframe and I have a vector to pull out terms of interest. for a previous project I was using:
a=data[data$rn %in% y, "Gene"]
To pull out information into a new vector. Now I have a another job Id like to do.
I have a large dataframe of 15 columns and >100000 rows. I want to search column 3 and 9 for the content in the vector and print this as a new dataframe.
To make this extra annoying the hit could be in v3 and not in v9 and visa versa.
Working example
I have striped the dataframe to 3 cols and few rows.
data <- structure(list(Gene = structure(c(1L, 5L, 3L, 2L, 4L), .Label = c("ibp","leuA", "pLeuDn_02", "repA", "repA1"), class = "factor"), LocusTag = structure(c(1L,2L, 5L, 3L, 4L), .Label = c("pBPS1_01", "pBPS1_02", "pleuBTgp4","pleuBTgp5", "pLeuDn_02"), class = "factor"), hit = structure(c(2L,4L, 3L, 1L, 5L), .Label = c("2-isopropylmalate synthase", "Ibp protein","ORF1", "repA1 protein", "replication-associated protein"), class = "factor")), .Names = c("Gene","LocusTag", "hit"), row.names = c(NA, 5L), class = "data.frame")
y <- c("ibp", "orf1")
First of all R is case sensitive so your example will not collect the third line but I guess you want that extracted. so you would have to change your y to
y <- c("ibp", "ORF1")
Ok from your example I try to see what you want to achieve I am not sure if this is really what you want but R knows the operator | as "or" so you could try something like:
new.data<-data[data$Gene %in% y|data$hit %in% y,]
if you only want to extract certain columns of your data set you can specify them behind the "," e.g.:
new.data<-data[data$Gene %in% y|data$hit %in% y, c("LocusTag","Gene")]

Resources