Elegant way to write function - r

I have an input column (symbols) which has more than 10000 rows and they contain operator symbols and text values like ("",">","<","","****","inv","MOD","seen") as shown below in the code as values. This column doesn't contain any numbers. It only contains the value which are stated in the code.
What I would like to do is map those operator symbols ('<','>' etc) to different codes, 1) Operator_codes 2) Value_codes and have these two different codes as separate columns
I already have a working code but it is not very efficient as you can see I repeat the same operation twice. Once for Operator_codes and then for value_codes. I am sure there must be some efficient way to write this. I am new to R and not very familiar with other approach.
oper_val_concepts = function(DF){
operators_source = str_extract(.$symbols)
operators_source = as.data.frame(operators_source)
colnames(operators_source) <- c("Symbol")
operator_list = c("",">","<","-","****","inv","MOD","seen")
operator_codes = c(123L,14L,16L,13L,0L,0L,0L,0L)
value_codes=c(14L,12L,32L,123L,16L
,41L,116L,186L)
operator_code_map = map2(operator_list,operator_codes,function(x,y)c(x,y))
%>%
data.frame()
value_code_map = map2(operator_list,value_codes,function(x,y) c(x,y)) %>%
data.frame()
operator_code_map = t(operator_code_map)
value_code_map = t(value_code_map)
colnames(operator_code_map) <- c("Symbol","Code")
colnames(value_code_map) <- c("Symbol","Code")
rownames(operator_code_map) = NULL
rownames(value_code_map) = NULL
dfm<-merge(x=operators_source,y=operator_code_map,by="Symbol",all.x =
TRUE)
dfm1<-merge(x=operators_source,y=value_code_map,by="Symbol",all.x = TRUE)
}
t1 = oper_val_concepts(test)
dput command output is
structure(list(Symbols = structure(c(2L, 3L, 1L, 4L, 2L, 3L,
5L, 4L, 6L), .Label = c("****", "<", ">", "inv", "mod", "seen"
), class = "factor")), .Names = "Symbols", row.names = c(NA,-9L), class =
"data.frame")
I am expecting an output to be two columns in a dataframe as shown below.

Based on what I am understanding, it seems like you want to create a dataframe that will act as a key (see key below). Once you have this, you can just join the dataframe that just contains symbols with this key dataframe.
df <- structure(list(Symbols = structure(c(2L, 3L, 1L, 4L, 2L, 3L,
5L, 4L, 6L), .Label = c("****", "<", ">", "inv", "mod", "seen"
), class = "factor")), .Names = "Symbols", row.names = c(NA, -9L), class = "data.frame")
key <- data.frame(Symbols = c("",">","<","-","****","inv","mod","seen"),
Oerator_code_map = c(123L,14L,16L,13L,0L,0L,0L,0L),
value_code_map = c(14L,12L,32L,123L,16L,41L,116L,186L))
df %>% left_join(key, by = "Symbols")
output
Symbols Oerator_code_map value_code_map
1 < 16 32
2 > 14 12
3 **** 0 16
4 inv 0 41
5 < 16 32
6 > 14 12
7 mod 0 116
8 inv 0 41
9 seen 0 186

Related

trying to summarize survey data for questions with 'select all that apply' using R

We have a survey that asks for 'select all that apply' so the result is a string inside quotes with the values separated by commas. i.e. "red, black,green"
There are other question about income so I have a factor with 'low, medium, high'
I want to be able to answer questions: What percent selected 'Red', then group that by income.
I can split the string with
'''df4 <- c("black,silver,green")'''
I can create a data frame with a timestamp and the split string with
'''t2 <- as.data.frame(c(df2[2],l2))'''
I am not able to understand how to do this for all rows at one time.
Here is a DPUT of the input:
structure(list(RespData = structure(1:2, .Label = c("1/20/2020",
"1/21/2020"), class = "factor"), CarColor = c("red,blue,green,yellow",
"black,silver,green")), row.names = c(NA, -2L), class = "data.frame")
and here is a DPUT of the desired output:
structure(list(RespData = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L), .Label = c("1/20/2020", "1/21/2020"), class = "factor"),
Cars = structure(c(3L, 1L, 2L, 4L, 5L, 6L, 2L), .Label = c("blue",
"green", "red", "yellow", "black", "silver"), class = "factor")), row.names = c(NA,
-7L), class = "data.frame")
Example of Function:
MySplitFunc <- function(ListIn) {
# build an empty data frame and set the column names
x1.all <- ListIn[0,]
names(x1.all) <- c("ResponseTime", "Descriptive")
# for each row build the data and combine to growing list
for(x in 1:nrow(ListIn)) {
#print(x)
r1 <- ListIn[x,1]
c1 <- strsplit(ListIn[x,2],",")
x1 <- as.data.frame(c(r1,c1))
# set the names and combine to all
names(x1) <- c("ResponseTime", "Descriptive")
x1.all <- rbind(x1.all,x1)
}
# strip the whitespace
x1.all <- data.frame(lapply(x1.all, trimws), stringsAsFactors = TRUE)
return(x1.all)
}

Casting dataframe gives error R

Here is the dataframe df on which I'm trying to do a pivot using cast function
dput(df)
structure(list(Val = c(1L, 2L, 2L, 5L, 2L, 5L), `Perm 1` = structure(c(1L,
2L, 3L, 3L, 3L, 3L), .Label = c("Blue", "green", "yellow"
), class = "factor"), `Perm 2` = structure(c(1L, 2L, 2L, 3L,
3L, 3L), .Label = c("Blue", "green", "yellow"), class = "factor"),
`Perm 3` = structure(c(1L, 2L, 2L, 2L, 3L, 3L), .Label = c("Blue",
"green", "yellow"), class = "factor")), .Names = c("Val",
"Perm 1", "Perm 2", "Perm 3"), row.names = c(NA, 6L), class = "data.frame")
And expecting the data after pivot
Blue 1 1 1
green 2 4 9
yellow 14 12 7
I tried doing
cast(df, df$Val ~ df$`Perm 1`+df$`Perm 2`+df$`Perm 3`, sum, value = 'Val')
But this gives error
Error: Casting formula contains variables not found in molten data: df$Val, df$`Perm1`, df$`Perm2`
How can I be able to do pivot so that I'll be able to get the desired O/P
P.S- The dataframe DF has around 36 column but for simplicity I took only 3 columns.
Any suggestion will be appreciated.
Thank you
Domnick
It appears you want to sum, grouped by each permutation in your dataset. Although hacky, I think this works for your problem. First we create a function to perform that summation using tidyeval syntax. Link for more information: Group by multiple columns in dplyr, using string vector input
sum_f <- function(col, df) {
library(tidyverse)
df <- df %>%
group_by_at(col) %>%
summarise(Val = sum(Val)) %>%
ungroup()
df[,2]
}
We then apply it to your dataset using lapply, and binding together the summations.
bind_cols(lapply(c('Perm1', 'Perm2', 'Perm3'), sum_f, df))
This gets us the above answer.
Caveats: Need to know the name of the columns you have to sum over for this to work. Also, each column needs to have the same levels of your permutations i.e. blue, green, yellow. The code will respect this ordering.

how to find similar strings within a data

My data looks like this
df<- structure(list(A = structure(c(7L, 6L, 5L, 4L, 3L, 2L, 1L, 1L,
1L), .Label = c("", "P42356;Q8N8J0;A4QPH2", "P67809;Q9Y2T7",
"Q08554", "Q13835", "Q5T749", "Q9NZT1"), class = "factor"), B = structure(c(9L,
8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("P62861", "P62906",
"P62979;P0CG47;P0CG48", "P63241;Q6IS14", "Q02413", "Q07955",
"Q08554", "Q5T749", "Q9UQ80"), class = "factor"), C = structure(c(9L,
8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("", "P62807;O60814;P57053;Q99879;Q99877;Q93079;Q5QNW6;P58876",
"P63241;Q6IS14", "Q02413", "Q16658", "Q5T750", "Q6P1N9", "Q99497",
"Q9UQ80"), class = "factor")), .Names = c("A", "B", "C"), class = "data.frame", row.names = c(NA,
-9L))
I want to count how many elements are in each columns including those that are separated with a ; , for example in this case
first column has 9, second column has 12 elements and the third column has 16 elements. then I want to check how many times a element is repeated in other columns . for example
string number of times columns
Q5T749 2 1,2
then remove the strings which are seen more than once from the df
One way to approach this is to start by re-organizing the data into a form that is more convenient to work with. The tidyr and dplyr packages are useful for that sort of thing.
library(tidyr)
df$index <- 1:nrow(df)
df <- gather(df, key = 'variable', value = 'value', -index, na.rm = TRUE)
df <- separate(df, "value", into = paste("x", 1:(1 + max(nchar(gsub("[^;]", "", df$value)))), sep = ""), sep = ";", fill = "right")
df <- gather(df, "which", "value", -index, -variable)
Once you do that counting each element is easy:
addmargins(t(table(df[, c("variable", "value")])), margin = 2)
Dropping duplicates is also easy.
df <- df[!duplicated(df$value), ]
If you really want to put the data back into the original for you can (though I don't recommend it).
df <- spread(df, key = "variable", value = "value")
library(dplyr)
summarize(group_by(df, index),
A = paste(na.omit(A), collapse = ";"),
B = paste(na.omit(B), collapse = ";"),
C = paste(na.omit(C), collapse = ";"))
For the count of elements in each column use this
sapply(df,function(x) length(unlist(sapply(strsplit(as.character(x),"\\s+"),strsplit,split=";"))))
For counting the repetition use this
words <- lapply(df,function(x) unlist(sapply(strsplit(as.character(x),"\\s+"),strsplit,split=";")))
dup_table <- table(unlist(words))
dup_table
There is a very bad approach to remove the repetition
pat <- names(dup_table)[unname(dup_table)>1]
for(i in pat)
df <- as.data.frame.list(lapply(df,function(x) gsub(pattern = i,replacement = "",x)))
But, there is only one problem. It will replace all the occurences of a particular pattern.

Add characters to existing cells with condition R

I have the following table:
structure(list(Id = structure(c(1L, 1L, 2L, 2L, 1L, 3L, 3L, 3L
), .Label = c("a", "b", "c"), class = "factor"), stops = c(1,
1, 1, 1, 1, 2, 2, 2)), .Names = c("Id", "stops"), row.names = c(NA,
-8L), class = "data.frame")
I would like to add to $stops new characters when the stop did not change but the $Id did.
For example, I would like to get:
structure(list(Id = structure(c(1L, 1L, 2L, 2L, 1L, 3L, 3L, 3L
), .Label = c("a", "b", "c"), class = "factor"), stops = structure(c(1L,
1L, 2L, 2L, 3L, 4L, 4L, 4L), .Label = c("1", "1-1", "1-2", "2"
), class = "factor")), .Names = c("Id", "stops"), row.names = c(NA,
-8L), class = "data.frame")
I just would like to do so if the Id is different than the previous one, and if the Stops is the same than the previous one...
I tried with mutate() but it seems I am quite far away to get something working here...
Here's a looples attempt using data.table
library(data.table)
setDT(df)[, `:=`(stops = as.character(stops), Idindx = rleid(Id))]
indx <- unique(df, by = "Idindx")[, counter := (1:.N) - 1L, by = rleid(stops)]
df[indx[counter > 0], stops := paste(stops, i.counter, sep = "-"), on = "Idindx"]
# Id stops Idindx
# 1: a 1 1
# 2: a 1 1
# 3: b 1-1 2
# 4: b 1-1 2
# 5: a 1-2 3
# 6: c 2 4
# 7: c 2 4
# 8: c 2 4
The first step is to create an unique index for each Id (as they aren't unique) and convert stops to a character (per your desired output)
Then, operating on unique indexes identify counts of same stops and join back to the original data
You could write a loop to solve your problem:
# Original data
data <- structure(list(Id = structure(c(1L, 1L, 2L, 2L, 1L, 3L, 3L, 3L
), .Label = c("a", "b", "c"), class = "factor"), stops = c(1,
1, 1, 1, 1, 2, 2, 2)), .Names = c("Id", "stops"), row.names = c(NA,
-8L), class = "data.frame")
# Add new column, which will be converted in the following loop
data$stops_new <- as.character(data$stops)
new <- 1
for(i in 2:nrow(data)) {
# Convert values of stops_new, if your specified conditions appear
if(data$Id[i] != data$Id[i - 1] & data$stops[i] == data$stops[i - 1]) {
data$stops_new[i] <- paste(data$stops_new[i], "-", new, sep = "")
# Repeat the convertion for all values with the same ID and stop-value
j <- i + 1
while(data$Id[i] == data$Id[j] & data$stops[i] == data$stops[j]) {
data$stops_new[j] <- paste(data$stops[i], "-", new, sep = "")
j <- j + 1
}
new <- new + 1
}
}
data
this is a base R solution.
create indicators showing you whether Id has changed (id.ind) and whether stops has changed (stops.ind) from the previous line (convention being that these indicators are set to "0", i.e. no change, for the first row):
stops.ind <- c(0, diff(dat$stops))
id.ind <- c(0, diff(as.numeric(dat$Id)))
create new stops vector:
stops <- new.stops <- dat$stops
row by row check whether a) there is a change in id and no change in stops or b) there is no change in either from the previous row. in case a) increase k by one and append "-k" to stops value b) use previous value of stops:
k <- 0
for(i in 2 : nrow(dat)){
if(id.ind[i] != 0 & stops.ind[i] == 0){
k <- k + 1
new.stops[i] <- paste0(stops[i], "-", k)
}
if(id.ind[i] == 0 & stops.ind[i] == 0)
new.stops[i] <- new.stops[i - 1]
}
new.stops
# [1] "1" "1" "1-1" "1-1" "1-2" "2" "2" "2"
new.dat <- data.frame(Id = dat$Id, stops = new.stops)

Errorbars in r of two groups ggplot2

I'd like to plot standard deviations of the mean(z)/mean(b) which are grouped by two factors $angle and $treatment:
z= Tracer angle treatment
60 0 S
51 0 S
56.415 15 X
56.410 15 X
b=Tracer angle treatment
21 0 S
15 0 S
16.415 15 X
26.410 15 X
So far I've calculated the mean for each variable based on angle and treatment:
aggmeanz <-aggregate(z$Tracer, list(angle=z$angle,treatment=z$treatment), FUN=mean)
aggmeanb <-aggregate(b$Tracer, list(angle=b$angle,treatment=b$treatment), FUN=mean)
It now looks like this:
aggmeanz
angle treatment x
1 0 S 0.09088021
2 30 S 0.18463353
3 60 S 0.08784315
4 80 S 0.09127198
5 90 S 0.12679296
6 0 X 2.68670392
7 15 X 0.50440692
8 30 X 0.83564470
9 60 X 0.52856956
10 80 X 0.63220093
11 90 X 1.70123025
But when I come to plot it, I can't quite get what I'm after
ggplot(aggmeanz, aes(x=aggmeanz$angle,y=aggmeanz$x/aggmeanb$x, colour=treatment)) +
geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=0.1, ymax=1.15),
width=.2,
position=position_dodge(.9)) +
theme(panel.grid.minor = element_blank()) +
theme_bw()
EDIT:
dput(aggmeanz)
structure(list(time = structure(c(1L, 3L, 4L, 5L, 6L, 1L, 2L,
3L, 4L, 5L, 6L), .Label = c("0", "15", "30", "60", "80", "90"
), class = "factor"), treatment = structure(c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("S", "X"), class = "factor"),
x = c(56.0841582902523, 61.2014237854156, 42.9900742785269,
42.4688447229277, 41.3354173870287, 45.7164231791512, 55.3943182966382,
55.0574951462903, 48.1575625699563, 60.5527200655174, 45.8412287451211
)), .Names = c("time", "treatment", "x"), row.names = c(NA,
-11L), class = "data.frame")
> dput(aggmeanb)
structure(list(time = structure(c(1L, 3L, 4L, 5L, 6L, 1L, 2L,
3L, 4L, 5L, 6L), .Label = c("0", "15", "30", "60", "80", "90"
), class = "factor"), treatment = structure(c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("S", "X"), class = "factor"),
x = c(56.26325504249, 61.751655279608, 43.1687113436753,
43.4147408285209, 41.9113698082799, 46.2800894420131, 55.1550995335947,
54.7531592595068, 47.3280215294235, 62.4629068516043, 44.2590192583692
)), .Names = c("time", "treatment", "x"), row.names = c(NA,
-11L), class = "data.frame")
EDIT 2: I calculated the standard dev as follows:
aggstdevz <-aggregate(z$Tracer, list(angle=z$angle,treatment=z$treatment), FUN=std)
aggstdevb <-aggregate(b$Tracer, list(angle=b$angle,treatment=b$treatment), FUN=std)
Any thoughts would be much appreciated,
Cheers
As others have noted, you'll need to join the two dataframes together. There are also some little quirks in the dput data you showed, so I've renamed some columns to make sure that they join appropriately and match what you've attempted. NOTE: You'll need name the two means differently so that they don't get merged together or cause conflicts.
names(aggmeanb)[names(aggmeanb) == "x"] = "mean_b"
names(aggmeanb)[names(aggmeanb) == "time"] = "angle"
names(aggmeanz)[names(aggmeanz) == "x"] = "mean_z"
names(aggmeanz)[names(aggmeanz) == "time"] = "angle"
joined_data = join(aggmeanb, aggmeanz)
joined_data$divmean = joined_data$mean_b/joined_data$mean_z
> head(joined_data)
angle treatment mean_b mean_z divmean
1 0 S 56.26326 56.08416 1.003193
2 30 S 61.75166 61.20142 1.008991
3 60 S 43.16871 42.99007 1.004155
4 80 S 43.41474 42.46884 1.022273
5 90 S 41.91137 41.33542 1.013934
6 0 X 46.28009 45.71642 1.012330
ggplot(joined_data, aes(factor(angle), divmean)) +
geom_boxplot() +
theme(panel.grid.minor = element_blank()) +
theme_bw()
It might be that the data you've included is just a bit of your real data set, but as is there's only one data point per angle-treatment group. However, when you are using a fuller dataset, you can try something like:
ggplot(joined_data, aes(factor(angle), diffmean, group = treatment)) +
geom_boxplot() +
facet_grid(.~angle, scales = "free_x")
That will group the boxes by angle and then allow you to fill them by treatment.
Think about the problem in two steps:
create a data frame (say data) which contains all the information
you would like to visualize. In this case, this seems to be the two
factors (angle, treatment), the mean group differences (say dif)
and standard errors (say ste).
visualize this information.
Step 2) will be easy. This should probably produce something very similar to your sketch.
ggplot(data, aes(x=angle, y=dif, colour=treatment)) +
geom_point(position=position_dodge(0.1)) +
geom_errorbar(aes(ymin=dif-ste, ymax=dif+ste), width=.1, position=position_dodge(0.1)) +
theme_bw()
However, at this point, you do not provide enough information to get help with Step 1. Try to include code which produces your original data (or the type of data you have) instead of copy-pasting chunks of your data output or pasting the aggregated data which lacks standard errors.
Combining your two aggregated data frames and generating random numbers for standard error produces the graph below:
#I imported your two aggregated data frames from your dput output.
data <- cbind(aggmeanb, aggmeanz$x, rnorm(11))
names(data) <- c("angle", "treatment", "meanz", "meanb", "ste")
data$dif <- data$meanz - data$meanb

Resources