Julia groupBy name and sum up count - plot

I'm new to Julia and have a simple question. I have a csv file with the following structures: [Category, Name, Count]. I have 2 things I want to create.
1, I want to create a function in julia which groupBy the Category and add up the Counts (Name is ignored). So that the output is [Name, Count]. I will then generate a bar-plot by setting x= Name and y= Count
2, I want to generate multiple plots for each Category where the Count of each Name is plotted on separate bar-plots. So iterative plotting process?
I think I've got the hang of plotting, but I am not sure about how to do the groupBy process. Any help/re-direction to tutorials would be greatly appreciated.
A sample of my data:
(net_worth,khan,14)
(net_worth,kevin,15)
(net_worth,bill,16)
the function I am currently working on:
function wordcount(text,opinion,number)
words= text
counts= Dict()
for w = words
counts[w]= number
end
return counts
end
function wcreduce(wcs)
counts=Dict()
for c in wcs, (k,v) in c
counts[k] = get(counts,k,0)+v
end
return counts
end
I am looking for a function like reduceByKey or GroupByKey I guess.

So I solved this by using the Julia by function on DataFrames,
First load in the data csv using:
data = readtable("iris.csv")
Now its the function by:
function trendingkeys(data::DataFrame,trends::Symbol,funcadd::Function)
by(data, :trends, funcadd -> sum(funcadd[:counts]))
end
I must say. DataFrame is so smart.

Related

Method to find if the string has words in the given set of keywords

I have a Dataset with ticket details including Short_Description and Notes column and I have one more Dataset with keywords and corresponding Categories. I have to run the Short_Description and Notes through keywords and find if we have the keywords available and select the corresponding category.
The problem is the data is huge(more than 1500 rows) and the keywords is around 600 rows of different categories and 5 columns with the keywords.
It take huge time(more than 5hrs) to run the code as i am using FOR loop.
Is there any way or any other function i can use to optimize the code?
I am using str_detect() where I get 600 rows of output for single ticket and
Data= read.csv("Open_tickets.csv")
k= read_excel("Keywords_All.xlsx")
setDT(k)[, Seq := rowid(Assignment.Group)]
k[,4:13]=tolower(unlist(k[,4:13]))
k[,4:13]=str_replace_all(unlist(k[,4:13]),"[^a-zA-Z\\s]","")
sd= data.frame()
notes=data.frame()
for (i in 1:NROW(Data1))
{
for (j in 1:NROW(k))
{
Data$Short_Description[i]=tolower(Data$Short_Description[i])
str1=str_replace_all(Data$Short_Description[i],"[^a-zA-Z\\s]","")
newd1 = str_detect(str1,unlist(k[j,4:8]))
newd1=as.data.frame(t(newd1))
newd1$Assignment_Group= Data$Assignment_Group[i]
newd1$inc= Data1$Number[i]
newd1$Short_Description = Data$Short_Description[i]
newd1$Notes=Data$Notes[i]
newd1$Subcategory=k$`Sub Category`[j]
newd1$Category=k$Category[j]
newd1$seq = k$Seq[j]
sd=rbind(sd,newd1)
Data$Notes[i]=tolower(Data$Notes[i])
str2= str_replace_all(Data$Notes[i],"[^a-zA-Z\\s]","")
newd2 = str_detect(str2,unlist(k[j,4:8]))
newd2=as.data.frame(t(newd2))
newd2$Assignment_Group= Data1$Assignment_Group[i]
newd2$inc= Data1$Number[i]
newd2$Short_Description = Data1$Short_Description[i]
newd2$Notes=Data1$Notes[i]
newd2$Subcategory=k$`Sub Category`[j]
newd2$Category=k$Category[j]
newd2$seq = k$Seq[j]
notes=rbind(notes,newd2)
}
}
I get the output dataframes with True and False values with corresponding keywords.
For this, you can use Aho-Corasick algorithm, it is a text searching algorithm with complexity O(n), you can find a package implementing it in R language here,
for bioc:
https://rdrr.io/bioc/Starr/man/match_ac.html
for CRAN:
https://rdrr.io/cran/AhoCorasickTrie/.

Iterate through and conditionally append string values in a Pandas dataframe

I've got a dataframe of research participants whose IDs are stored in the following format "0000.000".
Where the first four digits are their family ID number, and the final three digits are their individual index within the family. The majority of individuals have a suffix of ".000", but some have ".001", ".002", etc.
As a result of some inefficiencies, these numbers are stored as floats. I'm trying to import them as strings so that I can use them in a join to another data frame that is formatted correctly.
Those IDs that end in .000 are imported as "0000", rather than "0000.000". All others are imported correctly.
I'm trying to iterate through the IDs and append ".000" to those that are missing the suffix.
If I were using R, I could do it like this.
df %>% mutate(StudyID = ifelse(length(StudyID)<5,
paste(StudyID,".000",sep=""),
StudyID)
I've found a Python solution (below), but it's pretty janky.
row = 0
for i in df["StudyID"]:
if len(i)<5:
df.iloc[row,3] = i + ".000"
else: df.iloc[row,3] = i
index += 1
I think it'd be ideal to do it as a list comprehension, but I haven't been able to find a solution that lets me iterate through the column, changing a single value at a time.
For example, this solution iterates and checks the logic properly, but it replaces every single value that evaluates True during each iteration. I only want the value currently being evaluated to change.
[i + ".000" if len(i)<5 else i for i in df["StudyID"]]
Is this possible?
As you said, your code is doing the trick. One other way of doing what you want that i could think of is the following :
# Start by creating a mask that gives you the index you want to change
mask = [len(i)<5 for i in df.StudyID]
# Change the value of the dataframe on the mask
df.StudyID.iloc[mask] += ".000"
I think by length(StudyID), you meant nchar(StudyID), as #akrun pointed out.
You can do it in the dplyr way in python using datar:
>>> from datar.all import f, tibble, mutate, nchar, if_else, paste
>>>
>>> df = tibble(
... StudyID = ["0000", "0001", "0000.000", "0001.001"]
... )
>>> df
StudyID
<object>
0 0000
1 0001
2 0000.000
3 0001.001
>>>
>>> df >> mutate(StudyID=if_else(
... nchar(f.StudyID) < 5,
... paste(f.StudyID, ".000", sep=""),
... f.StudyID
... ))
StudyID
<object>
0 0000.000
1 0001.000
2 0000.000
3 0001.001
Disclaimer: I am the author of the datar package.
Ultimately, I needed to do this for a few different dataframes so I ended up defining a function to solve the problem so that I could apply it to each one.
I think the list comprehension idea was going to become too complex and potentially too difficult to understand when reviewing so I stuck with a plain old for-loop.
def create_multi_index(data, col_to_split, sep = "."):
"""
This function loops through the original ID column and splits it into
multiple parts (multi-IDs) on the defined separator.
By default, the function assumes the unique ID is formatted like a decimal number
The new multi-IDs are appended into a new list.
If the original ID was formatted like an integer, rather than a decimal
the function assumes the latter half of the ID to be ".000"
"""
# Take a copy of the dataframe to modify
new_df = data
# generate two new lists to store the new multi-index
Family_ID = []
Family_Index = []
# iterate through the IDs, split and allocate the pieces to the appropriate list
for i in new_df[col_to_split]:
i = i.split(sep)
Family_ID.append(i[0])
if len(i)==1:
Family_Index.append("000")
else:
Family_Index.append(i[1])
# Modify and return the dataframe including the new multi-index
return new_df.assign(Family_ID = Family_ID,
Family_Index = Family_Index)
This returns a duplicate dataframe with a new column for each part of the multi-id.
When joining dataframes with this form of ID, as long as both dataframes have the multi index in the same format, these columns can be used with pd.merge as follows:
pd.merge(df1, df2, how= "inner", on = ["Family_ID","Family_Index"])

How do I assign the argument of one of my functions as a variable name in R?

I wrote the following function:
rename.fun(rai,pred){
assign('pred',rai)
return(pred) }
I called it with the arguments rename.fun(k2e,k2e_cat2) and it returns the object I want but it is named pred.
The point of this function is to assign the object I define as rai to the object I define as pred. So rename k2e to k2e_cat2.
I am new to R but I am a SAS programmer. This is a very simple task with the SAS macro processor but I cant seem to figure it out in R
EDIT:
In SAS I would do the following:
%macro rename_fun(rai=) ;
data output (rename=(&rai.=&rai._cat2));
set input;
run;
%mend;
Essentially, I want to add the suffix _cat2 to a bunch of variables, but they need to be in a function call. I know this seems odd but its for a specific project at work. I am new to R so I apologize if this seems silly.
Since you say that you want to rename several columns in a data.frame you could simple do this by using a function that takes a data.frame and a list of column names to rename:
add_suffix_cat2 <- function(df, vars){
names(df)[match(vars, names(df))] <- paste0(vars, "_cat2")
return(df)
}
Then you can call the function like:
mydf <- mtcars
res <- add_suffix_cat2(mydf, c("hp","mpg"))
If you wanted to make the suffix customizable that's simlpe enough to do by adding another parameter to the function.

Custom function does not work in R 'ddply' function

I am trying to use a custom function inside 'ddply' in order to create a new variable (NormViability) in my data frame, based on values of a pre-existing variable (CelltiterGLO).
The function is meant to create a rescaled (%) value of 'CelltiterGLO' based on the mean 'CelltiterGLO' values at a specific sub-level of the variable 'Concentration_nM' (0.01).
So if the mean of 'CelltiterGLO' at 'Concentration_nM'==0.01 is set as 100, I want to rescale all other values of 'CelltiterGLO' over the levels of other variables ('CTSC', 'Time_h' and 'ExpType').
The normalization function is the following:
normalize.fun = function(CelltiterGLO) {
idx = Concentration_nM==0.01
jnk = mean(CelltiterGLO[idx], na.rm = T)
out = 100*(CelltiterGLO/jnk)
return(out)
}
and this is the code I try to apply to my dataframe:
library("plyr")
df.bis=ddply(df,
.(CTSC, Time_h, ExpType),
transform,
NormViability = normalize.fun(CelltiterGLO))
The code runs, but when I try to double check (aggregate or tapply) if the mean of 'NormViability' equals '100' at 'Concentration_nM'==0.01, I do not get 100, but different numbers. The fact is that, if I try to subset my df by the two levels of the variable 'ExpType', the code returns the correct numbers on each separated subset. I tried to make 'ExpType' either character or factor but I got similar results. 'ExpType has two levels/values which are "Combinations" and "DoseResponse", respectively. I can't figure out why the code is not working on the entire df, I wonder if this is due to the fact that the two levels of 'ExpType' do not contain the same number of levels for all the other variables, e.g. one of the levels of 'Time_h' is missing for the level "Combinations" of 'ExpType'.
Thanks very much for your help and I apologize in advance if the answer is already present in Stackoverflow and I was not able to find it.
Michele
I (the OP) found out that the function was missing one variable in the arguments, that was used in the statements. Simply adding the variable Concentration_nM to the custom function solved the problem.
THANKS
m.

Creating Dataset and assigning variable names in Matlab

I have a r code that I am trying to translate in Matlab. I have a bunch of vectors that I column bind into a matrix and then assign specific names to those columns. How can I do the same in matlab?
So my R code is the following -
param = cbind(a1,b1,r1,ratio,s1,d1,ratio1)
colnames(param) = c("alpha","beta^2","rho","rho/beta^2","sigma^2","delta","delta/sigma^2")
For the first part, in matlab I have
param = [a1;b1;r1;ratio;s1;d1;ratio1]';
I do not know how to accomplish the second part.
You mean this:
colnames = {'alpha', 'beta^2', 'rho', 'rho/beta^2', 'sigma^2', 'delta', 'delta/sigma^2'}

Resources