Combine DataFrame rows into a new column - julia

I am wondering if there is simple way to achieve this in Julia besides iterating over the rows in a for-loop.
I have a table with two columns that looks like this:
| Name | Interest |
|------|----------|
| AJ | Football |
| CJ | Running |
| AJ | Running |
| CC | Baseball |
| CC | Football |
| KD | Cricket |
...
I'd like to create a table where each Name in first column is matched with a combined Interest column as follows:
| Name | Interest |
|------|----------------------|
| AJ | Football, Running |
| CJ | Running |
| CC | Baseball, Football |
| KD | Cricket |
...
How do I achieve this?
UPDATE: OK, so after trying a few things including print_joint and grpby, I realized that the easiest way to do this would be by() function. I'm 99% there.
by(myTable, :Name, df->DataFrame(Interest = string(df[:Interest])))
This gives me my :Interest column as "UTF8String[\"Running\"]", and I can't figure out which method I should use instead of string() (or where to typecast) to get the desired ASCIIString output.

Related

How to match two columns in one dataframe using values in another dataframe in R

I have two dataframes. One is a set of ≈4000 entries that looks similar to this:
| grade_col1 | grade_col2 |
| --- | --- |
| A-| A-|
| B | 86|
| C+| C+|
| B-| D |
| A | A |
| C-| 72|
| F | 96|
| B+| B+|
| B | B |
| A-| A-|
The other is a set of ≈700 entries that look similar to this:
| grade | scale |
| --- | --- |
| A+|100|
| A+| 99|
| A+| 98|
| A+| 97|
| A | 96|
| A | 95|
| A | 94|
| A | 93|
| A-| 92|
| A-| 91|
| A-| 90|
| B+| 89|
| B+| 88|
...and so on.
What I'm trying to do is create a new column that shows whether grade_col2 matches grade_col1 with a binary, 0-1 output (0 = no match, 1 = match). Most of grade_col2 is shown by letter grade. But every once in awhile an entry in grade_col2 was accidentally entered as a numeric grade instead. I want this match column to give me a "1" even when grade_col2 is a numeric grade instead of a letter grade. In other words, if grade_col1 is B and grade_col2 is 86, I want this to still be read as a match. Only when grade_col1 is F and grade_col2 is 96 would this not be a match (similar to when grade_col1 is B- and grade_col2 is D = not a match).
The second data frame gives me the information I need to translate between one and the other (entries between 97-100 are A+, between 93-96 are A, and so on). I just don't know how to run a script that uses this information to find matches through all ≈4000 entries. Theoretically, I could do this manually, but the real dataset is so lengthy that this isn't realistic.
I had been thinking of using nested if_else statements with dplyr. But once I got past the first "if" statement, I got stuck. I'd appreciate any help with this people can offer.
You can do this using a join.
Let your first dataframe be grades_df and your second dataframe be lookup_df, then you want something like the following:
output = grades_df %>%
# join on look up, keeping everything grades table
left_join(lookup_df, by = c(grade_col2 = "scale")) %>%
# combine grade_col2 from grades_df and grade from lookup_df
mutate(grade_col2b = ifelse(is.na(grade), grade_col2, grade)) %>%
# indicator column
mutate(indicator = ifelse(grade_col1 == grade_col2b, 1, 0))

How to replace empty spaces with values from adjacent colum that needs to be separated?

Hi everyone. I'm so sorry for my english. I need to separate the
domain data of some emails in a table. Then, if these mail data have
the domain of a country, this information must be moved to another
column that is incomplete in which the participants of a congress are
included. This for a relatively large database. I put an example
below.
| email | country |
| -------- | -------------- |
| naco#gmail.com | CO |
| monic45814#gmail.com | AR |
| jsalazar#chapingo.mx | |
| andresramirez#urosario.edu.co | |
| jeimy861491#hotmail.com | CL |
|jytvc#hotmail.com | |
Outcome should be
| email | country |
| -------- | -------------- |
| naco#gmail.com | CO |
| monic45814#gmail.com | AR |
| jsalazar#chapingo.mx | MX |
| andresramirez#urosario.edu.co | CO |
|jeimy861491#hotmail.com | CL |
|jytvc#hotmail.com | *NA* |
Thank you so much.
You can use str_extract to get the string after the last occurrence of "." and if_else to ignore rows that already have a country and rows which e-mail doesn't end with a country code:
df %>%
mutate(country = if_else(is.na(country) & str_extract(email, "[^.]+$") != "com", toupper(str_extract(email, "[^.]+$")), country))
small but not so small PS: I would always recommend to provide fake data when you are mentioning personal data like e-mail addresses
Here is a solution in base R.
Suppose:
df<-data.frame(email,country)
Then:
df$country<-ifelse(is.na(df$country)&sub(".*(.*?)[\\.|:]", "",df$email)!="com",sub(".*(.*?)[\\.|:]", "",df$email),paste(df$country))

Split column string with delimiters into separate columns in azure kusto

I have a column 'Apples' in azure table that has this string: "Colour:red,Size:small".
Current situation:
|-----------------------|
| Apples |
|-----------------------|
| Colour:red,Size:small |
|-----------------------|
Desired Situation:
|----------------|
| Colour | Size |
|----------------|
| Red | small |
|----------------|
Please help
I'll answer the title as I noticed many people searched for a solution.
The key here is mv-expand operator (expands multi-value dynamic arrays or property bags into multiple records):
datatable (str:string)["aaa,bbb,ccc", "ddd,eee,fff"]
| project splitted=split(str, ',')
| mv-expand col1=splitted[0], col2=splitted[1], col3=splitted[2]
| project-away splitted
project-away operator allows us to select what columns from the input exclude from the output.
Result:
+--------------------+
| col1 | col2 | col3 |
+--------------------+
| aaa | bbb | ccc |
| ddd | eee | fff |
+--------------------+
This query gave me the desired results:
| parse Apples with "Colour:" AppColour ", Size:" AppSize
Remember to include all the different delimiters preceding each word you want to extract, e.g ", Size". Mind the space between.
This helped me then i used my intuition to customize the query according to my needs:
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/parseoperator

Addition of calculated field in rpivotTable

I want to create a calculated field to use with the rpivotTable package, similar to the functionality seen in excel.
For instance, consider the following table:
+--------------+--------+---------+-------------+-----------------+
| Manufacturer | Vendor | Shipper | Total Units | Defective Units |
+--------------+--------+---------+-------------+-----------------+
| A | P | X | 173247 | 34649 |
| A | P | Y | 451598 | 225799 |
| A | P | Z | 759695 | 463414 |
| A | Q | X | 358040 | 225565 |
| A | Q | Y | 102068 | 36744 |
| A | Q | Z | 994961 | 228841 |
| A | R | X | 454672 | 231883 |
| A | R | Y | 275994 | 124197 |
| A | R | Z | 691100 | 165864 |
| B | P | X | 755594 | 302238 |
| . | . | . | . | . |
| . | . | . | . | . |
+--------------+--------+---------+-------------+-----------------+
(my actual table has many more columns, both dimensions and measures, time, etc. and I need to define multiple such "calculated columns")
If I want to calculate defect rate (which would be Defective Units/Total Units) and I want to aggregate by either of the first three columns, I'm not able to.
I tried assignment by reference (:=), but that still didn't seem to work and summed up defect rates (i.e., sum(Defective_Units/Total_Units)), instead of sum(Defective_Units)/sum(Total_Units):
myData[, Defect.Rate := Defective_Units / Total_Units]
This ended up giving my defect rates greater than 1. Is there anywhere I can declare a calculated field, which is just a formula evaluated post aggregation?
You're lucky - the creator of pivottable.js foresaw cases like yours (and mine, earlier today) by implementing an aggregator called "Sum over Sum" and a few more, likewise, cf. https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L111 and https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L169.
So we'll use "Sum over Sum" as parameter "aggregatorName", and the columns whose quotient we want in the "vals" parameter.
Here's a meaningless usage example from the mtcars data for reproducibility:
require(rpivotTable)
data(mtcars)
rpivotTable(mtcars,rows="gear", cols=c("cyl","carb"),
aggregatorName = "Sum over Sum",
vals =c("mpg","disp"),
width="100%", height="400px")

Report the differences between two data frame in R

I have two data frames which are loaded from csv files. Basically from different environment but similar format/columns, They can have differences in rows/values. I would like to find the differences and create them in a new data frame. Also both data frame will have same order. I have 100's of files to be compared. Thanks in advance.
Dataframe1: df1test
product | country | partner | value
------------------------------------
prdct1 | china | part1 | ["563,45"]
prdct2 | UK | part4 | ["52,455"]
prdct3 | USA | part2 | ["563,45"]
prdct4 | ITALY | part6 | ["674,45"]
prdct5 | UK | part7 | ["563,578"]
Dataframe2: df1prod
product | country | partner | value
------------------------------------
prdct1 | china | part1 | ["563,45"]
prdct2 | UK | part4 | ["247,455"]
prdct3 | USA | part41 | ["563,45"]
prdct4 | UK | part6 | ["0,45"]
I would like to show the differences in third data frame
Dataframe3: dfDifference
Env:test Env:prod
product| country|partner| value product| country | partner | value
------------------------------------ -----------------------------------
prdct2 | UK |part4 | ["52,455"] prdct2 |UK |part4 | ["247,455"]
prdct3 | USA |part2 | ["563,45"] prdct3 |USA|part41 | ["563,45"]
prdct4 | ITALY |part6 | ["674,45"] prdct4 |UK |part6 | ["0,45"]
prdct5 | UK |part7 | ["563,578"] Not Available
I tried the following functions and methods but did'nt workout
Compare function
comptest<-compare(df1test,df1prod,allowAll = TRUE)
Variable combine
df1test$Varcomp <- apply(df1test,1,paste,collapse=';')
df1prod$Varcomp <- apply(df1prod,1,paste,collapse=';')
aabb<-sapply(df1prod$Varcomp,FUN = function(x){x==df1test$Varcomp})
A good way to do this ist the setdiff() function, whicht takes the two dataframes as arguments.
newdata <- setdiff(df1, df2)

Resources