Python: Average a list and append to dictionary - dictionary

I have a dictionary of names with a number (a score) assigned to them. The file is laid out as so:
Person A,7
Peron B,6
If a name is repeated in the file e.g. Person B occurred on 3 lines with 3 different scores I want to calculate the mean average of these scores then append this result to a dictionary in the form of a list. However, I keep encountering an error when i try to sort the dictionary. Code below.
else:
for key in results:
keyValue = results[key]
if len(keyValue) > 1:
# Line below this needs modification
keyValue = list(sum(keyValue)/len(keyValue))
newResults[key] = keyValue
# Error in above code...
else:
newResults[key] = keyValue
print(newResults)
print(sorted(zip(newResults.values(), newResults.keys()), reverse=True))
Results is a dictionary of the people (the keys) and their scores (the values) where the values are lists so that:
results = {'Bob':[7],'Jane':[8,9]}

If you're using Python 3.x you can use its statistics library which contains a function mean. Now assuming that your dict looks like: results = {'Bob': [7], 'Jane': [8, 9]} you can create a newResults dict like this:
from statistics import mean
newResults = {key: mean(results[key]) for key in results}
This is called dict comprehension and as you can see it's kinda intuitive. Starting with { you're telling that dict is going to be created. Then with key: value you're defining its structure. Lastly, with for loop you iterate over a collection that will be used for the dict creation. You can achieve the same with:
newResults = {}
for key in results:
newResults[key] = mean(results[key])
You want to sort the dict in the end. Unfortunately it's not possible. You can either create an OrderedDict, which remembers the items insertion order or a list which will contain sorted keys to your dict. The latter will look like:
sortedKeys = sorted(newResults, key=lambda x: newResults[x])

Related

applying unique id's (uuid) to environment objects in a list; preserving name and random-ness

I have a list of environment objects (R6) with a large number of associated parameters associated with each (>50 parameters for each "individual"; i.e. sex, weight, height, etc). The objects are stored in a list in order to loop over individuals in a model.
As part of the model, I want to add new individuals to the list periodically. My issue is that at some point I need to index the "individuals'" numeric position within the list to populate output of the model.
My solution so far has been to assign random ID numbers to each individual at its creation, then save the "individual" object and its associated name to the list, allowing me to index the position in the list using which().
for example.
individual1 <- newobject$new(1)
create unique objects at the start using the R6 constructor
individual2 <- newobject$new(1) #a 2nd object
population <- tibble::lst(individual1, individual2)
add both objects to an initial list
at this point, names(population) returns "individual1", "individual2"
next, within a for loop, I want to generate new individuals and assign random names because the process which triggers this loop is variable and random. I do not want to pull from a list of prenamed items
newindividuals <- 5
for(i in 1:newindividuals){
new <- newobject$new(1)
assign(UUIDgenerate(), new)
population <- c(population, tibble::lst(randomID))
}
create a new object, rename it with a random id (package uuid) and then append it to the popualtion list.
the desired end result is that the new population will return:
names(population)
"indiviual1" "individual2" "kneslkfk-69inf-flknsek53234-lkdfj"
the 2 objects alreday named in the list, AND the new random alphanumeric identifier created in the loop above.
Finally:
I'd like to index the position of the individual within the list
something like:
which(population == individual2)
returns: 2
I have tried to pipe the operation and I have tried to assign the random id within a list command
assign(UUIDgenerate(), new) %>% c(population, tibble::lst(.)) -> population
This pipe does not work because the dot cannot refer to this type of enviornment object?
Without the dot notation population appends the object, but not the name.
I have also tried to assign it without dyplr piping
population <- c(population, (tibble::lst(assign(UUIDgenerate(), new))))
this appends the object to the list but makes the name
"assign(UUIDgnereate(), new)"
which is not desireable....
as for indexing the posittion within the list, which(), match(), and Position() have proved fruitless, presumable becaes the data type is a list of environments, rather than integers. I am hopeful that using the names stored to the list, I will be able to reference the position using a placeholder in a loop...
ie
for(i in population){
do things
Position(i)
}
will return position "3" when i = "kneslkfk-69inf-flknsek53234-lkdfj"
I hope this is clear enough, it is hard to produce data for testing due to the complex nature of the larger model.
but simply my aims are twofold:
save the random generated name of an environment object to a list
index the numeric position of that new object within the list.

Delete all duplicated elements in a vector in Julia 1.1

I am trying to write a code which deletes all repeated elements in a Vector. How do I do this?
I already tried using unique and union but they both delete all the repeated items but 1. I want all to be deleted.
For example: let x = [1,2,3,4,1,6,2]. Using union or unique returns [1,2,3,4,6]. What I want as my result is [3,4,6].
There are lots of ways to go about this. One approach that is fairly straightforward and probably reasonably fast is to use countmap from StatsBase:
using StatsBase
function f1(x)
d = countmap(x)
return [ key for (key, val) in d if val == 1 ]
end
or as a one-liner:
[ key for (key, val) in countmap(x) if val == 1 ]
countmap creates a dictionary mapping each unique value from x to the number of times it occurs in x. The solution can then be easily found by extracting every key from the dictionary that maps to val of 1, ie all elements of x that occur precisely once.
It might be faster in some situations to use sort!(x) and then construct an index for the elements of the sorted x that only occur once, but this will be messier to code, and also the output will be in sorted order, which you may not want. The countmap method preserves the original ordering.

Iterate through and conditionally append string values in a Pandas dataframe

I've got a dataframe of research participants whose IDs are stored in the following format "0000.000".
Where the first four digits are their family ID number, and the final three digits are their individual index within the family. The majority of individuals have a suffix of ".000", but some have ".001", ".002", etc.
As a result of some inefficiencies, these numbers are stored as floats. I'm trying to import them as strings so that I can use them in a join to another data frame that is formatted correctly.
Those IDs that end in .000 are imported as "0000", rather than "0000.000". All others are imported correctly.
I'm trying to iterate through the IDs and append ".000" to those that are missing the suffix.
If I were using R, I could do it like this.
df %>% mutate(StudyID = ifelse(length(StudyID)<5,
paste(StudyID,".000",sep=""),
StudyID)
I've found a Python solution (below), but it's pretty janky.
row = 0
for i in df["StudyID"]:
if len(i)<5:
df.iloc[row,3] = i + ".000"
else: df.iloc[row,3] = i
index += 1
I think it'd be ideal to do it as a list comprehension, but I haven't been able to find a solution that lets me iterate through the column, changing a single value at a time.
For example, this solution iterates and checks the logic properly, but it replaces every single value that evaluates True during each iteration. I only want the value currently being evaluated to change.
[i + ".000" if len(i)<5 else i for i in df["StudyID"]]
Is this possible?
As you said, your code is doing the trick. One other way of doing what you want that i could think of is the following :
# Start by creating a mask that gives you the index you want to change
mask = [len(i)<5 for i in df.StudyID]
# Change the value of the dataframe on the mask
df.StudyID.iloc[mask] += ".000"
I think by length(StudyID), you meant nchar(StudyID), as #akrun pointed out.
You can do it in the dplyr way in python using datar:
>>> from datar.all import f, tibble, mutate, nchar, if_else, paste
>>>
>>> df = tibble(
... StudyID = ["0000", "0001", "0000.000", "0001.001"]
... )
>>> df
StudyID
<object>
0 0000
1 0001
2 0000.000
3 0001.001
>>>
>>> df >> mutate(StudyID=if_else(
... nchar(f.StudyID) < 5,
... paste(f.StudyID, ".000", sep=""),
... f.StudyID
... ))
StudyID
<object>
0 0000.000
1 0001.000
2 0000.000
3 0001.001
Disclaimer: I am the author of the datar package.
Ultimately, I needed to do this for a few different dataframes so I ended up defining a function to solve the problem so that I could apply it to each one.
I think the list comprehension idea was going to become too complex and potentially too difficult to understand when reviewing so I stuck with a plain old for-loop.
def create_multi_index(data, col_to_split, sep = "."):
"""
This function loops through the original ID column and splits it into
multiple parts (multi-IDs) on the defined separator.
By default, the function assumes the unique ID is formatted like a decimal number
The new multi-IDs are appended into a new list.
If the original ID was formatted like an integer, rather than a decimal
the function assumes the latter half of the ID to be ".000"
"""
# Take a copy of the dataframe to modify
new_df = data
# generate two new lists to store the new multi-index
Family_ID = []
Family_Index = []
# iterate through the IDs, split and allocate the pieces to the appropriate list
for i in new_df[col_to_split]:
i = i.split(sep)
Family_ID.append(i[0])
if len(i)==1:
Family_Index.append("000")
else:
Family_Index.append(i[1])
# Modify and return the dataframe including the new multi-index
return new_df.assign(Family_ID = Family_ID,
Family_Index = Family_Index)
This returns a duplicate dataframe with a new column for each part of the multi-id.
When joining dataframes with this form of ID, as long as both dataframes have the multi index in the same format, these columns can be used with pd.merge as follows:
pd.merge(df1, df2, how= "inner", on = ["Family_ID","Family_Index"])

R: build a list from separate key value columns

In R, I'd like to build a key-value paired list from separate key and value columns. In python I would just do something like this:
d = {k:v for k,v in zip(keys, values)}
I want something similar in R that is equivalent to:
list('key1' = 'value1', 'key2' = 'value2', ...)
I've built this with a for-loop but was hoping there is a more elegant R way of doing this.
You can use split to get a list of key/value pair
split(values, keys)

R: How to represent a table augmented by arbitrary key/value pairs for each row?

This is a newbie R question. I am beginning to explore the use of R for website analytics. I have a set of page view events which have common properties along with an arbitrary set of properties that depend on the page. For instance, all events will have a userId, createdAt, and pageId, but the "signup" page might have a special property origin whose value could be "adwords" or "organic", etc.
In JSON, the data might look like this:
[
{
"userId":null,
"pageId":"home",
"sessionId":"abcd",
"createdAt":1381013741,
"parameters":{},
},
{
"userId":123,
"pageId":"signup",
"sessionId":"abcd",
"createdAt":1381013787,
"parameters":{
"origin":"adwords",
"campaignId":4
}
}
]
I have been struggling to represent this data in R data structures effectively. In particular I need to be able to subset the event list by conditions based on the arbitrary key/value pairs, for instance, select all events whose pageId=="signup" and origin=="adwords".
There is enough diversity in the keys used for the arbitrary parameters that it seems unreasonable to create sparsely-populated columns for every possible key.
What I'm currently doing is pre-processing the data into two CSV files, core_properties.csv and parameters.csv, in the form:
# core_properties.csv (one record per pageview)
userId,pageId,sessionId,createdAt
,home,abcd
123,signup,abcd,1381013741
...
# parameters.csv (one record per k/v pair)
row,key,value # <- "row" here denotes the record index in core_properties.csv
1,origin,adwords
1,campaignId,4
...
I then read.table each file into a data frame, and I am now attempting to store the k/v pairs a list (with names=keys) inside cells of the core events data frame. This has been a lot of awkward trial and error, and the best approach I've found so far is the following:
events <- read.csv('core_properties.csv', header=TRUE)
parameters <- read.csv('parameters.csv',
header=TRUE,colClasses=c("character","character","character"))
paramLists <- sapply(1:nrow(events), function(x) { list() })
apply(parameters,1,function(x) {
paramLists [[ as.numeric(x[["row"]]) ]][[ x[["key"]] ]] <<- x[["value"]] })
events$parameters <- paramLists
I can now access the origin property of the first event by the syntax: events[1,][["parameters"]][[1]][["origin"]] - note it requires for some reason an extra [[1]] subscript in there. Data frames do not seem to appreciate being given lists as individual values for cells:
> events[1,][["parameters"]] <- list()
Error in `[[<-.data.frame`(`*tmp*`, "parameters", value = list()) :
replacement has 0 rows, data has 1
Is there a best practice for handling this sort of data? I have not found it discussed in the manuals and tutorials.
Thank you!
You can use nested lists in R that map nicely to JSON. I have shown a simple example where you filter based on parameter origin.
dat <- list(
list(userId = NULL, pageId = "home", createdAt = 1381013741, parameters = list()),
list(userId = NULL, pageId = "new", createdAt = 1381013741, parameters = list(origin = 'adwords', campaignId = 4))
)
Filter(function(l){length(l) > 0 && l$parameters$origin == 'adwords'}, dat)

Resources