R: build a list from separate key value columns - r

In R, I'd like to build a key-value paired list from separate key and value columns. In python I would just do something like this:
d = {k:v for k,v in zip(keys, values)}
I want something similar in R that is equivalent to:
list('key1' = 'value1', 'key2' = 'value2', ...)
I've built this with a for-loop but was hoping there is a more elegant R way of doing this.

You can use split to get a list of key/value pair
split(values, keys)

Related

Convert R list to Pythonic list and output as a txt file

I'm trying to convert these lists like Python's list. I've used these codes
library(GenomicRanges)
library(data.table)
library(Repitools)
pcs_by_tile<-lapply(as.list(1:length(tiled_chr)) , function(x){
obj<-tileSplit[[as.character(x)]]
if(is.null(obj)){
return(0)
} else {
runs<-filtered_identical_seqs.gr[obj]
df <- annoGR2DF(runs)
score = split(df[,c("start","end")], 1:nrow(df[,c("start","end")]))
#print(score)
return(score)
}
})
dt_text <- unlist(lapply(tiled_chr$score, paste, collapse=","))
writeLines(tiled_chr, paste0("x.txt"))
The following line of code iterates through each row of the DataFrame (only 2 columns) and splits them into the list. However, its output is different from what I desired.
score = split(df[,c("start","end")], 1:nrow(df[,c("start","end")]))
But I wanted the following kinda output:
[20350, 20355], [20357, 20359], [20361, 20362], ........
If I understand your question correctly, using as.tuple from the package 'sets' might help. Here's what the code might look like
library(sets)
score = split(df[,c("start","end")], 1:nrow(df[,c("start","end")]))
....
df_text = unlist(lapply(score, as.tuple),recursive = F)
This will return a list of tuples (and zeroes) that look more like what you are looking for. You can filter out the zeroes by checking the type of each element in the resulting list and removing the ones that match the type. For example, you could do something like this
df_text_trimmed <- df_text[!lapply(df_text, is.double)]
to get rid of all your zeroes
Edit: Now that I think about it, you probably don't even need to convert your dataframes to tuples if you don't want to. You just need to make sure to include the 'recursive = F' option when you unlist things to get a list of 0s and dataframes containing the numbers you want.

Iterate through and conditionally append string values in a Pandas dataframe

I've got a dataframe of research participants whose IDs are stored in the following format "0000.000".
Where the first four digits are their family ID number, and the final three digits are their individual index within the family. The majority of individuals have a suffix of ".000", but some have ".001", ".002", etc.
As a result of some inefficiencies, these numbers are stored as floats. I'm trying to import them as strings so that I can use them in a join to another data frame that is formatted correctly.
Those IDs that end in .000 are imported as "0000", rather than "0000.000". All others are imported correctly.
I'm trying to iterate through the IDs and append ".000" to those that are missing the suffix.
If I were using R, I could do it like this.
df %>% mutate(StudyID = ifelse(length(StudyID)<5,
paste(StudyID,".000",sep=""),
StudyID)
I've found a Python solution (below), but it's pretty janky.
row = 0
for i in df["StudyID"]:
if len(i)<5:
df.iloc[row,3] = i + ".000"
else: df.iloc[row,3] = i
index += 1
I think it'd be ideal to do it as a list comprehension, but I haven't been able to find a solution that lets me iterate through the column, changing a single value at a time.
For example, this solution iterates and checks the logic properly, but it replaces every single value that evaluates True during each iteration. I only want the value currently being evaluated to change.
[i + ".000" if len(i)<5 else i for i in df["StudyID"]]
Is this possible?
As you said, your code is doing the trick. One other way of doing what you want that i could think of is the following :
# Start by creating a mask that gives you the index you want to change
mask = [len(i)<5 for i in df.StudyID]
# Change the value of the dataframe on the mask
df.StudyID.iloc[mask] += ".000"
I think by length(StudyID), you meant nchar(StudyID), as #akrun pointed out.
You can do it in the dplyr way in python using datar:
>>> from datar.all import f, tibble, mutate, nchar, if_else, paste
>>>
>>> df = tibble(
... StudyID = ["0000", "0001", "0000.000", "0001.001"]
... )
>>> df
StudyID
<object>
0 0000
1 0001
2 0000.000
3 0001.001
>>>
>>> df >> mutate(StudyID=if_else(
... nchar(f.StudyID) < 5,
... paste(f.StudyID, ".000", sep=""),
... f.StudyID
... ))
StudyID
<object>
0 0000.000
1 0001.000
2 0000.000
3 0001.001
Disclaimer: I am the author of the datar package.
Ultimately, I needed to do this for a few different dataframes so I ended up defining a function to solve the problem so that I could apply it to each one.
I think the list comprehension idea was going to become too complex and potentially too difficult to understand when reviewing so I stuck with a plain old for-loop.
def create_multi_index(data, col_to_split, sep = "."):
"""
This function loops through the original ID column and splits it into
multiple parts (multi-IDs) on the defined separator.
By default, the function assumes the unique ID is formatted like a decimal number
The new multi-IDs are appended into a new list.
If the original ID was formatted like an integer, rather than a decimal
the function assumes the latter half of the ID to be ".000"
"""
# Take a copy of the dataframe to modify
new_df = data
# generate two new lists to store the new multi-index
Family_ID = []
Family_Index = []
# iterate through the IDs, split and allocate the pieces to the appropriate list
for i in new_df[col_to_split]:
i = i.split(sep)
Family_ID.append(i[0])
if len(i)==1:
Family_Index.append("000")
else:
Family_Index.append(i[1])
# Modify and return the dataframe including the new multi-index
return new_df.assign(Family_ID = Family_ID,
Family_Index = Family_Index)
This returns a duplicate dataframe with a new column for each part of the multi-id.
When joining dataframes with this form of ID, as long as both dataframes have the multi index in the same format, these columns can be used with pd.merge as follows:
pd.merge(df1, df2, how= "inner", on = ["Family_ID","Family_Index"])

SparkR gapply - function returns a multi-row R dataframe

Let's say I want to execute something as follows:
library(SparkR)
...
df = spark.read.parquet(<some_address>)
df.gapply(
df,
df$column1,
function(key, x) {
return(data.frame(x, newcol1=f1(x), newcol2=f2(x))
}
)
where the return of the function has multiple rows. To be clear, the examples in the documentation (which sadly echoes much of the Spark documentation where the examples are trivially simple) don't help me identify whether this will be handled as I expect.
I would expect that the outcome of this would be, for k groups created in the DataFrame with n_k output rows per group, that the result of the gapply() call would have sum(1..k, n_k) rows, where the key value is replicated for each of n_k rows for each group in key k ... However, the schema-field suggests to me that this is not how this will be handled - in fact it suggests that it will either want the result pushed into a single row.
Hopefully this is clear, albeit theoretical (I'm sorry I can't share my actual code example). Can someone verify or explain how such a function will actually be treated?
Exact expectations regarding input and output are clearly stated in the official documentation:
Apply a function to each group of a SparkDataFrame. The function is to be applied to each group of the SparkDataFrame and should have only two parameters: grouping key and R data.frame corresponding to that key. The groups are chosen from SparkDataFrames column(s). The output of function should be a data.frame.
Schema specifies the row format of the resulting SparkDataFrame. It must represent R function’s output schema on the basis of Spark data types. The column names of the returned data.frame are set by user. Below is the data type mapping between R and Spark.
In other words your function should take a key and data.frame of rows corresponding to that key and return data.frame that can be represented using Spark SQL types with schema provided as schema argument. There are no restriction regarding number of rows. You could for example apply identity transformation as follows:
df <- as.DataFrame(iris)
gapply(df, "Species", function(k, x) x, schema(df))
the same way as aggregations:
gapply(df, "Species",
function(k, x) {
dplyr::summarize(dplyr::group_by(x, Species), max(Sepal_Width))
},
structType(
structField("species", "string"),
structField("max_s_width", "double"))
)
although in practice you should prefer aggregations directly on DataFrame (groupBy %>% agg).

Python: Average a list and append to dictionary

I have a dictionary of names with a number (a score) assigned to them. The file is laid out as so:
Person A,7
Peron B,6
If a name is repeated in the file e.g. Person B occurred on 3 lines with 3 different scores I want to calculate the mean average of these scores then append this result to a dictionary in the form of a list. However, I keep encountering an error when i try to sort the dictionary. Code below.
else:
for key in results:
keyValue = results[key]
if len(keyValue) > 1:
# Line below this needs modification
keyValue = list(sum(keyValue)/len(keyValue))
newResults[key] = keyValue
# Error in above code...
else:
newResults[key] = keyValue
print(newResults)
print(sorted(zip(newResults.values(), newResults.keys()), reverse=True))
Results is a dictionary of the people (the keys) and their scores (the values) where the values are lists so that:
results = {'Bob':[7],'Jane':[8,9]}
If you're using Python 3.x you can use its statistics library which contains a function mean. Now assuming that your dict looks like: results = {'Bob': [7], 'Jane': [8, 9]} you can create a newResults dict like this:
from statistics import mean
newResults = {key: mean(results[key]) for key in results}
This is called dict comprehension and as you can see it's kinda intuitive. Starting with { you're telling that dict is going to be created. Then with key: value you're defining its structure. Lastly, with for loop you iterate over a collection that will be used for the dict creation. You can achieve the same with:
newResults = {}
for key in results:
newResults[key] = mean(results[key])
You want to sort the dict in the end. Unfortunately it's not possible. You can either create an OrderedDict, which remembers the items insertion order or a list which will contain sorted keys to your dict. The latter will look like:
sortedKeys = sorted(newResults, key=lambda x: newResults[x])

Julia groupBy name and sum up count

I'm new to Julia and have a simple question. I have a csv file with the following structures: [Category, Name, Count]. I have 2 things I want to create.
1, I want to create a function in julia which groupBy the Category and add up the Counts (Name is ignored). So that the output is [Name, Count]. I will then generate a bar-plot by setting x= Name and y= Count
2, I want to generate multiple plots for each Category where the Count of each Name is plotted on separate bar-plots. So iterative plotting process?
I think I've got the hang of plotting, but I am not sure about how to do the groupBy process. Any help/re-direction to tutorials would be greatly appreciated.
A sample of my data:
(net_worth,khan,14)
(net_worth,kevin,15)
(net_worth,bill,16)
the function I am currently working on:
function wordcount(text,opinion,number)
words= text
counts= Dict()
for w = words
counts[w]= number
end
return counts
end
function wcreduce(wcs)
counts=Dict()
for c in wcs, (k,v) in c
counts[k] = get(counts,k,0)+v
end
return counts
end
I am looking for a function like reduceByKey or GroupByKey I guess.
So I solved this by using the Julia by function on DataFrames,
First load in the data csv using:
data = readtable("iris.csv")
Now its the function by:
function trendingkeys(data::DataFrame,trends::Symbol,funcadd::Function)
by(data, :trends, funcadd -> sum(funcadd[:counts]))
end
I must say. DataFrame is so smart.

Resources