Iterate through and conditionally append string values in a Pandas dataframe - r

I've got a dataframe of research participants whose IDs are stored in the following format "0000.000".
Where the first four digits are their family ID number, and the final three digits are their individual index within the family. The majority of individuals have a suffix of ".000", but some have ".001", ".002", etc.
As a result of some inefficiencies, these numbers are stored as floats. I'm trying to import them as strings so that I can use them in a join to another data frame that is formatted correctly.
Those IDs that end in .000 are imported as "0000", rather than "0000.000". All others are imported correctly.
I'm trying to iterate through the IDs and append ".000" to those that are missing the suffix.
If I were using R, I could do it like this.
df %>% mutate(StudyID = ifelse(length(StudyID)<5,
paste(StudyID,".000",sep=""),
StudyID)
I've found a Python solution (below), but it's pretty janky.
row = 0
for i in df["StudyID"]:
if len(i)<5:
df.iloc[row,3] = i + ".000"
else: df.iloc[row,3] = i
index += 1
I think it'd be ideal to do it as a list comprehension, but I haven't been able to find a solution that lets me iterate through the column, changing a single value at a time.
For example, this solution iterates and checks the logic properly, but it replaces every single value that evaluates True during each iteration. I only want the value currently being evaluated to change.
[i + ".000" if len(i)<5 else i for i in df["StudyID"]]
Is this possible?

As you said, your code is doing the trick. One other way of doing what you want that i could think of is the following :
# Start by creating a mask that gives you the index you want to change
mask = [len(i)<5 for i in df.StudyID]
# Change the value of the dataframe on the mask
df.StudyID.iloc[mask] += ".000"

I think by length(StudyID), you meant nchar(StudyID), as #akrun pointed out.
You can do it in the dplyr way in python using datar:
>>> from datar.all import f, tibble, mutate, nchar, if_else, paste
>>>
>>> df = tibble(
... StudyID = ["0000", "0001", "0000.000", "0001.001"]
... )
>>> df
StudyID
<object>
0 0000
1 0001
2 0000.000
3 0001.001
>>>
>>> df >> mutate(StudyID=if_else(
... nchar(f.StudyID) < 5,
... paste(f.StudyID, ".000", sep=""),
... f.StudyID
... ))
StudyID
<object>
0 0000.000
1 0001.000
2 0000.000
3 0001.001
Disclaimer: I am the author of the datar package.

Ultimately, I needed to do this for a few different dataframes so I ended up defining a function to solve the problem so that I could apply it to each one.
I think the list comprehension idea was going to become too complex and potentially too difficult to understand when reviewing so I stuck with a plain old for-loop.
def create_multi_index(data, col_to_split, sep = "."):
"""
This function loops through the original ID column and splits it into
multiple parts (multi-IDs) on the defined separator.
By default, the function assumes the unique ID is formatted like a decimal number
The new multi-IDs are appended into a new list.
If the original ID was formatted like an integer, rather than a decimal
the function assumes the latter half of the ID to be ".000"
"""
# Take a copy of the dataframe to modify
new_df = data
# generate two new lists to store the new multi-index
Family_ID = []
Family_Index = []
# iterate through the IDs, split and allocate the pieces to the appropriate list
for i in new_df[col_to_split]:
i = i.split(sep)
Family_ID.append(i[0])
if len(i)==1:
Family_Index.append("000")
else:
Family_Index.append(i[1])
# Modify and return the dataframe including the new multi-index
return new_df.assign(Family_ID = Family_ID,
Family_Index = Family_Index)
This returns a duplicate dataframe with a new column for each part of the multi-id.
When joining dataframes with this form of ID, as long as both dataframes have the multi index in the same format, these columns can be used with pd.merge as follows:
pd.merge(df1, df2, how= "inner", on = ["Family_ID","Family_Index"])

Related

Creating a simple for loop in R

I have a tibble called 'Volume' in which I store some data (10 columns - the first 2 columns are characters, 30 rows).
Now I want to calculate the relative Volume of every column that corresponds to Column 3 of my tibble.
My current solution looks like this:
rel.Volume_unmod = tibble(
"Volume_OD" = Volume[[3]] / Volume[[3]],
"Volume_Imp" = Volume[[4]] / Volume[[3]],
"Volume_OD_1" = Volume[[5]] / Volume[[3]],
"Volume_WS_1" = Volume[[6]] / Volume[[3]],
"Volume_OD_2" = Volume[[7]] / Volume[[3]],
"Volume_WS_2" = Volume[[8]] / Volume[[3]],
"Volume_OD_3" = Volume[[9]] / Volume[[3]],
"Volume_WS_3" = Volume[[10]] / Volume[[3]])
rel.Volume_unmod
I would like to keep the tibble structure and the labels. I am sure there is a better solution for this, but I am relative new to R so I it's not obvious to me. What I tried is something like this, but I can't actually run this:
rel.Volume = NULL
for(i in Volume[,3:10]){
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
}
Mockup Data
Since you did not provide some data, I've followed the description you provided to create some mockup data. Here:
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE))
Volume[3:10] <- rnorm(30*8)
Solution with Dplyr
library(dplyr)
# rename columns [brute force]
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(Volume)[3:10] <- cols
# divide by Volumn_OD
rel.Volume_unmod <- Volume %>%
mutate(across(all_of(cols), ~ . / Volume_OD))
# result
rel.Volume_unmod
Explanation
I don't know the names of your columns. Probably, the names correspond to the names of the columns you intended to create in rel.Volume_unmod. Anyhow, to avoid any problem I renamed the columns (kinda brutally). You can do it with dplyr::rename if you wan to.
There are many ways to select the columns you want to mutate. mutate is a verb from dplyr that allows you to create new columns or perform operations or functions on columns.
across is an adverb from dplyr. Let's simplify by saying that it's a function that allows you to perform a function over multiple columns. In this case I want to perform a division by Volum_OD.
~ is a tidyverse way to create anonymous functions. ~ . / Volum_OD is equivalent to function(x) x / Volumn_OD
all_of is necessary because in this specific case I'm providing across with a vector of characters. Without it, it will work anyway, but you will receive a warning because it's ambiguous and it may work incorrectly in same cases.
More info
Check out this book to learn more about data manipulation with tidyverse (which dplyr is part of).
Solution with Base-R
rel.Volume_unmod <- Volume
# rename columns
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(rel.Volume_unmod)[3:10] <- cols
# divide by columns 3
rel.Volume_unmod[3:10] <- lapply(rel.Volume_unmod[3:10], `/`, rel.Volume_unmod[3])
rel.Volume_unmod
Explanation
lapply is a base R function that allows you to apply a function to every item of a list or a "listable" object.
in this case rel.Volume_unmod is a listable object: a dataframe is just a list of vectors with the same length. Therefore, lapply takes one column [= one item] a time and applies a function.
the function is /. You usually see / used like this: A / B, but actually / is a Primitive function. You could write the same thing in this way:
`/`(A, B) # same as A / B
lapply can be provided with additional parameters that are passed directly to the function that is being applied over the list (in this case /). Therefore, we are writing rel.Volume_unmod[3] as additional parameter.
lapply always returns a list. But, since we are assigning the result of lapply to a "fraction of a dataframe", we will just edit the columns of the dataframe and, as a result, we will have a dataframe instead of a list. Let me rephrase in a more technical way. When you are assigning rel.Volume_unmod[3:10] <- lapply(...), you are not simply assigning a list to rel.Volume_unmod[3:10]. You are technically using this assigning function: [<-. This is a function that allows to edit the items in a list/vector/dataframe. Specifically, [<- allows you to assign new items without modifying the attributes of the list/vector/dataframe. As I said before, a dataframe is just a list with specific attributes. Then when you use [<- you modify the columns, but you leave the attributes (the class data.frame in this case) untouched. That's why the magic works.
Whithout a minimal working example it's hard to guess what the Variable Volume actually refers to. Apart from that there seems to be a problem with your for-loop:
for(i in Volume[,3:10]){
Assuming Volume refers to a data.frame or tibble, this causes the actual column-vectors with indices between 3 and 10 to be assigned to i successively. You can verify this by putting print(i) inside the loop. But inside the loop it seems like you actually want to use i as a variable containing just the index of the current column as a number (not the column itself):
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
Also, two brackets are usually used with lists, not data.frames or tibbles. (You can, however, do so, because data.frames are special cases of lists.)
Last but not least, initialising the variable rel.Volume with NULL will result in an error, when trying to reassign to that variable, since you haven't told R, what rel.Volume should be.
Try this, if you like (thanks #Edo for example data):
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE),
Vol1 = rnorm(30),
Vol2 = rnorm(30),
Vol3 = rnorm(30))
rel.Volume <- Volume[1:2] # Assuming you want to keep the IDs.
# Your data.frame will need to have the correct number of rows here already.
for (i in 3:ncol(Volume)){ # ncol gives the total number of columns in data.frame
rel.Volume[i] = Volume[i]/Volume[3]
}
A more R-like approach would be to avoid using a for-loop altogether, since R's strength is implicit vectorization. These expressions will produce the same result without a loop:
# OK, this one messes up variable names...
rel.V.2 <- data.frame(sapply(X = Volume[3:5], FUN = function(x) x/Volume[3]))
rel.V.3 <- data.frame(Map(`/`, Volume[3:5], Volume[3]))
Since you said you were new to R, frankly I would recommend avoiding the Tidyverse-packages while you are still learing the basics. From my experience, in the long run you're better off learning base-R first and adding the "sugar" when you're more familiar with the core language. You can still learn to use Tidyverse-functions later (but then, why would anybody? ;-) ).

Skipping last column in r with read.csv

I was on that post read.csv and skip last column in R but did not find my answer, and try to check directly in Answer ... but that's not the right way (thanks mjuarez for taking the time to get me back on track.
The original question was:
I have read several other posts about how to import csv files with
read.csv but skipping specific columns. However, all the examples I
have found had very few columns, and so it was easy to do something
like:
columnHeaders <- c("column1", "column2", "column_to_skip")
columnClasses <- c("numeric", "numeric", "NULL")
data <- read.csv(fileCSV, header = FALSE, sep = ",", col.names =
columnHeaders, colClasses = columnClasses)
All answer were good, but does not work for what I entended to do. So I asked my self and other:
And in one function, does data <- read_csv(fileCSV)[,(ncol(data)-1)]
could work?
I've tried in one line of R to get on data, all 5 of first 6 columns, so not the last one. To do so, I would like to use "-" in the number of column, do you think it's possible? How can I do that?
Thanks!
In base r it has to be 2 steps operation. Example:
> data <- read.csv("test12.csv")
> data
# 3 columns are returned
a b c
1 1/02/2015 1 3
2 2/03/2015 2 4
# last column is excluded
> data[,-ncol(data)]
a b
1 1/02/2015 1
2 2/03/2015 2
one cannot write data <- read.csv("test12.csv")[,-ncol(data)] in base r.
But if you know max number of columns in your csv (say 3 in my case) then one can write:
df <- read.csv("test12.csv")[,-3]
df
a b
1 1/02/2015 1
2 2/03/2015 2
The right hand side of an assignment is processed first so this line from the question:
data <- read.csv(fileCSV)[,(ncol(data)-1)]
is trying to use data before it is defined. Also note what the above is saying is to take only the 2nd last field. To get all but the last field:
data <- read.csv(fileCSV)
data <- data[-ncol(data)]
If you know the name of the last field, say it is lastField, then this works and unlike the code above does not read the whole file and then remove the last field but rather only reads in fields other than the last. Also it is only one line of code.
read.csv(fileCSV, colClasses = c(lastField = "NULL"))
If you don't know the name of the last field but you do know how many fields there are, say n, then either of these would work:
read.csv(fileCSV)[-n]
read.csv(fileCSV, colClasses = replace(rep(NA, n), n, "NULL"))
Another way to do it without first reading in the last field is to first read in the header and first line to calculate the number of fields (assuming that all records have the same number) and then re-read the file using that.
n <- ncol(read.csv(fileCSV, nrows = 1))
making use of one of the prior two statements involving n.
It's not possible in one line as the data variable is not yet initialized when you call it. So the command ncol(data) will trigger an error.
You would need to use two lines of code to first load your data into the data variable and then remove the last column by either using data[,-ncol(data)] or data[,1:(ncol(data)-1)].
Not a single function, but at least a single line, using dplyr (disclaimer: I never use dplyr or magrittr, so a more optimized solution must exist using these libraries)
library(dplyr)
dat = read.table(fileCSV) %>% select(., which(names(.) != names(.)[ncol(.)]))

How do I refer to defined objects using the reshape function in R?

I'm unable to get the reshape function ( stats::reshape ) to accept a reference to a defined character vector in one of its arguments. I don't know if this reflects wrong syntax on my part, a limitation of the function, or a more general issue related to how R itself operates.
I am using reshape to change data from wide to long format. I have a dataset with many repeated measures that are sorted appropriately for reshape (x.1, x.2, x.3, y.1, y.2, y.3, etc). I've defined a variable firstlastmeasure that contains the index to the first and last column of repeated measures data that needs to be processed by reshape (this is to prevent having to change the index every time columns are added or removed from the original data).
This is how it's defined (in a convoluted way):
temp0 <- subset(p, select=nameoffirstcolumn:nameoflastcolumn)
lastmeasname = names(temp0[ncol(temp0)])
firstmeasname = names(temp0[1])
firstmeasindex = grep(firstmesname,colnames(p))
lastmeasindex = grep(lastmesname,colnames(p))
firstlastmeasure <- paste(firstmesindex,lastmesindex,sep=":")
I'm using this variable as an argument to reshape's varying parameter, like so:
reshape(dataset, direction = "long", varying = firstlastmeasure)
Reshape always returns:
"Error in guess(varying) : failed to guess time-varying variables from their names".
Using the numerical index explicitly (i.e. varying = 6:34) works fine.
paste creates a string, if you look at firstlastmeasure it will be something like "6:34". If you look at 6:34 it will be a vector 6 7 8 9 ... 34. You need to define the vector, not paste together a string. (Note that subset does a bit of special processing to make : work with column names.)
If I'm interepreting your code correctly, temp0 has all the column you want, so you could just do
firstlastmeasure = names(temp0)
and be done with it. A little more complicated, you could keep you grep code and just not use paste:
firstlastmeasure = firstmeasindex:lastmeasindex
Since you are inputting names, the subset is unnecessary. Simplest of all would be to skip temp0 and do
firstlastmeasure = grep(nameoffirstcolumn, names(p)):grep(nameoflastcolumn, names(p))

Optimization of R Data.table combination with for loop function

I have a 'Agency_Reference' table containing column 'agency_lookup', with 200 entries of strings as below :
alpha
beta
gamma etc..
I have a dataframe 'TEST' with a million rows containing a 'Campaign' column with entries such as :
Alpha_xt2010
alpha_xt2014
Beta_xt2016 etc..
i want to loop through for each entry in reference table and find which string is present within each campaign column entries and create a new agency_identifier column variable in table.
my current code is as below and is slow to execute. Requesting guidance on how to optimize the same. I would like to learn how to do it in the data.table way
Agency_Reference <- data.frame(agency_lookup = c('alpha','beta','gamma','delta','zeta'))
TEST <- data.frame(Campaign = c('alpha_xt123','ALPHA345','Beta_xyz_34','BETa_testing','code_delta_'))
TEST$agency_identifier <- 0
for (agency_lookup in as.vector(Agency_Reference$agency_lookup)) {
TEST$Agency_identifier <- ifelse(grepl(tolower(agency_lookup), tolower(TEST$Campaign)),agency_lookup,TEST$Agency_identifier)}
Expected Output :
Campaign----Agency_identifier
alpha_xt123---alpha
ALPHA34----alpha
Beta_xyz_34----beta
BETa_testing----beta
code_delta_-----delta
Try
TEST <- data.frame(Campaign = c('alpha_xt123','ALPHA345','Beta_xyz_34','BETa_testing','code_delta_'))
pattern = tolower(c('alpha','Beta','gamma','delta','zeta'))
TEST$agency_identifier <- sub(pattern = paste0('.*(', paste(pattern, collapse = '|'), ').*'),
replacement = '\\1',
x = tolower(TEST$Campaign))
This will not answer your question per se, but from what I understand you want to dissect the Campaign column and do something with the values it provides.
Take a look at Tidy data, more specifically the part "Multiple variables stored in one column". I think you'll make some great progress using tidyr::separate. That way you don't have to use a for-loop.

Using grep to subset columns of a dataframe by names(dataframe)

I am trying to use grep to subset columns of a data frame with one row. When grep returns multiple columns the new data frame has the corresponding column names from the grep. When only one column is returned the column name is NULL... I am using this method because I am looping over many sites that may contain different combinations of HVAC sensor data.
I am trying to create subsets for each unit 'HVAC1', 'HVAC2', 'HVAC3' and a subset for columns that are common to all units. In this case, there is only one column that is common to all units : 'IAT' or indoor ambient temperature. Also, there is no third HVAC unit so the grep on HVAC 3 rightly returns names(sensordata.h3) as character(0).
Here is my code.
sensordata <- data.frame(sitetime = c("2015-10-22 14:15:17"), HVAC1RT = c(70.7), HVAC1ST = c(74.75), HVAC2RT = c(66.875), HVAC2ST = c(46.4), IAT = c(72.5))
sensordata
names(sensordata)
sensordata.h1 <- sensordata[,c(grep("HVAC1",names(sensordata)))]
sensordata.h1
names(sensordata.h1)
sensordata.h2 <- sensordata[,c(grep("HVAC2",names(sensordata)))]
sensordata.h2
names(sensordata.h2)
sensordata.h3 <- sensordata[,c(grep("HVAC3",names(sensordata)))]
sensordata.h3
names(sensordata.h3)
sensordata.common <- sensordata[,c(grep("IAT|OAT|IAH",names(sensordata)))]
sensordata.common
names(sensordata.common)
Try this:
sensordata.common <- sensordata[,c(grep("IAT|OAT|IAH",names(sensordata))), drop=F]
sensordata.common
IAT
1 72.5
names(sensordata.common)
[1] "IAT"
The option drop=F prevents [ to reduce the output to a vector. See ?[ (you need to use backticks around [, can't get it formatted here correctly...
Alternatively, you could use dplyr::select, as in select(sensordata.common, contains("your_names_here")). dplyr's default is to never change the output class.

Resources