R: passing column names as variables in custom function - r

I am quite new to R and programming in general and have been struggling with the following for a few hours now.
I am trying to create a function that will take a df and a column name as variables, filter the table based on the column name provided and print the output.
example_function <- function(df=df, col=col){
a <- df[col == 100,]
b <- filter(df, col == 100)
print(a)
print(b)
}
Using example_function(df=example_df, col='percentage') doesn't work, both variables return just the column names but no data rows (despite there being values == 100).
Using example_function(df=df, col=percentage), so percentage isn't surrounded by quotes here, I get:
Error in [.data.frame(df, col == 100, ) : object 'percentage' not
found
However, when I run example_function(df=example_df, col=example_df$percentage) I get the correct result, with my dataframe returning as expected with only those rows where the example_df$percentage is equal to 100.
I really want to be able to pass the df as one variable and the column as another without having to type example_df$percentage each time as I want to be able to re-use the function for many different dataframes and typing that seems redundant.
Based on this I then modified the function thinking that I can just use df$col in the function and it will evaluate to example_df$percentage and work like it did above:
example_function <- function(df=df, col=col){
a <- df[df$col == 100,]
b <- filter(df, df$col == 100)
print(a)
print(b)
}
But now I get another error when using example_function(df=example_df, col=percentage) or when passing col='percentage':
Error in filter_impl(.data, quo) : Result must have length 19, not 0
Would any body be able to help me fix this, or point me in the right direction to understand why what I'm doing isn't working?
Thanks so much
Here is an example of the dataframe I am using (although my real one will have more columns but I hope it won't make a difference for this example.)
name | percentage
-----------------------
tom | 80
john | 100
harry | 99
elizabeth| 100
james | 50
example_df <- structure(list(name = structure(c(5L, 4L, 2L, 1L, 3L), .Label = c("elizabeth",
"harry", "james", "john", "tom"), class = "factor"), percentage = c(80L,
100L, 99L, 100L, 50L)), .Names = c("name", "percentage"), class = "data.frame", row.names = c(NA,
-5L))
as a note, I have updated my col=names to col=percentage in this example to more accurately represent what I am doing. In my attempt to generalise the example I used col=names and now realise that it wasn't a very good example (as you quite rightly asserted that a 'name' is never likely to be numeric). The above problems still persist for me however.
** Update: I managed to get it working with the following:
example_function <- function(df=df, col=col){
a <- df[df[col] == 100,]
print(a)
}
passing example_function(df=example_df, col='percentage')

The first row of example_function should be
a <- df[df[[col]] == 100,]
When you break it down, df[['names']] == 100 will give you a list of logicals corresponding to which rows of df has a names value of 100. But 'names' == 100 is nonsensical: it's always false.

Related

Recode monetary string values into new variable as numeric

First off - newbie with R so bear with me. I'm trying to recode string values as numeric. My problem is I have two different string patterns present in my values: "M" and "B" for 'million' and 'billion', respectively.
df <- (funds = c($1.76M, $2B, $57M, $9.87B)
I've successfully knocked off the dollar sign and now have:
df <- (funds = c($1.76M, $2B, $57M, $9.87B),
fundsR = c(1.76M, 2B, 57M, 9.87B)
)
How can I recode these as numeric while retaining their respective monetary values? I've tried using various if statements, for loops, with or without str_detect, pipe operators, case_when, mutate, etc. to isolate values with "M" and values with "B", convert to numeric and multiply to come up the complimentary numeric value--all in a new column. This seemingly simple task turned out not as simple as I imagined it would be and I'd attribute it to being a novice. At this point I'd like to start from scratch and see if anyone has any fresh ideas. My Rstudio is a MESS.
Something like this would be nice:
df <- (funds = c($1.76M, $2B, $57M, $9.87B),
fundsR = c(1.76M, 2B, 57M, 9.87B),
fundsFinal = c(1760000, 2000000000, 57000000, 9870000000)
)
I'd really appreciate your input.
You could create a helper function f, and then apply it to the funds column:
library(dplyr)
library(stringr)
f <- function(x) {
curr = c("M"=1e6, "B" = 1e9)
val = str_remove(x,"\\$")
as.numeric(str_remove_all(val,"B|M"))*curr[str_extract(val, "B|M")]
}
df %>% mutate(fundsFinal = f(funds))
Output:
funds fundsFinal
1 $1.76M 1.76e+06
2 $2B 2.00e+09
3 $57M 5.70e+07
4 $9.87B 9.87e+09
Input:
df = structure(list(funds = c("$1.76M", "$2B", "$57M", "$9.87B")), class = "data.frame", row.names = c(NA,
-4L))
This works but I'm sure better solutions exist. Assuming funds is a character vector:
library(tidyverse)
options(scipen = 999)
df <- data.frame(funds = c('$1.76M', '$2B', '$57M', '$9.87B'))
df = df %>%
mutate( fundsFinal = ifelse(str_sub(funds,nchar(funds),-1) =='M',
as.numeric(substr(funds, 2, nchar(funds) - 1))*10^6,
as.numeric(substr(funds, 2, nchar(funds) - 1))*10^9))

Accessing List Value by looking up value with same index

I was following this blog post:
https://www.robert-hickman.eu/post/dixon_coles_1/
And in a number of places, he gets a value from a list by putting in a value with the equivalent index, rather like using it as a key in a python dictionary. This would be an example in the link:
What I understand he's done is basically this:
teams <- c("a","b","c")
values <- c(1,2,3)
example_df <- data.frame(teams,values)
example_list <- as.list(example_df)
example_list$values[a]
Error: object 'a' not found
But I get an NA value for this - am I missing something ?
Thanks in advance !
The way a list works in R, makes it not really practical to address a list like that, as the values in the list aren't associated like that.
Which leads to this:
teams <- c("a","b","c")
values <- c(1,2,3)
example_df <- data.frame(teams,values)
example_list <- as.list(example_df)
#Gives NULL
example_list[example_list$teams == "a"]$values
#Gives 1, 2, 3
example_list[example_list$teams == "b"]$values
#Gives NULL
example_list[example_list$teams == "b"]$values
You can see that this wouldn't work, because the syntax you would expect to work in this case, throws an error "incorrect amount of dimensions":
example_list[example_list$teams == "b", ]$values
However, it is really easy to address a data frame, or any matrix like structure in the way you want to:
teams <- c("a","b","c")
values <- c(1,2,3)
example_df <- data.frame(teams,values)
#Gives 1
example_df[example_df$teams == "a", ]$values
#Gives 2
example_df[example_df$teams == "b", ]$values
#Gives 3
example_df[example_df$teams == "b", ]$values
What I think is happening in the tutorial you shared is something else. As far as I can see, there are no names passed through to the list, but variables. It is not giving the value of a higher dimensional thing, but rather the value of the list itself.
That also makes a lot more sense, as that is what the syntax is doing. "teams[1]" Simply returns the first value in the list called "teams" (even if that value is a vector or whatever) Of course, teams[i], where i is a variable, also works. What I mean is this:
teams = list(A = 1, B = 2, C = 3, D = 4)
#Gives A
teams[1]
If you want to understand why one of them works and the other one doesn't, here is both together. Throw it in RStudio, and look through the Environment.
## One dimensional
teams = list(A = "a", B = "very", C = "good", D = "example")
#Gives "very"
teams[2]
## Two dimensional
teams <- c("a","b","c")
values <- c(1,2,3)
teams2 <- list(teams, values)
#Gives "a, b, c"
teams2[1]
#Gives NULL
teams2[3]

How to match list of characters with partial strings in R?

I am analysing IDs from the RePEc database. Each ID matches a unique publication and sometimes publications are linked because they are different versions of each other (e.g. a working paper that becomes a journal article). I have a database of about 250,000 entries that shows the main IDs in one column and then the previous or alternative IDs in another. It looks like this:
df$repec_id <– c("RePEc:cid:wgha:353", "RePEc:hgd:wpfacu:350","RePEc:cpi:dynxce:050")
df$alt_repec_id <– c("RePEc:sii:giihdizi:heidwg06-2019|RePEc:azi:cusiihdizi:gdhs06-2019", "RePEc:tqu:vishdizi:d8z7-200x", "RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050", "RePEc:cid:wgha:353|RePEc:hgd:wpfacu:350")
I want to find out which IDs from the repec_id column are also present in the alt_repec_id column and create a dataframe that only has rows matching this condition. I tried to strsplit at "|" and use the %in% function like this:
df <- separate_rows(df, alt_repec_id, sep = "\\|")
df1 <- df1[trimws(df$alt_repec_id) %in% trimws(df$repec_id), ]
df1<- data.frame(df1)
df1 <- na.omit(df1)
df1 <- df1[!duplicated(df1$repec_id),]
It works but I'm worried that by eliminating duplicate rows based on the values in the repec_id column, I'm randomly eliminating matches. Is that right?
Ultimately, I want a dataframe that only contains values in which strings in the repec_id column match the partial strings in the alt_repec_id column. Using the example above, I want the following result:
df$repec_id <– c("RePEc:cpi:dynxce:050")
df$alt_repec_id <– c("RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050")
Does anyone have a solution to my problem? Thanks in advance for your help!
Try using str_detect() from stringr to identify if the repec_id exists in the larger alt_repec_id string.
Then filter() down to where it was found. This this is not returning as expected, try looking at and posting a few examples where found_match == FALSE but the match was expected.
library(stringr)
library(dplyr)
df %>%
mutate(found_match = str_detect(alt_repec_id, repec_id)) %>%
filter(found_match == TRUE)
Here is a base R solution using grepl() + apply() + subset()
dfout <- subset(df,apply(df, 1, function(v) grepl(v[1],v[2])))
such that
> dfout
repec_id alt_repec_id
3 RePEc:cpi:dynxce:050 RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050
DATA
df <- structure(list(repec_id = structure(c(1L, 3L, 2L), .Label = c("RePEc:cid:wgha:353",
"RePEc:cpi:dynxce:050", "RePEc:hgd:wpfacu:350"), class = "factor"),
alt_repec_id = structure(c(2L, 3L, 1L), .Label = c("RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050",
"RePEc:sii:giihdizi:heidwg06-2019|RePEc:azi:cusiihdizi:gdhs06-2019",
"RePEc:tqu:vishdizi:d8z7-200x"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))

converting dataframe columns from matrix to vector

I'm trying to combine a few dataframes. In the process of doing so, I noticed that one dataframe contains matrices rather than vectors. Here's a basic example:
df3 <- structure(list(v5 = structure(c(NA, 0), .Dim = c(2L, 1L), .Dimnames = list(
c("206", "207"), "ecbi1")), v6 = structure(c(NA, 0), .Dim = c(2L,
1L), .Dimnames = list(c("206", "207"), "ecbi2"))), .Names = c("v5",
"v6"), row.names = 206:207, class = "data.frame")
# get class
class(df3[,1])
# [1] "matrix"
I want the columns in df3 to be vectors, not matrices.
Just apply as.vector
df3[] = lapply(df3, as.vector)
I think most important is to figure out how you managed to get matrix-type columns in the first place and understand whether this was a desired behavior or a side effect of a mistake somewhere earlier.
Given where you are, you can just use c to undo a given column:
df3$v5 <- c(df3$v5)
Or if this is a problem with all columns:
df3[ ] <- lapply(df3, c)
(lapply returns a list of vectors, and when we pass a list via assignment to a data.frame, it interprets each list element as a column; df3[ ] returns all columns of df3. We could also do df3 <- lapply(df3, c), but using [ ] is more robust -- if we make a mistake and somehow return the wrong number of columns, an error will be thrown, where as simply using df3 would have simply overwritten our data.frame silently in case of such an error)
Lastly, if only some columns are matrix-type, we can replace only those columns like so:
mat.cols <- sapply(df3, is.matrix)
df3[ , mat.cols] <- lapply(df3[ , mat.cols], c)
As pertains to the relation between this approach and that using as.vector, from ?c:
c is sometimes used for its side effect of removing attributes except names, for example to turn an array into a vector. as.vector is a more intuitive way to do this, but also drops names.
So given that the names don't mean much in this context, c is simply a more concise approach, but the end result is practically identical.
Use this.
df3[,1] = as.vector(df3[,1])
The same procedure can be applied generally to the rest of the columns.
We can use do.call
do.call(data.frame, df3)

R data.table user defined function

I am transitioning from using data.frame in R to data.table for better performance. One of the main segments in converting code was applying custom functions from apply on data.frame to using it in data.table.
Say I have a simple data table, dt1.
x y z---header
1 9 j
4 1 n
7 1 n
Am trying to calculate another new column in dt1, based on values of x,y,z
I tried 2 ways, both of them give the correct result, but the faster one spits out a warning. So want to make sure the warning is nothing serious before I use the faster version in converting my existing code.
(1) dt1[,a:={if((x<1) & (y>3) & (j == "n")){6} else {7}}]
(2) dt1[,a:={if((x<1) & (y>3) & (j == "n")){6} else {7}}, by = 1:nrow(x)]
Version 1 runs faster than version 2, but spits out a warning" the condition has length > 1 and only the first element will be used"
But the result is good.
The second version is slightly slower but doesn't give that warning.
I wanted to make sure version one doesn't give erratic results once I start writing complicated functions.
Please treat the question as a generic one with the view to run a user defined function which wants to access different column values in a given row and calculate the new column value for that row.
Thanks for your help.
If 'x', 'y', and 'z' are the columns of 'dt1', try either the vectorized ifelse
dt1[, a:=ifelse(x<1 & y >3 & z=='n', 6, 7)]
Or create 'a' with 7, then assign 6 to 'a' based on the logical index.
dt1[, a := 7][x<1 & y >3 & z=='n', a:=6][]
Using a function
getnewvariable <- function(v1, v2, v3){
ifelse(v1 <1 & v2 >3 & v3=='n', 6, 7)
}
dt1[, a:=getnewvariable(x,y,z)][]
data
df1 <- structure(list(x = c(0L, 1L, 4L, 7L, -2L), y = c(4L, 9L, 1L,
1L, 5L), z = c("n", "j", "n", "n", "n")), .Names = c("x", "y",
"z"), class = "data.frame", row.names = c(NA, -5L))
dt1 <- as.data.table(df1)

Resources