I was following this blog post:
https://www.robert-hickman.eu/post/dixon_coles_1/
And in a number of places, he gets a value from a list by putting in a value with the equivalent index, rather like using it as a key in a python dictionary. This would be an example in the link:
What I understand he's done is basically this:
teams <- c("a","b","c")
values <- c(1,2,3)
example_df <- data.frame(teams,values)
example_list <- as.list(example_df)
example_list$values[a]
Error: object 'a' not found
But I get an NA value for this - am I missing something ?
Thanks in advance !
The way a list works in R, makes it not really practical to address a list like that, as the values in the list aren't associated like that.
Which leads to this:
teams <- c("a","b","c")
values <- c(1,2,3)
example_df <- data.frame(teams,values)
example_list <- as.list(example_df)
#Gives NULL
example_list[example_list$teams == "a"]$values
#Gives 1, 2, 3
example_list[example_list$teams == "b"]$values
#Gives NULL
example_list[example_list$teams == "b"]$values
You can see that this wouldn't work, because the syntax you would expect to work in this case, throws an error "incorrect amount of dimensions":
example_list[example_list$teams == "b", ]$values
However, it is really easy to address a data frame, or any matrix like structure in the way you want to:
teams <- c("a","b","c")
values <- c(1,2,3)
example_df <- data.frame(teams,values)
#Gives 1
example_df[example_df$teams == "a", ]$values
#Gives 2
example_df[example_df$teams == "b", ]$values
#Gives 3
example_df[example_df$teams == "b", ]$values
What I think is happening in the tutorial you shared is something else. As far as I can see, there are no names passed through to the list, but variables. It is not giving the value of a higher dimensional thing, but rather the value of the list itself.
That also makes a lot more sense, as that is what the syntax is doing. "teams[1]" Simply returns the first value in the list called "teams" (even if that value is a vector or whatever) Of course, teams[i], where i is a variable, also works. What I mean is this:
teams = list(A = 1, B = 2, C = 3, D = 4)
#Gives A
teams[1]
If you want to understand why one of them works and the other one doesn't, here is both together. Throw it in RStudio, and look through the Environment.
## One dimensional
teams = list(A = "a", B = "very", C = "good", D = "example")
#Gives "very"
teams[2]
## Two dimensional
teams <- c("a","b","c")
values <- c(1,2,3)
teams2 <- list(teams, values)
#Gives "a, b, c"
teams2[1]
#Gives NULL
teams2[3]
Related
I tried to use a 'function' in r for efficiency, but it seems that I get different results or no result.
When run directly, the result is,
> data1$CI_allergy <- str_extract(data1$CUR_ILL, "allergy")
> data1$CI_allergy <- ifelse(data1$CI_allergy == "allergy", 1, 0)
> data1$CI_allergy[is.na(data1$CI_allergy)] <-0 data1$CI_allergy <-
> ifelse(data1$CI_allergy == 0, "N", "Y")
>
> table(data1$CI_allergy)
N Y
2714383 21642
However, when the function is used:
CI_variable <- function(arg1, arg2) {
data1$arg1 <- str_extract(data1$CUR_ILL, 'arg2')
data1$arg1 <- ifelse(data1$arg1 == 'arg2', 1, 0)
data1$arg1[is.na(data1$arg1)] <-0
data1$arg1 <- ifelse(data1$arg1 == 0, "N", "Y")
return(table(data1$arg1))
}
CI_variable(CI_allergy, allergy)
N
2736025
I am guessing the error occurred in str_extract function in CI_variable, but not sure.
Has anyone had a similar problem and solved it?
Since the original code includes str_extract, which is part of the tidyverse, here is an alternative approach.
First, some toy data (see how to make a reproducible example).
library(tidyverse)
df <- tribble(
~Cur_ILL,
"something bad",
"something allergy",
"darkside",
NA_character_
)
Then we can use several features of the tidyverse to get (dynamic) summary statistics like so
get_CI <- function(data, col, type){
data %>%
count("has_{type}" := ifelse(str_detect({{ col }}, type) %in% T, "Y", "N"))
}
get_CI(df, Cur_ILL, "allergy")
has_allergy n
<chr> <int>
1 N 3
2 Y 1
Explanation:
count is shortcut for computing number of occurrences for a group (here "Y" and "N"). Its output is a data.frame, which is a bit easier to work with than a table for most use cases.
walrus-operator := to work with glue package style variable names. Here that is "has_{type}", which inserts the type argument into the string. This makes it easier to distinguish between tables.
embrace {{ as shortcut to indicate inserting a variable name
x %in% T to convert NA to FALSE
Finally, an explicit return statement is not required.
I've got some poorly structured data I am trying to clean. I have a list of keywords I can use to extract data frames from a CSV file. My raw data is structured roughly as follows:
There are 7 columns with values, the first columns are all string identifiers, like a credit rating or a country symbol (for FX data), while the other 6 columns are either a header like a percentage change string (e.g. +10%) or just a numerical value. Since I have all this data lumped together, I want to be able to extract data for each category. So for instance, I'd like to extract all the rows between my "credit" keyword and my "FX" keyword in my first column. Is there a way to do this in either base R or dplyr easily?
eg.
df %>%
filter(column1 = in_between("credit", "FX"))
Sample dataframe:
row 1: c('random',-1%', '0%', '1%, '2%')
row 2: c('credit', NA, NA, NA, NA)
row 3: c('AAA', 1,2,3,4)
...
row n: c('FX', '-1%', '0%', '1%, '2%')
And I would want the following output:
row 1: c('credit', -1%', '0%', '1%, '2%')
row 2: c('AAA', 1,2,3,4)
...
row n-1: ...
If I understand correctly you could do something like
start <- which(df$column1 == "credit")
end <- which(df$column1 == "FX")
df[start:(end-1), ]
Of course this won't work if "credit" or "FX" is in the column more than once.
Using what Brian suggested:
in_between <- function(df, start, end){
return(df[start:(end-1),])
}
Then loop over the indices in
dividers = which(df$column1 %in% keywords == TRUE)
And save the function outputs however one would like.
lapply(1:(length(dividers)-1), function(x) in_between(df, start = dividers[x], end = dividers[x+1]))
This works. Messy data so I still have the annoying case where I need to keep the offset rows.
I'm still not 100% sure what you are trying to accomplish but does this do what you need it to?
set.seed(1)
df <- data.frame(
x = sample(LETTERS[1:10]),
y = rnorm(10),
z = runif(10)
)
start <- c("C", "E", "F")
df2 <- df %>%
mutate(start = x %in% start,
group = cumsum(start))
split(df2, df2$group)
I want to replace certain values in a data frame column with values from a lookup table. I have the values in a list, stuff.kv, and many values are stored in the list (but some may not be).
stuff.kv <- list()
stuff.kv[["one"]] <- "thing"
stuff.kv[["two"]] <- "another"
#etc
I have a dataframe, df, which has multiple columns (say 20), with assorted names. I want to replace the contents of the column named 'stuff' with values from 'lookup'.
I have tried building various apply methods, but nothing has worked.
I built a function, which process a list of items and returns the mutated list,
stuff.lookup <- function(x) {
for( n in 1:length(x) ) {
if( !is.null( stuff.kv[[x[n]]] ) ) x[n] <- stuff.kv[[x[n]]]
}
return( x )
}
unlist(lapply(df$stuff, stuff.lookup))
The apply syntax is bedeviling me.
Since you made such a nice lookup table, You can just use it to change the values. No loops or apply needed.
## Sample Data
set.seed(1234)
DF = data.frame(stuff = sample(c("one", "two"), 8, replace=TRUE))
## Make the change
DF$stuff = unlist(stuff.kv[DF$stuff])
DF
stuff
1 thing
2 another
3 another
4 another
5 another
6 another
7 thing
8 thing
Below is a more general solution building on #G5W's answer as it doesn't cover the case where your original data frame has values that don't exist in the lookup table (which would result in length mismatch error):
library(dplyr)
stuff.kv <- list(one = "another", two = "thing")
df <- data_frame(
stuff = rep(c("one", "two", "three"), each = 3)
)
df <- df %>%
mutate(stuff = paste(stuff.kv[stuff]))
Is there an easier (i.e. one line of code instead of two!) way to do the following:
results <- as.data.frame(str_split_fixed(c("SampleID_someusefulinfo.countsA" , "SampleID_someusefulinfo.countsB" , "SampleID_someusefulinfo.counts"), "\\.", n=2))
names(results) <- c("a", "b")
Something like:
results <- data.frame(str_split_fixed(c("SampleID_someusefulinfo.countsA" , "SampleID_someusefulinfo.countsB" , "SampleID_someusefulinfo.counts"), "\\.", n=2), colnames = c("a", "b"))
I do this a lot, and would really love to have a way to have this in one line of code.
/data.table works too, if it's easier to do there than in base data.frame/
Clarifying:
My expected output (which is achieved by running the two lines of code at the top - AND I WANT IT TO BE ONE - THAT's IT!!!) is a result data frame of the structure:
results
a b
1 SampleID_someusefulinfo countsA
2 SampleID_someusefulinfo countsB
3 SampleID_someusefulinfo counts
What I would like to do is:
CREATE the data frame from a matrix or with some content (for example the toy code of matrix(c(1,2,3,4),nrow=2,ncol=2) I provided in the first example I wrote)
SPECIFY IN THAT SAME LINE what I would like the column names of my data frame to be
Use setNames() around a data.frame
setNames(data.frame(matrix(c(1,2,3,4),nrow=2,ncol=2)), c("a","b"))
# a b
#1 1 3
#2 2 4
?setNames:
a convenience function that sets the names on an object and returns the object
> setNames
function (object = nm, nm)
{
names(object) <- nm
object
}
We can use the dimnames option in matrix as the OP was using matrix to create the data.
data.frame(matrix(1:4, 2, 2, dimnames=list(NULL, c("a", "b"))))
Or
`colnames<-`(data.frame(matrix(1:4, 2, 2)), c('a', 'b'))
Take the simple data as:
A <- (1:100)
B <- (4:103)
C <- (100:199)
D <- (1000:1099)
df <- data.frame(A,B,C,D)
unique_set <- c('B','C')
Now it is simple to create a subset of df accounting for the unique_set variables, using
df[unique_A]
But lets say, i want only a specific row of numbers. Or more specifically for a specific A value. If we try this, an error occurs.
df[unique_A][df$A == 78]
Or
df[unique_A & df$A == 78]
So what I want it to output is what this returns
df[unique_A][78,]
While A is in sequential order, the following code work. But I want to know how the user can specific set conditions (ie. the A value) whilst account for our unique_set requirement at the same time?
Must one include A with the unique_set command?
Basically data.frame subset is looks like:
df[condition on rows, condition on columns]
So in your case, you want to select all the rows where column A is 78 and at the same time select only columns specified on unique_set:
df[df$A == 78, unique_set]
Try to play with these examples:
df[df$A == 78, c("B", "C")]
df[df$A == 78, c(2, 3)]
df[78, c(1:3)]