Paste string values from df column into a function - r

I have a dataset in R organized like so:
x freq
1 PRODUCT10000 6
2 PRODUCT10001 20
3 PRODUCT10002 11
4 PRODUCT10003 4
5 PRODUCT10004 1
6 PRODUCT10005 2
Then, I have a function like
fun <- function(number, df1, string, df2){NormC <- as.numeric(df1[string, "normc"])
df2$NormC <- rep(NormC)}
How can I iterate through my df and insert each value of "x" into the function?
I think the problem is that this part of the function (which has 4 input variables) is structured like so- NormC <- as.numeric(df[string, "normc"])

As explained by #duckmayr, you don't need to iterate through column x. Here is an example creating new variable.
df <- read.table(text = " x freq
1 PRODUCT10000 6
2 PRODUCT10001 20
3 PRODUCT10002 11
4 PRODUCT10003 4
5 PRODUCT10004 1
6 PRODUCT10005 2", header = TRUE)
fun <- function(string){paste0(string, "X")} # example
# option 1
df$new.col1 <- fun(df$x) # see duckmayr's comment
# option 2
library(data.table)
setDT(df)[, new.col2 := fun(x)]

Related

How to Efficiently Populate R Dataframe with Lookup Values from Another Dataframe [duplicate]

This question already has answers here:
Expand ranges defined by "from" and "to" columns
(10 answers)
Closed 2 years ago.
I have a question regarding efficiently populating an R dataframe based on data retrieved from another dataframe.
So my input typically looks like:
dfInput <- data.frame(start = c(1,6,17,29), end = c(5,16,28,42), value = c(1,2,3,4))
start end value
1 5 1
6 16 2
17 28 3
29 42 4
I want to find the min and max values in cols 1 and 2 and create a new dataframe with a row for each value in that range:
rangeMin <- min(dfInput$start)
rangeMax <- max(dfInput$end)
dfOutput <- data.frame(index = c(rangeMin:rangeMax), value = 0)
And then populate it with the appropriate "values" from the input dataframe:
for (i in seq(nrow(dfOutput))) {
lookup <- dfOutput[i,"index"]
dfOutput[i, "value"] <- dfInput[which(dfInput$start <= lookup &
dfInput$end >= lookup),"value"]
}
This for-loop achieves what I want to do, but it feels like this is a very convoluted way to do it.
Is there a way that I can do something like:
dfOutput$value <- dfInput[which(dfInput$start <= dfOutput$index &
dfInput$end >= dfOutput$index),"value"]
Or something else to populate the values when instantiating dfOutput.
I feel like this is pretty basic but I'm new to R, so many thanks for any help!
You can create a sequence between start and end :
library(dplyr)
dfInput %>%
mutate(index = purrr::map2(start, end, seq)) %>%
tidyr::unnest(index) %>%
select(-start, -end)
# A tibble: 42 x 2
# value index
# <dbl> <int>
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 5
# 6 2 6
# 7 2 7
# 8 2 8
# 9 2 9
#10 2 10
# … with 32 more rows
In base R :
do.call(rbind, Map(function(x, y, z)
data.frame(index = x:y, value = z), dfInput$start, dfInput$end, dfInput$value))

Find objects in list that contain values from vector in R

My data frame
set.seed(1)
df <- data_frame(col1 = c(1:49), col2 = sample(c(0:20), 49, replace = T))
My list
fields <- list(A = c(2:4, 12:16, 24:28, 36:40, 48:49),
B = c(6:10, 18:22, 30:34, 42:46))
I would like to create a new column that contains the name of the (vector) object in fields, which contains the number in df$col1
I have created a conditional for loop over fields:
col1 <- df$col1
for (i in col1) {
if (col1[i] %in% fields[[1]] == T) {
col1[i] <- names(fields)[1]
} else if (col1[i] %in% fields[[2]] == T) {
col1[i] <- names(fields)[2]
}
}
Although this works, and I can then assign the resulting new vector col1 to my data frame, this doesn't seem very efficient to me- especially because I also have lists with more objects.
The reason why I want to do this: I would like to use ggplot and dplyr to grouping and summarising the observations according to their position in my lists (fields, but also other lists) . I hope it is clear from my question what I intend to do. Thanks!
EDIT
I have created a more generalised function that contains a nested for-loop
find_object <- function(x, list) {
for (j in 1:length(list)) {
for (i in 1:length(x)) {
if (x[i] %in% list[[j]] == TRUE) {
x[i] <- names(list)[j]
}
}
}
x
}
find_object(col1, fields)
That is more or less what I want - but this is a nested for loop, and I have heard that this is bad... Does anyone have a better solution??
Thanks
A better way is to transform the list to data.frame and then do a join/merge:
library(dplyr)
fields.df <- stack(fields) %>% mutate(ind = as.character(ind))
df %>% left_join(fields.df, by = c('col1' = 'values'))
# col1 col2 ind
# <int> <int> <chr>
# 1 1 5 <NA>
# 2 2 7 A
# 3 3 12 A
# 4 4 19 A
# 5 5 4 <NA>
# 6 6 18 B
# 7 7 19 B
# 8 8 13 B
# 9 9 13 B
# 10 10 1 B
note: I use left_join from dplyr because you are using data_frame. The base R merge should also work.
Another way would be to use match() after creating a data frame with stack().
library(dplyr)
foo <- stack(fields)
mutate(df, whatever = foo$ind[match(df$col1, foo$values)])
col1 col2 whatever
<int> <int> <fctr>
1 1 5 <NA>
2 2 7 A
3 3 12 A
4 4 19 A
5 5 4 <NA>
6 6 18 B
7 7 19 B
8 8 13 B
9 9 13 B
10 10 1 B

R Dataframe comparison which, scaling bad

The idea is extracting the position of df charactes with a reference of other df, example:
L<-LETTERS[1:25]
A<-c(1:25)
df<-data.frame(L,A)
Compare<-c(LETTERS[sample(1:25, 25)])
df[] <- lapply(df, as.character)
for (i in 1:nrow(df)){
df[i,1]<-which(df[i,1]==Compare)
}
head(df)
L A
1 14 1
2 12 2
3 2 3
This works good but scale very bad, like all for, any ideas with apply, or dplyr?
Thanks
Just use match
Your data (use set.seed when providing data using sample)
df <- data.frame(L = LETTERS[1:25], A = 1:25)
set.seed(1)
Compare <- LETTERS[sample(1:25, 25)]
Solution
df$L <- match(df$L, Compare)
head(df)
# L A
# 1 10 1
# 2 23 2
# 3 12 3
# 4 11 4
# 5 5 5
# 6 21 6

Reshaping count-summarised data into long form in R [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
Embarrassingly basic question, but if you don't know.. I need to reshape a data.frame of count summarised data into what it would've looked like before being summarised. This is essentially the reverse of {plyr} count() e.g.
> (d = data.frame(value=c(1,1,1,2,3,3), cat=c('A','A','A','A','B','B')))
value cat
1 1 A
2 1 A
3 1 A
4 2 A
5 3 B
6 3 B
> (summry = plyr::count(d))
value cat freq
1 1 A 3
2 2 A 1
3 3 B 2
If you start with summry what is the quickest way back to d? Unless I'm mistaken (very possible), {Reshape2} doesn't do this..
Just use rep:
summry[rep(rownames(summry), summry$freq), c("value", "cat")]
# value cat
# 1 1 A
# 1.1 1 A
# 1.2 1 A
# 2 2 A
# 3 3 B
# 3.1 3 B
A variation of this approach can be found in expandRows from my "SOfun" package. If you had that loaded, you would be able to simply do:
expandRows(summry, "freq")
There is a good table to dataframe function on the R cookbook website that you can modify slightly. The only modifications were changing 'Freq' -> 'freq' (to be consistent with plyr::count) and making sure the rownames were reset as increasing integers.
expand.dft <- function(x, na.strings = "NA", as.is = FALSE, dec = ".") {
# Take each row in the source data frame table and replicate it
# using the Freq value
DF <- sapply(1:nrow(x),
function(i) x[rep(i, each = x$freq[i]), ],
simplify = FALSE)
# Take the above list and rbind it to create a single DF
# Also subset the result to eliminate the Freq column
DF <- subset(do.call("rbind", DF), select = -freq)
# Now apply type.convert to the character coerced factor columns
# to facilitate data type selection for each column
for (i in 1:ncol(DF)) {
DF[[i]] <- type.convert(as.character(DF[[i]]),
na.strings = na.strings,
as.is = as.is, dec = dec)
}
row.names(DF) <- seq(nrow(DF))
DF
}
expand.dft(summry)
value cat
1 1 A
2 1 A
3 1 A
4 2 A
5 3 B
6 3 B

Replace values in selected columns by passing column name of data.frame into apply() or plyr function

Suppose I have a date.frame like:
df <- data.frame(a=1:5, b=sample(1:5, 5, replace=TRUE), c=5:1)
df
a b c
1 1 4 5
2 2 3 4
3 3 5 3
4 4 2 2
5 5 1 1
and I need to replace all the 5 as NA in column b & c then return to df:
df
a b c
1 1 4 NA
2 2 3 4
3 3 NA 3
4 4 2 2
5 5 1 1
But I want to do a generic apply() function instead of using replace() each by each because there are actually many variables need to be replaced in the real data. Suppose I've defined a variable list:
var <- c("b", "c")
and come up with something like:
df <- within(df, sapply(var, function(x) x <- replace(x, x==5, NA)))
but nothing happens. I was thinking if there is a way to work this out with something similar to the above by passing a variable list of column names from a data.frame into a generic apply / plyr function (or maybe some other completely different ways). Thanks~
You could just do
df[,var][df[,var] == 5] <- NA
df <- data.frame(a=1:5, b=sample(1:5, 5, replace=TRUE), c=5:1)
df
var <- c("b","c")
df[,var] <- sapply(df[,var],function(x) ifelse(x==5,NA,x))
df
I find the ifelse notation easier to understand here, but most Rers would probably use indexing instead.

Resources