Suppose I have the following data frame:
m <- data.frame(a = c(".","1",2:10),
b = c(".","2",4:12),
c = c(rep(".",11)))
I use apply to get the max value of each row:
maxrowval <- apply(m,1,max)
fin <- cbind(m,maxrowval)
The problem is that rows 9 and 10 of fin does not give the max values.
I must be missing something here but can't point to the source of the problem. Maybe something to do with factors and the levels. Any help is appreciated.
Combining the character issue mentioned in the comments with the max function and removing "-Inf" from the results.
foo <- function(x){
tmp <- max(as.numeric(as.character(x)), na.rm = T)
ifelse(tmp == "-Inf", NA, tmp)
}
apply(m, 1, foo)
[1] NA 2 4 5 6 7 8 9 10 11 12
Related
I have a data frame with two string variables, and would like to convert them to numeric values using a separate "key" data frame. The below example is simplified, but I need to be able to apply it to replace the contents of the V1 and V2 variables based on an arbitrary key that will not always be a=1, b=2 etc...
Example:
set.seed(1)
x <- data.frame(
V1 = sample((letters), 10, replace=TRUE),
V2 = sample((letters), 10, replace=TRUE)
)
key <- data.frame(letters, 1:26)
I need to reference the first element of V1 against the key, replace with the according value (e.g. a = 1, b = 2, etc.), do the same for the second element, and then when done with V1 move on and do the same for V2.
I've been struggling to work out a solution using lapply() and sub() but keep getting stuck because I can't see a way to pass the sub() function more than a 1:1 comparison. Is there a different function I should be using?
Forgive me- I'm sure the solution must be simple but I'm quite new to R still.
Here are two approaches with base R to make it:
using sapply()
x[] <- with(key, sapply(x, function(v) values[match(v,letters)]))
or
x <- data.frame(with(key, sapply(x, function(v) values[match(v,letters)])))
using as.matrix (similar to the unlist() approach by #Ronak Shah)
x[] <- with(key, values[match(as.matrix(x),letters)])
You can create a lookup table with data.table and then apply the mapping along the columns of your data frame with apply:
library(data.table)
key <- data.table(letters = letters, value = 1:26, key = "letters")
apply(x, 2, function(x) key[x]$value)
>
V1 V2
1 y a
2 d u
3 g u
4 a j
5 b v
6 w n
7 k j
8 n g
9 r i
10 s o
You could unlist and match in base R
x[] <- key$values[match(unlist(x), key$letters)]
x
# V1 V2
#1 25 1
#2 4 21
#3 7 21
#4 1 10
#5 2 22
#6 23 14
#7 11 10
#8 14 7
#9 18 9
#10 19 15
Or using dplyr
library(dplyr)
x %>% mutate_all(~key$values[match(., key$letters)])
data
set.seed(1)
x <- data.frame(
V1 = sample((letters), 10, replace=TRUE),
V2 = sample((letters), 10, replace=TRUE)
)
key <- data.frame(letters = letters, values = 1:26)
You could use apply with both row and column margins, e.g, as.data.frame(apply(x, c(1,2), function(l) key[key$letters == l,c(2)])).
From ?dplyr::bind_cols:
This is an efficient implementation of the common pattern of do.call(rbind, dfs) or do.call(cbind, dfs) for binding many data frames into one
However, with example data:
tmp_df1 <- data.frame(a = 1)
tmp_df2 <- data.frame(b = c(-2, 2))
tmp_df3 <- data.frame(c = runif(10))
The command do.call(cbind, list(tmp_df1, tmp_df2, tmp_df3)) produces:
a b c
1 1 -2 0.8473307
2 1 2 0.8031552
3 1 -2 0.3057430
4 1 2 0.6344999
5 1 -2 0.7870753
6 1 2 0.9453199
7 1 -2 0.6642231
8 1 2 0.9708049
9 1 -2 0.7189576
10 1 2 0.9217087
That is, rows of tmp_df1 and tmp_df2 are recycled to match the number of rows in tmp_df3.
In dplyr:
> bind_cols(tmp_df1, tmp_df2, tmp_df3)
Error in eval(substitute(expr), envir, enclos) :
incompatible number of rows (2, expecting 1)
The reason why I want to do something like this is because I am in a situation similar to below:
df_normal_param <- df(mu = rnorm(10), sigma = runif(10))
df_normal_sample_list <- lapply(1:10, function(i)
with(df_normal_param,
data.frame(sam = rnorm(100, mu[i], sigma[i]))
and I wish to attach the arguments used to create each entry of df_normal_sample_list to the outputs, e.g.
df_normal_sample_list <- lapply(1:10, function(i)
cbind(df_normal_param[i,], df_normal_sample_list[[i]]))
You argue in a comment that this behavior is safe, I strongly disagree. It seems safe, for this very particular case, but it is likely to cause you problems somewhere down the road. Which is why I believe that the answer to your stated question ("Is there a way to get dplyr's bind_cols to expand number of rows like in cbind?") is a simple: no, and they probably built it that way intentionally.
Instead, I would suggest that you be more explicit in your approach, and just add the columns you want right as you build the data you are creating. For example, you could include that step right in your call (here using apply to clarify what is going where)
df <- data.frame(mu = rnorm(3), sigma = runif(3))
df_normal_sample_list <- apply(df, 1, function(x){
data.frame(
mu = x["mu"]
, sigma = x["sigma"]
, sam = rnorm(3, x["mu"], x["sigma"])
)
})
Returns
[[1]]
mu sigma sam
1 -0.6982395 0.1690402 -0.592286
2 -0.6982395 0.1690402 -0.516948
3 -0.6982395 0.1690402 -0.804366
[[2]]
mu sigma sam
1 -1.698747 0.2597186 -1.830950
2 -1.698747 0.2597186 -2.087393
3 -1.698747 0.2597186 -1.961376
[[3]]
mu sigma sam
1 0.9913492 0.3069877 0.9629801
2 0.9913492 0.3069877 1.2279697
3 0.9913492 0.3069877 1.1222780
Then, instead of binding the columns, then the rows, you can just bind the rows at the end (also from dplyr)
bind_rows(df_normal_sample_list)
I'm trying to loop this sequence of steps in r for a data frame.
Here is my data:
ID Height Weight
a 100 80
b 80 90
c na 70
d 120 na
....
Here is my code so far
winsorize2 <- function(x) {
Min <- which(x == min(x))
Max <- which(x == max(x))
ord <- order(x)
x[Min] <- x[ord][length(Min)+1]
x[Max] <- x[ord][length(x)-length(Max)]
x}
df<-read.csv("data.csv")
df2 <- scale(df[,-1], center = TRUE, scale = TRUE)
id<-df$Type
full<-data.frame(id,df2)
full[is.na(full)] <- 0
full[, -1] <- sapply(full[,-1], winsorize2)
what i'm trying to do is this:-> Standardize a dataframe, then winsorize the standardized dataframe using the function winsorize2, ie replace the most extreme values with the second least extreme value. This is then repeated 10 times. How do i do a loop for this? Im confused as in the sequence ive already replaced the nas with 0s and so i should remove this step from the loop too?
edit:After discussion with #ekstroem, we decided to change to code to introduce the boundaries
df<-read.csv("data.csv")
id<-df$Type
df2<- scale(df[,-1], center = TRUE, scale = TRUE)
df2[is.na(df2)] <- 0
df2[df2<=-3] = -3
df2[df2>=3] = 3
df3<-df2 #trying to loop again
df3<- scale(df3, center = TRUE, scale = TRUE)
df3[is.na(df3)] <- 0
df3[df3<=-3] = -3
df3[df3>=3] = 3
There are some boundary issues that are not fully specified in your code, but maybe the following can be used (using base R and not super efficient)
wins2 <- function(x, n=1) {
xx <- sort(unique(x))
x[x<=xx[n]] <- xx[n+1]
x[x>=xx[length(xx)-n]] <- xx[length(xx)-n]
x
}
This yields:
x <- 1:11
wins(x,1)
[1] 2 2 3 4 5 6 7 8 9 10 10
wins(x,3)
[1] 4 4 4 4 5 6 7 8 8 8 8
My problem is that i want to use a function to change a random value to NA in a global data frame.
df is a dataframe with 230 rows and 2 columns.
abstract code:
emptychange<- function(x){
placenumber <- round(runif(1,min= min(1),max=max(nrow(x))))
x[placenumber,2] <<- NA
}
emptychange(df)
The Error is:"Error in x[placenumber, 2] <<- NA : object 'x' not found".
I think the mistake is, that r searches for the global variable 'x' and doesn't use the function x-value (in this case df). How can I fix this? Thanks!
This works. The problem was this: <<- NA Double arrows are used when you want to assign a value to an object outside the function. In you case, your x is inside the function.
df1 <-data.frame(x = 1, y = 1:10)
emptychange<- function(x){
placenumber <- round(runif(1,min= min(1),max=max(nrow(x))))
x[placenumber,2] <- NA
return(x)
}
emptychange(df1)
f you want this to be done at the console, you can just use sample-ing from the row count inside the [<- function:
> df1 <-data.frame(x = 1, y = 1:10)
> df1[sample(nrow(df1), 1) , 2] <- NA
> df1
x y
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 NA
7 1 7
8 1 8
9 1 9
10 1 10
If you want to destructively change the dataframe argument given to a function you should instead assign the value which is returned back to the original name:
> randNA.secCol <- function(df) {df[sample(nrow(df), 1) , 2] <- NA; df}
> df1 <-data.frame(x = 1, y = 1:10)
> df1 <- randNA.secCol(df1)
Best practice in R is avoidance of the use of the <<- function.
I have a data frame with 2 columns one with numeric values and one with a name. The name repeats itself but has different values each time.
Data <- data.frame(
Value = c(1:10),
Name = rep(LETTERS, each=4)[1:10])
I would like to write a function that takes the 3 highest numbers for each name and calculates mean and median (and in case there aren’t 3 values present throw an NA) and then take all the values for each name and calculate mean and median.
My initial attempt looks something like this:
my.mean <- function (x,y){
top3.x <- ifelse(x > 3 , NA, x)
return(mean(top3.x), median(top3.x))
}
Any hints on how to improve this will be appreciated.
I would probably recommend by for this.
Something put together really quickly might look like this (if I understood your question correctly):
myFun <- function(indf) {
do.call(rbind, with(indf, by(Value, Name, FUN=function(x) {
Vals <- head(sort(x, decreasing=TRUE), 3)
if (length(Vals) < 3) {
c(Mean = NA, Median = NA)
} else {
c(Mean = mean(Vals), Median = median(Vals))
}
})))
}
myFun(Data)
# Mean Median
# A 3 3
# B 7 7
# C NA NA
Note that it is not a very useful function in this form because of how many parameters are hard-coded into the function. It's really only useful if your data is in the form you shared.
Here's a data.table solution, assuming that you don't have any other NAs in your data:
require(data.table) ## 1.9.2+
setDT(Data) ## convert to data.table
Data[order(Name, -Value)][, list(m1=mean(Value[1:3]), m2=median(Value[1:3])), by=Name]
# Name m1 m2
# 1: A 3 3
# 2: B 7 7
# 3: C NA NA
Using dplyr
library(dplyr)
myFun1 <- function(dat){
dat %>%
group_by(Name)%>%
arrange(desc(Value))%>%
mutate(n=n(), Value=ifelse(n<=3, NA_integer_, Value))%>%
summarize(Mean=mean(head(Value,3)), Median=median(head(Value,3)))
}
myFun1(Data)
#Source: local data frame [3 x 3]
# Name Mean Median
#1 A 3 3
#2 B 7 7
#3 C NA NA