Error with creating function in R - r

I'm trying to create a function as I need to apply the same code multiple times to different columns in my data.
My data (df) looks like this:
WEEK1.x WEEK1.y WEEK2.x WEEK2.y WEEK3.x WEEK3.y
1 660.14 1 690.74 2 821.34 1
2 -482.89 99 -368.12 99 -368.12 99
3 284.48 3 399.90 1 375.32 1
4 -554.18 99 -300.28 99 -300.28 99
Then my function looks like:
extra<-function(first_var, second_var){
df$first_var=ifelse((df$first_var == 99),"99",
ifelse((df$first_var %in% c(1,2,3,4,5)),"1-5",NA))
output=as.data.frame(aggregate(second_var~first_var, data = df, mean))
return(output)
}
WEEK1<-extra("WEEK1.y", "WEEK1.x")
WEEK2<-extra("WEEK2.y", "WEEK2.y")
This then gives me the error:
Error in $<-.data.frame(*tmp*, first_var, value = logical(0)) :
replacement has 0 rows, data has 1416
When I press view traceback this is what it says:
stop(sprintf(ngettext(N, "replacement has %d row, data has %d",
"replacement has %d rows, data has %d"), N, nrows), domain = NA)
$<-.data.frame(*tmp*, first_var, value = logical(0))
$<-(*tmp*, first_var, value = logical(0))
extra("WEEK1.y", "WEEK1.x")
I'm not sure what the problem is?

Here is a working version of your function.
I have used a variant of the suggestion by #A.Suliman, but with [[.
extra <- function(first_var, second_var){
df[[first_var]] <- ifelse((df[[first_var]] == 99), "99",
ifelse((df[[first_var]] %in% c(1,2,3,4,5)), "1-5", NA))
fmla <- as.formula(paste(second_var, first_var, sep = "~"))
aggregate(fmla, data = df, mean, na.rm = TRUE)
}
WEEK1 <- extra("WEEK1.y", "WEEK1.x")
WEEK1
# WEEK1.y WEEK1.x
#1 1-5 472.310
#2 99 -518.535
WEEK2 <- extra("WEEK2.y", "WEEK2.x")
WEEK2
# WEEK2.y WEEK2.x
#1 1-5 545.32
#2 99 -334.20
Note that I would also suggest that you pass df as an argument to the function. It is generally considered bad practice to rely on objects existing elsewhere than in the function´s environment. In this case, df exists in .GlobalEnv and you are forcing R to leave th environment where it is needed to find it.
DATA.
df <- read.table(text = "
WEEK1.x WEEK1.y WEEK2.x WEEK2.y WEEK3.x WEEK3.y
1 660.14 1 690.74 2 821.34 1
2 -482.89 99 -368.12 99 -368.12 99
3 284.48 3 399.90 1 375.32 1
4 -554.18 99 -300.28 99 -300.28 99
", header = TRUE)

Related

Replace row names in a column

I have a large data.frame in R with thousands of rows and 4 columns.
For example:
Chromosome Start End Count
1 NC_031985.1 16255093 16255094 1
2 NC_031972.1 11505205 11505206 1
3 NC_031971.1 24441227 24441228 1
4 NC_031977.1 29030540 29030541 1
5 NC_031969.1 595867 595868 1
6 NC_031986.1 40147812 40147813 1
I have this data.frame with the chromosome names accordingly
LG1 NC_031965.1
LG2 NC_031966.1
LG3a NC_031967.1
LG3b NC_031968.1
LG4 NC_031969.1
LG5 NC_031970.1
LG6 NC_031971.1
LG7 NC_031972.1
LG8 NC_031973.1
LG9 NC_031974.1
LG10 NC_031975.1
LG11 NC_031976.1
LG12 NC_031977.1
LG13 NC_031978.1
LG14 NC_031979.1
LG15 NC_031980.1
LG16 NC_031987.1
LG17 NC_031981.1
LG18 NC_031982.1
LG19 NC_031983.1
LG20 NC_031984.1
LG22 NC_031985.1
LG23 NC_031986.1
I want to replace all row names of the large matrix with the chromosome names as listed above and get:
Chromosome Start End Count
1 LG22 16255093 16255094 1
2 LG7 11505205 11505206 1
3 LG6 24441227 24441228 1
4 LG12 29030540 29030541 1
5 LG4 595867 595868 1
6 LG23 40147812 40147813 1
Does anybody know which is the less painful way to do this?
It might be easy (or not) but my experience in R is limited.
Many thanks!
As discussed in the comments here is the dplyr solution if people are looking:
library(dplyr)
df %>%
inner_join(chromo_names, by = c("Chromosome" = "V2")) %>%
select(Chromosome = V1, Start, End, Count)
This gives a warning message that the two merging columns has different factor levels. You can either ignore that and work with characters or convert the merged column to a factor like:
df %>%
inner_join(chromo_names, by = c("Chromosome" = "V2")) %>%
select(Chromosome = V1, Start, End, Count) %>%
mutate(Chromosome = as.factor(Chromosome))
Here is a Base R solution:
merged = merge(df, chromo_names,
by.x = "Chromosome",
by.y = "V2",
sort = FALSE)
merged = merged[c(5,2:4)]
names(merged)[1] = "Chromosome"
Result:
Chromosome Start End Count
1 LG22 16255093 16255094 1
2 LG7 11505205 11505206 1
3 LG6 24441227 24441228 1
4 LG12 29030540 29030541 1
5 LG4 595867 595868 1
6 LG23 40147812 40147813 1
Data:
df = read.table(text = " Chromosome Start End Count
1 NC_031985.1 16255093 16255094 1
2 NC_031972.1 11505205 11505206 1
3 NC_031971.1 24441227 24441228 1
4 NC_031977.1 29030540 29030541 1
5 NC_031969.1 595867 595868 1
6 NC_031986.1 40147812 40147813 1", header = TRUE)
chromo_names = read.table(text = "LG1 NC_031965.1
LG2 NC_031966.1
LG3a NC_031967.1
LG3b NC_031968.1
LG4 NC_031969.1
LG5 NC_031970.1
LG6 NC_031971.1
LG7 NC_031972.1
LG8 NC_031973.1
LG9 NC_031974.1
LG10 NC_031975.1
LG11 NC_031976.1
LG12 NC_031977.1
LG13 NC_031978.1
LG14 NC_031979.1
LG15 NC_031980.1
LG16 NC_031987.1
LG17 NC_031981.1
LG18 NC_031982.1
LG19 NC_031983.1
LG20 NC_031984.1
LG22 NC_031985.1
LG23 NC_031986.1", header = FALSE)

ifelse didn't work in dataframe in R

I have a question about ifelse in data.frame in R. I checked several SO posts about it, and unfortunately none of these solutions fitted my case.
My case is, making a conditional calculation in a data frame, but it returns the condition has length > 1 and only the first element will be used even after I used ifelse function in R, which should work perfectly according to the SO posts I checked.
Here is my sample code:
library(scales)
head(temp[, 2:3])
previous current
1 0 10
2 50 57
3 92 177
4 84 153
5 30 68
6 162 341
temp$change = ifelse(temp$previous > 0, rate(temp$previous, temp$current), temp$current)
rate = function(yest, tod){
value = tod/yest
if(value>1){
return(paste("+", percent(value-1), sep = ""))
}
else{
return(paste("-", percent(1-value), sep = ""))
}
}
So if I run the ifelse one, I will get following result:
head(temp[, 2:4])
previous current change
1 0 10 10
2 50 57 +NaN%
3 92 177 +NaN%
4 84 153 +NaN%
5 30 68 +NaN%
6 162 341 +NaN%
So my question is, how should I deal with it? I tried to assign 0 to the last column before I run ifelse, but it still failed.
Many thanks in advance!
Try the following two segments, both should does what you wanted. May be it is the second one you are looking for.
library(scales)
set.seed(1)
temp <- data.frame(previous = rnorm(5), current = rnorm(5))
rate <- function(i) {
yest <- temp$previous[i]
tod <- temp$current[i]
if (yest <= 0)
return(tod)
value = tod/yest
if (value>1) {
return(paste("+", percent(value-1), sep = ""))
} else {
return(paste("-", percent(1-value), sep = ""))
}
}
temp$change <- unlist(lapply(1:dim(temp)[1], rate))
Second:
ind <- which(temp$previous > 0)
temp$change <- temp$current
temp$change[ind] <- unlist(lapply(ind,
function(i) rate(temp$previous[i], temp$current[i])))
In the second segment, the function rate is same as you've coded it.
Here's another way to do the same
# 1: load dplyr
#if needed install.packages("dplyr")
library(dplyr)
# 2: I recreate your data
your_dataframe = as_tibble(cbind(c(0,50,92,84,30,162),
c(10,57,177,153,68,341))) %>%
rename(previous = V1, current = V2)
# 3: obtain the change using your conditions
your_dataframe %>%
mutate(change = ifelse(previous > 0,
ifelse(current/previous > 1,
paste0("+%", (current/previous-1)*100),
paste0("-%", (current/previous-1)*100)),
current))
Result:
# A tibble: 6 x 3
previous current change
<dbl> <dbl> <chr>
1 0 10 10
2 50 57 +%14
3 92 177 +%92.3913043478261
4 84 153 +%82.1428571428571
5 30 68 +%126.666666666667
6 162 341 +%110.493827160494
Only the first element in value is evaluated. So, the output of rate solely depend on the first row of temp.
Adopting the advice I received from warm-hearted SO users, I vectorized some of my functions and it worked! Raise a glass to SO community!
Here is the solution:
temp$rate = ifelse(temp$previous > 0, ifelse(temp$current/temp$previous > 1,
temp$current/temp$previous - 1,
1 - temp$current/temp$previous),
temp$current)
This will return rate with scientific notation. If "regular" notation is needed, here is an update:
temp$rate = format(temp$rate, scientific = F)

R data.table lapply split by segment: expecting wrong data type

I am working on a medical data set. Each patient has time-updated records across different rows. For each patient, we want to compute the time of their first hospitalization, first heart attack, or death. For those without any records during the time of the study, we'll return NA as a missing value.
This seems like a great opportunity to use R's data.table package, but I am getting a strange error. Here is an example:
library(data.table)
dat <- data.table(id = c(1,1,2,2), begin = c(0, 20, 0, 50),
hospital = c(1,0,0,1), heart.attack = c(0,0,0,1),
death = c(0,1,0,1))
print(dat)
id begin hospital heart.attack death
1: 1 0 1 0 0
2: 1 20 0 0 1
3: 2 0 0 0 0
4: 2 50 1 1 1
This function calculates the minimum time that an event occurred. If no event occurred, return a missing value.
first.calc <- function(outcome, time){
w <- which(as.logical(outcome) == TRUE)
if(length(w) > 0){
return(min(time[w]))
}
else{
return(NA)
}
}
Note: these variables are used because I'm actually writing a function to apply these techniques on multiple data sets with unknown names.
time.name = "begin"
id.name = "id"
outcome.names = c("hospital", "heart.attack", "death")
res <- dat[, lapply(X = .SD, FUN = "first.calc",
time = get(time.name)),
.SDcols = outcome.names,
by = id.name]
**Error in `[.data.table`(dat, , lapply(X = .SD,
FUN = "first.calc", time = get(time.name)), :
Column 2 of result for group 2 is type
'double' but expecting type 'logical'. Column types must
be consistent for each group.**
Meanwhile, the function seems to work on a single patient or without subsetting:
dat[id == 1, lapply(X = .SD, FUN = "first.calc",
time = get(time.name)),
.SDcols = outcome.names,
by = id.name]
id hospital heart.attack death
1: 1 0 NA 20
dat[id == 2, lapply(X = .SD, FUN = "first.calc",
time = get(time.name)),
.SDcols = outcome.names,
by = id.name]
id hospital heart.attack death
1: 2 50 50 50
dat[, lapply(X = .SD, FUN = "first.calc",
time = get(time.name)),
.SDcols = outcome.names]
hospital heart.attack death
1: 0 50 20
So why is this error occurring when I use by = id.name on the full data set?
I have read through most of the Similar Questions. There was not much information about this specific issue. Sorry if this question is a duplicate. Thanks for looking at it!

R How to iterate loops over every file in a folder?

I am struggling to iterate 2 loops over all the files in a folder. I have over 600 .csv files, which contain information about the latency and duration of saccades made in a sentence. They look like this:
order subject sentence latency duration
1 1 1 641 76
2 1 1 98 57
3 1 1 252 49
4 1 1 229 43
For each of the files, I want to create 2 new columns called Start and End, to calculate the start and end point of each saccade. The values in each of those are calculated from the values in the latency and duration columns. I can do this using a loop for each file, like so:
SentFile = read.csv(file.choose(), header = TRUE, sep = ",")
# Calculate Start
for (i in 1:(nrow(SentFile)-1)){
SentFile$Start[1] = SentFile$Latency[1]
SentFile$Start[i+1] = SentFile$Start[i] +
SentFile$Duration[i] + SentFile$Latency[i+1]}
#Calculate End
for (i in 1:(nrow(SentFile)-1)){
SentFile$End[i] = SentFile$Start[i] + SentFile$Duration[i]}
And then the result looks like this:
order subject sentence latency duration Start End
1 1 1 641 76 641 717
2 1 1 98 57 815 872
3 1 1 252 49 1124 1173
4 1 1 229 43 1402 1445
I am sure there is probably a more efficient way of doing it, but it is very important to use the precise cells specified in the loop to calculate the Start and End values and that was the only way I could think of to get it to work for each individual file.
As I said, I have over 600 files, and I want to be able to calculate the Start and End values for the entire set and add the new columns to each file. I tried using lapply, like this:
sent_files = list.files()
lapply(sent_files, function(x){
SentFile = read.csv(x, header = TRUE, sep = ",")
for (i in 1:(nrow(SentFile)-1)){
SentFile$Start[1] = SentFile$Latency[1]
SentFile$Start[i+1] = SentFile$Start[i] + SentFile$Duration[i]
+ SentFile$Latency[i+1]}
#Calculate End of Saccade Absolute Time Stamp #######
for (i in 1:(nrow(SentFile)-1)){
SentFile$End[i] = SentFile$Start[i] + SentFile$Duration[i]}})
However, I keep getting this error message:
Error in `$<-.data.frame`(`*tmp*`, "SacStart", value = c(2934L, NA)):replacement has 2 rows, data has 1
I would really appreciate any help in getting this to work!
First, replace for loops:
data <- data.frame(
"order" = c(1,2,3,4), subject = c(1,1,1,1), sentance = c(1,1,1,1), latency= c(641, 98, 252, 229), duration = c(76, 57, 49, 43)
)
data$end <- cumsum(data$latency + data$duration)
data$start <- data$end - data$duration
Secondly, you are not assigning results of the CSV load to your environment variable.
If you want to process all files in one go, change the code for data load to this:
data.list <- lapply(sent_files, function(x){
data <- read.csv(x, header = TRUE, sep = ",")
return(data)
})
data <- do.call("rbind", data.list)

Vectorization in R

I am new to R, and I have researched vectorization. But I am still trying to train my mind to think in vectorization terms. Very often examples of vectorization instead of loops are either too simple, so it's difficult for me to generalize them, or not present at all.
Can anyone suggest how I can vectorize the following?
Model2 <- subset(Cor.RMA, MODEL == Models.Sort[2,1])
RCM2 <- count(Model2$REPAIR_CODE)
colnames(RCM2) <- c("REPAIR_CODE", "FREQ")
M2M <- merge(RCM.Sort, RCM2, by = "REPAIR_CODE", all.x = TRUE)
M2M.Sort <- M2M[order(M2M$FREQ.x, decreasing = TRUE), ]
M2M.Sort[is.na(M2M.Sort)] <- 0
In the above code, each "2" needs to run from 2 to 85
writeWorksheetToFile(file="CL2 - TC - RC.xlsx",
data = M2M.Sort[ ,c("FREQ.y")],
sheet = "RC by Model",
clearSheets = FALSE,
startRow = 6,
startCol = 6)
In the above code, "data" should from from "M2M..." to "M85M..." and "startCol" should run from 6 to 89 for an Excel printout.
The data frame this comes from (Cor.RMA) has columns "MODEL", "REPAIR_CODE", and others that are unused.
RCM.Sort is a frequency table of each "REPAIR_CODE" across all models that I use as a Master list to adjoin Device-specific Repair Code counts. (left-join: all.x = TRUE)
Models.Sort is a frequency table I generated using the "count" function from the plyr package, so I can create subsets for each MODEL.
Then I merge a list of each "REPAIR_CODE" that I generated using the "unique" function.
Sample Data:
CASE_NO DEVICE_TYPE MODEL TRIAGE_CODE REPAIR_CODE
12341 Smartphone X TC01 RC01
12342 Smartphone Y TC02 RC02
12343 Smartphone Z TC01, TC05 RC05
12344 Tablet AA TC02 RC37
12345 Wearable BB TC05 RC37
12346 Smartphone X TC07 RC01
12347 Smartphone Y TC04 RC02
I very much appreciate your time and effort if you are willing to help.
Alright, this is not what your original script did, but here goes:
models <- c("X","Y","Z","AA","BB") # in your case it would be Models.Sort[2:85,1]
new <- Cor.RMA[Cor.RMA$MODEL %in% models,]
new2 <- aggregate(new$REPAIR_CODE, list(new$MODEL), table)
temp <- unlist(new2[[2]])
temp <- temp[, order(colSums(temp), decreasing = T)]
out <- data.frame(group=new2[,1], temp)
out <- out[order(rowSums(out[,-1]), decreasing = T),]
out
# group RC01 RC02 RC37 RC05
# 3 X 2 0 0 0
# 4 Y 0 2 0 0
# 1 AA 0 0 1 0
# 2 BB 0 0 1 0
# 5 Z 0 0 0 1
You can then write it easily to an xlsx file, e.g. with:
require(xlsx)
xlsx:::xlsx.write(out,"test.xlsx",row.names=F)
Edit: Added sorting.
Edit2: Fixed sorting.

Resources