data.table ifelse with multiple columns - r

I have a dataset with tens of columns that looks something like this:
df <- data.frame(id= c(1,1,1,2,2,2,3,3,3), time=c(1,2,3,1,2,3,1,2,3),y1 = rnorm(9), y2= rnorm(9), x = rnorm(9), xb = rnorm(9))
df
# id time y1 y2 x xb
# 1 1 1 -1.1184009 -1.07430118 0.61398523 -0.68343624
# 2 1 2 0.4347047 -0.53454071 -0.30716538 -1.02328242
# 3 1 3 0.2318315 -0.05854228 0.05169733 -0.22130149
# 4 2 1 1.2640080 2.07899296 -0.95918953 -0.35961156
# 5 2 2 -0.4374764 -0.25284854 -0.46251901 0.08630344
# 6 2 3 0.5042690 0.13322671 1.00881113 0.43807458
# 7 3 1 0.3672216 1.92995242 0.48708183 0.58206127
# 8 3 2 -1.5431709 0.53362731 1.17361087 -1.00932195
# 9 3 3 -1.4577268 0.23413541 -0.32399489 -0.91040641
I would like to modify my data frame using the following logic
df<-setDT(df)[,y1:=ifelse(y1>x,x,y1))]
df<-setDT(df)[,y2:=ifelse(y2>xb,xb,y2))]
However, since I have many variables I would like to do this in a single line expression. In other words, I would like to pass this function for multiple columns at once i.e. y1 with x, y2 with xb and so on...
I have tried the following but it does not seem to work
mod<-c("y1","y2")
max<-c("x","xb")
df2<-setDT(ppta)[,(mod):=ifelse(.(mod)>.(max),.(max),.(mod))]
does anyone knows what I am doing wrong? and how I modify multiple columns with their respective partner column at once?

Consider using pmin instead of your ifelse. You can try:
mod<-c("y1","y2")
max<-c("x","xb")
setDT(df)
df[,c(mod):=Map(pmin,mget(mod),mget(max))]
Explanation:
pmin takes two (or more) vectors and gives the minimum value for each element (equivalent of your ifelse(y1>x,x,y1));
mget returns a list of objects from their names. For instance mget("a","b") returns a list with the a and b objects (if they exist). This is used to retrieve the column from their name in the environment of the data table;
Map applies a function with more arguments element by element. Map(f,a,b) is equivalent to list(f(a[[1]],b[[1]]),f(a[[2]],b[[2]]),...).

Related

Function to recode multiple variables conditional on other variables

I have a dataset with multiple variables. Each question has the actual survey answer and three other characteristics. So there are four variables for each question. I want to specify if Q135_L ==1 , leave Q135_RT as it is, otherwise code it as NA. I can do that with an ifelse statement.
df$Q135_RT <- ifelse(df$Q135_L == 1, df$Q22_RT, NA)
However, I have hundreds of variables and the names are not related. For example, in the picture we can see Q135, SG1_1 and so on. How can I specify for the whole dataset if a variable ends at _L, then for the same variable ending at _RT should remain as it is, otherwise the variable ending at _RT should be coded as NA.
I tried this but it only returns NAs
ifelse(grepl("//b_L" ==1, df), "//b_RT" , NA)
If I understand your problem correctly, you have a data frame of which the columns represent survey question variables. Each column contains two identifiers, namely: a survey question number (134, 135, etc) and a variable letter (L, R, etc). Because you provide no reproducible example, I tried to make a simplified example of your data frame:
set.seed(5)
DF <- data.frame(array(sample(1:4, 24, replace = TRUE), c(4,6)))
colnames(DF) <- c("Q134_L","Q135_L", "Q134_R", "Q135_R", "Q_L1", "Q134_S")
DF
# Q134_L Q135_L Q134_R Q135_R Q_L1 Q134_S
# 1 2 3 2 3 1 1
# 2 3 1 3 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 3 3 2 1
What you want is that if Q135_L == 1, leave Q135_RT as it is, otherwise code it as NA. Here is a function that implements this recoding logic:
recode <- function(yourdf, questnums) {
for (k in 1:length(questnums)) {
charnum <- as.character(questnums)
col_end_L_k <- yourdf[grepl("_L\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
col_end_R_k <- yourdf[grepl("_RT\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
row_is_1 <- which(col_end_L_k == 1)
col_end_R_k[-row_is_1, ] <- NA
yourdf[, colnames(col_end_R_k)] <- col_end_R_k
}
return(yourdf)
}
This function takes a data frame and a vector of question numbers, and then returns the data frame that has been recoded.
What this function does:
Selecting each question number using for.
Using grepl to identify any column that contains the selected number and contains _L at the end of the column name.
Similar with above but for _RT at the end of the column name.
Using which to identify the location of rows in the _L column that contain 1.
Keeping the values of the _RT column, which has the same question number with the corresponding _L column, in those rows, and change values on other rows to NA.
The result:
recode(DF, 134:135)
# Q134_L Q135_L Q134_RT Q135_RT Q_L1 Q134_S
# 1 2 3 NA NA 1 1
# 2 3 1 NA 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 NA 3 2 1
Note that the Q_L1 column is not affected because _L in this column is not located on the end of the column name.
As for how to define questnums, the question numbers, you just need to create a numeric vector. Examples:
Your questnums are 1 to 200. Then use 1:200 or seq(200), so recode(DF, 1:200).
Your questnums are 1, 3, 134, 135. Then, use recode(DF, c(1, 3, 134, 135)).
You can also assign the question numbers to an object first, such as n = c(25, 135, 145) and the use it : recode(DF, n)

between list calculations per row in R

Let's say i have the following list of df's (in reality i have many more dfs).
seq <- c("12345","67890")
li <- list()
for (i in 1:length(seq)){
li[[i]] <- list()
names(li)[i] <- seq[i]
li[[i]] <- data.frame(A = c(1,2,3),
B = c(2,4,6))
}
What i would like to do is calculate the mean within the same cell position between the lists, keeping the same amount of rows and columns as the original lists. How could i do this? I believe I can use the apply() function, but i am unsure how to do this.
The expected output (not surprising):
A B
1 1 2
2 2 4
3 3 6
In reality, the values within each list are not necessarily the same.
If there are no NAs, then we can Reduce to get the sum of observations for each element and divide by the length of the list
Reduce(`+`, li)/length(li)
# A B
#1 1 2
#2 2 4
#3 3 6
If there are NA values, then it may be better to use mean (which has na.rm argument). For this, we can convert it to array and then use apply
apply(array(unlist(li), dim = c(dim(li[[1]]), length(li))), c(1, 2), mean)
An equivalent option in tidyverse would be
library(tidyverse)
reduce(li, `+`)/length(li)

R: change one value every row in big dataframe

I just started working with R for my master thesis and up to now all my calculations worked out as I read a lot of questions and answers here (and it's a lot of trial and error, but thats ok).
Now i need to process a more sophisticated code and i can't find a way to do this.
Thats the situation: I have multiple sub-data-sets with a lot of entries, but they are all structured in the same way. In one of them (50000 entries) I want to change only one value every row. The new value should be the amount of the existing entry plus a few values from another sub-data-set (140000 entries) where the 'ID'-variable is the same.
As this is the third day I'm trying to solve this, I already found and tested for and apply but both are running for hours (canceled after three hours).
Here is an example of one of my attempts (with for):
for (i in 1:50000) {
Entry_ID <- Sub02[i,4]
SUM_Entries <- sum(Sub03$Source==Entry_ID)
Entries_w_ID <- subset(Sub03, grepl(Entry_ID, Sub03$Source)) # The Entry_ID/Source is a character
Value1 <- as.numeric(Entries_w_ID$VAL1)
SUM_Value1 <- sum(Value1)
Value2 <- as.numeric(Entries_w_ID$VAL2)
SUM_Value2 <- sum(Value2)
OLD_Val1 <- Sub02[i,13]
OLD_Val <- as.numeric(OLD_Val1)
NEW_Val <- SUM_Entries + SUM_Value1 + SUM_Value2 + OLD_Val
Sub02[i,13] <- NEW_Val
}
I know this might be a silly code, but thats the way I tried it as a beginner. I would be very grateful if someone could help me out with this so I can get along with my thesis.
Thank you!
Here's an example of my data-structure:
Text VAL0 Source ID VAL1 VAL2 VAL3 VAL4 VAL5 VAL6 VAL7 VAL8 VAL9
XXX 12 456335667806925_1075080942599058 10153901516433434_10153902087098434 4 1 0 0 4 9 4 6 8
ABC 8 456335667806925_1057045047735981 10153677787178434_10153677793613434 6 7 1 1 5 3 6 8 11
DEF 8 456747267806925_2357045047735981 45653677787178434_94153677793613434 5 8 2 1 5 4 1 1 9
The output I expect is an updated value 'VAL9' in every row.
From what I understood so far, you need 2 things:
sum up some values in one dataset
add them to another dataset, using an ID variable
Besides what #yoland already contributed, I would suggest to break it down in two separate tasks. Consider these two datasets:
a = data.frame(x = 1:2, id = letters[1:2], stringsAsFactors = FALSE)
a
# x id
# 1 1 a
# 2 2 b
b = data.frame(values = as.character(1:4), otherid = letters[1:2],
stringsAsFactors = FALSE)
sapply(b, class)
# values otherid
# "character" "character"
Values is character now, we need to convert it to numeric:
b$values = as.numeric(b$values)
sapply(b, class)
# values otherid
# "numeric" "character"
Then sum up the values in b (grouped by otherid):
library(dplyr)
b = group_by(b, otherid)
b = summarise(b, sum_values = sum(values))
b
# otherid sum_values
# <chr> <dbl>
# 1 a 4
# 2 b 6
Then join it with a - note that identifiers are specified in c():
ab = left_join(a, b, by = c("id" = "otherid"))
ab
# x id sum_values
# 1 1 a 4
# 2 2 b 6
We can then add the result of the sum from b to the variable x in a:
ab$total = ab$x + ab$sum_values
ab
# x id sum_values total
# 1 1 a 4 5
# 2 2 b 6 8
(Updated.)
From what I understand you want to create a new variable that uses information from two different data sets indexed by the same ID. The easiest way to do this is probably to join the data sets together (if you need to safe memory, just join the columns you need). I found dplyr's join functions very handy for these cases (explained neatly here) Once you joined the data sets into one, it should be easy to create the new columns you need. e.g.: df$new <- df$old1 + df$old2

Use value in new variable name

I am trying to build a for loop which will step through each site, for that site calculate frequencies of a response, and put those results in a new data frame. Then after the loop I want to be able to combine all of the site data frames so it will look something like:
Site Genus Freq
1 A 50
1 B 30
1 C 20
2 A 70
2 B 10
2 C 20
But to do this I need my names (of vectors, dataframes) to change each time through the loop. I think I can do this using the SiteNum variable, but how do I insert it into new variable names? The way I tried (below) treats it like part of the string, doesn't insert the value for the name.
I feel like what I want to use is a placeholder %, but I don't know how to do that with variable names.
> SiteNum <- 1
> for (Site in CoralSites){
> Csub_SiteNum <- subset(dfrmC, Site==CoralSites[SiteNum])
> CGrfreq_SiteNum <- numeric(length(CoralGenera))
> for (Genus in CoralGenera){
> CGrfreq_SiteNum[GenusNum] <- mean(dfrmC$Genus == CoralGenera[GenusNum])*100
> GenusNum <- GenusNum + 1
> }
> names(CGrfreq_SiteNum) <- c(CoralGenera)
> Site_SiteNum <- c(Site)
> CG_SiteNum <- data.frame(CoralGenera,CGrfreq_SiteNum,Site_SiteNum)
> SiteNum <- SiteNum + 1
> }
Your question as stated asks how you can create a bunch of variables, e.g. CGrfreq_1, CGrfreq_2, ..., where the name of the variable indicates the site number that it corresponds to (1, 2, ...). While you can do such a thing with functions like assign, it is not good practice for a few reasons:
It makes your code to generate the variables more complicated because it will be littered with calls to assign and get and paste0.
It makes your data more difficult to manipulate afterwards -- you'll need to (either manually or programmatically) identify all the variables of a certain type, grab their values with get or mget, and then do something with them.
Instead, you'll find it easier to work with other R functions that will perform the aggregation for you. In this case you're looking to generate for each Site/Genus pairing the percentage of data points at the site with the particular genus value. This can be done in a few lines of code with the aggregate function:
# Sample data:
(dat <- data.frame(Site=c(rep(1, 5), rep(2, 5)), Genus=c(rep("A", 3), rep("B", 6), "A")))
# Site Genus
# 1 1 A
# 2 1 A
# 3 1 A
# 4 1 B
# 5 1 B
# 6 2 B
# 7 2 B
# 8 2 B
# 9 2 B
# 10 2 A
# Calculate frequencies
dat$Freq <- 1
res <- aggregate(Freq~Genus+Site, data=dat, sum)
res$Freq <- 100 * res$Freq / table(dat$Site)[as.character(res$Site)]
res
# Genus Site Freq
# 1 A 1 60
# 2 B 1 40
# 3 A 2 20
# 4 B 2 80

Subset based on granularity and average values

I have large data-frame consists of two columns. I want to calculate the average of the second column values for each subset of the first column. The subset of the first column is based on a specified granularity. For example, for the following data-frame, df, I want to calculate the average of df$B values for each subset of df$A with an increment(granularity) of 1 for each subset. The results should be in two new columns.
A B expected results newA newB
0.22096 1 0 1.142857
0.33489 1 1 2
0.33655 1 2 4
0.43953 1
0.64933 2
0.86668 1
0.96932 1
1.09342 2
1.58314 2
1.88481 2
2.07654 4
2.34652 3
2.79777 5
This is a simple example, I'm not sure how to loop over the whole data-frame and perform the calculation i.e. the average of the df$B.
tried below to subset, but couldn't figure how to append the results and create final results:
Tried something like :
increment<-1
mx<-max(df$A)
i<-0
newDF<-data.frame()
while(i < mx){
tmp<-subset(df, (A >i & A< (i+increment)))
i<-i+granualrity
}
Not sure about the logic. But I'm sure there is a short way to do the required calculation. Any thoughts?
I would use findInterval for the subset selection (In your example a simple ceiling for each A value should be sufficient, too. But if your increment is different from 1 you need findInterval.) and tapply to calculate the mean:
df <- read.table(textConnection("
A B
0.22096 1
0.33489 1
0.33655 1
0.43953 1
0.64933 2
0.86668 1
0.96932 1
1.09342 2
1.58314 2
1.88481 2
2.07654 4
2.34652 3
2.79777 5"), header=TRUE)
## sort data.frame by column A (needed for findInterval)
df <- df[order(df$A), ]
## define granuality
subsets <- seq(1, max(ceiling(df$A)), by=1) # change the "by" argument for different increments
df$subset <- findInterval(df$A, subsets)
tapply(df$B, df$subset, mean)
# 0 1 2
#1.142857 2.000000 4.000000

Resources