R for loop with a function - r

EDITED: I have the code below. Essentially, it is grabbing an image from a data frame for a function. I couldn't quite figure out what is the best way to do a for loop or if there is a better option for this. The end goal is to arrive at DF_ALL. The data frame has over 100s of images. So, the solution below is not the most elegant.
# Part 1, Get some profile images from Twitter.
library(rtweet) #I'm not including the key here.
# Get a list of IDs
followers <- get_followers("TWITTER_HANDLE_HERE", n = 10)
# Get the complete Twitter profile for the 10 users
follower_profiles <- lookup_users(followers)
# Create new variable profile_full_url for image API
follower_profiles$profile_full_url <- gsub("normal", "400x400", follower_profiles$profile_image_url)
# Part 2, Proceed image with API
library(Roxford)
visionkey = 'KEY_FROM_GOOGLE'
# Run image tag function on the first image
DF1 <- getTaggingResponseURL(follower_profiles$profile_full_url[1], visionkey)
DF1$twitter_url <- follower_profiles$profile_full_url[1]
# Here is the result (Notice how it is show 3 rows. I don't why it is. Would prefer to have 1 row per image)
# name confidence width height format twitter_url
# tags wall 0.999090671539307 <NA> <NA> <NA> http://pbs.twimg.com/profile_images/9999999999_400x400.jpg
# requestId <NA> <NA> <NA> <NA> <NA> http://pbs.twimg.com/profile_images/9999999999_400x400.jpg
# metadata <NA> <NA> 400 400 Jpeg http://pbs.twimg.com/profile_images/9999999999_400x400.jpg
# The problem is... there could be 100+ of images.
# I feel that a for loop could potentially be the solution.
DF1 <- getTaggingResponseURL(follower_profiles$profile_full_url[1], visionkey)
DF1$twitter_url <- follower_profiles$profile_full_url[1]
DF2 <- getTaggingResponseURL(follower_profiles$profile_full_url[2], visionkey)
DF2$twitter_url <- follower_profiles$profile_full_url[2]
DF3 <- getTaggingResponseURL(follower_profiles$profile_full_url[3], visionkey)
DF3$twitter_url <- follower_profiles$profile_full_url[3]
DF_ALL<-rbind(DF1,DF2,DF3)

The function for one URL.
foo <- function(x) {
DF <- getTaggingResponseURL(x, visionkey)
DF$twitter_url <- x
DF
}
Apply the foo on vector follower_profiles$profile_full_url and rbind the result.
DF_ALL <- do.call(rbind, lapply(follower_profiles$profile_full_url, foo))
Probably this one would work as well, but I am not sure as I do not know the structure of data.
DF_ALL <- sapply(follower_profiles$profile_full_url, foo)

Try this:
for (i in 1:nrow(follower_profiles)) {
DF <- getTaggingResponseURL(follower_profiles$profile_full_url[i], visionkey)
DF$twitter_url <- follower_profiles$profile_full_url[i]
if (i == 1) {
DF_ALL <- DF
} else {
DF_ALL <- rbind(DF_ALL,DF)
}
}

Related

Loop function in r to compare values of different data frames

Introduction
Hi to everyone,
for a little project, I try to get a function to compare values of a Data Frame 1 with values from a Data Frame 2. Thereafter, data frames 3 and 4 are supposed to get printed with the information of the comparison.
Data Frame 1:
ID
x1i
x2i
x3i
a
1
2
4
b
1
4
1
Data Frame 2:
Data_Frame_2 <- c(1:4)
Read x1a and compare with Data Frame 2. The value 1 is in Data Frame 2. Print value 1 and the name of the variable (x1a) in Data Frame 3 and cross out the value 1 from Data Frame 2.
Read x1b and compare with Data Frame 2. The value 1 is (not anymore) in Data Frame 2. Read x2b. The value 4 is in Data Frame 2. Print value 4 and the name of the variable (x2b) in Data Frame 3 and cross out the value 4 from Data Frame 2.
The Data Frame 3 is supposed to be something like this:
Data Frame 3:
ID
Value
Variable
a
1
x1i
b
4
x2i
Data Frame 4 (the remaining numbers of Data Frame 2):
Remaining numbers
2
3
Example in R to solve this theoretical problem
Until now, I worked out this code which does the job:
b <- as.data.frame(c(1:4)) # data frame 2
colnames(b, do.NULL = FALSE)
colnames(b) <- c("b")
View(b)
a <- as.data.frame(cbind(c("a","b"), c(3,3), c(2,1), c(1,2))) # data frame 1
colnames(a, do.NULL = FALSE)
colnames(a) <- c("ID","x1i","x2i","x3i")
View(a)
`%notin%` <- Negate(`%in%`) #got this one from <https://www.marsja.se/how-to-use-in-in-r/>
Read_Info <- function(a,b)
{
if (a[1,2] %in% b[1:4,1]) {c_1<-c(a[1,1:2],names(a)[2]); b1<-subset(b,b %notin% a[1,2])}
if (a[2,2] %in% b1[1:3,1]) {c_2<-c(a[2,1:2],names(a)[2]); b2<-subset(b,b %notin% c(a[1,2],a[2,2]))}
else if (a[2,3] %in% b1[1:3,1]) {c_2<-c(a[2,1],a[2,3],names(a)[3]); b2<-subset(b,b %notin% c(a[1,2],a[2,3]))}
if (a[3,2] %in% b2[1:2,1]) {c_3<-c(a[3,1],a[3,2],names(a)[2]); b3<-subset(b,b %notin% c(a[1,2],a[2,3],a[3,2]))}
else if (a[3,2] %notin% b2[1:2,1]) {c_3<-c(NA,NA,NA); b3<-b2}
c<-rbind(c_1,c_2,c_3)
colnames(c, do.NULL = FALSE)
colnames(c) <- c("ID","Value","Variable")
bx<-b3
colnames(bx, do.NULL = FALSE)
colnames(bx) <- c("Remaining numbers")
print(c)
print(bx)
}
Read_Info(a,b)
# In this example, c is data frame 3 and bx is data frame 4
Actual Task at hand - If, else if Loop Function in R
I do face the following obstacle: the actual data which I have is a little bit larger than the above example. Nevertheless, it follows the same structure:
b <- as.data.frame(c(1:20)) # this would be Data Frame 2 in the theoretical considerations
colnames(l, do.NULL = FALSE)
colnames(l) <- c("b")
View(l)
# This would be data frame 1 in the theoretical considerations
# Note: between "ID" and "x1i", there are now two additional variables which were not in the example above
# Although these two variables are part of the data, they are not of interest right know
a2 <- cbind(c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t"),c(0),c(1))
a1 <- data.frame(replicate(16,sample(1:20,rep=T)))
a <- cbind(a2, a1)
colnames(a, do.NULL = FALSE)
colnames(a) <- c("ID","variable1","variable2","x1i","x2i","x3i","x4i","x5i","x6i","x7i","x8i","x9i","x10i","x11i","x12i","x13i","x14i")
View(a)
I try to create an “if”, “else if” loop function utilizing "for" which is supposed to do this reading task by itself. Until now, I wrote down the following code which does not work yet.
`%notin%` <- Negate(`%in%`) # got this one from <https://www.marsja.se/how-to-use-in-in-r/>
Read_Info_Loop <- function(a,b)
{for (i in 1:20)
{ if (a[i,4] %in% b[1:(21-i),1]) {x[i]<-c(a[i,1],a[i,4],names(a)[4]); b[i]<-subset(b,b %notin% a[i,4])}
if (a[i,5] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,5],names(a)[5]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,5]))
} else if (a[i,6] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,6],names(a)[6]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,6]))
} else if (a[i,7] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,7],names(a)[7]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,7]))
} else if (a[i,8] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,8],names(a)[8]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,8]))
} else if (a[i,9] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,9],names(a)[9]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,9]))
} else if (a[i,10] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,10],names(a)[10]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,10]))
} else if (a[i,11] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,11],names(a)[11]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,11]))
} else if (a[i,12] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,12],names(a)[12]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,12]))
} else if (a[i,13] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,13],names(a)[13]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,13]))
} else if (a[i,14] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,14],names(a)[14]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,14]))
} else if (a[i,15] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,15],names(a)[15]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,15]))
} else if (a[i,16] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,16],names(a)[16]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,16]))
} else if (a[i,17] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,17],names(a)[17]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,17]))
} else if (a[i,17] %notin% b[1:(21-i),1]) {x[i]<-c(NA,NA,NA); b[i]<-c(b[i-1])}
y<-rbind(x[i[1:20]])
colnames(y, do.NULL = FALSE)
colnames(y) <- c("ID","Value","Variable")
u<-rbind(b[i=20])
colnames(u, do.NULL = FALSE)
colnames(u) <- c("Remaining numbers")
print(y)
print(u)
}
}
# y is supposed to be data frame 3 and u is supposed to be data frame 4
# in the above theoretical considerations
Errors
I now get the following errors:
Error in `[<-.data.frame`(`*tmp*`, i, value = c("a", "1", "x3i")) :
replacement has 3 rows, data has 4
Error in Read_Info_Loop(test, l) : object 'x' not found
...nevertheless, the first error, I got yesterday. Today, after restarting R, the second error occurred which seems to address internal structural problems of the function code. Additionally, I am pretty sure, that there might be further errors which are right now "hidden" behind the other errors and which will occur as soon as the two above mentioned errors are dealt with.
However, I do not want you to just solve any problems. I rather would like to ask, if you have ideas how I can solve these two specific errors, and maybe a hint to just get the function a little bit closer to work properly. So, for me the focus is clearly on learning a thing or two in general.
A few disclaimers: I have little experience in programming, so the code or my descriptions are probably rather messy. Therefore, if you have any questions for clarification, please feel free to ask. I try to respond as quickly as possible. English is not my first language, so please excuse me for any language mistakes.
I am looking forward to learning and hear your ideas about the code itself, ideas regarding the theoretical considerations or the approach to the loop function.
Kind Regards
Paul
Edits / Progression
Edit: I just realized, that the code can already be simplified with another "for". Nevertheless, I read that one should rather avoid nested "for" loops (for...for...)
`%notin%` <- Negate(`%in%`) #got this one from <https://www.marsja.se/how-to-use-in-in-r/>
Read_Info_Loop2 <- function(a,b)
{for (i in 1:20) for (k in 5:17) {
{ if (a[i,4] %in% b[1:(21-i),1]) {x[i]<-c(a[i,1],a[i,4],names(a)[4]); b[i]<-subset(b,b %notin% a[i,4])
} else if (a[i,k] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,k],names(a)[k]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,k]))
} else if (a[i,k] %notin% b[1:(21-i),1]) {x[i]<-c(NA,NA,NA); b[i]<-c(b[i-1])}
}
y<-rbind(x[i[1:20]])
colnames(y, do.NULL = FALSE)
colnames(y) <- c("ID","Value","Variable")
u<-rbind(b[i=20])
colnames(u, do.NULL = FALSE)
colnames(u) <- c("Remaining numbers")
print(y)
print(u)
}
}
The same error was shown:
Error in Read_Info_Loop2(test, l) : object 'x' not found
I try to use this resource, going forward: https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Repetitive-execution
I am going to give further updates.
This is a tricky one. I was able to find a solution for the underlying problem but unfortunately I wasn't able to fix OP's code as it was requested.
However, here is my solution:
library(data.table)
long <- melt(setDT(a), "ID", patterns("^x"))
df3 <- long[, {
if (any(.SD$value %in% b)) {
result <- first(.SD[value %in% b])
b <- setdiff(b, result$value)
} else {
result <- data.table(variable = NA_integer_, value = NA_integer_)
}
result
}, by = ID]
df3
ID variable value
1: a x1i 1
2: b x2i 4
# remaining values
df4 <- data.table(Remaining.numbers = setdiff(b, df3$value))
df4
Remaining.numbers
1: 2
2: 3
Explanation
In a first step, the dataset a is reshaped into long format
long
ID variable value
1: a x1i 1
2: b x1i 1
3: a x2i 2
4: b x2i 4
5: a x3i 4
6: b x3i 1
Now, variable contains the column names as data items which simplifies subsequent steps. Note that melt() has maintained the original order of rows and columns which is important for picking the correct values later on.
Now, we kind of loop through long by unique values of ID. This is achieved by grouping. As a speciality of data.table, we can use an arbitrary expression (enclosed in curly brackets) for aggregation.
For each ID, we check if there is at least one value still available in the vector of remaining values. If so, the first appearance is taken as resulting row. The corresponding value is removed from b which is then used in the next "iteration", i.e., the next group level.
Please note that b inside the expression (in curly brackets) is a local variable. The modified value of b is not available outside of the environment of the expression.
While testing with arbitrary datasets I have noticed that there might be situations where all numbers which belong to an ID already have been removed from remaining. To indicate this, a dummy result with NA values is returned.
So, for each ID group one row is returned which are then combined into one data.table object and assigned to df3.
df4 contains the Remaining.numbers and is created from building the set difference between b and the vector of picked values df3$value.
Note that I have tried to rewrite the code as a loop for demonstration purposes but I have given up because I found that the bookkeeping overhead wasn't worth it.
Data
For the first use case in OP's question:
a <- fread("ID x1i x2i x3i
a 1 2 4
b 1 4 1")
b <- 1:4
Other use cases with varying numbers of rows, columns, and lengths of b can be created using the code below. Please note that set.seed() is important because the created dataset a and the results df3 and df4 depend on it. For example, with set.seed(123) we can reproduce the situation where the list of remaining numbers for the last ID is exhausted.
# number of rows and columns to create
n_rows <- 18
n_cols <- 16
# create vector b
b <- 1:20
# create data.frame a
a2 <- data.frame(ID = letters[seq(n_rows)], variable1 = 0, variable2 = 1)
set.seed(123) # to ensure reproducible results
a1 <- as.data.frame(replicate(n_cols, sample(b, n_rows, replace = TRUE)))
colnames(a1) <- sprintf("x%ii", seq(n_cols))
a <- cbind(a2, a1)
Uwe’s Solution
Thank you very much, Uwe, for your solution and comprehensive explanation! It did not even occur to me, to combine the values into one list and to let the function run over that list. So, your solution opened a new perspective on the data. I am going to try out your solution in detail to learn as much as possible and report back here as soon as possible!
Solution regarding the original code
I was able to get to a solution for the original code which took quite some time.
test2 <- cbind(c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t"),c(0),c(1),c(1,1,1,sample(1:15),1,1),c(2,3,3,sample(1:15),2,3))
test1 <- data.frame(replicate(12,sample(1:20,rep=T)))
data.frame1 <- cbind(test2, test1)
colnames(data.frame1) <- c("ID","variable1","variable2","x1i","x2i","x3i","x4i","x5i","x6i","x7i","x8i","x9i","x10i","x11i","x12i","x13i","x14i")
data.frame2 <- as.data.frame(c(1:20))
x <- as.data.frame(matrix(NA,nrow = 3,ncol = 20))
rownames(x) <- c("ID","value","variable")
colnames(x) <- c()
View(x)
`%notin%` <- Negate(`%in%`) #got this one from <https://www.marsja.se/how-to-use-in-in-r/>
Read_Info_Loop2 <- function(a,b) {for (k in 1:20) {for (i in 4:17)
{if (a[k,i] %in% b[,1]) {x[k]<-c(a[k,1],(a[k,i]),names(a[i])); b<-subset(b,b %notin% a[k,i]);break}}
}
c<-rbind(x)
bx<-b
colnames(bx) <- c("numbers remaining")
print(c)
print(bx)
}
Read_Info_Loop2(data.frame1, data.frame2)
The only downfall with this solution is the output. It is rather in a weird form. But I don’t mind really. So now we already have two solutions which use different approaches. Very exciting. Regarding the output (see picture below of the output of some of the actual data) of data.frames 3 and 4: The last 7 columns are NAs because this data.frame1_original has just 13 rows (k=13). So for the last 7 iterations (k=14 to k=20), there is no output.
Here is the output of the random data.frame1 as described above. Here, the solution looks rather weird, since for "r" and "t" all entries are already deleted from data.frame2 which returns NAs for these rows. The two numbers, which remain are 18 and 20.

Using regular expressions to change element names for dataframes in a list

I have a list of many dataframes, in which I'd like to change certain elements within the dataframes using regular expressions. Here is a shortened mock-up of my data:
df1 <- data.frame(ID = c("KBS_2015_08_25_A1_P1", "KBS_2015_08_25_A2_P10", "KBS_2015_09_04_A2_P2"),
Site = c("KBS","KBS","KBS"))
df2 <- data.frame(ID = c("UMBS_2015_08_12_A1_P1", "UMBS_2015_08_29_D3_P3", "UMBS_2015_08_29_D5_P5"),
Site = c("UMBS","UMBS","UMBS"))
df_list <- list(df1=df1,df2=df2)
I attempted to make a function that takes the information in the ID column and changes it to a character string of a date.
change_id <- function(df){
df$ID[df$ID == "^KBS_2015_08_25*P\\d"] <- "8/25/2015"
df$ID[df$ID == "^KBS_2015_09_04*P\\d"] <- "9/4/2015"
df$ID[df$ID == "^UMBS_2015_08_12*P\\d"] <- "8/12/2015"
df$ID[df$ID == "^UMBS_2015_08_29*P\\d"] <- "8/29/2015"
return(df)
}
df_list <- lapply(df_list, change_id)
I don't get any errors, but this function doesn't change anything in the dataframes. I must be missing something for my attempt at character matching.
Using R version 4.0.2, Mac OS X 10.13.6
We can use sub
lapply(df_list, transform, ID = sub(".*_(\\d{4}_\\d{2}_\\d{2})_.*", "\\1", ID))
If needed to be in a specific format, convert to Date class and then use format
df_list1 <- lapply(df_list, transform,
ID = format(as.Date(sub(".*_(\\d{4}_\\d{2}_\\d{2})_.*",
"\\1", ID), "%Y_%m_%d"), "%m/%d/%Y"))
-output
df_list1
#$df1
# ID Site
#1 08/25/2015 KBS
#2 08/25/2015 KBS
#3 09/04/2015 KBS
#$df2
# ID Site
#1 08/12/2015 UMBS
#2 08/29/2015 UMBS
#3 08/29/2015 UMBS
An alternative to #akrun's excellent answer is a "join" methodology. The reason this can be good is so that the pattern/replacement list can be kept as a single frame/table, making maintenance a bit easier.
It operates by using fuzzyjoin::regex_left_join, which is similar to merge and dplyr::left_join but with pattern-matches.
ptns <- data.frame(
ID_ptn = c("^KBS_2015_08_25.*P\\d", "^KBS_2015_09_04.*P\\d",
"^UMBS_2015_08_12.*P\\d", "^UMBS_2015_08_29.*P\\d"),
ID_new = c("8/25/2015", "9/4/2015", "8/12/2015", "8/29/2015")
)
fuzzyjoin::regex_left_join(df1, ptns, by = c("ID" = "ID_ptn"))
# ID Site ID_ptn ID_new
# 1 KBS_2015_08_25_A1_P1 KBS ^KBS_2015_08_25.*P\\d 8/25/2015
# 2 KBS_2015_08_25_A2_P10 KBS ^KBS_2015_08_25.*P\\d 8/25/2015
# 3 KBS_2015_09_04_A2_P2 KBS ^KBS_2015_09_04.*P\\d 9/4/2015
Expanding this to the larger list can be done with:
lapply(df_list, function(df) {
tmp <- fuzzyjoin::regex_left_join(df, ptns, by = c("ID" = "ID_ptn"))
tmp$ID <- replace(tmp$ID, !is.na(tmp$ID_new), tmp$ID_new)
tmp[ names(ptns) ] <- NULL
tmp
})
# $df1
# ID Site
# 1 8/25/2015 KBS
# 2 8/25/2015 KBS
# 3 9/4/2015 KBS
# $df2
# ID Site
# 1 8/12/2015 UMBS
# 2 8/29/2015 UMBS
# 3 8/29/2015 UMBS
This is an alternative to the more straight-forward (and perhaps easier-to-see-and-understand) answer by #akrun. I offer it as a different way of looking at the problem.
(I will offer one caution: if it is possible that patterns may overlap, where two or more patterns could match a single ID, then some more steps need to be taken to determine which one to use. This will evidence as some rows repeating and the number of rows increasing through the join. This is not likely given the current patterns, but ... caveat emptor.)

Merge and name data frames in for loop

I have a bunch of DF named like: df1, df2, ..., dfN
and lt1, lt2, ..., ltN
I would like to merge them in a loop, something like:
for (X in 1:N){
outputX <- merge(dfX, ltX, ...)
}
But I have some troubles getting the name of output, dfX, and ltX to change in each iteration. I realize that plyr/data.table/reshape might have an easier way, but I would like for loop to work.
Perhaps I should clarify. The DF are quite large, which is why plyr etc will not work (they crash). I would like to avoid copy'ing.
The next in the code is to save the merged DF.
This is why I prefer the for-loop apporach, since I know what each merged DF is named in the enviroment.
You can combine data frames into lists and use mapply, as in the example below:
i <- 1:3
d1.a <- data.frame(i=i,a=letters[i])
d1.b <- data.frame(i=i,A=LETTERS[i])
i <- 11:13
d2.a <- data.frame(i=i,a=letters[i])
d2.b <- data.frame(i=i,A=LETTERS[i])
L1 <- list(d1.a, d2.a)
L2 <- list(d1.b, d2.b)
mapply(merge,L1,L2,SIMPLIFY=F)
# [[1]]
# i a A
# 1 1 a A
# 2 2 b B
# 3 3 c C
#
# [[2]]
# i a A
# 1 11 k K
# 2 12 l L
# 3 13 m M
If you'd like to save every of the resulting data frames in the global environment (I'd advise against it though), you could do:
result <- mapply(merge,L1,L2,SIMPLIFY=F)
names(result) <- paste0('output',seq_along(result))
which will give a name to every data frame in the list, an then:
sapply(names(result),function(s) assign(s,result[[s]],envir = globalenv()))
Please note that provided is a base R solution that does essentially the same thing as your sample code.
If your data frames are in a list, writing a for loop is trivial:
# lt = list(lt1, lt2, lt3, ...)
# if your data is very big, this may run you out of memory
lt = lapply(ls(pattern = "lt[0-9]*"), get)
merged_data = merge(lt[[1]], lt[[2]])
for (i in 3:length(lt)) {
merged_data = merge(merged_data, lt[[i]])
save(merged_data, file = paste0("merging", i, ".rda"))
}

R: Using a vector to feed dataframe names for sapply

I'm quite new to R, and I trying to use it to organize and extract info from some tables into different, but similar tables, and instead of repeating the commands but changing the names of the table:
#DvE, DvS, and EvS are dataframes
Sum.DvE <- data.frame(DvE$genes, DvE$FDR, DvE$logFC)
names(Sum.DvE) <- c("gene","FDR","log2FC")
Sum.DvS <- data.frame(DvS$genes, DvS$FDR, DvS$logFC)
names(Sum.DvS) <- c("gene","FDR","log2FC")
Sum.EvS <- data.frame(EvS$genes, EvS$FDR, EvS$logFC)
names(Sum.EvS) <- c("gene","FDR","log2FC")
I thought it would be easier to create a vector of the table names, and feed it into a for loop:
Sum.Comp <- c("DvE","DvS","EvS")
for(i in 1:3){
Sum.Comp[i] <- data.frame(i$genes, i$FDR, i$logFC)
names(Sum.Comp[i]) <- c("gene","FDR","log2FC")
}
But I get
>Error in i$genes : $ operator is invalid for atomic vectors
which I kind of expected because I was just trying it out, but can someone tell me if what I want to do can be done some other way, or if you have some suggestions for me, that would be much appreciated!
Clarification: Basically I'm trying to ask if there's a way to feed a dataframe name into a for loop through a vector, because I think I get the error because R doesn't realize "i" in the for loop stands for a dataframe name. This is a more simplified example:
DF1 <- data.frame(A=1:5, B=1:5, C=1:5, D=1:5)
DF2 <- data.frame(A=10:15, B=10:15, C=10:15, D=10:15)
DF3 <- data.frame(A=20:25, B=20:25, D=20:25, D=20:25)
DFs <- ("DF1", "DF2", "DF3")
for (i in 1:3){
New.i <- dataframe(i$A, i$D)
}
And I'd like it to make 3 new dataframes called "New.DF1", "New.DF2", "New.DF3" with example outputs like:
New.DF1
A D
1 1
2 2
3 3
4 4
5 5
New.DF2
A D
10 10
11 11
12 12
13 13
14 14
15 15
Thank you!
Not entirely sure I understand your problem, but the code below may do what you're asking. I've created simple values for the input data frames for testing.
DvE <- data.frame(genes=1:2, FDR=2:3, logFC=3:4)
DvS <- data.frame(genes=4, FDR=5, logFC=6)
EvS <- data.frame(genes=7, FDR=8, logFC=9)
df_names <- c("DvE","DvS", "EvS")
sum_df <- function(x) data.frame(gene=x$genes, FDR=x$FDR, log2FC=x$logFC)
for(df in df_names) {
assign(paste("Sum.",df,sep=""), do.call("sum_df", list(as.name(df)) ) )
}
Instead of operating on the names of variables, it would be easier to store the data frames you want to process in a list and then process them with lapply:
to.process <- list(DvE, DvS, EvS)
processed <- lapply(to.process, function(x) {
data.frame(gene=x$genes, FDR=x$FDR, log2FC=x$logFC)
})
Now you can access the new data frames with processed[[1]], processed[[2]], and processed[[3]].

Build Efficient R Filter

I have this dataframe called data. In the data frame I have a few columns, for simplicity I will explain the columns with a weather analogy, it is like "weather_st_louis", "weather_boston", "weather_ny"... I want to build a column "weather" and it should be done like this, "if weather in st louis exists, use this column, else if weather in boston exists, use this column, else if weather in ny exists, use this column, else NONE". I'm going to be using this logic many times, with many columns, so need a way to make this more efficient. What is the R way to do this.
Also, side question, is what I'm trying to build here called a "filter"?
if(exists("data['w_stlouis']")) {
data['w'] <- data['w_stlouis']
} else if(exists("data['w_boston]")){
data['w'] <- data['w_boston']
} else if(exists("data['w_ny']")){
data['w'] <- data['w_ny']
} else {data['w'] <- NA}
Try something like that :
example <- matrix(NA,ncol=5,nrow=5)
colnames(example) <- c("weather_1","weather_2","weather_3","weather_4","weather_5")
example[5,3] <- 1
example[3,2] <- 1
example[1,2] <- 1
example[4,4] <- 1
example[5,2] <- 1
w <- apply(example,1,function(x){
o <- which(!is.na(x))[1]
if (is.na(o)) r <- "NONE"
else r <- colnames(example)[o]
r
})
w
When you have repeating tasks to do, try to use apply/tapply/sapply functions
Here is another posibility. I'm not sure if this is what you need, but perhaps it gives you another way to handle it.
df <- data.frame(matrix(rnorm(100, 100, 20),ncol=5,nrow=5))
colnames(df) <- c("weather_1","weather_2","weather_3","weather_4","weather_5")
library(reshape2)
df <- melt(df)
df[1:10,2] <- NA
str(df)
weather_levels <- levels(df$variable)
df$case <- ifelse(is.na(df$value), 0, 1)
These two output the same result
subset(df, df$case == 1)
na.omit(df)

Resources