I have three df P1,P2,P3 with each three columns. I want to change the second column from each df to D1, D2, D3 with a loop but nothing is working. What do I miss out?
C1 <- c(12,34,22)
C2 <- c(43,86,82)
C3 <- c(98,76,25)
C4 <- c(12,34,22)
C5 <- c(43,86,82)
C6 <- c(98,76,25)
C7 <- c(12,34,22)
C8 <- c(43,86,82)
C9 <- c(98,76,25)
P1 <- data.frame(C1,C2,C3)
P2 <- data.frame(C4,C5,C6)
P3 <- data.frame(C7,C8,C9)
x <- c("P1", "P2", "P3")
b <- c("D1","D2","D3")
for (V in b){
names(x)[2] <- "V"
}
The output I would expect is:
P1 <- data.frame(C1,D1,C3)
P2 <- data.frame(C4,D2,C6)
P3 <- data.frame(C7,D3,C9)
We can use mget to get the values of the string vector in a list, use Map to rename the second column each of the list element with the corresponding 'b' value, then use list2env to update those objects in the global env
list2env(Map(function(x, y) {names(x)[2] <- y; x}, mget(x), b), .GlobalEnv)
-output
P1
# C1 D1 C3
#1 12 43 98
#2 34 86 76
#3 22 82 25
P2
# C4 D2 C6
#1 12 43 98
#2 34 86 76
#3 22 82 25
P3
# C7 D3 C9
#1 12 43 98
#2 34 86 76
#3 22 82 25
For understanding the code, first step is mget on the vector of strings
mget(x)
returns a list of data.frame
Then, we are passing this as argument to Map along with the corresponding 'b' vector. i.e. each element of list is a unit, similarly each element of vector is a unit
Map(function(x, y) x, mget(x), b)
The function(x, y) is anonymous/lambda function construct. Using that we set the names of the second column to that of 'b'. Here, the anonymous argument for 'b' is 'y'. Later, we wrap everything in list2env as it is a named list and it will look up for those names in the global env to update it
It is usually best to work with lists for this type of thing. That can make life easier and is generally a good workflow to learn.
list1 <- list(P1 = P1, P2 = P2, P3 = P3)
# here is your loop
for (i in seq_along(list1)) {
names(list1[[i]])[2] <- b[i]
}
# you can use Map as well for iteration, similar to #akrun's solution
# this is really identical at this point, except you've created the list differently
Map(function(df, b) {names(df)[2] <- b; df}, list1, b)
Related
I have a dataframe with 2 numeric columns. For each row, I want to create an array of integers that fall between the values in the columns, and that includes the values in the column. Then, I want to compile all of the values into a single column to generate a histogram.
Input:
df
C1 C2
A 3 -92
B 8 -162
C 20 -97
D 50 -76
Output:
sdf5$Values
-92
-91
-90
...
2
3
-162
-161
...
7
8
...
My actual dataframe has 62 rows. My current code gives me frequencies > 100 (should have a maximum of 62 for any integer). The code worked on a dummy dataframe, so I'm not sure where things are going wrong.
list <- mapply(":", df$C2, df$C1)
df3 <- do.call(rbind.data.frame, list)
sdf3 <- stack(df3)
sdf4 <- as.data.frame(sdf3$values)
sdf5 <- rename(sdf4, Values = 1)
a <- ggplot(sdf5, aes(x=Values)) +
geom_histogram(binwidth = 1, center=0)
I'm not sure what exactly goes wrong, but I think the rbind.data.frame is causing some troubles with the list input. As an alternative:
library(ggplot2)
df <- read.table(text = " C1 C2
A 3 -92
B 8 -162
C 20 -97
D 50 -76")
list <- mapply(":", df$C2, df$C1)
df2 <- data.frame(Values = do.call(c, list))
ggplot(df2, aes(x=Values)) +
geom_histogram(binwidth = 1, center=0)
Created on 2021-02-08 by the reprex package (v1.0.0)
There must be something off going on with the stack function, you can check it using table. To put all list numbers into a single vector I'd use unlist.
df=data.frame(C1=floor(runif(80,0,200)),C2=floor(runif(80,-200,0)))
list <- mapply(":", df$C2, df$C1)
df3 <- do.call(rbind.data.frame, list)
sdf3 <- stack(df3)
sdf4 <- data.frame("Values"=sdf3$values)
table(sdf4)
# This returns the count of each unique value and some go up to 200,
# notably the limits of my unif distribution
If you use unlist, it gives the desired result.
df=data.frame(C1=floor(runif(80,0,200)),C2=floor(runif(80,-200,0)))
list <- mapply(":", df$C2, df$C1)
vec <- data.frame("Values"=unlist(list))
a <- ggplot(vec, aes(x=Values)) +
geom_histogram(binwidth = 1, center=0)
I don't know the stack function, but the problem must be there somehow.
I have a data frame with this information here:
df <- data.frame("string1" = c("ABECDE","ABECDE","ABECDE"),
"string2" = c("ABCD","ABCD","ABCD"),
"site1" = NA, "site2" = NA, "combine" = NA, "filtered" = NA)
I would like to write a code that picks sites E and D in the string and adds them to the data frame.
If the combination is already created I'd like for it to go back and chose a new combination and check again until it gets one that has not been picked.
I have provided below the code I have done so far which gives the output of:
string1 string2 site1 site2 combine filtered
1 ABECDE ABCD E3 D4 E3D4 E3D4
2 ABECDE ABCD E3 D4 E3D4 <NA>
3 ABECDE ABCD E3 D4 E3D4 <NA>
Here, E3D4 is the value you get when it first goes through the function.
I would now like for it to go back and pick the next possible combinations:
E6D4 and D5D4 for the next two lines but I have no idea how to properly structure the iteration.
Here is the code I have so far (there is probably a less redundant way to write it but I am a beginner so apologies if it is overly long)
#make the columns of string1 and string2 into vectors
string1 <- df$string1
string2 <- df$string2
#for each string in the vector check to see first if it has an E, if not, then a D
#get the output as a letter and its position (eg E3)
for (i in 1:nrow(df)){
if (grepl("E", string1[i])){
sites1 = gregexpr('E', string1[i])
df$site1 <- paste0(substring(string1[i], sites1[[1]][1], sites1[[1]][1]), sites1[[1]][1])
} else if (grepl("D", string1[i])){
sites = gregexpr('D', string1[i])
df$site1 <- paste0(substring(string1[i], sites1[[1]][1], sites1[[1]][1]), sites1[[1]][1])
}
}
#do the same for the second vector
for (i in 1:nrow(df)){
if (grepl("E", string2[i])){
sites2 <- gregexpr('E', string2[i])
df$site2 <- paste0(substring(string2[i], sites2[[1]][1], sites2[[1]][1]), sites2[[1]][1])
} else if (grepl("D", string2[i])){
sites2 <- gregexpr('D', string2[i])
df$site2 <- paste0(substring(string2[i], sites2[[1]][1], sites2[[1]][1]), sites2[[1]][1])
}
}
#combine the sites
df$combine <- paste0(df$site1, df$site2)
#for each row of combined sites, check to see if the value is already created
for (i in 1:nrow(df)){
if(!df$combine[i] %in% df$filtered){
df$filtered[i] <- df$combine[i]
} else if(df$combine[i] %in% df$filtered){
#go back to for loop and look for either another E in the list
#if there is none, go to the next condition (looking for a D).
#pick the next possible values, put them together and check again
#do this continuously until you get a unique combine.
#do this for string1 and then string2 (or alternating both, which ever is easier)
}
}
Perhaps you could simplify and try the following.
Create a custom function, that will detect all positions of "D" and "E" in your strings. Then use expand.grid to get all combinations of these positions. In your example data, this will include combinations of positions 3, 5, 6 with position 4 (in the end 3 combinations: (3, 4), (5, 4), and (6, 4)).
Then, you can go through each of these combinations and create the desired strings, by combining with paste the letter from the position with the position number. A list will hold these results and be assembled in the end with rbind.
There are a few questions that remain, including if there are situations when there are no "D" or "E" letters found.
my_fun <- function(x) {
p1 <- as.numeric(unlist(gregexpr(pattern = 'D|E', x[["string1"]])))
p2 <- as.numeric(unlist(gregexpr(pattern = 'D|E', x[["string2"]])))
cbn <- expand.grid(p1, p2)
lst <- list()
for (i in seq_len(nrow(cbn))) {
site1 <- paste0(substr(x[["string1"]], cbn[i, "Var1"], cbn[i, "Var1"]), cbn[i, "Var1"])
site2 <- paste0(substr(x[["string2"]], cbn[i, "Var2"], cbn[i, "Var2"]), cbn[i, "Var2"])
lst[[i]] <- c(string1 = x[["string1"]], string2 = x[["string2"]], site1 = site1, site2 = site2, combine = paste0(site1, site2))
}
return(as.data.frame(do.call("rbind", lst)))
}
do.call(rbind, apply(df, 1, my_fun))
I created example data to test this out:
string1 string2 site1 site2 combine filtered
1 ABECDE ABCD NA NA NA NA
2 AABCDE ABCE NA NA NA NA
3 ABCDDE ACDD NA NA NA NA
Which would give the following output:
string1 string2 site1 site2 combine
1 ABECDE ABCD E3 D4 E3D4
2 ABECDE ABCD D5 D4 D5D4
3 ABECDE ABCD E6 D4 E6D4
4 AABCDE ABCE D5 E4 D5E4
5 AABCDE ABCE E6 E4 E6E4
6 ABCDDE ACDD D4 D3 D4D3
7 ABCDDE ACDD D5 D3 D5D3
8 ABCDDE ACDD E6 D3 E6D3
9 ABCDDE ACDD D4 D4 D4D4
10 ABCDDE ACDD D5 D4 D5D4
11 ABCDDE ACDD E6 D4 E6D4
I am trying to reorganize my data, basically a list of data.frames.
Its elements represent subjects of interest (A and B), with observations on x and y, collected on two occasions (1 and 2).
I am trying to make this a list that contains data.frames referring to the subjects, with the information on which occasion x and y were collected being stored in the respective data.frames as new variable, as opposed to the element name:
library('rlist')
A1 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
A2 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
B1 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
B2 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
list <- list(A1=A1,A2=A2,B1=B1,B2=B2)
A <- do.call(rbind,list.match(list,"A"))
B <- do.call(rbind,list.match(list,"B"))
list <- list(A=A,B=B)
list <- lapply(list,function(x) {
y <- data.frame(x)
y$class <- c(rep.int(1,2),rep.int(2,2))
return(y)
})
> list
$A
x y class
A1.1 66 96 1
A1.2 76 58 1
A2.1 50 93 2
A2.2 57 12 2
$B
x y class
B1.1 58 56 1
B1.2 69 15 1
B2.1 77 77 2
B2.2 9 9 2
In my real world problem there are about 500 subjects, not always two occasions, differing numbers of observations.
So my example above is just to illustrate where I want to get, and I am stuck at how to pass to the do.call-rbind that it should, based on elements names, bind subject-specific elements as new list elements together, while assigning a new variable.
To me, this is a somewhat fuzzy task, and the closest I got was the rlist package. This question is related but uses unique to identify elements, whereas in my case it seems to be more a regex problem.
I'd be happy even for instructions on how to use google, any keywords for further research etc.
From the data you provided:
subj <- sub("[A-Z]*", "", names(lst))
newlst <- Map(function(x, y) {x[,"class"] <- y;x}, lst, subj)
First we do the regular expression call to isolate the number that will go in the class column. In this case, I matched on capital letters and erased them leaving the number. Therefore, "A1" becomes "1". Please note that the real names will mean a different regex pattern.
Then we use Map to create a new column for each data frame and save to a new list called newlst. Map takes the first element of each argument and carries out the function then continues on with each object element. So the first data frame in lst and the first number in subj are used first. The anonymous function I used is function(x,y) {x[, "class"] <- y; x}. It takes two arguments. The first is the data frame, the second is the column value.
Now it's much easier to move forward. We can create a vector called uniq.nmes to get the names of the data frames that we will combine. Where "A1" will become "A". Then we can rbind on that match:
uniq.nmes <- unique(sub("\\d", "", names(lst)))
lapply(uniq.nmes, function(x) {
do.call(rbind, newlst[grep(x, names(newlst))])
})
# [[1]]
# x y class
# A1.1 1 79 1
# A1.2 30 13 1
# A2.1 90 39 2
# A2.2 43 22 2
#
# [[2]]
# x y class
# B1.1 54 59 1
# B1.2 83 90 1
# B2.1 85 36 2
# B2.2 91 28 2
Data
A1 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
A2 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
B1 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
B2 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
lst <- list(A1=A1,A2=A2,B1=B1,B2=B2)
It sounds like you're doing a lot of gymnastics because you have a specific form in mind. What I would suggest is first trying to make the data tidy. Without reading the link, the quick summary is to put your data into a single data frame, where it can be easily processed.
The quick version of the answer (here I've used lst instead of list for the name to avoid confusion with the built-in list) is to do this:
do.call(rbind,
lapply(seq(lst), function(i) {
lst[[i]]$type <- names(lst)[i]; lst[[i]]
})
)
What this will do is create a single data frame, with a column, "type", that contains the name of the list item in which that row appeared.
Using a slightly simplified version of your initial data:
lst <- list(A1=data.frame(x=rnorm(5)), A2=data.frame(x=rnorm(3)), B=data.frame(x=rnorm(5)))
lst
$A1
x
1 1.3386071
2 1.9875317
3 0.4942179
4 -0.1803087
5 0.3094100
$A2
x
1 -0.3388195
2 1.1993115
3 1.9524970
$B
x
1 -0.1317882
2 -0.3383545
3 0.8864144
4 0.9241305
5 -0.8481927
And then applying the magic function
df <- do.call(rbind,
lapply(seq(lst), function(i) {
lst[[i]]$type <- names(lst)[i]; lst[[i]]
})
)
df
x type
1 1.3386071 A1
2 1.9875317 A1
3 0.4942179 A1
4 -0.1803087 A1
5 0.3094100 A1
6 -0.3388195 A2
7 1.1993115 A2
8 1.9524970 A2
9 -0.1317882 B
10 -0.3383545 B
11 0.8864144 B
12 0.9241305 B
13 -0.8481927 B
From here we can process to our hearts content; with operations like df$subject <- gsub("[0-9]*", "", df$type) to extract the non-numeric portion of type, and tools like split can be used to generate the sub-lists that you mention in your question.
In addition, once it is in this form, you can use functions like by and aggregate or libraries like dplyr or data.table to do more advanced split-apply-combine operations for data analysis.
I have a simple data set in the form:
From,To,Date,Subject
I would like to reshape this data such that lines as:
e1,e2;e3;e4,d1,s1
Get expanded too:
e1,e2,d1,s1
e1,e3,d1,s1
e1,e4,d1,s1
Now, I get this done with a for loop over my data frame and constructing a new one on the fly, but I wondered if there is more "R"-way of doing this?
Edit:
This is what I currently have, it works but is kind of ugly (and showing my still somewhat limited R-skills):
filteredEmailsExpanded <- NULL
toCol <- 2
for (row in 1:nrow(filteredEmails)) {
receivers <- sapply(strsplit(filteredEmails[row, toCol], ","), function(x) gsub(" ", "", ))
for (receiver in receivers) {
newRow <- rep(filteredEmails[row,], times = 1)
newRow$To <- receiver
rbind(filteredEmailsExpanded, newRow)
}
}
How about you first expand your data frame (call it d), repeating the ith row n(i) times, where n(i) is the number of occurences of ';' in d$To[i], and then you replace d$To by these occurences? I've added an extra row to your example data to illustrate this better
d <- data.frame(
From = c("e1", "e5"),
To = c("e2;e3;e4", "e6;e7"),
Date = c("d1", "d2"),
Subject = c("s1", "s2"),
stringsAsFactors = FALSE)
v <- strsplit(d$To, ";")
lengths <- sapply(v, length)
d <- d[rep(1:nrow(d), lengths), ]
d$To <- unlist(v)
You may want to look at my "splitstackshape" package, in particular, the function concat.split.multiple which has a "long" argument.
Using #konvas's sample data, try:
library(splitstackshape)
concat.split.multiple(d, "To", ";", "long")
# From Date Subject time To
# 1 e1 d1 s1 1 e2
# 2 e5 d2 s2 1 e6
# 3 e1 d1 s1 2 e3
# 4 e5 d2 s2 2 e7
# 5 e1 d1 s1 3 e4
# 6 e5 d2 s2 3 <NA>
Alternatively, check out its successor function (which hasn't yet made it into the package). The successor is presently called cSplit and is available as a Gist. It is much faster but just as easy to use:
## cSplit(indt = d, splitCols = "To", sep = ";", direction = "long")
cSplit(d, "To", ";", "long")
# From To Date Subject
# 1: e1 e2 d1 s1
# 2: e1 e3 d1 s1
# 3: e1 e4 d1 s1
# 4: e5 e6 d2 s2
# 5: e5 e7 d2 s2
The input.txt contains 8000000 rows and 4 columns. The first 2 columns is text.The last 2 columns is number. The number of unique symbols (e.g., "c33") in columns 1 and 2 is not fixed. The value of columns 3 and 4 is the number of unique symbols of columns 1 and 2 after splitting by "]" respectively.
Each row of input.txt file is like this:
c33]c21]c5]c7]c8]c9 TPS2]MIC17]ERG3]NNF1]CIS3]CWP2 6 6
**The desired result:
row[ , ] represents characters like "c33 c21 c5 c7 c8 c9" or "TPS2 MIC17 ERG3 NNF1 CIS3 CWP2", | .| represents the number of characters, |c33 c21 c5 c7 c8 c9|=6
If two rows are overlapped (>=0.6), it outputs the NO. of these two rows to a file.**
This code is as follows, but it runs too slow.
The code:
library(compiler)
enableJIT(3)
data<-read.table("input.txt",header=FALSE)
row<-8000000
for (i in 1:(row-1)){
row11<-unlist(strsplit(as.character(data[i,1]),"]"))
row12<-unlist(strsplit(as.character(data[i,2]),"]"))
s1<-data[i,3]*data[i,4]
zz<-file(paste("output",i,".txt",sep=""),"w")
for (j in (i+1):row)
{ row21<-unlist(strsplit(as.character(data[j,1]),"]"))
row22<-unlist(strsplit(as.character(data[j,2]),"]"))
up<-length(intersect(row11,row21))*length(intersect(row12,row22))
s2<-data[j,3]*data[j,4]
down<-min(s1,s2)
if ((up/down)>=0.6) cat(i,"\t",j,"\n",file=zz,append=TRUE)
}
close(zz)
}
The running result:
each row can produce a file, it is like this:
1 23
1 67
1 562
1 78
...
In order to run fast, I rewrite the code.The code is as follows
The input.txt contains 16000000 rows. The number of columns is not fixed. The number of unique symbols (e.g., "c33") in columns 1 and 2 is not fixed. Each two rows of input.txt file is like this:
The 1st row (odd row1): c33 c21 c5 c7 c8
The 2nd row (even row1): TPS2 MIC17 ERG3 NNF1 CIS3 CWP2 MCM6
The 3rd row (odd row2): c33 c21 c5 c21 c18 c4 c58
The 4th row (even row2): TPS12 MIC3 ERG2 NNF1 CIS4
**The desired result:
If two rows are overlapped (>=0.6) with other two rows, it outputs the NO. of these two rows to a file.**
The code:
library(compiler)
enableJIT(3)
con <- file("input.txt", "r")
zz<-file("output.txt","w")
oddrow1<-readLines(con,n=1)
j<-0
i<-0
while( length(oddrow1) != 0 ){
oddrow1<-strsplit(oddrow1," ")
evenrow1<-readLines(con,n=1)
evenrow1<-strsplit(evenrow1," ")
j<-j+1
con2 <- file("input.txt", "r")
readLines(con2,n=(j*2))
oddrow2<-readLines(con2,n=1)
i<-j
while( length(oddrow2) != 0 ){
i<-i+1
oddrow2<-strsplit(oddrow2," ")
evenrow2<-readLines(con2,n=1)
evenrow2<-strsplit(evenrow2," ")
oddrow1<-unlist(oddrow1)
oddrow2<-unlist(oddrow2)
evenrow1<-unlist(evenrow1)
evenrow2<-unlist(evenrow2)
up<-length(intersect(oddrow1,oddrow2))*length(intersect(evenrow1,evenrow2))
down<-min(length(oddrow1)*length(evenrow1),length(oddrow2)*length(evenrow2))
if ((up/down)>=0.6) {cat(j,"\t",i,"\n",file=zz,append=TRUE) }
oddrow2<-readLines(con2,n=1)
}
close(con2)
oddrow1<-readLines(con,n=1)
}
close(con)
close(zz)
The running result:
it can produce a file, it is like this:
1 23
1 67
1 562
1 78
2 25
2 89
3 56
3 79
...
Both the above two methods are too slow, In order to run fast,how to rewrite this code. Thank you!
Well, I suspect uses too much memory for your size of data, but perhaps it will provoke some ideas.
Make up some data, with 20 total unique values and 5 to 10 in each cell.
set.seed(5)
n <- 1000L
ng <- 20
g1 <- paste(sample(10000:99999, ng))
g2 <- paste(sample(10000:99999, ng))
n1 <- sample(5:10, n, replace=TRUE)
n2 <- sample(5:10, n, replace=TRUE)
x1 <- sapply(n1, function(i) paste(g1[sample(ng, i)], collapse="|"))
x2 <- sapply(n2, function(i) paste(g2[sample(ng, i)], collapse="|"))
Load Matrix library and a helper function that takes a list of string vectors and converts them to a matrix with number of columns equal to the number of unique strings and 1's where it was present.
library(Matrix)
str2mat <- function(s) {
n <- length(s)
ni <- sapply(s, length)
s <- unlist(s)
u <- unique(s)
spMatrix(nrow=n, ncol=length(u), i=rep(1L:n, ni), j=match(s, u), x=rep(1, length(s)))
}
OK, now we can actually do something. First create the matrices and get the total number present in each row.
m1 <- str2mat(strsplit(x1, "|", fixed=TRUE))
m2 <- str2mat(strsplit(x2, "|", fixed=TRUE))
n1 <- rowSums(m1)
n2 <- rowSums(m2)
Now we can use crossproducts of these matrices to get the numerator, and outer to get the minimum to get the numerator. We then can compute the overlap and test if > 0.6. Since we have the whole matrix, we're not interested in the diagonal or the lower half. (There's ways of storing this kind of matrix more efficiently with Matrix library, but I'm not sure how.) We then get the rows that have enough overlap with which.
num <- tcrossprod(m1)*tcrossprod(m2)
n12 <- n1*n2
den <- outer(n12, n12, pmin)
use <- num/den > 0.6
diag(use) <- FALSE
use[lower.tri(use)] <- FALSE
out <- which(use, arr.ind=TRUE)
> head(out)
[,1] [,2]
[1,] 64 65
[2,] 27 69
[3,] 34 81
[4,] 26 82
[5,] 5 85
[6,] 21 115