I have 1000+ rows of string which I extracted from a column of an Excel worksheet. Here's how the data looks like (3 rows):
Chicken(31%);Duck(16%);Wild duck(14%);Turkey(10%);Pigeon(4%);Goose(4%);Wild bird(4%);Tree sparrow(2%)
Tree sparrow(2%)
Chicken(1%)
I need to put the data into a table (for this example: 8 columns x 3 rows). Can anyone help?
x <- c("Chicken(31%);Duck(16%);Wild duck(14%);Turkey(10%);Pigeon(4%);Goose(4%);Wild bird(4%);Tree sparrow(2%)",
"Tree sparrow(2%)", "Chicken(1%)")
There is most likely more concise way but you can try something like this:
library(stringi)
library(data.table)
# Drop empty lines if any
txt <- Filter(function(x) !stri_isempty(stri_trim(x)), x)
# Extract matches
matches <- stri_match_all_regex(txt, "([\\w\\s]+)\\(([1-9]+)%\\);?")
matches[[1]]
## [,1] [,2] [,3]
## [1,] "Chicken(31%);" "Chicken" "31"
## [2,] "Duck(16%);" "Duck" "16"
## [3,] "Wild duck(14%);" "Wild duck" "14"
## [4,] "Pigeon(4%);" "Pigeon" "4"
## [5,] "Goose(4%);" "Goose" "4"
## [6,] "Wild bird(4%);" "Wild bird" "4"
## [7,] "Tree sparrow(2%)" "Tree sparrow" "2"
# Rearrange
rows <- lapply(
matches,
function(x) setNames(as.list(as.numeric(x[, 3])), x[, 2]))
rbindlist(rows, fill=TRUE)
## Chicken Duck Wild duck Pigeon Goose Wild bird Tree sparrow
## 1: 31 16 14 4 4 4 2
## 2: NA NA NA NA NA NA 2
## 3: 1 NA NA NA NA NA NA
Regex explanation
([\\w\\s]+) # At least one word character or whitespace *, 1st group
\\( # Left parenthesis
([1-9]+) # At least one digit. You can replace + with {1,2}, 2nd group
% # Percent sign
\\) # Right parenthesis
;? # Optional semicolon
* Could be \\w[\\w\\s]+
Here's on possible solution:
library(qdapTools)
mtabulate(strsplit(gsub("\\(\\d+%\\)", "", x), ";"))
## Chicken Duck Goose Pigeon Tree sparrow Turkey Wild bird Wild duck
## 1 1 1 1 1 1 1 1 1
## 2 0 0 0 0 1 0 0 0
## 3 1 0 0 0 0 0 0 0
Related
The pattern list looks like:
pattern <- c('aaa','bbb','ccc','ddd')
X came from df looks like:
df$X <- c('aaa-053','aaa-001','aab','bbb')
What I tried to do: use agrep to find the matching name in pattern based on df$X, then assign value to an existing column 'column2' based on the matching result, for example, if 'aaa-053' matched 'aaa', then 'aaa' would be the value in 'column2', if not matched, then return na in that column.
for (i in 1:length(pattern)) {
match <- agrep(pattern, df$X, ignore.case=TRUE, max=0)
if agrep = TRUE {
df$column2 <- pattern
} else {df$column2 <- na
}
}
Ideal column2 in df looks like:
'aaa','aaa',na,'bbb'
agrep by itself isn't going to give you much to determine which to use when multiples match. For instance,
agrep(pattern[1], df$x)
# [1] 1 2 3
which makes sense for the first two, but the third is not among your expected values. Similarly, it's feasible that it might select multiple patterns for a given string.
Here's an alternative:
D <- adist(pattern, df$x, fixed = FALSE)
D
# [,1] [,2] [,3] [,4]
# [1,] 0 0 1 3
# [2,] 3 3 2 0
# [3,] 3 3 3 3
# [4,] 3 3 3 3
D[D > 0] <- NA
D
# [,1] [,2] [,3] [,4]
# [1,] 0 0 NA NA
# [2,] NA NA NA 0
# [3,] NA NA NA NA
# [4,] NA NA NA NA
apply(D, 2, function(z) which.min(z)[1])
# [1] 1 1 NA 2
pattern[apply(D, 2, function(z) which.min(z)[1])]
# [1] "aaa" "aaa" NA "bbb"
Are there any direct functions that can be used to get the combinations of all the items in the vector?
myVector <- c(1,2,3)
for (i in myVector)
for (j in myVector)
for (k in myVector)
print(paste(i,j,k,sep=","))
The screenshot of the first part of the output look like this. As there are three values 1,2,3 there will be
3 * 3 * 3 = 27 lines
I tried to get the permutations using the function permn() as,
permn(myVector)
But is giving only the 9 different values.
Screenshot of the output :
Is there any direct function that can produce such a result as shown in the first?
Using RcppAlgos::permuteGeneral.
r <- RcppAlgos::permuteGeneral(myVector, length(myVector), repetition=TRUE)
head(r, 3)
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 1 1 2
# [3,] 1 1 3
If you want the comma separated strings, do
apply(r, 1, paste, collapse=",")
# [1] "1,1,1" "1,1,2" "1,1,3" "1,2,1" "1,2,2" "1,2,3" "1,3,1"
# [8] "1,3,2" "1,3,3" "2,1,1" "2,1,2" "2,1,3" "2,2,1" "2,2,2"
# [15] "2,2,3" "2,3,1" "2,3,2" "2,3,3" "3,1,1" "3,1,2" "3,1,3"
# [22] "3,2,1" "3,2,2" "3,2,3" "3,3,1" "3,3,2" "3,3,3"
Or the list output, you've also shown
RcppAlgos::permuteGeneral(myVector, length(myVector), FUN=function(x)
paste(x, collapse=","), repetition=TRUE)
# [[1]]
# [1] "1,1,1"
#
# [[2]]
# [1] "1,1,2"
#
# [[3]]
# [1] "1,1,3"
#
# [[4]]
# [1] "1,2,1"
# ...
You may decide on your own :)
Use expand.grid :
tmp <- expand.grid(myVector, myVector, myVector)
tmp
# Var1 Var2 Var3
#1 1 1 1
#2 2 1 1
#3 3 1 1
#4 1 2 1
#5 2 2 1
#6 3 2 1
#...
#...
If you want to do this automatically for the length of myVector without manually specifying it 3 times you can use replicate.
tmp <- do.call(expand.grid, replicate(length(myVector),
myVector, simplify = FALSE))
To paste the values together you can do :
do.call(paste, c(tmp, sep = ','))
# [1] "1,1,1" "2,1,1" "3,1,1" "1,2,1" "2,2,1" "3,2,1" "1,3,1" "2,3,1"
# [9] "3,3,1" "1,1,2" "2,1,2" "3,1,2" "1,2,2" "2,2,2" "3,2,2" "1,3,2"
#[17] "2,3,2" "3,3,2" "1,1,3" "2,1,3" "3,1,3" "1,2,3" "2,2,3" "3,2,3"
#[25] "1,3,3" "2,3,3" "3,3,3"
Note that there is a permutations function in the gtools package that allows you to generalize permutation outputs:
library(gtools)
permutations(3, 3, 1:3, repeats.allowed = TRUE)
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 2
[3,] 1 1 3
[4,] 1 2 1
[5,] 1 2 2
[6,] 1 2 3
[7,] 1 3 1
[8,] 1 3 2
[9,] 1 3 3
[10,] 2 1 1
The function help describes the parameter settings.
It appears that pracma::combs does exactly this. That, and pracma::perms generate output sets which treat every element of the input as distinct, regardless of whether a value is repeated.
Suppose we have a matrix M
M <- matrix(c(1:9),3,3)
diag(M) <- NA
M
[,1] [,2] [,3]
[1,] NA 4 7
[2,] 2 NA 8
[3,] 3 6 NA
where each entry describes the outcomes of pairwise interactions. Each interaction of row i with column j is interepreted as "object i outperformed object j X times". Examples: Object 2 performs better than object 1 in 2 cases. Object 1 performs better than object 3 in 7 cases.
Is there a quick way to transform this matrix into an object holding this information in a format where each row fully describes the interactions between two objects? The goal is something like this:
[,1] [,2] [,3] [,4]
[1,] "OBJ1" "OBJ2" "N1" "N2"
[2,] "1" "2" "4" "2"
[3,] "1" "3" "7" "3"
[4,] "2" "3" "8" "6"
where the first two columns give the objects that are compared while columns 3 and 4 describe how often OBJ1 outperformed OBJ2 and vice versa. The interpretation of the first row is: Object 1 has outperformed Object 2 4 times, whereas Object 2 has outperformed Object 1 2 times. I have been playing around with reshape2 and aggregating without useful results so far.
Maybe you can try the code below
inds <- t(combn(dim(M)[1], 2))
Mout <- `colnames<-`(
cbind(inds, M[inds], M[inds[, 2:1]]),
do.call(paste0, rev(expand.grid(1:2, c("Obj", "N"))))
)
which gives
> Mout
Obj1 Obj2 N1 N2
[1,] 1 2 4 2
[2,] 1 3 7 3
[3,] 2 3 8 6
Another solution could be:
M <- matrix(c(1:9),3,3)
diag(M) <- NA
M1 <- M
M[upper.tri(M, diag=TRUE)] <- NA
M1[lower.tri(M1, diag=TRUE)] <- NA
R1 = reshape2::melt(M1, na.rm=TRUE, value.name="N1")
R2 = reshape2::melt(M, na.rm=TRUE, value.name="N2")
R1$N2 <- R2$N2
rownames(R1) <- NULL
Output:
> R1
Var1 Var2 N1 N2
1 1 2 4 2
2 1 3 7 3
3 2 3 8 6
I need to load social network data where each user has an unknown and potentially large number of friends, stored as a text file of the following format:
UserId: FriendId1, FriendId2, ...
1: 12, 33
2:
3: 4, 6, 10, 15, 16
into a two-column data.frame:
UserId FriendId
1 1 12
2 1 33
3 3 4
4 3 6
5 3 10
6 3 15
7 3 16
How would you do that in R?
Reading, filling and then reshaping is inefficient as it requires to keep in memory many columns full of NA.
Related questions here, and here.
If you really have a colon as a delimiter, then just use read.table with header = FALSE to get your data into R, then consider using cSplit from my "splitstackshape" package.
mydf <- read.table("test.txt", sep = ":", header = FALSE)
mydf
## V1 V2
## 1 1 12, 33
## 2 2
## 3 3 4, 6, 10, 15, 16
library(splitstackshape)
cSplit(mydf, "V2", ",", "long")
## V1 V2
## 1: 1 12
## 2: 1 33
## 3: 3 4
## 4: 3 6
## 5: 3 10
## 6: 3 15
## 7: 3 16
This reads the lines, then one-by-one parses them into two column matrices. This does produce character values (since lines of text are just characters) but it's trivial to coerce to numeric:
do.call(rbind, sapply(rLines, function(L) { n <- sub( ":.+", "", L);
items <- scan(text=sub(".+:","",L), sep=",");
matrix( c( rep(n, length(items)), items), ncol=2)}
)
)
#---------
[,1] [,2]
[1,] "1" "12"
[2,] "1" "33"
[3,] "3" "4"
[4,] "3" "6"
[5,] "3" "10"
[6,] "3" "15"
[7,] "3" "16"
If the path forward isn't trivial to you then educate yourself at ?as.numeric and ?as.data.frame.
I am new to R and i was wondering if there is a way to create a dataframe through lists. Here is an example.
n = c(1,4,5)
b = c(7,19,20)
v = c(3,8,9,4,5)
x = list(n,b,v)
If i use the command x i get columns. Is there i can combine them as rows if they have similar headers(like employee, count,id, row number, pages, page visits) and create a dataframe like this?
employee | count | id |row number| pages| page visits
1 4 5
7 19 20
3 8 9 4 5
You can try stri_list2matrix from the "stringi" package:
library(stringi)
stri_list2matrix(x, byrow = TRUE)
# [,1] [,2] [,3] [,4] [,5]
# [1,] "1" "4" "5" NA NA
# [2,] "7" "19" "20" NA NA
# [3,] "3" "8" "9" "4" "5"
However, your sample data only has 5 columns and you are expecting to create a data.frame with 6 columns.
You can also try listCol_w from my "splitstackshape" package:
library(splitstackshape)
listCol_w(data.table(id = seq_along(x), x), "x", fill = NA_real_)
id x_fl_1 x_fl_2 x_fl_3 x_fl_4 x_fl_5
1: 1 1 4 5 NA NA
2: 2 7 19 20 NA NA
3: 3 3 8 9 4 5
The NA_real_ is so that the results can be retained as numeric. (NA_integer_ is also appropriate here.)
require(plyr)
Reduce(function(z,y) rbind.fill(z,
setNames( data.frame(as.list(y)), cnams[1:length(y )])),
x,
init=setNames(data.frame(as.list(cnams))[0,], cnams) )
employee count id row_number pages page_visits
1 1 4 5 <NA> <NA> <NA>
2 7 19 20 <NA> <NA> <NA>
3 3 8 9 4 5 <NA>