I have a vector of strings that I'm trying to convert into a data frame with a frequency column. So far so good, but when I dim my data frame, I get only one column instead of two. I guess R is using the words as the index values.
Anyway here is how it starts. My list:
a<-c("welcoming", "whatsyourexcuse", "whiteway", "zero", "yay", "whatsyourexcuse", "yay")
Then, I tried to sort the frequency values in decreasing order and store as data frame using:
df <- as.data.frame(sort(table(a), decreasing=TRUE))
Problem is when I dim(df) I get [1] 5 1 instead of [1] 5 2. Here is what df looks like:
sort(table(a), decreasing = TRUE)
whatsyourexcuse 2
yay 2
welcoming 1
whiteway 1
zero 1
instead of:
a Freq
[1] whatsyourexcuse 2
[2] yay 2
[3] welcoming 1
[4] whiteway 1
[5] zero 1
Any pointers please? Thanks.
Try:
library(plyr)
a1 <- count(a)
a1[order(-a1$freq),]
# x freq
# 2 whatsyourexcuse 2
# 4 yay 2
# 1 welcoming 1
# 3 whiteway 1
# 5 zero 1
dim(a1)
#[1] 5 2
Or
a2 <- stack(sort(table(a),decreasing=TRUE))[,2:1]
dim(a2)
#[1] 5 2
When you are converting to data.frame using as.data.frame(sort(table(a), decreasing=TRUE)), the names of the elements become the rownames of the dataframe, so you are creating only one column instead of two. When you do sort, it no longer is the table object. For comparison check str(table(a)) and str(sort(table(a), decreasing=TRUE)))
You can also create the data.frame by
tbl <- sort(table(a), decreasing=TRUE)
data.frame(col1= names(tbl), Values= as.vector(tbl))
Related
I have two files. The first file is a data frame that is simply times in one column and individuals in a second
# [Time] [Individual]
# [1] 1528142 C5A1790
# [2] 1528142 C5A1059
# [3] 1528142 C5A1084
# [4] 1528142 C5A1564
# [5] 1528142 C5A1239
# [6] 1528142 C5A1180
the second is an N X N matrix in which both rows and columns are individuals, including those in the first matrix.
# [C5A1084] [C5A1059] [C5A1790] [C5A1180]
# 1 [C5A1084] 0 0.5 1 0
# 2 [C5A1059] 0.5 0 0 1
# 3 [C5A1790] 1 1 0 0.5
# 4 [C5A1180] 0 1 0.5 0
I need to create a vector containing the row numbers in the matrix at which I can find the individuals from the data frame, and in the order that they are listed in the data frame. For these example data it would be (3,2,1,4).
I tried to use the which() function as
RingIndex <- which(Matrix$IDcolumn == FrameIDs)
and received the "longer object length is not a multiple of shorter object length" message, presumably because the matrix includes more individuals than the data frame. %in% and match() are also returning errors stating that replacement has fewer rows than data.
Following the advice in the comments, I tried
RingIndex <- which(Matrix$IDcolumn %in% FrameIDs)
which successfully returned the correct row numbers, but in ascending order rather than the order of the original data. The match() function continues to complain of different replacement and original lengths.
What approach could I use to get my vector?
Many thanks!
df <- data.frame(Time = runif(6,1528142,1528150),
Individuals = c("C5A1790","C5A1791","C5A1792","C5A1793","C5A1794","C5A1795"))
> df
Time Individuals
1 1528144 C5A1790
2 1528143 C5A1791
3 1528144 C5A1792
4 1528148 C5A1793
5 1528145 C5A1794
6 1528143 C5A1795
nnMatrix <- matrix(runif(36,0,1),6,6)
colnames(nnMatrix) <- df$Individuals
rownames(nnMatrix) <- df$Individuals
> nnMatrix
C5A1790 C5A1791 C5A1792 C5A1793 C5A1794 C5A1795
C5A1790 0.08096946 0.8716328 0.6895134 0.05692825 0.4555460 0.53224424
C5A1791 0.42568532 0.5920239 0.4523232 0.11516185 0.8053652 0.72299411
C5A1792 0.42439187 0.6101881 0.8534429 0.86010851 0.1269521 0.41066857
C5A1793 0.26043345 0.8011337 0.8032234 0.30930988 0.2298927 0.93320166
C5A1794 0.43065533 0.2161525 0.6702832 0.89304071 0.6765714 0.09769635
C5A1795 0.70594252 0.1048099 0.7478553 0.87839534 0.5173364 0.69957502
> sapply(df$Individuals, function(t) which(colnames(nnMatrix) == t))
[1] 1 2 3 4 5 6
If you change the order
colnames(nnMatrix) <- rev(colnames(nnMatrix))
[1] 6 5 4 3 2 1
You may want to check for repetition and missing values, but the main approach is the same.
As suggested in the comments (#GKi) also match will work
> match(df$Individuals,colnames(nnMatrix))
[1] NA 1 3 4 5 6
i have a following example:
dat <- read.table(text="index string
1 'I have first and second'
2 'I have first, first'
3 'I have second and first and thirdeen'", header=TRUE)
toMatch <- c('first', 'second', 'third')
dat$count <- stri_count_regex(dat$string, paste0('\\b',toMatch,'\\b', collapse="|"))
dat
index string count
1 1 I have first and second 2
2 2 I have first, first 2
3 3 I have second and first and thirdeen 2
I want to add to the dataframe a column count, which will tell me how many UNIQUE words does each row have. The desired output would in this case be
index string count
1 1 I have first and second 2
2 2 I have first, first 1
3 3 I have second and first and thirdeen 2
Could you please give me a hint how to modify the original formula? Thank you very much
With base R you could do the following:
sapply(dat$string, function(x)
{sum(sapply(toMatch, function(y) {grepl(paste0('\\b', y, '\\b'), x)}))})
which returns
[1] 2 1 2
Hope this helps!
We can use stri_match_all instead which gives us the exact matches and then calculate distinct values using n_distinct or length(unique(x)) in base.
library(stringi)
library(dplyr)
sapply(stri_match_all(dat$string, regex = paste0('\\b',toMatch,'\\b',
collapse="|")), n_distinct)
#[1] 2 1 2
Or similary in base R
sapply(stri_match_all(dat$string, regex = paste0('\\b',toMatch,'\\b',
collapse="|")), function(x) length(unique(x)))
#[1] 2 1 2
Hi Stack Overflow Community,
I've invested now a few hours but I didn't find the answer. I have a list of 200 sublists in R. Each contains a character column and an integer column named FREQUENCY. My goal is to show only the integer columns. I've tested the function manually with the list-function and the first two sublists and it works:
mydata <- list(Name1[[1]]$FREQUENCY, Name1[[2]]FREQUENCY)
Now to my question: How is it possible to take all 200 sublists with one command. I need the list-function in this process, because I have to sum each FREQUENCY sublist in a next step:
lapply(mydata, sum)
Thank you guys!
Here's a base solution (if i understand properly):
your_list <- list(data.frame(a="hello",b=1),
data.frame(c="world",d=1))
# [[1]]
# a b
# 1 hello 1
#
# [[2]]
# c d
# 1 world 1
lapply(your_list,function(x) x[,sapply(x,is.numeric),drop=FALSE])
# [[1]]
# b
# 1 1
#
# [[2]]
# d
# 1 1
How do I search for a string in a data.frame? As a minimal example, how do I find the locations (columns and rows) of 'horse' in this data.frame?
> df = data.frame(animal=c('goat','horse','horse','two', 'five'), level=c('five','one','three',30,'horse'), length=c(10, 20, 30, 'horse', 'eight'))
> df
animal level length
1 goat five 10
2 horse one 20
3 horse three 30
4 two 30 horse
5 five horse eight
... so row 4 and 5 have the wrong order. Any output that would allow me to identify that 'horse' has shifted to the level column in row 5 and to the length column in row 4 is good. Maybe:
> magic_function(df, 'horse')
col row
'animal', 2
'animal', 3
'length', 4
'level', 5
Here's what I want to use this for: I have a very large data frame (around 60 columns, 20.000 rows) in which some columns are messed up for some rows. It's too large to eyeball in order to identify the different ways that order can be wrong, so searching would be nice. I will use this info to move data to the correct columns for these rows.
What about:
which(df == "horse", arr.ind = TRUE)
# row col
# [1,] 2 1
# [2,] 3 1
# [3,] 5 2
# [4,] 4 3
Another way around:
l <- sapply(colnames(df), function(x) grep("horse", df[,x]))
$animal
[1] 2 3
$level
[1] 5
$length
[1] 4
If you want the output to be matrix:
sapply(l,'[',1:max(lengths(l)))
animal level length
[1,] 2 5 4
[2,] 3 NA NA
We can get the indices where the value is equal to horse. Divide it by number of rows (nrow) to get the column indices and by columns (ncol) to get the row indices.
We use colnames to get column names instead of indices.
data.frame(col = colnames(df)[floor(which(df == "horse") / (nrow(df) + 1)) + 1],
row = floor(which(df == "horse") / ncol(df)) + 1)
# col row
#1 animal 1
#2 animal 2
#3 level 4
#4 length 5
Another way to do it is the following:
library(data.table)
library(zoo)
library(dplyr)
library(timeDate)
library(reshape2)
data frame name = tbl_account
first,Transpose it :
temp = t(tbl_Account)
Then, put it in to a list :
temp = list(temp)
This essentially puts every single observation in a data frame in to one massive string, allowing you to search the whole data frame in one go.
then do the searching :
temp[[1]][grep("Horse",temp[[1]])] #brings back the actual value occurrences
grep("Horse", temp[[1]]) # brings back the position of the element in a list it occurs in
hope this helps :)
Here is what my data look like.
id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{
As you can see, can be multiple codes concatenated into a single column, seperated by {. It is also possible for a row to have no interest_string values at all.
How can I manipulate this data frame to extract the values into a format like this:
id interest
1 YI
1 Z0
1 ZI
2 Z0
3 <NA>
4 ZT
I need to complete this task with R.
Thanks in advance.
This is one solution
out <- with(dat, strsplit(as.character(interest_string), "\\{"))
## or
# out <- with(dat, strsplit(as.character(interest_string), "{", fixed = TRUE))
out <- cbind.data.frame(id = rep(dat$id, times = sapply(out, length)),
interest = unlist(out, use.names = FALSE))
Giving:
R> out
id interest
1 1 YI
2 1 Z0
3 1 ZI
4 2 ZO
5 3 <NA>
6 4 ZT
Explanation
The first line of solution simply splits each element of the interest_string factor in data object dat, using \\{ as the split indicator. This indicator has to be escaped and in R that requires two \. (Actually it doesn't if you use fixed = TRUE in the call to strsplit.) The resulting object is a list, which looks like this for the example data
R> out
[[1]]
[1] "YI" "Z0" "ZI"
[[2]]
[1] "ZO"
[[3]]
[1] "<NA>"
[[4]]
[1] "ZT"
We have almost everything we need in this list to form the output you require. The only thing we need external to this list is the id values that refer to each element of out, which we grab from the original data.
Hence, in the second line, we bind, column-wise (specifying the data frame method so we get a data frame returned) the original id values, each one repeated the required number of times, to the strsplit list (out). By unlisting this list, we unwrap it to a vector which is of the required length as given by your expected output. We get the number of times we need to replicate each id value from the lengths of the components of the list returned by strsplit.
A nice and tidy data.table solution:
library(data.table)
DT <- data.table( read.table( textConnection("id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{"), header=TRUE))
DT$interest_string <- as.character(DT$interest_string)
DT[, {
list(interest=unlist(strsplit( interest_string, "{", fixed=TRUE )))
}, by=id]
gives me
id interest
1: 1 YI
2: 1 Z0
3: 1 ZI
4: 2 ZO
5: 3 <NA>
6: 4 ZT