Counting the number of individual letters in a string in R - r

I have the following string of characters:
pig<-c("A","B","C","D","AB","ABC","AB","AA","CD","CA",NA)
I am trying to get R to tell me how many of each total letters there are and how many total NAs there are. Thus, in this case I would like to the result to look like this:
print(cow)
A B C D NA
6 3 4 2 1
I have tried table in combination with strsplit but cannot figure out exactly how to do it. Any thoughts? Thanks!

You would need to use NULL (or the empty character "") for the split value in strsplit(), then unlist it. Then, in table() you'll want to use the useNA argument to include any NA values. Here we'll use "ifany", so that if there are any NA values they will be shown in the table and if there are not, NA will not be shown in the result at all.
table(unlist(strsplit(pig, NULL)), useNA = "ifany")
#
# A B C D <NA>
# 7 4 4 2 1

Related

Adding a vector to a column, without specifying the other columns

I have would like to add a vector to a column, without specifying the other columns. I have example data as follows.
library(data.table)
dat <- fread("A B C D
one 2 three four
two 3 NA one")
vector_to_add <- c("five", "six")
Desired ouput:
out <- fread("A B C D
one 2 three four
two 3 NA one
NA NA five NA
NA NA six NA")
I saw some answers using an approach where vectors are used to rowbind:
row3 < c(NA, NA, "five", NA)
I would however like to find a solution in which I do not have specify the whole row.
EDIT: Shortly after posting I realised that it would probably be easiest to take an existing row, make the row NA, and replace the value in the column where the vector would be added, for each entry in the vector. This is however still quite a cumbersome solution I guess.
If you name your vector, then you can rbind that column and fill the rest of the cells with NAs.
df_to_add <- data.frame(C=c("five", "six"))
rbind(dat, df_to_add, fill=TRUE)
A B C D
1: one 2 three four
2: two 3 <NA> one
3: <NA> NA five <NA>
4: <NA> NA six <NA>
You can use the rbindlist() function from the data.table package to add a vector to a column in a data table without specifying the other columns. The rbindlist() function allows you to create a list of vectors or data tables and combine them into a single data table.
In your case, you can create a new vector with the values you want to add to the data table and use the rbindlist() function to append the vector to the data table. For example, the following code shows how to add the vector vector_to_add to the data table dat:
library(data.table)
dat <- fread("A B C D
one 2 three four
two 3 NA one")
vector_to_add <- c("five", "six")
# Create a new vector with the values to add to the data table
new_vector <- c(NA, NA, vector_to_add[1], NA)
# Use rbindlist() to append the new vector to the data table
out <- rbindlist(list(dat, new_vector))
# Add the second value from the vector to the data table
out <- rbindlist(list(out, c(NA, NA, vector_to_add[2], NA)))
After running this code, the data table out should contain the desired output:
A B C D
1: one 2 three four
2: two 3 NA one
3: NA NA five NA
4: NA NA six NA
You can use the rbindlist() function to append multiple vectors to the data table in a similar way.

How do you return the list of unique values in dataframe and not the index value of the list when aggregating in R?

Given the below dataframe
df <- data.frame(cbind(seq(1:4),rep(letters[seq(1:3)],4)))
X1 X2
1 a
2 b
3 c
4 a
1 b
2 c
3 a
4 b
1 c
2 a
3 b
4 c
I would like to summarize unique X2s by X1. For example,
1 a,b,c
2 b,c,a
3 c,a,b
4 a,b,c
I am very close. I use the following code:
'summary <- aggregate(df$X2, list(df$X1),FUN=unique)`
which produces
Group.1 X
1 1,2,3
2 2,3,1
3 3,1,2
4 1,2,3
(the index of the list). What is the most efficient way to get my desired result?
I am certain there is an easy solution and I've tried searching, but I must not be using the correct search terms. Thank you in advanced.
We can use toString to paste the elements
aggregate(X2~X1, unique(df), toString )
Or if we need to keep it as list
aggregate(X2~X1, transform(unique(df), X2 = as.character(X2)), list)
As the OP also mentioned the efficient approach
library(data.table)
unique(setDT(df))[, .(X2 = toString(X2)), by = X1]
Regarding the creation of data.frame, it is easier, compact and error-free way to do without using cbind with data.frame. The main reason is that cbind converts to a matrix and matrix can have only a single class. So, if there is a single character column or elements, all the elements are converted to character. With as.data.frame, by default the stringsAsFactors=TRUE, so the columns are converted to factor class.
df <- data.frame(X1= 1:4, X2= rep(letters[1:3],4), stringsAsFactors= FALSE)
The above code gets the intended output. Note that seq is not needed when we use :

R - remove rows from a data frame with empty lines (not only numbers)

The issue seems to be something already treated but after a check I couldn't find any solution. I load a table from a file and it could be (don't know how) that some entire lines are empty. So when I get the data frame I got
# id c1 c2
# 1 a 1 2
# 2 b 2 4
# 3 NA NA
# 4 d 6 1
# 5 e 7 5
# 6 NA NA
if I do
apply(df, 1, function(x) all(is.na(x))
I got all FALSE as the first column is not a number (the table is much bigger with mixed character and numeric columns) and I can't filter these lines. Also with na.omit or complete.cases I cannot sort it out.
Is there any function or expression to check empty rows?
You may be able to cut this problem off at the source with the parameters you pass to read.csv:
For instance if the blanks are one space or blanks you could use
df <- read.csv(<your other logic here>, na.strings=c("NA","", " ")
This question seems to raise similar issues: read.csv blank fields to NA
If this works, then you can use the apply logic to work with the offending rows.

Find out if column in R table includes duplicate values?

I've got a lovely dataframe, my very first, and I'm starting to get the hang of R. One thing I haven't been able to find is a test for duplicate values. I have one column that I'm pretty sure is all unique values, but I don't know that.
Is there a way I can ask? For simplicity, let's pretend this is my data:
var1 var2 var3
1 1 A 1
2 2 B 3
3 3 C NA
4 4 D NA
5 5 E 4
and I want to know whether var1 ever repeats.
Check out the duplicated function:
duplicated(dat$var1) # the rows of dat var1 duplicated
Documentation is here.
You should also look at the unique function.
Remove duplicates based on columns:
my_data[!duplicated(my_data$Col_id), ] # Where ! is a logical negation:

Read multidimensional group data in R

I have done lot of googling but I didn't find satisfactory solution to my problem.
Say we have data file as:
Tag v1 v2 v3
A 1 2 3
B 1 2 2
C 5 6 1
A 9 2 7
C 1 0 1
The first line is header. The first column is Group id (the data have 3 groups A, B, C) while other column are values.
I want to read this file in R so that I can apply different functions on the data.
For example I tried to read the file and tried to get column mean
dt<-read.table(file_name,head=T) #gives warnings
apply(dt,2,mean) #gives NA NA NA
I want to read this file and want to get column mean. Then I want to separate the data in 3 groups (according to Tag A,B,C) and want to calculate mean(column wise) for each group. Any help
apply(dt,2,mean) doesn't work because apply coerces the first argument to an array via as.matrix (as is stated in the first paragraph of the Details section of ?apply). Since the first column is character, all elements in the coerced matrix object will be character.
Try this instead:
sapply(dt,mean) # works because data.frames are lists
To calculate column means by groups:
# using base functions
grpMeans1 <- t(sapply(split(dt[,c("v1","v2","v3")], dt[,"Tag"]), colMeans))
# using plyr
library(plyr)
grpMeans2 <- ddply(dt, "Tag", function(x) colMeans(x[,c("v1","v2","v3")]))

Resources