R Count the number of occurrence produces strange output - r

I have a dataset and consist of 10 000 rows of data. I perform the random set of 1000 sample data.
Name Age ...
Alice
Jasmine
Alice
Joel
Jimmy
Alice
Alex
Agar
Agar
When I perform the count of number of occurrence of names in a column
name <- table(example['Name'], useNA = "ifany")
The output showed a strange output. It showed a new name Bruce which has 0 value but for Bruce it is not found in the random set of 1000 data but it is instead found in the original dataset. I only want to to use the random set of 1000 data and the 0 value is it normal? How to get rid of it? Or is it impossible to get rid of it?
Alice 3
Jasmine 1
Joel 1
Agar 2
Jimmy 1
Alex 1
Bruce 0

You may use droplevels to drop unused factor levels.
name <- table(droplevels(example['Name']))
Consider this example -
set.seed(123)
#Sample dataframe
df <- data.frame(a = factor(sample(c('A', 'B', 'C'), 10, replace = TRUE)))
#Select only first 5 rows so we don't have any row with "A" value.
df1 <- df[1:5, , drop = FALSE]
table(df1['a'])
#A B C
#0 1 4
table(droplevels(df1['a']))
#B C
#1 4

Sounds like your name field is a factor variable. You are getting totals based on the factor levels. Note in the help text for table(), "Only when exclude is specified (i.e., not by default) and non-empty, will table potentially drop levels of factor arguments." Sounds like you may want to specify exclude to drop factor levels. Or consider refactoring the name field of the random data set with the unique values found just in that set.

Related

Remove NA from a dataset in R

I have used this function to remove rows that are not blanks:
data <- data[data$Age != "",]
in this dataset
Initial Age Type
1 S 21 Customer
2 D Enquirer
3 T 35 Customer
4 D 36 Customer
However if I run the above code, I get this:
Initial Age Type
1 S 21 Customer
N/A N/A N/A N/A
3 T 35 Customer
4 D 36 Customer
When all I want is:
Initial Age Type
1 S 21 Customer
3 T 35 Customer
4 D 36 Customer
I just want the dataset without any NAs and I wanted to remove any rows that are not blank, so ideally all NAs and any that are just "".
I have tried the na.omit function but this deletes everything from my dataset.
This is an example dataset I have used, but in my dataset there's over 1000 columns and I would like to remove all rows that are NA for a particular column name.
This is my first post, I apologise if this isn't the right way to write up my code, plus I am very new to R.
Also my row number has converted to NA when I don't want it there, it's messing up my calculation.
Thank you for taking time to read and commenting this post.
As pointed out in the comments, it would be good to know what the exact values in the "empty" Age cells are. When I recreate the above data snippet using:
data <- data.frame(Initial = c("S", "D", "T", "D"),
Age = c(21, "", 35, 36),
Type = c("Customer", "Enquirer", "Customer", "Customer"))
We can see that "Age" is transformed into column of type "character".
Using the following code we can effectively remove those "empty" Age rows:
data <- subset(data, is.finite(as.numeric(Age)))
This takes the subset of the dataframe "data" where a numeric version of the Age variable is a finite number, thus eliminating the rows with missing Age values.
Hope this solves your problem!
Thank you # M.P.Maurits
This formula worked!
data <- subset(data, is.finite(as.numeric(Age)))
The column was actually an integer but when changed to numeric it removed all rows that were imported as blank but shown as NAs. I didn't think that integer or numeric would be a difference.
Thank you to everyone else who also commented, much appreciated :)
A simple solution based on dplyr's function filter:
library(dplyr)
data %>%
filter(!Age == "")
Initial Age Type
1 S 21 Customer
2 T 35 Customer
3 D 36 Customer

Identifying, grouping unique entries in data frame (R)

I have a dataframe with two columns. One is an ID column (string), the second consists of strings several hundred characters long (DNA sequences). I want to identify the unique DNA sequences and group the unique groups together.
Using:
data$duplicates<-duplicated(data$seq, fromLast = TRUE)
I have successfully identified whether a specific row is a duplicate or not. This is not sufficient - I want to know whether I have 2, 3, etc. duplicates, and to which ID's do they correspond to (it is important that the ID always stays with its corresponding sequence).
Maybe something like:
for data$duplicates = TRUE... "add number in data$grouping
corresponding to the set of duplicates."
I don't know how to write the code for the last part.
I appreciate any and all help, thank you.
Edit: As an example:
df <- data.frame(ID = c("seq1","seq2","seq3","seq4","seq5"),seq= c("AAGTCA",AGTCA","AGCCTCA","AGTCA","AGTCAGG"))
I would like the output to be a new column (e.g.: df$grouping) where a numeric value is given to each unique group, so in this case:
("1","2","3","2","4")
I would like the output to be a new column (e.g.: df$grouping) where a numeric value is given to each unique group, so in this case:
Since df$seq is already a factor, we can just use the level number. This is given when a factor is coerced to an integer.
df$grouping = as.integer(df$seq)
df
# ID seq grouping
# 1 seq1 AAGTCA 1
# 2 seq2 AGTCA 3
# 3 seq3 AGCCTCA 2
# 4 seq4 AGTCA 3
# 5 seq5 AGTCAGG 4
If, in your real data, the seq column is not of class factor, you can still use df$grouping = as.integer(factor(df$seq)). By default the order of the groups will be alphabetical---you can modify this by giving the levels argument to factor in the order you want. For example, df$grouping = as.integer(factor(df$seq, levels = unique(df$seq))) will put the levels (and thus the grouping integers) in the order in which they first occur.
If you want to see the number of rows in each group, use table, e.g.
table(df$seq)
# AAGTCA AGCCTCA AGTCA AGTCAGG
# 1 1 2 1
table(df$grouping)
# 1 2 3 4
# 1 1 2 1
sort(table(df$seq), decreasing = T)
# AGTCA AAGTCA AGCCTCA AGTCAGG
# 2 1 1 1

R table function: how to coerce order of column names output of table()

I would like to change the column order output from the table function in R. I can only find information about manipulating column order for data.table (not what I want). The order of the columns ("No" and "Yes") has always been the consistent when I use R (alphabetical order?) but for some reason some of my tables have come back in a different order ("Yes" and "No"). I need these to be consistent (as I am combining some tables) and ordered so that "Yes" is last. I'm making several hundred of these tables with associated statistics and have some custom made formulas to help me out - but I can't afford to double check the order of every table - so I want to tell R what to do specifically. As I am doing chi-square tests I don't want to have to change each table into a data.frame, reorder the columns, and then change back to a table somehow. The order of the table columns is important as I am combining some tables (and R coerces these incorrectly), and also doing odds ratios so I need "Yes" to come last consistently. Out of curiosity (not necessary), someone could explain to me why some of my data produces table columns in alphabetical order but other data doesn't. I've attached a simplified version of my data.
df <- data.frame(treatment = c("A","A","B","A","B","A","B","B"),
symptom = c("Yes","Yes","No","No","Yes","Yes","Yes","No"))
table(df)
As this example produces my desired table column order please write code to change the column order from "No", "Yes" to "Yes", "No"
We can use factor with levels specified because the ordering is based on the alphabetic order where "N" comes before "Y" (first letter and so on). This could be changed by converting to factor with levels in the custom order.
table(df$treatment, factor(df$symptom, levels = c("Yes", "No")))
# Yes No
# A 3 1
# B 2 2
Or use transform and then do the table
table(transform(df, symptom = factor(symptom, levels = c("Yes", "No"))))
# symptom
#treatment Yes No
# A 3 1
# B 2 2
However, we can do this after the table by specifying the order (either column index or column names) but this would become more tedious if we don't know which are the levels
table(df)[, 2:1]
# symptom
#treatment Yes No
# A 3 1
# B 2 2
You can order it the way you want:
table(df)[,2:1]
symptom
treatment Yes No
A 3 1
B 2 2
table(df)[,c("Yes","No")]
symptom
treatment Yes No
A 3 1
B 2 2
levels=c("Yes","No")
table(df)[,levels]
symptom
treatment Yes No
A 3 1
B 2 2

Function to work out an average number of unique occurrences

I have the following code, which does what I want. But I would like to know if there is a simpler/nicer way of getting there?
The overall aim of me doing this is that I am building a separate summary table for the overall data, so the average which comes out of this will go into that summary.
Test <- data.frame(
ID = c(1,1,1,2,2,2,3,3,3),
Thing = c("Apple","Apple","Pear","Pear","Apple","Apple","Kiwi","Apple","Pear"),
Day = c("Mon","Tue","Wed")
)
countfruit <- function(data){
df <- as.data.frame(table(data$ID,data$Thing))
df <- dcast(df, Var1 ~ Var2)
colnames(df) = c("ID", "Apple","Kiwi", "Pear")
#fixing the counts to apply a 1 for if there is any count there:
df$Apple[df$Apple>0] = 1
df$Kiwi[df$Kiwi>0] = 1
df$Pear[df$Pear>0] = 1
#making a new column in the summary table of how many for each person
df$number <- rowSums(df[2:4])
return(mean(df$number))}
result <- countfruit(Test)
I think you over complicate the problem, Here a small version keeping the same rationale.
df <- table(data$ID,data$Thing)
mean(rowSums(df>0)) ## mean of non zero by column
EDIT one linear solution:
with(Test , mean(rowSums(table(ID,Thing)>0)))
It looks like you are trying to count how many nonzero entries in each column. If so, either use as.logical which will convert any nonzero number to TRUE (aka 1) , or just count the number of zeros in a row and subtract from the number of pertinent columns.
For example, if I followed your code correctly, your dataframe is
Var1 Apple Kiwi Pear
1 1 2 0 1
2 2 2 0 1
3 3 1 1 1
So, (ncol(df)-1) - length(df[1,]==0) gives you the count for the first row.
Alternatively, use as.logical to convert all nonzero values to TRUE aka 1 and calculate the rowSums over the columns of interest.

How to build a new column (/data.frame) from a table, and assign corresponding values to the rows

I printed out the summary of a column variables as such:
summary(document$subject)
A,B,C,D,E,F,.. are the subjects belonging to a column of a data.frame where A,B,C,...appear many times in the column, and the summary above shows the number of times (frequency) these subjects have appeared in the file. Also, the term "OTHER" refers to those subjects which have appeared only once in the file, I also need to assign "1" to these subjects.
There are so many different subjects that it's difficult to list out all of them if we use command "c".
I want to build up a new column (or data.frame) and then assign these corresponding numbers (scores) to the subjects. Ideally, it will become this in the file:
A 198
B 113
C 96
D 69
A 198
E 65
F 62
A 198
C 113
BZ 21
BC 1
CJ 1
...
I wonder what command I should use to take the scores/values from the summary table and then build a new column to assign these values to the corresponding subjects in the file.
Plus, since it's a summary table printed by R, I don't know how to build it into a table in a file, or take out the values and subject names from the table. I also wonder how I could find out the subject names which appeared only once in the file, so that the summary table added them up into "OTHER".
Your question is hard to interpret without a reproducible example. Please take a look this threat for tips on how to do that:
How to make a great R reproducible example?
Having said that, here is how I interpret your question. You have two data frames, one with a score per subject and another with the subjects multiple times in a column:
Sum <- data.frame(subject=c("A","B"),score=c(1,2))
foo <- data.frame(subject=c("A","B","A"))
> Sum
subject score
1 A 1
2 B 2
> foo
subject
1 A
2 B
3 A
You can then use match() to match the subjects in one data frame to the other and create the new variable in the second data frame:
foo$score <- Sum$score[match(foo$subject, Sum$subject)]
> foo
subject score
1 A 1
2 B 2
3 A 1

Resources