I have a dataframe with many columns. I want to count the number of times something is entered into each column.
#Example data
Gender<-c("","Male","Male","","Female","Female")
location<-c("UK","France","USA","","","")
dataset<-data.frame(Gender,location, stringsAsFactors = FALSE)
There are 4 entries in the gender column and 3 entries in the location column.
I want the results to be in a dataframe such as:
result<-data.frame(Results=c("Gender","location"), Totals=c(4,3))
Can anyone suggest an approach to do this?
You can use the namesof datasetas one column for resultand calculate the Totals by counting how often grep matches anything that is a character (as opposed to nothing in an empty cell):
result <- data.frame(
Results = names(dataset),
Totals = sapply(dataset, function(x) length(grep(".", x)))
)
rownames(result) <- NULL
Result:
result
Results Totals
1 Gender 4
2 location 3
A base R option using stack + colSums
setNames(
rev(stack(colSums(dataset != ""))),
c("Results", "Total")
)
gives
Results Total
1 Gender 4
2 location 3
This should work for you:
ngen <- sum(dataset$Gender != "") #sum number entries in column that are not empty
nloc <- sum(dataset$location != "") #sam thing
Totals <- c(ngen,nloc)
result<-data.frame(Results=c("Gender","location"), Totals)
You can simplify some of the steps if you want, but that would be the detailed way.
Related
I have a dataset that looks something like this
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1971,1972,1973,1974,1977,1978,1990),
"Group" = c(1,NA,1,NA,NA,2,2,NA),
"Val" = c(2,3,3,5,2,5,3,5))
And I would like to create a cumulative sum of "Val". I know how to do the simple cumulative sum
df <- df %>% group_by(id) %>% mutate(cumval=cumsum(Val))
However, I would like my final data to look like this
final <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1971,1972,1973,1974,1977,1978,1990),
"Group" = c(1,NA,1,NA,NA,2,2,NA),
"Val" = c(2,3,3,5,2,5,3,5),
"cumval" = c(2,5,6,11,2,7,5,10))
The basic idea is that when two "Val"'s are of the same "Group" the one happening later (Year) substitutes the previous one.
For instance, in the sample dataset, observation 3 has a "cumval" of 6 rather than 8 because of the "Val" at time 1972 replaced the "Val" at time 1970. similarly for Beta.
I thank you in advance for your help
In my head, this requires a for loop. First we split the dataframe by the id column into a list of two. Then we create two empty lists. In the og list, we will put the row where the first unique non NA group identifier occurs. For alpha this is the first row and for Beta this is the second row. We will use this to subtract from the cumulative sum when the value gets substituted.
mylist <- split(df, f = df$id)
og <- list()
vals <- list()
df_num <- 1
We shall use a nested loop, the outer loop loops over each object (dataframe in this case) in the list and the inner loop loops over each value in the Group column.
We need to keep track of the row numbers, which we do with the r variable. We initially set it to 0 outside the for loop so we add 1. First we check if we are in the first row of the data frame, in which case the cumulative sum is simply equal to the value in the first row of the Val column. Then within the if test, we use another if test to check if the Group id is an NA. If it isn't then this is the first occurrence of the number that will indicate a substitution of the current value if this number appears again. So we save the number to the temporary variable temp. We also extract and save the row that contains the value to the og list.
After this it, goes to the next iteration. We check if the current Group value is NA. If it is, then we just add the value to the cumulative sum. If it isn't equal to NA, we check if the value is NA and is equal to the value stored in temp. If both are true, then this means we need to substitute. We extract the original value stored in the og list and save it as old. We then subtract the old value from the cumulative sum and add the current value. We also replace the orginal value in og with the current replacement value. This is because if the value needs to replaced again, we will need to subtract the current value and not the original value.
If j is NA but it is not equal to temp, then this is a new instance of Group. So we save the row with the original value to og list, and save the Group. The sum continues as normal as this is not an instance of replacing a value. Note that the variable x that is used to count the elements in the og list is only incremented when a new occurrence is added to the list. Thus, og[[x-1]] will always be the replacement value.
for (my_df in mylist) {
x <- 1
r <- 0
for (j in my_df$Group) {
r <- r + 1
if (r == 1) {
vals[[1]] <- my_df$Val[1]
if (is.na(j)==FALSE) {
og[[x]] <- df[r, c('Group', 'Val'), drop = FALSE]
temp <- j
x <- x + 1
}
next
}
if (is.na(j)==TRUE) {
vals[[r]] <- vals[[r-1]] + my_df$Val[r]
} else if (is.na(j)==FALSE & j==temp) {
old <- og[[x-1]]
old <- old[,2]
vals[[r]] <- vals[[r-1]] - old + df$Val[r]
og[[x-1]] <- df[r, c('Group', 'Val'), drop = FALSE]
} else {
vals[[r]] <- vals[[r-1]] + my_df$Val[r]
og[[x]] <- my_df[r, c('Group', 'Val')]
temp <- j
x <- x + 1
}
}
cumval <- unlist(vals) %>% as.data.frame()
colnames(cumval) <- 'cumval'
my_df <- cbind(my_df, cumval)
mylist[[df_num]] <- my_df
df_num <- df_num + 1
}
Lastly, we combine the two dataframes in the list by binding them on rows with bind_rows from the dplyr package. Then I check if the Final dataframe is identical to your desired output with identical() and it evaluates to TRUE
final_df <- bind_rows(mylist)
identical(final_df, final)
[1] TRUE
I have a vector:
vector_1 <- c('aa1/10', 'aa1/20', 'aa2/10')
And I have a data frame, with the column: product (some rows are empty)
product
hello123
hello123;aa1/20
World
I want to have another column, called: check.
If one of the values in my vector_1 is in the column product, then I want to have a 1, else a 0.
I tried different things, but they didn't work out:
df$check <- ifelse(df$product %in% vector_1, 1,0)
Unfortunately, no results... So I tried:
df$check <- grepl(vector_1, df$product)
But there I received an warning message: In grep: argument pattern has lenght >1 and only the first element will be used.
How can I solve this?
Result:
product check
hello123 0
0
hello123;aa1/20 1
World 0
df$check <- as.numeric(grepl(pattern = paste0(vector_1, collapse = "|"), x = df$product))
I have a dataframe that is 6249 rows long, filled with character-type data and will likely get a lot bigger.
I want to count the number of occurrences of each string. Normally I'd use table(df)
or
count(df)
but they both seem to stop after 250 rows.
Is there a different function or a way to force count() or table() to continue for 6000+ results?
A simple way to do this with any sized data frame is to add a count field to the data frame and then summarize the string field by count with the doBy package - like so:
require(doBy)
df$count <- 1
result <- summaryBy(count ~ string, data = df, FUN = sum, keep.names = TRUE)
As #Gregor noticed it seems like you interpreted the table output wrongly whereas it is actually doing the right counting. Anyway here goes a solution using Reduce, you should replace df where indicated by your dataframe and string column name by the column name of your actual dataframe in which you are counting.
# let's create some dataframe with three strings randomly distributed of length 1000
df <- data.frame(string = unlist(lapply(round(runif(1000, 1, 3)), function(i) c('hi', 'ok', 'my cat')[i])))
my.count <- function(word, df) {
# now let's count how many 'b' we found
Reduce(function(acc, r) {
# replace 'string' by the name of the column of your dataframe over which you want to count
if(r$string == word)
acc + 1
else
acc
}, apply(df, 1, as.list), init = 0)
}
# count how many 'my cat' strings are in the df dataframe at column 'string', replace with yours
my.count('my cat', df)
# now let's try to find the frequency of all of them
uniq <- unique(df$string)
freq <- unlist(lapply(uniq, my.count, df))
names(freq) <- uniq
freq
# output
# ok my cat hi
# 490 261 249
# we can check indeed that the sum is 1000
sum(freq)
# [1] 1000
Well, this won't be popular, but in the end I achieved the desired result with a for loop and and taking the number of rows in a subset.
y <- as.numeric(vector())
x <- as.numeric(vector())
for (i in test$token){
x <- as.numeric(nrow(df[(df$token == i),]))
y <- c(y, x)
}
Y then becomes a vector with the number of occurences of each string.
I need some help regarding how to start the implementation a problem in R. I have a data frame with rows which are grouped by the variable 'id'. For each 'id' I want to keep only one row. However, I have a number of criteria which specify which rows to drop.
These are some of my criteria:
I want to keep one random row within each group 'id' which has 'text' != NA (there might be several such rows); and I also want to keep all columns of this row, this is also the case for all following criteria.
If all rows in a group have 'text' == NA, then I want to keep one random row which has the variable 'check' == T (there might be several such rows)
If all rows in a group have 'text' == NA and 'check' == F, then I want to keep the row which has the variable 'newtext' which meets the condition !(grepl("None",df$newtext))
I can also provide a dataset if this makes it more clear. However, my most important issue is that I do not know how to implement this logic of dropping rows according to an ordered number of criteria.
It would be nice, if anyone can tell me how to implement such a code.
Thank you!
This would be an example dataset:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,3),
text=c("asd",NA,"asd",NA,NA,NA,NA,NA,NA),
check = c(T,F,T,T,T,F,F,F,F),
newtext =
c("as","as","as","das","das","None","qwe","qwe2","None"),
othervars = c(1,2,3,45,5,6,6,7,1))
As an output, I want to keep the following rows:
row 1 or 3
row 4 or 5
row 7 or 8
The column othervars should be kept as well as I need this information later on.
Hope this makes it a bit clearer.
Alright, I've got something. I'm using filter() from dplyr to subset with unknown NA, because I ran into problems using either subset() or common df[,] subsetting from base R.
Data:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,3),
text=c("asd",NA,"asd",NA,NA,NA,NA,NA,NA),
check = c(T,F,T,T,T,F,F,F,F),
newtext =
c("as","as","as","das","das","None","qwe","qwe2","None"),
othervars = c(1,2,3,45,5,6,6,7,1))
Initiating new empty dataframe:
df2 <- df[0,]
Loop to sample one row per id:
library(dplyr)
for(i in unique(df$id)){
temp <- filter(df, id == i)
if(nrow(filter(temp, !is.na(text))) > 0){
temp <- filter(temp, !is.na(text))
df2[i, ] <- temp[sample(nrow(temp), size = 1), ]
}else if(nrow(filter(temp, check)) > 0){
temp <- filter(temp, check)
df2[i, ] <- temp[sample(nrow(temp), size = 1), ]
}else{
temp <- filter(temp, !(grepl("None",temp$newtext)))
df2[i, ] <- temp[sample(nrow(temp), size = 1), ]
}
}
Output example:
> df2
id text check newtext othervars
2 1 asd TRUE as 1
1 2 <NA> TRUE das 45
3 3 <NA> FALSE qwe 6
Greetings.
Edit: Ignore the row numbers on the left, they are residuals from the different subsets within the loop.
Suppose I have a list or data frame in R, and I would like to get the row index, how do I do that? That is, I would like to know how many rows a certain matrix consists of.
I'm interpreting your question to be about getting row numbers.
You can try as.numeric(rownames(df)) if you haven't set the rownames. Otherwise use a sequence of 1:nrow(df).
The which() function converts a TRUE/FALSE row index into row numbers.
It not quite clear what exactly you are trying to do.
To reference a row in a data frame use df[row,]
To get the first position in a vector of something use match(item,vector), where the vector could be one of the columns of your data frame, eg df$cname if the column name is cname.
Edit:
To combine these you would write:
df[match(item,df$cname),]
Note that the match gives you the first item in the list, so if you are not looking for a unique reference number, you may want to consider something else.
See row in ?base::row. This gives the row indices for any matrix-like object.
rownames(dataframe)
This will give you the index of dataframe
If i understand your question, you just want to be able to access items in a data frame (or list) by row:
x = matrix( ceiling(9*runif(20)), nrow=5 )
colnames(x) = c("col1", "col2", "col3", "col4")
df = data.frame(x) # create a small data frame
df[1,] # get the first row
df[3,] # get the third row
df[nrow(df),] # get the last row
lf = as.list(df)
lf[[1]] # get first row
lf[[3]] # get third row
etc.
Perhaps this complementary example of "match" would be helpful.
Having two datasets:
first_dataset <- data.frame(name = c("John", "Luke", "Simon", "Gregory", "Mary"),
role = c("Audit", "HR", "Accountant", "Mechanic", "Engineer"))
second_dataset <- data.frame(name = c("Mary", "Gregory", "Luke", "Simon"))
If the name column contains only unique across collection values (across whole collection)
then you can access row in other dataset by value of index returned by match
name_mapping <- match(second_dataset$name, first_dataset$name)
match returns proper row indexes of names in first_dataset from given names from second: 5 4 2 1
example here - accesing roles from first dataset by row index (by given name value)
for(i in 1:length(name_mapping)) {
role <- as.character(first_dataset$role[name_mapping[i]])
second_dataset$role[i] = role
}
===
second dataset with new column:
name role
1 Mary Engineer
2 Gregory Mechanic
3 Luke Supervisor
4 Simon Accountant
r
x <- matrix(ceiling(9*runif(20)), nrow=5)
colnames(x) <- c("these", "are", "the", "columnes")
df <- data.frame(x)
Result:
dataframe
which(df == "2") #returns rowIndexes results from the entire dataset, in this case it returns a list of 3 index numb
Result:
5 13 17
length(which(df == "2")) #count numb. of rows that matches a condition
Result:
3
You can also do this column wise, example of:
which(df$columnName == c("2", "7")) #you do the same with strings
length(which(df$columnName == c("2", "7")))