R count number of variables with value ="mq" per row [duplicate] - r

This question already has answers here:
How to count the frequency of a string for each row in R
(4 answers)
Closed 4 years ago.
I have a data frame with 70variables, I want to create a new variable which counts the number of occurrences where the 70 variables take the value "mq" on a per row basis.
I am looking for something like this:
[ID] [Var1] [Var2] [Count_mq]
1. mq mq 2
2. 1 mq 1
3. 1 7 0
I have found this solution:
count_row_if("mq",DT)
But it gives me a vector with those values for the whole data frame and it is quite slow to compute.
I would like to find a solution using the function apply() but I don't know how to achieve this.
Best.

You can use the 'apply' function to count a particular value in your existing dataframe 'df',
df$count.MQ <- apply(df, 1, function(x) length(which(x=="mq")))
Here the second argument is 1 since you want to count for each row. You can read more about it from https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/apply

I assume the name of dataset is DT. I'm a bit confused what you really want to get but this is how I understand. Data frame consists of 70 columns and a number of rows that some of them have observations 'mq'.
If I get it right, please see the code below.
apply(DT, function(x) length(filter(DT,value=='mq')), MARGIN=1)

Related

Count occurrences of value in a set of variables in R (per column) [duplicate]

This question already has answers here:
Counting the number of elements with the values of x in a vector
(20 answers)
Closed 1 year ago.
I have this data and I want to figure out a way to know how many ones and how many zeros are in each column (ie Arts and Crafts). I have been trying different things but it hasn't been working. Does anyone have any suggestions?
You can use the table() function in R. This creates a categorical representation of your data. Additionally here convert list to vector I have used unlist() function.
df1 <- read.csv("Your_CSV_file_name_here.csv")
table(unlist(df1$ArtsAndCrafts))
If you want to row vice categorize the number of zeros and ones you can refer to this question in Stackoverflow.

Select duplicate rows by comapring multiple columns in R [duplicate]

This question already has answers here:
Find duplicate values in R [duplicate]
(5 answers)
Closed 4 years ago.
I have an issue in selecting duplicate rows in R. A data fame has 14 columns and 1 million rows. I have to do row comparison i.e finding out identical rows, would be duplicate. I want to get the duplicate row by this method. My data frame is like
Data frame sample
Last two rows were identical, so need to mark it as flag value 1.
I don't know how to start with this.
I have tried these codes,
df <- unique(data[,1:97]) //this method gives me unique set not number of duplicates.
dim(data[duplicated(data),])[1] // this method gives me the number of duplicates but not ids.
I need to know the duplicate ids.
my intension is to check each row and written total number of duplicate rows or the line number.
Look into the duplicated() function. It can be used to remove the duplicated rows or inversely keep them as well

Viewing single column of data frame in R [duplicate]

This question already has answers here:
How to subset matrix to one column, maintain matrix data type, maintain row/column names?
(1 answer)
How do I extract a single column from a data.frame as a data.frame?
(3 answers)
Closed 5 years ago.
I am running a simulation model that creates a large data frame as its output, with each column corresponding to the time-series of a particular variable:
data5<-as.data.frame(simulation3$baseline)
Occasionally I want to look at subsets, especially particular columns, of this data frame in order to get an idea of the output. For this I am using the View-function like so
View(data5[1:100,1])
for instance, if I wish to see the first 100 rows of column 1. Alternatively, I also sometimes do something like this, using the names of the time series:
timeframe=1:100
toAnalyse=c("u","u_n","u_e","u_nw")
View(data5[timeframe,toAnalyse])
In either case, there is an annoying display problem when I am trying to view a single column on its own (as for instance with View(data5[1:100,1])), whereby what I get looks like this:
Example 1
As you can see, the top of the table which would usually contain the name of the variable in the dataset instead contains a string of all values that the variable takes. This problem does not appear if I select 2 or more columns:
Example 2
Does anyone know how to get rid of this issue? Is there some argument that I can feed to View to make sure that it behaves nicely when I ask it to just show a single column?
View(data5[1:100,1, drop=FALSE])
When you access a single column of a data frame it is converted to a vector, drop=FALSE prevents that and retains the column name.
For instance:
> df
n s b
1 2 aa TRUE
2 3 bb TRUE
3 5 cc TRUE
> df[, 1]
[1] 2 3 5
> df[, 1, drop=FALSE]
n
1 2
2 3
3 5

sum across columns within rows for all columns that start with a specific character string in R [duplicate]

This question already has answers here:
Subset data to contain only columns whose names match a condition
(10 answers)
Closed 6 years ago.
I have a data frame with a set of species IDs in the ID column, and sample IDs as separate columns with the motif CA_**. The data look like this:
ID <- c('A','B','C')
CA_01 <- c(3,9,54)
CA_56 <- c(2,7,12)
CA_92 <- c(45,4,47)
d<- data.frame(ID,CA_01,CA_56,CA_92)
ID CA_01 CA_56 CA_92
A 3 2 45
B 9 7 4
C 54 12 47
I want to sum across the columns within each row, and generate a new column, that is the total abundance of each species ID across sample columns (final values 50, 20, 113). Furthermore, There are many other columns in my real data frame. I only want to sum across columns that start with CA_**.
NOTE: this is different than the question asked here, as the asker knows the positions of the columns the asker wants to sum. Imy example I only know that the columns start with the motif, CA_. I don't know the positions. Its also different that the question here, as I specifically ask how to sum across columns based on the grep command.
We can use grep to subset the columns having column names that start with CA_ and get the sum of the rows with rowSums.
d$newCol <- rowSums(d[grep('^CA\\_', names(d))])

Include third variable in table [duplicate]

This question already has answers here:
Contingency table based on third variable (numeric)
(2 answers)
Closed 4 years ago.
I have made an edit after realising my code was insufficient in order to explain to problem - appologies.
I have a data frame including four columns
purchaseId <- c("abc","xyz","def","ghi")
product <- c("a","b","c","a")
quantity <- c(1,2,2,1)
revenue <- c(500,1000,300,500)
t <- data.frame(purchaseId,product, quantity, revenue)
table(t$product,t$quantity)
Running this query
table(t$product,t$quantity)
returns a table indicating how many times each combination occurs
1 2
a 2 0
b 0 1
c 0 1
What I would like to do is plot both product and quantity as rows and columns (as shown above) but with the revenue as an actual value.
The result should look like this:
1 2
a 1000 0
b 0 1000
c 300 0
This would allow me to create a table that I could export as a csv.
Could anyone help me any further?
edit - the code suggested below throws the following error on the actual data set of 140K rows:
Error: dims [product 21525] do not match the length of object [147805]
Other ideas?
Of course the example code above is a simplified version of the actual data I'm using, but the idea is the same.
Thank you advance,
Kind regards.
table(t$product,t$quantity)*t$revenue
Using library(reshape2) or library(data.table)
dcast(t,product ~ quantity, value.var = "revenue", fun = sum)
it is fairly simple syntax:
Set the data frame you are recasting
Set the "formula" of the resulting data frame. LHS of ~ is the row-wise pivot, RHS is the column-wise.
value.var tells you what column we want to place in the cells, and using fun we want to aggregate with the sum function
As you mentioned in your comments familiarity with Excel Pivot tables, its worth noting that dcast is a fairly comprehensive replacement, with additional flexibility.

Resources