R: Quickest way to summarize number of observations for multiple variables - r

I am sure this is a super simple thing, but I cannot find a really quick and easy solution.
I have patient data with a lot of columns in a format like this:
patID disease category ...
1 1 A
2 0 B
3 1 C
4 1 B
How can I quickly produce a summary table, which includes the number of observations for each column/variable in the dataframe? The result should be something like this:
VARIABLE Number of rows
disease:1 3
disease:0 1
category:A 1
category:B 2
category:C 1
...
I know I can do this for a single variable by just using table(data$column). But how can I produce something similar for all columns in a dataframe?

Using tidyr and dplyr:
gather(data, variable, value, -patID) %>%
count(variable, value)
(Thanks #Frank for reminding me about tally and count.)

Related

R removes more observations than it should with dplyr or base subset

I've got a question regarding the filter() function of dplyr, and/or base subset() function within R. Basically, when I use filter() or subset() I can extract observations based on two conditions, which is what I need.
As an example, this is what I've been using so far:
df %>% filter(Axis_1_1 == "Diagnostic of function on axis1 postponed") %>% filter(is.na(diagnostic_code9))
This gives me the right amount of observations that satisfy these two conditions at the same time, i.e. 92 out of the 23992 in total.
However, when I use the negation sign to not include these observations in my current dataframe, R is deleting roughly 8000 extra observations. Thus, the end result is 15992 observations left after filtering with the negation "!" sign used. Example:
df %>% filter(Axis_1_1 != "Diagnostic of function on axis1 postponed") %>% filter(!is.na(diagnostic_code9))
Using simple subsetting from base R gives me the same wrong end result, while it manages to find the correct 92 observations that satisfy the condition, as stated in the first example.
subset(df, df$Axis1_1 == "Diagnostic of function on axis1 postponed" & is.na(diagnostic_code9))
My dataframe consists of 112 variables and 23900+ observations in the current setting.
Thus, my questions are:
Could there be something curious going on with my dataframe I'm using (Unfortunately I cannot give you a subset out of it)
Second, is there something wrong here with my coding?
Lastly, what is R exactly doing in the background? Since it is able to filter out these observations based on the exact conditioning where they match the string and is.na() function, while doing completely something else when using the negation sign.
Your logic doesn't quote work in this case. Doing two subsequent filter statments is kind of like doing an AND operation. Consider the following example
df <- data.frame(a=c(1,1,1,1,2,2,2, 2),
b=c(NA,NA,5,5,5,5,5,NA))
df %>% filter(a==1) %>% filter(is.na(b))
# a b
# 1 1 NA
# 2 1 NA
df %>% filter(a!=1) %>% filter(!is.na(b))
# a b
# 1 2 5
# 2 2 5
# 3 2 5
Note the rows with a=1, b=5 are not returned even though they are not in the first output because your first filter (filter(!=1)) eliminates them.
So if you consider your two filters as A and B, in the first case you are doing A and B. It would be the same as
df %>% filter(a==1 & is.na(b))
# a b
# 1 1 NA
# 2 1 NA
But in the second you are doing NOT A and NOT B. These are not equivalent. According to DeMorgan's Law, you need NOT A OR NOT B. So try
df %>% filter(a!=1 | !is.na(b))
# a b
# 1 1 5
# 2 1 5
# 3 2 5
# 4 2 5
# 5 2 5
# 6 2 NA
or equivalently (note the parenthsis applying the NOT (!) to the whole expression)
df %>% filter(!(a==1 & is.na(b)))

How can i aggregate rows of a data.frame by name, summing the numeric value of the correspondent columns on R? [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 5 years ago.
I am an early user of Rstudio, and i have a quite simple problem, but unfortunately i am not able to solve it.
I just want to aggregate rows of my data.frame by words contained on the first column of the df.
The data.frame is made by five columns:
The first one is made by words;
the second, the third, the fourth, the fifth ones are made by numeric values.
for example if the data would be:
SecondWord X Y Z Q
NO 1 2 2 1
NO 0 0 1 0
YES 1 1 1 1
i expect to see a result like:
SecondWord X Y Z Q
NO 1 2 3 1
YES 1 1 1 1
How could i do?
i have tried to use the following method:
test <- read.csv2("test.csv")
df<-aggregate(.~Secondword,data=test, FUN = sum, na.rm=TRUE)
But the values were not the ones i expected to see.
Thank you for your future helps and sorry for the "simple" question.
You can also use tidyverse
library(tidyverse)
df <- test %>%
group_by(SecondWord) %>%
summarize_each(funs(sum))
df
# SecondWord X Y Z Q
# NO 1 2 3 1
# YES 1 1 1 1
ddply should work as well.
For example, something like:
library(plyr)
grouped <- ddply(test, "Secondword", numcolwise(sum))

Reshaping data - is this an operation for tidyr::spread?

I'm trying to reshape a data frame so that each unique value in a column becomes a binary column.
I've been provided data that looks like this:
df <- data.frame(id = c(1,1,2),
value = c(200,200,1000),
feature = c("A","B","C"))
print(df)
##id,value,feature
##1,200,A
##1,200,B
##2,1000,C
I'm trying to reshape it into this:
##trying to get here
##id,value,A,B,C
##1,200,1,1,0
##2,1000,0,0,1
spread(df,id,feature) fails because ids repeat.
I want to reshape the data to facilitate modeling - I'm trying to predict value from the presence or absence of features.
There is a way to do it with tidyr::spread though, using a transition variable always equal to one.
library(dplyr)
library(tidyr)
mutate(df,v=1) %>%
spread(feature,v,fill=0)
id value A B C
1 1 200 1 1 0
2 2 1000 0 0 1
As my previous comment:
You have to use dcast of the reshape2 package because spread works well for data that are been processed and/or are consistent with tidy data principles. Your "spreading" is a little bit different (and complicated). Unless of course you use spread combined with other functions.
library(reshape2)
dcast(df, id + value ~ ..., length)
id value A B C
1 1 200 1 1 0
2 2 1000 0 0 1

create new dataframe based on 2 columns

I have a large dataset "totaldata" containing multiple rows relating to each animal. Some of them are LactationNo 1 readings, and others are LactationNo 2 readings. I want to extract all animals that have readings from both LactationNo 1 and LactationNo 2 and store them in another dataframe "lactboth"
There are 16 other columns of variables of varying types in each row that I need to preserve in the new dataframe.
I have tried merge, aggregate and %in%, but perhaps I'm using them incorrectly eg.
(lactboth <- totaldata[totaldata$LactationNo %in% c(1,2), ])
Animal Id is column 1, and lactationno is column 2. I can't figure out how to select only those AnimalId with LactationNo=1&2
Have also tried
lactboth <- totaldata[ which(totaldata$LactationNo==1 & totaldata$LactationNo ==2), ]
I feel like this should be simple, but couldn't find an example to follow quite the same. Help appreciated!!
If I understand your question correctly, then your dataset looks something like this:
AnimalId LactationNo
1 A 1
2 B 2
3 E 2
4 A 2
5 E 2
and you'd like to select animals that happen to have both lactation numbers 1 & 2 (like A in this particular example). If that's the case, then you can simply use merge:
lactboth <- merge(totaldata[totaldata$LactationNo == 1,],
totaldata[totaldata$LactationNo == 2,],
by.x="AnimalId",
by.y="AnimalId")[,"AnimalId"]

Doubts about ddply function in R

I'm trying to do an equivalent group by summary in R through the plyr function named ddply. I have a data frame which have three columns (say id, period and event). Then, I'd like to count the times each id appears in the data frame (count(*)... group by id with SQL) and get the last element of each id corresponding to the column event.
Here an example of what I have and what I'm trying to obtain:
id period event #original data frame
1 1 1
2 1 0
2 2 1
3 1 1
4 1 1
4 1 0
id t x #what I want to obtain
1 1 1
2 2 1
3 1 1
4 2 0
This is the simple code I've been using for that:
teachers.pp<-read.table("http://www.ats.ucla.edu/stat/examples/alda/teachers_pp.csv", sep=",", header=T) # whole data frame
datos=ddply(teachers.pp,.(id),function(x) c(t=length(x$id), x=x[length(x$id),3])) #This is working fine.
Now, I've been reading The Split-Apply-Combine Strategy for Data
Analysis and it is given an example where they employed an equivalent syntax to the one I put below:
datos2=ddply(teachers.pp,.(id), summarise, t=length(id), x=teachers.pp[length(id),3]) #using summarise but the result is not what I want.
This is the data frame I get using datos2
id t x
1 1 1
2 2 0
3 1 1
4 1 1
So, my question is: why is this result different from the one I get using the first piece of code, I mean datos1? What am I doing wrong?
It is not clear for me when I have to use summarise or transform. Could you tell me the correct syntax for the ddply function?
When you use summarise, stop referencing the original data frame. Instead, just write expressions in terms of the column names.
You tried this:
ddply(teachers.pp,.(id), summarise, t=length(id), x=teachers.pp[length(id),3])
when what you probably wanted was something more like this:
ddply(teachers.pp,.(id), summarise, t=length(id), x=tail(event,1))

Resources