Convert data frame of redundant frequencies

Convert data frame of redundant frequencies - r

I have a data.frame like so:
category count
A 11
B 1
C 45
A 1003
D 20
B 207
E 634
E 40
A 42
A 7
B 44
B 12
Each row represents a specific element with a category type and a count of that element. I would like to produce a frequency distribution of counts per category, but the categories are at the moment redundant.
How do I retrieve a table of redundant category counts? i.e. I want a table that looks like:
category count
A 11234
B 4005
C 100023
D 65567
E 54654
... ...
I almost got there using lapply:
df.nrcounts <- lapply(unique(df.counts$category),
function(x) c(category=x, count=sum(subset(df.counts, category==x)$count)))
but I can't seem to coerce the output to a proper dataframe. I can't quite get my head around using the function.

aggregate(df.counts$count,by=list(df.counts$category),FUN=sum)
Or
library(data.table)
setDT(df.counts)[, list(count=sum(count)), by = category]

Related

Select a dataset based on different column value but in the same row

I have a dataset with around 80 columns and 1000 Rows, a sample of this dataset follow below:
ID gend.y gend.x Sire Dam Weight
1 M F Jim jud 220
2 F F josh linda 198
3 M NA Claude Bere 200
4 F M John Mary 350
5 F F Peter Lucy 298
And I need select all information that are different between gend.y and gend.x, like this:
ID gend.y gend.x Sire Dam Weight
1 M F Jim jud 220
3 M NA Claude Bere 200
4 F M John Mary 350
Remember, I need to select the another 76 columns too.
I tried this command:
library(dplyr)
new.file=my.file %>%
filter(gend.y != gend.x)
But don't worked. And this message appears:
Error in Ops.factor(gend.y, gend.x) : level sets of factors are different

As #divibisan said: "Still not a reproducible example, but the error gets you closer. These 2 variables are factors, The interpretation of a factor depends on both the codes and the "levels" attribute. Be careful only to compare factors with the same set of levels (in the same order). You probably want to convert them to character before comparing, or fix the levels to match."
So I did this (convert them to character):
my.file$new.gend.y=as.character(my.file$gend.y)
my.file$new.gend.x=as.character(my.file$gend.x)
And after I ran my previous command with the new variables (now converted to character):
library(dplyr)
new.file=my.file %>%
filter(new.gend.y != new.gend.x | is.na(new.gend.y != new.gend.x))
And now worked as I expected. Credits #divibisan

Sort list on numeric values stored as factor

I have 4 data frames with data from different experiments, where each row represents a trial. The participant's id (SID) is stored as a factor. Each one of the data frames look like this:
Experiment 1:
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059
I want to make a new data frame with the id's of the participants in each of the experiments, for example:
Exp1 Exp2 Exp3 Exp4
5402 22081 22160 25434
25403 22069 22179 25439
25485 22115 22141 25408
25457 22120 22185 25445
28041 22448 22239 25473
29514 22492 22291 25489
I want each column to be ordered as numbers, that is, 2 comes before 10.
I used unique() to extract the participant id's (SID) in each data frame, but I am having problems ordering the columns.
I tried using:
data.frame(order(unique(df1$SID)),
order(unique(df2$SID)),
order(unique(df3$SID)),
order(unique(df4$SID)))
and I get (without the column names):
38 60 16 32 15
2 9 41 14 41
3 33 5 30 62
4 51 11 18 33
I'm sorry if I am missing something very basic, I am still very new to R.
Thank you for any help!
Edit:
I tried the solutions in the comments, and now I have:
x<-cbind(sort(as.numeric(unique(df1$SID)),decreasing = F),
sort(as.numeric(unique(df2$SID)),decreasing = F),
sort(as.numeric(unique(df3$SID)),decreasing = F),
sort(as.numeric(unique(df4$SID)),decreasing = F) )
Still does not work... I get:
V1 V2 V3 V4
8 6 5 2
2 9 35 11 3
3 10 37 17 184
4 13 38 91 185
5 15 39 103 186
The subject id's are 3 to 5 digit numbers...

If your data looks like this:
df <- read.table(text="
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059",
header=TRUE, colClasses = c("factor","integer","numeric"))
I would do something like this:
df <- df[order(as.numeric(as.character(df$SID)), trial),] # sort df on SID (numeric) & trial
split(df$SID, df$trial) # breaks the vector SID into a list of vectors of SID for each trial
If you were worried about unique values you could do:
lapply(split(df$SID, df$trial), unique) # breaks SID into list of unique SIDs for each trial
That will give you a list of participant IDs for each trial, sorted by numeric value but maintaining their factor property.
If you really wanted a data frame, and the number of participants in each experiment were equal, you could use data.frame() on the list, as in: data.frame(split(df$SID, df$trial))

Suppose x and y represent the Exp1 SID and Exp2 SID. You can create a ordered list of unique values as shown below:
x<-factor(x = c(2,5,4,3,6,1,4,5,6,3,2,3))
y<-factor(x = c(2,3,4,2,4,1,4,5,5,3,2,3))
list(exp1=sort(x = unique(x),decreasing = F),y=sort(x = unique(y),decreasing = F))

Arrange univariateTable output by values not by levels

I am trying to solve the following inconvenience when trying to export a table consisting of factor levels. Here is the code to generate the sample data, and a table from it.
data <- c(sample('A',30,replace=TRUE), sample('B',120,replace=TRUE),
sample('C',180,replace=TRUE), sample('D',70,replace=TRUE))
library(Publish)
univariateTable(~data)
The default output of the univariateTable is by levels (From A through D):
Variable Levels Value
1 data A 30 (7.5)
2 B 120 (30.0)
3 C 180 (45.0)
4 D 70 (17.5)
How can I change this so that the output is based on the value instead? I mean, the first row being the largest number (and percentage) and the last low being the lowest, like this:
Variable Levels Value
1 data C 180 (45.0)
2 B 120 (30.0)
3 D 70 (17.5)
4 A 30 (7.5)

Assuming that the "Publish" package is the one installed from github, we extract the numbers before the ( using sub, order it and use it to order the "xlevels" and "summary.totals".
#library(devtools)
#install_github("TagTeam/Publish")
library(Publish)
Out <- univariateTable(~data)
i1 <- order(as.numeric(sub('\\s+.*', '',
Out$summary.totals$data)), decreasing=TRUE)
Out$xlevels$data <- Out$xlevels$data[i1]
Out$summary.totals$data <- Out$summary.totals$data[i1]
Out
# Variable Level Total
#1 data C 180 (45.0)
#2 B 120 (30.0)
#3 D 70 (17.5)
#4 A 30 (7.5)
data
set.seed(24)
data <- c(sample('A',30,replace=TRUE), sample('B',120,replace=TRUE),
sample('C',180,replace=TRUE), sample('D',70,replace=TRUE))

Assigning logical value to values higher than given threshold for each case across each year

I have a data frame resembling the extract below:
set.seed(1)
smpl_df <- data.frame(year = c(1500:2011), case = LETTERS[1:4])
smpl_df$var_one <- sample(100, size = nrow(smpl_df), replace = TRUE)
I'm interested in adding one more column to this data frame. I'm interested in the column to take the value 1 if the values in the column var_one were higher than a given threshold for all of the consecutive years represented in the data set. For example, in its present format the table looks like that:
head(smpl_df)
year case var_one
1 1500 A 27
2 1501 B 38
3 1502 C 58
4 1503 D 91
5 1504 A 21
6 1505 B 90
I would like to add a column to the data table (values for the new column are not right, just introduced as a way of example):
year case var_one var_one_higher_than_80_for_all_yrs_for_this_case
1 1500 A 27 0
2 1501 B 38 0
3 1502 C 58 0
4 1503 D 91 1
5 1504 A 21 0
6 1505 B 90 1
Edit
To add to the post following useful points expressed in the comments below. The long table that I'm currently working with could be obtained from the wide table below. In the example below, I added column NewColumn that takes values Yes if for a given case value was higher than 2 and No if the value was lower or equal 2 for all the years. I want to achieve the same effect but on my long table (sample_df).
Edit 2
Following the useful comments concerning the desired final output, my intention is to generate a column that would correspond to the last column in the table below.

maybe be helpful ifelse structure:
smpl_df$var_one_higher <- ifelse("your func",1,0)

Combine or Sum rows based on partial match and other rules

I have a dataframe df1:
df1 <- data.frame(
Lot = c("13VC011","13VC018","13VC011A","13VC011B","13VC018A","13VC018C","13VC018B"),
Date = c("2013-07-12","2013-07-11","2013-07-13","2013-07-14","2013-07-16","2013-07-18","2013-07-19"),
Step = c("A","A","B","B","C","C","C"),
kg = c(31,32,14,16,10,11,10))
Sometimes at a particular 'Step' a 'Lot' gets split into A,B or C as indicated. I'd like to sum those and get a dataframe that tells me the total kg at each step, for each lot.
For example the output should look like this:
df2 <- data.frame(
Lot = c("13VC011","13VC011","13VC018","13VC018"),
Step = c("A","B","A","C"),
kg = c(31,30,32,31))
So there are two requirements. If the 'Lot' matches, regardless of the trailing letter, and the step matches, then the sum occurs. If both conditions are not satisfied, then just carry over the line item as is into df2.
Part2:
So I would like to introduce a 3rd requirement. In some cases, the Lot was split in two or 3 parts, however not all the data is present. In this case, using these solutions masks this and makes it appear that one lot has much lower kg than it actually has.
What I would like to do is find a way to indicate if the dataset contains 13VC011A for example, but no 13VC011B. Or if we see a 'B' but no 'A' or a 'C' but no 'B' or 'A'.
So now the original dataframe is:
df1 <- data.frame(
Lot = c("13VC011","13VC018","13VC011A","13VC011B","13VC018A","13VC018C","13VC018B","13VC020B"),
Date = c("2013-07-12","2013-07-11","2013-07-13","2013-07-14","2013-07-16","2013-07-18","2013-07-19","2013-07-22"),
Step = c("A","A","B","B","C","C","C","B"),
kg = c(31,32,14,16,10,11,10,18))
And the resultant df2 should look something like:
df2 <- data.frame(
Lot = c("13VC011","13VC011","13VC018","13VC018","13VC020B"),
Step = c("A","B","A","C","B"),
kg = c(31,30,32,31,18),
Partial = c(F,F,F,F,T))

df1$Lot <- gsub("[[:alpha:]]$","",df1$Lot) #replace the character element at the end of string with `""`
aggregate(kg~Lot+Step,df1, FUN=sum)
# Lot Step kg
#1 13VC011 A 31
#2 13VC011 B 30
#3 13VC018 A 32
#4 13VC018 C 31
Or using dplyr
library(stringr)
library(dplyr)
df1%>%
group_by(Lot=str_extract(Lot,perl('.*\\d(?=[A-Z]?$)')), Step) %>%
summarize(kg=sum(kg))
#Source: local data frame [4 x 3]
#Groups: Lot
# Lot Step kg
#1 13VC011 A 31
#2 13VC011 B 30
#3 13VC018 A 32
#4 13VC018 C 31
Explanation
regex
.* : select more than one element
\\d :followed by a digit
(?=[A-Z]?$) : and lookahead for character elements or (?) not at the $ end of string.
`

> aggregate(kg ~Lot + Step, data=df1, FUN=sum)
Lot Step kg
1 13VC011 A 31
2 13VC018 A 32
3 13VC011A B 14
4 13VC011B B 16
5 13VC018A C 10
6 13VC018B C 10
7 13VC018C C 11
At that point I finally understood what you meant by "regardless of the trailing letter" and wondered if the formula method of aggregate could handle an R-function in one of the terms:
> aggregate(kg ~substr(Lot,1,7) + Step, data=df1, FUN=sum)
substr(Lot, 1, 7) Step kg
1 13VC011 A 31
2 13VC018 A 32
3 13VC011 B 30
4 13VC018 C 31

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Convert data frame of redundant frequencies - r

aggregate(df.counts$count,by=list(df.counts$category),FUN=sum) Or library(data.table) setDT(df.counts)[, list(count=sum(count)), by = category]

Related

Select a dataset based on different column value but in the same row

Sort list on numeric values stored as factor

Arrange univariateTable output by values not by levels

Assigning logical value to values higher than given threshold for each case across each year

Combine or Sum rows based on partial match and other rules

Categories

Resources