Arrange univariateTable output by values not by levels - r

I am trying to solve the following inconvenience when trying to export a table consisting of factor levels. Here is the code to generate the sample data, and a table from it.
data <- c(sample('A',30,replace=TRUE), sample('B',120,replace=TRUE),
sample('C',180,replace=TRUE), sample('D',70,replace=TRUE))
library(Publish)
univariateTable(~data)
The default output of the univariateTable is by levels (From A through D):
Variable Levels Value
1 data A 30 (7.5)
2 B 120 (30.0)
3 C 180 (45.0)
4 D 70 (17.5)
How can I change this so that the output is based on the value instead? I mean, the first row being the largest number (and percentage) and the last low being the lowest, like this:
Variable Levels Value
1 data C 180 (45.0)
2 B 120 (30.0)
3 D 70 (17.5)
4 A 30 (7.5)

Assuming that the "Publish" package is the one installed from github, we extract the numbers before the ( using sub, order it and use it to order the "xlevels" and "summary.totals".
#library(devtools)
#install_github("TagTeam/Publish")
library(Publish)
Out <- univariateTable(~data)
i1 <- order(as.numeric(sub('\\s+.*', '',
Out$summary.totals$data)), decreasing=TRUE)
Out$xlevels$data <- Out$xlevels$data[i1]
Out$summary.totals$data <- Out$summary.totals$data[i1]
Out
# Variable Level Total
#1 data C 180 (45.0)
#2 B 120 (30.0)
#3 D 70 (17.5)
#4 A 30 (7.5)
data
set.seed(24)
data <- c(sample('A',30,replace=TRUE), sample('B',120,replace=TRUE),
sample('C',180,replace=TRUE), sample('D',70,replace=TRUE))

Related

r - lapply divides a column by an integer value from different dataset, unexpected result

I have two data.frames, one with genotype counts and one with a number that I need to normalize my counts from the first dataset.
countsdata=data.frame(genotype1=rep(c(10,20,30,40),each=1),
genotype2=rep(c(100,200,300,400),each=1),
genotype3=rep(c(40,50,60,70),each=1),
genotype4=rep(c(40,50,60,70),each=1)
)
coldata = data.frame(Group =c('genotype1', 'genotype2', 'genotype3', 'genotype4'),
Treatment = rep(c("control","treated"),each = 2),
Norm=rep(c(1,2,5,5)))
I made sure my variables don't have factors
factorsCharacter <- function(d) modifyList(d, lapply(d[, sapply(d, is.factor)],
as.character))
coldata=factorsCharacter(coldata)
Then I see that lapply loops through my counts, one column at the time and through my coldata that contains the normalization value (Norm). All is looking good, until I combined the two action in the same step
> lapply(coldata['Group'],function(group_i){group_i})
$Group
[1] "genotype1" "genotype2" "genotype3" "genotype4"
> lapply(coldata['Group'],function(group_i){countsdata[,group_i]})
$Group
genotype1 genotype2 genotype3 genotype4
1 10 100 40 40
2 20 200 50 50
3 30 300 60 60
4 40 400 70 70
> lapply(coldata['Group'],function(group_i){as.integer(coldata[coldata$Group==group_i,'Norm'])})
$Group
[1] 1 2 5 5
> lapply(coldata['Group'],function(group_i){
+ countsdata[,group_i]/as.integer(coldata[coldata$Group==group_i,'Norm'])
+ })
$Group
genotype1 genotype2 genotype3 genotype4
1 10 100 40 40
2 10 100 25 25
3 6 60 12 12
4 8 80 14 14
Here the result is not what I was expecting (dividing each column by its normalization number). After further inspection I noticed it's normalizing by rows, in other words it's normalizing across different columns, which shouldn't be the case as I am looping through one column at the time. I am probably missing a basic concept but looking through other SO posts didn't find anything I could use. My goal is to fix the code to make the right calculation but I also would like to understand why this code above is not working. Thanks so much.
The problem is in using [ and not [[. So, instead of looping through each of the elements in 'Group' column, we have a list of length 1 with all the elements. So, either use coldata[, 'Group'] or coldata[['Group']] or coldata$Group for looping.
countsdataNew <- countsdata
countsdataNew[] <- lapply(coldata[['Group']],function(group_i)
countsdata[,group_i]/coldata$Norm[coldata$Group==group_i])
countsdataNew
# genotype1 genotype2 genotype3 genotype4
#1 10 50 8 8
#2 20 100 10 10
#3 30 150 12 12
#4 40 200 14 14
If the column name in 'countsdata' and 'Group' column from 'countsdata' are in the same order, we can do this easily with Map
Map(`/`, countsdata, coldata$Norm)
Or just replicate the 'Norm' and do a simple division
countsdata/coldata$Norm[col(countsdata)]
Or with sweep
sweep(countsdata, 2, coldata$Norm, "/")

Sort list on numeric values stored as factor

I have 4 data frames with data from different experiments, where each row represents a trial. The participant's id (SID) is stored as a factor. Each one of the data frames look like this:
Experiment 1:
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059
I want to make a new data frame with the id's of the participants in each of the experiments, for example:
Exp1 Exp2 Exp3 Exp4
5402 22081 22160 25434
25403 22069 22179 25439
25485 22115 22141 25408
25457 22120 22185 25445
28041 22448 22239 25473
29514 22492 22291 25489
I want each column to be ordered as numbers, that is, 2 comes before 10.
I used unique() to extract the participant id's (SID) in each data frame, but I am having problems ordering the columns.
I tried using:
data.frame(order(unique(df1$SID)),
order(unique(df2$SID)),
order(unique(df3$SID)),
order(unique(df4$SID)))
and I get (without the column names):
38 60 16 32 15
2 9 41 14 41
3 33 5 30 62
4 51 11 18 33
I'm sorry if I am missing something very basic, I am still very new to R.
Thank you for any help!
Edit:
I tried the solutions in the comments, and now I have:
x<-cbind(sort(as.numeric(unique(df1$SID)),decreasing = F),
sort(as.numeric(unique(df2$SID)),decreasing = F),
sort(as.numeric(unique(df3$SID)),decreasing = F),
sort(as.numeric(unique(df4$SID)),decreasing = F) )
Still does not work... I get:
V1 V2 V3 V4
8 6 5 2
2 9 35 11 3
3 10 37 17 184
4 13 38 91 185
5 15 39 103 186
The subject id's are 3 to 5 digit numbers...
If your data looks like this:
df <- read.table(text="
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059",
header=TRUE, colClasses = c("factor","integer","numeric"))
I would do something like this:
df <- df[order(as.numeric(as.character(df$SID)), trial),] # sort df on SID (numeric) & trial
split(df$SID, df$trial) # breaks the vector SID into a list of vectors of SID for each trial
If you were worried about unique values you could do:
lapply(split(df$SID, df$trial), unique) # breaks SID into list of unique SIDs for each trial
That will give you a list of participant IDs for each trial, sorted by numeric value but maintaining their factor property.
If you really wanted a data frame, and the number of participants in each experiment were equal, you could use data.frame() on the list, as in: data.frame(split(df$SID, df$trial))
Suppose x and y represent the Exp1 SID and Exp2 SID. You can create a ordered list of unique values as shown below:
x<-factor(x = c(2,5,4,3,6,1,4,5,6,3,2,3))
y<-factor(x = c(2,3,4,2,4,1,4,5,5,3,2,3))
list(exp1=sort(x = unique(x),decreasing = F),y=sort(x = unique(y),decreasing = F))

How to find group of rows of a data frame where error occures

I have a two-column dataframe contataining thousands of IDs where each ID has hundreds of data rows, in other words a data frame of about 6 million rows. I am grouping (using either dplyr or data.table) this data frame by IDs and performing a "tso" (outlier detection) function on grouped data frame. The problem is after hours of computation it returns me an error related to ARIMA specification of one of the IDs. Question is how can I identify the ID (or the row number) where my function is returning error?? (if I detect it then I can remove that ID from dataframe)
I tried to manually perform my function on subgroups of this dataframe however I cannot reach the erroneous ID because there are thousands of IDs so it takes me weeks to find them this way.
outlier.detection <- function(x,iter) {
y <- as.ts(x)
out2 <- tso(y,maxit.iloop = iter,tsmethod = "auto.arima",remove.method = "bottom-up",cval=3)
y[out2$outliers$ind] <- NA
return(y)}
df <- data.table(outlying1);setkey(df,id)
test <- df[,list(new.weight = outlier.detection(weight,iter=1)),by=id]
the above function finds the annomalies and replace them with NAs. here is an example,
ID weight
1 a 50
2 a 50
3 a 51
4 a 51.5
5 a 52
6 b 80
7 b 81
8 b 81.5
9 b 90
10 b 82
it will look like the following,
ID weight
1 a 50
2 a 50
3 a 51
4 a 51.5
5 a 52
6 b 80
7 b 81
8 b 81.5
9 b NA
10 b 82

Assigning logical value to values higher than given threshold for each case across each year

I have a data frame resembling the extract below:
set.seed(1)
smpl_df <- data.frame(year = c(1500:2011), case = LETTERS[1:4])
smpl_df$var_one <- sample(100, size = nrow(smpl_df), replace = TRUE)
I'm interested in adding one more column to this data frame. I'm interested in the column to take the value 1 if the values in the column var_one were higher than a given threshold for all of the consecutive years represented in the data set. For example, in its present format the table looks like that:
head(smpl_df)
year case var_one
1 1500 A 27
2 1501 B 38
3 1502 C 58
4 1503 D 91
5 1504 A 21
6 1505 B 90
I would like to add a column to the data table (values for the new column are not right, just introduced as a way of example):
year case var_one var_one_higher_than_80_for_all_yrs_for_this_case
1 1500 A 27 0
2 1501 B 38 0
3 1502 C 58 0
4 1503 D 91 1
5 1504 A 21 0
6 1505 B 90 1
Edit
To add to the post following useful points expressed in the comments below. The long table that I'm currently working with could be obtained from the wide table below. In the example below, I added column NewColumn that takes values Yes if for a given case value was higher than 2 and No if the value was lower or equal 2 for all the years. I want to achieve the same effect but on my long table (sample_df).
Edit 2
Following the useful comments concerning the desired final output, my intention is to generate a column that would correspond to the last column in the table below.
maybe be helpful ifelse structure:
smpl_df$var_one_higher <- ifelse("your func",1,0)

Convert data frame of redundant frequencies

I have a data.frame like so:
category count
A 11
B 1
C 45
A 1003
D 20
B 207
E 634
E 40
A 42
A 7
B 44
B 12
Each row represents a specific element with a category type and a count of that element. I would like to produce a frequency distribution of counts per category, but the categories are at the moment redundant.
How do I retrieve a table of redundant category counts? i.e. I want a table that looks like:
category count
A 11234
B 4005
C 100023
D 65567
E 54654
... ...
I almost got there using lapply:
df.nrcounts <- lapply(unique(df.counts$category),
function(x) c(category=x, count=sum(subset(df.counts, category==x)$count)))
but I can't seem to coerce the output to a proper dataframe. I can't quite get my head around using the function.
aggregate(df.counts$count,by=list(df.counts$category),FUN=sum)
Or
library(data.table)
setDT(df.counts)[, list(count=sum(count)), by = category]

Resources