Aggregate variables separetly in R [lapply + aggregate] - r

I have a data.frame with a set of records and as variables different measurements. I would like to create a new data.frame containing the amount of records having a specific measurement value for each measurement. Basically what I am trying to do is:
record <- c("r1", "r2", "r3")
firstMeasurement <- c(15, 10, 10)
secondMeasurement <- c(2, 4, 2)
df <- data.frame(record, firstMeasurement, secondMeasurement)
measurements <- c(colnames(df[2:3]))
measuramentsAggregate <- lapply(measurements, function(i)
aggregate(record~i, df, FUN=length))
I am getting really funny errors and I don't understand why. Can anyone help me?
Many thanks!

I think this is what you want
library(dplyr)
agg.measurements <- df %>% group_by(firstMeasurement) %>% summarise(records=n())
That should do it for the one.

If you want the number of records with specific firstMeasurement:
table(df$firstMeasurement)
Likewise for the secondMeasurement. I am not sure how the data.frame you are trying to create might look.

Related

R- How do I use a lookup table containing threshold values that vary for different variables (columns) to replace values below those thresholds?

I am trying to streamline the process of auditing chemistry laboratory data. When we encounter data where an analyte is not detected I need to change the recorded result to a value equal to 1/2 of the level of detection (LOD) for the analytical method. I have LOD's contained within another dataframe to be used as a lookup table.
I have multiple columns representing data from different analytical tests, each with it's own unique LOD. Here's an example of the type of data I am working with:
library(tidyverse)
dat <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,-2.3,7.6,0.1,45.6,12.2,-0.1,22.2,0.6),
"TN" = c(100.3,56.2,-10.5,0.4,-0.3,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,-10.5,0.2,14.6,489.3,0.3,14.4,54.6,88.8))
dat
detect_level <- tibble("Parameter" = c('TP', 'TN', 'DOC'),
'LOD' = c(0.6, 11, 0.3)) %>%
mutate(halfLOD=LOD/2)
detect_level
I have poured over multiple other questions with a similar theme:
Change values in multiple columns of a dataframe using a lookup table
R - Match values from multiple columns in a data.frame to a lookup table.
Replace values in multiple columns using different thresholds
and gotten to a point where I have pivoted the data and split it out into a list of dataframes that are specific analytes:
dat %>%
pivot_longer(cols = c('TP','TN','DOC')) %>%
arrange(name) %>%
split(.$name)
I have tried to apply a function using map(), however I cannot figure out how to integrate the values from the lookup table (detect_level) into my code. If someone could help me continue this pipe, or finish the process to achieve a final product dat2 that should look like this I would appreciate it:
dat2 <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,0.3,7.6,0.3,45.6,12.2,0.3,22.2,0.6),
"TN" = c(100.3,56.2,5.5,5.5,5.5,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,0.15,0.15,14.6,489.3,0.3,14.4,54.6,88.8))
dat2
Another possibility would be from the closest similar question I have found is:
Lookup multiple column from a single table
Here's a snippet of code that I have adapted from this question, however, if you run it you will see that where values exist that are not found in detect_level an NA is returned. Additionally, it does not appear to have worked for $TN or $DOC, even in cases when the $LOD value from detect_level was present.
dat %>%
mutate(across(all_of(unique(detect_level$Parameter)),
~ {i1 <- detect_level$Parameter == cur_column()
detect_level$LOD[i1][match(., detect_level$LOD)]}))
I am not comfortable at all with the purrr language here and have only adapted this code from the question linked, so I would appreciate if this is the direction an answerer chooses, that they might comment code to explain briefly what is happening "under the hood".
Thank you in advance!
Perhaps this helps
library(dplyr)
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ pmax(., detect_level$LOD[match(cur_column(), detect_level$Parameter)])))
For the updated case
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ replace(., . < detect_level$LOD[match(cur_column(),
detect_level$Parameter)],detect_level$halfLOD[match(cur_column(),
detect_level$Parameter)])))

Making quick calculations on subsets with R

and thanks to all in advance.
I have the following data:
set.seed(123)
data <- data.frame (name=LETTERS[sample(1:26, 500, replace=T)],present=sample(0:1,500,replace = T))
And I want to quickly calculate the percentage of present observations (1's) for each letter. I can do it manually, but I believe there is an easier way to do this:
library(dplyr)
A <- filter(data, name=="A" & present==1)
A2 <- filter(data, name=="A")
data$Percentage[data$name=="A"] <- nrow(A)/nrow(A2)
And so on until I arrive to "Z".
Can I make this task automatically without having to change the values of the "name" colum manually?
Best regards,
We can use prop.table with table to get the proportion
prop.table(table(data), 1)[,2]
To add it as a column, we can expand it by matching with the 'names'
data$Percentage <- prop.table(table(data), 1)[,2][as.character(data$name)]
Or as #Lars Lau Raket suggested, we don't need to convert to character
prop.table(table(data), 1)[,2][data$name]
If we need to create a column
library(dplyr)
data %>%
group_by(name) %>%
mutate(Percentage = mean(present==1))

Creating a nested table with ddply

I'm having trouble generating a table with plyr and I hope you can help. If you run the code below you should get a table with proportions at the highest aggregate level of my data (i.e., the whole data set). However, I'd like to get the same table with proportions per item for each school. Thanks for any help. Also, if there's a better way to synthesize this with just dplyr I'm open to it. I'm trying to integrate some of these new packages into my workflow.
# load packages
library(plyr)
library(dplyr)
library(reshape2)
library(tidyr)
library(xtable)
# generate fake Data
set.seed(500)
School <- rep(seq(1:20), 2)
District <- rep(c(rep("East", 10), rep("West", 10)), 2)
Score <- rnorm(40, 100, 15)
Student.ID <- sample(1:1000,8,replace=T)
items <- data.frame(replicate(10, sample(1:4, 40, replace=TRUE)))
items <- data.frame(lapply(items, factor, ordered=TRUE,
levels=1:4,
labels=c("Strongly disagree","Disagree",
"Agree","Strongly Agree")))
school.data <- data.frame(Student.ID, School, District, Score, items)
rm(items)
# code for table
items <- select(school.data, School, X1:X10)
g <- items %>%
gather(Item, response, -School)
# This gives me the aggregate results for the entire data set
foo <- ddply(g, .(Item), function(x) prop.table(table(x$response))) #I stupidly tried .(Item, School) to no avail
xtable(foo)
Try
prop.table(with(g, table(response, Item, School)), margin = 2)
This gives a 4x10x20 array (responses, items, schools). You can use as.data.fame on the result for conversion if needed.

Replacing values from a column using a condition in R

I have a very basic R question but I am having a hard time trying to get the right answer. I have a data frame that looks like this:
species <- "ABC"
ind <- rep(1:4, each = 24)
hour <- rep(seq(0, 23, by = 1), 4)
depth <- runif(length(ind), 1, 50)
df <- data.frame(cbind(species, ind, hour, depth))
df$depth <- as.numeric(df$depth)
What I would like it to select AND replace all the rows where depth < 10 (for example) with zero, but I want to keep all the information associated to those rows and the original dimensions of the data frame.
I have try the following but this does not work.
df[df$depth<10] <- 0
Any suggestions?
# reassign depth values under 10 to zero
df$depth[df$depth<10] <- 0
(For the columns that are factors, you can only assign values that are factor levels. If you wanted to assign a value that wasn't currently a factor level, you would need to create the additional level first:
levels(df$species) <- c(levels(df$species), "unknown")
df$species[df$depth<10] <- "unknown"
I arrived here from a google search, since my other code is 'tidy' so leaving the 'tidy' way for anyone who else who may find it useful
library(dplyr)
iris %>%
mutate(Species = ifelse(as.character(Species) == "virginica", "newValue", as.character(Species)))

Recovering tapply results into the original data-frame in R

I have a data frame with annual exports of firms to different countries in different years. My problem is i need to create a variable that says, for each year, how many firms there are in each country. I can do this perfectly with a "tapply" command, like
incumbents <- tapply(id, destination-year, function(x) length(unique(x)))
and it works just fine. My problem is that incumbents has length length(destination-year), and I need it to have length length(id) -there are many firms each year serving each destination-, to use it in a subsequent regression (of course, in a way that matches the year and the destination). A "for" loop can do this, but it is very time-consuming since the database is kind of huge.
Any suggestions?
You don't provide a reproducible example, so I can't test this, but you should be able to use ave:
incumbents <- ave(id, destination-year, FUN=function(x) length(unique(x)))
Just "merge" the tapply summary back in with the original data frame with merge.
Since you didn't provide example data, I made some. Modify accordingly.
n = 1000
id = sample(1:10, n, replace=T)
year = sample(2000:2011, n, replace=T)
destination = sample(LETTERS[1:6], n, replace=T)
`destination-year` = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, `destination-year`)
Now tabulate your summaries. Note how I reformatted to a data frame and made the names match the original data.
incumbents = tapply(id, `destination-year`, function(x) length(unique(x)))
incumbents = data.frame(`destination-year`=names(incumbents), incumbents)
Finally, merge back in with the original data:
merge(dat, incumbents)
By the way, instead of combining destination and year into a third variable, like it seems you've done, tapply can handle both variables directly as a list:
incumbents = melt(tapply(id, list(destination=destination, year=year), function(x) length(unique(x))))
Using #JohnColby's excellent example data, I was thinking of something more along the lines of this:
#I prefer not to deal with the pesky '-' in a variable name
destinationYear = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, destinationYear)
#require(plyr)
dat <- ddply(dat,.(destinationYear),transform,newCol = length(unique(id)))
#Or if more speed is required, use data.table
require(data.table)
datTable <- data.table(dat)
datTable <- datTable[,transform(.SD,newCol = length(unique(id))),by = destinationYear]

Resources