Pivotlonger with identical Column names - r

my data looks like
Nr. Type 1 Type 2 Type 1 Type 2
1 400 600 100 800
2 500 400 900 300
3 200 200 400 700
4 300 600 800 300
and I want to create Boxolplots of Type 1 and type 2.
Pivotlonger makes Type 1 and Type 1.1 which is not what I Need.
Maybe someone can help me.

It turns out your issue was not with the pivot_longer() but with the subsetting of your original data.frame using [. There is no direct control over the requirement that the output of [ or base::subset() have unique column names so you need do something else to subset your data and avoid losing column names. This is discussed in this question so borrowing from one of the answers, you can use:
library(tidyverse)
# data with extra column to be removed
d <- structure(list(Nr. = 1:4, `Type 1` = c(400L, 500L, 200L, 300L), x = 1:4, `Type 2` = c(600L,
400L, 200L, 600L), `Type 1` = c(100L, 900L, 400L, 800L), `Type 2` = c(800L,
300L, 700L, 300L)), row.names = c(NA, -4L), class = "data.frame")
# remove extra column without changing names then pivot
data.frame(as.list(d)[-3], check.names = FALSE) %>%
pivot_longer(-Nr.) %>%
ggplot(aes(name, value)) +
geom_boxplot()
Created on 2022-02-22 by the reprex package (v2.0.1)

Related

filter rows based on all previous row data in another column

I have a data table which i would like to filter based on multiple conditions looking at all previous columns. If the New_ID.1 row number is before the same id in the New_ID column, remove the row where New_ID= New_ID.1 from previous row. For example, I would remove the New_ID 581 in row 3 because New_ID.1 is in row 1. However I don't want to remove row six New_ID 551 since row 3 New_ID.551 would be removed first. Essentially, I think i need to loop through and create a new filtered table for each row and repeat process?
orig_df<- structure(list(New_ID = c(557L, 588L, 581L, 580L, 591L, 551L,
300L, 112L), New_ID.1 = c(581L, 591L, 551L, 300L, 112L, 584L,
416L, 115L), distance = c(3339.15537217173, 3432.33715484179,
5268.69104753613, 5296.72042763528, 5271.94917463488, 5258.66546295312,
5286.99982045171, 5277.81914818968), X.x = c(903604.940384474,
819515.728302034, 903663.550206032, 866828.860223065, 819525.350044447,
903720.790105847, 866881.654186025, 819585.173276271), Y.x = c(1027706.41509243,
1026880.34660449, 1024367.77412815, 1023962.99139374, 1023448.02293581,
1019099.39402149, 1018666.53407908, 1018176.41319296), X.y = c(903663.550206032,
819525.350044447, 903720.790105847, 866881.654186025, 819585.173276271,
903801.327345876, 866919.184271939, 819630.672367509), Y.y = c(1024367.77412815,
1023448.02293581, 1019099.39402149, 1018666.53407908, 1018176.41319296,
1013841.34531459, 1013379.66746509, 1012898.79016799), Y_filter = c(3338.64096427278,
3432.32366867992, 5268.38010666054, 5296.45731465891, 5271.60974284587,
5258.04870690871, 5286.86661398865, 5277.62302497006), X_filter = c(58.609821557533,
9.62174241337925, 57.2398998149438, 52.7939629601315, 59.8232318238588,
80.5372400298947, 37.5300859131385, 45.4990912381327), row.number = 1:8), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
End result would retain rows 1,2,4,6 and 8 from original data
output_table<-structure(list(New_ID = c(557L, 588L, 580L, 551L, 112L), New_ID.1 = c(581L,
591L, 300L, 584L, 115L), distance = c(3339.15537217173, 3432.33715484179,
5296.72042763528, 5258.66546295312, 5277.81914818968), X.x = c(903604.940384474,
819515.728302034, 866828.860223065, 903720.790105847, 819585.173276271
), Y.x = c(1027706.41509243, 1026880.34660449, 1023962.99139374,
1019099.39402149, 1018176.41319296), X.y = c(903663.550206032,
819525.350044447, 866881.654186025, 903801.327345876, 819630.672367509
), Y.y = c(1024367.77412815, 1023448.02293581, 1018666.53407908,
1013841.34531459, 1012898.79016799), Y_filter = c(3338.64096427278,
3432.32366867992, 5296.45731465891, 5258.04870690871, 5277.62302497006
), X_filter = c(58.609821557533, 9.62174241337925, 52.7939629601315,
80.5372400298947, 45.4990912381327), row.number = c(1L, 2L, 4L,
6L, 8L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl",
"data.frame"))
Below is a simpler problem that might be of help.
Original data
A|B
C|D
B|E
E|F
Updated data table
A|B
C|D
E|F
I think looping through the rows and saving the ids that you already encountered should be enough?
orig_df <- as.data.frame(orig_df)
included_rows <- rep(FALSE, nrow(orig_df))
seen_ids <- c()
for(i in 1:nrow(orig_df)){
# Skip row if we have seen either ID already
if(orig_df[i, 'New_ID'] %in% seen_ids) next
if(orig_df[i, 'New_ID.1'] %in% seen_ids) next
# If both ids are new, we save them as seen and include the entry
seen_ids <- c(seen_ids, orig_df[i, 'New_ID'] , orig_df[i, 'New_ID.1'] )
included_rows[i] <- TRUE
}
filtered_df <- orig_df[included_rows,]

Normalizing values depending on group in R [duplicate]

This question already has answers here:
Normalize by Group
(2 answers)
Closed 2 years ago.
I have this dataset:
> head(meltCalcium)
Time Cell Intensity
1 1 IntDen1 306852.5
2 2 IntDen1 302892.2
3 3 IntDen1 298258.6
4 4 IntDen1 300769.9
5 5 IntDen1 301971.8
6 6 IntDen1 302585.6
> tail(meltCalcium)
Time Cell Intensity
32531 659 IntDen49 47788.16
32532 660 IntDen49 47560.32
32533 661 IntDen49 47738.24
32534 662 IntDen49 48968.96
32535 663 IntDen49 48796.16
32536 664 IntDen49 48156.80
I have 49 Cells and the time reaches 664 for each one of them. In this case time is not important, as I'd like to get the normalized Intensity for each cell (so (Intensity - min)/(max - min)), and possibly adding it as a new column to the dataframe.
I tried
> meltCalcium$normalized <- with(meltCalcium, (Intensity - min(Intensity))/diff(range(Intensity)))
but in this way the max and the min are calculated using the Intensity over all Cells. How can I do it for each cell separately?
Thanks!
Apply the formula by group :
library(dplyr)
result <- meltCalcium %>%
group_by(Cell) %>%
mutate(normalized = (Intensity-min(Intensity))/diff(range(Intensity)))
Base R solution:
normalise_vec_min_max <- function(num_vec){
minnv <- min(num_vec, na.rm = TRUE)
maxnv <- max(num_vec, na.rm = TRUE)
return((num_vec - minnv) / (maxnv - minnv))
}
with(meltCalcium, ave(Intensity, Cell, FUN = normalise_vec_min_max))
Data:
meltCalcium <- structure(list(Time = c(1L, 2L, 3L, 4L, 5L, 6L, 659L, 660L, 661L,
662L, 663L, 664L), Cell = c("IntDen1", "IntDen1", "IntDen1",
"IntDen1", "IntDen1", "IntDen1", "IntDen49", "IntDen49", "IntDen49",
"IntDen49", "IntDen49", "IntDen49"), Intensity = c(306852.5,
302892.2, 298258.6, 300769.9, 301971.8, 302585.6, 47788.16, 47560.32,
47738.24, 48968.96, 48796.16, 48156.8)), row.names = c(NA, -12L
), class = "data.frame")

Finding occupancy rate

I am looking at a dataset where I got companies, and what there prices are for several weeks. If the value is blank/empty it is due to the house is booked and therefore there is no price available.
I have this code which is working, but I want to do all companies and weeks at once if possible. And then I want it to become a part of the data.
sum(D1$Company=='dc' & D1$`Price week 24`== " ") / sum(D1$Company=='dc' & D1$`Price week 24`!="-10")
Where I take the sum of one companies where the houses is booked (no price therefore blank/empty), and divide with the total. No values of -10..
My data could look like this (Sorry for the bad vision but I cannot paste in a screenshot). I got several more weeks and several companies.
I could see a new column named "Occupancy week 24" where it contains the value according to the company in row 1.
EDIT : the data
# dput(DF1)
structure(list(Company = 1:6, Price_week_24 = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "ns", class = "factor"), Price_week_25 = c(1639L,
860L, NA, NA, 399L, 645L), Price_week_26 = c(NA, 860L, NA, NA,
399L, NA), Price_week_27 = c(NA, 1010L, 1010L, 699L, 399L, 1010L
), Price_week_28 = c(NA, 1399L, NA, 1129L, 640L, 1399L)), class = "data.frame", row.names = c(NA,
-6L))
df$occupancy_rate <- apply(df[,2:6], 1,function(x) sum(x>0, na.rm = TRUE)/length(x))
Solve many problems but not them all. I want a value for every single company and not a total for them all.
I am looking forward to getting some help.
Thank you.
Best Regards
Here is how the data was created to share the example. I have included one example solution using base:
#Create a reprex
df <- read.table(text =
"1 ns 1639 ' ' ' ' ' '
2 ns 860 860 1010 1399
3 ns ' ' ' ' 1010 ' '
4 ns ' ' ' ' 699 1129
5 ns 399 399 399 640
6 ns 645 ' ' 1010 1399")
names(df) <- c("rows", "Company", paste0("Price_week_", 24:27) )
# to share the data
dput(df)
# Using base R
df$occupancy_rate <- apply(df[,2:6], 1,function(x) sum(x>0, na.rm = TRUE)/length(x))

Data Wrangling Using Dplyr

Using Dplyr, I am trying to find which country has the largest increase in wealth between 2002 and 2006 from the following data.
Country wealth_2002 wealth_2004 wealth_2006
Country_A 1000 1600 2200
Country_B 1200 1300 1800
Country_C 1400 1100 1200
Country_D 1500 1000 1100
Country_E 1100 1800 1900
To get the country's name, I have used
largest_increase <- df %>%
group_by(Country) %>%
filter(max(wealth_2006 - wealth_2002)) %>%
And this gives me
Error in filter_impl(.data, quo) :
Argument 2 filter condition does not evaluate to a logical vector
I would be really grateful if someone can help me what I am doing wrong and how I can fix this. I am very new to R so any help would be appreciated.
Using Base R you can use which.max to index your country column:
# This is my dummy data, you can ignore it
country <- c("Sweden", "Finland")
X1 <- c(1050, 1067)
X2 <- c(1045, 1069)
DF <- data.frame(country, X1, X2)
# Modify this to suit
DF$country[which.max(DF$X2- DF$X1)]
So for yours it would be:
df$Country[which.max(df$wealth_2006 - df$wealth_2002)]
Look at how filter works - you need to provide a logical "test" for each row, if it passes, it will keep the row. Also no real need to group_by country since each country is already its own row. Try something like this, where you calculate and store the wealth change for each country then keep the country/countries which have that max value:
library(dplyr)
df <- read.table(
text = "
Country wealth_2002 wealth_2004 wealth_2006
Country_A 1000 1600 2200
Country_B 1200 1300 1800
Country_C 1400 1100 1200
Country_D 1500 1000 1100
Country_E 1100 1800 1900
", header = TRUE, stringsAsFactors = FALSE
)
df %>%
mutate(wealth_change = wealth_2006 - wealth_2002) %>%
filter(wealth_change == max(wealth_change)) %>%
pull(Country) # gives us the Country column
Output:
[1] "Country_A"
Use dput(data) to help answers.
structure(list(Country = structure(1:5, .Label = c("Country_A",
"Country_B", "Country_C", "Country_D", "Country_E"), class = "factor"),
wealth_2002 = c(1000L, 1200L, 1400L, 1500L, 1100L), wealth_2004 = c(1600L,
1300L, 1100L, 1000L, 1800L), wealth_2006 = c(2200L, 1800L,
1200L, 1100L, 1900L)), .Names = c("Country", "wealth_2002",
"wealth_2004", "wealth_2006"), class = "data.frame", row.names = c(NA,
-5L))
library(dplyr)
data %>%
mutate(delta = wealth_2006 - wealth_2004) %>% #Create a new variable called delta with mutate
arrange(desc(delta)) %>% #sort descending by 'delta'
head(1) #return the top line.. pull out the specific value if needed
This will return the top row... of the greatest change.
Country A has a change of 600
You can also use top_n :
library(dplyr)
df %>% top_n(1,wealth_2006 - wealth_2002)
# Country wealth_2002 wealth_2004 wealth_2006
# 1 Country_A 1000 1600 2200

How do summarize this data table with dplyr, then run a chisq.test (or similar) on the results and loop it all into one neat function?

This question was embedded in another question I asked here, but as it goes beyond the scope of what I wanted to know in the initial inquiry, I thought it might deserve a separate thread.
I've been trying to come up with a solution for this problem based on the answers I have received here and here using dplyr and the functions written by Khashaa and Jaap.
Using the solutions provided to me (especially from Jaap), I have been able to summarize the raw data I received into a matrix-looking data table
dput(SO_Example_v1)
structure(list(Type = structure(c(3L, 1L, 2L), .Label = c("Community",
"Contaminant", "Healthcare"), class = "factor"), hosp1_WoundAssocType = c(464L,
285L, 24L), hosp1_BloodAssocType = c(73L, 40L, 26L), hosp1_UrineAssocType = c(75L,
37L, 18L), hosp1_RespAssocType = c(137L, 77L, 2L), hosp1_CathAssocType = c(80L,
34L, 24L), hosp2_WoundAssocType = c(171L, 115L, 17L), hosp2_BloodAssocType = c(127L,
62L, 12L), hosp2_UrineAssocType = c(50L, 29L, 6L), hosp2_RespAssocType = c(135L,
142L, 6L), hosp2_CathAssocType = c(95L, 24L, 12L)), .Names = c("Type",
"hosp1_WoundAssocType", "hosp1_BloodAssocType", "hosp1_UrineAssocType",
"hosp1_RespAssocType", "hosp1_CathAssocType", "hosp2_WoundAssocType",
"hosp2_BloodAssocType", "hosp2_UrineAssocType", "hosp2_RespAssocType",
"hosp2_CathAssocType"), class = "data.frame", row.names = c(NA,
-3L))
Which looks as follows
require(dplyr)
df <- tbl_df(SO_Example_v1)
head(df)
Type hosp1_WoundAssocType hosp1_BloodAssocType hosp1_UrineAssocType
1 Healthcare 464 73 75
2 Community 285 40 37
3 Contaminant 24 26 18
Variables not shown: hosp1_RespAssocType (int), hosp1_CathAssocType (int), hosp2_WoundAssocType
(int), hosp2_BloodAssocType (int), hosp2_UrineAssocType (int), hosp2_RespAssocType (int),
hosp2_CathAssocType (int)
The column Type is the type of bacteria, the following columns represent where they were cultured. The digits represent the number of times the respective type of bacteria were detected.
I know what my final table should look like, but until now I have been doing it step by step for each comparison and variable and there must undoubtedly be a way to do this by piping multiple functions in dplyr - but alas, I have not found the answer on SO to this.
Example of what final table should look like
Wound
Type n Hospital 1 (%) n Hospital 2 (%) p-val
Healthcare associated bacteria 464 (60.0) 171 (56.4) 0.28
Community associated bacteria 285 (36.9) 115 (38.0) 0.74
Contaminants 24 (3.1) 17 (5.6) 0.05
Where the first grouping variable "Wound" is then subsequently replaced by "Urine", "Respiratory", ... and then there's a final column termed "All/Total", which is the total number of times each variable in the rows of "Type" was found and summarized across Hospital 1 and 2 and then compared.
What I have done until now is the following and very tedious, as it's calculated "by hand" and then I poulate the table with all of the results manually.
### Wound cultures & healthcare associated (extracted manually)
# hosp1 464 (yes), 309 (no), 773 wound isolates in total; (% = 464 / 309 * 100)
# hosp2 171 (yes), 132 (no), 303 would isolates in total; (% = 171 / 303 * 100)
### Then the chisq.test of my contingency table
chisq.test(cbind(c(464,309),c(171,132)),correct=FALSE)
I appreciate that if I run a piped dplyr on the raw data.frame I won't be able to get the exact formatting of my desired table, but there must be a way to at least automate all the steps here and put the results together in a final table that I can export as a .csv file and then just do some final column editing etc.?
Any help is greatly appreciated.
It's ugly, but it works (Sam in the comments is right that this whole issue should probably be addressed by adjusting your data to a clean format before analysing, but anyway):
Map(
function(x,y) {
out <- cbind(x,y)
final <- rbind(out[1,],colSums(out[2:3,]))
chisq.test(final,correct=FALSE)
},
SO_Example_v1[grepl("^hosp1",names(SO_Example_v1))],
SO_Example_v1[grepl("^hosp2",names(SO_Example_v1))]
)
#$hosp1_WoundAssocType
#
# Pearson's Chi-squared test
#
#data: final
#X-squared = 1.16, df = 1, p-value = 0.2815
# etc etc...
Matches your intended result:
chisq.test(cbind(c(464,309),c(171,132)),correct=FALSE)
#
# Pearson's Chi-squared test
#
#data: cbind(c(464, 309), c(171, 132))
#X-squared = 1.16, df = 1, p-value = 0.2815

Resources