I have a column in my dataset that has various different numeric values in it. However, 3 of the numbers have a specific label, while all others have a general label. Going through the dataset one by one is not an option. It is a very large dataset with 167K obs.
Below shows all the unique values that are in the column:
> unique(NYC_2019_Arrests$JURISDICTION_CODE)
Levels: 0 1 2 3 4 6 7 9 11 12 13 14 15 16 69 71 72 73 74 76 79 85 87 88 97
The levels of JURISDICTION_CODE are defined as follows:
JURISDICTION_CODE - Jurisdiction responsible for arrest. Jurisdiction codes 0(Patrol), 1(Transit) and 2(Housing) represent NYPD whilst codes
3 and more represent non NYPD jurisdictions.
This is the code that I tried to get it to work but just returns an error:
> NYC_2019_Arrests$JURISDICTION_CODE <- factor(NYC_2019_Arrests$JURISDICTION_CODE, levels = c(0,1,2, 3:100), labels = c("Patrol", "Transit", "Housing", "Non-NYPD Jurisdiction"))
Error in factor(NYC_2019_Arrests$JURISDICTION_CODE, levels = c(0, 1, 2, :
invalid 'labels'; length 4 should be 1 or 101
I also tried the above code by taking out the 3:100 and leave in the label but that also did not work.
It would be greatly appreciated if anybody here would know how to make it that all values 3 and above has the generic without having to type out all of the numbers individually.
Thanks!
The error message is providing some direction. The problem is that the labels vector is of length 4 but your levels are length 101. I think you are almost there with the original code. Just make the labels to the correct length with:
reps<-rep("Non-NYPD Jurisdiction",98)
NYC_2019_Arrests$JURISDICTION_CODE <- factor(NYC_2019_Arrests$JURISDICTION_CODE, levels = c(0,1,2, 3:100), labels = c("Patrol", "Transit", "Housing", reps))
Edit with explanation:
Run this code for additional explanation.
#The key is that labels needs the same vector length as level
#length of levels
levels <- c(0,1,2, 3:100)
print(length(levels))
#length of original levels
labels = c("Patrol", "Transit", "Housing", "Non-NYPD Jurisdiction")
print(length(labels))
#This is problematic because what happens for when level - 4. labels[4] would be null.
#Therefore need to repeat "Non-NYPD Jurisdiction" for each level
#since length(3:100) is 98 that is how we know we need 98
reps<-rep("Non-NYPD Jurisdiction",98)
labels <- c("Patrol", "Transit", "Housing", reps)
print(length(labels))
There are several ways to solve this. The simplest and best way I can think of is to use case_when from dplyr
Here is an example:
library(dplyr)
case_when(mtcars$carb == 1 ~ "One",
mtcars$carb == 2 ~ "Two",
mtcars$carb >= 3 ~ "Three or More")
Related
I have scraped some data from a url to analyse cycling results. Unfortunately the name column exists of the name and the name of the team in one field. I would like to extract these from each other. Here's the code (last part doesn't work)
#get url
stradebianchi_2020 <- read_html("https://www.procyclingstats.com/race/strade-bianche/2020/result")
#scrape table
results_2020 <- stradebianchi_2020%>%
html_nodes("td")%>%
html_text()
#transpose scraped data into dataframe
results_stradebianchi_2020 <- as.data.frame(t(matrix(results_2020, 8, byrow = F)))
#rename
names(results_stradebianchi_2020) <- c("rank", "#", "name", "age", "team", "UCI point", "PCS points", "time")
#split rider from team
separate(data = results_stradebianchi_2020, col = name, into = c("left", "right"), sep = " ")
I think the best option is to get the team variable name and use that name to remove it from the 'name' column.
All suggestions are welcome!
I think your request is wrongly formulated. You want to remove team from name.
That's how you should do it in my opinion:
results_stradebianchi_2020 %>%
mutate(name = stringr::str_remove(name, team))
Write this instead of your line with separate.
In this case separate is not an optimal solution for you because the separation character is not clearly defined.
Also, I would advise you to remove the initial blanks from name with stringr::str_trim(name)
You could do this in base R with gsub and replace in the name column the pattern of team column with "", i.e. nothing. We use apply() with MARGIN=1 to go through the data frame row by row. Finally we use trimws to clean from whitespace (where we change to whitespace="[\\h\\v]" for better matching the spaces).
res <- transform(results_stradebianchi_2020,
name=trimws(apply(results_stradebianchi_2020, 1, function(x)
gsub(x["team"], "", x["name"])), whitespace="[\\h\\v]"))
head(res)
# rank X. name age team UCI.point PCS.points time
# 1 1 201 van Aert Wout 25 Team Jumbo-Visma 300 200 4:58:564:58:56
# 2 2 234 Formolo Davide 27 UAE-Team Emirates 250 150 0:300:30
# 3 3 87 Schachmann Maximilian 26 BORA - hansgrohe 215 120 0:320:32
# 4 4 111 Bettiol Alberto 26 EF Pro Cycling 175 100 1:311:31
# 5 5 44 Fuglsang Jakob 35 Astana Pro Team 120 90 2:552:55
# 6 6 7 Štybar Zdenek 34 Deceuninck - Quick Step 115 80 3:593:59
I am trying to create a loop to use compare_means (ggpubr library in R) across all columns in a dataframe and then select only significant p.adjusted values, but it does not work well.
Here is some code
head(df3)
sampleID Actio Beta Gammes Traw Cluster2
gut10 10 2.2 55 13 HIGH
gut12 20 44 67 12 HIGH
gut34 5.5 3 89 33 LOW
gut26 4 45 23 4 LOW
library(ggpubr)
data<-list()
for (i in 2:length(df3)){
data<-compare_means(df3[[i]] ~ Cluster2, data=df3, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}
Error: `df3[i]` must evaluate to column positions or names, not a list
I would like to create an output to convert in dataframe with all the information contained in compare_means output
Thanks a lot
Try this:
library(ggpubr)
data<-list()
for (i in 2:(length(df3)-1)){
new<-df3[,c(i,"Cluster2")]
colnames(new)<-c("interest","Cluster2")
data<-compare_means(interest ~ Cluster2, data=new, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}
I have a table which has multiple columns and rows. I want to access the each value by its column name and rowname, and make a plot with these values.
The table looks like this with 101 columns:
IDs Exam1 Exam2 Exam3 Exam4 .... Exam100
Ellie 12 48 33 64
Kate 98 34 21 76
Joe 22 53 49 72
Van 77 40 12
Xavier 88 92
What I want is to be able to reach the marks for given row (IDs),and given column(exams) as:
table[Ellie,Exam3] --> 48
table[Ellie,Exam100] --> 64
table[Ellie,Exam2] --> (empty)
Then with these numbers, I want to see the distribution of how Ellie did comparing the rest of exams to Exam2,3 and 100.
I have almost figured out this part with R:
library(data.table)
library(ggplot2)
pdf("distirbution_given_row.pdf")
selectedvalues <- c(table[Ellie,Exam3] ,table[Ellie,Exam100])
library(plyr)
cdat <- ddply(selected values, "IDs", summarise, exams.mean=mean(exams))
selectedvaluesggplot <- ggplot(selectedvalues, aes(x=IDs, colour=exams)) + geom_density() + geom_vline(data=cdat, aes(xintercept=exams.mean, colour=IDs), linetype="dashed", size=1)
dev.off()
Which should generate the Ellie's marks for exams of interests versus the rest of the marks ( if it is a blank, then it should not be seen as zero. It is still a blank.)
Red: Marks for Exam3, 100 and 2 , Blue: The marks for the remaining 97 exams
(The code and the plot are taken as an example of ggplot2 from this link.)
All ideas are appreciated!
For accessing your data at least you can do the following:
df=data.frame(IDs=c("Ellie","Kate","Joe","Van","Xavier"),Exam1=c(12,98,22,77,NA),Exam2=c(NA,34,53,NA,NA),
Exam3=c(48,21,49,40,NA),Exam4=c(33,76,NA,12,88))
row.names(df)=df$IDs
df=df%>%select(-IDs)
> df['Joe','Exam2']
[1] 53
Now I prepared an example with random created numbers to illustrate a bit what you could do. First let us create an example data frame
df=as.data.frame(matrix(rnorm(505,50,10),ncol=101))
colnames(df)=c("IDs",paste0("Exam",as.character(1:100)))
df$IDs=c("Ellie","Kate","Joe","Van","Xavier")
To work with ggplot it is recomended to convert it to long format:
df0=df%>%gather(key="exams",value="score",-IDs)
From here on you can play with your variables as desired. For instance plotting the density of the score per ID:
ggplot(df0, aes(x=score,col=IDs)) + geom_density()
or selecting only Exams 2,3,100 and plotting density for different exams
df0=df0%>%filter(exams=="Exam2"|exams=="Exam3"|exams=="Exam100")
ggplot(df0, aes(x=score,col=exams)) + geom_density()
IIUC - you want to plot each IDs select exams with all else exams. Consider the following steps:
Reshape your data to long format even replace NAs with zero as needed.
Run by() to subset data by IDs and build mean aggregrate data and ggplots.
Within by, create a SelectValues indicator column on the select exams then graph with vertical line mean summation.
Data
txt = 'IDs Exam1 Exam2 Exam3 Exam4 Exam100
Ellie 12 NA 48 33 64
Kate 98 34 21 76 NA
Joe 22 53 49 NA 72
Van 77 NA 40 12 NA
Xavier NA NA NA 88 92'
exams_df <- read.table(text=txt, header = TRUE)
# ADD OTHER EXAM COLUMNS (SEEDED FOR REPRODUCIBILITY)
set.seed(444)
exams_df[paste0("Exam", seq(5:99))] <- replicate(99-4, sample(100, 5))
Reshape and Graph
library(ggplot2) # ONLY PACKAGE NEEDED
# FILL NA
exams_df[is.na(exams_df)] <- 0
# RESHAPE (BASE R VERSION)
exams_long_df <- reshape(exams_df,
timevar = "Exam",
times = names(exams_df)[grep("Exam", names(exams_df))],
v.names = "Score",
varying = names(exams_df)[grep("Exam", names(exams_df))],
new.row.names = 1:1000,
direction = "long")
# GRAPH BY EACH ID
by(exams_long_df, exams_long_df$IDs, FUN=function(df) {
df$SelectValues <- ifelse(df$Exam %in% c("Exam1", "Exam3", "Exam100"), "Select Exams", "All Else")
cdat <- aggregate(Score ~ SelectValues, df, FUN=mean)
ggplot(df, aes(Score, colour=SelectValues)) +
geom_density() + xlim(-50, 120) +
labs(title=paste(df$IDs[[1]], "Density Plot of Scores"), x ="Exam Score", y = "Density") +
geom_vline(data=cdat, aes(xintercept=Score, colour=SelectValues), linetype="dashed", size=1)
})
Output
I am trying to create Rank Abundance Curves for tree communities that are grouped according to age. I have 15 sampling sites in 8 seral stages (15(×8)) (N=120). If I had to factor my data it would look something like this:
Age <- factor(c(rep(1,15), rep(2,15), rep(3,15), rep(4,15), rep(5,15), rep(6,15), rep(7,15), rep(8,15)), labels =c("12","15","19","23","27","31","35","38"))
Now, when I use the function 'rankabuncomp' (BiodiversityR package) to create RAC curves for each seral stage, I get an error message:
Error in diversitycomp(x, y, factor, index = "richness") :
specified factor1 'Age' is not a factor
In addition: Warning message:
In if ((method %in% METHOD) == F) { :
the condition has length > 1 and only the first element will be used
This is the code I used:
Trees_2015 <- read.csv(file = "Trees_2015.csv")
Trees<- Trees_2015[,-1]
Reg_age <- read.csv(file = "Age.csv")
RAC_trees <- rankabundance(Trees,y=Reg_age, factor = "Age", level = c("12","15","19","23","27","31","35","38"))
RAC_trees
rankabunplot(RAC_trees,scale='abundance', addit=FALSE, specnames=c(1,2,3))
rankabuncomp(Trees, y=Reg_age, factor='Age',
scale='proportion', legend=TRUE)
Why is R producing this error? How can I rectify it?
The 'Reg_age' data frame (120 obvs. of 1 variable) looks something like this:
Age
1 12
2 12
16 15
20 15
120 38
The 'Trees' data frame has 120 obs. of 75 variables (i.e. 75 different species)
Thanks
Elena
I have a dataframe consists of three variables asn(this is an id),ip_used,domain_used,correct(this is binary 0 or 1). data example :
asn, ip_used,domain_used,correct
1,234,34,1
30,45,765,1
498,4,765,0
3874,876,8765,1
I have plotted ip_used and domain_used against each other for each asn in a bubble plot and now I want to specify bubbles of the entries that are equal to 1 for "correct" with a different bubble color.
Here is my current plot and my current code:
symbols(log_domused,log_ipused, circles = radius,inches=0.40, fg="black", bg="white",xlab = "# used domain",ylab="# used ips",main="dnsdb distribution of domains per ips for each ASN")
Does anybody have any idea how to do that?
Your data:
myData <- rbind(c(1,234,34,1), c(30,45,765,1), c(498,4,765,0), c(3874,876,8765,1))
colnames(myData) <- c("asn", "ip_used", "domain_used", "correct")
myData
asn ip_used domain_used correct
[1,] 1 234 34 1
[2,] 30 45 765 1
[3,] 498 4 765 0
[4,] 3874 876 8765 1
You can specify the color of each circle with "fg" (or "bg"):
symbols(myData[,1], myData[,3], circles=c(1,1,1,1), inches=0.40, fg=myData[,4]+1, bg="white",
xlab = "# used domain",ylab="# used ips",
main="dnsdb distribution of domains per ips for each ASN"
)