Attribute labels to scored items [function] - r

I don't know if the subject has already been find but here my problem :
I have a dataset from behaviors personality items scored from 1 to 8 and I would like to convert each scored according a range (e.g. 1-2 = Rare ; 3-5 = Occasionally ; 6-8 = Frequent).
I succeed to create new columns and put labels in it but I don't understand why I have same repetition in others columns :
Beh_data[,c(2,3,4,32,33,34)
enter image description here
You can see that columns with "_class" had the same outputs, and there are mistakes about correct match between labels and scores (e.g. row4 -- 8 put as Occasionally)
Here the function code :
l = unlist(names(Beh_data[,2:28]))
for (j in 1:length(l)) {
cl[j] = list(paste(l[j],"class",sep="_"))
for (k in 1:length(cl)) {
Beh_data[,cl[[k]] ] <- cl[[k]]
for(i in 1:nrow(Beh_data)){
Beh_data[,cl[[k]] ][i] <-ifelse(Beh_data[,l[j] ][i]<3, "Rare", Beh_data[,cl[[k]] ][i])
Beh_data[,cl[[k]] ][i] <-ifelse(Beh_data[,l[j] ][i]>2 & Beh_data[,l[j] ][i]<6, "Occasionally", Beh_data[,cl[[k] ] ][i])
Beh_data[,cl[[k]] ][i] <-ifelse(Beh_data[,l[j] ][i]>5, "Frequent", Beh_data[,cl[[k]] ][i])
}
}
}
I tried to see if it's could from a wrong annotation as cl[[k]] ] or something like this but it steels doesn't work
Do you have any ideas please ?

If you're open to a dplyr solution, I think its across and case_when functions are helpful here. It should also run faster since it's vectorized. This will create new columns like aff_sum_class which use the categorization you've specified.
library(dplyr)
Beh_data |>
mutate(across(aff_sum:qui_sum,
~case_when(. >= 6 ~ "Frequent",
. >= 3 ~ "Occasionally",
TRUE ~ "Rare"),
.names = "{.col}_class"))

Related

Subsetting a data set and plotting means

I have a data set including Year, Site, and Species Count. I am trying to write a code that reflects in some years, the counts were done twice. For those years I have to find the mean count at each site for each species (there are two different species), and plot those means. This is the code I have generated:
DataSet1 <- subset(channel_islands,
channel_islands$SpeciesName=="Hypsypops ubicundus, adult" |
channel_islands$SpeciesName=="Paralabrax clathratus,adult")
years<-unique(DataSet1$Year)
Hypsypops_mean <- NULL
Paralabrax_mean <- NULL
Mean <- NULL
years <- unique(DataSet1$Year)
for(i in 1:length(years)){
data_year <- DataSet1[which(DataSet1$Year == years[i]), ]
Hypsypops<-data_year[which(data_year$SpeciesName=="Hypsypops rubicundus,adult"), ]
Paralabrax<-data_year[which(data_year$SpeciesName=="Paralabrax clathratus,adult"), ]
UNIQUESITE<-unique(unique(data_year$Site))
for(m in 1:(length(UNIQUESITE))){
zz<-Hypsypops[Hypsypops$Site==m,]
if(length(zz$Site)>=2){
Meanp <- mean(Hypsypops$Count[Hypsypops$Site==UNIQUESITE[m]])
Hypsypops_mean <- rbind(Hypsypops_mean,
c(UNIQUESITE[m], years[i], round(Meanp,2),
'Hypsypops rubicundus,adult'))
}
kk <- Paralabrax[Paralabrax$Site==m, ]
if(length(kk$Site)>=2){
Meane <- mean(Paralabrax$Count[Paralabrax$Site==UNIQUESITE[m]])
Paralabrax_mean <- rbind(Paralabrax_mean,
c(UNIQUESITE[m], years[i], round(Meane, 2),
'Paralabrax clathratus,adult'))
}
}
if(i==1){
Mean<-rbind(Hypsypops_mean, Paralabrax_mean)
}
if(i>1){
Mean<-rbind(DataMean, Hypsypops_mean, Paralabrax_mean)
}
Hypsypops_mean<-NULL
Paralabrax_mean<-NULL
}
Mean <- as.data.frame(Mean,stringsAsFactors=F)
names(Mean) <- c('Site','Year','mean_count','SpeciesName')
Mean$Site <- as.integer(Mean$Site)
Mean$Year <- as.integer(Mean$Year)
Mean$mean_count <- as.numeric(Mean$mean_count)
par(mfrow=c(5,5), oma=c(4,2,4,2), mar=c(5.5,4,3,0))
for(i in 1:length(years)){
if(any(Mean$Year==years[i])) {
year1<-Mean[which(Mean$Year==years[i]),]
Species<-unique(as.character(year1$SpeciesName))
Colors<-c("pink","purple")[Species]
Data_Hr<-year1[year1$SpeciesName=="Hypsypops rubicundus,adult",]
Data_Pc<-year1[year1$SpeciesName=="Paralabrax clathratus,adult",]
plot(Data_Hr$mean_count~Data_Pc$mean_count,
xlab=c("Hypsypops rubicundus"),
ylab=c("Paralabrax clathratus"),main=years[i],pch=16)
}
}
It's a lot I'm sorry, I'm not sure of a way to streamline the process. But I keep getting an error:
Error in names(Mean) <- c("Site", "Year", "mean_count", "SpeciesName")
: 'names' attribute [4] must be the same length as the vector [0]
Not sure how I can debug this.
Not sure why you want to do this with an elaborate loop code. It sounds like you are trying to summarise your data.
This can be done in different ways. Here is a solution using dplyr:
DataSet1 %>%
group_by(Year, SpeciesName, Site) %>%
summarise(nrecords = n(),
Count = mean(Count))
To get a better answer, it might be helpful to post a subset of the data and the intended result you are after.

R - How to fix function with if statement and multiple conditions (factor variables)

I have a df with two relevant columns with 1.words of different lengths and another with 2.labels for syllables, both are factor variables. Now in a third column (3. structure) I want to add information about the components of the syllable (Vowel or Consonant). To do this I use the word and the label of the syllable as conditions.
(word <- "dog", "parent", "extraordinary")
(labels <- "1", "2", "3","Final", "Mono")
(df <- data.frame(word,labels))
I have used very similar code before and it worked but now it doesn't anymore. I recently uploaded R (3.6.0) and can't figure out where the issue lies. I also updated all the packages I thought would be relevant for this but still no luck. I only get a column with only CV in it. My df is also much bigger although I don't think that's the issue.
Classes ‘rowwise_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 4458 obs. of 27 variables:
This is an excerpt of how my code looks like:
getStructure <- function(word, labels) {
if (str_detect(word, 'extraordinary') & labels == '2') {
return ('CCV')
} else {
return("CV")
}
}
df <- df %>%
rowwise() %>%
mutate(structure =getStructure(word, labels))
example of what I need:
|word | labels | structure
---------------------------
|dog | Mono | CVC
|parent| 1 | CV
|parent| Final | CVC
I would like to use a function and specify each word and syllable when necessary. After that the remaining is all CV.
However, I only get one new column with CV in all rows and this
Warning message:
In if (str_detect(word, "dog") & labels == "1") { :
the condition has length > 1 and only the first element will be used

Matching unequal data frames on a unique identifier

So I have all this data from the EPL and I'm currently trying to create a column for a teams "form" based on their last five games. A win counts at 1 point, a draw as .5, and a loss zero. I have a loop to do this for a single teams at a time but when I try to create one to merge them all together I can not get it to work for some reason. My data comes from: http://www.football-data.co.uk/englandm.php
For the purpose for making it more simple I will use only the data from the 2013/2014 season for the Premier League.
I import the data from the excel sheet and labeled it PL_20
library(RCurl)
URL <- "www.football-data.co.uk/mmz4281/1415/E0.csv"
x <- getURL(URL, ssl.verifypeer = FALSE)
PL_20 <- read.csv(textConnection(x))
# cleaning data, getting rid of the betting odds
data <- PL_20[,-c(1,10,11,24:65)]
#Get dates all in same date format
data$Date<-as.Date(data$Date, guess_formats(data$Date, "dmy"))
#sorting (used for all the seasons)
sorted <- data[order(data$Date,decreasing=TRUE),]
sorted$index <- seq(1:nrow(sorted))
# making a column for form
teams<-as.matrix(unique(data$HomeTeam))
test<-sorted
z<-1 # controls which team the form is being found for. Ideally I would have this cycle
# through all of the teams
current<-subset(sorted, HomeTeam==as.character(teams[z]) | AwayTeam==as.character(teams[z]))
current$h.form<-0
current$a.form<-0
current$recent<-0
for (i in 1:nrow(current)){
if((as.character(current[i,2])==as.character(teams[z]) && as.character(current[i,6])=="H") || (as.character(current[i,3])==as.character(teams[z]) && as.character(current[i,6])=="A")){
# current[i,7]<- "W"
current[i,24]<- 1
}else{
if((as.character(current[i,2])==as.character(teams[z]) && as.character(current[i,6])=="D") || (as.character(current[i,3])==as.character(teams[z]) && as.character(current[i,6])=="D"))
{
#current[i,7]<- "D"
current[i,24]<- .5
}else{
if((as.character(current[i,2])==as.character(teams[z]) && as.character(current[i,6])=="A") || (as.character(current[i,3])==as.character(teams[z]) && as.character(current[i,6])=="H"))
{
# current[i,7]<- "L"
current[i,24]<- 0
}
}
}
}
d<-0
for (d in 0:(nrow(current)-6))
{
if (as.character(current[nrow(current)-(5+d),2])==as.character(teams[z])){
current[(nrow(current)-(5+d)),22]<-as.numeric(sum(current[(nrow(current)-(4+d)):(nrow(current)-d),24]))
}else{
if(as.character(current[nrow(current)-(5+d),3])==as.character(teams[z]))
{
current[(nrow(current)-(5+d)),23]<-sum(current[(nrow(current)-(4+d)):(nrow(current)-d),24])
}
}
}
Now these ugly loops create a data frame called current which has three columns at the end: h.form, a.form, and recent. h.form is the form for the designated home team for that game and a.form is the form for the designated away team for that game. Recent is just the outcome of that game.
I would like to be able to combine all of the teams together so there is one observation per game and both h.form and a.form are populated with the correct values for their corresponding teams.
Your help is appreciated and if you have suggestions on how to clean up these loops that would be helpful as well.

Adding a row to a dataframe

I am reading a file line by line and then adding specific lines to a dataframe. Here is an example of a line I would add to a dataframe:
ATOM 230 CA GLU A 31 66.218 118.140 2.411 1.00 31.82 C
I have verified that my checks are ok, I think it has specifically to do with my rbind command. Thanks for your help!
Edit: The error is as follows, the output of the dataframe is:
Residue AtomCount SideChain XCoord YCoord ZCoord
2 MET 1 A 62.935 97.579 30.223
21 <NA> 2 A 63.155 95.525 27.079
3 <NA> 3 A 65.289 96.895 24.308
It seems like it stops picking up the name of the residue..
The code I am using is:
get.positions <- function(sourcefile, chain_required = "A"){
positions = data.frame()
visited = list()
filedata <- readLines(sourcefile, n= -1)
for(i in 1: length(filedata)){
input = filedata[i]
id = substr(input,1,4)
if(id == "ATOM"){
type = substr(input,14,15)
if(type == "CA"){
#if there are duplicates it takes the first one
residue = substr(input,18,20)
type_of_chain = substr(input,22,22)
atom_count = strtoi(substr(input, 23,26))
if(atom_count >=1){
if(type_of_chain == chain_required && !(atom_count %in% visited) ){
position_string = trim(substr(input,30,54))
position_string = lapply(unlist(strsplit(position_string," +")),as.numeric)
positions<- rbind(positions, list(residue, atom_count, type_of_chain, position_string[[1]], position_string[[2]], position_string[[3]]))
}
}
}
}
}
return (positions)
}
When I ran your code with that data I got type=="LU" (so it failed the type=="CA" test) and the rest of processing never got accomplished. I think you may need to change the indices to
type = substr(input,10,11)
Fixing that problem brings up others, and its going to be very difficult to fix all the problems since the goal is not clearly stated, but it suggests that you edit your code and data so it's reproducible. This could be a reproducible input/execution method:
get.positions(textConnection("ATOM 230 CA GLU A 31 66.218 118.140 2.411 1.00 31.82 C") )
In, the end, the following worked. First I made a much larger data frame, and then just replace specific rows (thank you Joran who linked me to the R inferno).
For the user that asked why I am splitting on a plus, your assumption is incorrect. The syntax is actually " +", that's a space-plus so that it's splitting on multiple spaces.Finally, as for the incorrect indices, I've finally figured out how to show the extra spaces on the form. Here is the correct original line, you will see the indices match.
ATOM 2 CA MET A 1 62.935 97.579 30.223 1.00 37.58 C
The R code that works, is as follows.
get.positions <- function(sourcefile, chain_required = "A"){
N <- 10^5
AACount <- 0
positions = data.frame(Residue=rep(NA, N),AtomCount=rep(NA, N),SideChain=rep(NA, N),XCoord=rep(NA, N),YCoord=rep(NA, N),ZCoord=rep(NA, N),stringsAsFactors=FALSE)
visited = list()
filedata <- readLines(sourcefile, n= -1)
for(i in 1: length(filedata)){
input = filedata[i]
id = substr(input,1,4)
if(id == "ATOM"){
type = substr(input,14,15)
if(type == "CA"){
#if there are duplicates it takes the first one
residue = substr(input,18,20)
type_of_chain = substr(input,22,22)
atom_count = strtoi(substr(input, 23,26))
if(atom_count >=1){
if(type_of_chain == chain_required && !(atom_count %in% visited) ){
visited <- c(visited, atom_count)
AACount <- AACount + 1
position_string = trim(substr(input,30,54))
position_string = lapply(unlist(strsplit(position_string," +")),as.numeric)
#print(input)
positions[AACount,]<- c(residue, atom_count, type_of_chain, position_string[[1]], position_string[[2]], position_string[[3]])
}
}
}
}
}
positions<-positions[1:AACount,]
return (positions)
}

Ordering Merged data frames

As a fairly new R programmer I seem to have run into a strange problem - probably my inexperience with R
After reading and merging successive files into a single data frame, I find that order does not sort the data as expected.
I have multiple references in each file but each file refers to measurement data obtained at a different time.
Here's the code
library(reshape)
# Enter file name to Read & Save data
FileName=readline("Enter File name:\n")
# Find first occurance of file
for ( round1 in 1 : 6) {
ReadFile=paste(round1,"C_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile))
break
}
x = data.frame(read.csv(ReadFile, header=TRUE),rnd=round1)
for ( round2 in (round1+1) : 6) {
#
ReadFile=paste(round2,"C_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile)) {
y = data.frame(read.csv(ReadFile, header=TRUE),rnd = round2)
if (round2 == (round1 +1))
z=data.frame(merge(x,y,all=TRUE))
z=data.frame(merge(y,z,all=TRUE))
}
}
ordered = order(z$lab_id)
results = z[ordered,]
res = data.frame( lab=results[,"lab_id"],bw=results[,"ZBW"],wi=results[,"ZWI"],pf_zbw=0,pf_zwi=0,r = results[,"rnd"])
#
# Establish no of samples recorded
nsmpls = length(res[,c("lab")])
# Evaluate Z_scores for Between Lab Results
for ( i in 1 : nsmpls) {
if (res[i,"bw"] > 3 | res[i,"bw"] < -3)
res[i,"pf_zbw"]=1
}
# Evaluate Z_scores for Within Lab Results
for ( i in 1 : nsmpls) {
if (res[i,"wi"] > 3 | res[i,"wi"] < -3)
res[i,"pf_zwi"]=1
}
dd = melt(res, id=c("lab","r"), "pf_zbw")
b = cast(dd, lab ~ r)
If anyone could see why the ordering only works for about 55 of 70 records and could steer me in the right direction I would be obliged
Thanks very much
Check whether z$lab_id is a factor (with is.factor(z$lab_id)).
If it is, try
z$lab_id <- as.character(z$lab_id)
if it is supposed to be a character vector; or
z$lab_id <- as.numeric(as.character(z$lab_id))
if it is supposed to be a numeric vector.
Then order it again.
Ps. I had previously put these in the comments.

Resources