Function to replace values in column

Function to replace values in column - r

I would like to create new column in data.frame as following:
Data description:
`'data.frame': 20 obs. of 3 variables:
$ gvkey : int 1004 1004 1004 1004 1004 1004 1004 1004 1004 1004 ...
$ DEF : int 0 0 0 0 0 0 0 0 0 0 ...
$ FittedRobustRatio: num 0.549 0.532 0.519 0.539 0.531 ...`
Function I wrote and doesn't work:
fun.mark <- function(x,y){
if (x==0) { y[y>0.60] <- "Del"
} else (x==1) {
y[y<0.45] <- "Del2"}}
NewDataFrame <- ddply(ShorterData,~gvkey,transform,Fitcorr=fun.mark(DEF, FittedRobustRatio))
So basically what I want to do is to look into DEF column if 0 and FittedRobustRatio > 0.60 then replace the value with "Del" and if column DEF is 1 (there are only 0 or 1 values in the column) then look into FittedRobustRatio column and replace values where <0.45 with for example "Del2". Thanks.

To do this I normally nest ifelse commands, ifelse sets out like this:
ifelse(definition e.g. a>b, gets "x" if definition met, gets "y" if definition not met)
So this should work...
data.frame$new.column <- ifelse (
data.frame$DEF=="0"&data.frame$FittedRobustRatio>0.6, "Del", ifelse(
data.frame$DEF=="1"&data.frame$FittedRobustRatio<0.45, "Del2", "none"))
I could be wrong as you have not provided a reproducible dataframe so I can't test this.

Related

How to devide my dataset for using permanova

Hello everyone :) I have a data set with individuals that correspond in 5 different species in one column, and their presence/absence in different landscapes (7 other columns).
data.frame': 1212 obs. of 10 variables:
$ latitude : num 34.5 34.7 34.7 34.8 34.8 ...
$ longitude : num 127 127 127 127 127 ...
$ species : Factor w/ 5 levels "Bufo gargarizans",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Built : int 0 0 0 0 0 0 0 0 0 0 ...
$ Agriculture: int 1 1 0 1 0 0 1 0 0 0 ...
$ Forested : int 0 0 1 0 0 0 0 1 1 1 ...
$ Grassland : int 0 0 0 0 0 0 0 0 0 0 ...
$ Wetland : int 0 0 0 0 0 0 0 0 0 0 ...
$ Bare : int 0 0 0 0 1 0 0 0 0 0 ...
$ Water : int 1 0 0 0 0 1 0 0 0 0 ...
I try to use permanova and then Tukey test to see if the species use the landscape differently or not. My supervisor did it on SPSS and it worked very well, so I have to do it on R.
I saw I need 2 csv files for running permanova on R but I have only one. I will give you the script that I found on internet and I want to use for my analysis.
library(vegan)
data(dune)
data(dune.env)
# default test by terms
adonis2(dune ~ Management*A1, data = dune.env)
In my case, I should have 1 dataframe with species and 1 dataframe with environmental variables, if I understand well.
However my presence/absence are inside the environmental categories (see the str of my table above). So if I create 1 dataframe with species only, I will not have numerical values in the dataframe with species.
So I am totally lost. I don't know how to process. Can someone help me please ? Thank you !

I will split my answer into two parts. The one where I know what I am talking about and one where I am brainstorming ;)
Here is the first part on how to split your data into two data.frames
# Set seed
seed(1312)
# Some sample data
sample_data <- dplyr::tibble(species=sample(LETTERS[1:5],size=500,replace=T),
Built=sample(c(0,1),size=500,replace=T),
Agriculture=sample(c(0,1),size=500,replace=T),
Forested=sample(c(0,1),size=500,replace=T),
Grassland=sample(c(0,1),size=500,replace=T),
Wetland=sample(c(0,1),size=500,replace=T),
Bare=sample(c(0,1),size=500,replace=T),
Water=sample(c(0,1),size=500,replace=T))
# Split data
df1 <- sample_data %>%
dplyr::select(species)
df2 <- sample_data %>%
dplyr::select(Built:Water)
Now part two: as it says in ?adonis2 the first part of adnonis2 is a formula where the left part of the formula must be a community data matrix or a dissimilarity matrix
Eventhough I am not sure if it does make sense, I went wild and followed the instructions :D
df2_dist <- dist(df2)
vegan::adonis2(df2_dist~species, data=df1)
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
vegan::adonis2(formula = d2 ~ species, data = d1)
Df SumOfSqs R2 F Pr(>F)
species 4 5.333 0.15802 0.7038 0.88
Residual 15 28.417 0.84198
Total 19 33.750 1.00000
Of course this might be nonsense in terms of content as I took a purely technical approach here, but maybe it helps you to shape your data as required

So I made this code :
setwd("C:/Users/Johan/Downloads/memoire Johanna (1)/memoire Johanna")
xdata=read.csv(file="all10_reduce_focal species1.csv", header=T, sep=",")
str(xdata)
xdata$species <- as.factor(xdata$species)
sample_data <- dplyr::tibble(species=sample(LETTERS[1:5],size=1212,replace=T),
Built=sample(c(0,1),size=1212,replace=T),
Agriculture=sample(c(0,1),size=1212,replace=T),
Forested=sample(c(0,1),size=1212,replace=T),
Grassland=sample(c(0,1),size=1212,replace=T),
Wetland=sample(c(0,1),size=1212,replace=T),
Bare=sample(c(0,1),size=1212,replace=T),
Water=sample(c(0,1),size=1212,replace=T))
library(dplyr)
df1 <- sample_data %>%
dplyr::select(species)
df2 <- sample_data %>%
dplyr::select(Built:Water)
df1_dist <- dist(df1)
vegan::adonis2(df1_dist~Built+Agriculture+Grassland+Forested+Wetland+Bare+Water, data=df2)
Species should be the response as I try to see the landscape on the species. When I do this I have :
Error in vegdist(as.matrix(lhs), method = method, ...) : input data must be numeric
It's because the "species" variable has only characters. So I changed to make it numeric :
sample_data <- dplyr::tibble(species=sample(c(1:5),size=1212,replace=T),
Built=sample(c(0,1),size=1212,replace=T),......
But the result I got is different as the result from SPSS, as I don't have any significant variable (in SPSS Built, Agriculture, Forested and Water are significant).
I think my code is wrong

convert factor to date in R to create dummy variable

I need to create dummy variable for "before and after 04/11/2020" for variable "date" in dataset "counties". There are over hundred dates in the dataset. I am trying to covert the dates from factor to date with as.date function, but get NA. Could you please help finding where I am making an error? I kept the other dummy variable I created just in case, if it affects the overall outcome
counties <- read.csv('C:/Users/matpo/Desktop/us-counties.csv')
str(counties)
as.Date(counties$date, format = '%m/%d/%y')
#create dummy variables forNew York, New Jersey, California, and Illinois
counties$state = ifelse(counties$state == 'New Jersey' &
counties$state == 'New York'& counties$state == 'California' &
counties$state == 'Illinois', 1, 0)
counties$date = ifelse(counties$date >= "4/11/2020", 1, 0)
str output
$ date : logi NA NA NA NA NA NA ...
$ county: Factor w/ 1774 levels "Abbeville","Acadia",..: 1468 1468 1468 379 1468 1178 379 1468 979 942 ...
$ state : num 0 0 0 0 0 0 0 0 0 0 ...
$ fips : int 53061 53061 53061 17031 53061 6059 17031 53061 4013 6037 ...
$ cases : int 1 1 1 1 1 1 1 1 1 1 ...
$ deaths: int 0 0 0 0 0 0 0 0 0 0 ...``
Thank you!

You have an incorrect format in as.Date, you should use "%Y" for 4 digit year.
You need to assign the values back (<-) for the values to change.
"4/11/2020" is just a string, if you are comparing date you need to convert it to date object. Also you can avoid using ifelse here.
Try :
counties$date <- as.Date(counties$date, format = '%m/%d/%Y')
counties$dummy <- as.integer(counties$date >= as.Date('2020-04-11'))

How to create independent different data.frame in a loop R

Good evening everybody,
I'm stuck about the construction of the for loop, I don't have any problem, buit I'd like to understand how I can create dataframe "independents" (duplicite with some differences).
I wrote the code step by step (it works), but I think that, maybe, there is a way to compact the code with the for.
x is my original data.frame
str(x)
Classes ‘data.table’ and 'data.frame': 13500 obs. of 6 variables:
$ a: int 1 56 1058 567 987 574 1001...
$ b: int 10 5 10 5 5 10 10 5 10 10 ...
$ c: int NA NA NA NA NA NA NA NA NA NA ...
$ d: int 0 0 0 0 0 0 0 0 0 0 ...
$ e: int 0 0 0 0 0 0 0 0 0 0 ...
$ f: int 22 22 22 22 22 22 22 22 22 22 ...
My first goal is to delete per every column the eventualy NA and "" elements. I do this by these codes of rows.
x_b<- x[!(!is.na(x$b) & x$b==""), ]
x_c<- x[!(!is.na(x$c) & x$c==""), ]
x_d<- x[!(!is.na(x$d) & x$d==""), ]
x_e<- x[!(!is.na(x$e) & x$e==""), ]
x_f<- x[!(!is.na(x$f) & x$f==""), ]
After this the second goal is to create per each new data.frame a id code that I create using the function paste0(x_b$a, x_b$f).
x_b$ID_1<-paste0(x_b$a, x_b$b)
x_c$ID_2<-paste0(x_c$a, x_c$c)
x_d$ID_3<-paste0(x_c$a, x_c$d)
x_e$ID_4<-paste0(x_c$a, x_c$e)
x_f$ID_5<-paste0(x_c$a, x_c$f)
I created this for loop to try to minimize the rows that I use, and to create a good code visualization.
z<-data.frame("a", "b","c","d","e","f")
zy<-data.frame("x_b", "x_c", "x_d", "x_e", "x_f")
for(i in z) {
for (j in zy ) {
target <- paste("_",i)
x[[i]]<-(!is.na(x[[i]]) & x[[i]]=="") #with this I able to create a column on the x data.frame,
#but if I put a new dataframe the for doesn't work
#the name, but I don't want this. I'd like to create a
#data.base per each transformation.
#at this point of the script, I should have a new
#different dataframe, as x_b, x_c, x_d, x_e, x_f but I
#don't know
#How to create them?
#If I have these data frame I will do this anther function
#in the for loop:
zy[[ID]]<-paste0(x_b$a, "_23X")
}
}
I'd like to have as output this:
str(x_b)
Classes ‘data.table’ and 'data.frame': 13500 obs. of 6 variables:
$ a: int 1 56 1058 567 987 574 1001...
$ b: int 10 5 10 5 5 10 10 5 10 10 ...
$ c: int NA NA NA NA NA NA NA NA NA NA ...
$ d: int 0 0 0 0 0 0 0 0 0 0 ...
$ e: int 0 0 0 0 0 0 0 0 0 0 ...
$ f: int 22 22 22 22 22 22 22 22 22 22 ...
$ ID: int 1_23X 56_23X 1058_23X 567_23X 987_23X 574_23X 1001_23X...
and so on.
I think that there is some important concept about the dataframe that I miss.
Where I wrong?
Thank you so much in advance for the support.

There is simple way to do this with the tidyverse package(s):
First goal:
drop.na(df)
You can also use na_if if you want convert "" to NA.
Second goal: use mutate to create a new variable:
df <- df %>%
mutate(id = paste0(x_b$a, "_23X"))

Using lag function gives an atomic vector with all zeroes

I have trying to use "lag" function in base R to calculate rainfall accumulations for a 6-hr period. I have hourly rainfall, then I calculate cumulative rainfall using cumsum function and then I am using the lag function to calculate 6-hr accumulations as below.
Event_Data<-dbGetQuery(con, "select feature_id, TO_CHAR(datetime, 'MM/DD/YYYY HH24:MI') as DATE_TIME, value_ms as RAINFALL_IN from Rain_HOURLY")
Event_Data$cume<-cumsum(Event_Data$RAINFALL_IN)
Event_Data$six_hr<-Event_Data$cume-lag(Event_Data$cume, 6)
But the lag function gives me all zeroes and the structure of the data frame looks like this-
'data.frame': 169 obs. of 5 variables:
$ feature_id : num 80 80 80 80 80 ...
$ DATE_TIME : chr "09/10/2017 00:00" "09/10/2017 01:00" "09/10/2017 02:00" "09/10/2017 03:00" ...
$ RAINFALL_IN: num 0.251 0.09 0.017 0.071 0.016 0.01 0.136 0.651 0.185 0.072 ...
$ cume : num 0.251 0.341 0.358 0.429 0.445 ...
$ six_hr : atomic 0 0 0 0 0 0 0 0 0 0 ...
..- attr(*, "tsp")= num -23 145 1
This code has worked fine with several of my other projects but I have no clue why I am getting zeroes. Any help is greatly appreciated.
Thanks.

There might be a conflict with the lag function from other packages, that would explain why this code worked on other scripts but not on this one.
try stats::lag instead of just lag to enforce which package you want to use. (or dplyr::lag which seems to work better for me at east) ?

I think you have a misconception about what lag() from the stats package does. It's returning zeros, because you're taking the full data for cumulative rainfall and then subtract it again. Check this small example for an illustration:
x <- 1:20
y <- lag(x,3) ;y
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#attr(,"tsp")
#[1] -2 17 1
x-y #x is a vector
# [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#attr(,"tsp")
#[1] -2 17 1
As you can see, lag() simply keeps the vector values and just adds a time series attribute with the values "starting time, ending time, frequency". Because you put in a vector, it used the default values "1, length(Event_Data$cume), 1" and subtracted the lag from the starting and ending time, which is 3 in the example and seemingly 24 in your code output (which doesn't fit the code input above it, btw).
The problem is that your vector doesn't have any time attribute assigned to it, so R doesn't know which the corresponding values of your data and lagged data are. Thus, it simply subtracts the vector values and adds the time attribute of the lagged variable. To fix this, you just need to assign times to Event_Data$cume, by converting it to a time-series object, i.e. try Event_Data$six_hr<-as.numeric(ts(Event_Data$cume) - lag(ts(Event_Data$cume), 6))
It works fine for the small example above:
x <- ts(1:20)
y <- lag(x,3)
x-y #x is a ts
#Time Series:
#Start = 1
#End = 17
#Frequency = 1
# [1] -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3

Conditional input using read.table or readLines

I'm struggling with using readLines() and read.Table() to get a well formatted data frame in R.
I want to read files like this which are Hockey stats. I'd like to get a nicely formatted data frame, however, specifying the concrete amount of lines to read is difficult because in other files like this the number of players is different. Also, non-players, signed as #.AC, #.HC and so on, should not be read in.
I tried something like this
LINES <- 19
stats <- read.table(file=Datei, skip=11, header=FALSE, stringsAsFactors=FALSE,
encoding="UTF-8", nrows=LINES)
but as mentioned above, the value for LINES is different each time.
I also tried readLines as in this post, but had no luck with it.
Is there a way to integrate a condition in read.table, like (pseudo code)
if (first character == "AC") {
break read.table
}
Sorry if this looks strange, I don't have that much experience in scripting or coding.
Any help is appreciated, thanks a lot!
Greetz!

Your data show a couple of difficulties which should be handled in a sequence, which means you should not try to read the entire file with one command:
Read plain lines and find start and stop row
Depending on the specification of the files you read in my suggestion is to first find the the first row you actually want to read in by any indicator. So this can be a lone number which is always the same or as in my example two lines after the line "TEAM STATS". Finding the last line is then simple again by just looking for the first line containing only whitespaces after the start line:
lines <- readLines( Datei )
start <- which(lines == "TEAM STATS") + 2
end <- start + min( grep( "^\\s+$", lines[ start:length(lines) ] ) ) -2
lines <- lines[start:end]
Read the data to data.frame
In your case you meet a couple of complications:
Your header line starts with an # which is on default recognized as a comment character, ignoring the line. But even if you switch this behavior off (comment.char = "") it's not a valid column name.
If we tell read.table to split the columns along whitespaces you end up with one more column in the data, than in the header since the Player column contains white spaces in the cells. So the best is at the moment to just ignore the header line and let read.table do this with it's default behavior (comment.char = "#"). Also we let the PLAYER column be split into two and will fix this later.
You won't be able to use the first column as row.names since they are not unique.
The rows have unequal length, since the POS column is not filled everywhere.
:
tab <- read.table( text = lines[ start:end ], fill = TRUE, stringsAsFactors=FALSE )
# fix the PLAYER column
tab$V2 <- paste( tab$V2, tab$V3 )
tab <- tab[-3]
Fix the header
Just split the start line at multiple whitespaces and reset the first entry (#) by a valid column name:
colns <- strsplit( lines[start], "\\s+" )[[1]]
colns[1] <- "code"
colnames(tab) <- colns
Fix cases were "POS" was empty
This is done by finding the rows which last cell contains NAs and shift them by one cell to the right:
colsToFix <- which( is.na(tab[, "SHO%"]) )
tab[ colsToFix, 4:ncol(tab) ] <- tab[ colsToFix, 3:(ncol(tab)-1) ]
tab[ colsToFix, 3 ] <- NA
> str(tab)
'data.frame': 25 obs. of 20 variables:
$ code : chr "93" "91" "61" "88" ...
$ PLAYER: chr "Eichelkraut, Flori" "Müller, Lars" "Alt, Sebastian" "Gross, Arthur" ...
$ POS : chr "F" "F" "D" "F" ...
$ GP : chr "8" "6" "7" "8" ...
$ G : int 10 1 4 3 4 2 0 2 1 0 ...
$ A : int 5 11 5 5 3 4 6 3 3 4 ...
$ PTS : int 15 12 9 8 7 6 6 5 4 4 ...
$ PIM : int 12 10 12 6 2 36 37 29 6 0 ...
$ PPG : int 3 0 1 1 1 1 0 0 1 0 ...
$ PPA : int 1 5 2 2 1 2 4 2 1 1 ...
$ SHG : int 0 1 0 1 1 0 0 0 0 0 ...
$ SHA : int 0 0 1 0 1 0 0 1 0 0 ...
$ GWG : int 2 0 1 0 0 0 0 0 0 0 ...
$ FG : int 1 0 1 1 1 0 0 0 0 0 ...
$ OTG : int 0 0 0 0 0 0 0 0 0 0 ...
$ UAG : int 1 0 1 0 0 0 0 0 0 0 ...
$ ENG : int 0 0 0 0 0 0 0 0 0 0 ...
$ SHOG : int 0 0 0 0 0 0 0 0 0 0 ...
$ SHOA : num 0 0 0 0 0 0 0 0 0 0 ...
$ SHO% : num 0 0 0 0 0 0 0 0 0 0 ...