Convert to negative a number - r

In data like this:
data.frame (com = c("col1","em"), stock1 = c(2.6, 0), aim = c(0,3.10))
How is it possible to add a minus to all rows of row with com "em"
Example of expected result:
data.frame (com = c("col1","em"), stock1 = c(2.6, 0), aim = c(0,-3.10))
com stock1 aim
1 col1 2.6 0.0
2 em 0.0 -3.1

Using ifelse:
df1 <- data.frame (com = c("col1","em"), stock1 = c(2.6, 0), aim = c(0,3.10))
df1$aim <- ifelse(df1$com == "em", -df1$aim, df1$aim)
df1
com stock1 aim
1 col1 2.6 0.0
2 em 0.0 -3.1

How does this work?
xy <- data.frame (com = c("col1","em"), stock1 = c(2.6, 0), aim = c(0,3.10))
find.com <- -which(names(xy) == "com")
xy[xy$com == "em", find.com] <- -xy[xy$com == "em", find.com]
xy
com stock1 aim
1 col1 2.6 0.0
2 em 0.0 -3.1

Related

Ordering a subset of columns by date r

I have a data frame which part of the columns are not in the correct order (they are dates). See:
data1989 <- data.frame("date_fire" = c("1987-02-01", "1987-07-03", "1988-01-01"),
"Foresttype" = c("oak", "pine", "oak"),
"meanSolarRad" = c(500, 550, 450),
"meanRainfall" = c(600, 300, 450),
"meanTemp" = c(14, 15, 12),
"1988.01.01" = c(0.5, 0.589, 0.66),
"1986.06.03" = c(0.56, 0.447, 0.75),
"1986.10.19" = c(0.8, NA, 0.83),
"1988.01.19" = c(0.75, 0.65,0.75),
"1986.06.19" = c(0.1, 0.55,0.811),
"1987.10.19" = c(0.15, 0.12, 0.780),
"1988.01.19" = c(0.2, 0.22,0.32),
"1986.06.19" = c(0.18, 0.21,0.23),
"1987.10.19" = c(0.21, 0.24, 0.250),
check.names = FALSE,
stringsAsFactors = FALSE)
> data1989
date_fire Foresttype meanSolarRad meanRainfall meanTemp 1988.01.01 1986.06.03 1986.10.19 1988.01.19 1986.06.19 1987.10.19 1988.01.19 1986.06.19 1987.10.19
1 1987-02-01 oak 500 600 14 0.500 0.560 0.80 0.75 0.100 0.15 0.20 0.18 0.21
2 1987-07-03 pine 550 300 15 0.589 0.447 NA 0.65 0.550 0.12 0.22 0.21 0.24
3 1988-01-01 oak 450 450 12 0.660 0.750 0.83 0.75 0.811 0.78 0.32 0.23 0.25
I would like to order the columns by increasing date, and keep the first 5 columns the same. Keep in mind that in my original dataset I have 30 initial columns to be kept the same.
As commented, try to avoid wide formatted data with columns that contain data elements such as dates, category values, other indicators. Instead use long-formatted, tidy data where ordering is much easier including aggregation, merging, plotting, and modeling.
Specifically, consider reshape to melt dates into one field such as quarter with value. Then order quarter column easily:
# RESHAPE WIDE TO LONG
long_data1989 <- reshape(data1989, varying = names(data1989)[6:ncol(data1989)],
times = names(data1989)[6:ncol(data1989)],
v.names = "value", timevar = "quarter", ids = NULL,
new.row.names = 1:1E4, direction = "long")
# ORDER DATES AND RESET row.names
long_data1989 <- `row.names<-`(with(long_data1989, long_data1989[order(date_fire, quarter),]),
NULL)
long_data1989
Online Demo
If you wanted to use dplyr here is an alternative. Note each colname would have to be unique. In you df there were some duplicate ones
library(dplyr)
data1989 <- data.frame("date_fire" = c("1987-02-01", "1987-07-03", "1988-01-01"),
"Foresttype" = c("oak", "pine", "oak"),
"meanSolarRad" = c(500, 550, 450),
"meanRainfall" = c(600, 300, 450),
"meanTemp" = c(14, 15, 12),
"1988.01.01" = c(0.5, 0.589, 0.66),
"1986.06.03" = c(0.56, 0.447, 0.75),
"1986.10.19" = c(0.8, NA, 0.83),
"1988.01.19" = c(0.75, 0.65,0.75),
"1986.06.19" = c(0.1, 0.55,0.811),
"1987.10.19" = c(0.15, 0.12, 0.780),
# "1988.01.19" = c(0.2, 0.22,0.32),
# "1986.06.19" = c(0.18, 0.21,0.23),
# "1987.10.19" = c(0.21, 0.24, 0.250),
check.names = FALSE,
stringsAsFactors = FALSE)
# Sort date column names. replace 6 with first date column
sorted_colnames = sort(names(data1989)[6:ncol(data1989)])
# Sort columns. Replace 5 with last non-date column
data1989 %>%
select(1:5, sorted_colnames)
We can convert the column names that are dates to Date class, do the order and then use that as column index
i1 <- grep('^\\d{4}\\.\\d{2}\\.\\d{2}$', names(data1989))
data1989[c(seq_len(i1[1]-1), order(as.Date(names(data1989)[i1], "%Y.%m.%d")) + i1[1]-1)]
# date_fire Foresttype meanSolarRad meanRainfall meanTemp 1986.06.03 1986.06.19 1986.06.19.1 1986.10.19 1987.10.19
#1 1987-02-01 oak 500 600 14 0.560 0.100 0.18 0.80 0.15
#2 1987-07-03 pine 550 300 15 0.447 0.550 0.21 NA 0.12
#3 1988-01-01 oak 450 450 12 0.750 0.811 0.23 0.83 0.78
# 1987.10.19.1 1988.01.01 1988.01.19 1988.01.19.1
#1 0.21 0.500 0.75 0.20
#2 0.24 0.589 0.65 0.22
#3 0.25 0.660 0.75 0.32
Base R solution (similar to #Parfaits):
# Reshape dataframe wide --> long:
df_long <-
reshape(data1989,
direction = "long",
varying = which(!(is.na(as.Date(names(data1989), "%Y.%m.%d")))),
idvar = which(is.na(as.Date(names(data1989), "%Y.%m.%d"))),
v.names = "value",
times = na.omit(as.Date(names(data1989), "%Y.%m.%d")),
timevar = "date_surveyed",
new.row.names = 1:(nrow(data1989)*length(na.omit(as.Date(names(data1989),
"%Y.%m.%d")))))
# Order the data frame and reset the index:
ordered_df_long <- data.frame(df_long[with(df_long, order(date_fire, date_surveyed)),],
row.names = NULL)

Select all values of a variables for which there is data for every year

Say I have some data with 2 numeric variables ranging from 0 to 1 (it1, it2), a name variable, which has the name of the subject the numeric variable belongs to and then some date for every measure, ranging from year 2014 to 2017. Now, what I want to do is create a data set that only contains measures of people that have values for every year of my measure, and then in the future maybe specify that I only want measures for people with data ranging from 2015 to 2017. Does anybody have any hint on what package or code could help me with my problem? Thanks in advance.
date <- c("2015-11-26", "2015-12-30","2016-11-13", "2014-09-22", "2014-01-13", "2014-07-26", "2016-11-26", "2016-04-04", "2017-04-09", "2017-02-23", "2015-03-22")
names <- c("Max", "Allen", "Allen", "Bob", "Max", "Sarah", "Max", "Sarah", "Max", "Sarah", "Sarah")
it1 <- c(0.6, 0.3, 0.1, 0.2, 0.3, 0.8, 0.8, 0.5, 0.5, 0.3, 0.7)
it2 <- c(0.5, 0.8, 0.1, 0.4, 0.4, 0.4, 0.5, 0.8, 0.6, 0.5, 0.4)
date <- as.Date(date, format = "%Y-%m-%d")
myframe <- data.frame(date, names, it1, it2)
Desired output:
date <- c("2015-11-26", "2014-01-13", "2014-07-26", "2016-11-26", "2016-04-04", "2017-04-09", "2017-02-23", "2015-03-22")
names <- c("Max", "Max", "Sarah", "Max", "Sarah", "Max", "Sarah", "Sarah")
it1 <- c(0.6, 0.3, 0.8, 0.8, 0.5, 0.5, 0.3, 0.7)
it2 <- c(0.5, 0.4, 0.4, 0.5, 0.8, 0.6, 0.5, 0.4)
date <- as.Date(date, format = "%Y-%m-%d")
myframe <- data.frame(date, names, it1, it2)
Create a table of year vs. name and for those names in all years select out those rows. No packages are used.
tab <- table(as.POSIXlt(myframe$date)$year + 1900, myframe$names)
subset(myframe, names %in% colnames(tab)[colSums(sign(tab)) == nrow(tab)])
giving:
date names it1 it2
1 2015-11-26 Max 0.6 0.5
5 2014-01-13 Max 0.3 0.4
6 2014-07-26 Sarah 0.8 0.4
7 2016-11-26 Max 0.8 0.5
8 2016-04-04 Sarah 0.5 0.8
9 2017-04-09 Max 0.5 0.6
10 2017-02-23 Sarah 0.3 0.5
11 2015-03-22 Sarah 0.7 0.4
library(lubridate)
myframe[with(data = myframe[year(myframe$date) >= 2014 & year(myframe$date) <= 2017,],
expr = ave(year(date), names, FUN = function(x)
all(year(date) %in% x))) == 1,]
# date names it1 it2
#1 2015-11-26 Max 0.6 0.5
#5 2014-01-13 Max 0.3 0.4
#6 2014-07-26 Sarah 0.8 0.4
#7 2016-11-26 Max 0.8 0.5
#8 2016-04-04 Sarah 0.5 0.8
#9 2017-04-09 Max 0.5 0.6
#10 2017-02-23 Sarah 0.3 0.5
#11 2015-03-22 Sarah 0.7 0.4

Getting p values for groupwise correlation using the dplyr package

I am trying run correlations between some variables in a dataframe. I have one character vector (group) and rest are numeric.
dataframe<-
Group V1 V2 V3 V4 V5
NG -4.5 3.5 2.4 -0.5 5.5
NG -5.4 5.5 5.5 1.0 2.0
GL 2.0 1.5 -3.5 2.0 -5.5
GL 3.5 6.5 -2.5 1.5 -2.5
GL 4.5 1.5 -6.5 1.0 -2.0
Following is my code:
library(dplyr)
dataframe %>%
group_by(Group) %>%
summarize(COR=cor(V3,V4))
Here is my output:
Group COR
<chr> <dbl>
1 GL 0.1848529
2 NG 0.1559912
How do i use edit this code to get the p-values? Any help would be appreciated! I have looked elsewhere but nothing is working. Thanks!!
You should try ?corrplot if you want to see pairwise correlation
library(corrplot)
df_cor <- cor(df[,sapply(df, is.numeric)])
corrplot(df_cor, method="color", type="upper", order="hclust")
In below graph you can notice that 'positive correlations' are displayed in 'blue' and 'negative correlations' in 'red' color and it's intensity are proportional to the correlation coefficients.
#sample data
> dput(df)
structure(list(Group = structure(c(2L, 2L, 1L, 1L, 1L), .Label = c("GL",
"NG"), class = "factor"), V1 = c(-4.5, -5.4, 2, 3.5, 4.5), V2 = c(3.5,
5.5, 1.5, 6.5, 1.5), V3 = c(2.4, 5.5, -3.5, -2.5, -6.5), V4 = c(-0.5,
1, 2, 1.5, 1), V5 = c(5.5, 2, -5.5, -2.5, -2)), .Names = c("Group",
"V1", "V2", "V3", "V4", "V5"), class = "data.frame", row.names = c(NA,
-5L))

How to extract rows (using Loop) from a data frame and save it in another data frame

I have 2 files (as data frames), the first one cossets of three columns lets say:
POS Freq1 Freq2
23 0.5 0.45
48 0.7 0.55
05 0.8 0.65
and so on..
the second file consists of 2 columns:
Start End
1 10
25 50
60 75
and so on.
what I need to do is to extract the rows in the first line that has POS values >= Start and <=End; that is I want to check the following:
is 23 >=1 && <=10 (False)
is 23 >=25 && <=50 (False)
and keep checking all the values in Start and End till find a true
is 48 >= 1 && <=10 (False)
is 48 >=25 && <=50 (TRUE)
Stop checking the rest
is 5 >=1 && <=10 (TRUE)
Stop checking the rest
Then all values of POS that give a true should be stored in a data frame with the rest of info in file 1: so what I want to get at the end is this (assuming that all comparisons for the first row are false)
POS Freq1 Freq2
48 0.7 0.55
5 0.8 0.65
I tried the following in R:
Trial = NULL
for (i in 1:nrow(file1) {
for (j in 1:nrow(file2){
if (file1$POS[[i]] >= file2$Start[[j]] & file1$POS[[i]] <= file2$End[j])
Trial = c(Trial,file1[i, ])
else stop("Check the file Trial")
}
}
Trial
## $POS
## [1] 48
## $Freq1
## [1] 0.7
## $Freq2
## [1] 0.55
And this is not what I am after?
Appreciate your help.
Try:
ddf = structure(list(POS = c(23L, 48L, 5L), Freq1 = c(0.5, 0.7, 0.8
), Freq2 = c(0.45, 0.55, 0.65)), .Names = c("POS", "Freq1", "Freq2"
), class = "data.frame", row.names = c(NA, -3L))
refdf = structure(list(Start = c(1L, 25L, 60L), End = c(10L, 50L, 75L
)), .Names = c("Start", "End"), class = "data.frame", row.names = c(NA,
-3L))
ddf
# POS Freq1 Freq2
#1 23 0.5 0.45
#2 48 0.7 0.55
#3 5 0.8 0.65
refdf
# Start End
#1 1 10
#2 25 50
#3 60 75
outdf = data.frame(POS=numeric(), Freq1=numeric(), Freq2=numeric())
for(i in 1:nrow(ddf)) for(j in 1:nrow(refdf)){
if(ddf[i,1]>refdf[j,1] && ddf[i,1]<refdf[j,2])
{outdf[nrow(outdf)+1,] = ddf[i,]; next}
}
outdf
# POS Freq1 Freq2
#2 48 0.7 0.55
#3 5 0.8 0.65

Select rows by criteria in another data.frame without for loop in R

This is related to question: How to extract rows (using Loop) from a data frame and save it in another data frame
If 'POS' of ddf lies between any 'start' and 'end' of refdf, it needs to be included in outdf which has same structure as ddf. I could manage it with 'for' loop, but can it be done without using 'for' loop?
ddf = structure(list(POS = c(23L, 48L, 5L), Freq1 = c(0.5, 0.7, 0.8
), Freq2 = c(0.45, 0.55, 0.65)), .Names = c("POS", "Freq1", "Freq2"
), class = "data.frame", row.names = c(NA, -3L))
refdf = structure(list(Start = c(1L, 25L, 60L), End = c(10L, 50L, 75L
)), .Names = c("Start", "End"), class = "data.frame", row.names = c(NA,
-3L))
ddf
# POS Freq1 Freq2
#1 23 0.5 0.45
#2 48 0.7 0.55
#3 5 0.8 0.65
refdf
# Start End
#1 1 10
#2 25 50
#3 60 75
outdf = data.frame(POS=numeric(), Freq1=numeric(), Freq2=numeric())
for(i in 1:nrow(ddf)) for(j in 1:nrow(refdf)){
if(ddf[i,1]>refdf[j,1] && ddf[i,1]<refdf[j,2])
{outdf[nrow(outdf)+1,] = ddf[i,]; next}
}
outdf
# POS Freq1 Freq2
#2 48 0.7 0.55
#3 5 0.8 0.65
I tried following but it does not work:
apply(ddf,1,function(x){print(x);ifelse(x[1]>refdf$Start & x[1]<refdf$End, x,"")})
This isn't particularly efficient in space for large problems, but it doesn't use for:
ddf[ddf$POS %in% unlist(apply(refdf, 1, function(x) seq(x[1],x[2]))),]
## POS Freq1 Freq2
## 2 48 0.7 0.55
## 3 5 0.8 0.65
All the allowed values of POS are computed by the unlist(apply) expression. This of course assumes that POS contains only integral values.
Here is a way. It doesn't necessitate integer values, but it's also not going to be especially efficient:
pow <- cbind(expand.grid(ddf$POS, refdf$Start), Var3=expand.grid(ddf$POS, refdf$End)$Var2)
boom <- pow[which(pow$Var1 > pow$Var2 & pow$Var1 < pow$Var3), 'Var1']
ddf[ddf$POS %in% boom, ]
# POS Freq1 Freq2
#2 48 0.7 0.55
#3 5 0.8 0.65

Resources