Remove single characters without changing the numbers from an r dataframe - r

My dataframe has many arrows, ">" and "<"s in it alongside some of the element values. I want to remove these characters but keep the numbers. I only know how to replace the entire element with NA with the following code.
df <- apply(df, 1:2, gsub, pattern = "<|>", replacement = "")
Will someone please help me edit this so that it keeps the element numbers too, instead of throwing the entire thing out?
structure(list(`Analyte Sample` = c(1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14), A = c("4190", "6665", "7435", "2052",
"783", "322", "199", "90", "46", "17", "8", "3", "3", "<1↓"
), B = c("11569", "6677", "3852", "983.88", "589", "359", "203",
"68", "33", "12", "6", "<2↓", "4", "<1↓"), C = c("20453",
"7699", "2499", "707.98", "412", "328", "156", "88", "39", "27",
"17", "<1↓", "<3↓", "<1↓"), D = c("7893", ">20000↑",
"1623", "685.64", "321", "644", "112", "65", "35", "29", "9",
"5", "<3↓", "<1↓"), E = c("320", "15444", "2049", "1065",
"389", "365", "145", "77", "38", "16", "9", "6", "<2↓", "<2↓"
), F = c("7438", ">21999↑", "3472", "1057", "563", "401", "167",
"89", "46", "19", "6", "<1↓", "<1↓", "<1↓"), G = c(7345,
9001, 2473, 1138, 516, 403, 134, 81, 37, 17, 8, 6, 4, 3), H = c("9004",
"3998", "2299", "964.88", "499", "341", "112", "88", "39", "32",
"<29↓", "<30↓", "<31↓", "<29↓"), I = c("8434", "8700",
"2217", "1263", "567", "352", "153", "80", "43", "18", "9", "2",
"3", "<1↓"), J = c("7734", "6733", "2092", "1115", "637", "332",
"155", "82", "37", "17", "10", "4", "1", "<1↓"), K = c(">3718↑",
">3000↑", "2118", "862.13", "426", "355", "143", "78", "44",
"22", "11", "<4↓", "<4↓", "<3↓"), L = c(6345, 7688, 2311,
1195, 647, 366, 177, 83, 41, 20, 8, 6, 3, 2), M = c("4222", ">25587↑",
"1846", "814.61", "422", "314", "154", "86", "41", "27", "21",
"<2↓", "<2↓", "<3↓"), N = c("6773", "8934", "2381", "1221",
"677", "356", "146", "89", "40", "17", "10", "5", "2", "<2↓"
), O = c(">2200↑", ">2133↑", ">2000↑", "564.5", "226",
"476", "111", "60", "32", "36", "18", "<10↓", "<1↓", "<2↓"
)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-14L), spec = structure(list(cols = list(`Analyte Sample` = structure(list(), class = c("collector_double",
"collector")), A = structure(list(), class = c("collector_character",
"collector")), B = structure(list(), class = c("collector_character",
"collector")), C = structure(list(), class = c("collector_character",
"collector")), D = structure(list(), class = c("collector_character",
"collector")), E = structure(list(), class = c("collector_character",
"collector")), F = structure(list(), class = c("collector_character",
"collector")), G = structure(list(), class = c("collector_double",
"collector")), H = structure(list(), class = c("collector_character",
"collector")), I = structure(list(), class = c("collector_character",
"collector")), J = structure(list(), class = c("collector_character",
"collector")), K = structure(list(), class = c("collector_character",
"collector")), L = structure(list(), class = c("collector_double",
"collector")), M = structure(list(), class = c("collector_character",
"collector")), N = structure(list(), class = c("collector_character",
"collector")), O = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))

You can use lapply() which returns a list and assign it back to df[]. [] is to keep the original attributes, i.e. a class of data.frame. You will see that df becomes what you want.
df[] <- lapply(df, gsub, pattern = "<|>", replacement = "")

I think in your case the best would be to use a regular expression. Using tidyverse:
df %>% mutate_at(vars(A:O), ~ as.numeric(gsub("[^0-9]*([0-9]*).*", "\\1", .)))
If you specifically want only to change values which start with a < or >, you do the following:
df %>% mutate_at(vars(A:O), ~ as.numeric(gsub("[<>]*([0-9]*).*", "\\1", .)))
Of course, you can also use apply... but mind the way apply changes the data frame into a matrix before applying the function (the columns which are numbers will have spaces prefixed, so we need to include space in the pattern):
apply(df, 2, function(x) gsub("[ <>]*([0-9]*).*", "\\1", x))
The pattern [0-9]* matches a digit any number of times. The pattern [^0-9] matches anything but a digit any number of times.

You can try one of these options:
#Code 1
df <- apply(df, 1:2, function(x) gsub(pattern = "<|>", replacement = "",x))
#Code 2
df <- sapply(df,function(x) gsub(pattern = "<|>", replacement = "",x))
Just be careful that the output can be a matrix, so you will have to transform again to dataframe using


Creating unique id if the difference between dates is more than 7 in R

I would like to go over each row, and if the difference between the current row and the previous row is more than 7, then I would like to assign the current row a unique id, otherwise the current and the previous row would have the same id. Please note that the zipcode is not relevant for the unique id. Here's my data:
z<- structure(list(zipcode = c(96717L, 96730L, 96825L, 96826L, 96720L,
96756L, 96740L, 96819L, 96734L, 96740L, 96714L, 96714L, 96703L,
90017L, 96796L, 96714L, 96714L, 96761L, 96712L, 96712L), date = structure(c(8809,
8809, 8847, 8848, 8989, 9041, 9161, 9188, 9201, 9293, 9403, 9437,
9437, 9437, 9437, 9443, 9444, 9457, 9457, 9483), class = "Date")), row.names = c(NA,
-20L), class = c("data.table", "data.frame"))
Here's my desired output:
y<- structure(list(zipcode = c(96717, 96730, 96825, 96826, 96720,
96756, 96740, 96819, 96734, 96740, 96714, 96714, 96703, 90017,
96796, 96714, 96714, 96761, 96712, 96712), date = structure(c(761097600,
761097600, 764380800, 764467200, 776649600, 781142400, 791510400,
793843200, 794966400, 802915200, 812419200, 815356800, 815356800,
815356800, 815356800, 815875200, 815961600, 817084800, 817084800,
819331200), class = c("POSIXct", "POSIXt"), tzone = "UTC"), difference_date = c("NA",
"0", "38", "1", "141", "52", "120", "27", "13", "92", "110",
"34", "0", "0", "0", "6", "1", "13", "0", "26"), id = c(1, 1,
2, 2, 3, 4, 5, 6, 8, 9, 10, 11, 11, 11, 11, 11, 11, 12, 12, 13
)), class = c("data.table", "data.frame"))
Here's what I did:
z[,date:=as.Date(date)][,diff_days:=c(NA,diff.Date(date,lag=1L,differences=1L))][, event_id :=1:.N,.(diff_days<=7)]
Here's what I got:
First I converted your example data to data.table format:
z <-
Then calculate the difference date:
z[, date_diff := as.numeric(difftime(date, lag(date), units = "days"))]
And finally, add the id :
z[, event_id := cumsum(c(TRUE, diff(date) > 7))]

subtracting multiple columns from each other

I have a large dataset and I want to subtract specific columns from each other based on their position. I want to subtract column 2 from column 8, column 3 from column 9 and column 4 from column 10.
Thanks a lot
structure(list(Stamp_summertime = structure(c(1546684744, 1546685858,
1546687004, 1547030061, 1547030835, 1547031816), tzone = "UTC", class = c("POSIXct",
"POSIXt")), X26.013 = c(0.138461, 0.138461, 0.138461, 0.144421,
0.144421, 0.144421), X27.024 = c(0.0752111, 0.0752111, 0.0752111,
0.0426819, 0.0426819, 0.0426819), X33.031 = c(3.75788, 3.75788,
3.75788, 3.12581, 3.12581, 3.12581), jar_camp = c("1_pf1.1",
"2_pf1.1", "3_pf1.1", "1_pf2.1", "2_pf2.1", "3_pf2.1"), jar = structure(c(1L,
12L, 23L, 1L, 12L, 23L), .Label = c("1", "10_blank", "11", "12",
"13", "14", "15", "16_blank", "17", "18", "19", "2", "20_blank",
"21", "22", "23", "24", "25", "26", "27", "28", "29", "3", "30_blank",
"31", "32", "33", "34", "35", "36", "37", "38_blank", "39", "4",
"40", "41", "42", "43", "44_blank", "45", "46", "47", "48", "49",
"5_blank", "blank_50", "51", "52", "53", "54", "55", "56", "57",
"6", "7", "8", "9", "X_blank"), class = "factor"), campaign = c("pf1.1",
"pf1.1", "pf1.1", "pf2.1", "pf2.1", "pf2.1"), i.X26.013 = c(0.144658,
0.21502, 0.458296, 0.191571, 0.0789067, 0.711814), i.X27.024 = c(0.0595547,
0.0651149, 0.146772, 0.0997815, 0.0539976, 0.185398), i.X33.031 = c(5.4066,
3.30406, 18.0479, 6.13854, 1.3028, 22.2226)), sorted = "Stamp_summertime", class = c("data.table",
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x00000237a3d91ef0>)
We can create 2 vectors of position and subtract the columns directly. Since you have data.table we use ..column_number to select columns by position.
col1group <- 2:4
col2group <- 8:10
df[, ..col1group] - df[, ..col2group])
If you want to add them as new columns to original data you can rename them and cbind
cbind(df, setNames(df[, ..col1group] - df[, ..col2group],
paste0(names(df)[col1group], '_diff')))
Something like the following computes the subtractions in the question.
nms <- names(df1)
iCols <- grep("^i\\.", nms, value = TRUE)
Cols <- sub("^i\\.", "", iCols)
df1[, lapply(seq_along(Cols), function(i) get(Cols[i]) - get(iCols[i]))]
# V1 V2 V3
#1: -0.0061970 0.0156564 -1.64872
#2: -0.0765590 0.0100962 0.45382
#3: -0.3198350 -0.0715609 -14.29002
#4: -0.0471500 -0.0570996 -3.01273
#5: 0.0655143 -0.0113157 1.82301
#6: -0.5673930 -0.1427161 -19.09679
Following Ronak Shah's answer I realized that the code below also works.
df1[, ..Cols] - df1[, ..iCols]
The numeric results are the same but the column names are the vector Cols.
To create new columns, try
newCols <- paste(Cols, "diff", sep = "_")
df1[, (newCols) := lapply(seq_along(Cols), function(i) get(Cols[i]) - get(iCols[i]))]
Base R solution:
idx <- c(2, 3, 4)
jdx <- c(8, 9, 10)
Using lapply() and column binding the list:
setNames("cbind", lapply(seq_along(idx), function(i){
df[, jdx[i], drop = FALSE] - df[, idx[i], drop = FALSE]
), c(paste("x", jdx, idx, sep = "_")))
Using sapply() and coercing vectors to a data.frame:
setNames(data.frame(sapply(seq_along(idx), function(i){
df[, jdx[i], drop = FALSE] - df[, idx[i], drop = FALSE]
), c(paste("x", jdx, idx, sep = "_")))
Using Map() and Reduce() and column binding to original data.frame:
cbind(df, setNames(Reduce(cbind, Map(function(i){
df[, jdx[i], drop = FALSE] - df[, idx[i], drop = FALSE]
}, seq_along(idx))), c(paste("x", jdx, idx, sep = "_"))))

Plotting lines of multiple groups in ggplot2 gives a weird result

I have done species accumulation curves and would like to plot the SAC results of different substrate sizeclasses in the same ggplot, with expected species richness on y-axis and number of sites samples on x-axis. The data features a cumulative number of samples in each sizeclass (column "sites"), the expected species richness (column "richness"), and substrate size classes 10, 20 and 30 (column "sc").
sites richness sc
1 1 0.6696915 10
2 2 1.2008513 10
3 3 1.6387310 10
4 4 2.0128472 10
5 5 2.3424933 10
6 6 2.6403239 10
sites richness sc
2836 1 1.000000 20
2837 2 1.703442 20
2838 3 2.249188 20
2839 4 2.706618 20
2840 5 3.110651 20
2841 6 3.479173 20
I want each sizeclass to have unique linetype. I used the following code for ggplot:
sac_kaikki<-ggplot(sac_data, aes(x=sites, y=richness,group=sc)) +
theme(axis.title.y = element_blank())+
theme(axis.title.x = element_blank())
However, instead of getting three neat lines in different linetypes, I got [this jumbly muddly messy thing with more stripes than a herd of zebras][1]: I am sure the solution is rather simple, but for my life I am not able to figure it out.
// as Brookes kindly pointed out I should add some reproducible data, here is a subset of my data with dput, featuring 10 first observations of size classes 10 and 20:
structure(list(sites = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), richness = c(0.669691470054462,
1.20085134466255, 1.63873100707468, 2.01284716414471, 2.34249332096243,
2.64032389106845, 2.91468283244696, 3.17111526890278, 3.41334794519086,
3.64392468817362), sc = c("10", "10", "10", "10", "10", "10",
"10", "10", "10", "10")), .Names = c("sites", "richness", "sc"
), row.names = c(NA, 10L), class = "data.frame")
structure(list(sites = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), richness = c(0.999999999999987,
1.70344155844158, 2.24918831168832, 2.70661814764865, 3.11065087175364,
3.47917264517669, 3.82165739030286, 4.14341144680334, 4.44765475554031,
4.73653870494466), sc = c("20", "20", "20", "20", "20", "20",
"20", "20", "20", "20")), .Names = c("sites", "richness", "sc"
), row.names = 2836:2845, class = "data.frame")
// okay so for whatever reason, the plot works just fine if I plot only two sizeclasses, but including the third one produces the absurd plot I posted a picture of.
structure(list(sites = 1:10, richness = c(0.42857142857143, 0.838095238095238,
1.22932330827066, 1.60300751879699, 1.95989974937343, 2.30075187969924,
2.62631578947368, 2.93734335839598, 3.23458646616541, 3.5187969924812
), sc = c("30", "30", "30", "30", "30", "30", "30", "30", "30",
"30")), .Names = c("sites", "richness", "sc"), row.names = c(NA,
10L), class = "data.frame")
Works fine for me with your sample data:
a <- structure(list(sites = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), richness = c(0.669691470054462,
1.20085134466255, 1.63873100707468, 2.01284716414471, 2.34249332096243,
2.64032389106845, 2.91468283244696, 3.17111526890278, 3.41334794519086,
3.64392468817362), sc = c("10", "10", "10", "10", "10", "10",
"10", "10", "10", "10")), .Names = c("sites", "richness", "sc"
), row.names = c(NA, 10L), class = "data.frame")
b <- structure(list(sites = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), richness = c(0.999999999999987,
1.70344155844158, 2.24918831168832, 2.70661814764865, 3.11065087175364,
3.47917264517669, 3.82165739030286, 4.14341144680334, 4.44765475554031,
4.73653870494466), sc = c("20", "20", "20", "20", "20", "20",
"20", "20", "20", "20")), .Names = c("sites", "richness", "sc"
), row.names = 2836:2845, class = "data.frame")
c <- structure(list(sites = 1:10, richness = c(0.42857142857143, 0.838095238095238,
1.22932330827066, 1.60300751879699, 1.95989974937343, 2.30075187969924,
2.62631578947368, 2.93734335839598, 3.23458646616541, 3.5187969924812
), sc = c("30", "30", "30", "30", "30", "30", "30", "30", "30",
"30")), .Names = c("sites", "richness", "sc"), row.names = c(NA,
10L), class = "data.frame")
sac_data <- bind_rows(a, b, c)
ggplot(sac_data, aes(sites, richness, group = sc)) +
geom_line(aes(linetype = sc))

Convert column types to their read_csv() column type in R

One of my favorite things about library(readr) and the read_csv() function in R is that it almost always sets the column types of my data to the correct class. However, I am currently working with an API in R that returns data to me as a dataframe of all character classes, even if the data is clearly numbers. Take this dataframe for example, which has some sports data:
structure(list(isUnplayed = c("false", "false", "false"), isInProgress =
c("false", "false", "false"), isCompleted = c("true", "true", "true"), awayScore = c("106",
"95", "95"), homeScore = c("94", "97", "111"), game.ID = c("31176",
"31177", "31178"), = c("2015-10-27", "2015-10-27",
"2015-10-27"), game.time = c("8:00PM", "8:00PM", "10:30PM"),
game.location = c("Philips Arena", "United Center", "Oracle Arena"
), game.awayTeam.ID = c("88", "86", "110"), game.awayTeam.City = c("Detroit",
"Cleveland", "New Orleans"), game.awayTeam.Name = c("Pistons",
"Cavaliers", "Pelicans"), game.awayTeam.Abbreviation = c("DET",
"CLE", "NOP"), game.homeTeam.ID = c("91", "89", "101"), game.homeTeam.City = c("Atlanta",
"Chicago", "Golden State"), game.homeTeam.Name = c("Hawks",
"Bulls", "Warriors"), game.homeTeam.Abbreviation = c("ATL",
"CHI", "GSW"), quarterSummary.quarter = list(structure(list(
`#number` = c("1", "2", "3", "4"), awayScore = c("25",
"23", "34", "24"), homeScore = c("25", "18", "23", "28"
)), .Names = c("#number", "awayScore", "homeScore"), class = "data.frame", row.names = c(NA,
4L)), structure(list(`#number` = c("1", "2", "3", "4"), awayScore = c("17",
"23", "28", "27"), homeScore = c("26", "20", "25", "26")), .Names = c("#number",
"awayScore", "homeScore"), class = "data.frame", row.names = c(NA,
4L)), structure(list(`#number` = c("1", "2", "3", "4"), awayScore = c("35",
"14", "26", "20"), homeScore = c("39", "20", "35", "17")), .Names = c("#number",
"awayScore", "homeScore"), class = "data.frame", row.names = c(NA,
4L)))), .Names = c("isUnplayed", "isInProgress", "isCompleted",
"awayScore", "homeScore", "game.ID", "", "game.time",
"game.location", "game.awayTeam.ID", "game.awayTeam.City", "game.awayTeam.Name",
"game.awayTeam.Abbreviation", "game.homeTeam.ID", "game.homeTeam.City",
"game.homeTeam.Name", "game.homeTeam.Abbreviation", "quarterSummary.quarter"
), class = "data.frame", row.names = c(NA, 3L))
It is quite a hassle to deal with this dataframe once it is returned by the API, given the class types. I've come up with a sort of a hack to update the column classes, which is as follows:
write_csv(mydf, 'mydf.csv')
mydf <- read_csv('mydf.csv')
By writing to CSV and then re-reading the CSV using read_csv(), the dataframe columns update. Unfortunately I am left with a CSV file in my directory that I don't want. Is there a way to update the columns of an R dataframe to their 'read_csv()' column classes, without actually having to write the CSV?
Any help is appreciated!
You don't need to write and read the data if you just want readr to guess you column type. You could use readr::type_convert for that:
iris %>%
dplyr::mutate(Sepal.Width = as.character(Sepal.Width)) %>%
readr::type_convert() %>%
For comparison:
iris %>%
dplyr::mutate(Sepal.Width = as.character(Sepal.Width)) %>%
try this code, type.convert convert a character vector to logical, integer, numeric, complex or factor as appropriate.
indx <- which(sapply(df, is.character))
df[, indx] <- lapply(df[, indx], type.convert)
indx <- which(sapply(df, is.factor))
df[, indx] <- lapply(df[, indx], as.character)

Computing angle between two vectors (with one vector having a specific X,Y position)

I am trying to compute the angle between two vectors, wherein one vector is fixed and the other vector is constantly moving. I already know the math in this and I found a code before:
theta <- acos( sum(a*b) / ( sqrt(sum(a * a)) * sqrt(sum(b * b)) ) )
I tried defining my a as:
and my b as:
b <- NM[, c("X","Y")]
When I apply the theta function I get:
Warning message:
In acos(sum(a * b)/(sqrt(sum(a * a)) * sqrt(sum(b * b)))) : NaNs produced
I would appreciate help to solve this.
And here is my sample data:
structure(list(A = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label =
c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12",
"13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23",
"24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34",
"35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45",
"46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56",
"57", "58", "59", "60", "61", "62", "63", "64", "65", "66", "67",
"68", "69", "70", "71", "72", "73", "74", "75", "76", "77", "78",
"79", "80", "81", "82", "83", "84", "85", "86", "87", "88", "89",
"90", "91", "92", "93", "94", "95", "96", "97", "98", "99", "100",
"101", "102", "103", "104", "105", "106", "107", "108", "109",
"110"), class = "factor"), T = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6 ), X =
c(528.04, 528.04, 528.04, 528.04, 528.04, 528.04), Y = c(10.32,
10.32, 10.32, 10.32, 10.32, 10.32), V = c(0, 0, 0, 0, 0, 0),
GD = c(0, 0, 0, 0, 0, 0), ND = c(NA, 0, 0, 0, 0, 0), ND2 = c(NA,
0, 0, 0, 0, 0), TID = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("t1",
"t10", "t100", "t101", "t102", "t103", "t104", "t105", "t106",
"t107", "t108", "t109", "t11", "t110", "t12", "t13", "t14",
"t15", "t16", "t17", "t18", "t19", "t2", "t20", "t21", "t22",
"t23", "t24", "t25", "t26", "t27", "t28", "t29", "t3", "t30",
"t31", "t32", "t33", "t34", "t35", "t36", "t37", "t38", "t39",
"t4", "t40", "t41", "t42", "t43", "t44", "t45", "t46", "t47",
"t48", "t49", "t5", "t50", "t51", "t52", "t53", "t54", "t55",
"t56", "t57", "t58", "t59", "t6", "t60", "t61", "t62", "t63",
"t64", "t65", "t66", "t67", "t68", "t69", "t7", "t70", "t71",
"t72", "t73", "t74", "t75", "t76", "t77", "t78", "t79", "t8",
"t80", "t81", "t82", "t83", "t84", "t85", "t86", "t87", "t88",
"t89", "t9", "t90", "t91", "t92", "t93", "t94", "t95", "t96",
"t97", "t98", "t99"), class = "factor")), .Names = c("A", "T", "X", "Y", "V", "GD", "ND", "ND2", "TID"), row.names = c(NA, 6L),
class = "data.frame")
Your function is not vectorized. Try this:
theta <- function(x,Y) apply(Y,1,function(y,x) acos( sum(x*y) / ( sqrt(sum(x^2)) * sqrt(sum(y^2)) ) ),x=x)
b <- DF[, c("X","Y")]
# 1 2 3 4 5 6
#0.6412264 0.6412264 0.6412264 0.6412264 0.6412264 0.6412264
There is a problem with the acos and atan functions in this application, as you cannot compute angles for the full circle, only for the plus quadrant. In 2D, you need two values to specify a vector, and you need two values (sin and cos) to define it in degrees/radians up to 2pi. Here is an example of the acos problem:
plot(seq(1,10,pi/20)) ## A sequence of numbers
plot(cos(seq(1,10,pi/20))) ## Their cosines
plot(acos(cos(seq(1,10,pi/20)))) ## NOT Back to the original sequence
Here's an idea:
angle <- circular::coord2rad(x, y)
where "(x,y)" has "angle"
gives the angle in radians (0,360). To report geographical directions, convert to degrees, and other things, you can use the added parameters for the circular function, e.g.:
x <- coord2rad(ea,eo, control.circular = list(type = "directions",units = "degrees"))
