I have a variable that takes values of the form x.1, x.2 or x.3 currently with x being any number followed by the decimal point.
I would like to convert x.1 to x.333, x.2 to x.666 and x.3 to x.999 or in this case I would assume it would be rounded up to the whole number.
Context: running regression analysis containing a variable of innings pitched (baseball pitchers) which currently have data values of the .1, .2, .3 form above.
Help would be much appreciated!
You can use x %% 1 to get the fractional part of a number in R. Then just multiply that by 3.333 and add the result back on to the integer part of your number to get total innings pitched.
x <- 2.3
as.integer(x) + (x %% 1 * 3.333)
[1] 2.9999
(Use 3.333 instead of 0.333 to move the decimal.)
Depending on the exact context, it could be nice to keep the component parts -- if that's the case, I would be a little verbose and utilize tidyr and dplyr:
library(tidyr)
library(dplyr)
vec <- c("123.1", "456.2", "789.3")
df <- data.frame(vec)
df %>%
separate(vec, into = c("before_dot", "after_dot"), remove = FALSE, convert = TRUE) %>%
mutate(after_dot_times_333 = after_dot * 333,
new_var = paste(before_dot, after_dot_times_333, sep = "."))
# vec before_dot after_dot after_dot_times_333 new_var
# 1 123.1 123 1 333 123.333
# 2 456.2 456 2 666 456.666
# 3 789.3 789 3 999 789.999
Alternatively, you could accomplish this in one line:
sapply(strsplit(vec, "\\."), function(x) paste(x[1], as.numeric(x[2]) * 333, sep = "."))
Related
My dataframe (df) contains a list of values which are labelled following a format of 'Month' 'Name of Site' and 'Camera No.'. I.e., if my value is 'DECBUTCAM27' then Dec-December, BUT-Name of Site and CAM27-Camera No.
I have 100 such values with 19 different site names.
I want to write an If else code such that only the site names are recognised and a corresponding number is added.
My initial idea was to add the corresponding number for all the 100 values, but since if else does not work beyond 50 values I couldnt use that option.
This is what I had written for the option that i had tried:
df <- df2 %>% mutate(Site_ID =
ifelse (CT_Name == 'DECBUTCAM27', "1",
ifelse (CT_Name == 'DECBUTCAM28', "1",
ifelse (CT_Name == 'DECI2NCAM01', "2",
ifelse (CT_Name == 'DECI2NCAM07', "2",
ifelse (CT_Name == 'DECI5CAM39', "3",
ifelse (CT_Name == 'DECI5CAM40', "3","NoVal")))))))
I am looking for a code such that only the sites i.e., 'BUT', 'I2N' and 'I5' would be recognised and a corresponding number is added.
Any help would be greatly appreciated.
Extract the sitename using regex and use match + unique to assign unique number.
df2$site_name <- sub('...(.*)CAM.*', '\\1', df2$CT_Name)
df2$Site_ID <- match(df2$site_name, unique(df2$site_name))
For example, see this example :
CT_Name <- c('DECBUTCAM27', 'DECBUTCAM28', 'DECI2NCAM07', 'DECI2NCAM01',
'DECI5CAM39', 'DECI5CAM40')
site_name <- sub('...(.*)CAM.*', '\\1', CT_Name)
site_name
#[1] "BUT" "BUT" "I2N" "I2N" "I5" "I5"
Site_ID <- match(site_name, unique(site_name))
Site_ID
#[1] 1 1 2 2 3 3
Here is a tidyverse solution:
You haven't provided a reproducible example, but let's use the CT_Names that you have supplied to create a test dataframe:
data <- tribble(
~ CT_Name,
"DECBUTCAM27",
"DECBUTCAM28",
"DECI2NCAM01",
"DECI2NCAM07",
"DECI5CAM39",
"DECI5CAM40"
)
Let's assume that the string format is 3 letters for months, 2 or more letters or numbers for site and CAM + 1 or more digits for camera number (adjust these as needed). We can use a regular expression in tidyr's extract() function to split up the string into its components:
data_new <- data %>%
extract(CT_Name, regex = "(\\w{3})(\\w{2,})(CAM\\d+)", into = c("Month", "Site", "Camera"))
(add remove = FALSE if you want to keep the original CT_Name variable)
This yields:
# A tibble: 6 x 3
Month Site Camera
<chr> <chr> <chr>
1 DEC BUT CAM27
2 DEC BUT CAM28
3 DEC I2N CAM01
4 DEC I2N CAM07
5 DEC I5 CAM39
6 DEC I5 CAM40
We can then group by site and assign a group ID as your Site_ID:
data_new <- data %>%
extract(CT_Name, regex = "(\\w{3})(\\w{2,})(CAM\\d+)", into = c("Month", "Site", "Camera")) %>%
group_by(Site) %>%
mutate(Site_ID = cur_group_id())
This produces:
# A tibble: 6 x 4
# Groups: Site [3]
Month Site Camera Site_ID
<chr> <chr> <chr> <int>
1 DEC BUT CAM27 1
2 DEC BUT CAM28 1
3 DEC I2N CAM01 2
4 DEC I2N CAM07 2
5 DEC I5 CAM39 3
6 DEC I5 CAM40 3
Here is a quick example using regex to find the site code and using an apply function to return a vector of code.
df <- data.frame(code = c('DECBUTCAM27','JANBUTCAM27','DECDUCCAM45'))
df$loc <- apply(df, 1, function(x) gsub("CAM.*$","",gsub("^.{3}",'',x[1])))
unique(df$loc) # all the location of the file
df$n <- as.numeric(as.factor(df$loc)) # get a number for each location
Mind that here I use the x[1] because the code are in the first column of my data.frame, which may vary for you.
---EDIT--- This was a previous answer also working but with more work for you to do. However it allow you to choose numeric code value (or text) to assign locations if they are ordered for example.
It require you to put all the codes for each site, which I found heavy in term of code but it works. The switch part is roughly the same as an ifelse.
The regex consist in excluding the 3 first character and the other ones at the end after the 'CAM' sequence.
df <- data.frame(code = c('DECBUTCAM27','JANBUTCAM27','DECDUCCAM45'))
df$n <- apply(df, 1, function(x) switch(gsub("CAM.*$","",gsub("^.{3}",'',x[1])),
BUT = 1,
DUC = 2)
)
I have a dataframe where one column is the amount spent. In the amount spent column there are the values for amount spent and also negative values for any returns. For example.
ID Store Spent
123 A 18.50
123 A -18.50
123 A 18.50
I want to remove the negative value then one of its positive counter parts - the idea is to only keep fully completed spend amounts so I can look at total spend.
Right now I am thinking something like this - where I have the data frame sorted by spend
if spend < 0 {
take absolute value of spend
if diff between abs(spend) and spend+1 = 0 then both are NA}
I would like to have something like
df[df$spend < 0] <- NA
where I can also set one positive counterpart to NA as well. Any suggestions?
There should be a simpler solution to this but here is one way. Also created my own example since the one shared did not have sufficient data points to test
#Original vector
x <- c(1, 2, -2, 1, -1, -1, 2, 3, -4, 1, 4)
#Count the frequency of negative numbers, keeping all the unique numbers
vals <- table(factor(abs(x[x < 0]), levels = unique(abs(x))))
#Count the frequency of absolute value of original vector
vals1 <- table(abs(x))
#Subtract the frequencies between two vectors
new_val <- vals1 - (vals * 2 )
#Recreate the new vector
as.integer(rep(names(new_val), new_val))
#[1] 1 2 3
If you add a rowid column you can do this with data.table ant-joins.
Here's an example which takes ID into account, not deleting "positive counterparts" unless they're the same ID
First create more interesting sample data
df <- fread('
ID Store Spent
123 A 18.50
123 A -18.50
123 A 18.50
123 A -19.50
123 A 19.50
123 A -99.50
124 A -94.50
124 A 99.50
124 A 94.50
124 A 94.50
')
Now remove all the negative values with positive counterparts, and remove those counterparts
negs <- df[Spent < 0][, Spent := -Spent][, rid := rowid(ID, Spent)]
pos <- df[Spent > 0][, rid := rowid(ID, Spent)]
pos[!negs, on = .(ID, Spent, rid), -'rid']
# ID Store Spent rid
# 1: 123 A 18.5 2
# 2: 124 A 99.5 1
# 3: 124 A 94.5 2
And as applied to Ronak's x vector example
x <- c(1, 2, -2, 1, -1, -1, 2, 3, -4, 1, 4)
negs <- data.table(x = -x[x<0])[, rid := rowid(x)]
pos <- data.table(x = x[x>0])[, rid := rowid(x)]
pos[!negs, on = names(pos), -'rid']
# x
# 1: 2
# 2: 3
# 3: 1
I used the following code.
library(dplyr)
store <- rep(LETTERS[1:3], 3)
id <- c(1:4, 1:3, 1:2)
expense <- runif(9, -10, 10)
tibble(store, id, expense) %>%
group_by(store) %>%
summarise(net_expenditure = sum(expense))
to get this output:
# A tibble: 3 x 2
store net_expenditure
<chr> <dbl>
1 A 13.3
2 B 8.17
3 C 16.6
Alternatively, if you wanted the net expenditure per store-id pairing, then you could use this code:
tibble(store, id, expense) %>%
group_by(store, id) %>%
summarise(net_expenditure = sum(expense))
I've approached your question from a slightly different perspective. I'm not sure that my code answers your question, but it might help.
I have a dataframe in R for which one column has multiple variables. The variables either start with ABC, DEF, GHI. Those variables are followed by a series of 6 numbers (ie ABC052689, ABC062895, DEF045158).
For each row, i would like to pull one instance of ABC (the one with the largest number).
If the row has ABC052689, ABC062895, DEF045158, I would like it to pull out ABC062895 because it is greater than ABC052689.
I would then want to do the same for the variable that starts with DEF######.
I have managed to filter the data to have rows where ABC is there and either DEF or GHI is there:
library(tidyverse)
data_with_ABC <- test %>%
filter(str_detect(car,"ABC"))
data_with_ABC_and_DEF_or_GHI <- data_with_ABC %>%
filter(str_detect(car, "DEF") | str_detect(car, "GHI"))
I don't know how to pull out let's say ABC with the greatest number
ABC052689, ABC062895, DEF045158 -> ABC062895
For a base R solution, we can try using lapply along with strsplit to identify the greatest ABC plate in each CSV string, in each row.
df <- data.frame(car=c("ABC052689,ABC062895,DEF045158"), id=c(1),
stringsAsFactors=FALSE)
df$largest <- lapply(df$car, function(x) {
cars <- strsplit(x, ",", fixed=TRUE)[[1]]
cars <- cars[substr(cars, 1, 3) == "ABC"]
max <- cars[which.max(substr(cars, 4, 9))]
return(max)
})
df
car id largest
1 ABC052689,ABC062895,DEF045158 1 ABC062895
Note that we don't need to worry about casting the substring of the plate number, because it is fixed width text. This means that it should sort properly even as text.
Besides Tim's answer, if you want to do all ABC/DEF at one time, following code may help with library(tidyverse):
> df <- data.frame(car=c("ABC052689", "ABC062895", "DEF045158", "DEF192345"), stringsAsFactors=FALSE)
>
> df2 = df %>%
+ mutate(state = str_sub(car, 1, 3), plate = str_sub(car, 4, 9))
>
> df2
car state plate
1 ABC052689 ABC 052689
2 ABC062895 ABC 062895
3 DEF045158 DEF 045158
4 DEF192345 DEF 192345
>
> df2 %>%
+ group_by(state) %>%
+ summarise(maxplate = max(plate)) %>%
+ mutate(full = str_c(state, maxplate))
# A tibble: 2 x 3
state maxplate full
<chr> <chr> <chr>
1 ABC 062895 ABC062895
2 DEF 192345 DEF192345
I would like to process some GPS-Data rows, pairwise.
For now, I am doing it in a normal for-loop but I'm sure there is a better and faster way.
n = 100
testdata <- as.data.frame(cbind(runif(n,1,10), runif(n,0,360), runif(n,14,16), runif(n, 46,49)))
colnames(testdata) <- c("speed", "heading", "long", "lat")
head(testdata)
diffmatrix <- as.data.frame(matrix(ncol = 3, nrow = dim(testdata)[1] - 1))
colnames(diffmatrix) <- c("distance","heading_diff","speed_diff")
for (i in 1:(dim(testdata)[1] - 1)) {
diffmatrix[i,1] <- spDists(as.matrix(testdata[i:(i+1),c('long','lat')]),
longlat = T, segments = T)*1000
diffmatrix[i,2] <- testdata[i+1,]$heading - testdata[i,]$heading
diffmatrix[i,3] <- testdata[i+1,]$speed - testdata[i,]$speed
}
head(diffmatrix)
How would i do that with an apply-function?
Or is it even possible to do that calclulation in parallel?
Thank you very much!
I'm not sure what you want to do with the end condition but with dplyr you can do all of this without using a for loop.
library(dplyr)
testdata %>% mutate(heading_diff = c(diff(heading),0),
speed_diff = c(diff(speed),0),
longdiff = c(diff(long),0),
latdiff = c(diff(lat),0))
%>% rowwise()
%>% mutate(spdist = spDists(cbind(c(long,long + longdiff),c(lat,lat +latdiff)),longlat = T, segments = T)*1000 )
%>% select(heading_diff,speed_diff,distance = spdist)
# heading_diff speed_diff distance
# <dbl> <dbl> <dbl>
# 1 15.9 0.107 326496
# 2 -345 -4.64 55184
# 3 124 -1.16 25256
# 4 85.6 5.24 221885
# 5 53.1 -2.23 17599
# 6 -184 2.33 225746
I will explain each part below:
The pipe operator %>% is essentially a chain that sends the results from one operation into the next. So we start with your test data and send it to the mutate function.
Use mutate to create 4 new columns that are the difference measurements from one row to the next. Adding in 0 at the last row because there is no measurement following the last datapoint. (Could do something like NA instead)
Next once you have the differences you want to use rowwise so you can apply the spDists function to each row.
Last we create another column with mutate that calls the original 4 columns that we created earlier.
To get only the 3 columns that you were concerned with I used a select statement at the end. You can leave this out if you want the entire dataframe.
Good day!
I’ve got a table of two columns. In the first column (x) there are values which I want to divide in into categories according to the specified range of values (in my instance – 300). And then using these categories I want to sum values in anther column (v). For instance, using my test data: The first category is from 65100 to 65400 (65100
The result: there is a table of two columns. The first one is the categories of x; the second column is the sum of according values of v.
Thank you!!!
# data
set.seed(1)
x <- sample(seq(65100, 67900, by=5), 100, replace = TRUE)
v <- sample(seq(1000, 8000), 100, replace = TRUE)
tabl <- data.frame(x=c(x), v=c(v))
attach(tabl)
#categories
seq(((min(x) - min(x)%%300) + 300), ((max(x) - max(x)%%300) + 300), by =300)
I understood you want to:
Cut vector x,
Using pre-calculated cut-off thresholds
Compute sums over vector v using those groupings
This is one line of code with data.table and chaining. Your data are in data.table named DT.
DT[, CUT := cut(x, breaks)][, sum(v), by=CUT]
Explanation:
First, assign cut-offs to variable breaks like so.
breaks <- seq(((min(x) - min(x) %% 300) + 300), ((max(x) - max(x) %% 300) + 300), by =300)
Second, compute a new column CUT to group rows by the data in breaks.
DT[, CUT := cut(x, breaks)]
Third, sum on column v in groups, using by=. I have chained this operation with the previous.
DT[, CUT := cut(x, breaks)][, sum(v), by=CUT]
Convert your data.frame to data.table like so.
library(data.table)
DT <- as.data.table(tabl)
This is the final result:
CUT V1
1: (6.57e+04,6.6e+04] 45493
2: (6.6e+04,6.63e+04] 77865
3: (6.66e+04,6.69e+04] 22893
4: (6.75e+04,6.78e+04] 61738
5: (6.54e+04,6.57e+04] 44805
6: (6.69e+04,6.72e+04] 64079
7: NA 33234
8: (6.72e+04,6.75e+04] 66517
9: (6.63e+04,6.66e+04] 43887
10: (6.78e+04,6.81e+04] 172
You can dress this up to improve aesthetics. For example, you can reset the factor levels for ease of reading.
When I use dplyr I am used to do it like this. Although I like the cut solution too.
# data
set.seed(1)
x <- sample(seq(65100, 67900, by=5), 100, replace = TRUE)
v <- sample(seq(1000, 8000), 100, replace = TRUE)
tabl <- data.frame(group=c(x), value=c(v))
attach(tabl)
#categories
s <- seq(((min(x) - min(x)%%300) + 300), ((max(x) - max(x)%%300) + 300), by =300)
tabl %>% rowwise() %>% mutate(g = s[min(which(group < s), na.rm=T)]) %>% ungroup() %>%
group_by(g) %>% summarise(sumvalue = sum(value))
result:
g sumvalue
<dbl> <int>
65400 28552
65700 49487
66000 45493
66300 77865
66600 43887
66900 21187
67200 65785
67500 66517
67800 61738
68100 1722
Try this (no package needed):
s <- seq(65100, max(tabl$x)+300, 300)
tabl$col = as.vector(cut(tabl$x, breaks = s, labels = 1:10))
df <- aggregate(v~col, tabl, sum)
# col v
# 1 1 33234
# 2 2 44805
# 3 3 45493
# 4 4 77865
# 5 5 43887
# 6 6 22893
# 7 7 64079
# 8 8 66517
# 9 9 61738
# 10 10 1722