How to find a repeated sequence of numbers in a data frame - r

suppose I have the next data frame and what I want to do is to identify and remove certain observations.
The idea is to delete those observations with 4 or more similar numbers.
df<-data.frame(col1=c(12,34,233,3333,3333333,333333,555555,543,456,87,4,111111,1111111111,22,222,2222,22222,9111111,912,8688888888))
col1
1 12
2 34
3 233
4 3333
5 3333333
6 333333
7 555555
8 543
9 456
10 87
11 4
12 111111
13 1111111111
14 22
15 222
16 2222
17 22222
18 9111111
19 912
20 8688888888
So the final output should be:
col1
1 12
2 34
3 233
4 543
5 456
6 87
7 4
8 22
9 222
10 912

Another way of removing the desired values would be to directly filter 1111, 2222 etc., using grep() after converting the numbers to characters.
df$col1[-as.numeric(grep(paste(1111*(1:9), collapse="|"), as.character(df$col1), value=F))]
# [1] 12 34 233 543 456 87 4 22 222 912

Not the most efficient method, but it seems to return the desired result. Convert the vector into a string, split each individual character, use rle to look for repeating sequences, take the maximum and return TRUE if that max is less than 4.
df[sapply(strsplit(as.character(df$col1), ""),
function(x) max(rle(x)$lengths) < 4), , drop=FALSE]
col1
1 12
2 34
3 233
8 543
9 456
10 87
11 4
14 22
15 222
19 912
This method will include values like 155155 but exclude values like 555511 or 155551.

Related

How to get rid of varying zeros in a column in R?

I have a df1:
Story Score
1 00678
2 0980
3 1120
4 00067
5 0091
6 123
7 234
8 0234
9 00412
and I would like to get rid of all beginning 0s to have a df2:
Story Score
1 678
2 980
3 1120
4 67
5 91
6 123
7 234
8 234
9 412
Assuming the Score column be text, you could use sub here:
df$Score <- sub("^0+", "", df$Score)
If you intend for Score to be treated and used as numbers, you also might be able to just cast it to numeric:
df$Score <- as.numeric(df$Score)

R Merge part of table into one column with sum

I have the following table in R:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
162 148 108 93 67 83 44 53 37 47 25 34 17 22 11 11 5
I want to divide in into 7 parts had title of 1 2 3 4 5 6 7&greater, where it needs to combine all the number after 7 and merge it into the last one.
I have looked at aggregate & tapply but doesn't seem like the right function I need.
x <- c(x[1:6], "7 and above"=sum(x[-(1:6)]))
1 2 3 4 5 6 7 and above
162 148 108 93 67 83 306
data
x <- table(rep(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17), c(162,148,108,93,67,83,44,53,37,47,25,34,17,22,11,11,5)))
If you are using table to generate the output above you can use pmin to keep minimum between the values in your data and 7 and then use table to count the frequency.
Assuming your dataframe is called df and column name is col_name you can do.
tab <- table(pmin(df$col_name, 7))
The values under 7 would include all the 7 & above values together. You can rename it to make it more clear.
names(tab)[7] <- '7&above'

Multiplication in R

I have a huge data set. Data covers around 4000 regions.
I need to do a multiplication like this: each number in each row should be multiplied by the corresponding column name/value (0 or...) at first.
Then, these resulting numbers should be summed up and be divided by total number (totaln) in that row.
For example, the data is like this:
region totan 0 1 2 3 4 5 6 7 .....
1 1346 5 7 3 9 23 24 34 54 .....
2 1256 7 8 4 10 34 2 14 30 .....
3 1125 83 43 23 11 16 4 67 21 .....
4 3211 43 21 67 12 13 12 98 12 .....
5 1111 21 8 9 3 23 13 11 0 .....
.... .... .. .. .. .. .. .. .. .. .....
4000 2345 21 9 11 45 67 89 28 7 .....
The calculation should be like this:
For example in region 1:
(5*0)+(7*1)+(3*2)+(9*3)+(23*4)+(24*5)+(34*6)+(7*54)...= the result/1346=the result
I need to do such an analysis for all the regions.
I tried a couple of ways like use of "for" and "apply" but did not get the required result.
This can be done fully vectorized:
Data:
> df
region totan 0 1 2 3 4 5 6 7
1 1 1346 5 7 3 9 23 24 34 54
2 2 1256 7 8 4 10 34 2 14 30
3 3 1125 83 43 23 11 16 4 67 21
4 4 3211 43 21 67 12 13 12 98 12
5 5 1111 21 8 9 3 23 13 11 0
6 4000 2345 21 9 11 45 67 89 28 7
as.matrix(df[3:10]) %*% as.numeric(names(df)[3:10]) / df$totan
[,1]
[1,] 0.6196137
[2,] 0.3869427
[3,] 0.6711111
[4,] 0.3036437
[5,] 0.2322232
[6,] 0.4673774
This should be significantly faster on a huge dataset than any for or *apply loop.
You could use the tidyverse :
library(tidyverse)
df %>% gather(k,v,-region,-totan) %>%
group_by(region,totan) %>% summarize(x=sum(as.numeric(k)*v)/first(totan))
## A tibble: 5 x 3
## Groups: region [?]
# region totan x
# <int> <int> <dbl>
#1 1 1346 0.620
#2 2 1256 0.387
#3 3 1125 0.671
#4 4 3211 0.304
#5 5 1111 0.232
for (i in 1:nrow(data)) {
sum(data[i,3:(ncol(data))]*names(data)[3:ncol(data)])/data[i,2]
}
alternatively
apply(data,1,function(x){
sum(x[3:length(x)]*names(x)[3:length(x)])/x[2]
}

Create new table based on two existing tables with specific criteria

this is probably stupid, but i have the following problem:
I have two tables:
1)Table with therapies on a specific patient with beginning and ending date:
therapyID patientID startoftherapy endoftherapy
1 1 233 5.5.10 6.6.11
2 2 233 7.7.11 8.8.11
3 3 344 1.1.09 3.2.10
4 4 344 3.3.10 10.10.11
5 5 544 2.1.09 3.2.10
6 6 544 4.3.12 4.3.14
7 7 113 1.1.12 1.1.15
8 8 123 2.1.13 1.1.15
9 9 543 2.1.09 3.2.10
10 10 533 7.7.11 8.8.14
2)Table with many diagnoses, the specific patient and date and description:
diagnosisID dateofdiagnosis patientID diagnosis
1 11 8.8.10 233 xxx
2 22 5.10.11 233 yyy
3 33 8.9.11 233 xxx
4 44 2.2.09 344 zzz
5 55 3.3.09 344 yyy
6 666 2.2.12 123 zzz
7 777 3.3.12 123 yyy
8 555 3.2.10 543 xxx
9 203 8.8.12 533 zzz
I want to create a new table, with the diagnoses of the patieents in the time of their therapy, i.e. with the matching criteria: patientID, date between startoftherapy and endoftherapy. Something like this:
therapyID diagnosisID patientID dateofdiagnosis diagnosis
1 1 11 233 08.08.10 xxx
2 2 22 233 05.10.11 yyy
3 2 33 233 08.09.11 xxx
I´m way to unexperienced to do this, can anyone help me with this or point me in the right direction?
We can do it with `plyr:
# We recreate your data.frames
df1 <- read.table(text="
therapyID patientID startoftherapy endoftherapy
1 1 233 5.5.10 6.6.11
2 2 233 7.7.11 8.8.11
3 3 344 1.1.09 3.2.10
4 4 344 3.3.10 10.10.11
5 5 544 2.1.09 3.2.10
6 6 544 4.3.12 4.3.14
7 7 113 1.1.12 1.1.15
8 8 123 2.1.13 1.1.15
9 9 543 2.1.09 3.2.10
10 10 533 7.7.11 8.8.14", h=T)
df2 <- read.table(text="
diagnosisID dateofdiagnosis patientID diagnosis
1 11 8.8.10 233 xxx
2 22 5.10.11 233 yyy
3 33 8.9.11 233 xxx
4 44 2.2.09 344 zzz
5 55 3.3.09 344 yyy
6 666 2.2.12 123 zzz
7 777 3.3.12 123 yyy
8 555 3.2.10 543 xxx
9 203 8.8.12 533 zzz", h=T)
We load dplyr ; install.packages("dplyr") if you don't have it.
library(dplyr)
Then we left_join by patientID. A graphical definition (and more) can be found here. Then we just rearrange column order.
# we first left_join
left_join(df1, df2, "patientID") %>%
select(therapyID,diagnosisID,patientID, dateofdiagnosis, diagnosis) %>%
arrange(therapyID)
We obtain:
therapyID diagnosisID patientID dateofdiagnosis diagnosis
1 1 11 233 8.8.10 xxx
2 1 22 233 5.10.11 yyy
3 1 33 233 8.9.11 xxx
4 2 11 233 8.8.10 xxx
The output may be different from the one you provided because of row order. It can be changed with arrange. Is this what you want?
EDIT
I want to sort out cases where date of diagnosis did not happened during the therapy
Then you first need to properly convert time column to date format. This function does the job for your format:
ch2date <- function(x) as.Date(x, format="%d.%m.%y")
We can include it to the pipe and then use these columns for filtering:
left_join(df1, df2, "patientID") %>%
mutate(startoftherapy = ch2date(startoftherapy),
endoftherapy = ch2date(endoftherapy),
dateofdiagnosis = ch2date(dateofdiagnosis)) %>%
filter(startoftherapy < dateofdiagnosis, dateofdiagnosis < endoftherapy) %>%
select(therapyID, diagnosisID, patientID, dateofdiagnosis, diagnosis) %>%
arrange(therapyID)
Does it solve your problem?

ddply type functionality on multiple datafrmaes

I have two dataframes that are structured as follows:
Dataframe A:
id sqft traf month
1 1030 16 35 1
1 1030 15 32 2
2 1027 1 31 1
2 1027 2 31 2
Dataframe B:
id price frequency month day
1 1030 8 196 1 1
2 1030 9 101 1 15
3 1030 10 156 1 30
4 1030 3 137 2 1
5 1030 7 190 2 15
6 1027 6 188 1 1
7 1027 1 198 1 15
8 1027 2 123 1 30
9 1027 4 185 2 1
10 1027 5 122 2 15
I want to output certain types of summary statistics (centered around each unique ID) from both these columns. This would be easy with ddply if say I wanted the mean price for each ID for each month (split by id and month) from Dataframe B or if I wanted the average ratio of sqft to traf for each id (split by id).
But what would be a potential solution if I wanted to make combined variables from both dataframes. For instance, how would I get the average price for each id/month (Dataframe B) divided by sqft for each id/month?
The varying frequencies at of the dataframes are measured makes combining them not easily doable. The only solution I've found so far is to ddply the first dataframe to extract average sqft/id/month and then pass that value into a second ddply call on the second dataframe.
Is there a more efficient/less convoluted way to do this? I would be splitting both dataframes on the same variables (id and month).
Thanks in advance for any suggestions!
In the case of the sample data, you could merge the two data sets like this (by specifying all.y = TRUE you can make sure that all rows of dfb are kept and, in this case, corresponding entries of dfa are repeated accordingly)
dfall <- merge(dfa, dfb, by = c("id", "month"), all.y=TRUE)
# id month sqft traf price frequency day
#1 1027 1 1 31 6 188 1
#2 1027 1 1 31 1 198 15
#3 1027 1 1 31 2 123 30
#4 1027 2 2 31 4 185 1
#5 1027 2 2 31 5 122 15
#6 1030 1 16 35 8 196 1
#7 1030 1 16 35 9 101 15
#8 1030 1 16 35 10 156 30
#9 1030 2 15 32 3 137 1
#10 1030 2 15 32 7 190 15
Then, you can use ddply as usual:
ddply(dfall, .(id, month), mutate, newcol = mean(price)/sqft)
# id month sqft traf price frequency day newcol
#1 1027 1 1 31 6 188 1 3.0000000
#2 1027 1 1 31 1 198 15 3.0000000
#3 1027 1 1 31 2 123 30 3.0000000
#4 1027 2 2 31 4 185 1 2.2500000
#5 1027 2 2 31 5 122 15 2.2500000
#6 1030 1 16 35 8 196 1 0.5625000
#7 1030 1 16 35 9 101 15 0.5625000
#8 1030 1 16 35 10 156 30 0.5625000
#9 1030 2 15 32 3 137 1 0.3333333
#10 1030 2 15 32 7 190 15 0.3333333
Edit: if you're looking for better performance, consider using dplyr instead of plyr. The equivalent dplyr code (including the merge) is:
library(dplyr)
dfall <- dfb %>%
left_join(., dfa, by = c("id", "month")) %>%
group_by(id, month) %>%
dplyr::mutate(newcol = mean(price)/sqft) # I added dplyr:: to avoid confusion with plyr::mutate
Of course, you could also check out data.table which is also very efficient.
AFAIK ddply is not designed to be used with different data frames at the same time.
dplyr does well here. This code merges the data frames, gets price and sqft means by unique id/month combination, then creates a new variable pricePerSqft.
require(dplyr)
dfa %>%
left_join(dfb, by = c("id", "month")) %>%
group_by(id, month) %>%
summarize(
avgPrice = mean(price),
avgSqft = mean(sqft)) %>%
mutate(pricePerSqft = round(avgPrice / avgSqft, 2))
Here's the result:
id month avgPrice avgSqft pricePerSqft
1 1027 1 3.0 1 3.00
2 1027 2 4.5 2 2.25
3 1030 1 9.0 16 0.56
4 1030 2 5.0 15 0.33

Resources