How can I find the changes in a single column in R? - r

I have a "csv" file which includes 3 columns, 100+rows. The variables in all columns change according to the data placed in Column 1, the "Time".
Time Temp Cloud
1100 22 1
1102 14 1
1104 14 2
1106 23 1
1108 12 1
1110 21 2
1112 17 2
1114 12 3
1116 24 3
I want to know when "Cloud" changes [ e.g. at 3rd and 6th row ], and I want to obtain the other variables which is placed at that row, and the row before that row.
How can I do that ?
Thanks

diff will almost do this directly. Apply it twice. Calling your example data d:
> d[c(diff(d$Cloud) != 0,FALSE) | c(FALSE, diff(d$Cloud) != 0),]
Time Temp Cloud
2 1102 14 1
3 1104 14 2
4 1106 23 1
5 1108 12 1
6 1110 21 2
7 1112 17 2
8 1114 12 3

I would do something like this:
df$Change <- c(0,sign(diff(df$Cloud)))
subset(df,Change!=0)[,4]
This will eliminate rows where there are no changes.

Related

Replace one row with many changing a column value based on an object

I've tried to find the answer to this simple question without anyluck.
Let's say I have:
A <- c(1101,1102,1103)
and a dataframe that looks like this:
t m com x
1 1 1101 10
1 1 1102 15
1 1 1103 20
1 2 NA NA
1 3 1101 20
1 3 1102 25
1 3 1103 30
1 4 NA NA
1 5 1101 25
1 5 1102 30
1 5 1103 35
and since what I want to do is a linear interpolation of X with na.approx (actual dataframe and object are way bigger), I need the dataframe to look like this:
t m com x
1 1 1101 10
1 1 1102 15
1 1 1103 20
1 2 1101 NA
1 2 1102 NA
1 2 1103 NA
1 3 1101 20
1 3 1102 25
1 3 1103 30
1 4 1101 NA
1 4 1102 NA
1 4 1103 NA
1 5 1101 25
1 5 1102 30
1 5 1103 35
I haven't tried any code for this because I don't know where to start.
Any help and/or r material that you consider usefull would be great.
Thanks,
The function you need is tidyr::complete().
Since you would like to "expand" column com within group of m, you need to group_by(m) first,
then use the vector A to complete the com column.
In this case, the t column will be filled with NA by default, we can fill() up column t using the value in the previous row (since you said you have a large dataset, this should work better than setting the fill parameter in complete()).
Then drop any NA in the com column (these are the original NA in your data set).
Finally, reorder the columns back to their original position.
library(tidyverse)
A <- c(1101,1102,1103)
df %>%
group_by(m) %>%
complete(com = A) %>%
fill(t, .direction = "up") %>%
drop_na(com) %>%
select(t, m, com, x)
# A tibble: 15 x 4
# Groups: m [5]
t m com x
<int> <int> <dbl> <int>
1 1 1 1101 10
2 1 1 1102 15
3 1 1 1103 20
4 1 2 1101 NA
5 1 2 1102 NA
6 1 2 1103 NA
7 1 3 1101 20
8 1 3 1102 25
9 1 3 1103 30
10 1 4 1101 NA
11 1 4 1102 NA
12 1 4 1103 NA
13 1 5 1101 25
14 1 5 1102 30
15 1 5 1103 35

Creating an summary dataset with multiple objects and multiple observations per object

I have a dataset with the reports from a local shop, where each line has a client's ID, date of purchase and total value per purchase.
I want to create a new plot where for each client ID I have all the purchases in the last month or even just sample purchases in a range of dates I choose.
The main problem is that certain customers might buy once a month, while others can come daily - so the number of observations per period of time can vary.
I have tried subsetting my dataset to a specific range of time, but either I choose a specific date - and then I only get a small % of all customers, or I choose a range and get multiple observations for certain customers.
(In this case - I wouldn't mind getting the earliest observation)
An important note: I know how to create a for loop to solve this problem, but since the dataset is over 4 million observations it isn't practical since it would take an extremely long time to run.
A basic example of what the dataset looks like:
ID Date Sum
1 1 1 234
2 1 2 45
3 1 3 1
4 2 4 223
5 3 5 546
6 4 6 12
7 2 1 20
8 4 3 30
9 6 2 3
10 3 5 45
11 7 6 456
12 3 7 65
13 8 8 234
14 1 9 45
15 3 2 1
16 4 3 223
17 6 6 546
18 3 4 12
19 8 7 20
20 9 5 30
21 11 6 3
22 12 6 45
23 14 9 456
24 15 10 65
....
And the new data set would look something like this:
ID 1Date 1Sum 2Date 2Sum 3Date 3Sum
1 1 234 2 45 3 1
2 1 20 4 223 NA NA
3 2 1 5 546 5 45
...
Thanks for your help!
I think you can do this with a bit if help from dplyr and tidyr
library(dplyr)
library(tidyr)
dd %>% group_by(ID) %>% mutate(seq=1:n()) %>%
pivot_wider("ID", names_from="seq", values_from = c("Date","Sum"))
Where dd is your sample data frame above.

How to find a repeated sequence of numbers in a data frame

suppose I have the next data frame and what I want to do is to identify and remove certain observations.
The idea is to delete those observations with 4 or more similar numbers.
df<-data.frame(col1=c(12,34,233,3333,3333333,333333,555555,543,456,87,4,111111,1111111111,22,222,2222,22222,9111111,912,8688888888))
col1
1 12
2 34
3 233
4 3333
5 3333333
6 333333
7 555555
8 543
9 456
10 87
11 4
12 111111
13 1111111111
14 22
15 222
16 2222
17 22222
18 9111111
19 912
20 8688888888
So the final output should be:
col1
1 12
2 34
3 233
4 543
5 456
6 87
7 4
8 22
9 222
10 912
Another way of removing the desired values would be to directly filter 1111, 2222 etc., using grep() after converting the numbers to characters.
df$col1[-as.numeric(grep(paste(1111*(1:9), collapse="|"), as.character(df$col1), value=F))]
# [1] 12 34 233 543 456 87 4 22 222 912
Not the most efficient method, but it seems to return the desired result. Convert the vector into a string, split each individual character, use rle to look for repeating sequences, take the maximum and return TRUE if that max is less than 4.
df[sapply(strsplit(as.character(df$col1), ""),
function(x) max(rle(x)$lengths) < 4), , drop=FALSE]
col1
1 12
2 34
3 233
8 543
9 456
10 87
11 4
14 22
15 222
19 912
This method will include values like 155155 but exclude values like 555511 or 155551.

Select specific rows based on previous row value (in the same column)

I've been trying to figure a way to script this through R, but just can't get it. I have a dataset like this:
Trial Type Correct Latency
1 55 0 0
3 30 1 766
4 10 1 344
6 40 1 716
7 10 1 326
9 30 1 550
10 10 1 350
11 64 0 0
13 30 1 683
14 10 1 270
16 30 1 666
17 10 1 297
19 40 1 616
20 10 1 315
21 64 0 0
23 40 1 850
24 10 1 322
26 30 1 566
27 20 0 766
28 40 1 500
29 20 1 230
which goes for much longer(around 1000 rows).
From this one dataset, I would like to create 4 separate data.frames/tables I can export tables with as well as do my own calculations
I would like to have a data.frame (4 in total), one for each of these bullet points:
type 10 rows which are preceded by a type 30 row
type 10 rows which are preceded by a type 40 row
type 20 rows which are preceded by a type 30 row
type 20 rows which are preceded by a type 40 row
I would like for all the columns in the relevant rows to be placed into these new tables, but only including the column info of row types 10 or 20.
For example, the first table (type 10 preceded by type 30) would like this based on the sample data:
Trial Type Correct Latency
4 10 1 344
10 10 1 350
14 10 1 270
17 10 1 297
Second table (type 10 preceded by type 40):
Trial Type Correct Latency
7 10 1 326
20 10 1 315
24 10 1 322
Third table (type 20 preceded by type 30):
Trial Type Correct Latency
27 20 0 766
Fourth table (table 20 preceded by type 40):
Trial Type Correct Latency
29 20 1 230
I can subset just fine to get one table only of type 10 rows and another for type 20 rows, but I can't figure out how to create different tables for type 10 and 20 rows based on the previous type value. Also, an issue is that "Trials" is not in order (skips numbers).
Any help would be greatly appreciated. Thank you.
Also, is there a way to include the previous row as well, so the output for the fourth table would look something like this:
Fourth table (table 20 preceded by type 40):
Trial Type Correct Latency
28 40 1 500
29 20 1 230
For the fourth example, you could use which() in combination with lag() from dplyr, to attain the indices that meet your criteria. Then you can use these to subset the data.frame.
# Get indices of rows that meet condition
ind2 <- which(df$Type==20 & dplyr::lag(df$Type)==40)
# Get indices of rows before the ones that meet condition
ind1 <- which(df$Type==20 & dplyr::lag(df$Type)==40)-1
# Subset data
> df[c(ind1,ind2)]
Trial Type Correct Latency
1: 28 40 1 500
2: 29 20 1 230
Here is an example code if you always want to delete the first trials of your data.
var1 <- c(1,2,1,2,1,2,1,2,1,2)
var2 <- c(1,1,1,2,2,2,2,3,3,3)
dat <- data.frame(var1, var2)
var1 var2
1 1 1
2 2 1
3 1 1
4 2 2
5 1 2
6 2 2
7 1 2
8 2 3
9 1 3
10 2 3
#delete only this line directly
filter(dat,lag(var2)==var2)
var1 var2
1 1 1
2 2 1
3 1 1
6 2 2
7 1 2
10 2 3
#delete the first 2 trials
#make a list of all rows where var2[n-1]!=var2[n] --> using lag from dplyr
drops <- c(1,2,which(lag(dat$var2)!=dat$var2), which(lag(dat$var2)!=dat$var2)+1)
if (!identical(drops,numeric(0))) { dat <- dat[-drops,] }
var1 var2
3 1 1
6 2 2
7 1 2
10 2 3

Merge two datasets in R

I have two different datasets arranged in column format as follows:
Dataset 1:
A B C D E
13 1 1.7 2 1
13 2 5.3 2 1
13 2 2 2 1
13 2 1.8 2 1
1 6 27 9 1
1 6 6.6 9 1
1 7 17 9 1
1 7 7.1 9 1
1 7 8.5 9 1
Dataset 2:
A B F G
13 1 42 1002
13 2 42 1002
13 2 42 1002
13 2 42 1002
13 3 42 1002
13 4 42 1002
13 5 42 1002
1 2 27 650
1 3 27 650
1 4 27 650
1 6 27 650
1 7 27 650
1 7 27 650
1 7 27 650
1 8 27 650
Row numbers of both datasets are variable but they contain data for two samples (for example, column A: 13 and 1 of both datasets). I want C D and E values of dataset 1 to be placed in dataset 2, those having the same values of A and B in both datasets. So, joining should be based on A and B. I need to do this for about 47560 rows.
I am new in R so should be thankful if I could get code for saving the new merged dataset in R.
Use the merge function in R.
Reference from : http://www.statmethods.net/management/merging.html
Edit:
So first you'd need to read in the datasets, CSV is a good format.
> dataset1 <- read.csv(file="dataset1.csv", head=TRUE, sep=",")
> dataset2 <- read.csv(file="dataset2.csv", head=TRUE, sep=",")
If you just type the variable names now and hit enter you should see a read-out of your datasets. So...
> dataset1
should read out your data above. Then I believe the following should occur...I may be wrong...
> dataset1_2 <- merge(dataset1, dataset2, by=c("A","B"))
EDIT 2 :
> write.table(dataset1_2, "c:/dataset1_2.txt", sep=" ")
Reference : http://www.statmethods.net/input/exportingdata.html

Resources