This question already has answers here:
Create new column in dataframe based on partial string matching other column
(2 answers)
Closed 1 year ago.
I would like to make a new variable unsure which contains the word "unsure" if any of the following words are found in the freetext column: "too soon", "to tell", leaving the freetext unchanged, and NA in the new column when freetext doesn't contain those words. Currently the data looks like:
id freetext date
1 1 its too soon 1
2 2 I'm not sure 2
3 3 pink 12
4 4 yellow 15
5 5 too soon to tell 20
6 6 I think it is too soon 2
7 7 5 days 6
8 8 red 7
9 9 its been 2 days 3
10 10 too soon to tell 11
The data:
structure(list(id = c("1","2","3","4","5","6","7","8","9","10"),
freetext = c("its too soon", "I'm not sure",
"pink","yellow","too soon to tell","I think it is too soon","5 days","red",
"its been 2 days","too soon to tell","scans","went on holiday"),
date = c("1","2","12","15","20","2","6","7","3","11")), class = "data.frame", row.names = c(NA,-10L))
And I would like it to look like:
id freetext unsure date
1 1 its too soon unsure 1
2 2 I'm not sure <NA> 2
3 3 pink <NA> 12
4 4 yellow <NA> 15
5 5 too soon to tell unsure 20
6 6 I think it is too soon unsure 2
7 7 5 days <NA> 6
8 8 red <NA> 7
9 9 its been 2 days <NA> 3
10 10 too soon to tell unsure 11
You can use if_else with str_detect for pattern matching -
library(tidyverse)
df %>% mutate(unsure = if_else(str_detect(freetext, 'too soon|to tell'), 'unsure', NA_character_))
# id freetext date unsure
#1 1 its too soon 1 unsure
#2 2 I'm not sure 2 <NA>
#3 3 pink 12 <NA>
#4 4 yellow 15 <NA>
#5 5 too soon to tell 20 unsure
#6 6 I think it is too soon 2 unsure
#7 7 5 days 6 <NA>
#8 8 red 7 <NA>
#9 9 its been 2 days 3 <NA>
#10 10 too soon to tell 11 unsure
In base R -
transform(df, unsure = ifelse(grepl('too soon|to tell', freetext), 'unsure', NA))
Related
I'm using FiveThirtyEight's Star Wars survey.
On $Anakin I've assigned 0 (very unfavourably) to 5 (very favourably) as categorical variables to the respondent's view of Anakin. "N/A" on the survey was assigned "". (Did that step on MS Excel)
$Startrek contains whether the respondent's seen Star Trek or not.
starwars <- read.csv2("starsurvey.csv", header = TRUE, stringsAsFactors = FALSE)
as.factor(starwars$Anakin)
as.factor(starwars$Startrek)
tbl <- table(starwars$Anakin, starwars$Startrek)
The table() function returns this:
No Yes
1 0 20 19
2 2 31 50
3 0 68 67
4 1 140 128
5 5 101 139
I'm wondering why the function returns 0, 2, 0, 1, 5 for the factors in $Anakin, since it contains:
starwars$Anakin
[1] 5 <NA> 4 5 2 5 4 3 4 5 <NA> <NA> 4 4
[15] 4 2 3 5 5 5 4 3 3 2 5 <NA> 4 4
[29] 1 1 3 5 2 <NA> <NA> 5 5 4 4 4 3 4
[43] 4 4 4 4 <NA> 2 3 <NA> 4 4 5 4 4 <NA>
The output of table here is confusing because your factor levels (1 to 5) look like row numbers, and there are some blank ("") responses to the Startrek variable which makes it appear like the data is only under the No and Yes columns.
So, the data here is a 5 by 3 table, with the rows representing the score from Anakin (1 to 5) and the columns representing 3 types of response to Startrek ("", No, Yes).
Note that where there are NA's in Anakin, this data is ingored in the table. To count these too, use addNA:
table(addNA(starwars$Anakin), starwars$Startrek)
I have a simple problem, and a bit more complicated twist at the end.
I have 2 datasets A & B (Separate when imported into R):
Dataset A is pulled from a DAQ that is sampling at 2000 times a second, while dataset B is pulled from a scope at 500 times a second. I have a test that records data from the DAQ and Scope for 5 seconds.
In R Studio I want to time synchronize this data and, for the sake of learning, how can I do it in both of the following ways?
1) Without duplicating values so filtering doesn't stair step:
A B
1 1 1
2 2 NA
3 3 NA
4 4 NA
5 5 2
6 6 NA
7 7 NA
8 8 NA
9 9 3
10 10 NA
11 11 NA
12 12 NA
2) With duplicating numbers if I don't want NA's in the functions I apply to the frame:
A B
1 1 1
2 2 1
3 3 1
4 4 1
5 5 2
6 6 2
7 7 2
8 8 2
9 9 3
10 10 3
11 11 3
12 12 3
Now here is the twist where it becomes a very unique problem I have. Lets say Dataset A records a bit before & after the 5 second test. Dataset A also has an extra column for "Trigger" which is either a 0 or a 1. 1 is a high that represents recording and basically where Dataset B starts. When it switches back to 0, Dataset B has finished recording.
Is there a way I can strategically do the above time sync in Dataset A? The reason I want to keep the data before & after the "true" recording section, is to make sure a filter or a filtfilt sweep will level out before the data truly starts.
Thanks for any help!
This question already has answers here:
Find consecutive values in vector in R [duplicate]
(2 answers)
Closed 6 years ago.
I am fairly new to the art of programming (loops etc..) and this is something where I would be grateful if I could get an opinion whether my approach is fine or it would definitely need to be optimized if it was about to used on much bigger sample.
Currently I have approximately 20 000 observations and one of the columns is the ID of receipt. What I would like to achieve is to assign each row to a group that would consist of IDs that are ascending in a format of n+1. If this rule is broken the new group should be created until the rule is broken again.
To illustrate, lets say I have this table (Important note is that ID are not necessarily unique and can repeat, like ID 10 in my example):
MyTable <- data.frame(ID = c(1,2,3,4,6,7,8,10,10,11,17,18,19,200,201,202,2010,2011,2013))
MyTable
ID
1
2
3
4
6
7
8
10
10
11
17
18
19
200
201
202
2010
2011
2013
The result of my grouping should be following:
ID GROUP
1 1
2 1
3 1
4 1
6 2
7 2
8 2
10 3
10 3
11 3
17 4
18 4
19 4
200 5
201 5
202 5
2010 6
2011 6
2013 7
I used dplyr for ordering the ID in ascending way. Then created the variable MyData$Group which I have simply filled with 1's.
rep(1,length(MyTable$ID)
for (i in 2:length(MyTable$ID) ) {
if(MyTable$ID[i] == MyTable$ID[i-1]+1 | MyTable$ID[i] == MyTable$ID[i-1]) {
MyTable$ID[i] <- MyTable$GROUP[i-1]
} else {
MyTable$GROUP[i] <- MyTable$GROUP[i-1]+1
}
}
This code worked for me and I got the results fairly easily. However, I wonder if in eyes of more experienced programmers, this piece of code would be considered as "bad", "average", "good" or whatever rating you come up with.
EDIT: I am sure this topic has been touched already, not arguing against that. Though, as the main difference is that I would like to touch a topic of optimization here and see whether my approach meets standards.
Thanks!
To make a long story short:
MyTable$Group <- cumsum(c(1, diff(MyTable$ID) != 1))
# ID Group
#1 1 1
#2 2 1
#3 3 1
#4 4 1
#5 6 2
#6 7 2
#7 8 2
#8 10 3
#9 11 3
#10 12 3
#11 17 4
#12 18 4
#13 19 4
#14 200 5
#15 201 5
#16 202 5
#17 2010 6
#18 2011 6
#19 2013 7
You are searching all differences in your vector Mytable$ID, which are not 1, so this are your "breaks". And then you cumsum all these values. When you do not know cumsum so type ?cumsum.
That's all!
UPDATE: with repeating IDs, you can use this:
MyTable <- data.frame(ID = c(1,2,3,4,6,7,8,10,10,11,17,18,19,200,201,202,2010,2011,2013))
MyTable$Group <- cumsum(c(1, !diff(MyTable$ID) %in% c(0,1) ))
# ID Group
#1 1 1
#2 2 1
#3 3 1
#4 4 1
#5 6 2
#6 7 2
#7 8 2
#8 10 3
#9 10 3
#10 11 3
#11 17 4
#12 18 4
#13 19 4
#14 200 5
#15 201 5
#16 202 5
#17 2010 6
#18 2011 6
#19 2013 7
This question already has answers here:
Filtering a data frame by values in a column [duplicate]
(3 answers)
Closed 3 years ago.
I have the following data with the ID of subjects.
V1
1 2
2 2
3 2
4 2
5 2
6 2
7 2
8 2
9 2
10 2
11 2
12 2
13 2
14 2
15 2
16 4
17 4
18 4
19 4
20 4
21 4
22 4
23 4
24 4
I want to subset all the rows of the data where V1 == 4. This way I can see which observations relate to subject 4.
For example, the correct output would be
16 4
17 4
18 4
19 4
20 4
21 4
22 4
23 4
24 4
However, the output I'm given after subsetting does not give me the correct rows . It simply gives me.
V1
1 4
2 4
3 4
4 4
5 4
6 4
7 4
8 4
I'm unable to tell which observations relate to subject 4, as observations 1:8 are for subject 2.
I've tried the usual methods, such as
condition<- df == 4
df[condition]
How can I subset the data so I'm given back a dataset that shows the correct row numbers for subject 4.
You can also use the subset function:
subset(df,df$V1==4)
I've managed to find a solution since posting.
newdf <- subset(df, V1 == 4).
However i'm still very interested in other solutions to this problems, so please post if you're aware of another method.
I'm trying to merge 7 complete data frames into one great wide data frame. I figured I have to do this stepwise and merge 2 frames into 1 and then that frame into another so forth until all 7 original frames becomes one.
fil2005: "ID" "abr_2005" "lop_2005" "ins_2005"
fil2006: "ID" "abr_2006" "lop_2006" "ins_2006"
But the variables "abr_2006" "lop_2006" "ins_2006" and 2005 are all either 0,1.
Now the things is, I want to either merge or do a dcast of some sort (I think) to make these two long data frames into one wide data frame were both "abr_2005" "lop_2005" "ins_2005" and abr_2006" "lop_2006" "ins_2006" are in that final file.
When I try
$fil_2006.1 <- merge(x=fil_2005, y=fil_2006, by="ID__", all.y=T)
all the variables with _2005 at the end if it is saved to the fil_2006.1, but the variables ending in _2006 doesn't.
I'm apparently doing something wrong. Any idea?
Is there a reason you put those underscores after ID__? Otherwise, the code you provided will work
An example:
dat1 <- data.frame("ID"=seq(1,20,by=2),"varx2005"=1:10, "vary2005"=2:11)
dat2 <- data.frame("ID"=5:14,"varx2006"=1:20, "vary2006"=21:40)
# create data frames of differing lengths
head(dat1)
ID varx2005 vary2005
1 1 1 2
2 3 2 3
3 5 3 4
4 7 4 5
5 9 5 6
6 11 6 7
head(dat2)
ID varx2006 vary2006
1 5 1 21
2 6 2 22
3 7 3 23
4 8 4 24
5 9 5 25
6 10 6 26
merged <- merge(dat1,dat2,by="ID",all=T)
head(merged)
ID varx2006 vary2006 varx2005 vary2005
1 1 NA NA 1 2
2 3 NA NA 2 3
3 5 1 21 3 4
4 5 11 31 3 4
5 7 13 33 4 5
6 7 3 23 4 5