subsetting a dataframe by a condition in R [duplicate] - r

This question already has answers here:
Filtering a data frame by values in a column [duplicate]
(3 answers)
Closed 3 years ago.
I have the following data with the ID of subjects.
V1
1 2
2 2
3 2
4 2
5 2
6 2
7 2
8 2
9 2
10 2
11 2
12 2
13 2
14 2
15 2
16 4
17 4
18 4
19 4
20 4
21 4
22 4
23 4
24 4
I want to subset all the rows of the data where V1 == 4. This way I can see which observations relate to subject 4.
For example, the correct output would be
16 4
17 4
18 4
19 4
20 4
21 4
22 4
23 4
24 4
However, the output I'm given after subsetting does not give me the correct rows . It simply gives me.
V1
1 4
2 4
3 4
4 4
5 4
6 4
7 4
8 4
I'm unable to tell which observations relate to subject 4, as observations 1:8 are for subject 2.
I've tried the usual methods, such as
condition<- df == 4
df[condition]
How can I subset the data so I'm given back a dataset that shows the correct row numbers for subject 4.

You can also use the subset function:
subset(df,df$V1==4)

I've managed to find a solution since posting.
newdf <- subset(df, V1 == 4).
However i'm still very interested in other solutions to this problems, so please post if you're aware of another method.

Related

How to create a vector in R in the pattern (0,0,1,1,2,2,3,3...len(list/2))

I am trying to create an index for a data frame. Each team playing has its own row, but I would like to add a column to use as an index so that the first two teams have the index 'Game 0', the next two teams have the index 'Game 1' until the length of half the list. In python the code would look as follows:
for i in range(0,int(len(teams)/2)):
gamenumber.append('Game '+str(i))
gamenumber.append('Game '+str(i))
I am unfamiliar with R so any help would be appreciated!
This will give you a list of paired index numbers:
> teams=1:100
> data.frame("Games"=sort(c(1:(length(teams)/2), 1:(length(teams)/2))))
Games
1 1
2 1
3 2
4 2
5 3
6 3
7 4
8 4
9 5
10 5
11 6
12 6
13 7
14 7
15 8
16 8
17 9
18 9
19 10
20 10 #etc.
Assuming teams is a data.frame with an even number of rows:
rep(1:(nrow(teams)/2), each=2)

How to run a loop on different sections of the same data.frame [duplicate]

This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 7 years ago.
Suppose I have a data frame with 2 variables which I'm trying to run some basic summary stats on. I would like to run a loop to give me the difference between minimum and maximum seconds values for each unique value of number. My actual data frame is huge and contains many values for 'number' so subsetting and running individually is not a realistic option. Data looks like this:
df <- data.frame(number=c(1,1,1,2,2,2,2,3,3,4,4,4,4,4,4,5,5,5,5),
seconds=c(1,4,8,1,5,11,23,1,8,1,9,11,24,44,112,1,34,55,109))
number seconds
1 1 1
2 1 4
3 1 8
4 2 1
5 2 5
6 2 11
7 2 23
8 3 1
9 3 8
10 4 1
11 4 9
12 4 11
13 4 24
14 4 44
15 4 112
16 5 1
17 5 34
18 5 55
19 5 109
my current code only returns the value of the difference between minimum and maximum seconds for the entire data fram:
ZZ <- unique(df$number)
for (i in ZZ){
Y <- max(df$seconds) - min(df$seconds)
}
Since you have a lot of data performance should matter and you should use a data.table instead of a data.frame:
library(data.table)
dt <- as.data.table(df)
dt[, .(spread = (max(seconds) - min(seconds))), by=.(number)]
number spread
1: 1 7
2: 2 22
3: 3 7
4: 4 111
5: 5 108

How to merge dating correctly

I'm trying to merge 7 complete data frames into one great wide data frame. I figured I have to do this stepwise and merge 2 frames into 1 and then that frame into another so forth until all 7 original frames becomes one.
fil2005: "ID" "abr_2005" "lop_2005" "ins_2005"
fil2006: "ID" "abr_2006" "lop_2006" "ins_2006"
But the variables "abr_2006" "lop_2006" "ins_2006" and 2005 are all either 0,1.
Now the things is, I want to either merge or do a dcast of some sort (I think) to make these two long data frames into one wide data frame were both "abr_2005" "lop_2005" "ins_2005" and abr_2006" "lop_2006" "ins_2006" are in that final file.
When I try
$fil_2006.1 <- merge(x=fil_2005, y=fil_2006, by="ID__", all.y=T)
all the variables with _2005 at the end if it is saved to the fil_2006.1, but the variables ending in _2006 doesn't.
I'm apparently doing something wrong. Any idea?
Is there a reason you put those underscores after ID__? Otherwise, the code you provided will work
An example:
dat1 <- data.frame("ID"=seq(1,20,by=2),"varx2005"=1:10, "vary2005"=2:11)
dat2 <- data.frame("ID"=5:14,"varx2006"=1:20, "vary2006"=21:40)
# create data frames of differing lengths
head(dat1)
ID varx2005 vary2005
1 1 1 2
2 3 2 3
3 5 3 4
4 7 4 5
5 9 5 6
6 11 6 7
head(dat2)
ID varx2006 vary2006
1 5 1 21
2 6 2 22
3 7 3 23
4 8 4 24
5 9 5 25
6 10 6 26
merged <- merge(dat1,dat2,by="ID",all=T)
head(merged)
ID varx2006 vary2006 varx2005 vary2005
1 1 NA NA 1 2
2 3 NA NA 2 3
3 5 1 21 3 4
4 5 11 31 3 4
5 7 13 33 4 5
6 7 3 23 4 5

Compute difference between rows in R and setting in zero first difference

Hi everybody I am trying to solve a little problem in R. I want to compute the difference between rows in a dataframe in R. My dataframe looks like this:
df <- data.frame(ID=1:8, x2=8:1, x3=11:18, x4=c(2,4,10,0,1,1,9,12))
I want to create a new column named diff.var. This column saves the results of differences from rows in variable. One posibble solution is using diff() function. When I used this function I got this:
diff(df$x4)
[1] 2 6 -10 1 0 8 3
That works fine but when I try to apply in my dataframe using df$diff.var=diff(df$x4) I got this:
Error in `$<-.data.frame`(`*tmp*`, "diff.var", value = c(2, 6, -10, 1, :
replacement has 7 rows, data has 8
Due to the fact that the firs row doesn't have a previous row to compute the difference I want to set this in zero. I would like to get something this:
ID x2 x3 x4 diff.var
1 8 11 2 0
2 7 12 4 2
3 6 13 10 6
4 5 14 0 -10
5 4 15 1 1
6 3 16 1 0
7 2 17 9 8
8 1 18 12 3
Where the first element of diff.var is zero due to this element doesn't have a previous element. I would like to build a function to set firts element of diff.var is zero and that makes the differences for the next rows. I wish to create a new dataframe with all variables and diff.var because ID is used por posterior analysis with diff.var. diff() doesn't allow to create this new variable. Thanks for your help.
This question was already asked before in this forum and can be found elsewhere. Anyway, do what Frank suggests
df <- data.frame(ID=1:8, x2=8:1, x3=11:18, x4=c(2,4,10,0,1,1,9,12))
df$vardiff <- c(0, diff(df$x4))
df
ID x2 x3 x4 vardiff
1 1 8 11 2 0
2 2 7 12 4 2
3 3 6 13 10 6
4 4 5 14 0 -10
5 5 4 15 1 1
6 6 3 16 1 0
7 7 2 17 9 8
8 8 1 18 12 3

How can I produce a table into a data.frame?

I printed out the summary of a column variables as such:
Please see below the summary table printed out from R:
I would like to generate it into a data.frame. However, there are too many subject names that it's very difficult to list out all, also, the term "OTHER" with number 31 means that there are 319 subjects which appear only 1 time in the original data.frame.
So, the new data.frame I hope to produce would look like below:
Here is one possible solution.
Table<-table(rpois(100,5))
as.data.frame(Table)
Var1 Freq
1 1 2
2 2 11
3 3 9
4 4 18
5 5 13
6 6 20
7 7 14
8 8 8
9 9 3
10 10 1
11 11 1

Resources