Subsetting values not adding up in R [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I have a dataframe (df) in R. All columns are character class.
> dim(df)
[1] 1000 6
I'm trying to remove rows where df$entry == c("7795").
entries_to_remove <- subset(df, entry == c("7795"))
> dim(entries_to_remove)
[1] 35 6
So as you can see above, I have 35 entries to remove from the data frame. However, when I go to remove these using subset, it doesn't remove the correct amount:
entries_to_remove <- subset(df, entry != c("7795"))
> dim(entries_to_remove)
[1] 648 6
The above command was supposed to remove 35 entries, but instead it removed 352. Does anyone know why this might be happening?

Here's another solution, which takes up just one line:
df[-which(grepl("7995", apply(df, 1, paste0, collapse = " "))),]
RESULT:
v1 entry1 entry2 entry3
2 2 5 5 2
3 3 2 4 2
4 4 2 3 1
6 6 1 2 1
7 7 2 4 4
8 8 4 5 5
9 9 5 1 5
DATA:
set.seed(121)
df <- data.frame(
v1 = 1:10,
entry1 = c(sample(1:5, 9, replace = T), 7995),
entry2 = c(sample(1:5, 4), 7995, sample(1:5, 5)),
entry3 = c(7995, sample(1:5, 9, replace = T))
)
df[2:4] <- lapply(df[2:4], as.character) # convert to character, as in your data
df
v1 entry1 entry2 entry3
1 1 1 2 7995
2 2 5 5 2
3 3 2 4 2
4 4 2 3 1
5 5 3 7995 2
6 6 1 2 1
7 7 2 4 4
8 8 4 5 5
9 9 5 1 5
10 10 7995 3 5

The above solutions didn't work, I do not think the issue is with NA. However, I solved the problem myself. It is a workaround but it worked:
# list the row numbers for the entries to remove
row_remove <- rownames(entries_to_remove )
# make a list of all the row numbers
all_rows <- 1:dim(df)[1]
# create a vector with only the rows to keep
subset_row <- all_rows[!(all_rows%in%row_remove)]
# subset the dataframe with these rows
df<- df[subset_row,]

The issue has to do with NAs, some of the other solutions will work, but the easiest and I think most inutive is just to use %in% rather than ==
entries_to_remove <- subset(df, !(entry %in% c("7795")))
entries_to_remove <- subset(df, entry %in% c("7795"))
This should explain whats happening. Notice how the ==, returns NA rather than FALSE.
> c( 5, 6, 7) == 5
[1] TRUE FALSE FALSE
> c( 5, 6, 7 , NA) == 5
[1] TRUE FALSE FALSE NA
> c( 5, 6, 7 , NA) %in% 5
[1] TRUE FALSE FALSE FALSE
and you can't subset using an NA

Related

Replace n number of rows with condition in R? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I have a df :
number=c(3,3,3,3,3,1,1,1,1,4,4,4,4,4,4)
data.frame(number)
but with thousands of rows.
How can i replace n number of rows out of way more and turn 3 into 1 for example.
If you can explain the logic too would be great.
No special requirements just replace a certain amount of 3 into 1. Not all.
Either randomly or the first n numbers.
Here are two versions for you. The first assumes you randomly want to convert n rows from 3 to 1. The second assumes that you want to choose the first n rows from 3 to 1.
To randomly select n of the rows where the value is currently 3, and then convert to 1:
> number=c(3,3,3,3,3,1,1,1,1,4,4,4,4,4,4)
>
>
> # to randomly change n rows (assume here that n = 4)
> set.seed(1)
> df <- data.frame(v1 = number)
> df$v1[sample(which(df$v1 == 3), 4)] <- 1
> df
v1
1 1
2 1
3 1
4 1
5 3
6 1
7 1
8 1
9 1
10 4
11 4
12 4
13 4
14 4
15 4
To change to the first n rows (assume again that n = 4):
> df <- data.frame(v1 = number)
> df$v1[which(df$v1 == 3)[1:4]] <- 1
> df
v1
1 1
2 1
3 1
4 1
5 3
6 1
7 1
8 1
9 1
10 4
11 4
12 4
13 4
14 4
15 4
Since you wanted the logic for how this works:
Both answers rely on the which() command. Which will give you the location where a vector is TRUE, so when we do which(df$v1 == 3) this is going to give us the location of all the rows where the df$v1 is 3:
> df$v1 == 3
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> which(df$v1 == 3)
[1] 1 2 3 4 5
We then simply specify that we want to reassign df$v1 at those positions to 1. However, since you wanted to specify how many rows to do this for, we subset the result of our which() vector by using [1:n] to select the first n results, or sample(x, n) to randomly select n results.
I am assuming you want to select n appearances of some value in a data.frame column.
For that you can sample, with or without replacement, all the values that match your requirements.
Below I show how to do that for 3 instances of 3's
number =c (3,3,3,3,3,1,1,1,1,4,4,4,4,4,4)
foo = data.frame(number)
indexes = sample(which(foo$number == 3), size = 3, replace = F)
foo$number[indexes] = 'your value'

R code - split column each specific row values [duplicate]

how can I split the following data.frame
df <- data.frame(var1 = c("a", 1, 2, 3, "a", 1, 2, 3, 4, 5, 6, "a", 1, 2), var2 = 1:14)
into lists of / groups of
a 1
1 2
2 3
3 4
a 5
1 6
2 7
3 8
4 9
5 10
6 11
a 12
1 13
2 14
So basically, value "a" in column 1 is the tag / identifier I want to split the data frame on. I know about the split function but that means I have to add another column and since, as can be seen from my example, the size of the groups can vary I do not know how to automatically create such a dummy column to fit my needs.
Any ideas on that?
Cheers,
Sven
You could find which values of the indexing vector equal "a", then create a grouping variable based on that and then use split.
df[,1] == "a"
# [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[13] FALSE FALSE
cumsum(df[,1] == "a")
# [1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3
split(df, cumsum(df[,1] == "a"))
#$`1`
# var1 var2
#1 a 1
#2 1 2
#3 2 3
#4 3 4
#
#$`2`
# var1 var2
#5 a 5
#6 1 6
#7 2 7
#8 3 8
#9 4 9
#10 5 10
#11 6 11
#
#$`3`
# var1 var2
#12 a 12
#13 1 13
#14 2 14
You could create a loop that loops through the entire first column of the data frame and saves the positions of non-numeric characters in a vector. Thus, you'd have something like:
data <- df$var1 #this gives you a vector of the values you'll sort through
positions <- c()
for (i in seq(1:length(data))){
if (is.numeric(data[i]) == TRUE) {
#nothing
}
else positions <- append(positions, i) #saves the positions of the non-numeric characters
}
With those positions, you shouldn't have a problem accessing splitting up the data frame from there. It's just a matter of using sequences between the values in the position vector.

For each row in data frame, return variable with non-zero column names

I am trying to create a variable that contains a list of all of the column names that are not zero for each row.
Example of data:
set.seed(334)
DF <- matrix(sample(0:9,9),ncol=4,nrow=10)
DF <- as.data.frame.matrix(DF)
DF$id <- c("ty18","se78","first", "gh89", "sil12","seve","aga2", "second","anotherX", "CH560")
DF$count <- rowSums(DF[,2:5]>0)
DF
> V1 V2 V3 V4 id count
> 1 9 4 0 5 ty18 3
> 2 4 0 5 8 se78 3
> 3 0 5 8 2 first 4
> 4 5 8 2 6 gh89 4
> 5 8 2 6 7 sil12 4
> 6 2 6 7 3 seve 4
> 7 6 7 3 9 aga2 4
> 8 7 3 9 4 second 4
> 9 3 9 4 0 anotherX 3
> 10 9 4 0 5 CH560 3
The desired output would be a new variable that was, for row 1, "V1 V2 V4" and for row 2 "V1 V3 V4". I only want to use the V1-V4 for this, and not consider id or count.
This question on SO helped: For each row return the column name of the largest value
I tried to test this out, but it ignores my selective columns, even for max, so the first test here just gives the max for the whole row, which is not always in V1-V4 in my data.
DF$max <- colnames(DF)[apply(DF[,1:4],1,which.max)]
Despite the error, I think I need to do something like this, but my DF$list attempt is clearly all wrong:
DF$list <- colnames(DF[,1:4]>0)
I'm getting
Error in `$<-.data.frame`(`*tmp*`, "list", value = c("V1", "V2", "V3", :
replacement has 4 rows, data has 10
Maybe I'm trying to put a vector into a cell, and that is why it doesn't work, but I don't know how to get this information out and then make it into a string. I also don't understand why the max on selective columns did not work.
How about this
DF$nonzeros <- simplify2array(
apply(
DF[1:4], 1,
function(x) paste(names(DF[1:4])[x != 0], collapse = " ")
)
)

Splitting a data frame in R - Missing block column [duplicate]

how can I split the following data.frame
df <- data.frame(var1 = c("a", 1, 2, 3, "a", 1, 2, 3, 4, 5, 6, "a", 1, 2), var2 = 1:14)
into lists of / groups of
a 1
1 2
2 3
3 4
a 5
1 6
2 7
3 8
4 9
5 10
6 11
a 12
1 13
2 14
So basically, value "a" in column 1 is the tag / identifier I want to split the data frame on. I know about the split function but that means I have to add another column and since, as can be seen from my example, the size of the groups can vary I do not know how to automatically create such a dummy column to fit my needs.
Any ideas on that?
Cheers,
Sven
You could find which values of the indexing vector equal "a", then create a grouping variable based on that and then use split.
df[,1] == "a"
# [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[13] FALSE FALSE
cumsum(df[,1] == "a")
# [1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3
split(df, cumsum(df[,1] == "a"))
#$`1`
# var1 var2
#1 a 1
#2 1 2
#3 2 3
#4 3 4
#
#$`2`
# var1 var2
#5 a 5
#6 1 6
#7 2 7
#8 3 8
#9 4 9
#10 5 10
#11 6 11
#
#$`3`
# var1 var2
#12 a 12
#13 1 13
#14 2 14
You could create a loop that loops through the entire first column of the data frame and saves the positions of non-numeric characters in a vector. Thus, you'd have something like:
data <- df$var1 #this gives you a vector of the values you'll sort through
positions <- c()
for (i in seq(1:length(data))){
if (is.numeric(data[i]) == TRUE) {
#nothing
}
else positions <- append(positions, i) #saves the positions of the non-numeric characters
}
With those positions, you shouldn't have a problem accessing splitting up the data frame from there. It's just a matter of using sequences between the values in the position vector.

Split data.frame by value

how can I split the following data.frame
df <- data.frame(var1 = c("a", 1, 2, 3, "a", 1, 2, 3, 4, 5, 6, "a", 1, 2), var2 = 1:14)
into lists of / groups of
a 1
1 2
2 3
3 4
a 5
1 6
2 7
3 8
4 9
5 10
6 11
a 12
1 13
2 14
So basically, value "a" in column 1 is the tag / identifier I want to split the data frame on. I know about the split function but that means I have to add another column and since, as can be seen from my example, the size of the groups can vary I do not know how to automatically create such a dummy column to fit my needs.
Any ideas on that?
Cheers,
Sven
You could find which values of the indexing vector equal "a", then create a grouping variable based on that and then use split.
df[,1] == "a"
# [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[13] FALSE FALSE
cumsum(df[,1] == "a")
# [1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3
split(df, cumsum(df[,1] == "a"))
#$`1`
# var1 var2
#1 a 1
#2 1 2
#3 2 3
#4 3 4
#
#$`2`
# var1 var2
#5 a 5
#6 1 6
#7 2 7
#8 3 8
#9 4 9
#10 5 10
#11 6 11
#
#$`3`
# var1 var2
#12 a 12
#13 1 13
#14 2 14
You could create a loop that loops through the entire first column of the data frame and saves the positions of non-numeric characters in a vector. Thus, you'd have something like:
data <- df$var1 #this gives you a vector of the values you'll sort through
positions <- c()
for (i in seq(1:length(data))){
if (is.numeric(data[i]) == TRUE) {
#nothing
}
else positions <- append(positions, i) #saves the positions of the non-numeric characters
}
With those positions, you shouldn't have a problem accessing splitting up the data frame from there. It's just a matter of using sequences between the values in the position vector.

Resources