I am very new to R, so this question may seem stupid, but please bear with me. Here's what my data looks like:
col1 col2
1 2 9
2 2 2
3 1 8
4 1 1
5 2 4
6 2 5
7 2 3
8 1 10
9 1 6
10 2 7
reproducible from
data <- data.frame(col1 = sample(c(1,2), 10, replace = TRUE),
col2 = as.factor(sample(10)))
I want to have all rows in col2 multiplied by 2, if the corresponding value in col1 is "1". So the end result should be like:
col1 col2
1 2 9
2 2 2
3 1 16
4 1 2
5 2 4
6 2 5
7 2 3
8 1 20
9 1 12
10 2 7
And ideas? Appreciation in advance for your help.
If the data were numeric, you could assign to a slice with a simple computation:
> d[d$col1==1,2] <- 2*d[d$col1==1,2]
> d
col1 col2
1 2 9
2 2 2
3 1 16
4 1 2
5 2 4
6 2 5
7 2 3
8 1 20
9 1 12
10 2 7
With a factor, this becomes problematic as you cannot do the substitution in-place (the existing factor doesn't have the appropriate levels). Instead, you must create a new factor with the desired levels:
d$col2 <- as.factor(ifelse(d$col1==1, 2*as.numeric(d$col2), d$col2))
Assuming that the columns are numeric
transform(df1, col2= (2+(col1==1)-1)*col2)
Here's another possibility:
data$col2 <- as.numeric(data$col2) * (1 + (data$col1==1))
Related
I am trying to use anti-join exactly as I have done many times to establish which rows across two datasets do not have matches for two specific columns. For some reason I keep getting 0 rows in the result and I can't understand why.
Below are two dummy df's containing the two columns I am trying to compare - you will see one is missing an entry (df1, SITE no2, PLOT no 8) - so when I use anti-join to compare the two dfs, this entry should be returned, but I am just getting a result of 0.
a<- seq(1:3)
SITE <- rep(a, times = c(16,15,1))
PLOT <- c(1:16,1:7,9:16,1)
df1 <- data.frame(SITE,PLOT)
SITE <- rep(a, times = c(16,16,1))
PLOT <- c(rep(1:16,2),1)
df2 <- data.frame(SITE,PLOT)
df1 df2
SITE PLOT SITE PLOT
1 1 1 1
1 2 1 2
1 3 1 3
1 4 1 4
1 5 1 5
1 6 1 6
1 7 1 7
1 9 1 8
1 10 1 9
1 11 1 10
1 12 1 11
1 13 1 12
1 14 1 13
1 15 1 14
1 16 1 15
1 1 1 16
2 2 2 1
2 3 2 2
2 4 2 3
2 5 2 4
2 6 2 5
2 7 2 6
2 8 2 7
2 9 2 8
2 10 2 9
2 11 2 10
2 12 2 11
2 13 2 12
2 14 2 13
2 15 2 14
2 16 2 15
3 1 2 16
3 1
a <- anti_join(df1, df2, by=c('SITE', 'PLOT'))
a
<0 rows> (or 0-length row.names)
I'm sure the answer is obvious but I can't see it.
The answer can be found in the help file.
anti_join() return all rows from x without a match in y.
So reversing the input for df1 and df2 will give you what you expect.
anti_join(df2, df1, by=c('SITE', 'PLOT'))
# SITE PLOT
# 1 2 8
I have an example of a data frame in which columns "a" and "b" have certain values, and in column "c" the values are 1 or 2. I would like to create column "d" in which the value found in the frame will be located at the index specified in column "c".
x = data.frame(a = c(1:10), b = c(3:12), c = seq(1:2))
x
a b c
1 1 3 1
2 2 4 2
3 3 5 1
4 4 6 2
5 5 7 1
6 6 8 2
7 7 9 1
8 8 10 2
9 9 11 1
10 10 12 2
thus column "d" for the first row will contain the value 1, since the index in column "c" is 1, for the second row d = 4, since the index in column "c" is 2, and so on. I was not helped by the standard indexing in R, it just returns the value of the column c. in what ways can I solve my problem?
You may can create a matrix of row and column numbers to subset values from the dataframe.
x$d <- x[cbind(1:nrow(x), x$c)]
x
# a b c d
#1 1 3 1 1
#2 2 4 2 4
#3 3 5 1 3
#4 4 6 2 6
#5 5 7 1 5
#6 6 8 2 8
#7 7 9 1 7
#8 8 10 2 10
#9 9 11 1 9
#10 10 12 2 12
If the input is tibble, you need to change the tibble to dataframe to use the above answer.
If you don't want to change to dataframe, here is another option using rowwise.
library(dplyr)
x <- tibble(x)
x %>% rowwise() %>% mutate(d = c_across()[c])
By using dplyr::mutate and ifelse,
x %>% mutate(d = ifelse(c == 1, a, b))
a b c d
1 1 3 1 1
2 2 4 2 4
3 3 5 1 3
4 4 6 2 6
5 5 7 1 5
6 6 8 2 8
7 7 9 1 7
8 8 10 2 10
9 9 11 1 9
10 10 12 2 12
I have an R data frame with data from multiple subjects, each tested several times. To perform statistics on the set, there is a factor for subject ("id") and a row for each observation (around 40,000) with around 200 variables each.
allData <- data.frame(id = rep(1:4, 3),
session = rep(1:3, each = 4),
measure1 = sample(c(NA, 1:11)),
measure2 = sample(c(NA, 1:11)),
measure3 = sample(c(NA, 1:11)),
measure4 = sample(c(NA, 1:11)))
allData
# id session measure1 measure2 measure3 measure4
# 1 1 1 3 7 10 6
# 2 2 1 4 4 9 9
# 3 3 1 6 6 7 10
# 4 4 1 1 5 2 3
# 5 1 2 NA NA 5 11
# 6 2 2 7 10 6 5
# 7 3 2 9 8 4 2
# 8 4 2 2 9 1 7
# 9 1 3 5 1 3 8
# 10 2 3 8 3 8 1
# 11 3 3 11 11 11 4
# 12 4 3 10 2 NA NA
I need to remove all rows with id 1 and 4, given that the "measureX" (X=1,..,4) column contains NA in one of the rows for the id 1 and 4.
A solution for this problem was suggested by flodel in [https://stackoverflow.com/a/9917524/5042101][1] using the "plyr" package and the function ddply.
probeColumns = c('measure1','measure4')
library(plyr)
ddply(allData, "id",
function(df)if(any(is.na(df[, probeColumns]))) NULL else df)
Problem. My database includes around 40,000 rows and 200 columns. An error appears when I try for a single column: C stack usage 10027284.
I am using R 3.1.3 in RStudio on Windows. When a try for more columns RStudio close up automatically or R freezes. Moreover, I do not have access to the administrator session in the computer.
I can't say exactly what the problem is with plyr (though it might be a bug in the package). It is possible to do this using apply:
> allData[apply(allData, 1, function(x) !any(is.na(x[probeColumns]))), ]
id session measure1 measure2 measure3 measure4
1 1 1 1 1 2 4
2 2 1 5 4 6 1
3 3 1 9 8 NA 3
4 4 1 11 7 7 5
5 1 2 8 5 11 2
6 2 2 6 NA 5 8
7 3 2 10 10 3 10
9 1 3 4 9 4 9
10 2 3 2 6 8 7
11 3 3 3 3 9 6
A bit of explanation - apply(allData, c(1), function(x) !any(is.na(x[probeColumns]))) determines the indexes of rows that don't have NA in columns specified by probeColumns by going row by row and checking if any of the values in a row in probeColums are NA.
Here is my solution a little bit clumsy maybe but here is the idea:
Find out where are located the NAs
then identify at which id they correspond
Last step remove all id elements that have at least
(in at least one column) an NA.
ind <- allData[apply(allData, 1, function(x) sum(is.na(x))) == !0, 1 ]
allData %>% filter(!id %in% ind)
id session measure1 measure2 measure3 measure4
1 1 1 1 6 1 8
2 2 1 10 2 7 2
3 1 2 11 7 5 11
4 2 2 5 5 4 7
5 1 3 4 8 9 5
6 2 3 8 11 3 9
I was wondering if anyone knows a simple way to create a new column in a data frame, taking data from an existing column, within a certain range.
For example, I have this data frame
range col1
1 5
2 4
3 9
4 5
5 2
6 8
7 9
I would like to create col2 using the data in col1, and have col2 take values above the range 3
range col1 col2
1 5 0
2 4 0
3 9 0
4 5 5
5 2 2
6 8 8
7 9 9
I have tried
data$col2 <- data$col1 [which(data$range > 3)) ]
data$col2 <- subset ( data$col1 , data$range >3 )
However both of these produce error:
replacement has 4 rows, data has 7
Any help greatly appreciated
You can do it even without ifelse here:
data$new <- with(data, (range > 3) * col1)
data
# range col1 new
#1 1 5 0
#2 2 4 0
#3 3 9 0
#4 4 5 5
#5 5 2 2
#6 6 8 8
#7 7 9 9
Try ifelse
transform(data, col2=ifelse(range >3, col1, 0))
# range col1 col2
#1 1 5 0
#2 2 4 0
#3 3 9 0
#4 4 5 5
#5 5 2 2
#6 6 8 8
#7 7 9 9
I am not new to R, but I cannot solve this problem: I have a data.frame and want to rbind the same data.frame with coloumn switching. But R does not switch the columns.
Example:
set.seed(13)
df <- data.frame(var1 = sample(5), var2 = sample(5))
> df
var1 var2
1 4 1
2 1 3
3 2 4
4 5 2
5 3 5
> rbind(df, df[,c(2,1)])
var1 var2
1 4 1
2 1 3
3 2 4
4 5 2
5 3 5
6 4 1
7 1 3
8 2 4
9 5 2
10 3 5
As you can see, the coloumns are not switched (row 6-10) whereas switching the columns alone works like a charm:
> df[,c(2,1)]
var2 var1
1 1 4
2 3 1
3 4 2
4 2 5
5 5 3
I guess this has something to do with the column names, but I cannot figure out what exacly.
Can anyone help?
Kind regards!
As pointed out by #Henrik, from ?rbind.data.frame: "The rbind data frame method [...] matches columns by name. So try this:
> rbind(df, setNames(df[,c(2,1)], c("var1", "var2")))
var1 var2
1 4 1
2 1 3
3 2 4
4 5 2
5 3 5
6 1 4
7 3 1
8 4 2
9 2 5
10 5 3
this also works:
> rbind(as.matrix(df), as.matrix(df[,c(2,1)]))