How to replace the rows in data - r

Hello I have a table with 5 columns. One of the column X is:
x <- c(1,1,1,1,1,1,2,2,2,3)
How can I change the order of numbers in vector X, for example on the first place put 3s, on the second place put 1s and on the third place put 2s. The output should be in format like:
x <- c(3,1,1,1,1,1,1,2,2,2)
And replace not only the values in the column X but all other rows for each number of X
To clarify the question:
X(old version) -> X(new version)
1 2
2 3
3 1
So, If X=1 make it X=2
If X=2 make it X=3
If X=3 make it X=1
And if for example we change X=1 to X=2 we should put all the rows for X=1 to X=2
I have two vectors:
x <- c(1,1,1,1,1,1,2,2,2,3)
z <- c(10,10,10,10,10,10,20,20,20,30)
The desired output:
x z
1 30
2 10
2 10
2 10
2 10
2 10
2 10
3 20
3 20
3 20

You could
x1 <-c(2,3,1)[x]
x[order(x1)]
# [1] 3 1 1 1 1 1 1 2 2 2
or
x[order(chartr(old="123",new="231",x))]
#[1] 3 1 1 1 1 1 1 2 2 2
Update
If you have many columns.
x <- c(1,1,1,1,1,1,2,2,2,3)
z <- c(10,10,10,10,10,10,20,20,20,30)
set.seed(14)
y <- matrix(sample(25,10*3,replace=TRUE),ncol=3)
m1 <- as.data.frame(cbind(x,z,y))
x1 <- c(2,3,1)[m1$x]
x1
# [1] 2 2 2 2 2 2 3 3 3 1
res <- cbind(x=c(2,3,1)[m1$x[order(x1)]],subset(m1[order(x1),], select=-x))
res
# x z V3 V4 V5
#10 1 30 10 15 2
#1 2 10 7 23 9
#2 2 10 16 5 11
#3 2 10 24 12 16
#4 2 10 14 22 18
#5 2 10 25 22 19
#6 2 10 13 19 16
#7 3 20 24 9 10
#8 3 20 11 17 14
#9 3 20 13 22 18

If I'm understanding correctly, it sounds as though you want to define your own order for sorting something. Is that right? Two ways you could do that:
Option #1: Make another column in your data.frame and assign values in the order you'd like. If you wanted the threes to come first, the ones to come second and the twos to come third, you'd do this:
Data$y <- rep(NA, nrow(Data)
Data$y[Data$x == 3] <- 1
Data$y[Data$x == 1] <- 2
Data$y[Data$x == 2] <- 3
Then you can sort on y and your data.frame will have the order you want.
Option #2: If the numbers you list in x are levels in a factor, you could do this using plyr:
library(plyr)
Data$x <- revalue(Data$x, c("3" = "1", "1" = "2", "2" = "3"))
Personally, I think that the 2nd option would be rather confusing, but if you are using "1", "2", and "3" to refer to levels in a factor, that is one quick way to change things.

Related

Conditional Subset, Manipulate and Replace

Following on from a previous question here I extracted the following data.frame
DF <- data.frame(A =c("One","Two","Three","Four","Five"),
B=c(1,1,2,2,3),
D=c(10,2,3,-5,5))
subset(DF, B %in% c(1,3))
A B D
1 One 1 10
2 Two 1 2
5 Five 3 5
but now I want to (for example) multiply the numbers by (say) five and replace them in the original data.frame
The following code
subset(DF, B %in% c(1,3))[,2:3] * 5
B D
1 5 50
2 5 10
5 15 25
gives me the numbers I want but how to I get them back to
A B D
1 One 5 50
2 Two 5 10
3 Three 2 3
4 Four 2 -5
5 Five 15 25
The answer is staring me in the face (ie the index numbers ... but how do I get to them)?
You can do
DF[DF$B %in% c(1, 3), 2:3] <- DF[DF$B %in% c(1, 3), 2:3] * 5
DF
# A B D
#1 One 5 50
#2 Two 5 10
#3 Three 2 3
#4 Four 2 -5
#5 Five 15 25

create a group variable in a data frame by a string variable starting from a certain value in R

I have following data frame.
sub1=c("2021","2121","M123","M143")
x1=c(10,5,6,7)
x2=c(11,12,34,56)
data=data.frame(sub1,x1,x2)
I need to get create a group variable for this data frame such that if the sub1 starts from number 2, then it will belongs to one group and if sub1 starts from letter M , it belongs to second group.
My desired output should be like this,
sub1 x1 x2 group
1 2021 10 11 1
2 2121 5 12 1
3 M123 6 34 2
4 M143 7 56 2
can anyone suggest any funstion that i use for this ? I tried grep funstion as follows, but i didnt get the desired result.
data$sub1[grep("^[2].*", data$sub1)]
Thank you
What about this:
data$group <- ifelse(substr(data$sub1,1,1)==2,1,2)
data
sub1 x1 x2 group
1 2021 10 11 1
2 2121 5 12 1
3 M123 6 34 2
4 M143 7 56 2
In case you do not know if it could be other cases than 2 or M:
ifelse(substr(data$sub1,1,1)==2,1,ifelse(substr(data$sub1,1,1)=='M',2,'Missing'))
Another way using substring and indexing to assign groups.
data$group <- (substr(data$sub1, 1, 1) == "M") + 1
data
# sub1 x1 x2 group
#1 2021 10 11 1
#2 2121 5 12 1
#3 M123 6 34 2
#4 M143 7 56 2
Or extract first character using regex
sub("(.).*", "\\1", data$sub1)
#[1] "2" "2" "M" "M"
and then use the same method to assign groups
(sub("(.).*", "\\1", data$sub1) == "M") + 1
#[1] 1 1 2 2
You can also do:
as.integer(!grepl("^2", data$sub1)) + 1
[1] 1 1 2 2

understanding apply and outer function in R

Suppose i have a data which looks like this
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
I Wanted to compare these values with each other so if an ID has changed its value of A variable over a period of B variable(which is from 1 to 4) it goes into data frame K and if it hasn't then it goes to data frame L.
so in this data set K will look like
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
and L will look like
ID A B C
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
In terms of nested loops and if then else statement it can be solved like following
for ( i in 1:length(ID)){
m=0
for (j in 1: length(B)){
ifelse( A[j] == A[j+1],m,m=m+1)
}
ifelse(m=0, L=c[,df[i]], K=c[,df[i]])
}
I have read in some posts that in R nested loops can be replaced by apply and outer function. if someone can help me understand how it can be used in such circumstances.
So basically you don't need a loop with conditions here, all you need to do is to check if there's a variance (and then converting it to a logical using !) in A during each cycle of B (IDs) by converting A to a numeric value (I'm assuming its a factor in your real data set, if its not a factor, you can use FUN = function(x) length(unique(x)) within ave instead ) and then split accordingly. With base R we can use ave for such task, for example
indx <- !with(df, ave(as.numeric(A), ID , FUN = var))
Or (if A is a character rather a factor)
indx <- with(df, ave(A, ID , FUN = function(x) length(unique(x)))) == 1L
Then simply run split
split(df, indx)
# $`FALSE`
# ID A B C
# 1 1 X 1 10
# 2 1 X 2 10
# 3 1 Z 3 15
# 4 1 Y 4 12
# 5 2 Y 1 15
# 6 2 X 2 13
# 7 2 X 3 13
# 8 2 Y 4 13
#
# $`TRUE`
# ID A B C
# 9 3 Y 1 16
# 10 3 Y 2 18
# 11 3 Y 3 19
# 12 3 Y 4 10
This will return a list with two data frames.
Similarly with data.table
library(data.table)
setDT(df)[, indx := !var(A), by = ID]
split(df, df$indx)
Or dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(indx = !var(A)) %>%
split(., indx)
Since you want to understand apply rather than simply getting it done, you can consider tapply. As a demonstration:
> tapply(df$A, df$ID, function(x) ifelse(length(unique(x))>1, "K", "L"))
1 2 3
"K" "K" "L"
In a bit plainer English: go through all df$A grouped by df$ID, and apply the function on df$A within each groupings (i.e. the x in the embedded function): if the number of unique values is more than 1, it's "K", otherwise it's "L".
We can do this using data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'ID', we check the length of unique elements in 'A' (uniqueN(A)) is greater than 1 or not, create a column 'ind' based on that. We can then split the dataset based on that
'ind' column.
library(data.table)
setDT(df1)[, ind:= uniqueN(A)>1, by = ID]
setDF(df1)
split(df1[-5], df1$ind)
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
Or similarly using dplyr, we can use n_distinct to create a logical column and then split by that column.
library(dplyr)
df2 <- df1 %>%
group_by(ID) %>%
mutate(ind= n_distinct(A)>1)
split(df2, df2$ind)
Or a base R option with table. We get the table of the first two columns of 'df1' i.e. the 'ID' and 'A'. By double negating (!!) the output, we can get the '0' values convert to 'TRUE' and all other frequency as 'FALSE'. Get the rowSums ('indx'). We match the ID column in 'df1' with the names of the 'indx', use that to replace the 'ID' with TRUE/FALSE, and split the dataset with that.
indx <- rowSums(!!table(df1[1:2]))>1
lst <- split(df1, indx[match(df1$ID, names(indx))])
lst
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
If we need to get individual datasets on the global environment, change the names of the list elements to the object names we wanted and use list2env (not recommended though)
list2env(setNames(lst, c('L', 'K')), envir=.GlobalEnv)

How to use logical values to access elements of data frame

Say I have this data frame:
x <- data.frame(matrix(rep(1:5, each=5), nrow=5))
Say I want to square all values that are greater than 3 and put these values back into the x.
I identify the values that are greater than 3 by:
x > 3
Then how can I reference these values in x? Doing x[x>3] returns a vector of integers, not a data frame.
Note that I am more so interested in this particular problem of x[x>3] and not as much the actual application that I included simply as motivation.
Just use matrix indexing:
ind <- which(x > 3, arr.ind = TRUE)
x[ind] <- x[ind] * x[ind] ## or x[ind] <- x[ind]^2
x
# X1 X2 X3 X4 X5
# 1 1 2 3 16 25
# 2 1 2 3 16 25
# 3 1 2 3 16 25
# 4 1 2 3 16 25
# 5 1 2 3 16 25
Alternatively, you can do replace(x, x > 3, x[x > 3]^2), but remember that this doesn't actually modify your "x" object so it needs to be reassigned.
Or,
> x[x>3] <- (x[x>3])^2
> x
X1 X2 X3 X4 X5
1 1 2 3 16 25
2 1 2 3 16 25
3 1 2 3 16 25
4 1 2 3 16 25
5 1 2 3 16 25

Read csv with two headers into a data.frame

Apologies for the seemingly simple question, but I can't seem to find a solution to the following re-arrangement problem.
I'm used to using read.csv to read in files with a header row, but I have an excel spreadsheet with two 'header' rows - cell identifier (a, b, c ... g) and three sets of measurements (x, y and z; 1000s each) for each cell:
a b
x y z x y z
10 1 5 22 1 6
12 2 6 21 3 5
12 2 7 11 3 7
13 1 4 33 2 8
12 2 5 44 1 9
csv file below:
a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9
How can I get to a data.frame in R as shown below?
cell x y z
a 10 1 5
a 12 2 6
a 12 2 7
a 13 1 4
a 12 2 5
b 22 1 6
b 21 3 5
b 11 3 7
b 33 2 8
b 44 1 9
Use base R reshape():
temp = read.delim(text="a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9", header=TRUE, skip=1, sep=",")
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT
# time x y z id
# 1.0 0 10 1 5 1
# 2.0 0 12 2 6 2
# 3.0 0 12 2 7 3
# 4.0 0 13 1 4 4
# 5.0 0 12 2 5 5
# 1.1 1 22 1 6 1
# 2.1 1 21 3 5 2
# 3.1 1 11 3 7 3
# 4.1 1 33 2 8 4
# 5.1 1 44 1 9 5
Basically, you should just skip the first row, where there are the letters a-g every third column. Since the sub-column names are all the same, R will automatically append a grouping number after all of the columns after the third column; so we need to add a grouping number to the first three columns.
You can either then create an "id" variable, or, as I've done here, just use the row names for the IDs.
You can change the "time" variable to your "cell" variable as follows:
# Change the following to the number of levels you actually have
OUT$cell = factor(OUT$time, labels=letters[1:2])
Then, drop the "time" column:
OUT$time = NULL
Update
To answer a question in the comments below, if the first label was something other than a letter, this should still pose no problem. The sequence I would take would be as follows:
temp = read.csv("path/to/file.csv", skip=1, stringsAsFactors = FALSE)
GROUPS = read.csv("path/to/file.csv", header=FALSE,
nrows=1, stringsAsFactors = FALSE)
GROUPS = GROUPS[!is.na(GROUPS)]
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT$cell = factor(temp$time, labels=GROUPS)
OUT$time = NULL

Resources