Replacing column values in R - r

Example df:
index name V1 V2 etc
1 x 2 1
2 y 1 2
3 z 3 4
4 w 4 3
I would like to replace values in columns V1 and V2 with related values in name column for particular index value. Output should look like this:
index name V1 V2 etc
1 x y x
2 y x y
3 z z w
4 w w z
I have tried multiple merge statements in loop but not sure how I can replace the values instead of creating new columns and also got a duplicate name error.
V<-2 # number of V columns
names<-c()
for (i in 1:k){names[[i]]<-paste0('V',i)}
lookup_table<-df[,c('index','name'),drop=FALSE] # it's at unique index level
for(col in names){
df<- merge(df,lookup_table,by.x=col,by.y="index",all.x = TRUE)
}

We can do
df[3:4] <- lapply(df[3:4], function(x) df$name[x])
Or without looping
df[3:4] <- df$name[as.matrix(df[3:4])]
df
# index name V1 V2
#1 1 x y x
#2 2 y x y
#3 3 z z w
#4 4 w w z
data
df <- structure(list(index = 1:4, name = c("x", "y", "z", "w"), V1 = c(2L,
1L, 3L, 4L), V2 = c(1L, 2L, 4L, 3L)), .Names = c("index", "name",
"V1", "V2"), class = "data.frame", row.names = c(NA, -4L))

Related

group by a dataframe and get a row of specific index within each group in r

I have a df like
ProjectID Dist
1 x
1 y
2 z
2 x
2 h
3 k
.... ....
and a vector of indices of lengthunique(df$ProjectID) like
2
3
1
....
I would like to get Dist by ProjectID whose index is the element vector corresponding to project ID. So the result I want looks like
ProjectID Dist
1 y
2 h
3 k
.... ....
I tried
aggregate(XRKL ~ ID, FUN=..?, data=df)
but I'm not sure where I can put the vector of indices. Is there a way to get the right result from dply ftns, tapply, or aggregate? Or do I need to make a function of my own? Thank you.
You can add the indices in the dataframe itself and then select that row from each group.
inds <- c(2, 3, 1)
df %>%
mutate(inds = inds[match(ProjectID, unique(ProjectID))]) %>%
#If ProjectID is sequential like 1, 2, 3
#mutate(inds = inds[ProjectID]) %>%
group_by(ProjectID) %>%
slice(first(inds)) %>%
ungroup() %>%
select(-inds)
# ProjectID Dist
# <int> <chr>
#1 1 y
#2 2 h
#3 3 k
data
df <- structure(list(ProjectID = c(1L, 1L, 2L, 2L, 2L, 3L), Dist = c("x",
"y", "z", "x", "h", "k")), class = "data.frame", row.names = c(NA, -6L))

How to extract a column based on column name?

I have a data frame df
m n o p
a 1 1 2 5
b 1 2 0 4
c 3 3 3 3
I can extract column m by:
df[,"m"]
Now the problem is, the column name was generated somewhere else (multiple times, in a for loop). For example, column name m was generated by choosing a specific element in the dataframe, gen, in one loop
:
> gen[i,1]
[1] m
How do I extract the column based on gen[i,1]?
Just nest the subsetting.
dat[,"m"]
# [1] 1 1 3
i <- 13
gen[i, 1]
# [1] "m"
dat[, gen[i, 1]]
# [1] 1 1 3
Or, if you don't want the column to be dropped:
dat[, gen[i, 1], drop=FALSE]
# m
# a 1
# b 1
# c 3
Data
dat <- structure(list(m = c(1L, 1L, 3L), n = 1:3, o = c(2L, 0L, 3L),
p = 5:3), class = "data.frame", row.names = c("a", "b", "c"
))
gen <- data.frame(letters)
We can use select from dplyr
library(dplyr)
i <- 13
dat %>%
select(gen[i, 1])
# m
#a 1
#b 1
#c 3
data
dat <- structure(list(m = c(1L, 1L, 3L), n = 1:3, o = c(2L, 0L, 3L),
p = 5:3), class = "data.frame", row.names = c("a", "b", "c"
))
gen <- data.frame(letters)

How to sanitize a df according to specific variable values?

I have two data frames. dfOne is made like this:
X Y Z T J
3 4 5 6 1
1 2 3 4 1
5 1 2 5 1
and dfTwo is made like this
C.1 C.2
X Z
Y T
I want to obtain a new dataframe where there are simultaneously X, Y, Z, T Values which are major than a specific threshold.
Example. I need simultaneously (in the same row):
X, Y > 2
Z, T > 4
I need to use the second data frame to reach my objective, I expect something like:
dfTwo$C.1>2
so the result would be a new dataframe with this structure:
X Y Z T J
3 4 5 6 1
How could I do it?
Here is a base R method with Map and Reduce.
# build lookup table of thresholds relative to variable name
vals <- setNames(c(2, 2, 4, 4), unlist(dat2))
# subset data.frame
dat[Reduce("&", Map(">", dat[names(vals)], vals)), ]
X Y Z T J
1 3 4 5 6 1
Here, Map returns a list of length 4 with logical variables corresponding to each comparison. This list is passed to Reduce which returns a single logical vector with length corresponding to the number of rows in the data.frame, dat. This logical vector is used to subset dat.
data
dat <-
structure(list(X = c(3L, 1L, 5L), Y = c(4L, 2L, 1L), Z = c(5L,
3L, 2L), T = c(6L, 4L, 5L), J = c(1L, 1L, 1L)), .Names = c("X",
"Y", "Z", "T", "J"), class = "data.frame", row.names = c(NA,
-3L))
dat2 <-
structure(list(C.1 = structure(1:2, .Label = c("X", "Y"), class = "factor"),
C.2 = structure(c(2L, 1L), .Label = c("T", "Z"), class = "factor")), .Names = c("C.1",
"C.2"), class = "data.frame", row.names = c(NA, -2L))
We can use the purrr package
Here is the input data.
# Data frame from lmo's solution
dat <-
structure(list(X = c(3L, 1L, 5L), Y = c(4L, 2L, 1L), Z = c(5L,
3L, 2L), T = c(6L, 4L, 5L), J = c(1L, 1L, 1L)), .Names = c("X",
"Y", "Z", "T", "J"), class = "data.frame", row.names = c(NA,
-3L))
# A numeric vector to show the threshold values
# Notice that columns without any requirements need NA
vals <- c(X = 2, Y = 2, Z = 4, T = 4, J = NA)
Here is the implementation
library(purrr)
map2_dfc(dat, vals, ~ifelse(.x > .y | is.na(.y), .x, NA)) %>% na.omit()
# A tibble: 1 x 5
X Y Z T J
<int> <int> <int> <int> <int>
1 3 4 5 6 1
map2_dfc loop through each column in dat and each value in vals one by one with a defined function. ~ifelse(.x > .y | is.na(.y), .x, NA) means if the number in each column is larger than the corresponding value in vals, or vals is NA, the output should be the original value from the column. Otherwise, the value is replaced to be NA. The output of map2_dfc(dat, vals, ~ifelse(.x > .y | is.na(.y), .x, NA)) is a data frame with NA values in some rows indicating that the condition is not met. Finally, na.omit removes those rows.
Update
Here I demonstrate how to covert the dfTwo dataframe to the vals vector in my example.
First, let's create the dfTwo data frame.
dfTwo <- read.table(text = "C.1 C.2
X Z
Y T",
header = TRUE, stringsAsFactors = FALSE)
dfTwo
C.1 C.2
1 X Z
2 Y T
To complete the task, I load the dplyr and tidyr package.
library(dplyr)
library(tidyr)
Now I begin the transformation of dfTwo. The first step is to use stack function to convert the format.
dfTwo2 <- dfTwo %>%
stack() %>%
setNames(c("Col", "Group")) %>%
mutate(Group = as.character(Group))
dfTwo2
Col Group
1 X C.1
2 Y C.1
3 Z C.2
4 T C.2
The second step is to add the threshold information. One way to do this is to create a look-up table showing the association between Group and Value
threshold_df <- data.frame(Group = c("C.1", "C.2"),
Value = c(2, 4),
stringsAsFactors = FALSE)
threshold_df
Group Value
1 C.1 2
2 C.2 4
And then we can use the left_join function to combine the data frame.
dfTwo3 <- dfTwo2 %>% left_join(threshold_dt, by = "Group")
dfTwo3
Col Group Value
1 X C.1 2
2 Y C.1 2
3 Z C.2 4
4 T C.2 4
Now it is the third step. Notice that there is a column called J which does not need any threshold. So we need to add this information to dfTwo3. We can use the complete function from tidyr. The following code completes the data frame by adding Col in dat but not in dfTwo3 and NA to the Value.
dfTwo4 <- dfTwo3 %>% complete(Col = colnames(dat))
dfTwo4
# A tibble: 5 x 3
Col Group Value
<chr> <chr> <dbl>
1 J <NA> NA
2 T C.2 4
3 X C.1 2
4 Y C.1 2
5 Z C.2 4
The fourth step is arrange the right order of dfTwo4. We can achieve this by turning Col to factor and assign the level based on the order of the column name in dat.
dfTwo5 <- dfTwo4 %>%
mutate(Col = factor(Col, levels = colnames(dat))) %>%
arrange(Col) %>%
mutate(Col = as.character(Col))
dfTwo5
# A tibble: 5 x 3
Col Group Value
<chr> <chr> <dbl>
1 X C.1 2
2 Y C.1 2
3 Z C.2 4
4 T C.2 4
5 J <NA> NA
We are almost there. Now we can create vals from dfTwo5.
vals <- dfTwo5$Value
names(vals) <- dfTwo5$Col
vals
X Y Z T J
2 2 4 4 NA
Now we are ready to use the purrr package to filter the data.
The aboved are the breakdown of steps. We can combine all these steps into the following code for simlicity.
library(dplyr)
library(tidyr)
threshold_df <- data.frame(Group = c("C.1", "C.2"),
Value = c(2, 4),
stringsAsFactors = FALSE)
dfTwo2 <- dfTwo %>%
stack() %>%
setNames(c("Col", "Group")) %>%
mutate(Group = as.character(Group)) %>%
left_join(threshold_df, by = "Group") %>%
complete(Col = colnames(dat)) %>%
mutate(Col = factor(Col, levels = colnames(dat))) %>%
arrange(Col) %>%
mutate(Col = as.character(Col))
vals <- dfTwo2$Value
names(vals) <- dfTwo2$Col
dfOne[Reduce(intersect, list(which(dfOne["X"] > 2),
which(dfOne["Y"] > 2),
which(dfOne["Z"] > 4),
which(dfOne["T"] > 4))),]
# X Y Z T J
#1 3 4 5 6 1
Or iteratively (so fewer inequalities are tested):
vals = c(X = 2, Y = 2, Z = 4, T = 4) # from #lmo's answer
dfOne[Reduce(intersect, lapply(names(vals), function(x) which(dfOne[x] > vals[x]))),]
# X Y Z T J
#1 3 4 5 6 1
I'm writing this assuming that the second DF is meant to categorize the fields in the first DF. It's way simpler if you don't need to use the second one to define the conditions:
dfNew = dfOne[dfOne$X > 2 & dfOne$Y > 2 & dfOne$Z > 4 & dfOne$T > 4, ]
Or, using dplyr:
library(dplyr)
dfNew = dfOne %>% filter(X > 2 & Y > 2 & Z > 4 & T > 4)
In case that's all you need, I'll save this comment while I poke at the more complicated version of the question.

Group by operarion in R

I have a data-set having millions of rows and i need to apply the 'group by' operation in it using R.
The data is of the form
V1 V2 V3
a u 1
a v 2
b w 3
b x 4
c y 5
c z 6
performing 'group by' using R, I want to add up the values in column 3 and concatenate the values in column 2 like
V1 V2 V3
a uv 3
b wx 7
c yz 11
I have tried doing the opertaion in excel but due to a lot of tuples i can't use excel. I am new to R so any help would be appreciated.
Many possible ways to solve, here are two
library(data.table)
setDT(df)[, .(V2 = paste(V2, collapse = ""), V3 = sum(V3)), by = V1]
# V1 V2 V3
# 1: a uv 3
# 2: b wx 7
# 3: c yz 11
Or
library(dplyr)
df %>%
group_by(V1) %>%
summarise(V2 = paste(V2, collapse = ""), V3 = sum(V3))
# Source: local data table [3 x 3]
#
# V1 V2 V3
# 1 a uv 3
# 2 b wx 7
# 3 c yz 11
Data
df <- structure(list(V1 = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label = c("a",
"b", "c"), class = "factor"), V2 = structure(1:6, .Label = c("u",
"v", "w", "x", "y", "z"), class = "factor"), V3 = 1:6), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -6L))
Another option, using aggregate
# Group column 2
ag.2 <- aggregate(df$V2, by=list(df$V1), FUN = paste0, collapse = "")
# Group column 3
ag.3 <- aggregate(df$V3, by=list(df$V1), FUN = sum)
# Merge the two
res <- cbind(ag.2, ag.3[,-1])
Another option with sqldf
library(sqldf)
sqldf('select V1,
group_concat(V2,"") as V2,
sum(V3) as V3
from df
group by V1')
# V1 V2 V3
#1 a uv 3
#2 b wx 7
#3 c yz 11
Or using base R
do.call(rbind,lapply(split(df, df$V1), function(x)
with(x, data.frame(V1=V1[1L], V2= paste(V2, collapse=''), V3= sum(V3)))))
using ddply
library(plyr)
ddply(df, .(V1), summarize, V2 = paste(V2, collapse=''), V3 = sum(V3))
# V1 V2 V3
#1 a uv 3
#2 b wx 7
#3 c yz 11
You could also just use the groupBy function in the 'caroline' package:
x <-cbind.data.frame(V1=rep(letters[1:3],each=2), V2=letters[21:26], V3=1:6, stringsAsFactors=F)
groupBy(df=x, clmns=c('V2','V3'),by='V1',aggregation=c('paste','sum'),collapse='')

Manipulating a Data frame in R

I am new in R. I have data frame
A 5 8 9 6
B 8 2 3 6
C 1 8 9 5
I want to make
A 5
A 8
A 9
A 6
B 8
B 2
B 3
B 6
C 1
C 8
C 9
C 5
I have a big data file
Assuming you're starting with something like this:
mydf <- structure(list(V1 = c("A", "B", "C"), V2 = c(5L, 8L, 1L),
V3 = c(8L, 2L, 8L), V4 = c(9L, 3L, 9L),
V5 = c(6L, 6L, 5L)),
.Names = c("V1", "V2", "V3", "V4", "V5"),
class = "data.frame", row.names = c(NA, -3L))
mydf
# V1 V2 V3 V4 V5
# 1 A 5 8 9 6
# 2 B 8 2 3 6
# 3 C 1 8 9 5
Try one of the following:
library(reshape2)
melt(mydf, 1)
Or
cbind(mydf[1], stack(mydf[-1]))
Or
library(splitstackshape)
merged.stack(mydf, var.stubs = "V[2-5]", sep = "var.stubs")
The name pattern in the last example is unlikely to be applicable to your actual data though.
Someone could probably do this in a better way but here I go...
I put your data into a data frame called data
#repeat the value in the first column (c - 1) times were c is the number of columns (data[1,])
rep(data[,1], each=length(data[1,])-1)
#turning your data frame into a matrix allows you then turn it into a vector...
#transpose the matrix because the vector concatenates columns rather than rows
as.vector(t(as.matrix(data[,2:5])))
#combining these ideas you get...
data.frame(col1=rep(data[,1], each=length(data[1,])-1),
col2=as.vector(t(as.matrix(data[,2:5]))))
If you could use a matrix you can just 'cast' it to a vector and add the row names. I have assumed that you really want 'a', 'b', 'c' as row names.
n <- 3;
data <- matrix(1:9, ncol = n);
data <- t(t(as.vector(data)));
rownames(data) <- rep(letters[1:3], each = n);
If you want to keep the rownames from your first data frame this is ofcourse also possible without libraries.
n <- 3;
data <- matrix(1:9, ncol=n);
names <- rownames(data);
data <- t(t(as.vector(data)))
rownames(data) <- rep(names, each = n)

Resources