access data frame column using variable - r

Consider the following code
a = "col1"
b = "col2"
d = data.frame(a=c(1,2,3),b=c(4,5,6))
This code produces the following data frame
a b
1 1 4
2 2 5
3 3 6
However the desired data frame is
col1 col2
1 1 4
2 2 5
3 3 6
Further, I'd like to be able to do something like d$a which would then grab d$col1 since a = "col1"
How can I tell R that "a" is a variable and not a name of a column?

After creating your data frame, you need to use ?colnames. For example, you would have:
d = data.frame(a=c(1,2,3), b=c(4,5,6))
colnames(d) <- c("col1", "col2")
You can also name your variables when you create the data frame. For example:
d = data.frame(col1=c(1,2,3), col2=c(4,5,6))
Further, if you have the names of columns stored in variables, as in
a <- "col1"
you can't use $ to select a column via d$a. R will look for a column whose name is a. Instead, you can do either d[[a]] or d[,a].

You can do it this way
a = "col1"
b = "col2"
d = data.frame(a=c(1,2,3),b=c(4,5,6))
>d
a b
1 1 4
2 2 5
3 3 6
#Renaming the columns
names(d) <- c(a,b)
> d
col1 col2
1 1 4
2 2 5
3 3 6
#Calling by names
d[,a]

Related

Replace values from dataframe where vector values match indexes in another dataframe

Perhaps the question and answer are already posted, but I can't find it. Besides, is there any optimal approach to this problem?
Because this is just an example of some rows, but I'll apply it to a data frame of about 1 million rows.
I'm kind of new to R.
I have two data frames
DF1:
a b
1 1 0
2 2 0
3 2 0
4 3 0
5 5 0
and
DF2
l
1 A
2 B
3 C
4 D
5 E
What I try to do, is to match the values in DF1$a with the indexes of DF2 and assign those values to DF1$b so my result would be the following way.
DF1:
a b
1 1 A
2 2 B
3 2 B
4 3 C
5 5 E
I've coded a for loop to do this, but it seems that I'm missing something
for(i in 1:length(df1$a)){
df1$b[i] <- df2$l[df1$a[i]]
}
Which throws the following result:
DF1:
a b
1 1 1
2 2 2
3 2 2
4 3 3
5 5 5
Thanks in advance :)
We can use merge to merge two data frame based on row id and a.
# Create example data frame
DF1 <- data.frame(a = c(1, 2, 2, 3, 5))
DF2 <- data.frame(l = c("A", "B", "C", "D", "E"),
stringsAsFactors = FALSE)
# Create a column called a in DF2 shows the row id
DF2$a <- row.names(DF2)
# Merge DF1 and DF2 by a
DF3 <- merge(DF1, DF2, by = "a", all.x = TRUE)
# Change the name of column l to be b
names(DF3) <- c("a", "b")
DF3
# a b
# 1 1 A
# 2 2 B
# 3 2 B
# 4 3 C
# 5 5 E

Change dataframe values R using different column name provided?

I have the following data frame:
Column1 Default_Val
1 A 2
2 B 2
3 C 2
4 D 2
5 E 2
...
colnames: "Column1" "Default_Val"
rownames: "1" "2" "3" "4" "5"
This data frame is part of my function and this function changes the default values according to some if's.
I want to generalize the assignment process because I want to support different column names of this data frame.
Please advise how can I change the default value without being dependent of column names?
Here is what I did so far:
df[Column1 == "A","Default_Val"]
[1] 2
df[Column1 == "A","Default_Val"] = 2
df[Column1 == "A","Default_Val"]
[1] 1
I want something generalized like:
t <- colnames(df)
df[t[1] == "A", t[2]] = 7
For some reason it doesn't work (each time this happens I love Python more :)).
Please advise.
I think it must be straightforward. Please check if this solves your problem.
> df
Column1 Default_val
1 A 1
2 B 3
3 A 4
4 C 1
5 D 4
> df[2][df[1] == 'A'] = 3
> df
Column1 Default_val
1 A 3
2 B 3
3 A 3
4 C 1
5 D 4

in R find duplicates by column 1 and filter by not NA column 3

I have a dataframe:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
c <- c(1,NA,2,4,NA,1,2,2)
df <-data.frame(a,b,c)
I have a dataframe with some duplicate variables in column 1 but when I use the duplicated function, it randomly chooses the row after de-duping using duplicate(function)
dedup_df = df[!duplicated(df$a), ]
How can I ensure that the output returns me the row that does not contain an NA on column c ?
I tried to use the dplyr package but the output prints only a result
library(dplyr)
options(dplyr.print_max = Inf )
df %>% ## source dataframe
group_by(a) %>% ## grouped by variable
filter(!is.na(c) ) %>% ## filter by Gross value
as.data.frame(dedup_df)
Your use of duplicated function to remove duplicate observations (lines) using a column as key from a data frame is correct.
But it seems that you are worried that it may keep a line that contains NA in another column and drop another line that contains a non NA value.
I'll use you example, but with a slight modification
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
c <- c(NA,1,2,4,NA,1,2,2)
df <-data.frame(a,b,c)
> df
a b c
1 A 1 NA
2 A 1 1
3 A 2 2
4 B 4 4
5 B 1 NA
6 B 1 1
7 C 2 2
8 C 2 2
In this case, your dedup_df contains an NA for the first value.
> dedup_df = df[!duplicated(df$a), ]
> dedup_df
a b c
1 A 1 NA
4 B 4 4
7 C 2 2
Solution:
Reorder df by column c first and then use the same command. This reordering by column c will send all NAs to the end of the data frame. When the duplicated passes it will see these lines having NA last and will tag them as TRUE if there was a previous one without NA.
df = df[order(df$c),]
dedup_df = df[!duplicated(df$a), ]
> dedup_df
a b c
2 A 1 1
6 B 1 1
7 C 2 2
You can also reorder in descending order
df = df[order(df$c,decreasing = T),]
dedup_df = df[!duplicated(df$a), ]
> dedup_df
a b c
4 B 4 4
3 A 2 2
7 C 2 2

Order a data frame only from a certain row index to a certain row index

Let's say we have a DF like this:
col1 col2
A 1
A 5
A 3
A 16
B 5
B 4
B 3
C 7
C 2
I'm trying to order col2 but only for same values in col1. Better said, I want it to look like this:
col1 col2
A 1
A 3
A 5
A 16
B 3
B 4
B 5
C 2
C 7
So order col2 only for A, B and C values, not order the entire col2 column
x <- function() {
values<- unique(DF[, 1])
for (i in values) {
currentData <- which(DF$col1== i)
## what to do here ?
data[order(data[, 2]), ]
}
}
so in CurrentData I have indexes for col2 values for only As, Bs etc. But how do I order only those items in my entire DF data frame ? Is it somehow possible to tell the order function to do order only on certain row indexes of data frame ?
ave will group the data by the first element, and apply the named function to the second element for each group. Here is an application of ave sorting within groups:
DF$col2 <- ave(DF$col2, DF$col1, FUN=sort)
DF
## col1 col2
## 1 A 1
## 2 A 3
## 3 A 5
## 4 A 16
## 5 B 3
## 6 B 4
## 7 B 5
## 8 C 2
## 9 C 7
This will work even if the values in col1 are not consecutive, leaving them in their original positions.
If that is not an important consideration, there are better ways to do this, such as the answer by #user314046.
It seems that
my_df[with(my_df, order(col1, col2)), ]
will do what you want - this just sorts the dataframe by col1 and col2. If you don't want to order by col1 a method is provided in the other answer.

How to exclude a set of elements in R?

I have two data frames: A and B of the same number of columns names and content. Data frame B is the subset of A. I want to get A without B. I have tried different functions like setdiff, duplicated, which and others. None of them worked for me, perhaps I didn't use them correctly. Any help is appreciated.
You could use merge e.g.:
df1 <- data.frame(col1=c('A','B','C','D','E'),col2=1:5,col3=11:15)
subset <- df1[c(2,4),]
subset$EXTRACOL <- 1 # use a column name that is not present among
# the original data.frame columns
merged <- merge(df1,subset,all=TRUE)
dfdifference <- merged[is.na(merged$EXTRACOL),]
dfdifference$EXTRACOL <- NULL
-----------------------------------------
> df1:
col1 col2 col3
1 A 1 11
2 B 2 12
3 C 3 13
4 D 4 14
5 E 5 15
> subset:
col1 col2 col3
2 B 2 12
4 D 4 14
> dfdifference:
col1 col2 col3
1 A 1 11
3 C 3 13
5 E 5 15

Resources