Loop through columns and add string lengths as new columns

Loop through columns and add string lengths as new columns - r

I have a data frame with a number of columns, and would like to output a separate column for each with the length of each row in it.
I am trying to iterate through the column names, and for each column output a corresponding column with '_length' attached.
For example col1 | col2 would go to col1 | col2 | col1_length | col2_length
The code I am using is:
df <- data.frame(col1 = c("abc","abcd","a","abcdefg"),col2 = c("adf qqwe","d","e","f"))
for(i in names(df)){
df$paste(i,'length',sep="_") <- str_length(df$i)
}
However this throws and error:
invalid function in complex assignment.
Am I able to use loops in this way in R?

You need to use [[, the programmatic equivalent of $. Otherwise, for example, when i is col1, R will look for df$i instead of df$col1.
for(i in names(df)){
df[[paste(i, 'length', sep="_")]] <- str_length(df[[i]])
}

You can use lapply to pass each column to str_length, then cbind it to your original data.frame...
library(stringr)
out <- lapply( df , str_length )
df <- cbind( df , out )
# col1 col2 col1 col2
#1 abc adf qqwe 3 8
#2 abcd d 4 1
#3 a e 1 1
#4 abcdefg f 7 1

With dplyr and stringr you can use mutate_all:
> df %>% mutate_all(funs(length = str_length(.)))
col1 col2 col1_length col2_length
1 abc adf qqwe 3 8
2 abcd d 4 1
3 a e 1 1
4 abcdefg f 7 1

For the sake of completeness, there is also a data.table solution:
library(data.table)
result <- setDT(df)[, paste0(names(df), "_length") := lapply(.SD, stringr::str_length)]
result
# col1 col2 col1_length col2_length
#1: abc adf qqwe 3 8
#2: abcd d 4 1
#3: a e 1 1
#4: abcdefg f 7 1

Related

R Question: How do I add data to a column without replacing existing data?

So I have a basic table with empty column 1
df
ident col1
1 NA
2 NA
3 NA
4 NA
I have two other tables:
df1
ident col1
2 Yes
3 Yes
df2
ident col1
1 No
4 No
I am trying to add data from each df1 and df2 to the first table but when I add the second df2 data it replaces the current entires there that don't match with NA. So I wan't to add df1 data into col1 and then add df2 data into col1 by matching ident numbers and without replacing values in there.
Here is my code:
df$col1<- df1$col1[match(df$ident, df1$ident)]
and then I do the same for df2 but it replaces df1 data...
Any suggestions? Thanks
PS. My data is much more complicated than this, I just figured i would boil it down easier for ya.

You just need to do the match the other way around
df$col1[match(df1$ident, df$ident)] <- df1$col1
df$col1[match(df2$ident, df$ident)] <- df2$col1
df
# ident col1
# 1: 1 No
# 2: 2 Yes
# 3: 3 Yes
# 4: 4 No
You can also do this with data.table update joins
library(data.table)
setDT(df)
df[, col1 := as.character(col1)]# may not be necessary (if data is already char)
df[df1, on = .(ident), col1 := i.col1]
df[df2, on = .(ident), col1 := i.col1]
df
# ident col1
# 1: 1 No
# 2: 2 Yes
# 3: 3 Yes
# 4: 4 No

r - dplyr full_join using column position

I the following dataframes:
a <- c(1,1,1)
b<- c(10,8,2)
c<- c(2,2)
d<- c(3,5)
AB<- data.frame(a,b)
CD<- data.frame(c,d)
I would like to join AB and CD, where the first column of CD is equal to the second column of AB. Please note that my actual data will have a varying number of columns, with varying names, so I am really looking for a way to join based on position only. I have been trying this:
#Get the name of the last column in AB
> colnames(AB)[ncol(AB)]
[1] "b"
#Get the name of the first column in CD
> colnames(CD)[1]
[1] "c"
Then I attempt to join like this:
> abcd <- full_join(AB, CD, by = c(colnames(AB)[ncol(AB)]=colnames(CD)[1]))
Error: unexpected '=' in "abcd <- full_join(AB, CD, by = c(colnames(AB)[ncol(AB)]="
The behavior I am looking for is essentially this:
> abcd<- full_join(AB, CD, by = c("b" = "c"))
> abcd
a b d
1 1 10 NA
2 1 8 NA
3 1 2 3
4 1 2 5

We can do setNames
full_join(AB, CD, setNames(colnames(CD)[1], colnames(AB)[ncol(AB)]))
# a b d
#1 1 10 NA
#2 1 8 NA
#3 1 2 3
#4 1 2 5

We can replace the target column names with a common name, such as "Target", and then do full_join. Finally, replace the "Target" name with the original column name.
library(dplyr)
AB_name <- names(AB)
target_name <- AB_name[ncol(AB)] # Store the original column name
AB_name[ncol(AB)] <- "Target" # Set a common name
names(AB) <- AB_name
CD_name <- names(CD)
CD_name[1] <- "Target" # Set a common name
names(CD) <- CD_name
abcd <- full_join(AB, CD, by = "Target") %>% # Merge based on the common name
rename(!!target_name := Target) # Replace the common name with the original name
abcd
# a b d
# 1 1 10 NA
# 2 1 8 NA
# 3 1 2 3
# 4 1 2 5

Filtering a R DataFrame with repeated values in columns

I have a R DataFrame and I want to make another DF from this one, but only with the values which appears more than X times in a determinate column.
>DataFrame
Value Column
1 a
4 a
2 b
6 c
3 c
4 c
9 a
1 d
For example a want a new DataFrame only with the values in Column which appears more than 2 times, to get something like this:
>NewDataFrame
Value Column
1 a
4 a
6 c
3 c
4 c
9 a
Thank you very much for your time.

We can use table to get the count of values in 'Column' and subset the dataset ('df1') based on the names in 'tbl' that have a count greater than 'n'
n <- 2
tbl <- table(DataFrame$Column) > n
NewDataFrame <- subset(DataFrame, Column %in% names(tbl)[tbl])
# Value Column
#1 1 a
#2 4 a
#4 6 c
#5 3 c
#6 4 c
#7 9 a
Or using ave from base R
NewDataFrame <- DataFrame[with(DataFrame, ave(Column, Column, FUN=length)>n),]
Or using data.table
library(data.table)
NewDataFrame <- setDT(DataFrame)[, .SD[.N>n] , by = Column]
Or
NewDataFrame <- setDT(DataFrame)[, if(.N > n) .SD, by = Column]
Or dplyr
NewDataFrame <- DataFrame %>%
group_by(Column) %>%
filter(n()>2)

Changing the values of a column for the values from another column

I have two datasets that look like this:
What I want is to change the values from the second column in the first dataset to the values from the second column from the second dataset. All the names in the first dataset are in the second one, and obviously my dataset is much bigger than that.
I was trying to use R to do that but I am very new at it. I was looking at the intersect command but I am not sure if it's going to work. I don't put any codes because I'm real lost here.
I also need that the order of the first columns (which are names) in the first dataset stays the same, but with the new values from the second column of the second dataset.

Agree with #agstudy, a simple use of merge would do the trick. Try something like this:
df1 <- data.frame(name=c("ab23242", "ab35366", "ab47490", "ab59614"),
X=c(72722, 88283, 99999, 114278.333))
df2 <- data.frame(name=c("ab35366", "ab47490", "ab59614", "ab23242" ),
X=c(12345, 23456, 34567, 456789))
df.merge <- merge(df1, df2, by="name", all.x=T)
df.merge <- df.merge[, -2]
Output:
name X.y
1 ab23242 456789
2 ab35366 12345
3 ab47490 23456
4 ab59614 34567
I think merge will keep order of first frame but you can also keep the order strictly by simply adding a column with order df1$order <- 1:nrow(df1) and later on sorting based on that column.

df1<- data.frame( name1 = letters[6:10], valuecol1=seq(2,10,by=2))
df2 <- data.frame( name2 = letters[1:10], valuecol2=10:1)
df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ df1$name1 %in% df2$name2 , "valuecol1"]
df2
name2 valuecol2
1 a 10
2 b 9
3 c 8
4 d 7
5 e 6
6 f 2
7 g 4
8 h 6
9 i 8
10 j 10
This is what I thought might work, but doing replacements using indexing with match sometimes bites me in ways I need to adjust:
df2 [match(df1$name1, df2$name2) , "valuecol2"] <-
df1[ match(df1$name1, df2$name2) , "valuecol1"]
Here's how I tested it (edited).
> df2 <- data.frame( name2 = letters[1:10], valuecol2=10:1)
> df1<- data.frame( name1 = letters[1:5], valuecol1=seq(2,10,by=2))
> df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ match(df1$name1, df2$name2) , "valuecol1"]
> df2
name2 valuecol2
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f 5
7 g 4
8 h 3
9 i 2
10 j 1
Yep.... bitten again.
> df1<- data.frame( name1 = letters[6:10], valuecol1=seq(2,10,by=2))
> df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ match(df1$name1, df2$name2) , "valuecol1"]
> df2
name2 valuecol2
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f NA
7 g NA
8 h NA
9 i NA
10 j NA

How about this:
library(data.table)
# generate some random data
dt.1 <- data.table(id = 1:1000, value=rnorm(1000), key="id")
dt.2 <- data.table(id = 2*(500:1), value=as.numeric(1:500), key="id")
# objective is to replace value in df.1 with value from df.2 where id's match.
# data table joins - very efficient
# dt.1 now has 3 columns: id, value, and value.1 from dt.2$value
dt.1 <-dt.2[dt.1,nomatch=NA]
dt.1[is.na(value),]$value=dt.1[is.na(value),]$value.1
dt.1$value.1=NULL # get rid of extra column
NB: This sorts dt.1 by id which should be OK since it's sorted that way already.
Also: In future, please include data that can be imported into R. Images are not useful!

Match one column of a data.frame with all the columns in another data.frame

I have two data.frames:
DF1
Col1 Col2 ...... ...... Col2000
A H
c d
d e
n b
e A
b n
H c
DF2
A
b
c
d
e
n
H
I need simply to match the only one column in DF2 with each column in DF1. I need to match them because I need to know exactly the ranking of the match. Anyway I tried to write a function but since I'm not an R expert something goes wrong in my code:
lapply(DF1, function(x) match(DF1[,i], DF2[,1]))

To get a correct result, you need a correct command :
lapply(DF1, function(x) match(x, DF2[,1]))
is doing what you're trying to do. Take :
DF1 <- data.frame(
Col1 = c('A','c','d','n','e','b','H'),
Col2 = c('H','d','e','b','A','n','c')
)
DF2 <- data.frame(c('A','b','c','d','e','n','H'))
Then:
> lapply(DF1, function(x) match(x, DF2[,1]))
$Col1
[1] 1 3 4 6 5 2 7
$Col2
[1] 7 4 5 2 1 6 3

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Loop through columns and add string lengths as new columns - r

You need to use [[, the programmatic equivalent of $. Otherwise, for example, when i is col1, R will look for df$i instead of df$col1. for(i in names(df)){ df[[paste(i, 'length', sep="_")]] <- str_length(df[[i]]) }

You can use lapply to pass each column to str_length, then cbind it to your original data.frame... library(stringr) out <- lapply( df , str_length ) df <- cbind( df , out ) # col1 col2 col1 col2 #1 abc adf qqwe 3 8 #2 abcd d 4 1 #3 a e 1 1 #4 abcdefg f 7 1

With dplyr and stringr you can use mutate_all: > df %>% mutate_all(funs(length = str_length(.))) col1 col2 col1_length col2_length 1 abc adf qqwe 3 8 2 abcd d 4 1 3 a e 1 1 4 abcdefg f 7 1

For the sake of completeness, there is also a data.table solution: library(data.table) result <- setDT(df)[, paste0(names(df), "_length") := lapply(.SD, stringr::str_length)] result # col1 col2 col1_length col2_length #1: abc adf qqwe 3 8 #2: abcd d 4 1 #3: a e 1 1 #4: abcdefg f 7 1

Related

R Question: How do I add data to a column without replacing existing data?

r - dplyr full_join using column position

Filtering a R DataFrame with repeated values in columns

Changing the values of a column for the values from another column

Match one column of a data.frame with all the columns in another data.frame

Categories

Resources