Access specific instances in list in dataframe column, and also count list length - R - r

I have an R dataframe composed of columns. One column contains lists: i.e.
Column
1,2,4,7,9,0
5,3,8,9,0
3,4
5.8,9,3.5
6
NA
7,4,3
I would like to create column which counts, the length of these lists:
Column Count
1,2,4,7,9,0 6
5,3,8,9,0 5
3,4 2
5.8,9,3.5 3
6 1
NA NA
7,4,3 3
Also, is there a way to access specific instances in these lists? i.e. make a new column with only the first instances of each list? or the last instances of each?

One solution is to use strsplit to split element in character vector and use sapply to get the desired count:
df$count <- sapply(strsplit(df$Column, ","),function(x){
if(all(is.na(x))){
NA
} else {
length(x)
}
})
df
# Column count
# 1 1,2,4,7,9,0 6
# 2 5,3,8,9,0 5
# 3 3,4 2
# 4 5.8,9,3.5 3
# 5 6 1
# 6 <NA> NA
# 7 7,4,3 3
If it is desired to count NA as 1 then solution could have been even simpler as:
df$count <- sapply(strsplit(df$Column, ","),length)
Data:
df <- read.table(text = "Column
'1,2,4,7,9,0'
'5,3,8,9,0'
'3,4'
'5.8,9,3.5'
'6'
NA
'7,4,3'",
header = TRUE, stringsAsFactors = FALSE)

count.fields serves this purpose for a text file, and can be coerced to work with a column too:
df$Count <- count.fields(textConnection(df$Column), sep=",")
df$Count[is.na(df$Column)] <- NA
df
# Column Count
#1 1,2,4,7,9,0 6
#2 5,3,8,9,0 5
#3 3,4 2
#4 5.8,9,3.5 3
#5 6 1
#6 <NA> NA
#7 7,4,3 3
On a more general note, you're probably better off converting your column to a list, or stacking the data to a long form, to make it easier to work with:
df$Column <- strsplit(df$Column, ",")
lengths(df$Column)
#[1] 6 5 2 3 1 1 3
sapply(df$Column, `[`, 1)
#[1] "1" "5" "3" "5.8" "6" NA "7"
stack(setNames(df$Column, seq_along(df$Column)))
# values ind
#1 1 1
#2 2 1
#3 4 1
#4 7 1
#5 9 1
#6 0 1
#7 5 2
#8 3 2
#9 8 2
# etc

Here's a slightly faster way to achieve the same result:
df$Count <- nchar(gsub('[^,]', '', df$Column)) + 1
This one works by counting how many commas there are and adding 1.

Related

r- dynamically detect excel column names format as date (without df slicing)

I am trying to detect column dates that come from an excel format:
library(openxlsx)
df <- read.xlsx('path/df.xlsx', sheet=1, detectDates = T)
Which reads the data as follows:
# a b c 44197 44228 d
#1 1 1 NA 1 1 1
#2 2 2 NA 2 2 2
#3 3 3 NA 3 3 3
#4 4 4 NA 4 4 4
#5 5 5 NA 5 5 5
I tried to specify a fix index slice and then transform these specific columns as follows:
names(df)[4:5] <- format(as.Date(as.numeric(names(df)[4:5]),
origin = "1899-12-30"), "%m/%d/%Y")
This works well when the df is sliced for those specific columns, unfortunately, there could be the possibility that these column index's change, say from names(df)[4:5] to names(df)[2:3] for example. Which will return coerced NA values instead of dates.
data:
Note: for this data the column name is read as X4488, while read.xlsx() read it as 4488
df <- data.frame(a=rep(1:5), b=rep(1:5), c=NA, "44197"=rep(1:5), '44228'=rep(1:5), d=rep(1:5))
Expected Output:
Note: this is the original excel format for these above columns:
# a b c 01/01/2021 01/02/2021 d
#1 1 1 NA 1 1 1
#2 2 2 NA 2 2 2
#3 3 3 NA 3 3 3
#4 4 4 NA 4 4 4
#5 5 5 NA 5 5 5
How could I detect directly these excel format and change it to date without having to slice the dataframe?
We may need to only get those column names that are numbers
i1 <- !is.na(as.integer(names(df)))
and then use
names(df)[i1] <- format(as.Date(as.numeric(names(df)[i1]),
origin = "1899-12-30"), "%m/%d/%Y")
Or with dplyr
library(dplyr)
df %>%
rename_with(~ format(as.Date(as.numeric(.),
origin = "1899-12-30"), "%m/%d/%Y"), matches('^\\d+$'))

Convert list df (with multiple columns) to numeric

I have a df below:
view(fds)
#1 #2 #3 #4
1# 1 3 4 2
2# 4 5 3 2
3# 2 5 3 1
4# 3 5 1 3
I want to fds.sum <- rowSums(fds) but I get an "Error in rowSums(fds) : 'x' must be numeric"... Then, when I try fds.mun <- as.numeric(fds), I get an "Error: 'list' object cannot be coerced to type 'double'"...
I have tried fds.num <- lapply(fds, as.numeric) but that gives me:
fds.num list[4] List of Length 4
1# double[101] 1 4 2 3
2# double[101] 3 5 5 5
3# double[101] 4 3 3 1
4# double[101] 2 2 1 3
I just want a sum of my rows in a new column such that:
#1 #2 #3 #4 sum
1# 1 3 4 2 10
2# 4 5 3 2 14
3# 2 5 3 1 11
4# 3 5 1 3 12
Anyone know how to do that?
If we want to use the OP's code, just Reduce with +
fds$sum <- Reduce(`+`, lapply(fds, as.numeric) )
Or after converting to numeric, bind them as a matrix or update the original data
fds[] <- lapply(fds, as.numeric)
fds$sum <- rowSums(fds, na.rm = TRUE)
Or it can be done on the fly with sapply
fds$sum <- rowSums(sapply(fds, as.numeric))
Or even without doing as.numeric, can be automated with type.convert
fds$sum <- rowSums(type.convert(fds, as.is = TRUE))
The error showed in OP's code is a a result of applying rowSums directly on a list as lapply always returns a list

Comparing items in a list to a dataset in R

I have a large dataset (8,000 obs) and about 16 lists with anywhere from 120 to 2,000 items. Essentially, I want to check to see if any of the observations in the dataset match an item in a list. If there is a match, I want to include a variable indicating the match.
As an example, if I have data that look like this:
dat <- as.data.frame(1:10)
list1 <- c(2:4)
list2 <- c(7,8)
I want to end with a dataset that looks something like this
Obs Var List
1 1
2 2 1
3 3 1
4 4 1
5 5
6 6
7 7 2
8 8 2
9 9
10 10
How do I go about doing this? Thank you!
Here is one way to do it using boolean sum and %in%. If several match, then the last one is taken here:
dat <- data.frame(Obs = 1:10)
list_all <- list(c(2:4), c(7,8))
present <- sapply(1:length(list_all), function(n) dat$Obs %in% list_all[[n]]*n)
dat$List <- apply(present, 1, FUN = max)
dat$List[dat$List == 0] <- NA
dat
> dat
Obs List
1 1 NA
2 2 1
3 3 1
4 4 1
5 5 NA
6 6 NA
7 7 2
8 8 2
9 9 NA
10 10 NA

Grouping by character matching & string length

Suppose I have a column in a dataframe with strings. I want to create a grouping technique so that the length of the string is matched and then the character of the string is also matched to acknowledge it as a specific group.
The output should be grouped like the below provided sample:
Rule Group
x 1
x 1
xx 2
xx 2
xy 3
yx 3
xx 2
xyx 4
yxx 4
yyy 5
xyxy 6
yxyx 6
xyxy 6
You can split the Rule, sort and paste back together. Matching the result with the unique result will then give you what you need. In R,
v1 <- sapply(strsplit(df$Rule, ''), function(i)paste(sort(i), collapse = ''))
match(v1, unique(v1))
#[1] 1 1 2 2 3 3 2 4 4 5 6 6 6

Replacing NAs between two rows with identical values in a specific column

I have a dataframe with multiple columns and I want to replace NAs in one column if they are between two rows with an identical number. Here is my data:
v1 v2
1 2
NA 3
NA 2
1 1
NA 7
NA 2
3 1
I basically want to start from the beginning of the data frame and replcae NAs in column v1 with previous Non NA if the next Non NA matches the previous one. That been said, I want the result to be like this:
v1 v2
1 2
1 3
1 2
1 1
NA 7
NA 2
3 1
As you may see, rows 2 and 3 are replaced with number "1" because row 1 and 4 had an identical number but rows 5,6 stays the same because the non na values in rows 4 and 7 are not identical. I have been twicking a lot but so far no luck. Thanks
Here is an idea using zoo package. We basically fill NAs in both directions and set NA the values that are not equal between those directions.
library(zoo)
ind1 <- na.locf(df$v1, fromLast = TRUE)
df$v1 <- na.locf(df$v1)
df$v1[df$v1 != ind1] <- NA
which gives,
v1 v2
1 1 2
2 1 3
3 1 2
4 1 1
5 NA 7
6 NA 2
7 3 1
Here is a similar approach in tidyverse using fill
library(tidyverse)
df1 %>%
mutate(vNew = v1) %>%
fill(vNew, .direction = 'up') %>%
fill(v1) %>%
mutate(v1 = replace(v1, v1 != vNew, NA)) %>%
select(-vNew)
# v1 v2
#1 1 2
#2 1 3
#3 1 2
#4 1 1
#5 NA 7
#6 NA 2
#7 3 1
Here is a base R solution, the logic is almost the same as Sotos's one:
replace_na <- function(x){
f <- function(x) ave(x, cumsum(!is.na(x)), FUN = function(x) x[1])
y <- f(x)
yp <- rev(f(rev(x)))
ifelse(!is.na(y) & y == yp, y, x)
}
df$v1 <- replace_na(df$v1)
test:
> replace_na(c(1, NA, NA, 1, NA, NA, 3))
[1] 1 1 1 1 NA NA 3
I could use na.locf function to do so. Basically, I use the normal na.locf function package zoo to replace each NA with the latest previous non NA and store the data in a column. by using the same function but fixing fromlast=TRUE NAs are replaces with the first next nonNA and store them in another column. I checked these two columns and if the results in each row for these two columns are not matching I replace them with NA.

Resources