Use count() in more than one column and order results - r

I have a dataframe called table like this
a m g c1 c2 c3 c4
1 2015 5 13 bread wine <NA> <NA>
2 2015 8 30 wine eggs rice cake
3 2015 1 21 wine rice eggs <NA>
...
I want to count the elements in column c1 to c4 and order them
I tried to use:
library(plyr)
c<-count(table,"c1")
But i don't know how to count more than one column.
Then i want to use arrange(c,desc(freq)) to order them but when i try with one column the value NA is always on top, and i want only top 3 elements. Like this
c freq
1 wine 3
2 eggs 2
3 rice 2
Can someone please get me some solution for this. Thanks

Use melt and table:
df1 <- read.table(text="a m g c1 c2 c3 c4
2015 5 13 bread wine NA NA
2015 8 30 wine eggs rice cake
2015 1 21 wine rice eggs NA", header=TRUE, stringsAsFactors=FALSE)
c_col <- melt(as.matrix(df1[,4:7]))
sort(table(c_col$value),decreasing=TRUE)
wine eggs rice bread cake
3 2 2 1 1

With qdaptools, with the example dataframe (having name table) provided:
library(qdapTools)
counts <- data.frame(count=sort(colSums(mtabulate(table[,4:7])), decreasing=TRUE))
subset(counts,rownames(counts)!='<NA>')[1:3,1,drop=FALSE] #remove <NA>, select top 3 elements
# count
# wine 3
# eggs 2
# rice 2

Related

remove rows with >50% missing across certain columns in R

here is my data:
data <- data.frame(id=c(1,2,3,4,5),
ethnicity=c("asian",NA,NA,NA,"asian"),
age=c(34,NA,NA,NA,65),
a1=c(3,4,5,2,7),
a2=c("y","y","y",NA,NA),
a3=c("low", NA, "high", "med", NA),
a4=c("green", NA, "blue", "orange", NA))
id ethnicity age a1 a2 a3 a4
1 asian 34 3 y low green
2 <NA> NA 4 y <NA> <NA>
3 <NA> NA 5 y high blue
4 <NA> NA 2 <NA> med orange
5 asian 65 7 <NA> <NA> <NA>
I would like to remove rows that have >50% missing in columns a1 to a4.
I have tried the below code; but am having trouble specifying the columns that I want this to take effect for:
data[which(rowMeans(!is.na(data)) > 0.5), ] #This doesn't specify the column
miss2 <- c()
for(i in 1:nrow(data)) {
if(length(which(is.na(data[4:7,]))) >= 0.5*ncol(data)) miss2 <- append(miss2,4:7)
}
data1 <- data[-miss2,]
#I thought I specified the column here but im not getting the output I was hoping for (i.e id 4 doesn't show up)
The code above looks at missing in all columns. I would like to specify to just look for % of missing in columns a1,a2,a3,a4. What im hoping to get is below:
id ethnicity age a1 a2 a3 a4
1 asian 34 3 y low green
2 <NA> NA 4 y <NA> <NA>
3 <NA> NA 5 y high blue
4 <NA> NA 2 <NA> med orange
Any help is appreciated, thank you!
You're really close, the main issue being using which (an array of indices) instead of simply an array of booleans
keep <- rowMeans(is.na(data[,4:7])) <= 0.5
keep
[1] TRUE TRUE TRUE TRUE FALSE
data[keep,]
id ethnicity age a1 a2 a3 a4
1 1 asian 34 3 y low green
2 2 <NA> NA 4 y <NA> <NA>
3 3 <NA> NA 5 y high blue
4 4 <NA> NA 2 <NA> med orange
Just for fun a dplyr approach:
Here we combine rowwise with a comparing statement directly in filter. First we check the sum of NA over a1:a4, divide by the amount of columns and ask if condition <= 0.5 is true:
To do this we have to transform all (a1:a4) to the same class:
data %>%
rowwise() %>%
mutate(a1 = as.character(a1)) %>%
filter(sum(is.na(c_across(a1:a4))) / length(c_across(a1:a4)) <= 0.5)
id ethnicity age a1 a2 a3 a4
<dbl> <chr> <dbl> <chr> <chr> <chr> <chr>
1 1 asian 34 3 y low green
2 2 NA NA 4 y NA NA
3 3 NA NA 5 y high blue
4 4 NA NA 2 NA med orange
data[rowSums(is.na(data[, -c(1:3)])) / 4 <= .5, ]
#> id ethnicity age a1 a2 a3 a4
#> 1 1 asian 34 3 y low green
#> 2 2 <NA> NA 4 y <NA> <NA>
#> 3 3 <NA> NA 5 y high blue
#> 4 4 <NA> NA 2 <NA> med orange

Join dataframes in dplyr by characters

So I have two dataframes:
DF1
X Y ID
banana 14 1
orange 20 2
pineapple 1 3
guava 300 4
grapes 1 5
DF2
Store State ID
Walmart NY 1
Sears AL 1;2
Target DC 3
Old Navy PA 3
Popeye's HA 5
Footlocker NJ 4;5
I join with the following and get:
df1 %>%
inner_join(df2, by = "ID")
X Y ID Store State
banana 14 1 Walmart NY
pineapple 1 3 Target DC
pineapple 1 3 Old Navy PA
grapes 1 5 Popeye's HA
But due to the semi-colons I'm not capturing those data points on the join, the end result should look like this:
X Y ID Store State
banana 14 1 Walmart NY
banana 14 1 Sears AL
orange 20 2 Sears AL
pineapple 1 3 Target DC
pineapple 1 3 Old Navy PA
guava 300 4 Foot Locker NJ
grapes 1 5 Popeye's HA
grapes 1 5 Popeye's HA
Using separate_rows from tidyr in combination with dplyr will get you there.
First table I called fruit, the other stores.
library(dplyr)
library(tidyr)
fruit %>%
inner_join(separate_rows(stores, ID) %>% mutate(ID = as.integer(ID)))
Joining, by = "ID"
X Y ID Store State
1 banana 14 1 Walmart NY
2 banana 14 1 Sears AL
3 orange 20 2 Sears AL
4 pineapple 1 3 Target DC
5 pineapple 1 3 Old Navy PA
6 guava 300 4 Footlocker NJ
7 grapes 1 5 Popeye's HA
8 grapes 1 5 Footlocker NJ
With base R, we can use strsplit with merge
lst1 <- strsplit(DF2$ID, ";")
merge(DF1, transform(DF2[rep(seq_len(nrow(DF2)),
lengths(lst1)), 1:2], ID = unlist(lst1)))
# ID X Y Store State
#1 1 banana 14 Walmart NY
#2 1 banana 14 Sears AL
#3 2 orange 20 Sears AL
#4 3 pineapple 1 Target DC
#5 3 pineapple 1 Old Navy PA
#6 4 guava 300 Footlocker NJ
#7 5 grapes 1 Popeye's HA
#8 5 grapes 1 Footlocker NJ

Collapsing group of strings into one string using an if statement within a for loop in R

I have a dataframe with a column "Food."
dataframe <- data.frame(Color = c("red","red","red","red","red","blue","blue","blue","blue","blue","green","green","green","green","green","orange","orange","orange","orange","orange"),
Food = c("banana","apple","potato","orange","egg","strawberry","cheese","yogurt","kiwi","butter","kale","sugar","carrot","celery","radish","cereal","milk","blueberry","squash","lemon"), Count = c(2,5,4,8,10,7,5,6,9,11,1,8,5,3,7,9,2,3,6,4))
Every time a fruit appears I want to replace the name of the fruit with "fruit."
I've tried making a vector of the fruit names. Then I go through each row in the dataframe and where the string matches the fruit, I want to replace the fruit name with "fruit."
fruit_list <- c("banana","apple","orange","strawberry","kiwi","blueberry","lemon")
for (r in 1:nrow(dataframe)) {
for (i in 1:length(fruit_list)){
if (length(grep(fruit_list[i], dataframe$Food[r])) != 0) {
dataframe$Food[r] <- paste("fruit")
}
}
}
How do I use this general format so that dataframe$Food doesn't just end up filled with NA?
With dplyr:
library(dplyr)
ataframe %>%
mutate(Food=as.character(Food),
Food=ifelse(Food%in%fruit_list,"Fruit",Food))#can change to fruit
Result:
Color Food Count
1 red Fruit 2
2 red Fruit 5
3 red potato 4
4 red Fruit 8
5 red egg 10
6 blue Fruit 7
7 blue cheese 5
8 blue yogurt 6
9 blue Fruit 9
10 blue butter 11
11 green kale 1
12 green sugar 8
13 green carrot 5
14 green celery 3
15 green radish 7
16 orange cereal 9
17 orange milk 2
18 orange Fruit 3
19 orange squash 6
20 orange Fruit 4
Only R base:
dataframe$Food <-
sapply(dataframe$Food,
function(x,fruit_list) ifelse(x %in% fruit_list, "fruit", as.character(x) ),
fruit_list = fruit_list )
You don't necessarily need dplyr for this.
Just use:
dataframe$Food <- ifelse(dataframe$Food %in% fruit_list, "Fruit", as.character(dataframe$Food))
You can do this in one line by using data.table package-
> setDT(dataframe)[,Food:=ifelse(Food %in% fruit_list,"fruit",as.character(Food))]
Color Food Count
1: red fruit 2
2: red fruit 5
3: red potato 4
4: red fruit 8
5: red egg 10
6: blue fruit 7
7: blue cheese 5
8: blue yogurt 6
9: blue fruit 9
10: blue butter 11
11: green kale 1
12: green sugar 8
13: green carrot 5
14: green celery 3
15: green radish 7
16: orange cereal 9
17: orange milk 2
18: orange fruit 3
19: orange squash 6
20: orange fruit 4

Normalize data using look-up table R

I have a data frame that looks like this:
COMM YEAR PRODUCE
Apple 2001 3
Mango 2001 5
Apple 2002 7
Mango 2002 2
Apple 2003 1
Mango 2003 9
I also have a yearly production dataframe:
Year Total.Produce
2001 10
2002 13
2003 14
I want to add a new column to the first data-frame that contains the normalized production of each item per year:
COMM YEAR PRODUCE Normalized.Produce
Apple 2001 3 3/10
Mango 2001 5 5/10
Apple 2002 7 7/13
Mango 2002 2 2/13
Apple 2003 9 9/14
Mango 2003 2 2/14
What is the most effecient R way of doing this?
My tables contains about 100,000 entries.
Just use match to get the matching rows:
R> match(dd1$YEAR, dd2$Year)
[1] 1 1 2 2 3 3
Then just use standard vectorised commands:
dd1$normalise = dd1$PRODUCE/dd2$Total.Produce[match(dd1$YEAR, dd2$Year)]
Can also use the merge function.
d1 <- read.table(text=
"COMM YEAR PRODUCE
Apple 2001 3
Mango 2001 5
Apple 2002 7
Mango 2002 2
Apple 2003 1
Mango 2003 9", head=T, as.is=T)
d2 <- read.table(text="Year Total.Produce
2001 10
2002 13
2003 14", head=T, as.is=T)
d3 <- merge(d1, d2, by.x="YEAR", by.y="Year")
d3$Normalized.Produce <- d3$PRODUCE/d3$Total.Produce
# YEAR COMM PRODUCE Total.Produce Normalized.Produce
# 1 2001 Apple 3 10 0.30000000
# 2 2001 Mango 5 10 0.50000000
# 3 2002 Apple 7 13 0.53846154
# 4 2002 Mango 2 13 0.15384615
# 5 2003 Apple 1 14 0.07142857
# 6 2003 Mango 9 14 0.64285714

How to populate values of one row conditional of another row in R?

I inherited a data set coded in an unusual way. I would like to learn a less verbose way of reshaping it. The data frame looks like this:
# Input.
participant = c(rep("John",6), rep("Mary",6))
day = c(rep(1,3), rep(2,3), rep(1,3), rep(2,3))
likes = c("apples", "apples", "18", "apples", "apples", "7", "bananas", "bananas", "24", "bananas", "bananas", "3")
question = rep(c(1,1,0),4)
number = c(rep(18,3), rep(7,3), rep(24,3), rep(3,3))
df = data.frame(participant, day, question, likes)
participant day question likes
1 John 1 1 apples
2 John 1 1 apples
3 John 1 0 18
4 John 2 1 apples
5 John 2 1 apples
6 John 2 0 7
7 Mary 1 1 bananas
8 Mary 1 1 bananas
9 Mary 1 0 24
10 Mary 2 1 bananas
11 Mary 2 1 bananas
12 Mary 2 0 3
As you can see, the column likes is heterogeneous. When question equals 0, likes conveys a number chosen by the participants, not their preferred fruit. So I would like to re-code it in a new column as follows:
participant day question likes number
1 John 1 1 apples 18
2 John 1 1 apples 18
3 John 1 0 18 18
4 John 2 1 apples 7
5 John 2 1 apples 7
6 John 2 0 7 7
7 Mary 1 1 bananas 24
8 Mary 1 1 bananas 24
9 Mary 1 0 24 24
10 Mary 2 1 bananas 3
11 Mary 2 1 bananas 3
12 Mary 2 0 3 3
My current solution with base R involves subsetting the initial data frame, creating a lookup table, changing the column names and then merging the lookup table with the original data frame. But this involves several steps and I worry that there should be a simpler solution. I think that tidyr might be the answer, but I don't know how to use it to spread values in one column (likes) conditional other columns (day and question).
Do you have any suggestions? Thanks a lot!
Using the data set above, you can try the following. You group your data by participant and day and look for a row with question == 0 for each group.
library(dplyr)
group_by(df, participant, day) %>%
mutate(age = as.numeric(as.character(likes[which(question == 0)])))
Or as alistaire suggested, you can use grep() too.
group_by(df, participant, day) %>%
mutate(age = as.numeric(grep('\\d+', likes, value = TRUE)))
# participant day question likes age
# (fctr) (dbl) (dbl) (fctr) (dbl)
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
If you want to use data.table, you can do:
library(data.table)
setDT(df)[, age := as.numeric(as.character(likes[which(question == 0)])),
by = list(participant, day)]
NOTE
The present data set is a new one. Jota's answer works for the deleted data set.
Addressing the new example data:
# create a key column, overwrite it later
df$number <- paste0(df$participant, df$day) # use as a key
# create lookup table
lookup <- df[!is.na(as.numeric(as.character(df$likes))), c("number", "likes")]
# use lookup to overwrite df$number with the appropriate number
df$number <- lookup$likes[match(df$number, lookup$number)]
# participant day question likes number
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
The warning about NAs be introduced by coercion is expected due to converting characters to numeric (as.numeric(as.character(df$likes))),.
If you're data are ordered like in the example, you can use na.locf from the zoo package:
library(zoo)
df$age <- na.locf(as.numeric(as.character(df$likes)), fromLast = TRUE)

Resources