I have a data.frame with 4 cathegorical variables with scale 1-5.
data.frame(
first=c(2,3,3,2,2),
second=c(5,5,4,5,5),
third=c(5,5,5,4,4),
fourth=c(2,1,1,1,2))
first second third fourth
2 5 5 2
3 5 5 1
3 4 5 1
2 5 4 1
2 5 4 2
I want to transform names of variables to one column and do cumulative counts of values and set up new variables to rows with cathegorical scale.
newvar 1 2 3 4 5
first 0 3 2 0 0
second 0 0 0 1 4
third 0 0 0 2 3
fourth 3 2 0 0 0
Using data.table :
library(data.table)
dcast(melt(df), variable~value)
# variable 1 2 3 4 5
#1 first 0 3 2 0 0
#2 second 0 0 0 1 4
#3 third 0 0 0 2 3
#4 fourth 3 2 0 0 0
This returns some warning since we are relying on the default options of melt and dcast, it is safe to ignore them in this case. To avoid warnings you can use this extended version.
library(data.table)
dcast(melt(setDT(df), measure.vars = names(df)), variable~value, fun.aggregate = length)
Not the cleanest method, nevertheless it works.
You use pivot_longer to transform the data into a long format.
Then we can group the data and count how many occurrences there are for each of your original columns.
Transform the data back into wide format using pivot_wider and then the last two lines rearranges the data to match your desired output.
df %>%
pivot_longer(c(first:fourth)) %>%
count(name, value) %>%
pivot_wider(names_from = "value",
values_from = "n") %>%
select(name, `1`, `2`, `3`, `4`, `5`) %>%
arrange(match(name, c("first", "second", "third", "fourth")), desc(name))
Related
I have the following dummy dataframe:
t <- data.frame(
a= c(0,0,2,4,5),
b= c(0,0,4,6,5))
a b
0 0
0 0
2 4
4 6
5 5
I want to replace just the first value that it is not zero for the column b. Imagine that the row that meets this criteria is i. I want to replace t$b[i] with t[i+2]+t[i+1] and the rest of t$b should remain the same. So the output would be
a b
0 0
0 0
2 11
4 6
5 5
In fact the dataset is dynamic so I cannot directly point to a specific row, it has to meet the criteria of being the first row not equal to zero in column b.
How can I create this new t$b?
Here is a straight forward solution in base R:
t <- data.frame(
a= c(0,0,2,4,5),
b= c(0,0,4,6,5))
ind <- which(t$b > 0)[1L]
t$b[ind] <- t$b[ind+2L] + t$b[ind+1L]
t
a b
1 0 0
2 0 0
3 2 11
4 4 6
5 5 5
Here is a roundabout way of getting there with a combination of group_by() and mutate():
library(tidyverse)
t %>%
mutate(
b_cond = b != 0,
row_number = row_number()
) %>%
group_by(b_cond) %>%
mutate(
min_row_number = row_number == min(row_number),
b = if_else(b_cond & min_row_number, lead(b, 1) + lead(b, 2), b)
) %>%
ungroup() %>%
select(a, b) # optional, to get back to original columns
# A tibble: 5 × 2
a b
<dbl> <dbl>
1 0 0
2 0 0
3 2 11
4 4 6
5 5 5
This question already has answers here:
How to convert a table to a data frame
(5 answers)
Closed 4 years ago.
I have data like dataframe df_a, and want to have it converted to the format as in dataframe df_b.
xtabs() gives similar result, but I did not find a way to access elements as in the example code below. Accessing through xa[1,1] gives no advantage since there is a weak correlation between indexing by numbers ("1") and names ("A"). As you can see there is a sort difference in the xtabs() result, so xa[2,2]=2 and not 0 as on the df_b listing.
> df_a
ItemName Feature Amount
1 First A 2
2 First B 3
3 First A 4
4 Second C 3
5 Second C 2
6 Third D 1
7 Fourth B 2
8 Fourth D 3
9 Fourth D 2
> df_b
ItemName A B C D
1 First 6 3 0 0
2 Second 0 0 5 0
3 Third 0 0 0 1
4 Fourth 0 2 0 5
> df_b$A
[1] 6 0 0 0
> xa<-xtabs(df_a$Amount~df_a$ItemName+df_a$Feature)
> xa
df_a$Feature
df_a$ItemName A B C D
First 6 3 0 0
Fourth 0 2 0 5
Second 0 0 5 0
Third 0 0 0 1
> xa$A
Error in xa$A : $ operator is invalid for atomic vectors
There is a way of iterative conversion with for() loops, but totally inefficient in my case because my data has millions of records.
For the purpose of further processing my required output format is dataframe.
If anyone solved similar problem please share.
You can just use as.data.frame.matrix(xa)
# output
A B C D
First 6 3 0 0
Fourth 0 2 0 5
Second 0 0 5 0
Third 0 0 0 1
## or
df_b <- as.data.frame.matrix(xa)[unique(df_a$ItemName), ]
data.frame(ItemName = row.names(df_b), df_b, row.names = NULL)
# output
ItemName A B C D
1 First 6 3 0 0
2 Second 0 0 5 0
3 Third 0 0 0 1
4 Fourth 0 2 0 5
Without using xtabs you can do something like this:
df %>%
dplyr::group_by(ItemName, Feature) %>%
dplyr::summarise(Sum=sum(Amount, na.rm = T)) %>%
tidyr::spread(Feature, Sum, fill=0) %>%
as.data.frame()
This will transform as you require and it stays as a data.frame
Or, you can just as.data.frame(your_xtabs_result) and that should work too
I've got a data.frame with key/value string column containing information about features and their values for a set of users. Something like this:
data<-data.frame(id=1:3,statid=c("s003e","s093u","s085t"),str=c("a:1,7:2","a:1,c:4","a:3,b:5,c:33"))
data
# id statid str
# 1 1 s003e a:1,7:2
# 2 2 s093u a:1,c:4
# 3 3 s085t a:3,b:5,c:33
What I'm trying to do is to create a data.frame containing column for every feature. Like this:
data_after<-data.frame(id=1:3,statid=c("s003e","s093u","s085t"),
a=c(1,1,3),b=c(0,0,5),c=c(0,4,33),"7"=c(2,0,0))
data_after
# id statid a b c X7
# 1 1 s003e 1 0 0 2
# 2 2 s093u 1 0 4 0
# 3 3 s085t 3 5 33 0
I was trying to use str_split from stringr package and then transform elements of created list to data.frames (later bind them using for example rbind.fill from plyr) but couldn't done it. Any help will be appreciated!
You can use dplyr and tidyr:
library(dplyr); library(tidyr)
data %>% mutate(str = strsplit(str, ",")) %>% unnest(str) %>%
separate(str, into = c('var', 'val'), sep = ":") %>% spread(var, val, fill = 0)
# id statid 7 a b c
# 1 1 s003e 2 1 0 0
# 2 2 s093u 0 1 0 4
# 3 3 s085t 0 3 5 33
We can use cSplit to do this in a cleaner way. Convert the data to 'long' format by splitting at ,, then do the split at : and dcast from 'long' to 'wide'
library(splitstackshape)
library(data.table)
dcast(cSplit(cSplit(data, "str", ",", "long"), "str", ":"),
id+statid~str_1, value.var="str_2", fill = 0)
# id statid 7 a b c
#1: 1 s003e 2 1 0 0
#2: 2 s093u 0 1 0 4
#3: 3 s085t 0 3 5 33
I like to add an extra column "na_count" that counts adjacent NAs in the column value, like
value na_count
8 0
2 0
NA 4
NA 4
NA 4
NA 4
5 0
9 0
1 0
NA 2
NA 2
5 0
NA 3
NA 3
NA 3
8 0
5 0
NA 1
Is there perhaps a way with dplyr window functions?
Here is an option using dplyr (as the author asked for). We create a grouping column by taking the difference of logical vector (!is.na(value)), compare with 1 and do the cumsum, then create the 'NA_count' by multiplying the logical vector with number of elements in the group (n()).
library(dplyr)
df1 %>%
select(-na_count) %>% #removing the column that was not needed
group_by(grp=cumsum(c(TRUE,abs(diff(!is.na(value)))==1))) %>%
mutate(NA_count = is.na(value)*n()) %>%
ungroup() %>%
select(-grp)
Or we can convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by the rleid of logical vector (is.na(value)), we get the nrow (.N), multiply with the logical vector and extract the 'V1' column.
library(data.table)#v1.9.6+
setDT(df1)[, .N*is.na(value) ,rleid(is.na(value))]$V1
#[1] 0 0 4 4 4 4 0 0 0 2 2 0 3 3 3 0 0 1
If this is to create a new column,
setDT(df1)[, Na_count:= .N*is.na(value) ,rleid(is.na(value))]
Or we can use rle (run length encoding) from base R. We get the rle of 'value' that are NA (is.na(df1$value)) in a list, use within.list to change the 'values' i.e. TRUE elements by using that as index to the corresponding 'lengths' and then return the atomic vector with inverse.rle.
inverse.rle(within.list(rle(is.na(df1$value)),
{values[values] <- lengths[values] }))
#[1] 0 0 4 4 4 4 0 0 0 2 2 0 3 3 3 0 0 1
Or a slightly more compact version is
inverse.rle(within.list(rle(is.na(df1$value)), values <-lengths*values))
#[1] 0 0 4 4 4 4 0 0 0 2 2 0 3 3 3 0 0 1
Not with dplyr, but using rle from base-R:
# get run-length of missings
dd_rle <- rle(is.na(dd$value))
# use rep: value is length if missing, 0 otherwise, number of repetitions
# is length of runs
# na_count2 so comparison to expected output possible
dd$na_count2 <- rep(ifelse(dd_rle$values, dd_rle$lengths, 0),
dd_rle$lengths)
Is it possible to group and count instances of all other columns using R (dplyr)? For example, The following dataframe
x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1
Turns to this (note: y is value that is being counted)
EDIT:- explaining the transformation, x is what I'm grouping by, for each number grouped, i want to count how many times 0 and 1 and 2 was mentioned, as in the first row in the transformed dataframe, we counted how many times x = 1 was equal to 0 in the other columns (y), so 0 was in column a one time, column b two times and column c one time
x y a b c
1 0 1 2 1
1 1 1 0 2
1 2 1 1 0
2 1 1 0 1
2 2 0 1 0
An approach with a combination of the melt and dcast functions of data.table or reshape2:
library(data.table) # v1.9.5+
dt.new <- dcast(melt(setDT(df), id.vars="x"), x + value ~ variable)
this gives:
dt.new
# x value a b c
# 1: 1 0 1 2 1
# 2: 1 1 1 0 2
# 3: 1 2 1 1 0
# 4: 2 1 1 0 1
# 5: 2 2 0 1 0
In dcast you can specify which aggregation function to use, but this is in this case not necessary as the default aggregation function is length. Without using an aggregation function, you will get a warning about that:
Aggregation function missing: defaulting to length
Furthermore, if you do not explicitly convert the dataframe to a data table, data.table will redirect to reshape2 (see the explanation from #Arun in the comments). Consequently this method can be used with reshape2 as well:
library(reshape2)
df.new <- dcast(melt(df, id.vars="x"), x + value ~ variable)
Used data:
df <- read.table(text="x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1", header=TRUE)
I'd use a combination of gather and spread from the tidyr package, and count from dplyr:
library(dplyr)
library(tidyr)
df = data.frame(x = c(1,1,1,2), a = c(0,1,2,1), b = c(0,0,2,2), c = c(0,1,1,1))
res = df %>%
gather(variable, value, -x) %>%
count(x, variable, value) %>%
spread(variable, n, fill = 0)
# Source: local data frame [5 x 5]
#
# x value a b c
# 1 1 0 1 2 1
# 2 1 1 1 0 2
# 3 1 2 1 1 0
# 4 2 1 1 0 1
# 5 2 2 0 1 0
Essentially, you first change the format of the dataset to:
head(df %>%
gather(variable, value, -x))
# x variable value
#1 1 a 0
#2 1 a 1
#3 1 a 2
#4 2 a 1
#5 1 b 0
#6 1 b 0
Which allows you to use count to get the information regarding how often certain values occur in columns a to c. After that, you reformat the dataset to your required format using spread.