R: combining frequency lists with different lengths by labels?

R: combining frequency lists with different lengths by labels? - r

I'm a newbie to R, but really like it and want to improve constantly. Now, after searching for a while, I need to ask you for help.
This is the given case:
1) I have sentences (sentence.1 and sentence.2 - all words are already lower-case) and create the sorted frequency lists of their words:
sentence.1 <- "bob buys this car, although his old car is still fine." # saves the sentence into sentence.1
sentence.2 <- "a car can cost you very much per month."
sentence.1.list <- strsplit(sentence.1, "\\W+", perl=T) #(I have these following commands thanks to Stefan Gries) we split the sentence at non-word characters
sentence.2.list <- strsplit(sentence.2, "\\W+", perl=T)
sentence.1.vector <- unlist(sentence.1.list) # then we create a vector of the list
sentence.2.vector <- unlist(sentence.2.list) # vectorizes the list
sentence.1.freq <- table(sentence.1.vector) # and finally create the frequency lists for
sentence.2.freq <- table(sentence.2.vector)
These are the results:
sentence.1.freq:
although bob buys car fine his is old still this
1 1 1 2 1 1 1 1 1 1
sentence.2.freq:
a can car cost month much per very you
1 1 1 1 1 1 1 1 1
Now, please, how could I combine these two frequency lists that I will have the following:
a although bob buys can car cost fine his is month much old per still this very you
NA 1 1 1 NA 2 NA 1 1 1 NA NA 1 NA 1 1 NA NA
1 NA NA NA 1 1 1 NA NA NA 1 1 NA 1 NA NA 1 1
Thus, this "table" should be "flexible" so that in case of entering a new sentence with the word, e.g. "and", the table would add the column with the label "and" between "a" and "although".
I thought of just adding new sentences into a new row and putting all not word that are not yet in the list column-wise (here, "and" would be to the right of "you") and sort the list again. However, I haven't managed this as already the sorting of the new sentence's words' frequencies according to the existing labels haven't been working (when there is e.g., "car" again, the new sentence's frequency of car should be written into the new sentence's row and the column of "car", but when there is e.g. "you" for the 1st time, its frequency should be written into the new sentence's row and a new column labeled "you").

This isn't exactly what you describe, but what you're aiming for makes more sense to me organized by row, rather than by column (and R handles data organized this way a bit more easily anyway).
#Convert tables to data frames
a1 <- as.data.frame(sentence.1.freq)
a2 <- as.data.frame(sentence.2.freq)
#There are other options here, see note below
colnames(a1) <- colnames(a2) <- c('word','freq')
#Then merge
merge(a1,a2,by = "word",all = TRUE)
word freq.x freq.y
1 although 1 NA
2 bob 1 NA
3 buys 1 NA
4 car 2 1
5 fine 1 NA
6 his 1 NA
7 is 1 NA
8 old 1 NA
9 still 1 NA
10 this 1 NA
11 a NA 1
12 can NA 1
13 cost NA 1
14 month NA 1
15 much NA 1
16 per NA 1
17 very NA 1
18 you NA 1
You can then keep using merge to add more sentences. I converted the column names for simplicity, but there are other options. Using the by.x and by.y arguments instead of just by in merge can indicate the specific columns merge on if the names aren't the same in each data frame. Also, the suffix argument in merge will control how the count columns are given unique names. The default is to append .x and .y but you can change that.

Related

copy values from different columns based on conditions (r code)

I have data like one in the picture where there are two columns (Cday,Dday) with some missing values.
There can't be a row where there are values for both columns; there's a value on either one column or the other or in neither.
I want to create the column "new" that has copied values from whichever column there was a number.
Really appreciate any help!

Since no row has a value for both, you can just sum up the two existing columns. Assume your dataframe is called df.
df$'new' = rowSums(df[,2:3], na.rm=T)
This will sum the rows, removing NAs and should give you what you want. (Note: you may need to adjust column numbering if you have more columns than what you've shown).

The dplyr package has the coalesce function.
library(dplyr)
df <- data.frame(id=1:8, Cday=c(1,2,NA,NA,3,NA,2,NA), Dday=c(NA,NA,NA,3,NA,2,NA,1))
new <- df %>% mutate(new = coalesce(Dday, Cday, na.rm=T))
new
# id Cday Dday new
#1 1 1 NA 1
#2 2 2 NA 2
#3 3 NA NA NA
#4 4 NA 3 3
#5 5 3 NA 3
#6 6 NA 2 2
#7 7 2 NA 2
#8 8 NA 1 1

Shifting rows up in a particular column of data

I have a question about shifting of rows in the particular column of a data.
data <- data.frame(B=c(NA,NA,0,NA,NA,0),C=c(1,NA,NA,1,NA,NA))
B C
1 NA 1
2 NA NA
3 0 NA
4 NA 1
5 NA NA
6 0 NA
I tried from this post Shifting a column down by one
na.omit(transform(data, B = c(NA, B[-nrow(data)])))
but only get
B C
4 0 1
expected output;
B C
1 0 1
2 0 1
How can we achieve that ?
Thanks.

If you want to remove all NA from each column and do not care that the rows will not match between columns you can do:
data <- data.frame(B=c(NA,NA,0,NA,NA,0),C=c(1,NA,NA,1,NA,NA))
res<-lapply(data,function(x){x[complete.cases(x)]})
res<-data.frame(res)
the second line says: for every column in data keep only the values which are not NA
Thanks to #thelatemail for the correction from the solution below, which worked, but would have kept the columns as factors:
data <- data.frame(B=c(NA,NA,0,NA,NA,0),C=c(1,NA,NA,1,NA,NA))
res<-apply(data,2,function(x){x[complete.cases(x)]})

Fill in-between entries in an ID vector

Looking for a quick-and-easy solution to a problem which I have only been able to solve inelegantly, by looping. I have an ID vector which looks something like this:
id<-c(NA,NA,1,1,1,NA,1,NA,2,2,2,NA,3,NA,3,3,3)
The NA's that fall in-between a sequence of a single number (id[6], id[14]) need to be replaced by that number. However, the NA's that don't meet this condition (those between sequences of two different numbers) need to be left alone (i.e., id[1],id[2],id[8],id[12]). The target vector is therefore:
id.target<-c(NA,NA,1,1,1,1,1,NA,2,2,2,NA,3,3,3,3,3)
This is not difficult to do by looping through each value, but I am looking to do this to many very long vectors, and was hoping for a neater solution. Thanks for any suggestions.

This seem to work. The idea is to use zoo::na.locf in order to fill the NAs correctly and then insert NAs when they are between different numbers
id.target <- zoo::na.locf(id, na.rm = FALSE)
id.target[(c(diff(id.target), 1L) > 0L) & is.na(id)] <- NA
id.target
## [1] NA NA 1 1 1 1 1 NA 2 2 2 NA 3 3 3 3 3

Here is a base R option
d1 <- do.call(rbind,lapply(split(seq_along(id), id), function(x) {
i1 <- min(x):max(x)
data.frame(val= unique(id[x]), i1)}))
id[seq_along(id) %in% d1$i1 ] <- d1$val
id
#[1] NA NA 1 1 1 1 1 NA 2 2 2 NA 3 3 3 3 3

increase in one variable nested within another column in R + setting 0 as starting value

I'm trying to use the diff function to calculate the increase in a variable ("damage") in this dataset (df). I want to fill the column "damage_new" with this new variable. The values that you see now are the values I would like to have.
df = data.frame(id=c(1,1,1,2,2), trial=c(1,3,4,1,2), damage=(1,NA,3,1,5))
df
ID TRIAL DAMAGE DAMAGE_NEW
1 1 1 0
1 3 NA NA
1 4 3 NA
2 1 1 0
2 2 5 4
If I run
diff(df$damage) it will calculate the difference in the whole dataset.
two things that I haven't managed are:
-how to nest the difference within the values of another column? Specifically, I want to calculate the damage increase (for the whole dataset), but within a single individual (ID), of which I have repeated measurements.
-I also would like to have the damage_new column to be the same length as the rest of the dataset (to attach it), and for each individual, have the first value of damage_new set to 0, since obviously the first measurement has no reference.
-To further describe the dataset, I have NAs in the 'damage" column, which I suspect will lead to more NAs in the damage_new column, but I would like to keep them (and I wonder how the function deals with them?). I also don't have the same number of measurements per individual (they will have a different number of trials, with some missing in between).
thanks a lot for the always fast and efficient answers!

The dplyr package is great for this kind of things:
library(dplyr)
df %>% group_by(id) %>% mutate(damage_new=c(0,diff(damage)))
Source: local data frame [5 x 4]
Groups: id
id trial damage damage_new
1 1 1 1 0
2 1 3 NA NA
3 1 4 3 NA
4 2 1 1 0
5 2 2 5 4
You can read more about dplyr usage here
Update
If you'd like to go with the base R, you could do:
df$damage_new <- ave(df$damage,df$id,FUN=function(v) c(0,diff(v)))
which will produce the same df.

Library data.table is your friend there:
> library(data.table)
> setDT(df)
> setkey(df, id, trial)
> df[,new_damage:=c(0,diff(damage)),by=id]
> df
id trial damage new_damage
1: 1 1 1 0
2: 1 3 NA NA
3: 1 4 3 NA
4: 2 1 1 0
5: 2 2 5 4
On the diff working with NA, anything you withdraw from NA gives NA:
> diff(c(1,3,4,NA,5,7))
[1] 2 1 NA NA 2

Data frame "expand" procedure in R?

This is not a real statistical question, but rather a data preparation question before performing the actual statistical analysis. I have a data frame which consists of sparse data. I would like to "expand" this data to include zeroes for missing values, group by group.
Here is an example of the data (a and b are two factors defining the group, t is the sparse timestamp and xis the value):
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
Assuming I would like to expand the values between t=0 and t=9, this is the result I'm hoping for:
test.expanded <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
t=c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9),
x=c(1,0,2,1,2,0,0,2,0,0,0,0,0,1,1,0,2,1,1,3))
Zeroes have been inserted for all missing values of t. This makes it easier to use.
I have a quick and dirty implementation which sorts the dataframe and loops through each of its lines, adding missing lines one at a time. But I'm not entirely satisfied by the solution. Is there a better way to do it?
For those who are familiar with SAS, it is similar to the proc expand.
Thanks!

As you noted in a comment to the other answer, doing it by group is easy with plyr which just leaves how to "fill in" the data sets. My approach is to use merge.
library("plyr")
test.expanded <- ddply(test, c("a","b"), function(DF) {
DF <- merge(data.frame(t=0:9), DF[,c("t","x")], all.x=TRUE)
DF[is.na(DF$x),"x"] <- 0
DF
})
merge with all.x=TRUE will make the missing values NA, so the second line of the function is needed to replace those NAs with 0's.

This is convoluted but works fine:
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
my.seq <- seq(0,9)
not.t <- !(my.seq %in% test$t)
test[nrow(test)+seq(length(my.seq[not.t])),"t"] <- my.seq[not.t]
test
#------------
a b t x
1 1 1 0 1
2 1 1 2 2
3 1 1 3 1
4 1 1 4 2
5 1 1 7 2
6 1 2 3 1
7 1 2 4 1
8 1 2 6 2
9 1 2 7 1
10 1 2 8 1
11 1 2 9 3
12 NA NA 1 NA
13 NA NA 5 NA
Not sure if you want it sorted by t afterwards or not. If so, easy enough to do:
https://stackoverflow.com/a/6871968/636656

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: combining frequency lists with different lengths by labels? - r

Related

copy values from different columns based on conditions (r code)

Shifting rows up in a particular column of data

Fill in-between entries in an ID vector

increase in one variable nested within another column in R + setting 0 as starting value

Data frame "expand" procedure in R?

Categories

Resources