Look, Compare and Draw: Differential Query Transformer for Automatic Oil Painting


Target	Painting Animation


Target	Painting Animation


Target	Painting Animation


Target	Painting Animation


Target	Painting Animation


Target	Painting Animation

Abstract

This work introduces a new approach to automatic oil painting that emphasizes the creation of dynamic and expressive brushstrokes. A pivotal challenge lies in mitigating the duplicate and common-place strokes, which often lead to less aesthetic outcomes. Inspired from the human painting process, i.e., observing, comparing, and drawing, we incorporate differential image analysis into a neural oil painting model, allowing the model to effectively concentrate on the incremental impact of successive brushstrokes. To operationalize this concept, we propose the Differential Query Transformer (DQ-Transformer), a new architecture that leverages differentially derived image representations enriched with positional encoding to guide the stroke prediction process. This integration enables the model to maintain heightened sensitivity to local details, resulting in more refined and nuanced stroke generation. Furthermore, we incorporate adversarial training into our framework, enhancing the accuracy of stroke prediction and thereby improving the overall realism and fidelity of the synthesized paintings. Extensive qualitative evaluations, complemented by a controlled user study, validate that our DQ-Transformer surpasses existing methods in both visual realism and artistic authenticity, typically achieving these results with fewer strokes.

Methodology

Given the canvas image and the target image generated by the renderer, we first obtain their differential image by simply subtracting one input from the other. Three local encoders comprised of convolutional neural networks are employed to extract image features with positional information. DQ-Transformer has two components, i.e., the DQ-encoder and the DQ-decoder. These visual features are concatenated and then fed to the DQ-encoder to obtain the fused feature. Next, we transform the differential image features into query tokens to query the key and value pairs generated by the fused feature. Finally, the DQ-Transformer outputs a set of predicted strokes, each accompanied by its respective confidence. The predicted image is generated by rendering these strokes onto the canvas. The discriminator operates by treating the target images as real samples and the predicted images as fake samples.

Inference

We present four intermediate stages of oil painting according to a real target image (left). Each stage is illustrated with a diagram, where the top-left corner shows the current canvas, the top-right corner displays the corresponding differential image for that stage, and the bottom part presents the painting result inferred by our model. We observe that since we explicitly compare the content in the differential images during training, our model tends to add strokes in areas where discrepancies are more pronounced, thereby progressively reducing the discrepancy content within the differential images.

Results on various datasets

Paintings on Landscapes


Target	Painting Animation	Painting Result


Target	Painting Animation	Painting Result


Target	Painting Animation	Painting Result


Target	Painting Animation	Painting Result

Paintings on FFHQ


Target	Painting Animation	Painting Result


Target	Painting Animation	Painting Result


Target	Painting Animation	Painting Result


Target	Painting Animation	Painting Result

Paintings on Wiki Art


Target	Painting Animation	Painting Result


Target	Painting Animation	Painting Result


Target	Painting Animation	Painting Result


Target	Painting Animation	Painting Result

This page was adapted from this source code.